[PATCH v3 0/5] Support message-based DMA in vfio-user server

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Support message-based DMA in vfio-user server
@ 2023-09-07 13:04 Mattias Nissler
  2023-09-07 13:04 ` [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering Mattias Nissler
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that there is some more work required on top of this series to get
message-based DMA to really work well:

* libvfio-user has a long-standing issue where socket communication gets messed
  up when messages are sent from both ends at the same time. See
  https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
  been engaging there and a fix is in review.

* qemu currently breaks down DMA accesses into chunks of size 8 bytes at
  maximum, each of which will be handled in a separate vfio-user DMA request
  message. This is quite terrible for large DMA accesses, such as when nvme
  reads and writes page-sized blocks for example. Thus, I would like to improve
  qemu to be able to perform larger accesses, at least for indirect memory
  regions. I have something working locally, but since this will likely result
  in more involved surgery and discussion, I am leaving this to be addressed in
  a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c                  |   8 ++
 hw/remote/trace-events        |   2 +
 hw/remote/vfio-user-obj.c     |  88 +++++++++++++++++--
 include/exec/cpu-common.h     |   2 -
 include/exec/memory.h         |  39 ++++++++-
 include/hw/pci/pci_device.h   |   3 +
 softmmu/dma-helpers.c         |   4 +-
 softmmu/memory.c              |   4 +
 softmmu/physmem.c             | 155 ++++++++++++++++++----------------
 subprojects/libvfio-user.wrap |   2 +-
 10 files changed, 220 insertions(+), 87 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
@ 2023-09-07 13:04 ` Mattias Nissler
  2023-09-13 18:30   ` Peter Xu
  2023-09-07 13:04 ` [PATCH v3 2/5] softmmu: Support concurrent bounce buffers Mattias Nissler
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h     |  45 ++++++++++++++++-
 softmmu/dma-helpers.c     |   4 +-
 softmmu/memory.c          |   3 ++
 softmmu/physmem.c         | 103 ++++++++++++++++----------------------
 5 files changed, 90 insertions(+), 67 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 41788c0bdd..63463c415d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -138,8 +138,6 @@ void *cpu_physical_memory_map(hwaddr addr,
                               bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
                                bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 68284428f8..7d68936157 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1076,6 +1076,19 @@ struct MemoryListener {
     QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+    QEMUBH *bh;
+    QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+    MemoryRegion *mr;
+    void *buffer;
+    hwaddr addr;
+    hwaddr len;
+    bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
  */
@@ -1092,6 +1105,12 @@ struct AddressSpace {
     struct MemoryRegionIoeventfd *ioeventfds;
     QTAILQ_HEAD(, MemoryListener) listeners;
     QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+    /* Bounce buffer to use for this address space. */
+    BounceBuffer bounce;
+    /* List of callbacks to invoke when buffers free up */
+    QemuMutex map_client_list_lock;
+    QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2832,8 +2851,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2858,6 +2877,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
                          bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 2463964805..d9fc26c063 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -167,7 +167,7 @@ static void dma_blk_cb(void *opaque, int ret)
     if (dbs->iov.size == 0) {
         trace_dma_map_wait(dbs);
         dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-        cpu_register_map_client(dbs->bh);
+        address_space_register_map_client(dbs->sg->as, dbs->bh);
         goto out;
     }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
     }
 
     if (dbs->bh) {
-        cpu_unregister_map_client(dbs->bh);
+        address_space_unregister_map_client(dbs->sg->as, dbs->bh);
         qemu_bh_delete(dbs->bh);
         dbs->bh = NULL;
     }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7d9494ce70..5c9622c3d6 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
     as->ioeventfds = NULL;
     QTAILQ_INIT(&as->listeners);
     QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
+    as->bounce.in_use = false;
+    qemu_mutex_init(&as->map_client_list_lock);
+    QLIST_INIT(&as->map_client_list);
     as->name = g_strdup(name ? name : "anonymous");
     address_space_update_topology(as);
     address_space_update_ioeventfds(as);
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 18277ddd67..f40cc564b8 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2926,55 +2926,37 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
                                      NULL, len, FLUSH_CACHE);
 }
 
-typedef struct {
-    MemoryRegion *mr;
-    void *buffer;
-    hwaddr addr;
-    hwaddr len;
-    bool in_use;
-} BounceBuffer;
-
-static BounceBuffer bounce;
-
-typedef struct MapClient {
-    QEMUBH *bh;
-    QLIST_ENTRY(MapClient) link;
-} MapClient;
-
-QemuMutex map_client_list_lock;
-static QLIST_HEAD(, MapClient) map_client_list
-    = QLIST_HEAD_INITIALIZER(map_client_list);
-
-static void cpu_unregister_map_client_do(MapClient *client)
+static void
+address_space_unregister_map_client_do(AddressSpaceMapClient *client)
 {
     QLIST_REMOVE(client, link);
     g_free(client);
 }
 
-static void cpu_notify_map_clients_locked(void)
+static void address_space_notify_map_clients_locked(AddressSpace *as)
 {
-    MapClient *client;
+    AddressSpaceMapClient *client;
 
-    while (!QLIST_EMPTY(&map_client_list)) {
-        client = QLIST_FIRST(&map_client_list);
+    while (!QLIST_EMPTY(&as->map_client_list)) {
+        client = QLIST_FIRST(&as->map_client_list);
         qemu_bh_schedule(client->bh);
-        cpu_unregister_map_client_do(client);
+        address_space_unregister_map_client_do(client);
     }
 }
 
-void cpu_register_map_client(QEMUBH *bh)
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
 {
-    MapClient *client = g_malloc(sizeof(*client));
+    AddressSpaceMapClient *client = g_malloc(sizeof(*client));
 
-    qemu_mutex_lock(&map_client_list_lock);
+    qemu_mutex_lock(&as->map_client_list_lock);
     client->bh = bh;
-    QLIST_INSERT_HEAD(&map_client_list, client, link);
-    /* Write map_client_list before reading in_use.  */
+    QLIST_INSERT_HEAD(&as->map_client_list, client, link);
+    /* Write map_client_list before reading bounce_buffer_size.  */
     smp_mb();
-    if (!qatomic_read(&bounce.in_use)) {
-        cpu_notify_map_clients_locked();
+    if (!qatomic_read(&as->bounce.in_use)) {
+        address_space_notify_map_clients_locked(as);
     }
-    qemu_mutex_unlock(&map_client_list_lock);
+    qemu_mutex_unlock(&as->map_client_list_lock);
 }
 
 void cpu_exec_init_all(void)
@@ -2990,28 +2972,27 @@ void cpu_exec_init_all(void)
     finalize_target_page_bits();
     io_mem_init();
     memory_map_init();
-    qemu_mutex_init(&map_client_list_lock);
 }
 
-void cpu_unregister_map_client(QEMUBH *bh)
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh)
 {
-    MapClient *client;
+    AddressSpaceMapClient *client;
 
-    qemu_mutex_lock(&map_client_list_lock);
-    QLIST_FOREACH(client, &map_client_list, link) {
+    qemu_mutex_lock(&as->map_client_list_lock);
+    QLIST_FOREACH(client, &as->map_client_list, link) {
         if (client->bh == bh) {
-            cpu_unregister_map_client_do(client);
+            address_space_unregister_map_client_do(client);
             break;
         }
     }
-    qemu_mutex_unlock(&map_client_list_lock);
+    qemu_mutex_unlock(&as->map_client_list_lock);
 }
 
-static void cpu_notify_map_clients(void)
+static void address_space_notify_map_clients(AddressSpace *as)
 {
-    qemu_mutex_lock(&map_client_list_lock);
-    cpu_notify_map_clients_locked();
-    qemu_mutex_unlock(&map_client_list_lock);
+    qemu_mutex_lock(&as->map_client_list_lock);
+    address_space_notify_map_clients_locked(as);
+    qemu_mutex_unlock(&as->map_client_list_lock);
 }
 
 static bool flatview_access_valid(FlatView *fv, hwaddr addr, hwaddr len,
@@ -3078,8 +3059,8 @@ flatview_extend_translation(FlatView *fv, hwaddr addr,
  * May map a subset of the requested range, given by and returned in *plen.
  * May return NULL if resources needed to perform the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  */
 void *address_space_map(AddressSpace *as,
                         hwaddr addr,
@@ -3102,25 +3083,25 @@ void *address_space_map(AddressSpace *as,
     mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
 
     if (!memory_access_is_direct(mr, is_write)) {
-        if (qatomic_xchg(&bounce.in_use, true)) {
+        if (qatomic_xchg(&as->bounce.in_use, true)) {
             *plen = 0;
             return NULL;
         }
         /* Avoid unbounded allocations */
         l = MIN(l, TARGET_PAGE_SIZE);
-        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
-        bounce.addr = addr;
-        bounce.len = l;
+        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
+        as->bounce.addr = addr;
+        as->bounce.len = l;
 
         memory_region_ref(mr);
-        bounce.mr = mr;
+        as->bounce.mr = mr;
         if (!is_write) {
             flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
-                               bounce.buffer, l);
+                          as->bounce.buffer, l);
         }
 
         *plen = l;
-        return bounce.buffer;
+        return as->bounce.buffer;
     }
 
 
@@ -3138,7 +3119,7 @@ void *address_space_map(AddressSpace *as,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
                          bool is_write, hwaddr access_len)
 {
-    if (buffer != bounce.buffer) {
+    if (buffer != as->bounce.buffer) {
         MemoryRegion *mr;
         ram_addr_t addr1;
 
@@ -3154,15 +3135,15 @@ void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
         return;
     }
     if (is_write) {
-        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
-                            bounce.buffer, access_len);
+        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
+                            as->bounce.buffer, access_len);
     }
-    qemu_vfree(bounce.buffer);
-    bounce.buffer = NULL;
-    memory_region_unref(bounce.mr);
+    qemu_vfree(as->bounce.buffer);
+    as->bounce.buffer = NULL;
+    memory_region_unref(as->bounce.mr);
     /* Clear in_use before reading map_client_list.  */
-    qatomic_set_mb(&bounce.in_use, false);
-    cpu_notify_map_clients();
+    qatomic_set_mb(&as->bounce.in_use, false);
+    address_space_notify_map_clients(as);
 }
 
 void *cpu_physical_memory_map(hwaddr addr,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
  2023-09-07 13:04 ` [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering Mattias Nissler
@ 2023-09-07 13:04 ` Mattias Nissler
  2023-09-13 19:11   ` Peter Xu
  2023-09-14 18:49   ` Stefan Hajnoczi
  2023-09-07 13:04 ` [PATCH v3 3/5] Update subprojects/libvfio-user Mattias Nissler
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 hw/pci/pci.c                |  8 ++++
 include/exec/memory.h       | 14 ++----
 include/hw/pci/pci_device.h |  3 ++
 softmmu/memory.c            |  3 +-
 softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
 5 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 881d774fb6..8c4541b394 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
                     QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
     DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
                     QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+                     max_bounce_buffer_size, 4096),
     DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
                        "bus master container", UINT64_MAX);
     address_space_init(&pci_dev->bus_master_as,
                        &pci_dev->bus_master_container_region, pci_dev->name);
+    pci_dev->bus_master_as.max_bounce_buffer_size =
+        pci_dev->max_bounce_buffer_size;
 
     if (phase_check(PHASE_MACHINE_READY)) {
         pci_init_bus_master(pci_dev);
@@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->unrealize = pci_qdev_unrealize;
     k->bus_type = TYPE_PCI_BUS;
     device_class_set_props(k, pci_props);
+    object_class_property_set_description(
+        klass, "x-max-bounce-buffer-size",
+        "Maximum buffer size allocated for bounce buffers used for mapped "
+        "access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d68936157..5577542b5e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
     QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-    MemoryRegion *mr;
-    void *buffer;
-    hwaddr addr;
-    hwaddr len;
-    bool in_use;
-} BounceBuffer;
-
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
  */
@@ -1106,8 +1098,10 @@ struct AddressSpace {
     QTAILQ_HEAD(, MemoryListener) listeners;
     QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-    /* Bounce buffer to use for this address space. */
-    BounceBuffer bounce;
+    /* Maximum DMA bounce buffer size used for indirect memory map requests */
+    uint64_t max_bounce_buffer_size;
+    /* Total size of bounce buffers currently allocated, atomically accessed */
+    uint64_t bounce_buffer_size;
     /* List of callbacks to invoke when buffers free up */
     QemuMutex map_client_list_lock;
     QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+
+    /* Maximum DMA bounce buffer size used for indirect memory map requests */
+    uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 5c9622c3d6..e02799359c 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
     as->ioeventfds = NULL;
     QTAILQ_INIT(&as->listeners);
     QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
-    as->bounce.in_use = false;
+    as->max_bounce_buffer_size = 4096;
+    as->bounce_buffer_size = 0;
     qemu_mutex_init(&as->map_client_list_lock);
     QLIST_INIT(&as->map_client_list);
     as->name = g_strdup(name ? name : "anonymous");
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index f40cc564b8..e3d1cf5fba 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
                                      NULL, len, FLUSH_CACHE);
 }
 
+/*
+ * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
+ * to detect illegal pointers passed to address_space_unmap.
+ */
+#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
+
+typedef struct {
+    uint64_t magic;
+    MemoryRegion *mr;
+    hwaddr addr;
+    size_t len;
+    uint8_t buffer[];
+} BounceBuffer;
+
 static void
 address_space_unregister_map_client_do(AddressSpaceMapClient *client)
 {
@@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
     QLIST_INSERT_HEAD(&as->map_client_list, client, link);
     /* Write map_client_list before reading bounce_buffer_size.  */
     smp_mb();
-    if (!qatomic_read(&as->bounce.in_use)) {
+    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
         address_space_notify_map_clients_locked(as);
     }
     qemu_mutex_unlock(&as->map_client_list_lock);
@@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
     RCU_READ_LOCK_GUARD();
     fv = address_space_to_flatview(as);
     mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
+    memory_region_ref(mr);
 
     if (!memory_access_is_direct(mr, is_write)) {
-        if (qatomic_xchg(&as->bounce.in_use, true)) {
+        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
+        if (size > as->max_bounce_buffer_size) {
+            size_t excess = size - as->max_bounce_buffer_size;
+            l -= excess;
+            qatomic_sub(&as->bounce_buffer_size, excess);
+        }
+
+        if (l == 0) {
             *plen = 0;
             return NULL;
         }
-        /* Avoid unbounded allocations */
-        l = MIN(l, TARGET_PAGE_SIZE);
-        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
-        as->bounce.addr = addr;
-        as->bounce.len = l;
 
-        memory_region_ref(mr);
-        as->bounce.mr = mr;
+        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
+        bounce->magic = BOUNCE_BUFFER_MAGIC;
+        bounce->mr = mr;
+        bounce->addr = addr;
+        bounce->len = l;
+
         if (!is_write) {
             flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
-                          as->bounce.buffer, l);
+                          bounce->buffer, l);
         }
 
         *plen = l;
-        return as->bounce.buffer;
+        return bounce->buffer;
     }
 
-
-    memory_region_ref(mr);
     *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
                                         l, is_write, attrs);
     fuzz_dma_read_cb(addr, *plen, mr);
@@ -3119,31 +3138,44 @@ void *address_space_map(AddressSpace *as,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
                          bool is_write, hwaddr access_len)
 {
-    if (buffer != as->bounce.buffer) {
-        MemoryRegion *mr;
-        ram_addr_t addr1;
+    MemoryRegion *mr;
+    ram_addr_t addr1;
+
+    mr = memory_region_from_host(buffer, &addr1);
+    if (mr == NULL) {
+        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
+        if (bounce->magic != BOUNCE_BUFFER_MAGIC) {
+            error_report(
+                "Unmap request for %p, which neither corresponds to a memory "
+                "region, nor looks like a bounce buffer, ignoring!",
+                buffer);
+            return;
+        }
 
-        mr = memory_region_from_host(buffer, &addr1);
-        assert(mr != NULL);
         if (is_write) {
-            invalidate_and_set_dirty(mr, addr1, access_len);
+            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
+                                bounce->buffer, access_len);
         }
-        if (xen_enabled()) {
-            xen_invalidate_map_cache_entry(buffer);
+
+        memory_region_unref(bounce->mr);
+        uint64_t previous_buffer_size =
+            qatomic_fetch_sub(&as->bounce_buffer_size, bounce->len);
+        if (previous_buffer_size == as->max_bounce_buffer_size) {
+            /* Write bounce_buffer_size before reading map_client_list. */
+            smp_mb();
+            address_space_notify_map_clients(as);
         }
-        memory_region_unref(mr);
+        bounce->magic = ~BOUNCE_BUFFER_MAGIC;
+        g_free(bounce);
         return;
     }
+
+    if (xen_enabled()) {
+        xen_invalidate_map_cache_entry(buffer);
+    }
     if (is_write) {
-        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
-                            as->bounce.buffer, access_len);
-    }
-    qemu_vfree(as->bounce.buffer);
-    as->bounce.buffer = NULL;
-    memory_region_unref(as->bounce.mr);
-    /* Clear in_use before reading map_client_list.  */
-    qatomic_set_mb(&as->bounce.in_use, false);
-    address_space_notify_map_clients(as);
+        invalidate_and_set_dirty(mr, addr1, access_len);
+    }
 }
 
 void *cpu_physical_memory_map(hwaddr addr,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/5] Update subprojects/libvfio-user
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
  2023-09-07 13:04 ` [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering Mattias Nissler
  2023-09-07 13:04 ` [PATCH v3 2/5] softmmu: Support concurrent bounce buffers Mattias Nissler
@ 2023-09-07 13:04 ` Mattias Nissler
  2023-09-07 13:04 ` [PATCH v3 4/5] vfio-user: Message-based DMA support Mattias Nissler
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

Brings in assorted bug fixes. In particular, "Fix address calculation
for message-based DMA" corrects a bug in DMA address calculation which
is necessary to get DMA across VFIO-user messages working.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..135667a40d 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = f63ef82ad01821417df488cef7ec1fd94c3883fa
 depth = 1
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/5] vfio-user: Message-based DMA support
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
                   ` (2 preceding siblings ...)
  2023-09-07 13:04 ` [PATCH v3 3/5] Update subprojects/libvfio-user Mattias Nissler
@ 2023-09-07 13:04 ` Mattias Nissler
  2023-09-14 19:04   ` Stefan Hajnoczi
  2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
  2023-09-14 14:39 ` [PATCH v3 0/5] Support message-based DMA in vfio-user server Stefan Hajnoczi
  5 siblings, 1 reply; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 hw/remote/trace-events    |  2 +
 hw/remote/vfio-user-obj.c | 84 +++++++++++++++++++++++++++++++++++----
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..cee5e615a9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+                                unsigned size, MemTxAttrs attrs)
+{
+    MemoryRegion *region = opaque;
+    VfuObject *o = VFU_OBJECT(region->owner);
+    uint8_t buf[sizeof(uint64_t)];
+
+    trace_vfu_dma_read(region->addr + addr, size);
+
+    dma_sg_t *sg = alloca(dma_sg_size());
+    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+        vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
+        return MEMTX_ERROR;
+    }
+
+    *val = ldn_he_p(buf, size);
+
+    return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+                                 unsigned size, MemTxAttrs attrs)
+{
+    MemoryRegion *region = opaque;
+    VfuObject *o = VFU_OBJECT(region->owner);
+    uint8_t buf[sizeof(uint64_t)];
+
+    trace_vfu_dma_write(region->addr + addr, size);
+
+    stn_he_p(buf, size, val);
+
+    dma_sg_t *sg = alloca(dma_sg_size());
+    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+        vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
+        return MEMTX_ERROR;
+    }
+
+    return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+    .read_with_attrs = vfu_dma_read,
+    .write_with_attrs = vfu_dma_write,
+    .endianness = DEVICE_HOST_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+        .unaligned = true,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
     VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     g_autofree char *name = NULL;
     struct iovec *iov = &info->iova;
 
-    if (!info->vaddr) {
-        return;
-    }
-
     name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-                           (uint64_t)info->vaddr);
+                           (uint64_t)iov->iov_base);
 
     subregion = g_new0(MemoryRegion, 1);
 
-    memory_region_init_ram_ptr(subregion, NULL, name,
-                               iov->iov_len, info->vaddr);
+    if (info->vaddr) {
+        memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+                                   iov->iov_len, info->vaddr);
+    } else {
+        /*
+         * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+         * bytes at a time, and larger accesses are broken down. However,
+         * many/most DMA accesses are larger than 8 bytes and VFIO-user can
+         * handle large DMA accesses just fine, thus this size restriction
+         * unnecessarily hurts performance, in particular given that each
+         * access causes a round trip on the VFIO-user socket.
+         *
+         * TODO: Investigate how to plumb larger accesses through memory
+         * regions, possibly by amending MemoryRegionOps or by creating a new
+         * memory region type.
+         */
+        memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion,
+                              name, iov->iov_len);
+    }
 
     dma_as = pci_device_iommu_address_space(o->pci_dev);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/5] vfio-user: Fix config space access byte order
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
                   ` (3 preceding siblings ...)
  2023-09-07 13:04 ` [PATCH v3 4/5] vfio-user: Message-based DMA support Mattias Nissler
@ 2023-09-07 13:04 ` Mattias Nissler
  2023-09-14 19:34   ` Stefan Hajnoczi
                     ` (2 more replies)
  2023-09-14 14:39 ` [PATCH v3 0/5] Support message-based DMA in vfio-user server Stefan Hajnoczi
  5 siblings, 3 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-07 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Peter Xu,
	Paolo Bonzini, Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index cee5e615a9..d38b4700f3 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     while (bytes > 0) {
         len = (bytes > pci_access_width) ? pci_access_width : bytes;
         if (is_write) {
-            memcpy(&val, ptr, len);
+            val = ldn_le_p(ptr, len);
             pci_host_config_write_common(o->pci_dev, offset,
                                          pci_config_size(o->pci_dev),
                                          val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
         } else {
             val = pci_host_config_read_common(o->pci_dev, offset,
                                               pci_config_size(o->pci_dev), len);
-            memcpy(ptr, &val, len);
+            stn_le_p(ptr, len, val);
             trace_vfu_cfg_read(offset, val);
         }
         offset += len;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering
  2023-09-07 13:04 ` [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering Mattias Nissler
@ 2023-09-13 18:30   ` Peter Xu
  2023-09-15  8:37     ` Mattias Nissler
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-09-13 18:30 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Thu, Sep 07, 2023 at 06:04:06AM -0700, Mattias Nissler wrote:
> @@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
>      as->ioeventfds = NULL;
>      QTAILQ_INIT(&as->listeners);
>      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> +    as->bounce.in_use = false;
> +    qemu_mutex_init(&as->map_client_list_lock);
> +    QLIST_INIT(&as->map_client_list);
>      as->name = g_strdup(name ? name : "anonymous");
>      address_space_update_topology(as);
>      address_space_update_ioeventfds(as);

Missing the counterpart in do_address_space_destroy()?

Perhaps we should assert on having no one using the buffer, or on the
client list too.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-07 13:04 ` [PATCH v3 2/5] softmmu: Support concurrent bounce buffers Mattias Nissler
@ 2023-09-13 19:11   ` Peter Xu
  2023-09-15  9:32     ` Mattias Nissler
  2023-09-14 18:49   ` Stefan Hajnoczi
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-09-13 19:11 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> When DMA memory can't be directly accessed, as is the case when
> running the device model in a separate process without shareable DMA
> file descriptors, bounce buffering is used.
> 
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. Examples include:
>  * net devices, e.g. when transmitting a packet that is split across
>    several TX descriptors (observed with igb)
>  * USB host controllers, when handling a packet with multiple data TRBs
>    (observed with xhci)
> 
> Previously, qemu only provided a single bounce buffer per AddressSpace
> and would fail DMA map requests while the buffer was already in use. In
> turn, this would cause DMA failures that ultimately manifest as hardware
> errors from the guest perspective.
> 
> This change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer. Thus, multiple DMA mappings work
> correctly also when RAM can't be mmap()-ed.
> 
> The total bounce buffer allocation size is limited individually for each
> AddressSpace. The default limit is 4096 bytes, matching the previous
> maximum buffer size. A new x-max-bounce-buffer-size parameter is
> provided to configure the limit for PCI devices.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  hw/pci/pci.c                |  8 ++++
>  include/exec/memory.h       | 14 ++----
>  include/hw/pci/pci_device.h |  3 ++
>  softmmu/memory.c            |  3 +-
>  softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
>  5 files changed, 80 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 881d774fb6..8c4541b394 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -85,6 +85,8 @@ static Property pci_props[] = {
>                      QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
>      DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
>                      QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> +    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> +                     max_bounce_buffer_size, 4096),
>      DEFINE_PROP_END_OF_LIST()
>  };
>  
> @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>                         "bus master container", UINT64_MAX);
>      address_space_init(&pci_dev->bus_master_as,
>                         &pci_dev->bus_master_container_region, pci_dev->name);
> +    pci_dev->bus_master_as.max_bounce_buffer_size =
> +        pci_dev->max_bounce_buffer_size;
>  
>      if (phase_check(PHASE_MACHINE_READY)) {
>          pci_init_bus_master(pci_dev);
> @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
>      k->unrealize = pci_qdev_unrealize;
>      k->bus_type = TYPE_PCI_BUS;
>      device_class_set_props(k, pci_props);
> +    object_class_property_set_description(
> +        klass, "x-max-bounce-buffer-size",
> +        "Maximum buffer size allocated for bounce buffers used for mapped "
> +        "access to indirect DMA memory");
>  }
>  
>  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 7d68936157..5577542b5e 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
>      QLIST_ENTRY(AddressSpaceMapClient) link;
>  } AddressSpaceMapClient;
>  
> -typedef struct {
> -    MemoryRegion *mr;
> -    void *buffer;
> -    hwaddr addr;
> -    hwaddr len;
> -    bool in_use;
> -} BounceBuffer;
> -
>  /**
>   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
>   */
> @@ -1106,8 +1098,10 @@ struct AddressSpace {
>      QTAILQ_HEAD(, MemoryListener) listeners;
>      QTAILQ_ENTRY(AddressSpace) address_spaces_link;
>  
> -    /* Bounce buffer to use for this address space. */
> -    BounceBuffer bounce;
> +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> +    uint64_t max_bounce_buffer_size;
> +    /* Total size of bounce buffers currently allocated, atomically accessed */
> +    uint64_t bounce_buffer_size;
>      /* List of callbacks to invoke when buffers free up */
>      QemuMutex map_client_list_lock;
>      QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index d3dd0f64b2..f4027c5379 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -160,6 +160,9 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +
> +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> +    uint64_t max_bounce_buffer_size;
>  };
>  
>  static inline int pci_intx(PCIDevice *pci_dev)
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 5c9622c3d6..e02799359c 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
>      as->ioeventfds = NULL;
>      QTAILQ_INIT(&as->listeners);
>      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> -    as->bounce.in_use = false;
> +    as->max_bounce_buffer_size = 4096;

Instead of hard-code this 4k again (besides the pci property), maybe we can
have address_space_init_with_bouncebuffer() and always pass it in from pci
do_realize?  Then by default no bounce buffer supported for the AS only if
requested.

> +    as->bounce_buffer_size = 0;
>      qemu_mutex_init(&as->map_client_list_lock);
>      QLIST_INIT(&as->map_client_list);
>      as->name = g_strdup(name ? name : "anonymous");
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index f40cc564b8..e3d1cf5fba 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
>                                       NULL, len, FLUSH_CACHE);
>  }
>  
> +/*
> + * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
> + * to detect illegal pointers passed to address_space_unmap.
> + */
> +#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
> +
> +typedef struct {
> +    uint64_t magic;
> +    MemoryRegion *mr;
> +    hwaddr addr;
> +    size_t len;
> +    uint8_t buffer[];
> +} BounceBuffer;
> +
>  static void
>  address_space_unregister_map_client_do(AddressSpaceMapClient *client)
>  {
> @@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
>      QLIST_INSERT_HEAD(&as->map_client_list, client, link);
>      /* Write map_client_list before reading bounce_buffer_size.  */
>      smp_mb();
> -    if (!qatomic_read(&as->bounce.in_use)) {
> +    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
>          address_space_notify_map_clients_locked(as);
>      }
>      qemu_mutex_unlock(&as->map_client_list_lock);
> @@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
>      RCU_READ_LOCK_GUARD();
>      fv = address_space_to_flatview(as);
>      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> +    memory_region_ref(mr);
>  
>      if (!memory_access_is_direct(mr, is_write)) {
> -        if (qatomic_xchg(&as->bounce.in_use, true)) {
> +        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
> +        if (size > as->max_bounce_buffer_size) {
> +            size_t excess = size - as->max_bounce_buffer_size;
> +            l -= excess;
> +            qatomic_sub(&as->bounce_buffer_size, excess);
> +        }
> +
> +        if (l == 0) {
>              *plen = 0;
>              return NULL;

MR refcount leak?

>          }
> -        /* Avoid unbounded allocations */
> -        l = MIN(l, TARGET_PAGE_SIZE);
> -        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> -        as->bounce.addr = addr;
> -        as->bounce.len = l;
>  
> -        memory_region_ref(mr);
> -        as->bounce.mr = mr;
> +        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
> +        bounce->magic = BOUNCE_BUFFER_MAGIC;
> +        bounce->mr = mr;
> +        bounce->addr = addr;
> +        bounce->len = l;
> +
>          if (!is_write) {
>              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> -                          as->bounce.buffer, l);
> +                          bounce->buffer, l);
>          }
>  
>          *plen = l;
> -        return as->bounce.buffer;
> +        return bounce->buffer;
>      }
>  
> -
> -    memory_region_ref(mr);
>      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
>                                          l, is_write, attrs);
>      fuzz_dma_read_cb(addr, *plen, mr);
> @@ -3119,31 +3138,44 @@ void *address_space_map(AddressSpace *as,
>  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
>                           bool is_write, hwaddr access_len)
>  {
> -    if (buffer != as->bounce.buffer) {
> -        MemoryRegion *mr;
> -        ram_addr_t addr1;
> +    MemoryRegion *mr;
> +    ram_addr_t addr1;
> +
> +    mr = memory_region_from_host(buffer, &addr1);
> +    if (mr == NULL) {
> +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> +        if (bounce->magic != BOUNCE_BUFFER_MAGIC) {
> +            error_report(
> +                "Unmap request for %p, which neither corresponds to a memory "
> +                "region, nor looks like a bounce buffer, ignoring!",
> +                buffer);
> +            return;

Would assert() be better here? So that when trigger we can have the user
stack already.

> +        }
>  
> -        mr = memory_region_from_host(buffer, &addr1);
> -        assert(mr != NULL);
>          if (is_write) {
> -            invalidate_and_set_dirty(mr, addr1, access_len);
> +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> +                                bounce->buffer, access_len);
>          }
> -        if (xen_enabled()) {
> -            xen_invalidate_map_cache_entry(buffer);
> +
> +        memory_region_unref(bounce->mr);
> +        uint64_t previous_buffer_size =
> +            qatomic_fetch_sub(&as->bounce_buffer_size, bounce->len);
> +        if (previous_buffer_size == as->max_bounce_buffer_size) {

Can there be race condition that someone just temporarily boosted
bounce_buffer_size, so that it won't exactly equal to max but actually
larger (but we should still notify)?

If the current context has already a bounce buffer and is going to unmap it
(len>0), IIUC it means we should always notify because there will
definitely be some space left, and the requesters can retry to see whether
the size fit.

> +            /* Write bounce_buffer_size before reading map_client_list. */
> +            smp_mb();

I know it comes from the old code.. but I don't know why this is needed;
mutex lock should contain an mb() already before the list iterations later.
Just to raise it up, maybe Paolo would like to comment.

> +            address_space_notify_map_clients(as);
>          }
> -        memory_region_unref(mr);
> +        bounce->magic = ~BOUNCE_BUFFER_MAGIC;
> +        g_free(bounce);
>          return;
>      }
> +
> +    if (xen_enabled()) {
> +        xen_invalidate_map_cache_entry(buffer);
> +    }
>      if (is_write) {
> -        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
> -                            as->bounce.buffer, access_len);
> -    }
> -    qemu_vfree(as->bounce.buffer);
> -    as->bounce.buffer = NULL;
> -    memory_region_unref(as->bounce.mr);
> -    /* Clear in_use before reading map_client_list.  */
> -    qatomic_set_mb(&as->bounce.in_use, false);
> -    address_space_notify_map_clients(as);
> +        invalidate_and_set_dirty(mr, addr1, access_len);
> +    }
>  }
>  
>  void *cpu_physical_memory_map(hwaddr addr,
> -- 
> 2.34.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/5] Support message-based DMA in vfio-user server
  2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
                   ` (4 preceding siblings ...)
  2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
@ 2023-09-14 14:39 ` Stefan Hajnoczi
  2023-09-15  8:23   ` Mattias Nissler
  5 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-14 14:39 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3374 bytes --]

On Thu, Sep 07, 2023 at 06:04:05AM -0700, Mattias Nissler wrote:
> This series adds basic support for message-based DMA in qemu's vfio-user
> server. This is useful for cases where the client does not provide file
> descriptors for accessing system memory via memory mappings. My motivating use
> case is to hook up device models as PCIe endpoints to a hardware design. This
> works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> does not access memory directly, but sends memory requests TLPs to the hardware
> design in order to perform DMA.
> 
> Note that there is some more work required on top of this series to get
> message-based DMA to really work well:
> 
> * libvfio-user has a long-standing issue where socket communication gets messed
>   up when messages are sent from both ends at the same time. See
>   https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
>   been engaging there and a fix is in review.
> 
> * qemu currently breaks down DMA accesses into chunks of size 8 bytes at
>   maximum, each of which will be handled in a separate vfio-user DMA request
>   message. This is quite terrible for large DMA accesses, such as when nvme
>   reads and writes page-sized blocks for example. Thus, I would like to improve
>   qemu to be able to perform larger accesses, at least for indirect memory
>   regions. I have something working locally, but since this will likely result
>   in more involved surgery and discussion, I am leaving this to be addressed in
>   a separate patch.

Have you tried setting mr->ops->valid.max_access_size to something like
64 KB?

Paolo: Any suggestions for increasing DMA transaction sizes?

Stefan

> 
> Changes from v1:
> 
> * Address Stefan's review comments. In particular, enforce an allocation limit
>   and don't drop the map client callbacks given that map requests can fail when
>   hitting size limits.
> 
> * libvfio-user version bump now included in the series.
> 
> * Tested as well on big-endian s390x. This uncovered another byte order issue
>   in vfio-user server code that I've included a fix for.
> 
> Changes from v2:
> 
> * Add a preparatory patch to make bounce buffering an AddressSpace-specific
>   concept.
> 
> * The total buffer size limit parameter is now per AdressSpace and can be
>   configured for PCIDevice via a property.
> 
> * Store a magic value in first bytes of bounce buffer struct as a best effort
>   measure to detect invalid pointers in address_space_unmap.
> 
> Mattias Nissler (5):
>   softmmu: Per-AddressSpace bounce buffering
>   softmmu: Support concurrent bounce buffers
>   Update subprojects/libvfio-user
>   vfio-user: Message-based DMA support
>   vfio-user: Fix config space access byte order
> 
>  hw/pci/pci.c                  |   8 ++
>  hw/remote/trace-events        |   2 +
>  hw/remote/vfio-user-obj.c     |  88 +++++++++++++++++--
>  include/exec/cpu-common.h     |   2 -
>  include/exec/memory.h         |  39 ++++++++-
>  include/hw/pci/pci_device.h   |   3 +
>  softmmu/dma-helpers.c         |   4 +-
>  softmmu/memory.c              |   4 +
>  softmmu/physmem.c             | 155 ++++++++++++++++++----------------
>  subprojects/libvfio-user.wrap |   2 +-
>  10 files changed, 220 insertions(+), 87 deletions(-)
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-07 13:04 ` [PATCH v3 2/5] softmmu: Support concurrent bounce buffers Mattias Nissler
  2023-09-13 19:11   ` Peter Xu
@ 2023-09-14 18:49   ` Stefan Hajnoczi
  2023-09-15  9:54     ` Mattias Nissler
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-14 18:49 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 11445 bytes --]

On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> When DMA memory can't be directly accessed, as is the case when
> running the device model in a separate process without shareable DMA
> file descriptors, bounce buffering is used.
> 
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. Examples include:
>  * net devices, e.g. when transmitting a packet that is split across
>    several TX descriptors (observed with igb)
>  * USB host controllers, when handling a packet with multiple data TRBs
>    (observed with xhci)
> 
> Previously, qemu only provided a single bounce buffer per AddressSpace
> and would fail DMA map requests while the buffer was already in use. In
> turn, this would cause DMA failures that ultimately manifest as hardware
> errors from the guest perspective.
> 
> This change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer. Thus, multiple DMA mappings work
> correctly also when RAM can't be mmap()-ed.
> 
> The total bounce buffer allocation size is limited individually for each
> AddressSpace. The default limit is 4096 bytes, matching the previous
> maximum buffer size. A new x-max-bounce-buffer-size parameter is
> provided to configure the limit for PCI devices.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  hw/pci/pci.c                |  8 ++++
>  include/exec/memory.h       | 14 ++----
>  include/hw/pci/pci_device.h |  3 ++
>  softmmu/memory.c            |  3 +-
>  softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
>  5 files changed, 80 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 881d774fb6..8c4541b394 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -85,6 +85,8 @@ static Property pci_props[] = {
>                      QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
>      DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
>                      QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> +    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> +                     max_bounce_buffer_size, 4096),
>      DEFINE_PROP_END_OF_LIST()
>  };
>  
> @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>                         "bus master container", UINT64_MAX);
>      address_space_init(&pci_dev->bus_master_as,
>                         &pci_dev->bus_master_container_region, pci_dev->name);
> +    pci_dev->bus_master_as.max_bounce_buffer_size =
> +        pci_dev->max_bounce_buffer_size;
>  
>      if (phase_check(PHASE_MACHINE_READY)) {
>          pci_init_bus_master(pci_dev);
> @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
>      k->unrealize = pci_qdev_unrealize;
>      k->bus_type = TYPE_PCI_BUS;
>      device_class_set_props(k, pci_props);
> +    object_class_property_set_description(
> +        klass, "x-max-bounce-buffer-size",
> +        "Maximum buffer size allocated for bounce buffers used for mapped "
> +        "access to indirect DMA memory");
>  }
>  
>  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 7d68936157..5577542b5e 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
>      QLIST_ENTRY(AddressSpaceMapClient) link;
>  } AddressSpaceMapClient;
>  
> -typedef struct {
> -    MemoryRegion *mr;
> -    void *buffer;
> -    hwaddr addr;
> -    hwaddr len;
> -    bool in_use;
> -} BounceBuffer;
> -
>  /**
>   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
>   */
> @@ -1106,8 +1098,10 @@ struct AddressSpace {
>      QTAILQ_HEAD(, MemoryListener) listeners;
>      QTAILQ_ENTRY(AddressSpace) address_spaces_link;
>  
> -    /* Bounce buffer to use for this address space. */
> -    BounceBuffer bounce;
> +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> +    uint64_t max_bounce_buffer_size;
> +    /* Total size of bounce buffers currently allocated, atomically accessed */
> +    uint64_t bounce_buffer_size;
>      /* List of callbacks to invoke when buffers free up */
>      QemuMutex map_client_list_lock;
>      QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index d3dd0f64b2..f4027c5379 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -160,6 +160,9 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +
> +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> +    uint64_t max_bounce_buffer_size;
>  };
>  
>  static inline int pci_intx(PCIDevice *pci_dev)
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 5c9622c3d6..e02799359c 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
>      as->ioeventfds = NULL;
>      QTAILQ_INIT(&as->listeners);
>      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> -    as->bounce.in_use = false;
> +    as->max_bounce_buffer_size = 4096;
> +    as->bounce_buffer_size = 0;
>      qemu_mutex_init(&as->map_client_list_lock);
>      QLIST_INIT(&as->map_client_list);
>      as->name = g_strdup(name ? name : "anonymous");
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index f40cc564b8..e3d1cf5fba 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
>                                       NULL, len, FLUSH_CACHE);
>  }
>  
> +/*
> + * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
> + * to detect illegal pointers passed to address_space_unmap.
> + */
> +#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
> +
> +typedef struct {
> +    uint64_t magic;
> +    MemoryRegion *mr;
> +    hwaddr addr;
> +    size_t len;
> +    uint8_t buffer[];
> +} BounceBuffer;
> +
>  static void
>  address_space_unregister_map_client_do(AddressSpaceMapClient *client)
>  {
> @@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
>      QLIST_INSERT_HEAD(&as->map_client_list, client, link);
>      /* Write map_client_list before reading bounce_buffer_size.  */
>      smp_mb();
> -    if (!qatomic_read(&as->bounce.in_use)) {
> +    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
>          address_space_notify_map_clients_locked(as);
>      }
>      qemu_mutex_unlock(&as->map_client_list_lock);
> @@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
>      RCU_READ_LOCK_GUARD();
>      fv = address_space_to_flatview(as);
>      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> +    memory_region_ref(mr);
>  
>      if (!memory_access_is_direct(mr, is_write)) {
> -        if (qatomic_xchg(&as->bounce.in_use, true)) {
> +        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
> +        if (size > as->max_bounce_buffer_size) {
> +            size_t excess = size - as->max_bounce_buffer_size;
> +            l -= excess;
> +            qatomic_sub(&as->bounce_buffer_size, excess);

I think two threads can race here. as->bounce_buffer_size will be
corrupted (smaller than it should be) and l will be wrong as well. A
cmpxchg loop would solve the race.

> +        }
> +
> +        if (l == 0) {
>              *plen = 0;
>              return NULL;
>          }
> -        /* Avoid unbounded allocations */
> -        l = MIN(l, TARGET_PAGE_SIZE);
> -        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> -        as->bounce.addr = addr;
> -        as->bounce.len = l;
>  
> -        memory_region_ref(mr);
> -        as->bounce.mr = mr;
> +        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
> +        bounce->magic = BOUNCE_BUFFER_MAGIC;
> +        bounce->mr = mr;
> +        bounce->addr = addr;
> +        bounce->len = l;
> +
>          if (!is_write) {
>              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> -                          as->bounce.buffer, l);
> +                          bounce->buffer, l);
>          }
>  
>          *plen = l;
> -        return as->bounce.buffer;
> +        return bounce->buffer;
>      }
>  
> -
> -    memory_region_ref(mr);
>      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
>                                          l, is_write, attrs);
>      fuzz_dma_read_cb(addr, *plen, mr);
> @@ -3119,31 +3138,44 @@ void *address_space_map(AddressSpace *as,
>  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
>                           bool is_write, hwaddr access_len)
>  {
> -    if (buffer != as->bounce.buffer) {
> -        MemoryRegion *mr;
> -        ram_addr_t addr1;
> +    MemoryRegion *mr;
> +    ram_addr_t addr1;
> +
> +    mr = memory_region_from_host(buffer, &addr1);
> +    if (mr == NULL) {
> +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> +        if (bounce->magic != BOUNCE_BUFFER_MAGIC) {
> +            error_report(
> +                "Unmap request for %p, which neither corresponds to a memory "
> +                "region, nor looks like a bounce buffer, ignoring!",
> +                buffer);
> +            return;
> +        }
>  
> -        mr = memory_region_from_host(buffer, &addr1);
> -        assert(mr != NULL);
>          if (is_write) {
> -            invalidate_and_set_dirty(mr, addr1, access_len);
> +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> +                                bounce->buffer, access_len);
>          }
> -        if (xen_enabled()) {
> -            xen_invalidate_map_cache_entry(buffer);
> +
> +        memory_region_unref(bounce->mr);
> +        uint64_t previous_buffer_size =
> +            qatomic_fetch_sub(&as->bounce_buffer_size, bounce->len);
> +        if (previous_buffer_size == as->max_bounce_buffer_size) {
> +            /* Write bounce_buffer_size before reading map_client_list. */
> +            smp_mb();
> +            address_space_notify_map_clients(as);
>          }
> -        memory_region_unref(mr);
> +        bounce->magic = ~BOUNCE_BUFFER_MAGIC;
> +        g_free(bounce);
>          return;
>      }
> +
> +    if (xen_enabled()) {
> +        xen_invalidate_map_cache_entry(buffer);
> +    }
>      if (is_write) {
> -        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
> -                            as->bounce.buffer, access_len);
> -    }
> -    qemu_vfree(as->bounce.buffer);
> -    as->bounce.buffer = NULL;
> -    memory_region_unref(as->bounce.mr);
> -    /* Clear in_use before reading map_client_list.  */
> -    qatomic_set_mb(&as->bounce.in_use, false);
> -    address_space_notify_map_clients(as);
> +        invalidate_and_set_dirty(mr, addr1, access_len);
> +    }
>  }
>  
>  void *cpu_physical_memory_map(hwaddr addr,
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/5] vfio-user: Message-based DMA support
  2023-09-07 13:04 ` [PATCH v3 4/5] vfio-user: Message-based DMA support Mattias Nissler
@ 2023-09-14 19:04   ` Stefan Hajnoczi
  2023-09-15  9:58     ` Mattias Nissler
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-14 19:04 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6474 bytes --]

On Thu, Sep 07, 2023 at 06:04:09AM -0700, Mattias Nissler wrote:
> Wire up support for DMA for the case where the vfio-user client does not
> provide mmap()-able file descriptors, but DMA requests must be performed
> via the VFIO-user protocol. This installs an indirect memory region,
> which already works for pci_dma_{read,write}, and pci_dma_map works
> thanks to the existing DMA bounce buffering support.
> 
> Note that while simple scenarios work with this patch, there's a known
> race condition in libvfio-user that will mess up the communication
> channel. See https://github.com/nutanix/libvfio-user/issues/279 for
> details as well as a proposed fix.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  hw/remote/trace-events    |  2 +
>  hw/remote/vfio-user-obj.c | 84 +++++++++++++++++++++++++++++++++++----
>  2 files changed, 79 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/remote/trace-events b/hw/remote/trace-events
> index 0d1b7d56a5..358a68fb34 100644
> --- a/hw/remote/trace-events
> +++ b/hw/remote/trace-events
> @@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x"
>  vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
>  vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
>  vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
> +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
> +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
>  vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
>  vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
>  vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 8b10c32a3c..cee5e615a9 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>      return count;
>  }
>  
> +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
> +                                unsigned size, MemTxAttrs attrs)
> +{
> +    MemoryRegion *region = opaque;
> +    VfuObject *o = VFU_OBJECT(region->owner);
> +    uint8_t buf[sizeof(uint64_t)];
> +
> +    trace_vfu_dma_read(region->addr + addr, size);
> +
> +    dma_sg_t *sg = alloca(dma_sg_size());

Variable-length arrays have recently been removed from QEMU and
alloca(3) is a similar case. An example is commit
b3c8246750b7077add335559341268f2956f6470 ("hw/nvme: Avoid dynamic stack
allocation").

libvfio-user returns a sane sizeof(struct dma_sg) value so we don't need
to worry about bogus values, so the risk is low here.

However, its hard to scan for and forbid the dangerous alloca(3) calls
when exceptions are made for some alloca(3) uses.

I would avoid alloca(3) and instead use:

  g_autofree dma_sg_t *sg = g_new(dma_sg_size(), 1);

> +    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> +    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
> +        vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
> +        return MEMTX_ERROR;
> +    }
> +
> +    *val = ldn_he_p(buf, size);
> +
> +    return MEMTX_OK;
> +}
> +
> +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
> +                                 unsigned size, MemTxAttrs attrs)
> +{
> +    MemoryRegion *region = opaque;
> +    VfuObject *o = VFU_OBJECT(region->owner);
> +    uint8_t buf[sizeof(uint64_t)];
> +
> +    trace_vfu_dma_write(region->addr + addr, size);
> +
> +    stn_he_p(buf, size, val);
> +
> +    dma_sg_t *sg = alloca(dma_sg_size());

Same here.

> +    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> +    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
> +        vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
> +        return MEMTX_ERROR;
> +    }
> +
> +    return MEMTX_OK;
> +}
> +
> +static const MemoryRegionOps vfu_dma_ops = {
> +    .read_with_attrs = vfu_dma_read,
> +    .write_with_attrs = vfu_dma_write,
> +    .endianness = DEVICE_HOST_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +};
> +
>  static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>  {
>      VfuObject *o = vfu_get_private(vfu_ctx);
> @@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>      g_autofree char *name = NULL;
>      struct iovec *iov = &info->iova;
>  
> -    if (!info->vaddr) {
> -        return;
> -    }
> -
>      name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
> -                           (uint64_t)info->vaddr);
> +                           (uint64_t)iov->iov_base);
>  
>      subregion = g_new0(MemoryRegion, 1);
>  
> -    memory_region_init_ram_ptr(subregion, NULL, name,
> -                               iov->iov_len, info->vaddr);
> +    if (info->vaddr) {
> +        memory_region_init_ram_ptr(subregion, OBJECT(o), name,
> +                                   iov->iov_len, info->vaddr);
> +    } else {
> +        /*
> +         * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
> +         * bytes at a time, and larger accesses are broken down. However,
> +         * many/most DMA accesses are larger than 8 bytes and VFIO-user can
> +         * handle large DMA accesses just fine, thus this size restriction
> +         * unnecessarily hurts performance, in particular given that each
> +         * access causes a round trip on the VFIO-user socket.
> +         *
> +         * TODO: Investigate how to plumb larger accesses through memory
> +         * regions, possibly by amending MemoryRegionOps or by creating a new
> +         * memory region type.
> +         */
> +        memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion,
> +                              name, iov->iov_len);
> +    }
>  
>      dma_as = pci_device_iommu_address_space(o->pci_dev);
>  
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/5] vfio-user: Fix config space access byte order
  2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
@ 2023-09-14 19:34   ` Stefan Hajnoczi
  2023-09-14 20:29   ` Philippe Mathieu-Daudé
  2023-09-14 20:32   ` Stefan Hajnoczi
  2 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-14 19:34 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1998 bytes --]

On Thu, Sep 07, 2023 at 06:04:10AM -0700, Mattias Nissler wrote:
> PCI config space is little-endian, so on a big-endian host we need to
> perform byte swaps for values as they are passed to and received from
> the generic PCI config space access machinery.

Byteswapping only works when registers are accessed with their natural
size, even with this patch.

If there is something like a PCI Capability structure, then it needs to
be read one register at a time to get back valid data. It cannot simply
be copied in a single multi-DWORD access.

I'm not sure if this fix is sufficient. Maybe pci_host_*() needs to be
extended to support little-endian accesses instead?

> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  hw/remote/vfio-user-obj.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index cee5e615a9..d38b4700f3 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>      while (bytes > 0) {
>          len = (bytes > pci_access_width) ? pci_access_width : bytes;
>          if (is_write) {
> -            memcpy(&val, ptr, len);
> +            val = ldn_le_p(ptr, len);
>              pci_host_config_write_common(o->pci_dev, offset,
>                                           pci_config_size(o->pci_dev),
>                                           val, len);
> @@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>          } else {
>              val = pci_host_config_read_common(o->pci_dev, offset,
>                                                pci_config_size(o->pci_dev), len);
> -            memcpy(ptr, &val, len);
> +            stn_le_p(ptr, len, val);
>              trace_vfu_cfg_read(offset, val);
>          }
>          offset += len;
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/5] vfio-user: Fix config space access byte order
  2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
  2023-09-14 19:34   ` Stefan Hajnoczi
@ 2023-09-14 20:29   ` Philippe Mathieu-Daudé
  2023-09-14 20:32   ` Stefan Hajnoczi
  2 siblings, 0 replies; 23+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-09-14 20:29 UTC (permalink / raw)
  To: Mattias Nissler, qemu-devel
  Cc: john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Richard Henderson,
	Jagannathan Raman, stefanha, Peter Xu, Paolo Bonzini

On 7/9/23 15:04, Mattias Nissler wrote:
> PCI config space is little-endian, so on a big-endian host we need to
> perform byte swaps for values as they are passed to and received from
> the generic PCI config space access machinery.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>   hw/remote/vfio-user-obj.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index cee5e615a9..d38b4700f3 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>       while (bytes > 0) {
>           len = (bytes > pci_access_width) ? pci_access_width : bytes;
>           if (is_write) {
> -            memcpy(&val, ptr, len);
> +            val = ldn_le_p(ptr, len);
>               pci_host_config_write_common(o->pci_dev, offset,
>                                            pci_config_size(o->pci_dev),
>                                            val, len);
> @@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>           } else {
>               val = pci_host_config_read_common(o->pci_dev, offset,
>                                                 pci_config_size(o->pci_dev), len);
> -            memcpy(ptr, &val, len);
> +            stn_le_p(ptr, len, val);
>               trace_vfu_cfg_read(offset, val);
>           }
>           offset += len;

This makes sense,

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/5] vfio-user: Fix config space access byte order
  2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
  2023-09-14 19:34   ` Stefan Hajnoczi
  2023-09-14 20:29   ` Philippe Mathieu-Daudé
@ 2023-09-14 20:32   ` Stefan Hajnoczi
  2023-09-15 10:24     ` Mattias Nissler
  2 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-14 20:32 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

On Thu, Sep 07, 2023 at 06:04:10AM -0700, Mattias Nissler wrote:
> PCI config space is little-endian, so on a big-endian host we need to
> perform byte swaps for values as they are passed to and received from
> the generic PCI config space access machinery.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  hw/remote/vfio-user-obj.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

After some discussion about PCI Configuration Space endianness on IRC
with aw, mcayland, and f4bug I am now happy with this patch:

1. Configuration space can only be accessed in 1-, 2-, or 4-byte
   accesses.
2. If it's a 2- or 4-byte access then your patch adds the missing
   little-endian conversion.
3. If it's a 1-byte access then there is (effectively) no byteswap in
   the code path and the pci_dev->config[] array is already
   little-endian.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/5] Support message-based DMA in vfio-user server
  2023-09-14 14:39 ` [PATCH v3 0/5] Support message-based DMA in vfio-user server Stefan Hajnoczi
@ 2023-09-15  8:23   ` Mattias Nissler
  0 siblings, 0 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15  8:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

On Thu, Sep 14, 2023 at 4:39 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:05AM -0700, Mattias Nissler wrote:
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My motivating use
> > case is to hook up device models as PCIe endpoints to a hardware design. This
> > works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> > does not access memory directly, but sends memory requests TLPs to the hardware
> > design in order to perform DMA.
> >
> > Note that there is some more work required on top of this series to get
> > message-based DMA to really work well:
> >
> > * libvfio-user has a long-standing issue where socket communication gets messed
> >   up when messages are sent from both ends at the same time. See
> >   https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
> >   been engaging there and a fix is in review.
> >
> > * qemu currently breaks down DMA accesses into chunks of size 8 bytes at
> >   maximum, each of which will be handled in a separate vfio-user DMA request
> >   message. This is quite terrible for large DMA accesses, such as when nvme
> >   reads and writes page-sized blocks for example. Thus, I would like to improve
> >   qemu to be able to perform larger accesses, at least for indirect memory
> >   regions. I have something working locally, but since this will likely result
> >   in more involved surgery and discussion, I am leaving this to be addressed in
> >   a separate patch.
>
> Have you tried setting mr->ops->valid.max_access_size to something like
> 64 KB?

I had tried that early on, but it's not that easy unfortunately. The
memory access path eventually hits flatview_read_continue [1], where
memory_region_dispatch_read gets invoked which passes data in a single
uint64_t, which is also the unit of data that MemoryRegionOps operates
on. Thus, sizeof(uint64_t) is the current hard limit when accessing an
indirect memory region. I have some proof of concept code that extends
MemoryRegionOps with functions to read and write larger blocks, and
change the dispatching code to use these if available. I'm not sure
whether that's the right way to go though, it was just what jumped out
at me as a quick way to get what I need :-) Happy to share this code
if it helps the conversation.

There are certainly various considerations with this:
* It crossed my mind that we could introduce a separate memory region
type (I understand that indirect memory regions were originally
designed for I/O regions, accessed by the CPU, and thus naturally
limited to memop-sized accesses?). But then again perhaps we want
arbitrarily-sized accesses for potentially all memory regions, not
just those of special types?
* If we do decide to add support to MemoryRegionOps for
arbitrarily-sized accesses, that raises the question on whether this
is a 3rd, optional pair of accessors in addition to read/write and
read_with_attrs/write_with_attrs, or whether MemoryRegionOps deserves
a cleanup to expose only a single pair of arbitrarily-size accessors.
Then we'd adapt them somehow to the simpler memop-sized accessors
which existing code implements, and I think makes sense to keep for
cases where this is sufficient.
* Performance - need to keep an eye on what performance implications
these design decisions come with.

[1] https://github.com/qemu/qemu/blob/master/softmmu/physmem.c#L2744

>
> Paolo: Any suggestions for increasing DMA transaction sizes?
>
> Stefan
>
> >
> > Changes from v1:
> >
> > * Address Stefan's review comments. In particular, enforce an allocation limit
> >   and don't drop the map client callbacks given that map requests can fail when
> >   hitting size limits.
> >
> > * libvfio-user version bump now included in the series.
> >
> > * Tested as well on big-endian s390x. This uncovered another byte order issue
> >   in vfio-user server code that I've included a fix for.
> >
> > Changes from v2:
> >
> > * Add a preparatory patch to make bounce buffering an AddressSpace-specific
> >   concept.
> >
> > * The total buffer size limit parameter is now per AdressSpace and can be
> >   configured for PCIDevice via a property.
> >
> > * Store a magic value in first bytes of bounce buffer struct as a best effort
> >   measure to detect invalid pointers in address_space_unmap.
> >
> > Mattias Nissler (5):
> >   softmmu: Per-AddressSpace bounce buffering
> >   softmmu: Support concurrent bounce buffers
> >   Update subprojects/libvfio-user
> >   vfio-user: Message-based DMA support
> >   vfio-user: Fix config space access byte order
> >
> >  hw/pci/pci.c                  |   8 ++
> >  hw/remote/trace-events        |   2 +
> >  hw/remote/vfio-user-obj.c     |  88 +++++++++++++++++--
> >  include/exec/cpu-common.h     |   2 -
> >  include/exec/memory.h         |  39 ++++++++-
> >  include/hw/pci/pci_device.h   |   3 +
> >  softmmu/dma-helpers.c         |   4 +-
> >  softmmu/memory.c              |   4 +
> >  softmmu/physmem.c             | 155 ++++++++++++++++++----------------
> >  subprojects/libvfio-user.wrap |   2 +-
> >  10 files changed, 220 insertions(+), 87 deletions(-)
> >
> > --
> > 2.34.1
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering
  2023-09-13 18:30   ` Peter Xu
@ 2023-09-15  8:37     ` Mattias Nissler
  2023-09-19 15:54       ` Mattias Nissler
  0 siblings, 1 reply; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15  8:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Wed, Sep 13, 2023 at 8:30 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:06AM -0700, Mattias Nissler wrote:
> > @@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> >      as->ioeventfds = NULL;
> >      QTAILQ_INIT(&as->listeners);
> >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > +    as->bounce.in_use = false;
> > +    qemu_mutex_init(&as->map_client_list_lock);
> > +    QLIST_INIT(&as->map_client_list);
> >      as->name = g_strdup(name ? name : "anonymous");
> >      address_space_update_topology(as);
> >      address_space_update_ioeventfds(as);
>
> Missing the counterpart in do_address_space_destroy()?

Of course, thanks for pointing this out.

>
> Perhaps we should assert on having no one using the buffer, or on the
> client list too.

I agree it makes sense to put these assertions, but let me dig a bit
and do some experiments to see whether these hold true in practice.

>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-13 19:11   ` Peter Xu
@ 2023-09-15  9:32     ` Mattias Nissler
  2023-09-15 15:57       ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15  9:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Wed, Sep 13, 2023 at 9:11 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >    several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >    (observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> >
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > ---
> >  hw/pci/pci.c                |  8 ++++
> >  include/exec/memory.h       | 14 ++----
> >  include/hw/pci/pci_device.h |  3 ++
> >  softmmu/memory.c            |  3 +-
> >  softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
> >  5 files changed, 80 insertions(+), 42 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 881d774fb6..8c4541b394 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -85,6 +85,8 @@ static Property pci_props[] = {
> >                      QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >      DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
> >                      QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> > +    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> > +                     max_bounce_buffer_size, 4096),
> >      DEFINE_PROP_END_OF_LIST()
> >  };
> >
> > @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
> >                         "bus master container", UINT64_MAX);
> >      address_space_init(&pci_dev->bus_master_as,
> >                         &pci_dev->bus_master_container_region, pci_dev->name);
> > +    pci_dev->bus_master_as.max_bounce_buffer_size =
> > +        pci_dev->max_bounce_buffer_size;
> >
> >      if (phase_check(PHASE_MACHINE_READY)) {
> >          pci_init_bus_master(pci_dev);
> > @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
> >      k->unrealize = pci_qdev_unrealize;
> >      k->bus_type = TYPE_PCI_BUS;
> >      device_class_set_props(k, pci_props);
> > +    object_class_property_set_description(
> > +        klass, "x-max-bounce-buffer-size",
> > +        "Maximum buffer size allocated for bounce buffers used for mapped "
> > +        "access to indirect DMA memory");
> >  }
> >
> >  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 7d68936157..5577542b5e 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
> >      QLIST_ENTRY(AddressSpaceMapClient) link;
> >  } AddressSpaceMapClient;
> >
> > -typedef struct {
> > -    MemoryRegion *mr;
> > -    void *buffer;
> > -    hwaddr addr;
> > -    hwaddr len;
> > -    bool in_use;
> > -} BounceBuffer;
> > -
> >  /**
> >   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
> >   */
> > @@ -1106,8 +1098,10 @@ struct AddressSpace {
> >      QTAILQ_HEAD(, MemoryListener) listeners;
> >      QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >
> > -    /* Bounce buffer to use for this address space. */
> > -    BounceBuffer bounce;
> > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > +    uint64_t max_bounce_buffer_size;
> > +    /* Total size of bounce buffers currently allocated, atomically accessed */
> > +    uint64_t bounce_buffer_size;
> >      /* List of callbacks to invoke when buffers free up */
> >      QemuMutex map_client_list_lock;
> >      QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index d3dd0f64b2..f4027c5379 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -160,6 +160,9 @@ struct PCIDevice {
> >      /* ID of standby device in net_failover pair */
> >      char *failover_pair_id;
> >      uint32_t acpi_index;
> > +
> > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > +    uint64_t max_bounce_buffer_size;
> >  };
> >
> >  static inline int pci_intx(PCIDevice *pci_dev)
> > diff --git a/softmmu/memory.c b/softmmu/memory.c
> > index 5c9622c3d6..e02799359c 100644
> > --- a/softmmu/memory.c
> > +++ b/softmmu/memory.c
> > @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> >      as->ioeventfds = NULL;
> >      QTAILQ_INIT(&as->listeners);
> >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > -    as->bounce.in_use = false;
> > +    as->max_bounce_buffer_size = 4096;
>
> Instead of hard-code this 4k again (besides the pci property), maybe we can
> have address_space_init_with_bouncebuffer() and always pass it in from pci
> do_realize?  Then by default no bounce buffer supported for the AS only if
> requested.

I haven't verified in a running configuration, but I believe bounce
buffering is also used with non-PCI code, for example
sysbus_ahci_realize grabs &address_space_memory. So, IMO it makes
sense to keep at the old default for non-PCI address spaces, unless we
create additional knobs to set the limit for these?


>
> > +    as->bounce_buffer_size = 0;
> >      qemu_mutex_init(&as->map_client_list_lock);
> >      QLIST_INIT(&as->map_client_list);
> >      as->name = g_strdup(name ? name : "anonymous");
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index f40cc564b8..e3d1cf5fba 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> >                                       NULL, len, FLUSH_CACHE);
> >  }
> >
> > +/*
> > + * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
> > + * to detect illegal pointers passed to address_space_unmap.
> > + */
> > +#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
> > +
> > +typedef struct {
> > +    uint64_t magic;
> > +    MemoryRegion *mr;
> > +    hwaddr addr;
> > +    size_t len;
> > +    uint8_t buffer[];
> > +} BounceBuffer;
> > +
> >  static void
> >  address_space_unregister_map_client_do(AddressSpaceMapClient *client)
> >  {
> > @@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
> >      QLIST_INSERT_HEAD(&as->map_client_list, client, link);
> >      /* Write map_client_list before reading bounce_buffer_size.  */
> >      smp_mb();
> > -    if (!qatomic_read(&as->bounce.in_use)) {
> > +    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
> >          address_space_notify_map_clients_locked(as);
> >      }
> >      qemu_mutex_unlock(&as->map_client_list_lock);
> > @@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
> >      RCU_READ_LOCK_GUARD();
> >      fv = address_space_to_flatview(as);
> >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > +    memory_region_ref(mr);
> >
> >      if (!memory_access_is_direct(mr, is_write)) {
> > -        if (qatomic_xchg(&as->bounce.in_use, true)) {
> > +        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
> > +        if (size > as->max_bounce_buffer_size) {
> > +            size_t excess = size - as->max_bounce_buffer_size;
> > +            l -= excess;
> > +            qatomic_sub(&as->bounce_buffer_size, excess);
> > +        }
> > +
> > +        if (l == 0) {
> >              *plen = 0;
> >              return NULL;
>
> MR refcount leak?

Oops, fixing.

>
> >          }
> > -        /* Avoid unbounded allocations */
> > -        l = MIN(l, TARGET_PAGE_SIZE);
> > -        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > -        as->bounce.addr = addr;
> > -        as->bounce.len = l;
> >
> > -        memory_region_ref(mr);
> > -        as->bounce.mr = mr;
> > +        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
> > +        bounce->magic = BOUNCE_BUFFER_MAGIC;
> > +        bounce->mr = mr;
> > +        bounce->addr = addr;
> > +        bounce->len = l;
> > +
> >          if (!is_write) {
> >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > -                          as->bounce.buffer, l);
> > +                          bounce->buffer, l);
> >          }
> >
> >          *plen = l;
> > -        return as->bounce.buffer;
> > +        return bounce->buffer;
> >      }
> >
> > -
> > -    memory_region_ref(mr);
> >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> >                                          l, is_write, attrs);
> >      fuzz_dma_read_cb(addr, *plen, mr);
> > @@ -3119,31 +3138,44 @@ void *address_space_map(AddressSpace *as,
> >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> >                           bool is_write, hwaddr access_len)
> >  {
> > -    if (buffer != as->bounce.buffer) {
> > -        MemoryRegion *mr;
> > -        ram_addr_t addr1;
> > +    MemoryRegion *mr;
> > +    ram_addr_t addr1;
> > +
> > +    mr = memory_region_from_host(buffer, &addr1);
> > +    if (mr == NULL) {
> > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> > +        if (bounce->magic != BOUNCE_BUFFER_MAGIC) {
> > +            error_report(
> > +                "Unmap request for %p, which neither corresponds to a memory "
> > +                "region, nor looks like a bounce buffer, ignoring!",
> > +                buffer);
> > +            return;
>
> Would assert() be better here? So that when trigger we can have the user
> stack already.

Yes, and it's something in the should-never-happen bucket, so assert
is more appropriate. Changing.

>
> > +        }
> >
> > -        mr = memory_region_from_host(buffer, &addr1);
> > -        assert(mr != NULL);
> >          if (is_write) {
> > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > +                                bounce->buffer, access_len);
> >          }
> > -        if (xen_enabled()) {
> > -            xen_invalidate_map_cache_entry(buffer);
> > +
> > +        memory_region_unref(bounce->mr);
> > +        uint64_t previous_buffer_size =
> > +            qatomic_fetch_sub(&as->bounce_buffer_size, bounce->len);
> > +        if (previous_buffer_size == as->max_bounce_buffer_size) {
>
> Can there be race condition that someone just temporarily boosted
> bounce_buffer_size, so that it won't exactly equal to max but actually
> larger (but we should still notify)?

Ah, yes, we may have overshot in the allocation path. So this should
be >= then, but see below...

>
> If the current context has already a bounce buffer and is going to unmap it
> (len>0), IIUC it means we should always notify because there will
> definitely be some space left, and the requesters can retry to see whether
> the size fit.

The current semantics are actually more awkward: If the requested size
doesn't fit, a caller currently gets a mapping for the portion that is
below the limit, with an adjusted *plen value. So a caller that wants
more would then register their callback and get a callback when more
space frees up. I carried this over from the existing code, but I find
it extremely obscure and error-prone. I would rather change it to the
semantics you suggest, namely rejecting the entire request, but that's
a breaking API change that's hard to validate. So, maybe it's best to
do what you suggest and notify unconditionally when we free up space
and leave it to the caller to deal with whatever situation they find
(which is already the case). So, I'll drop the condition here for now,
let me know if you have a better suggestion.

>
> > +            /* Write bounce_buffer_size before reading map_client_list. */
> > +            smp_mb();
>
> I know it comes from the old code.. but I don't know why this is needed;
> mutex lock should contain an mb() already before the list iterations later.
> Just to raise it up, maybe Paolo would like to comment.

Hm, are you sure that qemu_mutex_lock includes a full memory barrier?
The atomics docs say that pthread_mutex_lock is guaranteed to have
acquire semantics, but that doesn't guarantee that previous writes are
visible elsewhere. We need a release of bounce_buffer_size and an
acquire of map_client_list. The latter is implied by qemu_mutex_lock,
so I could arguably change this to smp_wmb().

(I hope this makes sense, memory fencing isn't my forte)

>
> > +            address_space_notify_map_clients(as);
> >          }
> > -        memory_region_unref(mr);
> > +        bounce->magic = ~BOUNCE_BUFFER_MAGIC;
> > +        g_free(bounce);
> >          return;
> >      }
> > +
> > +    if (xen_enabled()) {
> > +        xen_invalidate_map_cache_entry(buffer);
> > +    }
> >      if (is_write) {
> > -        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > -                            as->bounce.buffer, access_len);
> > -    }
> > -    qemu_vfree(as->bounce.buffer);
> > -    as->bounce.buffer = NULL;
> > -    memory_region_unref(as->bounce.mr);
> > -    /* Clear in_use before reading map_client_list.  */
> > -    qatomic_set_mb(&as->bounce.in_use, false);
> > -    address_space_notify_map_clients(as);
> > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > +    }
> >  }
> >
> >  void *cpu_physical_memory_map(hwaddr addr,
> > --
> > 2.34.1
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-14 18:49   ` Stefan Hajnoczi
@ 2023-09-15  9:54     ` Mattias Nissler
  2023-09-15 20:40       ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15  9:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

On Thu, Sep 14, 2023 at 8:49 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >    several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >    (observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> >
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > ---
> >  hw/pci/pci.c                |  8 ++++
> >  include/exec/memory.h       | 14 ++----
> >  include/hw/pci/pci_device.h |  3 ++
> >  softmmu/memory.c            |  3 +-
> >  softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
> >  5 files changed, 80 insertions(+), 42 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 881d774fb6..8c4541b394 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -85,6 +85,8 @@ static Property pci_props[] = {
> >                      QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >      DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
> >                      QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> > +    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> > +                     max_bounce_buffer_size, 4096),
> >      DEFINE_PROP_END_OF_LIST()
> >  };
> >
> > @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
> >                         "bus master container", UINT64_MAX);
> >      address_space_init(&pci_dev->bus_master_as,
> >                         &pci_dev->bus_master_container_region, pci_dev->name);
> > +    pci_dev->bus_master_as.max_bounce_buffer_size =
> > +        pci_dev->max_bounce_buffer_size;
> >
> >      if (phase_check(PHASE_MACHINE_READY)) {
> >          pci_init_bus_master(pci_dev);
> > @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
> >      k->unrealize = pci_qdev_unrealize;
> >      k->bus_type = TYPE_PCI_BUS;
> >      device_class_set_props(k, pci_props);
> > +    object_class_property_set_description(
> > +        klass, "x-max-bounce-buffer-size",
> > +        "Maximum buffer size allocated for bounce buffers used for mapped "
> > +        "access to indirect DMA memory");
> >  }
> >
> >  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 7d68936157..5577542b5e 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
> >      QLIST_ENTRY(AddressSpaceMapClient) link;
> >  } AddressSpaceMapClient;
> >
> > -typedef struct {
> > -    MemoryRegion *mr;
> > -    void *buffer;
> > -    hwaddr addr;
> > -    hwaddr len;
> > -    bool in_use;
> > -} BounceBuffer;
> > -
> >  /**
> >   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
> >   */
> > @@ -1106,8 +1098,10 @@ struct AddressSpace {
> >      QTAILQ_HEAD(, MemoryListener) listeners;
> >      QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >
> > -    /* Bounce buffer to use for this address space. */
> > -    BounceBuffer bounce;
> > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > +    uint64_t max_bounce_buffer_size;
> > +    /* Total size of bounce buffers currently allocated, atomically accessed */
> > +    uint64_t bounce_buffer_size;
> >      /* List of callbacks to invoke when buffers free up */
> >      QemuMutex map_client_list_lock;
> >      QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index d3dd0f64b2..f4027c5379 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -160,6 +160,9 @@ struct PCIDevice {
> >      /* ID of standby device in net_failover pair */
> >      char *failover_pair_id;
> >      uint32_t acpi_index;
> > +
> > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > +    uint64_t max_bounce_buffer_size;
> >  };
> >
> >  static inline int pci_intx(PCIDevice *pci_dev)
> > diff --git a/softmmu/memory.c b/softmmu/memory.c
> > index 5c9622c3d6..e02799359c 100644
> > --- a/softmmu/memory.c
> > +++ b/softmmu/memory.c
> > @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> >      as->ioeventfds = NULL;
> >      QTAILQ_INIT(&as->listeners);
> >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > -    as->bounce.in_use = false;
> > +    as->max_bounce_buffer_size = 4096;
> > +    as->bounce_buffer_size = 0;
> >      qemu_mutex_init(&as->map_client_list_lock);
> >      QLIST_INIT(&as->map_client_list);
> >      as->name = g_strdup(name ? name : "anonymous");
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index f40cc564b8..e3d1cf5fba 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> >                                       NULL, len, FLUSH_CACHE);
> >  }
> >
> > +/*
> > + * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
> > + * to detect illegal pointers passed to address_space_unmap.
> > + */
> > +#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
> > +
> > +typedef struct {
> > +    uint64_t magic;
> > +    MemoryRegion *mr;
> > +    hwaddr addr;
> > +    size_t len;
> > +    uint8_t buffer[];
> > +} BounceBuffer;
> > +
> >  static void
> >  address_space_unregister_map_client_do(AddressSpaceMapClient *client)
> >  {
> > @@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
> >      QLIST_INSERT_HEAD(&as->map_client_list, client, link);
> >      /* Write map_client_list before reading bounce_buffer_size.  */
> >      smp_mb();
> > -    if (!qatomic_read(&as->bounce.in_use)) {
> > +    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
> >          address_space_notify_map_clients_locked(as);
> >      }
> >      qemu_mutex_unlock(&as->map_client_list_lock);
> > @@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
> >      RCU_READ_LOCK_GUARD();
> >      fv = address_space_to_flatview(as);
> >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > +    memory_region_ref(mr);
> >
> >      if (!memory_access_is_direct(mr, is_write)) {
> > -        if (qatomic_xchg(&as->bounce.in_use, true)) {
> > +        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
> > +        if (size > as->max_bounce_buffer_size) {
> > +            size_t excess = size - as->max_bounce_buffer_size;
> > +            l -= excess;
> > +            qatomic_sub(&as->bounce_buffer_size, excess);
>
> I think two threads can race here. as->bounce_buffer_size will be
> corrupted (smaller than it should be) and l will be wrong as well. A
> cmpxchg loop would solve the race.

Ah, thanks for pointing this out. I think the case you have in mind is this:
1. Thread A bumps the size to be larger than the max
2. Thread B bumps the size further
3. Thread B now computes an incorrect excess and sutracts more than it added.

I should be good if I ensure that each thread will only subtract what
it has previously added to enforce the invariant that additions and
subtractions for each map/unmap pair cancel each other out. Let me
know if there's a reason to still prefer the cmpxchg loop.

>
> > +        }
> > +
> > +        if (l == 0) {
> >              *plen = 0;
> >              return NULL;
> >          }
> > -        /* Avoid unbounded allocations */
> > -        l = MIN(l, TARGET_PAGE_SIZE);
> > -        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > -        as->bounce.addr = addr;
> > -        as->bounce.len = l;
> >
> > -        memory_region_ref(mr);
> > -        as->bounce.mr = mr;
> > +        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
> > +        bounce->magic = BOUNCE_BUFFER_MAGIC;
> > +        bounce->mr = mr;
> > +        bounce->addr = addr;
> > +        bounce->len = l;
> > +
> >          if (!is_write) {
> >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > -                          as->bounce.buffer, l);
> > +                          bounce->buffer, l);
> >          }
> >
> >          *plen = l;
> > -        return as->bounce.buffer;
> > +        return bounce->buffer;
> >      }
> >
> > -
> > -    memory_region_ref(mr);
> >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> >                                          l, is_write, attrs);
> >      fuzz_dma_read_cb(addr, *plen, mr);
> > @@ -3119,31 +3138,44 @@ void *address_space_map(AddressSpace *as,
> >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> >                           bool is_write, hwaddr access_len)
> >  {
> > -    if (buffer != as->bounce.buffer) {
> > -        MemoryRegion *mr;
> > -        ram_addr_t addr1;
> > +    MemoryRegion *mr;
> > +    ram_addr_t addr1;
> > +
> > +    mr = memory_region_from_host(buffer, &addr1);
> > +    if (mr == NULL) {
> > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> > +        if (bounce->magic != BOUNCE_BUFFER_MAGIC) {
> > +            error_report(
> > +                "Unmap request for %p, which neither corresponds to a memory "
> > +                "region, nor looks like a bounce buffer, ignoring!",
> > +                buffer);
> > +            return;
> > +        }
> >
> > -        mr = memory_region_from_host(buffer, &addr1);
> > -        assert(mr != NULL);
> >          if (is_write) {
> > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > +                                bounce->buffer, access_len);
> >          }
> > -        if (xen_enabled()) {
> > -            xen_invalidate_map_cache_entry(buffer);
> > +
> > +        memory_region_unref(bounce->mr);
> > +        uint64_t previous_buffer_size =
> > +            qatomic_fetch_sub(&as->bounce_buffer_size, bounce->len);
> > +        if (previous_buffer_size == as->max_bounce_buffer_size) {
> > +            /* Write bounce_buffer_size before reading map_client_list. */
> > +            smp_mb();
> > +            address_space_notify_map_clients(as);
> >          }
> > -        memory_region_unref(mr);
> > +        bounce->magic = ~BOUNCE_BUFFER_MAGIC;
> > +        g_free(bounce);
> >          return;
> >      }
> > +
> > +    if (xen_enabled()) {
> > +        xen_invalidate_map_cache_entry(buffer);
> > +    }
> >      if (is_write) {
> > -        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > -                            as->bounce.buffer, access_len);
> > -    }
> > -    qemu_vfree(as->bounce.buffer);
> > -    as->bounce.buffer = NULL;
> > -    memory_region_unref(as->bounce.mr);
> > -    /* Clear in_use before reading map_client_list.  */
> > -    qatomic_set_mb(&as->bounce.in_use, false);
> > -    address_space_notify_map_clients(as);
> > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > +    }
> >  }
> >
> >  void *cpu_physical_memory_map(hwaddr addr,
> > --
> > 2.34.1
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/5] vfio-user: Message-based DMA support
  2023-09-14 19:04   ` Stefan Hajnoczi
@ 2023-09-15  9:58     ` Mattias Nissler
  0 siblings, 0 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15  9:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

On Thu, Sep 14, 2023 at 9:04 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:09AM -0700, Mattias Nissler wrote:
> > Wire up support for DMA for the case where the vfio-user client does not
> > provide mmap()-able file descriptors, but DMA requests must be performed
> > via the VFIO-user protocol. This installs an indirect memory region,
> > which already works for pci_dma_{read,write}, and pci_dma_map works
> > thanks to the existing DMA bounce buffering support.
> >
> > Note that while simple scenarios work with this patch, there's a known
> > race condition in libvfio-user that will mess up the communication
> > channel. See https://github.com/nutanix/libvfio-user/issues/279 for
> > details as well as a proposed fix.
> >
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > ---
> >  hw/remote/trace-events    |  2 +
> >  hw/remote/vfio-user-obj.c | 84 +++++++++++++++++++++++++++++++++++----
> >  2 files changed, 79 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/remote/trace-events b/hw/remote/trace-events
> > index 0d1b7d56a5..358a68fb34 100644
> > --- a/hw/remote/trace-events
> > +++ b/hw/remote/trace-events
> > @@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x"
> >  vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
> >  vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
> >  vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
> > +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
> > +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
> >  vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
> >  vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
> >  vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 8b10c32a3c..cee5e615a9 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> >      return count;
> >  }
> >
> > +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
> > +                                unsigned size, MemTxAttrs attrs)
> > +{
> > +    MemoryRegion *region = opaque;
> > +    VfuObject *o = VFU_OBJECT(region->owner);
> > +    uint8_t buf[sizeof(uint64_t)];
> > +
> > +    trace_vfu_dma_read(region->addr + addr, size);
> > +
> > +    dma_sg_t *sg = alloca(dma_sg_size());
>
> Variable-length arrays have recently been removed from QEMU and
> alloca(3) is a similar case. An example is commit
> b3c8246750b7077add335559341268f2956f6470 ("hw/nvme: Avoid dynamic stack
> allocation").
>
> libvfio-user returns a sane sizeof(struct dma_sg) value so we don't need
> to worry about bogus values, so the risk is low here.
>
> However, its hard to scan for and forbid the dangerous alloca(3) calls
> when exceptions are made for some alloca(3) uses.
>
> I would avoid alloca(3) and instead use:
>
>   g_autofree dma_sg_t *sg = g_new(dma_sg_size(), 1);

Ok, changing. I personally actually dislike alloca, I was just
following libvfio-user's example code. Plus there's really no valid
performance argument here given that the IPC we're doing will dominate
everything.

>
> > +    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
> > +        vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
> > +        return MEMTX_ERROR;
> > +    }
> > +
> > +    *val = ldn_he_p(buf, size);
> > +
> > +    return MEMTX_OK;
> > +}
> > +
> > +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
> > +                                 unsigned size, MemTxAttrs attrs)
> > +{
> > +    MemoryRegion *region = opaque;
> > +    VfuObject *o = VFU_OBJECT(region->owner);
> > +    uint8_t buf[sizeof(uint64_t)];
> > +
> > +    trace_vfu_dma_write(region->addr + addr, size);
> > +
> > +    stn_he_p(buf, size, val);
> > +
> > +    dma_sg_t *sg = alloca(dma_sg_size());
>
> Same here.
>
> > +    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
> > +        vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
> > +        return MEMTX_ERROR;
> > +    }
> > +
> > +    return MEMTX_OK;
> > +}
> > +
> > +static const MemoryRegionOps vfu_dma_ops = {
> > +    .read_with_attrs = vfu_dma_read,
> > +    .write_with_attrs = vfu_dma_write,
> > +    .endianness = DEVICE_HOST_ENDIAN,
> > +    .valid = {
> > +        .min_access_size = 1,
> > +        .max_access_size = 8,
> > +        .unaligned = true,
> > +    },
> > +    .impl = {
> > +        .min_access_size = 1,
> > +        .max_access_size = 8,
> > +    },
> > +};
> > +
> >  static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> >  {
> >      VfuObject *o = vfu_get_private(vfu_ctx);
> > @@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> >      g_autofree char *name = NULL;
> >      struct iovec *iov = &info->iova;
> >
> > -    if (!info->vaddr) {
> > -        return;
> > -    }
> > -
> >      name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
> > -                           (uint64_t)info->vaddr);
> > +                           (uint64_t)iov->iov_base);
> >
> >      subregion = g_new0(MemoryRegion, 1);
> >
> > -    memory_region_init_ram_ptr(subregion, NULL, name,
> > -                               iov->iov_len, info->vaddr);
> > +    if (info->vaddr) {
> > +        memory_region_init_ram_ptr(subregion, OBJECT(o), name,
> > +                                   iov->iov_len, info->vaddr);
> > +    } else {
> > +        /*
> > +         * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
> > +         * bytes at a time, and larger accesses are broken down. However,
> > +         * many/most DMA accesses are larger than 8 bytes and VFIO-user can
> > +         * handle large DMA accesses just fine, thus this size restriction
> > +         * unnecessarily hurts performance, in particular given that each
> > +         * access causes a round trip on the VFIO-user socket.
> > +         *
> > +         * TODO: Investigate how to plumb larger accesses through memory
> > +         * regions, possibly by amending MemoryRegionOps or by creating a new
> > +         * memory region type.
> > +         */
> > +        memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion,
> > +                              name, iov->iov_len);
> > +    }
> >
> >      dma_as = pci_device_iommu_address_space(o->pci_dev);
> >
> > --
> > 2.34.1
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/5] vfio-user: Fix config space access byte order
  2023-09-14 20:32   ` Stefan Hajnoczi
@ 2023-09-15 10:24     ` Mattias Nissler
  0 siblings, 0 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-15 10:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, Peter Xu, Paolo Bonzini

On Thu, Sep 14, 2023 at 10:32 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 07, 2023 at 06:04:10AM -0700, Mattias Nissler wrote:
> > PCI config space is little-endian, so on a big-endian host we need to
> > perform byte swaps for values as they are passed to and received from
> > the generic PCI config space access machinery.
> >
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > ---
> >  hw/remote/vfio-user-obj.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
>
> After some discussion about PCI Configuration Space endianness on IRC
> with aw, mcayland, and f4bug I am now happy with this patch:
>
> 1. Configuration space can only be accessed in 1-, 2-, or 4-byte
>    accesses.
> 2. If it's a 2- or 4-byte access then your patch adds the missing
>    little-endian conversion.
> 3. If it's a 1-byte access then there is (effectively) no byteswap in
>    the code path and the pci_dev->config[] array is already
>    little-endian.

Thanks for checking! This indeed relies on the
pci_host_config_{read,write}_common to be register-based access paths.
I have also experimentally verified that this works as expected using
a s390x build.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-15  9:32     ` Mattias Nissler
@ 2023-09-15 15:57       ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-09-15 15:57 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Fri, Sep 15, 2023 at 11:32:31AM +0200, Mattias Nissler wrote:
> > > @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> > >      as->ioeventfds = NULL;
> > >      QTAILQ_INIT(&as->listeners);
> > >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > > -    as->bounce.in_use = false;
> > > +    as->max_bounce_buffer_size = 4096;
> >
> > Instead of hard-code this 4k again (besides the pci property), maybe we can
> > have address_space_init_with_bouncebuffer() and always pass it in from pci
> > do_realize?  Then by default no bounce buffer supported for the AS only if
> > requested.
> 
> I haven't verified in a running configuration, but I believe bounce
> buffering is also used with non-PCI code, for example
> sysbus_ahci_realize grabs &address_space_memory. So, IMO it makes
> sense to keep at the old default for non-PCI address spaces, unless we
> create additional knobs to set the limit for these?

Oh okay, in that case do we want to have a macro defining the default for
all (as 4K), then also use the macro in the pci property for the default
value?  Maybe it's slightly better than the hard-coded.

> > > +            /* Write bounce_buffer_size before reading map_client_list. */
> > > +            smp_mb();
> >
> > I know it comes from the old code.. but I don't know why this is needed;
> > mutex lock should contain an mb() already before the list iterations later.
> > Just to raise it up, maybe Paolo would like to comment.
> 
> Hm, are you sure that qemu_mutex_lock includes a full memory barrier?

No. :)

> The atomics docs say that pthread_mutex_lock is guaranteed to have
> acquire semantics, but that doesn't guarantee that previous writes are
> visible elsewhere. We need a release of bounce_buffer_size and an
> acquire of map_client_list. The latter is implied by qemu_mutex_lock,
> so I could arguably change this to smp_wmb().

Yeah I think I made such mistake before, sorry.  I think some day I looked
into x86 impl and I believe it was mb() there but I always kept that
impression in mind that it will always be, but not really.  I think you're
right that mutex_lock semantics only guarantees an REQUIRE, and that's not
the same as mb(), at least not always.

Changing it to wmb() makes sense to me but indeed that may be more suitable
for another patch.  Maybe easier to just leave it as-is as it shouldn't be
super hot path anyway.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers
  2023-09-15  9:54     ` Mattias Nissler
@ 2023-09-15 20:40       ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-09-15 20:40 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: Stefan Hajnoczi, qemu-devel, john.levon, Elena Ufimtseva,
	Michael S. Tsirkin, David Hildenbrand, Marcel Apfelbaum,
	Philippe Mathieu-Daudé, Richard Henderson, Jagannathan Raman,
	Peter Xu, Paolo Bonzini

On Fri, 15 Sept 2023 at 05:55, Mattias Nissler <mnissler@rivosinc.com> wrote:
>
> On Thu, Sep 14, 2023 at 8:49 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> > > When DMA memory can't be directly accessed, as is the case when
> > > running the device model in a separate process without shareable DMA
> > > file descriptors, bounce buffering is used.
> > >
> > > It is not uncommon for device models to request mapping of several DMA
> > > regions at the same time. Examples include:
> > >  * net devices, e.g. when transmitting a packet that is split across
> > >    several TX descriptors (observed with igb)
> > >  * USB host controllers, when handling a packet with multiple data TRBs
> > >    (observed with xhci)
> > >
> > > Previously, qemu only provided a single bounce buffer per AddressSpace
> > > and would fail DMA map requests while the buffer was already in use. In
> > > turn, this would cause DMA failures that ultimately manifest as hardware
> > > errors from the guest perspective.
> > >
> > > This change allocates DMA bounce buffers dynamically instead of
> > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > correctly also when RAM can't be mmap()-ed.
> > >
> > > The total bounce buffer allocation size is limited individually for each
> > > AddressSpace. The default limit is 4096 bytes, matching the previous
> > > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > > provided to configure the limit for PCI devices.
> > >
> > > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > > ---
> > >  hw/pci/pci.c                |  8 ++++
> > >  include/exec/memory.h       | 14 ++----
> > >  include/hw/pci/pci_device.h |  3 ++
> > >  softmmu/memory.c            |  3 +-
> > >  softmmu/physmem.c           | 94 +++++++++++++++++++++++++------------
> > >  5 files changed, 80 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > index 881d774fb6..8c4541b394 100644
> > > --- a/hw/pci/pci.c
> > > +++ b/hw/pci/pci.c
> > > @@ -85,6 +85,8 @@ static Property pci_props[] = {
> > >                      QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> > >      DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
> > >                      QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> > > +    DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> > > +                     max_bounce_buffer_size, 4096),
> > >      DEFINE_PROP_END_OF_LIST()
> > >  };
> > >
> > > @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
> > >                         "bus master container", UINT64_MAX);
> > >      address_space_init(&pci_dev->bus_master_as,
> > >                         &pci_dev->bus_master_container_region, pci_dev->name);
> > > +    pci_dev->bus_master_as.max_bounce_buffer_size =
> > > +        pci_dev->max_bounce_buffer_size;
> > >
> > >      if (phase_check(PHASE_MACHINE_READY)) {
> > >          pci_init_bus_master(pci_dev);
> > > @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
> > >      k->unrealize = pci_qdev_unrealize;
> > >      k->bus_type = TYPE_PCI_BUS;
> > >      device_class_set_props(k, pci_props);
> > > +    object_class_property_set_description(
> > > +        klass, "x-max-bounce-buffer-size",
> > > +        "Maximum buffer size allocated for bounce buffers used for mapped "
> > > +        "access to indirect DMA memory");
> > >  }
> > >
> > >  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index 7d68936157..5577542b5e 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
> > >      QLIST_ENTRY(AddressSpaceMapClient) link;
> > >  } AddressSpaceMapClient;
> > >
> > > -typedef struct {
> > > -    MemoryRegion *mr;
> > > -    void *buffer;
> > > -    hwaddr addr;
> > > -    hwaddr len;
> > > -    bool in_use;
> > > -} BounceBuffer;
> > > -
> > >  /**
> > >   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
> > >   */
> > > @@ -1106,8 +1098,10 @@ struct AddressSpace {
> > >      QTAILQ_HEAD(, MemoryListener) listeners;
> > >      QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> > >
> > > -    /* Bounce buffer to use for this address space. */
> > > -    BounceBuffer bounce;
> > > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > > +    uint64_t max_bounce_buffer_size;
> > > +    /* Total size of bounce buffers currently allocated, atomically accessed */
> > > +    uint64_t bounce_buffer_size;
> > >      /* List of callbacks to invoke when buffers free up */
> > >      QemuMutex map_client_list_lock;
> > >      QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> > > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > > index d3dd0f64b2..f4027c5379 100644
> > > --- a/include/hw/pci/pci_device.h
> > > +++ b/include/hw/pci/pci_device.h
> > > @@ -160,6 +160,9 @@ struct PCIDevice {
> > >      /* ID of standby device in net_failover pair */
> > >      char *failover_pair_id;
> > >      uint32_t acpi_index;
> > > +
> > > +    /* Maximum DMA bounce buffer size used for indirect memory map requests */
> > > +    uint64_t max_bounce_buffer_size;
> > >  };
> > >
> > >  static inline int pci_intx(PCIDevice *pci_dev)
> > > diff --git a/softmmu/memory.c b/softmmu/memory.c
> > > index 5c9622c3d6..e02799359c 100644
> > > --- a/softmmu/memory.c
> > > +++ b/softmmu/memory.c
> > > @@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> > >      as->ioeventfds = NULL;
> > >      QTAILQ_INIT(&as->listeners);
> > >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > > -    as->bounce.in_use = false;
> > > +    as->max_bounce_buffer_size = 4096;
> > > +    as->bounce_buffer_size = 0;
> > >      qemu_mutex_init(&as->map_client_list_lock);
> > >      QLIST_INIT(&as->map_client_list);
> > >      as->name = g_strdup(name ? name : "anonymous");
> > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > > index f40cc564b8..e3d1cf5fba 100644
> > > --- a/softmmu/physmem.c
> > > +++ b/softmmu/physmem.c
> > > @@ -2926,6 +2926,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> > >                                       NULL, len, FLUSH_CACHE);
> > >  }
> > >
> > > +/*
> > > + * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
> > > + * to detect illegal pointers passed to address_space_unmap.
> > > + */
> > > +#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
> > > +
> > > +typedef struct {
> > > +    uint64_t magic;
> > > +    MemoryRegion *mr;
> > > +    hwaddr addr;
> > > +    size_t len;
> > > +    uint8_t buffer[];
> > > +} BounceBuffer;
> > > +
> > >  static void
> > >  address_space_unregister_map_client_do(AddressSpaceMapClient *client)
> > >  {
> > > @@ -2953,7 +2967,7 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
> > >      QLIST_INSERT_HEAD(&as->map_client_list, client, link);
> > >      /* Write map_client_list before reading bounce_buffer_size.  */
> > >      smp_mb();
> > > -    if (!qatomic_read(&as->bounce.in_use)) {
> > > +    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
> > >          address_space_notify_map_clients_locked(as);
> > >      }
> > >      qemu_mutex_unlock(&as->map_client_list_lock);
> > > @@ -3081,31 +3095,36 @@ void *address_space_map(AddressSpace *as,
> > >      RCU_READ_LOCK_GUARD();
> > >      fv = address_space_to_flatview(as);
> > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > +    memory_region_ref(mr);
> > >
> > >      if (!memory_access_is_direct(mr, is_write)) {
> > > -        if (qatomic_xchg(&as->bounce.in_use, true)) {
> > > +        size_t size = qatomic_add_fetch(&as->bounce_buffer_size, l);
> > > +        if (size > as->max_bounce_buffer_size) {
> > > +            size_t excess = size - as->max_bounce_buffer_size;
> > > +            l -= excess;
> > > +            qatomic_sub(&as->bounce_buffer_size, excess);
> >
> > I think two threads can race here. as->bounce_buffer_size will be
> > corrupted (smaller than it should be) and l will be wrong as well. A
> > cmpxchg loop would solve the race.
>
> Ah, thanks for pointing this out. I think the case you have in mind is this:
> 1. Thread A bumps the size to be larger than the max
> 2. Thread B bumps the size further
> 3. Thread B now computes an incorrect excess and sutracts more than it added.

Yes, that's the case.

> I should be good if I ensure that each thread will only subtract what
> it has previously added to enforce the invariant that additions and
> subtractions for each map/unmap pair cancel each other out. Let me
> know if there's a reason to still prefer the cmpxchg loop.

Cool, it would be nice to avoid the cmpxchg loop.

Stefan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering
  2023-09-15  8:37     ` Mattias Nissler
@ 2023-09-19 15:54       ` Mattias Nissler
  0 siblings, 0 replies; 23+ messages in thread
From: Mattias Nissler @ 2023-09-19 15:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, john.levon, Elena Ufimtseva, Michael S. Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Richard Henderson, Jagannathan Raman, stefanha, Paolo Bonzini

On Fri, Sep 15, 2023 at 10:37 AM Mattias Nissler <mnissler@rivosinc.com> wrote:
>
> On Wed, Sep 13, 2023 at 8:30 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 07, 2023 at 06:04:06AM -0700, Mattias Nissler wrote:
> > > @@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
> > >      as->ioeventfds = NULL;
> > >      QTAILQ_INIT(&as->listeners);
> > >      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
> > > +    as->bounce.in_use = false;
> > > +    qemu_mutex_init(&as->map_client_list_lock);
> > > +    QLIST_INIT(&as->map_client_list);
> > >      as->name = g_strdup(name ? name : "anonymous");
> > >      address_space_update_topology(as);
> > >      address_space_update_ioeventfds(as);
> >
> > Missing the counterpart in do_address_space_destroy()?
>
> Of course, thanks for pointing this out.
>
> >
> > Perhaps we should assert on having no one using the buffer, or on the
> > client list too.
>
> I agree it makes sense to put these assertions, but let me dig a bit
> and do some experiments to see whether these hold true in practice.

To close the loop here: I've experimented a bit to try whether I can
get the shutdown path to trigger the assertions by terminating the
qemu process with mappings present. I tried xhci (for usb_packet_map),
e1000e (for net_tx_pkt_add_raw_fragment_pci), and nvme (for
dma-helpers), and some of them with hacked Linux kernels in attempts
to create problematic situations. I found that cleanup of mappings
seems to work correctly already, I wasn't able to trigger the
assertions I added in do_address_space_destroy. That doesn't prove
absence of a code path that would trigger them, but then that would
just indicate a bug in device model cleanup code that should be fixed
anyways.

>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-09-19 15:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-07 13:04 [PATCH v3 0/5] Support message-based DMA in vfio-user server Mattias Nissler
2023-09-07 13:04 ` [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering Mattias Nissler
2023-09-13 18:30   ` Peter Xu
2023-09-15  8:37     ` Mattias Nissler
2023-09-19 15:54       ` Mattias Nissler
2023-09-07 13:04 ` [PATCH v3 2/5] softmmu: Support concurrent bounce buffers Mattias Nissler
2023-09-13 19:11   ` Peter Xu
2023-09-15  9:32     ` Mattias Nissler
2023-09-15 15:57       ` Peter Xu
2023-09-14 18:49   ` Stefan Hajnoczi
2023-09-15  9:54     ` Mattias Nissler
2023-09-15 20:40       ` Stefan Hajnoczi
2023-09-07 13:04 ` [PATCH v3 3/5] Update subprojects/libvfio-user Mattias Nissler
2023-09-07 13:04 ` [PATCH v3 4/5] vfio-user: Message-based DMA support Mattias Nissler
2023-09-14 19:04   ` Stefan Hajnoczi
2023-09-15  9:58     ` Mattias Nissler
2023-09-07 13:04 ` [PATCH v3 5/5] vfio-user: Fix config space access byte order Mattias Nissler
2023-09-14 19:34   ` Stefan Hajnoczi
2023-09-14 20:29   ` Philippe Mathieu-Daudé
2023-09-14 20:32   ` Stefan Hajnoczi
2023-09-15 10:24     ` Mattias Nissler
2023-09-14 14:39 ` [PATCH v3 0/5] Support message-based DMA in vfio-user server Stefan Hajnoczi
2023-09-15  8:23   ` Mattias Nissler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).