[PATCH v2 0/4] Support message-based DMA in vfio-user server

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Support message-based DMA in vfio-user server
@ 2023-08-23  9:29 Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 1/4] softmmu: Support concurrent bounce buffers Mattias Nissler
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, Peter Xu, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that there is some more work required on top of this series to get
message-based DMA to really work well:

* libvfio-user has a long-standing issue where socket communication gets messed
  up when messages are sent from both ends at the same time. See
  https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
  been engaging there and a fix is in review.

* qemu currently breaks down DMA accesses into chunks of size 8 bytes at
  maximum, each of which will be handled in a separate vfio-user DMA request
  message. This is quite terrible for large DMA accesses, such as when nvme
  reads and writes page-sized blocks for example. Thus, I would like to improve
  qemu to be able to perform larger accesses, at least for indirect memory
  regions. I have something working locally, but since this will likely result
  in more involved surgery and discussion, I am leaving this to be addressed in
  a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Mattias Nissler (4):
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/remote/trace-events        |  2 +
 hw/remote/vfio-user-obj.c     | 88 +++++++++++++++++++++++++++++++----
 include/sysemu/sysemu.h       |  2 +
 qemu-options.hx               | 27 +++++++++++
 softmmu/globals.c             |  1 +
 softmmu/physmem.c             | 84 ++++++++++++++++++---------------
 softmmu/vl.c                  |  6 +++
 subprojects/libvfio-user.wrap |  2 +-
 8 files changed, 165 insertions(+), 47 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23  9:29 [PATCH v2 0/4] Support message-based DMA in vfio-user server Mattias Nissler
@ 2023-08-23  9:29 ` Mattias Nissler
  2023-08-23 17:34   ` Peter Xu
  2023-08-23  9:29 ` [PATCH v2 2/4] Update subprojects/libvfio-user Mattias Nissler
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, Peter Xu, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer and would fail DMA
map requests while the buffer was already in use. In turn, this would
cause DMA failures that ultimately manifest as hardware errors from the
guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited by a new command line
parameter. The default is 4096 bytes to match the previous maximum
buffer size. It is expected that suitable limits will vary quite a bit
in practice depending on device models and workloads.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 include/sysemu/sysemu.h |  2 +
 qemu-options.hx         | 27 +++++++++++++
 softmmu/globals.c       |  1 +
 softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
 softmmu/vl.c            |  6 +++
 5 files changed, 83 insertions(+), 37 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 25be2a692e..c5dc93cb53 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -61,6 +61,8 @@ extern int nb_option_roms;
 extern const char *prom_envs[MAX_PROM_ENVS];
 extern unsigned int nb_prom_envs;
 
+extern uint64_t max_bounce_buffer_size;
+
 /* serial ports */
 
 /* Return the Chardev for serial port i, or NULL if none */
diff --git a/qemu-options.hx b/qemu-options.hx
index 29b98c3d4c..6071794237 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4959,6 +4959,33 @@ SRST
 ERST
 #endif
 
+DEF("max-bounce-buffer-size", HAS_ARG,
+    QEMU_OPTION_max_bounce_buffer_size,
+    "-max-bounce-buffer-size size\n"
+    "                DMA bounce buffer size limit in bytes (default=4096)\n",
+    QEMU_ARCH_ALL)
+SRST
+``-max-bounce-buffer-size size``
+    Set the limit in bytes for DMA bounce buffer allocations.
+
+    DMA bounce buffers are used when device models request memory-mapped access
+    to memory regions that can't be directly mapped by the qemu process, so the
+    memory must read or written to a temporary local buffer for the device
+    model to work with. This is the case e.g. for I/O memory regions, and when
+    running in multi-process mode without shared access to memory.
+
+    Whether bounce buffering is necessary depends heavily on the device model
+    implementation. Some devices use explicit DMA read and write operations
+    which do not require bounce buffers. Some devices, notably storage, will
+    retry a failed DMA map request after bounce buffer space becomes available
+    again. Most other devices will bail when encountering map request failures,
+    which will typically appear to the guest as a hardware error.
+
+    Suitable bounce buffer size values depend on the workload and guest
+    configuration. A few kilobytes up to a few megabytes are common sizes
+    encountered in practice.
+ERST
+
 DEFHEADING()
 
 DEFHEADING(Generic object creation:)
diff --git a/softmmu/globals.c b/softmmu/globals.c
index e83b5428d1..d3cc010717 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
 int icount_align_option;
+uint64_t max_bounce_buffer_size = 4096;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
  * little-endian "wire format" described in the SMBIOS 2.6 specification.
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3df73542e1..9f0fec0c8e 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -50,6 +50,7 @@
 #include "sysemu/dma.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
+#include "sysemu/sysemu.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
 
@@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
 
 typedef struct {
     MemoryRegion *mr;
-    void *buffer;
     hwaddr addr;
-    hwaddr len;
-    bool in_use;
+    size_t len;
+    uint8_t buffer[];
 } BounceBuffer;
 
-static BounceBuffer bounce;
+static size_t bounce_buffer_size;
 
 typedef struct MapClient {
     QEMUBH *bh;
@@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
     qemu_mutex_lock(&map_client_list_lock);
     client->bh = bh;
     QLIST_INSERT_HEAD(&map_client_list, client, link);
-    /* Write map_client_list before reading in_use.  */
+    /* Write map_client_list before reading bounce_buffer_size.  */
     smp_mb();
-    if (!qatomic_read(&bounce.in_use)) {
+    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
         cpu_notify_map_clients_locked();
     }
     qemu_mutex_unlock(&map_client_list_lock);
@@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
     RCU_READ_LOCK_GUARD();
     fv = address_space_to_flatview(as);
     mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
+    memory_region_ref(mr);
 
     if (!memory_access_is_direct(mr, is_write)) {
-        if (qatomic_xchg(&bounce.in_use, true)) {
+        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
+        if (size > max_bounce_buffer_size) {
+            size_t excess = size - max_bounce_buffer_size;
+            l -= excess;
+            qatomic_sub(&bounce_buffer_size, excess);
+        }
+
+        if (l == 0) {
             *plen = 0;
             return NULL;
         }
-        /* Avoid unbounded allocations */
-        l = MIN(l, TARGET_PAGE_SIZE);
-        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
-        bounce.addr = addr;
-        bounce.len = l;
 
-        memory_region_ref(mr);
-        bounce.mr = mr;
+        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
+        bounce->mr = mr;
+        bounce->addr = addr;
+        bounce->len = l;
+
         if (!is_write) {
             flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
-                               bounce.buffer, l);
+                          bounce->buffer, l);
         }
 
         *plen = l;
-        return bounce.buffer;
+        return bounce->buffer;
     }
 
-
-    memory_region_ref(mr);
     *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
                                         l, is_write, attrs);
     fuzz_dma_read_cb(addr, *plen, mr);
@@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
                          bool is_write, hwaddr access_len)
 {
-    if (buffer != bounce.buffer) {
-        MemoryRegion *mr;
-        ram_addr_t addr1;
+    MemoryRegion *mr;
+    ram_addr_t addr1;
+
+    mr = memory_region_from_host(buffer, &addr1);
+    if (mr == NULL) {
+        /*
+         * Must be a bounce buffer (unless the caller passed a pointer which
+         * wasn't returned by address_space_map, which is illegal).
+         */
+        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
 
-        mr = memory_region_from_host(buffer, &addr1);
-        assert(mr != NULL);
         if (is_write) {
-            invalidate_and_set_dirty(mr, addr1, access_len);
-        }
-        if (xen_enabled()) {
-            xen_invalidate_map_cache_entry(buffer);
+            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
+                                bounce->buffer, access_len);
         }
-        memory_region_unref(mr);
+
+        memory_region_unref(bounce->mr);
+        qatomic_sub(&bounce_buffer_size, bounce->len);
+        /* Write bounce_buffer_size before reading map_client_list. */
+        smp_mb();
+        cpu_notify_map_clients();
+        g_free(bounce);
         return;
     }
+
+    if (xen_enabled()) {
+        xen_invalidate_map_cache_entry(buffer);
+    }
     if (is_write) {
-        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
-                            bounce.buffer, access_len);
-    }
-    qemu_vfree(bounce.buffer);
-    bounce.buffer = NULL;
-    memory_region_unref(bounce.mr);
-    /* Clear in_use before reading map_client_list.  */
-    qatomic_set_mb(&bounce.in_use, false);
-    cpu_notify_map_clients();
+        invalidate_and_set_dirty(mr, addr1, access_len);
+    }
 }
 
 void *cpu_physical_memory_map(hwaddr addr,
diff --git a/softmmu/vl.c b/softmmu/vl.c
index b0b96f67fa..dbe52f5ea1 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
                 exit(1);
 #endif
                 break;
+            case QEMU_OPTION_max_bounce_buffer_size:
+                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
+                    error_report("invalid -max-ounce-buffer-size value");
+                    exit(1);
+                }
+                break;
             case QEMU_OPTION_object:
                 object_option_parse(optarg);
                 break;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23  9:29 ` [PATCH v2 1/4] softmmu: Support concurrent bounce buffers Mattias Nissler
@ 2023-08-23 17:34   ` Peter Xu
  2023-08-23 20:08     ` Mattias Nissler
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2023-08-23 17:34 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, David Hildenbrand, Paolo Bonzini,
	Elena Ufimtseva

On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> When DMA memory can't be directly accessed, as is the case when
> running the device model in a separate process without shareable DMA
> file descriptors, bounce buffering is used.
> 
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. Examples include:
>  * net devices, e.g. when transmitting a packet that is split across
>    several TX descriptors (observed with igb)
>  * USB host controllers, when handling a packet with multiple data TRBs
>    (observed with xhci)
> 
> Previously, qemu only provided a single bounce buffer and would fail DMA
> map requests while the buffer was already in use. In turn, this would
> cause DMA failures that ultimately manifest as hardware errors from the
> guest perspective.
> 
> This change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer. Thus, multiple DMA mappings work
> correctly also when RAM can't be mmap()-ed.
> 
> The total bounce buffer allocation size is limited by a new command line
> parameter. The default is 4096 bytes to match the previous maximum
> buffer size. It is expected that suitable limits will vary quite a bit
> in practice depending on device models and workloads.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> ---
>  include/sysemu/sysemu.h |  2 +
>  qemu-options.hx         | 27 +++++++++++++
>  softmmu/globals.c       |  1 +
>  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
>  softmmu/vl.c            |  6 +++
>  5 files changed, 83 insertions(+), 37 deletions(-)
> 
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 25be2a692e..c5dc93cb53 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -61,6 +61,8 @@ extern int nb_option_roms;
>  extern const char *prom_envs[MAX_PROM_ENVS];
>  extern unsigned int nb_prom_envs;
>  
> +extern uint64_t max_bounce_buffer_size;
> +
>  /* serial ports */
>  
>  /* Return the Chardev for serial port i, or NULL if none */
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 29b98c3d4c..6071794237 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4959,6 +4959,33 @@ SRST
>  ERST
>  #endif
>  
> +DEF("max-bounce-buffer-size", HAS_ARG,
> +    QEMU_OPTION_max_bounce_buffer_size,
> +    "-max-bounce-buffer-size size\n"
> +    "                DMA bounce buffer size limit in bytes (default=4096)\n",
> +    QEMU_ARCH_ALL)
> +SRST
> +``-max-bounce-buffer-size size``
> +    Set the limit in bytes for DMA bounce buffer allocations.
> +
> +    DMA bounce buffers are used when device models request memory-mapped access
> +    to memory regions that can't be directly mapped by the qemu process, so the
> +    memory must read or written to a temporary local buffer for the device
> +    model to work with. This is the case e.g. for I/O memory regions, and when
> +    running in multi-process mode without shared access to memory.
> +
> +    Whether bounce buffering is necessary depends heavily on the device model
> +    implementation. Some devices use explicit DMA read and write operations
> +    which do not require bounce buffers. Some devices, notably storage, will
> +    retry a failed DMA map request after bounce buffer space becomes available
> +    again. Most other devices will bail when encountering map request failures,
> +    which will typically appear to the guest as a hardware error.
> +
> +    Suitable bounce buffer size values depend on the workload and guest
> +    configuration. A few kilobytes up to a few megabytes are common sizes
> +    encountered in practice.

Does it mean that the default 4K size can still easily fail with some
device setup?

IIUC the whole point of limit here is to make sure the allocation is still
bounded, while 4K itself is not a hard limit. Making it bigger would be,
IMHO, nice if it should work with known configs which used to be broken.

> +ERST
> +
>  DEFHEADING()
>  
>  DEFHEADING(Generic object creation:)
> diff --git a/softmmu/globals.c b/softmmu/globals.c
> index e83b5428d1..d3cc010717 100644
> --- a/softmmu/globals.c
> +++ b/softmmu/globals.c
> @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
>  uint8_t *boot_splash_filedata;
>  int only_migratable; /* turn it off unless user states otherwise */
>  int icount_align_option;
> +uint64_t max_bounce_buffer_size = 4096;
>  
>  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
>   * little-endian "wire format" described in the SMBIOS 2.6 specification.
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3df73542e1..9f0fec0c8e 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -50,6 +50,7 @@
>  #include "sysemu/dma.h"
>  #include "sysemu/hostmem.h"
>  #include "sysemu/hw_accel.h"
> +#include "sysemu/sysemu.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace/trace-root.h"
>  
> @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
>  
>  typedef struct {
>      MemoryRegion *mr;
> -    void *buffer;
>      hwaddr addr;
> -    hwaddr len;
> -    bool in_use;
> +    size_t len;
> +    uint8_t buffer[];
>  } BounceBuffer;
>  
> -static BounceBuffer bounce;
> +static size_t bounce_buffer_size;
>  
>  typedef struct MapClient {
>      QEMUBH *bh;
> @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
>      qemu_mutex_lock(&map_client_list_lock);
>      client->bh = bh;
>      QLIST_INSERT_HEAD(&map_client_list, client, link);
> -    /* Write map_client_list before reading in_use.  */
> +    /* Write map_client_list before reading bounce_buffer_size.  */
>      smp_mb();
> -    if (!qatomic_read(&bounce.in_use)) {
> +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
>          cpu_notify_map_clients_locked();
>      }
>      qemu_mutex_unlock(&map_client_list_lock);
> @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
>      RCU_READ_LOCK_GUARD();
>      fv = address_space_to_flatview(as);
>      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> +    memory_region_ref(mr);
>  
>      if (!memory_access_is_direct(mr, is_write)) {
> -        if (qatomic_xchg(&bounce.in_use, true)) {
> +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> +        if (size > max_bounce_buffer_size) {
> +            size_t excess = size - max_bounce_buffer_size;
> +            l -= excess;
> +            qatomic_sub(&bounce_buffer_size, excess);
> +        }
> +
> +        if (l == 0) {
>              *plen = 0;
>              return NULL;
>          }
> -        /* Avoid unbounded allocations */
> -        l = MIN(l, TARGET_PAGE_SIZE);
> -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> -        bounce.addr = addr;
> -        bounce.len = l;
>  
> -        memory_region_ref(mr);
> -        bounce.mr = mr;
> +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));

Maybe g_malloc0() would be better?

I just checked that we had target page aligned allocations since the 1st
day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
like that, but I do have worry on whether any existing caller that may
implicitly relying on an address that is target page aligned.  But maybe
not a major issue; I didn't see anything rely on that yet.

> +        bounce->mr = mr;
> +        bounce->addr = addr;
> +        bounce->len = l;
> +
>          if (!is_write) {
>              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> -                               bounce.buffer, l);
> +                          bounce->buffer, l);
>          }
>  
>          *plen = l;
> -        return bounce.buffer;
> +        return bounce->buffer;
>      }
>  
> -
> -    memory_region_ref(mr);
>      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
>                                          l, is_write, attrs);
>      fuzz_dma_read_cb(addr, *plen, mr);
> @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
>  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
>                           bool is_write, hwaddr access_len)
>  {
> -    if (buffer != bounce.buffer) {
> -        MemoryRegion *mr;
> -        ram_addr_t addr1;
> +    MemoryRegion *mr;
> +    ram_addr_t addr1;
> +
> +    mr = memory_region_from_host(buffer, &addr1);
> +    if (mr == NULL) {
> +        /*
> +         * Must be a bounce buffer (unless the caller passed a pointer which
> +         * wasn't returned by address_space_map, which is illegal).

Is it possible to still have some kind of sanity check to make sure it's a
bounce buffer passed in, just in case of a caller bug?  Or, the failure can
be weird..

> +         */
> +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
>  
> -        mr = memory_region_from_host(buffer, &addr1);
> -        assert(mr != NULL);
>          if (is_write) {
> -            invalidate_and_set_dirty(mr, addr1, access_len);
> -        }
> -        if (xen_enabled()) {
> -            xen_invalidate_map_cache_entry(buffer);
> +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> +                                bounce->buffer, access_len);
>          }
> -        memory_region_unref(mr);
> +
> +        memory_region_unref(bounce->mr);
> +        qatomic_sub(&bounce_buffer_size, bounce->len);
> +        /* Write bounce_buffer_size before reading map_client_list. */
> +        smp_mb();
> +        cpu_notify_map_clients();
> +        g_free(bounce);
>          return;
>      }
> +
> +    if (xen_enabled()) {
> +        xen_invalidate_map_cache_entry(buffer);
> +    }
>      if (is_write) {
> -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> -                            bounce.buffer, access_len);
> -    }
> -    qemu_vfree(bounce.buffer);
> -    bounce.buffer = NULL;
> -    memory_region_unref(bounce.mr);
> -    /* Clear in_use before reading map_client_list.  */
> -    qatomic_set_mb(&bounce.in_use, false);
> -    cpu_notify_map_clients();
> +        invalidate_and_set_dirty(mr, addr1, access_len);
> +    }
>  }
>  
>  void *cpu_physical_memory_map(hwaddr addr,
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index b0b96f67fa..dbe52f5ea1 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
>                  exit(1);
>  #endif
>                  break;
> +            case QEMU_OPTION_max_bounce_buffer_size:
> +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> +                    error_report("invalid -max-ounce-buffer-size value");
> +                    exit(1);
> +                }
> +                break;

PS: I had a vague memory that we do not recommend adding more qemu cmdline
options, but I don't know enough on the plan to say anything real.

>              case QEMU_OPTION_object:
>                  object_option_parse(optarg);
>                  break;
> -- 
> 2.34.1
> 

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23 17:34   ` Peter Xu
@ 2023-08-23 20:08     ` Mattias Nissler
  2023-08-23 20:54       ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23 20:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, David Hildenbrand, Paolo Bonzini,
	Elena Ufimtseva

Peter, thanks for taking a look and providing feedback!

On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >    several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >    (observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer and would fail DMA
> > map requests while the buffer was already in use. In turn, this would
> > cause DMA failures that ultimately manifest as hardware errors from the
> > guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited by a new command line
> > parameter. The default is 4096 bytes to match the previous maximum
> > buffer size. It is expected that suitable limits will vary quite a bit
> > in practice depending on device models and workloads.
> >
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > ---
> >  include/sysemu/sysemu.h |  2 +
> >  qemu-options.hx         | 27 +++++++++++++
> >  softmmu/globals.c       |  1 +
> >  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
> >  softmmu/vl.c            |  6 +++
> >  5 files changed, 83 insertions(+), 37 deletions(-)
> >
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 25be2a692e..c5dc93cb53 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> >  extern const char *prom_envs[MAX_PROM_ENVS];
> >  extern unsigned int nb_prom_envs;
> >
> > +extern uint64_t max_bounce_buffer_size;
> > +
> >  /* serial ports */
> >
> >  /* Return the Chardev for serial port i, or NULL if none */
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 29b98c3d4c..6071794237 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -4959,6 +4959,33 @@ SRST
> >  ERST
> >  #endif
> >
> > +DEF("max-bounce-buffer-size", HAS_ARG,
> > +    QEMU_OPTION_max_bounce_buffer_size,
> > +    "-max-bounce-buffer-size size\n"
> > +    "                DMA bounce buffer size limit in bytes (default=4096)\n",
> > +    QEMU_ARCH_ALL)
> > +SRST
> > +``-max-bounce-buffer-size size``
> > +    Set the limit in bytes for DMA bounce buffer allocations.
> > +
> > +    DMA bounce buffers are used when device models request memory-mapped access
> > +    to memory regions that can't be directly mapped by the qemu process, so the
> > +    memory must read or written to a temporary local buffer for the device
> > +    model to work with. This is the case e.g. for I/O memory regions, and when
> > +    running in multi-process mode without shared access to memory.
> > +
> > +    Whether bounce buffering is necessary depends heavily on the device model
> > +    implementation. Some devices use explicit DMA read and write operations
> > +    which do not require bounce buffers. Some devices, notably storage, will
> > +    retry a failed DMA map request after bounce buffer space becomes available
> > +    again. Most other devices will bail when encountering map request failures,
> > +    which will typically appear to the guest as a hardware error.
> > +
> > +    Suitable bounce buffer size values depend on the workload and guest
> > +    configuration. A few kilobytes up to a few megabytes are common sizes
> > +    encountered in practice.
>
> Does it mean that the default 4K size can still easily fail with some
> device setup?

Yes. The thing is that the respective device setup is pretty exotic,
at least the only setup I'm aware of is multi-process with direct RAM
access via shared file descriptors from the device process disabled
(which hurts performance, so few people will run this setup). In
theory, DMA to an I/O region of some sort would also run into the
issue even in single process mode, but I'm not aware of such a
situation. Looking at it from a historic perspective, note that the
single-bounce-buffer restriction has been present since a decade, and
thus the issue has been present for years (since multi-process is a
thing), without it hurting anyone enough to get fixed. But don't get
me wrong - I don't want to downplay anything and very much would like
to see this fixed, but I want to be honest and put things into the
right perspective.

>
> IIUC the whole point of limit here is to make sure the allocation is still
> bounded, while 4K itself is not a hard limit. Making it bigger would be,
> IMHO, nice if it should work with known configs which used to be broken.

I'd be in favor of bumping the default. Networking should be fine with
64KB, but I've observed a default Linux + xhci + usb_storage setup to
use up to 1MB of DMA buffers, we'd probably need to raise it
considerably. Would 4MB still be acceptable? That wouldn't allow a
single nefarious VM to stage a memory denial of service attack, but
what if you're running many VMs?

>
> > +ERST
> > +
> >  DEFHEADING()
> >
> >  DEFHEADING(Generic object creation:)
> > diff --git a/softmmu/globals.c b/softmmu/globals.c
> > index e83b5428d1..d3cc010717 100644
> > --- a/softmmu/globals.c
> > +++ b/softmmu/globals.c
> > @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
> >  uint8_t *boot_splash_filedata;
> >  int only_migratable; /* turn it off unless user states otherwise */
> >  int icount_align_option;
> > +uint64_t max_bounce_buffer_size = 4096;
> >
> >  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
> >   * little-endian "wire format" described in the SMBIOS 2.6 specification.
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index 3df73542e1..9f0fec0c8e 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -50,6 +50,7 @@
> >  #include "sysemu/dma.h"
> >  #include "sysemu/hostmem.h"
> >  #include "sysemu/hw_accel.h"
> > +#include "sysemu/sysemu.h"
> >  #include "sysemu/xen-mapcache.h"
> >  #include "trace/trace-root.h"
> >
> > @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> >
> >  typedef struct {
> >      MemoryRegion *mr;
> > -    void *buffer;
> >      hwaddr addr;
> > -    hwaddr len;
> > -    bool in_use;
> > +    size_t len;
> > +    uint8_t buffer[];
> >  } BounceBuffer;
> >
> > -static BounceBuffer bounce;
> > +static size_t bounce_buffer_size;
> >
> >  typedef struct MapClient {
> >      QEMUBH *bh;
> > @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
> >      qemu_mutex_lock(&map_client_list_lock);
> >      client->bh = bh;
> >      QLIST_INSERT_HEAD(&map_client_list, client, link);
> > -    /* Write map_client_list before reading in_use.  */
> > +    /* Write map_client_list before reading bounce_buffer_size.  */
> >      smp_mb();
> > -    if (!qatomic_read(&bounce.in_use)) {
> > +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
> >          cpu_notify_map_clients_locked();
> >      }
> >      qemu_mutex_unlock(&map_client_list_lock);
> > @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
> >      RCU_READ_LOCK_GUARD();
> >      fv = address_space_to_flatview(as);
> >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > +    memory_region_ref(mr);
> >
> >      if (!memory_access_is_direct(mr, is_write)) {
> > -        if (qatomic_xchg(&bounce.in_use, true)) {
> > +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> > +        if (size > max_bounce_buffer_size) {
> > +            size_t excess = size - max_bounce_buffer_size;
> > +            l -= excess;
> > +            qatomic_sub(&bounce_buffer_size, excess);
> > +        }
> > +
> > +        if (l == 0) {
> >              *plen = 0;
> >              return NULL;
> >          }
> > -        /* Avoid unbounded allocations */
> > -        l = MIN(l, TARGET_PAGE_SIZE);
> > -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > -        bounce.addr = addr;
> > -        bounce.len = l;
> >
> > -        memory_region_ref(mr);
> > -        bounce.mr = mr;
> > +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
>
> Maybe g_malloc0() would be better?

Good point, will change.

>
> I just checked that we had target page aligned allocations since the 1st
> day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
> like that, but I do have worry on whether any existing caller that may
> implicitly relying on an address that is target page aligned.  But maybe
> not a major issue; I didn't see anything rely on that yet.

I did go through the same exercise when noticing the page alignment
and arrived at the same conclusion as you. That makes it two people
thinking it's OK, so I feel like we should take the risk here, in
particular given that we know this code path is already broken as is.

>
> > +        bounce->mr = mr;
> > +        bounce->addr = addr;
> > +        bounce->len = l;
> > +
> >          if (!is_write) {
> >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > -                               bounce.buffer, l);
> > +                          bounce->buffer, l);
> >          }
> >
> >          *plen = l;
> > -        return bounce.buffer;
> > +        return bounce->buffer;
> >      }
> >
> > -
> > -    memory_region_ref(mr);
> >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> >                                          l, is_write, attrs);
> >      fuzz_dma_read_cb(addr, *plen, mr);
> > @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
> >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> >                           bool is_write, hwaddr access_len)
> >  {
> > -    if (buffer != bounce.buffer) {
> > -        MemoryRegion *mr;
> > -        ram_addr_t addr1;
> > +    MemoryRegion *mr;
> > +    ram_addr_t addr1;
> > +
> > +    mr = memory_region_from_host(buffer, &addr1);
> > +    if (mr == NULL) {
> > +        /*
> > +         * Must be a bounce buffer (unless the caller passed a pointer which
> > +         * wasn't returned by address_space_map, which is illegal).
>
> Is it possible to still have some kind of sanity check to make sure it's a
> bounce buffer passed in, just in case of a caller bug?  Or, the failure can
> be weird..

I was contemplating putting a magic number as the first struct member
as a best effort to detect invalid pointers and corruptions, but
wasn't sure it's warranted. Since you ask, I'll make that change.

>
> > +         */
> > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> >
> > -        mr = memory_region_from_host(buffer, &addr1);
> > -        assert(mr != NULL);
> >          if (is_write) {
> > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > -        }
> > -        if (xen_enabled()) {
> > -            xen_invalidate_map_cache_entry(buffer);
> > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > +                                bounce->buffer, access_len);
> >          }
> > -        memory_region_unref(mr);
> > +
> > +        memory_region_unref(bounce->mr);
> > +        qatomic_sub(&bounce_buffer_size, bounce->len);
> > +        /* Write bounce_buffer_size before reading map_client_list. */
> > +        smp_mb();
> > +        cpu_notify_map_clients();
> > +        g_free(bounce);
> >          return;
> >      }
> > +
> > +    if (xen_enabled()) {
> > +        xen_invalidate_map_cache_entry(buffer);
> > +    }
> >      if (is_write) {
> > -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > -                            bounce.buffer, access_len);
> > -    }
> > -    qemu_vfree(bounce.buffer);
> > -    bounce.buffer = NULL;
> > -    memory_region_unref(bounce.mr);
> > -    /* Clear in_use before reading map_client_list.  */
> > -    qatomic_set_mb(&bounce.in_use, false);
> > -    cpu_notify_map_clients();
> > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > +    }
> >  }
> >
> >  void *cpu_physical_memory_map(hwaddr addr,
> > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > index b0b96f67fa..dbe52f5ea1 100644
> > --- a/softmmu/vl.c
> > +++ b/softmmu/vl.c
> > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> >                  exit(1);
> >  #endif
> >                  break;
> > +            case QEMU_OPTION_max_bounce_buffer_size:
> > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> > +                    error_report("invalid -max-ounce-buffer-size value");
> > +                    exit(1);
> > +                }
> > +                break;
>
> PS: I had a vague memory that we do not recommend adding more qemu cmdline
> options, but I don't know enough on the plan to say anything real.

I am aware of that, and I'm really not happy with the command line
option myself. Consider the command line flag a straw man I put in to
see whether any reviewers have better ideas :)

More seriously, I actually did look around to see whether I can add
the parameter to one of the existing option groupings somewhere, but
neither do I have a suitable QOM object that I can attach the
parameter to, nor did I find any global option groups that fits: this
is not really memory configuration, and it's not really CPU
configuration, it's more related to shared device model
infrastructure... If you have a good idea for a home for this, I'm all
ears.

>
> >              case QEMU_OPTION_object:
> >                  object_option_parse(optarg);
> >                  break;
> > --
> > 2.34.1
> >
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23 20:08     ` Mattias Nissler
@ 2023-08-23 20:54       ` Peter Xu
  2023-08-24  6:58         ` Mattias Nissler
  2023-08-24 13:32         ` Stefan Hajnoczi
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Xu @ 2023-08-23 20:54 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: qemu-devel, john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, David Hildenbrand, Paolo Bonzini,
	Elena Ufimtseva

On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> Peter, thanks for taking a look and providing feedback!
> 
> On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > > When DMA memory can't be directly accessed, as is the case when
> > > running the device model in a separate process without shareable DMA
> > > file descriptors, bounce buffering is used.
> > >
> > > It is not uncommon for device models to request mapping of several DMA
> > > regions at the same time. Examples include:
> > >  * net devices, e.g. when transmitting a packet that is split across
> > >    several TX descriptors (observed with igb)
> > >  * USB host controllers, when handling a packet with multiple data TRBs
> > >    (observed with xhci)
> > >
> > > Previously, qemu only provided a single bounce buffer and would fail DMA
> > > map requests while the buffer was already in use. In turn, this would
> > > cause DMA failures that ultimately manifest as hardware errors from the
> > > guest perspective.
> > >
> > > This change allocates DMA bounce buffers dynamically instead of
> > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > correctly also when RAM can't be mmap()-ed.
> > >
> > > The total bounce buffer allocation size is limited by a new command line
> > > parameter. The default is 4096 bytes to match the previous maximum
> > > buffer size. It is expected that suitable limits will vary quite a bit
> > > in practice depending on device models and workloads.
> > >
> > > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > > ---
> > >  include/sysemu/sysemu.h |  2 +
> > >  qemu-options.hx         | 27 +++++++++++++
> > >  softmmu/globals.c       |  1 +
> > >  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
> > >  softmmu/vl.c            |  6 +++
> > >  5 files changed, 83 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > index 25be2a692e..c5dc93cb53 100644
> > > --- a/include/sysemu/sysemu.h
> > > +++ b/include/sysemu/sysemu.h
> > > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> > >  extern const char *prom_envs[MAX_PROM_ENVS];
> > >  extern unsigned int nb_prom_envs;
> > >
> > > +extern uint64_t max_bounce_buffer_size;
> > > +
> > >  /* serial ports */
> > >
> > >  /* Return the Chardev for serial port i, or NULL if none */
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index 29b98c3d4c..6071794237 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -4959,6 +4959,33 @@ SRST
> > >  ERST
> > >  #endif
> > >
> > > +DEF("max-bounce-buffer-size", HAS_ARG,
> > > +    QEMU_OPTION_max_bounce_buffer_size,
> > > +    "-max-bounce-buffer-size size\n"
> > > +    "                DMA bounce buffer size limit in bytes (default=4096)\n",
> > > +    QEMU_ARCH_ALL)
> > > +SRST
> > > +``-max-bounce-buffer-size size``
> > > +    Set the limit in bytes for DMA bounce buffer allocations.
> > > +
> > > +    DMA bounce buffers are used when device models request memory-mapped access
> > > +    to memory regions that can't be directly mapped by the qemu process, so the
> > > +    memory must read or written to a temporary local buffer for the device
> > > +    model to work with. This is the case e.g. for I/O memory regions, and when
> > > +    running in multi-process mode without shared access to memory.
> > > +
> > > +    Whether bounce buffering is necessary depends heavily on the device model
> > > +    implementation. Some devices use explicit DMA read and write operations
> > > +    which do not require bounce buffers. Some devices, notably storage, will
> > > +    retry a failed DMA map request after bounce buffer space becomes available
> > > +    again. Most other devices will bail when encountering map request failures,
> > > +    which will typically appear to the guest as a hardware error.
> > > +
> > > +    Suitable bounce buffer size values depend on the workload and guest
> > > +    configuration. A few kilobytes up to a few megabytes are common sizes
> > > +    encountered in practice.
> >
> > Does it mean that the default 4K size can still easily fail with some
> > device setup?
> 
> Yes. The thing is that the respective device setup is pretty exotic,
> at least the only setup I'm aware of is multi-process with direct RAM
> access via shared file descriptors from the device process disabled
> (which hurts performance, so few people will run this setup). In
> theory, DMA to an I/O region of some sort would also run into the
> issue even in single process mode, but I'm not aware of such a
> situation. Looking at it from a historic perspective, note that the
> single-bounce-buffer restriction has been present since a decade, and
> thus the issue has been present for years (since multi-process is a
> thing), without it hurting anyone enough to get fixed. But don't get
> me wrong - I don't want to downplay anything and very much would like
> to see this fixed, but I want to be honest and put things into the
> right perspective.
> 
> >
> > IIUC the whole point of limit here is to make sure the allocation is still
> > bounded, while 4K itself is not a hard limit. Making it bigger would be,
> > IMHO, nice if it should work with known configs which used to be broken.
> 
> I'd be in favor of bumping the default. Networking should be fine with
> 64KB, but I've observed a default Linux + xhci + usb_storage setup to
> use up to 1MB of DMA buffers, we'd probably need to raise it
> considerably. Would 4MB still be acceptable? That wouldn't allow a
> single nefarious VM to stage a memory denial of service attack, but
> what if you're running many VMs?

Could wait and see whether there's any more comments from others.
Personally 4MB looks fine, as that's not a constant consumption per-vm, but
a worst case limit (probably only when there is an attacker).

Multiple VM does can indeed make it worse, but it means the attacker will
need to attack all the VMs all success, and the sum up will be 4MB /
mem_size_average_vm in percentage, irrelevant of numbers; for ~4GB average
VM size it's 0.1% memory, and even less if VM is larger - maybe not
something extremely scary even if happened.

> 
> >
> > > +ERST
> > > +
> > >  DEFHEADING()
> > >
> > >  DEFHEADING(Generic object creation:)
> > > diff --git a/softmmu/globals.c b/softmmu/globals.c
> > > index e83b5428d1..d3cc010717 100644
> > > --- a/softmmu/globals.c
> > > +++ b/softmmu/globals.c
> > > @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
> > >  uint8_t *boot_splash_filedata;
> > >  int only_migratable; /* turn it off unless user states otherwise */
> > >  int icount_align_option;
> > > +uint64_t max_bounce_buffer_size = 4096;
> > >
> > >  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
> > >   * little-endian "wire format" described in the SMBIOS 2.6 specification.
> > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > > index 3df73542e1..9f0fec0c8e 100644
> > > --- a/softmmu/physmem.c
> > > +++ b/softmmu/physmem.c
> > > @@ -50,6 +50,7 @@
> > >  #include "sysemu/dma.h"
> > >  #include "sysemu/hostmem.h"
> > >  #include "sysemu/hw_accel.h"
> > > +#include "sysemu/sysemu.h"
> > >  #include "sysemu/xen-mapcache.h"
> > >  #include "trace/trace-root.h"
> > >
> > > @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> > >
> > >  typedef struct {
> > >      MemoryRegion *mr;
> > > -    void *buffer;
> > >      hwaddr addr;
> > > -    hwaddr len;
> > > -    bool in_use;
> > > +    size_t len;
> > > +    uint8_t buffer[];
> > >  } BounceBuffer;
> > >
> > > -static BounceBuffer bounce;
> > > +static size_t bounce_buffer_size;
> > >
> > >  typedef struct MapClient {
> > >      QEMUBH *bh;
> > > @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
> > >      qemu_mutex_lock(&map_client_list_lock);
> > >      client->bh = bh;
> > >      QLIST_INSERT_HEAD(&map_client_list, client, link);
> > > -    /* Write map_client_list before reading in_use.  */
> > > +    /* Write map_client_list before reading bounce_buffer_size.  */
> > >      smp_mb();
> > > -    if (!qatomic_read(&bounce.in_use)) {
> > > +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
> > >          cpu_notify_map_clients_locked();
> > >      }
> > >      qemu_mutex_unlock(&map_client_list_lock);
> > > @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
> > >      RCU_READ_LOCK_GUARD();
> > >      fv = address_space_to_flatview(as);
> > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > +    memory_region_ref(mr);
> > >
> > >      if (!memory_access_is_direct(mr, is_write)) {
> > > -        if (qatomic_xchg(&bounce.in_use, true)) {
> > > +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> > > +        if (size > max_bounce_buffer_size) {
> > > +            size_t excess = size - max_bounce_buffer_size;
> > > +            l -= excess;
> > > +            qatomic_sub(&bounce_buffer_size, excess);
> > > +        }
> > > +
> > > +        if (l == 0) {
> > >              *plen = 0;
> > >              return NULL;
> > >          }
> > > -        /* Avoid unbounded allocations */
> > > -        l = MIN(l, TARGET_PAGE_SIZE);
> > > -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > > -        bounce.addr = addr;
> > > -        bounce.len = l;
> > >
> > > -        memory_region_ref(mr);
> > > -        bounce.mr = mr;
> > > +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
> >
> > Maybe g_malloc0() would be better?
> 
> Good point, will change.
> 
> >
> > I just checked that we had target page aligned allocations since the 1st
> > day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
> > like that, but I do have worry on whether any existing caller that may
> > implicitly relying on an address that is target page aligned.  But maybe
> > not a major issue; I didn't see anything rely on that yet.
> 
> I did go through the same exercise when noticing the page alignment
> and arrived at the same conclusion as you. That makes it two people
> thinking it's OK, so I feel like we should take the risk here, in
> particular given that we know this code path is already broken as is.

It'll be more important to see if any one person thinks it's not okay in
this case, though. :)

If we decide to take the risk, we should merge a patch like this in as
early stage as possible of the release.

> 
> >
> > > +        bounce->mr = mr;
> > > +        bounce->addr = addr;
> > > +        bounce->len = l;
> > > +
> > >          if (!is_write) {
> > >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > > -                               bounce.buffer, l);
> > > +                          bounce->buffer, l);
> > >          }
> > >
> > >          *plen = l;
> > > -        return bounce.buffer;
> > > +        return bounce->buffer;
> > >      }
> > >
> > > -
> > > -    memory_region_ref(mr);
> > >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> > >                                          l, is_write, attrs);
> > >      fuzz_dma_read_cb(addr, *plen, mr);
> > > @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
> > >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> > >                           bool is_write, hwaddr access_len)
> > >  {
> > > -    if (buffer != bounce.buffer) {
> > > -        MemoryRegion *mr;
> > > -        ram_addr_t addr1;
> > > +    MemoryRegion *mr;
> > > +    ram_addr_t addr1;
> > > +
> > > +    mr = memory_region_from_host(buffer, &addr1);
> > > +    if (mr == NULL) {
> > > +        /*
> > > +         * Must be a bounce buffer (unless the caller passed a pointer which
> > > +         * wasn't returned by address_space_map, which is illegal).
> >
> > Is it possible to still have some kind of sanity check to make sure it's a
> > bounce buffer passed in, just in case of a caller bug?  Or, the failure can
> > be weird..
> 
> I was contemplating putting a magic number as the first struct member
> as a best effort to detect invalid pointers and corruptions, but
> wasn't sure it's warranted. Since you ask, I'll make that change.

That'll be good, thanks.

I was thinking maybe we can also maintain all the mapped buffers just like
before, either in a tree, or a sorted array; the array can be even easier
and static if the limit applied here will be "maximum number of bounce
buffer mapped" rather than "maximum bytes of bounce buffer mapped", but
this whole idea may already be over-complicated to worry on leaked buffers?
The magic number sounds good enough.

> 
> >
> > > +         */
> > > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> > >
> > > -        mr = memory_region_from_host(buffer, &addr1);
> > > -        assert(mr != NULL);
> > >          if (is_write) {
> > > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > > -        }
> > > -        if (xen_enabled()) {
> > > -            xen_invalidate_map_cache_entry(buffer);
> > > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > > +                                bounce->buffer, access_len);
> > >          }
> > > -        memory_region_unref(mr);
> > > +
> > > +        memory_region_unref(bounce->mr);
> > > +        qatomic_sub(&bounce_buffer_size, bounce->len);
> > > +        /* Write bounce_buffer_size before reading map_client_list. */
> > > +        smp_mb();
> > > +        cpu_notify_map_clients();
> > > +        g_free(bounce);
> > >          return;
> > >      }
> > > +
> > > +    if (xen_enabled()) {
> > > +        xen_invalidate_map_cache_entry(buffer);
> > > +    }
> > >      if (is_write) {
> > > -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > > -                            bounce.buffer, access_len);
> > > -    }
> > > -    qemu_vfree(bounce.buffer);
> > > -    bounce.buffer = NULL;
> > > -    memory_region_unref(bounce.mr);
> > > -    /* Clear in_use before reading map_client_list.  */
> > > -    qatomic_set_mb(&bounce.in_use, false);
> > > -    cpu_notify_map_clients();
> > > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > > +    }
> > >  }
> > >
> > >  void *cpu_physical_memory_map(hwaddr addr,
> > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > > index b0b96f67fa..dbe52f5ea1 100644
> > > --- a/softmmu/vl.c
> > > +++ b/softmmu/vl.c
> > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> > >                  exit(1);
> > >  #endif
> > >                  break;
> > > +            case QEMU_OPTION_max_bounce_buffer_size:
> > > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> > > +                    error_report("invalid -max-ounce-buffer-size value");
> > > +                    exit(1);
> > > +                }
> > > +                break;
> >
> > PS: I had a vague memory that we do not recommend adding more qemu cmdline
> > options, but I don't know enough on the plan to say anything real.
> 
> I am aware of that, and I'm really not happy with the command line
> option myself. Consider the command line flag a straw man I put in to
> see whether any reviewers have better ideas :)
> 
> More seriously, I actually did look around to see whether I can add
> the parameter to one of the existing option groupings somewhere, but
> neither do I have a suitable QOM object that I can attach the
> parameter to, nor did I find any global option groups that fits: this
> is not really memory configuration, and it's not really CPU
> configuration, it's more related to shared device model
> infrastructure... If you have a good idea for a home for this, I'm all
> ears.

No good & simple suggestion here, sorry.  We can keep the option there
until someone jumps in, then the better alternative could also come along.

After all I expect if we can choose a sensible enough default value, this
new option shouldn't be used by anyone for real.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23 20:54       ` Peter Xu
@ 2023-08-24  6:58         ` Mattias Nissler
  2023-08-24 13:32         ` Stefan Hajnoczi
  1 sibling, 0 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-08-24  6:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, David Hildenbrand, Paolo Bonzini,
	Elena Ufimtseva

On Wed, Aug 23, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> > Peter, thanks for taking a look and providing feedback!
> >
> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > > > When DMA memory can't be directly accessed, as is the case when
> > > > running the device model in a separate process without shareable DMA
> > > > file descriptors, bounce buffering is used.
> > > >
> > > > It is not uncommon for device models to request mapping of several DMA
> > > > regions at the same time. Examples include:
> > > >  * net devices, e.g. when transmitting a packet that is split across
> > > >    several TX descriptors (observed with igb)
> > > >  * USB host controllers, when handling a packet with multiple data TRBs
> > > >    (observed with xhci)
> > > >
> > > > Previously, qemu only provided a single bounce buffer and would fail DMA
> > > > map requests while the buffer was already in use. In turn, this would
> > > > cause DMA failures that ultimately manifest as hardware errors from the
> > > > guest perspective.
> > > >
> > > > This change allocates DMA bounce buffers dynamically instead of
> > > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > > correctly also when RAM can't be mmap()-ed.
> > > >
> > > > The total bounce buffer allocation size is limited by a new command line
> > > > parameter. The default is 4096 bytes to match the previous maximum
> > > > buffer size. It is expected that suitable limits will vary quite a bit
> > > > in practice depending on device models and workloads.
> > > >
> > > > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > > > ---
> > > >  include/sysemu/sysemu.h |  2 +
> > > >  qemu-options.hx         | 27 +++++++++++++
> > > >  softmmu/globals.c       |  1 +
> > > >  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
> > > >  softmmu/vl.c            |  6 +++
> > > >  5 files changed, 83 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > index 25be2a692e..c5dc93cb53 100644
> > > > --- a/include/sysemu/sysemu.h
> > > > +++ b/include/sysemu/sysemu.h
> > > > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> > > >  extern const char *prom_envs[MAX_PROM_ENVS];
> > > >  extern unsigned int nb_prom_envs;
> > > >
> > > > +extern uint64_t max_bounce_buffer_size;
> > > > +
> > > >  /* serial ports */
> > > >
> > > >  /* Return the Chardev for serial port i, or NULL if none */
> > > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > > index 29b98c3d4c..6071794237 100644
> > > > --- a/qemu-options.hx
> > > > +++ b/qemu-options.hx
> > > > @@ -4959,6 +4959,33 @@ SRST
> > > >  ERST
> > > >  #endif
> > > >
> > > > +DEF("max-bounce-buffer-size", HAS_ARG,
> > > > +    QEMU_OPTION_max_bounce_buffer_size,
> > > > +    "-max-bounce-buffer-size size\n"
> > > > +    "                DMA bounce buffer size limit in bytes (default=4096)\n",
> > > > +    QEMU_ARCH_ALL)
> > > > +SRST
> > > > +``-max-bounce-buffer-size size``
> > > > +    Set the limit in bytes for DMA bounce buffer allocations.
> > > > +
> > > > +    DMA bounce buffers are used when device models request memory-mapped access
> > > > +    to memory regions that can't be directly mapped by the qemu process, so the
> > > > +    memory must read or written to a temporary local buffer for the device
> > > > +    model to work with. This is the case e.g. for I/O memory regions, and when
> > > > +    running in multi-process mode without shared access to memory.
> > > > +
> > > > +    Whether bounce buffering is necessary depends heavily on the device model
> > > > +    implementation. Some devices use explicit DMA read and write operations
> > > > +    which do not require bounce buffers. Some devices, notably storage, will
> > > > +    retry a failed DMA map request after bounce buffer space becomes available
> > > > +    again. Most other devices will bail when encountering map request failures,
> > > > +    which will typically appear to the guest as a hardware error.
> > > > +
> > > > +    Suitable bounce buffer size values depend on the workload and guest
> > > > +    configuration. A few kilobytes up to a few megabytes are common sizes
> > > > +    encountered in practice.
> > >
> > > Does it mean that the default 4K size can still easily fail with some
> > > device setup?
> >
> > Yes. The thing is that the respective device setup is pretty exotic,
> > at least the only setup I'm aware of is multi-process with direct RAM
> > access via shared file descriptors from the device process disabled
> > (which hurts performance, so few people will run this setup). In
> > theory, DMA to an I/O region of some sort would also run into the
> > issue even in single process mode, but I'm not aware of such a
> > situation. Looking at it from a historic perspective, note that the
> > single-bounce-buffer restriction has been present since a decade, and
> > thus the issue has been present for years (since multi-process is a
> > thing), without it hurting anyone enough to get fixed. But don't get
> > me wrong - I don't want to downplay anything and very much would like
> > to see this fixed, but I want to be honest and put things into the
> > right perspective.
> >
> > >
> > > IIUC the whole point of limit here is to make sure the allocation is still
> > > bounded, while 4K itself is not a hard limit. Making it bigger would be,
> > > IMHO, nice if it should work with known configs which used to be broken.
> >
> > I'd be in favor of bumping the default. Networking should be fine with
> > 64KB, but I've observed a default Linux + xhci + usb_storage setup to
> > use up to 1MB of DMA buffers, we'd probably need to raise it
> > considerably. Would 4MB still be acceptable? That wouldn't allow a
> > single nefarious VM to stage a memory denial of service attack, but
> > what if you're running many VMs?
>
> Could wait and see whether there's any more comments from others.
> Personally 4MB looks fine, as that's not a constant consumption per-vm, but
> a worst case limit (probably only when there is an attacker).

OK, I'll take a note to change to 4MB for the next version of this
series then, barring any objections from others.

Note to self: Raising the limit will likely also accommodate the ide
test's DMA buffer consumption, which means we'd be losing test
coverage for the map client callback code path. I probably need to
adjust the test or make a new one to make up for that.

>
> Multiple VM does can indeed make it worse, but it means the attacker will
> need to attack all the VMs all success, and the sum up will be 4MB /
> mem_size_average_vm in percentage, irrelevant of numbers; for ~4GB average
> VM size it's 0.1% memory, and even less if VM is larger - maybe not
> something extremely scary even if happened.
>
> >
> > >
> > > > +ERST
> > > > +
> > > >  DEFHEADING()
> > > >
> > > >  DEFHEADING(Generic object creation:)
> > > > diff --git a/softmmu/globals.c b/softmmu/globals.c
> > > > index e83b5428d1..d3cc010717 100644
> > > > --- a/softmmu/globals.c
> > > > +++ b/softmmu/globals.c
> > > > @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
> > > >  uint8_t *boot_splash_filedata;
> > > >  int only_migratable; /* turn it off unless user states otherwise */
> > > >  int icount_align_option;
> > > > +uint64_t max_bounce_buffer_size = 4096;
> > > >
> > > >  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
> > > >   * little-endian "wire format" described in the SMBIOS 2.6 specification.
> > > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > > > index 3df73542e1..9f0fec0c8e 100644
> > > > --- a/softmmu/physmem.c
> > > > +++ b/softmmu/physmem.c
> > > > @@ -50,6 +50,7 @@
> > > >  #include "sysemu/dma.h"
> > > >  #include "sysemu/hostmem.h"
> > > >  #include "sysemu/hw_accel.h"
> > > > +#include "sysemu/sysemu.h"
> > > >  #include "sysemu/xen-mapcache.h"
> > > >  #include "trace/trace-root.h"
> > > >
> > > > @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
> > > >
> > > >  typedef struct {
> > > >      MemoryRegion *mr;
> > > > -    void *buffer;
> > > >      hwaddr addr;
> > > > -    hwaddr len;
> > > > -    bool in_use;
> > > > +    size_t len;
> > > > +    uint8_t buffer[];
> > > >  } BounceBuffer;
> > > >
> > > > -static BounceBuffer bounce;
> > > > +static size_t bounce_buffer_size;
> > > >
> > > >  typedef struct MapClient {
> > > >      QEMUBH *bh;
> > > > @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
> > > >      qemu_mutex_lock(&map_client_list_lock);
> > > >      client->bh = bh;
> > > >      QLIST_INSERT_HEAD(&map_client_list, client, link);
> > > > -    /* Write map_client_list before reading in_use.  */
> > > > +    /* Write map_client_list before reading bounce_buffer_size.  */
> > > >      smp_mb();
> > > > -    if (!qatomic_read(&bounce.in_use)) {
> > > > +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
> > > >          cpu_notify_map_clients_locked();
> > > >      }
> > > >      qemu_mutex_unlock(&map_client_list_lock);
> > > > @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
> > > >      RCU_READ_LOCK_GUARD();
> > > >      fv = address_space_to_flatview(as);
> > > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > > +    memory_region_ref(mr);
> > > >
> > > >      if (!memory_access_is_direct(mr, is_write)) {
> > > > -        if (qatomic_xchg(&bounce.in_use, true)) {
> > > > +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> > > > +        if (size > max_bounce_buffer_size) {
> > > > +            size_t excess = size - max_bounce_buffer_size;
> > > > +            l -= excess;
> > > > +            qatomic_sub(&bounce_buffer_size, excess);
> > > > +        }
> > > > +
> > > > +        if (l == 0) {
> > > >              *plen = 0;
> > > >              return NULL;
> > > >          }
> > > > -        /* Avoid unbounded allocations */
> > > > -        l = MIN(l, TARGET_PAGE_SIZE);
> > > > -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > > > -        bounce.addr = addr;
> > > > -        bounce.len = l;
> > > >
> > > > -        memory_region_ref(mr);
> > > > -        bounce.mr = mr;
> > > > +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
> > >
> > > Maybe g_malloc0() would be better?
> >
> > Good point, will change.
> >
> > >
> > > I just checked that we had target page aligned allocations since the 1st
> > > day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
> > > like that, but I do have worry on whether any existing caller that may
> > > implicitly relying on an address that is target page aligned.  But maybe
> > > not a major issue; I didn't see anything rely on that yet.
> >
> > I did go through the same exercise when noticing the page alignment
> > and arrived at the same conclusion as you. That makes it two people
> > thinking it's OK, so I feel like we should take the risk here, in
> > particular given that we know this code path is already broken as is.
>
> It'll be more important to see if any one person thinks it's not okay in
> this case, though. :)
>
> If we decide to take the risk, we should merge a patch like this in as
> early stage as possible of the release.

Happy to do so.

However, let me know if you rather want to retain page alignment.
It'll complicate things though, as it'll likely lead to storing the
metadata separate from the buffer allocation, probably using a hash
table / tree / array (as you suggest below) to keep track of all
allocated buffers and locate their metadata by address.

>
> >
> > >
> > > > +        bounce->mr = mr;
> > > > +        bounce->addr = addr;
> > > > +        bounce->len = l;
> > > > +
> > > >          if (!is_write) {
> > > >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > > > -                               bounce.buffer, l);
> > > > +                          bounce->buffer, l);
> > > >          }
> > > >
> > > >          *plen = l;
> > > > -        return bounce.buffer;
> > > > +        return bounce->buffer;
> > > >      }
> > > >
> > > > -
> > > > -    memory_region_ref(mr);
> > > >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> > > >                                          l, is_write, attrs);
> > > >      fuzz_dma_read_cb(addr, *plen, mr);
> > > > @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
> > > >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> > > >                           bool is_write, hwaddr access_len)
> > > >  {
> > > > -    if (buffer != bounce.buffer) {
> > > > -        MemoryRegion *mr;
> > > > -        ram_addr_t addr1;
> > > > +    MemoryRegion *mr;
> > > > +    ram_addr_t addr1;
> > > > +
> > > > +    mr = memory_region_from_host(buffer, &addr1);
> > > > +    if (mr == NULL) {
> > > > +        /*
> > > > +         * Must be a bounce buffer (unless the caller passed a pointer which
> > > > +         * wasn't returned by address_space_map, which is illegal).
> > >
> > > Is it possible to still have some kind of sanity check to make sure it's a
> > > bounce buffer passed in, just in case of a caller bug?  Or, the failure can
> > > be weird..
> >
> > I was contemplating putting a magic number as the first struct member
> > as a best effort to detect invalid pointers and corruptions, but
> > wasn't sure it's warranted. Since you ask, I'll make that change.
>
> That'll be good, thanks.
>
> I was thinking maybe we can also maintain all the mapped buffers just like
> before, either in a tree, or a sorted array; the array can be even easier
> and static if the limit applied here will be "maximum number of bounce
> buffer mapped" rather than "maximum bytes of bounce buffer mapped", but
> this whole idea may already be over-complicated to worry on leaked buffers?
> The magic number sounds good enough.
>
> >
> > >
> > > > +         */
> > > > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> > > >
> > > > -        mr = memory_region_from_host(buffer, &addr1);
> > > > -        assert(mr != NULL);
> > > >          if (is_write) {
> > > > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > > > -        }
> > > > -        if (xen_enabled()) {
> > > > -            xen_invalidate_map_cache_entry(buffer);
> > > > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > > > +                                bounce->buffer, access_len);
> > > >          }
> > > > -        memory_region_unref(mr);
> > > > +
> > > > +        memory_region_unref(bounce->mr);
> > > > +        qatomic_sub(&bounce_buffer_size, bounce->len);
> > > > +        /* Write bounce_buffer_size before reading map_client_list. */
> > > > +        smp_mb();
> > > > +        cpu_notify_map_clients();
> > > > +        g_free(bounce);
> > > >          return;
> > > >      }
> > > > +
> > > > +    if (xen_enabled()) {
> > > > +        xen_invalidate_map_cache_entry(buffer);
> > > > +    }
> > > >      if (is_write) {
> > > > -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > > > -                            bounce.buffer, access_len);
> > > > -    }
> > > > -    qemu_vfree(bounce.buffer);
> > > > -    bounce.buffer = NULL;
> > > > -    memory_region_unref(bounce.mr);
> > > > -    /* Clear in_use before reading map_client_list.  */
> > > > -    qatomic_set_mb(&bounce.in_use, false);
> > > > -    cpu_notify_map_clients();
> > > > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > > > +    }
> > > >  }
> > > >
> > > >  void *cpu_physical_memory_map(hwaddr addr,
> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > > > index b0b96f67fa..dbe52f5ea1 100644
> > > > --- a/softmmu/vl.c
> > > > +++ b/softmmu/vl.c
> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> > > >                  exit(1);
> > > >  #endif
> > > >                  break;
> > > > +            case QEMU_OPTION_max_bounce_buffer_size:
> > > > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> > > > +                    error_report("invalid -max-ounce-buffer-size value");
> > > > +                    exit(1);
> > > > +                }
> > > > +                break;
> > >
> > > PS: I had a vague memory that we do not recommend adding more qemu cmdline
> > > options, but I don't know enough on the plan to say anything real.
> >
> > I am aware of that, and I'm really not happy with the command line
> > option myself. Consider the command line flag a straw man I put in to
> > see whether any reviewers have better ideas :)
> >
> > More seriously, I actually did look around to see whether I can add
> > the parameter to one of the existing option groupings somewhere, but
> > neither do I have a suitable QOM object that I can attach the
> > parameter to, nor did I find any global option groups that fits: this
> > is not really memory configuration, and it's not really CPU
> > configuration, it's more related to shared device model
> > infrastructure... If you have a good idea for a home for this, I'm all
> > ears.
>
> No good & simple suggestion here, sorry.  We can keep the option there
> until someone jumps in, then the better alternative could also come along.
>
> After all I expect if we can choose a sensible enough default value, this
> new option shouldn't be used by anyone for real.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-23 20:54       ` Peter Xu
  2023-08-24  6:58         ` Mattias Nissler
@ 2023-08-24 13:32         ` Stefan Hajnoczi
  2023-09-01 13:41           ` Markus Armbruster
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2023-08-24 13:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mattias Nissler, qemu-devel, john.levon, Jagannathan Raman,
	Philippe Mathieu-Daudé, David Hildenbrand, Paolo Bonzini,
	Elena Ufimtseva, armbru

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

On Wed, Aug 23, 2023 at 04:54:06PM -0400, Peter Xu wrote:
> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > > > index b0b96f67fa..dbe52f5ea1 100644
> > > > --- a/softmmu/vl.c
> > > > +++ b/softmmu/vl.c
> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> > > >                  exit(1);
> > > >  #endif
> > > >                  break;
> > > > +            case QEMU_OPTION_max_bounce_buffer_size:
> > > > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> > > > +                    error_report("invalid -max-ounce-buffer-size value");
> > > > +                    exit(1);
> > > > +                }
> > > > +                break;
> > >
> > > PS: I had a vague memory that we do not recommend adding more qemu cmdline
> > > options, but I don't know enough on the plan to say anything real.
> > 
> > I am aware of that, and I'm really not happy with the command line
> > option myself. Consider the command line flag a straw man I put in to
> > see whether any reviewers have better ideas :)
> > 
> > More seriously, I actually did look around to see whether I can add
> > the parameter to one of the existing option groupings somewhere, but
> > neither do I have a suitable QOM object that I can attach the
> > parameter to, nor did I find any global option groups that fits: this
> > is not really memory configuration, and it's not really CPU
> > configuration, it's more related to shared device model
> > infrastructure... If you have a good idea for a home for this, I'm all
> > ears.
> 
> No good & simple suggestion here, sorry.  We can keep the option there
> until someone jumps in, then the better alternative could also come along.
> 
> After all I expect if we can choose a sensible enough default value, this
> new option shouldn't be used by anyone for real.

QEMU commits to stability in its external interfaces. Once the
command-line option is added, it needs to be supported in the future or
go through the deprecation process. I think we should agree on the
command-line option now.

Two ways to avoid the issue:
1. Drop the command-line option until someone needs it.
2. Make it an experimental option (with an "x-" prefix).

The closest to a proper solution that I found was adding it as a
-machine property. What bothers me is that if QEMU supports
multi-machine emulation in a single process some day, then the property
doesn't make sense since it's global rather than specific to a machine.

CCing Markus Armbruster for ideas.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-08-24 13:32         ` Stefan Hajnoczi
@ 2023-09-01 13:41           ` Markus Armbruster
  2023-09-05  7:38             ` Mattias Nissler
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2023-09-01 13:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, Mattias Nissler, qemu-devel, john.levon,
	Jagannathan Raman, Philippe Mathieu-Daudé, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, armbru

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Wed, Aug 23, 2023 at 04:54:06PM -0400, Peter Xu wrote:
>> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
>> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
>> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
>> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
>> > > > index b0b96f67fa..dbe52f5ea1 100644
>> > > > --- a/softmmu/vl.c
>> > > > +++ b/softmmu/vl.c
>> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
>> > > >                  exit(1);
>> > > >  #endif
>> > > >                  break;
>> > > > +            case QEMU_OPTION_max_bounce_buffer_size:
>> > > > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
>> > > > +                    error_report("invalid -max-ounce-buffer-size value");
>> > > > +                    exit(1);
>> > > > +                }
>> > > > +                break;
>> > >
>> > > PS: I had a vague memory that we do not recommend adding more qemu cmdline
>> > > options, but I don't know enough on the plan to say anything real.
>> > 
>> > I am aware of that, and I'm really not happy with the command line
>> > option myself. Consider the command line flag a straw man I put in to
>> > see whether any reviewers have better ideas :)
>> > 
>> > More seriously, I actually did look around to see whether I can add
>> > the parameter to one of the existing option groupings somewhere, but
>> > neither do I have a suitable QOM object that I can attach the
>> > parameter to, nor did I find any global option groups that fits: this
>> > is not really memory configuration, and it's not really CPU
>> > configuration, it's more related to shared device model
>> > infrastructure... If you have a good idea for a home for this, I'm all
>> > ears.
>> 
>> No good & simple suggestion here, sorry.  We can keep the option there
>> until someone jumps in, then the better alternative could also come along.
>> 
>> After all I expect if we can choose a sensible enough default value, this
>> new option shouldn't be used by anyone for real.
>
> QEMU commits to stability in its external interfaces. Once the
> command-line option is added, it needs to be supported in the future or
> go through the deprecation process. I think we should agree on the
> command-line option now.
>
> Two ways to avoid the issue:
> 1. Drop the command-line option until someone needs it.

Avoiding unneeded configuration knobs is always good.

> 2. Make it an experimental option (with an "x-" prefix).

Fine if actual experiments are planned.

Also fine if it's a development or debugging aid.

> The closest to a proper solution that I found was adding it as a
> -machine property. What bothers me is that if QEMU supports
> multi-machine emulation in a single process some day, then the property
> doesn't make sense since it's global rather than specific to a machine.
>
> CCing Markus Armbruster for ideas.

I'm afraid I'm lacking context.  Glancing at the patch...

    ``-max-bounce-buffer-size size``
        Set the limit in bytes for DMA bounce buffer allocations.
    
        DMA bounce buffers are used when device models request memory-mapped access
        to memory regions that can't be directly mapped by the qemu process, so the
        memory must read or written to a temporary local buffer for the device
        model to work with. This is the case e.g. for I/O memory regions, and when
        running in multi-process mode without shared access to memory.
    
        Whether bounce buffering is necessary depends heavily on the device model
        implementation. Some devices use explicit DMA read and write operations
        which do not require bounce buffers. Some devices, notably storage, will
        retry a failed DMA map request after bounce buffer space becomes available
        again. Most other devices will bail when encountering map request failures,
        which will typically appear to the guest as a hardware error.
    
        Suitable bounce buffer size values depend on the workload and guest
        configuration. A few kilobytes up to a few megabytes are common sizes
        encountered in practice.

Sounds quite device-specific.  Why isn't this configured per device?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-09-01 13:41           ` Markus Armbruster
@ 2023-09-05  7:38             ` Mattias Nissler
  2023-09-05 13:45               ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Mattias Nissler @ 2023-09-05  7:38 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Stefan Hajnoczi, Peter Xu, qemu-devel, john.levon,
	Jagannathan Raman, Philippe Mathieu-Daudé, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva

On Fri, Sep 1, 2023 at 3:41 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Stefan Hajnoczi <stefanha@redhat.com> writes:
>
> > On Wed, Aug 23, 2023 at 04:54:06PM -0400, Peter Xu wrote:
> >> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> >> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
> >> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> >> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> > > > index b0b96f67fa..dbe52f5ea1 100644
> >> > > > --- a/softmmu/vl.c
> >> > > > +++ b/softmmu/vl.c
> >> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> >> > > >                  exit(1);
> >> > > >  #endif
> >> > > >                  break;
> >> > > > +            case QEMU_OPTION_max_bounce_buffer_size:
> >> > > > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 0) {
> >> > > > +                    error_report("invalid -max-ounce-buffer-size value");
> >> > > > +                    exit(1);
> >> > > > +                }
> >> > > > +                break;
> >> > >
> >> > > PS: I had a vague memory that we do not recommend adding more qemu cmdline
> >> > > options, but I don't know enough on the plan to say anything real.
> >> >
> >> > I am aware of that, and I'm really not happy with the command line
> >> > option myself. Consider the command line flag a straw man I put in to
> >> > see whether any reviewers have better ideas :)
> >> >
> >> > More seriously, I actually did look around to see whether I can add
> >> > the parameter to one of the existing option groupings somewhere, but
> >> > neither do I have a suitable QOM object that I can attach the
> >> > parameter to, nor did I find any global option groups that fits: this
> >> > is not really memory configuration, and it's not really CPU
> >> > configuration, it's more related to shared device model
> >> > infrastructure... If you have a good idea for a home for this, I'm all
> >> > ears.
> >>
> >> No good & simple suggestion here, sorry.  We can keep the option there
> >> until someone jumps in, then the better alternative could also come along.
> >>
> >> After all I expect if we can choose a sensible enough default value, this
> >> new option shouldn't be used by anyone for real.
> >
> > QEMU commits to stability in its external interfaces. Once the
> > command-line option is added, it needs to be supported in the future or
> > go through the deprecation process. I think we should agree on the
> > command-line option now.
> >
> > Two ways to avoid the issue:
> > 1. Drop the command-line option until someone needs it.
>
> Avoiding unneeded configuration knobs is always good.
>
> > 2. Make it an experimental option (with an "x-" prefix).
>
> Fine if actual experiments are planned.
>
> Also fine if it's a development or debugging aid.

To a certain extent it is: I've been playing with different device
models and bumping the parameter until their DMA requests stopped
failing.

>
> > The closest to a proper solution that I found was adding it as a
> > -machine property. What bothers me is that if QEMU supports
> > multi-machine emulation in a single process some day, then the property
> > doesn't make sense since it's global rather than specific to a machine.
> >
> > CCing Markus Armbruster for ideas.
>
> I'm afraid I'm lacking context.  Glancing at the patch...
>
>     ``-max-bounce-buffer-size size``
>         Set the limit in bytes for DMA bounce buffer allocations.
>
>         DMA bounce buffers are used when device models request memory-mapped access
>         to memory regions that can't be directly mapped by the qemu process, so the
>         memory must read or written to a temporary local buffer for the device
>         model to work with. This is the case e.g. for I/O memory regions, and when
>         running in multi-process mode without shared access to memory.
>
>         Whether bounce buffering is necessary depends heavily on the device model
>         implementation. Some devices use explicit DMA read and write operations
>         which do not require bounce buffers. Some devices, notably storage, will
>         retry a failed DMA map request after bounce buffer space becomes available
>         again. Most other devices will bail when encountering map request failures,
>         which will typically appear to the guest as a hardware error.
>
>         Suitable bounce buffer size values depend on the workload and guest
>         configuration. A few kilobytes up to a few megabytes are common sizes
>         encountered in practice.
>
> Sounds quite device-specific.  Why isn't this configured per device?

It would be nice to use a property on the device that originates the
DMA operation to configure this. However, I don't see how to do this
in a reasonable way without bigger changes: A typical call path is
pci_dma_map -> dma_memory_map -> address_space_map. While pci_dma_map
has a PCIDevice*, address_space_map only receives the AddressSpace*.
So, we'd probably have to pass through a new QObject parameter to
address_space_map that indicates the originator and pass that through?
Or is there a better alternative to supply context information to
address_space map? Let me know if any of these approaches sound
appropriate and I'll be happy to explore them further.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-09-05  7:38             ` Mattias Nissler
@ 2023-09-05 13:45               ` Peter Xu
  2023-09-07 12:37                 ` Mattias Nissler
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2023-09-05 13:45 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: Markus Armbruster, Stefan Hajnoczi, qemu-devel, john.levon,
	Jagannathan Raman, Philippe Mathieu-Daudé, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva

On Tue, Sep 05, 2023 at 09:38:39AM +0200, Mattias Nissler wrote:
> It would be nice to use a property on the device that originates the
> DMA operation to configure this. However, I don't see how to do this
> in a reasonable way without bigger changes: A typical call path is
> pci_dma_map -> dma_memory_map -> address_space_map. While pci_dma_map
> has a PCIDevice*, address_space_map only receives the AddressSpace*.
> So, we'd probably have to pass through a new QObject parameter to
> address_space_map that indicates the originator and pass that through?
> Or is there a better alternative to supply context information to
> address_space map? Let me know if any of these approaches sound
> appropriate and I'll be happy to explore them further.

Should be possible to do. The pci address space is not shared but
per-device by default (even if there is no vIOMMU intervention).  See
do_pci_register_device():

    address_space_init(&pci_dev->bus_master_as,
                       &pci_dev->bus_master_container_region, pci_dev->name);

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
  2023-09-05 13:45               ` Peter Xu
@ 2023-09-07 12:37                 ` Mattias Nissler
  0 siblings, 0 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-09-07 12:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Markus Armbruster, Stefan Hajnoczi, qemu-devel, john.levon,
	Jagannathan Raman, Philippe Mathieu-Daudé, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva

On Tue, Sep 5, 2023 at 3:45 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Sep 05, 2023 at 09:38:39AM +0200, Mattias Nissler wrote:
> > It would be nice to use a property on the device that originates the
> > DMA operation to configure this. However, I don't see how to do this
> > in a reasonable way without bigger changes: A typical call path is
> > pci_dma_map -> dma_memory_map -> address_space_map. While pci_dma_map
> > has a PCIDevice*, address_space_map only receives the AddressSpace*.
> > So, we'd probably have to pass through a new QObject parameter to
> > address_space_map that indicates the originator and pass that through?
> > Or is there a better alternative to supply context information to
> > address_space map? Let me know if any of these approaches sound
> > appropriate and I'll be happy to explore them further.
>
> Should be possible to do. The pci address space is not shared but
> per-device by default (even if there is no vIOMMU intervention).  See
> do_pci_register_device():
>
>     address_space_init(&pci_dev->bus_master_as,
>                        &pci_dev->bus_master_container_region, pci_dev->name);

Ah, thanks for that hint! This works, and it probably even makes more
sense to treat bounce buffering as a concept tied to AddressSpace
rather than a global thing.

I'll send an updated series shortly, with the configuration parameter
attached to the PCI device, so it can be specified as a -device option
on the command line. In that light, I decided to keep the default at
4096 bytes though, since we now have the ability for each device model
to choose its default independently.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] Update subprojects/libvfio-user
  2023-08-23  9:29 [PATCH v2 0/4] Support message-based DMA in vfio-user server Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 1/4] softmmu: Support concurrent bounce buffers Mattias Nissler
@ 2023-08-23  9:29 ` Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 3/4] vfio-user: Message-based DMA support Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 4/4] vfio-user: Fix config space access byte order Mattias Nissler
  3 siblings, 0 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, Peter Xu, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, Mattias Nissler

Brings in assorted bug fixes. In particular, "Fix address calculation
for message-based DMA" corrects a bug in DMA address calculation which
is necessary to get DMA across VFIO-user messages working.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..47aad1ae18 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = cfb7d908dca025bdea6709801c5790863e902ef8
 depth = 1
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] vfio-user: Message-based DMA support
  2023-08-23  9:29 [PATCH v2 0/4] Support message-based DMA in vfio-user server Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 1/4] softmmu: Support concurrent bounce buffers Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 2/4] Update subprojects/libvfio-user Mattias Nissler
@ 2023-08-23  9:29 ` Mattias Nissler
  2023-08-23  9:29 ` [PATCH v2 4/4] vfio-user: Fix config space access byte order Mattias Nissler
  3 siblings, 0 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, Peter Xu, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 hw/remote/trace-events    |  2 +
 hw/remote/vfio-user-obj.c | 84 +++++++++++++++++++++++++++++++++++----
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..cee5e615a9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+                                unsigned size, MemTxAttrs attrs)
+{
+    MemoryRegion *region = opaque;
+    VfuObject *o = VFU_OBJECT(region->owner);
+    uint8_t buf[sizeof(uint64_t)];
+
+    trace_vfu_dma_read(region->addr + addr, size);
+
+    dma_sg_t *sg = alloca(dma_sg_size());
+    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+        vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
+        return MEMTX_ERROR;
+    }
+
+    *val = ldn_he_p(buf, size);
+
+    return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+                                 unsigned size, MemTxAttrs attrs)
+{
+    MemoryRegion *region = opaque;
+    VfuObject *o = VFU_OBJECT(region->owner);
+    uint8_t buf[sizeof(uint64_t)];
+
+    trace_vfu_dma_write(region->addr + addr, size);
+
+    stn_he_p(buf, size, val);
+
+    dma_sg_t *sg = alloca(dma_sg_size());
+    vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+    if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+        vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
+        return MEMTX_ERROR;
+    }
+
+    return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+    .read_with_attrs = vfu_dma_read,
+    .write_with_attrs = vfu_dma_write,
+    .endianness = DEVICE_HOST_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+        .unaligned = true,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
     VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     g_autofree char *name = NULL;
     struct iovec *iov = &info->iova;
 
-    if (!info->vaddr) {
-        return;
-    }
-
     name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-                           (uint64_t)info->vaddr);
+                           (uint64_t)iov->iov_base);
 
     subregion = g_new0(MemoryRegion, 1);
 
-    memory_region_init_ram_ptr(subregion, NULL, name,
-                               iov->iov_len, info->vaddr);
+    if (info->vaddr) {
+        memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+                                   iov->iov_len, info->vaddr);
+    } else {
+        /*
+         * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+         * bytes at a time, and larger accesses are broken down. However,
+         * many/most DMA accesses are larger than 8 bytes and VFIO-user can
+         * handle large DMA accesses just fine, thus this size restriction
+         * unnecessarily hurts performance, in particular given that each
+         * access causes a round trip on the VFIO-user socket.
+         *
+         * TODO: Investigate how to plumb larger accesses through memory
+         * regions, possibly by amending MemoryRegionOps or by creating a new
+         * memory region type.
+         */
+        memory_region_init_io(subregion, OBJECT(o), &vfu_dma_ops, subregion,
+                              name, iov->iov_len);
+    }
 
     dma_as = pci_device_iommu_address_space(o->pci_dev);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] vfio-user: Fix config space access byte order
  2023-08-23  9:29 [PATCH v2 0/4] Support message-based DMA in vfio-user server Mattias Nissler
                   ` (2 preceding siblings ...)
  2023-08-23  9:29 ` [PATCH v2 3/4] vfio-user: Message-based DMA support Mattias Nissler
@ 2023-08-23  9:29 ` Mattias Nissler
  3 siblings, 0 replies; 14+ messages in thread
From: Mattias Nissler @ 2023-08-23  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: john.levon, stefanha, Jagannathan Raman,
	Philippe Mathieu-Daudé, Peter Xu, David Hildenbrand,
	Paolo Bonzini, Elena Ufimtseva, Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index cee5e615a9..d38b4700f3 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     while (bytes > 0) {
         len = (bytes > pci_access_width) ? pci_access_width : bytes;
         if (is_write) {
-            memcpy(&val, ptr, len);
+            val = ldn_le_p(ptr, len);
             pci_host_config_write_common(o->pci_dev, offset,
                                          pci_config_size(o->pci_dev),
                                          val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
         } else {
             val = pci_host_config_read_common(o->pci_dev, offset,
                                               pci_config_size(o->pci_dev), len);
-            memcpy(ptr, &val, len);
+            stn_le_p(ptr, len, val);
             trace_vfu_cfg_read(offset, val);
         }
         offset += len;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-09-07 12:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-23  9:29 [PATCH v2 0/4] Support message-based DMA in vfio-user server Mattias Nissler
2023-08-23  9:29 ` [PATCH v2 1/4] softmmu: Support concurrent bounce buffers Mattias Nissler
2023-08-23 17:34   ` Peter Xu
2023-08-23 20:08     ` Mattias Nissler
2023-08-23 20:54       ` Peter Xu
2023-08-24  6:58         ` Mattias Nissler
2023-08-24 13:32         ` Stefan Hajnoczi
2023-09-01 13:41           ` Markus Armbruster
2023-09-05  7:38             ` Mattias Nissler
2023-09-05 13:45               ` Peter Xu
2023-09-07 12:37                 ` Mattias Nissler
2023-08-23  9:29 ` [PATCH v2 2/4] Update subprojects/libvfio-user Mattias Nissler
2023-08-23  9:29 ` [PATCH v2 3/4] vfio-user: Message-based DMA support Mattias Nissler
2023-08-23  9:29 ` [PATCH v2 4/4] vfio-user: Fix config space access byte order Mattias Nissler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).