[PATCH v3 0/2] system/memory: Make ram device region directly accessible

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] system/memory: Make ram device region directly accessible
@ 2026-06-16  5:25 Gavin Shan
  2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  5:25 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
DMA bounce buffer in address_space_map() to cover the DMA request. However,
the bounce buffer size is 4096 bytes and we're overrunning it easily when
the guest has significant disk activities on compiling 'cuda-samples'.
The full log and problem description can be found from PATCH[1/2]'s commit
log.

Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
memmove() with newly added helpers qemu_ram_{copy, move}() that works on
top of __builtin_{memcpy, memmove} or unaligned access friendly memory
movement in the accessors to the ram device regions. With this, we can
basically revert that commit to make ram device region directly accessible
again and bypass the bounce buffer in address_space_map() where the guest
hang is caused.

PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
PATCH[2] makes ram device region directly accessible again

Changelog
=========
v2 -> v3:
  * https://lore.kernel.org/qemu-arm/20260615100200.266968-1-gshan@redhat.com/
  * Documentation for qemu_ram_{copy, move}           (Peter/Michael)
  * Support qemu_ram_move() for overlapped src/dest   (Richard)
  * Use {memcpy, memmove} if step is 16-bytes or more (Michael)
  * Code improvements                                 (Richard/Michael)
v1 -> v2:
  * https://lore.kernel.org/qemu-arm/20260612110307.1264798-1-gshan@redhat.com/
  * Rename address_space_{memcpy, memmove}() to qemu_ram_{copy, move}()
    and move them to physmem.c and memory.h   (Philippe)
  * Use memcpy() and memmove() in qemu_ram_{copy, move}() for the variable
    length case                               (Miachel)
  * Handle unaligned access in qemu_ram_{copy, move}() for all archs
    except i386 and x86_64                    (Richard/Michael)
RFCv1 -> v1:
  * https://lists.nongnu.org/archive/html/qemu-arm/2026-06/msg00307.html
  * Reworked solution based on suggestions from Peter Xu, Peter Maydell
    and Michael S. Tsirkin

Gavin Shan (2):
  system/memory: Use qemu_ram_{copy, move}() in ram device region
    accessors
  system/memory: Make ram device region directly accessible

 hw/remote/vfio-user-obj.c |   4 +-
 include/system/memory.h   |  43 ++++++---
 system/memory.c           |  41 +--------
 system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
 system/trace-events       |   2 -
 5 files changed, 210 insertions(+), 58 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-16  5:25 ` Gavin Shan
  2026-06-16  6:17   ` Michael S. Tsirkin
  2026-06-25 10:09   ` Peter Maydell
  2026-06-16  5:25 ` [PATCH v3 2/2] system/memory: Make ram device region directly accessible Gavin Shan
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  5:25 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

All ram device regions were turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
guest is started by the following command lines, with GH100 GPU card passed
from the host.

   host$ lspci | grep GH100
   0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
   host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
         -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
         -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
         -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
         -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
           :
   guest$ cd cuda-samples/build
   guest$ make -j 20 clean
   guest$ make -j 20
           :
   [ 54%] Linking CUDA executable graphMemoryNodes
   [ 54%] Built target graphMemoryNodes
   <no more output afterwards, guest becomes frozen here>

   guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
   [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)

When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
be presented to the guest through memory hot-add. The page cache can
then be allocated from the hot added memory blocks when cuda-samples
is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
device as part of the DMA request, the bounce buffer has to be used to
accomodate the request as the corresponding memory region (MemoryRegion)
is an indirectly accessible ram device region in qemu. However, the max
bounce bufer size is only 4096 bytes by default and that is exhausted
quickly, leading to a reset on the virtio-blk device and frozen guest
eventually.

  QEMU
  ====
  virtio_blk_handle_output
    virtio_blk_handle_vq
      virtio_blk_get_request
        virtqueue_pop
          virtqueue_split_pop
            virtqueue_map_desc
              address_space_map
                memory_access_is_direct         # Return false
                  memory_region_supports_direct_access

  (qemu) info mtree
  memory-region: pci_bridge_pci
    0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
      0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
        0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
          0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]

This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
them in the ram device memory region accessors, similar to what's done
in commit 4a2e242bbb so that the issue (MMIO access instructions were
optimized to SSE instructions) covered by that commit is fixed. This
makes 'ram_device_mem_ops' redundant, paving the way to revert that
commit to make ram device region directly accessible again in the next
patch.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
    Support qemu_ram_move() for overlapped src/dest   (Richard)
    Use {memcpy, memmove} if step is 16-bytes or more (Michael)
    Code improvements                                 (Richard/Michael)
---
 hw/remote/vfio-user-obj.c |   4 +-
 include/system/memory.h   |  32 ++++++-
 system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 207 insertions(+), 7 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 87fa7b6572..97a6c88780 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
         ram_ptr = memory_region_get_ram_ptr(mr);
 
         if (is_write) {
-            memcpy((ram_ptr + offset), buf, size);
+            qemu_ram_copy(ram_ptr + offset, buf, size);
         } else {
-            memcpy(buf, (ram_ptr + offset), size);
+            qemu_ram_copy(buf, ram_ptr + offset, size);
         }
 
         return 0;
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..84203c312d 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
 void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
+
+/**
+ * qemu_ram_copy: copy data to ramblock
+ *
+ * @dst: destination where the data is copied to
+ * @src: source where the data is copied from
+ * @n: length of data to be copied
+ *
+ *
+ * Copy @n bytes from @src to @dst with the assumption that @src and @dst
+ * do not overlap. Handles special cases such as uncacheable ramblocks
+ * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
+ * in preference to memcpy().
+ */
+void qemu_ram_copy(void *dest, const void *src, size_t n);
+
+/**
+ * qemu_ram_move: move data to ramblock
+ *
+ * @dst: destination where the data is moved to
+ * @src: source where the data is moved from
+ * @n: length of data to be moved
+ *
+ * Move @n bytes from @src to @dst with the assumption that @src and @dst
+ * can overlap. Handles special cases such as uncacheable ramblocks
+ * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
+ * in preference to memmove().
+ */
+void qemu_ram_move(void *dest, const void *src, size_t n);
+
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
                                     MemTxAttrs attrs, void *buf, hwaddr len);
 MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
@@ -2970,7 +3000,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
             mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
             if (len == l && memory_access_is_direct(mr, false, attrs)) {
                 ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-                memcpy(buf, ptr, len);
+                qemu_ram_copy(buf, ptr, len);
             } else {
                 result = flatview_read_continue(fv, addr, attrs, buf, len,
                                                 addr1, l, mr);
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..45a17cd580 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3160,6 +3160,177 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
     invalidate_and_set_dirty(mr, addr, size);
 }
 
+static void qemu_ram_copy_aligned(void *dst, const void *src, size_t n)
+{
+    switch (n) {
+    case 1:
+        __builtin_memcpy(dst, src, 1);
+        break;
+    case 2:
+        __builtin_memcpy(dst, src, 2);
+        break;
+    case 4:
+        __builtin_memcpy(dst, src, 4);
+        break;
+    case 8:
+        __builtin_memcpy(dst, src, 8);
+        break;
+    default:
+        memcpy(dst, src, n);
+    }
+}
+
+static void qemu_ram_move_aligned(void *dst, const void *src, size_t n)
+{
+    switch (n) {
+    case 1:
+        __builtin_memmove(dst, src, 1);
+        break;
+    case 2:
+        __builtin_memmove(dst, src, 2);
+        break;
+    case 4:
+        __builtin_memmove(dst, src, 4);
+        break;
+    case 8:
+        __builtin_memmove(dst, src, 8);
+        break;
+    default:
+        memmove(dst, src, n);
+    }
+}
+
+static void qemu_ram_copy_unaligned(void *dst, const void *src,
+                                    size_t n, size_t max_step)
+{
+    uintptr_t test, step;
+
+    /* Aligned maximal step */
+    max_step = pow2floor(max_step);
+
+    while (n) {
+        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
+        step = test & -test;
+
+        switch (step) {
+        case 1:
+            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
+            src += 1;
+            dst += 1;
+            n -= 1;
+            break;
+        case 2:
+            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
+            src += 2;
+            dst += 2;
+            n -= 2;
+            break;
+        case 4:
+            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
+            src += 4;
+            dst += 4;
+            n -= 4;
+            break;
+        case 8:
+            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
+            src += 8;
+            dst += 8;
+            n -= 8;
+            break;
+        default:
+            memcpy(dst, src, step);
+            src += step;
+            dst += step;
+            n -= step;
+        }
+    }
+}
+
+static void qemu_ram_backwards_copy_unaligned(void *dst, const void *src,
+                                              size_t n, size_t max_step)
+{
+    uintptr_t test, step;
+
+    /* Aligned maximal step */
+    max_step = pow2floor(max_step);
+
+    /* End of the blocks */
+    src += n;
+    dst += n;
+
+    while (n) {
+        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
+        step = test & -test;
+
+        switch (step) {
+        case 1:
+            src -= 1;
+            dst -= 1;
+            n -= 1;
+            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
+            break;
+        case 2:
+            src -= 2;
+            dst -= 2;
+            n -= 2;
+            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
+            break;
+        case 4:
+            src -= 4;
+            dst -= 4;
+            n -= 4;
+            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
+            break;
+        case 8:
+            src -= 8;
+            dst -= 8;
+            n -= 8;
+            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
+            break;
+        default:
+            src -= step;
+            dst -= step;
+            n -= step;
+            memmove(dst, src, step);
+        }
+    }
+}
+
+/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
+#if defined(__i386__) || defined(__x86_64__)
+#define HOST_UNALIGNED_MMIO_OK 1
+#else
+#define HOST_UNALIGNED_MMIO_OK 0
+#endif
+
+void qemu_ram_copy(void *dst, const void *src, size_t n)
+{
+    if (dst == src || n == 0) {
+        return;
+    }
+
+    if (HOST_UNALIGNED_MMIO_OK) {
+        qemu_ram_copy_aligned(dst, src, n);
+    } else {
+        qemu_ram_copy_unaligned(dst, src, n, 8);
+    }
+}
+
+void qemu_ram_move(void *dst, const void *src, size_t n)
+{
+    if (src == dst || n == 0) {
+        return;
+    }
+
+    if (HOST_UNALIGNED_MMIO_OK) {
+        qemu_ram_move_aligned(dst, src, n);
+    } else if (dst < src) {
+        qemu_ram_copy_unaligned(dst, src, n, src - dst);
+    } else {
+        qemu_ram_backwards_copy_unaligned(dst, src, n, dst - src);
+    }
+}
+
 int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
 {
     unsigned access_size_max = mr->ops->valid.max_access_size;
@@ -3272,7 +3443,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, true);
 
-        memmove(ram_ptr, buf, *l);
+        qemu_ram_move(ram_ptr, buf, *l);
         invalidate_and_set_dirty(mr, mr_addr, *l);
 
         return MEMTX_OK;
@@ -3365,7 +3536,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, false);
 
-        memcpy(buf, ram_ptr, *l);
+        qemu_ram_copy(buf, ram_ptr, *l);
 
         return MEMTX_OK;
     }
@@ -3503,8 +3674,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
             l = memory_access_size(mr, l, addr1);
         } else {
             /* ROM/RAM case */
-            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-            memcpy(ram_ptr, buf, l);
+            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
             invalidate_and_set_dirty(mr, addr1, l);
         }
         len -= l;
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
@ 2026-06-16  6:17   ` Michael S. Tsirkin
  2026-06-16  7:15     ` Gavin Shan
  2026-06-25 10:09   ` Peter Maydell
  1 sibling, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16  6:17 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 03:25:51PM +1000, Gavin Shan wrote:
> All ram device regions were turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> guest is started by the following command lines, with GH100 GPU card passed
> from the host.
> 
>    host$ lspci | grep GH100
>    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>          -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>            :
>    guest$ cd cuda-samples/build
>    guest$ make -j 20 clean
>    guest$ make -j 20
>            :
>    [ 54%] Linking CUDA executable graphMemoryNodes
>    [ 54%] Built target graphMemoryNodes
>    <no more output afterwards, guest becomes frozen here>
> 
>    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> 
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
> be presented to the guest through memory hot-add. The page cache can
> then be allocated from the hot added memory blocks when cuda-samples
> is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
> device as part of the DMA request, the bounce buffer has to be used to
> accomodate the request as the corresponding memory region (MemoryRegion)
> is an indirectly accessible ram device region in qemu. However, the max
> bounce bufer size is only 4096 bytes by default and that is exhausted
> quickly, leading to a reset on the virtio-blk device and frozen guest
> eventually.
> 
>   QEMU
>   ====
>   virtio_blk_handle_output
>     virtio_blk_handle_vq
>       virtio_blk_get_request
>         virtqueue_pop
>           virtqueue_split_pop
>             virtqueue_map_desc
>               address_space_map
>                 memory_access_is_direct         # Return false
>                   memory_region_supports_direct_access
> 
>   (qemu) info mtree
>   memory-region: pci_bridge_pci
>     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> 
> This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
> them in the ram device memory region accessors, similar to what's done
> in commit 4a2e242bbb so that the issue (MMIO access instructions were
> optimized to SSE instructions) covered by that commit is fixed. This
> makes 'ram_device_mem_ops' redundant, paving the way to revert that
> commit to make ram device region directly accessible again in the next
> patch.
> 
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
>     Support qemu_ram_move() for overlapped src/dest   (Richard)
>     Use {memcpy, memmove} if step is 16-bytes or more (Michael)
>     Code improvements                                 (Richard/Michael)
> ---
>  hw/remote/vfio-user-obj.c |   4 +-
>  include/system/memory.h   |  32 ++++++-
>  system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
>  3 files changed, 207 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..97a6c88780 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>          ram_ptr = memory_region_get_ram_ptr(mr);
>  
>          if (is_write) {
> -            memcpy((ram_ptr + offset), buf, size);
> +            qemu_ram_copy(ram_ptr + offset, buf, size);
>          } else {
> -            memcpy(buf, (ram_ptr + offset), size);
> +            qemu_ram_copy(buf, ram_ptr + offset, size);
>          }
>  
>          return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..84203c312d 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>  
>  /* Internal functions, part of the implementation of address_space_read.  */
> +
> +/**
> + * qemu_ram_copy: copy data to ramblock
> + *
> + * @dst: destination where the data is copied to
> + * @src: source where the data is copied from
> + * @n: length of data to be copied
> + *
> + *
> + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
> + * do not overlap. Handles special cases such as uncacheable ramblocks
> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> + * in preference to memcpy().
> + */
> +void qemu_ram_copy(void *dest, const void *src, size_t n);
> +
> +/**
> + * qemu_ram_move: move data to ramblock
> + *
> + * @dst: destination where the data is moved to
> + * @src: source where the data is moved from
> + * @n: length of data to be moved
> + *
> + * Move @n bytes from @src to @dst with the assumption that @src and @dst
> + * can overlap. Handles special cases such as uncacheable ramblocks
> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> + * in preference to memmove().
> + */
> +void qemu_ram_move(void *dest, const void *src, size_t n);
> +
>  MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
>                                      MemTxAttrs attrs, void *buf, hwaddr len);
>  MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
> @@ -2970,7 +3000,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>              mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>              if (len == l && memory_access_is_direct(mr, false, attrs)) {
>                  ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -                memcpy(buf, ptr, len);
> +                qemu_ram_copy(buf, ptr, len);
>              } else {
>                  result = flatview_read_continue(fv, addr, attrs, buf, len,
>                                                  addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..45a17cd580 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3160,6 +3160,177 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
>      invalidate_and_set_dirty(mr, addr, size);
>  }
>  
> +static void qemu_ram_copy_aligned(void *dst, const void *src, size_t n)
> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memcpy(dst, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memcpy(dst, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memcpy(dst, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memcpy(dst, src, 8);
> +        break;
> +    default:
> +        memcpy(dst, src, n);
> +    }
> +}
> +
> +static void qemu_ram_move_aligned(void *dst, const void *src, size_t n)
> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memmove(dst, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memmove(dst, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memmove(dst, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memmove(dst, src, 8);
> +        break;
> +    default:
> +        memmove(dst, src, n);
> +    }
> +}

A weird name for this. You call it for both aligned and unaligned.

maybe this is _unsafe and the "unaligned" below is _safe ?




> +
> +static void qemu_ram_copy_unaligned(void *dst, const void *src,
> +                                    size_t n, size_t max_step)
> +{
> +    uintptr_t test, step;
> +
> +    /* Aligned maximal step */
> +    max_step = pow2floor(max_step);
> +
> +    while (n) {
> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
> +        step = test & -test;
> +
> +        switch (step) {
> +        case 1:
> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
> +            src += 1;
> +            dst += 1;
> +            n -= 1;
> +            break;
> +        case 2:
> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
> +            src += 2;
> +            dst += 2;
> +            n -= 2;
> +            break;
> +        case 4:
> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
> +            src += 4;
> +            dst += 4;
> +            n -= 4;
> +            break;
> +        case 8:
> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
> +            src += 8;
> +            dst += 8;
> +            n -= 8;
> +            break;
> +        default:
> +            memcpy(dst, src, step);
> +            src += step;
> +            dst += step;
> +            n -= step;
> +        }
> +    }
> +}
> +
> +static void qemu_ram_backwards_copy_unaligned(void *dst, const void *src,
> +                                              size_t n, size_t max_step)
> +{
> +    uintptr_t test, step;
> +
> +    /* Aligned maximal step */
> +    max_step = pow2floor(max_step);
> +
> +    /* End of the blocks */
> +    src += n;
> +    dst += n;
> +
> +    while (n) {
> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
> +        step = test & -test;
> +
> +        switch (step) {
> +        case 1:
> +            src -= 1;
> +            dst -= 1;
> +            n -= 1;
> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
> +            break;
> +        case 2:
> +            src -= 2;
> +            dst -= 2;
> +            n -= 2;
> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
> +            break;
> +        case 4:
> +            src -= 4;
> +            dst -= 4;
> +            n -= 4;
> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
> +            break;
> +        case 8:
> +            src -= 8;
> +            dst -= 8;
> +            n -= 8;
> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
> +            break;
> +        default:
> +            src -= step;
> +            dst -= step;
> +            n -= step;
> +            memmove(dst, src, step);
> +        }
> +    }
> +}
> +
> +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
> +#if defined(__i386__) || defined(__x86_64__)
> +#define HOST_UNALIGNED_MMIO_OK 1

maybe HOST_UNALIGNED_VECTOR_MMIO_OK ? We also care that vector stores
are safe.

> +#else
> +#define HOST_UNALIGNED_MMIO_OK 0
> +#endif
> +
> +void qemu_ram_copy(void *dst, const void *src, size_t n)
> +{
> +    if (dst == src || n == 0) {
> +        return;
> +    }
> +
> +    if (HOST_UNALIGNED_MMIO_OK) {
> +        qemu_ram_copy_aligned(dst, src, n);
> +    } else {
> +        qemu_ram_copy_unaligned(dst, src, n, 8);
> +    }
> +}
> +
> +void qemu_ram_move(void *dst, const void *src, size_t n)
> +{
> +    if (src == dst || n == 0) {
> +        return;
> +    }
> +
> +    if (HOST_UNALIGNED_MMIO_OK) {
> +        qemu_ram_move_aligned(dst, src, n);
> +    } else if (dst < src) {
> +        qemu_ram_copy_unaligned(dst, src, n, src - dst);
> +    } else {
> +        qemu_ram_backwards_copy_unaligned(dst, src, n, dst - src);
> +    }
> +}
> +
>  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
>  {
>      unsigned access_size_max = mr->ops->valid.max_access_size;
> @@ -3272,7 +3443,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>          uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                 false, true);
>  
> -        memmove(ram_ptr, buf, *l);
> +        qemu_ram_move(ram_ptr, buf, *l);
>          invalidate_and_set_dirty(mr, mr_addr, *l);
>  
>          return MEMTX_OK;
> @@ -3365,7 +3536,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>          uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                 false, false);
>  
> -        memcpy(buf, ram_ptr, *l);
> +        qemu_ram_copy(buf, ram_ptr, *l);
>  
>          return MEMTX_OK;
>      }
> @@ -3503,8 +3674,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>              l = memory_access_size(mr, l, addr1);
>          } else {
>              /* ROM/RAM case */
> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -            memcpy(ram_ptr, buf, l);
> +            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
>              invalidate_and_set_dirty(mr, addr1, l);
>          }
>          len -= l;
> -- 
> 2.54.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  6:17   ` Michael S. Tsirkin
@ 2026-06-16  7:15     ` Gavin Shan
  2026-06-16  9:51       ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  7:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin


On 6/16/26 4:17 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 03:25:51PM +1000, Gavin Shan wrote:
>> All ram device regions were turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
>> guest is started by the following command lines, with GH100 GPU card passed
>> from the host.
>>
>>     host$ lspci | grep GH100
>>     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>>     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>>           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>>           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>>           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>>           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>>           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>>             :
>>     guest$ cd cuda-samples/build
>>     guest$ make -j 20 clean
>>     guest$ make -j 20
>>             :
>>     [ 54%] Linking CUDA executable graphMemoryNodes
>>     [ 54%] Built target graphMemoryNodes
>>     <no more output afterwards, guest becomes frozen here>
>>
>>     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>
>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>> the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
>> be presented to the guest through memory hot-add. The page cache can
>> then be allocated from the hot added memory blocks when cuda-samples
>> is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
>> device as part of the DMA request, the bounce buffer has to be used to
>> accomodate the request as the corresponding memory region (MemoryRegion)
>> is an indirectly accessible ram device region in qemu. However, the max
>> bounce bufer size is only 4096 bytes by default and that is exhausted
>> quickly, leading to a reset on the virtio-blk device and frozen guest
>> eventually.
>>
>>    QEMU
>>    ====
>>    virtio_blk_handle_output
>>      virtio_blk_handle_vq
>>        virtio_blk_get_request
>>          virtqueue_pop
>>            virtqueue_split_pop
>>              virtqueue_map_desc
>>                address_space_map
>>                  memory_access_is_direct         # Return false
>>                    memory_region_supports_direct_access
>>
>>    (qemu) info mtree
>>    memory-region: pci_bridge_pci
>>      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>>        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>>          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>>            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>
>> This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
>> them in the ram device memory region accessors, similar to what's done
>> in commit 4a2e242bbb so that the issue (MMIO access instructions were
>> optimized to SSE instructions) covered by that commit is fixed. This
>> makes 'ram_device_mem_ops' redundant, paving the way to revert that
>> commit to make ram device region directly accessible again in the next
>> patch.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>> v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
>>      Support qemu_ram_move() for overlapped src/dest   (Richard)
>>      Use {memcpy, memmove} if step is 16-bytes or more (Michael)
>>      Code improvements                                 (Richard/Michael)
>> ---
>>   hw/remote/vfio-user-obj.c |   4 +-
>>   include/system/memory.h   |  32 ++++++-
>>   system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 207 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 87fa7b6572..97a6c88780 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>>           ram_ptr = memory_region_get_ram_ptr(mr);
>>   
>>           if (is_write) {
>> -            memcpy((ram_ptr + offset), buf, size);
>> +            qemu_ram_copy(ram_ptr + offset, buf, size);
>>           } else {
>> -            memcpy(buf, (ram_ptr + offset), size);
>> +            qemu_ram_copy(buf, ram_ptr + offset, size);
>>           }
>>   
>>           return 0;
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 1417132f6d..84203c312d 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>>   void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>>   
>>   /* Internal functions, part of the implementation of address_space_read.  */
>> +
>> +/**
>> + * qemu_ram_copy: copy data to ramblock
>> + *
>> + * @dst: destination where the data is copied to
>> + * @src: source where the data is copied from
>> + * @n: length of data to be copied
>> + *
>> + *
>> + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
>> + * do not overlap. Handles special cases such as uncacheable ramblocks
>> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
>> + * in preference to memcpy().
>> + */
>> +void qemu_ram_copy(void *dest, const void *src, size_t n);
>> +
>> +/**
>> + * qemu_ram_move: move data to ramblock
>> + *
>> + * @dst: destination where the data is moved to
>> + * @src: source where the data is moved from
>> + * @n: length of data to be moved
>> + *
>> + * Move @n bytes from @src to @dst with the assumption that @src and @dst
>> + * can overlap. Handles special cases such as uncacheable ramblocks
>> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
>> + * in preference to memmove().
>> + */
>> +void qemu_ram_move(void *dest, const void *src, size_t n);
>> +
>>   MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
>>                                       MemTxAttrs attrs, void *buf, hwaddr len);
>>   MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
>> @@ -2970,7 +3000,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>>               mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>>               if (len == l && memory_access_is_direct(mr, false, attrs)) {
>>                   ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -                memcpy(buf, ptr, len);
>> +                qemu_ram_copy(buf, ptr, len);
>>               } else {
>>                   result = flatview_read_continue(fv, addr, attrs, buf, len,
>>                                                   addr1, l, mr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf87573..45a17cd580 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3160,6 +3160,177 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
>>       invalidate_and_set_dirty(mr, addr, size);
>>   }
>>   
>> +static void qemu_ram_copy_aligned(void *dst, const void *src, size_t n)
>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memcpy(dst, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memcpy(dst, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memcpy(dst, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memcpy(dst, src, 8);
>> +        break;
>> +    default:
>> +        memcpy(dst, src, n);
>> +    }
>> +}
>> +
>> +static void qemu_ram_move_aligned(void *dst, const void *src, size_t n)
>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memmove(dst, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memmove(dst, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memmove(dst, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memmove(dst, src, 8);
>> +        break;
>> +    default:
>> +        memmove(dst, src, n);
>> +    }
>> +}
> 
> A weird name for this. You call it for both aligned and unaligned.
> 
> maybe this is _unsafe and the "unaligned" below is _safe ?
> 

Yeah, xxx_{aligned, unaligned}() should be renamed to xxx_{unsafe, safe}(). The
question is why we're away from __builtin_{memcpy, memmove}() and memcpy/memmove
since they're not safe as the new function name indicates? :-)

If we're away from memcpy/memmove() and switch to xxx_safe() for x86, we shouldn't
have any more concerns except the performance lost. At least, everything should
be working from function POV because we're not affected by glibc and the optimization
enforced by the compiler any more.

> 
> 
> 
>> +
>> +static void qemu_ram_copy_unaligned(void *dst, const void *src,
>> +                                    size_t n, size_t max_step)
>> +{
>> +    uintptr_t test, step;
>> +
>> +    /* Aligned maximal step */
>> +    max_step = pow2floor(max_step);
>> +
>> +    while (n) {
>> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
>> +        step = test & -test;
>> +
>> +        switch (step) {
>> +        case 1:
>> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
>> +            src += 1;
>> +            dst += 1;
>> +            n -= 1;
>> +            break;
>> +        case 2:
>> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
>> +            src += 2;
>> +            dst += 2;
>> +            n -= 2;
>> +            break;
>> +        case 4:
>> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
>> +            src += 4;
>> +            dst += 4;
>> +            n -= 4;
>> +            break;
>> +        case 8:
>> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
>> +            src += 8;
>> +            dst += 8;
>> +            n -= 8;
>> +            break;
>> +        default:
>> +            memcpy(dst, src, step);
>> +            src += step;
>> +            dst += step;
>> +            n -= step;
>> +        }
>> +    }
>> +}
>> +
>> +static void qemu_ram_backwards_copy_unaligned(void *dst, const void *src,
>> +                                              size_t n, size_t max_step)
>> +{
>> +    uintptr_t test, step;
>> +
>> +    /* Aligned maximal step */
>> +    max_step = pow2floor(max_step);
>> +
>> +    /* End of the blocks */
>> +    src += n;
>> +    dst += n;
>> +
>> +    while (n) {
>> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
>> +        step = test & -test;
>> +
>> +        switch (step) {
>> +        case 1:
>> +            src -= 1;
>> +            dst -= 1;
>> +            n -= 1;
>> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
>> +            break;
>> +        case 2:
>> +            src -= 2;
>> +            dst -= 2;
>> +            n -= 2;
>> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
>> +            break;
>> +        case 4:
>> +            src -= 4;
>> +            dst -= 4;
>> +            n -= 4;
>> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
>> +            break;
>> +        case 8:
>> +            src -= 8;
>> +            dst -= 8;
>> +            n -= 8;
>> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
>> +            break;
>> +        default:
>> +            src -= step;
>> +            dst -= step;
>> +            n -= step;
>> +            memmove(dst, src, step);

This should be memcpy() actually since src/dst don't overlap. However, I think
Richard may suggest to avoid using memcpy/memmove() here.

>> +        }
>> +    }
>> +}
>> +
>> +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
>> +#if defined(__i386__) || defined(__x86_64__)
>> +#define HOST_UNALIGNED_MMIO_OK 1
> 
> maybe HOST_UNALIGNED_VECTOR_MMIO_OK ? We also care that vector stores
> are safe.
> 

Yes, it should be renamed to HOST_UNALIGNED_VECTOR_MMIO_OK.

>> +#else
>> +#define HOST_UNALIGNED_MMIO_OK 0
>> +#endif
>> +
>> +void qemu_ram_copy(void *dst, const void *src, size_t n)
>> +{
>> +    if (dst == src || n == 0) {
>> +        return;
>> +    }
>> +
>> +    if (HOST_UNALIGNED_MMIO_OK) {
>> +        qemu_ram_copy_aligned(dst, src, n);
>> +    } else {
>> +        qemu_ram_copy_unaligned(dst, src, n, 8);
>> +    }
>> +}
>> +
>> +void qemu_ram_move(void *dst, const void *src, size_t n)
>> +{
>> +    if (src == dst || n == 0) {
>> +        return;
>> +    }
>> +
>> +    if (HOST_UNALIGNED_MMIO_OK) {
>> +        qemu_ram_move_aligned(dst, src, n);
>> +    } else if (dst < src) {
>> +        qemu_ram_copy_unaligned(dst, src, n, src - dst);
>> +    } else {
>> +        qemu_ram_backwards_copy_unaligned(dst, src, n, dst - src);
>> +    }
>> +}
>> +
>>   int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
>>   {
>>       unsigned access_size_max = mr->ops->valid.max_access_size;
>> @@ -3272,7 +3443,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, true);
>>   
>> -        memmove(ram_ptr, buf, *l);
>> +        qemu_ram_move(ram_ptr, buf, *l);
>>           invalidate_and_set_dirty(mr, mr_addr, *l);
>>   
>>           return MEMTX_OK;
>> @@ -3365,7 +3536,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, false);
>>   
>> -        memcpy(buf, ram_ptr, *l);
>> +        qemu_ram_copy(buf, ram_ptr, *l);
>>   
>>           return MEMTX_OK;
>>       }
>> @@ -3503,8 +3674,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>>               l = memory_access_size(mr, l, addr1);
>>           } else {
>>               /* ROM/RAM case */
>> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -            memcpy(ram_ptr, buf, l);
>> +            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
>>               invalidate_and_set_dirty(mr, addr1, l);
>>           }
>>           len -= l;
>> -- 
>> 2.54.0
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  7:15     ` Gavin Shan
@ 2026-06-16  9:51       ` Michael S. Tsirkin
  2026-06-16 12:50         ` Ding Hui
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16  9:51 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 05:15:13PM +1000, Gavin Shan wrote:
> 
> On 6/16/26 4:17 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 16, 2026 at 03:25:51PM +1000, Gavin Shan wrote:
> > > All ram device regions were turned to be indirectly accessible by commit
> > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > > guest is started by the following command lines, with GH100 GPU card passed
> > > from the host.
> > > 
> > >     host$ lspci | grep GH100
> > >     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> > >     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
> > >           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
> > >           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
> > >           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
> > >           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
> > >           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> > >             :
> > >     guest$ cd cuda-samples/build
> > >     guest$ make -j 20 clean
> > >     guest$ make -j 20
> > >             :
> > >     [ 54%] Linking CUDA executable graphMemoryNodes
> > >     [ 54%] Built target graphMemoryNodes
> > >     <no more output afterwards, guest becomes frozen here>
> > > 
> > >     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > >     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> > > 
> > > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > > the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
> > > be presented to the guest through memory hot-add. The page cache can
> > > then be allocated from the hot added memory blocks when cuda-samples
> > > is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
> > > device as part of the DMA request, the bounce buffer has to be used to
> > > accomodate the request as the corresponding memory region (MemoryRegion)
> > > is an indirectly accessible ram device region in qemu. However, the max
> > > bounce bufer size is only 4096 bytes by default and that is exhausted
> > > quickly, leading to a reset on the virtio-blk device and frozen guest
> > > eventually.
> > > 
> > >    QEMU
> > >    ====
> > >    virtio_blk_handle_output
> > >      virtio_blk_handle_vq
> > >        virtio_blk_get_request
> > >          virtqueue_pop
> > >            virtqueue_split_pop
> > >              virtqueue_map_desc
> > >                address_space_map
> > >                  memory_access_is_direct         # Return false
> > >                    memory_region_supports_direct_access
> > > 
> > >    (qemu) info mtree
> > >    memory-region: pci_bridge_pci
> > >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> > >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> > >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> > > 
> > > This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
> > > them in the ram device memory region accessors, similar to what's done
> > > in commit 4a2e242bbb so that the issue (MMIO access instructions were
> > > optimized to SSE instructions) covered by that commit is fixed. This
> > > makes 'ram_device_mem_ops' redundant, paving the way to revert that
> > > commit to make ram device region directly accessible again in the next
> > > patch.
> > > 
> > > Reported-by: Julia Graham <jugraham@redhat.com>
> > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > Suggested-by: Peter Xu <peterx@redhat.com>
> > > Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> > > Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > > v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
> > >      Support qemu_ram_move() for overlapped src/dest   (Richard)
> > >      Use {memcpy, memmove} if step is 16-bytes or more (Michael)
> > >      Code improvements                                 (Richard/Michael)
> > > ---
> > >   hw/remote/vfio-user-obj.c |   4 +-
> > >   include/system/memory.h   |  32 ++++++-
> > >   system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
> > >   3 files changed, 207 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > > index 87fa7b6572..97a6c88780 100644
> > > --- a/hw/remote/vfio-user-obj.c
> > > +++ b/hw/remote/vfio-user-obj.c
> > > @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> > >           ram_ptr = memory_region_get_ram_ptr(mr);
> > >           if (is_write) {
> > > -            memcpy((ram_ptr + offset), buf, size);
> > > +            qemu_ram_copy(ram_ptr + offset, buf, size);
> > >           } else {
> > > -            memcpy(buf, (ram_ptr + offset), size);
> > > +            qemu_ram_copy(buf, ram_ptr + offset, size);
> > >           }
> > >           return 0;
> > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > index 1417132f6d..84203c312d 100644
> > > --- a/include/system/memory.h
> > > +++ b/include/system/memory.h
> > > @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> > >   void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> > >   /* Internal functions, part of the implementation of address_space_read.  */
> > > +
> > > +/**
> > > + * qemu_ram_copy: copy data to ramblock
> > > + *
> > > + * @dst: destination where the data is copied to
> > > + * @src: source where the data is copied from
> > > + * @n: length of data to be copied
> > > + *
> > > + *
> > > + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
> > > + * do not overlap. Handles special cases such as uncacheable ramblocks
> > > + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> > > + * in preference to memcpy().
> > > + */
> > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > +
> > > +/**
> > > + * qemu_ram_move: move data to ramblock
> > > + *
> > > + * @dst: destination where the data is moved to
> > > + * @src: source where the data is moved from
> > > + * @n: length of data to be moved
> > > + *
> > > + * Move @n bytes from @src to @dst with the assumption that @src and @dst
> > > + * can overlap. Handles special cases such as uncacheable ramblocks
> > > + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> > > + * in preference to memmove().
> > > + */
> > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> > > +
> > >   MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
> > >                                       MemTxAttrs attrs, void *buf, hwaddr len);
> > >   MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
> > > @@ -2970,7 +3000,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
> > >               mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
> > >               if (len == l && memory_access_is_direct(mr, false, attrs)) {
> > >                   ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> > > -                memcpy(buf, ptr, len);
> > > +                qemu_ram_copy(buf, ptr, len);
> > >               } else {
> > >                   result = flatview_read_continue(fv, addr, attrs, buf, len,
> > >                                                   addr1, l, mr);
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index 7bcbf87573..45a17cd580 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -3160,6 +3160,177 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
> > >       invalidate_and_set_dirty(mr, addr, size);
> > >   }
> > > +static void qemu_ram_copy_aligned(void *dst, const void *src, size_t n)
> > > +{
> > > +    switch (n) {
> > > +    case 1:
> > > +        __builtin_memcpy(dst, src, 1);
> > > +        break;
> > > +    case 2:
> > > +        __builtin_memcpy(dst, src, 2);
> > > +        break;
> > > +    case 4:
> > > +        __builtin_memcpy(dst, src, 4);
> > > +        break;
> > > +    case 8:
> > > +        __builtin_memcpy(dst, src, 8);
> > > +        break;
> > > +    default:
> > > +        memcpy(dst, src, n);
> > > +    }
> > > +}
> > > +
> > > +static void qemu_ram_move_aligned(void *dst, const void *src, size_t n)
> > > +{
> > > +    switch (n) {
> > > +    case 1:
> > > +        __builtin_memmove(dst, src, 1);
> > > +        break;
> > > +    case 2:
> > > +        __builtin_memmove(dst, src, 2);
> > > +        break;
> > > +    case 4:
> > > +        __builtin_memmove(dst, src, 4);
> > > +        break;
> > > +    case 8:
> > > +        __builtin_memmove(dst, src, 8);
> > > +        break;
> > > +    default:
> > > +        memmove(dst, src, n);
> > > +    }
> > > +}
> > 
> > A weird name for this. You call it for both aligned and unaligned.
> > 
> > maybe this is _unsafe and the "unaligned" below is _safe ?
> > 
> 
> Yeah, xxx_{aligned, unaligned}() should be renamed to xxx_{unsafe, safe}(). The
> question is why we're away from __builtin_{memcpy, memmove}() and memcpy/memmove
> since they're not safe as the new function name indicates? :-)

What is the question, exactly? That's why I made the list of 11 issues.
Is that unclear, somehow?  Another reason to include that, maybe.

> If we're away from memcpy/memmove() and switch to xxx_safe() for x86, we shouldn't
> have any more concerns except the performance lost. At least, everything should
> be working from function POV because we're not affected by glibc and the optimization
> enforced by the compiler any more.

There's no way to write C and not depend on compiler's whims for
weird things such as unaligned memory accesses.

> > 
> > 
> > 
> > > +
> > > +static void qemu_ram_copy_unaligned(void *dst, const void *src,
> > > +                                    size_t n, size_t max_step)
> > > +{
> > > +    uintptr_t test, step;
> > > +
> > > +    /* Aligned maximal step */
> > > +    max_step = pow2floor(max_step);
> > > +
> > > +    while (n) {
> > > +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
> > > +        step = test & -test;
> > > +
> > > +        switch (step) {
> > > +        case 1:
> > > +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
> > > +            src += 1;
> > > +            dst += 1;
> > > +            n -= 1;
> > > +            break;
> > > +        case 2:
> > > +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
> > > +            src += 2;
> > > +            dst += 2;
> > > +            n -= 2;
> > > +            break;
> > > +        case 4:
> > > +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
> > > +            src += 4;
> > > +            dst += 4;
> > > +            n -= 4;
> > > +            break;
> > > +        case 8:
> > > +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
> > > +            src += 8;
> > > +            dst += 8;
> > > +            n -= 8;
> > > +            break;
> > > +        default:
> > > +            memcpy(dst, src, step);
> > > +            src += step;
> > > +            dst += step;
> > > +            n -= step;
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +static void qemu_ram_backwards_copy_unaligned(void *dst, const void *src,
> > > +                                              size_t n, size_t max_step)
> > > +{
> > > +    uintptr_t test, step;
> > > +
> > > +    /* Aligned maximal step */
> > > +    max_step = pow2floor(max_step);
> > > +
> > > +    /* End of the blocks */
> > > +    src += n;
> > > +    dst += n;
> > > +
> > > +    while (n) {
> > > +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
> > > +        step = test & -test;
> > > +
> > > +        switch (step) {
> > > +        case 1:
> > > +            src -= 1;
> > > +            dst -= 1;
> > > +            n -= 1;
> > > +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
> > > +            break;
> > > +        case 2:
> > > +            src -= 2;
> > > +            dst -= 2;
> > > +            n -= 2;
> > > +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
> > > +            break;
> > > +        case 4:
> > > +            src -= 4;
> > > +            dst -= 4;
> > > +            n -= 4;
> > > +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
> > > +            break;
> > > +        case 8:
> > > +            src -= 8;
> > > +            dst -= 8;
> > > +            n -= 8;
> > > +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
> > > +            break;
> > > +        default:
> > > +            src -= step;
> > > +            dst -= step;
> > > +            n -= step;
> > > +            memmove(dst, src, step);
> 
> This should be memcpy() actually since src/dst don't overlap. However, I think
> Richard may suggest to avoid using memcpy/memmove() here.
> 
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
> > > +#if defined(__i386__) || defined(__x86_64__)
> > > +#define HOST_UNALIGNED_MMIO_OK 1
> > 
> > maybe HOST_UNALIGNED_VECTOR_MMIO_OK ? We also care that vector stores
> > are safe.
> > 
> 
> Yes, it should be renamed to HOST_UNALIGNED_VECTOR_MMIO_OK.
> 
> > > +#else
> > > +#define HOST_UNALIGNED_MMIO_OK 0
> > > +#endif
> > > +
> > > +void qemu_ram_copy(void *dst, const void *src, size_t n)
> > > +{
> > > +    if (dst == src || n == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (HOST_UNALIGNED_MMIO_OK) {
> > > +        qemu_ram_copy_aligned(dst, src, n);
> > > +    } else {
> > > +        qemu_ram_copy_unaligned(dst, src, n, 8);
> > > +    }
> > > +}
> > > +
> > > +void qemu_ram_move(void *dst, const void *src, size_t n)
> > > +{
> > > +    if (src == dst || n == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (HOST_UNALIGNED_MMIO_OK) {
> > > +        qemu_ram_move_aligned(dst, src, n);
> > > +    } else if (dst < src) {
> > > +        qemu_ram_copy_unaligned(dst, src, n, src - dst);
> > > +    } else {
> > > +        qemu_ram_backwards_copy_unaligned(dst, src, n, dst - src);
> > > +    }
> > > +}
> > > +
> > >   int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
> > >   {
> > >       unsigned access_size_max = mr->ops->valid.max_access_size;
> > > @@ -3272,7 +3443,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> > >           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> > >                                                  false, true);
> > > -        memmove(ram_ptr, buf, *l);
> > > +        qemu_ram_move(ram_ptr, buf, *l);
> > >           invalidate_and_set_dirty(mr, mr_addr, *l);
> > >           return MEMTX_OK;
> > > @@ -3365,7 +3536,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> > >           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> > >                                                  false, false);
> > > -        memcpy(buf, ram_ptr, *l);
> > > +        qemu_ram_copy(buf, ram_ptr, *l);
> > >           return MEMTX_OK;
> > >       }
> > > @@ -3503,8 +3674,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
> > >               l = memory_access_size(mr, l, addr1);
> > >           } else {
> > >               /* ROM/RAM case */
> > > -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> > > -            memcpy(ram_ptr, buf, l);
> > > +            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
> > >               invalidate_and_set_dirty(mr, addr1, l);
> > >           }
> > >           len -= l;
> > > -- 
> > > 2.54.0
> > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  9:51       ` Michael S. Tsirkin
@ 2026-06-16 12:50         ` Ding Hui
  2026-06-16 15:51           ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Ding Hui @ 2026-06-16 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin, Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, shan.gavin

On 2026/6/16 17:51, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 05:15:13PM +1000, Gavin Shan wrote:
>>
>> On 6/16/26 4:17 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 16, 2026 at 03:25:51PM +1000, Gavin Shan wrote:
>>>> All ram device regions were turned to be indirectly accessible by commit
>>>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>>>> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
>>>> guest is started by the following command lines, with GH100 GPU card passed
>>>> from the host.
>>>>
>>>>      host$ lspci | grep GH100
>>>>      0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>>>>      host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>>>>            -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>>>>            -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>>>>            -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>>>>            -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>>>>            -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>>>>              :
>>>>      guest$ cd cuda-samples/build
>>>>      guest$ make -j 20 clean
>>>>      guest$ make -j 20
>>>>              :
>>>>      [ 54%] Linking CUDA executable graphMemoryNodes
>>>>      [ 54%] Built target graphMemoryNodes
>>>>      <no more output afterwards, guest becomes frozen here>
>>>>
>>>>      guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>      [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>>>
>>>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>>>> the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
>>>> be presented to the guest through memory hot-add. The page cache can
>>>> then be allocated from the hot added memory blocks when cuda-samples
>>>> is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
>>>> device as part of the DMA request, the bounce buffer has to be used to
>>>> accomodate the request as the corresponding memory region (MemoryRegion)
>>>> is an indirectly accessible ram device region in qemu. However, the max
>>>> bounce bufer size is only 4096 bytes by default and that is exhausted
>>>> quickly, leading to a reset on the virtio-blk device and frozen guest
>>>> eventually.
>>>>
>>>>     QEMU
>>>>     ====
>>>>     virtio_blk_handle_output
>>>>       virtio_blk_handle_vq
>>>>         virtio_blk_get_request
>>>>           virtqueue_pop
>>>>             virtqueue_split_pop
>>>>               virtqueue_map_desc
>>>>                 address_space_map
>>>>                   memory_access_is_direct         # Return false
>>>>                     memory_region_supports_direct_access
>>>>
>>>>     (qemu) info mtree
>>>>     memory-region: pci_bridge_pci
>>>>       0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>>>>         0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>>>>           0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>>>>             0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>>>
>>>> This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
>>>> them in the ram device memory region accessors, similar to what's done
>>>> in commit 4a2e242bbb so that the issue (MMIO access instructions were
>>>> optimized to SSE instructions) covered by that commit is fixed. This
>>>> makes 'ram_device_mem_ops' redundant, paving the way to revert that
>>>> commit to make ram device region directly accessible again in the next
>>>> patch.
>>>>
>>>> Reported-by: Julia Graham <jugraham@redhat.com>
>>>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>>>> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>> v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
>>>>       Support qemu_ram_move() for overlapped src/dest   (Richard)
>>>>       Use {memcpy, memmove} if step is 16-bytes or more (Michael)
>>>>       Code improvements                                 (Richard/Michael)
>>>> ---
>>>>    hw/remote/vfio-user-obj.c |   4 +-
>>>>    include/system/memory.h   |  32 ++++++-
>>>>    system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
>>>>    3 files changed, 207 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>>>> index 87fa7b6572..97a6c88780 100644
>>>> --- a/hw/remote/vfio-user-obj.c
>>>> +++ b/hw/remote/vfio-user-obj.c
>>>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>>>>            ram_ptr = memory_region_get_ram_ptr(mr);
>>>>            if (is_write) {
>>>> -            memcpy((ram_ptr + offset), buf, size);
>>>> +            qemu_ram_copy(ram_ptr + offset, buf, size);
>>>>            } else {
>>>> -            memcpy(buf, (ram_ptr + offset), size);
>>>> +            qemu_ram_copy(buf, ram_ptr + offset, size);
>>>>            }
>>>>            return 0;
>>>> diff --git a/include/system/memory.h b/include/system/memory.h
>>>> index 1417132f6d..84203c312d 100644
>>>> --- a/include/system/memory.h
>>>> +++ b/include/system/memory.h
>>>> @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>>>>    void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>>>>    /* Internal functions, part of the implementation of address_space_read.  */
>>>> +
>>>> +/**
>>>> + * qemu_ram_copy: copy data to ramblock
>>>> + *
>>>> + * @dst: destination where the data is copied to
>>>> + * @src: source where the data is copied from
>>>> + * @n: length of data to be copied
>>>> + *
>>>> + *
>>>> + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
>>>> + * do not overlap. Handles special cases such as uncacheable ramblocks
>>>> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
>>>> + * in preference to memcpy().
>>>> + */
>>>> +void qemu_ram_copy(void *dest, const void *src, size_t n);
>>>> +
>>>> +/**
>>>> + * qemu_ram_move: move data to ramblock
>>>> + *
>>>> + * @dst: destination where the data is moved to
>>>> + * @src: source where the data is moved from
>>>> + * @n: length of data to be moved
>>>> + *
>>>> + * Move @n bytes from @src to @dst with the assumption that @src and @dst
>>>> + * can overlap. Handles special cases such as uncacheable ramblocks
>>>> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
>>>> + * in preference to memmove().
>>>> + */
>>>> +void qemu_ram_move(void *dest, const void *src, size_t n);
>>>> +
>>>>    MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
>>>>                                        MemTxAttrs attrs, void *buf, hwaddr len);
>>>>    MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
>>>> @@ -2970,7 +3000,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>>>>                mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>>>>                if (len == l && memory_access_is_direct(mr, false, attrs)) {
>>>>                    ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>>>> -                memcpy(buf, ptr, len);
>>>> +                qemu_ram_copy(buf, ptr, len);
>>>>                } else {
>>>>                    result = flatview_read_continue(fv, addr, attrs, buf, len,
>>>>                                                    addr1, l, mr);
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index 7bcbf87573..45a17cd580 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3160,6 +3160,177 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
>>>>        invalidate_and_set_dirty(mr, addr, size);
>>>>    }
>>>> +static void qemu_ram_copy_aligned(void *dst, const void *src, size_t n)
>>>> +{
>>>> +    switch (n) {
>>>> +    case 1:
>>>> +        __builtin_memcpy(dst, src, 1);
>>>> +        break;
>>>> +    case 2:
>>>> +        __builtin_memcpy(dst, src, 2);
>>>> +        break;
>>>> +    case 4:
>>>> +        __builtin_memcpy(dst, src, 4);
>>>> +        break;
>>>> +    case 8:
>>>> +        __builtin_memcpy(dst, src, 8);
>>>> +        break;
>>>> +    default:
>>>> +        memcpy(dst, src, n);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void qemu_ram_move_aligned(void *dst, const void *src, size_t n)
>>>> +{
>>>> +    switch (n) {
>>>> +    case 1:
>>>> +        __builtin_memmove(dst, src, 1);
>>>> +        break;
>>>> +    case 2:
>>>> +        __builtin_memmove(dst, src, 2);
>>>> +        break;
>>>> +    case 4:
>>>> +        __builtin_memmove(dst, src, 4);
>>>> +        break;
>>>> +    case 8:
>>>> +        __builtin_memmove(dst, src, 8);
>>>> +        break;
>>>> +    default:
>>>> +        memmove(dst, src, n);
>>>> +    }
>>>> +}
>>>
>>> A weird name for this. You call it for both aligned and unaligned.
>>>
>>> maybe this is _unsafe and the "unaligned" below is _safe ?
>>>
>>
>> Yeah, xxx_{aligned, unaligned}() should be renamed to xxx_{unsafe, safe}(). The
>> question is why we're away from __builtin_{memcpy, memmove}() and memcpy/memmove
>> since they're not safe as the new function name indicates? :-)
> 
> What is the question, exactly? That's why I made the list of 11 issues.
> Is that unclear, somehow?  Another reason to include that, maybe.
> 

We can take the e1000 issue on aarch64 for example:

Link: https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn/

The software allocate ring buffer for hardware to receive packet.
In normal case, the hardware fill the RX ring descriptor, and with status DD bit=1 (Descriptor Done),
then the ownership of the ring descriptor is converted to software, after the software consume
the descriptor, it will write status DD bit=0 and give it back to hardware to refill data.

pci_dma_write calls memcpy at the underlying level, and on aarch64 with glibc 2.24+,
the memcpy/memmove implementation uses a branchless sequence that copies
the same byte three times when count == 1.

So the issue case:

qemu/e1000                      |       guest/dpdk
-----------------------------------------------------
memcpy(dst, &desc.status, 1)
   1st write dst DD=1
                                    poll and see DD==1
                                    consume the descriptor
                                    write status DD=0 back
   2nd write dst DD=1
   3rd write dst DD=1

That left the dirty state, and RX will hang.

>> If we're away from memcpy/memmove() and switch to xxx_safe() for x86, we shouldn't
>> have any more concerns except the performance lost. At least, everything should
>> be working from function POV because we're not affected by glibc and the optimization
>> enforced by the compiler any more.
> 
> There's no way to write C and not depend on compiler's whims for
> weird things such as unaligned memory accesses.
> 
>>>
>>>
>>>
>>>> +
>>>> +static void qemu_ram_copy_unaligned(void *dst, const void *src,
>>>> +                                    size_t n, size_t max_step)
>>>> +{
>>>> +    uintptr_t test, step;
>>>> +
>>>> +    /* Aligned maximal step */
>>>> +    max_step = pow2floor(max_step);
>>>> +
>>>> +    while (n) {
>>>> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
>>>> +        step = test & -test;
>>>> +
>>>> +        switch (step) {
>>>> +        case 1:
>>>> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
>>>> +            src += 1;
>>>> +            dst += 1;
>>>> +            n -= 1;
>>>> +            break;
>>>> +        case 2:
>>>> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
>>>> +            src += 2;
>>>> +            dst += 2;
>>>> +            n -= 2;
>>>> +            break;
>>>> +        case 4:
>>>> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
>>>> +            src += 4;
>>>> +            dst += 4;
>>>> +            n -= 4;
>>>> +            break;
>>>> +        case 8:
>>>> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
>>>> +            src += 8;
>>>> +            dst += 8;
>>>> +            n -= 8;
>>>> +            break;
>>>> +        default:
>>>> +            memcpy(dst, src, step);
>>>> +            src += step;
>>>> +            dst += step;
>>>> +            n -= step;
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void qemu_ram_backwards_copy_unaligned(void *dst, const void *src,
>>>> +                                              size_t n, size_t max_step)
>>>> +{
>>>> +    uintptr_t test, step;
>>>> +
>>>> +    /* Aligned maximal step */
>>>> +    max_step = pow2floor(max_step);
>>>> +
>>>> +    /* End of the blocks */
>>>> +    src += n;
>>>> +    dst += n;
>>>> +
>>>> +    while (n) {
>>>> +        test = (uintptr_t)src | (uintptr_t)dst | n | max_step;
>>>> +        step = test & -test;
>>>> +
>>>> +        switch (step) {
>>>> +        case 1:
>>>> +            src -= 1;
>>>> +            dst -= 1;
>>>> +            n -= 1;
>>>> +            qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));
>>>> +            break;
>>>> +        case 2:
>>>> +            src -= 2;
>>>> +            dst -= 2;
>>>> +            n -= 2;
>>>> +            qatomic_set((uint16_t *)dst, qatomic_read((uint16_t *)src));
>>>> +            break;
>>>> +        case 4:
>>>> +            src -= 4;
>>>> +            dst -= 4;
>>>> +            n -= 4;
>>>> +            qatomic_set((uint32_t *)dst, qatomic_read((uint32_t *)src));
>>>> +            break;
>>>> +        case 8:
>>>> +            src -= 8;
>>>> +            dst -= 8;
>>>> +            n -= 8;
>>>> +            qatomic_set((uint64_t *)dst, qatomic_read((uint64_t *)src));
>>>> +            break;
>>>> +        default:
>>>> +            src -= step;
>>>> +            dst -= step;
>>>> +            n -= step;
>>>> +            memmove(dst, src, step);
>>
>> This should be memcpy() actually since src/dst don't overlap. However, I think
>> Richard may suggest to avoid using memcpy/memmove() here.
>>
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
>>>> +#if defined(__i386__) || defined(__x86_64__)
>>>> +#define HOST_UNALIGNED_MMIO_OK 1
>>>
>>> maybe HOST_UNALIGNED_VECTOR_MMIO_OK ? We also care that vector stores
>>> are safe.
>>>
>>
>> Yes, it should be renamed to HOST_UNALIGNED_VECTOR_MMIO_OK.
>>
>>>> +#else
>>>> +#define HOST_UNALIGNED_MMIO_OK 0
>>>> +#endif
>>>> +
>>>> +void qemu_ram_copy(void *dst, const void *src, size_t n)
>>>> +{
>>>> +    if (dst == src || n == 0) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (HOST_UNALIGNED_MMIO_OK) {
>>>> +        qemu_ram_copy_aligned(dst, src, n);
>>>> +    } else {
>>>> +        qemu_ram_copy_unaligned(dst, src, n, 8);
>>>> +    }
>>>> +}
>>>> +
>>>> +void qemu_ram_move(void *dst, const void *src, size_t n)
>>>> +{
>>>> +    if (src == dst || n == 0) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (HOST_UNALIGNED_MMIO_OK) {
>>>> +        qemu_ram_move_aligned(dst, src, n);
>>>> +    } else if (dst < src) {
>>>> +        qemu_ram_copy_unaligned(dst, src, n, src - dst);
>>>> +    } else {
>>>> +        qemu_ram_backwards_copy_unaligned(dst, src, n, dst - src);
>>>> +    }
>>>> +}
>>>> +
>>>>    int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
>>>>    {
>>>>        unsigned access_size_max = mr->ops->valid.max_access_size;
>>>> @@ -3272,7 +3443,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>>>>            uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>>>                                                   false, true);
>>>> -        memmove(ram_ptr, buf, *l);
>>>> +        qemu_ram_move(ram_ptr, buf, *l);
>>>>            invalidate_and_set_dirty(mr, mr_addr, *l);
>>>>            return MEMTX_OK;
>>>> @@ -3365,7 +3536,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>>>>            uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>>>                                                   false, false);
>>>> -        memcpy(buf, ram_ptr, *l);
>>>> +        qemu_ram_copy(buf, ram_ptr, *l);
>>>>            return MEMTX_OK;
>>>>        }
>>>> @@ -3503,8 +3674,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>>>>                l = memory_access_size(mr, l, addr1);
>>>>            } else {
>>>>                /* ROM/RAM case */
>>>> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>>>> -            memcpy(ram_ptr, buf, l);
>>>> +            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
>>>>                invalidate_and_set_dirty(mr, addr1, l);
>>>>            }
>>>>            len -= l;
>>>> -- 
>>>> 2.54.0
>>>
>>
>> Thanks,
>> Gavin
> 
> 
> 

-- 
Thanks,
- Ding Hui



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16 12:50         ` Ding Hui
@ 2026-06-16 15:51           ` Michael S. Tsirkin
  2026-06-16 23:01             ` Gavin Shan
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16 15:51 UTC (permalink / raw)
  To: Ding Hui
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, shan.gavin

On Tue, Jun 16, 2026 at 08:50:27PM +0800, Ding Hui wrote:
> > What is the question, exactly? That's why I made the list of 11 issues.
> > Is that unclear, somehow?  Another reason to include that, maybe.
> > 
> 
> We can take the e1000 issue on aarch64 for example:
> 
> Link: https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn/
> 
> The software allocate ring buffer for hardware to receive packet.
> In normal case, the hardware fill the RX ring descriptor, and with status DD bit=1 (Descriptor Done),
> then the ownership of the ring descriptor is converted to software, after the software consume
> the descriptor, it will write status DD bit=0 and give it back to hardware to refill data.
> 
> pci_dma_write calls memcpy at the underlying level, and on aarch64 with glibc 2.24+,
> the memcpy/memmove implementation uses a branchless sequence that copies
> the same byte three times when count == 1.

Yes this is issue 11:

10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.


-- 
MST



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16 15:51           ` Michael S. Tsirkin
@ 2026-06-16 23:01             ` Gavin Shan
  0 siblings, 0 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-16 23:01 UTC (permalink / raw)
  To: Michael S. Tsirkin, Ding Hui
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, shan.gavin

On 6/17/26 1:51 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 08:50:27PM +0800, Ding Hui wrote:
>>> What is the question, exactly? That's why I made the list of 11 issues.
>>> Is that unclear, somehow?  Another reason to include that, maybe.
>>>
>>
>> We can take the e1000 issue on aarch64 for example:
>>
>> Link: https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn/
>>
>> The software allocate ring buffer for hardware to receive packet.
>> In normal case, the hardware fill the RX ring descriptor, and with status DD bit=1 (Descriptor Done),
>> then the ownership of the ring descriptor is converted to software, after the software consume
>> the descriptor, it will write status DD bit=0 and give it back to hardware to refill data.
>>
>> pci_dma_write calls memcpy at the underlying level, and on aarch64 with glibc 2.24+,
>> the memcpy/memmove implementation uses a branchless sequence that copies
>> the same byte three times when count == 1.
> 
> Yes this is issue 11:
> 
> 10. on non-x86 memcpy will do multiple overlapping stores even
> for single byte writes. E.g. it does it to avoid extra branches.
> This is causing issues in practice.
> 

This issue should be fixed by this series. With this series applied, the memcpy/memmove
are replaced with 'qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src))' in
qemu_ram_{copy, backwards_copy}_unaligned(). I hope Ding can give this series a try
to confirm.

After this series is applied:

   pci_dma_write
     pci_dma_rw
       dma_memory_rw
         dma_memory_rw_relaxed
           address_space_rw
             address_space_write
               flatview_write
                 flatview_write_continue
                   flatview_write_continue_step
                     memory_access_is_direct          // return true
                     qemu_ram_move                    // replaced original memmove()
                       qemu_ram_{copy, backwards_copy}_unaligned
                         qatomic_set((uint8_t *)dst, qatomic_read((uint8_t *)src));

For this specific case, what's done in this series is similar to the proposed
in that thread [1]. In the proposal, pci_dma_write() is replaced with address_space_stb(),
and eventually bail into "*(uint8_t *)ptr = v;". However, address_space_stb()
seems not friendly to DMA write from from syntax level because the limited
size is supported by it.

   [1] https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn/

   address_space_stb
     address_space_stm_internal
       stm_p
         stb_p
           *(uint8_t *)ptr = v;

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
  2026-06-16  6:17   ` Michael S. Tsirkin
@ 2026-06-25 10:09   ` Peter Maydell
  2026-06-25 11:07     ` Michael S. Tsirkin
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2026-06-25 10:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, mst, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Tue, 16 Jun 2026 at 06:26, Gavin Shan <gshan@redhat.com> wrote:
>
> All ram device regions were turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> guest is started by the following command lines, with GH100 GPU card passed
> from the host.
>
>    host$ lspci | grep GH100
>    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>          -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>            :
>    guest$ cd cuda-samples/build
>    guest$ make -j 20 clean
>    guest$ make -j 20
>            :
>    [ 54%] Linking CUDA executable graphMemoryNodes
>    [ 54%] Built target graphMemoryNodes
>    <no more output afterwards, guest becomes frozen here>
>
>    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
> be presented to the guest through memory hot-add. The page cache can
> then be allocated from the hot added memory blocks when cuda-samples
> is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
> device as part of the DMA request, the bounce buffer has to be used to
> accomodate the request as the corresponding memory region (MemoryRegion)
> is an indirectly accessible ram device region in qemu. However, the max
> bounce bufer size is only 4096 bytes by default and that is exhausted
> quickly, leading to a reset on the virtio-blk device and frozen guest
> eventually.
>
>   QEMU
>   ====
>   virtio_blk_handle_output
>     virtio_blk_handle_vq
>       virtio_blk_get_request
>         virtqueue_pop
>           virtqueue_split_pop
>             virtqueue_map_desc
>               address_space_map
>                 memory_access_is_direct         # Return false
>                   memory_region_supports_direct_access
>
>   (qemu) info mtree
>   memory-region: pci_bridge_pci
>     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
> them in the ram device memory region accessors, similar to what's done
> in commit 4a2e242bbb so that the issue (MMIO access instructions were
> optimized to SSE instructions) covered by that commit is fixed. This
> makes 'ram_device_mem_ops' redundant, paving the way to revert that
> commit to make ram device region directly accessible again in the next
> patch.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
>     Support qemu_ram_move() for overlapped src/dest   (Richard)
>     Use {memcpy, memmove} if step is 16-bytes or more (Michael)
>     Code improvements                                 (Richard/Michael)
> ---
>  hw/remote/vfio-user-obj.c |   4 +-
>  include/system/memory.h   |  32 ++++++-
>  system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
>  3 files changed, 207 insertions(+), 7 deletions(-)
>
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..97a6c88780 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>          ram_ptr = memory_region_get_ram_ptr(mr);
>
>          if (is_write) {
> -            memcpy((ram_ptr + offset), buf, size);
> +            qemu_ram_copy(ram_ptr + offset, buf, size);
>          } else {
> -            memcpy(buf, (ram_ptr + offset), size);
> +            qemu_ram_copy(buf, ram_ptr + offset, size);
>          }
>
>          return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..84203c312d 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>
>  /* Internal functions, part of the implementation of address_space_read.  */
> +
> +/**
> + * qemu_ram_copy: copy data to ramblock
> + *
> + * @dst: destination where the data is copied to
> + * @src: source where the data is copied from
> + * @n: length of data to be copied
> + *
> + *
> + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
> + * do not overlap. Handles special cases such as uncacheable ramblocks
> + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> + * in preference to memcpy().
> + */

The documentation for these functions needs to say what semantics
the function is providing (e.g. can I rely on it to do a 4 byte aligned
load as a single 4 byte read?). This does not.

> +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
> +#if defined(__i386__) || defined(__x86_64__)
> +#define HOST_UNALIGNED_MMIO_OK 1
> +#else
> +#define HOST_UNALIGNED_MMIO_OK 0
> +#endif

This is still wrong. We should not have "x86 magically works
and all other hosts do something different" ifdefs. Define what
semantics you need and then we can figure out how to
implement them.

My current thought is that we need to handle accesses of
1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
do exactly one host load/store of that type, and that anything else
is "the guest isn't relying on specific semantics here, we can just
assume it's plain old RAM and do whatever". That would not
require any architecture specific ifdefs.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 10:09   ` Peter Maydell
@ 2026-06-25 11:07     ` Michael S. Tsirkin
  2026-06-25 12:48       ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-25 11:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, Jun 25, 2026 at 11:09:26AM +0100, Peter Maydell wrote:
> On Tue, 16 Jun 2026 at 06:26, Gavin Shan <gshan@redhat.com> wrote:
> >
> > All ram device regions were turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > guest is started by the following command lines, with GH100 GPU card passed
> > from the host.
> >
> >    host$ lspci | grep GH100
> >    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> >    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
> >          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
> >          -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
> >          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
> >          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
> >          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> >            :
> >    guest$ cd cuda-samples/build
> >    guest$ make -j 20 clean
> >    guest$ make -j 20
> >            :
> >    [ 54%] Linking CUDA executable graphMemoryNodes
> >    [ 54%] Built target graphMemoryNodes
> >    <no more output afterwards, guest becomes frozen here>
> >
> >    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> >    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> >
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
> > be presented to the guest through memory hot-add. The page cache can
> > then be allocated from the hot added memory blocks when cuda-samples
> > is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
> > device as part of the DMA request, the bounce buffer has to be used to
> > accomodate the request as the corresponding memory region (MemoryRegion)
> > is an indirectly accessible ram device region in qemu. However, the max
> > bounce bufer size is only 4096 bytes by default and that is exhausted
> > quickly, leading to a reset on the virtio-blk device and frozen guest
> > eventually.
> >
> >   QEMU
> >   ====
> >   virtio_blk_handle_output
> >     virtio_blk_handle_vq
> >       virtio_blk_get_request
> >         virtqueue_pop
> >           virtqueue_split_pop
> >             virtqueue_map_desc
> >               address_space_map
> >                 memory_access_is_direct         # Return false
> >                   memory_region_supports_direct_access
> >
> >   (qemu) info mtree
> >   memory-region: pci_bridge_pci
> >     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> >         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> >           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> >
> > This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
> > them in the ram device memory region accessors, similar to what's done
> > in commit 4a2e242bbb so that the issue (MMIO access instructions were
> > optimized to SSE instructions) covered by that commit is fixed. This
> > makes 'ram_device_mem_ops' redundant, paving the way to revert that
> > commit to make ram device region directly accessible again in the next
> > patch.
> >
> > Reported-by: Julia Graham <jugraham@redhat.com>
> > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> > Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > ---
> > v3: Documentation for qemu_ram_{copy, move}           (Peter/Michael)
> >     Support qemu_ram_move() for overlapped src/dest   (Richard)
> >     Use {memcpy, memmove} if step is 16-bytes or more (Michael)
> >     Code improvements                                 (Richard/Michael)
> > ---
> >  hw/remote/vfio-user-obj.c |   4 +-
> >  include/system/memory.h   |  32 ++++++-
> >  system/physmem.c          | 178 +++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 207 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 87fa7b6572..97a6c88780 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> >          ram_ptr = memory_region_get_ram_ptr(mr);
> >
> >          if (is_write) {
> > -            memcpy((ram_ptr + offset), buf, size);
> > +            qemu_ram_copy(ram_ptr + offset, buf, size);
> >          } else {
> > -            memcpy(buf, (ram_ptr + offset), size);
> > +            qemu_ram_copy(buf, ram_ptr + offset, size);
> >          }
> >
> >          return 0;
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index 1417132f6d..84203c312d 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -2897,6 +2897,36 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> >  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> >
> >  /* Internal functions, part of the implementation of address_space_read.  */
> > +
> > +/**
> > + * qemu_ram_copy: copy data to ramblock
> > + *
> > + * @dst: destination where the data is copied to
> > + * @src: source where the data is copied from
> > + * @n: length of data to be copied
> > + *
> > + *
> > + * Copy @n bytes from @src to @dst with the assumption that @src and @dst
> > + * do not overlap. Handles special cases such as uncacheable ramblocks
> > + * correctly. Use this for accessing ramblock in response to DMA/VCPU IO,
> > + * in preference to memcpy().
> > + */
> 
> The documentation for these functions needs to say what semantics
> the function is providing (e.g. can I rely on it to do a 4 byte aligned
> load as a single 4 byte read?). This does not.
> 
> > +/* x86 should work with __builtin_{memcpy, memmove}() for IO access */
> > +#if defined(__i386__) || defined(__x86_64__)
> > +#define HOST_UNALIGNED_MMIO_OK 1
> > +#else
> > +#define HOST_UNALIGNED_MMIO_OK 0
> > +#endif
> 
> This is still wrong. We should not have "x86 magically works
> and all other hosts do something different" ifdefs. Define what
> semantics you need and then we can figure out how to
> implement them.
> 
> My current thought is that we need to handle accesses of
> 1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
> do exactly one host load/store of that type, and that anything else
> is "the guest isn't relying on specific semantics here, we can just
> assume it's plain old RAM and do whatever". That would not
> require any architecture specific ifdefs.
> 
> thanks
> -- PMM


Well. X86 is special, as usual. It allows unaligned mmio so
we really have no way to know an x86 guest does not intend
just that. That can only be emulated perfectly on x86
which is sad but I see no reason to actively break it.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 11:07     ` Michael S. Tsirkin
@ 2026-06-25 12:48       ` Peter Maydell
  2026-06-25 13:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2026-06-25 12:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, 25 Jun 2026 at 12:07, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Jun 25, 2026 at 11:09:26AM +0100, Peter Maydell wrote:
> > This is still wrong. We should not have "x86 magically works
> > and all other hosts do something different" ifdefs. Define what
> > semantics you need and then we can figure out how to
> > implement them.
> >
> > My current thought is that we need to handle accesses of
> > 1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
> > do exactly one host load/store of that type, and that anything else
> > is "the guest isn't relying on specific semantics here, we can just
> > assume it's plain old RAM and do whatever". That would not
> > require any architecture specific ifdefs.

> Well. X86 is special, as usual. It allows unaligned mmio so
> we really have no way to know an x86 guest does not intend
> just that. That can only be emulated perfectly on x86
> which is sad but I see no reason to actively break it.

What actual hardware will rely on unaligned accesses
appearing to it as single unaligned accesses?

And if we do need this, we need to specify:
- what are the semantics we need to provide?
- what is the use case that relies on these?
Then we can see how closely we can match that.

There's a lot of extra complexity added in here for
something it I don't think is ever going to matter,
and having the most common host architecture take a
totally different codepath from everything else is
a recipe for that other codepath breaking.

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 12:48       ` Peter Maydell
@ 2026-06-25 13:23         ` Michael S. Tsirkin
  2026-06-25 14:02           ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-25 13:23 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, Jun 25, 2026 at 01:48:11PM +0100, Peter Maydell wrote:
> On Thu, 25 Jun 2026 at 12:07, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jun 25, 2026 at 11:09:26AM +0100, Peter Maydell wrote:
> > > This is still wrong. We should not have "x86 magically works
> > > and all other hosts do something different" ifdefs. Define what
> > > semantics you need and then we can figure out how to
> > > implement them.
> > >
> > > My current thought is that we need to handle accesses of
> > > 1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
> > > do exactly one host load/store of that type, and that anything else
> > > is "the guest isn't relying on specific semantics here, we can just
> > > assume it's plain old RAM and do whatever". That would not
> > > require any architecture specific ifdefs.
> 
> > Well. X86 is special, as usual. It allows unaligned mmio so
> > we really have no way to know an x86 guest does not intend
> > just that. That can only be emulated perfectly on x86
> > which is sad but I see no reason to actively break it.
> 
> What actual hardware will rely on unaligned accesses
> appearing to it as single unaligned accesses?

/me shrugs
any hardware that was only tested under windows?

> And if we do need this, we need to specify:
> - what are the semantics we need to provide?

match what happens on bare metal

> - what is the use case that relies on these?
> Then we can see how closely we can match that.

passhtrough of weird devices, first of all.

> 
> There's a lot of extra complexity added in here for
> something it I don't think is ever going to matter,
> and having the most common host architecture take a
> totally different codepath from everything else is
> a recipe for that other codepath breaking.
> 
> -- PMM

Not sure we have to be pedantic. We have arch specific code and the
world did not end.  It's not a lot of code, it's just for unaligned
accesses on MMIO, which might or might not be broken anyway.
And it worked already, so it's more about going back to what we
did earlier.

-- 
MST



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 13:23         ` Michael S. Tsirkin
@ 2026-06-25 14:02           ` Peter Maydell
  2026-06-25 14:52             ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2026-06-25 14:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, 25 Jun 2026 at 14:23, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Jun 25, 2026 at 01:48:11PM +0100, Peter Maydell wrote:
> > On Thu, 25 Jun 2026 at 12:07, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Jun 25, 2026 at 11:09:26AM +0100, Peter Maydell wrote:
> > > > This is still wrong. We should not have "x86 magically works
> > > > and all other hosts do something different" ifdefs. Define what
> > > > semantics you need and then we can figure out how to
> > > > implement them.
> > > >
> > > > My current thought is that we need to handle accesses of
> > > > 1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
> > > > do exactly one host load/store of that type, and that anything else
> > > > is "the guest isn't relying on specific semantics here, we can just
> > > > assume it's plain old RAM and do whatever". That would not
> > > > require any architecture specific ifdefs.
> >
> > > Well. X86 is special, as usual. It allows unaligned mmio so
> > > we really have no way to know an x86 guest does not intend
> > > just that. That can only be emulated perfectly on x86
> > > which is sad but I see no reason to actively break it.
> >
> > What actual hardware will rely on unaligned accesses
> > appearing to it as single unaligned accesses?
>
> /me shrugs
> any hardware that was only tested under windows?

I seriously doubt any hardware designer implements
registers that are not aligned, especially on PCI devices.
"Only tested under Windows" doesn't imply "hardware
designer did something nuts".

> > And if we do need this, we need to specify:
> > - what are the semantics we need to provide?
>
> match what happens on bare metal

This is too vague. We need to write down what we mean,
in enough specifics that we can then say "OK, we can
do this this way", or "we can do this this way on
some hosts, and these corner cases are not covered
on other hosts".

> > - what is the use case that relies on these?
> > Then we can see how closely we can match that.
>
> passhtrough of weird devices, first of all.

This isn't a use case, it's a vague hypothetical.

> > There's a lot of extra complexity added in here for
> > something it I don't think is ever going to matter,
> > and having the most common host architecture take a
> > totally different codepath from everything else is
> > a recipe for that other codepath breaking.

> Not sure we have to be pedantic. We have arch specific code and the
> world did not end.  It's not a lot of code, it's just for unaligned
> accesses on MMIO, which might or might not be broken anyway.
> And it worked already, so it's more about going back to what we
> did earlier.

But we should not have arch specific code unless we
have a reason to do it. And if we have a reason to do
it then we should say what that reason is and what the
cases the host has to cover are, so that when we can
implement it correctly we do that.

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 14:02           ` Peter Maydell
@ 2026-06-25 14:52             ` Michael S. Tsirkin
  2026-06-25 15:23               ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-25 14:52 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, Jun 25, 2026 at 03:02:42PM +0100, Peter Maydell wrote:
> On Thu, 25 Jun 2026 at 14:23, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jun 25, 2026 at 01:48:11PM +0100, Peter Maydell wrote:
> > > On Thu, 25 Jun 2026 at 12:07, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Thu, Jun 25, 2026 at 11:09:26AM +0100, Peter Maydell wrote:
> > > > > This is still wrong. We should not have "x86 magically works
> > > > > and all other hosts do something different" ifdefs. Define what
> > > > > semantics you need and then we can figure out how to
> > > > > implement them.
> > > > >
> > > > > My current thought is that we need to handle accesses of
> > > > > 1, 2, 4 and 8 bytes that are naturally aligned by ensuring that we
> > > > > do exactly one host load/store of that type, and that anything else
> > > > > is "the guest isn't relying on specific semantics here, we can just
> > > > > assume it's plain old RAM and do whatever". That would not
> > > > > require any architecture specific ifdefs.
> > >
> > > > Well. X86 is special, as usual. It allows unaligned mmio so
> > > > we really have no way to know an x86 guest does not intend
> > > > just that. That can only be emulated perfectly on x86
> > > > which is sad but I see no reason to actively break it.
> > >
> > > What actual hardware will rely on unaligned accesses
> > > appearing to it as single unaligned accesses?
> >
> > /me shrugs
> > any hardware that was only tested under windows?
> 
> I seriously doubt any hardware designer implements
> registers that are not aligned, especially on PCI devices.
> "Only tested under Windows" doesn't imply "hardware
> designer did something nuts".

Well I worry.

> > > And if we do need this, we need to specify:
> > > - what are the semantics we need to provide?
> >
> > match what happens on bare metal
> 
> This is too vague. We need to write down what we mean,
> in enough specifics that we can then say "OK, we can
> do this this way", or "we can do this this way on
> some hosts, and these corner cases are not covered
> on other hosts".

I think there is exactly 1 kinda reasonable case. A 2 byte read/write at
offset 0x1 within a dword. This maps nicely to even classical PCI byte
enable mechanism and so yes it works if your CPU can initiate these
things, and it's atomic.

I tried reading LEDCTL on e1000e:

       byte  @ 0xe00: 0x64
       byte  @ 0xe01: 0x2a
       byte  @ 0xe02: 0x00
       byte  @ 0xe03: 0x00
       word  @ 0xe00: 0x2a64
       word  @ 0xe01: 0x002a

Works fine.

How important is it to have a script like this to work
correctly? I don't know.


> > > - what is the use case that relies on these?
> > > Then we can see how closely we can match that.
> >
> > passhtrough of weird devices, first of all.
> 
> This isn't a use case, it's a vague hypothetical.

True.

> > > There's a lot of extra complexity added in here for
> > > something it I don't think is ever going to matter,
> > > and having the most common host architecture take a
> > > totally different codepath from everything else is
> > > a recipe for that other codepath breaking.
> 
> > Not sure we have to be pedantic. We have arch specific code and the
> > world did not end.  It's not a lot of code, it's just for unaligned
> > accesses on MMIO, which might or might not be broken anyway.
> > And it worked already, so it's more about going back to what we
> > did earlier.
> 
> But we should not have arch specific code unless we
> have a reason to do it. And if we have a reason to do
> it then we should say what that reason is and what the
> cases the host has to cover are, so that when we can
> implement it correctly we do that.
> 
> -- PMM

I just like it when we implement whatever host does, qemu being a
hardware emulator, after all.

If you feel it's nonsense, I'm not gonnu fight.

-- 
MST



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 14:52             ` Michael S. Tsirkin
@ 2026-06-25 15:23               ` Peter Maydell
  2026-06-25 16:47                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2026-06-25 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, 25 Jun 2026 at 15:52, Michael S. Tsirkin <mst@redhat.com> wrote:
> I think there is exactly 1 kinda reasonable case. A 2 byte read/write at
> offset 0x1 within a dword. This maps nicely to even classical PCI byte
> enable mechanism and so yes it works if your CPU can initiate these
> things, and it's atomic.
>
> I tried reading LEDCTL on e1000e:
>
>        byte  @ 0xe00: 0x64
>        byte  @ 0xe01: 0x2a
>        byte  @ 0xe02: 0x00
>        byte  @ 0xe03: 0x00
>        word  @ 0xe00: 0x2a64
>        word  @ 0xe01: 0x002a
>
> Works fine.

The e1000e datasheet actually documents what it does in this
case (slightly surprising, since hardware engineers love to
leave this kind of corner case undocumented):

# For registers that should be accessed as 32-bit double words,
# partial writes (less than a 32-bit double word) does not take
# effect (such as, the write is ignored). Partial reads
# return all 32 bits of data regardless of the byte enables.
#
# Note: Partial reads to clear-by-read registers (such as, ICR)
# can have unexpected results since all 32 bits are actually read
# regardless of the byte enables. Partial reads should not be done.

So for this specific device that access is out-of-spec.

I guess what I'm wondering is: can we just have code
that does an aligned exact-width access in the 1/2/4/8
byte aligned case, and the host's best approximation to
an unaligned exact-width access for the 2/4/8 byte
unaligned case? (so on sparc you get multiple smaller
accesses, and on most archs including x86 and arm you
get an unaligned load). That would mean the guest could
potentially provoke a fault on the load/store on an
access to a passthrough device, but if you give the
guest passthrough access it can very likely provoke
a fault anyway, depending on exactly what the device is.

I think the most likely reason for an unaligned access
in this codepath is "it's actually RAM, either really
host RAM or else something memory-like in a BAR", and
either way if the guest does a 4-byte unaligned access
then doing a 4-byte unaligned access seems better than
second-guessing it, even on non-x86.

-- PMM

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 15:23               ` Peter Maydell
@ 2026-06-25 16:47                 ` Michael S. Tsirkin
  2026-06-25 18:40                   ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-25 16:47 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, Jun 25, 2026 at 04:23:47PM +0100, Peter Maydell wrote:
> On Thu, 25 Jun 2026 at 15:52, Michael S. Tsirkin <mst@redhat.com> wrote:
> > I think there is exactly 1 kinda reasonable case. A 2 byte read/write at
> > offset 0x1 within a dword. This maps nicely to even classical PCI byte
> > enable mechanism and so yes it works if your CPU can initiate these
> > things, and it's atomic.
> >
> > I tried reading LEDCTL on e1000e:
> >
> >        byte  @ 0xe00: 0x64
> >        byte  @ 0xe01: 0x2a
> >        byte  @ 0xe02: 0x00
> >        byte  @ 0xe03: 0x00
> >        word  @ 0xe00: 0x2a64
> >        word  @ 0xe01: 0x002a
> >
> > Works fine.
> 
> The e1000e datasheet actually documents what it does in this
> case (slightly surprising, since hardware engineers love to
> leave this kind of corner case undocumented):
> 
> # For registers that should be accessed as 32-bit double words,
> # partial writes (less than a 32-bit double word) does not take
> # effect (such as, the write is ignored).
> # Partial reads
> # return all 32 bits of data regardless of the byte enables.
> #
> # Note: Partial reads to clear-by-read registers (such as, ICR)
> # can have unexpected results since all 32 bits are actually read
> # regardless of the byte enables. Partial reads should not be done.
> 
> So for this specific device that access is out-of-spec.

You mean that access to clear by read should not be done, right?

> I guess what I'm wondering is: can we just have code
> that does an aligned exact-width access in the 1/2/4/8
> byte aligned case, and the host's best approximation to
> an unaligned exact-width access for the 2/4/8 byte
> unaligned case?

That's my idea, too.

> (so on sparc you get multiple smaller
> accesses, and on most archs including x86 and arm you
> get an unaligned load).

I thought unaligned load from uncacheable on arm is
also a fault?

> That would mean the guest could
> potentially provoke a fault on the load/store on an
> access to a passthrough device, but if you give the
> guest passthrough access it can very likely provoke
> a fault anyway, depending on exactly what the device is.
> 
> I think the most likely reason for an unaligned access
> in this codepath is "it's actually RAM, either really
> host RAM or else something memory-like in a BAR", and
> either way if the guest does a 4-byte unaligned access
> then doing a 4-byte unaligned access seems better than
> second-guessing it, even on non-x86.
> 
> -- PMM

Right though remember: whether it's RAM doesn't matter. What matters is
how we map it.  qemu might fault because it maps NC but guest maps
cacheable and it's ok.

But, all this in theory. At a high level, I personally think going with
what you propose as a 1st approximation is entirely reasonable, except
for one thing: we really should not crash qemu, since access can be from
guest userspace.

qemu should know how things are mapped, and predict it will fault.
Approximating by splitting in that case sounds reasonable.



-- 
MST



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 16:47                 ` Michael S. Tsirkin
@ 2026-06-25 18:40                   ` Peter Maydell
  2026-06-26  0:07                     ` Gavin Shan
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2026-06-25 18:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Thu, 25 Jun 2026 at 17:47, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Jun 25, 2026 at 04:23:47PM +0100, Peter Maydell wrote:
> > On Thu, 25 Jun 2026 at 15:52, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > I think there is exactly 1 kinda reasonable case. A 2 byte read/write at
> > > offset 0x1 within a dword. This maps nicely to even classical PCI byte
> > > enable mechanism and so yes it works if your CPU can initiate these
> > > things, and it's atomic.
> > >
> > > I tried reading LEDCTL on e1000e:
> > >
> > >        byte  @ 0xe00: 0x64
> > >        byte  @ 0xe01: 0x2a
> > >        byte  @ 0xe02: 0x00
> > >        byte  @ 0xe03: 0x00
> > >        word  @ 0xe00: 0x2a64
> > >        word  @ 0xe01: 0x002a
> > >
> > > Works fine.
> >
> > The e1000e datasheet actually documents what it does in this
> > case (slightly surprising, since hardware engineers love to
> > leave this kind of corner case undocumented):
> >
> > # For registers that should be accessed as 32-bit double words,
> > # partial writes (less than a 32-bit double word) does not take
> > # effect (such as, the write is ignored).
> > # Partial reads
> > # return all 32 bits of data regardless of the byte enables.
> > #
> > # Note: Partial reads to clear-by-read registers (such as, ICR)
> > # can have unexpected results since all 32 bits are actually read
> > # regardless of the byte enables. Partial reads should not be done.
> >
> > So for this specific device that access is out-of-spec.
>
> You mean that access to clear by read should not be done, right?

The datasheet is ambiguous about whether "Partial reads should
not be done" is meant to apply generally or only to clear-by-read
registers.

> > I guess what I'm wondering is: can we just have code
> > that does an aligned exact-width access in the 1/2/4/8
> > byte aligned case, and the host's best approximation to
> > an unaligned exact-width access for the 2/4/8 byte
> > unaligned case?
>
> That's my idea, too.
>
> > (so on sparc you get multiple smaller
> > accesses, and on most archs including x86 and arm you
> > get an unaligned load).
>
> I thought unaligned load from uncacheable on arm is
> also a fault?

For Arm the distinction is not cacheable/uncacheable
but Normal vs Device. (Device is essentially for things
which are not RAM; Normal is for RAM and RAM-like things,
and includes all of Normal Non-cacheable, Normal WT-Cacheable
and Normal WB-Cacheable.) Things mapped as Normal memory
don't generate unaligned faults (unless the guest turned them
on deliberately). For Device memory, it is IMPLEMENTATION
DEFINED whether you get an alignment fault or not if you
map something as Device that could have handled unaligned
accesses if you had mapped it as Normal.

> > That would mean the guest could
> > potentially provoke a fault on the load/store on an
> > access to a passthrough device, but if you give the
> > guest passthrough access it can very likely provoke
> > a fault anyway, depending on exactly what the device is.
> >
> > I think the most likely reason for an unaligned access
> > in this codepath is "it's actually RAM, either really
> > host RAM or else something memory-like in a BAR", and
> > either way if the guest does a 4-byte unaligned access
> > then doing a 4-byte unaligned access seems better than
> > second-guessing it, even on non-x86.

> Right though remember: whether it's RAM doesn't matter. What matters is
> how we map it.  qemu might fault because it maps NC but guest maps
> cacheable and it's ok.

If QEMU and the guest disagree about the memory attributes
on Arm then we have already lost, because the architecture
says that memory attribute mismatches result in a variety of
undesirable effects including things like loss of cache coherency
(i.e. read and writes via QEMU's NC mapping disagree with ones
via the guest's cacheable mapping because the latter are hitting
in the cache and the former are bypassing it).

> But, all this in theory. At a high level, I personally think going with
> what you propose as a 1st approximation is entirely reasonable, except
> for one thing: we really should not crash qemu, since access can be from
> guest userspace.

You can't prevent faults entirely, though -- if the device
being mapped has e.g. behaviour that says "unaligned accesses
will fault" and then the x86 guest does an unaligned access,
then the device will trigger a fault, and the fault is what
you want because it's what the guest would see on real h/w.
Unfortunately we don't have a convenient way to feed the
fault back to the guest. At some level if you pass through
host hardware you're relying on the guest to not do totally
stupid things.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-25 18:40                   ` Peter Maydell
@ 2026-06-26  0:07                     ` Gavin Shan
  0 siblings, 0 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-26  0:07 UTC (permalink / raw)
  To: Peter Maydell, Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson, berrange,
	philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On 6/26/26 4:40 AM, Peter Maydell wrote:
> On Thu, 25 Jun 2026 at 17:47, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Thu, Jun 25, 2026 at 04:23:47PM +0100, Peter Maydell wrote:
>>> On Thu, 25 Jun 2026 at 15:52, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> I think there is exactly 1 kinda reasonable case. A 2 byte read/write at
>>>> offset 0x1 within a dword. This maps nicely to even classical PCI byte
>>>> enable mechanism and so yes it works if your CPU can initiate these
>>>> things, and it's atomic.
>>>>
>>>> I tried reading LEDCTL on e1000e:
>>>>
>>>>         byte  @ 0xe00: 0x64
>>>>         byte  @ 0xe01: 0x2a
>>>>         byte  @ 0xe02: 0x00
>>>>         byte  @ 0xe03: 0x00
>>>>         word  @ 0xe00: 0x2a64
>>>>         word  @ 0xe01: 0x002a
>>>>
>>>> Works fine.
>>>
>>> The e1000e datasheet actually documents what it does in this
>>> case (slightly surprising, since hardware engineers love to
>>> leave this kind of corner case undocumented):
>>>
>>> # For registers that should be accessed as 32-bit double words,
>>> # partial writes (less than a 32-bit double word) does not take
>>> # effect (such as, the write is ignored).
>>> # Partial reads
>>> # return all 32 bits of data regardless of the byte enables.
>>> #
>>> # Note: Partial reads to clear-by-read registers (such as, ICR)
>>> # can have unexpected results since all 32 bits are actually read
>>> # regardless of the byte enables. Partial reads should not be done.
>>>
>>> So for this specific device that access is out-of-spec.
>>
>> You mean that access to clear by read should not be done, right?
> 
> The datasheet is ambiguous about whether "Partial reads should
> not be done" is meant to apply generally or only to clear-by-read
> registers.
> 

I think the document intends to say: the partial reads shouldn't be issued to
the clear-by-read registers. Those partial reads on no-read-side-affect registers
would be fine.

>>> I guess what I'm wondering is: can we just have code
>>> that does an aligned exact-width access in the 1/2/4/8
>>> byte aligned case, and the host's best approximation to
>>> an unaligned exact-width access for the 2/4/8 byte
>>> unaligned case?
>>
>> That's my idea, too.
>>

I assume this the conclusion of our discussions? If so, we just need to have
unified function where __builtin_{memcpy, memmove}() are used for 1/2/4/8 bytes
access no matter if the address is aligned, and fall back to {memcpy, memmove}()
for other cases?

>>> (so on sparc you get multiple smaller
>>> accesses, and on most archs including x86 and arm you
>>> get an unaligned load).
>>
>> I thought unaligned load from uncacheable on arm is
>> also a fault?
> 
> For Arm the distinction is not cacheable/uncacheable
> but Normal vs Device. (Device is essentially for things
> which are not RAM; Normal is for RAM and RAM-like things,
> and includes all of Normal Non-cacheable, Normal WT-Cacheable
> and Normal WB-Cacheable.) Things mapped as Normal memory
> don't generate unaligned faults (unless the guest turned them
> on deliberately). For Device memory, it is IMPLEMENTATION
> DEFINED whether you get an alignment fault or not if you
> map something as Device that could have handled unaligned
> accesses if you had mapped it as Normal.
> 
>>> That would mean the guest could
>>> potentially provoke a fault on the load/store on an
>>> access to a passthrough device, but if you give the
>>> guest passthrough access it can very likely provoke
>>> a fault anyway, depending on exactly what the device is.
>>>
>>> I think the most likely reason for an unaligned access
>>> in this codepath is "it's actually RAM, either really
>>> host RAM or else something memory-like in a BAR", and
>>> either way if the guest does a 4-byte unaligned access
>>> then doing a 4-byte unaligned access seems better than
>>> second-guessing it, even on non-x86.
> 
>> Right though remember: whether it's RAM doesn't matter. What matters is
>> how we map it.  qemu might fault because it maps NC but guest maps
>> cacheable and it's ok.
> 
> If QEMU and the guest disagree about the memory attributes
> on Arm then we have already lost, because the architecture
> says that memory attribute mismatches result in a variety of
> undesirable effects including things like loss of cache coherency
> (i.e. read and writes via QEMU's NC mapping disagree with ones
> via the guest's cacheable mapping because the latter are hitting
> in the cache and the former are bypassing it).
> 
>> But, all this in theory. At a high level, I personally think going with
>> what you propose as a 1st approximation is entirely reasonable, except
>> for one thing: we really should not crash qemu, since access can be from
>> guest userspace.
> 
> You can't prevent faults entirely, though -- if the device
> being mapped has e.g. behaviour that says "unaligned accesses
> will fault" and then the x86 guest does an unaligned access,
> then the device will trigger a fault, and the fault is what
> you want because it's what the guest would see on real h/w.
> Unfortunately we don't have a convenient way to feed the
> fault back to the guest. At some level if you pass through
> host hardware you're relying on the guest to not do totally
> stupid things.
> 
> thanks
> -- PMM
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 2/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
@ 2026-06-16  5:25 ` Gavin Shan
  2026-06-16  5:36 ` [PATCH v3 0/2] " Michael S. Tsirkin
  2026-06-16  5:40 ` Gavin Shan
  3 siblings, 0 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  5:25 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

This basically reverts 4a2e242bbb30 ("memory: Don't use memcpy for
ram_device regions") to make ram device region directly accessible
again. With this, the bounce buffer is bypassed in address_space_map()
when a ram device region is involved, potentially avoid to overrun
the (small) bounce buffer.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/system/memory.h | 11 ++---------
 system/memory.c         | 41 +----------------------------------------
 system/trace-events     |  2 --
 3 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index 84203c312d..28bfb6f1c9 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2944,15 +2944,8 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
     if (memory_region_is_romd(mr)) {
         return true;
     }
-    if (!memory_region_is_ram(mr)) {
-        return false;
-    }
-    /*
-     * RAM DEVICE regions can be accessed directly using memcpy, but it might
-     * be MMIO and access using mempy can be wrong (e.g., using instructions not
-     * intended for MMIO access). So we treat this as IO.
-     */
-    return !memory_region_is_ram_device(mr);
+
+    return memory_region_is_ram(mr);
 }
 
 static inline bool memory_access_is_direct(const MemoryRegion *mr,
diff --git a/system/memory.c b/system/memory.c
index 739ba11da6..9549dd1a94 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1362,43 +1362,6 @@ const MemoryRegionOps unassigned_mem_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
-static uint64_t memory_region_ram_device_read(void *opaque,
-                                              hwaddr addr, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-    uint64_t data = ldn_he_p(mr->ram_block->host + addr, size);
-
-    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
-
-    return data;
-}
-
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
-                                           uint64_t data, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-
-    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
-
-    stn_he_p(mr->ram_block->host + addr, size, data);
-}
-
-static const MemoryRegionOps ram_device_mem_ops = {
-    .read = memory_region_ram_device_read,
-    .write = memory_region_ram_device_write,
-    .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
-    .valid = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-    .impl = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-};
-
 bool memory_region_access_valid(MemoryRegion *mr,
                                 hwaddr addr,
                                 unsigned size,
@@ -1676,10 +1639,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, Object *owner,
                                        const char *name, uint64_t size,
                                        void *ptr)
 {
-    memory_region_init_io(mr, owner, &ram_device_mem_ops, mr, name, size);
-    mr->ram = true;
+    memory_region_init_ram_ptr(mr, owner, name, size, ptr);
     mr->ram_device = true;
-    memory_region_set_ram_ptr(mr, size, ptr);
 }
 
 void memory_region_init_alias(MemoryRegion *mr, Object *owner,
diff --git a/system/trace-events b/system/trace-events
index e6e1b61279..34af0a3a1e 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -20,8 +20,6 @@ memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, u
 memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size, const char *name) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u name '%s'"
 memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_sync_dirty(const char *mr, const char *listener, int global) "mr '%s' listener '%s' synced (global=%d)"
 flatview_new(void *view, void *root) "%p (root %p)"
 flatview_destroy(void *view, void *root) "%p (root %p)"
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
  2026-06-16  5:25 ` [PATCH v3 2/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-16  5:36 ` Michael S. Tsirkin
  2026-06-16  5:43   ` Gavin Shan
  2026-06-16  5:40 ` Gavin Shan
  3 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16  5:36 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 03:25:50PM +1000, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions").


What is missing here is the list of issues around
direct ram access and how we are solving them.

Thanks,
-- 
MST



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:36 ` [PATCH v3 0/2] " Michael S. Tsirkin
@ 2026-06-16  5:43   ` Gavin Shan
  0 siblings, 0 replies; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  5:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin


On 6/16/26 3:36 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 03:25:50PM +1000, Gavin Shan wrote:
>> All ram device regions was turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions").
> 
> 
> What is missing here is the list of issues around
> direct ram access and how we are solving them.
> 

Sorry that I saw your request for the inclusion after v3 series was
posted. They have been appended to the thread as a reply.

Thanks,
Gavin

> Thanks,



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
                   ` (2 preceding siblings ...)
  2026-06-16  5:36 ` [PATCH v3 0/2] " Michael S. Tsirkin
@ 2026-06-16  5:40 ` Gavin Shan
  2026-06-16  5:44   ` Michael S. Tsirkin
  3 siblings, 1 reply; 27+ messages in thread
From: Gavin Shan @ 2026-06-16  5:40 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On 6/16/26 3:25 PM, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> DMA bounce buffer in address_space_map() to cover the DMA request. However,
> the bounce buffer size is 4096 bytes and we're overrunning it easily when
> the guest has significant disk activities on compiling 'cuda-samples'.
> The full log and problem description can be found from PATCH[1/2]'s commit
> log.
> 
> Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> movement in the accessors to the ram device regions. With this, we can
> basically revert that commit to make ram device region directly accessible
> again and bypass the bounce buffer in address_space_map() where the guest
> hang is caused.
> 
> PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> PATCH[2] makes ram device region directly accessible again
> 
Michael asked to include below context in the cover letter in v3, but I
didn't noticed that before I sent v3 series, appended with them.

----

The issues listed by Michael:

1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
    force-headers from 2019. Likely no longer relevant.

2. variable length memcpy can translate 2,4,8 byte guest access into
    multiple byte accesses. doing this for mmio is guaranteed to break devices.

3. (theoretical concern) also on x86, unaligned accesses are possible on guest
    and host, so converting an unaligned access to a series of aligned ones can
    in theory break devices.

4. also on x86, vector instructions for large (>16 byte) writes into
    pgprot_noncached memory are safe and faster than multiple 8 byte ones.

5. also on x86 it so happens that if you write a fixed-size memcpy this gets
    optimized to a single store/load and it works for aligned and unaligned
    addresses on that architecture. How to ensure this keeps being correct
    is left as an excerise for the reader. But qemu already relies on this
    and did for years.

6. on non-x86 both unaligned accesses and vector instructions for accessing
    UC memory are illegal.

7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
    map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
    If it does the second then it can use unaligned or vector for access.
    This is why normal passthrough tends to work - it never traps to qemu at
    all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
    can't use unaligned or vector instructions on non-x86.


8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
    so qemu could safely use unaligned/vector instructioons even on non-x86.

9. Except sadly, vfio currently does not tell qemu how it maps
    the memory, so qemu can not know what is safe on non-x86.


Now, what is to be done?


A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
At least for aligned, perferably for unaligned accesses too.
Fixed width memcpy seems to work for this. Whether we should bother with
__builtin to work around broken old fortify headers, I donnu.
I do not have any answer how to check that compiler does this correctly.
If anyone is motivated enough, adding a GCC builtin could be possible.
Given qemu did this for years, I think we can leave solving this for
another day.

B. Also on x86, I do not see why we should not use memcpy for large
accesses if we can. Better perf.

C. on non-x86, we currently must not memcpy since we do not know if it
is pgprot_noncached. yes, performance will be bad for DMA into device RAM.

D. It goes without saying that casting an unaligned address to unint32_t
(be it for qatomic_set or whatever) is undefined behaviour in C
and so a bad idea on any architecture.

E. also for non-x86, we really should teach vfio to tell qemu whether
it maps device pgprot_noncached or pgprot_writecombine.
we will then be able to use memcpy for >8 accesses.

Anyone, correct me if I'm wrong? Maybe I should start a new thread with
this summary?

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:40 ` Gavin Shan
@ 2026-06-16  5:44   ` Michael S. Tsirkin
  2026-06-17  2:35     ` Gavin Shan
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16  5:44 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> On 6/16/26 3:25 PM, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > the guest has significant disk activities on compiling 'cuda-samples'.
> > The full log and problem description can be found from PATCH[1/2]'s commit
> > log.
> > 
> > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > movement in the accessors to the ram device regions. With this, we can
> > basically revert that commit to make ram device region directly accessible
> > again and bypass the bounce buffer in address_space_map() where the guest
> > hang is caused.
> > 
> > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > PATCH[2] makes ram device region directly accessible again
> > 
> Michael asked to include below context in the cover letter in v3, but I
> didn't noticed that before I sent v3 series, appended with them.
> 
> ----
> 
> The issues listed by Michael:
> 
> 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
>    force-headers from 2019. Likely no longer relevant.
> 
> 2. variable length memcpy can translate 2,4,8 byte guest access into
>    multiple byte accesses. doing this for mmio is guaranteed to break devices.
> 
> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
>    and host, so converting an unaligned access to a series of aligned ones can
>    in theory break devices.
> 
> 4. also on x86, vector instructions for large (>16 byte) writes into
>    pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> 
> 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
>    optimized to a single store/load and it works for aligned and unaligned
>    addresses on that architecture. How to ensure this keeps being correct
>    is left as an excerise for the reader. But qemu already relies on this
>    and did for years.
> 
> 6. on non-x86 both unaligned accesses and vector instructions for accessing
>    UC memory are illegal.
> 
> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
>    map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
>    If it does the second then it can use unaligned or vector for access.
>    This is why normal passthrough tends to work - it never traps to qemu at
>    all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
>    can't use unaligned or vector instructions on non-x86.
> 
> 
> 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
>    so qemu could safely use unaligned/vector instructioons even on non-x86.
> 
> 9. Except sadly, vfio currently does not tell qemu how it maps
>    the memory, so qemu can not know what is safe on non-x86.
> 

And more:

10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.

10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.





> Now, what is to be done?
> 
> 
> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> At least for aligned, perferably for unaligned accesses too.
> Fixed width memcpy seems to work for this. Whether we should bother with
> __builtin to work around broken old fortify headers, I donnu.
> I do not have any answer how to check that compiler does this correctly.
> If anyone is motivated enough, adding a GCC builtin could be possible.
> Given qemu did this for years, I think we can leave solving this for
> another day.
> 
> B. Also on x86, I do not see why we should not use memcpy for large
> accesses if we can. Better perf.
> 
> C. on non-x86, we currently must not memcpy since we do not know if it
> is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> 
> D. It goes without saying that casting an unaligned address to unint32_t
> (be it for qatomic_set or whatever) is undefined behaviour in C
> and so a bad idea on any architecture.
> 
> E. also for non-x86, we really should teach vfio to tell qemu whether
> it maps device pgprot_noncached or pgprot_writecombine.
> we will then be able to use memcpy for >8 accesses.
> 
> Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> this summary?
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-16  5:44   ` Michael S. Tsirkin
@ 2026-06-17  2:35     ` Gavin Shan
  2026-06-17  5:52       ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Gavin Shan @ 2026-06-17  2:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
>> On 6/16/26 3:25 PM, Gavin Shan wrote:
>>> All ram device regions was turned to be indirectly accessible by commit
>>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>>> to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
>>> in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
>>> DMA bounce buffer in address_space_map() to cover the DMA request. However,
>>> the bounce buffer size is 4096 bytes and we're overrunning it easily when
>>> the guest has significant disk activities on compiling 'cuda-samples'.
>>> The full log and problem description can be found from PATCH[1/2]'s commit
>>> log.
>>>
>>> Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
>>> memmove() with newly added helpers qemu_ram_{copy, move}() that works on
>>> top of __builtin_{memcpy, memmove} or unaligned access friendly memory
>>> movement in the accessors to the ram device regions. With this, we can
>>> basically revert that commit to make ram device region directly accessible
>>> again and bypass the bounce buffer in address_space_map() where the guest
>>> hang is caused.
>>>
>>> PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
>>> PATCH[2] makes ram device region directly accessible again
>>>
>> Michael asked to include below context in the cover letter in v3, but I
>> didn't noticed that before I sent v3 series, appended with them.
>>

Looking at the list of issues (questions) raised by Michael, I don't understand
every one before I'm able to put more time to dig, but I feel this series has
too ambitious goal to cover accesses to all the directly accessible regions
with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
changes and concerns, making this series impossible to land.

I would suggest to break down the goal and step back to apply the newly introduced
qemu_ram_{copy, move} to the ram device regions only? It's actually something
proposed by Peter Xu in the earlier replies. Taking address_space_write() as an
example, the indirectly accessible regions are covered by memory_region_dispatch_write()
in (1), the ram device region is covered by qemu_ram_move() in (2), and all other
directly accessible regions are covered by memmove() in (3).

   address_space_write
     flatview_write
       flatview_write_continue
         flatview_write_continue_step
           memory_access_size             // (1) indirectly accessible region
           memory_region_dispatch_write
             access_with_adjusted_size
               memory_region_write_accessor
                 mr->ops->write
           qemu_ram_move                  // (2) ram device region
           memmove                        // (3) all other directly accessible regions

With the limitation, only the ram device regions in (2) are affected. We're
basically moving the accesses to the ram device region from (1) to (2). No
changes introduced to other types of regions. The goal is to make the ram device
region accessible so that the bounce buffer can be bypassed in DMA path.

>> ----
>>
>> The issues listed by Michael:
>>
>> 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
>>     force-headers from 2019. Likely no longer relevant.
>>
>> 2. variable length memcpy can translate 2,4,8 byte guest access into
>>     multiple byte accesses. doing this for mmio is guaranteed to break devices.
>>
>> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
>>     and host, so converting an unaligned access to a series of aligned ones can
>>     in theory break devices.
>>
>> 4. also on x86, vector instructions for large (>16 byte) writes into
>>     pgprot_noncached memory are safe and faster than multiple 8 byte ones.
>>
>> 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
>>     optimized to a single store/load and it works for aligned and unaligned
>>     addresses on that architecture. How to ensure this keeps being correct
>>     is left as an excerise for the reader. But qemu already relies on this
>>     and did for years.
>>
>> 6. on non-x86 both unaligned accesses and vector instructions for accessing
>>     UC memory are illegal.
>>
>> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
>>     map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
>>     If it does the second then it can use unaligned or vector for access.
>>     This is why normal passthrough tends to work - it never traps to qemu at
>>     all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
>>     can't use unaligned or vector instructions on non-x86.
>>
>>
>> 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
>>     so qemu could safely use unaligned/vector instructioons even on non-x86.
>>
>> 9. Except sadly, vfio currently does not tell qemu how it maps
>>     the memory, so qemu can not know what is safe on non-x86.
>>
> 
> And more:
> 
> 10. on x86 memcpy will sometimes do multiple overlapping stores when
> size is not a power of 2. for example, a 15 byte write is done with
> 2 8-byte stores. This is theoretically an issue
> if guest does something super clever with ordering,
> but does not seem to be in practice.
> 
> 10. on non-x86 memcpy will do multiple overlapping stores even
> for single byte writes. E.g. it does it to avoid extra branches.
> This is causing issues in practice.
> 
> 
> 
> 
> 
>> Now, what is to be done?
>>
>>
>> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
>> At least for aligned, perferably for unaligned accesses too.
>> Fixed width memcpy seems to work for this. Whether we should bother with
>> __builtin to work around broken old fortify headers, I donnu.
>> I do not have any answer how to check that compiler does this correctly.
>> If anyone is motivated enough, adding a GCC builtin could be possible.
>> Given qemu did this for years, I think we can leave solving this for
>> another day.
>>
>> B. Also on x86, I do not see why we should not use memcpy for large
>> accesses if we can. Better perf.
>>
>> C. on non-x86, we currently must not memcpy since we do not know if it
>> is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
>>
>> D. It goes without saying that casting an unaligned address to unint32_t
>> (be it for qatomic_set or whatever) is undefined behaviour in C
>> and so a bad idea on any architecture.
>>
>> E. also for non-x86, we really should teach vfio to tell qemu whether
>> it maps device pgprot_noncached or pgprot_writecombine.
>> we will then be able to use memcpy for >8 accesses.
>>
>> Anyone, correct me if I'm wrong? Maybe I should start a new thread with
>> this summary?
>>

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-17  2:35     ` Gavin Shan
@ 2026-06-17  5:52       ` Michael S. Tsirkin
  2026-06-17  7:00         ` Gavin Shan
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-17  5:52 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Wed, Jun 17, 2026 at 12:35:00PM +1000, Gavin Shan wrote:
> On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> > > On 6/16/26 3:25 PM, Gavin Shan wrote:
> > > > All ram device regions was turned to be indirectly accessible by commit
> > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > > > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > > > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > > > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > > > the guest has significant disk activities on compiling 'cuda-samples'.
> > > > The full log and problem description can be found from PATCH[1/2]'s commit
> > > > log.
> > > > 
> > > > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > > > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > > > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > > > movement in the accessors to the ram device regions. With this, we can
> > > > basically revert that commit to make ram device region directly accessible
> > > > again and bypass the bounce buffer in address_space_map() where the guest
> > > > hang is caused.
> > > > 
> > > > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > > > PATCH[2] makes ram device region directly accessible again
> > > > 
> > > Michael asked to include below context in the cover letter in v3, but I
> > > didn't noticed that before I sent v3 series, appended with them.
> > > 
> 
> Looking at the list of issues (questions) raised by Michael, I don't understand
> every one

Gavin, I doubt one should make memory.c changes without understanding the issues
it is trying to address.

What is unclear? Ask away.


> before I'm able to put more time to dig, but I feel this series has
> too ambitious goal to cover accesses to all the directly accessible regions
> with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
> changes and concerns, making this series impossible to land.
> 
> I would suggest to break down the goal and step back to apply the newly introduced
> qemu_ram_{copy, move} to the ram device regions only? It's actually something
> proposed by Peter Xu in the earlier replies. Taking address_space_write() as an
> example, the indirectly accessible regions are covered by memory_region_dispatch_write()
> in (1), the ram device region is covered by qemu_ram_move() in (2), and all other
> directly accessible regions are covered by memmove() in (3).
> 
>   address_space_write
>     flatview_write
>       flatview_write_continue
>         flatview_write_continue_step
>           memory_access_size             // (1) indirectly accessible region
>           memory_region_dispatch_write
>             access_with_adjusted_size
>               memory_region_write_accessor
>                 mr->ops->write
>           qemu_ram_move                  // (2) ram device region
>           memmove                        // (3) all other directly accessible regions
> 
> With the limitation, only the ram device regions in (2) are affected. We're
> basically moving the accesses to the ram device region from (1) to (2). No
> changes introduced to other types of regions. The goal is to make the ram device
> region accessible so that the bounce buffer can be bypassed in DMA path.

Esthetics aside - ram device regions have all the same issues.

Maybe you can limit the scope of the changes,
but I doubt you can get out understanding)


> > > ----
> > > 
> > > The issues listed by Michael:
> > > 
> > > 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
> > >     force-headers from 2019. Likely no longer relevant.
> > > 
> > > 2. variable length memcpy can translate 2,4,8 byte guest access into
> > >     multiple byte accesses. doing this for mmio is guaranteed to break devices.
> > > 
> > > 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
> > >     and host, so converting an unaligned access to a series of aligned ones can
> > >     in theory break devices.
> > > 
> > > 4. also on x86, vector instructions for large (>16 byte) writes into
> > >     pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> > > 
> > > 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
> > >     optimized to a single store/load and it works for aligned and unaligned
> > >     addresses on that architecture. How to ensure this keeps being correct
> > >     is left as an excerise for the reader. But qemu already relies on this
> > >     and did for years.
> > > 
> > > 6. on non-x86 both unaligned accesses and vector instructions for accessing
> > >     UC memory are illegal.
> > > 
> > > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
> > >     map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
> > >     If it does the second then it can use unaligned or vector for access.
> > >     This is why normal passthrough tends to work - it never traps to qemu at
> > >     all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
> > >     can't use unaligned or vector instructions on non-x86.
> > > 
> > > 
> > > 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
> > >     so qemu could safely use unaligned/vector instructioons even on non-x86.
> > > 
> > > 9. Except sadly, vfio currently does not tell qemu how it maps
> > >     the memory, so qemu can not know what is safe on non-x86.
> > > 
> > 
> > And more:
> > 
> > 10. on x86 memcpy will sometimes do multiple overlapping stores when
> > size is not a power of 2. for example, a 15 byte write is done with
> > 2 8-byte stores. This is theoretically an issue
> > if guest does something super clever with ordering,
> > but does not seem to be in practice.
> > 
> > 10. on non-x86 memcpy will do multiple overlapping stores even
> > for single byte writes. E.g. it does it to avoid extra branches.
> > This is causing issues in practice.
> > 
> > 
> > 
> > 
> > 
> > > Now, what is to be done?
> > > 
> > > 
> > > A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> > > At least for aligned, perferably for unaligned accesses too.
> > > Fixed width memcpy seems to work for this. Whether we should bother with
> > > __builtin to work around broken old fortify headers, I donnu.
> > > I do not have any answer how to check that compiler does this correctly.
> > > If anyone is motivated enough, adding a GCC builtin could be possible.
> > > Given qemu did this for years, I think we can leave solving this for
> > > another day.
> > > 
> > > B. Also on x86, I do not see why we should not use memcpy for large
> > > accesses if we can. Better perf.
> > > 
> > > C. on non-x86, we currently must not memcpy since we do not know if it
> > > is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> > > 
> > > D. It goes without saying that casting an unaligned address to unint32_t
> > > (be it for qatomic_set or whatever) is undefined behaviour in C
> > > and so a bad idea on any architecture.
> > > 
> > > E. also for non-x86, we really should teach vfio to tell qemu whether
> > > it maps device pgprot_noncached or pgprot_writecombine.
> > > we will then be able to use memcpy for >8 accesses.
> > > 
> > > Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> > > this summary?
> > > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-17  5:52       ` Michael S. Tsirkin
@ 2026-06-17  7:00         ` Gavin Shan
  2026-06-17  7:27           ` Michael S. Tsirkin
  0 siblings, 1 reply; 27+ messages in thread
From: Gavin Shan @ 2026-06-17  7:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

Hi Michael,

On 6/17/26 3:52 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2026 at 12:35:00PM +1000, Gavin Shan wrote:
>> On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
>>>> On 6/16/26 3:25 PM, Gavin Shan wrote:
>>>>> All ram device regions was turned to be indirectly accessible by commit
>>>>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>>>>> to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
>>>>> in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
>>>>> DMA bounce buffer in address_space_map() to cover the DMA request. However,
>>>>> the bounce buffer size is 4096 bytes and we're overrunning it easily when
>>>>> the guest has significant disk activities on compiling 'cuda-samples'.
>>>>> The full log and problem description can be found from PATCH[1/2]'s commit
>>>>> log.
>>>>>
>>>>> Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
>>>>> memmove() with newly added helpers qemu_ram_{copy, move}() that works on
>>>>> top of __builtin_{memcpy, memmove} or unaligned access friendly memory
>>>>> movement in the accessors to the ram device regions. With this, we can
>>>>> basically revert that commit to make ram device region directly accessible
>>>>> again and bypass the bounce buffer in address_space_map() where the guest
>>>>> hang is caused.
>>>>>
>>>>> PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
>>>>> PATCH[2] makes ram device region directly accessible again
>>>>>
>>>> Michael asked to include below context in the cover letter in v3, but I
>>>> didn't noticed that before I sent v3 series, appended with them.
>>>>
>>
>> Looking at the list of issues (questions) raised by Michael, I don't understand
>> every one
> 
> Gavin, I doubt one should make memory.c changes without understanding the issues
> it is trying to address.
> 
> What is unclear? Ask away.
> 

Yeah, absolutely. I need some time to understand all questions or suggestions
by digging the code a bit, before I'm able to come back to you, but not very
soon though :-)

> 
>> before I'm able to put more time to dig, but I feel this series has
>> too ambitious goal to cover accesses to all the directly accessible regions
>> with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
>> changes and concerns, making this series impossible to land.
>>
>> I would suggest to break down the goal and step back to apply the newly introduced
>> qemu_ram_{copy, move} to the ram device regions only? It's actually something
>> proposed by Peter Xu in the earlier replies. Taking address_space_write() as an
>> example, the indirectly accessible regions are covered by memory_region_dispatch_write()
>> in (1), the ram device region is covered by qemu_ram_move() in (2), and all other
>> directly accessible regions are covered by memmove() in (3).
>>
>>    address_space_write
>>      flatview_write
>>        flatview_write_continue
>>          flatview_write_continue_step
>>            memory_access_size             // (1) indirectly accessible region
>>            memory_region_dispatch_write
>>              access_with_adjusted_size
>>                memory_region_write_accessor
>>                  mr->ops->write
>>            qemu_ram_move                  // (2) ram device region
>>            memmove                        // (3) all other directly accessible regions
>>
>> With the limitation, only the ram device regions in (2) are affected. We're
>> basically moving the accesses to the ram device region from (1) to (2). No
>> changes introduced to other types of regions. The goal is to make the ram device
>> region accessible so that the bounce buffer can be bypassed in DMA path.
> 
> Esthetics aside - ram device regions have all the same issues.
> 
> Maybe you can limit the scope of the changes,
> but I doubt you can get out understanding)
> 

Yes. Limited to the scope of VFIO and PCI BARs, it depends on how the PCI BARs
are mapped, pgprot_noncached or pgprot_writecombine. The listed problems and
concerns are existing on the ram device region exposed with pgprot_noncached.
However, everything should be just fine if the region is exposed with pgprot_writecombine.
Am I understanding this correctly?

[...]

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
  2026-06-17  7:00         ` Gavin Shan
@ 2026-06-17  7:27           ` Michael S. Tsirkin
  0 siblings, 0 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-17  7:27 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Wed, Jun 17, 2026 at 05:00:45PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 6/17/26 3:52 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2026 at 12:35:00PM +1000, Gavin Shan wrote:
> > > On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> > > > > On 6/16/26 3:25 PM, Gavin Shan wrote:
> > > > > > All ram device regions was turned to be indirectly accessible by commit
> > > > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > > > > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > > > > > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > > > > > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > > > > > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > > > > > the guest has significant disk activities on compiling 'cuda-samples'.
> > > > > > The full log and problem description can be found from PATCH[1/2]'s commit
> > > > > > log.
> > > > > > 
> > > > > > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > > > > > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > > > > > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > > > > > movement in the accessors to the ram device regions. With this, we can
> > > > > > basically revert that commit to make ram device region directly accessible
> > > > > > again and bypass the bounce buffer in address_space_map() where the guest
> > > > > > hang is caused.
> > > > > > 
> > > > > > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > > > > > PATCH[2] makes ram device region directly accessible again
> > > > > > 
> > > > > Michael asked to include below context in the cover letter in v3, but I
> > > > > didn't noticed that before I sent v3 series, appended with them.
> > > > > 
> > > 
> > > Looking at the list of issues (questions) raised by Michael, I don't understand
> > > every one
> > 
> > Gavin, I doubt one should make memory.c changes without understanding the issues
> > it is trying to address.
> > 
> > What is unclear? Ask away.
> > 
> 
> Yeah, absolutely. I need some time to understand all questions or suggestions
> by digging the code a bit, before I'm able to come back to you, but not very
> soon though :-)
> 
> > 
> > > before I'm able to put more time to dig, but I feel this series has
> > > too ambitious goal to cover accesses to all the directly accessible regions
> > > with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
> > > changes and concerns, making this series impossible to land.
> > > 
> > > I would suggest to break down the goal and step back to apply the newly introduced
> > > qemu_ram_{copy, move} to the ram device regions only? It's actually something
> > > proposed by Peter Xu in the earlier replies. Taking address_space_write() as an
> > > example, the indirectly accessible regions are covered by memory_region_dispatch_write()
> > > in (1), the ram device region is covered by qemu_ram_move() in (2), and all other
> > > directly accessible regions are covered by memmove() in (3).
> > > 
> > >    address_space_write
> > >      flatview_write
> > >        flatview_write_continue
> > >          flatview_write_continue_step
> > >            memory_access_size             // (1) indirectly accessible region
> > >            memory_region_dispatch_write
> > >              access_with_adjusted_size
> > >                memory_region_write_accessor
> > >                  mr->ops->write
> > >            qemu_ram_move                  // (2) ram device region
> > >            memmove                        // (3) all other directly accessible regions
> > > 
> > > With the limitation, only the ram device regions in (2) are affected. We're
> > > basically moving the accesses to the ram device region from (1) to (2). No
> > > changes introduced to other types of regions. The goal is to make the ram device
> > > region accessible so that the bounce buffer can be bypassed in DMA path.
> > 
> > Esthetics aside - ram device regions have all the same issues.
> > 
> > Maybe you can limit the scope of the changes,
> > but I doubt you can get out understanding)
> > 
> 
> Yes. Limited to the scope of VFIO and PCI BARs, it depends on how the PCI BARs
> are mapped, pgprot_noncached or pgprot_writecombine. The listed problems and
> concerns are existing on the ram device region exposed with pgprot_noncached.
> However, everything should be just fine if the region is exposed with pgprot_writecombine.
> Am I understanding this correctly?

Not everything)

We have issues around guest RAM access, too.

But yes you can then mostly access device RAM as guest RAM on
arm and x86. I am not sure about power. for power pgprot_writecombine
is cache inhibited and I am not sure e.g. vector instructions
with misaligned addresses behave the same.



> [...]
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-06-26  0:08 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
2026-06-16  6:17   ` Michael S. Tsirkin
2026-06-16  7:15     ` Gavin Shan
2026-06-16  9:51       ` Michael S. Tsirkin
2026-06-16 12:50         ` Ding Hui
2026-06-16 15:51           ` Michael S. Tsirkin
2026-06-16 23:01             ` Gavin Shan
2026-06-25 10:09   ` Peter Maydell
2026-06-25 11:07     ` Michael S. Tsirkin
2026-06-25 12:48       ` Peter Maydell
2026-06-25 13:23         ` Michael S. Tsirkin
2026-06-25 14:02           ` Peter Maydell
2026-06-25 14:52             ` Michael S. Tsirkin
2026-06-25 15:23               ` Peter Maydell
2026-06-25 16:47                 ` Michael S. Tsirkin
2026-06-25 18:40                   ` Peter Maydell
2026-06-26  0:07                     ` Gavin Shan
2026-06-16  5:25 ` [PATCH v3 2/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-16  5:36 ` [PATCH v3 0/2] " Michael S. Tsirkin
2026-06-16  5:43   ` Gavin Shan
2026-06-16  5:40 ` Gavin Shan
2026-06-16  5:44   ` Michael S. Tsirkin
2026-06-17  2:35     ` Gavin Shan
2026-06-17  5:52       ` Michael S. Tsirkin
2026-06-17  7:00         ` Gavin Shan
2026-06-17  7:27           ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.