[PATCH v2 0/2] system/memory: Make ram device region directly accessible

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] system/memory: Make ram device region directly accessible
@ 2026-06-15 10:01 Gavin Shan
  2026-06-15 10:01 ` [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
  2026-06-15 10:01 ` [PATCH v2 2/2] system/memory: Make ram device region directly accessible Gavin Shan
  0 siblings, 2 replies; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 10:01 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
DMA bounce buffer in address_space_map() to cover the DMA request. However,
the bounce buffer size is 4096 bytes and we're overrunning it easily when
the guest has significant disk activities on compiling 'cuda-samples'.
The full log and problem description can be found from PATCH[1/2]'s commit
log.

Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
memmove() with newly added helpers qemu_ram_{copy, move}() that works on
top of __builtin_{memcpy, memmove} or unaligned access friendly memory
movement in the accessors to the ram device regions. With this, we can
basically revert that commit to make ram device region directly accessible
again and bypass the bounce buffer in address_space_map() where the guest
hang is caused.

PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
PATCH[2] makes ram device region directly accessible again

Changelog
=========
v1 -> v2:
  * https://lore.kernel.org/qemu-arm/20260612110307.1264798-1-gshan@redhat.com/
  * Rename address_space_{memcpy, memmove}() to qemu_ram_{copy, move}()
    and move them to physmem.c and memory.h   (Philippe)
  * Use memcpy() and memmove() in qemu_ram_{copy, move}() for the variable
    length case                               (Miachel)
  * Handle unaligned access in qemu_ram_{copy, move}() for all archs
    except i386 and x86_64                    (Richard/Michael)
RFCv1 -> v1:
  * https://lists.nongnu.org/archive/html/qemu-arm/2026-06/msg00307.html
  * Reworked solution based on suggestions from Peter Xu, Peter Maydell
    and Michael S. Tsirkin

Gavin Shan (2):
  system/memory: Use qemu_ram_{copy, move}() in ram device region
    accessors
  system/memory: Make ram device region directly accessible

 hw/remote/vfio-user-obj.c |  4 +-
 include/system/memory.h   | 15 +++----
 system/memory.c           | 41 +-----------------
 system/physmem.c          | 91 +++++++++++++++++++++++++++++++++++++--
 system/trace-events       |  2 -
 5 files changed, 95 insertions(+), 58 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 10:01 [PATCH v2 0/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-15 10:01 ` Gavin Shan
  2026-06-15 10:57   ` Peter Maydell
  2026-06-15 15:17   ` Richard Henderson
  2026-06-15 10:01 ` [PATCH v2 2/2] system/memory: Make ram device region directly accessible Gavin Shan
  1 sibling, 2 replies; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 10:01 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

All ram device regions were turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
guest is started by the following command lines, with GH100 GPU card passed
from the host.

   host$ lspci | grep GH100
   0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
   host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
         -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
         -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
         -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
         -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
           :
   guest$ cd cuda-samples/build
   guest$ make -j 20 clean
   guest$ make -j 20
           :
   [ 54%] Linking CUDA executable graphMemoryNodes
   [ 54%] Built target graphMemoryNodes
   <no more output afterwards, guest becomes frozen here>

   guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
   [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)

When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR#4 of the GH100 GPU card can
be presented to the guest through memory hot-add. The page cache can
then be allocated from the hot added memory blocks when cuda-samples
is being built. Afterwards, the page cache is sent to QEMU's virtio-blk
device as part of the DMA request, the bounce buffer has to be used to
accomodate the request as the corresponding memory region (MemoryRegion)
is an indirectly accessible ram device region in qemu. However, the max
bounce bufer size is only 4096 bytes by default and that is exhausted
quickly, leading to a reset on the virtio-blk device and frozen guest
eventually.

  QEMU
  ====
  virtio_blk_handle_output
    virtio_blk_handle_vq
      virtio_blk_get_request
        virtqueue_pop
          virtqueue_split_pop
            virtqueue_map_desc
              address_space_map
                memory_access_is_direct         # Return false
                  memory_region_supports_direct_access

  (qemu) info mtree
  memory-region: pci_bridge_pci
    0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
      0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
        0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
          0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]

This adds qemu_ram_{copy, move}() and replaces {memcpy, memmove}() with
them in the ram device memory region accessors, similar to what's done
in commit 4a2e242bbb so that the issue (MMIO access instructions were
optimized to SSE instructions) covered by that commit is fixed. This
makes 'ram_device_mem_ops' redundant, paving the way to revert that
commit to make ram device region directly accessible again in the next
patch.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
v2: Rename address_space_{memcpy, memmove}() to qemu_ram_{copy, move}()
and move them to physmem.c and memory.h, the unaligned access is also
covered for all architectures except i386 and x86_64 as Richard suggested.
---
 hw/remote/vfio-user-obj.c |  4 +-
 include/system/memory.h   |  4 +-
 system/physmem.c          | 91 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 87fa7b6572..97a6c88780 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
         ram_ptr = memory_region_get_ram_ptr(mr);
 
         if (is_write) {
-            memcpy((ram_ptr + offset), buf, size);
+            qemu_ram_copy(ram_ptr + offset, buf, size);
         } else {
-            memcpy(buf, (ram_ptr + offset), size);
+            qemu_ram_copy(buf, ram_ptr + offset, size);
         }
 
         return 0;
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..5878727d09 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
 void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
+void qemu_ram_copy(void *dest, const void *src, size_t n);
+void qemu_ram_move(void *dest, const void *src, size_t n);
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
                                     MemTxAttrs attrs, void *buf, hwaddr len);
 MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
@@ -2970,7 +2972,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
             mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
             if (len == l && memory_access_is_direct(mr, false, attrs)) {
                 ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-                memcpy(buf, ptr, len);
+                qemu_ram_copy(buf, ptr, len);
             } else {
                 result = flatview_read_continue(fv, addr, attrs, buf, len,
                                                 addr1, l, mr);
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..7f98101612 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3160,6 +3160,90 @@ void memory_region_flush_rom_device(MemoryRegion *mr, hwaddr addr, hwaddr size)
     invalidate_and_set_dirty(mr, addr, size);
 }
 
+#if defined(__i386__) || defined(__x86_64__)
+#define HOST_UNALIGNED_MMIO_OK 1
+#else
+#define HOST_UNALIGNED_MMIO_OK 0
+#endif
+
+void qemu_ram_copy(void *dest, const void *src, size_t n)
+{
+    if (HOST_UNALIGNED_MMIO_OK) {
+        switch (n) {
+        case 1:
+            __builtin_memcpy(dest, src, 1);
+            break;
+        case 2:
+            __builtin_memcpy(dest, src, 2);
+            break;
+        case 4:
+            __builtin_memcpy(dest, src, 4);
+            break;
+        case 8:
+            __builtin_memcpy(dest, src, 8);
+            break;
+        default:
+            memcpy(dest, src, n);
+        }
+    } else {
+        uintptr_t test, lsb;
+
+        do {
+            test = (uintptr_t)dest | n;
+            lsb = test & -test;
+            switch (lsb) {
+            case 1:
+                *(uint8_t *)dest = *(uint8_t *)src;
+                src += 1;
+                dest += 1;
+                n -= 1;
+                break;
+            case 2:
+                *(uint16_t *)dest = *(uint16_t *)src;
+                src += 2;
+                dest += 2;
+                n -= 2;
+                break;
+            case 4:
+                *(uint32_t *)dest = *(uint32_t *)src;
+                src += 4;
+                dest += 4;
+                n -= 4;
+                break;
+            default:
+                *(uint64_t *)dest = *(uint64_t *)src;
+                src += 8;
+                dest += 8;
+                n -= 8;
+            }
+        } while (n != 0);
+    }
+}
+
+void qemu_ram_move(void *dest, const void *src, size_t n)
+{
+    if (HOST_UNALIGNED_MMIO_OK) {
+        switch (n) {
+        case 1:
+            __builtin_memmove(dest, src, 1);
+            break;
+        case 2:
+            __builtin_memmove(dest, src, 2);
+            break;
+        case 4:
+            __builtin_memmove(dest, src, 4);
+            break;
+        case 8:
+            __builtin_memmove(dest, src, 8);
+            break;
+        default:
+            memmove(dest, src, n);
+        }
+    } else {
+        qemu_ram_copy(dest, src, n);
+    }
+}
+
 int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
 {
     unsigned access_size_max = mr->ops->valid.max_access_size;
@@ -3272,7 +3356,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, true);
 
-        memmove(ram_ptr, buf, *l);
+        qemu_ram_move(ram_ptr, buf, *l);
         invalidate_and_set_dirty(mr, mr_addr, *l);
 
         return MEMTX_OK;
@@ -3365,7 +3449,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, false);
 
-        memcpy(buf, ram_ptr, *l);
+        qemu_ram_copy(buf, ram_ptr, *l);
 
         return MEMTX_OK;
     }
@@ -3503,8 +3587,7 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
             l = memory_access_size(mr, l, addr1);
         } else {
             /* ROM/RAM case */
-            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-            memcpy(ram_ptr, buf, l);
+            qemu_ram_copy(qemu_map_ram_ptr(mr->ram_block, addr1), buf, l);
             invalidate_and_set_dirty(mr, addr1, l);
         }
         len -= l;
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/2] system/memory: Make ram device region directly accessible
  2026-06-15 10:01 [PATCH v2 0/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-15 10:01 ` [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
@ 2026-06-15 10:01 ` Gavin Shan
  1 sibling, 0 replies; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 10:01 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, mst, peterx, alex, richard.henderson, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

This basically reverts 4a2e242bbb30 ("memory: Don't use memcpy for
ram_device regions") to make ram device region directly accessible
again. With this, the bounce buffer is bypassed in address_space_map()
when a ram device region is involved, potentially avoid to overrun
the (small) bounce buffer.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/system/memory.h | 11 ++---------
 system/memory.c         | 41 +----------------------------------------
 system/trace-events     |  2 --
 3 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index 5878727d09..7d6ef9b32e 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2916,15 +2916,8 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
     if (memory_region_is_romd(mr)) {
         return true;
     }
-    if (!memory_region_is_ram(mr)) {
-        return false;
-    }
-    /*
-     * RAM DEVICE regions can be accessed directly using memcpy, but it might
-     * be MMIO and access using mempy can be wrong (e.g., using instructions not
-     * intended for MMIO access). So we treat this as IO.
-     */
-    return !memory_region_is_ram_device(mr);
+
+    return memory_region_is_ram(mr);
 }
 
 static inline bool memory_access_is_direct(const MemoryRegion *mr,
diff --git a/system/memory.c b/system/memory.c
index 739ba11da6..9549dd1a94 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1362,43 +1362,6 @@ const MemoryRegionOps unassigned_mem_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
-static uint64_t memory_region_ram_device_read(void *opaque,
-                                              hwaddr addr, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-    uint64_t data = ldn_he_p(mr->ram_block->host + addr, size);
-
-    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
-
-    return data;
-}
-
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
-                                           uint64_t data, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-
-    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
-
-    stn_he_p(mr->ram_block->host + addr, size, data);
-}
-
-static const MemoryRegionOps ram_device_mem_ops = {
-    .read = memory_region_ram_device_read,
-    .write = memory_region_ram_device_write,
-    .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
-    .valid = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-    .impl = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-};
-
 bool memory_region_access_valid(MemoryRegion *mr,
                                 hwaddr addr,
                                 unsigned size,
@@ -1676,10 +1639,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, Object *owner,
                                        const char *name, uint64_t size,
                                        void *ptr)
 {
-    memory_region_init_io(mr, owner, &ram_device_mem_ops, mr, name, size);
-    mr->ram = true;
+    memory_region_init_ram_ptr(mr, owner, name, size, ptr);
     mr->ram_device = true;
-    memory_region_set_ram_ptr(mr, size, ptr);
 }
 
 void memory_region_init_alias(MemoryRegion *mr, Object *owner,
diff --git a/system/trace-events b/system/trace-events
index e6e1b61279..34af0a3a1e 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -20,8 +20,6 @@ memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, u
 memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size, const char *name) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u name '%s'"
 memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_sync_dirty(const char *mr, const char *listener, int global) "mr '%s' listener '%s' synced (global=%d)"
 flatview_new(void *view, void *root) "%p (root %p)"
 flatview_destroy(void *view, void *root) "%p (root %p)"
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 10:01 ` [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
@ 2026-06-15 10:57   ` Peter Maydell
  2026-06-15 14:48     ` Michael S. Tsirkin
  2026-06-15 14:56     ` Michael S. Tsirkin
  2026-06-15 15:17   ` Richard Henderson
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Maydell @ 2026-06-15 10:57 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, mst, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
>
> All ram device regions were turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> guest is started by the following command lines, with GH100 GPU card passed
> from the host.

> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..5878727d09 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>
>  /* Internal functions, part of the implementation of address_space_read.  */
> +void qemu_ram_copy(void *dest, const void *src, size_t n);
> +void qemu_ram_move(void *dest, const void *src, size_t n);

New function prototypes in include headers need documentation comments.

In particular for these, it's really important that we clearly say
what semantics we are attempting to provide with them, so that
(a) when we're reviewing or later updating the implementation we
know what we are trying to provide
(b) when we're looking at the callsites we know what the function
is guaranteeing to us and what it is not, and thus whether it's
OK to use it or we need something els.

> +#if defined(__i386__) || defined(__x86_64__)
> +#define HOST_UNALIGNED_MMIO_OK 1
> +#else
> +#define HOST_UNALIGNED_MMIO_OK 0
> +#endif

We need to do something better than this. We can't
just say "oh, we trust that on x86 this works": it is
neither actually true that the compiler guarantees it even
on x86, nor is it the case that only x86 can do unaligned
accesses to normal RAM.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 10:57   ` Peter Maydell
@ 2026-06-15 14:48     ` Michael S. Tsirkin
  2026-06-15 14:56     ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 14:48 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
> >
> > All ram device regions were turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > guest is started by the following command lines, with GH100 GPU card passed
> > from the host.
> 
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index 1417132f6d..5878727d09 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> >  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> >
> >  /* Internal functions, part of the implementation of address_space_read.  */
> > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > +void qemu_ram_move(void *dest, const void *src, size_t n);
> 
> New function prototypes in include headers need documentation comments.
> 
> In particular for these, it's really important that we clearly say
> what semantics we are attempting to provide with them, so that
> (a) when we're reviewing or later updating the implementation we
> know what we are trying to provide
> (b) when we're looking at the callsites we know what the function
> is guaranteeing to us and what it is not, and thus whether it's
> OK to use it or we need something els.
> 
> > +#if defined(__i386__) || defined(__x86_64__)
> > +#define HOST_UNALIGNED_MMIO_OK 1
> > +#else
> > +#define HOST_UNALIGNED_MMIO_OK 0
> > +#endif
> 
> We need to do something better than this. We can't
> just say "oh, we trust that on x86 this works": it is
> neither actually true that the compiler guarantees it even
> on x86, nor is it the case that only x86 can do unaligned
> accesses to normal RAM.
> 
> thanks
> -- PMM

Why normal RAM? It explicitly says MMIO.
x86 seems to be the only arch that can do unaligned MMIO?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 10:57   ` Peter Maydell
  2026-06-15 14:48     ` Michael S. Tsirkin
@ 2026-06-15 14:56     ` Michael S. Tsirkin
  2026-06-15 15:12       ` Peter Maydell
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 14:56 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
> >
> > All ram device regions were turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > guest is started by the following command lines, with GH100 GPU card passed
> > from the host.
> 
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index 1417132f6d..5878727d09 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> >  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> >
> >  /* Internal functions, part of the implementation of address_space_read.  */
> > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > +void qemu_ram_move(void *dest, const void *src, size_t n);
> 
> New function prototypes in include headers need documentation comments.
> 
> In particular for these, it's really important that we clearly say
> what semantics we are attempting to provide with them, so that
> (a) when we're reviewing or later updating the implementation we
> know what we are trying to provide
> (b) when we're looking at the callsites we know what the function
> is guaranteeing to us and what it is not, and thus whether it's
> OK to use it or we need something els.
> 
> > +#if defined(__i386__) || defined(__x86_64__)
> > +#define HOST_UNALIGNED_MMIO_OK 1
> > +#else
> > +#define HOST_UNALIGNED_MMIO_OK 0
> > +#endif
> 
> We need to do something better than this. We can't
> just say "oh, we trust that on x86 this works": it is
> neither actually true that the compiler guarantees it

sorry guarantee what?

> even
> on x86, nor is it the case that only x86 can do unaligned
> accesses to normal RAM.
> 
> thanks
> -- PMM



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 14:56     ` Michael S. Tsirkin
@ 2026-06-15 15:12       ` Peter Maydell
  2026-06-15 19:24         ` Gavin Shan
  2026-06-15 19:52         ` Michael S. Tsirkin
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Maydell @ 2026-06-15 15:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> > On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
> > >
> > > All ram device regions were turned to be indirectly accessible by commit
> > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > > guest is started by the following command lines, with GH100 GPU card passed
> > > from the host.
> >
> > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > index 1417132f6d..5878727d09 100644
> > > --- a/include/system/memory.h
> > > +++ b/include/system/memory.h
> > > @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> > >  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> > >
> > >  /* Internal functions, part of the implementation of address_space_read.  */
> > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> >
> > New function prototypes in include headers need documentation comments.
> >
> > In particular for these, it's really important that we clearly say
> > what semantics we are attempting to provide with them, so that
> > (a) when we're reviewing or later updating the implementation we
> > know what we are trying to provide
> > (b) when we're looking at the callsites we know what the function
> > is guaranteeing to us and what it is not, and thus whether it's
> > OK to use it or we need something els.
> >
> > > +#if defined(__i386__) || defined(__x86_64__)
> > > +#define HOST_UNALIGNED_MMIO_OK 1
> > > +#else
> > > +#define HOST_UNALIGNED_MMIO_OK 0
> > > +#endif
> >
> > We need to do something better than this. We can't
> > just say "oh, we trust that on x86 this works": it is
> > neither actually true that the compiler guarantees it
>
> sorry guarantee what?

Well, that's part of my point about "we need a doc comment":
what exactly are we trying to guarantee ? But whatever it is,
"hardcoded ifdef that privileges x86" is definitely the wrong
answer.

-- PMM


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 10:01 ` [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
  2026-06-15 10:57   ` Peter Maydell
@ 2026-06-15 15:17   ` Richard Henderson
  2026-06-15 16:33     ` Gavin Shan
  2026-06-15 16:35     ` Michael S. Tsirkin
  1 sibling, 2 replies; 20+ messages in thread
From: Richard Henderson @ 2026-06-15 15:17 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, mst, peterx, alex, peter.maydell, berrange, philmd,
	philmd, david, clg, pbonzini, phrdina, jugraham, liugang24219,
	dinghui, shan.gavin

On 6/15/26 03:01, Gavin Shan wrote:
> +void qemu_ram_copy(void *dest, const void *src, size_t n)
> +{
> +    if (HOST_UNALIGNED_MMIO_OK) {
> +        switch (n) {
> +        case 1:
> +            __builtin_memcpy(dest, src, 1);
> +            break;
> +        case 2:
> +            __builtin_memcpy(dest, src, 2);
> +            break;
> +        case 4:
> +            __builtin_memcpy(dest, src, 4);
> +            break;
> +        case 8:
> +            __builtin_memcpy(dest, src, 8);
> +            break;
> +        default:
> +            memcpy(dest, src, n);
> +        }
> +    } else {
> +        uintptr_t test, lsb;
> +
> +        do {
> +            test = (uintptr_t)dest | n;
> +            lsb = test & -test;
> +            switch (lsb) {

Either assert n != 0 to start, or use while not do/while.
Because the body of the loop won't handle n == 0 correctly.

> +            case 1:
> +                *(uint8_t *)dest = *(uint8_t *)src;
> +                src += 1;
> +                dest += 1;
> +                n -= 1;
> +                break;
> +            case 2:
> +                *(uint16_t *)dest = *(uint16_t *)src;
> +                src += 2;
> +                dest += 2;
> +                n -= 2;
> +                break;
> +            case 4:
> +                *(uint32_t *)dest = *(uint32_t *)src;
> +                src += 4;
> +                dest += 4;
> +                n -= 4;
> +                break;
> +            default:
> +                *(uint64_t *)dest = *(uint64_t *)src;
> +                src += 8;
> +                dest += 8;
> +                n -= 8;

Use qatomic_set for the stores.

src is not aligned, so except for case 1, you need ld{uw,l,q}_he_p.

> +void qemu_ram_move(void *dest, const void *src, size_t n)
> +{
> +    if (HOST_UNALIGNED_MMIO_OK) {
> +        switch (n) {
> +        case 1:
> +            __builtin_memmove(dest, src, 1);
> +            break;
> +        case 2:
> +            __builtin_memmove(dest, src, 2);
> +            break;
> +        case 4:
> +            __builtin_memmove(dest, src, 4);
> +            break;
> +        case 8:
> +            __builtin_memmove(dest, src, 8);
> +            break;
> +        default:
> +            memmove(dest, src, n);
> +        }
> +    } else {
> +        qemu_ram_copy(dest, src, n);
> +    }
> +}

The qemu_ram_copy implementation above does not work with overlapping blocks.


r~



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 15:17   ` Richard Henderson
@ 2026-06-15 16:33     ` Gavin Shan
  2026-06-15 17:03       ` Richard Henderson
  2026-06-15 16:35     ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 16:33 UTC (permalink / raw)
  To: Richard Henderson, qemu-arm
  Cc: qemu-devel, mst, peterx, alex, peter.maydell, berrange, philmd,
	philmd, david, clg, pbonzini, phrdina, jugraham, liugang24219,
	dinghui, shan.gavin

On 6/16/26 1:17 AM, Richard Henderson wrote:
> On 6/15/26 03:01, Gavin Shan wrote:
>> +void qemu_ram_copy(void *dest, const void *src, size_t n)
>> +{
>> +    if (HOST_UNALIGNED_MMIO_OK) {
>> +        switch (n) {
>> +        case 1:
>> +            __builtin_memcpy(dest, src, 1);
>> +            break;
>> +        case 2:
>> +            __builtin_memcpy(dest, src, 2);
>> +            break;
>> +        case 4:
>> +            __builtin_memcpy(dest, src, 4);
>> +            break;
>> +        case 8:
>> +            __builtin_memcpy(dest, src, 8);
>> +            break;
>> +        default:
>> +            memcpy(dest, src, n);
>> +        }
>> +    } else {
>> +        uintptr_t test, lsb;
>> +
>> +        do {
>> +            test = (uintptr_t)dest | n;
>> +            lsb = test & -test;
>> +            switch (lsb) {
> 
> Either assert n != 0 to start, or use while not do/while.
> Because the body of the loop won't handle n == 0 correctly.
> 

I will change this to "while (n > 0)" since we're not expecting
"n < 0" either.

>> +            case 1:
>> +                *(uint8_t *)dest = *(uint8_t *)src;
>> +                src += 1;
>> +                dest += 1;
>> +                n -= 1;
>> +                break;
>> +            case 2:
>> +                *(uint16_t *)dest = *(uint16_t *)src;
>> +                src += 2;
>> +                dest += 2;
>> +                n -= 2;
>> +                break;
>> +            case 4:
>> +                *(uint32_t *)dest = *(uint32_t *)src;
>> +                src += 4;
>> +                dest += 4;
>> +                n -= 4;
>> +                break;
>> +            default:
>> +                *(uint64_t *)dest = *(uint64_t *)src;
>> +                src += 8;
>> +                dest += 8;
>> +                n -= 8;
> 
> Use qatomic_set for the stores.
> 

Could you confirm which stores need qatomic_set()? There are two sets
of stores as below. I guess you're asking have qatomic_set() for (a)?
Could you explain a bit why qatomic_set() is needed?

     // (a)
     *(uint64_t *)dest = *(uint64_t *)src;


     // (b)
     src += 8;
     dest += 8;
     n -= 8;

> src is not aligned, so except for case 1, you need ld{uw,l,q}_he_p.
> 

Yeah, I realized this after this patch was sent. ldub_p() will be
used for case 1 either.

Another unrelated question: why 'int' value is returned from ldub_p()
and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
values?

static inline int ldub_p(const void *ptr)
static inline int lduw_he_p(const void *ptr)
static inline int ldl_he_p(const void *ptr)
static inline uint64_t ldq_he_p(const void *ptr)   // return uint64_t

>> +void qemu_ram_move(void *dest, const void *src, size_t n)
>> +{
>> +    if (HOST_UNALIGNED_MMIO_OK) {
>> +        switch (n) {
>> +        case 1:
>> +            __builtin_memmove(dest, src, 1);
>> +            break;
>> +        case 2:
>> +            __builtin_memmove(dest, src, 2);
>> +            break;
>> +        case 4:
>> +            __builtin_memmove(dest, src, 4);
>> +            break;
>> +        case 8:
>> +            __builtin_memmove(dest, src, 8);
>> +            break;
>> +        default:
>> +            memmove(dest, src, n);
>> +        }
>> +    } else {
>> +        qemu_ram_copy(dest, src, n);
>> +    }
>> +}
> 
> The qemu_ram_copy implementation above does not work with overlapping blocks.
> 

Yeah, it's something I need to figure out for next revision. Thanks for your
review and comments.

> 
> r~
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 15:17   ` Richard Henderson
  2026-06-15 16:33     ` Gavin Shan
@ 2026-06-15 16:35     ` Michael S. Tsirkin
  2026-06-15 16:37       ` Michael S. Tsirkin
  2026-06-15 17:05       ` Richard Henderson
  1 sibling, 2 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 16:35 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, Jun 15, 2026 at 08:17:14AM -0700, Richard Henderson wrote:
> On 6/15/26 03:01, Gavin Shan wrote:
> > +void qemu_ram_copy(void *dest, const void *src, size_t n)
> > +{
> > +    if (HOST_UNALIGNED_MMIO_OK) {
> > +        switch (n) {
> > +        case 1:
> > +            __builtin_memcpy(dest, src, 1);
> > +            break;
> > +        case 2:
> > +            __builtin_memcpy(dest, src, 2);
> > +            break;
> > +        case 4:
> > +            __builtin_memcpy(dest, src, 4);
> > +            break;
> > +        case 8:
> > +            __builtin_memcpy(dest, src, 8);
> > +            break;
> > +        default:
> > +            memcpy(dest, src, n);
> > +        }
> > +    } else {
> > +        uintptr_t test, lsb;
> > +
> > +        do {
> > +            test = (uintptr_t)dest | n;
> > +            lsb = test & -test;
> > +            switch (lsb) {
> 
> Either assert n != 0 to start, or use while not do/while.
> Because the body of the loop won't handle n == 0 correctly.
> 
> > +            case 1:
> > +                *(uint8_t *)dest = *(uint8_t *)src;
> > +                src += 1;
> > +                dest += 1;
> > +                n -= 1;
> > +                break;
> > +            case 2:
> > +                *(uint16_t *)dest = *(uint16_t *)src;
> > +                src += 2;
> > +                dest += 2;
> > +                n -= 2;
> > +                break;
> > +            case 4:
> > +                *(uint32_t *)dest = *(uint32_t *)src;
> > +                src += 4;
> > +                dest += 4;
> > +                n -= 4;
> > +                break;
> > +            default:
> > +                *(uint64_t *)dest = *(uint64_t *)src;
> > +                src += 8;
> > +                dest += 8;
> > +                n -= 8;
> 
> Use qatomic_set for the stores.


Won't this do things like locked on x86?



> 
> src is not aligned, so except for case 1, you need ld{uw,l,q}_he_p.

These just map back to memcpy. What is the point?

> > +void qemu_ram_move(void *dest, const void *src, size_t n)
> > +{
> > +    if (HOST_UNALIGNED_MMIO_OK) {
> > +        switch (n) {
> > +        case 1:
> > +            __builtin_memmove(dest, src, 1);
> > +            break;
> > +        case 2:
> > +            __builtin_memmove(dest, src, 2);
> > +            break;
> > +        case 4:
> > +            __builtin_memmove(dest, src, 4);
> > +            break;
> > +        case 8:
> > +            __builtin_memmove(dest, src, 8);
> > +            break;
> > +        default:
> > +            memmove(dest, src, n);
> > +        }
> > +    } else {
> > +        qemu_ram_copy(dest, src, n);
> > +    }
> > +}
> 
> The qemu_ram_copy implementation above does not work with overlapping blocks.

IIUC copy is not supposed to. this is what move is for.

> 
> r~



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 16:35     ` Michael S. Tsirkin
@ 2026-06-15 16:37       ` Michael S. Tsirkin
  2026-06-15 17:05       ` Richard Henderson
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 16:37 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, Jun 15, 2026 at 12:36:00PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2026 at 08:17:14AM -0700, Richard Henderson wrote:
> > On 6/15/26 03:01, Gavin Shan wrote:
> > > +void qemu_ram_copy(void *dest, const void *src, size_t n)
> > > +{
> > > +    if (HOST_UNALIGNED_MMIO_OK) {
> > > +        switch (n) {
> > > +        case 1:
> > > +            __builtin_memcpy(dest, src, 1);
> > > +            break;
> > > +        case 2:
> > > +            __builtin_memcpy(dest, src, 2);
> > > +            break;
> > > +        case 4:
> > > +            __builtin_memcpy(dest, src, 4);
> > > +            break;
> > > +        case 8:
> > > +            __builtin_memcpy(dest, src, 8);
> > > +            break;
> > > +        default:
> > > +            memcpy(dest, src, n);
> > > +        }
> > > +    } else {
> > > +        uintptr_t test, lsb;
> > > +
> > > +        do {
> > > +            test = (uintptr_t)dest | n;
> > > +            lsb = test & -test;
> > > +            switch (lsb) {
> > 
> > Either assert n != 0 to start, or use while not do/while.
> > Because the body of the loop won't handle n == 0 correctly.

I think n!=0 performs better than a loop. Worth checking
asm though.

> > > +            case 1:
> > > +                *(uint8_t *)dest = *(uint8_t *)src;
> > > +                src += 1;
> > > +                dest += 1;
> > > +                n -= 1;
> > > +                break;
> > > +            case 2:
> > > +                *(uint16_t *)dest = *(uint16_t *)src;
> > > +                src += 2;
> > > +                dest += 2;
> > > +                n -= 2;
> > > +                break;
> > > +            case 4:
> > > +                *(uint32_t *)dest = *(uint32_t *)src;
> > > +                src += 4;
> > > +                dest += 4;
> > > +                n -= 4;
> > > +                break;
> > > +            default:
> > > +                *(uint64_t *)dest = *(uint64_t *)src;
> > > +                src += 8;
> > > +                dest += 8;
> > > +                n -= 8;
> > 
> > Use qatomic_set for the stores.
> 
> 
> Won't this do things like locked on x86?

Hmm I didn't check. It's relaxed - so no locked. Fine by me.

> 
> 
> > 
> > src is not aligned, so except for case 1, you need ld{uw,l,q}_he_p.
> 
> These just map back to memcpy. What is the point?
> 
> > > +void qemu_ram_move(void *dest, const void *src, size_t n)
> > > +{
> > > +    if (HOST_UNALIGNED_MMIO_OK) {
> > > +        switch (n) {
> > > +        case 1:
> > > +            __builtin_memmove(dest, src, 1);
> > > +            break;
> > > +        case 2:
> > > +            __builtin_memmove(dest, src, 2);
> > > +            break;
> > > +        case 4:
> > > +            __builtin_memmove(dest, src, 4);
> > > +            break;
> > > +        case 8:
> > > +            __builtin_memmove(dest, src, 8);
> > > +            break;
> > > +        default:
> > > +            memmove(dest, src, n);
> > > +        }
> > > +    } else {
> > > +        qemu_ram_copy(dest, src, n);
> > > +    }
> > > +}
> > 
> > The qemu_ram_copy implementation above does not work with overlapping blocks.
> 
> IIUC copy is not supposed to. this is what move is for.
> 
> > 
> > r~



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 16:33     ` Gavin Shan
@ 2026-06-15 17:03       ` Richard Henderson
  2026-06-15 18:09         ` Gavin Shan
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-06-15 17:03 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, mst, peterx, alex, peter.maydell, berrange, philmd,
	philmd, david, clg, pbonzini, phrdina, jugraham, liugang24219,
	dinghui, shan.gavin

On 6/15/26 09:33, Gavin Shan wrote:
>> Either assert n != 0 to start, or use while not do/while.
>> Because the body of the loop won't handle n == 0 correctly.
>>
> 
> I will change this to "while (n > 0)" since we're not expecting
> "n < 0" either.

size_t is unsigned.

> Could you confirm which stores need qatomic_set()? There are two sets
> of stores as below. I guess you're asking have qatomic_set() for (a)?
> Could you explain a bit why qatomic_set() is needed?
> 
>      // (a)
>      *(uint64_t *)dest = *(uint64_t *)src;

Of course (a).

Using qatomic_set prevents the compiler from merging the stores, kinda like volatile.  Not 
especially likely with the way you've written this, but if you had exchanged the switch 
and loop,

     switch (lsb) {
     case 1:
         for (; n ; n--, dst++, src++) {
             *(uint8_t *)dst = *(uint8_t *)src;
         }

then the compiler is quite likely to "optimize" the loop back to memcpy.

It's also self-documenting what we're intending with the store.

I guess it's also worth asking if this copy is also used for copying *from* device memory. 
  The commit that added memmove (4a73aee8814) suggests that dma from the device uses these 
paths.  In which case you'll either want a separate function for that, or both source and 
destination must be aligned.

> Another unrelated question: why 'int' value is returned from ldub_p()
> and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
> values?

A bad choice 20 years ago, and the need to audit all uses in order to change it now.
Newer functions use uintN_t as you suggest.

r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 16:35     ` Michael S. Tsirkin
  2026-06-15 16:37       ` Michael S. Tsirkin
@ 2026-06-15 17:05       ` Richard Henderson
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2026-06-15 17:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, peter.maydell,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On 6/15/26 09:35, Michael S. Tsirkin wrote:
>>> +void qemu_ram_move(void *dest, const void *src, size_t n)
>>> +{
>>> +    if (HOST_UNALIGNED_MMIO_OK) {
>>> +        switch (n) {
>>> +        case 1:
>>> +            __builtin_memmove(dest, src, 1);
>>> +            break;
>>> +        case 2:
>>> +            __builtin_memmove(dest, src, 2);
>>> +            break;
>>> +        case 4:
>>> +            __builtin_memmove(dest, src, 4);
>>> +            break;
>>> +        case 8:
>>> +            __builtin_memmove(dest, src, 8);
>>> +            break;
>>> +        default:
>>> +            memmove(dest, src, n);
>>> +        }
>>> +    } else {
>>> +        qemu_ram_copy(dest, src, n);
>>> +    }
>>> +}
>>
>> The qemu_ram_copy implementation above does not work with overlapping blocks.
> 
> IIUC copy is not supposed to. this is what move is for.

My point exactly.  You therefore can't use copy to implement move.


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 17:03       ` Richard Henderson
@ 2026-06-15 18:09         ` Gavin Shan
  2026-06-15 18:33           ` Richard Henderson
  2026-06-15 19:40           ` Michael S. Tsirkin
  0 siblings, 2 replies; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 18:09 UTC (permalink / raw)
  To: Richard Henderson, qemu-arm
  Cc: qemu-devel, mst, peterx, alex, peter.maydell, berrange, philmd,
	philmd, david, clg, pbonzini, phrdina, jugraham, liugang24219,
	dinghui, shan.gavin

On 6/16/26 3:03 AM, Richard Henderson wrote:
> On 6/15/26 09:33, Gavin Shan wrote:
>>> Either assert n != 0 to start, or use while not do/while.
>>> Because the body of the loop won't handle n == 0 correctly.
>>>
>>
>> I will change this to "while (n > 0)" since we're not expecting
>> "n < 0" either.
> 
> size_t is unsigned.
> 

oh, yes.

>> Could you confirm which stores need qatomic_set()? There are two sets
>> of stores as below. I guess you're asking have qatomic_set() for (a)?
>> Could you explain a bit why qatomic_set() is needed?
>>
>>      // (a)
>>      *(uint64_t *)dest = *(uint64_t *)src;
> 
> Of course (a).
> 
> Using qatomic_set prevents the compiler from merging the stores, kinda like volatile.  Not especially likely with the way you've written this, but if you had exchanged the switch and loop,
> 
>      switch (lsb) {
>      case 1:
>          for (; n ; n--, dst++, src++) {
>              *(uint8_t *)dst = *(uint8_t *)src;
>          }
> 
> then the compiler is quite likely to "optimize" the loop back to memcpy.
> 
> It's also self-documenting what we're intending with the store.
> 

ok, thanks for your explanation.

> I guess it's also worth asking if this copy is also used for copying *from* device memory.  The commit that added memmove (4a73aee8814) suggests that dma from the device uses these paths.  In which case you'll either want a separate function for that, or both source and destination must be aligned.
> 

I think it's a valid case. Lets handle this by limiting the 'step' based on
src/dest/n, something like below. atomic_read() instead of ldxxx_he_p() is
used for loads. Please take a look if there are anything we can improve
further.

// Not compiled and tested yet
static void qemu_ram_copy_unaligned(void *dest, const void *src, size_t n)
{
     uintptr_t test, step;

     while (n != 0) {
         test = (uintptr_t)src | n;       /* alignment enforced by source */
         step = test & -test;
         test = (uintptr_t)dest | n;      /* alignment enforced by destination */
         step = MIN(step, test & -test);

         switch (step) {
         case 1:
             qatomic_set((uint8_t *)dest, qatomic_read((uint8_t *)src));
             src += 1;
             dest += 1;
             n -= 1;
             break;
         case 2:
             qatomic_set((uint16_t *)dest, qatomic_read((uint16_t *)src));
             src += 2;
             dest += 2;
             n -= 2;
             break;
         case 4:
             qatomic_set((uint32_t *)dest, qatomic_read((uint32_t *)src));
             src += 4;
             dest += 4;
             n -= 4;
             break;
         default:
             qatomic_set((uint64_t *)dest, qatomic_read((uint64_t *)src);
             src += 8;
             dest += 8;
             n -= 8;
         }
     }
}


>> Another unrelated question: why 'int' value is returned from ldub_p()
>> and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
>> values?
> 
> A bad choice 20 years ago, and the need to audit all uses in order to change it now.
> Newer functions use uintN_t as you suggest.
> 

Ok :-)

> r~
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 18:09         ` Gavin Shan
@ 2026-06-15 18:33           ` Richard Henderson
  2026-06-15 19:40           ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2026-06-15 18:33 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, mst, peterx, alex, peter.maydell, berrange, philmd,
	philmd, david, clg, pbonzini, phrdina, jugraham, liugang24219,
	dinghui, shan.gavin

On 6/15/26 11:09, Gavin Shan wrote:
>> I guess it's also worth asking if this copy is also used for copying *from* device 
>> memory.  The commit that added memmove (4a73aee8814) suggests that dma from the device 
>> uses these paths.  In which case you'll either want a separate function for that, or 
>> both source and destination must be aligned.
>>
> 
> I think it's a valid case. Lets handle this by limiting the 'step' based on
> src/dest/n, something like below. atomic_read() instead of ldxxx_he_p() is
> used for loads. Please take a look if there are anything we can improve
> further.
> 
> // Not compiled and tested yet
> static void qemu_ram_copy_unaligned(void *dest, const void *src, size_t n)
> {
>      uintptr_t test, step;
> 
>      while (n != 0) {
>          test = (uintptr_t)src | n;       /* alignment enforced by source */
>          step = test & -test;
>          test = (uintptr_t)dest | n;      /* alignment enforced by destination */
>          step = MIN(step, test & -test);

     while (n) {
         uintptr_t test = (uintptr_t)src | (uintptr_t)dst | n;
         uintptr_t step = test & -test;

Otherwise it looks good.


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 15:12       ` Peter Maydell
@ 2026-06-15 19:24         ` Gavin Shan
  2026-06-15 19:42           ` Michael S. Tsirkin
  2026-06-15 19:52         ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 19:24 UTC (permalink / raw)
  To: Peter Maydell, Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson, berrange,
	philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

Hi Peter and Michael,

On 6/16/26 1:12 AM, Peter Maydell wrote:
> On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
>>> On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
>>>>
>>>> All ram device regions were turned to be indirectly accessible by commit
>>>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>>>> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
>>>> guest is started by the following command lines, with GH100 GPU card passed
>>>> from the host.
>>>
>>>> diff --git a/include/system/memory.h b/include/system/memory.h
>>>> index 1417132f6d..5878727d09 100644
>>>> --- a/include/system/memory.h
>>>> +++ b/include/system/memory.h
>>>> @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>>>>   void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>>>>
>>>>   /* Internal functions, part of the implementation of address_space_read.  */
>>>> +void qemu_ram_copy(void *dest, const void *src, size_t n);
>>>> +void qemu_ram_move(void *dest, const void *src, size_t n);
>>>
>>> New function prototypes in include headers need documentation comments.
>>>
>>> In particular for these, it's really important that we clearly say
>>> what semantics we are attempting to provide with them, so that
>>> (a) when we're reviewing or later updating the implementation we
>>> know what we are trying to provide
>>> (b) when we're looking at the callsites we know what the function
>>> is guaranteeing to us and what it is not, and thus whether it's
>>> OK to use it or we need something els.
>>>
>>>> +#if defined(__i386__) || defined(__x86_64__)
>>>> +#define HOST_UNALIGNED_MMIO_OK 1
>>>> +#else
>>>> +#define HOST_UNALIGNED_MMIO_OK 0
>>>> +#endif
>>>
>>> We need to do something better than this. We can't
>>> just say "oh, we trust that on x86 this works": it is
>>> neither actually true that the compiler guarantees it
>>
>> sorry guarantee what?
> 
> Well, that's part of my point about "we need a doc comment":
> what exactly are we trying to guarantee ? But whatever it is,
> "hardcoded ifdef that privileges x86" is definitely the wrong
> answer.
> 

Could you please check the following comments (documentation context) for the
added functions look good to you? Please let me know if there are anything
can be improved.

+
+/**
+ * qemu_ram_copy: copy data to a RAMBlock
+ *
+ * @dest: destination where the data is copied to. It's the pointer returned
+ *        by ramblock_ptr() and its variants.
+ * @src: source where the data is copied from. It can be a regular memory block
+ *       or a pointer returned by ramblock_ptr() and its variants.
+ * @n: length of data to be copied
+ *
+ * The destination is always a pointer to data area of a RAMBlock which can
+ * be for a regular memory block or a MMIO region. The source can be a pointer
+ * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
+ * memcpy() to copy data between those MMIO regions as we do in this function.
+ * Besides, unaligned accesses are well handled on all architectures except
+ * i386 and x86_64 where the unaligned accesses are supported by the CPUs.
+ *
+ * The ensured atomicity and alignment consideration, which aren't needed
+ * by data copy between two regular memory blocks. So performance penalty
+ * is introduced by this function in such circumstance.
+ */
+void qemu_ram_copy(void *dest, const void *src, size_t n);
+
+/**
+ * qemu_ram_move: move data to a RAMBlock
+ *
+ * @dest: destination where the data is moved to. It's the pointer returned
+ *        by ramblock_ptr() and its variants.
+ * @src: source where the data is moved from. It can be a regular memory block
+ *       or a pointer returned by ramblock_ptr() and its variants.
+ * @n: length of data to be moved
+ *
+ * The destination is always a pointer to data area of a RAMBlock which can
+ * be for a regular memory block or a MMIO region. The source can be a pointer
+ * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
+ * memmove() to move data from or to those MMIO regions as we do in this
+ * function. Besides, unaligned accesses are well handled on all architectures
+ * except i386 and x86_64 where the unaligned accesses are supported by the
+ * CPUs.
+ *
+ * The ensured atomicity and alignment consideration, which aren't needed
+ * by data movement between two regular memory blocks. So performance penalty
+ * is introduced by this function in such circumstance.
+ */
+void qemu_ram_move(void *dest, const void *src, size_t n);
+

> -- PMM
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 18:09         ` Gavin Shan
  2026-06-15 18:33           ` Richard Henderson
@ 2026-06-15 19:40           ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 19:40 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Richard Henderson, qemu-arm, qemu-devel, peterx, alex,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 04:09:27AM +1000, Gavin Shan wrote:
> On 6/16/26 3:03 AM, Richard Henderson wrote:
> > On 6/15/26 09:33, Gavin Shan wrote:
> > > > Either assert n != 0 to start, or use while not do/while.
> > > > Because the body of the loop won't handle n == 0 correctly.
> > > > 
> > > 
> > > I will change this to "while (n > 0)" since we're not expecting
> > > "n < 0" either.
> > 
> > size_t is unsigned.
> > 
> 
> oh, yes.
> 
> > > Could you confirm which stores need qatomic_set()? There are two sets
> > > of stores as below. I guess you're asking have qatomic_set() for (a)?
> > > Could you explain a bit why qatomic_set() is needed?
> > > 
> > >      // (a)
> > >      *(uint64_t *)dest = *(uint64_t *)src;
> > 
> > Of course (a).
> > 
> > Using qatomic_set prevents the compiler from merging the stores, kinda like volatile.  Not especially likely with the way you've written this, but if you had exchanged the switch and loop,
> > 
> >      switch (lsb) {
> >      case 1:
> >          for (; n ; n--, dst++, src++) {
> >              *(uint8_t *)dst = *(uint8_t *)src;
> >          }
> > 
> > then the compiler is quite likely to "optimize" the loop back to memcpy.
> > 
> > It's also self-documenting what we're intending with the store.
> > 
> 
> ok, thanks for your explanation.
> 
> > I guess it's also worth asking if this copy is also used for copying *from* device memory.  The commit that added memmove (4a73aee8814) suggests that dma from the device uses these paths.  In which case you'll either want a separate function for that, or both source and destination must be aligned.
> > 
> 
> I think it's a valid case. Lets handle this by limiting the 'step' based on
> src/dest/n, something like below. atomic_read() instead of ldxxx_he_p() is
> used for loads. Please take a look if there are anything we can improve
> further.
> 
> // Not compiled and tested yet
> static void qemu_ram_copy_unaligned(void *dest, const void *src, size_t n)
> {
>     uintptr_t test, step;
> 
>     while (n != 0) {
>         test = (uintptr_t)src | n;       /* alignment enforced by source */
>         step = test & -test;
>         test = (uintptr_t)dest | n;      /* alignment enforced by destination */
>         step = MIN(step, test & -test);
> 
>         switch (step) {
>         case 1:
>             qatomic_set((uint8_t *)dest, qatomic_read((uint8_t *)src));
>             src += 1;
>             dest += 1;
>             n -= 1;
>             break;
>         case 2:
>             qatomic_set((uint16_t *)dest, qatomic_read((uint16_t *)src));
>             src += 2;
>             dest += 2;
>             n -= 2;
>             break;
>         case 4:
>             qatomic_set((uint32_t *)dest, qatomic_read((uint32_t *)src));
>             src += 4;
>             dest += 4;
>             n -= 4;
>             break;
>         default:
>             qatomic_set((uint64_t *)dest, qatomic_read((uint64_t *)src);
>             src += 8;
>             dest += 8;
>             n -= 8;
>         }


If the length is > 8, I feel it's important to just use memcpy/memmove
normally.

Because at that point it's a large data transfer not MMIO, and performance
is important.


>     }
> }
> 
> 
> > > Another unrelated question: why 'int' value is returned from ldub_p()
> > > and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
> > > values?
> > 
> > A bad choice 20 years ago, and the need to audit all uses in order to change it now.
> > Newer functions use uintN_t as you suggest.
> > 
> 
> Ok :-)
> 
> > r~
> > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 19:24         ` Gavin Shan
@ 2026-06-15 19:42           ` Michael S. Tsirkin
  2026-06-15 21:31             ` Gavin Shan
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 19:42 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, alex,
	richard.henderson, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Tue, Jun 16, 2026 at 05:24:15AM +1000, Gavin Shan wrote:
> Hi Peter and Michael,
> 
> On 6/16/26 1:12 AM, Peter Maydell wrote:
> > On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> > > > On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
> > > > > 
> > > > > All ram device regions were turned to be indirectly accessible by commit
> > > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > > > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > > > > guest is started by the following command lines, with GH100 GPU card passed
> > > > > from the host.
> > > > 
> > > > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > > > index 1417132f6d..5878727d09 100644
> > > > > --- a/include/system/memory.h
> > > > > +++ b/include/system/memory.h
> > > > > @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> > > > >   void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> > > > > 
> > > > >   /* Internal functions, part of the implementation of address_space_read.  */
> > > > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> > > > 
> > > > New function prototypes in include headers need documentation comments.
> > > > 
> > > > In particular for these, it's really important that we clearly say
> > > > what semantics we are attempting to provide with them, so that
> > > > (a) when we're reviewing or later updating the implementation we
> > > > know what we are trying to provide
> > > > (b) when we're looking at the callsites we know what the function
> > > > is guaranteeing to us and what it is not, and thus whether it's
> > > > OK to use it or we need something els.
> > > > 
> > > > > +#if defined(__i386__) || defined(__x86_64__)
> > > > > +#define HOST_UNALIGNED_MMIO_OK 1
> > > > > +#else
> > > > > +#define HOST_UNALIGNED_MMIO_OK 0
> > > > > +#endif
> > > > 
> > > > We need to do something better than this. We can't
> > > > just say "oh, we trust that on x86 this works": it is
> > > > neither actually true that the compiler guarantees it
> > > 
> > > sorry guarantee what?
> > 
> > Well, that's part of my point about "we need a doc comment":
> > what exactly are we trying to guarantee ? But whatever it is,
> > "hardcoded ifdef that privileges x86" is definitely the wrong
> > answer.
> > 
> 
> Could you please check the following comments (documentation context) for the
> added functions look good to you? Please let me know if there are anything
> can be improved.
> 
> +
> +/**
> + * qemu_ram_copy: copy data to a RAMBlock
> + *
> + * @dest: destination where the data is copied to. It's the pointer returned
> + *        by ramblock_ptr() and its variants.
> + * @src: source where the data is copied from. It can be a regular memory block
> + *       or a pointer returned by ramblock_ptr() and its variants.
> + * @n: length of data to be copied
> + *
> + * The destination is always a pointer to data area of a RAMBlock which can
> + * be for a regular memory block or a MMIO region. The source can be a pointer
> + * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
> + * memcpy() to copy data between those MMIO regions as we do in this function.
> + * Besides, unaligned accesses are well handled on all architectures except
> + * i386 and x86_64 where the unaligned accesses are supported by the CPUs.
> + *
> + * The ensured atomicity and alignment consideration, which aren't needed
> + * by data copy between two regular memory blocks. So performance penalty
> + * is introduced by this function in such circumstance.
> + */
> +void qemu_ram_copy(void *dest, const void *src, size_t n);
> +
> +/**
> + * qemu_ram_move: move data to a RAMBlock
> + *
> + * @dest: destination where the data is moved to. It's the pointer returned
> + *        by ramblock_ptr() and its variants.
> + * @src: source where the data is moved from. It can be a regular memory block
> + *       or a pointer returned by ramblock_ptr() and its variants.
> + * @n: length of data to be moved
> + *
> + * The destination is always a pointer to data area of a RAMBlock which can
> + * be for a regular memory block or a MMIO region. The source can be a pointer
> + * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
> + * memmove() to move data from or to those MMIO regions as we do in this
> + * function. Besides, unaligned accesses are well handled on all architectures
> + * except i386 and x86_64 where the unaligned accesses are supported by the
> + * CPUs.
> + *
> + * The ensured atomicity and alignment consideration, which aren't needed
> + * by data movement between two regular memory blocks. So performance penalty
> + * is introduced by this function in such circumstance.
> + */
> +void qemu_ram_move(void *dest, const void *src, size_t n);
> +


I apologize, I don't understand what these comments are saying, at all.



> > -- PMM
> > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 15:12       ` Peter Maydell
  2026-06-15 19:24         ` Gavin Shan
@ 2026-06-15 19:52         ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 19:52 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	berrange, philmd, philmd, david, clg, pbonzini, phrdina, jugraham,
	liugang24219, dinghui, shan.gavin

On Mon, Jun 15, 2026 at 04:12:40PM +0100, Peter Maydell wrote:
> On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> > > On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
> > > >
> > > > All ram device regions were turned to be indirectly accessible by commit
> > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > > > to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
> > > > guest is started by the following command lines, with GH100 GPU card passed
> > > > from the host.
> > >
> > > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > > index 1417132f6d..5878727d09 100644
> > > > --- a/include/system/memory.h
> > > > +++ b/include/system/memory.h
> > > > @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> > > >  void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
> > > >
> > > >  /* Internal functions, part of the implementation of address_space_read.  */
> > > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> > >
> > > New function prototypes in include headers need documentation comments.
> > >
> > > In particular for these, it's really important that we clearly say
> > > what semantics we are attempting to provide with them, so that
> > > (a) when we're reviewing or later updating the implementation we
> > > know what we are trying to provide
> > > (b) when we're looking at the callsites we know what the function
> > > is guaranteeing to us and what it is not, and thus whether it's
> > > OK to use it or we need something els.
> > >
> > > > +#if defined(__i386__) || defined(__x86_64__)
> > > > +#define HOST_UNALIGNED_MMIO_OK 1
> > > > +#else
> > > > +#define HOST_UNALIGNED_MMIO_OK 0
> > > > +#endif
> > >
> > > We need to do something better than this. We can't
> > > just say "oh, we trust that on x86 this works": it is
> > > neither actually true that the compiler guarantees it
> >
> > sorry guarantee what?
> 
> Well, that's part of my point about "we need a doc comment":
> what exactly are we trying to guarantee ?

we are trying to guarantee that host transaction sizes match what
guest requested as closely as possible.

> But whatever it is,
> "hardcoded ifdef that privileges x86" is definitely the wrong
> answer.
> 
> -- PMM

why is it "definitely a wrong answer" if a specific host architecture 
supports originating transactions that other host architectures lack?
should we intentionally break setups that could initiate unaligned
accesses and work fine just because some other setups have hardware that
simply can not initiate unaligned accesses?

-- 
MST



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors
  2026-06-15 19:42           ` Michael S. Tsirkin
@ 2026-06-15 21:31             ` Gavin Shan
  0 siblings, 0 replies; 20+ messages in thread
From: Gavin Shan @ 2026-06-15 21:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, alex,
	richard.henderson, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On 6/16/26 5:42 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2026 at 05:24:15AM +1000, Gavin Shan wrote:
>> Hi Peter and Michael,
>>
>> On 6/16/26 1:12 AM, Peter Maydell wrote:
>>> On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
>>>>> On Mon, 15 Jun 2026 at 11:03, Gavin Shan <gshan@redhat.com> wrote:
>>>>>>
>>>>>> All ram device regions were turned to be indirectly accessible by commit
>>>>>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>>>>>> to guest hang on attempt to build 'cuda-samples' as reported by Julia. The
>>>>>> guest is started by the following command lines, with GH100 GPU card passed
>>>>>> from the host.
>>>>>
>>>>>> diff --git a/include/system/memory.h b/include/system/memory.h
>>>>>> index 1417132f6d..5878727d09 100644
>>>>>> --- a/include/system/memory.h
>>>>>> +++ b/include/system/memory.h
>>>>>> @@ -2897,6 +2897,8 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
>>>>>>    void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
>>>>>>
>>>>>>    /* Internal functions, part of the implementation of address_space_read.  */
>>>>>> +void qemu_ram_copy(void *dest, const void *src, size_t n);
>>>>>> +void qemu_ram_move(void *dest, const void *src, size_t n);
>>>>>
>>>>> New function prototypes in include headers need documentation comments.
>>>>>
>>>>> In particular for these, it's really important that we clearly say
>>>>> what semantics we are attempting to provide with them, so that
>>>>> (a) when we're reviewing or later updating the implementation we
>>>>> know what we are trying to provide
>>>>> (b) when we're looking at the callsites we know what the function
>>>>> is guaranteeing to us and what it is not, and thus whether it's
>>>>> OK to use it or we need something els.
>>>>>
>>>>>> +#if defined(__i386__) || defined(__x86_64__)
>>>>>> +#define HOST_UNALIGNED_MMIO_OK 1
>>>>>> +#else
>>>>>> +#define HOST_UNALIGNED_MMIO_OK 0
>>>>>> +#endif
>>>>>
>>>>> We need to do something better than this. We can't
>>>>> just say "oh, we trust that on x86 this works": it is
>>>>> neither actually true that the compiler guarantees it
>>>>
>>>> sorry guarantee what?
>>>
>>> Well, that's part of my point about "we need a doc comment":
>>> what exactly are we trying to guarantee ? But whatever it is,
>>> "hardcoded ifdef that privileges x86" is definitely the wrong
>>> answer.
>>>
>>
>> Could you please check the following comments (documentation context) for the
>> added functions look good to you? Please let me know if there are anything
>> can be improved.
>>
>> +
>> +/**
>> + * qemu_ram_copy: copy data to a RAMBlock
>> + *
>> + * @dest: destination where the data is copied to. It's the pointer returned
>> + *        by ramblock_ptr() and its variants.
>> + * @src: source where the data is copied from. It can be a regular memory block
>> + *       or a pointer returned by ramblock_ptr() and its variants.
>> + * @n: length of data to be copied
>> + *
>> + * The destination is always a pointer to data area of a RAMBlock which can
>> + * be for a regular memory block or a MMIO region. The source can be a pointer
>> + * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
>> + * memcpy() to copy data between those MMIO regions as we do in this function.
>> + * Besides, unaligned accesses are well handled on all architectures except
>> + * i386 and x86_64 where the unaligned accesses are supported by the CPUs.
>> + *
>> + * The ensured atomicity and alignment consideration, which aren't needed
>> + * by data copy between two regular memory blocks. So performance penalty
>> + * is introduced by this function in such circumstance.
>> + */
>> +void qemu_ram_copy(void *dest, const void *src, size_t n);
>> +
>> +/**
>> + * qemu_ram_move: move data to a RAMBlock
>> + *
>> + * @dest: destination where the data is moved to. It's the pointer returned
>> + *        by ramblock_ptr() and its variants.
>> + * @src: source where the data is moved from. It can be a regular memory block
>> + *       or a pointer returned by ramblock_ptr() and its variants.
>> + * @n: length of data to be moved
>> + *
>> + * The destination is always a pointer to data area of a RAMBlock which can
>> + * be for a regular memory block or a MMIO region. The source can be a pointer
>> + * to a regular memory block or a MMIO region. Atomicity isn't guranteed by
>> + * memmove() to move data from or to those MMIO regions as we do in this
>> + * function. Besides, unaligned accesses are well handled on all architectures
>> + * except i386 and x86_64 where the unaligned accesses are supported by the
>> + * CPUs.
>> + *
>> + * The ensured atomicity and alignment consideration, which aren't needed
>> + * by data movement between two regular memory blocks. So performance penalty
>> + * is introduced by this function in such circumstance.
>> + */
>> +void qemu_ram_move(void *dest, const void *src, size_t n);
>> +
> 
> 
> I apologize, I don't understand what these comments are saying, at all.
> 

Sorry to hear that, Michael. There are two things done in the newly added
function qemu_ram_{copy, move}(), comparing to the original memcpy() and
memmove():

- Data copy or movement on MMIO region(s) using __bultin_{memcopy, memmove}()
   on x86, or qatomic_set() on other architecutures, to avoid the unexpected
   optimization done by the compiler on memcpy/memmove. Unsafe instructions
   can be produced by the compiler's optimization. This fixes the issue resolved
   by commit 4a2e242bbb ("memory: Don't use memcpy for ram_device regions")

- The unaligned access is handled by forcing the access size aligned to (src | dest | n)
   on non-x86 architectures, comparing to the original memcpy() and memmove().

Please let me know if we're on same page. If we do, I can revise the comments
accordingly.

> 
> 
>>> -- PMM
>>>

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-06-15 21:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 10:01 [PATCH v2 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-15 10:01 ` [PATCH v2 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
2026-06-15 10:57   ` Peter Maydell
2026-06-15 14:48     ` Michael S. Tsirkin
2026-06-15 14:56     ` Michael S. Tsirkin
2026-06-15 15:12       ` Peter Maydell
2026-06-15 19:24         ` Gavin Shan
2026-06-15 19:42           ` Michael S. Tsirkin
2026-06-15 21:31             ` Gavin Shan
2026-06-15 19:52         ` Michael S. Tsirkin
2026-06-15 15:17   ` Richard Henderson
2026-06-15 16:33     ` Gavin Shan
2026-06-15 17:03       ` Richard Henderson
2026-06-15 18:09         ` Gavin Shan
2026-06-15 18:33           ` Richard Henderson
2026-06-15 19:40           ` Michael S. Tsirkin
2026-06-15 16:35     ` Michael S. Tsirkin
2026-06-15 16:37       ` Michael S. Tsirkin
2026-06-15 17:05       ` Richard Henderson
2026-06-15 10:01 ` [PATCH v2 2/2] system/memory: Make ram device region directly accessible Gavin Shan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.