[PATCH 0/2] system/memory: Make ram device region directly accessible

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] system/memory: Make ram device region directly accessible
@ 2026-06-12 11:03 Gavin Shan
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
DMA bounce buffer in address_space_map() to cover the DMA request. However,
the bounce buffer size is 4096 bytes and we're overrunning it easily when
the guest has significant disk activities on compiling 'cuda-samples'.
The full log and problem description can be found from PATCH[1/2]'s commit
log.

Try to fix the issue handled in commit 4a2e242bbb by replacing mem{cpy, move}
with __builtin_mem{cpy, move} in the accessors to the ram device regions.
With this, we can basically revert that commit to make ram device region
directly accessible again and bypass the bounce buffer in address_space_map()
where the guest hang is caused.

PATCH[1] replaces mem{cpy, move} with __builtin_mem{cpy, move}
PATCH[2] makes ram device region directly accessible again

Changelog
=========
RFCv1 -> v1:
  * https://lists.nongnu.org/archive/html/qemu-arm/2026-06/msg00307.html
  * Reworked solution based on suggestions from Peter Xu, Peter Maydell
    and Michael S. Tsirkin

Gavin Shan (2):
  system/memory: Use __builtin_mem{cpy, move} in accessors of ram device
    region
  system/memory: Make ram device region directly accessible

 hw/remote/vfio-user-obj.c |  4 +--
 include/system/memory.h   | 53 +++++++++++++++++++++++++++++++--------
 system/memory.c           | 41 +-----------------------------
 system/physmem.c          |  8 +++---
 system/trace-events       |  2 --
 5 files changed, 50 insertions(+), 58 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-12 11:03 ` Gavin Shan
  2026-06-12 11:22   ` Michael S. Tsirkin
                     ` (2 more replies)
  2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-12 23:35 ` [PATCH 0/2] " Gavin Shan
  2 siblings, 3 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
is started by the following command lines, with a GH100 GPU card.

   host$ lspci | grep GH100
   0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
   host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
         -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
         -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
         -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
         -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
           :
   guest$ cd cuda-samples/build
   guest$ make -j 20 clean
   guest$ make -j 20
           :
   [ 54%] Linking CUDA executable graphMemoryNodes
   [ 54%] Built target graphMemoryNodes
   <no more output afterwards, guest becomes frozen here>

   guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
   [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)

When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR#4 can be presented to the
guest through memory hot-add. The page cache can be allocated from the
hot added memory blocks when cuda-samples is being compiled. Afterwards,
the page cache is sent to QEMU's virtio-blk device as part of the DMA
request, the bounce buffer has to be used to accomodate the request as
the corresponding memory region (MemoryRegion) is a RAM DEVICE region
and indirectly accessible in qemu. However, the max bounce bufer size
is only 4096 bytes by default. We're running out of that space quickly.

  QEMU
  ====
  virtio_blk_handle_output
    virtio_blk_handle_vq
      virtio_blk_get_request
        virtqueue_pop
          virtqueue_split_pop
            virtqueue_map_desc
              address_space_map
                memory_access_is_direct         # Return false
                  memory_region_supports_direct_access

  (qemu) info mtree
  memory-region: pci_bridge_pci
    0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
      0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
        0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
          0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]

This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
accessors to ram device memory region, preparatory work to make ram device
region directly accessible and bypass the bounce buffer in the DMA path
in next patch.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/remote/vfio-user-obj.c |  4 ++--
 include/system/memory.h   | 42 ++++++++++++++++++++++++++++++++++++++-
 system/physmem.c          |  8 ++++----
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 87fa7b6572..fe6f661fe2 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
         ram_ptr = memory_region_get_ram_ptr(mr);
 
         if (is_write) {
-            memcpy((ram_ptr + offset), buf, size);
+            address_space_memcpy(ram_ptr + offset, buf, size);
         } else {
-            memcpy(buf, (ram_ptr + offset), size);
+            address_space_memcpy(buf, ram_ptr + offset, size);
         }
 
         return 0;
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..6bb2e13eea 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
     return true;
 }
 
+static inline void address_space_memcpy(void *dest, const void *src, size_t n)
+{
+    switch (n) {
+    case 1:
+        __builtin_memcpy(dest, src, 1);
+        break;
+    case 2:
+        __builtin_memcpy(dest, src, 2);
+        break;
+    case 4:
+        __builtin_memcpy(dest, src, 4);
+        break;
+    case 8:
+        __builtin_memcpy(dest, src, 8);
+        break;
+    default:
+        __builtin_memcpy(dest, src, n);
+    }
+}
+
+static inline void address_space_memmove(void *dest, const void *src, size_t n)
+{
+    switch (n) {
+    case 1:
+        __builtin_memmove(dest, src, 1);
+        break;
+    case 2:
+        __builtin_memmove(dest, src, 2);
+        break;
+    case 4:
+        __builtin_memmove(dest, src, 4);
+        break;
+    case 8:
+        __builtin_memmove(dest, src, 8);
+        break;
+    default:
+        __builtin_memmove(dest, src, n);
+    }
+}
+
 /**
  * address_space_read: read from an address space.
  *
@@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
             mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
             if (len == l && memory_access_is_direct(mr, false, attrs)) {
                 ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-                memcpy(buf, ptr, len);
+                __builtin_memcpy(buf, ptr, len);
             } else {
                 result = flatview_read_continue(fv, addr, attrs, buf, len,
                                                 addr1, l, mr);
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..5f46a9d676 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, true);
 
-        memmove(ram_ptr, buf, *l);
+        address_space_memmove(ram_ptr, buf, *l);
         invalidate_and_set_dirty(mr, mr_addr, *l);
 
         return MEMTX_OK;
@@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, false);
 
-        memcpy(buf, ram_ptr, *l);
+        address_space_memcpy(buf, ram_ptr, *l);
 
         return MEMTX_OK;
     }
@@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
             l = memory_access_size(mr, l, addr1);
         } else {
             /* ROM/RAM case */
-            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
-            memcpy(ram_ptr, buf, l);
+            address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
+                                 buf, l);
             invalidate_and_set_dirty(mr, addr1, l);
         }
         len -= l;
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
@ 2026-06-12 11:22   ` Michael S. Tsirkin
  2026-06-14 10:04     ` Gavin Shan
  2026-06-12 14:05   ` Philippe Mathieu-Daudé
  2026-06-12 15:29   ` Richard Henderson
  2 siblings, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-12 11:22 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, peter.maydell, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Fri, Jun 12, 2026 at 09:03:06PM +1000, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
> 
>    host$ lspci | grep GH100
>    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>          -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>            :
>    guest$ cd cuda-samples/build
>    guest$ make -j 20 clean
>    guest$ make -j 20
>            :
>    [ 54%] Linking CUDA executable graphMemoryNodes
>    [ 54%] Built target graphMemoryNodes
>    <no more output afterwards, guest becomes frozen here>
> 
>    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> 
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
> 
>   QEMU
>   ====
>   virtio_blk_handle_output
>     virtio_blk_handle_vq
>       virtio_blk_get_request
>         virtqueue_pop
>           virtqueue_split_pop
>             virtqueue_map_desc
>               address_space_map
>                 memory_access_is_direct         # Return false
>                   memory_region_supports_direct_access
> 
>   (qemu) info mtree
>   memory-region: pci_bridge_pci
>     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> 
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
> 
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/remote/vfio-user-obj.c |  4 ++--
>  include/system/memory.h   | 42 ++++++++++++++++++++++++++++++++++++++-
>  system/physmem.c          |  8 ++++----
>  3 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>          ram_ptr = memory_region_get_ram_ptr(mr);
>  
>          if (is_write) {
> -            memcpy((ram_ptr + offset), buf, size);
> +            address_space_memcpy(ram_ptr + offset, buf, size);
>          } else {
> -            memcpy(buf, (ram_ptr + offset), size);
> +            address_space_memcpy(buf, ram_ptr + offset, size);
>          }
>  
>          return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>      return true;
>  }
>  
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memcpy(dest, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memcpy(dest, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memcpy(dest, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memcpy(dest, src, 8);
> +        break;
> +    default:
> +        __builtin_memcpy(dest, src, n);
> +    }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memmove(dest, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memmove(dest, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memmove(dest, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memmove(dest, src, 8);
> +        break;
> +    default:
> +        __builtin_memmove(dest, src, n);
> +    }
> +}
> +
>  /**
>   * address_space_read: read from an address space.
>   *


The variable length probably should use the regular memcpy/memmove -
no reason to bypass fortification for these.


> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>              mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>              if (len == l && memory_access_is_direct(mr, false, attrs)) {
>                  ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -                memcpy(buf, ptr, len);
> +                __builtin_memcpy(buf, ptr, len);
>              } else {
>                  result = flatview_read_continue(fv, addr, attrs, buf, len,
>                                                  addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>          uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                 false, true);
>  
> -        memmove(ram_ptr, buf, *l);
> +        address_space_memmove(ram_ptr, buf, *l);
>          invalidate_and_set_dirty(mr, mr_addr, *l);
>  
>          return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>          uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                 false, false);
>  
> -        memcpy(buf, ram_ptr, *l);
> +        address_space_memcpy(buf, ram_ptr, *l);
>  
>          return MEMTX_OK;
>      }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>              l = memory_access_size(mr, l, addr1);
>          } else {
>              /* ROM/RAM case */
> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -            memcpy(ram_ptr, buf, l);
> +            address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> +                                 buf, l);
>              invalidate_and_set_dirty(mr, addr1, l);
>          }
>          len -= l;
> -- 
> 2.54.0



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 11:22   ` Michael S. Tsirkin
@ 2026-06-14 10:04     ` Gavin Shan
  0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-14 10:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, peter.maydell, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/12/26 9:22 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 12, 2026 at 09:03:06PM +1000, Gavin Shan wrote:
>> All ram device regions was turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
>> is started by the following command lines, with a GH100 GPU card.
>>
>>     host$ lspci | grep GH100
>>     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>>     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>>           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>>           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>>           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>>           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>>           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>>             :
>>     guest$ cd cuda-samples/build
>>     guest$ make -j 20 clean
>>     guest$ make -j 20
>>             :
>>     [ 54%] Linking CUDA executable graphMemoryNodes
>>     [ 54%] Built target graphMemoryNodes
>>     <no more output afterwards, guest becomes frozen here>
>>
>>     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>
>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>> the memory blocks residing in the PCI BAR#4 can be presented to the
>> guest through memory hot-add. The page cache can be allocated from the
>> hot added memory blocks when cuda-samples is being compiled. Afterwards,
>> the page cache is sent to QEMU's virtio-blk device as part of the DMA
>> request, the bounce buffer has to be used to accomodate the request as
>> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
>> and indirectly accessible in qemu. However, the max bounce bufer size
>> is only 4096 bytes by default. We're running out of that space quickly.
>>
>>    QEMU
>>    ====
>>    virtio_blk_handle_output
>>      virtio_blk_handle_vq
>>        virtio_blk_get_request
>>          virtqueue_pop
>>            virtqueue_split_pop
>>              virtqueue_map_desc
>>                address_space_map
>>                  memory_access_is_direct         # Return false
>>                    memory_region_supports_direct_access
>>
>>    (qemu) info mtree
>>    memory-region: pci_bridge_pci
>>      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>>        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>>          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>>            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>
>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>> accessors to ram device memory region, preparatory work to make ram device
>> region directly accessible and bypass the bounce buffer in the DMA path
>> in next patch.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/remote/vfio-user-obj.c |  4 ++--
>>   include/system/memory.h   | 42 ++++++++++++++++++++++++++++++++++++++-
>>   system/physmem.c          |  8 ++++----
>>   3 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 87fa7b6572..fe6f661fe2 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>>           ram_ptr = memory_region_get_ram_ptr(mr);
>>   
>>           if (is_write) {
>> -            memcpy((ram_ptr + offset), buf, size);
>> +            address_space_memcpy(ram_ptr + offset, buf, size);
>>           } else {
>> -            memcpy(buf, (ram_ptr + offset), size);
>> +            address_space_memcpy(buf, ram_ptr + offset, size);
>>           }
>>   
>>           return 0;
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 1417132f6d..6bb2e13eea 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>>       return true;
>>   }
>>   
>> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memcpy(dest, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memcpy(dest, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memcpy(dest, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memcpy(dest, src, 8);
>> +        break;
>> +    default:
>> +        __builtin_memcpy(dest, src, n);
>> +    }
>> +}
>> +
>> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memmove(dest, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memmove(dest, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memmove(dest, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memmove(dest, src, 8);
>> +        break;
>> +    default:
>> +        __builtin_memmove(dest, src, n);
>> +    }
>> +}
>> +
>>   /**
>>    * address_space_read: read from an address space.
>>    *
> 
> 
> The variable length probably should use the regular memcpy/memmove -
> no reason to bypass fortification for these.
> 

Fixed locally and the changes will be included in next revision.

> 
>> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>>               mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>>               if (len == l && memory_access_is_direct(mr, false, attrs)) {
>>                   ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -                memcpy(buf, ptr, len);
>> +                __builtin_memcpy(buf, ptr, len);
>>               } else {
>>                   result = flatview_read_continue(fv, addr, attrs, buf, len,
>>                                                   addr1, l, mr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf87573..5f46a9d676 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, true);
>>   
>> -        memmove(ram_ptr, buf, *l);
>> +        address_space_memmove(ram_ptr, buf, *l);
>>           invalidate_and_set_dirty(mr, mr_addr, *l);
>>   
>>           return MEMTX_OK;
>> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, false);
>>   
>> -        memcpy(buf, ram_ptr, *l);
>> +        address_space_memcpy(buf, ram_ptr, *l);
>>   
>>           return MEMTX_OK;
>>       }
>> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>>               l = memory_access_size(mr, l, addr1);
>>           } else {
>>               /* ROM/RAM case */
>> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -            memcpy(ram_ptr, buf, l);
>> +            address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
>> +                                 buf, l);
>>               invalidate_and_set_dirty(mr, addr1, l);
>>           }
>>           len -= l;
>> -- 
>> 2.54.0
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
  2026-06-12 11:22   ` Michael S. Tsirkin
@ 2026-06-12 14:05   ` Philippe Mathieu-Daudé
  2026-06-14 10:08     ` Gavin Shan
  2026-06-12 15:29   ` Richard Henderson
  2 siblings, 1 reply; 24+ messages in thread
From: Philippe Mathieu-Daudé @ 2026-06-12 14:05 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, phrdina, jugraham, shan.gavin

Hi Gavin,

On 12/6/26 13:03, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
> 
>     host$ lspci | grep GH100
>     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>             :
>     guest$ cd cuda-samples/build
>     guest$ make -j 20 clean
>     guest$ make -j 20
>             :
>     [ 54%] Linking CUDA executable graphMemoryNodes
>     [ 54%] Built target graphMemoryNodes
>     <no more output afterwards, guest becomes frozen here>
> 
>     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> 
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
> 
>    QEMU
>    ====
>    virtio_blk_handle_output
>      virtio_blk_handle_vq
>        virtio_blk_get_request
>          virtqueue_pop
>            virtqueue_split_pop
>              virtqueue_map_desc
>                address_space_map
>                  memory_access_is_direct         # Return false
>                    memory_region_supports_direct_access
> 
>    (qemu) info mtree
>    memory-region: pci_bridge_pci
>      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> 
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
> 
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   hw/remote/vfio-user-obj.c |  4 ++--
>   include/system/memory.h   | 42 ++++++++++++++++++++++++++++++++++++++-
>   system/physmem.c          |  8 ++++----
>   3 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>           ram_ptr = memory_region_get_ram_ptr(mr);
>   
>           if (is_write) {
> -            memcpy((ram_ptr + offset), buf, size);
> +            address_space_memcpy(ram_ptr + offset, buf, size);
>           } else {
> -            memcpy(buf, (ram_ptr + offset), size);
> +            address_space_memcpy(buf, ram_ptr + offset, size);
>           }
>   
>           return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>       return true;
>   }
>   
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)

'address_space_' prefix for something that doesn't use neither
AddressSpace nor MemoryRegion is odd.

Maybe prefix 'qemu_ram_' or 'qemu_ram_ptr_' instead? (since the
address is returned by memory_region_get_ram_ptr)
Add the definitions in "system/ramblock.h" with that declaration?

> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memcpy(dest, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memcpy(dest, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memcpy(dest, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memcpy(dest, src, 8);
> +        break;
> +    default:
> +        __builtin_memcpy(dest, src, n);
> +    }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> +    switch (n) {
> +    case 1:
> +        __builtin_memmove(dest, src, 1);
> +        break;
> +    case 2:
> +        __builtin_memmove(dest, src, 2);
> +        break;
> +    case 4:
> +        __builtin_memmove(dest, src, 4);
> +        break;
> +    case 8:
> +        __builtin_memmove(dest, src, 8);
> +        break;
> +    default:
> +        __builtin_memmove(dest, src, n);
> +    }
> +}
> +
>   /**
>    * address_space_read: read from an address space.
>    *
> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>               mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>               if (len == l && memory_access_is_direct(mr, false, attrs)) {
>                   ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -                memcpy(buf, ptr, len);
> +                __builtin_memcpy(buf, ptr, len);
>               } else {
>                   result = flatview_read_continue(fv, addr, attrs, buf, len,
>                                                   addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                  false, true);
>   
> -        memmove(ram_ptr, buf, *l);
> +        address_space_memmove(ram_ptr, buf, *l);
>           invalidate_and_set_dirty(mr, mr_addr, *l);
>   
>           return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                  false, false);
>   
> -        memcpy(buf, ram_ptr, *l);
> +        address_space_memcpy(buf, ram_ptr, *l);
>   
>           return MEMTX_OK;
>       }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>               l = memory_access_size(mr, l, addr1);
>           } else {
>               /* ROM/RAM case */
> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> -            memcpy(ram_ptr, buf, l);
> +            address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> +                                 buf, l);
>               invalidate_and_set_dirty(mr, addr1, l);
>           }
>           len -= l;



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 14:05   ` Philippe Mathieu-Daudé
@ 2026-06-14 10:08     ` Gavin Shan
  0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-14 10:08 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, phrdina, jugraham, shan.gavin

Hi Philippe,

On 6/13/26 12:05 AM, Philippe Mathieu-Daudé wrote:
> Hi Gavin,
> 
> On 12/6/26 13:03, Gavin Shan wrote:
>> All ram device regions was turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
>> is started by the following command lines, with a GH100 GPU card.
>>
>>     host$ lspci | grep GH100
>>     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>>     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>>           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>>           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>>           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>>           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>>           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>>             :
>>     guest$ cd cuda-samples/build
>>     guest$ make -j 20 clean
>>     guest$ make -j 20
>>             :
>>     [ 54%] Linking CUDA executable graphMemoryNodes
>>     [ 54%] Built target graphMemoryNodes
>>     <no more output afterwards, guest becomes frozen here>
>>
>>     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>
>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>> the memory blocks residing in the PCI BAR#4 can be presented to the
>> guest through memory hot-add. The page cache can be allocated from the
>> hot added memory blocks when cuda-samples is being compiled. Afterwards,
>> the page cache is sent to QEMU's virtio-blk device as part of the DMA
>> request, the bounce buffer has to be used to accomodate the request as
>> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
>> and indirectly accessible in qemu. However, the max bounce bufer size
>> is only 4096 bytes by default. We're running out of that space quickly.
>>
>>    QEMU
>>    ====
>>    virtio_blk_handle_output
>>      virtio_blk_handle_vq
>>        virtio_blk_get_request
>>          virtqueue_pop
>>            virtqueue_split_pop
>>              virtqueue_map_desc
>>                address_space_map
>>                  memory_access_is_direct         # Return false
>>                    memory_region_supports_direct_access
>>
>>    (qemu) info mtree
>>    memory-region: pci_bridge_pci
>>      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>>        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>>          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>>            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>
>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>> accessors to ram device memory region, preparatory work to make ram device
>> region directly accessible and bypass the bounce buffer in the DMA path
>> in next patch.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/remote/vfio-user-obj.c |  4 ++--
>>   include/system/memory.h   | 42 ++++++++++++++++++++++++++++++++++++++-
>>   system/physmem.c          |  8 ++++----
>>   3 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 87fa7b6572..fe6f661fe2 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>>           ram_ptr = memory_region_get_ram_ptr(mr);
>>           if (is_write) {
>> -            memcpy((ram_ptr + offset), buf, size);
>> +            address_space_memcpy(ram_ptr + offset, buf, size);
>>           } else {
>> -            memcpy(buf, (ram_ptr + offset), size);
>> +            address_space_memcpy(buf, ram_ptr + offset, size);
>>           }
>>           return 0;
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 1417132f6d..6bb2e13eea 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>>       return true;
>>   }
>> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
> 
> 'address_space_' prefix for something that doesn't use neither
> AddressSpace nor MemoryRegion is odd.
> 
> Maybe prefix 'qemu_ram_' or 'qemu_ram_ptr_' instead? (since the
> address is returned by memory_region_get_ram_ptr)
> Add the definitions in "system/ramblock.h" with that declaration?
> 

Yes, they have been renamed to qemu_ram_{copy, move}() and moved to
system/ramblock.h in local repo. The changes will be seen in the next
revision.

>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memcpy(dest, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memcpy(dest, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memcpy(dest, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memcpy(dest, src, 8);
>> +        break;
>> +    default:
>> +        __builtin_memcpy(dest, src, n);
>> +    }
>> +}
>> +
>> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
>> +{
>> +    switch (n) {
>> +    case 1:
>> +        __builtin_memmove(dest, src, 1);
>> +        break;
>> +    case 2:
>> +        __builtin_memmove(dest, src, 2);
>> +        break;
>> +    case 4:
>> +        __builtin_memmove(dest, src, 4);
>> +        break;
>> +    case 8:
>> +        __builtin_memmove(dest, src, 8);
>> +        break;
>> +    default:
>> +        __builtin_memmove(dest, src, n);
>> +    }
>> +}
>> +
>>   /**
>>    * address_space_read: read from an address space.
>>    *
>> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>>               mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>>               if (len == l && memory_access_is_direct(mr, false, attrs)) {
>>                   ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -                memcpy(buf, ptr, len);
>> +                __builtin_memcpy(buf, ptr, len);
>>               } else {
>>                   result = flatview_read_continue(fv, addr, attrs, buf, len,
>>                                                   addr1, l, mr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf87573..5f46a9d676 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, true);
>> -        memmove(ram_ptr, buf, *l);
>> +        address_space_memmove(ram_ptr, buf, *l);
>>           invalidate_and_set_dirty(mr, mr_addr, *l);
>>           return MEMTX_OK;
>> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>>           uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>>                                                  false, false);
>> -        memcpy(buf, ram_ptr, *l);
>> +        address_space_memcpy(buf, ram_ptr, *l);
>>           return MEMTX_OK;
>>       }
>> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>>               l = memory_access_size(mr, l, addr1);
>>           } else {
>>               /* ROM/RAM case */
>> -            void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> -            memcpy(ram_ptr, buf, l);
>> +            address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
>> +                                 buf, l);
>>               invalidate_and_set_dirty(mr, addr1, l);
>>           }
>>           len -= l;
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
  2026-06-12 11:22   ` Michael S. Tsirkin
  2026-06-12 14:05   ` Philippe Mathieu-Daudé
@ 2026-06-12 15:29   ` Richard Henderson
  2026-06-12 16:36     ` Peter Maydell
  2026-06-14 15:42     ` Michael S. Tsirkin
  2 siblings, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-12 15:29 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/12/26 04:03, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
> 
>     host$ lspci | grep GH100
>     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
>           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
>           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>             :
>     guest$ cd cuda-samples/build
>     guest$ make -j 20 clean
>     guest$ make -j 20
>             :
>     [ 54%] Linking CUDA executable graphMemoryNodes
>     [ 54%] Built target graphMemoryNodes
>     <no more output afterwards, guest becomes frozen here>
> 
>     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> 
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
> 
>    QEMU
>    ====
>    virtio_blk_handle_output
>      virtio_blk_handle_vq
>        virtio_blk_get_request
>          virtqueue_pop
>            virtqueue_split_pop
>              virtqueue_map_desc
>                address_space_map
>                  memory_access_is_direct         # Return false
>                    memory_region_supports_direct_access
> 
>    (qemu) info mtree
>    memory-region: pci_bridge_pci
>      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> 
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.

memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later 
decides whether or not to expand inline.

So, this doesn't do what you think it does.
My real question is: what are you attempting to achieve?

(1) is the problem unaligned access to a mapped physical device?
(2) is the problem vector access to a mapped physical device?
(3) something else?


r~


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 15:29   ` Richard Henderson
@ 2026-06-12 16:36     ` Peter Maydell
  2026-06-12 17:25       ` Richard Henderson
  2026-06-14 15:42     ` Michael S. Tsirkin
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Maydell @ 2026-06-12 16:36 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Fri, 12 Jun 2026 at 16:29, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 6/12/26 04:03, Gavin Shan wrote:
> > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > accessors to ram device memory region, preparatory work to make ram device
> > region directly accessible and bypass the bounce buffer in the DMA path
> > in next patch.
>
> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> decides whether or not to expand inline.

Yes, but if you pass it a fixed small integer, then it is likely
to expand it inline, whereas if you pass it a variable then it
is likely not to... The patch is attempting to persuade the
compiler to definitely do an inline access for 1, 2, 4, 8
byte access.

> So, this doesn't do what you think it does.
> My real question is: what are you attempting to achieve?
>
> (1) is the problem unaligned access to a mapped physical device?
> (2) is the problem vector access to a mapped physical device?
> (3) something else?

I think there are two problems we're trying to fix here:

(1) If a device does e.g. a pci_dma_write() with size 1, we want
this to turn into exactly 1 byte write into guest memory, for the
normal case where the guest memory is real host RAM.
This deals with the e1000 bug where the pci_dma_write() turns into
a call to glibc memmove() with size 1 and glibc's implementation
turns that into 3 writes of the byte to the same address,
and then a guest write/read might interleave badly with
the extra writes.[*] Similarly some devices specify that they
do definitely do DMA accesses in a particular way (e.g. the
Arm SMMU spec says that the SMMU reads 64-bit page table entries
as single-copy atomic accesses).

(2) If a vCPU (emulated or KVM) does a 4 byte access to an
address which is in the PCI BAR of a vfio-passthrough device,
we want this to turn into exactly a 4-byte access to the
mmap()ed memory which is the BAR of the host device. This
is because that address might be a real hardware device
register, so it's important to access it exactly the way
the guest vCPU asked for, and not do multiple accesses or
misaligned accesses or whatever.

(The reason we make pass-through BARs not "direct access"
but instead go via the bounce buffer is that we were working
around (2).)

[*] I have not fully investigated the e1000 situation so it's
possible that there is some other issue also there, e.g.
guest or device model getting barrier semantics wrong. But
writing the byte with the "done with this descriptor" bit
3 times rather than once seems like asking for trouble.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 16:36     ` Peter Maydell
@ 2026-06-12 17:25       ` Richard Henderson
  2026-06-12 17:57         ` Peter Maydell
  2026-06-14 15:13         ` Michael S. Tsirkin
  0 siblings, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-12 17:25 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/12/26 09:36, Peter Maydell wrote:
> On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 6/12/26 04:03, Gavin Shan wrote:
>>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>>> accessors to ram device memory region, preparatory work to make ram device
>>> region directly accessible and bypass the bounce buffer in the DMA path
>>> in next patch.
>>
>> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
>> decides whether or not to expand inline.
> 
> Yes, but if you pass it a fixed small integer, then it is likely
> to expand it inline, whereas if you pass it a variable then it
> is likely not to... The patch is attempting to persuade the
> compiler to definitely do an inline access for 1, 2, 4, 8
> byte access.

Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?) riscv64 that 
don't automatically have such and will compile to more than one host instruction.

>> My real question is: what are you attempting to achieve?
>>
>> (1) is the problem unaligned access to a mapped physical device?
>> (2) is the problem vector access to a mapped physical device?
>> (3) something else?
> 
> I think there are two problems we're trying to fix here:
> 
> (1) If a device does e.g. a pci_dma_write() with size 1, we want
> this to turn into exactly 1 byte write into guest memory, for the
> normal case where the guest memory is real host RAM.
> This deals with the e1000 bug where the pci_dma_write() turns into
> a call to glibc memmove() with size 1 and glibc's implementation
> turns that into 3 writes of the byte to the same address...

Gotcha.  Easily handled by not using memcpy/memmove at all.

	*(char *)ptr = val;

is sufficient for all hosts.

> (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> address which is in the PCI BAR of a vfio-passthrough device,
> we want this to turn into exactly a 4-byte access to the
> mmap()ed memory which is the BAR of the host device. This
> is because that address might be a real hardware device
> register, so it's important to access it exactly the way
> the guest vCPU asked for, and not do multiple accesses or
> misaligned accesses or whatever.
> 
> (The reason we make pass-through BARs not "direct access"
> but instead go via the bounce buffer is that we were working
> around (2).)

That's going to require qatomic and alignment checks.

That's also going to beg the question of the intended behaviour anything that isn't a 
naturally aligned access of size in {1,2,4,8}: fall back to N operations of the minimum of 
alignment and size?


r~


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 17:25       ` Richard Henderson
@ 2026-06-12 17:57         ` Peter Maydell
  2026-06-14 15:13         ` Michael S. Tsirkin
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Maydell @ 2026-06-12 17:57 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Fri, 12 Jun 2026 at 18:25, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> On 6/12/26 04:03, Gavin Shan wrote:
> >>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> >>> accessors to ram device memory region, preparatory work to make ram device
> >>> region directly accessible and bypass the bounce buffer in the DMA path
> >>> in next patch.
> >>
> >> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> >> decides whether or not to expand inline.
> >
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
>
> Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?) riscv64 that
> don't automatically have such and will compile to more than one host instruction.

Good point.

> >> My real question is: what are you attempting to achieve?
> >>
> >> (1) is the problem unaligned access to a mapped physical device?
> >> (2) is the problem vector access to a mapped physical device?
> >> (3) something else?
> >
> > I think there are two problems we're trying to fix here:
> >
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
>
> Gotcha.  Easily handled by not using memcpy/memmove at all.
>
>         *(char *)ptr = val;
>
> is sufficient for all hosts.

That helps e1000 because it happens to be doing byte writes,
but the same effect probably occurs for word writes in other
devices...

> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> >
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
>
> That's going to require qatomic and alignment checks.
>
> That's also going to beg the question of the intended behaviour
> anything that isn't a naturally aligned access of size in {1,2,4,8}:
>fall back to N operations of the minimum of alignment and size?

I think in practice we can probably say we don't care and
fall back to memmove: anything unaligned here is almost
certainly going to real RAM and not going to have any
particular ordering requirements (and/or might be caused
by the guest misprogramming a device DMA address).

I feel like we handle "unaligned access, maybe need to split it"
already in multiple parts of QEMU (for TCG accesses we do that
in accel/tcg; system/memory.c has a TODO comment about unaligned
accesses at the IO MemoryRegion read/writefn level and there's
a patchseries that tries to fix that), so if we find ourselves
really needing to do "handle this unaligned access by splitting
it into pieces" again here we ought to take a wider look at
whether we can consolidate that at all...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 17:25       ` Richard Henderson
  2026-06-12 17:57         ` Peter Maydell
@ 2026-06-14 15:13         ` Michael S. Tsirkin
  2026-06-14 16:01           ` Peter Maydell
  2026-06-14 16:45           ` Richard Henderson
  1 sibling, 2 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 15:13 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> > > 
> > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > accessors to ram device memory region, preparatory work to make ram device
> > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > in next patch.
> > > 
> > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > decides whether or not to expand inline.
> > 
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
> 
> Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?)
> riscv64 that don't automatically have such and will compile to more than one
> host instruction.
> 
> > > My real question is: what are you attempting to achieve?
> > > 
> > > (1) is the problem unaligned access to a mapped physical device?
> > > (2) is the problem vector access to a mapped physical device?
> > > (3) something else?
> > 
> > I think there are two problems we're trying to fix here:
> > 
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
> 
> Gotcha.  Easily handled by not using memcpy/memmove at all.
> 
> 	*(char *)ptr = val;
> 
> is sufficient for all hosts.

Yes, I think it does work because we use -fno-strict-aliasing.
For bigger sizes we'll need packed because the addresses
could be unaligned.



But again, qemu simply already relies on this in bswap.h

I kind of dislike muddying the waters by making several
unrelated changes here. If we do we should change bwap too.


> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> > 
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
> 
> That's going to require qatomic and alignment checks.
> 
> That's also going to beg the question of the intended behaviour anything
> that isn't a naturally aligned access of size in {1,2,4,8}: fall back to N
> operations of the minimum of alignment and size?

For most host/guest pairs things simply work even for unaligned.

And yes, guest drivers do do this.

On classical pci, there are no transactions as such and
an unaligned access will be split anyway.



> 
> 
> r~



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 15:13         ` Michael S. Tsirkin
@ 2026-06-14 16:01           ` Peter Maydell
  2026-06-14 16:06             ` Michael S. Tsirkin
  2026-06-14 16:17             ` Michael S. Tsirkin
  2026-06-14 16:45           ` Richard Henderson
  1 sibling, 2 replies; 24+ messages in thread
From: Peter Maydell @ 2026-06-14 16:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
	berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
	shan.gavin

On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > On 6/12/26 09:36, Peter Maydell wrote:
> > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > <richard.henderson@linaro.org> wrote:
> > > >
> > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > in next patch.
> > > >
> > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > decides whether or not to expand inline.
> > >
> > > Yes, but if you pass it a fixed small integer, then it is likely
> > > to expand it inline, whereas if you pass it a variable then it
> > > is likely not to... The patch is attempting to persuade the
> > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > byte access.
> >
> > Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?)
> > riscv64 that don't automatically have such and will compile to more than one
> > host instruction.
> >
> > > > My real question is: what are you attempting to achieve?
> > > >
> > > > (1) is the problem unaligned access to a mapped physical device?
> > > > (2) is the problem vector access to a mapped physical device?
> > > > (3) something else?
> > >
> > > I think there are two problems we're trying to fix here:
> > >
> > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > this to turn into exactly 1 byte write into guest memory, for the
> > > normal case where the guest memory is real host RAM.
> > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > a call to glibc memmove() with size 1 and glibc's implementation
> > > turns that into 3 writes of the byte to the same address...
> >
> > Gotcha.  Easily handled by not using memcpy/memmove at all.
> >
> >       *(char *)ptr = val;
> >
> > is sufficient for all hosts.
>
> Yes, I think it does work because we use -fno-strict-aliasing.
> For bigger sizes we'll need packed because the addresses
> could be unaligned.

IIRC "packed" will cause architectures that can't do
unaligned word accesses to emit code to do byte accesses,
so you don't want that. You need to explicitly check alignment,
I think.

> But again, qemu simply already relies on this in bswap.h
>
> I kind of dislike muddying the waters by making several
> unrelated changes here. If we do we should change bwap too.

The ldl_p etc functions in bswap.h provide different semantics:
ldl_p() is "do a load of a 32-bit quantity, even if the
address is not 4-aligned". We don't care if the compiler
or the memcpy ends up doing that as 4 byte accesses, which
on some hosts it must do.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 16:01           ` Peter Maydell
@ 2026-06-14 16:06             ` Michael S. Tsirkin
  2026-06-14 16:17             ` Michael S. Tsirkin
  1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 16:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
	berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
	shan.gavin

On Sun, Jun 14, 2026 at 05:01:34PM +0100, Peter Maydell wrote:
> On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > > On 6/12/26 09:36, Peter Maydell wrote:
> > > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > > >
> > > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > > in next patch.
> > > > >
> > > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > > decides whether or not to expand inline.
> > > >
> > > > Yes, but if you pass it a fixed small integer, then it is likely
> > > > to expand it inline, whereas if you pass it a variable then it
> > > > is likely not to... The patch is attempting to persuade the
> > > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > > byte access.
> > >
> > > Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?)
> > > riscv64 that don't automatically have such and will compile to more than one
> > > host instruction.
> > >
> > > > > My real question is: what are you attempting to achieve?
> > > > >
> > > > > (1) is the problem unaligned access to a mapped physical device?
> > > > > (2) is the problem vector access to a mapped physical device?
> > > > > (3) something else?
> > > >
> > > > I think there are two problems we're trying to fix here:
> > > >
> > > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > > this to turn into exactly 1 byte write into guest memory, for the
> > > > normal case where the guest memory is real host RAM.
> > > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > > a call to glibc memmove() with size 1 and glibc's implementation
> > > > turns that into 3 writes of the byte to the same address...
> > >
> > > Gotcha.  Easily handled by not using memcpy/memmove at all.
> > >
> > >       *(char *)ptr = val;
> > >
> > > is sufficient for all hosts.
> >
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
> 
> IIRC "packed" will cause architectures that can't do
> unaligned word accesses to emit code to do byte accesses,
> so you don't want that.

I mean, what's left to do if guest wants an unaligned access? Crash?
On classical PCI, unaligned accesses are split up by the bus, so
there's a decent chance most devices will be ok if qemu does it.


> You need to explicitly check alignment,
> I think.
> 
> > But again, qemu simply already relies on this in bswap.h
> >
> > I kind of dislike muddying the waters by making several
> > unrelated changes here. If we do we should change bwap too.
> 
> The ldl_p etc functions in bswap.h provide different semantics:
> ldl_p() is "do a load of a 32-bit quantity, even if the
> address is not 4-aligned". We don't care if the compiler
> or the memcpy ends up doing that as 4 byte accesses, which
> on some hosts it must do.
> 
> thanks
> -- PMM


I mean the ram device ops trick is using bounce buffers precisely
because it happens to give you a single load.
And yes on hosts where it does not work, these devices
won't work, and qemu can't fix that.

-- 
MST



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 16:01           ` Peter Maydell
  2026-06-14 16:06             ` Michael S. Tsirkin
@ 2026-06-14 16:17             ` Michael S. Tsirkin
  1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 16:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
	berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
	shan.gavin

On Sun, Jun 14, 2026 at 05:01:34PM +0100, Peter Maydell wrote:
> On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > > On 6/12/26 09:36, Peter Maydell wrote:
> > > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > > >
> > > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > > in next patch.
> > > > >
> > > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > > decides whether or not to expand inline.
> > > >
> > > > Yes, but if you pass it a fixed small integer, then it is likely
> > > > to expand it inline, whereas if you pass it a variable then it
> > > > is likely not to... The patch is attempting to persuade the
> > > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > > byte access.
> > >
> > > Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?)
> > > riscv64 that don't automatically have such and will compile to more than one
> > > host instruction.
> > >
> > > > > My real question is: what are you attempting to achieve?
> > > > >
> > > > > (1) is the problem unaligned access to a mapped physical device?
> > > > > (2) is the problem vector access to a mapped physical device?
> > > > > (3) something else?
> > > >
> > > > I think there are two problems we're trying to fix here:
> > > >
> > > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > > this to turn into exactly 1 byte write into guest memory, for the
> > > > normal case where the guest memory is real host RAM.
> > > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > > a call to glibc memmove() with size 1 and glibc's implementation
> > > > turns that into 3 writes of the byte to the same address...
> > >
> > > Gotcha.  Easily handled by not using memcpy/memmove at all.
> > >
> > >       *(char *)ptr = val;
> > >
> > > is sufficient for all hosts.
> >
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
> 
> IIRC "packed" will cause architectures that can't do
> unaligned word accesses to emit code to do byte accesses,
> so you don't want that.


I checked arm32 and while it does that, so does memcpy.
We'd have to explicitly code up aligned/unaligned usecases.

But do we really care about device assignment on those
hosts?

Maybe, the thing to do is just to ignore the issue.



> You need to explicitly check alignment,
> I think.
> 
> > But again, qemu simply already relies on this in bswap.h
> >
> > I kind of dislike muddying the waters by making several
> > unrelated changes here. If we do we should change bwap too.
> 
> The ldl_p etc functions in bswap.h provide different semantics:
> ldl_p() is "do a load of a 32-bit quantity, even if the
> address is not 4-aligned". We don't care if the compiler
> or the memcpy ends up doing that as 4 byte accesses, which
> on some hosts it must do.
> 
> thanks
> -- PMM



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 15:13         ` Michael S. Tsirkin
  2026-06-14 16:01           ` Peter Maydell
@ 2026-06-14 16:45           ` Richard Henderson
  2026-06-14 17:22             ` Michael S. Tsirkin
  1 sibling, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2026-06-14 16:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/14/26 08:13, Michael S. Tsirkin wrote:
> Yes, I think it does work because we use -fno-strict-aliasing.
> For bigger sizes we'll need packed because the addresses
> could be unaligned.
...
> For most host/guest pairs things simply work even for unaligned.
> 
> And yes, guest drivers do do this.
> 
> On classical pci, there are no transactions as such and
> an unaligned access will be split anyway.

I'm saying, if you're talking about pass-through to real devices, that won't work. For 
instance, AArch64 will trap unaligned accesses to Device memory.

You need to actually handle unaligned.  Perhaps something like

     /* Find unit to fit size and alignment of dst */
     uintptr_t test = (uintptr_t)dst | size;
     uintptr_t lsb = test & -test;

     switch (lsb) {
     case 1:   // loop over uint8_t
     case 2:   // loop over uint16_t
     case 4:   // loop over uint32_t
     default:  // loop over uint64_t
     }

with the expectation that normally we'll have aligned addresses and size such that the 
loop will iterate once.


r~


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 16:45           ` Richard Henderson
@ 2026-06-14 17:22             ` Michael S. Tsirkin
  2026-06-15  1:17               ` Gavin Shan
  2026-06-15  1:41               ` Richard Henderson
  0 siblings, 2 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 17:22 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
> ...
> > For most host/guest pairs things simply work even for unaligned.
> > 
> > And yes, guest drivers do do this.
> > 
> > On classical pci, there are no transactions as such and
> > an unaligned access will be split anyway.
> 
> I'm saying, if you're talking about pass-through to real devices, that won't
> work. For instance, AArch64 will trap unaligned accesses to Device memory.

Presumably, AArch64 drivers don't do unaligned at all then?

> You need to actually handle unaligned.  Perhaps something like
> 
>     /* Find unit to fit size and alignment of dst */
>     uintptr_t test = (uintptr_t)dst | size;
>     uintptr_t lsb = test & -test;
> 
>     switch (lsb) {
>     case 1:   // loop over uint8_t
>     case 2:   // loop over uint16_t
>     case 4:   // loop over uint32_t
>     default:  // loop over uint64_t
>     }
> 
> with the expectation that normally we'll have aligned addresses and size
> such that the loop will iterate once.
> 
> 
> r~

And ifdef for arches without unaligned support?

-- 
MST



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 17:22             ` Michael S. Tsirkin
@ 2026-06-15  1:17               ` Gavin Shan
  2026-06-15  1:41               ` Richard Henderson
  1 sibling, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-15  1:17 UTC (permalink / raw)
  To: Michael S. Tsirkin, Richard Henderson
  Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/15/26 3:22 AM, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>> For bigger sizes we'll need packed because the addresses
>>> could be unaligned.
>> ...
>>> For most host/guest pairs things simply work even for unaligned.
>>>
>>> And yes, guest drivers do do this.
>>>
>>> On classical pci, there are no transactions as such and
>>> an unaligned access will be split anyway.
>>
>> I'm saying, if you're talking about pass-through to real devices, that won't
>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
> 
> Presumably, AArch64 drivers don't do unaligned at all then?
> 

I think Michael is correct because the unaligned access isn't a concern to
ram device region. MemroyRegionOPs::impl::unaligned is true for ram device
region, which means no unaligned access concerns.

static const MemoryRegionOps ram_device_mem_ops = {
     .read = memory_region_ram_device_read,
     .write = memory_region_ram_device_write,
     .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
     .valid = {
         .min_access_size = 1,
         .max_access_size = 8,
         .unaligned = true,
     },
     .impl = {
         .min_access_size = 1,
         .max_access_size = 8,
         .unaligned = true,
     },
};

system/physmem.c::memory_access_size
====================================
int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
{
         :
     /* Bound the maximum access by the alignment of the address.  */
     if (!mr->ops->impl.unaligned) {
         unsigned align_size_max = addr & -addr;
         if (align_size_max != 0 && align_size_max < access_size_max) {
             access_size_max = align_size_max;
         }
     }
}

Please help to confirm if we really want to cover the unaligned access for
all architectures. After it's confirmed, I can send (v2) for further review.

>> You need to actually handle unaligned.  Perhaps something like
>>
>>      /* Find unit to fit size and alignment of dst */
>>      uintptr_t test = (uintptr_t)dst | size;
>>      uintptr_t lsb = test & -test;
>>
>>      switch (lsb) {
>>      case 1:   // loop over uint8_t
>>      case 2:   // loop over uint16_t
>>      case 4:   // loop over uint32_t
>>      default:  // loop over uint64_t
>>      }
>>
>> with the expectation that normally we'll have aligned addresses and size
>> such that the loop will iterate once.
>>

Thanks, Richard. I think it should work if we really want to cover the unaligned
access. I tried the following snippet based on yours, my issue can be fixed and
no other issues are seen in my environment.

-----> system/physmem.c

+/*
+ * qemu_ram_copy - copy data from ram block
+ *
+ * @dest: destination into which data is copied
+ * @src: source of the data
+ * @n: length of the data to be copied
+ *
+ * This function is friendly to unaligned access.
+ */
+void qemu_ram_copy(void *dest, const void *src, size_t n)
+{
+    uintptr_t test, lsb;
+
+    do {
+        test = (uintptr_t)dest | n;
+        lsb = test & -test;
+        switch (lsb) {
+        case 1:
+            *(uint8_t *)dest = *(uint8_t *)src;
+            src += 1;
+            dest += 1;
+            n -= 1;
+            break;
+        case 2:
+            *(uint16_t *)dest = *(uint16_t *)src;
+            src += 2;
+            dest += 2;
+            n -= 2;
+            break;
+        case 4:
+            *(uint32_t *)dest = *(uint32_t *)src;
+            src += 4;
+            dest += 4;
+            n -= 4;
+            break;
+        default:
+            *(uint64_t *)dest = *(uint64_t *)src;
+            src += 8;
+            dest += 8;
+            n -= 8;
+        }
+    } while (n != 0);
+}

-----> include/system/memory.h

+void qemu_ram_copy(void *dest, const void *src, size_t n);
+static inline void qemu_ram_move(void *dest, const void *src, size_t n)
+{
+    qemu_ram_copy(dest, src, n);
+}

>>
>> r~
> 
> And ifdef for arches without unaligned support?
> 

I guess we probably support unaligned access for all architectures or none of
them. It will depend on guest if the unaligned access will be triggered :-)

Thanks,
Gavin




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-14 17:22             ` Michael S. Tsirkin
  2026-06-15  1:17               ` Gavin Shan
@ 2026-06-15  1:41               ` Richard Henderson
  2026-06-15  7:24                 ` Michael S. Tsirkin
  2026-06-15  7:58                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-15  1:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/14/26 10:22, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>> For bigger sizes we'll need packed because the addresses
>>> could be unaligned.
>> ...
>>> For most host/guest pairs things simply work even for unaligned.
>>>
>>> And yes, guest drivers do do this.
>>>
>>> On classical pci, there are no transactions as such and
>>> an unaligned access will be split anyway.
>>
>> I'm saying, if you're talking about pass-through to real devices, that won't
>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
> 
> Presumably, AArch64 drivers don't do unaligned at all then?

Yes.

>> You need to actually handle unaligned.  Perhaps something like
>>
>>      /* Find unit to fit size and alignment of dst */
>>      uintptr_t test = (uintptr_t)dst | size;
>>      uintptr_t lsb = test & -test;
>>
>>      switch (lsb) {
>>      case 1:   // loop over uint8_t
>>      case 2:   // loop over uint16_t
>>      case 4:   // loop over uint32_t
>>      default:  // loop over uint64_t
>>      }
>>
>> with the expectation that normally we'll have aligned addresses and size
>> such that the loop will iterate once.
>>
>>
>> r~
> 
> And ifdef for arches without unaligned support?

No ifdef.  All accesses produced by the above are aligned.


r~


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-15  1:41               ` Richard Henderson
@ 2026-06-15  7:24                 ` Michael S. Tsirkin
  2026-06-15  7:58                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15  7:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
> On 6/14/26 10:22, Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> > > On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > > > Yes, I think it does work because we use -fno-strict-aliasing.
> > > > For bigger sizes we'll need packed because the addresses
> > > > could be unaligned.
> > > ...
> > > > For most host/guest pairs things simply work even for unaligned.
> > > > 
> > > > And yes, guest drivers do do this.
> > > > 
> > > > On classical pci, there are no transactions as such and
> > > > an unaligned access will be split anyway.
> > > 
> > > I'm saying, if you're talking about pass-through to real devices, that won't
> > > work. For instance, AArch64 will trap unaligned accesses to Device memory.
> > 
> > Presumably, AArch64 drivers don't do unaligned at all then?
> 
> Yes.
> 
> > > You need to actually handle unaligned.  Perhaps something like
> > > 
> > >      /* Find unit to fit size and alignment of dst */
> > >      uintptr_t test = (uintptr_t)dst | size;
> > >      uintptr_t lsb = test & -test;
> > > 
> > >      switch (lsb) {
> > >      case 1:   // loop over uint8_t
> > >      case 2:   // loop over uint16_t
> > >      case 4:   // loop over uint32_t
> > >      default:  // loop over uint64_t
> > >      }
> > > 
> > > with the expectation that normally we'll have aligned addresses and size
> > > such that the loop will iterate once.

OK though it is worth looking at assembly and checking how to make it ooptimal.

I don't get why we have default here either, for that
we really should use memcpy for a better perf,
I think VCPUs can't initiate MMIO transactions >8 bytes.


> > > 
> > > 
> > > r~
> > 
> > And ifdef for arches without unaligned support?
> 
> No ifdef.  All accesses produced by the above are aligned.
> 
> 
> r~

but I don't think it's a good idea to have this on x86.

x86 does not need this pile of branches, has a popular closed
source guest we are all familiar with, famous for
it's rich ecosystem of drivers) and it's just
barely possible that an x86 guest on x86 host could
work as long as we do not break transaction boundaries.

Is 

#ifdef HOST_X86_64
#define QEMU_UNALIGNED 1
#else
#define QEMU_UNALIGNED 0
#endif

And then if (QEMU_UNALIGNED) a big deal really?





-- 
MST



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-15  1:41               ` Richard Henderson
  2026-06-15  7:24                 ` Michael S. Tsirkin
@ 2026-06-15  7:58                 ` Michael S. Tsirkin
  2026-06-15 10:08                   ` Gavin Shan
  1 sibling, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15  7:58 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
> On 6/14/26 10:22, Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> > > On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > > > Yes, I think it does work because we use -fno-strict-aliasing.
> > > > For bigger sizes we'll need packed because the addresses
> > > > could be unaligned.
> > > ...
> > > > For most host/guest pairs things simply work even for unaligned.
> > > > 
> > > > And yes, guest drivers do do this.
> > > > 
> > > > On classical pci, there are no transactions as such and
> > > > an unaligned access will be split anyway.
> > > 
> > > I'm saying, if you're talking about pass-through to real devices, that won't
> > > work. For instance, AArch64 will trap unaligned accesses to Device memory.
> > 
> > Presumably, AArch64 drivers don't do unaligned at all then?
> 
> Yes.
> 
> > > You need to actually handle unaligned.  Perhaps something like
> > > 
> > >      /* Find unit to fit size and alignment of dst */
> > >      uintptr_t test = (uintptr_t)dst | size;
> > >      uintptr_t lsb = test & -test;
> > > 
> > >      switch (lsb) {
> > >      case 1:   // loop over uint8_t
> > >      case 2:   // loop over uint16_t
> > >      case 4:   // loop over uint32_t
> > >      default:  // loop over uint64_t
> > >      }
> > > 
> > > with the expectation that normally we'll have aligned addresses and size
> > > such that the loop will iterate once.
> > > 
> > > 
> > > r~
> > 
> > And ifdef for arches without unaligned support?
> 
> No ifdef.  All accesses produced by the above are aligned.
> 
> 
> r~

So something like:

#if defined(__i386__) || defined(__x86_64__)
#define HOST_UNALIGNED_MMIO_OK  1
#else
#define HOST_UNALIGNED_MMIO_OK  0
#endif


and then we can check that.

-- 
MST



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-15  7:58                 ` Michael S. Tsirkin
@ 2026-06-15 10:08                   ` Gavin Shan
  0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-15 10:08 UTC (permalink / raw)
  To: Michael S. Tsirkin, Richard Henderson
  Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, berrange, david,
	alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On 6/15/26 5:58 PM, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
>> On 6/14/26 10:22, Michael S. Tsirkin wrote:
>>> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>>>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>>>> For bigger sizes we'll need packed because the addresses
>>>>> could be unaligned.
>>>> ...
>>>>> For most host/guest pairs things simply work even for unaligned.
>>>>>
>>>>> And yes, guest drivers do do this.
>>>>>
>>>>> On classical pci, there are no transactions as such and
>>>>> an unaligned access will be split anyway.
>>>>
>>>> I'm saying, if you're talking about pass-through to real devices, that won't
>>>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
>>>
>>> Presumably, AArch64 drivers don't do unaligned at all then?
>>
>> Yes.
>>
>>>> You need to actually handle unaligned.  Perhaps something like
>>>>
>>>>       /* Find unit to fit size and alignment of dst */
>>>>       uintptr_t test = (uintptr_t)dst | size;
>>>>       uintptr_t lsb = test & -test;
>>>>
>>>>       switch (lsb) {
>>>>       case 1:   // loop over uint8_t
>>>>       case 2:   // loop over uint16_t
>>>>       case 4:   // loop over uint32_t
>>>>       default:  // loop over uint64_t
>>>>       }
>>>>
>>>> with the expectation that normally we'll have aligned addresses and size
>>>> such that the loop will iterate once.
>>>>
>>>>
>>>> r~
>>>
>>> And ifdef for arches without unaligned support?
>>
>> No ifdef.  All accesses produced by the above are aligned.
>>
>>
>> r~
> 
> So something like:
> 
> #if defined(__i386__) || defined(__x86_64__)
> #define HOST_UNALIGNED_MMIO_OK  1
> #else
> #define HOST_UNALIGNED_MMIO_OK  0
> #endif
> 
> 
> and then we can check that.
> 

Thanks for all the comments. A new revision (v2) has been sent for further
comments, thanks.

https://lore.kernel.org/qemu-arm/20260615100200.266968-1-gshan@redhat.com/

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
  2026-06-12 15:29   ` Richard Henderson
  2026-06-12 16:36     ` Peter Maydell
@ 2026-06-14 15:42     ` Michael S. Tsirkin
  1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 15:42 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, peter.maydell, berrange,
	david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

On Fri, Jun 12, 2026 at 08:29:34AM -0700, Richard Henderson wrote:
> On 6/12/26 04:03, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> > is started by the following command lines, with a GH100 GPU card.
> > 
> >     host$ lspci | grep GH100
> >     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> >     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
> >           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
> >           -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
> >           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
> >           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
> >           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> >             :
> >     guest$ cd cuda-samples/build
> >     guest$ make -j 20 clean
> >     guest$ make -j 20
> >             :
> >     [ 54%] Linking CUDA executable graphMemoryNodes
> >     [ 54%] Built target graphMemoryNodes
> >     <no more output afterwards, guest becomes frozen here>
> > 
> >     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> >     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> > 
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR#4 can be presented to the
> > guest through memory hot-add. The page cache can be allocated from the
> > hot added memory blocks when cuda-samples is being compiled. Afterwards,
> > the page cache is sent to QEMU's virtio-blk device as part of the DMA
> > request, the bounce buffer has to be used to accomodate the request as
> > the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> > and indirectly accessible in qemu. However, the max bounce bufer size
> > is only 4096 bytes by default. We're running out of that space quickly.
> > 
> >    QEMU
> >    ====
> >    virtio_blk_handle_output
> >      virtio_blk_handle_vq
> >        virtio_blk_get_request
> >          virtqueue_pop
> >            virtqueue_split_pop
> >              virtqueue_map_desc
> >                address_space_map
> >                  memory_access_is_direct         # Return false
> >                    memory_region_supports_direct_access
> > 
> >    (qemu) info mtree
> >    memory-region: pci_bridge_pci
> >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> > 
> > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > accessors to ram device memory region, preparatory work to make ram device
> > region directly accessible and bypass the bounce buffer in the DMA path
> > in next patch.
> 
> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the
> compiler later decides whether or not to expand inline.

Did some research.

The issue that prompted use of __builtin_memcpy in bswap.h is musl's fortify headers
that Alpine Linux was using back in 2019.

Try this:
git clone --depth=1 https://github.com/jvoisin/fortify-headers /tmp/fortify-headers
git -C /tmp/fortify-headers checkout 5aabc7e~1

now:

musl-gcc -O2 -D_FORTIFY_SOURCE=2 -isystem /tmp/fortify-headers/include -S -o- test.c
where test.c is:

#include <string.h>
#include <stdint.h>

void test_memcpy(void *dst, const void *src) {
    memcpy(dst, src, 4);
}

void test_builtin_memcpy(void *dst, const void *src) {
    __builtin_memcpy(dst, src, 4);
}




On x86:

    test_memcpy:
     .LFB13:
        .cfi_startproc
        cmpq    %rsi, %rdi
        jnb     .L2
        leaq    4(%rdi), %rax
        cmpq    %rax, %rsi
        jb      .L6
     .L4:                                                                                            
        movl    $4, %edx
        jmp     memcpy
        .p2align 4,,10
        .p2align 3
     .L2:
        cmpq    %rdi, %rsi
        jnb     .L4
        leaq    4(%rsi), %rax
        cmpq    %rax, %rdi
        jb      .L3
        movl    $4, %edx
        jmp     memcpy
     .L6:
        jmp     .L3
        .cfi_endproc
        .section        .text.unlikely
        .cfi_startproc
        .type   test_memcpy.cold, @function
     test_memcpy.cold:
     .LFSB13:
     .L3:
        ud2
        .cfi_endproc
     .LFE13:
        .text
        .size   test_memcpy, .-test_memcpy



A jmp so a function call.
While builtin:



     test_builtin_memcpy:
     .LFB14:
        .cfi_startproc
        movl    (%rsi), %eax
        movl    %eax, (%rdi)
        ret
        .cfi_endproc
     .LFE14:
        .size   test_builtin_memcpy, .-test_builtin_memcpy



So it's a fortify headers Version 1.0 bug.  Version 1.1 has the fix.

I guess we should rewrite bswap.h then.


> So, this doesn't do what you think it does.
> My real question is: what are you attempting to achieve?
> 
> (1) is the problem unaligned access to a mapped physical device?
> (2) is the problem vector access to a mapped physical device?
> (3) something else?
> 
> 
> r~



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/2] system/memory: Make ram device region directly accessible
  2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
@ 2026-06-12 11:03 ` Gavin Shan
  2026-06-12 23:35 ` [PATCH 0/2] " Gavin Shan
  2 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, philmd, phrdina, jugraham, shan.gavin

This basically reverts 4a2e242bbb30 ("memory: Don't use memcpy for
ram_device regions") to make ram device region directly accessible
again. With this, the bounce buffer is bypassed in address_space_map()
when a ram device region is involved, potentially avoid to overrun
the bounce buffer.

Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/system/memory.h | 11 ++---------
 system/memory.c         | 41 +----------------------------------------
 system/trace-events     |  2 --
 3 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index 6bb2e13eea..3ca6155805 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2914,15 +2914,8 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
     if (memory_region_is_romd(mr)) {
         return true;
     }
-    if (!memory_region_is_ram(mr)) {
-        return false;
-    }
-    /*
-     * RAM DEVICE regions can be accessed directly using memcpy, but it might
-     * be MMIO and access using mempy can be wrong (e.g., using instructions not
-     * intended for MMIO access). So we treat this as IO.
-     */
-    return !memory_region_is_ram_device(mr);
+
+    return memory_region_is_ram(mr);
 }
 
 static inline bool memory_access_is_direct(const MemoryRegion *mr,
diff --git a/system/memory.c b/system/memory.c
index 739ba11da6..9549dd1a94 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1362,43 +1362,6 @@ const MemoryRegionOps unassigned_mem_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
-static uint64_t memory_region_ram_device_read(void *opaque,
-                                              hwaddr addr, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-    uint64_t data = ldn_he_p(mr->ram_block->host + addr, size);
-
-    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
-
-    return data;
-}
-
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
-                                           uint64_t data, unsigned size)
-{
-    MemoryRegion *mr = opaque;
-
-    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
-
-    stn_he_p(mr->ram_block->host + addr, size, data);
-}
-
-static const MemoryRegionOps ram_device_mem_ops = {
-    .read = memory_region_ram_device_read,
-    .write = memory_region_ram_device_write,
-    .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
-    .valid = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-    .impl = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-        .unaligned = true,
-    },
-};
-
 bool memory_region_access_valid(MemoryRegion *mr,
                                 hwaddr addr,
                                 unsigned size,
@@ -1676,10 +1639,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, Object *owner,
                                        const char *name, uint64_t size,
                                        void *ptr)
 {
-    memory_region_init_io(mr, owner, &ram_device_mem_ops, mr, name, size);
-    mr->ram = true;
+    memory_region_init_ram_ptr(mr, owner, name, size, ptr);
     mr->ram_device = true;
-    memory_region_set_ram_ptr(mr, size, ptr);
 }
 
 void memory_region_init_alias(MemoryRegion *mr, Object *owner,
diff --git a/system/trace-events b/system/trace-events
index e6e1b61279..34af0a3a1e 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -20,8 +20,6 @@ memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, u
 memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size, const char *name) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u name '%s'"
 memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_sync_dirty(const char *mr, const char *listener, int global) "mr '%s' listener '%s' synced (global=%d)"
 flatview_new(void *view, void *root) "%p (root %p)"
 flatview_destroy(void *view, void *root) "%p (root %p)"
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] system/memory: Make ram device region directly accessible
  2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
  2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
  2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-12 23:35 ` Gavin Shan
  2 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-12 23:35 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
	clg, pbonzini, philmd, phrdina, jugraham, shan.gavin,
	liugang24219, dinghui

[Cc Liu Gang and Ding hui]

On 6/12/26 9:03 PM, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> DMA bounce buffer in address_space_map() to cover the DMA request. However,
> the bounce buffer size is 4096 bytes and we're overrunning it easily when
> the guest has significant disk activities on compiling 'cuda-samples'.
> The full log and problem description can be found from PATCH[1/2]'s commit
> log.
> 
> Try to fix the issue handled in commit 4a2e242bbb by replacing mem{cpy, move}
> with __builtin_mem{cpy, move} in the accessors to the ram device regions.
> With this, we can basically revert that commit to make ram device region
> directly accessible again and bypass the bounce buffer in address_space_map()
> where the guest hang is caused.
> 
> PATCH[1] replaces mem{cpy, move} with __builtin_mem{cpy, move}
> PATCH[2] makes ram device region directly accessible again
> 

Liu and Ding, Could you give this series a try to see if your e1000 issue gets
fixed by this?

   https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn/

> Changelog
> =========
> RFCv1 -> v1:
>    * https://lists.nongnu.org/archive/html/qemu-arm/2026-06/msg00307.html
>    * Reworked solution based on suggestions from Peter Xu, Peter Maydell
>      and Michael S. Tsirkin
> 
> Gavin Shan (2):
>    system/memory: Use __builtin_mem{cpy, move} in accessors of ram device
>      region
>    system/memory: Make ram device region directly accessible
> 
>   hw/remote/vfio-user-obj.c |  4 +--
>   include/system/memory.h   | 53 +++++++++++++++++++++++++++++++--------
>   system/memory.c           | 41 +-----------------------------
>   system/physmem.c          |  8 +++---
>   system/trace-events       |  2 --
>   5 files changed, 50 insertions(+), 58 deletions(-)
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-06-15 10:10 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:22   ` Michael S. Tsirkin
2026-06-14 10:04     ` Gavin Shan
2026-06-12 14:05   ` Philippe Mathieu-Daudé
2026-06-14 10:08     ` Gavin Shan
2026-06-12 15:29   ` Richard Henderson
2026-06-12 16:36     ` Peter Maydell
2026-06-12 17:25       ` Richard Henderson
2026-06-12 17:57         ` Peter Maydell
2026-06-14 15:13         ` Michael S. Tsirkin
2026-06-14 16:01           ` Peter Maydell
2026-06-14 16:06             ` Michael S. Tsirkin
2026-06-14 16:17             ` Michael S. Tsirkin
2026-06-14 16:45           ` Richard Henderson
2026-06-14 17:22             ` Michael S. Tsirkin
2026-06-15  1:17               ` Gavin Shan
2026-06-15  1:41               ` Richard Henderson
2026-06-15  7:24                 ` Michael S. Tsirkin
2026-06-15  7:58                 ` Michael S. Tsirkin
2026-06-15 10:08                   ` Gavin Shan
2026-06-14 15:42     ` Michael S. Tsirkin
2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-12 23:35 ` [PATCH 0/2] " Gavin Shan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.