* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
@ 2026-06-12 11:22 ` Michael S. Tsirkin
2026-06-14 10:04 ` Gavin Shan
2026-06-12 14:05 ` Philippe Mathieu-Daudé
2026-06-12 15:29 ` Richard Henderson
2 siblings, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-12 11:22 UTC (permalink / raw)
To: Gavin Shan
Cc: qemu-arm, qemu-devel, peterx, peter.maydell, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, Jun 12, 2026 at 09:03:06PM +1000, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -accel kvm -cpu host -smp cpus=48 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> :
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/remote/vfio-user-obj.c | 4 ++--
> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
> system/physmem.c | 8 ++++----
> 3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> ram_ptr = memory_region_get_ram_ptr(mr);
>
> if (is_write) {
> - memcpy((ram_ptr + offset), buf, size);
> + address_space_memcpy(ram_ptr + offset, buf, size);
> } else {
> - memcpy(buf, (ram_ptr + offset), size);
> + address_space_memcpy(buf, ram_ptr + offset, size);
> }
>
> return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
> return true;
> }
>
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memcpy(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memcpy(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memcpy(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memcpy(dest, src, 8);
> + break;
> + default:
> + __builtin_memcpy(dest, src, n);
> + }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memmove(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memmove(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memmove(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memmove(dest, src, 8);
> + break;
> + default:
> + __builtin_memmove(dest, src, n);
> + }
> +}
> +
> /**
> * address_space_read: read from an address space.
> *
The variable length probably should use the regular memcpy/memmove -
no reason to bypass fortification for these.
> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
> if (len == l && memory_access_is_direct(mr, false, attrs)) {
> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(buf, ptr, len);
> + __builtin_memcpy(buf, ptr, len);
> } else {
> result = flatview_read_continue(fv, addr, attrs, buf, len,
> addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, true);
>
> - memmove(ram_ptr, buf, *l);
> + address_space_memmove(ram_ptr, buf, *l);
> invalidate_and_set_dirty(mr, mr_addr, *l);
>
> return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, false);
>
> - memcpy(buf, ram_ptr, *l);
> + address_space_memcpy(buf, ram_ptr, *l);
>
> return MEMTX_OK;
> }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
> l = memory_access_size(mr, l, addr1);
> } else {
> /* ROM/RAM case */
> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(ram_ptr, buf, l);
> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> + buf, l);
> invalidate_and_set_dirty(mr, addr1, l);
> }
> len -= l;
> --
> 2.54.0
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:22 ` Michael S. Tsirkin
@ 2026-06-14 10:04 ` Gavin Shan
0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-14 10:04 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-arm, qemu-devel, peterx, peter.maydell, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/12/26 9:22 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 12, 2026 at 09:03:06PM +1000, Gavin Shan wrote:
>> All ram device regions was turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
>> is started by the following command lines, with a GH100 GPU card.
>>
>> host$ lspci | grep GH100
>> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
>> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
>> -accel kvm -cpu host -smp cpus=48 -m size=8G \
>> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
>> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
>> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>> :
>> guest$ cd cuda-samples/build
>> guest$ make -j 20 clean
>> guest$ make -j 20
>> :
>> [ 54%] Linking CUDA executable graphMemoryNodes
>> [ 54%] Built target graphMemoryNodes
>> <no more output afterwards, guest becomes frozen here>
>>
>> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>
>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>> the memory blocks residing in the PCI BAR#4 can be presented to the
>> guest through memory hot-add. The page cache can be allocated from the
>> hot added memory blocks when cuda-samples is being compiled. Afterwards,
>> the page cache is sent to QEMU's virtio-blk device as part of the DMA
>> request, the bounce buffer has to be used to accomodate the request as
>> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
>> and indirectly accessible in qemu. However, the max bounce bufer size
>> is only 4096 bytes by default. We're running out of that space quickly.
>>
>> QEMU
>> ====
>> virtio_blk_handle_output
>> virtio_blk_handle_vq
>> virtio_blk_get_request
>> virtqueue_pop
>> virtqueue_split_pop
>> virtqueue_map_desc
>> address_space_map
>> memory_access_is_direct # Return false
>> memory_region_supports_direct_access
>>
>> (qemu) info mtree
>> memory-region: pci_bridge_pci
>> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>
>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>> accessors to ram device memory region, preparatory work to make ram device
>> region directly accessible and bypass the bounce buffer in the DMA path
>> in next patch.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>> hw/remote/vfio-user-obj.c | 4 ++--
>> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
>> system/physmem.c | 8 ++++----
>> 3 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 87fa7b6572..fe6f661fe2 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>> ram_ptr = memory_region_get_ram_ptr(mr);
>>
>> if (is_write) {
>> - memcpy((ram_ptr + offset), buf, size);
>> + address_space_memcpy(ram_ptr + offset, buf, size);
>> } else {
>> - memcpy(buf, (ram_ptr + offset), size);
>> + address_space_memcpy(buf, ram_ptr + offset, size);
>> }
>>
>> return 0;
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 1417132f6d..6bb2e13eea 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>> return true;
>> }
>>
>> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
>> +{
>> + switch (n) {
>> + case 1:
>> + __builtin_memcpy(dest, src, 1);
>> + break;
>> + case 2:
>> + __builtin_memcpy(dest, src, 2);
>> + break;
>> + case 4:
>> + __builtin_memcpy(dest, src, 4);
>> + break;
>> + case 8:
>> + __builtin_memcpy(dest, src, 8);
>> + break;
>> + default:
>> + __builtin_memcpy(dest, src, n);
>> + }
>> +}
>> +
>> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
>> +{
>> + switch (n) {
>> + case 1:
>> + __builtin_memmove(dest, src, 1);
>> + break;
>> + case 2:
>> + __builtin_memmove(dest, src, 2);
>> + break;
>> + case 4:
>> + __builtin_memmove(dest, src, 4);
>> + break;
>> + case 8:
>> + __builtin_memmove(dest, src, 8);
>> + break;
>> + default:
>> + __builtin_memmove(dest, src, n);
>> + }
>> +}
>> +
>> /**
>> * address_space_read: read from an address space.
>> *
>
>
> The variable length probably should use the regular memcpy/memmove -
> no reason to bypass fortification for these.
>
Fixed locally and the changes will be included in next revision.
>
>> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>> if (len == l && memory_access_is_direct(mr, false, attrs)) {
>> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> - memcpy(buf, ptr, len);
>> + __builtin_memcpy(buf, ptr, len);
>> } else {
>> result = flatview_read_continue(fv, addr, attrs, buf, len,
>> addr1, l, mr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf87573..5f46a9d676 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>> false, true);
>>
>> - memmove(ram_ptr, buf, *l);
>> + address_space_memmove(ram_ptr, buf, *l);
>> invalidate_and_set_dirty(mr, mr_addr, *l);
>>
>> return MEMTX_OK;
>> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>> false, false);
>>
>> - memcpy(buf, ram_ptr, *l);
>> + address_space_memcpy(buf, ram_ptr, *l);
>>
>> return MEMTX_OK;
>> }
>> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>> l = memory_access_size(mr, l, addr1);
>> } else {
>> /* ROM/RAM case */
>> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> - memcpy(ram_ptr, buf, l);
>> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
>> + buf, l);
>> invalidate_and_set_dirty(mr, addr1, l);
>> }
>> len -= l;
>> --
>> 2.54.0
>
Thanks,
Gavin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:22 ` Michael S. Tsirkin
@ 2026-06-12 14:05 ` Philippe Mathieu-Daudé
2026-06-14 10:08 ` Gavin Shan
2026-06-12 15:29 ` Richard Henderson
2 siblings, 1 reply; 24+ messages in thread
From: Philippe Mathieu-Daudé @ 2026-06-12 14:05 UTC (permalink / raw)
To: Gavin Shan, qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, phrdina, jugraham, shan.gavin
Hi Gavin,
On 12/6/26 13:03, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -accel kvm -cpu host -smp cpus=48 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> :
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/remote/vfio-user-obj.c | 4 ++--
> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
> system/physmem.c | 8 ++++----
> 3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> ram_ptr = memory_region_get_ram_ptr(mr);
>
> if (is_write) {
> - memcpy((ram_ptr + offset), buf, size);
> + address_space_memcpy(ram_ptr + offset, buf, size);
> } else {
> - memcpy(buf, (ram_ptr + offset), size);
> + address_space_memcpy(buf, ram_ptr + offset, size);
> }
>
> return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
> return true;
> }
>
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
'address_space_' prefix for something that doesn't use neither
AddressSpace nor MemoryRegion is odd.
Maybe prefix 'qemu_ram_' or 'qemu_ram_ptr_' instead? (since the
address is returned by memory_region_get_ram_ptr)
Add the definitions in "system/ramblock.h" with that declaration?
> +{
> + switch (n) {
> + case 1:
> + __builtin_memcpy(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memcpy(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memcpy(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memcpy(dest, src, 8);
> + break;
> + default:
> + __builtin_memcpy(dest, src, n);
> + }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memmove(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memmove(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memmove(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memmove(dest, src, 8);
> + break;
> + default:
> + __builtin_memmove(dest, src, n);
> + }
> +}
> +
> /**
> * address_space_read: read from an address space.
> *
> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
> if (len == l && memory_access_is_direct(mr, false, attrs)) {
> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(buf, ptr, len);
> + __builtin_memcpy(buf, ptr, len);
> } else {
> result = flatview_read_continue(fv, addr, attrs, buf, len,
> addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, true);
>
> - memmove(ram_ptr, buf, *l);
> + address_space_memmove(ram_ptr, buf, *l);
> invalidate_and_set_dirty(mr, mr_addr, *l);
>
> return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, false);
>
> - memcpy(buf, ram_ptr, *l);
> + address_space_memcpy(buf, ram_ptr, *l);
>
> return MEMTX_OK;
> }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
> l = memory_access_size(mr, l, addr1);
> } else {
> /* ROM/RAM case */
> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(ram_ptr, buf, l);
> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> + buf, l);
> invalidate_and_set_dirty(mr, addr1, l);
> }
> len -= l;
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 14:05 ` Philippe Mathieu-Daudé
@ 2026-06-14 10:08 ` Gavin Shan
0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-14 10:08 UTC (permalink / raw)
To: Philippe Mathieu-Daudé, qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, phrdina, jugraham, shan.gavin
Hi Philippe,
On 6/13/26 12:05 AM, Philippe Mathieu-Daudé wrote:
> Hi Gavin,
>
> On 12/6/26 13:03, Gavin Shan wrote:
>> All ram device regions was turned to be indirectly accessible by commit
>> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
>> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
>> is started by the following command lines, with a GH100 GPU card.
>>
>> host$ lspci | grep GH100
>> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
>> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
>> -accel kvm -cpu host -smp cpus=48 -m size=8G \
>> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
>> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
>> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>> :
>> guest$ cd cuda-samples/build
>> guest$ make -j 20 clean
>> guest$ make -j 20
>> :
>> [ 54%] Linking CUDA executable graphMemoryNodes
>> [ 54%] Built target graphMemoryNodes
>> <no more output afterwards, guest becomes frozen here>
>>
>> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>
>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>> the memory blocks residing in the PCI BAR#4 can be presented to the
>> guest through memory hot-add. The page cache can be allocated from the
>> hot added memory blocks when cuda-samples is being compiled. Afterwards,
>> the page cache is sent to QEMU's virtio-blk device as part of the DMA
>> request, the bounce buffer has to be used to accomodate the request as
>> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
>> and indirectly accessible in qemu. However, the max bounce bufer size
>> is only 4096 bytes by default. We're running out of that space quickly.
>>
>> QEMU
>> ====
>> virtio_blk_handle_output
>> virtio_blk_handle_vq
>> virtio_blk_get_request
>> virtqueue_pop
>> virtqueue_split_pop
>> virtqueue_map_desc
>> address_space_map
>> memory_access_is_direct # Return false
>> memory_region_supports_direct_access
>>
>> (qemu) info mtree
>> memory-region: pci_bridge_pci
>> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>
>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>> accessors to ram device memory region, preparatory work to make ram device
>> region directly accessible and bypass the bounce buffer in the DMA path
>> in next patch.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>> hw/remote/vfio-user-obj.c | 4 ++--
>> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
>> system/physmem.c | 8 ++++----
>> 3 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 87fa7b6572..fe6f661fe2 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>> ram_ptr = memory_region_get_ram_ptr(mr);
>> if (is_write) {
>> - memcpy((ram_ptr + offset), buf, size);
>> + address_space_memcpy(ram_ptr + offset, buf, size);
>> } else {
>> - memcpy(buf, (ram_ptr + offset), size);
>> + address_space_memcpy(buf, ram_ptr + offset, size);
>> }
>> return 0;
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 1417132f6d..6bb2e13eea 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
>> return true;
>> }
>> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
>
> 'address_space_' prefix for something that doesn't use neither
> AddressSpace nor MemoryRegion is odd.
>
> Maybe prefix 'qemu_ram_' or 'qemu_ram_ptr_' instead? (since the
> address is returned by memory_region_get_ram_ptr)
> Add the definitions in "system/ramblock.h" with that declaration?
>
Yes, they have been renamed to qemu_ram_{copy, move}() and moved to
system/ramblock.h in local repo. The changes will be seen in the next
revision.
>> +{
>> + switch (n) {
>> + case 1:
>> + __builtin_memcpy(dest, src, 1);
>> + break;
>> + case 2:
>> + __builtin_memcpy(dest, src, 2);
>> + break;
>> + case 4:
>> + __builtin_memcpy(dest, src, 4);
>> + break;
>> + case 8:
>> + __builtin_memcpy(dest, src, 8);
>> + break;
>> + default:
>> + __builtin_memcpy(dest, src, n);
>> + }
>> +}
>> +
>> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
>> +{
>> + switch (n) {
>> + case 1:
>> + __builtin_memmove(dest, src, 1);
>> + break;
>> + case 2:
>> + __builtin_memmove(dest, src, 2);
>> + break;
>> + case 4:
>> + __builtin_memmove(dest, src, 4);
>> + break;
>> + case 8:
>> + __builtin_memmove(dest, src, 8);
>> + break;
>> + default:
>> + __builtin_memmove(dest, src, n);
>> + }
>> +}
>> +
>> /**
>> * address_space_read: read from an address space.
>> *
>> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
>> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
>> if (len == l && memory_access_is_direct(mr, false, attrs)) {
>> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> - memcpy(buf, ptr, len);
>> + __builtin_memcpy(buf, ptr, len);
>> } else {
>> result = flatview_read_continue(fv, addr, attrs, buf, len,
>> addr1, l, mr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf87573..5f46a9d676 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>> false, true);
>> - memmove(ram_ptr, buf, *l);
>> + address_space_memmove(ram_ptr, buf, *l);
>> invalidate_and_set_dirty(mr, mr_addr, *l);
>> return MEMTX_OK;
>> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>> false, false);
>> - memcpy(buf, ram_ptr, *l);
>> + address_space_memcpy(buf, ram_ptr, *l);
>> return MEMTX_OK;
>> }
>> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
>> l = memory_access_size(mr, l, addr1);
>> } else {
>> /* ROM/RAM case */
>> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>> - memcpy(ram_ptr, buf, l);
>> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
>> + buf, l);
>> invalidate_and_set_dirty(mr, addr1, l);
>> }
>> len -= l;
>
Thanks,
Gavin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:22 ` Michael S. Tsirkin
2026-06-12 14:05 ` Philippe Mathieu-Daudé
@ 2026-06-12 15:29 ` Richard Henderson
2026-06-12 16:36 ` Peter Maydell
2026-06-14 15:42 ` Michael S. Tsirkin
2 siblings, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-12 15:29 UTC (permalink / raw)
To: Gavin Shan, qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/12/26 04:03, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -accel kvm -cpu host -smp cpus=48 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> :
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
decides whether or not to expand inline.
So, this doesn't do what you think it does.
My real question is: what are you attempting to achieve?
(1) is the problem unaligned access to a mapped physical device?
(2) is the problem vector access to a mapped physical device?
(3) something else?
r~
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 15:29 ` Richard Henderson
@ 2026-06-12 16:36 ` Peter Maydell
2026-06-12 17:25 ` Richard Henderson
2026-06-14 15:42 ` Michael S. Tsirkin
1 sibling, 1 reply; 24+ messages in thread
From: Peter Maydell @ 2026-06-12 16:36 UTC (permalink / raw)
To: Richard Henderson
Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, 12 Jun 2026 at 16:29, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 6/12/26 04:03, Gavin Shan wrote:
> > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > accessors to ram device memory region, preparatory work to make ram device
> > region directly accessible and bypass the bounce buffer in the DMA path
> > in next patch.
>
> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> decides whether or not to expand inline.
Yes, but if you pass it a fixed small integer, then it is likely
to expand it inline, whereas if you pass it a variable then it
is likely not to... The patch is attempting to persuade the
compiler to definitely do an inline access for 1, 2, 4, 8
byte access.
> So, this doesn't do what you think it does.
> My real question is: what are you attempting to achieve?
>
> (1) is the problem unaligned access to a mapped physical device?
> (2) is the problem vector access to a mapped physical device?
> (3) something else?
I think there are two problems we're trying to fix here:
(1) If a device does e.g. a pci_dma_write() with size 1, we want
this to turn into exactly 1 byte write into guest memory, for the
normal case where the guest memory is real host RAM.
This deals with the e1000 bug where the pci_dma_write() turns into
a call to glibc memmove() with size 1 and glibc's implementation
turns that into 3 writes of the byte to the same address,
and then a guest write/read might interleave badly with
the extra writes.[*] Similarly some devices specify that they
do definitely do DMA accesses in a particular way (e.g. the
Arm SMMU spec says that the SMMU reads 64-bit page table entries
as single-copy atomic accesses).
(2) If a vCPU (emulated or KVM) does a 4 byte access to an
address which is in the PCI BAR of a vfio-passthrough device,
we want this to turn into exactly a 4-byte access to the
mmap()ed memory which is the BAR of the host device. This
is because that address might be a real hardware device
register, so it's important to access it exactly the way
the guest vCPU asked for, and not do multiple accesses or
misaligned accesses or whatever.
(The reason we make pass-through BARs not "direct access"
but instead go via the bounce buffer is that we were working
around (2).)
[*] I have not fully investigated the e1000 situation so it's
possible that there is some other issue also there, e.g.
guest or device model getting barrier semantics wrong. But
writing the byte with the "done with this descriptor" bit
3 times rather than once seems like asking for trouble.
thanks
-- PMM
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 16:36 ` Peter Maydell
@ 2026-06-12 17:25 ` Richard Henderson
2026-06-12 17:57 ` Peter Maydell
2026-06-14 15:13 ` Michael S. Tsirkin
0 siblings, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-12 17:25 UTC (permalink / raw)
To: Peter Maydell
Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/12/26 09:36, Peter Maydell wrote:
> On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 6/12/26 04:03, Gavin Shan wrote:
>>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
>>> accessors to ram device memory region, preparatory work to make ram device
>>> region directly accessible and bypass the bounce buffer in the DMA path
>>> in next patch.
>>
>> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
>> decides whether or not to expand inline.
>
> Yes, but if you pass it a fixed small integer, then it is likely
> to expand it inline, whereas if you pass it a variable then it
> is likely not to... The patch is attempting to persuade the
> compiler to definitely do an inline access for 1, 2, 4, 8
> byte access.
Sure, for hosts with unaligned accesses. We still have sparc64 and (some?) riscv64 that
don't automatically have such and will compile to more than one host instruction.
>> My real question is: what are you attempting to achieve?
>>
>> (1) is the problem unaligned access to a mapped physical device?
>> (2) is the problem vector access to a mapped physical device?
>> (3) something else?
>
> I think there are two problems we're trying to fix here:
>
> (1) If a device does e.g. a pci_dma_write() with size 1, we want
> this to turn into exactly 1 byte write into guest memory, for the
> normal case where the guest memory is real host RAM.
> This deals with the e1000 bug where the pci_dma_write() turns into
> a call to glibc memmove() with size 1 and glibc's implementation
> turns that into 3 writes of the byte to the same address...
Gotcha. Easily handled by not using memcpy/memmove at all.
*(char *)ptr = val;
is sufficient for all hosts.
> (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> address which is in the PCI BAR of a vfio-passthrough device,
> we want this to turn into exactly a 4-byte access to the
> mmap()ed memory which is the BAR of the host device. This
> is because that address might be a real hardware device
> register, so it's important to access it exactly the way
> the guest vCPU asked for, and not do multiple accesses or
> misaligned accesses or whatever.
>
> (The reason we make pass-through BARs not "direct access"
> but instead go via the bounce buffer is that we were working
> around (2).)
That's going to require qatomic and alignment checks.
That's also going to beg the question of the intended behaviour anything that isn't a
naturally aligned access of size in {1,2,4,8}: fall back to N operations of the minimum of
alignment and size?
r~
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 17:25 ` Richard Henderson
@ 2026-06-12 17:57 ` Peter Maydell
2026-06-14 15:13 ` Michael S. Tsirkin
1 sibling, 0 replies; 24+ messages in thread
From: Peter Maydell @ 2026-06-12 17:57 UTC (permalink / raw)
To: Richard Henderson
Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, mst, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, 12 Jun 2026 at 18:25, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> On 6/12/26 04:03, Gavin Shan wrote:
> >>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> >>> accessors to ram device memory region, preparatory work to make ram device
> >>> region directly accessible and bypass the bounce buffer in the DMA path
> >>> in next patch.
> >>
> >> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> >> decides whether or not to expand inline.
> >
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
>
> Sure, for hosts with unaligned accesses. We still have sparc64 and (some?) riscv64 that
> don't automatically have such and will compile to more than one host instruction.
Good point.
> >> My real question is: what are you attempting to achieve?
> >>
> >> (1) is the problem unaligned access to a mapped physical device?
> >> (2) is the problem vector access to a mapped physical device?
> >> (3) something else?
> >
> > I think there are two problems we're trying to fix here:
> >
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
>
> Gotcha. Easily handled by not using memcpy/memmove at all.
>
> *(char *)ptr = val;
>
> is sufficient for all hosts.
That helps e1000 because it happens to be doing byte writes,
but the same effect probably occurs for word writes in other
devices...
> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> >
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
>
> That's going to require qatomic and alignment checks.
>
> That's also going to beg the question of the intended behaviour
> anything that isn't a naturally aligned access of size in {1,2,4,8}:
>fall back to N operations of the minimum of alignment and size?
I think in practice we can probably say we don't care and
fall back to memmove: anything unaligned here is almost
certainly going to real RAM and not going to have any
particular ordering requirements (and/or might be caused
by the guest misprogramming a device DMA address).
I feel like we handle "unaligned access, maybe need to split it"
already in multiple parts of QEMU (for TCG accesses we do that
in accel/tcg; system/memory.c has a TODO comment about unaligned
accesses at the IO MemoryRegion read/writefn level and there's
a patchseries that tries to fix that), so if we find ourselves
really needing to do "handle this unaligned access by splitting
it into pieces" again here we ought to take a wider look at
whether we can consolidate that at all...
thanks
-- PMM
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 17:25 ` Richard Henderson
2026-06-12 17:57 ` Peter Maydell
@ 2026-06-14 15:13 ` Michael S. Tsirkin
2026-06-14 16:01 ` Peter Maydell
2026-06-14 16:45 ` Richard Henderson
1 sibling, 2 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 15:13 UTC (permalink / raw)
To: Richard Henderson
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> > >
> > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > accessors to ram device memory region, preparatory work to make ram device
> > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > in next patch.
> > >
> > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > decides whether or not to expand inline.
> >
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
>
> Sure, for hosts with unaligned accesses. We still have sparc64 and (some?)
> riscv64 that don't automatically have such and will compile to more than one
> host instruction.
>
> > > My real question is: what are you attempting to achieve?
> > >
> > > (1) is the problem unaligned access to a mapped physical device?
> > > (2) is the problem vector access to a mapped physical device?
> > > (3) something else?
> >
> > I think there are two problems we're trying to fix here:
> >
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
>
> Gotcha. Easily handled by not using memcpy/memmove at all.
>
> *(char *)ptr = val;
>
> is sufficient for all hosts.
Yes, I think it does work because we use -fno-strict-aliasing.
For bigger sizes we'll need packed because the addresses
could be unaligned.
But again, qemu simply already relies on this in bswap.h
I kind of dislike muddying the waters by making several
unrelated changes here. If we do we should change bwap too.
> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> >
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
>
> That's going to require qatomic and alignment checks.
>
> That's also going to beg the question of the intended behaviour anything
> that isn't a naturally aligned access of size in {1,2,4,8}: fall back to N
> operations of the minimum of alignment and size?
For most host/guest pairs things simply work even for unaligned.
And yes, guest drivers do do this.
On classical pci, there are no transactions as such and
an unaligned access will be split anyway.
>
>
> r~
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 15:13 ` Michael S. Tsirkin
@ 2026-06-14 16:01 ` Peter Maydell
2026-06-14 16:06 ` Michael S. Tsirkin
2026-06-14 16:17 ` Michael S. Tsirkin
2026-06-14 16:45 ` Richard Henderson
1 sibling, 2 replies; 24+ messages in thread
From: Peter Maydell @ 2026-06-14 16:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
shan.gavin
On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > On 6/12/26 09:36, Peter Maydell wrote:
> > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > <richard.henderson@linaro.org> wrote:
> > > >
> > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > in next patch.
> > > >
> > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > decides whether or not to expand inline.
> > >
> > > Yes, but if you pass it a fixed small integer, then it is likely
> > > to expand it inline, whereas if you pass it a variable then it
> > > is likely not to... The patch is attempting to persuade the
> > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > byte access.
> >
> > Sure, for hosts with unaligned accesses. We still have sparc64 and (some?)
> > riscv64 that don't automatically have such and will compile to more than one
> > host instruction.
> >
> > > > My real question is: what are you attempting to achieve?
> > > >
> > > > (1) is the problem unaligned access to a mapped physical device?
> > > > (2) is the problem vector access to a mapped physical device?
> > > > (3) something else?
> > >
> > > I think there are two problems we're trying to fix here:
> > >
> > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > this to turn into exactly 1 byte write into guest memory, for the
> > > normal case where the guest memory is real host RAM.
> > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > a call to glibc memmove() with size 1 and glibc's implementation
> > > turns that into 3 writes of the byte to the same address...
> >
> > Gotcha. Easily handled by not using memcpy/memmove at all.
> >
> > *(char *)ptr = val;
> >
> > is sufficient for all hosts.
>
> Yes, I think it does work because we use -fno-strict-aliasing.
> For bigger sizes we'll need packed because the addresses
> could be unaligned.
IIRC "packed" will cause architectures that can't do
unaligned word accesses to emit code to do byte accesses,
so you don't want that. You need to explicitly check alignment,
I think.
> But again, qemu simply already relies on this in bswap.h
>
> I kind of dislike muddying the waters by making several
> unrelated changes here. If we do we should change bwap too.
The ldl_p etc functions in bswap.h provide different semantics:
ldl_p() is "do a load of a 32-bit quantity, even if the
address is not 4-aligned". We don't care if the compiler
or the memcpy ends up doing that as 4 byte accesses, which
on some hosts it must do.
thanks
-- PMM
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 16:01 ` Peter Maydell
@ 2026-06-14 16:06 ` Michael S. Tsirkin
2026-06-14 16:17 ` Michael S. Tsirkin
1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 16:06 UTC (permalink / raw)
To: Peter Maydell
Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
shan.gavin
On Sun, Jun 14, 2026 at 05:01:34PM +0100, Peter Maydell wrote:
> On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > > On 6/12/26 09:36, Peter Maydell wrote:
> > > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > > >
> > > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > > in next patch.
> > > > >
> > > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > > decides whether or not to expand inline.
> > > >
> > > > Yes, but if you pass it a fixed small integer, then it is likely
> > > > to expand it inline, whereas if you pass it a variable then it
> > > > is likely not to... The patch is attempting to persuade the
> > > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > > byte access.
> > >
> > > Sure, for hosts with unaligned accesses. We still have sparc64 and (some?)
> > > riscv64 that don't automatically have such and will compile to more than one
> > > host instruction.
> > >
> > > > > My real question is: what are you attempting to achieve?
> > > > >
> > > > > (1) is the problem unaligned access to a mapped physical device?
> > > > > (2) is the problem vector access to a mapped physical device?
> > > > > (3) something else?
> > > >
> > > > I think there are two problems we're trying to fix here:
> > > >
> > > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > > this to turn into exactly 1 byte write into guest memory, for the
> > > > normal case where the guest memory is real host RAM.
> > > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > > a call to glibc memmove() with size 1 and glibc's implementation
> > > > turns that into 3 writes of the byte to the same address...
> > >
> > > Gotcha. Easily handled by not using memcpy/memmove at all.
> > >
> > > *(char *)ptr = val;
> > >
> > > is sufficient for all hosts.
> >
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
>
> IIRC "packed" will cause architectures that can't do
> unaligned word accesses to emit code to do byte accesses,
> so you don't want that.
I mean, what's left to do if guest wants an unaligned access? Crash?
On classical PCI, unaligned accesses are split up by the bus, so
there's a decent chance most devices will be ok if qemu does it.
> You need to explicitly check alignment,
> I think.
>
> > But again, qemu simply already relies on this in bswap.h
> >
> > I kind of dislike muddying the waters by making several
> > unrelated changes here. If we do we should change bwap too.
>
> The ldl_p etc functions in bswap.h provide different semantics:
> ldl_p() is "do a load of a 32-bit quantity, even if the
> address is not 4-aligned". We don't care if the compiler
> or the memcpy ends up doing that as 4 byte accesses, which
> on some hosts it must do.
>
> thanks
> -- PMM
I mean the ram device ops trick is using bounce buffers precisely
because it happens to give you a single load.
And yes on hosts where it does not work, these devices
won't work, and qemu can't fix that.
--
MST
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 16:01 ` Peter Maydell
2026-06-14 16:06 ` Michael S. Tsirkin
@ 2026-06-14 16:17 ` Michael S. Tsirkin
1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 16:17 UTC (permalink / raw)
To: Peter Maydell
Cc: Richard Henderson, Gavin Shan, qemu-arm, qemu-devel, peterx,
berrange, david, alex, clg, pbonzini, philmd, phrdina, jugraham,
shan.gavin
On Sun, Jun 14, 2026 at 05:01:34PM +0100, Peter Maydell wrote:
> On Sun, 14 Jun 2026 at 16:13, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> > > On 6/12/26 09:36, Peter Maydell wrote:
> > > > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > > > <richard.henderson@linaro.org> wrote:
> > > > >
> > > > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > > > accessors to ram device memory region, preparatory work to make ram device
> > > > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > > > in next patch.
> > > > >
> > > > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later
> > > > > decides whether or not to expand inline.
> > > >
> > > > Yes, but if you pass it a fixed small integer, then it is likely
> > > > to expand it inline, whereas if you pass it a variable then it
> > > > is likely not to... The patch is attempting to persuade the
> > > > compiler to definitely do an inline access for 1, 2, 4, 8
> > > > byte access.
> > >
> > > Sure, for hosts with unaligned accesses. We still have sparc64 and (some?)
> > > riscv64 that don't automatically have such and will compile to more than one
> > > host instruction.
> > >
> > > > > My real question is: what are you attempting to achieve?
> > > > >
> > > > > (1) is the problem unaligned access to a mapped physical device?
> > > > > (2) is the problem vector access to a mapped physical device?
> > > > > (3) something else?
> > > >
> > > > I think there are two problems we're trying to fix here:
> > > >
> > > > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > > > this to turn into exactly 1 byte write into guest memory, for the
> > > > normal case where the guest memory is real host RAM.
> > > > This deals with the e1000 bug where the pci_dma_write() turns into
> > > > a call to glibc memmove() with size 1 and glibc's implementation
> > > > turns that into 3 writes of the byte to the same address...
> > >
> > > Gotcha. Easily handled by not using memcpy/memmove at all.
> > >
> > > *(char *)ptr = val;
> > >
> > > is sufficient for all hosts.
> >
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
>
> IIRC "packed" will cause architectures that can't do
> unaligned word accesses to emit code to do byte accesses,
> so you don't want that.
I checked arm32 and while it does that, so does memcpy.
We'd have to explicitly code up aligned/unaligned usecases.
But do we really care about device assignment on those
hosts?
Maybe, the thing to do is just to ignore the issue.
> You need to explicitly check alignment,
> I think.
>
> > But again, qemu simply already relies on this in bswap.h
> >
> > I kind of dislike muddying the waters by making several
> > unrelated changes here. If we do we should change bwap too.
>
> The ldl_p etc functions in bswap.h provide different semantics:
> ldl_p() is "do a load of a 32-bit quantity, even if the
> address is not 4-aligned". We don't care if the compiler
> or the memcpy ends up doing that as 4 byte accesses, which
> on some hosts it must do.
>
> thanks
> -- PMM
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 15:13 ` Michael S. Tsirkin
2026-06-14 16:01 ` Peter Maydell
@ 2026-06-14 16:45 ` Richard Henderson
2026-06-14 17:22 ` Michael S. Tsirkin
1 sibling, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2026-06-14 16:45 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/14/26 08:13, Michael S. Tsirkin wrote:
> Yes, I think it does work because we use -fno-strict-aliasing.
> For bigger sizes we'll need packed because the addresses
> could be unaligned.
...
> For most host/guest pairs things simply work even for unaligned.
>
> And yes, guest drivers do do this.
>
> On classical pci, there are no transactions as such and
> an unaligned access will be split anyway.
I'm saying, if you're talking about pass-through to real devices, that won't work. For
instance, AArch64 will trap unaligned accesses to Device memory.
You need to actually handle unaligned. Perhaps something like
/* Find unit to fit size and alignment of dst */
uintptr_t test = (uintptr_t)dst | size;
uintptr_t lsb = test & -test;
switch (lsb) {
case 1: // loop over uint8_t
case 2: // loop over uint16_t
case 4: // loop over uint32_t
default: // loop over uint64_t
}
with the expectation that normally we'll have aligned addresses and size such that the
loop will iterate once.
r~
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 16:45 ` Richard Henderson
@ 2026-06-14 17:22 ` Michael S. Tsirkin
2026-06-15 1:17 ` Gavin Shan
2026-06-15 1:41 ` Richard Henderson
0 siblings, 2 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 17:22 UTC (permalink / raw)
To: Richard Henderson
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > Yes, I think it does work because we use -fno-strict-aliasing.
> > For bigger sizes we'll need packed because the addresses
> > could be unaligned.
> ...
> > For most host/guest pairs things simply work even for unaligned.
> >
> > And yes, guest drivers do do this.
> >
> > On classical pci, there are no transactions as such and
> > an unaligned access will be split anyway.
>
> I'm saying, if you're talking about pass-through to real devices, that won't
> work. For instance, AArch64 will trap unaligned accesses to Device memory.
Presumably, AArch64 drivers don't do unaligned at all then?
> You need to actually handle unaligned. Perhaps something like
>
> /* Find unit to fit size and alignment of dst */
> uintptr_t test = (uintptr_t)dst | size;
> uintptr_t lsb = test & -test;
>
> switch (lsb) {
> case 1: // loop over uint8_t
> case 2: // loop over uint16_t
> case 4: // loop over uint32_t
> default: // loop over uint64_t
> }
>
> with the expectation that normally we'll have aligned addresses and size
> such that the loop will iterate once.
>
>
> r~
And ifdef for arches without unaligned support?
--
MST
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 17:22 ` Michael S. Tsirkin
@ 2026-06-15 1:17 ` Gavin Shan
2026-06-15 1:41 ` Richard Henderson
1 sibling, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-15 1:17 UTC (permalink / raw)
To: Michael S. Tsirkin, Richard Henderson
Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/15/26 3:22 AM, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>> For bigger sizes we'll need packed because the addresses
>>> could be unaligned.
>> ...
>>> For most host/guest pairs things simply work even for unaligned.
>>>
>>> And yes, guest drivers do do this.
>>>
>>> On classical pci, there are no transactions as such and
>>> an unaligned access will be split anyway.
>>
>> I'm saying, if you're talking about pass-through to real devices, that won't
>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
>
> Presumably, AArch64 drivers don't do unaligned at all then?
>
I think Michael is correct because the unaligned access isn't a concern to
ram device region. MemroyRegionOPs::impl::unaligned is true for ram device
region, which means no unaligned access concerns.
static const MemoryRegionOps ram_device_mem_ops = {
.read = memory_region_ram_device_read,
.write = memory_region_ram_device_write,
.endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
.valid = {
.min_access_size = 1,
.max_access_size = 8,
.unaligned = true,
},
.impl = {
.min_access_size = 1,
.max_access_size = 8,
.unaligned = true,
},
};
system/physmem.c::memory_access_size
====================================
int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
{
:
/* Bound the maximum access by the alignment of the address. */
if (!mr->ops->impl.unaligned) {
unsigned align_size_max = addr & -addr;
if (align_size_max != 0 && align_size_max < access_size_max) {
access_size_max = align_size_max;
}
}
}
Please help to confirm if we really want to cover the unaligned access for
all architectures. After it's confirmed, I can send (v2) for further review.
>> You need to actually handle unaligned. Perhaps something like
>>
>> /* Find unit to fit size and alignment of dst */
>> uintptr_t test = (uintptr_t)dst | size;
>> uintptr_t lsb = test & -test;
>>
>> switch (lsb) {
>> case 1: // loop over uint8_t
>> case 2: // loop over uint16_t
>> case 4: // loop over uint32_t
>> default: // loop over uint64_t
>> }
>>
>> with the expectation that normally we'll have aligned addresses and size
>> such that the loop will iterate once.
>>
Thanks, Richard. I think it should work if we really want to cover the unaligned
access. I tried the following snippet based on yours, my issue can be fixed and
no other issues are seen in my environment.
-----> system/physmem.c
+/*
+ * qemu_ram_copy - copy data from ram block
+ *
+ * @dest: destination into which data is copied
+ * @src: source of the data
+ * @n: length of the data to be copied
+ *
+ * This function is friendly to unaligned access.
+ */
+void qemu_ram_copy(void *dest, const void *src, size_t n)
+{
+ uintptr_t test, lsb;
+
+ do {
+ test = (uintptr_t)dest | n;
+ lsb = test & -test;
+ switch (lsb) {
+ case 1:
+ *(uint8_t *)dest = *(uint8_t *)src;
+ src += 1;
+ dest += 1;
+ n -= 1;
+ break;
+ case 2:
+ *(uint16_t *)dest = *(uint16_t *)src;
+ src += 2;
+ dest += 2;
+ n -= 2;
+ break;
+ case 4:
+ *(uint32_t *)dest = *(uint32_t *)src;
+ src += 4;
+ dest += 4;
+ n -= 4;
+ break;
+ default:
+ *(uint64_t *)dest = *(uint64_t *)src;
+ src += 8;
+ dest += 8;
+ n -= 8;
+ }
+ } while (n != 0);
+}
-----> include/system/memory.h
+void qemu_ram_copy(void *dest, const void *src, size_t n);
+static inline void qemu_ram_move(void *dest, const void *src, size_t n)
+{
+ qemu_ram_copy(dest, src, n);
+}
>>
>> r~
>
> And ifdef for arches without unaligned support?
>
I guess we probably support unaligned access for all architectures or none of
them. It will depend on guest if the unaligned access will be triggered :-)
Thanks,
Gavin
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-14 17:22 ` Michael S. Tsirkin
2026-06-15 1:17 ` Gavin Shan
@ 2026-06-15 1:41 ` Richard Henderson
2026-06-15 7:24 ` Michael S. Tsirkin
2026-06-15 7:58 ` Michael S. Tsirkin
1 sibling, 2 replies; 24+ messages in thread
From: Richard Henderson @ 2026-06-15 1:41 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/14/26 10:22, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>> For bigger sizes we'll need packed because the addresses
>>> could be unaligned.
>> ...
>>> For most host/guest pairs things simply work even for unaligned.
>>>
>>> And yes, guest drivers do do this.
>>>
>>> On classical pci, there are no transactions as such and
>>> an unaligned access will be split anyway.
>>
>> I'm saying, if you're talking about pass-through to real devices, that won't
>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
>
> Presumably, AArch64 drivers don't do unaligned at all then?
Yes.
>> You need to actually handle unaligned. Perhaps something like
>>
>> /* Find unit to fit size and alignment of dst */
>> uintptr_t test = (uintptr_t)dst | size;
>> uintptr_t lsb = test & -test;
>>
>> switch (lsb) {
>> case 1: // loop over uint8_t
>> case 2: // loop over uint16_t
>> case 4: // loop over uint32_t
>> default: // loop over uint64_t
>> }
>>
>> with the expectation that normally we'll have aligned addresses and size
>> such that the loop will iterate once.
>>
>>
>> r~
>
> And ifdef for arches without unaligned support?
No ifdef. All accesses produced by the above are aligned.
r~
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-15 1:41 ` Richard Henderson
@ 2026-06-15 7:24 ` Michael S. Tsirkin
2026-06-15 7:58 ` Michael S. Tsirkin
1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 7:24 UTC (permalink / raw)
To: Richard Henderson
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
> On 6/14/26 10:22, Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> > > On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > > > Yes, I think it does work because we use -fno-strict-aliasing.
> > > > For bigger sizes we'll need packed because the addresses
> > > > could be unaligned.
> > > ...
> > > > For most host/guest pairs things simply work even for unaligned.
> > > >
> > > > And yes, guest drivers do do this.
> > > >
> > > > On classical pci, there are no transactions as such and
> > > > an unaligned access will be split anyway.
> > >
> > > I'm saying, if you're talking about pass-through to real devices, that won't
> > > work. For instance, AArch64 will trap unaligned accesses to Device memory.
> >
> > Presumably, AArch64 drivers don't do unaligned at all then?
>
> Yes.
>
> > > You need to actually handle unaligned. Perhaps something like
> > >
> > > /* Find unit to fit size and alignment of dst */
> > > uintptr_t test = (uintptr_t)dst | size;
> > > uintptr_t lsb = test & -test;
> > >
> > > switch (lsb) {
> > > case 1: // loop over uint8_t
> > > case 2: // loop over uint16_t
> > > case 4: // loop over uint32_t
> > > default: // loop over uint64_t
> > > }
> > >
> > > with the expectation that normally we'll have aligned addresses and size
> > > such that the loop will iterate once.
OK though it is worth looking at assembly and checking how to make it ooptimal.
I don't get why we have default here either, for that
we really should use memcpy for a better perf,
I think VCPUs can't initiate MMIO transactions >8 bytes.
> > >
> > >
> > > r~
> >
> > And ifdef for arches without unaligned support?
>
> No ifdef. All accesses produced by the above are aligned.
>
>
> r~
but I don't think it's a good idea to have this on x86.
x86 does not need this pile of branches, has a popular closed
source guest we are all familiar with, famous for
it's rich ecosystem of drivers) and it's just
barely possible that an x86 guest on x86 host could
work as long as we do not break transaction boundaries.
Is
#ifdef HOST_X86_64
#define QEMU_UNALIGNED 1
#else
#define QEMU_UNALIGNED 0
#endif
And then if (QEMU_UNALIGNED) a big deal really?
--
MST
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-15 1:41 ` Richard Henderson
2026-06-15 7:24 ` Michael S. Tsirkin
@ 2026-06-15 7:58 ` Michael S. Tsirkin
2026-06-15 10:08 ` Gavin Shan
1 sibling, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-15 7:58 UTC (permalink / raw)
To: Richard Henderson
Cc: Peter Maydell, Gavin Shan, qemu-arm, qemu-devel, peterx, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
> On 6/14/26 10:22, Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
> > > On 6/14/26 08:13, Michael S. Tsirkin wrote:
> > > > Yes, I think it does work because we use -fno-strict-aliasing.
> > > > For bigger sizes we'll need packed because the addresses
> > > > could be unaligned.
> > > ...
> > > > For most host/guest pairs things simply work even for unaligned.
> > > >
> > > > And yes, guest drivers do do this.
> > > >
> > > > On classical pci, there are no transactions as such and
> > > > an unaligned access will be split anyway.
> > >
> > > I'm saying, if you're talking about pass-through to real devices, that won't
> > > work. For instance, AArch64 will trap unaligned accesses to Device memory.
> >
> > Presumably, AArch64 drivers don't do unaligned at all then?
>
> Yes.
>
> > > You need to actually handle unaligned. Perhaps something like
> > >
> > > /* Find unit to fit size and alignment of dst */
> > > uintptr_t test = (uintptr_t)dst | size;
> > > uintptr_t lsb = test & -test;
> > >
> > > switch (lsb) {
> > > case 1: // loop over uint8_t
> > > case 2: // loop over uint16_t
> > > case 4: // loop over uint32_t
> > > default: // loop over uint64_t
> > > }
> > >
> > > with the expectation that normally we'll have aligned addresses and size
> > > such that the loop will iterate once.
> > >
> > >
> > > r~
> >
> > And ifdef for arches without unaligned support?
>
> No ifdef. All accesses produced by the above are aligned.
>
>
> r~
So something like:
#if defined(__i386__) || defined(__x86_64__)
#define HOST_UNALIGNED_MMIO_OK 1
#else
#define HOST_UNALIGNED_MMIO_OK 0
#endif
and then we can check that.
--
MST
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-15 7:58 ` Michael S. Tsirkin
@ 2026-06-15 10:08 ` Gavin Shan
0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2026-06-15 10:08 UTC (permalink / raw)
To: Michael S. Tsirkin, Richard Henderson
Cc: Peter Maydell, qemu-arm, qemu-devel, peterx, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On 6/15/26 5:58 PM, Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2026 at 06:41:57PM -0700, Richard Henderson wrote:
>> On 6/14/26 10:22, Michael S. Tsirkin wrote:
>>> On Sun, Jun 14, 2026 at 09:45:51AM -0700, Richard Henderson wrote:
>>>> On 6/14/26 08:13, Michael S. Tsirkin wrote:
>>>>> Yes, I think it does work because we use -fno-strict-aliasing.
>>>>> For bigger sizes we'll need packed because the addresses
>>>>> could be unaligned.
>>>> ...
>>>>> For most host/guest pairs things simply work even for unaligned.
>>>>>
>>>>> And yes, guest drivers do do this.
>>>>>
>>>>> On classical pci, there are no transactions as such and
>>>>> an unaligned access will be split anyway.
>>>>
>>>> I'm saying, if you're talking about pass-through to real devices, that won't
>>>> work. For instance, AArch64 will trap unaligned accesses to Device memory.
>>>
>>> Presumably, AArch64 drivers don't do unaligned at all then?
>>
>> Yes.
>>
>>>> You need to actually handle unaligned. Perhaps something like
>>>>
>>>> /* Find unit to fit size and alignment of dst */
>>>> uintptr_t test = (uintptr_t)dst | size;
>>>> uintptr_t lsb = test & -test;
>>>>
>>>> switch (lsb) {
>>>> case 1: // loop over uint8_t
>>>> case 2: // loop over uint16_t
>>>> case 4: // loop over uint32_t
>>>> default: // loop over uint64_t
>>>> }
>>>>
>>>> with the expectation that normally we'll have aligned addresses and size
>>>> such that the loop will iterate once.
>>>>
>>>>
>>>> r~
>>>
>>> And ifdef for arches without unaligned support?
>>
>> No ifdef. All accesses produced by the above are aligned.
>>
>>
>> r~
>
> So something like:
>
> #if defined(__i386__) || defined(__x86_64__)
> #define HOST_UNALIGNED_MMIO_OK 1
> #else
> #define HOST_UNALIGNED_MMIO_OK 0
> #endif
>
>
> and then we can check that.
>
Thanks for all the comments. A new revision (v2) has been sent for further
comments, thanks.
https://lore.kernel.org/qemu-arm/20260615100200.266968-1-gshan@redhat.com/
Thanks,
Gavin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 15:29 ` Richard Henderson
2026-06-12 16:36 ` Peter Maydell
@ 2026-06-14 15:42 ` Michael S. Tsirkin
1 sibling, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2026-06-14 15:42 UTC (permalink / raw)
To: Richard Henderson
Cc: Gavin Shan, qemu-arm, qemu-devel, peterx, peter.maydell, berrange,
david, alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, Jun 12, 2026 at 08:29:34AM -0700, Richard Henderson wrote:
> On 6/12/26 04:03, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> > is started by the following command lines, with a GH100 GPU card.
> >
> > host$ lspci | grep GH100
> > 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> > host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> > -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> > -accel kvm -cpu host -smp cpus=48 -m size=8G \
> > -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> > -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> > :
> > guest$ cd cuda-samples/build
> > guest$ make -j 20 clean
> > guest$ make -j 20
> > :
> > [ 54%] Linking CUDA executable graphMemoryNodes
> > [ 54%] Built target graphMemoryNodes
> > <no more output afterwards, guest becomes frozen here>
> >
> > guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> >
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR#4 can be presented to the
> > guest through memory hot-add. The page cache can be allocated from the
> > hot added memory blocks when cuda-samples is being compiled. Afterwards,
> > the page cache is sent to QEMU's virtio-blk device as part of the DMA
> > request, the bounce buffer has to be used to accomodate the request as
> > the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> > and indirectly accessible in qemu. However, the max bounce bufer size
> > is only 4096 bytes by default. We're running out of that space quickly.
> >
> > QEMU
> > ====
> > virtio_blk_handle_output
> > virtio_blk_handle_vq
> > virtio_blk_get_request
> > virtqueue_pop
> > virtqueue_split_pop
> > virtqueue_map_desc
> > address_space_map
> > memory_access_is_direct # Return false
> > memory_region_supports_direct_access
> >
> > (qemu) info mtree
> > memory-region: pci_bridge_pci
> > 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> > 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> > 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> >
> > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > accessors to ram device memory region, preparatory work to make ram device
> > region directly accessible and bypass the bounce buffer in the DMA path
> > in next patch.
>
> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the
> compiler later decides whether or not to expand inline.
Did some research.
The issue that prompted use of __builtin_memcpy in bswap.h is musl's fortify headers
that Alpine Linux was using back in 2019.
Try this:
git clone --depth=1 https://github.com/jvoisin/fortify-headers /tmp/fortify-headers
git -C /tmp/fortify-headers checkout 5aabc7e~1
now:
musl-gcc -O2 -D_FORTIFY_SOURCE=2 -isystem /tmp/fortify-headers/include -S -o- test.c
where test.c is:
#include <string.h>
#include <stdint.h>
void test_memcpy(void *dst, const void *src) {
memcpy(dst, src, 4);
}
void test_builtin_memcpy(void *dst, const void *src) {
__builtin_memcpy(dst, src, 4);
}
On x86:
test_memcpy:
.LFB13:
.cfi_startproc
cmpq %rsi, %rdi
jnb .L2
leaq 4(%rdi), %rax
cmpq %rax, %rsi
jb .L6
.L4:
movl $4, %edx
jmp memcpy
.p2align 4,,10
.p2align 3
.L2:
cmpq %rdi, %rsi
jnb .L4
leaq 4(%rsi), %rax
cmpq %rax, %rdi
jb .L3
movl $4, %edx
jmp memcpy
.L6:
jmp .L3
.cfi_endproc
.section .text.unlikely
.cfi_startproc
.type test_memcpy.cold, @function
test_memcpy.cold:
.LFSB13:
.L3:
ud2
.cfi_endproc
.LFE13:
.text
.size test_memcpy, .-test_memcpy
A jmp so a function call.
While builtin:
test_builtin_memcpy:
.LFB14:
.cfi_startproc
movl (%rsi), %eax
movl %eax, (%rdi)
ret
.cfi_endproc
.LFE14:
.size test_builtin_memcpy, .-test_builtin_memcpy
So it's a fortify headers Version 1.0 bug. Version 1.1 has the fix.
I guess we should rewrite bswap.h then.
> So, this doesn't do what you think it does.
> My real question is: what are you attempting to achieve?
>
> (1) is the problem unaligned access to a mapped physical device?
> (2) is the problem vector access to a mapped physical device?
> (3) something else?
>
>
> r~
^ permalink raw reply [flat|nested] 24+ messages in thread