* [PATCH 0/2] system/memory: Make ram device region directly accessible
@ 2026-06-12 11:03 Gavin Shan
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
0 siblings, 2 replies; 5+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
To: qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
DMA bounce buffer in address_space_map() to cover the DMA request. However,
the bounce buffer size is 4096 bytes and we're overrunning it easily when
the guest has significant disk activities on compiling 'cuda-samples'.
The full log and problem description can be found from PATCH[1/2]'s commit
log.
Try to fix the issue handled in commit 4a2e242bbb by replacing mem{cpy, move}
with __builtin_mem{cpy, move} in the accessors to the ram device regions.
With this, we can basically revert that commit to make ram device region
directly accessible again and bypass the bounce buffer in address_space_map()
where the guest hang is caused.
PATCH[1] replaces mem{cpy, move} with __builtin_mem{cpy, move}
PATCH[2] makes ram device region directly accessible again
Changelog
=========
RFCv1 -> v1:
* https://lists.nongnu.org/archive/html/qemu-arm/2026-06/msg00307.html
* Reworked solution based on suggestions from Peter Xu, Peter Maydell
and Michael S. Tsirkin
Gavin Shan (2):
system/memory: Use __builtin_mem{cpy, move} in accessors of ram device
region
system/memory: Make ram device region directly accessible
hw/remote/vfio-user-obj.c | 4 +--
include/system/memory.h | 53 +++++++++++++++++++++++++++++++--------
system/memory.c | 41 +-----------------------------
system/physmem.c | 8 +++---
system/trace-events | 2 --
5 files changed, 50 insertions(+), 58 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
@ 2026-06-12 11:03 ` Gavin Shan
2026-06-12 11:22 ` Michael S. Tsirkin
2026-06-12 14:05 ` Philippe Mathieu-Daudé
2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
1 sibling, 2 replies; 5+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
To: qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
is started by the following command lines, with a GH100 GPU card.
host$ lspci | grep GH100
0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
-machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
-accel kvm -cpu host -smp cpus=48 -m size=8G \
-drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
-device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
:
guest$ cd cuda-samples/build
guest$ make -j 20 clean
guest$ make -j 20
:
[ 54%] Linking CUDA executable graphMemoryNodes
[ 54%] Built target graphMemoryNodes
<no more output afterwards, guest becomes frozen here>
guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
[ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR#4 can be presented to the
guest through memory hot-add. The page cache can be allocated from the
hot added memory blocks when cuda-samples is being compiled. Afterwards,
the page cache is sent to QEMU's virtio-blk device as part of the DMA
request, the bounce buffer has to be used to accomodate the request as
the corresponding memory region (MemoryRegion) is a RAM DEVICE region
and indirectly accessible in qemu. However, the max bounce bufer size
is only 4096 bytes by default. We're running out of that space quickly.
QEMU
====
virtio_blk_handle_output
virtio_blk_handle_vq
virtio_blk_get_request
virtqueue_pop
virtqueue_split_pop
virtqueue_map_desc
address_space_map
memory_access_is_direct # Return false
memory_region_supports_direct_access
(qemu) info mtree
memory-region: pci_bridge_pci
0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
accessors to ram device memory region, preparatory work to make ram device
region directly accessible and bypass the bounce buffer in the DMA path
in next patch.
Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
hw/remote/vfio-user-obj.c | 4 ++--
include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
system/physmem.c | 8 ++++----
3 files changed, 47 insertions(+), 7 deletions(-)
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 87fa7b6572..fe6f661fe2 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
ram_ptr = memory_region_get_ram_ptr(mr);
if (is_write) {
- memcpy((ram_ptr + offset), buf, size);
+ address_space_memcpy(ram_ptr + offset, buf, size);
} else {
- memcpy(buf, (ram_ptr + offset), size);
+ address_space_memcpy(buf, ram_ptr + offset, size);
}
return 0;
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..6bb2e13eea 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
return true;
}
+static inline void address_space_memcpy(void *dest, const void *src, size_t n)
+{
+ switch (n) {
+ case 1:
+ __builtin_memcpy(dest, src, 1);
+ break;
+ case 2:
+ __builtin_memcpy(dest, src, 2);
+ break;
+ case 4:
+ __builtin_memcpy(dest, src, 4);
+ break;
+ case 8:
+ __builtin_memcpy(dest, src, 8);
+ break;
+ default:
+ __builtin_memcpy(dest, src, n);
+ }
+}
+
+static inline void address_space_memmove(void *dest, const void *src, size_t n)
+{
+ switch (n) {
+ case 1:
+ __builtin_memmove(dest, src, 1);
+ break;
+ case 2:
+ __builtin_memmove(dest, src, 2);
+ break;
+ case 4:
+ __builtin_memmove(dest, src, 4);
+ break;
+ case 8:
+ __builtin_memmove(dest, src, 8);
+ break;
+ default:
+ __builtin_memmove(dest, src, n);
+ }
+}
+
/**
* address_space_read: read from an address space.
*
@@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
if (len == l && memory_access_is_direct(mr, false, attrs)) {
ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
- memcpy(buf, ptr, len);
+ __builtin_memcpy(buf, ptr, len);
} else {
result = flatview_read_continue(fv, addr, attrs, buf, len,
addr1, l, mr);
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..5f46a9d676 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
false, true);
- memmove(ram_ptr, buf, *l);
+ address_space_memmove(ram_ptr, buf, *l);
invalidate_and_set_dirty(mr, mr_addr, *l);
return MEMTX_OK;
@@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
false, false);
- memcpy(buf, ram_ptr, *l);
+ address_space_memcpy(buf, ram_ptr, *l);
return MEMTX_OK;
}
@@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
l = memory_access_size(mr, l, addr1);
} else {
/* ROM/RAM case */
- void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
- memcpy(ram_ptr, buf, l);
+ address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
+ buf, l);
invalidate_and_set_dirty(mr, addr1, l);
}
len -= l;
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] system/memory: Make ram device region directly accessible
2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
@ 2026-06-12 11:03 ` Gavin Shan
1 sibling, 0 replies; 5+ messages in thread
From: Gavin Shan @ 2026-06-12 11:03 UTC (permalink / raw)
To: qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
This basically reverts 4a2e242bbb30 ("memory: Don't use memcpy for
ram_device regions") to make ram device region directly accessible
again. With this, the bounce buffer is bypassed in address_space_map()
when a ram device region is involved, potentially avoid to overrun
the bounce buffer.
Reported-by: Julia Graham <jugraham@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
include/system/memory.h | 11 ++---------
system/memory.c | 41 +----------------------------------------
system/trace-events | 2 --
3 files changed, 3 insertions(+), 51 deletions(-)
diff --git a/include/system/memory.h b/include/system/memory.h
index 6bb2e13eea..3ca6155805 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2914,15 +2914,8 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
if (memory_region_is_romd(mr)) {
return true;
}
- if (!memory_region_is_ram(mr)) {
- return false;
- }
- /*
- * RAM DEVICE regions can be accessed directly using memcpy, but it might
- * be MMIO and access using mempy can be wrong (e.g., using instructions not
- * intended for MMIO access). So we treat this as IO.
- */
- return !memory_region_is_ram_device(mr);
+
+ return memory_region_is_ram(mr);
}
static inline bool memory_access_is_direct(const MemoryRegion *mr,
diff --git a/system/memory.c b/system/memory.c
index 739ba11da6..9549dd1a94 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1362,43 +1362,6 @@ const MemoryRegionOps unassigned_mem_ops = {
.endianness = DEVICE_NATIVE_ENDIAN,
};
-static uint64_t memory_region_ram_device_read(void *opaque,
- hwaddr addr, unsigned size)
-{
- MemoryRegion *mr = opaque;
- uint64_t data = ldn_he_p(mr->ram_block->host + addr, size);
-
- trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
-
- return data;
-}
-
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
- uint64_t data, unsigned size)
-{
- MemoryRegion *mr = opaque;
-
- trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
-
- stn_he_p(mr->ram_block->host + addr, size, data);
-}
-
-static const MemoryRegionOps ram_device_mem_ops = {
- .read = memory_region_ram_device_read,
- .write = memory_region_ram_device_write,
- .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
- .valid = {
- .min_access_size = 1,
- .max_access_size = 8,
- .unaligned = true,
- },
- .impl = {
- .min_access_size = 1,
- .max_access_size = 8,
- .unaligned = true,
- },
-};
-
bool memory_region_access_valid(MemoryRegion *mr,
hwaddr addr,
unsigned size,
@@ -1676,10 +1639,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, Object *owner,
const char *name, uint64_t size,
void *ptr)
{
- memory_region_init_io(mr, owner, &ram_device_mem_ops, mr, name, size);
- mr->ram = true;
+ memory_region_init_ram_ptr(mr, owner, name, size, ptr);
mr->ram_device = true;
- memory_region_set_ram_ptr(mr, size, ptr);
}
void memory_region_init_alias(MemoryRegion *mr, Object *owner,
diff --git a/system/trace-events b/system/trace-events
index e6e1b61279..34af0a3a1e 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -20,8 +20,6 @@ memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, u
memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size, const char *name) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u name '%s'"
memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
memory_region_sync_dirty(const char *mr, const char *listener, int global) "mr '%s' listener '%s' synced (global=%d)"
flatview_new(void *view, void *root) "%p (root %p)"
flatview_destroy(void *view, void *root) "%p (root %p)"
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
@ 2026-06-12 11:22 ` Michael S. Tsirkin
2026-06-12 14:05 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 5+ messages in thread
From: Michael S. Tsirkin @ 2026-06-12 11:22 UTC (permalink / raw)
To: Gavin Shan
Cc: qemu-arm, qemu-devel, peterx, peter.maydell, berrange, david,
alex, clg, pbonzini, philmd, phrdina, jugraham, shan.gavin
On Fri, Jun 12, 2026 at 09:03:06PM +1000, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -accel kvm -cpu host -smp cpus=48 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> :
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/remote/vfio-user-obj.c | 4 ++--
> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
> system/physmem.c | 8 ++++----
> 3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> ram_ptr = memory_region_get_ram_ptr(mr);
>
> if (is_write) {
> - memcpy((ram_ptr + offset), buf, size);
> + address_space_memcpy(ram_ptr + offset, buf, size);
> } else {
> - memcpy(buf, (ram_ptr + offset), size);
> + address_space_memcpy(buf, ram_ptr + offset, size);
> }
>
> return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
> return true;
> }
>
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memcpy(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memcpy(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memcpy(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memcpy(dest, src, 8);
> + break;
> + default:
> + __builtin_memcpy(dest, src, n);
> + }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memmove(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memmove(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memmove(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memmove(dest, src, 8);
> + break;
> + default:
> + __builtin_memmove(dest, src, n);
> + }
> +}
> +
> /**
> * address_space_read: read from an address space.
> *
The variable length probably should use the regular memcpy/memmove -
no reason to bypass fortification for these.
> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
> if (len == l && memory_access_is_direct(mr, false, attrs)) {
> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(buf, ptr, len);
> + __builtin_memcpy(buf, ptr, len);
> } else {
> result = flatview_read_continue(fv, addr, attrs, buf, len,
> addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, true);
>
> - memmove(ram_ptr, buf, *l);
> + address_space_memmove(ram_ptr, buf, *l);
> invalidate_and_set_dirty(mr, mr_addr, *l);
>
> return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, false);
>
> - memcpy(buf, ram_ptr, *l);
> + address_space_memcpy(buf, ram_ptr, *l);
>
> return MEMTX_OK;
> }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
> l = memory_access_size(mr, l, addr1);
> } else {
> /* ROM/RAM case */
> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(ram_ptr, buf, l);
> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> + buf, l);
> invalidate_and_set_dirty(mr, addr1, l);
> }
> len -= l;
> --
> 2.54.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:22 ` Michael S. Tsirkin
@ 2026-06-12 14:05 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 5+ messages in thread
From: Philippe Mathieu-Daudé @ 2026-06-12 14:05 UTC (permalink / raw)
To: Gavin Shan, qemu-arm
Cc: qemu-devel, peterx, mst, peter.maydell, berrange, david, alex,
clg, pbonzini, phrdina, jugraham, shan.gavin
Hi Gavin,
On 12/6/26 13:03, Gavin Shan wrote:
> All ram device regions was turned to be indirectly accessible by commit
> 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> is started by the following command lines, with a GH100 GPU card.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -accel kvm -cpu host -smp cpus=48 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> :
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR#4 can be presented to the
> guest through memory hot-add. The page cache can be allocated from the
> hot added memory blocks when cuda-samples is being compiled. Afterwards,
> the page cache is sent to QEMU's virtio-blk device as part of the DMA
> request, the bounce buffer has to be used to accomodate the request as
> the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> and indirectly accessible in qemu. However, the max bounce bufer size
> is only 4096 bytes by default. We're running out of that space quickly.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> accessors to ram device memory region, preparatory work to make ram device
> region directly accessible and bypass the bounce buffer in the DMA path
> in next patch.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/remote/vfio-user-obj.c | 4 ++--
> include/system/memory.h | 42 ++++++++++++++++++++++++++++++++++++++-
> system/physmem.c | 8 ++++----
> 3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 87fa7b6572..fe6f661fe2 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -375,9 +375,9 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> ram_ptr = memory_region_get_ram_ptr(mr);
>
> if (is_write) {
> - memcpy((ram_ptr + offset), buf, size);
> + address_space_memcpy(ram_ptr + offset, buf, size);
> } else {
> - memcpy(buf, (ram_ptr + offset), size);
> + address_space_memcpy(buf, ram_ptr + offset, size);
> }
>
> return 0;
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..6bb2e13eea 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2938,6 +2938,46 @@ static inline bool memory_access_is_direct(const MemoryRegion *mr,
> return true;
> }
>
> +static inline void address_space_memcpy(void *dest, const void *src, size_t n)
'address_space_' prefix for something that doesn't use neither
AddressSpace nor MemoryRegion is odd.
Maybe prefix 'qemu_ram_' or 'qemu_ram_ptr_' instead? (since the
address is returned by memory_region_get_ram_ptr)
Add the definitions in "system/ramblock.h" with that declaration?
> +{
> + switch (n) {
> + case 1:
> + __builtin_memcpy(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memcpy(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memcpy(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memcpy(dest, src, 8);
> + break;
> + default:
> + __builtin_memcpy(dest, src, n);
> + }
> +}
> +
> +static inline void address_space_memmove(void *dest, const void *src, size_t n)
> +{
> + switch (n) {
> + case 1:
> + __builtin_memmove(dest, src, 1);
> + break;
> + case 2:
> + __builtin_memmove(dest, src, 2);
> + break;
> + case 4:
> + __builtin_memmove(dest, src, 4);
> + break;
> + case 8:
> + __builtin_memmove(dest, src, 8);
> + break;
> + default:
> + __builtin_memmove(dest, src, n);
> + }
> +}
> +
> /**
> * address_space_read: read from an address space.
> *
> @@ -2970,7 +3010,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
> mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
> if (len == l && memory_access_is_direct(mr, false, attrs)) {
> ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(buf, ptr, len);
> + __builtin_memcpy(buf, ptr, len);
> } else {
> result = flatview_read_continue(fv, addr, attrs, buf, len,
> addr1, l, mr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..5f46a9d676 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3272,7 +3272,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, true);
>
> - memmove(ram_ptr, buf, *l);
> + address_space_memmove(ram_ptr, buf, *l);
> invalidate_and_set_dirty(mr, mr_addr, *l);
>
> return MEMTX_OK;
> @@ -3365,7 +3365,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> false, false);
>
> - memcpy(buf, ram_ptr, *l);
> + address_space_memcpy(buf, ram_ptr, *l);
>
> return MEMTX_OK;
> }
> @@ -3503,8 +3503,8 @@ MemTxResult address_space_write_rom(AddressSpace *as, hwaddr addr,
> l = memory_access_size(mr, l, addr1);
> } else {
> /* ROM/RAM case */
> - void *ram_ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> - memcpy(ram_ptr, buf, l);
> + address_space_memcpy(qemu_map_ram_ptr(mr->ram_block, addr1),
> + buf, l);
> invalidate_and_set_dirty(mr, addr1, l);
> }
> len -= l;
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-06-12 14:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 11:03 [PATCH 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-12 11:03 ` [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region Gavin Shan
2026-06-12 11:22 ` Michael S. Tsirkin
2026-06-12 14:05 ` Philippe Mathieu-Daudé
2026-06-12 11:03 ` [PATCH 2/2] system/memory: Make ram device region directly accessible Gavin Shan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.