* [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
@ 2026-06-08 0:18 Gavin Shan
2026-06-08 8:55 ` Daniel P. Berrangé
2026-06-10 9:49 ` Michael S. Tsirkin
0 siblings, 2 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-08 0:18 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, mst, jugraham, shan.gavin
On the guest where a NVidia's GH100 card is passed from the host, the
guest system hang can be observed on attempt to compile 'cuda-samples',
as reported by Julia.
host$ lspci | grep GH100
0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
-machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
-cpu host -smp cpus=32 -m size=8G \
-drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
-device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
guest$ cd cuda-samples/build
guest$ make -j 20 clean
guest$ make -j 20
:
[ 54%] Linking CUDA executable graphMemoryNodes
[ 54%] Built target graphMemoryNodes
<no more output afterwards, guest becomes frozen here>
guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
[ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR can be presented to the guest
through memory hot-add. The page cache can be allocated from the hot added
memory blocks when cuda-samples is being built. Afterwards, he page cache
is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
buffer is used to accomodate the request as the corresponding memory
region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
case, false is returned from memory_access_is_direct() in the path where
the DMA request is handled.
QEMU
====
virtio_blk_handle_output
virtio_blk_handle_vq
virtio_blk_get_request
virtqueue_pop
virtqueue_split_pop
virtqueue_map_desc
address_space_map
memory_access_is_direct # Return false
memory_region_supports_direct_access
(qemu) info mtree
:
memory-region: pci_bridge_pci
0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
By default, the max bounce buffer size is only 4096 bytes, even less
than one page when the guest page is 64KB. This tries to fix the issue
by inheriting the customized max bounce buffer size of the virtio bus's
parent through property 'x-max-bounce-buffer-size' when the customized
size is a larger one. With this applied, no guest system hang is seen
with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
Reported-by: Julia Graham <jugraham@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
hw/virtio/virtio-bus.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
index cef944e015..e0933823f3 100644
--- a/hw/virtio/virtio-bus.c
+++ b/hw/virtio/virtio-bus.c
@@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
/* A VirtIODevice is being plugged */
void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
{
+ AddressSpace *as;
DeviceState *qdev = DEVICE(vdev);
BusState *qbus = BUS(qdev_get_parent_bus(qdev));
VirtioBusState *bus = VIRTIO_BUS(qbus);
@@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
return;
}
}
+ } else {
+ /*
+ * The maximal bounce buffer size of the virtio bus's parent may
+ * have been customized by property 'x-max-bounce-buffer-size'.
+ * Lets inherit the customized size if it's larger than the
+ * current one.
+ */
+ as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
+ if (as) {
+ vdev->dma_as->max_bounce_buffer_size = MAX(
+ vdev->dma_as->max_bounce_buffer_size,
+ as->max_bounce_buffer_size);
+ }
}
}
--
2.54.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
@ 2026-06-08 8:55 ` Daniel P. Berrangé
2026-06-08 11:11 ` Gavin Shan
2026-06-10 9:49 ` Michael S. Tsirkin
1 sibling, 1 reply; 37+ messages in thread
From: Daniel P. Berrangé @ 2026-06-08 8:55 UTC (permalink / raw)
To: Gavin Shan; +Cc: qemu-devel, qemu-arm, mst, jugraham, shan.gavin
On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> On the guest where a NVidia's GH100 card is passed from the host, the
> guest system hang can be observed on attempt to compile 'cuda-samples',
> as reported by Julia.
snip
> By default, the max bounce buffer size is only 4096 bytes, even less
> than one page when the guest page is 64KB. This tries to fix the issue
> by inheriting the customized max bounce buffer size of the virtio bus's
> parent through property 'x-max-bounce-buffer-size' when the customized
> size is a larger one. With this applied, no guest system hang is seen
> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
"x-max-bounce-buffer-size" is an experimental / unsupported property.
We really shouldn't be expecting users to have to set this in a production
deployment in order to stop a guest from hanging. Even if we dropped the
experimental marker from this property, users would still need to know to
provide this magic setting, so it would still be broken out of the box.
How can we get a solution that "just works" out of the box, which is
fully supported, not relying on experimental properties ?
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/virtio/virtio-bus.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
With regards,
Daniel
--
|: https://berrange.com ~~ https://hachyderm.io/@berrange :|
|: https://libvirt.org ~~ https://entangle-photo.org :|
|: https://pixelfed.art/berrange ~~ https://fstop138.berrange.com :|
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 8:55 ` Daniel P. Berrangé
@ 2026-06-08 11:11 ` Gavin Shan
2026-06-08 11:38 ` Daniel P. Berrangé
2026-06-10 9:54 ` Pavel Hrdina
0 siblings, 2 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-08 11:11 UTC (permalink / raw)
To: Daniel P. Berrangé, Peter Xu
Cc: qemu-devel, qemu-arm, mst, jugraham, shan.gavin
Hi Daniel,
On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>> On the guest where a NVidia's GH100 card is passed from the host, the
>> guest system hang can be observed on attempt to compile 'cuda-samples',
>> as reported by Julia.
>
> snip
>
Thanks for looking into this.
>> By default, the max bounce buffer size is only 4096 bytes, even less
>> than one page when the guest page is 64KB. This tries to fix the issue
>> by inheriting the customized max bounce buffer size of the virtio bus's
>> parent through property 'x-max-bounce-buffer-size' when the customized
>> size is a larger one. With this applied, no guest system hang is seen
>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>
> "x-max-bounce-buffer-size" is an experimental / unsupported property.
>
> We really shouldn't be expecting users to have to set this in a production
> deployment in order to stop a guest from hanging. Even if we dropped the
> experimental marker from this property, users would still need to know to
> provide this magic setting, so it would still be broken out of the box.
>
> How can we get a solution that "just works" out of the box, which is
> fully supported, not relying on experimental properties ?
>
How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
property? I guess the properties whose names start with "x-" are all treated as
experimental and unsupported?
For this case, the bounce buffer is inevitable as the memory region can't be
directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
bounce buffer size can be specified by users, and it's why the existing property
"x-max-bounce-buffer-size" is reused.
I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
which is set to on by default, following the existing behavior. When it's set to
off by users, the max (allowed) buffer size won't be checked at all. However, I'm
not sure if this makes sense at all.
>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>> hw/virtio/virtio-bus.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>
> With regards,
> Daniel
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 11:11 ` Gavin Shan
@ 2026-06-08 11:38 ` Daniel P. Berrangé
2026-06-09 2:08 ` Gavin Shan
2026-06-10 9:54 ` Pavel Hrdina
1 sibling, 1 reply; 37+ messages in thread
From: Daniel P. Berrangé @ 2026-06-08 11:38 UTC (permalink / raw)
To: Gavin Shan; +Cc: Peter Xu, qemu-devel, qemu-arm, mst, jugraham, shan.gavin
On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> Hi Daniel,
>
> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > as reported by Julia.
> >
> > snip
> >
>
> Thanks for looking into this.
NB, I didn't really look into it beyond noticing the suggestion
that users set an "x-" property as a proposed solution to failing
to boot, which raised a red-flag to me from a usability POV.
I don't really know anything about the underlying technical problems
here, so can't offer specific guidance in that area.
>
> > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > than one page when the guest page is 64KB. This tries to fix the issue
> > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > size is a larger one. With this applied, no guest system hang is seen
> > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> >
> > "x-max-bounce-buffer-size" is an experimental / unsupported property.
> >
> > We really shouldn't be expecting users to have to set this in a production
> > deployment in order to stop a guest from hanging. Even if we dropped the
> > experimental marker from this property, users would still need to know to
> > provide this magic setting, so it would still be broken out of the box.
> >
> > How can we get a solution that "just works" out of the box, which is
> > fully supported, not relying on experimental properties ?
> >
>
> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> property? I guess the properties whose names start with "x-" are all treated as
> experimental and unsupported?
Yes, any QEMU property starting with 'x-' is experimental/unstable/
unsupported and can be changed/withdrawn at any time. Libvirt will
not provide any way to configure 'x-' properties, as it requires a
supported/stable solution from QEMU.
With regards,
Daniel
--
|: https://berrange.com ~~ https://hachyderm.io/@berrange :|
|: https://libvirt.org ~~ https://entangle-photo.org :|
|: https://pixelfed.art/berrange ~~ https://fstop138.berrange.com :|
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 11:38 ` Daniel P. Berrangé
@ 2026-06-09 2:08 ` Gavin Shan
2026-06-09 16:25 ` Peter Xu
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-09 2:08 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Peter Xu, qemu-devel, qemu-arm, mst, jugraham, shan.gavin
On 6/8/26 9:38 PM, Daniel P. Berrangé wrote:
> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>> Hi Daniel,
>>
>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>> as reported by Julia.
>>>
>>> snip
>>>
>>
>> Thanks for looking into this.
>
> NB, I didn't really look into it beyond noticing the suggestion
> that users set an "x-" property as a proposed solution to failing
> to boot, which raised a red-flag to me from a usability POV.
>
> I don't really know anything about the underlying technical problems
> here, so can't offer specific guidance in that area.
>
Ok, no worries, I got your points :-)
>>
>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>> size is a larger one. With this applied, no guest system hang is seen
>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>
>>> "x-max-bounce-buffer-size" is an experimental / unsupported property.
>>>
>>> We really shouldn't be expecting users to have to set this in a production
>>> deployment in order to stop a guest from hanging. Even if we dropped the
>>> experimental marker from this property, users would still need to know to
>>> provide this magic setting, so it would still be broken out of the box.
>>>
>>> How can we get a solution that "just works" out of the box, which is
>>> fully supported, not relying on experimental properties ?
>>>
>>
>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>> property? I guess the properties whose names start with "x-" are all treated as
>> experimental and unsupported?
>
> Yes, any QEMU property starting with 'x-' is experimental/unstable/
> unsupported and can be changed/withdrawn at any time. Libvirt will
> not provide any way to configure 'x-' properties, as it requires a
> supported/stable solution from QEMU.
>
Yeah. Apart from the option of adding a new property to MachineState to disable
the check on the max bounce buffer size, we also can make this existing option
"x-max-bounce-buffer-size" official and officially supported by renaming it to
"max-bounce-buffer-size". Lets see what comments Michael or Peter will have.
> With regards,
> Daniel
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-09 2:08 ` Gavin Shan
@ 2026-06-09 16:25 ` Peter Xu
2026-06-10 0:32 ` Gavin Shan
0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2026-06-09 16:25 UTC (permalink / raw)
To: Gavin Shan
Cc: Daniel P. Berrangé, qemu-devel, qemu-arm, mst, jugraham,
shan.gavin
On Tue, Jun 09, 2026 at 12:08:34PM +1000, Gavin Shan wrote:
> Yeah. Apart from the option of adding a new property to MachineState to disable
> the check on the max bounce buffer size, we also can make this existing option
> "x-max-bounce-buffer-size" official and officially supported by renaming it to
> "max-bounce-buffer-size". Lets see what comments Michael or Peter will have.
IIUC updating max-bounce-buffer-size will be the last resort, because I
don't know how to properly define what is the correct value. When it's
prefixed with x- it's indeed more problematic..
Two pure questions..
Question 1:
I want to better understand the failure case. I don't yet understand why
it has anything to do with page size with the parameter. Say, shouldn't
virtio-blk's DMA requests in form of less-than-page-size, then normally it
should work even for 64k psize (as long as the total of buffers to map goes
beyond 4k)?
Maybe it's because there're a lot of concurrent IOs/DMAs hence it did use
more than that?
Question 2:
Quoting from commit message:
When the GPU's driver (NVidia open driver) is loaded on guest
bootup, the memory blocks residing in the PCI BAR can be presented
to the guest through memory hot-add. The page cache can be
allocated from the hot added memory blocks when cuda-samples is
being built. Afterwards, he page cache is sent to QEMU's virtio-blk
device as part of the DMA request, the bounce buffer is used to
accomodate the request as the corresponding memory region
(MemoryRegion) is a RAM DEVICE region in qemu. For this specific
case, false is returned from memory_access_is_direct() in the path
where the DMA request is handled.
I don't think I know well in this case, but if you say the PCI bars have
page cache in the back, does it mean that it should be directly accessible?
Maybe it's about this line:
/*
* RAM DEVICE regions can be accessed directly using memcpy, but it might
* be MMIO and access using mempy can be wrong (e.g., using instructions not
* intended for MMIO access). So we treat this as IO.
*/
return !memory_region_is_ram_device(mr);
But then my question is if this is a legal case can we loose this check so
that we don't need to use bounce buffers at all.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-09 16:25 ` Peter Xu
@ 2026-06-10 0:32 ` Gavin Shan
0 siblings, 0 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 0:32 UTC (permalink / raw)
To: Peter Xu
Cc: Daniel P. Berrangé, qemu-devel, qemu-arm, mst, jugraham,
shan.gavin
Hi Peter,
On 6/10/26 2:25 AM, Peter Xu wrote:
> On Tue, Jun 09, 2026 at 12:08:34PM +1000, Gavin Shan wrote:
>> Yeah. Apart from the option of adding a new property to MachineState to disable
>> the check on the max bounce buffer size, we also can make this existing option
>> "x-max-bounce-buffer-size" official and officially supported by renaming it to
>> "max-bounce-buffer-size". Lets see what comments Michael or Peter will have.
>
> IIUC updating max-bounce-buffer-size will be the last resort, because I
> don't know how to properly define what is the correct value. When it's
> prefixed with x- it's indeed more problematic..
>
Ok, thanks for your confirmation. Lets rename 'x-max-bounce-buffer-size' to
'max-bounce-buffer-size' in next revision. I plan to have two patches for this.
[PATCH 1/2] renames x-max-bounce-buffer-size to max-bounce-buffer-size
[PATCH 2/2] does what's done in this patch, inheriting 'max-bounce-buffer-size'
for virtio device from its bus parent
> Two pure questions..
>
> Question 1:
>
> I want to better understand the failure case. I don't yet understand why
> it has anything to do with page size with the parameter. Say, shouldn't
> virtio-blk's DMA requests in form of less-than-page-size, then normally it
> should work even for 64k psize (as long as the total of buffers to map goes
> beyond 4k)?
>
> Maybe it's because there're a lot of concurrent IOs/DMAs hence it did use
> more than that?
>
I think both are affecting the bounce buffer. In the failing case, the debugging
output indicates the length of the DMA request is 64KB while the max bounce buffer
size is only 4KB. I believe concurrent DMA requests also bring pressure on the
bounce buffer.
In my failing cases, I received the following output with the debugging code.
They're revealing the length of the DMA request is 64KB, aligned to the guest
page size.
Output from qemu:
virtqueue_map_desc: PA=0x420025b0000, size=0x10000, current_PA=0x420025b1000
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 63e2faee99..c038a62717 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1618,6 +1618,8 @@ static bool virtqueue_map_desc(VirtIODevice *vdev, unsigned int *p_num_sg,
{
bool ok = false;
unsigned num_sg = *p_num_sg;
+ hwaddr saved_pa = pa;
+ size_t saved_sz = sz;
assert(num_sg <= max_num_sg);
if (!sz) {
@@ -1641,6 +1643,9 @@ static bool virtqueue_map_desc(VirtIODevice *vdev, unsigned int *p_num_sg,
MEMTXATTRS_UNSPECIFIED);
if (!iov[num_sg].iov_base) {
virtio_error(vdev, "virtio: bogus descriptor or out of resources");
+ fprintf(stdout, "%s: PA=0x%lx, size=0x%lx, current_PA=0x%lx\n",
+ __func__, (unsigned long)saved_pa, (unsigned long)saved_sz,
+ (unsigned long)pa);
goto out;
}
> Question 2:
>
> Quoting from commit message:
>
> When the GPU's driver (NVidia open driver) is loaded on guest
> bootup, the memory blocks residing in the PCI BAR can be presented
> to the guest through memory hot-add. The page cache can be
> allocated from the hot added memory blocks when cuda-samples is
> being built. Afterwards, he page cache is sent to QEMU's virtio-blk
> device as part of the DMA request, the bounce buffer is used to
> accomodate the request as the corresponding memory region
> (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> case, false is returned from memory_access_is_direct() in the path
> where the DMA request is handled.
>
> I don't think I know well in this case, but if you say the PCI bars have
> page cache in the back, does it mean that it should be directly accessible?
> Maybe it's about this line:
>
> /*
> * RAM DEVICE regions can be accessed directly using memcpy, but it might
> * be MMIO and access using mempy can be wrong (e.g., using instructions not
> * intended for MMIO access). So we treat this as IO.
> */
> return !memory_region_is_ram_device(mr);
>
> But then my question is if this is a legal case can we loose this check so
> that we don't need to use bounce buffers at all.
>
It's a nice point. I ever bypass the bounce buffer for this particular
memory region, and it worked for me. However, I don't think we're able to
do it because the memory region isn't directly accessible by nature. The
accesses to the memory region is handled by 'ram_device_mem_ops' where
{ldn, stn}_he_p() are used in its read/write handler. They're different
from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
Thanks,
Gavin
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08 8:55 ` Daniel P. Berrangé
@ 2026-06-10 9:49 ` Michael S. Tsirkin
2026-06-10 18:30 ` Stefan Hajnoczi
1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 9:49 UTC (permalink / raw)
To: Gavin Shan
Cc: qemu-devel, qemu-arm, jugraham, shan.gavin, stefanha, qemu-block
On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> On the guest where a NVidia's GH100 card is passed from the host, the
> guest system hang can be observed on attempt to compile 'cuda-samples',
> as reported by Julia.
>
> host$ lspci | grep GH100
> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> -cpu host -smp cpus=32 -m size=8G \
> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>
> guest$ cd cuda-samples/build
> guest$ make -j 20 clean
> guest$ make -j 20
> :
> [ 54%] Linking CUDA executable graphMemoryNodes
> [ 54%] Built target graphMemoryNodes
> <no more output afterwards, guest becomes frozen here>
>
> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR can be presented to the guest
> through memory hot-add. The page cache can be allocated from the hot added
> memory blocks when cuda-samples is being built. Afterwards, he page cache
> is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
> buffer is used to accomodate the request as the corresponding memory
> region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> case, false is returned from memory_access_is_direct() in the path where
> the DMA request is handled.
>
> QEMU
> ====
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> memory_access_is_direct # Return false
> memory_region_supports_direct_access
>
> (qemu) info mtree
> :
> memory-region: pci_bridge_pci
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>
> By default, the max bounce buffer size is only 4096 bytes, even less
> than one page when the guest page is 64KB. This tries to fix the issue
> by inheriting the customized max bounce buffer size of the virtio bus's
> parent through property 'x-max-bounce-buffer-size' when the customized
> size is a larger one. With this applied, no guest system hang is seen
> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>
> Reported-by: Julia Graham <jugraham@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> hw/virtio/virtio-bus.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> index cef944e015..e0933823f3 100644
> --- a/hw/virtio/virtio-bus.c
> +++ b/hw/virtio/virtio-bus.c
> @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
> /* A VirtIODevice is being plugged */
> void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> {
> + AddressSpace *as;
> DeviceState *qdev = DEVICE(vdev);
> BusState *qbus = BUS(qdev_get_parent_bus(qdev));
> VirtioBusState *bus = VIRTIO_BUS(qbus);
> @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> return;
> }
> }
> + } else {
> + /*
> + * The maximal bounce buffer size of the virtio bus's parent may
> + * have been customized by property 'x-max-bounce-buffer-size'.
> + * Lets inherit the customized size if it's larger than the
> + * current one.
> + */
> + as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
> + if (as) {
> + vdev->dma_as->max_bounce_buffer_size = MAX(
> + vdev->dma_as->max_bounce_buffer_size,
> + as->max_bounce_buffer_size);
> + }
> }
> }
>
> --
> 2.54.0
Problem with all this is, users would not know how to size this.
So fundamentally, is not the issue that virtio blk (and scsi!) maps
all of the buffer all the time?
It's not hard to add something like virtio_pop_unmapped that would not map,
then build QEMUSGLists out of addr/len pairs and submit these.
Stefan, do you think doing it like this would be bad for perf? Good for
perf?
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-08 11:11 ` Gavin Shan
2026-06-08 11:38 ` Daniel P. Berrangé
@ 2026-06-10 9:54 ` Pavel Hrdina
2026-06-10 10:55 ` Gavin Shan
1 sibling, 1 reply; 37+ messages in thread
From: Pavel Hrdina @ 2026-06-10 9:54 UTC (permalink / raw)
To: Gavin Shan
Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
jugraham, shan.gavin
On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> Hi Daniel,
>
> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > as reported by Julia.
> >
> > snip
> >
>
> Thanks for looking into this.
>
> > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > than one page when the guest page is 64KB. This tries to fix the issue
> > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > size is a larger one. With this applied, no guest system hang is seen
> > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> >
> > "x-max-bounce-buffer-size" is an experimental / unsupported property.
> >
> > We really shouldn't be expecting users to have to set this in a production
> > deployment in order to stop a guest from hanging. Even if we dropped the
> > experimental marker from this property, users would still need to know to
> > provide this magic setting, so it would still be broken out of the box.
> >
> > How can we get a solution that "just works" out of the box, which is
> > fully supported, not relying on experimental properties ?
> >
>
> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> property? I guess the properties whose names start with "x-" are all treated as
> experimental and unsupported?
>
> For this case, the bounce buffer is inevitable as the memory region can't be
> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> bounce buffer size can be specified by users, and it's why the existing property
> "x-max-bounce-buffer-size" is reused.
>
> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> which is set to on by default, following the existing behavior. When it's set to
> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> not sure if this makes sense at all.
Hi Gavin,
You did not answer the question that Daniel was asking, how will user
know that max-bounce-buffer-size should be used if it's necessary to fix
guest system hangs and how will user know what magic value should be set?
Pavel
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 9:54 ` Pavel Hrdina
@ 2026-06-10 10:55 ` Gavin Shan
2026-06-10 12:12 ` Michael S. Tsirkin
2026-06-10 12:23 ` Pavel Hrdina
0 siblings, 2 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 10:55 UTC (permalink / raw)
To: Pavel Hrdina
Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
jugraham, shan.gavin
Hi Pavel,
On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>> Hi Daniel,
>>
>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>> as reported by Julia.
>>>
>>> snip
>>>
>>
>> Thanks for looking into this.
>>
>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>> size is a larger one. With this applied, no guest system hang is seen
>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>
>>> "x-max-bounce-buffer-size" is an experimental / unsupported property.
>>>
>>> We really shouldn't be expecting users to have to set this in a production
>>> deployment in order to stop a guest from hanging. Even if we dropped the
>>> experimental marker from this property, users would still need to know to
>>> provide this magic setting, so it would still be broken out of the box.
>>>
>>> How can we get a solution that "just works" out of the box, which is
>>> fully supported, not relying on experimental properties ?
>>>
>>
>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>> property? I guess the properties whose names start with "x-" are all treated as
>> experimental and unsupported?
>>
>> For this case, the bounce buffer is inevitable as the memory region can't be
>> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
>> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
>> bounce buffer size can be specified by users, and it's why the existing property
>> "x-max-bounce-buffer-size" is reused.
>>
>> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
>> which is set to on by default, following the existing behavior. When it's set to
>> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
>> not sure if this makes sense at all.
>
> Hi Gavin,
>
> You did not answer the question that Daniel was asking, how will user
> know that max-bounce-buffer-size should be used if it's necessary to fix
> guest system hangs and how will user know what magic value should be set?
>
Sorry that I missed to answer Daniel's questions. For this specific case,
user need to enlarge the bounce buffer size when seeing the following error
message. We can add an explicit one in address_space_map() if the existing
error message isn't obvious.
qemu-system-aarch64: virtio: bogus descriptor or out of resources
void *address_space_map(AddressSpace *as,
hwaddr addr,
hwaddr *plen,
bool is_write,
MemTxAttrs attrs)
{
if (!memory_access_is_direct(mr, is_write, attrs)) {
if (l == 0) {
error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
*plen = 0;
return NULL;
}
}
As to the value user should take for max-bounce-buffer-size, it is really case by case
and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
smallest value works for them. The worst case is to set 0xFFFFFFFF.
> Pavel
>
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 10:55 ` Gavin Shan
@ 2026-06-10 12:12 ` Michael S. Tsirkin
2026-06-10 12:19 ` Gavin Shan
2026-06-10 12:23 ` Pavel Hrdina
1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 12:12 UTC (permalink / raw)
To: Gavin Shan
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> Hi Pavel,
>
> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > Hi Daniel,
> > >
> > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > as reported by Julia.
> > > >
> > > > snip
> > > >
> > >
> > > Thanks for looking into this.
> > >
> > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > >
> > > > "x-max-bounce-buffer-size" is an experimental / unsupported property.
> > > >
> > > > We really shouldn't be expecting users to have to set this in a production
> > > > deployment in order to stop a guest from hanging. Even if we dropped the
> > > > experimental marker from this property, users would still need to know to
> > > > provide this magic setting, so it would still be broken out of the box.
> > > >
> > > > How can we get a solution that "just works" out of the box, which is
> > > > fully supported, not relying on experimental properties ?
> > > >
> > >
> > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > property? I guess the properties whose names start with "x-" are all treated as
> > > experimental and unsupported?
> > >
> > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > bounce buffer size can be specified by users, and it's why the existing property
> > > "x-max-bounce-buffer-size" is reused.
> > >
> > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > which is set to on by default, following the existing behavior. When it's set to
> > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > not sure if this makes sense at all.
> >
> > Hi Gavin,
> >
> > You did not answer the question that Daniel was asking, how will user
> > know that max-bounce-buffer-size should be used if it's necessary to fix
> > guest system hangs and how will user know what magic value should be set?
> >
>
> Sorry that I missed to answer Daniel's questions. For this specific case,
> user need to enlarge the bounce buffer size when seeing the following error
> message. We can add an explicit one in address_space_map() if the existing
> error message isn't obvious.
>
> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>
> void *address_space_map(AddressSpace *as,
> hwaddr addr,
> hwaddr *plen,
> bool is_write,
> MemTxAttrs attrs)
> {
> if (!memory_access_is_direct(mr, is_write, attrs)) {
> if (l == 0) {
> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> *plen = 0;
> return NULL;
> }
> }
>
> As to the value user should take for max-bounce-buffer-size, it is really case by case
> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>
> > Pavel
> >
>
> Thanks,
> Gavin
This is not at all reasonable. All kind of fixes are possible but
fundamentally, bounce buffering data path is by itself already a
bad idea.
I have no idea what does bounce buffering device ram accomplish.
In the end, qemu still simply reads the memory from/to the buffer.
My suggestion is to first of all look for ways to mark the
memory as direct.
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 12:12 ` Michael S. Tsirkin
@ 2026-06-10 12:19 ` Gavin Shan
2026-06-10 12:27 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 12:19 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
Hi Michael,
On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
[...]
>>>
>>> You did not answer the question that Daniel was asking, how will user
>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>> guest system hangs and how will user know what magic value should be set?
>>>
>>
>> Sorry that I missed to answer Daniel's questions. For this specific case,
>> user need to enlarge the bounce buffer size when seeing the following error
>> message. We can add an explicit one in address_space_map() if the existing
>> error message isn't obvious.
>>
>> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>
>> void *address_space_map(AddressSpace *as,
>> hwaddr addr,
>> hwaddr *plen,
>> bool is_write,
>> MemTxAttrs attrs)
>> {
>> if (!memory_access_is_direct(mr, is_write, attrs)) {
>> if (l == 0) {
>> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>> *plen = 0;
>> return NULL;
>> }
>> }
>>
>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>
>
>
> This is not at all reasonable. All kind of fixes are possible but
> fundamentally, bounce buffering data path is by itself already a
> bad idea.
>
> I have no idea what does bounce buffering device ram accomplish.
>
> In the end, qemu still simply reads the memory from/to the buffer.
>
> My suggestion is to first of all look for ways to mark the
> memory as direct.
>
As I explained to Peter Xu in another reply, we can't simply mark the (RAM
DEVICE) memory region is directly accessible. The memory region is initialized
by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
The accesses to the memory region is handled by 'ram_device_mem_ops' where
{ldn, stn}_he_p() are used in its read/write handler. They're different
from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 10:55 ` Gavin Shan
2026-06-10 12:12 ` Michael S. Tsirkin
@ 2026-06-10 12:23 ` Pavel Hrdina
2026-06-10 14:04 ` Gavin Shan
1 sibling, 1 reply; 37+ messages in thread
From: Pavel Hrdina @ 2026-06-10 12:23 UTC (permalink / raw)
To: Gavin Shan
Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
jugraham, shan.gavin
On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> Hi Pavel,
>
> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > Hi Daniel,
> > >
> > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > as reported by Julia.
> > > >
> > > > snip
> > > >
> > >
> > > Thanks for looking into this.
> > >
> > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > >
> > > > "x-max-bounce-buffer-size" is an experimental / unsupported property.
> > > >
> > > > We really shouldn't be expecting users to have to set this in a production
> > > > deployment in order to stop a guest from hanging. Even if we dropped the
> > > > experimental marker from this property, users would still need to know to
> > > > provide this magic setting, so it would still be broken out of the box.
> > > >
> > > > How can we get a solution that "just works" out of the box, which is
> > > > fully supported, not relying on experimental properties ?
> > > >
> > >
> > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > property? I guess the properties whose names start with "x-" are all treated as
> > > experimental and unsupported?
> > >
> > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > bounce buffer size can be specified by users, and it's why the existing property
> > > "x-max-bounce-buffer-size" is reused.
> > >
> > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > which is set to on by default, following the existing behavior. When it's set to
> > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > not sure if this makes sense at all.
> >
> > Hi Gavin,
> >
> > You did not answer the question that Daniel was asking, how will user
> > know that max-bounce-buffer-size should be used if it's necessary to fix
> > guest system hangs and how will user know what magic value should be set?
> >
>
> Sorry that I missed to answer Daniel's questions. For this specific case,
> user need to enlarge the bounce buffer size when seeing the following error
> message. We can add an explicit one in address_space_map() if the existing
> error message isn't obvious.
>
> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>
> void *address_space_map(AddressSpace *as,
> hwaddr addr,
> hwaddr *plen,
> bool is_write,
> MemTxAttrs attrs)
> {
> if (!memory_access_is_direct(mr, is_write, attrs)) {
> if (l == 0) {
> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> *plen = 0;
> return NULL;
> }
> }
This may work when using qemu directly but users will not see this error
when using libvirt or management tools like kubevirt.
> As to the value user should take for max-bounce-buffer-size, it is really case by case
> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> smallest value works for them. The worst case is to set 0xFFFFFFFF.
Doesn't sound like pleasant user experience playing guessing game to
figure out how to make a VM work and again will most likely not work for
kubevirt where users are usually not exposed to these low level properties.
I'm not familiar with the internals but isn't there a better way how to
solve it without requiring users to figure out by guessing what value works?
Pavel
> > Pavel
> >
>
> Thanks,
> Gavin
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 12:19 ` Gavin Shan
@ 2026-06-10 12:27 ` Michael S. Tsirkin
2026-06-10 13:00 ` Gavin Shan
0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 12:27 UTC (permalink / raw)
To: Gavin Shan
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> Hi Michael,
>
> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>
> [...]
>
> > > >
> > > > You did not answer the question that Daniel was asking, how will user
> > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > guest system hangs and how will user know what magic value should be set?
> > > >
> > >
> > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > user need to enlarge the bounce buffer size when seeing the following error
> > > message. We can add an explicit one in address_space_map() if the existing
> > > error message isn't obvious.
> > >
> > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > >
> > > void *address_space_map(AddressSpace *as,
> > > hwaddr addr,
> > > hwaddr *plen,
> > > bool is_write,
> > > MemTxAttrs attrs)
> > > {
> > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > if (l == 0) {
> > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > *plen = 0;
> > > return NULL;
> > > }
> > > }
> > >
> > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > >
> >
> >
> > This is not at all reasonable. All kind of fixes are possible but
> > fundamentally, bounce buffering data path is by itself already a
> > bad idea.
> >
> > I have no idea what does bounce buffering device ram accomplish.
> >
> > In the end, qemu still simply reads the memory from/to the buffer.
> >
> > My suggestion is to first of all look for ways to mark the
> > memory as direct.
> >
>
> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> DEVICE) memory region is directly accessible. The memory region is initialized
> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>
> The accesses to the memory region is handled by 'ram_device_mem_ops' where
> {ldn, stn}_he_p() are used in its read/write handler. They're different
> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>
> Thanks,
> Gavin
>
What is endianness set to, for this region?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 12:27 ` Michael S. Tsirkin
@ 2026-06-10 13:00 ` Gavin Shan
2026-06-10 13:54 ` Gavin Shan
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 13:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
>> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>>>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>
>> [...]
>>
>>>>>
>>>>> You did not answer the question that Daniel was asking, how will user
>>>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>>>> guest system hangs and how will user know what magic value should be set?
>>>>>
>>>>
>>>> Sorry that I missed to answer Daniel's questions. For this specific case,
>>>> user need to enlarge the bounce buffer size when seeing the following error
>>>> message. We can add an explicit one in address_space_map() if the existing
>>>> error message isn't obvious.
>>>>
>>>> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>
>>>> void *address_space_map(AddressSpace *as,
>>>> hwaddr addr,
>>>> hwaddr *plen,
>>>> bool is_write,
>>>> MemTxAttrs attrs)
>>>> {
>>>> if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>> if (l == 0) {
>>>> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>>> *plen = 0;
>>>> return NULL;
>>>> }
>>>> }
>>>>
>>>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>>>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>>>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>>>
>>>
>>>
>>> This is not at all reasonable. All kind of fixes are possible but
>>> fundamentally, bounce buffering data path is by itself already a
>>> bad idea.
>>>
>>> I have no idea what does bounce buffering device ram accomplish.
>>>
>>> In the end, qemu still simply reads the memory from/to the buffer.
>>>
>>> My suggestion is to first of all look for ways to mark the
>>> memory as direct.
>>>
>>
>> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
>> DEVICE) memory region is directly accessible. The memory region is initialized
>> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>>
>> The accesses to the memory region is handled by 'ram_device_mem_ops' where
>> {ldn, stn}_he_p() are used in its read/write handler. They're different
>> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>>
>> Thanks,
>> Gavin
>>
>
> What is endianness set to, for this region?
>
The endianness of the memory region is set to that for the host.
static const MemoryRegionOps ram_device_mem_ops = {
.read = memory_region_ram_device_read,
.write = memory_region_ram_device_write,
.endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
};
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 13:00 ` Gavin Shan
@ 2026-06-10 13:54 ` Gavin Shan
2026-06-10 14:06 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 13:54 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
Hi Michael and Peter,
On 6/10/26 11:00 PM, Gavin Shan wrote:
> On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
>> On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
>>> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>>>>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>>
>>> [...]
>>>
>>>>>>
>>>>>> You did not answer the question that Daniel was asking, how will user
>>>>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>>>>> guest system hangs and how will user know what magic value should be set?
>>>>>>
>>>>>
>>>>> Sorry that I missed to answer Daniel's questions. For this specific case,
>>>>> user need to enlarge the bounce buffer size when seeing the following error
>>>>> message. We can add an explicit one in address_space_map() if the existing
>>>>> error message isn't obvious.
>>>>>
>>>>> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>>
>>>>> void *address_space_map(AddressSpace *as,
>>>>> hwaddr addr,
>>>>> hwaddr *plen,
>>>>> bool is_write,
>>>>> MemTxAttrs attrs)
>>>>> {
>>>>> if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>>> if (l == 0) {
>>>>> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>>>> *plen = 0;
>>>>> return NULL;
>>>>> }
>>>>> }
>>>>>
>>>>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>>>>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>>>>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>>>>
>>>>
>>>>
>>>> This is not at all reasonable. All kind of fixes are possible but
>>>> fundamentally, bounce buffering data path is by itself already a
>>>> bad idea.
>>>>
>>>> I have no idea what does bounce buffering device ram accomplish.
>>>>
>>>> In the end, qemu still simply reads the memory from/to the buffer.
>>>>
>>>> My suggestion is to first of all look for ways to mark the
>>>> memory as direct.
>>>>
>>>
>>> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
>>> DEVICE) memory region is directly accessible. The memory region is initialized
>>> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>>>
>>> The accesses to the memory region is handled by 'ram_device_mem_ops' where
>>> {ldn, stn}_he_p() are used in its read/write handler. They're different
>>> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>>>
>>> Thanks,
>>> Gavin
>>>
>>
>> What is endianness set to, for this region?
>>
>
> The endianness of the memory region is set to that for the host.
>
> static const MemoryRegionOps ram_device_mem_ops = {
> .read = memory_region_ram_device_read,
> .write = memory_region_ram_device_write,
> .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> };
>
How about to treat the RAM DEVICE memory region directly accessible in
address_space_map() only when HOST_BIG_ENDIAN is false, something like
below and I don't hit the guest hang issue with the changes.
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..9daca55251 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
bool prepare_mmio_access(MemoryRegion *mr);
-static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
+static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
+ bool check_ram_device)
{
/* ROM DEVICE regions only allow direct access if in ROMD mode. */
if (memory_region_is_romd(mr)) {
@@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
* be MMIO and access using mempy can be wrong (e.g., using instructions not
* intended for MMIO access). So we treat this as IO.
*/
- return !memory_region_is_ram_device(mr);
+ return (!check_ram_device || !memory_region_is_ram_device(mr));
}
static inline bool memory_access_is_direct(const MemoryRegion *mr,
+ bool check_ram_device,
bool is_write, MemTxAttrs attrs)
{
- if (!memory_region_supports_direct_access(mr)) {
+ if (!memory_region_supports_direct_access(mr, check_ram_device)) {
return false;
}
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..2e6b72b124 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
fv = address_space_to_flatview(as);
mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
- if (!memory_access_is_direct(mr, is_write, attrs)) {
+ if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
size_t used = qatomic_read(&as->bounce_buffer_size);
for (;;) {
hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
Thanks,
Gavin
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 12:23 ` Pavel Hrdina
@ 2026-06-10 14:04 ` Gavin Shan
2026-06-10 14:08 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-10 14:04 UTC (permalink / raw)
To: Pavel Hrdina
Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
jugraham, shan.gavin
On 6/10/26 10:23 PM, Pavel Hrdina wrote:
> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>> Hi Pavel,
>>
>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>>>> Hi Daniel,
>>>>
>>>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>>>> as reported by Julia.
>>>>>
>>>>> snip
>>>>>
>>>>
>>>> Thanks for looking into this.
>>>>
>>>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>>>> size is a larger one. With this applied, no guest system hang is seen
>>>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>>>
>>>>> "x-max-bounce-buffer-size" is an experimental / unsupported property.
>>>>>
>>>>> We really shouldn't be expecting users to have to set this in a production
>>>>> deployment in order to stop a guest from hanging. Even if we dropped the
>>>>> experimental marker from this property, users would still need to know to
>>>>> provide this magic setting, so it would still be broken out of the box.
>>>>>
>>>>> How can we get a solution that "just works" out of the box, which is
>>>>> fully supported, not relying on experimental properties ?
>>>>>
>>>>
>>>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>>>> property? I guess the properties whose names start with "x-" are all treated as
>>>> experimental and unsupported?
>>>>
>>>> For this case, the bounce buffer is inevitable as the memory region can't be
>>>> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
>>>> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
>>>> bounce buffer size can be specified by users, and it's why the existing property
>>>> "x-max-bounce-buffer-size" is reused.
>>>>
>>>> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
>>>> which is set to on by default, following the existing behavior. When it's set to
>>>> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
>>>> not sure if this makes sense at all.
>>>
>>> Hi Gavin,
>>>
>>> You did not answer the question that Daniel was asking, how will user
>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>> guest system hangs and how will user know what magic value should be set?
>>>
>>
>> Sorry that I missed to answer Daniel's questions. For this specific case,
>> user need to enlarge the bounce buffer size when seeing the following error
>> message. We can add an explicit one in address_space_map() if the existing
>> error message isn't obvious.
>>
>> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>
>> void *address_space_map(AddressSpace *as,
>> hwaddr addr,
>> hwaddr *plen,
>> bool is_write,
>> MemTxAttrs attrs)
>> {
>> if (!memory_access_is_direct(mr, is_write, attrs)) {
>> if (l == 0) {
>> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>> *plen = 0;
>> return NULL;
>> }
>> }
>
> This may work when using qemu directly but users will not see this error
> when using libvirt or management tools like kubevirt.
>
Ok, then an error message raised by error_report() won't help.
>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>
> Doesn't sound like pleasant user experience playing guessing game to
> figure out how to make a VM work and again will most likely not work for
> kubevirt where users are usually not exposed to these low level properties.
>
> I'm not familiar with the internals but isn't there a better way how to
> solve it without requiring users to figure out by guessing what value works?
>
Not really. The worst case is to have 'max-bounce-buffer-size=0xFFFFFFFF',
which is to disable the check against the max bounce buffer size :-)
Peter and Michael already lead the direction to bypass the bounce buffer
for this specific case. It worked for me and no guest hang isn't seen when
the bounce buffer is bypassed in address_space_map().
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 13:54 ` Gavin Shan
@ 2026-06-10 14:06 ` Michael S. Tsirkin
2026-06-10 15:36 ` Peter Xu
0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 14:06 UTC (permalink / raw)
To: Gavin Shan
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> Hi Michael and Peter,
>
> On 6/10/26 11:00 PM, Gavin Shan wrote:
> > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > >
> > > > [...]
> > > >
> > > > > > >
> > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > >
> > > > > >
> > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > error message isn't obvious.
> > > > > >
> > > > > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > >
> > > > > > void *address_space_map(AddressSpace *as,
> > > > > > hwaddr addr,
> > > > > > hwaddr *plen,
> > > > > > bool is_write,
> > > > > > MemTxAttrs attrs)
> > > > > > {
> > > > > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > if (l == 0) {
> > > > > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > *plen = 0;
> > > > > > return NULL;
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > >
> > > > >
> > > > >
> > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > bad idea.
> > > > >
> > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > >
> > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > >
> > > > > My suggestion is to first of all look for ways to mark the
> > > > > memory as direct.
> > > > >
> > > >
> > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > >
> > > > The accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > >
> > > > Thanks,
> > > > Gavin
> > > >
> > >
> > > What is endianness set to, for this region?
> > >
> >
> > The endianness of the memory region is set to that for the host.
> >
> > static const MemoryRegionOps ram_device_mem_ops = {
> > .read = memory_region_ram_device_read,
> > .write = memory_region_ram_device_write,
> > .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > };
> >
So there is never any endianness translation.
I think the reason qemu does the bounce buffer is more
to prevent things like vector access from MMIO.
> How about to treat the RAM DEVICE memory region directly accessible in
> address_space_map() only when HOST_BIG_ENDIAN is false,
> something like
> below and I don't hit the guest hang issue with the changes.
>
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..9daca55251 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> bool prepare_mmio_access(MemoryRegion *mr);
> -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> + bool check_ram_device)
> {
> /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> if (memory_region_is_romd(mr)) {
> @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> * be MMIO and access using mempy can be wrong (e.g., using instructions not
> * intended for MMIO access). So we treat this as IO.
> */
> - return !memory_region_is_ram_device(mr);
> + return (!check_ram_device || !memory_region_is_ram_device(mr));
> }
> static inline bool memory_access_is_direct(const MemoryRegion *mr,
> + bool check_ram_device,
> bool is_write, MemTxAttrs attrs)
> {
> - if (!memory_region_supports_direct_access(mr)) {
> + if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> return false;
> }
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..2e6b72b124 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> fv = address_space_to_flatview(as);
> mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> - if (!memory_access_is_direct(mr, is_write, attrs)) {
> + if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> size_t used = qatomic_read(&as->bounce_buffer_size);
> for (;;) {
> hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
>
> Thanks,
> Gavin
>
I do not think it has anything to do with host endian-ness.
This is the change that broke it I think?
commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
Author: Alex Williamson <alex@shazbot.org>
Date: Mon Oct 31 09:53:03 2016 -0600
memory: Don't use memcpy for ram_device regions
Maybe Alex has an opinion on what to do.
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 14:04 ` Gavin Shan
@ 2026-06-10 14:08 ` Michael S. Tsirkin
0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 14:08 UTC (permalink / raw)
To: Gavin Shan
Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
qemu-arm, jugraham, shan.gavin
On Thu, Jun 11, 2026 at 12:04:52AM +1000, Gavin Shan wrote:
> On 6/10/26 10:23 PM, Pavel Hrdina wrote:
> > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > Hi Pavel,
> > >
> > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > > > Hi Daniel,
> > > > >
> > > > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > > > as reported by Julia.
> > > > > >
> > > > > > snip
> > > > > >
> > > > >
> > > > > Thanks for looking into this.
> > > > >
> > > > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > > > >
> > > > > > "x-max-bounce-buffer-size" is an experimental / unsupported property.
> > > > > >
> > > > > > We really shouldn't be expecting users to have to set this in a production
> > > > > > deployment in order to stop a guest from hanging. Even if we dropped the
> > > > > > experimental marker from this property, users would still need to know to
> > > > > > provide this magic setting, so it would still be broken out of the box.
> > > > > >
> > > > > > How can we get a solution that "just works" out of the box, which is
> > > > > > fully supported, not relying on experimental properties ?
> > > > > >
> > > > >
> > > > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > > > property? I guess the properties whose names start with "x-" are all treated as
> > > > > experimental and unsupported?
> > > > >
> > > > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > > > bounce buffer size can be specified by users, and it's why the existing property
> > > > > "x-max-bounce-buffer-size" is reused.
> > > > >
> > > > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > > > which is set to on by default, following the existing behavior. When it's set to
> > > > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > > > not sure if this makes sense at all.
> > > >
> > > > Hi Gavin,
> > > >
> > > > You did not answer the question that Daniel was asking, how will user
> > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > guest system hangs and how will user know what magic value should be set?
> > > >
> > >
> > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > user need to enlarge the bounce buffer size when seeing the following error
> > > message. We can add an explicit one in address_space_map() if the existing
> > > error message isn't obvious.
> > >
> > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > >
> > > void *address_space_map(AddressSpace *as,
> > > hwaddr addr,
> > > hwaddr *plen,
> > > bool is_write,
> > > MemTxAttrs attrs)
> > > {
> > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > if (l == 0) {
> > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > *plen = 0;
> > > return NULL;
> > > }
> > > }
> >
> > This may work when using qemu directly but users will not see this error
> > when using libvirt or management tools like kubevirt.
> >
>
> Ok, then an error message raised by error_report() won't help.
>
> > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> >
> > Doesn't sound like pleasant user experience playing guessing game to
> > figure out how to make a VM work and again will most likely not work for
> > kubevirt where users are usually not exposed to these low level properties.
> >
> > I'm not familiar with the internals but isn't there a better way how to
> > solve it without requiring users to figure out by guessing what value works?
> >
>
> Not really. The worst case is to have 'max-bounce-buffer-size=0xFFFFFFFF',
> which is to disable the check against the max bounce buffer size :-)
>
> Peter and Michael already lead the direction to bypass the bounce buffer
> for this specific case. It worked for me and no guest hang isn't seen when
> the bounce buffer is bypassed in address_space_map().
>
> Thanks,
> Gavin
Mind, I am not against additionally switching virtio to support popping
bufs into QEMUSGList and not iovecs.
But the performance is gonnu be bad for this one.
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 14:06 ` Michael S. Tsirkin
@ 2026-06-10 15:36 ` Peter Xu
2026-06-10 16:11 ` Peter Maydell
2026-06-10 16:18 ` Michael S. Tsirkin
0 siblings, 2 replies; 37+ messages in thread
From: Peter Xu @ 2026-06-10 15:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Gavin Shan, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > Hi Michael and Peter,
> >
> > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > > >
> > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > >
> > > > > > >
> > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > error message isn't obvious.
> > > > > > >
> > > > > > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > >
> > > > > > > void *address_space_map(AddressSpace *as,
> > > > > > > hwaddr addr,
> > > > > > > hwaddr *plen,
> > > > > > > bool is_write,
> > > > > > > MemTxAttrs attrs)
> > > > > > > {
> > > > > > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > > if (l == 0) {
> > > > > > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > > *plen = 0;
> > > > > > > return NULL;
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > bad idea.
> > > > > >
> > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > >
> > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > >
> > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > memory as direct.
> > > > > >
> > > > >
> > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > >
> > > > > The accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > >
> > > > > Thanks,
> > > > > Gavin
> > > > >
> > > >
> > > > What is endianness set to, for this region?
> > > >
> > >
> > > The endianness of the memory region is set to that for the host.
> > >
> > > static const MemoryRegionOps ram_device_mem_ops = {
> > > .read = memory_region_ram_device_read,
> > > .write = memory_region_ram_device_write,
> > > .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > };
> > >
>
> So there is never any endianness translation.
> I think the reason qemu does the bounce buffer is more
> to prevent things like vector access from MMIO.
>
>
> > How about to treat the RAM DEVICE memory region directly accessible in
> > address_space_map() only when HOST_BIG_ENDIAN is false,
> > something like
> > below and I don't hit the guest hang issue with the changes.
> >
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index 1417132f6d..9daca55251 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > bool prepare_mmio_access(MemoryRegion *mr);
> > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > + bool check_ram_device)
> > {
> > /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> > if (memory_region_is_romd(mr)) {
> > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > * be MMIO and access using mempy can be wrong (e.g., using instructions not
> > * intended for MMIO access). So we treat this as IO.
> > */
> > - return !memory_region_is_ram_device(mr);
> > + return (!check_ram_device || !memory_region_is_ram_device(mr));
> > }
> > static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > + bool check_ram_device,
> > bool is_write, MemTxAttrs attrs)
> > {
> > - if (!memory_region_supports_direct_access(mr)) {
> > + if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> > return false;
> > }
> > diff --git a/system/physmem.c b/system/physmem.c
> > index 7bcbf87573..2e6b72b124 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> > fv = address_space_to_flatview(as);
> > mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > - if (!memory_access_is_direct(mr, is_write, attrs)) {
> > + if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> > size_t used = qatomic_read(&as->bounce_buffer_size);
> > for (;;) {
> > hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> >
> > Thanks,
> > Gavin
> >
>
> I do not think it has anything to do with host endian-ness.
>
>
> This is the change that broke it I think?
>
>
> commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> Author: Alex Williamson <alex@shazbot.org>
> Date: Mon Oct 31 09:53:03 2016 -0600
>
> memory: Don't use memcpy for ram_device regions
>
>
> Maybe Alex has an opinion on what to do.
I can offer one idea here..
IIUC the major issue was vector ops but the mr ops might be too heavy, then
another way to fix it is in memory API instead of using memcpy()/memmove(),
we always use a helper (say, memmove_no_vector()) to do the split and
properly aligned IOs as what ram_device_mem_ops does right now, this should
only applies to ram_device.
With that, IIUC we can remove the current ram_device_mem_ops, then in
Gavin's case mmap() will go through and guest will not need to vmexit at
all. Best perf, issue solve.
We just need to be careful to trap all possible memcpy()/memmove() used in
memory core.. if I didn't miss any, IMO below four should needs to be
replaced by memmove_no_vector():
flatview_write_continue_step()
flatview_read_continue_step()
address_space_read()
address_space_write_rom()
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 15:36 ` Peter Xu
@ 2026-06-10 16:11 ` Peter Maydell
2026-06-10 16:19 ` Michael S. Tsirkin
2026-06-10 16:18 ` Michael S. Tsirkin
1 sibling, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2026-06-10 16:11 UTC (permalink / raw)
To: Peter Xu
Cc: Michael S. Tsirkin, Gavin Shan, Pavel Hrdina,
Daniel P. Berrangé, qemu-devel, qemu-arm, jugraham,
shan.gavin, Alex Williamson, David Hildenbrand
On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > This is the change that broke it I think?
> >
> >
> > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > Author: Alex Williamson <alex@shazbot.org>
> > Date: Mon Oct 31 09:53:03 2016 -0600
> >
> > memory: Don't use memcpy for ram_device regions
> >
> >
> > Maybe Alex has an opinion on what to do.
>
> I can offer one idea here..
>
> IIUC the major issue was vector ops but the mr ops might be too heavy, then
> another way to fix it is in memory API instead of using memcpy()/memmove(),
> we always use a helper (say, memmove_no_vector()) to do the split and
> properly aligned IOs as what ram_device_mem_ops does right now, this should
> only applies to ram_device.
If the underlying memory needs to be accessed only with specific
alignment/size, as the 4a2e242bbb30 commit message suggests, then
we cannot expose it via address_space_map(), so we must have
a bounce-buffer. The address_space_map() function says
"here's a host pointer to memory, do what you like to it", and
the caller is entitled to memcpy to/from it or otherwise
access it with any C operations, which are not guaranteed to
respect any kind of alignment or similar restrictions.
My guess from commit 4a2e242bbb30 is that that applied an
overly broad "don't do direct access" hammer to all
vfio assigned devices, and that there needs to be some
concept of "this vfio assigned device's region is OK for
direct access" vs "this other one is not", such that if
this GH100 card's BAR guarantees it can be treated entirely
as RAM then we can have memory_region_supports_direct_access()
return true for it.
thanks
-- PMM
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 15:36 ` Peter Xu
2026-06-10 16:11 ` Peter Maydell
@ 2026-06-10 16:18 ` Michael S. Tsirkin
2026-06-11 4:33 ` Gavin Shan
1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 16:18 UTC (permalink / raw)
To: Peter Xu
Cc: Gavin Shan, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > > Hi Michael and Peter,
> > >
> > > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > > >
> > > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > > error message isn't obvious.
> > > > > > > >
> > > > > > > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > > >
> > > > > > > > void *address_space_map(AddressSpace *as,
> > > > > > > > hwaddr addr,
> > > > > > > > hwaddr *plen,
> > > > > > > > bool is_write,
> > > > > > > > MemTxAttrs attrs)
> > > > > > > > {
> > > > > > > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > > > if (l == 0) {
> > > > > > > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > > > *plen = 0;
> > > > > > > > return NULL;
> > > > > > > > }
> > > > > > > > }
> > > > > > > >
> > > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > > bad idea.
> > > > > > >
> > > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > > >
> > > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > > >
> > > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > > memory as direct.
> > > > > > >
> > > > > >
> > > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > > >
> > > > > > The accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > > >
> > > > > > Thanks,
> > > > > > Gavin
> > > > > >
> > > > >
> > > > > What is endianness set to, for this region?
> > > > >
> > > >
> > > > The endianness of the memory region is set to that for the host.
> > > >
> > > > static const MemoryRegionOps ram_device_mem_ops = {
> > > > .read = memory_region_ram_device_read,
> > > > .write = memory_region_ram_device_write,
> > > > .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > > };
> > > >
> >
> > So there is never any endianness translation.
> > I think the reason qemu does the bounce buffer is more
> > to prevent things like vector access from MMIO.
> >
> >
> > > How about to treat the RAM DEVICE memory region directly accessible in
> > > address_space_map() only when HOST_BIG_ENDIAN is false,
> > > something like
> > > below and I don't hit the guest hang issue with the changes.
> > >
> > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > index 1417132f6d..9daca55251 100644
> > > --- a/include/system/memory.h
> > > +++ b/include/system/memory.h
> > > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > > int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > > bool prepare_mmio_access(MemoryRegion *mr);
> > > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > > + bool check_ram_device)
> > > {
> > > /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> > > if (memory_region_is_romd(mr)) {
> > > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > * be MMIO and access using mempy can be wrong (e.g., using instructions not
> > > * intended for MMIO access). So we treat this as IO.
> > > */
> > > - return !memory_region_is_ram_device(mr);
> > > + return (!check_ram_device || !memory_region_is_ram_device(mr));
> > > }
> > > static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > > + bool check_ram_device,
> > > bool is_write, MemTxAttrs attrs)
> > > {
> > > - if (!memory_region_supports_direct_access(mr)) {
> > > + if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> > > return false;
> > > }
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index 7bcbf87573..2e6b72b124 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> > > fv = address_space_to_flatview(as);
> > > mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > - if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > + if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> > > size_t used = qatomic_read(&as->bounce_buffer_size);
> > > for (;;) {
> > > hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> > >
> > > Thanks,
> > > Gavin
> > >
> >
> > I do not think it has anything to do with host endian-ness.
> >
> >
> > This is the change that broke it I think?
> >
> >
> > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > Author: Alex Williamson <alex@shazbot.org>
> > Date: Mon Oct 31 09:53:03 2016 -0600
> >
> > memory: Don't use memcpy for ram_device regions
> >
> >
> > Maybe Alex has an opinion on what to do.
>
> I can offer one idea here..
>
> IIUC the major issue was vector ops but the mr ops might be too heavy, then
> another way to fix it is in memory API instead of using memcpy()/memmove(),
> we always use a helper (say, memmove_no_vector()) to do the split and
> properly aligned IOs as what ram_device_mem_ops does right now, this should
> only applies to ram_device.
>
> With that, IIUC we can remove the current ram_device_mem_ops, then in
> Gavin's case mmap() will go through and guest will not need to vmexit at
> all. Best perf, issue solve.
>
> We just need to be careful to trap all possible memcpy()/memmove() used in
> memory core.. if I didn't miss any, IMO below four should needs to be
> replaced by memmove_no_vector():
>
> flatview_write_continue_step()
> flatview_read_continue_step()
> address_space_read()
> address_space_write_rom()
>
> Thanks,
>
> --
> Peter Xu
First, this is a nice idea.
Second, the ideal thing is still just allowing direct access.
And I think VFIO actually knows it's regular RAM.
So something like the following small patch in linux, maybe?
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index fa056b69f899..a4ca2d01272c 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -418,6 +418,10 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
struct nvgrace_gpu_pci_core_device *nvdev =
container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
core_device.vdev);
+ struct vfio_region_info_cap_direct_access direct_access = {
+ .header.id = VFIO_REGION_INFO_CAP_DIRECT_ACCESS,
+ .header.version = 1,
+ };
struct vfio_region_info_cap_sparse_mmap *sparse;
struct mem_region *memregion;
u32 size;
@@ -453,6 +457,13 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
if (ret)
return ret;
+ if (info->index == USEMEM_REGION_INDEX) {
+ ret = vfio_info_add_capability(caps, &direct_access.header,
+ sizeof(direct_access));
+ if (ret)
+ return ret;
+ }
+
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
/*
* The region memory size may not be power-of-2 aligned.
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..f475f4920b52 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -466,6 +466,16 @@ struct vfio_device_migration_info {
*/
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
+/*
+ * The direct access capability informs that a mmappable region may be
+ * accessed by userspace using any CPU load/store operations.
+ */
+#define VFIO_REGION_INFO_CAP_DIRECT_ACCESS 6
+
+struct vfio_region_info_cap_direct_access {
+ struct vfio_info_cap_header header;
+};
+
/*
* Capability with compressed real address (aka SSA - small system address)
* where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 16:11 ` Peter Maydell
@ 2026-06-10 16:19 ` Michael S. Tsirkin
2026-06-10 19:10 ` Peter Xu
0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 16:19 UTC (permalink / raw)
To: Peter Maydell
Cc: Peter Xu, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > This is the change that broke it I think?
> > >
> > >
> > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > Author: Alex Williamson <alex@shazbot.org>
> > > Date: Mon Oct 31 09:53:03 2016 -0600
> > >
> > > memory: Don't use memcpy for ram_device regions
> > >
> > >
> > > Maybe Alex has an opinion on what to do.
> >
> > I can offer one idea here..
> >
> > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > we always use a helper (say, memmove_no_vector()) to do the split and
> > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > only applies to ram_device.
>
> If the underlying memory needs to be accessed only with specific
> alignment/size, as the 4a2e242bbb30 commit message suggests, then
> we cannot expose it via address_space_map(), so we must have
> a bounce-buffer.
Right. And virtio currently isn't friendly to the bounce buffer.
We can fix that but I worry about the perf impact.
> The address_space_map() function says
> "here's a host pointer to memory, do what you like to it", and
> the caller is entitled to memcpy to/from it or otherwise
> access it with any C operations, which are not guaranteed to
> respect any kind of alignment or similar restrictions.
>
> My guess from commit 4a2e242bbb30 is that that applied an
> overly broad "don't do direct access" hammer to all
> vfio assigned devices, and that there needs to be some
> concept of "this vfio assigned device's region is OK for
> direct access" vs "this other one is not", such that if
> this GH100 card's BAR guarantees it can be treated entirely
> as RAM then we can have memory_region_supports_direct_access()
> return true for it.
>
> thanks
> -- PMM
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 9:49 ` Michael S. Tsirkin
@ 2026-06-10 18:30 ` Stefan Hajnoczi
2026-06-10 21:00 ` Michael S. Tsirkin
2026-06-11 1:19 ` Gavin Shan
0 siblings, 2 replies; 37+ messages in thread
From: Stefan Hajnoczi @ 2026-06-10 18:30 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Gavin Shan, qemu-devel, qemu-arm, jugraham, shan.gavin,
qemu-block
[-- Attachment #1: Type: text/plain, Size: 6189 bytes --]
On Wed, Jun 10, 2026 at 05:49:21AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > On the guest where a NVidia's GH100 card is passed from the host, the
> > guest system hang can be observed on attempt to compile 'cuda-samples',
> > as reported by Julia.
> >
> > host$ lspci | grep GH100
> > 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> > host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
> > -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> > -cpu host -smp cpus=32 -m size=8G \
> > -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> > -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> >
> > guest$ cd cuda-samples/build
> > guest$ make -j 20 clean
> > guest$ make -j 20
> > :
> > [ 54%] Linking CUDA executable graphMemoryNodes
> > [ 54%] Built target graphMemoryNodes
> > <no more output afterwards, guest becomes frozen here>
> >
> > guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> >
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR can be presented to the guest
> > through memory hot-add. The page cache can be allocated from the hot added
> > memory blocks when cuda-samples is being built. Afterwards, he page cache
> > is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
> > buffer is used to accomodate the request as the corresponding memory
> > region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> > case, false is returned from memory_access_is_direct() in the path where
> > the DMA request is handled.
> >
> > QEMU
> > ====
> > virtio_blk_handle_output
> > virtio_blk_handle_vq
> > virtio_blk_get_request
> > virtqueue_pop
> > virtqueue_split_pop
> > virtqueue_map_desc
> > address_space_map
> > memory_access_is_direct # Return false
> > memory_region_supports_direct_access
> >
> > (qemu) info mtree
> > :
> > memory-region: pci_bridge_pci
> > 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> > 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> > 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> >
> > By default, the max bounce buffer size is only 4096 bytes, even less
> > than one page when the guest page is 64KB. This tries to fix the issue
> > by inheriting the customized max bounce buffer size of the virtio bus's
> > parent through property 'x-max-bounce-buffer-size' when the customized
> > size is a larger one. With this applied, no guest system hang is seen
> > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> >
> > Reported-by: Julia Graham <jugraham@redhat.com>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > ---
> > hw/virtio/virtio-bus.c | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> > index cef944e015..e0933823f3 100644
> > --- a/hw/virtio/virtio-bus.c
> > +++ b/hw/virtio/virtio-bus.c
> > @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
> > /* A VirtIODevice is being plugged */
> > void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> > {
> > + AddressSpace *as;
> > DeviceState *qdev = DEVICE(vdev);
> > BusState *qbus = BUS(qdev_get_parent_bus(qdev));
> > VirtioBusState *bus = VIRTIO_BUS(qbus);
> > @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> > return;
> > }
> > }
> > + } else {
> > + /*
> > + * The maximal bounce buffer size of the virtio bus's parent may
> > + * have been customized by property 'x-max-bounce-buffer-size'.
> > + * Lets inherit the customized size if it's larger than the
> > + * current one.
> > + */
> > + as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
> > + if (as) {
> > + vdev->dma_as->max_bounce_buffer_size = MAX(
> > + vdev->dma_as->max_bounce_buffer_size,
> > + as->max_bounce_buffer_size);
> > + }
> > }
> > }
> >
> > --
> > 2.54.0
>
>
> Problem with all this is, users would not know how to size this.
>
> So fundamentally, is not the issue that virtio blk (and scsi!) maps
> all of the buffer all the time?
>
> It's not hard to add something like virtio_pop_unmapped that would not map,
> then build QEMUSGLists out of addr/len pairs and submit these.
>
> Stefan, do you think doing it like this would be bad for perf? Good for
> perf?
I'd like to first make sure that the BAR really cannot be mmapped.
A bounce buffer is necessary when QEMU has no way of mmapping the memory
(e.g. it needs to invoke a device model's callback to read/write the
MemoryRegion).
The reason why the bounce buffer size is low is because it's normally
only used on emulated machines where MMIO registers or similar small
MemoryRegions are accessed by DMA. If we ran into this on modern
machines there would also be other consequences like vhost devices would
be unable to access that memory since it cannot be shared/mmapped.
This is why I think we need to understand why this BAR is a RAM DEVICE.
If it can support mmap then this issue, plus anything else like vhost,
would work.
Gavin, can you share the output of `lspci -vv -s 0009:01:00.0`?
Thanks,
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 16:19 ` Michael S. Tsirkin
@ 2026-06-10 19:10 ` Peter Xu
2026-06-10 21:03 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2026-06-10 19:10 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 12:19:39PM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> > On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > This is the change that broke it I think?
> > > >
> > > >
> > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > Author: Alex Williamson <alex@shazbot.org>
> > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > >
> > > > memory: Don't use memcpy for ram_device regions
> > > >
> > > >
> > > > Maybe Alex has an opinion on what to do.
> > >
> > > I can offer one idea here..
> > >
> > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > only applies to ram_device.
> >
> > If the underlying memory needs to be accessed only with specific
> > alignment/size, as the 4a2e242bbb30 commit message suggests, then
> > we cannot expose it via address_space_map(), so we must have
> > a bounce-buffer.
I get the point; this is technically a concern, but IMHO it's still
slightly different, and I expect it non-issue in reality.
Essentially we can have two ways to iteract with the pci bar:
1) via vCPU / CPU access
2) via DMA targets
Alex can correct me, but IIUC that problem was when the CPU accesses the
mapped region with memcpy(), rather than making that bar to be DMA target.
Hence, use case 1) only. So my current understanding is the proposal
shouldn't (hopefully..) regress that realtek problem because use case 1) is
properly covered.
I always think it is very bogus to have any register-like MMIO regions to
be passed over, maybe it's a bug already? It's because I don't know any
way to guarantee DMA performs in a way that will be compatible with a pci
bar that is register-based and will not have any side effect. Say, if some
pci bar (real register-backed) must be accessed in 4B and aligned, how
would a DMA request guarantee that?
From that perspective, IMHO it's a guest (driver or app, I'm not sure..)
bug to make such region to be DMA target in the first place. The outcome
of such setup should be undefined. It'll be the same after applying the
proposal I raised, that QEMU will have undefined behavior for such pci bars
to be used as DMA targets.
Thanks,
>
> Right. And virtio currently isn't friendly to the bounce buffer.
> We can fix that but I worry about the perf impact.
>
> > The address_space_map() function says
> > "here's a host pointer to memory, do what you like to it", and
> > the caller is entitled to memcpy to/from it or otherwise
> > access it with any C operations, which are not guaranteed to
> > respect any kind of alignment or similar restrictions.
> >
> > My guess from commit 4a2e242bbb30 is that that applied an
> > overly broad "don't do direct access" hammer to all
> > vfio assigned devices, and that there needs to be some
> > concept of "this vfio assigned device's region is OK for
> > direct access" vs "this other one is not", such that if
> > this GH100 card's BAR guarantees it can be treated entirely
> > as RAM then we can have memory_region_supports_direct_access()
> > return true for it.
> >
> > thanks
> > -- PMM
>
--
Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 18:30 ` Stefan Hajnoczi
@ 2026-06-10 21:00 ` Michael S. Tsirkin
2026-06-11 1:19 ` Gavin Shan
1 sibling, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 21:00 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Gavin Shan, qemu-devel, qemu-arm, jugraham, shan.gavin,
qemu-block
On Wed, Jun 10, 2026 at 02:30:46PM -0400, Stefan Hajnoczi wrote:
> On Wed, Jun 10, 2026 at 05:49:21AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > as reported by Julia.
> > >
> > > host$ lspci | grep GH100
> > > 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> > > host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
> > > -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
> > > -cpu host -smp cpus=32 -m size=8G \
> > > -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
> > > -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
> > > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> > >
> > > guest$ cd cuda-samples/build
> > > guest$ make -j 20 clean
> > > guest$ make -j 20
> > > :
> > > [ 54%] Linking CUDA executable graphMemoryNodes
> > > [ 54%] Built target graphMemoryNodes
> > > <no more output afterwards, guest becomes frozen here>
> > >
> > > guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> > >
> > > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > > the memory blocks residing in the PCI BAR can be presented to the guest
> > > through memory hot-add. The page cache can be allocated from the hot added
> > > memory blocks when cuda-samples is being built. Afterwards, he page cache
> > > is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
> > > buffer is used to accomodate the request as the corresponding memory
> > > region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> > > case, false is returned from memory_access_is_direct() in the path where
> > > the DMA request is handled.
> > >
> > > QEMU
> > > ====
> > > virtio_blk_handle_output
> > > virtio_blk_handle_vq
> > > virtio_blk_get_request
> > > virtqueue_pop
> > > virtqueue_split_pop
> > > virtqueue_map_desc
> > > address_space_map
> > > memory_access_is_direct # Return false
> > > memory_region_supports_direct_access
> > >
> > > (qemu) info mtree
> > > :
> > > memory-region: pci_bridge_pci
> > > 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > > 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> > > 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> > > 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> > >
> > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > than one page when the guest page is 64KB. This tries to fix the issue
> > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > size is a larger one. With this applied, no guest system hang is seen
> > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > >
> > > Reported-by: Julia Graham <jugraham@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > > hw/virtio/virtio-bus.c | 14 ++++++++++++++
> > > 1 file changed, 14 insertions(+)
> > >
> > > diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> > > index cef944e015..e0933823f3 100644
> > > --- a/hw/virtio/virtio-bus.c
> > > +++ b/hw/virtio/virtio-bus.c
> > > @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
> > > /* A VirtIODevice is being plugged */
> > > void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> > > {
> > > + AddressSpace *as;
> > > DeviceState *qdev = DEVICE(vdev);
> > > BusState *qbus = BUS(qdev_get_parent_bus(qdev));
> > > VirtioBusState *bus = VIRTIO_BUS(qbus);
> > > @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> > > return;
> > > }
> > > }
> > > + } else {
> > > + /*
> > > + * The maximal bounce buffer size of the virtio bus's parent may
> > > + * have been customized by property 'x-max-bounce-buffer-size'.
> > > + * Lets inherit the customized size if it's larger than the
> > > + * current one.
> > > + */
> > > + as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
> > > + if (as) {
> > > + vdev->dma_as->max_bounce_buffer_size = MAX(
> > > + vdev->dma_as->max_bounce_buffer_size,
> > > + as->max_bounce_buffer_size);
> > > + }
> > > }
> > > }
> > >
> > > --
> > > 2.54.0
> >
> >
> > Problem with all this is, users would not know how to size this.
> >
> > So fundamentally, is not the issue that virtio blk (and scsi!) maps
> > all of the buffer all the time?
> >
> > It's not hard to add something like virtio_pop_unmapped that would not map,
> > then build QEMUSGLists out of addr/len pairs and submit these.
> >
> > Stefan, do you think doing it like this would be bad for perf? Good for
> > perf?
>
> I'd like to first make sure that the BAR really cannot be mmapped.
The issue is that qemu has no way to know, up front.
What we could thinkably do, is map it and do the
accesses from QEMU through the bounce buffer, while
DMA through mmap.
> A bounce buffer is necessary when QEMU has no way of mmapping the memory
> (e.g. it needs to invoke a device model's callback to read/write the
> MemoryRegion).
>
> The reason why the bounce buffer size is low is because it's normally
> only used on emulated machines where MMIO registers or similar small
> MemoryRegions are accessed by DMA. If we ran into this on modern
> machines there would also be other consequences like vhost devices would
> be unable to access that memory since it cannot be shared/mmapped.
>
> This is why I think we need to understand why this BAR is a RAM DEVICE.
VFIO maps all memory BARS like this.
> If it can support mmap then this issue, plus anything else like vhost,
> would work.
>
> Gavin, can you share the output of `lspci -vv -s 0009:01:00.0`?
>
> Thanks,
> Stefan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 19:10 ` Peter Xu
@ 2026-06-10 21:03 ` Michael S. Tsirkin
2026-06-10 21:27 ` Peter Xu
0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 21:03 UTC (permalink / raw)
To: Peter Xu
Cc: Peter Maydell, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 03:10:46PM -0400, Peter Xu wrote:
> On Wed, Jun 10, 2026 at 12:19:39PM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> > > On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > > This is the change that broke it I think?
> > > > >
> > > > >
> > > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > > Author: Alex Williamson <alex@shazbot.org>
> > > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > > >
> > > > > memory: Don't use memcpy for ram_device regions
> > > > >
> > > > >
> > > > > Maybe Alex has an opinion on what to do.
> > > >
> > > > I can offer one idea here..
> > > >
> > > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > > only applies to ram_device.
> > >
> > > If the underlying memory needs to be accessed only with specific
> > > alignment/size, as the 4a2e242bbb30 commit message suggests, then
> > > we cannot expose it via address_space_map(), so we must have
> > > a bounce-buffer.
>
> I get the point; this is technically a concern, but IMHO it's still
> slightly different, and I expect it non-issue in reality.
>
> Essentially we can have two ways to iteract with the pci bar:
>
> 1) via vCPU / CPU access
> 2) via DMA targets
>
> Alex can correct me, but IIUC that problem was when the CPU accesses the
> mapped region with memcpy(), rather than making that bar to be DMA target.
> Hence, use case 1) only. So my current understanding is the proposal
> shouldn't (hopefully..) regress that realtek problem because use case 1) is
> properly covered.
>
> I always think it is very bogus to have any register-like MMIO regions to
> be passed over, maybe it's a bug already? It's because I don't know any
> way to guarantee DMA performs in a way that will be compatible with a pci
> bar that is register-based and will not have any side effect. Say, if some
> pci bar (real register-backed) must be accessed in 4B and aligned, how
> would a DMA request guarantee that?
>
> >From that perspective, IMHO it's a guest (driver or app, I'm not sure..)
> bug to make such region to be DMA target in the first place. The outcome
> of such setup should be undefined. It'll be the same after applying the
> proposal I raised, that QEMU will have undefined behavior for such pci bars
> to be used as DMA targets.
>
> Thanks,
Sorry, wasting gibabytes and GB/s of main
RAM and PCI BW just to shuffle data back out the PCI bus
is out of the question.
You don't have to like it)
> >
> > Right. And virtio currently isn't friendly to the bounce buffer.
> > We can fix that but I worry about the perf impact.
> >
> > > The address_space_map() function says
> > > "here's a host pointer to memory, do what you like to it", and
> > > the caller is entitled to memcpy to/from it or otherwise
> > > access it with any C operations, which are not guaranteed to
> > > respect any kind of alignment or similar restrictions.
> > >
> > > My guess from commit 4a2e242bbb30 is that that applied an
> > > overly broad "don't do direct access" hammer to all
> > > vfio assigned devices, and that there needs to be some
> > > concept of "this vfio assigned device's region is OK for
> > > direct access" vs "this other one is not", such that if
> > > this GH100 card's BAR guarantees it can be treated entirely
> > > as RAM then we can have memory_region_supports_direct_access()
> > > return true for it.
> > >
> > > thanks
> > > -- PMM
> >
>
> --
> Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 21:03 ` Michael S. Tsirkin
@ 2026-06-10 21:27 ` Peter Xu
2026-06-10 21:44 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2026-06-10 21:27 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Maydell, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 05:03:59PM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 03:10:46PM -0400, Peter Xu wrote:
> > On Wed, Jun 10, 2026 at 12:19:39PM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> > > > On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > > > This is the change that broke it I think?
> > > > > >
> > > > > >
> > > > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > > > Author: Alex Williamson <alex@shazbot.org>
> > > > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > > > >
> > > > > > memory: Don't use memcpy for ram_device regions
> > > > > >
> > > > > >
> > > > > > Maybe Alex has an opinion on what to do.
> > > > >
> > > > > I can offer one idea here..
> > > > >
> > > > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > > > only applies to ram_device.
> > > >
> > > > If the underlying memory needs to be accessed only with specific
> > > > alignment/size, as the 4a2e242bbb30 commit message suggests, then
> > > > we cannot expose it via address_space_map(), so we must have
> > > > a bounce-buffer.
> >
> > I get the point; this is technically a concern, but IMHO it's still
> > slightly different, and I expect it non-issue in reality.
> >
> > Essentially we can have two ways to iteract with the pci bar:
> >
> > 1) via vCPU / CPU access
> > 2) via DMA targets
> >
> > Alex can correct me, but IIUC that problem was when the CPU accesses the
> > mapped region with memcpy(), rather than making that bar to be DMA target.
> > Hence, use case 1) only. So my current understanding is the proposal
> > shouldn't (hopefully..) regress that realtek problem because use case 1) is
> > properly covered.
> >
> > I always think it is very bogus to have any register-like MMIO regions to
> > be passed over, maybe it's a bug already? It's because I don't know any
> > way to guarantee DMA performs in a way that will be compatible with a pci
> > bar that is register-based and will not have any side effect. Say, if some
> > pci bar (real register-backed) must be accessed in 4B and aligned, how
> > would a DMA request guarantee that?
> >
> > >From that perspective, IMHO it's a guest (driver or app, I'm not sure..)
> > bug to make such region to be DMA target in the first place. The outcome
> > of such setup should be undefined. It'll be the same after applying the
> > proposal I raised, that QEMU will have undefined behavior for such pci bars
> > to be used as DMA targets.
> >
> > Thanks,
>
> Sorry, wasting gibabytes and GB/s of main
> RAM and PCI BW just to shuffle data back out the PCI bus
> is out of the question.
>
> You don't have to like it)
I'm not sure if my previous comment wasn't clear, but just to make sure it
is..
The proposal was exactly about making ram_device to be directly accessible
by default. That is, make memory_region_supports_direct_access() return
true for ram_device like before, however when doing direct access from
QEMU, QEMU should use memmove_no_vector() version instead of memcpy().
We leave DMA maps like virtio-blk to be directly accessible without
auditing device emulations using memcpy() or not: then QEMU faces the same
risk to bare metal where MMIO regions used for DMA buffers, then it's
undefined behavior.
In case of GPU bar mapping, it should then work like RAM for virtio-blk.
Thanks,
>
> > >
> > > Right. And virtio currently isn't friendly to the bounce buffer.
> > > We can fix that but I worry about the perf impact.
> > >
> > > > The address_space_map() function says
> > > > "here's a host pointer to memory, do what you like to it", and
> > > > the caller is entitled to memcpy to/from it or otherwise
> > > > access it with any C operations, which are not guaranteed to
> > > > respect any kind of alignment or similar restrictions.
> > > >
> > > > My guess from commit 4a2e242bbb30 is that that applied an
> > > > overly broad "don't do direct access" hammer to all
> > > > vfio assigned devices, and that there needs to be some
> > > > concept of "this vfio assigned device's region is OK for
> > > > direct access" vs "this other one is not", such that if
> > > > this GH100 card's BAR guarantees it can be treated entirely
> > > > as RAM then we can have memory_region_supports_direct_access()
> > > > return true for it.
> > > >
> > > > thanks
> > > > -- PMM
> > >
> >
> > --
> > Peter Xu
>
>
--
Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 21:27 ` Peter Xu
@ 2026-06-10 21:44 ` Michael S. Tsirkin
0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 21:44 UTC (permalink / raw)
To: Peter Xu
Cc: Peter Maydell, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Wed, Jun 10, 2026 at 05:27:31PM -0400, Peter Xu wrote:
> On Wed, Jun 10, 2026 at 05:03:59PM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 03:10:46PM -0400, Peter Xu wrote:
> > > On Wed, Jun 10, 2026 at 12:19:39PM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> > > > > On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > > > > This is the change that broke it I think?
> > > > > > >
> > > > > > >
> > > > > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > > > > Author: Alex Williamson <alex@shazbot.org>
> > > > > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > > > > >
> > > > > > > memory: Don't use memcpy for ram_device regions
> > > > > > >
> > > > > > >
> > > > > > > Maybe Alex has an opinion on what to do.
> > > > > >
> > > > > > I can offer one idea here..
> > > > > >
> > > > > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > > > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > > > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > > > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > > > > only applies to ram_device.
> > > > >
> > > > > If the underlying memory needs to be accessed only with specific
> > > > > alignment/size, as the 4a2e242bbb30 commit message suggests, then
> > > > > we cannot expose it via address_space_map(), so we must have
> > > > > a bounce-buffer.
> > >
> > > I get the point; this is technically a concern, but IMHO it's still
> > > slightly different, and I expect it non-issue in reality.
> > >
> > > Essentially we can have two ways to iteract with the pci bar:
> > >
> > > 1) via vCPU / CPU access
> > > 2) via DMA targets
> > >
> > > Alex can correct me, but IIUC that problem was when the CPU accesses the
> > > mapped region with memcpy(), rather than making that bar to be DMA target.
> > > Hence, use case 1) only. So my current understanding is the proposal
> > > shouldn't (hopefully..) regress that realtek problem because use case 1) is
> > > properly covered.
> > >
> > > I always think it is very bogus to have any register-like MMIO regions to
> > > be passed over, maybe it's a bug already? It's because I don't know any
> > > way to guarantee DMA performs in a way that will be compatible with a pci
> > > bar that is register-based and will not have any side effect. Say, if some
> > > pci bar (real register-backed) must be accessed in 4B and aligned, how
> > > would a DMA request guarantee that?
> > >
> > > >From that perspective, IMHO it's a guest (driver or app, I'm not sure..)
> > > bug to make such region to be DMA target in the first place. The outcome
> > > of such setup should be undefined. It'll be the same after applying the
> > > proposal I raised, that QEMU will have undefined behavior for such pci bars
> > > to be used as DMA targets.
> > >
> > > Thanks,
> >
> > Sorry, wasting gibabytes and GB/s of main
> > RAM and PCI BW just to shuffle data back out the PCI bus
> > is out of the question.
> >
> > You don't have to like it)
>
> I'm not sure if my previous comment wasn't clear, but just to make sure it
> is..
>
> The proposal was exactly about making ram_device to be directly accessible
> by default. That is, make memory_region_supports_direct_access() return
> true for ram_device like before, however when doing direct access from
> QEMU, QEMU should use memmove_no_vector() version instead of memcpy().
>
> We leave DMA maps like virtio-blk to be directly accessible without
> auditing device emulations using memcpy() or not: then QEMU faces the same
> risk to bare metal where MMIO regions used for DMA buffers, then it's
> undefined behavior.
>
> In case of GPU bar mapping, it should then work like RAM for virtio-blk.
>
> Thanks,
Ah, I get it. not sure we need memmove_no_vector even, we can keep
the bounce buffer thing if we want.
> >
> > > >
> > > > Right. And virtio currently isn't friendly to the bounce buffer.
> > > > We can fix that but I worry about the perf impact.
> > > >
> > > > > The address_space_map() function says
> > > > > "here's a host pointer to memory, do what you like to it", and
> > > > > the caller is entitled to memcpy to/from it or otherwise
> > > > > access it with any C operations, which are not guaranteed to
> > > > > respect any kind of alignment or similar restrictions.
> > > > >
> > > > > My guess from commit 4a2e242bbb30 is that that applied an
> > > > > overly broad "don't do direct access" hammer to all
> > > > > vfio assigned devices, and that there needs to be some
> > > > > concept of "this vfio assigned device's region is OK for
> > > > > direct access" vs "this other one is not", such that if
> > > > > this GH100 card's BAR guarantees it can be treated entirely
> > > > > as RAM then we can have memory_region_supports_direct_access()
> > > > > return true for it.
> > > > >
> > > > > thanks
> > > > > -- PMM
> > > >
> > >
> > > --
> > > Peter Xu
> >
> >
>
> --
> Peter Xu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 18:30 ` Stefan Hajnoczi
2026-06-10 21:00 ` Michael S. Tsirkin
@ 2026-06-11 1:19 ` Gavin Shan
1 sibling, 0 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-11 1:19 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, qemu-arm, jugraham, shan.gavin, qemu-block
On 6/11/26 4:30 AM, Stefan Hajnoczi wrote:
> On Wed, Jun 10, 2026 at 05:49:21AM -0400, Michael S. Tsirkin wrote:
>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>> as reported by Julia.
>>>
>>> host$ lspci | grep GH100
>>> 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>>> host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
>>> -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \
>>> -cpu host -smp cpus=32 -m size=8G \
>>> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \
>>> -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \
>>> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
>>>
>>> guest$ cd cuda-samples/build
>>> guest$ make -j 20 clean
>>> guest$ make -j 20
>>> :
>>> [ 54%] Linking CUDA executable graphMemoryNodes
>>> [ 54%] Built target graphMemoryNodes
>>> <no more output afterwards, guest becomes frozen here>
>>>
>>> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>> [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
>>>
>>> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
>>> the memory blocks residing in the PCI BAR can be presented to the guest
>>> through memory hot-add. The page cache can be allocated from the hot added
>>> memory blocks when cuda-samples is being built. Afterwards, he page cache
>>> is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
>>> buffer is used to accomodate the request as the corresponding memory
>>> region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
>>> case, false is returned from memory_access_is_direct() in the path where
>>> the DMA request is handled.
>>>
>>> QEMU
>>> ====
>>> virtio_blk_handle_output
>>> virtio_blk_handle_vq
>>> virtio_blk_get_request
>>> virtqueue_pop
>>> virtqueue_split_pop
>>> virtqueue_map_desc
>>> address_space_map
>>> memory_access_is_direct # Return false
>>> memory_region_supports_direct_access
>>>
>>> (qemu) info mtree
>>> :
>>> memory-region: pci_bridge_pci
>>> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>>> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>>> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>>> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
>>>
>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>> than one page when the guest page is 64KB. This tries to fix the issue
>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>> size is a larger one. With this applied, no guest system hang is seen
>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>
>>> Reported-by: Julia Graham <jugraham@redhat.com>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>> hw/virtio/virtio-bus.c | 14 ++++++++++++++
>>> 1 file changed, 14 insertions(+)
>>>
>>> diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
>>> index cef944e015..e0933823f3 100644
>>> --- a/hw/virtio/virtio-bus.c
>>> +++ b/hw/virtio/virtio-bus.c
>>> @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
>>> /* A VirtIODevice is being plugged */
>>> void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
>>> {
>>> + AddressSpace *as;
>>> DeviceState *qdev = DEVICE(vdev);
>>> BusState *qbus = BUS(qdev_get_parent_bus(qdev));
>>> VirtioBusState *bus = VIRTIO_BUS(qbus);
>>> @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
>>> return;
>>> }
>>> }
>>> + } else {
>>> + /*
>>> + * The maximal bounce buffer size of the virtio bus's parent may
>>> + * have been customized by property 'x-max-bounce-buffer-size'.
>>> + * Lets inherit the customized size if it's larger than the
>>> + * current one.
>>> + */
>>> + as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
>>> + if (as) {
>>> + vdev->dma_as->max_bounce_buffer_size = MAX(
>>> + vdev->dma_as->max_bounce_buffer_size,
>>> + as->max_bounce_buffer_size);
>>> + }
>>> }
>>> }
>>>
>>> --
>>> 2.54.0
>>
>>
>> Problem with all this is, users would not know how to size this.
>>
>> So fundamentally, is not the issue that virtio blk (and scsi!) maps
>> all of the buffer all the time?
>>
>> It's not hard to add something like virtio_pop_unmapped that would not map,
>> then build QEMUSGLists out of addr/len pairs and submit these.
>>
>> Stefan, do you think doing it like this would be bad for perf? Good for
>> perf?
>
> I'd like to first make sure that the BAR really cannot be mmapped.
>
> A bounce buffer is necessary when QEMU has no way of mmapping the memory
> (e.g. it needs to invoke a device model's callback to read/write the
> MemoryRegion).
>
> The reason why the bounce buffer size is low is because it's normally
> only used on emulated machines where MMIO registers or similar small
> MemoryRegions are accessed by DMA. If we ran into this on modern
> machines there would also be other consequences like vhost devices would
> be unable to access that memory since it cannot be shared/mmapped.
>
> This is why I think we need to understand why this BAR is a RAM DEVICE.
> If it can support mmap then this issue, plus anything else like vhost,
> would work.
>
> Gavin, can you share the output of `lspci -vv -s 0009:01:00.0`?
>
root@nvidia-grace-hopper-01:/home/gavin/sandbox/qemu.main# lspci -vv -s 0009:01:00.0
0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
Subsystem: NVIDIA Corporation Device 1809
Physical Slot: 9
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 170
NUMA node: 0
IOMMU group: 12
Region 0: Memory at 661ffd000000 (64-bit, prefetchable) [size=16M]
Region 2: Memory at 662000000000 (64-bit, prefetchable) [size=128G]
Region 4: Memory at 661ffe000000 (64-bit, prefetchable) [size=32M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [60] Express (v2) Endpoint, IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 32GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 128 bytes, LnkDisable- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (downgraded), Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [100 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [12c v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [14c v1] Data Link Feature <?>
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [188 v1] Physical Layer 32.0 GT/s <?>
Capabilities: [1b8 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr+ HeaderOF+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04000001 0000000f 01080000 00000000
Capabilities: [200 v1] Lane Margining at the Receiver
PortCap: Uses Driver+
PortSta: MargReady- MargSoftReady-
Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [250 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration- 10BitTagReq+ IntMsgNum 0
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ 10BitTagReq-
IOVSta: Migration-
Initial VFs: 24, Total VFs: 24, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: 2342
Supported Page Size: 00000573, System Page Size: 00000001
Region 0: Memory at 0000661ffca00000 (64-bit, prefetchable)
Region 2: Memory at 0000000000000000 (64-bit, prefetchable)
Region 4: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Capabilities: [2b8 v1] Power Budgeting <?>
Capabilities: [2c8 v1] Data Object Exchange
DOECap: IntSup+
IntMsgNum 0
DOECtl: IntEn-
DOESta: Busy- IntSta- Error- ObjectReady-
Capabilities: [2e0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2e8 v1] Process Address Space ID (PASID)
PASIDCap: Exec- Priv-, Max PASID Width: 14
PASIDCtl: Enable+ Exec- Priv-
Capabilities: [2f0 v1] Device Serial Number c5-c5-17-ff-f6-2d-b0-48
Kernel driver in use: nvgrace_gpu_vfio_pci
Kernel modules: nouveau
> Thanks,
> Stefan
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-10 16:18 ` Michael S. Tsirkin
@ 2026-06-11 4:33 ` Gavin Shan
2026-06-11 5:31 ` Michael S. Tsirkin
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-11 4:33 UTC (permalink / raw)
To: Michael S. Tsirkin, Peter Xu
Cc: Pavel Hrdina, Daniel P. Berrangé, qemu-devel, qemu-arm,
jugraham, shan.gavin, Alex Williamson, David Hildenbrand
Hi Peter, Michael and Alex,
On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
>> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
>>>> Hi Michael and Peter,
>>>>
>>>> On 6/10/26 11:00 PM, Gavin Shan wrote:
>>>>> On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
>>>>>>> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
>>>>>>>> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>>>>>>>>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You did not answer the question that Daniel was asking, how will user
>>>>>>>>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>>>>>>>>> guest system hangs and how will user know what magic value should be set?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry that I missed to answer Daniel's questions. For this specific case,
>>>>>>>>> user need to enlarge the bounce buffer size when seeing the following error
>>>>>>>>> message. We can add an explicit one in address_space_map() if the existing
>>>>>>>>> error message isn't obvious.
>>>>>>>>>
>>>>>>>>> qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>>>>>>
>>>>>>>>> void *address_space_map(AddressSpace *as,
>>>>>>>>> hwaddr addr,
>>>>>>>>> hwaddr *plen,
>>>>>>>>> bool is_write,
>>>>>>>>> MemTxAttrs attrs)
>>>>>>>>> {
>>>>>>>>> if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>>>>>>> if (l == 0) {
>>>>>>>>> error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>>>>>>>> *plen = 0;
>>>>>>>>> return NULL;
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>>>>>>>>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>>>>>>>>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This is not at all reasonable. All kind of fixes are possible but
>>>>>>>> fundamentally, bounce buffering data path is by itself already a
>>>>>>>> bad idea.
>>>>>>>>
>>>>>>>> I have no idea what does bounce buffering device ram accomplish.
>>>>>>>>
>>>>>>>> In the end, qemu still simply reads the memory from/to the buffer.
>>>>>>>>
>>>>>>>> My suggestion is to first of all look for ways to mark the
>>>>>>>> memory as direct.
>>>>>>>>
>>>>>>>
>>>>>>> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
>>>>>>> DEVICE) memory region is directly accessible. The memory region is initialized
>>>>>>> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>>>>>>>
>>>>>>> The accesses to the memory region is handled by 'ram_device_mem_ops' where
>>>>>>> {ldn, stn}_he_p() are used in its read/write handler. They're different
>>>>>>> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gavin
>>>>>>>
>>>>>>
>>>>>> What is endianness set to, for this region?
>>>>>>
>>>>>
>>>>> The endianness of the memory region is set to that for the host.
>>>>>
>>>>> static const MemoryRegionOps ram_device_mem_ops = {
>>>>> .read = memory_region_ram_device_read,
>>>>> .write = memory_region_ram_device_write,
>>>>> .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
>>>>> };
>>>>>
>>>
>>> So there is never any endianness translation.
>>> I think the reason qemu does the bounce buffer is more
>>> to prevent things like vector access from MMIO.
>>>
>>>
>>>> How about to treat the RAM DEVICE memory region directly accessible in
>>>> address_space_map() only when HOST_BIG_ENDIAN is false,
>>>> something like
>>>> below and I don't hit the guest hang issue with the changes.
>>>>
>>>> diff --git a/include/system/memory.h b/include/system/memory.h
>>>> index 1417132f6d..9daca55251 100644
>>>> --- a/include/system/memory.h
>>>> +++ b/include/system/memory.h
>>>> @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
>>>> int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
>>>> bool prepare_mmio_access(MemoryRegion *mr);
>>>> -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
>>>> +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
>>>> + bool check_ram_device)
>>>> {
>>>> /* ROM DEVICE regions only allow direct access if in ROMD mode. */
>>>> if (memory_region_is_romd(mr)) {
>>>> @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
>>>> * be MMIO and access using mempy can be wrong (e.g., using instructions not
>>>> * intended for MMIO access). So we treat this as IO.
>>>> */
>>>> - return !memory_region_is_ram_device(mr);
>>>> + return (!check_ram_device || !memory_region_is_ram_device(mr));
>>>> }
>>>> static inline bool memory_access_is_direct(const MemoryRegion *mr,
>>>> + bool check_ram_device,
>>>> bool is_write, MemTxAttrs attrs)
>>>> {
>>>> - if (!memory_region_supports_direct_access(mr)) {
>>>> + if (!memory_region_supports_direct_access(mr, check_ram_device)) {
>>>> return false;
>>>> }
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index 7bcbf87573..2e6b72b124 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
>>>> fv = address_space_to_flatview(as);
>>>> mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
>>>> - if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>> + if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
>>>> size_t used = qatomic_read(&as->bounce_buffer_size);
>>>> for (;;) {
>>>> hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
>>>>
>>>> Thanks,
>>>> Gavin
>>>>
>>>
>>> I do not think it has anything to do with host endian-ness.
>>>
>>>
>>> This is the change that broke it I think?
>>>
>>>
>>> commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
>>> Author: Alex Williamson <alex@shazbot.org>
>>> Date: Mon Oct 31 09:53:03 2016 -0600
>>>
>>> memory: Don't use memcpy for ram_device regions
>>>
>>>
>>> Maybe Alex has an opinion on what to do.
>>
>> I can offer one idea here..
>>
>> IIUC the major issue was vector ops but the mr ops might be too heavy, then
>> another way to fix it is in memory API instead of using memcpy()/memmove(),
>> we always use a helper (say, memmove_no_vector()) to do the split and
>> properly aligned IOs as what ram_device_mem_ops does right now, this should
>> only applies to ram_device.
>>
>> With that, IIUC we can remove the current ram_device_mem_ops, then in
>> Gavin's case mmap() will go through and guest will not need to vmexit at
>> all. Best perf, issue solve.
>>
>> We just need to be careful to trap all possible memcpy()/memmove() used in
>> memory core.. if I didn't miss any, IMO below four should needs to be
>> replaced by memmove_no_vector():
>>
>> flatview_write_continue_step()
>> flatview_read_continue_step()
>> address_space_read()
>> address_space_write_rom()
>>
>> Thanks,
>>
>> --
>> Peter Xu
>
> First, this is a nice idea.
> Second, the ideal thing is still just allowing direct access.
> And I think VFIO actually knows it's regular RAM.
> So something like the following small patch in linux, maybe?
>
If I understood everything, Peter's proposal seems to move the logics covered
by ram_device_mem_ops to the upper layer. I tends to agree with Michael that
we need the host to expose a flag (capability) indicating the PCI BAR is directly
accessible. The capabilities associated with the PCI BAR is determined by the
host, it's making sense to ask host to expose the extra capability if the PCI
BAR is directly accessible.
With the flag (capability) exposed from host, we split RAM DEVICE region into
two classes: indirectly accessible region and directly accessible region. They're
identified by:
- indirectly accessible RAM DEVICE region
MemoryRegion::ram true
MemoryRegion::ram_device true
MemoryRegion::ops ram_device_mem_ops
- directly accessible RAM DEVICE region
MemoryRegion::ram true
MemoryRegion::ram_device true
MemoryRegion::ops unassigned_mem_ops
Before I'm going to send a kernel patch for review, I hope Alex can take a look
and agree to add the extra capability as the indicator of directly accessible
RAM DEVICE region.
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index fa056b69f899..a4ca2d01272c 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -418,6 +418,10 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
> struct nvgrace_gpu_pci_core_device *nvdev =
> container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
> core_device.vdev);
> + struct vfio_region_info_cap_direct_access direct_access = {
> + .header.id = VFIO_REGION_INFO_CAP_DIRECT_ACCESS,
> + .header.version = 1,
> + };
> struct vfio_region_info_cap_sparse_mmap *sparse;
> struct mem_region *memregion;
> u32 size;
> @@ -453,6 +457,13 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
> if (ret)
> return ret;
>
> + if (info->index == USEMEM_REGION_INDEX) {
> + ret = vfio_info_add_capability(caps, &direct_access.header,
> + sizeof(direct_access));
> + if (ret)
> + return ret;
> + }
> +
> info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
> /*
> * The region memory size may not be power-of-2 aligned.
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..f475f4920b52 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -466,6 +466,16 @@ struct vfio_device_migration_info {
> */
> #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
>
> +/*
> + * The direct access capability informs that a mmappable region may be
> + * accessed by userspace using any CPU load/store operations.
> + */
> +#define VFIO_REGION_INFO_CAP_DIRECT_ACCESS 6
> +
> +struct vfio_region_info_cap_direct_access {
> + struct vfio_info_cap_header header;
> +};
> +
> /*
> * Capability with compressed real address (aka SSA - small system address)
> * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
>
With above code changes applied to the host, I'm able to avoid the guest hang issue
with more changes in QEMU:
-----> hw/vfio/region.c
int vfio_region_mmap(VFIORegion *region)
{
/* region->direct_access is sync up to VFIO_REGION_INFO_CAP_DIRECT_ACCESS */
if (region->direct_access) {
memory_region_init_ram_ptr(®ion->mmaps[i].mem,
memory_region_owner(region->mem),
name, region->mmaps[i].size,
region->mmaps[i].mmap);
region->mmaps[i].mem.ram_device = true;
} else {
memory_region_init_ram_device_ptr(®ion->mmaps[i].mem,
memory_region_owner(region->mem),
name, region->mmaps[i].size,
region->mmaps[i].mmap);
}
}
-----> system/memory.c
bool memory_region_has_unassigned_ops(const MemoryRegion *mr)
{
return mr->ops == &unassigned_mem_ops;
}
-----> include/system/memory.h
static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
{
/*
* RAM DEVICE regions can be accessed directly using memcpy, but it might
* be MMIO and access using mempy can be wrong (e.g., using instructions not
- * intended for MMIO access). So we treat this as IO.
+ * intended for MMIO access). So we treat this as IO except it has been
+ * explicitly declared as being directly accessible. For those directly
+ * accessible RAM device regions, their callbacks point to the unassigned
+ * one.
*/
- return !memory_region_is_ram_device(mr);
+ return !memory_region_is_ram_device(mr) ||
+ memory_region_has_unassigned_ops(mr);
}
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 4:33 ` Gavin Shan
@ 2026-06-11 5:31 ` Michael S. Tsirkin
2026-06-11 6:28 ` Gavin Shan
0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-11 5:31 UTC (permalink / raw)
To: Gavin Shan
Cc: Peter Xu, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Thu, Jun 11, 2026 at 02:33:05PM +1000, Gavin Shan wrote:
> Hi Peter, Michael and Alex,
>
> On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > > > > Hi Michael and Peter,
> > > > >
> > > > > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > > > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > > > > error message isn't obvious.
> > > > > > > > > >
> > > > > > > > > > qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > > > > >
> > > > > > > > > > void *address_space_map(AddressSpace *as,
> > > > > > > > > > hwaddr addr,
> > > > > > > > > > hwaddr *plen,
> > > > > > > > > > bool is_write,
> > > > > > > > > > MemTxAttrs attrs)
> > > > > > > > > > {
> > > > > > > > > > if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > > > > > if (l == 0) {
> > > > > > > > > > error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > > > > > *plen = 0;
> > > > > > > > > > return NULL;
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > > > > bad idea.
> > > > > > > > >
> > > > > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > > > > >
> > > > > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > > > > >
> > > > > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > > > > memory as direct.
> > > > > > > > >
> > > > > > > >
> > > > > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > > > > >
> > > > > > > > The accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Gavin
> > > > > > > >
> > > > > > >
> > > > > > > What is endianness set to, for this region?
> > > > > > >
> > > > > >
> > > > > > The endianness of the memory region is set to that for the host.
> > > > > >
> > > > > > static const MemoryRegionOps ram_device_mem_ops = {
> > > > > > .read = memory_region_ram_device_read,
> > > > > > .write = memory_region_ram_device_write,
> > > > > > .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > > > > };
> > > > > >
> > > >
> > > > So there is never any endianness translation.
> > > > I think the reason qemu does the bounce buffer is more
> > > > to prevent things like vector access from MMIO.
> > > >
> > > >
> > > > > How about to treat the RAM DEVICE memory region directly accessible in
> > > > > address_space_map() only when HOST_BIG_ENDIAN is false,
> > > > > something like
> > > > > below and I don't hit the guest hang issue with the changes.
> > > > >
> > > > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > > > index 1417132f6d..9daca55251 100644
> > > > > --- a/include/system/memory.h
> > > > > +++ b/include/system/memory.h
> > > > > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > > > > int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > > > > bool prepare_mmio_access(MemoryRegion *mr);
> > > > > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > > > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > > > > + bool check_ram_device)
> > > > > {
> > > > > /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> > > > > if (memory_region_is_romd(mr)) {
> > > > > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > > > * be MMIO and access using mempy can be wrong (e.g., using instructions not
> > > > > * intended for MMIO access). So we treat this as IO.
> > > > > */
> > > > > - return !memory_region_is_ram_device(mr);
> > > > > + return (!check_ram_device || !memory_region_is_ram_device(mr));
> > > > > }
> > > > > static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > > > > + bool check_ram_device,
> > > > > bool is_write, MemTxAttrs attrs)
> > > > > {
> > > > > - if (!memory_region_supports_direct_access(mr)) {
> > > > > + if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> > > > > return false;
> > > > > }
> > > > > diff --git a/system/physmem.c b/system/physmem.c
> > > > > index 7bcbf87573..2e6b72b124 100644
> > > > > --- a/system/physmem.c
> > > > > +++ b/system/physmem.c
> > > > > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> > > > > fv = address_space_to_flatview(as);
> > > > > mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > > > - if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > + if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> > > > > size_t used = qatomic_read(&as->bounce_buffer_size);
> > > > > for (;;) {
> > > > > hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> > > > >
> > > > > Thanks,
> > > > > Gavin
> > > > >
> > > >
> > > > I do not think it has anything to do with host endian-ness.
> > > >
> > > >
> > > > This is the change that broke it I think?
> > > >
> > > >
> > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > Author: Alex Williamson <alex@shazbot.org>
> > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > >
> > > > memory: Don't use memcpy for ram_device regions
> > > >
> > > > Maybe Alex has an opinion on what to do.
> > >
> > > I can offer one idea here..
> > >
> > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > only applies to ram_device.
> > >
> > > With that, IIUC we can remove the current ram_device_mem_ops, then in
> > > Gavin's case mmap() will go through and guest will not need to vmexit at
> > > all. Best perf, issue solve.
> > >
> > > We just need to be careful to trap all possible memcpy()/memmove() used in
> > > memory core.. if I didn't miss any, IMO below four should needs to be
> > > replaced by memmove_no_vector():
> > >
> > > flatview_write_continue_step()
> > > flatview_read_continue_step()
> > > address_space_read()
> > > address_space_write_rom()
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> >
> > First, this is a nice idea.
> > Second, the ideal thing is still just allowing direct access.
> > And I think VFIO actually knows it's regular RAM.
> > So something like the following small patch in linux, maybe?
> >
>
> If I understood everything, Peter's proposal seems to move the logics covered
> by ram_device_mem_ops to the upper layer.
I think the basics of Peter's idea are really simple: if guest is doing
DMA into a region then that access is treating that region as RAM and so
any vectored etc instructions into it are fine.
So we can fix specifically DMA into RAM DEVICE to bypass bounce buffering.
It's at the low memory level, not the upper layer.
He also apparently feels bounce buffering isn't needed
generally and can be replaced with memmove_no_vector? And
somehow virtio DMA can be done without kicking host? I'm not
sure I understand these parts.
> I tends to agree with Michael that
> we need the host to expose a flag (capability) indicating the PCI BAR is directly
> accessible. The capabilities associated with the PCI BAR is determined by the
> host, it's making sense to ask host to expose the extra capability if the PCI
> BAR is directly accessible.
>
> With the flag (capability) exposed from host, we split RAM DEVICE region into
> two classes: indirectly accessible region and directly accessible region. They're
> identified by:
>
> - indirectly accessible RAM DEVICE region
> MemoryRegion::ram true
> MemoryRegion::ram_device true
> MemoryRegion::ops ram_device_mem_ops
>
> - directly accessible RAM DEVICE region
> MemoryRegion::ram true
> MemoryRegion::ram_device true
> MemoryRegion::ops unassigned_mem_ops
>
> Before I'm going to send a kernel patch for review, I hope Alex can take a look
> and agree to add the extra capability as the indicator of directly accessible
> RAM DEVICE region.
>
> >
> > diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> > index fa056b69f899..a4ca2d01272c 100644
> > --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> > +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> > @@ -418,6 +418,10 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
> > struct nvgrace_gpu_pci_core_device *nvdev =
> > container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
> > core_device.vdev);
> > + struct vfio_region_info_cap_direct_access direct_access = {
> > + .header.id = VFIO_REGION_INFO_CAP_DIRECT_ACCESS,
> > + .header.version = 1,
> > + };
> > struct vfio_region_info_cap_sparse_mmap *sparse;
> > struct mem_region *memregion;
> > u32 size;
> > @@ -453,6 +457,13 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
> > if (ret)
> > return ret;
> > + if (info->index == USEMEM_REGION_INDEX) {
> > + ret = vfio_info_add_capability(caps, &direct_access.header,
> > + sizeof(direct_access));
> > + if (ret)
> > + return ret;
> > + }
> > +
> > info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
> > /*
> > * The region memory size may not be power-of-2 aligned.
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 5de618a3a5ee..f475f4920b52 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -466,6 +466,16 @@ struct vfio_device_migration_info {
> > */
> > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
> > +/*
> > + * The direct access capability informs that a mmappable region may be
> > + * accessed by userspace using any CPU load/store operations.
> > + */
> > +#define VFIO_REGION_INFO_CAP_DIRECT_ACCESS 6
> > +
> > +struct vfio_region_info_cap_direct_access {
> > + struct vfio_info_cap_header header;
> > +};
> > +
> > /*
> > * Capability with compressed real address (aka SSA - small system address)
> > * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
> >
>
> With above code changes applied to the host, I'm able to avoid the guest hang issue
> with more changes in QEMU:
>
> -----> hw/vfio/region.c
>
> int vfio_region_mmap(VFIORegion *region)
> {
> /* region->direct_access is sync up to VFIO_REGION_INFO_CAP_DIRECT_ACCESS */
> if (region->direct_access) {
> memory_region_init_ram_ptr(®ion->mmaps[i].mem,
> memory_region_owner(region->mem),
> name, region->mmaps[i].size,
> region->mmaps[i].mmap);
> region->mmaps[i].mem.ram_device = true;
> } else {
> memory_region_init_ram_device_ptr(®ion->mmaps[i].mem,
> memory_region_owner(region->mem),
> name, region->mmaps[i].size,
> region->mmaps[i].mmap);
> }
> }
>
> -----> system/memory.c
>
> bool memory_region_has_unassigned_ops(const MemoryRegion *mr)
> {
> return mr->ops == &unassigned_mem_ops;
> }
>
> -----> include/system/memory.h
>
> static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> {
> /*
> * RAM DEVICE regions can be accessed directly using memcpy, but it might
> * be MMIO and access using mempy can be wrong (e.g., using instructions not
> - * intended for MMIO access). So we treat this as IO.
> + * intended for MMIO access). So we treat this as IO except it has been
> + * explicitly declared as being directly accessible. For those directly
> + * accessible RAM device regions, their callbacks point to the unassigned
> + * one.
> */
> - return !memory_region_is_ram_device(mr);
> + return !memory_region_is_ram_device(mr) ||
> + memory_region_has_unassigned_ops(mr);
> }
>
> Thanks,
> Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 5:31 ` Michael S. Tsirkin
@ 2026-06-11 6:28 ` Gavin Shan
2026-06-11 6:34 ` Michael S. Tsirkin
2026-06-11 6:51 ` Michael S. Tsirkin
0 siblings, 2 replies; 37+ messages in thread
From: Gavin Shan @ 2026-06-11 6:28 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Xu, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
Hi Michael,
On 6/11/26 3:31 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 11, 2026 at 02:33:05PM +1000, Gavin Shan wrote:
>> On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
>>>> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
>>>>> On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
[...]
>>>>>
>>>>> I do not think it has anything to do with host endian-ness.
>>>>>
>>>>>
>>>>> This is the change that broke it I think?
>>>>>
>>>>>
>>>>> commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
>>>>> Author: Alex Williamson <alex@shazbot.org>
>>>>> Date: Mon Oct 31 09:53:03 2016 -0600
>>>>>
>>>>> memory: Don't use memcpy for ram_device regions
>>>>>
>>>>> Maybe Alex has an opinion on what to do.
>>>>
>>>> I can offer one idea here..
>>>>
>>>> IIUC the major issue was vector ops but the mr ops might be too heavy, then
>>>> another way to fix it is in memory API instead of using memcpy()/memmove(),
>>>> we always use a helper (say, memmove_no_vector()) to do the split and
>>>> properly aligned IOs as what ram_device_mem_ops does right now, this should
>>>> only applies to ram_device.
>>>>
>>>> With that, IIUC we can remove the current ram_device_mem_ops, then in
>>>> Gavin's case mmap() will go through and guest will not need to vmexit at
>>>> all. Best perf, issue solve.
>>>>
>>>> We just need to be careful to trap all possible memcpy()/memmove() used in
>>>> memory core.. if I didn't miss any, IMO below four should needs to be
>>>> replaced by memmove_no_vector():
>>>>
>>>> flatview_write_continue_step()
>>>> flatview_read_continue_step()
>>>> address_space_read()
>>>> address_space_write_rom()
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Peter Xu
>>>
>>> First, this is a nice idea.
>>> Second, the ideal thing is still just allowing direct access.
>>> And I think VFIO actually knows it's regular RAM.
>>> So something like the following small patch in linux, maybe?
>>>
>>
>> If I understood everything, Peter's proposal seems to move the logics covered
>> by ram_device_mem_ops to the upper layer.
>
>
> I think the basics of Peter's idea are really simple: if guest is doing
> DMA into a region then that access is treating that region as RAM and so
> any vectored etc instructions into it are fine.
>
> So we can fix specifically DMA into RAM DEVICE to bypass bounce buffering.
>
> It's at the low memory level, not the upper layer.
>
> He also apparently feels bounce buffering isn't needed
> generally and can be replaced with memmove_no_vector? And
> somehow virtio DMA can be done without kicking host? I'm not
> sure I understand these parts.
>
For Peter's idea, I believe there is something I missed. Lets take our specific
case as an example where the DMA request is handled as the following calltrace
indicates.
virtio_blk_handle_output
virtio_blk_handle_vq
virtio_blk_get_request
virtqueue_pop
virtqueue_split_pop
virtqueue_map_desc
address_space_map
virtio_blk_handle_request
iov_to_buf
memcpy
In address_space_map(), all RAM DEVICE regions treated as directly accessible
and the buffer (RAMBlock::host + offset) is returned. The buffer is passed on
to virtio_blk_handle_request() and iov_to_buf(), the data is then copied over
using memcpy(), which we're trying to avoid.
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 6:28 ` Gavin Shan
@ 2026-06-11 6:34 ` Michael S. Tsirkin
2026-06-11 12:33 ` Gavin Shan
2026-06-11 6:51 ` Michael S. Tsirkin
1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-11 6:34 UTC (permalink / raw)
To: Gavin Shan
Cc: Peter Xu, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Thu, Jun 11, 2026 at 04:28:20PM +1000, Gavin Shan wrote:
> Hi Michael,
>
> On 6/11/26 3:31 PM, Michael S. Tsirkin wrote:
> > On Thu, Jun 11, 2026 at 02:33:05PM +1000, Gavin Shan wrote:
> > > On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> > > > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
>
> [...]
>
> > > > > >
> > > > > > I do not think it has anything to do with host endian-ness.
> > > > > >
> > > > > >
> > > > > > This is the change that broke it I think?
> > > > > >
> > > > > >
> > > > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > > > Author: Alex Williamson <alex@shazbot.org>
> > > > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > > > >
> > > > > > memory: Don't use memcpy for ram_device regions
> > > > > >
> > > > > > Maybe Alex has an opinion on what to do.
> > > > >
> > > > > I can offer one idea here..
> > > > >
> > > > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > > > only applies to ram_device.
> > > > >
> > > > > With that, IIUC we can remove the current ram_device_mem_ops, then in
> > > > > Gavin's case mmap() will go through and guest will not need to vmexit at
> > > > > all. Best perf, issue solve.
> > > > >
> > > > > We just need to be careful to trap all possible memcpy()/memmove() used in
> > > > > memory core.. if I didn't miss any, IMO below four should needs to be
> > > > > replaced by memmove_no_vector():
> > > > >
> > > > > flatview_write_continue_step()
> > > > > flatview_read_continue_step()
> > > > > address_space_read()
> > > > > address_space_write_rom()
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Peter Xu
> > > >
> > > > First, this is a nice idea.
> > > > Second, the ideal thing is still just allowing direct access.
> > > > And I think VFIO actually knows it's regular RAM.
> > > > So something like the following small patch in linux, maybe?
> > > >
> > >
> > > If I understood everything, Peter's proposal seems to move the logics covered
> > > by ram_device_mem_ops to the upper layer.
> >
> >
> > I think the basics of Peter's idea are really simple: if guest is doing
> > DMA into a region then that access is treating that region as RAM and so
> > any vectored etc instructions into it are fine.
> >
> > So we can fix specifically DMA into RAM DEVICE to bypass bounce buffering.
> >
> > It's at the low memory level, not the upper layer.
> >
> > He also apparently feels bounce buffering isn't needed
> > generally and can be replaced with memmove_no_vector? And
> > somehow virtio DMA can be done without kicking host? I'm not
> > sure I understand these parts.
> >
>
> For Peter's idea, I believe there is something I missed. Lets take our specific
> case as an example where the DMA request is handled as the following calltrace
> indicates.
>
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> virtio_blk_handle_request
> iov_to_buf
> memcpy
>
> In address_space_map(), all RAM DEVICE regions treated as directly accessible
> and the buffer (RAMBlock::host + offset) is returned. The buffer is passed on
> to virtio_blk_handle_request() and iov_to_buf(), the data is then copied over
> using memcpy(), which we're trying to avoid.
>
> Thanks,
> Gavin
>
This is header copy why and how would we try to avoid that?
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 6:28 ` Gavin Shan
2026-06-11 6:34 ` Michael S. Tsirkin
@ 2026-06-11 6:51 ` Michael S. Tsirkin
1 sibling, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2026-06-11 6:51 UTC (permalink / raw)
To: Gavin Shan
Cc: Peter Xu, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
On Thu, Jun 11, 2026 at 04:28:20PM +1000, Gavin Shan wrote:
> Hi Michael,
>
> On 6/11/26 3:31 PM, Michael S. Tsirkin wrote:
> > On Thu, Jun 11, 2026 at 02:33:05PM +1000, Gavin Shan wrote:
> > > On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> > > > > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
>
> [...]
>
> > > > > >
> > > > > > I do not think it has anything to do with host endian-ness.
> > > > > >
> > > > > >
> > > > > > This is the change that broke it I think?
> > > > > >
> > > > > >
> > > > > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > > > > Author: Alex Williamson <alex@shazbot.org>
> > > > > > Date: Mon Oct 31 09:53:03 2016 -0600
> > > > > >
> > > > > > memory: Don't use memcpy for ram_device regions
> > > > > >
> > > > > > Maybe Alex has an opinion on what to do.
> > > > >
> > > > > I can offer one idea here..
> > > > >
> > > > > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > > > > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > > > > we always use a helper (say, memmove_no_vector()) to do the split and
> > > > > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > > > > only applies to ram_device.
> > > > >
> > > > > With that, IIUC we can remove the current ram_device_mem_ops, then in
> > > > > Gavin's case mmap() will go through and guest will not need to vmexit at
> > > > > all. Best perf, issue solve.
> > > > >
> > > > > We just need to be careful to trap all possible memcpy()/memmove() used in
> > > > > memory core.. if I didn't miss any, IMO below four should needs to be
> > > > > replaced by memmove_no_vector():
> > > > >
> > > > > flatview_write_continue_step()
> > > > > flatview_read_continue_step()
> > > > > address_space_read()
> > > > > address_space_write_rom()
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Peter Xu
> > > >
> > > > First, this is a nice idea.
> > > > Second, the ideal thing is still just allowing direct access.
> > > > And I think VFIO actually knows it's regular RAM.
> > > > So something like the following small patch in linux, maybe?
> > > >
> > >
> > > If I understood everything, Peter's proposal seems to move the logics covered
> > > by ram_device_mem_ops to the upper layer.
> >
> >
> > I think the basics of Peter's idea are really simple: if guest is doing
> > DMA into a region then that access is treating that region as RAM and so
> > any vectored etc instructions into it are fine.
> >
> > So we can fix specifically DMA into RAM DEVICE to bypass bounce buffering.
> >
> > It's at the low memory level, not the upper layer.
> >
> > He also apparently feels bounce buffering isn't needed
> > generally and can be replaced with memmove_no_vector? And
> > somehow virtio DMA can be done without kicking host? I'm not
> > sure I understand these parts.
> >
>
> For Peter's idea, I believe there is something I missed. Lets take our specific
> case as an example where the DMA request is handled as the following calltrace
> indicates.
>
> virtio_blk_handle_output
> virtio_blk_handle_vq
> virtio_blk_get_request
> virtqueue_pop
> virtqueue_split_pop
> virtqueue_map_desc
> address_space_map
> virtio_blk_handle_request
> iov_to_buf
> memcpy
>
> In address_space_map(), all RAM DEVICE regions treated as directly accessible
> and the buffer (RAMBlock::host + offset) is returned. The buffer is passed on
> to virtio_blk_handle_request() and iov_to_buf(), the data is then copied over
> using memcpy(), which we're trying to avoid.
>
> Thanks,
> Gavin
>
The original bug was QEMU doing MMIO on behalf of guest right?
So maybe there's an even simpler thing: I think that if the BAR is
mapped directly into guest we do not need a bounce buffer in qemu.
Alex could you give your opinion on this?
--
MST
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 6:34 ` Michael S. Tsirkin
@ 2026-06-11 12:33 ` Gavin Shan
2026-06-11 12:48 ` Peter Maydell
0 siblings, 1 reply; 37+ messages in thread
From: Gavin Shan @ 2026-06-11 12:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Peter Xu, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
qemu-arm, jugraham, shan.gavin, Alex Williamson,
David Hildenbrand
Hi Michael,
On 6/11/26 4:34 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 11, 2026 at 04:28:20PM +1000, Gavin Shan wrote:
>> On 6/11/26 3:31 PM, Michael S. Tsirkin wrote:
>>> On Thu, Jun 11, 2026 at 02:33:05PM +1000, Gavin Shan wrote:
>>>> On 6/11/26 2:18 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
>>>>>> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
>>
>> [...]
>>
>>>>>>>
>>>>>>> I do not think it has anything to do with host endian-ness.
>>>>>>>
>>>>>>>
>>>>>>> This is the change that broke it I think?
>>>>>>>
>>>>>>>
>>>>>>> commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
>>>>>>> Author: Alex Williamson <alex@shazbot.org>
>>>>>>> Date: Mon Oct 31 09:53:03 2016 -0600
>>>>>>>
>>>>>>> memory: Don't use memcpy for ram_device regions
>>>>>>>
>>>>>>> Maybe Alex has an opinion on what to do.
>>>>>>
>>>>>> I can offer one idea here..
>>>>>>
>>>>>> IIUC the major issue was vector ops but the mr ops might be too heavy, then
>>>>>> another way to fix it is in memory API instead of using memcpy()/memmove(),
>>>>>> we always use a helper (say, memmove_no_vector()) to do the split and
>>>>>> properly aligned IOs as what ram_device_mem_ops does right now, this should
>>>>>> only applies to ram_device.
>>>>>>
>>>>>> With that, IIUC we can remove the current ram_device_mem_ops, then in
>>>>>> Gavin's case mmap() will go through and guest will not need to vmexit at
>>>>>> all. Best perf, issue solve.
>>>>>>
>>>>>> We just need to be careful to trap all possible memcpy()/memmove() used in
>>>>>> memory core.. if I didn't miss any, IMO below four should needs to be
>>>>>> replaced by memmove_no_vector():
>>>>>>
>>>>>> flatview_write_continue_step()
>>>>>> flatview_read_continue_step()
>>>>>> address_space_read()
>>>>>> address_space_write_rom()
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Peter Xu
>>>>>
>>>>> First, this is a nice idea.
>>>>> Second, the ideal thing is still just allowing direct access.
>>>>> And I think VFIO actually knows it's regular RAM.
>>>>> So something like the following small patch in linux, maybe?
>>>>>
>>>>
>>>> If I understood everything, Peter's proposal seems to move the logics covered
>>>> by ram_device_mem_ops to the upper layer.
>>>
>>>
>>> I think the basics of Peter's idea are really simple: if guest is doing
>>> DMA into a region then that access is treating that region as RAM and so
>>> any vectored etc instructions into it are fine.
>>>
>>> So we can fix specifically DMA into RAM DEVICE to bypass bounce buffering.
>>>
>>> It's at the low memory level, not the upper layer.
>>>
>>> He also apparently feels bounce buffering isn't needed
>>> generally and can be replaced with memmove_no_vector? And
>>> somehow virtio DMA can be done without kicking host? I'm not
>>> sure I understand these parts.
>>>
>>
>> For Peter's idea, I believe there is something I missed. Lets take our specific
>> case as an example where the DMA request is handled as the following calltrace
>> indicates.
>>
>> virtio_blk_handle_output
>> virtio_blk_handle_vq
>> virtio_blk_get_request
>> virtqueue_pop
>> virtqueue_split_pop
>> virtqueue_map_desc
>> address_space_map
>> virtio_blk_handle_request
>> iov_to_buf
>> memcpy
>>
>> In address_space_map(), all RAM DEVICE regions treated as directly accessible
>> and the buffer (RAMBlock::host + offset) is returned. The buffer is passed on
>> to virtio_blk_handle_request() and iov_to_buf(), the data is then copied over
>> using memcpy(), which we're trying to avoid.
>
> This is header copy why and how would we try to avoid that?
>
Let me try to summarize what I understood. As VFIO is concerned, there
are multiple memory regions for one particular PCI BAR, and they're stacked
up. The memory regions for PCI BAR#4 of the GH100 card looks as below.
(qemu) info mtree
:
address-space: pci_bridge_pci_mem
0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
(1) Its MemoryRegionOps is NULL. No data accesses are routed to this region
(2) The data accesses routed to this region is handled by pread() and pwrite()
(3) The data accesses routed to this region is handled by memcpy() before
commit 4a2e242bbb.
There are identified PCI devices who have quirks, see vfio_bar_quirk_setup().
Accesses to part of the PCI BAR have to be emulated by the extra IO regions,
something like below for rtl8168 PCI device, where two extra IO regions are
stacked up for the quirks.
address-space: pci_bridge_pci_mem
0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
0000042000000010-0000042000000014 (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[0] <---- (4) quirk[0]
0000042000000018-000004200000001c (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[1] <---- (5) quirk[1]
Access on 0000042000000010-0000042000000014 should be routed to region (4) quirk[0]
and access on 0000042000000018-000004200000001c should be routed to region (5) quirk[1].
However, accesses to 0000042000000000-0000042000000020 are routed to region (3) before
commit 4a2e242bbb and the data transfer is done by memcpy(), bypassing region (4) and
(5). It's not the expected behavior and why memcpy() isn't expected on device rtl8168's
PCI BAR due to the quirks, answering your question.
With commit 4a2e242bbb applied, the accesses will be routed to the correct region.
For example, accesses to 0000042000000000-0000042000000020 are routed to (3), (4)
and (5) based on their addresses. Region (4) and (5) aren't bypassed. It's what I
understood and hopefully nothing has been missed. It's why we're not expecting
memcpy() on region (3).
----
Back to our case (GH100 card), there are no quirks for the PCI BAR (0009:01:00.0 BAR 4)
so it's fine mark the RAM DEVICE region as directly accessible. We perhaps needn't host
to export the capability (VFIO_REGION_INFO_CAP_DIRECT_ACCESS) suggested by you. It's
safe to mark any PCI BARs as directly accessible if they have no quirks attached. All
the devices except those listed in vfio_bar_quirk_setup() are capable of this.
Again, I'm not sure if I understood every details. Alex is the best person to confirm,
but I believe Peter Xu understood this much better than me.
Thanks,
Gavin
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
2026-06-11 12:33 ` Gavin Shan
@ 2026-06-11 12:48 ` Peter Maydell
0 siblings, 0 replies; 37+ messages in thread
From: Peter Maydell @ 2026-06-11 12:48 UTC (permalink / raw)
To: Gavin Shan
Cc: Michael S. Tsirkin, Peter Xu, Pavel Hrdina,
Daniel P. Berrangé, qemu-devel, qemu-arm, jugraham,
shan.gavin, Alex Williamson, David Hildenbrand
On Thu, 11 Jun 2026 at 13:34, Gavin Shan <gshan@redhat.com> wrote:
>
> Let me try to summarize what I understood. As VFIO is concerned, there
> are multiple memory regions for one particular PCI BAR, and they're stacked
> up. The memory regions for PCI BAR#4 of the GH100 card looks as below.
>
> (qemu) info mtree
> :
> address-space: pci_bridge_pci_mem
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
>
> (1) Its MemoryRegionOps is NULL. No data accesses are routed to this region
> (2) The data accesses routed to this region is handled by pread() and pwrite()
> (3) The data accesses routed to this region is handled by memcpy() before
> commit 4a2e242bbb.
>
> There are identified PCI devices who have quirks, see vfio_bar_quirk_setup().
> Accesses to part of the PCI BAR have to be emulated by the extra IO regions,
> something like below for rtl8168 PCI device, where two extra IO regions are
> stacked up for the quirks.
>
> address-space: pci_bridge_pci_mem
> 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
> 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
> 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> 0000042000000010-0000042000000014 (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[0] <---- (4) quirk[0]
> 0000042000000018-000004200000001c (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[1] <---- (5) quirk[1]
>
> Access on 0000042000000010-0000042000000014 should be routed to region (4) quirk[0]
> and access on 0000042000000018-000004200000001c should be routed to region (5) quirk[1].
> However, accesses to 0000042000000000-0000042000000020 are routed to region (3) before
> commit 4a2e242bbb and the data transfer is done by memcpy(), bypassing region (4) and
> (5). It's not the expected behavior and why memcpy() isn't expected on device rtl8168's
> PCI BAR due to the quirks, answering your question.
>
> With commit 4a2e242bbb applied, the accesses will be routed to the correct region.
The way I read 4a2e242bbb's commit message, it isn't about things being routed
to the wrong region. It's about the handling of areas which aren't in the small
quirk regions but which are in the same 4K page as them. These have to
be handled
via the memory subsystem's "subpage" mechanism. This does route
everything to the
correct region, but if the region (3) is marked as "direct access is OK" then
QEMU assumes that any kind of direct access is OK, i.e. this behaves
like true RAM.
It then does a memcpy access to a BAR that's really a bank of device registers,
and this goes wrong.
> Back to our case (GH100 card), there are no quirks for the PCI BAR (0009:01:00.0 BAR 4)
> so it's fine mark the RAM DEVICE region as directly accessible. We perhaps needn't host
> to export the capability (VFIO_REGION_INFO_CAP_DIRECT_ACCESS) suggested by you. It's
> safe to mark any PCI BARs as directly accessible if they have no quirks attached. All
> the devices except those listed in vfio_bar_quirk_setup() are capable of this.
I still feel like there are different kinds of PCI BAR here ("this BAR is
true RAM and can be accessed arbitrarily" vs "this BAR is full of registers
and can't be handled that way") and the vfio code in QEMU needs to set up
the memory regions differently for the two cases. For your example I think
it would be fine to have direct-access even if there were some kind of
quirk memory region, because for the parts of the BAR that aren't covered
by a quirk overlay the underlying BAR still allows "entirely like RAM,
any alignment and size is OK" accesses.
-- PMM
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2026-06-11 12:49 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08 8:55 ` Daniel P. Berrangé
2026-06-08 11:11 ` Gavin Shan
2026-06-08 11:38 ` Daniel P. Berrangé
2026-06-09 2:08 ` Gavin Shan
2026-06-09 16:25 ` Peter Xu
2026-06-10 0:32 ` Gavin Shan
2026-06-10 9:54 ` Pavel Hrdina
2026-06-10 10:55 ` Gavin Shan
2026-06-10 12:12 ` Michael S. Tsirkin
2026-06-10 12:19 ` Gavin Shan
2026-06-10 12:27 ` Michael S. Tsirkin
2026-06-10 13:00 ` Gavin Shan
2026-06-10 13:54 ` Gavin Shan
2026-06-10 14:06 ` Michael S. Tsirkin
2026-06-10 15:36 ` Peter Xu
2026-06-10 16:11 ` Peter Maydell
2026-06-10 16:19 ` Michael S. Tsirkin
2026-06-10 19:10 ` Peter Xu
2026-06-10 21:03 ` Michael S. Tsirkin
2026-06-10 21:27 ` Peter Xu
2026-06-10 21:44 ` Michael S. Tsirkin
2026-06-10 16:18 ` Michael S. Tsirkin
2026-06-11 4:33 ` Gavin Shan
2026-06-11 5:31 ` Michael S. Tsirkin
2026-06-11 6:28 ` Gavin Shan
2026-06-11 6:34 ` Michael S. Tsirkin
2026-06-11 12:33 ` Gavin Shan
2026-06-11 12:48 ` Peter Maydell
2026-06-11 6:51 ` Michael S. Tsirkin
2026-06-10 12:23 ` Pavel Hrdina
2026-06-10 14:04 ` Gavin Shan
2026-06-10 14:08 ` Michael S. Tsirkin
2026-06-10 9:49 ` Michael S. Tsirkin
2026-06-10 18:30 ` Stefan Hajnoczi
2026-06-10 21:00 ` Michael S. Tsirkin
2026-06-11 1:19 ` Gavin Shan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.