All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
@ 2026-06-08  0:18 Gavin Shan
  2026-06-08  8:55 ` Daniel P. Berrangé
  2026-06-10  9:49 ` Michael S. Tsirkin
  0 siblings, 2 replies; 23+ messages in thread
From: Gavin Shan @ 2026-06-08  0:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, mst, jugraham, shan.gavin

On the guest where a NVidia's GH100 card is passed from the host, the
guest system hang can be observed on attempt to compile 'cuda-samples',
as reported by Julia.

   host$ lspci | grep GH100
   0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
   host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
         -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
         -cpu host -smp cpus=32 -m size=8G                                  \
         -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
         -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0

   guest$ cd cuda-samples/build
   guest$ make -j 20 clean
   guest$ make -j 20
               :
   [ 54%] Linking CUDA executable graphMemoryNodes
   [ 54%] Built target graphMemoryNodes
   <no more output afterwards, guest becomes frozen here>

   guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
   [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)

When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR can be presented to the guest
through memory hot-add. The page cache can be allocated from the hot added
memory blocks when cuda-samples is being built. Afterwards, he page cache
is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
buffer is used to accomodate the request as the corresponding memory
region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
case, false is returned from memory_access_is_direct() in the path where
the DMA request is handled.

  QEMU
  ====
  virtio_blk_handle_output
    virtio_blk_handle_vq
      virtio_blk_get_request
        virtqueue_pop
          virtqueue_split_pop
            virtqueue_map_desc
              address_space_map
                memory_access_is_direct         # Return false
                  memory_region_supports_direct_access

  (qemu) info mtree
          :
  memory-region: pci_bridge_pci
    0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
      0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
        0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
          0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]

By default, the max bounce buffer size is only 4096 bytes, even less
than one page when the guest page is 64KB. This tries to fix the issue
by inheriting the customized max bounce buffer size of the virtio bus's
parent through property 'x-max-bounce-buffer-size' when the customized
size is a larger one. With this applied, no guest system hang is seen
with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.

Reported-by: Julia Graham <jugraham@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/virtio/virtio-bus.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
index cef944e015..e0933823f3 100644
--- a/hw/virtio/virtio-bus.c
+++ b/hw/virtio/virtio-bus.c
@@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
 /* A VirtIODevice is being plugged */
 void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
 {
+    AddressSpace *as;
     DeviceState *qdev = DEVICE(vdev);
     BusState *qbus = BUS(qdev_get_parent_bus(qdev));
     VirtioBusState *bus = VIRTIO_BUS(qbus);
@@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
                 return;
             }
         }
+    } else {
+        /*
+         * The maximal bounce buffer size of the virtio bus's parent may
+         * have been customized by property 'x-max-bounce-buffer-size'.
+         * Lets inherit the customized size if it's larger than the
+         * current one.
+         */
+        as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
+        if (as) {
+            vdev->dma_as->max_bounce_buffer_size = MAX(
+                    vdev->dma_as->max_bounce_buffer_size,
+                    as->max_bounce_buffer_size);
+        }
     }
 }
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
@ 2026-06-08  8:55 ` Daniel P. Berrangé
  2026-06-08 11:11   ` Gavin Shan
  2026-06-10  9:49 ` Michael S. Tsirkin
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2026-06-08  8:55 UTC (permalink / raw)
  To: Gavin Shan; +Cc: qemu-devel, qemu-arm, mst, jugraham, shan.gavin

On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> On the guest where a NVidia's GH100 card is passed from the host, the
> guest system hang can be observed on attempt to compile 'cuda-samples',
> as reported by Julia.

snip

> By default, the max bounce buffer size is only 4096 bytes, even less
> than one page when the guest page is 64KB. This tries to fix the issue
> by inheriting the customized max bounce buffer size of the virtio bus's
> parent through property 'x-max-bounce-buffer-size' when the customized
> size is a larger one. With this applied, no guest system hang is seen
> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.

"x-max-bounce-buffer-size"  is an experimental / unsupported property.

We really shouldn't be expecting users to have to set this in a production
deployment in order to stop a guest from hanging.  Even if we dropped the
experimental marker from this property, users would still need to know to
provide this magic setting, so it would still be broken out of the box.

How can we  get a solution that "just works" out of the box, which is
fully supported, not relying on experimental properties ?

> 
> Reported-by: Julia Graham <jugraham@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/virtio/virtio-bus.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)

With regards,
Daniel
-- 
|: https://berrange.com       ~~        https://hachyderm.io/@berrange :|
|: https://libvirt.org          ~~          https://entangle-photo.org :|
|: https://pixelfed.art/berrange   ~~    https://fstop138.berrange.com :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08  8:55 ` Daniel P. Berrangé
@ 2026-06-08 11:11   ` Gavin Shan
  2026-06-08 11:38     ` Daniel P. Berrangé
  2026-06-10  9:54     ` Pavel Hrdina
  0 siblings, 2 replies; 23+ messages in thread
From: Gavin Shan @ 2026-06-08 11:11 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu
  Cc: qemu-devel, qemu-arm, mst, jugraham, shan.gavin

Hi Daniel,

On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>> On the guest where a NVidia's GH100 card is passed from the host, the
>> guest system hang can be observed on attempt to compile 'cuda-samples',
>> as reported by Julia.
> 
> snip
> 

Thanks for looking into this.

>> By default, the max bounce buffer size is only 4096 bytes, even less
>> than one page when the guest page is 64KB. This tries to fix the issue
>> by inheriting the customized max bounce buffer size of the virtio bus's
>> parent through property 'x-max-bounce-buffer-size' when the customized
>> size is a larger one. With this applied, no guest system hang is seen
>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> 
> "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> 
> We really shouldn't be expecting users to have to set this in a production
> deployment in order to stop a guest from hanging.  Even if we dropped the
> experimental marker from this property, users would still need to know to
> provide this magic setting, so it would still be broken out of the box.
> 
> How can we  get a solution that "just works" out of the box, which is
> fully supported, not relying on experimental properties ?
> 

How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
property? I guess the properties whose names start with "x-" are all treated as
experimental and unsupported?

For this case, the bounce buffer is inevitable as the memory region can't be
directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
bounce buffer size can be specified by users, and it's why the existing property
"x-max-bounce-buffer-size" is reused.

I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
which is set to on by default, following the existing behavior. When it's set to
off by users, the max (allowed) buffer size won't be checked at all. However, I'm
not sure if this makes sense at all.

>>
>> Reported-by: Julia Graham <jugraham@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/virtio/virtio-bus.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
> 
> With regards,
> Daniel

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08 11:11   ` Gavin Shan
@ 2026-06-08 11:38     ` Daniel P. Berrangé
  2026-06-09  2:08       ` Gavin Shan
  2026-06-10  9:54     ` Pavel Hrdina
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2026-06-08 11:38 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Peter Xu, qemu-devel, qemu-arm, mst, jugraham, shan.gavin

On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> Hi Daniel,
> 
> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > as reported by Julia.
> > 
> > snip
> > 
> 
> Thanks for looking into this.

NB, I didn't really look into it beyond noticing the suggestion
that users set an "x-" property as a proposed solution to failing
to boot, which raised a red-flag to me from a usability POV.

I don't really know anything about the underlying technical problems
here, so can't offer specific guidance in that area.

> 
> > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > than one page when the guest page is 64KB. This tries to fix the issue
> > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > size is a larger one. With this applied, no guest system hang is seen
> > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > 
> > "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> > 
> > We really shouldn't be expecting users to have to set this in a production
> > deployment in order to stop a guest from hanging.  Even if we dropped the
> > experimental marker from this property, users would still need to know to
> > provide this magic setting, so it would still be broken out of the box.
> > 
> > How can we  get a solution that "just works" out of the box, which is
> > fully supported, not relying on experimental properties ?
> > 
> 
> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> property? I guess the properties whose names start with "x-" are all treated as
> experimental and unsupported?

Yes, any QEMU property starting with 'x-' is experimental/unstable/
unsupported and can be changed/withdrawn at any time.  Libvirt will
not provide any way to configure 'x-' properties, as it requires a
supported/stable solution from QEMU.

With regards,
Daniel
-- 
|: https://berrange.com       ~~        https://hachyderm.io/@berrange :|
|: https://libvirt.org          ~~          https://entangle-photo.org :|
|: https://pixelfed.art/berrange   ~~    https://fstop138.berrange.com :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08 11:38     ` Daniel P. Berrangé
@ 2026-06-09  2:08       ` Gavin Shan
  2026-06-09 16:25         ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Gavin Shan @ 2026-06-09  2:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, qemu-devel, qemu-arm, mst, jugraham, shan.gavin

On 6/8/26 9:38 PM, Daniel P. Berrangé wrote:
> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>> Hi Daniel,
>>
>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>> as reported by Julia.
>>>
>>> snip
>>>
>>
>> Thanks for looking into this.
> 
> NB, I didn't really look into it beyond noticing the suggestion
> that users set an "x-" property as a proposed solution to failing
> to boot, which raised a red-flag to me from a usability POV.
> 
> I don't really know anything about the underlying technical problems
> here, so can't offer specific guidance in that area.
> 

Ok, no worries, I got your points :-)

>>
>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>> size is a larger one. With this applied, no guest system hang is seen
>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>
>>> "x-max-bounce-buffer-size"  is an experimental / unsupported property.
>>>
>>> We really shouldn't be expecting users to have to set this in a production
>>> deployment in order to stop a guest from hanging.  Even if we dropped the
>>> experimental marker from this property, users would still need to know to
>>> provide this magic setting, so it would still be broken out of the box.
>>>
>>> How can we  get a solution that "just works" out of the box, which is
>>> fully supported, not relying on experimental properties ?
>>>
>>
>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>> property? I guess the properties whose names start with "x-" are all treated as
>> experimental and unsupported?
> 
> Yes, any QEMU property starting with 'x-' is experimental/unstable/
> unsupported and can be changed/withdrawn at any time.  Libvirt will
> not provide any way to configure 'x-' properties, as it requires a
> supported/stable solution from QEMU.
> 

Yeah. Apart from the option of adding a new property to MachineState to disable
the check on the max bounce buffer size, we also can make this existing option
"x-max-bounce-buffer-size" official and officially supported by renaming it to
"max-bounce-buffer-size". Lets see what comments Michael or Peter will have.

> With regards,
> Daniel

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-09  2:08       ` Gavin Shan
@ 2026-06-09 16:25         ` Peter Xu
  2026-06-10  0:32           ` Gavin Shan
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2026-06-09 16:25 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Daniel P. Berrangé, qemu-devel, qemu-arm, mst, jugraham,
	shan.gavin

On Tue, Jun 09, 2026 at 12:08:34PM +1000, Gavin Shan wrote:
> Yeah. Apart from the option of adding a new property to MachineState to disable
> the check on the max bounce buffer size, we also can make this existing option
> "x-max-bounce-buffer-size" official and officially supported by renaming it to
> "max-bounce-buffer-size". Lets see what comments Michael or Peter will have.

IIUC updating max-bounce-buffer-size will be the last resort, because I
don't know how to properly define what is the correct value.  When it's
prefixed with x- it's indeed more problematic..

Two pure questions..

Question 1:

I want to better understand the failure case.  I don't yet understand why
it has anything to do with page size with the parameter.  Say, shouldn't
virtio-blk's DMA requests in form of less-than-page-size, then normally it
should work even for 64k psize (as long as the total of buffers to map goes
beyond 4k)?

Maybe it's because there're a lot of concurrent IOs/DMAs hence it did use
more than that?

Question 2:

Quoting from commit message:

        When the GPU's driver (NVidia open driver) is loaded on guest
        bootup, the memory blocks residing in the PCI BAR can be presented
        to the guest through memory hot-add. The page cache can be
        allocated from the hot added memory blocks when cuda-samples is
        being built. Afterwards, he page cache is sent to QEMU's virtio-blk
        device as part of the DMA request, the bounce buffer is used to
        accomodate the request as the corresponding memory region
        (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
        case, false is returned from memory_access_is_direct() in the path
        where the DMA request is handled.

I don't think I know well in this case, but if you say the PCI bars have
page cache in the back, does it mean that it should be directly accessible?
Maybe it's about this line:

    /*
     * RAM DEVICE regions can be accessed directly using memcpy, but it might
     * be MMIO and access using mempy can be wrong (e.g., using instructions not
     * intended for MMIO access). So we treat this as IO.
     */
    return !memory_region_is_ram_device(mr);

But then my question is if this is a legal case can we loose this check so
that we don't need to use bounce buffers at all.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-09 16:25         ` Peter Xu
@ 2026-06-10  0:32           ` Gavin Shan
  0 siblings, 0 replies; 23+ messages in thread
From: Gavin Shan @ 2026-06-10  0:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, qemu-devel, qemu-arm, mst, jugraham,
	shan.gavin

Hi Peter,

On 6/10/26 2:25 AM, Peter Xu wrote:
> On Tue, Jun 09, 2026 at 12:08:34PM +1000, Gavin Shan wrote:
>> Yeah. Apart from the option of adding a new property to MachineState to disable
>> the check on the max bounce buffer size, we also can make this existing option
>> "x-max-bounce-buffer-size" official and officially supported by renaming it to
>> "max-bounce-buffer-size". Lets see what comments Michael or Peter will have.
> 
> IIUC updating max-bounce-buffer-size will be the last resort, because I
> don't know how to properly define what is the correct value.  When it's
> prefixed with x- it's indeed more problematic..
> 

Ok, thanks for your confirmation. Lets rename 'x-max-bounce-buffer-size' to
'max-bounce-buffer-size' in next revision. I plan to have two patches for this.

[PATCH 1/2] renames x-max-bounce-buffer-size to max-bounce-buffer-size
[PATCH 2/2] does what's done in this patch, inheriting 'max-bounce-buffer-size'
             for virtio device from its bus parent

> Two pure questions..
> 
> Question 1:
> 
> I want to better understand the failure case.  I don't yet understand why
> it has anything to do with page size with the parameter.  Say, shouldn't
> virtio-blk's DMA requests in form of less-than-page-size, then normally it
> should work even for 64k psize (as long as the total of buffers to map goes
> beyond 4k)?
> 
> Maybe it's because there're a lot of concurrent IOs/DMAs hence it did use
> more than that?
> 

I think both are affecting the bounce buffer. In the failing case, the debugging
output indicates the length of the DMA request is 64KB while the max bounce buffer
size is only 4KB. I believe concurrent DMA requests also bring pressure on the
bounce buffer.

In my failing cases, I received the following output with the debugging code.
They're revealing the length of the DMA request is 64KB, aligned to the guest
page size.

Output from qemu:
virtqueue_map_desc: PA=0x420025b0000, size=0x10000, current_PA=0x420025b1000

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 63e2faee99..c038a62717 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1618,6 +1618,8 @@ static bool virtqueue_map_desc(VirtIODevice *vdev, unsigned int *p_num_sg,
  {
      bool ok = false;
      unsigned num_sg = *p_num_sg;
+    hwaddr saved_pa = pa;
+    size_t saved_sz = sz;
      assert(num_sg <= max_num_sg);
  
      if (!sz) {
@@ -1641,6 +1643,9 @@ static bool virtqueue_map_desc(VirtIODevice *vdev, unsigned int *p_num_sg,
                                                MEMTXATTRS_UNSPECIFIED);
          if (!iov[num_sg].iov_base) {
              virtio_error(vdev, "virtio: bogus descriptor or out of resources");
+            fprintf(stdout, "%s: PA=0x%lx, size=0x%lx, current_PA=0x%lx\n",
+                    __func__, (unsigned long)saved_pa, (unsigned long)saved_sz,
+                    (unsigned long)pa);
              goto out;
          }

> Question 2:
> 
> Quoting from commit message:
> 
>          When the GPU's driver (NVidia open driver) is loaded on guest
>          bootup, the memory blocks residing in the PCI BAR can be presented
>          to the guest through memory hot-add. The page cache can be
>          allocated from the hot added memory blocks when cuda-samples is
>          being built. Afterwards, he page cache is sent to QEMU's virtio-blk
>          device as part of the DMA request, the bounce buffer is used to
>          accomodate the request as the corresponding memory region
>          (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
>          case, false is returned from memory_access_is_direct() in the path
>          where the DMA request is handled.
> 
> I don't think I know well in this case, but if you say the PCI bars have
> page cache in the back, does it mean that it should be directly accessible?
> Maybe it's about this line:
> 
>      /*
>       * RAM DEVICE regions can be accessed directly using memcpy, but it might
>       * be MMIO and access using mempy can be wrong (e.g., using instructions not
>       * intended for MMIO access). So we treat this as IO.
>       */
>      return !memory_region_is_ram_device(mr);
> 
> But then my question is if this is a legal case can we loose this check so
> that we don't need to use bounce buffers at all.
> 

It's a nice point. I ever bypass the bounce buffer for this particular
memory region, and it worked for me. However, I don't think we're able to
do it because the memory region isn't directly accessible by nature. The
accesses to the memory region is handled by 'ram_device_mem_ops' where
{ldn, stn}_he_p() are used in its read/write handler. They're different
from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().

Thanks,
Gavin



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
  2026-06-08  8:55 ` Daniel P. Berrangé
@ 2026-06-10  9:49 ` Michael S. Tsirkin
  1 sibling, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10  9:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, jugraham, shan.gavin, stefanha, qemu-block

On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> On the guest where a NVidia's GH100 card is passed from the host, the
> guest system hang can be observed on attempt to compile 'cuda-samples',
> as reported by Julia.
> 
>    host$ lspci | grep GH100
>    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
>    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
>          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
>          -cpu host -smp cpus=32 -m size=8G                                  \
>          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
>          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> 
>    guest$ cd cuda-samples/build
>    guest$ make -j 20 clean
>    guest$ make -j 20
>                :
>    [ 54%] Linking CUDA executable graphMemoryNodes
>    [ 54%] Built target graphMemoryNodes
>    <no more output afterwards, guest becomes frozen here>
> 
>    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
>    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> 
> When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> the memory blocks residing in the PCI BAR can be presented to the guest
> through memory hot-add. The page cache can be allocated from the hot added
> memory blocks when cuda-samples is being built. Afterwards, he page cache
> is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
> buffer is used to accomodate the request as the corresponding memory
> region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> case, false is returned from memory_access_is_direct() in the path where
> the DMA request is handled.
> 
>   QEMU
>   ====
>   virtio_blk_handle_output
>     virtio_blk_handle_vq
>       virtio_blk_get_request
>         virtqueue_pop
>           virtqueue_split_pop
>             virtqueue_map_desc
>               address_space_map
>                 memory_access_is_direct         # Return false
>                   memory_region_supports_direct_access
> 
>   (qemu) info mtree
>           :
>   memory-region: pci_bridge_pci
>     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
>       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
>         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
>           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> 
> By default, the max bounce buffer size is only 4096 bytes, even less
> than one page when the guest page is 64KB. This tries to fix the issue
> by inheriting the customized max bounce buffer size of the virtio bus's
> parent through property 'x-max-bounce-buffer-size' when the customized
> size is a larger one. With this applied, no guest system hang is seen
> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> 
> Reported-by: Julia Graham <jugraham@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/virtio/virtio-bus.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> index cef944e015..e0933823f3 100644
> --- a/hw/virtio/virtio-bus.c
> +++ b/hw/virtio/virtio-bus.c
> @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
>  /* A VirtIODevice is being plugged */
>  void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
>  {
> +    AddressSpace *as;
>      DeviceState *qdev = DEVICE(vdev);
>      BusState *qbus = BUS(qdev_get_parent_bus(qdev));
>      VirtioBusState *bus = VIRTIO_BUS(qbus);
> @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
>                  return;
>              }
>          }
> +    } else {
> +        /*
> +         * The maximal bounce buffer size of the virtio bus's parent may
> +         * have been customized by property 'x-max-bounce-buffer-size'.
> +         * Lets inherit the customized size if it's larger than the
> +         * current one.
> +         */
> +        as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
> +        if (as) {
> +            vdev->dma_as->max_bounce_buffer_size = MAX(
> +                    vdev->dma_as->max_bounce_buffer_size,
> +                    as->max_bounce_buffer_size);
> +        }
>      }
>  }
>  
> -- 
> 2.54.0


Problem with all this is, users would not know how to size this.

So fundamentally, is not the issue that virtio blk (and scsi!) maps
all of the buffer all the time?

It's not hard to add something like virtio_pop_unmapped that would not map,
then build QEMUSGLists out of addr/len pairs and submit these.

Stefan, do you think doing it like this would be bad for perf? Good for
perf?

-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-08 11:11   ` Gavin Shan
  2026-06-08 11:38     ` Daniel P. Berrangé
@ 2026-06-10  9:54     ` Pavel Hrdina
  2026-06-10 10:55       ` Gavin Shan
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Hrdina @ 2026-06-10  9:54 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
	jugraham, shan.gavin

On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> Hi Daniel,
> 
> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > as reported by Julia.
> > 
> > snip
> > 
> 
> Thanks for looking into this.
> 
> > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > than one page when the guest page is 64KB. This tries to fix the issue
> > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > size is a larger one. With this applied, no guest system hang is seen
> > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > 
> > "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> > 
> > We really shouldn't be expecting users to have to set this in a production
> > deployment in order to stop a guest from hanging.  Even if we dropped the
> > experimental marker from this property, users would still need to know to
> > provide this magic setting, so it would still be broken out of the box.
> > 
> > How can we  get a solution that "just works" out of the box, which is
> > fully supported, not relying on experimental properties ?
> > 
> 
> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> property? I guess the properties whose names start with "x-" are all treated as
> experimental and unsupported?
> 
> For this case, the bounce buffer is inevitable as the memory region can't be
> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> bounce buffer size can be specified by users, and it's why the existing property
> "x-max-bounce-buffer-size" is reused.
> 
> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> which is set to on by default, following the existing behavior. When it's set to
> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> not sure if this makes sense at all.

Hi Gavin,

You did not answer the question that Daniel was asking, how will user
know that max-bounce-buffer-size should be used if it's necessary to fix
guest system hangs and how will user know what magic value should be set?

Pavel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10  9:54     ` Pavel Hrdina
@ 2026-06-10 10:55       ` Gavin Shan
  2026-06-10 12:12         ` Michael S. Tsirkin
  2026-06-10 12:23         ` Pavel Hrdina
  0 siblings, 2 replies; 23+ messages in thread
From: Gavin Shan @ 2026-06-10 10:55 UTC (permalink / raw)
  To: Pavel Hrdina
  Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
	jugraham, shan.gavin

Hi Pavel,

On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>> Hi Daniel,
>>
>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>> as reported by Julia.
>>>
>>> snip
>>>
>>
>> Thanks for looking into this.
>>
>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>> size is a larger one. With this applied, no guest system hang is seen
>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>
>>> "x-max-bounce-buffer-size"  is an experimental / unsupported property.
>>>
>>> We really shouldn't be expecting users to have to set this in a production
>>> deployment in order to stop a guest from hanging.  Even if we dropped the
>>> experimental marker from this property, users would still need to know to
>>> provide this magic setting, so it would still be broken out of the box.
>>>
>>> How can we  get a solution that "just works" out of the box, which is
>>> fully supported, not relying on experimental properties ?
>>>
>>
>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>> property? I guess the properties whose names start with "x-" are all treated as
>> experimental and unsupported?
>>
>> For this case, the bounce buffer is inevitable as the memory region can't be
>> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
>> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
>> bounce buffer size can be specified by users, and it's why the existing property
>> "x-max-bounce-buffer-size" is reused.
>>
>> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
>> which is set to on by default, following the existing behavior. When it's set to
>> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
>> not sure if this makes sense at all.
> 
> Hi Gavin,
> 
> You did not answer the question that Daniel was asking, how will user
> know that max-bounce-buffer-size should be used if it's necessary to fix
> guest system hangs and how will user know what magic value should be set?
> 

Sorry that I missed to answer Daniel's questions. For this specific case,
user need to enlarge the bounce buffer size when seeing the following error
message. We can add an explicit one in address_space_map() if the existing
error message isn't obvious.

   qemu-system-aarch64: virtio: bogus descriptor or out of resources

   void *address_space_map(AddressSpace *as,
                         hwaddr addr,
                         hwaddr *plen,
                         bool is_write,
                         MemTxAttrs attrs)
   {
       if (!memory_access_is_direct(mr, is_write, attrs)) {
           if (l == 0) {
               error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
               *plen = 0;
               return NULL;
           }
       }

As to the value user should take for max-bounce-buffer-size, it is really case by case
and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
smallest value works for them. The worst case is to set 0xFFFFFFFF.

> Pavel
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 10:55       ` Gavin Shan
@ 2026-06-10 12:12         ` Michael S. Tsirkin
  2026-06-10 12:19           ` Gavin Shan
  2026-06-10 12:23         ` Pavel Hrdina
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 12:12 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> Hi Pavel,
> 
> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > Hi Daniel,
> > > 
> > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > as reported by Julia.
> > > > 
> > > > snip
> > > > 
> > > 
> > > Thanks for looking into this.
> > > 
> > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > > 
> > > > "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> > > > 
> > > > We really shouldn't be expecting users to have to set this in a production
> > > > deployment in order to stop a guest from hanging.  Even if we dropped the
> > > > experimental marker from this property, users would still need to know to
> > > > provide this magic setting, so it would still be broken out of the box.
> > > > 
> > > > How can we  get a solution that "just works" out of the box, which is
> > > > fully supported, not relying on experimental properties ?
> > > > 
> > > 
> > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > property? I guess the properties whose names start with "x-" are all treated as
> > > experimental and unsupported?
> > > 
> > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > bounce buffer size can be specified by users, and it's why the existing property
> > > "x-max-bounce-buffer-size" is reused.
> > > 
> > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > which is set to on by default, following the existing behavior. When it's set to
> > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > not sure if this makes sense at all.
> > 
> > Hi Gavin,
> > 
> > You did not answer the question that Daniel was asking, how will user
> > know that max-bounce-buffer-size should be used if it's necessary to fix
> > guest system hangs and how will user know what magic value should be set?
> > 
> 
> Sorry that I missed to answer Daniel's questions. For this specific case,
> user need to enlarge the bounce buffer size when seeing the following error
> message. We can add an explicit one in address_space_map() if the existing
> error message isn't obvious.
> 
>   qemu-system-aarch64: virtio: bogus descriptor or out of resources
> 
>   void *address_space_map(AddressSpace *as,
>                         hwaddr addr,
>                         hwaddr *plen,
>                         bool is_write,
>                         MemTxAttrs attrs)
>   {
>       if (!memory_access_is_direct(mr, is_write, attrs)) {
>           if (l == 0) {
>               error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>               *plen = 0;
>               return NULL;
>           }
>       }
> 
> As to the value user should take for max-bounce-buffer-size, it is really case by case
> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> smallest value works for them. The worst case is to set 0xFFFFFFFF.
> 
> > Pavel
> > 
> 
> Thanks,
> Gavin


This is not at all reasonable. All kind of fixes are possible but
fundamentally, bounce buffering data path is by itself already a
bad idea.

I have no idea what does bounce buffering device ram accomplish.

In the end, qemu still simply reads the memory from/to the buffer.

My suggestion is to first of all look for ways to mark the
memory as direct.

-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 12:12         ` Michael S. Tsirkin
@ 2026-06-10 12:19           ` Gavin Shan
  2026-06-10 12:27             ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Gavin Shan @ 2026-06-10 12:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

Hi Michael,

On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:

[...]

>>>
>>> You did not answer the question that Daniel was asking, how will user
>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>> guest system hangs and how will user know what magic value should be set?
>>>
>>
>> Sorry that I missed to answer Daniel's questions. For this specific case,
>> user need to enlarge the bounce buffer size when seeing the following error
>> message. We can add an explicit one in address_space_map() if the existing
>> error message isn't obvious.
>>
>>    qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>
>>    void *address_space_map(AddressSpace *as,
>>                          hwaddr addr,
>>                          hwaddr *plen,
>>                          bool is_write,
>>                          MemTxAttrs attrs)
>>    {
>>        if (!memory_access_is_direct(mr, is_write, attrs)) {
>>            if (l == 0) {
>>                error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>                *plen = 0;
>>                return NULL;
>>            }
>>        }
>>
>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>
> 
> 
> This is not at all reasonable. All kind of fixes are possible but
> fundamentally, bounce buffering data path is by itself already a
> bad idea.
> 
> I have no idea what does bounce buffering device ram accomplish.
> 
> In the end, qemu still simply reads the memory from/to the buffer.
> 
> My suggestion is to first of all look for ways to mark the
> memory as direct.
> 

As I explained to Peter Xu in another reply, we can't simply mark the (RAM
DEVICE) memory region is directly accessible. The memory region is initialized
by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().

The  accesses to the memory region is handled by 'ram_device_mem_ops' where
{ldn, stn}_he_p() are used in its read/write handler. They're different
from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().

Thanks,
Gavin




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 10:55       ` Gavin Shan
  2026-06-10 12:12         ` Michael S. Tsirkin
@ 2026-06-10 12:23         ` Pavel Hrdina
  2026-06-10 14:04           ` Gavin Shan
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Hrdina @ 2026-06-10 12:23 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
	jugraham, shan.gavin

On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> Hi Pavel,
> 
> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > Hi Daniel,
> > > 
> > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > as reported by Julia.
> > > > 
> > > > snip
> > > > 
> > > 
> > > Thanks for looking into this.
> > > 
> > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > > 
> > > > "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> > > > 
> > > > We really shouldn't be expecting users to have to set this in a production
> > > > deployment in order to stop a guest from hanging.  Even if we dropped the
> > > > experimental marker from this property, users would still need to know to
> > > > provide this magic setting, so it would still be broken out of the box.
> > > > 
> > > > How can we  get a solution that "just works" out of the box, which is
> > > > fully supported, not relying on experimental properties ?
> > > > 
> > > 
> > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > property? I guess the properties whose names start with "x-" are all treated as
> > > experimental and unsupported?
> > > 
> > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > bounce buffer size can be specified by users, and it's why the existing property
> > > "x-max-bounce-buffer-size" is reused.
> > > 
> > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > which is set to on by default, following the existing behavior. When it's set to
> > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > not sure if this makes sense at all.
> > 
> > Hi Gavin,
> > 
> > You did not answer the question that Daniel was asking, how will user
> > know that max-bounce-buffer-size should be used if it's necessary to fix
> > guest system hangs and how will user know what magic value should be set?
> > 
> 
> Sorry that I missed to answer Daniel's questions. For this specific case,
> user need to enlarge the bounce buffer size when seeing the following error
> message. We can add an explicit one in address_space_map() if the existing
> error message isn't obvious.
> 
>   qemu-system-aarch64: virtio: bogus descriptor or out of resources
> 
>   void *address_space_map(AddressSpace *as,
>                         hwaddr addr,
>                         hwaddr *plen,
>                         bool is_write,
>                         MemTxAttrs attrs)
>   {
>       if (!memory_access_is_direct(mr, is_write, attrs)) {
>           if (l == 0) {
>               error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>               *plen = 0;
>               return NULL;
>           }
>       }

This may work when using qemu directly but users will not see this error
when using libvirt or management tools like kubevirt.

> As to the value user should take for max-bounce-buffer-size, it is really case by case
> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> smallest value works for them. The worst case is to set 0xFFFFFFFF.

Doesn't sound like pleasant user experience playing guessing game to
figure out how to make a VM work and again will most likely not work for
kubevirt where users are usually not exposed to these low level properties.

I'm not familiar with the internals but isn't there a better way how to
solve it without requiring users to figure out by guessing what value works?

Pavel

> > Pavel
> > 
> 
> Thanks,
> Gavin
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 12:19           ` Gavin Shan
@ 2026-06-10 12:27             ` Michael S. Tsirkin
  2026-06-10 13:00               ` Gavin Shan
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 12:27 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> 
> [...]
> 
> > > > 
> > > > You did not answer the question that Daniel was asking, how will user
> > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > guest system hangs and how will user know what magic value should be set?
> > > > 
> > > 
> > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > user need to enlarge the bounce buffer size when seeing the following error
> > > message. We can add an explicit one in address_space_map() if the existing
> > > error message isn't obvious.
> > > 
> > >    qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > 
> > >    void *address_space_map(AddressSpace *as,
> > >                          hwaddr addr,
> > >                          hwaddr *plen,
> > >                          bool is_write,
> > >                          MemTxAttrs attrs)
> > >    {
> > >        if (!memory_access_is_direct(mr, is_write, attrs)) {
> > >            if (l == 0) {
> > >                error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > >                *plen = 0;
> > >                return NULL;
> > >            }
> > >        }
> > > 
> > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > 
> > 
> > 
> > This is not at all reasonable. All kind of fixes are possible but
> > fundamentally, bounce buffering data path is by itself already a
> > bad idea.
> > 
> > I have no idea what does bounce buffering device ram accomplish.
> > 
> > In the end, qemu still simply reads the memory from/to the buffer.
> > 
> > My suggestion is to first of all look for ways to mark the
> > memory as direct.
> > 
> 
> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> DEVICE) memory region is directly accessible. The memory region is initialized
> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> 
> The  accesses to the memory region is handled by 'ram_device_mem_ops' where
> {ldn, stn}_he_p() are used in its read/write handler. They're different
> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> 
> Thanks,
> Gavin
> 

What is endianness set to, for this region?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 12:27             ` Michael S. Tsirkin
@ 2026-06-10 13:00               ` Gavin Shan
  2026-06-10 13:54                 ` Gavin Shan
  0 siblings, 1 reply; 23+ messages in thread
From: Gavin Shan @ 2026-06-10 13:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
>> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>>>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>
>> [...]
>>
>>>>>
>>>>> You did not answer the question that Daniel was asking, how will user
>>>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>>>> guest system hangs and how will user know what magic value should be set?
>>>>>
>>>>
>>>> Sorry that I missed to answer Daniel's questions. For this specific case,
>>>> user need to enlarge the bounce buffer size when seeing the following error
>>>> message. We can add an explicit one in address_space_map() if the existing
>>>> error message isn't obvious.
>>>>
>>>>     qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>
>>>>     void *address_space_map(AddressSpace *as,
>>>>                           hwaddr addr,
>>>>                           hwaddr *plen,
>>>>                           bool is_write,
>>>>                           MemTxAttrs attrs)
>>>>     {
>>>>         if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>>             if (l == 0) {
>>>>                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>>>                 *plen = 0;
>>>>                 return NULL;
>>>>             }
>>>>         }
>>>>
>>>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>>>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>>>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>>>
>>>
>>>
>>> This is not at all reasonable. All kind of fixes are possible but
>>> fundamentally, bounce buffering data path is by itself already a
>>> bad idea.
>>>
>>> I have no idea what does bounce buffering device ram accomplish.
>>>
>>> In the end, qemu still simply reads the memory from/to the buffer.
>>>
>>> My suggestion is to first of all look for ways to mark the
>>> memory as direct.
>>>
>>
>> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
>> DEVICE) memory region is directly accessible. The memory region is initialized
>> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>>
>> The  accesses to the memory region is handled by 'ram_device_mem_ops' where
>> {ldn, stn}_he_p() are used in its read/write handler. They're different
>> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>>
>> Thanks,
>> Gavin
>>
> 
> What is endianness set to, for this region?
> 

The endianness of the memory region is set to that for the host.

static const MemoryRegionOps ram_device_mem_ops = {
     .read = memory_region_ram_device_read,
     .write = memory_region_ram_device_write,
     .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
};

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 13:00               ` Gavin Shan
@ 2026-06-10 13:54                 ` Gavin Shan
  2026-06-10 14:06                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Gavin Shan @ 2026-06-10 13:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

Hi Michael and Peter,

On 6/10/26 11:00 PM, Gavin Shan wrote:
> On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
>> On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
>>> On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>>>>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>>
>>> [...]
>>>
>>>>>>
>>>>>> You did not answer the question that Daniel was asking, how will user
>>>>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>>>>> guest system hangs and how will user know what magic value should be set?
>>>>>>
>>>>>
>>>>> Sorry that I missed to answer Daniel's questions. For this specific case,
>>>>> user need to enlarge the bounce buffer size when seeing the following error
>>>>> message. We can add an explicit one in address_space_map() if the existing
>>>>> error message isn't obvious.
>>>>>
>>>>>     qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>>>>
>>>>>     void *address_space_map(AddressSpace *as,
>>>>>                           hwaddr addr,
>>>>>                           hwaddr *plen,
>>>>>                           bool is_write,
>>>>>                           MemTxAttrs attrs)
>>>>>     {
>>>>>         if (!memory_access_is_direct(mr, is_write, attrs)) {
>>>>>             if (l == 0) {
>>>>>                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>>>>                 *plen = 0;
>>>>>                 return NULL;
>>>>>             }
>>>>>         }
>>>>>
>>>>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>>>>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>>>>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
>>>>>
>>>>
>>>>
>>>> This is not at all reasonable. All kind of fixes are possible but
>>>> fundamentally, bounce buffering data path is by itself already a
>>>> bad idea.
>>>>
>>>> I have no idea what does bounce buffering device ram accomplish.
>>>>
>>>> In the end, qemu still simply reads the memory from/to the buffer.
>>>>
>>>> My suggestion is to first of all look for ways to mark the
>>>> memory as direct.
>>>>
>>>
>>> As I explained to Peter Xu in another reply, we can't simply mark the (RAM
>>> DEVICE) memory region is directly accessible. The memory region is initialized
>>> by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
>>>
>>> The  accesses to the memory region is handled by 'ram_device_mem_ops' where
>>> {ldn, stn}_he_p() are used in its read/write handler. They're different
>>> from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
>>>
>>> Thanks,
>>> Gavin
>>>
>>
>> What is endianness set to, for this region?
>>
> 
> The endianness of the memory region is set to that for the host.
> 
> static const MemoryRegionOps ram_device_mem_ops = {
>      .read = memory_region_ram_device_read,
>      .write = memory_region_ram_device_write,
>      .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> };
> 

How about to treat the RAM DEVICE memory region directly accessible in
address_space_map() only when HOST_BIG_ENDIAN is false, something like
below and I don't hit the guest hang issue with the changes.

diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..9daca55251 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
  bool prepare_mmio_access(MemoryRegion *mr);
  
-static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
+static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
+                                                        bool check_ram_device)
  {
      /* ROM DEVICE regions only allow direct access if in ROMD mode. */
      if (memory_region_is_romd(mr)) {
@@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
       * be MMIO and access using mempy can be wrong (e.g., using instructions not
       * intended for MMIO access). So we treat this as IO.
       */
-    return !memory_region_is_ram_device(mr);
+    return (!check_ram_device || !memory_region_is_ram_device(mr));
  }
  
  static inline bool memory_access_is_direct(const MemoryRegion *mr,
+                                           bool check_ram_device,
                                             bool is_write, MemTxAttrs attrs)
  {
-    if (!memory_region_supports_direct_access(mr)) {
+    if (!memory_region_supports_direct_access(mr, check_ram_device)) {
          return false;
      }
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..2e6b72b124 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
      fv = address_space_to_flatview(as);
      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
  
-    if (!memory_access_is_direct(mr, is_write, attrs)) {
+    if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
          size_t used = qatomic_read(&as->bounce_buffer_size);
          for (;;) {
              hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);

Thanks,
Gavin




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 12:23         ` Pavel Hrdina
@ 2026-06-10 14:04           ` Gavin Shan
  2026-06-10 14:08             ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Gavin Shan @ 2026-06-10 14:04 UTC (permalink / raw)
  To: Pavel Hrdina
  Cc: Daniel P. Berrangé, Peter Xu, qemu-devel, qemu-arm, mst,
	jugraham, shan.gavin

On 6/10/26 10:23 PM, Pavel Hrdina wrote:
> On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
>> Hi Pavel,
>>
>> On 6/10/26 7:54 PM, Pavel Hrdina wrote:
>>> On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
>>>> Hi Daniel,
>>>>
>>>> On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
>>>>> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
>>>>>> On the guest where a NVidia's GH100 card is passed from the host, the
>>>>>> guest system hang can be observed on attempt to compile 'cuda-samples',
>>>>>> as reported by Julia.
>>>>>
>>>>> snip
>>>>>
>>>>
>>>> Thanks for looking into this.
>>>>
>>>>>> By default, the max bounce buffer size is only 4096 bytes, even less
>>>>>> than one page when the guest page is 64KB. This tries to fix the issue
>>>>>> by inheriting the customized max bounce buffer size of the virtio bus's
>>>>>> parent through property 'x-max-bounce-buffer-size' when the customized
>>>>>> size is a larger one. With this applied, no guest system hang is seen
>>>>>> with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
>>>>>
>>>>> "x-max-bounce-buffer-size"  is an experimental / unsupported property.
>>>>>
>>>>> We really shouldn't be expecting users to have to set this in a production
>>>>> deployment in order to stop a guest from hanging.  Even if we dropped the
>>>>> experimental marker from this property, users would still need to know to
>>>>> provide this magic setting, so it would still be broken out of the box.
>>>>>
>>>>> How can we  get a solution that "just works" out of the box, which is
>>>>> fully supported, not relying on experimental properties ?
>>>>>
>>>>
>>>> How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
>>>> property? I guess the properties whose names start with "x-" are all treated as
>>>> experimental and unsupported?
>>>>
>>>> For this case, the bounce buffer is inevitable as the memory region can't be
>>>> directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
>>>> in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
>>>> bounce buffer size can be specified by users, and it's why the existing property
>>>> "x-max-bounce-buffer-size" is reused.
>>>>
>>>> I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
>>>> which is set to on by default, following the existing behavior. When it's set to
>>>> off by users, the max (allowed) buffer size won't be checked at all. However, I'm
>>>> not sure if this makes sense at all.
>>>
>>> Hi Gavin,
>>>
>>> You did not answer the question that Daniel was asking, how will user
>>> know that max-bounce-buffer-size should be used if it's necessary to fix
>>> guest system hangs and how will user know what magic value should be set?
>>>
>>
>> Sorry that I missed to answer Daniel's questions. For this specific case,
>> user need to enlarge the bounce buffer size when seeing the following error
>> message. We can add an explicit one in address_space_map() if the existing
>> error message isn't obvious.
>>
>>    qemu-system-aarch64: virtio: bogus descriptor or out of resources
>>
>>    void *address_space_map(AddressSpace *as,
>>                          hwaddr addr,
>>                          hwaddr *plen,
>>                          bool is_write,
>>                          MemTxAttrs attrs)
>>    {
>>        if (!memory_access_is_direct(mr, is_write, attrs)) {
>>            if (l == 0) {
>>                error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
>>                *plen = 0;
>>                return NULL;
>>            }
>>        }
> 
> This may work when using qemu directly but users will not see this error
> when using libvirt or management tools like kubevirt.
> 

Ok, then an error message raised by error_report() won't help.

>> As to the value user should take for max-bounce-buffer-size, it is really case by case
>> and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
>> smallest value works for them. The worst case is to set 0xFFFFFFFF.
> 
> Doesn't sound like pleasant user experience playing guessing game to
> figure out how to make a VM work and again will most likely not work for
> kubevirt where users are usually not exposed to these low level properties.
> 
> I'm not familiar with the internals but isn't there a better way how to
> solve it without requiring users to figure out by guessing what value works?
> 

Not really. The worst case is to have 'max-bounce-buffer-size=0xFFFFFFFF',
which is to disable the check against the max bounce buffer size :-)

Peter and Michael already lead the direction to bypass the bounce buffer
for this specific case. It worked for me and no guest hang isn't seen when
the bounce buffer is bypassed in address_space_map().

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 13:54                 ` Gavin Shan
@ 2026-06-10 14:06                   ` Michael S. Tsirkin
  2026-06-10 15:36                     ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 14:06 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin, Alex Williamson,
	David Hildenbrand

On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> Hi Michael and Peter,
> 
> On 6/10/26 11:00 PM, Gavin Shan wrote:
> > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > 
> > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > 
> > > > > > 
> > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > error message isn't obvious.
> > > > > > 
> > > > > >     qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > 
> > > > > >     void *address_space_map(AddressSpace *as,
> > > > > >                           hwaddr addr,
> > > > > >                           hwaddr *plen,
> > > > > >                           bool is_write,
> > > > > >                           MemTxAttrs attrs)
> > > > > >     {
> > > > > >         if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > >             if (l == 0) {
> > > > > >                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > >                 *plen = 0;
> > > > > >                 return NULL;
> > > > > >             }
> > > > > >         }
> > > > > > 
> > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > 
> > > > > 
> > > > > 
> > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > bad idea.
> > > > > 
> > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > 
> > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > 
> > > > > My suggestion is to first of all look for ways to mark the
> > > > > memory as direct.
> > > > > 
> > > > 
> > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > 
> > > > The  accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > 
> > > > Thanks,
> > > > Gavin
> > > > 
> > > 
> > > What is endianness set to, for this region?
> > > 
> > 
> > The endianness of the memory region is set to that for the host.
> > 
> > static const MemoryRegionOps ram_device_mem_ops = {
> >      .read = memory_region_ram_device_read,
> >      .write = memory_region_ram_device_write,
> >      .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > };
> > 

So there is never any endianness translation.
I think the reason qemu does the bounce buffer is more
to prevent things like vector access from MMIO.


> How about to treat the RAM DEVICE memory region directly accessible in
> address_space_map() only when HOST_BIG_ENDIAN is false,
> something like
> below and I don't hit the guest hang issue with the changes.
> 
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 1417132f6d..9daca55251 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
>  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
>  bool prepare_mmio_access(MemoryRegion *mr);
> -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> +                                                        bool check_ram_device)
>  {
>      /* ROM DEVICE regions only allow direct access if in ROMD mode. */
>      if (memory_region_is_romd(mr)) {
> @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
>       * be MMIO and access using mempy can be wrong (e.g., using instructions not
>       * intended for MMIO access). So we treat this as IO.
>       */
> -    return !memory_region_is_ram_device(mr);
> +    return (!check_ram_device || !memory_region_is_ram_device(mr));
>  }
>  static inline bool memory_access_is_direct(const MemoryRegion *mr,
> +                                           bool check_ram_device,
>                                             bool is_write, MemTxAttrs attrs)
>  {
> -    if (!memory_region_supports_direct_access(mr)) {
> +    if (!memory_region_supports_direct_access(mr, check_ram_device)) {
>          return false;
>      }
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf87573..2e6b72b124 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
>      fv = address_space_to_flatview(as);
>      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> -    if (!memory_access_is_direct(mr, is_write, attrs)) {
> +    if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
>          size_t used = qatomic_read(&as->bounce_buffer_size);
>          for (;;) {
>              hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> 
> Thanks,
> Gavin
> 

I do not think it has anything to do with host endian-ness.


This is the change that broke it I think?


commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
Author: Alex Williamson <alex@shazbot.org>
Date:   Mon Oct 31 09:53:03 2016 -0600

    memory: Don't use memcpy for ram_device regions
    

Maybe Alex has an opinion on what to do.


-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 14:04           ` Gavin Shan
@ 2026-06-10 14:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 14:08 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Pavel Hrdina, Daniel P. Berrangé, Peter Xu, qemu-devel,
	qemu-arm, jugraham, shan.gavin

On Thu, Jun 11, 2026 at 12:04:52AM +1000, Gavin Shan wrote:
> On 6/10/26 10:23 PM, Pavel Hrdina wrote:
> > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > Hi Pavel,
> > > 
> > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > On Mon, Jun 08, 2026 at 09:11:50PM +1000, Gavin Shan wrote:
> > > > > Hi Daniel,
> > > > > 
> > > > > On 6/8/26 6:55 PM, Daniel P. Berrangé wrote:
> > > > > > On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > > > > > > On the guest where a NVidia's GH100 card is passed from the host, the
> > > > > > > guest system hang can be observed on attempt to compile 'cuda-samples',
> > > > > > > as reported by Julia.
> > > > > > 
> > > > > > snip
> > > > > > 
> > > > > 
> > > > > Thanks for looking into this.
> > > > > 
> > > > > > > By default, the max bounce buffer size is only 4096 bytes, even less
> > > > > > > than one page when the guest page is 64KB. This tries to fix the issue
> > > > > > > by inheriting the customized max bounce buffer size of the virtio bus's
> > > > > > > parent through property 'x-max-bounce-buffer-size' when the customized
> > > > > > > size is a larger one. With this applied, no guest system hang is seen
> > > > > > > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > > > > > 
> > > > > > "x-max-bounce-buffer-size"  is an experimental / unsupported property.
> > > > > > 
> > > > > > We really shouldn't be expecting users to have to set this in a production
> > > > > > deployment in order to stop a guest from hanging.  Even if we dropped the
> > > > > > experimental marker from this property, users would still need to know to
> > > > > > provide this magic setting, so it would still be broken out of the box.
> > > > > > 
> > > > > > How can we  get a solution that "just works" out of the box, which is
> > > > > > fully supported, not relying on experimental properties ?
> > > > > > 
> > > > > 
> > > > > How do we know that "x-max-bounce-buffer-size" is an experimental or unsupported
> > > > > property? I guess the properties whose names start with "x-" are all treated as
> > > > > experimental and unsupported?
> > > > > 
> > > > > For this case, the bounce buffer is inevitable as the memory region can't be
> > > > > directly accessed. The memory region is initialized by memory_region_init_ram_device_ptr()
> > > > > in hw/vfio/region.c::vfio_region_mmap(). So the question is how the allowed
> > > > > bounce buffer size can be specified by users, and it's why the existing property
> > > > > "x-max-bounce-buffer-size" is reused.
> > > > > 
> > > > > I even thought of a new property for MachineState (e.g. "limited-bounce-buffer"),
> > > > > which is set to on by default, following the existing behavior. When it's set to
> > > > > off by users, the max (allowed) buffer size won't be checked at all. However, I'm
> > > > > not sure if this makes sense at all.
> > > > 
> > > > Hi Gavin,
> > > > 
> > > > You did not answer the question that Daniel was asking, how will user
> > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > guest system hangs and how will user know what magic value should be set?
> > > > 
> > > 
> > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > user need to enlarge the bounce buffer size when seeing the following error
> > > message. We can add an explicit one in address_space_map() if the existing
> > > error message isn't obvious.
> > > 
> > >    qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > 
> > >    void *address_space_map(AddressSpace *as,
> > >                          hwaddr addr,
> > >                          hwaddr *plen,
> > >                          bool is_write,
> > >                          MemTxAttrs attrs)
> > >    {
> > >        if (!memory_access_is_direct(mr, is_write, attrs)) {
> > >            if (l == 0) {
> > >                error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > >                *plen = 0;
> > >                return NULL;
> > >            }
> > >        }
> > 
> > This may work when using qemu directly but users will not see this error
> > when using libvirt or management tools like kubevirt.
> > 
> 
> Ok, then an error message raised by error_report() won't help.
> 
> > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > 
> > Doesn't sound like pleasant user experience playing guessing game to
> > figure out how to make a VM work and again will most likely not work for
> > kubevirt where users are usually not exposed to these low level properties.
> > 
> > I'm not familiar with the internals but isn't there a better way how to
> > solve it without requiring users to figure out by guessing what value works?
> > 
> 
> Not really. The worst case is to have 'max-bounce-buffer-size=0xFFFFFFFF',
> which is to disable the check against the max bounce buffer size :-)
> 
> Peter and Michael already lead the direction to bypass the bounce buffer
> for this specific case. It worked for me and no guest hang isn't seen when
> the bounce buffer is bypassed in address_space_map().
> 
> Thanks,
> Gavin

Mind, I am not against additionally switching virtio to support popping
bufs into QEMUSGList and not iovecs.

But the performance is gonnu be bad for this one.


-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 14:06                   ` Michael S. Tsirkin
@ 2026-06-10 15:36                     ` Peter Xu
  2026-06-10 16:11                       ` Peter Maydell
  2026-06-10 16:18                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Xu @ 2026-06-10 15:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
	qemu-arm, jugraham, shan.gavin, Alex Williamson,
	David Hildenbrand

On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > Hi Michael and Peter,
> > 
> > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > 
> > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > error message isn't obvious.
> > > > > > > 
> > > > > > >     qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > > 
> > > > > > >     void *address_space_map(AddressSpace *as,
> > > > > > >                           hwaddr addr,
> > > > > > >                           hwaddr *plen,
> > > > > > >                           bool is_write,
> > > > > > >                           MemTxAttrs attrs)
> > > > > > >     {
> > > > > > >         if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > >             if (l == 0) {
> > > > > > >                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > >                 *plen = 0;
> > > > > > >                 return NULL;
> > > > > > >             }
> > > > > > >         }
> > > > > > > 
> > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > bad idea.
> > > > > > 
> > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > > 
> > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > > 
> > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > memory as direct.
> > > > > > 
> > > > > 
> > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > > 
> > > > > The  accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > > 
> > > > > Thanks,
> > > > > Gavin
> > > > > 
> > > > 
> > > > What is endianness set to, for this region?
> > > > 
> > > 
> > > The endianness of the memory region is set to that for the host.
> > > 
> > > static const MemoryRegionOps ram_device_mem_ops = {
> > >      .read = memory_region_ram_device_read,
> > >      .write = memory_region_ram_device_write,
> > >      .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > };
> > > 
> 
> So there is never any endianness translation.
> I think the reason qemu does the bounce buffer is more
> to prevent things like vector access from MMIO.
> 
> 
> > How about to treat the RAM DEVICE memory region directly accessible in
> > address_space_map() only when HOST_BIG_ENDIAN is false,
> > something like
> > below and I don't hit the guest hang issue with the changes.
> > 
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index 1417132f6d..9daca55251 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> >  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> >  bool prepare_mmio_access(MemoryRegion *mr);
> > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > +                                                        bool check_ram_device)
> >  {
> >      /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> >      if (memory_region_is_romd(mr)) {
> > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> >       * be MMIO and access using mempy can be wrong (e.g., using instructions not
> >       * intended for MMIO access). So we treat this as IO.
> >       */
> > -    return !memory_region_is_ram_device(mr);
> > +    return (!check_ram_device || !memory_region_is_ram_device(mr));
> >  }
> >  static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > +                                           bool check_ram_device,
> >                                             bool is_write, MemTxAttrs attrs)
> >  {
> > -    if (!memory_region_supports_direct_access(mr)) {
> > +    if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> >          return false;
> >      }
> > diff --git a/system/physmem.c b/system/physmem.c
> > index 7bcbf87573..2e6b72b124 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> >      fv = address_space_to_flatview(as);
> >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > -    if (!memory_access_is_direct(mr, is_write, attrs)) {
> > +    if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> >          size_t used = qatomic_read(&as->bounce_buffer_size);
> >          for (;;) {
> >              hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> > 
> > Thanks,
> > Gavin
> > 
> 
> I do not think it has anything to do with host endian-ness.
> 
> 
> This is the change that broke it I think?
> 
> 
> commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> Author: Alex Williamson <alex@shazbot.org>
> Date:   Mon Oct 31 09:53:03 2016 -0600
> 
>     memory: Don't use memcpy for ram_device regions
>     
> 
> Maybe Alex has an opinion on what to do.

I can offer one idea here..

IIUC the major issue was vector ops but the mr ops might be too heavy, then
another way to fix it is in memory API instead of using memcpy()/memmove(),
we always use a helper (say, memmove_no_vector()) to do the split and
properly aligned IOs as what ram_device_mem_ops does right now, this should
only applies to ram_device.

With that, IIUC we can remove the current ram_device_mem_ops, then in
Gavin's case mmap() will go through and guest will not need to vmexit at
all.  Best perf, issue solve.

We just need to be careful to trap all possible memcpy()/memmove() used in
memory core.. if I didn't miss any, IMO below four should needs to be
replaced by memmove_no_vector():

  flatview_write_continue_step()
  flatview_read_continue_step()
  address_space_read()
  address_space_write_rom()

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 15:36                     ` Peter Xu
@ 2026-06-10 16:11                       ` Peter Maydell
  2026-06-10 16:19                         ` Michael S. Tsirkin
  2026-06-10 16:18                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2026-06-10 16:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Michael S. Tsirkin, Gavin Shan, Pavel Hrdina,
	Daniel P. Berrangé, qemu-devel, qemu-arm, jugraham,
	shan.gavin, Alex Williamson, David Hildenbrand

On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > This is the change that broke it I think?
> >
> >
> > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > Author: Alex Williamson <alex@shazbot.org>
> > Date:   Mon Oct 31 09:53:03 2016 -0600
> >
> >     memory: Don't use memcpy for ram_device regions
> >
> >
> > Maybe Alex has an opinion on what to do.
>
> I can offer one idea here..
>
> IIUC the major issue was vector ops but the mr ops might be too heavy, then
> another way to fix it is in memory API instead of using memcpy()/memmove(),
> we always use a helper (say, memmove_no_vector()) to do the split and
> properly aligned IOs as what ram_device_mem_ops does right now, this should
> only applies to ram_device.

If the underlying memory needs to be accessed only with specific
alignment/size, as the 4a2e242bbb30 commit message suggests, then
we cannot expose it via address_space_map(), so we must have
a bounce-buffer. The address_space_map() function says
"here's a host pointer to  memory, do what you like to it", and
the caller is entitled to memcpy to/from it or otherwise
access it with any C operations, which are not guaranteed to
respect any kind of alignment or similar restrictions.

My guess from commit 4a2e242bbb30 is that that applied an
overly broad "don't do direct access" hammer to all
vfio assigned devices, and that there needs to be some
concept of "this vfio assigned device's region is OK for
direct access" vs "this other one is not", such that if
this GH100 card's BAR guarantees it can be treated entirely
as RAM then we can have memory_region_supports_direct_access()
return true for it.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 15:36                     ` Peter Xu
  2026-06-10 16:11                       ` Peter Maydell
@ 2026-06-10 16:18                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 16:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Gavin Shan, Pavel Hrdina, Daniel P. Berrangé, qemu-devel,
	qemu-arm, jugraham, shan.gavin, Alex Williamson,
	David Hildenbrand

On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > > Hi Michael and Peter,
> > > 
> > > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > > > 
> > > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > > error message isn't obvious.
> > > > > > > > 
> > > > > > > >     qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > > > 
> > > > > > > >     void *address_space_map(AddressSpace *as,
> > > > > > > >                           hwaddr addr,
> > > > > > > >                           hwaddr *plen,
> > > > > > > >                           bool is_write,
> > > > > > > >                           MemTxAttrs attrs)
> > > > > > > >     {
> > > > > > > >         if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > > >             if (l == 0) {
> > > > > > > >                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > > >                 *plen = 0;
> > > > > > > >                 return NULL;
> > > > > > > >             }
> > > > > > > >         }
> > > > > > > > 
> > > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > > bad idea.
> > > > > > > 
> > > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > > > 
> > > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > > > 
> > > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > > memory as direct.
> > > > > > > 
> > > > > > 
> > > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > > > 
> > > > > > The  accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > > > 
> > > > > > Thanks,
> > > > > > Gavin
> > > > > > 
> > > > > 
> > > > > What is endianness set to, for this region?
> > > > > 
> > > > 
> > > > The endianness of the memory region is set to that for the host.
> > > > 
> > > > static const MemoryRegionOps ram_device_mem_ops = {
> > > >      .read = memory_region_ram_device_read,
> > > >      .write = memory_region_ram_device_write,
> > > >      .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > > };
> > > > 
> > 
> > So there is never any endianness translation.
> > I think the reason qemu does the bounce buffer is more
> > to prevent things like vector access from MMIO.
> > 
> > 
> > > How about to treat the RAM DEVICE memory region directly accessible in
> > > address_space_map() only when HOST_BIG_ENDIAN is false,
> > > something like
> > > below and I don't hit the guest hang issue with the changes.
> > > 
> > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > index 1417132f6d..9daca55251 100644
> > > --- a/include/system/memory.h
> > > +++ b/include/system/memory.h
> > > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > >  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > >  bool prepare_mmio_access(MemoryRegion *mr);
> > > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > > +                                                        bool check_ram_device)
> > >  {
> > >      /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> > >      if (memory_region_is_romd(mr)) {
> > > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > >       * be MMIO and access using mempy can be wrong (e.g., using instructions not
> > >       * intended for MMIO access). So we treat this as IO.
> > >       */
> > > -    return !memory_region_is_ram_device(mr);
> > > +    return (!check_ram_device || !memory_region_is_ram_device(mr));
> > >  }
> > >  static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > > +                                           bool check_ram_device,
> > >                                             bool is_write, MemTxAttrs attrs)
> > >  {
> > > -    if (!memory_region_supports_direct_access(mr)) {
> > > +    if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> > >          return false;
> > >      }
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index 7bcbf87573..2e6b72b124 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> > >      fv = address_space_to_flatview(as);
> > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > -    if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > +    if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> > >          size_t used = qatomic_read(&as->bounce_buffer_size);
> > >          for (;;) {
> > >              hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> > > 
> > > Thanks,
> > > Gavin
> > > 
> > 
> > I do not think it has anything to do with host endian-ness.
> > 
> > 
> > This is the change that broke it I think?
> > 
> > 
> > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > Author: Alex Williamson <alex@shazbot.org>
> > Date:   Mon Oct 31 09:53:03 2016 -0600
> > 
> >     memory: Don't use memcpy for ram_device regions
> >     
> > 
> > Maybe Alex has an opinion on what to do.
> 
> I can offer one idea here..
> 
> IIUC the major issue was vector ops but the mr ops might be too heavy, then
> another way to fix it is in memory API instead of using memcpy()/memmove(),
> we always use a helper (say, memmove_no_vector()) to do the split and
> properly aligned IOs as what ram_device_mem_ops does right now, this should
> only applies to ram_device.
> 
> With that, IIUC we can remove the current ram_device_mem_ops, then in
> Gavin's case mmap() will go through and guest will not need to vmexit at
> all.  Best perf, issue solve.
> 
> We just need to be careful to trap all possible memcpy()/memmove() used in
> memory core.. if I didn't miss any, IMO below four should needs to be
> replaced by memmove_no_vector():
> 
>   flatview_write_continue_step()
>   flatview_read_continue_step()
>   address_space_read()
>   address_space_write_rom()
> 
> Thanks,
> 
> -- 
> Peter Xu

First, this is a nice idea.
Second, the ideal thing is still just allowing direct access.
And I think VFIO actually knows it's regular RAM.
So something like the following small patch in linux, maybe?


diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index fa056b69f899..a4ca2d01272c 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -418,6 +418,10 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
 	struct nvgrace_gpu_pci_core_device *nvdev =
 		container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
 			     core_device.vdev);
+	struct vfio_region_info_cap_direct_access direct_access = {
+		.header.id = VFIO_REGION_INFO_CAP_DIRECT_ACCESS,
+		.header.version = 1,
+	};
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	struct mem_region *memregion;
 	u32 size;
@@ -453,6 +457,13 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
 	if (ret)
 		return ret;
 
+	if (info->index == USEMEM_REGION_INDEX) {
+		ret = vfio_info_add_capability(caps, &direct_access.header,
+					       sizeof(direct_access));
+		if (ret)
+			return ret;
+	}
+
 	info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
 	/*
 	 * The region memory size may not be power-of-2 aligned.
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..f475f4920b52 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -466,6 +466,16 @@ struct vfio_device_migration_info {
  */
 #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
 
+/*
+ * The direct access capability informs that a mmappable region may be
+ * accessed by userspace using any CPU load/store operations.
+ */
+#define VFIO_REGION_INFO_CAP_DIRECT_ACCESS	6
+
+struct vfio_region_info_cap_direct_access {
+	struct vfio_info_cap_header header;
+};
+
 /*
  * Capability with compressed real address (aka SSA - small system address)
  * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
  2026-06-10 16:11                       ` Peter Maydell
@ 2026-06-10 16:19                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 16:19 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Peter Xu, Gavin Shan, Pavel Hrdina, Daniel P. Berrangé,
	qemu-devel, qemu-arm, jugraham, shan.gavin, Alex Williamson,
	David Hildenbrand

On Wed, Jun 10, 2026 at 05:11:40PM +0100, Peter Maydell wrote:
> On Wed, 10 Jun 2026 at 16:37, Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > > This is the change that broke it I think?
> > >
> > >
> > > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > > Author: Alex Williamson <alex@shazbot.org>
> > > Date:   Mon Oct 31 09:53:03 2016 -0600
> > >
> > >     memory: Don't use memcpy for ram_device regions
> > >
> > >
> > > Maybe Alex has an opinion on what to do.
> >
> > I can offer one idea here..
> >
> > IIUC the major issue was vector ops but the mr ops might be too heavy, then
> > another way to fix it is in memory API instead of using memcpy()/memmove(),
> > we always use a helper (say, memmove_no_vector()) to do the split and
> > properly aligned IOs as what ram_device_mem_ops does right now, this should
> > only applies to ram_device.
> 
> If the underlying memory needs to be accessed only with specific
> alignment/size, as the 4a2e242bbb30 commit message suggests, then
> we cannot expose it via address_space_map(), so we must have
> a bounce-buffer.

Right. And virtio currently isn't friendly to the bounce buffer.
We can fix that but I worry about the perf impact.

> The address_space_map() function says
> "here's a host pointer to  memory, do what you like to it", and
> the caller is entitled to memcpy to/from it or otherwise
> access it with any C operations, which are not guaranteed to
> respect any kind of alignment or similar restrictions.
> 
> My guess from commit 4a2e242bbb30 is that that applied an
> overly broad "don't do direct access" hammer to all
> vfio assigned devices, and that there needs to be some
> concept of "this vfio assigned device's region is OK for
> direct access" vs "this other one is not", such that if
> this GH100 card's BAR guarantees it can be treated entirely
> as RAM then we can have memory_region_supports_direct_access()
> return true for it.
> 
> thanks
> -- PMM



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-06-10 16:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08  8:55 ` Daniel P. Berrangé
2026-06-08 11:11   ` Gavin Shan
2026-06-08 11:38     ` Daniel P. Berrangé
2026-06-09  2:08       ` Gavin Shan
2026-06-09 16:25         ` Peter Xu
2026-06-10  0:32           ` Gavin Shan
2026-06-10  9:54     ` Pavel Hrdina
2026-06-10 10:55       ` Gavin Shan
2026-06-10 12:12         ` Michael S. Tsirkin
2026-06-10 12:19           ` Gavin Shan
2026-06-10 12:27             ` Michael S. Tsirkin
2026-06-10 13:00               ` Gavin Shan
2026-06-10 13:54                 ` Gavin Shan
2026-06-10 14:06                   ` Michael S. Tsirkin
2026-06-10 15:36                     ` Peter Xu
2026-06-10 16:11                       ` Peter Maydell
2026-06-10 16:19                         ` Michael S. Tsirkin
2026-06-10 16:18                       ` Michael S. Tsirkin
2026-06-10 12:23         ` Pavel Hrdina
2026-06-10 14:04           ` Gavin Shan
2026-06-10 14:08             ` Michael S. Tsirkin
2026-06-10  9:49 ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.