All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, jugraham@redhat.com,
	shan.gavin@gmail.com, qemu-block@nongnu.org
Subject: Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
Date: Wed, 10 Jun 2026 14:30:46 -0400	[thread overview]
Message-ID: <20260610183046.GB121666@fedora> (raw)
In-Reply-To: <20260610041036-mutt-send-email-mst@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 6189 bytes --]

On Wed, Jun 10, 2026 at 05:49:21AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 10:18:21AM +1000, Gavin Shan wrote:
> > On the guest where a NVidia's GH100 card is passed from the host, the
> > guest system hang can be observed on attempt to compile 'cuda-samples',
> > as reported by Julia.
> > 
> >    host$ lspci | grep GH100
> >    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1)
> >    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 -accel kvm \
> >          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
> >          -cpu host -smp cpus=32 -m size=8G                                  \
> >          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
> >          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
> >          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> > 
> >    guest$ cd cuda-samples/build
> >    guest$ make -j 20 clean
> >    guest$ make -j 20
> >                :
> >    [ 54%] Linking CUDA executable graphMemoryNodes
> >    [ 54%] Built target graphMemoryNodes
> >    <no more output afterwards, guest becomes frozen here>
> > 
> >    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> >    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB)
> > 
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR can be presented to the guest
> > through memory hot-add. The page cache can be allocated from the hot added
> > memory blocks when cuda-samples is being built. Afterwards, he page cache
> > is sent to QEMU's virtio-blk device as part of the DMA request, the bounce
> > buffer is used to accomodate the request as the corresponding memory
> > region (MemoryRegion) is a RAM DEVICE region in qemu. For this specific
> > case, false is returned from memory_access_is_direct() in the path where
> > the DMA request is handled.
> > 
> >   QEMU
> >   ====
> >   virtio_blk_handle_output
> >     virtio_blk_handle_vq
> >       virtio_blk_get_request
> >         virtqueue_pop
> >           virtqueue_split_pop
> >             virtqueue_map_desc
> >               address_space_map
> >                 memory_access_is_direct         # Return false
> >                   memory_region_supports_direct_access
> > 
> >   (qemu) info mtree
> >           :
> >   memory-region: pci_bridge_pci
> >     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
> >         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> >           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0]
> > 
> > By default, the max bounce buffer size is only 4096 bytes, even less
> > than one page when the guest page is 64KB. This tries to fix the issue
> > by inheriting the customized max bounce buffer size of the virtio bus's
> > parent through property 'x-max-bounce-buffer-size' when the customized
> > size is a larger one. With this applied, no guest system hang is seen
> > with '-device virtio-blk-pci,...,x-max-bounce-buffer-size=268435456'.
> > 
> > Reported-by: Julia Graham <jugraham@redhat.com>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > ---
> >  hw/virtio/virtio-bus.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> > index cef944e015..e0933823f3 100644
> > --- a/hw/virtio/virtio-bus.c
> > +++ b/hw/virtio/virtio-bus.c
> > @@ -42,6 +42,7 @@ do { printf("virtio_bus: " fmt , ## __VA_ARGS__); } while (0)
> >  /* A VirtIODevice is being plugged */
> >  void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> >  {
> > +    AddressSpace *as;
> >      DeviceState *qdev = DEVICE(vdev);
> >      BusState *qbus = BUS(qdev_get_parent_bus(qdev));
> >      VirtioBusState *bus = VIRTIO_BUS(qbus);
> > @@ -100,6 +101,19 @@ void virtio_bus_device_plugged(VirtIODevice *vdev, Error **errp)
> >                  return;
> >              }
> >          }
> > +    } else {
> > +        /*
> > +         * The maximal bounce buffer size of the virtio bus's parent may
> > +         * have been customized by property 'x-max-bounce-buffer-size'.
> > +         * Lets inherit the customized size if it's larger than the
> > +         * current one.
> > +         */
> > +        as = klass->get_dma_as ? klass->get_dma_as(qbus->parent) : NULL;
> > +        if (as) {
> > +            vdev->dma_as->max_bounce_buffer_size = MAX(
> > +                    vdev->dma_as->max_bounce_buffer_size,
> > +                    as->max_bounce_buffer_size);
> > +        }
> >      }
> >  }
> >  
> > -- 
> > 2.54.0
> 
> 
> Problem with all this is, users would not know how to size this.
> 
> So fundamentally, is not the issue that virtio blk (and scsi!) maps
> all of the buffer all the time?
>
> It's not hard to add something like virtio_pop_unmapped that would not map,
> then build QEMUSGLists out of addr/len pairs and submit these.
> 
> Stefan, do you think doing it like this would be bad for perf? Good for
> perf?

I'd like to first make sure that the BAR really cannot be mmapped.

A bounce buffer is necessary when QEMU has no way of mmapping the memory
(e.g. it needs to invoke a device model's callback to read/write the
MemoryRegion).

The reason why the bounce buffer size is low is because it's normally
only used on emulated machines where MMIO registers or similar small
MemoryRegions are accessed by DMA. If we ran into this on modern
machines there would also be other consequences like vhost devices would
be unable to access that memory since it cannot be shared/mmapped.

This is why I think we need to understand why this BAR is a RAM DEVICE.
If it can support mmap then this issue, plus anything else like vhost,
would work.

Gavin, can you share the output of `lspci -vv -s 0009:01:00.0`?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2026-06-10 18:31 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08  8:55 ` Daniel P. Berrangé
2026-06-08 11:11   ` Gavin Shan
2026-06-08 11:38     ` Daniel P. Berrangé
2026-06-09  2:08       ` Gavin Shan
2026-06-09 16:25         ` Peter Xu
2026-06-10  0:32           ` Gavin Shan
2026-06-10  9:54     ` Pavel Hrdina
2026-06-10 10:55       ` Gavin Shan
2026-06-10 12:12         ` Michael S. Tsirkin
2026-06-10 12:19           ` Gavin Shan
2026-06-10 12:27             ` Michael S. Tsirkin
2026-06-10 13:00               ` Gavin Shan
2026-06-10 13:54                 ` Gavin Shan
2026-06-10 14:06                   ` Michael S. Tsirkin
2026-06-10 15:36                     ` Peter Xu
2026-06-10 16:11                       ` Peter Maydell
2026-06-10 16:19                         ` Michael S. Tsirkin
2026-06-10 19:10                           ` Peter Xu
2026-06-10 21:03                             ` Michael S. Tsirkin
2026-06-10 21:27                               ` Peter Xu
2026-06-10 21:44                                 ` Michael S. Tsirkin
2026-06-10 16:18                       ` Michael S. Tsirkin
2026-06-10 12:23         ` Pavel Hrdina
2026-06-10 14:04           ` Gavin Shan
2026-06-10 14:08             ` Michael S. Tsirkin
2026-06-10  9:49 ` Michael S. Tsirkin
2026-06-10 18:30   ` Stefan Hajnoczi [this message]
2026-06-10 21:00     ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260610183046.GB121666@fedora \
    --to=stefanha@redhat.com \
    --cc=gshan@redhat.com \
    --cc=jugraham@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.