All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Gavin Shan" <gshan@redhat.com>,
	"Pavel Hrdina" <phrdina@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, jugraham@redhat.com,
	shan.gavin@gmail.com, "Alex Williamson" <alex@shazbot.org>,
	"David Hildenbrand" <david@kernel.org>
Subject: Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
Date: Wed, 10 Jun 2026 12:18:10 -0400	[thread overview]
Message-ID: <20260610121026-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <aimEl9QQ_LTRPLtd@x1.local>

On Wed, Jun 10, 2026 at 11:36:55AM -0400, Peter Xu wrote:
> On Wed, Jun 10, 2026 at 10:06:24AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 11:54:47PM +1000, Gavin Shan wrote:
> > > Hi Michael and Peter,
> > > 
> > > On 6/10/26 11:00 PM, Gavin Shan wrote:
> > > > On 6/10/26 10:27 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 10, 2026 at 10:19:31PM +1000, Gavin Shan wrote:
> > > > > > On 6/10/26 10:12 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jun 10, 2026 at 08:55:10PM +1000, Gavin Shan wrote:
> > > > > > > > On 6/10/26 7:54 PM, Pavel Hrdina wrote:
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > > > 
> > > > > > > > > You did not answer the question that Daniel was asking, how will user
> > > > > > > > > know that max-bounce-buffer-size should be used if it's necessary to fix
> > > > > > > > > guest system hangs and how will user know what magic value should be set?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sorry that I missed to answer Daniel's questions. For this specific case,
> > > > > > > > user need to enlarge the bounce buffer size when seeing the following error
> > > > > > > > message. We can add an explicit one in address_space_map() if the existing
> > > > > > > > error message isn't obvious.
> > > > > > > > 
> > > > > > > >     qemu-system-aarch64: virtio: bogus descriptor or out of resources
> > > > > > > > 
> > > > > > > >     void *address_space_map(AddressSpace *as,
> > > > > > > >                           hwaddr addr,
> > > > > > > >                           hwaddr *plen,
> > > > > > > >                           bool is_write,
> > > > > > > >                           MemTxAttrs attrs)
> > > > > > > >     {
> > > > > > > >         if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > > > > > >             if (l == 0) {
> > > > > > > >                 error_report("Running out of bounce buffer size , enlarge it with max-bounce-buffer-size");
> > > > > > > >                 *plen = 0;
> > > > > > > >                 return NULL;
> > > > > > > >             }
> > > > > > > >         }
> > > > > > > > 
> > > > > > > > As to the value user should take for max-bounce-buffer-size, it is really case by case
> > > > > > > > and decided by user. User needs to try 4096, 8192, ..., 0xFFFFFFFF to figure out the
> > > > > > > > smallest value works for them. The worst case is to set 0xFFFFFFFF.
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > This is not at all reasonable. All kind of fixes are possible but
> > > > > > > fundamentally, bounce buffering data path is by itself already a
> > > > > > > bad idea.
> > > > > > > 
> > > > > > > I have no idea what does bounce buffering device ram accomplish.
> > > > > > > 
> > > > > > > In the end, qemu still simply reads the memory from/to the buffer.
> > > > > > > 
> > > > > > > My suggestion is to first of all look for ways to mark the
> > > > > > > memory as direct.
> > > > > > > 
> > > > > > 
> > > > > > As I explained to Peter Xu in another reply, we can't simply mark the (RAM
> > > > > > DEVICE) memory region is directly accessible. The memory region is initialized
> > > > > > by memory_region_init_ram_device_ptr() in hw/vfio/region.c::vfio_region_mmap().
> > > > > > 
> > > > > > The  accesses to the memory region is handled by 'ram_device_mem_ops' where
> > > > > > {ldn, stn}_he_p() are used in its read/write handler. They're different
> > > > > > from memcpy() since the data endianness is well handled in {ldn, stn}_he_p().
> > > > > > 
> > > > > > Thanks,
> > > > > > Gavin
> > > > > > 
> > > > > 
> > > > > What is endianness set to, for this region?
> > > > > 
> > > > 
> > > > The endianness of the memory region is set to that for the host.
> > > > 
> > > > static const MemoryRegionOps ram_device_mem_ops = {
> > > >      .read = memory_region_ram_device_read,
> > > >      .write = memory_region_ram_device_write,
> > > >      .endianness = HOST_BIG_ENDIAN ? DEVICE_BIG_ENDIAN : DEVICE_LITTLE_ENDIAN,
> > > > };
> > > > 
> > 
> > So there is never any endianness translation.
> > I think the reason qemu does the bounce buffer is more
> > to prevent things like vector access from MMIO.
> > 
> > 
> > > How about to treat the RAM DEVICE memory region directly accessible in
> > > address_space_map() only when HOST_BIG_ENDIAN is false,
> > > something like
> > > below and I don't hit the guest hang issue with the changes.
> > > 
> > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > index 1417132f6d..9daca55251 100644
> > > --- a/include/system/memory.h
> > > +++ b/include/system/memory.h
> > > @@ -2908,7 +2908,8 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > >  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > >  bool prepare_mmio_access(MemoryRegion *mr);
> > > -static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > > +static inline bool memory_region_supports_direct_access(const MemoryRegion *mr,
> > > +                                                        bool check_ram_device)
> > >  {
> > >      /* ROM DEVICE regions only allow direct access if in ROMD mode. */
> > >      if (memory_region_is_romd(mr)) {
> > > @@ -2922,13 +2923,14 @@ static inline bool memory_region_supports_direct_access(const MemoryRegion *mr)
> > >       * be MMIO and access using mempy can be wrong (e.g., using instructions not
> > >       * intended for MMIO access). So we treat this as IO.
> > >       */
> > > -    return !memory_region_is_ram_device(mr);
> > > +    return (!check_ram_device || !memory_region_is_ram_device(mr));
> > >  }
> > >  static inline bool memory_access_is_direct(const MemoryRegion *mr,
> > > +                                           bool check_ram_device,
> > >                                             bool is_write, MemTxAttrs attrs)
> > >  {
> > > -    if (!memory_region_supports_direct_access(mr)) {
> > > +    if (!memory_region_supports_direct_access(mr, check_ram_device)) {
> > >          return false;
> > >      }
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index 7bcbf87573..2e6b72b124 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -3724,7 +3724,7 @@ void *address_space_map(AddressSpace *as,
> > >      fv = address_space_to_flatview(as);
> > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > -    if (!memory_access_is_direct(mr, is_write, attrs)) {
> > > +    if (!memory_access_is_direct(mr, HOST_BIG_ENDIAN, is_write, attrs)) {
> > >          size_t used = qatomic_read(&as->bounce_buffer_size);
> > >          for (;;) {
> > >              hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
> > > 
> > > Thanks,
> > > Gavin
> > > 
> > 
> > I do not think it has anything to do with host endian-ness.
> > 
> > 
> > This is the change that broke it I think?
> > 
> > 
> > commit 4a2e242bbb306ef5c16ce9e7bb2da3bd8a4eb098
> > Author: Alex Williamson <alex@shazbot.org>
> > Date:   Mon Oct 31 09:53:03 2016 -0600
> > 
> >     memory: Don't use memcpy for ram_device regions
> >     
> > 
> > Maybe Alex has an opinion on what to do.
> 
> I can offer one idea here..
> 
> IIUC the major issue was vector ops but the mr ops might be too heavy, then
> another way to fix it is in memory API instead of using memcpy()/memmove(),
> we always use a helper (say, memmove_no_vector()) to do the split and
> properly aligned IOs as what ram_device_mem_ops does right now, this should
> only applies to ram_device.
> 
> With that, IIUC we can remove the current ram_device_mem_ops, then in
> Gavin's case mmap() will go through and guest will not need to vmexit at
> all.  Best perf, issue solve.
> 
> We just need to be careful to trap all possible memcpy()/memmove() used in
> memory core.. if I didn't miss any, IMO below four should needs to be
> replaced by memmove_no_vector():
> 
>   flatview_write_continue_step()
>   flatview_read_continue_step()
>   address_space_read()
>   address_space_write_rom()
> 
> Thanks,
> 
> -- 
> Peter Xu

First, this is a nice idea.
Second, the ideal thing is still just allowing direct access.
And I think VFIO actually knows it's regular RAM.
So something like the following small patch in linux, maybe?


diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index fa056b69f899..a4ca2d01272c 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -418,6 +418,10 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
 	struct nvgrace_gpu_pci_core_device *nvdev =
 		container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
 			     core_device.vdev);
+	struct vfio_region_info_cap_direct_access direct_access = {
+		.header.id = VFIO_REGION_INFO_CAP_DIRECT_ACCESS,
+		.header.version = 1,
+	};
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	struct mem_region *memregion;
 	u32 size;
@@ -453,6 +457,13 @@ static int nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
 	if (ret)
 		return ret;
 
+	if (info->index == USEMEM_REGION_INDEX) {
+		ret = vfio_info_add_capability(caps, &direct_access.header,
+					       sizeof(direct_access));
+		if (ret)
+			return ret;
+	}
+
 	info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
 	/*
 	 * The region memory size may not be power-of-2 aligned.
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..f475f4920b52 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -466,6 +466,16 @@ struct vfio_device_migration_info {
  */
 #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
 
+/*
+ * The direct access capability informs that a mmappable region may be
+ * accessed by userspace using any CPU load/store operations.
+ */
+#define VFIO_REGION_INFO_CAP_DIRECT_ACCESS	6
+
+struct vfio_region_info_cap_direct_access {
+	struct vfio_info_cap_header header;
+};
+
 /*
  * Capability with compressed real address (aka SSA - small system address)
  * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing



  parent reply	other threads:[~2026-06-10 16:19 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08  8:55 ` Daniel P. Berrangé
2026-06-08 11:11   ` Gavin Shan
2026-06-08 11:38     ` Daniel P. Berrangé
2026-06-09  2:08       ` Gavin Shan
2026-06-09 16:25         ` Peter Xu
2026-06-10  0:32           ` Gavin Shan
2026-06-10  9:54     ` Pavel Hrdina
2026-06-10 10:55       ` Gavin Shan
2026-06-10 12:12         ` Michael S. Tsirkin
2026-06-10 12:19           ` Gavin Shan
2026-06-10 12:27             ` Michael S. Tsirkin
2026-06-10 13:00               ` Gavin Shan
2026-06-10 13:54                 ` Gavin Shan
2026-06-10 14:06                   ` Michael S. Tsirkin
2026-06-10 15:36                     ` Peter Xu
2026-06-10 16:11                       ` Peter Maydell
2026-06-10 16:19                         ` Michael S. Tsirkin
2026-06-10 16:18                       ` Michael S. Tsirkin [this message]
2026-06-10 12:23         ` Pavel Hrdina
2026-06-10 14:04           ` Gavin Shan
2026-06-10 14:08             ` Michael S. Tsirkin
2026-06-10  9:49 ` Michael S. Tsirkin
2026-06-10 18:30   ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260610121026-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex@shazbot.org \
    --cc=berrange@redhat.com \
    --cc=david@kernel.org \
    --cc=gshan@redhat.com \
    --cc=jugraham@redhat.com \
    --cc=peterx@redhat.com \
    --cc=phrdina@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.