From: "Michael S. Tsirkin" <mst@redhat.com>
To: Gavin Shan <gshan@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
"Peter Xu" <peterx@redhat.com>,
"Pavel Hrdina" <phrdina@redhat.com>,
"Daniel P. Berrangé" <berrange@redhat.com>,
qemu-devel@nongnu.org, qemu-arm@nongnu.org, jugraham@redhat.com,
shan.gavin@gmail.com, "Alex Williamson" <alex@shazbot.org>,
"David Hildenbrand" <david@kernel.org>
Subject: Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
Date: Fri, 12 Jun 2026 04:43:00 -0400 [thread overview]
Message-ID: <20260612044147-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <ec32c5ae-782c-4c83-8432-23d979552fbe@redhat.com>
On Fri, Jun 12, 2026 at 02:25:07PM +1000, Gavin Shan wrote:
> On 6/11/26 10:48 PM, Peter Maydell wrote:
> > On Thu, 11 Jun 2026 at 13:34, Gavin Shan <gshan@redhat.com> wrote:
> > >
> > > Let me try to summarize what I understood. As VFIO is concerned, there
> > > are multiple memory regions for one particular PCI BAR, and they're stacked
> > > up. The memory regions for PCI BAR#4 of the GH100 card looks as below.
> > >
> > > (qemu) info mtree
> > > :
> > > address-space: pci_bridge_pci_mem
> > > 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > > 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
> > > 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
> > > 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> > >
> > > (1) Its MemoryRegionOps is NULL. No data accesses are routed to this region
> > > (2) The data accesses routed to this region is handled by pread() and pwrite()
> > > (3) The data accesses routed to this region is handled by memcpy() before
> > > commit 4a2e242bbb.
> > >
> > > There are identified PCI devices who have quirks, see vfio_bar_quirk_setup().
> > > Accesses to part of the PCI BAR have to be emulated by the extra IO regions,
> > > something like below for rtl8168 PCI device, where two extra IO regions are
> > > stacked up for the quirks.
> > >
> > > address-space: pci_bridge_pci_mem
> > > 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > > 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 <---- (1) VFIOBAR::mr
> > > 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 <---- (2) VFIOBAR::VFIORegion::mem
> > > 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> > > 0000042000000010-0000042000000014 (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[0] <---- (4) quirk[0]
> > > 0000042000000018-000004200000001c (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[1] <---- (5) quirk[1]
> > >
> > > Access on 0000042000000010-0000042000000014 should be routed to region (4) quirk[0]
> > > and access on 0000042000000018-000004200000001c should be routed to region (5) quirk[1].
> > > However, accesses to 0000042000000000-0000042000000020 are routed to region (3) before
> > > commit 4a2e242bbb and the data transfer is done by memcpy(), bypassing region (4) and
> > > (5). It's not the expected behavior and why memcpy() isn't expected on device rtl8168's
> > > PCI BAR due to the quirks, answering your question.
> > >
> > > With commit 4a2e242bbb applied, the accesses will be routed to the correct region.
> >
> > The way I read 4a2e242bbb's commit message, it isn't about things being routed
> > to the wrong region. It's about the handling of areas which aren't in the small
> > quirk regions but which are in the same 4K page as them. These have to
> > be handled
> > via the memory subsystem's "subpage" mechanism. This does route
> > everything to the
> > correct region, but if the region (3) is marked as "direct access is OK" then
> > QEMU assumes that any kind of direct access is OK, i.e. this behaves
> > like true RAM.
> > It then does a memcpy access to a BAR that's really a bank of device registers,
> > and this goes wrong.
> >
>
> Ok, thanks for your followup and explanation. I also spent some time going through
> system/memory.h and system/physmem.c. I think I fully understood the issue now. There
> are two concerning paths as mentioned by Peter Xu in another reply: (a) vCPU accessors
> like address_space_rw(), address_space_write_rom(), address_space_ldq(), address_space_stq()
> and their variants; (b) functions used in the DMA path like address_space_map() and
> address_space_unmap().
>
> For (a), memmove() and memcpy() are used, and they can be replaced something else that
> is safe. However, the replacement in (a) can't fix the issue existing in (b). In (b),
> the host's pointer is returned by address_space_map() and the memory block (perhaps not
> a real RAM block) can be accessed in all means after that. I guess we have to ensure
> the memory region is a real RAM block before the pointer can be returned in (b). Otherwise,
> we still fall back to the bounce buffer in (b).
No guest will DMA into a "non RAM block". Whatever is a DMA target, is a
'real RAM block'.
> > > Back to our case (GH100 card), there are no quirks for the PCI BAR (0009:01:00.0 BAR 4)
> > > so it's fine mark the RAM DEVICE region as directly accessible. We perhaps needn't host
> > > to export the capability (VFIO_REGION_INFO_CAP_DIRECT_ACCESS) suggested by you. It's
> > > safe to mark any PCI BARs as directly accessible if they have no quirks attached. All
> > > the devices except those listed in vfio_bar_quirk_setup() are capable of this.
> >
> > I still feel like there are different kinds of PCI BAR here ("this BAR is
> > true RAM and can be accessed arbitrarily" vs "this BAR is full of registers
> > and can't be handled that way") and the vfio code in QEMU needs to set up
> > the memory regions differently for the two cases. For your example I think
> > it would be fine to have direct-access even if there were some kind of
> > quirk memory region, because for the parts of the BAR that aren't covered
> > by a quirk overlay the underlying BAR still allows "entirely like RAM,
> > any alignment and size is OK" accesses.
> >
>
> Agreed. If the PCI BAR is a 'real RAM block' is determined by hardware vendor. I
> think the solution proposed by Michael is a nice one as this specific case is concerned:
> A flag (capability) is returned by nvgrace_gpu_vfio_pci driver to indicate that the
> PCI BAR is a 'real RAM block' and tolerant to all kinds of memory accessor. QEMU marks
> the memory region for this PCI BAR as directly accessible. Alternatively, qemu also
> can determine the capability by the PCI device's vendor/device/subsystem/version information
> in hw/vfio/region.c.
>
I sent it before I understood the issue.
> > -- PMM
> >
>
> Thanks,
> Gavin
next prev parent reply other threads:[~2026-06-12 8:43 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08 8:55 ` Daniel P. Berrangé
2026-06-08 11:11 ` Gavin Shan
2026-06-08 11:38 ` Daniel P. Berrangé
2026-06-09 2:08 ` Gavin Shan
2026-06-09 16:25 ` Peter Xu
2026-06-10 0:32 ` Gavin Shan
2026-06-10 9:54 ` Pavel Hrdina
2026-06-10 10:55 ` Gavin Shan
2026-06-10 12:12 ` Michael S. Tsirkin
2026-06-10 12:19 ` Gavin Shan
2026-06-10 12:27 ` Michael S. Tsirkin
2026-06-10 13:00 ` Gavin Shan
2026-06-10 13:54 ` Gavin Shan
2026-06-10 14:06 ` Michael S. Tsirkin
2026-06-10 15:36 ` Peter Xu
2026-06-10 16:11 ` Peter Maydell
2026-06-10 16:19 ` Michael S. Tsirkin
2026-06-10 19:10 ` Peter Xu
2026-06-10 21:03 ` Michael S. Tsirkin
2026-06-10 21:27 ` Peter Xu
2026-06-10 21:44 ` Michael S. Tsirkin
2026-06-10 16:18 ` Michael S. Tsirkin
2026-06-11 4:33 ` Gavin Shan
2026-06-11 5:31 ` Michael S. Tsirkin
2026-06-11 6:28 ` Gavin Shan
2026-06-11 6:34 ` Michael S. Tsirkin
2026-06-11 12:33 ` Gavin Shan
2026-06-11 12:48 ` Peter Maydell
2026-06-11 14:10 ` Michael S. Tsirkin
2026-06-11 14:55 ` Peter Maydell
2026-06-11 15:05 ` Michael S. Tsirkin
2026-06-11 15:25 ` Michael S. Tsirkin
2026-06-11 15:29 ` Peter Maydell
2026-06-11 15:57 ` Michael S. Tsirkin
2026-06-11 16:16 ` Peter Maydell
2026-06-11 16:42 ` Michael S. Tsirkin
2026-06-11 16:53 ` Peter Maydell
2026-06-11 17:02 ` Michael S. Tsirkin
2026-06-11 18:20 ` Peter Xu
2026-06-11 20:52 ` Michael S. Tsirkin
2026-06-11 21:20 ` Peter Xu
2026-06-11 21:59 ` Michael S. Tsirkin
2026-06-11 21:15 ` Michael S. Tsirkin
2026-06-11 16:13 ` Michael S. Tsirkin
2026-06-12 4:25 ` Gavin Shan
2026-06-12 8:43 ` Michael S. Tsirkin [this message]
2026-06-12 10:25 ` Gavin Shan
2026-06-11 6:51 ` Michael S. Tsirkin
2026-06-10 12:23 ` Pavel Hrdina
2026-06-10 14:04 ` Gavin Shan
2026-06-10 14:08 ` Michael S. Tsirkin
2026-06-10 9:49 ` Michael S. Tsirkin
2026-06-10 18:30 ` Stefan Hajnoczi
2026-06-10 21:00 ` Michael S. Tsirkin
2026-06-11 14:20 ` Stefan Hajnoczi
2026-06-11 14:45 ` Michael S. Tsirkin
2026-06-11 15:04 ` Peter Maydell
2026-06-11 15:09 ` Michael S. Tsirkin
2026-06-11 18:37 ` Stefan Hajnoczi
2026-06-11 20:54 ` Michael S. Tsirkin
2026-06-11 1:19 ` Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612044147-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=alex@shazbot.org \
--cc=berrange@redhat.com \
--cc=david@kernel.org \
--cc=gshan@redhat.com \
--cc=jugraham@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=phrdina@redhat.com \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=shan.gavin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.