All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: "Gavin Shan" <gshan@redhat.com>, "Peter Xu" <peterx@redhat.com>,
	"Pavel Hrdina" <phrdina@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, jugraham@redhat.com,
	shan.gavin@gmail.com, "Alex Williamson" <alex@shazbot.org>,
	"David Hildenbrand" <david@kernel.org>
Subject: Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
Date: Thu, 11 Jun 2026 11:05:31 -0400	[thread overview]
Message-ID: <20260611110156-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAFEAcA-8Hv0o3PibjH=8jixsHyunH2HkOAMQVqHkGr34b1Ascg@mail.gmail.com>

On Thu, Jun 11, 2026 at 03:55:12PM +0100, Peter Maydell wrote:
> On Thu, 11 Jun 2026 at 15:10, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jun 11, 2026 at 01:48:51PM +0100, Peter Maydell wrote:
> > > On Thu, 11 Jun 2026 at 13:34, Gavin Shan <gshan@redhat.com> wrote:
> > > >
> > > > Let me try to summarize what I understood. As VFIO is concerned, there
> > > > are multiple memory regions for one particular PCI BAR, and they're stacked
> > > > up. The memory regions for PCI BAR#4 of the GH100 card looks as below.
> > > >
> > > >    (qemu) info mtree
> > > >               :
> > > >    address-space: pci_bridge_pci_mem
> > > >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > > >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4          <---- (1) VFIOBAR::mr
> > > >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4             <---- (2) VFIOBAR::VFIORegion::mem
> > > >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> > > >
> > > >    (1) Its MemoryRegionOps is NULL. No data accesses are routed to this region
> > > >    (2) The data accesses routed to this region is handled by pread() and pwrite()
> > > >    (3) The data accesses routed to this region is handled by memcpy() before
> > > >        commit 4a2e242bbb.
> > > >
> > > > There are identified PCI devices who have quirks, see vfio_bar_quirk_setup().
> > > > Accesses to part of the PCI BAR have to be emulated by the extra IO regions,
> > > > something like below for rtl8168 PCI device, where two extra IO regions are
> > > > stacked up for the quirks.
> > > >
> > > >    address-space: pci_bridge_pci_mem
> > > >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> > > >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4          <---- (1) VFIOBAR::mr
> > > >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4             <---- (2) VFIOBAR::VFIORegion::mem
> > > >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> > > >            0000042000000010-0000042000000014 (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[0]  <---- (4) quirk[0]
> > > >            0000042000000018-000004200000001c (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[1]  <---- (5) quirk[1]
> > > >
> > > > Access on 0000042000000010-0000042000000014 should be routed to region (4) quirk[0]
> > > > and access on 0000042000000018-000004200000001c should be routed to region (5) quirk[1].
> > > > However, accesses to 0000042000000000-0000042000000020 are routed to region (3) before
> > > > commit 4a2e242bbb and the data transfer is done by memcpy(), bypassing region (4) and
> > > > (5). It's not the expected behavior and why memcpy() isn't expected on device rtl8168's
> > > > PCI BAR due to the quirks, answering your question.
> > > >
> > > > With commit 4a2e242bbb applied, the accesses will be routed to the correct region.
> > >
> > > The way I read 4a2e242bbb's commit message, it isn't about things being routed
> > > to the wrong region. It's about the handling of areas which aren't in the small
> > > quirk regions but which are in the same 4K page as them. These have to
> > > be handled
> > > via the memory subsystem's "subpage" mechanism. This does route
> > > everything to the
> > > correct region, but if the region (3) is marked as "direct access is OK" then
> > > QEMU assumes that any kind of direct access is OK, i.e. this behaves
> > > like true RAM.
> > > It then does a memcpy access to a BAR that's really a bank of device registers,
> > > and this goes wrong.
> > >
> > > > Back to our case (GH100 card), there are no quirks for the PCI BAR (0009:01:00.0 BAR 4)
> > > > so it's fine mark the RAM DEVICE region as directly accessible. We perhaps needn't host
> > > > to export the capability (VFIO_REGION_INFO_CAP_DIRECT_ACCESS) suggested by you. It's
> > > > safe to mark any PCI BARs as directly accessible if they have no quirks attached. All
> > > > the devices except those listed in vfio_bar_quirk_setup() are capable of this.
> > >
> > > I still feel like there are different kinds of PCI BAR here ("this BAR is
> > > true RAM and can be accessed arbitrarily" vs "this BAR is full of registers
> > > and can't be handled that way") and the vfio code in QEMU needs to set up
> > > the memory regions differently for the two cases. For your example I think
> > > it would be fine to have direct-access even if there were some kind of
> > > quirk memory region, because for the parts of the BAR that aren't covered
> > > by a quirk overlay the underlying BAR still allows "entirely like RAM,
> > > any alignment and size is OK" accesses.
> 
> > Yea, and I feel this is the main part:
> >
> >     The assumption here is that accesses initiated by the VM are
> >     driven by a device specific driver, which knows the device
> >     capabilities.
> >
> >
> > Frankly I don't get why a big hammer of disabling direct access
> > was taken, when all we apparently need to do is to make sure
> > small aligned accesses through BAR stay aligned and same size.
> 
> If you say an MR is OK for direct access then in my view you are
> saying "any access of any kind is OK", because you're permitting
> the pointer to be directly returned as a host pointer from
> address_space_map() (at which point the caller might do anything
> with that memory). That is not OK for every BAR, only for ones where
> it's really RAM.
>

What is "OK"? If the BAR is RAM and I write into it, I will overwrite
data guest stored there. Is that "OK"?



 
> > I guess it felt safe - a vfio specific change, and emulating device
> > accesses was assumed to be slow path, anyway.
> >
> > Except it no longer is with people wanting to do direct io
> > into device BARs.
> 
> > Isn't it basically the below?
> > At least I checked asm and it produces the correct code.
> > And then the whole pile of hacks can be reverted?
> >
> >
> > diff --git a/system/physmem.c b/system/physmem.c
> > index 7bcbf87573..aab4390d40 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -3272,7 +3272,29 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> >          uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> >                                                 false, true);
> >
> > -        memmove(ram_ptr, buf, *l);
> > +        switch (*l) {
> > +        case 1:
> > +            __builtin_memmove(ram_ptr, buf, 1);
> > +            break;
> > +        case 2:
> > +            __builtin_memmove(ram_ptr, buf, 2);
> > +            break;
> > +        case 4:
> > +            __builtin_memmove(ram_ptr, buf, 4);
> > +            break;
> > +        case 8:
> > +            __builtin_memmove(ram_ptr, buf, 8);
> 
> Nothing says that __builtin_memmove() is required to do only a single
> access of the right size.
> memmove() can do all kinds of things.
> For instance on aarch64 if you get into the actual glibc memmove then
> a memmove of 1 byte will actually store the same byte multiple times.
> Or an architecture might decide that it's more efficient to do an
> 8-byte write with two 4-byte writes even if there is an 8-byte access
> instruction.
> 
> If and where we need to provide guarantees about "this access will
> really definitely only do this size access and it won't break it
> apart or anything like that" then we need to either be using compiler
> atomics or else inline asm.
> 
> (I think there are places where we do need to be more careful about
> what we do with accesses to real RAM, for where we're emulating a
> device write to RAM that's updating a data structure shared with the
> guest, and things like writing multiple times can cause problems.
> https://lore.kernel.org/qemu-devel/CAFEAcA8dwHV8F48kb-013rxkG9kKcZhym9_qarKmoeUfeh0YWw@mail.gmail.com/
> is an unrelated example of that, which I haven't done detailed
> analysis of yet.)
> 
> thanks
> -- PMM

If it does not work, then QEMU is broken:


/*
 * Any compiler worth its salt will turn these memcpy into native unaligned
 * operations.  Thus we don't need to play games with packed attributes, or
 * inline byte-by-byte stores.
 * Some compilation environments (eg some fortify-source implementations)
 * may intercept memcpy() in a way that defeats the compiler optimization,
 * though, so we use __builtin_memcpy() to give ourselves the best chance
 * of good performance.
 */

static inline int lduw_he_p(const void *ptr)
{
    uint16_t r;
    __builtin_memcpy(&r, ptr, sizeof(r));
    return r;
}

static inline int ldsw_he_p(const void *ptr)
{
    int16_t r;
    __builtin_memcpy(&r, ptr, sizeof(r));
    return r;
}

static inline void stw_he_p(void *ptr, uint16_t v)
{
    __builtin_memcpy(ptr, &v, sizeof(v));
}




Somehow things have worked for years without atomics,
but if you want to fix that, be my guest.


-- 
MST



  reply	other threads:[~2026-06-11 15:06 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08  8:55 ` Daniel P. Berrangé
2026-06-08 11:11   ` Gavin Shan
2026-06-08 11:38     ` Daniel P. Berrangé
2026-06-09  2:08       ` Gavin Shan
2026-06-09 16:25         ` Peter Xu
2026-06-10  0:32           ` Gavin Shan
2026-06-10  9:54     ` Pavel Hrdina
2026-06-10 10:55       ` Gavin Shan
2026-06-10 12:12         ` Michael S. Tsirkin
2026-06-10 12:19           ` Gavin Shan
2026-06-10 12:27             ` Michael S. Tsirkin
2026-06-10 13:00               ` Gavin Shan
2026-06-10 13:54                 ` Gavin Shan
2026-06-10 14:06                   ` Michael S. Tsirkin
2026-06-10 15:36                     ` Peter Xu
2026-06-10 16:11                       ` Peter Maydell
2026-06-10 16:19                         ` Michael S. Tsirkin
2026-06-10 19:10                           ` Peter Xu
2026-06-10 21:03                             ` Michael S. Tsirkin
2026-06-10 21:27                               ` Peter Xu
2026-06-10 21:44                                 ` Michael S. Tsirkin
2026-06-10 16:18                       ` Michael S. Tsirkin
2026-06-11  4:33                         ` Gavin Shan
2026-06-11  5:31                           ` Michael S. Tsirkin
2026-06-11  6:28                             ` Gavin Shan
2026-06-11  6:34                               ` Michael S. Tsirkin
2026-06-11 12:33                                 ` Gavin Shan
2026-06-11 12:48                                   ` Peter Maydell
2026-06-11 14:10                                     ` Michael S. Tsirkin
2026-06-11 14:55                                       ` Peter Maydell
2026-06-11 15:05                                         ` Michael S. Tsirkin [this message]
2026-06-11 15:25                                           ` Michael S. Tsirkin
2026-06-11 15:29                                           ` Peter Maydell
2026-06-11 15:57                                             ` Michael S. Tsirkin
2026-06-11 16:16                                               ` Peter Maydell
2026-06-11 16:42                                                 ` Michael S. Tsirkin
2026-06-11 16:53                                                   ` Peter Maydell
2026-06-11 17:02                                                     ` Michael S. Tsirkin
2026-06-11 18:20                                                       ` Peter Xu
2026-06-11 16:13                                           ` Michael S. Tsirkin
2026-06-11  6:51                               ` Michael S. Tsirkin
2026-06-10 12:23         ` Pavel Hrdina
2026-06-10 14:04           ` Gavin Shan
2026-06-10 14:08             ` Michael S. Tsirkin
2026-06-10  9:49 ` Michael S. Tsirkin
2026-06-10 18:30   ` Stefan Hajnoczi
2026-06-10 21:00     ` Michael S. Tsirkin
2026-06-11 14:20       ` Stefan Hajnoczi
2026-06-11 14:45         ` Michael S. Tsirkin
2026-06-11 15:04           ` Peter Maydell
2026-06-11 15:09             ` Michael S. Tsirkin
2026-06-11 18:37               ` Stefan Hajnoczi
2026-06-11  1:19     ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260611110156-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex@shazbot.org \
    --cc=berrange@redhat.com \
    --cc=david@kernel.org \
    --cc=gshan@redhat.com \
    --cc=jugraham@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=phrdina@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.