All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: "Gavin Shan" <gshan@redhat.com>, "Peter Xu" <peterx@redhat.com>,
	"Pavel Hrdina" <phrdina@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, jugraham@redhat.com,
	shan.gavin@gmail.com, "Alex Williamson" <alex@shazbot.org>,
	"David Hildenbrand" <david@kernel.org>
Subject: Re: [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible
Date: Thu, 11 Jun 2026 10:10:20 -0400	[thread overview]
Message-ID: <20260611093049-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAFEAcA8Q9qEaZS8wCpZQ-4jpAnQ9JXvnb7iSeRTdhJ9Y3BcCRg@mail.gmail.com>

On Thu, Jun 11, 2026 at 01:48:51PM +0100, Peter Maydell wrote:
> On Thu, 11 Jun 2026 at 13:34, Gavin Shan <gshan@redhat.com> wrote:
> >
> > Let me try to summarize what I understood. As VFIO is concerned, there
> > are multiple memory regions for one particular PCI BAR, and they're stacked
> > up. The memory regions for PCI BAR#4 of the GH100 card looks as below.
> >
> >    (qemu) info mtree
> >               :
> >    address-space: pci_bridge_pci_mem
> >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4          <---- (1) VFIOBAR::mr
> >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4             <---- (2) VFIOBAR::VFIORegion::mem
> >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> >
> >    (1) Its MemoryRegionOps is NULL. No data accesses are routed to this region
> >    (2) The data accesses routed to this region is handled by pread() and pwrite()
> >    (3) The data accesses routed to this region is handled by memcpy() before
> >        commit 4a2e242bbb.
> >
> > There are identified PCI devices who have quirks, see vfio_bar_quirk_setup().
> > Accesses to part of the PCI BAR have to be emulated by the extra IO regions,
> > something like below for rtl8168 PCI device, where two extra IO regions are
> > stacked up for the quirks.
> >
> >    address-space: pci_bridge_pci_mem
> >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4          <---- (1) VFIOBAR::mr
> >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4             <---- (2) VFIOBAR::VFIORegion::mem
> >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] <---- (3) VFIOBAR::VFIORegion::VFIOMap::mem
> >            0000042000000010-0000042000000014 (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[0]  <---- (4) quirk[0]
> >            0000042000000018-000004200000001c (prio 1, i/o): 0009:01:00.0 BAR 4 quirk[1]  <---- (5) quirk[1]
> >
> > Access on 0000042000000010-0000042000000014 should be routed to region (4) quirk[0]
> > and access on 0000042000000018-000004200000001c should be routed to region (5) quirk[1].
> > However, accesses to 0000042000000000-0000042000000020 are routed to region (3) before
> > commit 4a2e242bbb and the data transfer is done by memcpy(), bypassing region (4) and
> > (5). It's not the expected behavior and why memcpy() isn't expected on device rtl8168's
> > PCI BAR due to the quirks, answering your question.
> >
> > With commit 4a2e242bbb applied, the accesses will be routed to the correct region.
> 
> The way I read 4a2e242bbb's commit message, it isn't about things being routed
> to the wrong region. It's about the handling of areas which aren't in the small
> quirk regions but which are in the same 4K page as them. These have to
> be handled
> via the memory subsystem's "subpage" mechanism. This does route
> everything to the
> correct region, but if the region (3) is marked as "direct access is OK" then
> QEMU assumes that any kind of direct access is OK, i.e. this behaves
> like true RAM.
> It then does a memcpy access to a BAR that's really a bank of device registers,
> and this goes wrong.
> 
> > Back to our case (GH100 card), there are no quirks for the PCI BAR (0009:01:00.0 BAR 4)
> > so it's fine mark the RAM DEVICE region as directly accessible. We perhaps needn't host
> > to export the capability (VFIO_REGION_INFO_CAP_DIRECT_ACCESS) suggested by you. It's
> > safe to mark any PCI BARs as directly accessible if they have no quirks attached. All
> > the devices except those listed in vfio_bar_quirk_setup() are capable of this.
> 
> I still feel like there are different kinds of PCI BAR here ("this BAR is
> true RAM and can be accessed arbitrarily" vs "this BAR is full of registers
> and can't be handled that way") and the vfio code in QEMU needs to set up
> the memory regions differently for the two cases. For your example I think
> it would be fine to have direct-access even if there were some kind of
> quirk memory region, because for the parts of the BAR that aren't covered
> by a quirk overlay the underlying BAR still allows "entirely like RAM,
> any alignment and size is OK" accesses.
> 
> -- PMM


Yea, and I feel this is the main part: 

    The assumption here is that accesses initiated by the VM are
    driven by a device specific driver, which knows the device
    capabilities.


Frankly I don't get why a big hammer of disabling direct access
was taken, when all we apparently need to do is to make sure
small aligned accesses through BAR stay aligned and same size.

I guess it felt safe - a vfio specific change, and emulating device
accesses was assumed to be slow path, anyway.

Except it no longer is with people wanting to do direct io
into device BARs.

Isn't it basically the below?
At least I checked asm and it produces the correct code.
And then the whole pile of hacks can be reverted?


diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..aab4390d40 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3272,7 +3272,29 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, true);
 
-        memmove(ram_ptr, buf, *l);
+        switch (*l) {
+        case 1:
+            __builtin_memmove(ram_ptr, buf, 1);
+            break;
+        case 2:
+            __builtin_memmove(ram_ptr, buf, 2);
+            break;
+        case 4:
+            __builtin_memmove(ram_ptr, buf, 4);
+            break;
+        case 8:
+            __builtin_memmove(ram_ptr, buf, 8);
+            break;
+        default:
+            memmove(ram_ptr, buf, *l);
+            break;
+        }
         invalidate_and_set_dirty(mr, mr_addr, *l);
 
         return MEMTX_OK;
@@ -3365,7 +3387,29 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                                false, false);
 
-        memcpy(buf, ram_ptr, *l);
+        switch (*l) {
+        case 1:
+            __builtin_memcpy(buf, ram_ptr, 1);
+            break;
+        case 2:
+            __builtin_memcpy(buf, ram_ptr, 2);
+            break;
+        case 4:
+            __builtin_memcpy(buf, ram_ptr, 4);
+            break;
+        case 8:
+            __builtin_memcpy(buf, ram_ptr, 8);
+            break;
+        default:
+            memcpy(buf, ram_ptr, *l);
+            break;
+        }
 
         return MEMTX_OK;
     }
-- 
MST



  reply	other threads:[~2026-06-11 14:11 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  0:18 [PATCH RFCv1] virtio: Inherit max bounce buffer size from bus parent if possible Gavin Shan
2026-06-08  8:55 ` Daniel P. Berrangé
2026-06-08 11:11   ` Gavin Shan
2026-06-08 11:38     ` Daniel P. Berrangé
2026-06-09  2:08       ` Gavin Shan
2026-06-09 16:25         ` Peter Xu
2026-06-10  0:32           ` Gavin Shan
2026-06-10  9:54     ` Pavel Hrdina
2026-06-10 10:55       ` Gavin Shan
2026-06-10 12:12         ` Michael S. Tsirkin
2026-06-10 12:19           ` Gavin Shan
2026-06-10 12:27             ` Michael S. Tsirkin
2026-06-10 13:00               ` Gavin Shan
2026-06-10 13:54                 ` Gavin Shan
2026-06-10 14:06                   ` Michael S. Tsirkin
2026-06-10 15:36                     ` Peter Xu
2026-06-10 16:11                       ` Peter Maydell
2026-06-10 16:19                         ` Michael S. Tsirkin
2026-06-10 19:10                           ` Peter Xu
2026-06-10 21:03                             ` Michael S. Tsirkin
2026-06-10 21:27                               ` Peter Xu
2026-06-10 21:44                                 ` Michael S. Tsirkin
2026-06-10 16:18                       ` Michael S. Tsirkin
2026-06-11  4:33                         ` Gavin Shan
2026-06-11  5:31                           ` Michael S. Tsirkin
2026-06-11  6:28                             ` Gavin Shan
2026-06-11  6:34                               ` Michael S. Tsirkin
2026-06-11 12:33                                 ` Gavin Shan
2026-06-11 12:48                                   ` Peter Maydell
2026-06-11 14:10                                     ` Michael S. Tsirkin [this message]
2026-06-11 14:55                                       ` Peter Maydell
2026-06-11 15:05                                         ` Michael S. Tsirkin
2026-06-11 15:25                                           ` Michael S. Tsirkin
2026-06-11 15:29                                           ` Peter Maydell
2026-06-11  6:51                               ` Michael S. Tsirkin
2026-06-10 12:23         ` Pavel Hrdina
2026-06-10 14:04           ` Gavin Shan
2026-06-10 14:08             ` Michael S. Tsirkin
2026-06-10  9:49 ` Michael S. Tsirkin
2026-06-10 18:30   ` Stefan Hajnoczi
2026-06-10 21:00     ` Michael S. Tsirkin
2026-06-11 14:20       ` Stefan Hajnoczi
2026-06-11 14:45         ` Michael S. Tsirkin
2026-06-11 15:04           ` Peter Maydell
2026-06-11 15:09             ` Michael S. Tsirkin
2026-06-11  1:19     ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260611093049-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex@shazbot.org \
    --cc=berrange@redhat.com \
    --cc=david@kernel.org \
    --cc=gshan@redhat.com \
    --cc=jugraham@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=phrdina@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.