Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Gavin Shan <gshan@redhat.com>
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, peterx@redhat.com,
	alex@shazbot.org, richard.henderson@linaro.org,
	peter.maydell@linaro.org, berrange@redhat.com,
	philmd@oss.qualcomm.com, philmd@mailo.com, david@kernel.org,
	clg@redhat.com, pbonzini@redhat.com, phrdina@redhat.com,
	jugraham@redhat.com, liugang24219@sangfor.com.cn,
	dinghui@sangfor.com.cn, shan.gavin@gmail.com
Subject: Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible
Date: Tue, 16 Jun 2026 01:44:54 -0400	[thread overview]
Message-ID: <20260616014419-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <ae17b5a6-eed8-4d4c-a798-cb92ffa6c915@redhat.com>

On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> On 6/16/26 3:25 PM, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > the guest has significant disk activities on compiling 'cuda-samples'.
> > The full log and problem description can be found from PATCH[1/2]'s commit
> > log.
> > 
> > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > movement in the accessors to the ram device regions. With this, we can
> > basically revert that commit to make ram device region directly accessible
> > again and bypass the bounce buffer in address_space_map() where the guest
> > hang is caused.
> > 
> > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > PATCH[2] makes ram device region directly accessible again
> > 
> Michael asked to include below context in the cover letter in v3, but I
> didn't noticed that before I sent v3 series, appended with them.
> 
> ----
> 
> The issues listed by Michael:
> 
> 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
>    force-headers from 2019. Likely no longer relevant.
> 
> 2. variable length memcpy can translate 2,4,8 byte guest access into
>    multiple byte accesses. doing this for mmio is guaranteed to break devices.
> 
> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
>    and host, so converting an unaligned access to a series of aligned ones can
>    in theory break devices.
> 
> 4. also on x86, vector instructions for large (>16 byte) writes into
>    pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> 
> 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
>    optimized to a single store/load and it works for aligned and unaligned
>    addresses on that architecture. How to ensure this keeps being correct
>    is left as an excerise for the reader. But qemu already relies on this
>    and did for years.
> 
> 6. on non-x86 both unaligned accesses and vector instructions for accessing
>    UC memory are illegal.
> 
> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
>    map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
>    If it does the second then it can use unaligned or vector for access.
>    This is why normal passthrough tends to work - it never traps to qemu at
>    all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
>    can't use unaligned or vector instructions on non-x86.
> 
> 
> 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
>    so qemu could safely use unaligned/vector instructioons even on non-x86.
> 
> 9. Except sadly, vfio currently does not tell qemu how it maps
>    the memory, so qemu can not know what is safe on non-x86.
> 

And more:

10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.

10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.





> Now, what is to be done?
> 
> 
> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> At least for aligned, perferably for unaligned accesses too.
> Fixed width memcpy seems to work for this. Whether we should bother with
> __builtin to work around broken old fortify headers, I donnu.
> I do not have any answer how to check that compiler does this correctly.
> If anyone is motivated enough, adding a GCC builtin could be possible.
> Given qemu did this for years, I think we can leave solving this for
> another day.
> 
> B. Also on x86, I do not see why we should not use memcpy for large
> accesses if we can. Better perf.
> 
> C. on non-x86, we currently must not memcpy since we do not know if it
> is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> 
> D. It goes without saying that casting an unaligned address to unint32_t
> (be it for qatomic_set or whatever) is undefined behaviour in C
> and so a bad idea on any architecture.
> 
> E. also for non-x86, we really should teach vfio to tell qemu whether
> it maps device pgprot_noncached or pgprot_writecombine.
> we will then be able to use memcpy for >8 accesses.
> 
> Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> this summary?
> 
> Thanks,
> Gavin

     prev parent reply	other threads:[~2026-06-16  5:45 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-16  5:25 [PATCH v3 0/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-16  5:25 ` [PATCH v3 1/2] system/memory: Use qemu_ram_{copy, move}() in ram device region accessors Gavin Shan
2026-06-16  6:17   ` Michael S. Tsirkin
2026-06-16  7:15     ` Gavin Shan
2026-06-16  9:51       ` Michael S. Tsirkin
2026-06-16  5:25 ` [PATCH v3 2/2] system/memory: Make ram device region directly accessible Gavin Shan
2026-06-16  5:36 ` [PATCH v3 0/2] " Michael S. Tsirkin
2026-06-16  5:43   ` Gavin Shan
2026-06-16  5:40 ` Gavin Shan
2026-06-16  5:44   ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260616014419-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex@shazbot.org \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=david@kernel.org \
    --cc=dinghui@sangfor.com.cn \
    --cc=gshan@redhat.com \
    --cc=jugraham@redhat.com \
    --cc=liugang24219@sangfor.com.cn \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@mailo.com \
    --cc=philmd@oss.qualcomm.com \
    --cc=phrdina@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.