All of lore.kernel.org
 help / color / mirror / Atom feed
* list of memory/memcpy access issues
@ 2026-06-17  7:14 Michael S. Tsirkin
  2026-06-19  0:44 ` Gavin Shan
  0 siblings, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2026-06-17  7:14 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

This is a top post attempting to summarize some findings related to
emulating DMA and MMIO existing in QEMU memory core
using memcpy/memmove.

Hopefully, this will help inform discussion about multiple
changes currently proposed for QEMU.

At a high level, and in a variety of configurations, QEMU gets
DMA requests from a virtual device, or MMIO requests from
a VCPU, and wants to execute them either on guest ram or
passhtrough device memory.

Down the road this almost always (virtio ring implementation seems to be
a notable exception) translates to memcpy/memmove calls
(glibc e.g. on x86 currently implements memcpy through memmove).

However, memcpy's signature is:
       void *memcpy(void *dest, const void *src, size_t n);
note how neither src not more importantly dest are volatile.
Thus it was never designed either for a concurrent access
by another CPU, or for accessing devices.
(Mis)using it for that gives good performance but has issues,
some of which I am trying to enumerate below.

In the below I say memcpy but same applies to memmove just as well.



------------------




1. On x86, memcpy is different from __builtin_memcpy if
one uses old 1.0 force-headers from 2019. Thus, QEMU
sometimes uses __builtin sometimes it does not, inconsitently.
Likely no longer relevant and should be cleaned up.


2. variable length memcpy can translate 2,4,8 byte guest access
into multiple byte accesses. doing this for mmio is
guaranteed to break devices.


3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host,
so converting an unaligned access to a series of aligned ones can
in theory break devices.

4. also on x86, vector instructions for large (>16 byte) writes
into pgprot_noncached memory are safe and faster than multiple 8 byte
ones.

5. also on x86 it so happens that if you write a fixed-size memcpy this
gets optimized to a single store/load and it works for aligned and
unaligned addresses on that architecture. How to ensure this keeps being
correct is left as an excerise for the reader. But qemu already relies
on this and did for years.

6. on non-x86 both unaligned accesses and vector instructions
for accessing  UC memory are illegal.

7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
guest can
map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
If it does the second then it can use unaligned or vector for access.
This is why normal passthrough tends to work - it never traps to qemu at
all.


But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
can't use unaligned or vector instructions on non-x86.


8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
so qemu could safely use unaligned/vector instructioons even on non-x86.

9. Except sadly, vfio currently does not tell qemu how it maps
the memory, so qemu can not know what is safe on non-x86.

10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.

11. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.

12. PCI writes are in order, last byte is written last.
memmove especially writes last byte first sometimes.
Violating that theoretically can break guests.

13. but if we are copying between 2 addresses that are overlapping,
the standard trick (used by memmove) is to compare dst and src and copy
backwards if dst < src, so last byte is written first.

-------------



Some conclusions:

A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
At least for aligned, perferably for unaligned accesses too.
Fixed width memcpy seems to work for this. Whether we should bother with
__builtin to work around broken old fortify headers, I donnu.
I do not have any answer how to check that compiler does this correctly.
If anyone is motivated enough, adding a GCC builtin could be possible.
Given qemu did this for years, I think we can leave solving this for
another day.

B. Also on many architectures, memcpy is much faster for large transfers
than iterating over 8 byte chunks in C.
When we can get away with doing that (e.g. for emulated devices where
we know the concurrency rules, writing into guest RAM), we should.

C. on non-x86, we currently must not memcpy into host devices
since we do not know if it is pgprot_noncached. yes, performance will be
bad for DMA into device RAM.




D.  It goes without saying that casting an unaligned address to unint32_t
(be it for qatomic_set or whatever) is undefined behaviour in C
and so a bad idea on any architecture.


E. also for non-x86, we really should teach vfio to tell qemu whether
it maps device pgprot_noncached or pgprot_writecombine.
we will then be able to do things like use vector ops
(through memcpy or not) for >8 accesses.

F. Arbitrary device passthrough with drivers doing unalined accesses and
when working cross architectures basically is a best effort thing.  It
can't be 100% perfect for all devices.


--------------------

Links:


example of a fix for a bug caused by memcpy to overlapping addresses:
4a73aee881 - "softmmu: Use memmove in flatview_write_continue"
https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com


example of a bug caused by memcpy as result of DMA:
https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn

an attempt to fix bugs caused by memcpy to device memory in response to
MMIO:
4a2e242bbb "memory: Don't use memcpy for ram_device regions"
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html

-- 
MST



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: list of memory/memcpy access issues
  2026-06-17  7:14 list of memory/memcpy access issues Michael S. Tsirkin
@ 2026-06-19  0:44 ` Gavin Shan
  2026-06-19  5:33   ` Michael S. Tsirkin
  0 siblings, 1 reply; 5+ messages in thread
From: Gavin Shan @ 2026-06-19  0:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On 6/17/26 5:14 PM, Michael S. Tsirkin wrote:> This is a top post attempting to summarize some findings related to
> emulating DMA and MMIO existing in QEMU memory core
> using memcpy/memmove.
> 
> Hopefully, this will help inform discussion about multiple
> changes currently proposed for QEMU.
> 
> At a high level, and in a variety of configurations, QEMU gets
> DMA requests from a virtual device, or MMIO requests from
> a VCPU, and wants to execute them either on guest ram or
> passhtrough device memory.
> 
> Down the road this almost always (virtio ring implementation seems to be
> a notable exception) translates to memcpy/memmove calls
> (glibc e.g. on x86 currently implements memcpy through memmove).
> 
> However, memcpy's signature is:
>         void *memcpy(void *dest, const void *src, size_t n);
> note how neither src not more importantly dest are volatile.
> Thus it was never designed either for a concurrent access
> by another CPU, or for accessing devices.
> (Mis)using it for that gives good performance but has issues,
> some of which I am trying to enumerate below.
> 
> In the below I say memcpy but same applies to memmove just as well.
> 

Firstly, thanks to Michael for the summary and helps to lead the discussions.

I went through the listed questions and suggestions, but I'm not sure if
I understood every question and suggestion. Figured out that we probably
need to something as below. Please take a look when you get a chance to
check if there are any gaps.

S1: New field MemoryRegion::cache_mode to indicate how it has been mapped
     for those directly accessible regions: cache_{normal, writecombine, no}
     corrresponding to pgprot_{normal, writecombine, noncached}.

     S1.1 MemoryRegion::cache_mode is meaningful to those directly accessible
          regions like ram, ram device and rom device regions
     S1.2 MemoryRegion::cache_mode should be set when those directly accessible
          regions are created
     S1.3 Only cache_{normal, writecombine} regions can be directly accessible

S2: qemu_ram_{copy, move}() which are our private implementations of memcpy()
     and memmove(). They're going to replace memcpy/memove() in the memory
     region directly access paths like flatview_{read, write}_continue_step()

     S2.1 Small fixed length (1/2/4/8 bytes) accesses shouldn't be either
          split or reordered
     S2.2 Arcitectural optimization based on the MemoryRegion::cache_mode,
          unaligned accesses and vector instructions may be allowed

S3: Support VFIO_REGION_INFO_CAP_MMAP_CACHE_MODE where cache_{normal,
     writecombine, no} is provided for every mmapable region or all
     sparse mmaps on the region

     S2.1 The capability is only meaningful when VFIO_REGION_INFO_FLAG_MMAP
          is absent
     S2.2 All sparse mmaps for one specific region should have unified cache
          mode

> 
> ------------------
> 
> 1. On x86, memcpy is different from __builtin_memcpy if
> one uses old 1.0 force-headers from 2019. Thus, QEMU
> sometimes uses __builtin sometimes it does not, inconsitently.
> Likely no longer relevant and should be cleaned up.
> 

S2. old 1.0 force-headers won't be used with S2?

> 
> 2. variable length memcpy can translate 2,4,8 byte guest access
> into multiple byte accesses. doing this for mmio is
> guaranteed to break devices.
> 

S2.1. However, is it still a problem when the MMIO region is mapped with
pgprot_{normal, writecombine}?

> 
> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host,
> so converting an unaligned access to a series of aligned ones can
> in theory break devices.
> 

S1.3, the directly accessible regions have attribute cache_{normal, writecombine}
where unaligned access is allowed. The question is unaligned access on those
regions are always safe on all architectures?

> 4. also on x86, vector instructions for large (>16 byte) writes
> into pgprot_noncached memory are safe and faster than multiple 8 byte
> ones.
> 

S1.3, region with pgprot_noncached is indirectly accessible.

> 5. also on x86 it so happens that if you write a fixed-size memcpy this
> gets optimized to a single store/load and it works for aligned and
> unaligned addresses on that architecture. How to ensure this keeps being
> correct is left as an excerise for the reader. But qemu already relies
> on this and did for years.
> 

Sorry, Not fully understood. It's perhaps covered by S2 if we're talking
about address_space_{read,write}. If we're talking about address_space_{ldl, stl}(),
we perhaps need to replace __builtin_{memcpy, memmove}() with those private
functions introduced in S2.


> 6. on non-x86 both unaligned accesses and vector instructions
> for accessing  UC memory are illegal.
> 

Assume UC is equivalent to pgprot_noncached. In that case, it's true on aarch64
at least. With S1.3 applied, this kind of region becomes indirectly accessible.

> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
> guest can
> map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
> If it does the second then it can use unaligned or vector for access.
> This is why normal passthrough tends to work - it never traps to qemu at
> all.
> 
> 
> But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
> can't use unaligned or vector instructions on non-x86.
> 

VM_ALLOW_ANY_UNCACHED is exclusively to arm64 since commit 8c47ce3e1d2c ("KVM:
arm64: Set io memory s2 pte as normalnc for vfio pci device). After that, all
VFIO PCI BARs have pgprot_writecombine attribute on arm64, thus unaligned or
vector accesses are safe on those BARs from guest POV instead of host. I maybe
wrong and Alex can correct me.

S1.3 only cache_{normal, writecombine} regions can be directly accessible.
A region with pgprot_noncached attribute is indirectly accessible.

> 
> 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
> so qemu could safely use unaligned/vector instructioons even on non-x86.
> 

For my specific case related to GH100 card, Region-0/2 have pgprot_writecombine
while Region-4 has pgprot_normal attribute.

   Region 0: Memory at 44080000000 (64-bit, prefetchable) [size=16M]
   Region 2: Memory at 44000000000 (64-bit, prefetchable) [size=2G]
   Region 4: Memory at 42000000000 (64-bit, prefetchable) [size=128G]


> 9. Except sadly, vfio currently does not tell qemu how it maps
> the memory, so qemu can not know what is safe on non-x86.
> 

S3. Host VFIO driver needs ABI changes to expose the cache mode.

> 10. on x86 memcpy will sometimes do multiple overlapping stores when
> size is not a power of 2. for example, a 15 byte write is done with
> 2 8-byte stores. This is theoretically an issue
> if guest does something super clever with ordering,
> but does not seem to be in practice.
> 

S2. This should be avoided in qemu_ram_{copy, move}() which are going to replace
the standard memcpy() and memmove().


> 11. on non-x86 memcpy will do multiple overlapping stores even
> for single byte writes. E.g. it does it to avoid extra branches.
> This is causing issues in practice.
> 

S2, This should be avoided in qemu_ram_{copy, move}().


> 12. PCI writes are in order, last byte is written last.
> memmove especially writes last byte first sometimes.
> Violating that theoretically can break guests.
> 

S2, the reordering should be avoided in qemu_ram_{copy, move}(). However,
I would think this region becomes indirectly accessible with S1.3 applied?


> 13. but if we are copying between 2 addresses that are overlapping,
> the standard trick (used by memmove) is to compare dst and src and copy
> backwards if dst < src, so last byte is written first.
> 

Backwards copying happens on (dst > src) not on (dst < src). We potentially
convert this to a forwards copying by moving the data in the overlapped area
to somewhere else, and then take that as the src in the subsequent forwards
copying.

I think it's unliekly to be a directly accessible region with S1.3 applied.


> -------------
> 
> 
> 
> Some conclusions:
> 
> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> At least for aligned, perferably for unaligned accesses too.
> Fixed width memcpy seems to work for this. Whether we should bother with
> __builtin to work around broken old fortify headers, I donnu.
> I do not have any answer how to check that compiler does this correctly.
> If anyone is motivated enough, adding a GCC builtin could be possible.
> Given qemu did this for years, I think we can leave solving this for
> another day.
> 

Covered by S2.1

> B. Also on many architectures, memcpy is much faster for large transfers
> than iterating over 8 byte chunks in C.
> When we can get away with doing that (e.g. for emulated devices where
> we know the concurrency rules, writing into guest RAM), we should.
> 

S2.2, something related to performance optimization for the future

> C. on non-x86, we currently must not memcpy into host devices
> since we do not know if it is pgprot_noncached. yes, performance will be
> bad for DMA into device RAM.
> 

S1.3. This specific region becomes indirectly accessible after S1.3 is applied.

> 
> D.  It goes without saying that casting an unaligned address to unint32_t
> (be it for qatomic_set or whatever) is undefined behaviour in C
> and so a bad idea on any architecture.
> 

S1.3. The directly accessible region always have cache_{normal, writecombine}
attribute.

> 
> E. also for non-x86, we really should teach vfio to tell qemu whether
> it maps device pgprot_noncached or pgprot_writecombine.
> we will then be able to do things like use vector ops
> (through memcpy or not) for >8 accesses.
> 

S3. pgprot_normal is also needed. In my case related to GH100 card, the PCI
BAR-4 is mapped with pgprot_normal.

Yes, with those information fed to MemoryRegion::cache_mode in S1, it's
possible to optimize qemu_ram_{copy, move}() in S2.2 for the performance sake.


> F. Arbitrary device passthrough with drivers doing unalined accesses and
> when working cross architectures basically is a best effort thing.  It
> can't be 100% perfect for all devices.
> 

Yes. For the first step, we perhaps need to gurantee the directly accessible region
have pgprot_{normal, writecombine} in S1.3 if you agree.

> 
> --------------------
> 
> Links:
> 
> 
> example of a fix for a bug caused by memcpy to overlapping addresses:
> 4a73aee881 - "softmmu: Use memmove in flatview_write_continue"
> https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com
> 
> 
> example of a bug caused by memcpy as result of DMA:
> https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
> 
> an attempt to fix bugs caused by memcpy to device memory in response to
> MMIO:
> 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html
> 

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: list of memory/memcpy access issues
  2026-06-19  0:44 ` Gavin Shan
@ 2026-06-19  5:33   ` Michael S. Tsirkin
  2026-06-21  6:51     ` Gavin Shan
  0 siblings, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2026-06-19  5:33 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Fri, Jun 19, 2026 at 10:44:17AM +1000, Gavin Shan wrote:
> On 6/17/26 5:14 PM, Michael S. Tsirkin wrote:> This is a top post attempting to summarize some findings related to
> > emulating DMA and MMIO existing in QEMU memory core
> > using memcpy/memmove.
> > 
> > Hopefully, this will help inform discussion about multiple
> > changes currently proposed for QEMU.
> > 
> > At a high level, and in a variety of configurations, QEMU gets
> > DMA requests from a virtual device, or MMIO requests from
> > a VCPU, and wants to execute them either on guest ram or
> > passhtrough device memory.
> > 
> > Down the road this almost always (virtio ring implementation seems to be
> > a notable exception) translates to memcpy/memmove calls
> > (glibc e.g. on x86 currently implements memcpy through memmove).
> > 
> > However, memcpy's signature is:
> >         void *memcpy(void *dest, const void *src, size_t n);
> > note how neither src not more importantly dest are volatile.
> > Thus it was never designed either for a concurrent access
> > by another CPU, or for accessing devices.
> > (Mis)using it for that gives good performance but has issues,
> > some of which I am trying to enumerate below.
> > 
> > In the below I say memcpy but same applies to memmove just as well.
> > 
> 
> Firstly, thanks to Michael for the summary and helps to lead the discussions.
> 
> I went through the listed questions and suggestions, but I'm not sure if
> I understood every question and suggestion.
>
> Figured out that we probably
> need to something as below. Please take a look when you get a chance to
> check if there are any gaps.
> 
> S1: New field MemoryRegion::cache_mode to indicate how it has been mapped
>     for those directly accessible regions: cache_{normal, writecombine, no}
>     corrresponding to pgprot_{normal, writecombine, noncached}.
> 
>     S1.1 MemoryRegion::cache_mode is meaningful to those directly accessible
>          regions like ram, ram device and rom device regions
>     S1.2 MemoryRegion::cache_mode should be set when those directly accessible
>          regions are created
>     S1.3 Only cache_{normal, writecombine} regions can be directly accessible

What is "directly accessible" here? That all memory ops thinkable work?
E.g. on power8 not all ops work on pgprot_writecombine, either (and glibc memcopy
uses these).
I am not 100% sure that's a sane userspace API. It's an internal kernel one.
If we are mirroring kernel, we need to include pgprot_device - the CDX vfio
driver uses it.

But if you want to know what memory instructions work from userspace, there is
a lot of detail and pgprot_ macros do not cover all of them.


> S2: qemu_ram_{copy, move}() which are our private implementations of memcpy()
>     and memmove(). They're going to replace memcpy/memove() in the memory
>     region directly access paths like flatview_{read, write}_continue_step()
> 
>     S2.1 Small fixed length (1/2/4/8 bytes) accesses shouldn't be either
>          split or reordered
>     S2.2 Arcitectural optimization based on the MemoryRegion::cache_mode,
>          unaligned accesses and vector instructions may be allowed

hmm. meaning what exactly?

> 
> S3: Support VFIO_REGION_INFO_CAP_MMAP_CACHE_MODE where cache_{normal,
>     writecombine, no} is provided for every mmapable region or all
>     sparse mmaps on the region

seems like you are trying to drive the cache mode from userspace?
but how will userspace know what to set?
I'd expect, instead, to just have VFIO report how it is mapped.


> 
>     S2.1 The capability is only meaningful when VFIO_REGION_INFO_FLAG_MMAP
>          is absent
>     S2.2 All sparse mmaps for one specific region should have unified cache
>          mode

you can not trust userspace to do that.

>
> > 
> > ------------------
> > 
> > 1. On x86, memcpy is different from __builtin_memcpy if
> > one uses old 1.0 force-headers from 2019. Thus, QEMU
> > sometimes uses __builtin sometimes it does not, inconsitently.
> > Likely no longer relevant and should be cleaned up.
> > 
> 
> S2. old 1.0 force-headers won't be used with S2?
> 
> > 
> > 2. variable length memcpy can translate 2,4,8 byte guest access
> > into multiple byte accesses. doing this for mmio is
> > guaranteed to break devices.
> > 
> 
> S2.1. However, is it still a problem when the MMIO region is mapped with
> pgprot_{normal, writecombine}?

MMIO as pgprot_{normal, writecombine} will break devices whatever
userspace does.


> > 
> > 3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host,
> > so converting an unaligned access to a series of aligned ones can
> > in theory break devices.
> > 
> 
> S1.3, the directly accessible regions have attribute cache_{normal, writecombine}
> where unaligned access is allowed. The question is unaligned access on those
> regions are always safe on all architectures?

Define "directly accessible regions". Or better, avoid even thinking
in these terms.


> > 4. also on x86, vector instructions for large (>16 byte) writes
> > into pgprot_noncached memory are safe and faster than multiple 8 byte
> > ones.
> > 
> 
> S1.3, region with pgprot_noncached is indirectly accessible.
> 
> > 5. also on x86 it so happens that if you write a fixed-size memcpy this
> > gets optimized to a single store/load and it works for aligned and
> > unaligned addresses on that architecture. How to ensure this keeps being
> > correct is left as an excerise for the reader. But qemu already relies
> > on this and did for years.
> > 
> 
> Sorry, Not fully understood.


what is unclear? on x86, and some others, glibc will see size 1,2,4 and
maybe 8 of 64 and inline memcpy and it happens to do exactly a single
load/store.  and code in bswap.h relies on this to mirror guest MMIO on
the host. So assuming that is it least not regressing too much.



> It's perhaps covered by S2 if we're talking
> about address_space_{read,write}. If we're talking about address_space_{ldl, stl}(),
> we perhaps need to replace __builtin_{memcpy, memmove}() with those private
> functions introduced in S2.

Not sure what "covered" means.


> 
> > 6. on non-x86 both unaligned accesses and vector instructions
> > for accessing  UC memory are illegal.
> > 
> 
> Assume UC is equivalent to pgprot_noncached. In that case, it's true on aarch64
> at least. With S1.3 applied, this kind of region becomes indirectly accessible.
> 
> > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
> > guest can
> > map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
> > If it does the second then it can use unaligned or vector for access.
> > This is why normal passthrough tends to work - it never traps to qemu at
> > all.
> > 
> > 
> > But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
> > can't use unaligned or vector instructions on non-x86.
> > 
> 
> VM_ALLOW_ANY_UNCACHED is exclusively to arm64 since commit 8c47ce3e1d2c ("KVM:
> arm64: Set io memory s2 pte as normalnc for vfio pci device). After that, all
> VFIO PCI BARs have pgprot_writecombine attribute on arm64, thus unaligned or
> vector accesses are safe on those BARs from guest POV instead of host.

But not e.g. on power8, sadly.

> I maybe
> wrong and Alex can correct me.
> 
> S1.3 only cache_{normal, writecombine} regions can be directly accessible.
> A region with pgprot_noncached attribute is indirectly accessible.
> 
> > 
> > 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
> > so qemu could safely use unaligned/vector instructioons even on non-x86.
> > 
> 
> For my specific case related to GH100 card, Region-0/2 have pgprot_writecombine
> while Region-4 has pgprot_normal attribute.
> 
>   Region 0: Memory at 44080000000 (64-bit, prefetchable) [size=16M]
>   Region 2: Memory at 44000000000 (64-bit, prefetchable) [size=2G]
>   Region 4: Memory at 42000000000 (64-bit, prefetchable) [size=128G]
> 
> 
> > 9. Except sadly, vfio currently does not tell qemu how it maps
> > the memory, so qemu can not know what is safe on non-x86.
> > 
> 
> S3. Host VFIO driver needs ABI changes to expose the cache mode.
> 
> > 10. on x86 memcpy will sometimes do multiple overlapping stores when
> > size is not a power of 2. for example, a 15 byte write is done with
> > 2 8-byte stores. This is theoretically an issue
> > if guest does something super clever with ordering,
> > but does not seem to be in practice.
> > 
> 
> S2. This should be avoided in qemu_ram_{copy, move}() which are going to replace
> the standard memcpy() and memmove().
> 
> 
> > 11. on non-x86 memcpy will do multiple overlapping stores even
> > for single byte writes. E.g. it does it to avoid extra branches.
> > This is causing issues in practice.
> > 
> 
> S2, This should be avoided in qemu_ram_{copy, move}().
> 
> 
> > 12. PCI writes are in order, last byte is written last.
> > memmove especially writes last byte first sometimes.
> > Violating that theoretically can break guests.
> > 
> 
> S2, the reordering should be avoided in qemu_ram_{copy, move}(). However,
> I would think this region becomes indirectly accessible with S1.3 applied?
> 
> 
> > 13. but if we are copying between 2 addresses that are overlapping,
> > the standard trick (used by memmove) is to compare dst and src and copy
> > backwards if dst < src, so last byte is written first.
> > 
> 
> Backwards copying happens on (dst > src) not on (dst < src). We potentially
> convert this to a forwards copying by moving the data in the overlapped area
> to somewhere else, and then take that as the src in the subsequent forwards
> copying.

Not that simple. Issue is, the size of the overlap is not really limited.
Maybe make last X bytes go through the buffer, the rest copy backwards and hope for the best?


> I think it's unliekly to be a directly accessible region with S1.3 applied.

No, this does not have much to do with how the region is mapped.
If guest or device write bytes 1 to X in order and you decide to
write them X to 1, you have broken some drivers, unless you know
exactly how the device and driver are supposed to work.


> 
> > -------------
> > 
> > 
> > 
> > Some conclusions:
> > 
> > A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> > At least for aligned, perferably for unaligned accesses too.
> > Fixed width memcpy seems to work for this. Whether we should bother with
> > __builtin to work around broken old fortify headers, I donnu.
> > I do not have any answer how to check that compiler does this correctly.
> > If anyone is motivated enough, adding a GCC builtin could be possible.
> > Given qemu did this for years, I think we can leave solving this for
> > another day.
> > 
> 
> Covered by S2.1
> 
> > B. Also on many architectures, memcpy is much faster for large transfers
> > than iterating over 8 byte chunks in C.
> > When we can get away with doing that (e.g. for emulated devices where
> > we know the concurrency rules, writing into guest RAM), we should.
> > 
> 
> S2.2, something related to performance optimization for the future
> 
> > C. on non-x86, we currently must not memcpy into host devices
> > since we do not know if it is pgprot_noncached. yes, performance will be
> > bad for DMA into device RAM.
> > 
> 
> S1.3. This specific region becomes indirectly accessible after S1.3 is applied.
> 
> > 
> > D.  It goes without saying that casting an unaligned address to unint32_t
> > (be it for qatomic_set or whatever) is undefined behaviour in C
> > and so a bad idea on any architecture.
> > 
> 
> S1.3. The directly accessible region always have cache_{normal, writecombine}
> attribute.
> 
> > 
> > E. also for non-x86, we really should teach vfio to tell qemu whether
> > it maps device pgprot_noncached or pgprot_writecombine.
> > we will then be able to do things like use vector ops
> > (through memcpy or not) for >8 accesses.
> > 
> 
> S3. pgprot_normal is also needed. In my case related to GH100 card, the PCI
> BAR-4 is mapped with pgprot_normal.
> 
> Yes, with those information fed to MemoryRegion::cache_mode in S1, it's
> possible to optimize qemu_ram_{copy, move}() in S2.2 for the performance sake.
> 
> 
> > F. Arbitrary device passthrough with drivers doing unalined accesses and
> > when working cross architectures basically is a best effort thing.  It
> > can't be 100% perfect for all devices.
> > 
> 
> Yes. For the first step, we perhaps need to gurantee the directly accessible region
> have pgprot_{normal, writecombine} in S1.3 if you agree.

I do not know what "directly accessible" is and I feel we should get
out of the habit of thinking in these terms.
VFIO likely DTRT mapping already, and userspace really has no
business overriding it.


> > 
> > --------------------
> > 
> > Links:
> > 
> > 
> > example of a fix for a bug caused by memcpy to overlapping addresses:
> > 4a73aee881 - "softmmu: Use memmove in flatview_write_continue"
> > https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com
> > 
> > 
> > example of a bug caused by memcpy as result of DMA:
> > https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
> > 
> > an attempt to fix bugs caused by memcpy to device memory in response to
> > MMIO:
> > 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
> > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html
> > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: list of memory/memcpy access issues
  2026-06-19  5:33   ` Michael S. Tsirkin
@ 2026-06-21  6:51     ` Gavin Shan
  2026-06-21 12:52       ` Michael S. Tsirkin
  0 siblings, 1 reply; 5+ messages in thread
From: Gavin Shan @ 2026-06-21  6:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On 6/19/26 3:33 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2026 at 10:44:17AM +1000, Gavin Shan wrote:
>> On 6/17/26 5:14 PM, Michael S. Tsirkin wrote:> This is a top post attempting to summarize some findings related to
>>> emulating DMA and MMIO existing in QEMU memory core
>>> using memcpy/memmove.
>>>
>>> Hopefully, this will help inform discussion about multiple
>>> changes currently proposed for QEMU.
>>>
>>> At a high level, and in a variety of configurations, QEMU gets
>>> DMA requests from a virtual device, or MMIO requests from
>>> a VCPU, and wants to execute them either on guest ram or
>>> passhtrough device memory.
>>>
>>> Down the road this almost always (virtio ring implementation seems to be
>>> a notable exception) translates to memcpy/memmove calls
>>> (glibc e.g. on x86 currently implements memcpy through memmove).
>>>
>>> However, memcpy's signature is:
>>>          void *memcpy(void *dest, const void *src, size_t n);
>>> note how neither src not more importantly dest are volatile.
>>> Thus it was never designed either for a concurrent access
>>> by another CPU, or for accessing devices.
>>> (Mis)using it for that gives good performance but has issues,
>>> some of which I am trying to enumerate below.
>>>
>>> In the below I say memcpy but same applies to memmove just as well.
>>>
>>
>> Firstly, thanks to Michael for the summary and helps to lead the discussions.
>>
>> I went through the listed questions and suggestions, but I'm not sure if
>> I understood every question and suggestion.
>>
>> Figured out that we probably
>> need to something as below. Please take a look when you get a chance to
>> check if there are any gaps.
>>
>> S1: New field MemoryRegion::cache_mode to indicate how it has been mapped
>>      for those directly accessible regions: cache_{normal, writecombine, no}
>>      corrresponding to pgprot_{normal, writecombine, noncached}.
>>
>>      S1.1 MemoryRegion::cache_mode is meaningful to those directly accessible
>>           regions like ram, ram device and rom device regions
>>      S1.2 MemoryRegion::cache_mode should be set when those directly accessible
>>           regions are created
>>      S1.3 Only cache_{normal, writecombine} regions can be directly accessible
> 
> What is "directly accessible" here? That all memory ops thinkable work?
> E.g. on power8 not all ops work on pgprot_writecombine, either (and glibc memcopy
> uses these).
> I am not 100% sure that's a sane userspace API. It's an internal kernel one.
> If we are mirroring kernel, we need to include pgprot_device - the CDX vfio
> driver uses it.
> 
> But if you want to know what memory instructions work from userspace, there is
> a lot of detail and pgprot_ macros do not cover all of them.
> 

A region is 'directly accessible' when memory_access_is_direct() returns true
for it. The accesses to the directly accessible regions are turned into memcpy()
and memmove() in flatview_{read, write}_continue_step(), or {ldm, stm}_p() in
address_space_{ldm, stm}_internal(). Currently, the mmapable VFIO PCI BARs are
exposed as ram device regions, which are indirectly accessible. One of our goals
is to make part of the mappable VFIO PCI BARs (not all of them) directly accessible,
so that the DMA bounce buffer is bypassed when the DMA target buffer resides in
the BAR (region).

Yes, I don't know if pgprot_xxx is adequate or not, but it's just the thought. We
probably need more information like the combination (host_arch, guest_arch, pgprot_xxx).
The idea is to have more information fed to MemoryRegion and our private accessing
functions, which are mentioned in S2 to replace the standard memcpy() and
memmove(), know which instructions are safe to use, if vector instructions can
be used, and whatever else.

I don't think we can do everything in one shot. Initially, we probably just provide
a sustainable design (or inrastructure) for long-term evolving. From there, we can
extend it to other architectures and cases step by step.

     #if defined(__x86_64__) || defined(__aarch64__)
     #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE  1
     #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE                1
     #else
     #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE  0
     #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE                0
     #endif

we also can put more constraints to S1.3 so that only cache_normal
MMIO region can be directly accessible.

     S1.3 Only cache_normal regions can be directly accessible. The question
          is the cache_normal MMIO region is tolerant to unaligned and vectored
          access on all architectures?

> 
>> S2: qemu_ram_{copy, move}() which are our private implementations of memcpy()
>>      and memmove(). They're going to replace memcpy/memove() in the memory
>>      region directly access paths like flatview_{read, write}_continue_step()
>>
>>      S2.1 Small fixed length (1/2/4/8 bytes) accesses shouldn't be either
>>           split or reordered
>>      S2.2 Arcitectural optimization based on the MemoryRegion::cache_mode,
>>           unaligned accesses and vector instructions may be allowed
> 
> hmm. meaning what exactly?
> 

glibc::{memcopy, memmove}() aren't reliable. There are several related bugs,
as you listed. For [1], where one-byte-store is translated to triple stores
to same location. it seems we have to bypass glibc::memcopy(), at least for
some cases? If so, we need our own (well-behaved) memcpy/memmove(), and
qemu_ram_{copy, move}() are our own implementations to replace memcpy/memmove()
in the direct access paths.

     [1] example of a bug caused by memcpy as result of DMA
         https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
     [2] an attempt to fix bugs caused by memcpy to device memory in response to
         MMIO. 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
         https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html

>>
>> S3: Support VFIO_REGION_INFO_CAP_MMAP_CACHE_MODE where cache_{normal,
>>      writecombine, no} is provided for every mmapable region or all
>>      sparse mmaps on the region
> 
> seems like you are trying to drive the cache mode from userspace?
> but how will userspace know what to set?
> I'd expect, instead, to just have VFIO report how it is mapped.
> 

No, the capability should be reported by host's VFIO PCI driver. For my GH100
specific case, it's nvgrace_gpu_vfio_pci driver where this capability is reported.

  
>>
>>      S2.1 The capability is only meaningful when VFIO_REGION_INFO_FLAG_MMAP
>>           is absent
>>      S2.2 All sparse mmaps for one specific region should have unified cache
>>           mode
> 
> you can not trust userspace to do that.
> 

See above explanation. We don't trust userspace to do it. Instead, the host's
VFIO PCI driver needs to report it.

>>
>>>
>>> ------------------
>>>
>>> 1. On x86, memcpy is different from __builtin_memcpy if
>>> one uses old 1.0 force-headers from 2019. Thus, QEMU
>>> sometimes uses __builtin sometimes it does not, inconsitently.
>>> Likely no longer relevant and should be cleaned up.
>>>
>>
>> S2. old 1.0 force-headers won't be used with S2?
>>
>>>
>>> 2. variable length memcpy can translate 2,4,8 byte guest access
>>> into multiple byte accesses. doing this for mmio is
>>> guaranteed to break devices.
>>>
>>
>> S2.1. However, is it still a problem when the MMIO region is mapped with
>> pgprot_{normal, writecombine}?
> 
> MMIO as pgprot_{normal, writecombine} will break devices whatever
> userspace does.
> 

If we're going to introduce something similar to linux-kernel::readl/writel(),
and use them to access MMIO region, it should be fine then?


> 
>>>
>>> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host,
>>> so converting an unaligned access to a series of aligned ones can
>>> in theory break devices.
>>>
>>
>> S1.3, the directly accessible regions have attribute cache_{normal, writecombine}
>> where unaligned access is allowed. The question is unaligned access on those
>> regions are always safe on all architectures?
> 
> Define "directly accessible regions". Or better, avoid even thinking
> in these terms.
> 

Please see my explanation above.

> 
>>> 4. also on x86, vector instructions for large (>16 byte) writes
>>> into pgprot_noncached memory are safe and faster than multiple 8 byte
>>> ones.
>>>
>>
>> S1.3, region with pgprot_noncached is indirectly accessible.
>>
>>> 5. also on x86 it so happens that if you write a fixed-size memcpy this
>>> gets optimized to a single store/load and it works for aligned and
>>> unaligned addresses on that architecture. How to ensure this keeps being
>>> correct is left as an excerise for the reader. But qemu already relies
>>> on this and did for years.
>>>
>>
>> Sorry, Not fully understood.
> 
> 
> what is unclear? on x86, and some others, glibc will see size 1,2,4 and
> maybe 8 of 64 and inline memcpy and it happens to do exactly a single
> load/store.  and code in bswap.h relies on this to mirror guest MMIO on
> the host. So assuming that is it least not regressing too much.
> 

Ok. So you're saying that __builtin_{memcpy, memmove}() aren't safe for MMIO
accesses? All the functions in bswap.h, based on __builtin_{memcpy, memmove}(),
are only safe to RAM accesses, but unsafe to MMIO accesses?

Currently, ram_device_mem_ops::memory_region_ram_device_write() runs into
stn_he_p() and then __builtin_memcpy(), which aren't safe to access to VFIO PCI
BARs. This path wasn't considered in the proposed design and something needs
to be considered in the revised design. For this, I'm going add the following
context to the design.

S1: A new set of functions added to include/qemu/io.h, similar to linux/
     include/asm/io.h::{read,write}{b, w, l q}() to access MMIO region

     S1.1 Those new fuctions will be used to access MMIO region
          S1.1.1 Directly access paths in address_space_{ldm, stm}_internal(),
                 to replace {ldm, stm}_p().
          S1.1.2 Directly access paths in flatview_{read, write}_continue_step()
                 to replace memcpy() and memmove().
          S1.1.3 Indirectly access paths where MemoryRegionOps is invoked, to
                 replace {ldn, stn}_he_p() or their RAM access variants if the
                 region is a MMIO region.

> 
> 
>> It's perhaps covered by S2 if we're talking
>> about address_space_{read,write}. If we're talking about address_space_{ldl, stl}(),
>> we perhaps need to replace __builtin_{memcpy, memmove}() with those private
>> functions introduced in S2.
> 
> Not sure what "covered" means.
> 

Ok, it's not important now since the paths invokved by those functions in bswap.h,
which target a MMIO region, aren't considered in the proposed design.

> 
>>
>>> 6. on non-x86 both unaligned accesses and vector instructions
>>> for accessing  UC memory are illegal.
>>>
>>
>> Assume UC is equivalent to pgprot_noncached. In that case, it's true on aarch64
>> at least. With S1.3 applied, this kind of region becomes indirectly accessible.
>>
>>> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
>>> guest can
>>> map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
>>> If it does the second then it can use unaligned or vector for access.
>>> This is why normal passthrough tends to work - it never traps to qemu at
>>> all.
>>>
>>>
>>> But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
>>> can't use unaligned or vector instructions on non-x86.
>>>
>>
>> VM_ALLOW_ANY_UNCACHED is exclusively to arm64 since commit 8c47ce3e1d2c ("KVM:
>> arm64: Set io memory s2 pte as normalnc for vfio pci device). After that, all
>> VFIO PCI BARs have pgprot_writecombine attribute on arm64, thus unaligned or
>> vector accesses are safe on those BARs from guest POV instead of host.
> 
> But not e.g. on power8, sadly.
> 

Ok. With more constraints applied to S1.3 as I mentioned above, only cache_normal
regions can be directly accessible, I guess power8 will be happy with unaligned
and vector access?

     S1.3 Only cache_normal regions can be directly accessible. The question
          is the cache_normal MMIO region is tolerant to unaligned and vectored
          access on all architectures?

>> I maybe
>> wrong and Alex can correct me.
>>
>> S1.3 only cache_{normal, writecombine} regions can be directly accessible.
>> A region with pgprot_noncached attribute is indirectly accessible.
>>
>>>
>>> 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
>>> so qemu could safely use unaligned/vector instructioons even on non-x86.
>>>
>>
>> For my specific case related to GH100 card, Region-0/2 have pgprot_writecombine
>> while Region-4 has pgprot_normal attribute.
>>
>>    Region 0: Memory at 44080000000 (64-bit, prefetchable) [size=16M]
>>    Region 2: Memory at 44000000000 (64-bit, prefetchable) [size=2G]
>>    Region 4: Memory at 42000000000 (64-bit, prefetchable) [size=128G]
>>
>>
>>> 9. Except sadly, vfio currently does not tell qemu how it maps
>>> the memory, so qemu can not know what is safe on non-x86.
>>>
>>
>> S3. Host VFIO driver needs ABI changes to expose the cache mode.
>>
>>> 10. on x86 memcpy will sometimes do multiple overlapping stores when
>>> size is not a power of 2. for example, a 15 byte write is done with
>>> 2 8-byte stores. This is theoretically an issue
>>> if guest does something super clever with ordering,
>>> but does not seem to be in practice.
>>>
>>
>> S2. This should be avoided in qemu_ram_{copy, move}() which are going to replace
>> the standard memcpy() and memmove().
>>
>>
>>> 11. on non-x86 memcpy will do multiple overlapping stores even
>>> for single byte writes. E.g. it does it to avoid extra branches.
>>> This is causing issues in practice.
>>>
>>
>> S2, This should be avoided in qemu_ram_{copy, move}().
>>
>>
>>> 12. PCI writes are in order, last byte is written last.
>>> memmove especially writes last byte first sometimes.
>>> Violating that theoretically can break guests.
>>>
>>
>> S2, the reordering should be avoided in qemu_ram_{copy, move}(). However,
>> I would think this region becomes indirectly accessible with S1.3 applied?
>>
>>
>>> 13. but if we are copying between 2 addresses that are overlapping,
>>> the standard trick (used by memmove) is to compare dst and src and copy
>>> backwards if dst < src, so last byte is written first.
>>>
>>
>> Backwards copying happens on (dst > src) not on (dst < src). We potentially
>> convert this to a forwards copying by moving the data in the overlapped area
>> to somewhere else, and then take that as the src in the subsequent forwards
>> copying.
> 
> Not that simple. Issue is, the size of the overlap is not really limited.
> Maybe make last X bytes go through the buffer, the rest copy backwards and hope for the best?
> 

I guess it would work with luck. Alternative, it can be converted to two
forwards copying. The source buffer is split into two parts, the second
part of the source buffer is copied before the first part.

> 
>> I think it's unliekly to be a directly accessible region with S1.3 applied.
> 
> No, this does not have much to do with how the region is mapped.
> If guest or device write bytes 1 to X in order and you decide to
> write them X to 1, you have broken some drivers, unless you know
> exactly how the device and driver are supposed to work.
> 

Ideally, we should disallow data movement between two overlapped MMIO regions.
In APIs of linux kernel like readl/writel, one of the operand always resides
in RAM, the source/destination are never overlapped.

> 
>>
>>> -------------
>>>
>>>
>>>
>>> Some conclusions:
>>>
>>> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
>>> At least for aligned, perferably for unaligned accesses too.
>>> Fixed width memcpy seems to work for this. Whether we should bother with
>>> __builtin to work around broken old fortify headers, I donnu.
>>> I do not have any answer how to check that compiler does this correctly.
>>> If anyone is motivated enough, adding a GCC builtin could be possible.
>>> Given qemu did this for years, I think we can leave solving this for
>>> another day.
>>>
>>
>> Covered by S2.1
>>
>>> B. Also on many architectures, memcpy is much faster for large transfers
>>> than iterating over 8 byte chunks in C.
>>> When we can get away with doing that (e.g. for emulated devices where
>>> we know the concurrency rules, writing into guest RAM), we should.
>>>
>>
>> S2.2, something related to performance optimization for the future
>>
>>> C. on non-x86, we currently must not memcpy into host devices
>>> since we do not know if it is pgprot_noncached. yes, performance will be
>>> bad for DMA into device RAM.
>>>
>>
>> S1.3. This specific region becomes indirectly accessible after S1.3 is applied.
>>
>>>
>>> D.  It goes without saying that casting an unaligned address to unint32_t
>>> (be it for qatomic_set or whatever) is undefined behaviour in C
>>> and so a bad idea on any architecture.
>>>
>>
>> S1.3. The directly accessible region always have cache_{normal, writecombine}
>> attribute.
>>
>>>
>>> E. also for non-x86, we really should teach vfio to tell qemu whether
>>> it maps device pgprot_noncached or pgprot_writecombine.
>>> we will then be able to do things like use vector ops
>>> (through memcpy or not) for >8 accesses.
>>>
>>
>> S3. pgprot_normal is also needed. In my case related to GH100 card, the PCI
>> BAR-4 is mapped with pgprot_normal.
>>
>> Yes, with those information fed to MemoryRegion::cache_mode in S1, it's
>> possible to optimize qemu_ram_{copy, move}() in S2.2 for the performance sake.
>>
>>
>>> F. Arbitrary device passthrough with drivers doing unalined accesses and
>>> when working cross architectures basically is a best effort thing.  It
>>> can't be 100% perfect for all devices.
>>>
>>
>> Yes. For the first step, we perhaps need to gurantee the directly accessible region
>> have pgprot_{normal, writecombine} in S1.3 if you agree.
> 
> I do not know what "directly accessible" is and I feel we should get
> out of the habit of thinking in these terms.
> VFIO likely DTRT mapping already, and userspace really has no
> business overriding it.
> 

Sorry that I didn't explain 'directly accessible', which has been explained at
the beginning of this reply.

> 
>>>
>>> --------------------
>>>
>>> Links:
>>>
>>>
>>> example of a fix for a bug caused by memcpy to overlapping addresses:
>>> 4a73aee881 - "softmmu: Use memmove in flatview_write_continue"
>>> https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com
>>>
>>>
>>> example of a bug caused by memcpy as result of DMA:
>>> https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
>>>
>>> an attempt to fix bugs caused by memcpy to device memory in response to
>>> MMIO:
>>> 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html
>>>
>>

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: list of memory/memcpy access issues
  2026-06-21  6:51     ` Gavin Shan
@ 2026-06-21 12:52       ` Michael S. Tsirkin
  0 siblings, 0 replies; 5+ messages in thread
From: Michael S. Tsirkin @ 2026-06-21 12:52 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, peterx, alex, richard.henderson,
	peter.maydell, berrange, philmd, philmd, david, clg, pbonzini,
	phrdina, jugraham, liugang24219, dinghui, shan.gavin

On Sun, Jun 21, 2026 at 04:51:55PM +1000, Gavin Shan wrote:
> On 6/19/26 3:33 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 19, 2026 at 10:44:17AM +1000, Gavin Shan wrote:
> > > On 6/17/26 5:14 PM, Michael S. Tsirkin wrote:> This is a top post attempting to summarize some findings related to
> > > > emulating DMA and MMIO existing in QEMU memory core
> > > > using memcpy/memmove.
> > > > 
> > > > Hopefully, this will help inform discussion about multiple
> > > > changes currently proposed for QEMU.
> > > > 
> > > > At a high level, and in a variety of configurations, QEMU gets
> > > > DMA requests from a virtual device, or MMIO requests from
> > > > a VCPU, and wants to execute them either on guest ram or
> > > > passhtrough device memory.
> > > > 
> > > > Down the road this almost always (virtio ring implementation seems to be
> > > > a notable exception) translates to memcpy/memmove calls
> > > > (glibc e.g. on x86 currently implements memcpy through memmove).
> > > > 
> > > > However, memcpy's signature is:
> > > >          void *memcpy(void *dest, const void *src, size_t n);
> > > > note how neither src not more importantly dest are volatile.
> > > > Thus it was never designed either for a concurrent access
> > > > by another CPU, or for accessing devices.
> > > > (Mis)using it for that gives good performance but has issues,
> > > > some of which I am trying to enumerate below.
> > > > 
> > > > In the below I say memcpy but same applies to memmove just as well.
> > > > 
> > > 
> > > Firstly, thanks to Michael for the summary and helps to lead the discussions.
> > > 
> > > I went through the listed questions and suggestions, but I'm not sure if
> > > I understood every question and suggestion.
> > > 
> > > Figured out that we probably
> > > need to something as below. Please take a look when you get a chance to
> > > check if there are any gaps.
> > > 
> > > S1: New field MemoryRegion::cache_mode to indicate how it has been mapped
> > >      for those directly accessible regions: cache_{normal, writecombine, no}
> > >      corrresponding to pgprot_{normal, writecombine, noncached}.
> > > 
> > >      S1.1 MemoryRegion::cache_mode is meaningful to those directly accessible
> > >           regions like ram, ram device and rom device regions
> > >      S1.2 MemoryRegion::cache_mode should be set when those directly accessible
> > >           regions are created
> > >      S1.3 Only cache_{normal, writecombine} regions can be directly accessible
> > 
> > What is "directly accessible" here? That all memory ops thinkable work?
> > E.g. on power8 not all ops work on pgprot_writecombine, either (and glibc memcopy
> > uses these).
> > I am not 100% sure that's a sane userspace API. It's an internal kernel one.
> > If we are mirroring kernel, we need to include pgprot_device - the CDX vfio
> > driver uses it.
> > 
> > But if you want to know what memory instructions work from userspace, there is
> > a lot of detail and pgprot_ macros do not cover all of them.
> > 
> 
> A region is 'directly accessible' when memory_access_is_direct() returns true
> for it. The accesses to the directly accessible regions are turned into memcpy()
> and memmove()

This is unlikely to *generally* be safe for any memory that can be concurrently
accessed by guest. Not device memory, nor guest memory.
But, it might be safe for specific architectures, devices, and lengths.
A heuristic similar to "memcpy/memmove for specific lengths" might
practically work, though.


> in flatview_{read, write}_continue_step(), or {ldm, stm}_p() in
> address_space_{ldm, stm}_internal(). Currently, the mmapable VFIO PCI BARs are
> exposed as ram device regions, which are indirectly accessible. One of our goals
> is to make part of the mappable VFIO PCI BARs (not all of them) directly accessible,
> so that the DMA bounce buffer is bypassed when the DMA target buffer resides in
> the BAR (region).
> 
> Yes, I don't know if pgprot_xxx is adequate or not, but it's just the thought. We
> probably need more information like the combination (host_arch, guest_arch, pgprot_xxx).
> The idea is to have more information fed to MemoryRegion and our private accessing
> functions, which are mentioned in S2 to replace the standard memcpy() and
> memmove(), know which instructions are safe to use, if vector instructions can
> be used, and whatever else.
> 
> I don't think we can do everything in one shot. Initially, we probably just provide
> a sustainable design (or inrastructure) for long-term evolving. From there, we can
> extend it to other architectures and cases step by step.
> 
>     #if defined(__x86_64__) || defined(__aarch64__)
>     #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE  1
>     #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE                1
>     #else
>     #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE  0
>     #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE                0
>     #endif
> 
> we also can put more constraints to S1.3 so that only cache_normal
> MMIO region can be directly accessible.

The idea that anything at all is "directly accessible" is kinda flawed.
memcpy can easily be broken for a specific use and it does not
then matter how the memory is mapped.

>     S1.3 Only cache_normal regions can be directly accessible. The question
>          is the cache_normal MMIO region is tolerant to unaligned and vectored
>          access on all architectures?
> 
> > 
> > > S2: qemu_ram_{copy, move}() which are our private implementations of memcpy()
> > >      and memmove(). They're going to replace memcpy/memove() in the memory
> > >      region directly access paths like flatview_{read, write}_continue_step()
> > > 
> > >      S2.1 Small fixed length (1/2/4/8 bytes) accesses shouldn't be either
> > >           split or reordered
> > >      S2.2 Arcitectural optimization based on the MemoryRegion::cache_mode,
> > >           unaligned accesses and vector instructions may be allowed
> > 
> > hmm. meaning what exactly?
> > 
> 
> glibc::{memcopy, memmove}() aren't reliable. There are several related bugs,
> as you listed. For [1], where one-byte-store is translated to triple stores
> to same location. it seems we have to bypass glibc::memcopy(), at least for
> some cases? If so, we need our own (well-behaved) memcpy/memmove(), and
> qemu_ram_{copy, move}() are our own implementations to replace memcpy/memmove()
> in the direct access paths.
> 
>     [1] example of a bug caused by memcpy as result of DMA
>         https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
>     [2] an attempt to fix bugs caused by memcpy to device memory in response to
>         MMIO. 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
>         https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html

But note the ordering issues (of which multiple stores are one example)
are distinct from ISA issues, and they apply universally for any memory.



> > > 
> > > S3: Support VFIO_REGION_INFO_CAP_MMAP_CACHE_MODE where cache_{normal,
> > >      writecombine, no} is provided for every mmapable region or all
> > >      sparse mmaps on the region
> > 
> > seems like you are trying to drive the cache mode from userspace?
> > but how will userspace know what to set?
> > I'd expect, instead, to just have VFIO report how it is mapped.
> > 
> 
> No, the capability should be reported by host's VFIO PCI driver. For my GH100
> specific case, it's nvgrace_gpu_vfio_pci driver where this capability is reported.

good

> > > 
> > >      S2.1 The capability is only meaningful when VFIO_REGION_INFO_FLAG_MMAP
> > >           is absent
> > >      S2.2 All sparse mmaps for one specific region should have unified cache
> > >           mode
> > 
> > you can not trust userspace to do that.
> > 
> 
> See above explanation. We don't trust userspace to do it. Instead, the host's
> VFIO PCI driver needs to report it.

good

> > > 
> > > > 
> > > > ------------------
> > > > 
> > > > 1. On x86, memcpy is different from __builtin_memcpy if
> > > > one uses old 1.0 force-headers from 2019. Thus, QEMU
> > > > sometimes uses __builtin sometimes it does not, inconsitently.
> > > > Likely no longer relevant and should be cleaned up.
> > > > 
> > > 
> > > S2. old 1.0 force-headers won't be used with S2?
> > > 
> > > > 
> > > > 2. variable length memcpy can translate 2,4,8 byte guest access
> > > > into multiple byte accesses. doing this for mmio is
> > > > guaranteed to break devices.
> > > > 
> > > 
> > > S2.1. However, is it still a problem when the MMIO region is mapped with
> > > pgprot_{normal, writecombine}?
> > 
> > MMIO as pgprot_{normal, writecombine} will break devices whatever
> > userspace does.
> > 
> 
> If we're going to introduce something similar to linux-kernel::readl/writel(),
> and use them to access MMIO region, it should be fine then?
> 

For aligned memory. Unaligned accesses seems to be generally unfixable
unless host and guest are x86. We can split them up and hope for the
best, given we always did.

> > 
> > > > 
> > > > 3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host,
> > > > so converting an unaligned access to a series of aligned ones can
> > > > in theory break devices.
> > > > 
> > > 
> > > S1.3, the directly accessible regions have attribute cache_{normal, writecombine}
> > > where unaligned access is allowed. The question is unaligned access on those
> > > regions are always safe on all architectures?
> > 
> > Define "directly accessible regions". Or better, avoid even thinking
> > in these terms.
> > 
> 
> Please see my explanation above.
> 
> > 
> > > > 4. also on x86, vector instructions for large (>16 byte) writes
> > > > into pgprot_noncached memory are safe and faster than multiple 8 byte
> > > > ones.
> > > > 
> > > 
> > > S1.3, region with pgprot_noncached is indirectly accessible.
> > > 
> > > > 5. also on x86 it so happens that if you write a fixed-size memcpy this
> > > > gets optimized to a single store/load and it works for aligned and
> > > > unaligned addresses on that architecture. How to ensure this keeps being
> > > > correct is left as an excerise for the reader. But qemu already relies
> > > > on this and did for years.
> > > > 
> > > 
> > > Sorry, Not fully understood.
> > 
> > 
> > what is unclear? on x86, and some others, glibc will see size 1,2,4 and
> > maybe 8 of 64 and inline memcpy and it happens to do exactly a single
> > load/store.  and code in bswap.h relies on this to mirror guest MMIO on
> > the host. So assuming that is it least not regressing too much.
> > 
> 
> Ok. So you're saying that __builtin_{memcpy, memmove}() aren't safe for MMIO
> accesses?


No, I am saying 1,2,4 byte __builtin_{memcpy, memmove} on x86 hosts
are currently translated to single 1,2,4 byte stores/loads,
and they work for unaligned accesses, which is nice.
I am also saying there's no difference between them and memcpy/memmove
on modern systems.


> All the functions in bswap.h, based on __builtin_{memcpy, memmove}(),
> are only safe to RAM accesses, but unsafe to MMIO accesses?
> 
> Currently, ram_device_mem_ops::memory_region_ram_device_write() runs into
> stn_he_p() and then __builtin_memcpy(), which aren't safe to access to VFIO PCI
> BARs. This path wasn't considered in the proposed design and something needs
> to be considered in the revised design. For this, I'm going add the following
> context to the design.
> 
> S1: A new set of functions added to include/qemu/io.h, similar to linux/
>     include/asm/io.h::{read,write}{b, w, l q}() to access MMIO region
> 
>     S1.1 Those new fuctions will be used to access MMIO region
>          S1.1.1 Directly access paths in address_space_{ldm, stm}_internal(),
>                 to replace {ldm, stm}_p().
>          S1.1.2 Directly access paths in flatview_{read, write}_continue_step()
>                 to replace memcpy() and memmove().
>          S1.1.3 Indirectly access paths where MemoryRegionOps is invoked, to
>                 replace {ldn, stn}_he_p() or their RAM access variants if the
>                 region is a MMIO region.
> 
> > 
> > 
> > > It's perhaps covered by S2 if we're talking
> > > about address_space_{read,write}. If we're talking about address_space_{ldl, stl}(),
> > > we perhaps need to replace __builtin_{memcpy, memmove}() with those private
> > > functions introduced in S2.
> > 
> > Not sure what "covered" means.
> > 
> 
> Ok, it's not important now since the paths invokved by those functions in bswap.h,
> which target a MMIO region, aren't considered in the proposed design.
> 
> > 
> > > 
> > > > 6. on non-x86 both unaligned accesses and vector instructions
> > > > for accessing  UC memory are illegal.
> > > > 
> > > 
> > > Assume UC is equivalent to pgprot_noncached. In that case, it's true on aarch64
> > > at least. With S1.3 applied, this kind of region becomes indirectly accessible.
> > > 
> > > > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
> > > > guest can
> > > > map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc.
> > > > If it does the second then it can use unaligned or vector for access.
> > > > This is why normal passthrough tends to work - it never traps to qemu at
> > > > all.
> > > > 
> > > > 
> > > > But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
> > > > can't use unaligned or vector instructions on non-x86.
> > > > 
> > > 
> > > VM_ALLOW_ANY_UNCACHED is exclusively to arm64 since commit 8c47ce3e1d2c ("KVM:
> > > arm64: Set io memory s2 pte as normalnc for vfio pci device). After that, all
> > > VFIO PCI BARs have pgprot_writecombine attribute on arm64, thus unaligned or
> > > vector accesses are safe on those BARs from guest POV instead of host.
> > 
> > But not e.g. on power8, sadly.
> > 
> 
> Ok. With more constraints applied to S1.3 as I mentioned above, only cache_normal
> regions can be directly accessible, I guess power8 will be happy with unaligned
> and vector access?
> 
>     S1.3 Only cache_normal regions can be directly accessible. The question
>          is the cache_normal MMIO region is tolerant to unaligned and vectored
>          access on all architectures?

I assume you mean pgprot_normal?

For unaligned access there - generally no, but of major ones qemu cares
about, I think only sparc and maybe riscv don't support unaligned memory
accesses for pgprot_normal. Of others, I think mips doesn't.

And I am not sure what do you call "MMIO region".


> > > I maybe
> > > wrong and Alex can correct me.
> > > 
> > > S1.3 only cache_{normal, writecombine} regions can be directly accessible.
> > > A region with pgprot_noncached attribute is indirectly accessible.
> > > 
> > > > 
> > > > 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc.
> > > > so qemu could safely use unaligned/vector instructioons even on non-x86.
> > > > 
> > > 
> > > For my specific case related to GH100 card, Region-0/2 have pgprot_writecombine
> > > while Region-4 has pgprot_normal attribute.
> > > 
> > >    Region 0: Memory at 44080000000 (64-bit, prefetchable) [size=16M]
> > >    Region 2: Memory at 44000000000 (64-bit, prefetchable) [size=2G]
> > >    Region 4: Memory at 42000000000 (64-bit, prefetchable) [size=128G]
> > > 
> > > 
> > > > 9. Except sadly, vfio currently does not tell qemu how it maps
> > > > the memory, so qemu can not know what is safe on non-x86.
> > > > 
> > > 
> > > S3. Host VFIO driver needs ABI changes to expose the cache mode.
> > > 
> > > > 10. on x86 memcpy will sometimes do multiple overlapping stores when
> > > > size is not a power of 2. for example, a 15 byte write is done with
> > > > 2 8-byte stores. This is theoretically an issue
> > > > if guest does something super clever with ordering,
> > > > but does not seem to be in practice.
> > > > 
> > > 
> > > S2. This should be avoided in qemu_ram_{copy, move}() which are going to replace
> > > the standard memcpy() and memmove().
> > > 
> > > 
> > > > 11. on non-x86 memcpy will do multiple overlapping stores even
> > > > for single byte writes. E.g. it does it to avoid extra branches.
> > > > This is causing issues in practice.
> > > > 
> > > 
> > > S2, This should be avoided in qemu_ram_{copy, move}().
> > > 
> > > 
> > > > 12. PCI writes are in order, last byte is written last.
> > > > memmove especially writes last byte first sometimes.
> > > > Violating that theoretically can break guests.
> > > > 
> > > 
> > > S2, the reordering should be avoided in qemu_ram_{copy, move}(). However,
> > > I would think this region becomes indirectly accessible with S1.3 applied?
> > > 
> > > 
> > > > 13. but if we are copying between 2 addresses that are overlapping,
> > > > the standard trick (used by memmove) is to compare dst and src and copy
> > > > backwards if dst < src, so last byte is written first.
> > > > 
> > > 
> > > Backwards copying happens on (dst > src) not on (dst < src). We potentially
> > > convert this to a forwards copying by moving the data in the overlapped area
> > > to somewhere else, and then take that as the src in the subsequent forwards
> > > copying.
> > 
> > Not that simple. Issue is, the size of the overlap is not really limited.
> > Maybe make last X bytes go through the buffer, the rest copy backwards and hope for the best?
> > 
> 
> I guess it would work with luck. Alternative, it can be converted to two
> forwards copying. The source buffer is split into two parts, the second
> part of the source buffer is copied before the first part.

I'm not sure I see it:

Imagine: src 0 to 2G, dst 1G to 3G


SRC:
1G ------ 1G ------- 2G

DST:
          1G ------- 2G ------ 3G


if you want to emulate pci ordering exactly, DST has to be over-written
in exactly address order.

I don't yet see how it can be done without buffering 1G data.




> > 
> > > I think it's unliekly to be a directly accessible region with S1.3 applied.
> > 
> > No, this does not have much to do with how the region is mapped.
> > If guest or device write bytes 1 to X in order and you decide to
> > write them X to 1, you have broken some drivers, unless you know
> > exactly how the device and driver are supposed to work.
> > 
> 
> Ideally, we should disallow data movement between two overlapped MMIO regions.

Sadly, devices use exactly this for DMA, this is why qemu switched to
memmove.

> In APIs of linux kernel like readl/writel, one of the operand always resides
> in RAM, the source/destination are never overlapped.



> > 
> > > 
> > > > -------------
> > > > 
> > > > 
> > > > 
> > > > Some conclusions:
> > > > 
> > > > A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> > > > At least for aligned, perferably for unaligned accesses too.
> > > > Fixed width memcpy seems to work for this. Whether we should bother with
> > > > __builtin to work around broken old fortify headers, I donnu.
> > > > I do not have any answer how to check that compiler does this correctly.
> > > > If anyone is motivated enough, adding a GCC builtin could be possible.
> > > > Given qemu did this for years, I think we can leave solving this for
> > > > another day.
> > > > 
> > > 
> > > Covered by S2.1
> > > 
> > > > B. Also on many architectures, memcpy is much faster for large transfers
> > > > than iterating over 8 byte chunks in C.
> > > > When we can get away with doing that (e.g. for emulated devices where
> > > > we know the concurrency rules, writing into guest RAM), we should.
> > > > 
> > > 
> > > S2.2, something related to performance optimization for the future
> > > 
> > > > C. on non-x86, we currently must not memcpy into host devices
> > > > since we do not know if it is pgprot_noncached. yes, performance will be
> > > > bad for DMA into device RAM.
> > > > 
> > > 
> > > S1.3. This specific region becomes indirectly accessible after S1.3 is applied.
> > > 
> > > > 
> > > > D.  It goes without saying that casting an unaligned address to unint32_t
> > > > (be it for qatomic_set or whatever) is undefined behaviour in C
> > > > and so a bad idea on any architecture.
> > > > 
> > > 
> > > S1.3. The directly accessible region always have cache_{normal, writecombine}
> > > attribute.
> > > 
> > > > 
> > > > E. also for non-x86, we really should teach vfio to tell qemu whether
> > > > it maps device pgprot_noncached or pgprot_writecombine.
> > > > we will then be able to do things like use vector ops
> > > > (through memcpy or not) for >8 accesses.
> > > > 
> > > 
> > > S3. pgprot_normal is also needed. In my case related to GH100 card, the PCI
> > > BAR-4 is mapped with pgprot_normal.
> > > 
> > > Yes, with those information fed to MemoryRegion::cache_mode in S1, it's
> > > possible to optimize qemu_ram_{copy, move}() in S2.2 for the performance sake.
> > > 
> > > 
> > > > F. Arbitrary device passthrough with drivers doing unalined accesses and
> > > > when working cross architectures basically is a best effort thing.  It
> > > > can't be 100% perfect for all devices.
> > > > 
> > > 
> > > Yes. For the first step, we perhaps need to gurantee the directly accessible region
> > > have pgprot_{normal, writecombine} in S1.3 if you agree.
> > 
> > I do not know what "directly accessible" is and I feel we should get
> > out of the habit of thinking in these terms.
> > VFIO likely DTRT mapping already, and userspace really has no
> > business overriding it.
> > 
> 
> Sorry that I didn't explain 'directly accessible', which has been explained at
> the beginning of this reply.
> 
> > 
> > > > 
> > > > --------------------
> > > > 
> > > > Links:
> > > > 
> > > > 
> > > > example of a fix for a bug caused by memcpy to overlapping addresses:
> > > > 4a73aee881 - "softmmu: Use memmove in flatview_write_continue"
> > > > https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com
> > > > 
> > > > 
> > > > example of a bug caused by memcpy as result of DMA:
> > > > https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn
> > > > 
> > > > an attempt to fix bugs caused by memcpy to device memory in response to
> > > > MMIO:
> > > > 4a2e242bbb "memory: Don't use memcpy for ram_device regions"
> > > > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html
> > > > 
> > > 
> 
> Thanks,
> Gavin



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-21 12:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17  7:14 list of memory/memcpy access issues Michael S. Tsirkin
2026-06-19  0:44 ` Gavin Shan
2026-06-19  5:33   ` Michael S. Tsirkin
2026-06-21  6:51     ` Gavin Shan
2026-06-21 12:52       ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.