From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AE110CD98F0 for ; Sun, 21 Jun 2026 12:52:35 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wbHf4-0000xr-1d; Sun, 21 Jun 2026 08:52:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wbHf1-0000xP-ID for qemu-arm@nongnu.org; Sun, 21 Jun 2026 08:52:28 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wbHex-0003ol-Ve for qemu-arm@nongnu.org; Sun, 21 Jun 2026 08:52:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1782046342; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1ZL2XYO9y9OkqTmi3pGZQ8KqD/78RwHqTrSxS8EQUPU=; b=CpLvbKlPDub5Hk7JiCiAudQ/JtGOreuLB6QWPPdnC0hLBzHKRFa5szJQ2FJXQm2Jilmw5I JLlYFNXLhxpgRSi00Gbkz7xBg15ZaH7ltjRmWNNlETthVXYtz+W/RgCfqBsgwi42ZWKh9J /fztikzbxcWWQfVzciHEKibwVmoyJEs= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-615-bSxYbZlGMlS6M8gJypRIbw-1; Sun, 21 Jun 2026 08:52:20 -0400 X-MC-Unique: bSxYbZlGMlS6M8gJypRIbw-1 X-Mimecast-MFC-AGG-ID: bSxYbZlGMlS6M8gJypRIbw_1782046340 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-49244130073so11026435e9.1 for ; Sun, 21 Jun 2026 05:52:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782046339; x=1782651139; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1ZL2XYO9y9OkqTmi3pGZQ8KqD/78RwHqTrSxS8EQUPU=; b=Myd9DyffkIIbM1Xsm2l5o57zQydpjppCy8Gg+eCf+EZ6sJmXQ1vPCoQeJ1+r7i36tJ v09ed3RnovwmB4SbcriS/3c/1+oTjzPPXzpzpmFo6q2Zca+p2mtL0dDBTcjwRxPwoTca dpGEMU2EBE9Kt6965wvHLFuDPb+1ZRrh13kJ7a5wk+zHtXRBV1EC++uQ1mv/pKeeRZSC 8cAdqGjGwF+oe8SthKwL4hIB5DLV5EHWzGjCCQqeE9QPgOKGJP3W0c8CWT/cqjkoREz1 cNpN9rBepxCEJH1Y/0B77cGSZCSECbsKfG+afPBTihATC1RSqvEWSpNolLcMIQIswyTD erAw== X-Gm-Message-State: AOJu0YyWe0CwgUt5J25VhH2LjHTZYwJ+77sGAaOCzMWEHfpEcq9ndgLI 2e1b85kkJTelhr96NsXGL18kX5+XaQUdn7NpQ6bu2z4M0B2be2pw4Cvk70i4W0QYgzlNUTWtYP1 3u45uwtoPLaELgYW6JopglvAqxLXbbcGiE0VnB6VxeNCv4+ZJv1d7BA== X-Gm-Gg: AfdE7clHj6Z+GbNJUs/tOZom1xRNs+ZZ1ysw73qkXOnWmPpr8EiAh9R88Eys1h1mbH/ fJ5hoUMex65AwA7tFhXCJwKVbmwKcKqnISxXA4j2+HZXGm0Ujub83gtRsPHpz81Xx6sWe2dqI/X s8WebZhsgRXjqlJznHp2YZXmroxWwpUoDPW5pl7Mm4QtVEaco1K6IN22weAjPGaDx+cRi4RP54Q tTCiSblV2qEOfRSp7JkaRC/5eIaHVjHWAwXp387n0RXZckGHqMLcYVYORUTQTJAdfex89f/teSD dDHbk1ZdMgqk5d2ydaNg5KAIITPKrtbz2PHfobHboq4ErdtUPVN8zx/PNgS0mgdQ83svpMDVpBU p8kVLKCtBlhVJksvnqsApOxpAI47OAIWf X-Received: by 2002:a05:600c:5290:b0:490:bcf6:469f with SMTP id 5b1f17b1804b1-49242189b1cmr153416875e9.0.1782046339196; Sun, 21 Jun 2026 05:52:19 -0700 (PDT) X-Received: by 2002:a05:600c:5290:b0:490:bcf6:469f with SMTP id 5b1f17b1804b1-49242189b1cmr153416445e9.0.1782046338459; Sun, 21 Jun 2026 05:52:18 -0700 (PDT) Received: from redhat.com (IGLD-80-230-85-71.inter.net.il. [80.230.85.71]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-466648c5ddbsm17777541f8f.12.2026.06.21.05.52.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 21 Jun 2026 05:52:17 -0700 (PDT) Date: Sun, 21 Jun 2026 08:52:14 -0400 From: "Michael S. Tsirkin" To: Gavin Shan Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, peterx@redhat.com, alex@shazbot.org, richard.henderson@linaro.org, peter.maydell@linaro.org, berrange@redhat.com, philmd@oss.qualcomm.com, philmd@mailo.com, david@kernel.org, clg@redhat.com, pbonzini@redhat.com, phrdina@redhat.com, jugraham@redhat.com, liugang24219@sangfor.com.cn, dinghui@sangfor.com.cn, shan.gavin@gmail.com Subject: Re: list of memory/memcpy access issues Message-ID: <20260621031129-mutt-send-email-mst@kernel.org> References: <20260617022330-mutt-send-email-mst@kernel.org> <3f58de99-74a2-4ccc-b800-d254bbd40931@redhat.com> <20260619004237-mutt-send-email-mst@kernel.org> <829e6103-7f82-4884-9892-86cfb7b743bb@redhat.com> MIME-Version: 1.0 In-Reply-To: <829e6103-7f82-4884-9892-86cfb7b743bb@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 7HqApq92x_SzRvSt_Iue5iEFM3nwlxFv8uyhngPPyrc_1782046340 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Received-SPF: pass client-ip=170.10.129.124; envelope-from=mst@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -24 X-Spam_score: -2.5 X-Spam_bar: -- X-Spam_report: (-2.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.445, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-arm@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-arm-bounces+qemu-arm=archiver.kernel.org@nongnu.org Sender: qemu-arm-bounces+qemu-arm=archiver.kernel.org@nongnu.org On Sun, Jun 21, 2026 at 04:51:55PM +1000, Gavin Shan wrote: > On 6/19/26 3:33 PM, Michael S. Tsirkin wrote: > > On Fri, Jun 19, 2026 at 10:44:17AM +1000, Gavin Shan wrote: > > > On 6/17/26 5:14 PM, Michael S. Tsirkin wrote:> This is a top post attempting to summarize some findings related to > > > > emulating DMA and MMIO existing in QEMU memory core > > > > using memcpy/memmove. > > > > > > > > Hopefully, this will help inform discussion about multiple > > > > changes currently proposed for QEMU. > > > > > > > > At a high level, and in a variety of configurations, QEMU gets > > > > DMA requests from a virtual device, or MMIO requests from > > > > a VCPU, and wants to execute them either on guest ram or > > > > passhtrough device memory. > > > > > > > > Down the road this almost always (virtio ring implementation seems to be > > > > a notable exception) translates to memcpy/memmove calls > > > > (glibc e.g. on x86 currently implements memcpy through memmove). > > > > > > > > However, memcpy's signature is: > > > > void *memcpy(void *dest, const void *src, size_t n); > > > > note how neither src not more importantly dest are volatile. > > > > Thus it was never designed either for a concurrent access > > > > by another CPU, or for accessing devices. > > > > (Mis)using it for that gives good performance but has issues, > > > > some of which I am trying to enumerate below. > > > > > > > > In the below I say memcpy but same applies to memmove just as well. > > > > > > > > > > Firstly, thanks to Michael for the summary and helps to lead the discussions. > > > > > > I went through the listed questions and suggestions, but I'm not sure if > > > I understood every question and suggestion. > > > > > > Figured out that we probably > > > need to something as below. Please take a look when you get a chance to > > > check if there are any gaps. > > > > > > S1: New field MemoryRegion::cache_mode to indicate how it has been mapped > > > for those directly accessible regions: cache_{normal, writecombine, no} > > > corrresponding to pgprot_{normal, writecombine, noncached}. > > > > > > S1.1 MemoryRegion::cache_mode is meaningful to those directly accessible > > > regions like ram, ram device and rom device regions > > > S1.2 MemoryRegion::cache_mode should be set when those directly accessible > > > regions are created > > > S1.3 Only cache_{normal, writecombine} regions can be directly accessible > > > > What is "directly accessible" here? That all memory ops thinkable work? > > E.g. on power8 not all ops work on pgprot_writecombine, either (and glibc memcopy > > uses these). > > I am not 100% sure that's a sane userspace API. It's an internal kernel one. > > If we are mirroring kernel, we need to include pgprot_device - the CDX vfio > > driver uses it. > > > > But if you want to know what memory instructions work from userspace, there is > > a lot of detail and pgprot_ macros do not cover all of them. > > > > A region is 'directly accessible' when memory_access_is_direct() returns true > for it. The accesses to the directly accessible regions are turned into memcpy() > and memmove() This is unlikely to *generally* be safe for any memory that can be concurrently accessed by guest. Not device memory, nor guest memory. But, it might be safe for specific architectures, devices, and lengths. A heuristic similar to "memcpy/memmove for specific lengths" might practically work, though. > in flatview_{read, write}_continue_step(), or {ldm, stm}_p() in > address_space_{ldm, stm}_internal(). Currently, the mmapable VFIO PCI BARs are > exposed as ram device regions, which are indirectly accessible. One of our goals > is to make part of the mappable VFIO PCI BARs (not all of them) directly accessible, > so that the DMA bounce buffer is bypassed when the DMA target buffer resides in > the BAR (region). > > Yes, I don't know if pgprot_xxx is adequate or not, but it's just the thought. We > probably need more information like the combination (host_arch, guest_arch, pgprot_xxx). > The idea is to have more information fed to MemoryRegion and our private accessing > functions, which are mentioned in S2 to replace the standard memcpy() and > memmove(), know which instructions are safe to use, if vector instructions can > be used, and whatever else. > > I don't think we can do everything in one shot. Initially, we probably just provide > a sustainable design (or inrastructure) for long-term evolving. From there, we can > extend it to other architectures and cases step by step. > > #if defined(__x86_64__) || defined(__aarch64__) > #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE 1 > #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE 1 > #else > #define RAM_DEVICE_REGION_CAN_BE_DIRECTLY_ACCESSIBLE 0 > #define USE_OUR_OWN_MEMCPY_AND_MEMMOVE 0 > #endif > > we also can put more constraints to S1.3 so that only cache_normal > MMIO region can be directly accessible. The idea that anything at all is "directly accessible" is kinda flawed. memcpy can easily be broken for a specific use and it does not then matter how the memory is mapped. > S1.3 Only cache_normal regions can be directly accessible. The question > is the cache_normal MMIO region is tolerant to unaligned and vectored > access on all architectures? > > > > > > S2: qemu_ram_{copy, move}() which are our private implementations of memcpy() > > > and memmove(). They're going to replace memcpy/memove() in the memory > > > region directly access paths like flatview_{read, write}_continue_step() > > > > > > S2.1 Small fixed length (1/2/4/8 bytes) accesses shouldn't be either > > > split or reordered > > > S2.2 Arcitectural optimization based on the MemoryRegion::cache_mode, > > > unaligned accesses and vector instructions may be allowed > > > > hmm. meaning what exactly? > > > > glibc::{memcopy, memmove}() aren't reliable. There are several related bugs, > as you listed. For [1], where one-byte-store is translated to triple stores > to same location. it seems we have to bypass glibc::memcopy(), at least for > some cases? If so, we need our own (well-behaved) memcpy/memmove(), and > qemu_ram_{copy, move}() are our own implementations to replace memcpy/memmove() > in the direct access paths. > > [1] example of a bug caused by memcpy as result of DMA > https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn > [2] an attempt to fix bugs caused by memcpy to device memory in response to > MMIO. 4a2e242bbb "memory: Don't use memcpy for ram_device regions" > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html But note the ordering issues (of which multiple stores are one example) are distinct from ISA issues, and they apply universally for any memory. > > > > > > S3: Support VFIO_REGION_INFO_CAP_MMAP_CACHE_MODE where cache_{normal, > > > writecombine, no} is provided for every mmapable region or all > > > sparse mmaps on the region > > > > seems like you are trying to drive the cache mode from userspace? > > but how will userspace know what to set? > > I'd expect, instead, to just have VFIO report how it is mapped. > > > > No, the capability should be reported by host's VFIO PCI driver. For my GH100 > specific case, it's nvgrace_gpu_vfio_pci driver where this capability is reported. good > > > > > > S2.1 The capability is only meaningful when VFIO_REGION_INFO_FLAG_MMAP > > > is absent > > > S2.2 All sparse mmaps for one specific region should have unified cache > > > mode > > > > you can not trust userspace to do that. > > > > See above explanation. We don't trust userspace to do it. Instead, the host's > VFIO PCI driver needs to report it. good > > > > > > > > > > > ------------------ > > > > > > > > 1. On x86, memcpy is different from __builtin_memcpy if > > > > one uses old 1.0 force-headers from 2019. Thus, QEMU > > > > sometimes uses __builtin sometimes it does not, inconsitently. > > > > Likely no longer relevant and should be cleaned up. > > > > > > > > > > S2. old 1.0 force-headers won't be used with S2? > > > > > > > > > > > 2. variable length memcpy can translate 2,4,8 byte guest access > > > > into multiple byte accesses. doing this for mmio is > > > > guaranteed to break devices. > > > > > > > > > > S2.1. However, is it still a problem when the MMIO region is mapped with > > > pgprot_{normal, writecombine}? > > > > MMIO as pgprot_{normal, writecombine} will break devices whatever > > userspace does. > > > > If we're going to introduce something similar to linux-kernel::readl/writel(), > and use them to access MMIO region, it should be fine then? > For aligned memory. Unaligned accesses seems to be generally unfixable unless host and guest are x86. We can split them up and hope for the best, given we always did. > > > > > > > > > > 3. (theoretical concern) also on x86, unaligned accesses are possible on guest and host, > > > > so converting an unaligned access to a series of aligned ones can > > > > in theory break devices. > > > > > > > > > > S1.3, the directly accessible regions have attribute cache_{normal, writecombine} > > > where unaligned access is allowed. The question is unaligned access on those > > > regions are always safe on all architectures? > > > > Define "directly accessible regions". Or better, avoid even thinking > > in these terms. > > > > Please see my explanation above. > > > > > > > 4. also on x86, vector instructions for large (>16 byte) writes > > > > into pgprot_noncached memory are safe and faster than multiple 8 byte > > > > ones. > > > > > > > > > > S1.3, region with pgprot_noncached is indirectly accessible. > > > > > > > 5. also on x86 it so happens that if you write a fixed-size memcpy this > > > > gets optimized to a single store/load and it works for aligned and > > > > unaligned addresses on that architecture. How to ensure this keeps being > > > > correct is left as an excerise for the reader. But qemu already relies > > > > on this and did for years. > > > > > > > > > > Sorry, Not fully understood. > > > > > > what is unclear? on x86, and some others, glibc will see size 1,2,4 and > > maybe 8 of 64 and inline memcpy and it happens to do exactly a single > > load/store. and code in bswap.h relies on this to mirror guest MMIO on > > the host. So assuming that is it least not regressing too much. > > > > Ok. So you're saying that __builtin_{memcpy, memmove}() aren't safe for MMIO > accesses? No, I am saying 1,2,4 byte __builtin_{memcpy, memmove} on x86 hosts are currently translated to single 1,2,4 byte stores/loads, and they work for unaligned accesses, which is nice. I am also saying there's no difference between them and memcpy/memmove on modern systems. > All the functions in bswap.h, based on __builtin_{memcpy, memmove}(), > are only safe to RAM accesses, but unsafe to MMIO accesses? > > Currently, ram_device_mem_ops::memory_region_ram_device_write() runs into > stn_he_p() and then __builtin_memcpy(), which aren't safe to access to VFIO PCI > BARs. This path wasn't considered in the proposed design and something needs > to be considered in the revised design. For this, I'm going add the following > context to the design. > > S1: A new set of functions added to include/qemu/io.h, similar to linux/ > include/asm/io.h::{read,write}{b, w, l q}() to access MMIO region > > S1.1 Those new fuctions will be used to access MMIO region > S1.1.1 Directly access paths in address_space_{ldm, stm}_internal(), > to replace {ldm, stm}_p(). > S1.1.2 Directly access paths in flatview_{read, write}_continue_step() > to replace memcpy() and memmove(). > S1.1.3 Indirectly access paths where MemoryRegionOps is invoked, to > replace {ldn, stn}_he_p() or their RAM access variants if the > region is a MMIO region. > > > > > > > > It's perhaps covered by S2 if we're talking > > > about address_space_{read,write}. If we're talking about address_space_{ldl, stl}(), > > > we perhaps need to replace __builtin_{memcpy, memmove}() with those private > > > functions introduced in S2. > > > > Not sure what "covered" means. > > > > Ok, it's not important now since the paths invokved by those functions in bswap.h, > which target a MMIO region, aren't considered in the proposed design. > > > > > > > > > > 6. on non-x86 both unaligned accesses and vector instructions > > > > for accessing UC memory are illegal. > > > > > > > > > > Assume UC is equivalent to pgprot_noncached. In that case, it's true on aarch64 > > > at least. With S1.3 applied, this kind of region becomes indirectly accessible. > > > > > > > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 > > > > guest can > > > > map the memory as as pgprot_noncached/ioremap or pgprot_writecombine/ioremap_uc. > > > > If it does the second then it can use unaligned or vector for access. > > > > This is why normal passthrough tends to work - it never traps to qemu at > > > > all. > > > > > > > > > > > > But for qemu, vfio uses pgprot_noncached unconditionally so qemu > > > > can't use unaligned or vector instructions on non-x86. > > > > > > > > > > VM_ALLOW_ANY_UNCACHED is exclusively to arm64 since commit 8c47ce3e1d2c ("KVM: > > > arm64: Set io memory s2 pte as normalnc for vfio pci device). After that, all > > > VFIO PCI BARs have pgprot_writecombine attribute on arm64, thus unaligned or > > > vector accesses are safe on those BARs from guest POV instead of host. > > > > But not e.g. on power8, sadly. > > > > Ok. With more constraints applied to S1.3 as I mentioned above, only cache_normal > regions can be directly accessible, I guess power8 will be happy with unaligned > and vector access? > > S1.3 Only cache_normal regions can be directly accessible. The question > is the cache_normal MMIO region is tolerant to unaligned and vectored > access on all architectures? I assume you mean pgprot_normal? For unaligned access there - generally no, but of major ones qemu cares about, I think only sparc and maybe riscv don't support unaligned memory accesses for pgprot_normal. Of others, I think mips doesn't. And I am not sure what do you call "MMIO region". > > > I maybe > > > wrong and Alex can correct me. > > > > > > S1.3 only cache_{normal, writecombine} regions can be directly accessible. > > > A region with pgprot_noncached attribute is indirectly accessible. > > > > > > > > > > > 8. But for nvgrace RAM, vfio has a driver that uses pgprot_writecombine/ioremap_uc. > > > > so qemu could safely use unaligned/vector instructioons even on non-x86. > > > > > > > > > > For my specific case related to GH100 card, Region-0/2 have pgprot_writecombine > > > while Region-4 has pgprot_normal attribute. > > > > > > Region 0: Memory at 44080000000 (64-bit, prefetchable) [size=16M] > > > Region 2: Memory at 44000000000 (64-bit, prefetchable) [size=2G] > > > Region 4: Memory at 42000000000 (64-bit, prefetchable) [size=128G] > > > > > > > > > > 9. Except sadly, vfio currently does not tell qemu how it maps > > > > the memory, so qemu can not know what is safe on non-x86. > > > > > > > > > > S3. Host VFIO driver needs ABI changes to expose the cache mode. > > > > > > > 10. on x86 memcpy will sometimes do multiple overlapping stores when > > > > size is not a power of 2. for example, a 15 byte write is done with > > > > 2 8-byte stores. This is theoretically an issue > > > > if guest does something super clever with ordering, > > > > but does not seem to be in practice. > > > > > > > > > > S2. This should be avoided in qemu_ram_{copy, move}() which are going to replace > > > the standard memcpy() and memmove(). > > > > > > > > > > 11. on non-x86 memcpy will do multiple overlapping stores even > > > > for single byte writes. E.g. it does it to avoid extra branches. > > > > This is causing issues in practice. > > > > > > > > > > S2, This should be avoided in qemu_ram_{copy, move}(). > > > > > > > > > > 12. PCI writes are in order, last byte is written last. > > > > memmove especially writes last byte first sometimes. > > > > Violating that theoretically can break guests. > > > > > > > > > > S2, the reordering should be avoided in qemu_ram_{copy, move}(). However, > > > I would think this region becomes indirectly accessible with S1.3 applied? > > > > > > > > > > 13. but if we are copying between 2 addresses that are overlapping, > > > > the standard trick (used by memmove) is to compare dst and src and copy > > > > backwards if dst < src, so last byte is written first. > > > > > > > > > > Backwards copying happens on (dst > src) not on (dst < src). We potentially > > > convert this to a forwards copying by moving the data in the overlapped area > > > to somewhere else, and then take that as the src in the subsequent forwards > > > copying. > > > > Not that simple. Issue is, the size of the overlap is not really limited. > > Maybe make last X bytes go through the buffer, the rest copy backwards and hope for the best? > > > > I guess it would work with luck. Alternative, it can be converted to two > forwards copying. The source buffer is split into two parts, the second > part of the source buffer is copied before the first part. I'm not sure I see it: Imagine: src 0 to 2G, dst 1G to 3G SRC: 1G ------ 1G ------- 2G DST: 1G ------- 2G ------ 3G if you want to emulate pci ordering exactly, DST has to be over-written in exactly address order. I don't yet see how it can be done without buffering 1G data. > > > > > I think it's unliekly to be a directly accessible region with S1.3 applied. > > > > No, this does not have much to do with how the region is mapped. > > If guest or device write bytes 1 to X in order and you decide to > > write them X to 1, you have broken some drivers, unless you know > > exactly how the device and driver are supposed to work. > > > > Ideally, we should disallow data movement between two overlapped MMIO regions. Sadly, devices use exactly this for DMA, this is why qemu switched to memmove. > In APIs of linux kernel like readl/writel, one of the operand always resides > in RAM, the source/destination are never overlapped. > > > > > > > > > ------------- > > > > > > > > > > > > > > > > Some conclusions: > > > > > > > > A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses. > > > > At least for aligned, perferably for unaligned accesses too. > > > > Fixed width memcpy seems to work for this. Whether we should bother with > > > > __builtin to work around broken old fortify headers, I donnu. > > > > I do not have any answer how to check that compiler does this correctly. > > > > If anyone is motivated enough, adding a GCC builtin could be possible. > > > > Given qemu did this for years, I think we can leave solving this for > > > > another day. > > > > > > > > > > Covered by S2.1 > > > > > > > B. Also on many architectures, memcpy is much faster for large transfers > > > > than iterating over 8 byte chunks in C. > > > > When we can get away with doing that (e.g. for emulated devices where > > > > we know the concurrency rules, writing into guest RAM), we should. > > > > > > > > > > S2.2, something related to performance optimization for the future > > > > > > > C. on non-x86, we currently must not memcpy into host devices > > > > since we do not know if it is pgprot_noncached. yes, performance will be > > > > bad for DMA into device RAM. > > > > > > > > > > S1.3. This specific region becomes indirectly accessible after S1.3 is applied. > > > > > > > > > > > D. It goes without saying that casting an unaligned address to unint32_t > > > > (be it for qatomic_set or whatever) is undefined behaviour in C > > > > and so a bad idea on any architecture. > > > > > > > > > > S1.3. The directly accessible region always have cache_{normal, writecombine} > > > attribute. > > > > > > > > > > > E. also for non-x86, we really should teach vfio to tell qemu whether > > > > it maps device pgprot_noncached or pgprot_writecombine. > > > > we will then be able to do things like use vector ops > > > > (through memcpy or not) for >8 accesses. > > > > > > > > > > S3. pgprot_normal is also needed. In my case related to GH100 card, the PCI > > > BAR-4 is mapped with pgprot_normal. > > > > > > Yes, with those information fed to MemoryRegion::cache_mode in S1, it's > > > possible to optimize qemu_ram_{copy, move}() in S2.2 for the performance sake. > > > > > > > > > > F. Arbitrary device passthrough with drivers doing unalined accesses and > > > > when working cross architectures basically is a best effort thing. It > > > > can't be 100% perfect for all devices. > > > > > > > > > > Yes. For the first step, we perhaps need to gurantee the directly accessible region > > > have pgprot_{normal, writecombine} in S1.3 if you agree. > > > > I do not know what "directly accessible" is and I feel we should get > > out of the habit of thinking in these terms. > > VFIO likely DTRT mapping already, and userspace really has no > > business overriding it. > > > > Sorry that I didn't explain 'directly accessible', which has been explained at > the beginning of this reply. > > > > > > > > > > > -------------------- > > > > > > > > Links: > > > > > > > > > > > > example of a fix for a bug caused by memcpy to overlapping addresses: > > > > 4a73aee881 - "softmmu: Use memmove in flatview_write_continue" > > > > https://lore.kernel.org/qemu-devel/20230131030155.18932-1-akihiko.odaki@daynix.com > > > > > > > > > > > > example of a bug caused by memcpy as result of DMA: > > > > https://lore.kernel.org/qemu-devel/20260527091711.3901-1-liugang24219@sangfor.com.cn > > > > > > > > an attempt to fix bugs caused by memcpy to device memory in response to > > > > MMIO: > > > > 4a2e242bbb "memory: Don't use memcpy for ram_device regions" > > > > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08129.html > > > > > > > > > Thanks, > Gavin