From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C6AFCC43458 for ; Mon, 29 Jun 2026 10:48:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=bv2MNymaPWPHJTDdOg64m63/msqBwPUlFib40uHg9iI=; b=qCxW+h2VWxiMvKCRGhHyR52+61 lZ19MV9W3BmW3352qrU3Gk+Oe83J9k8eOKL7Jdf78qN2oYP9r9uYK0A3TiITdc/NvojK8NMaQQhBU IJvDlh4fmFPb0Q0fGsW1bzepOEeuTm7qdKZ1HlkEar9Rg/7qpMRvWsZG3yxNEKFZwucJ9+Q73xYEb opC0XcmzwS+OA2oTTSxiNjPRBxraLoqQJobF1SgU1bHhqpUU/8wfi+OBMI7r66hDZWLEu7Hlii2eR 4zy4lBYuFKpsHR8bQUMGcZFJAGCZPxZ3EJlyHqVSmwCXr+kPonGbWqyhZ3Sfr7d7+cX8PEtI2zhh6 10Vi3gbQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.99.1 #2 (Red Hat Linux)) id 1we9Xk-0000000EN4T-3IZW; Mon, 29 Jun 2026 10:48:48 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.99.1 #2 (Red Hat Linux)) id 1we9Xj-0000000EN3y-0ARz for linux-arm-kernel@lists.infradead.org; Mon, 29 Jun 2026 10:48:48 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7DC46176C; Mon, 29 Jun 2026 03:48:41 -0700 (PDT) Received: from [10.1.36.193] (e121487-lin.cambridge.arm.com [10.1.36.193]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CC8243F836; Mon, 29 Jun 2026 03:48:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1782730125; bh=HyjFYxbAqFE497eLVxXFT95li2NffwINS9WOP9Nk9NU=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=NYbdVGAflN1iz+6isb+2ix0etO1gfPGgn2t+Gy6GsbFSqiYbcWp1RgRs6+2cRMYdy jcf6JhT8aQsybQdchLAuD6LErxTtolTB9n1aPQXnFNlL0v5Vhqj4u98+3fIru2Gtxp 6CZWCTXP+lUvlLkbs12dOJF6mHTE1YIeG8kobicM= Message-ID: <97b62a6f-a514-46bb-9ee8-81f563220f6a@arm.com> Date: Mon, 29 Jun 2026 11:48:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write To: Shanker Donthineni , Catalin Marinas , Will Deacon Cc: Jason Gunthorpe , linux-arm-kernel@lists.infradead.org, Mark Rutland , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Vikram Sethi , Jason Sequeira References: <20260625182425.3194066-1-sdonthineni@nvidia.com> <20260625182425.3194066-3-sdonthineni@nvidia.com> Content-Language: en-GB From: Vladimir Murzin In-Reply-To: <20260625182425.3194066-3-sdonthineni@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.9.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260629_034847_159750_E2169828 X-CRM114-Status: GOOD ( 30.16 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi, On 6/25/26 19:24, Shanker Donthineni wrote: > The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on > parts with the NVIDIA Olympus device store/load ordering erratum the > ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in > the block to a store-release. Each stlr* carries a barrier cost, so block > MMIO becomes O(n) store-releases, making a block copy many times slower > than a single ordered burst and growing with the transfer size. > > Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop > and order the whole block against subsequent loads with a single > trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the > relaxed contract of these helpers). This keeps block MMIO writes at > one-barrier cost rather than scaling with the transfer size. > > Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single > PE pinned; per-call cost in ns; consecutive writes ping-pong between two > buffers so repeated stores are not coalesced; iowrite64/iowrite32 = > __iowrite{64,32}_copy()): > > Table 1 - arm64 memset_io/memcpy_toio (this patch) > +-------+-----------+-----------+-----------+-------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-----------+-------------+ > | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | > | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | > | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | > | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | > | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | > | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | > | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | > | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | > | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | > | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | > +-------+-----------+-----------+-----------+-------------+ > all four helpers end with a single trailing barrier (dmb osh). > > Table 2 - generic per-store memset_io/memcpy_toio > +-------+-----------+-----------+-------------+--------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-------------+--------------+ > | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | > | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | > | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | > | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | > | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | > | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | > | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | > | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | > | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | > | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | > +-------+-----------+-----------+-------------+--------------+ > the generic memset_io()/memcpy_toio() build on __raw_write*(), which the > workaround promotes to store-release, so every store is individually > ordered - hence O(n) in the store count. > > The arm64 versions stay flat at one-barrier cost while the generic > per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs > ~366 ns). > > Signed-off-by: Shanker Donthineni > --- > arch/arm64/include/asm/io.h | 5 +++ > arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 87 insertions(+) > > diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h > index 69e0fa004d31..649503f347bc 100644 > --- a/arch/arm64/include/asm/io.h > +++ b/arch/arm64/include/asm/io.h > @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count) > } > #define __iowrite64_copy __iowrite64_copy > > +void memset_io(volatile void __iomem *dst, int c, size_t count); > +#define memset_io memset_io > +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count); > +#define memcpy_toio memcpy_toio > + > /* > * I/O memory mapping functions. > */ > diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c > index fe86ada23c7d..b5fd9ee6d9eb 100644 > --- a/arch/arm64/kernel/io.c > +++ b/arch/arm64/kernel/io.c > @@ -5,9 +5,91 @@ > * Copyright (C) 2012 ARM Ltd. > */ > > +#include > #include > #include > #include > +#include > + > +#include > + > +/* > + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store > + * (__raw_write*()) to a store-release on affected CPUs. The generic > + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would > + * emit one store-release per element and turn a block write into O(n) ordered > + * stores - far more costly than the single barrier a block actually needs. > + * > + * Provide arm64 versions that emit plain STR in the loop and order the whole > + * block against subsequent loads with one trailing DMB OSH, patched in only on > + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is > + * preserved). > + * > + * This capability is currently enabled only for the NVIDIA Olympus device > + * store/load ordering erratum, where a Device-nGnR* load may be observed before > + * an older, non-overlapping Device-nGnR* store to the same peripheral. > + */ > +static __always_inline void iomem_block_store_barrier(void) > +{ > + asm volatile(ALTERNATIVE("nop", "dmb osh", > + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) > + : : : "memory"); > +} > + > +void memset_io(volatile void __iomem *dst, int c, size_t count) > +{ > + u64 qc = (u8)c; > + > + qc *= ~0ULL / 0xff; > + > + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { > + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); > + dst++; > + count--; > + } > + while (count >= sizeof(u64)) { > + asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory"); > + dst += sizeof(u64); > + count -= sizeof(u64); > + } > + while (count) { > + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); > + dst++; > + count--; > + } > + > + iomem_block_store_barrier(); > +} > +EXPORT_SYMBOL(memset_io); > + > +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) > +{ > + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { > + asm volatile("strb %w0, [%1]" > + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); > + src++; > + dst++; > + count--; > + } > + while (count >= sizeof(u64)) { > + asm volatile("str %x0, [%1]" > + : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst) Why do we need get_unaligned() here? I understand this came from the generic implementation, where it needs to handle architectures that do not support unaligned accesses. But IIUC this is not an issue for arm64, and there was no special handling in memcpy_toio() before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset"). Am I missing something? > + : "memory"); > + src += sizeof(u64); > + dst += sizeof(u64); > + count -= sizeof(u64); > + } > + while (count) { > + asm volatile("strb %w0, [%1]" > + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); > + src++; > + dst++; > + count--; > + } > + > + iomem_block_store_barrier(); It is perhaps a matter of taste, but having the inline assembly here (and in memset_io()) might make the code clearer. To a casual reader, it would be obvious that the barrier is not guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE, without having to jump back and forth through the code. Obliviously maintainers might have different preference ;) Cheers Vladimir > +} > +EXPORT_SYMBOL(memcpy_toio); > > /* > * This generates a memcpy that works on a from/to address which is aligned to > -- 2.54.0.windows.1 >