* [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
@ 2026-06-25 18:24 Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Vladimir Murzin
Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira
This series works around the NVIDIA Olympus device store/load ordering
erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
peripheral before an older, non-overlapping Device-nGnR* store to the
same peripheral, breaking the program order that drivers rely on for
MMIO and potentially leaving a device in an incorrect state.
Patch 1 adds the workaround. It promotes the raw MMIO store helpers
(__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
store-release on affected CPUs, and promotes the trailing DGH of the
write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
in only on affected parts, so it is a no-op elsewhere.
Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
are built on __raw_write*(), so patch 1 would promote every store in a
block to a store-release; as each STLR drains the write-combining buffer,
block MMIO becomes O(n) store-releases. The arm64 versions emit plain
STR in the loop and order the whole block with a single trailing dmb osh,
keeping block MMIO at one-barrier cost.
Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
__iowrite{64,32}_copy().
Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
+-------+-----------+-----------+-----------+-------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
| 8B | 67.9 ns | 67.8 ns | 3.6 ns | 3.6 ns |
| 16B | 67.9 ns | 67.8 ns | 4.0 ns | 4.0 ns |
| 32B | 67.9 ns | 67.9 ns | 4.6 ns | 4.6 ns |
| 64B | 69.1 ns | 69.1 ns | 69.1 ns | 69.0 ns |
| 128B | 138.3 ns | 138.3 ns | 138.4 ns | 138.3 ns |
| 256B | 276.6 ns | 276.6 ns | 276.6 ns | 276.7 ns |
| 512B | 276.6 ns | 276.5 ns | 276.6 ns | 276.6 ns |
| 1KB | 276.6 ns | 278.4 ns | 276.6 ns | 276.6 ns |
| 2KB | 278.4 ns | 278.4 ns | 275.9 ns | 276.6 ns |
| 4KB | 365.7 ns | 365.7 ns | 365.7 ns | 365.7 ns |
+-------+-----------+-----------+-----------+-------------+
relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().
Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
+-------+-----------+-----------+-----------+-------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
| 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns |
| 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns |
| 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns |
| 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns |
| 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns |
| 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns |
| 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns |
| 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns |
| 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns |
| 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).
Table 3 - workaround on, generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
| 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns |
| 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns |
| 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns |
| 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns |
| 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns |
| 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns |
| 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns |
| 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns |
| 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns |
| 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.
Tables 2 and 3 show why patch 2 is needed: the generic per-store block
writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.
Changes since v3:
- Split the workaround into two patches: the erratum fix (1/2) and the
arm64 memset_io()/memcpy_toio() block writers (2/2).
- Reworked the raw MMIO write helpers to use a direct base-register
str*/stlr* alternative sequence instead of a per-write static branch.
- Covered the write-combining __iowrite{32,64}_copy() path by patching
dgh() to dmb osh on affected CPUs, keeping the contiguous STR groups
and the ordering barrier outside the copy loop; the single-element
case now uses a plain str* as well.
- Added arm64 memset_io()/memcpy_toio() so the byte/word block writers
take one trailing dmb osh instead of a per-store store-release.
- Updated the commit messages to describe the offset-addressing
trade-off.
Changes since v2:
- Reworked the raw MMIO write helpers so unaffected CPUs keep the
existing offset-addressed STR sequence, while affected CPUs use the
base-register STLR path.
- Updated the commit message to match the code changes.
- Rebased on top of the arm64 for-next/errata branch:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
Changes since v1:
- Updated the commit message based on feedback from Vladimir Murzin.
Shanker Donthineni (2):
arm64: errata: Workaround NVIDIA Olympus device store/load ordering
arm64: io: apply the device store-release workaround once per block
write
Documentation/arch/arm64/silicon-errata.rst | 2 +
arch/arm64/Kconfig | 25 +++++++++
arch/arm64/include/asm/barrier.h | 4 +-
arch/arm64/include/asm/io.h | 36 +++++++++----
arch/arm64/kernel/cpu_errata.c | 8 +++
arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++
arch/arm64/tools/cpucaps | 1 +
7 files changed, 146 insertions(+), 12 deletions(-)
--
2.54.0.windows.1
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering 2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni @ 2026-06-25 18:24 ` Shanker Donthineni 2026-06-30 14:21 ` Will Deacon 2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni 2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin 2 siblings, 1 reply; 10+ messages in thread From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Vladimir Murzin Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira On systems with NVIDIA Olympus cores, a Device-nGnR* load can be observed by a peripheral before an older, non-overlapping Device-nGnR* store to the same peripheral. This breaks the program-order guarantee that software expects for Device-nGnR* accesses and can leave a peripheral in an incorrect state, as a load is observed before an earlier store takes effect. The erratum can occur only when all of the following apply: - A PE executes a Device-nGnR* store followed by a younger Device-nGnR* load. - The store is not a store-release. - The accesses target the same peripheral and do not overlap in bytes. - There is at most one intervening Device-nGnR* store in program order, and there are no intervening Device-nGnR* loads. - There is no DSB, and no DMB that orders loads, between the store and the load. - Specific micro-architectural and timing conditions occur. Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* to stlr* (Store-Release) on affected CPUs, which removes the "store is not a store-release" condition for every device write the kernel issues. Because writel() and writel_relaxed() are both built on __raw_writel() in asm-generic/io.h, patching the raw variants covers both the non-relaxed and relaxed APIs without touching the higher layers. Note that writel()'s own barrier sits before the store, so it does not order the store against a subsequent readl(); the store-release promotion is what provides that ordering. Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use plain str* instructions. Note: stlr* only supports base-register addressing, so the raw MMIO write helpers use a base-register str*/stlr* alternative sequence. This gives up the offset-addressed str* code generation introduced by commit d044d6ba6f02 ("arm64: io: permit offset addressing"). A static-branch implementation would add extra control flow without preserving the desired offset-addressed code generation in practice, so use a direct base-register str*/stlr* alternative instead. For the write-combining copy helpers (__iowrite{32,64}_copy()), the contiguous str* groups are kept, because replacing those stores would defeat the write-combining behaviour used to improve store performance. Rather than rely on the relaxed, no-ordering contract of these helpers - which would leave affected CPUs behaving differently from every other arm64 system and exposed to any future driver that depends on ordering across such copies - the DGH hint emitted once after each copy is promoted to dmb osh on affected CPUs. That orders the grouped stores against subsequent loads without placing a barrier in the copy loop, while unaffected CPUs keep the existing DGH hint. The single-element case of __const_memcpy_toio_aligned{32,64}() likewise uses a plain str* (instead of __raw_write*()) so it shares that str* group + DGH path rather than taking a per-store store-release. Co-developed-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Link: https://lore.kernel.org/all/ajVZBJgKn-5sxHD6@willie-the-truck/ --- Documentation/arch/arm64/silicon-errata.rst | 2 ++ arch/arm64/Kconfig | 25 +++++++++++++++++ arch/arm64/include/asm/barrier.h | 4 ++- arch/arm64/include/asm/io.h | 31 +++++++++++++-------- arch/arm64/kernel/cpu_errata.c | 8 ++++++ arch/arm64/tools/cpucaps | 1 + 6 files changed, 59 insertions(+), 12 deletions(-) diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst index ad04d1cdc0f0..c4137f89acef 100644 --- a/Documentation/arch/arm64/silicon-errata.rst +++ b/Documentation/arch/arm64/silicon-errata.rst @@ -298,6 +298,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | +----------------+-----------------+-----------------+-----------------------------+ +| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM | ++----------------+-----------------+-----------------+-----------------------------+ | NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 | +----------------+-----------------+-----------------+-----------------------------+ | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 10c69474f276..da4e66b19209 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -564,6 +564,31 @@ config ARM64_ERRATUM_832075 If unsure, say Y. +config NVIDIA_OLYMPUS_1027_ERRATUM + bool "NVIDIA Olympus: device store/load ordering erratum" + default y + help + This option adds an alternative code sequence to work around an + NVIDIA Olympus core erratum where a Device-nGnR* store can be + observed by a peripheral after a younger Device-nGnR* load to the + same peripheral. This breaks the program order that drivers rely + on for MMIO and can leave a device in an incorrect state. + + The workaround promotes the raw MMIO store helpers + (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the + required ordering. Because writel() and writel_relaxed() are built + on __raw_writel(), both are covered without changes to the higher + layers. It also promotes the DGH hint used after write-combining + memcpy-to-IO sequences to a DMB, so grouped stores are ordered + against subsequent reads without placing a barrier in the copy loop. + + The fix is applied through the alternatives framework, so enabling + this option does not by itself activate the workaround: it is + patched in only when an affected CPU is detected, and is a no-op on + unaffected CPUs. + + If unsure, say Y. + config ARM64_ERRATUM_834220 bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)" depends on KVM diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index 9495c4441a46..22792d1305aa 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -38,7 +38,9 @@ * Device-GRE attributes before the hint instruction with any memory accesses * appearing after the hint instruction. */ -#define dgh() asm volatile("hint #6" : : : "memory") +#define dgh() asm volatile(ALTERNATIVE("hint #6", "dmb osh", \ + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) \ + : : : "memory") #define spec_bar() asm volatile(ALTERNATIVE("dsb nsh\nisb\n", \ SB_BARRIER_INSN"nop\n", \ diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 8cbd1e96fd50..69e0fa004d31 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -16,7 +16,6 @@ #include <asm/memory.h> #include <asm/early_ioremap.h> #include <asm/alternative.h> -#include <asm/cpufeature.h> #include <asm/rsi.h> /* @@ -25,29 +24,37 @@ #define __raw_writeb __raw_writeb static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) { - volatile u8 __iomem *ptr = addr; - asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); + asm volatile(ALTERNATIVE("strb %w0, [%1]", + "stlrb %w0, [%1]", + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) + : : "rZ" (val), "r" (addr)); } #define __raw_writew __raw_writew static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr) { - volatile u16 __iomem *ptr = addr; - asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr)); + asm volatile(ALTERNATIVE("strh %w0, [%1]", + "stlrh %w0, [%1]", + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) + : : "rZ" (val), "r" (addr)); } #define __raw_writel __raw_writel static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr) { - volatile u32 __iomem *ptr = addr; - asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr)); + asm volatile(ALTERNATIVE("str %w0, [%1]", + "stlr %w0, [%1]", + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) + : : "rZ" (val), "r" (addr)); } #define __raw_writeq __raw_writeq static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr) { - volatile u64 __iomem *ptr = addr; - asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr)); + asm volatile(ALTERNATIVE("str %x0, [%1]", + "stlr %x0, [%1]", + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) + : : "rZ" (val), "r" (addr)); } #define __raw_readb __raw_readb @@ -178,7 +185,8 @@ __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from, : "rZ"(from[0]), "rZ"(from[1]), "r"(to)); break; case 1: - __raw_writel(*from, to); + asm volatile("str %w0, [%1]" + : : "rZ"(from[0]), "r"(to) : "memory"); break; default: BUILD_BUG(); @@ -235,7 +243,8 @@ __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from, : "rZ"(from[0]), "rZ"(from[1]), "r"(to)); break; case 1: - __raw_writeq(*from, to); + asm volatile("str %x0, [%1]" + : : "rZ"(from[0]), "r"(to) : "memory"); break; default: BUILD_BUG(); diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 4b0d5d932897..76c1f8cf1ee0 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -839,6 +839,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = { ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL), }, #endif +#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM + { + /* NVIDIA Olympus core */ + .desc = "NVIDIA Olympus device load/store ordering erratum", + .capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE, + ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS), + }, +#endif #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE { /* diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 811c2479e82d..d367257bf770 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM WORKAROUND_CAVIUM_TX2_219_TVM WORKAROUND_CLEAN_CACHE WORKAROUND_DEVICE_LOAD_ACQUIRE +WORKAROUND_DEVICE_STORE_RELEASE WORKAROUND_NVIDIA_CARMEL_CNP WORKAROUND_PMUV3_IMPDEF_TRAPS WORKAROUND_QCOM_FALKOR_E1003 -- 2.54.0.windows.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering 2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni @ 2026-06-30 14:21 ` Will Deacon 0 siblings, 0 replies; 10+ messages in thread From: Will Deacon @ 2026-06-30 14:21 UTC (permalink / raw) To: Shanker Donthineni Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Thu, Jun 25, 2026 at 01:24:24PM -0500, Shanker Donthineni wrote: > On systems with NVIDIA Olympus cores, a Device-nGnR* load can be > observed by a peripheral before an older, non-overlapping Device-nGnR* > store to the same peripheral. This breaks the program-order guarantee > that software expects for Device-nGnR* accesses and can leave a > peripheral in an incorrect state, as a load is observed before an > earlier store takes effect. > > The erratum can occur only when all of the following apply: > > - A PE executes a Device-nGnR* store followed by a younger > Device-nGnR* load. > - The store is not a store-release. > - The accesses target the same peripheral and do not overlap in bytes. > - There is at most one intervening Device-nGnR* store in program > order, and there are no intervening Device-nGnR* loads. > - There is no DSB, and no DMB that orders loads, between the store and > the load. Does that mean that a DMB LD between the store and the load would solve the problem? It would be interesting to see how your benchmarks motivating patch 2 look if you leave __raw_writeX as-is and instead add a barrier in __raw_readX before the load instruction. Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write 2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni 2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni @ 2026-06-25 18:24 ` Shanker Donthineni 2026-06-29 10:48 ` Vladimir Murzin 2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin 2 siblings, 1 reply; 10+ messages in thread From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Vladimir Murzin Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on parts with the NVIDIA Olympus device store/load ordering erratum the ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in the block to a store-release. Each stlr* carries a barrier cost, so block MMIO becomes O(n) store-releases, making a block copy many times slower than a single ordered burst and growing with the transfer size. Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop and order the whole block against subsequent loads with a single trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the relaxed contract of these helpers). This keeps block MMIO writes at one-barrier cost rather than scaling with the transfer size. Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single PE pinned; per-call cost in ns; consecutive writes ping-pong between two buffers so repeated stores are not coalesced; iowrite64/iowrite32 = __iowrite{64,32}_copy()): Table 1 - arm64 memset_io/memcpy_toio (this patch) +-------+-----------+-----------+-----------+-------------+ | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | +-------+-----------+-----------+-----------+-------------+ | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | +-------+-----------+-----------+-----------+-------------+ all four helpers end with a single trailing barrier (dmb osh). Table 2 - generic per-store memset_io/memcpy_toio +-------+-----------+-----------+-------------+--------------+ | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | +-------+-----------+-----------+-------------+--------------+ | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | +-------+-----------+-----------+-------------+--------------+ the generic memset_io()/memcpy_toio() build on __raw_write*(), which the workaround promotes to store-release, so every store is individually ordered - hence O(n) in the store count. The arm64 versions stay flat at one-barrier cost while the generic per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs ~366 ns). Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> --- arch/arm64/include/asm/io.h | 5 +++ arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++++++++++ 2 files changed, 87 insertions(+) diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 69e0fa004d31..649503f347bc 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count) } #define __iowrite64_copy __iowrite64_copy +void memset_io(volatile void __iomem *dst, int c, size_t count); +#define memset_io memset_io +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count); +#define memcpy_toio memcpy_toio + /* * I/O memory mapping functions. */ diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c index fe86ada23c7d..b5fd9ee6d9eb 100644 --- a/arch/arm64/kernel/io.c +++ b/arch/arm64/kernel/io.c @@ -5,9 +5,91 @@ * Copyright (C) 2012 ARM Ltd. */ +#include <linux/align.h> #include <linux/export.h> #include <linux/types.h> #include <linux/io.h> +#include <linux/unaligned.h> + +#include <asm/alternative.h> + +/* + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store + * (__raw_write*()) to a store-release on affected CPUs. The generic + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would + * emit one store-release per element and turn a block write into O(n) ordered + * stores - far more costly than the single barrier a block actually needs. + * + * Provide arm64 versions that emit plain STR in the loop and order the whole + * block against subsequent loads with one trailing DMB OSH, patched in only on + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is + * preserved). + * + * This capability is currently enabled only for the NVIDIA Olympus device + * store/load ordering erratum, where a Device-nGnR* load may be observed before + * an older, non-overlapping Device-nGnR* store to the same peripheral. + */ +static __always_inline void iomem_block_store_barrier(void) +{ + asm volatile(ALTERNATIVE("nop", "dmb osh", + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) + : : : "memory"); +} + +void memset_io(volatile void __iomem *dst, int c, size_t count) +{ + u64 qc = (u8)c; + + qc *= ~0ULL / 0xff; + + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); + dst++; + count--; + } + while (count >= sizeof(u64)) { + asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory"); + dst += sizeof(u64); + count -= sizeof(u64); + } + while (count) { + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); + dst++; + count--; + } + + iomem_block_store_barrier(); +} +EXPORT_SYMBOL(memset_io); + +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) +{ + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { + asm volatile("strb %w0, [%1]" + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); + src++; + dst++; + count--; + } + while (count >= sizeof(u64)) { + asm volatile("str %x0, [%1]" + : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst) + : "memory"); + src += sizeof(u64); + dst += sizeof(u64); + count -= sizeof(u64); + } + while (count) { + asm volatile("strb %w0, [%1]" + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); + src++; + dst++; + count--; + } + + iomem_block_store_barrier(); +} +EXPORT_SYMBOL(memcpy_toio); /* * This generates a memcpy that works on a from/to address which is aligned to -- 2.54.0.windows.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write 2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni @ 2026-06-29 10:48 ` Vladimir Murzin 2026-06-29 23:09 ` Shanker Donthineni 0 siblings, 1 reply; 10+ messages in thread From: Vladimir Murzin @ 2026-06-29 10:48 UTC (permalink / raw) To: Shanker Donthineni, Catalin Marinas, Will Deacon Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi, On 6/25/26 19:24, Shanker Donthineni wrote: > The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on > parts with the NVIDIA Olympus device store/load ordering erratum the > ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in > the block to a store-release. Each stlr* carries a barrier cost, so block > MMIO becomes O(n) store-releases, making a block copy many times slower > than a single ordered burst and growing with the transfer size. > > Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop > and order the whole block against subsequent loads with a single > trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the > relaxed contract of these helpers). This keeps block MMIO writes at > one-barrier cost rather than scaling with the transfer size. > > Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single > PE pinned; per-call cost in ns; consecutive writes ping-pong between two > buffers so repeated stores are not coalesced; iowrite64/iowrite32 = > __iowrite{64,32}_copy()): > > Table 1 - arm64 memset_io/memcpy_toio (this patch) > +-------+-----------+-----------+-----------+-------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-----------+-------------+ > | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | > | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | > | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | > | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | > | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | > | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | > | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | > | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | > | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | > | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | > +-------+-----------+-----------+-----------+-------------+ > all four helpers end with a single trailing barrier (dmb osh). > > Table 2 - generic per-store memset_io/memcpy_toio > +-------+-----------+-----------+-------------+--------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-------------+--------------+ > | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | > | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | > | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | > | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | > | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | > | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | > | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | > | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | > | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | > | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | > +-------+-----------+-----------+-------------+--------------+ > the generic memset_io()/memcpy_toio() build on __raw_write*(), which the > workaround promotes to store-release, so every store is individually > ordered - hence O(n) in the store count. > > The arm64 versions stay flat at one-barrier cost while the generic > per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs > ~366 ns). > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > --- > arch/arm64/include/asm/io.h | 5 +++ > arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 87 insertions(+) > > diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h > index 69e0fa004d31..649503f347bc 100644 > --- a/arch/arm64/include/asm/io.h > +++ b/arch/arm64/include/asm/io.h > @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count) > } > #define __iowrite64_copy __iowrite64_copy > > +void memset_io(volatile void __iomem *dst, int c, size_t count); > +#define memset_io memset_io > +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count); > +#define memcpy_toio memcpy_toio > + > /* > * I/O memory mapping functions. > */ > diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c > index fe86ada23c7d..b5fd9ee6d9eb 100644 > --- a/arch/arm64/kernel/io.c > +++ b/arch/arm64/kernel/io.c > @@ -5,9 +5,91 @@ > * Copyright (C) 2012 ARM Ltd. > */ > > +#include <linux/align.h> > #include <linux/export.h> > #include <linux/types.h> > #include <linux/io.h> > +#include <linux/unaligned.h> > + > +#include <asm/alternative.h> > + > +/* > + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store > + * (__raw_write*()) to a store-release on affected CPUs. The generic > + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would > + * emit one store-release per element and turn a block write into O(n) ordered > + * stores - far more costly than the single barrier a block actually needs. > + * > + * Provide arm64 versions that emit plain STR in the loop and order the whole > + * block against subsequent loads with one trailing DMB OSH, patched in only on > + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is > + * preserved). > + * > + * This capability is currently enabled only for the NVIDIA Olympus device > + * store/load ordering erratum, where a Device-nGnR* load may be observed before > + * an older, non-overlapping Device-nGnR* store to the same peripheral. > + */ > +static __always_inline void iomem_block_store_barrier(void) > +{ > + asm volatile(ALTERNATIVE("nop", "dmb osh", > + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) > + : : : "memory"); > +} > + > +void memset_io(volatile void __iomem *dst, int c, size_t count) > +{ > + u64 qc = (u8)c; > + > + qc *= ~0ULL / 0xff; > + > + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { > + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); > + dst++; > + count--; > + } > + while (count >= sizeof(u64)) { > + asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory"); > + dst += sizeof(u64); > + count -= sizeof(u64); > + } > + while (count) { > + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); > + dst++; > + count--; > + } > + > + iomem_block_store_barrier(); > +} > +EXPORT_SYMBOL(memset_io); > + > +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) > +{ > + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { > + asm volatile("strb %w0, [%1]" > + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); > + src++; > + dst++; > + count--; > + } > + while (count >= sizeof(u64)) { > + asm volatile("str %x0, [%1]" > + : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst) Why do we need get_unaligned() here? I understand this came from the generic implementation, where it needs to handle architectures that do not support unaligned accesses. But IIUC this is not an issue for arm64, and there was no special handling in memcpy_toio() before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset"). Am I missing something? > + : "memory"); > + src += sizeof(u64); > + dst += sizeof(u64); > + count -= sizeof(u64); > + } > + while (count) { > + asm volatile("strb %w0, [%1]" > + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); > + src++; > + dst++; > + count--; > + } > + > + iomem_block_store_barrier(); It is perhaps a matter of taste, but having the inline assembly here (and in memset_io()) might make the code clearer. To a casual reader, it would be obvious that the barrier is not guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE, without having to jump back and forth through the code. Obliviously maintainers might have different preference ;) Cheers Vladimir > +} > +EXPORT_SYMBOL(memcpy_toio); > > /* > * This generates a memcpy that works on a from/to address which is aligned to > -- 2.54.0.windows.1 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write 2026-06-29 10:48 ` Vladimir Murzin @ 2026-06-29 23:09 ` Shanker Donthineni 2026-06-30 14:17 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Shanker Donthineni @ 2026-06-29 23:09 UTC (permalink / raw) To: Vladimir Murzin, Catalin Marinas, Will Deacon Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi Vladimir, On 6/29/2026 5:48 AM, Vladimir Murzin wrote: > External email: Use caution opening links or attachments > > > Hi, > > On 6/25/26 19:24, Shanker Donthineni wrote: >> The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on >> parts with the NVIDIA Olympus device store/load ordering erratum the >> ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in >> the block to a store-release. Each stlr* carries a barrier cost, so block >> MMIO becomes O(n) store-releases, making a block copy many times slower >> than a single ordered burst and growing with the transfer size. >> >> Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop >> and order the whole block against subsequent loads with a single >> trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the >> relaxed contract of these helpers). This keeps block MMIO writes at >> one-barrier cost rather than scaling with the transfer size. >> >> Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single >> PE pinned; per-call cost in ns; consecutive writes ping-pong between two >> buffers so repeated stores are not coalesced; iowrite64/iowrite32 = >> __iowrite{64,32}_copy()): >> >> Table 1 - arm64 memset_io/memcpy_toio (this patch) >> +-------+-----------+-----------+-----------+-------------+ >> | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | >> +-------+-----------+-----------+-----------+-------------+ >> | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | >> | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | >> | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | >> | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | >> | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | >> | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | >> | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | >> | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | >> | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | >> | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | >> +-------+-----------+-----------+-----------+-------------+ >> all four helpers end with a single trailing barrier (dmb osh). >> >> Table 2 - generic per-store memset_io/memcpy_toio >> +-------+-----------+-----------+-------------+--------------+ >> | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | >> +-------+-----------+-----------+-------------+--------------+ >> | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | >> | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | >> | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | >> | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | >> | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | >> | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | >> | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | >> | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | >> | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | >> | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | >> +-------+-----------+-----------+-------------+--------------+ >> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the >> workaround promotes to store-release, so every store is individually >> ordered - hence O(n) in the store count. >> >> The arm64 versions stay flat at one-barrier cost while the generic >> per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs >> ~366 ns). >> >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >> --- >> arch/arm64/include/asm/io.h | 5 +++ >> arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++++++++++ >> 2 files changed, 87 insertions(+) >> >> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h >> index 69e0fa004d31..649503f347bc 100644 >> --- a/arch/arm64/include/asm/io.h >> +++ b/arch/arm64/include/asm/io.h >> @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count) >> } >> #define __iowrite64_copy __iowrite64_copy >> >> +void memset_io(volatile void __iomem *dst, int c, size_t count); >> +#define memset_io memset_io >> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count); >> +#define memcpy_toio memcpy_toio >> + >> /* >> * I/O memory mapping functions. >> */ >> diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c >> index fe86ada23c7d..b5fd9ee6d9eb 100644 >> --- a/arch/arm64/kernel/io.c >> +++ b/arch/arm64/kernel/io.c >> @@ -5,9 +5,91 @@ >> * Copyright (C) 2012 ARM Ltd. >> */ >> >> +#include <linux/align.h> >> #include <linux/export.h> >> #include <linux/types.h> >> #include <linux/io.h> >> +#include <linux/unaligned.h> >> + >> +#include <asm/alternative.h> >> + >> +/* >> + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store >> + * (__raw_write*()) to a store-release on affected CPUs. The generic >> + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would >> + * emit one store-release per element and turn a block write into O(n) ordered >> + * stores - far more costly than the single barrier a block actually needs. >> + * >> + * Provide arm64 versions that emit plain STR in the loop and order the whole >> + * block against subsequent loads with one trailing DMB OSH, patched in only on >> + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is >> + * preserved). >> + * >> + * This capability is currently enabled only for the NVIDIA Olympus device >> + * store/load ordering erratum, where a Device-nGnR* load may be observed before >> + * an older, non-overlapping Device-nGnR* store to the same peripheral. >> + */ >> +static __always_inline void iomem_block_store_barrier(void) >> +{ >> + asm volatile(ALTERNATIVE("nop", "dmb osh", >> + ARM64_WORKAROUND_DEVICE_STORE_RELEASE) >> + : : : "memory"); >> +} >> + >> +void memset_io(volatile void __iomem *dst, int c, size_t count) >> +{ >> + u64 qc = (u8)c; >> + >> + qc *= ~0ULL / 0xff; >> + >> + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { >> + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); >> + dst++; >> + count--; >> + } >> + while (count >= sizeof(u64)) { >> + asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory"); >> + dst += sizeof(u64); >> + count -= sizeof(u64); >> + } >> + while (count) { >> + asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory"); >> + dst++; >> + count--; >> + } >> + >> + iomem_block_store_barrier(); >> +} >> +EXPORT_SYMBOL(memset_io); >> + >> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) >> +{ >> + while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) { >> + asm volatile("strb %w0, [%1]" >> + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); >> + src++; >> + dst++; >> + count--; >> + } >> + while (count >= sizeof(u64)) { >> + asm volatile("str %x0, [%1]" >> + : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst) > Why do we need get_unaligned() here? I understand this came from > the generic implementation, where it needs to handle architectures > that do not support unaligned accesses. But IIUC this is not an > issue for arm64, and there was no special handling in memcpy_toio() > before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset"). > Am I missing something? Thanks for the review. I used get_unaligned() because I was trying to keep the arm64 implementation as close as possible to the generic memcpy_toio() implementation in lib/iomem_copy.c. However, you are right that before commit 0110feaaf6d0 (“arm64: Use new fallback IO memcpy/memset”), the arm64 implementation used a direct u64 load and did not explicitly handle source alignment. I can restore the previous arm64 form in v5 if that is preferred. >> + : "memory"); >> + src += sizeof(u64); >> + dst += sizeof(u64); >> + count -= sizeof(u64); >> + } >> + while (count) { >> + asm volatile("strb %w0, [%1]" >> + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); >> + src++; >> + dst++; >> + count--; >> + } >> + >> + iomem_block_store_barrier(); > It is perhaps a matter of taste, but having the inline assembly > here (and in memset_io()) might make the code clearer. To a > casual reader, it would be obvious that the barrier is not > guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE, > without having to jump back and forth through the code. > > Obliviously maintainers might have different preference ;) Regarding the barrier, iomem_block_store_barrier() is declared static __always_inline, so it does not add a function call. The nop/dmb osh alternative is emitted directly in each caller. I used the helper to avoid duplicating the alternative sequence. I understand that placing the assembly directly in both functions could make its conditional nature more obvious. I do not have a strong preference and am happy to follow Will’s and Catalin’s preference here. -Shanker ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write 2026-06-29 23:09 ` Shanker Donthineni @ 2026-06-30 14:17 ` Will Deacon 0 siblings, 0 replies; 10+ messages in thread From: Will Deacon @ 2026-06-30 14:17 UTC (permalink / raw) To: Shanker Donthineni Cc: Vladimir Murzin, Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Mon, Jun 29, 2026 at 06:09:11PM -0500, Shanker Donthineni wrote: > On 6/29/2026 5:48 AM, Vladimir Murzin wrote: > > > + : "memory"); > > > + src += sizeof(u64); > > > + dst += sizeof(u64); > > > + count -= sizeof(u64); > > > + } > > > + while (count) { > > > + asm volatile("strb %w0, [%1]" > > > + : : "rZ"(*(const u8 *)src), "r"(dst) : "memory"); > > > + src++; > > > + dst++; > > > + count--; > > > + } > > > + > > > + iomem_block_store_barrier(); > > It is perhaps a matter of taste, but having the inline assembly > > here (and in memset_io()) might make the code clearer. To a > > casual reader, it would be obvious that the barrier is not > > guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE, > > without having to jump back and forth through the code. > > > > Obliviously maintainers might have different preference ;) Oblivious maintainer here :) > Regarding the barrier, iomem_block_store_barrier() is declared > static __always_inline, so it does not add a function call. The nop/dmb > osh alternative is emitted directly in each caller. I used the helper to > avoid duplicating the alternative sequence. > > I understand that placing the assembly directly in both functions could > make its conditional nature more obvious. I do not have a strong preference > and am happy to follow Will’s and Catalin’s preference here. I agree with Vladimir that it would be clearer to inline the conditional barrier. It would be even better if we could avoid having to duplicate this code to start with, but I can't immediately think of a better alternative. Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering 2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni 2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni 2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni @ 2026-06-29 10:45 ` Vladimir Murzin 2026-06-29 23:08 ` Shanker Donthineni 2 siblings, 1 reply; 10+ messages in thread From: Vladimir Murzin @ 2026-06-29 10:45 UTC (permalink / raw) To: Shanker Donthineni, Catalin Marinas, Will Deacon Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi, On 6/25/26 19:24, Shanker Donthineni wrote: > This series works around the NVIDIA Olympus device store/load ordering > erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a > peripheral before an older, non-overlapping Device-nGnR* store to the > same peripheral, breaking the program order that drivers rely on for > MMIO and potentially leaving a device in an incorrect state. > > Patch 1 adds the workaround. It promotes the raw MMIO store helpers > (__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to > store-release on affected CPUs, and promotes the trailing DGH of the > write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is > gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched > in only on affected parts, so it is a no-op elsewhere. > > Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions > are built on __raw_write*(), so patch 1 would promote every store in a > block to a store-release; as each STLR drains the write-combining buffer, > block MMIO becomes O(n) store-releases. The arm64 versions emit plain > STR in the loop and order the whole block with a single trailing dmb osh, > keeping block MMIO at one-barrier cost. > > Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single > PE pinned; per-call cost in ns. Consecutive writes ping-pong between two > buffers so repeated stores are not coalesced. iowrite64/iowrite32 = > __iowrite{64,32}_copy(). > > Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n) > +-------+-----------+-----------+-----------+-------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-----------+-------------+ > | 8B | 67.9 ns | 67.8 ns | 3.6 ns | 3.6 ns | > | 16B | 67.9 ns | 67.8 ns | 4.0 ns | 4.0 ns | > | 32B | 67.9 ns | 67.9 ns | 4.6 ns | 4.6 ns | > | 64B | 69.1 ns | 69.1 ns | 69.1 ns | 69.0 ns | > | 128B | 138.3 ns | 138.3 ns | 138.4 ns | 138.3 ns | > | 256B | 276.6 ns | 276.6 ns | 276.6 ns | 276.7 ns | > | 512B | 276.6 ns | 276.5 ns | 276.6 ns | 276.6 ns | > | 1KB | 276.6 ns | 278.4 ns | 276.6 ns | 276.6 ns | > | 2KB | 278.4 ns | 278.4 ns | 275.9 ns | 276.6 ns | > | 4KB | 365.7 ns | 365.7 ns | 365.7 ns | 365.7 ns | > +-------+-----------+-----------+-----------+-------------+ > relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no > trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh(). > > Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series) > +-------+-----------+-----------+-----------+-------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-----------+-------------+ > | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | > | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | > | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | > | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | > | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | > | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | > | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | > | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | > | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | > | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | > +-------+-----------+-----------+-----------+-------------+ > all four helpers end with a single trailing barrier (dmb osh). > > Table 3 - workaround on, generic per-store memset_io/memcpy_toio > +-------+-----------+-----------+-------------+--------------+ > | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | > +-------+-----------+-----------+-------------+--------------+ > | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | > | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | > | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | > | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | > | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | > | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | > | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | > | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | > | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | > | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | > +-------+-----------+-----------+-------------+--------------+ > the generic memset_io()/memcpy_toio() build on __raw_write*(), which the > workaround promotes to store-release, so every store is individually > ordered - hence O(n) in the store count. > > Tables 2 and 3 show why patch 2 is needed: the generic per-store block > writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us > vs ~366 ns), while the arm64 versions stay flat at one-barrier cost. That's interesting. With the way the patch set is structured, it now looks like: 1. Fix the erratum, but cause a performance regression. 2. Restore the performance regression and (re)apply the erratum workaround. Would it make sense to avoid introducing the performance regression in the first place by structuring the patch set slightly differently? 1. (Re)introduce arm64 memset_io()/memcpy_toio(). 2. Fix the erratum once for all What do you reckon? Cheers Vladimir ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering 2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin @ 2026-06-29 23:08 ` Shanker Donthineni 2026-06-30 13:53 ` Will Deacon 0 siblings, 1 reply; 10+ messages in thread From: Shanker Donthineni @ 2026-06-29 23:08 UTC (permalink / raw) To: Vladimir Murzin, Catalin Marinas, Will Deacon Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi Vladimir, On 6/29/2026 5:45 AM, Vladimir Murzin wrote: > External email: Use caution opening links or attachments > > > Hi, > > On 6/25/26 19:24, Shanker Donthineni wrote: >> This series works around the NVIDIA Olympus device store/load ordering >> erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a >> peripheral before an older, non-overlapping Device-nGnR* store to the >> same peripheral, breaking the program order that drivers rely on for >> MMIO and potentially leaving a device in an incorrect state. >> >> Patch 1 adds the workaround. It promotes the raw MMIO store helpers >> (__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to >> store-release on affected CPUs, and promotes the trailing DGH of the >> write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is >> gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched >> in only on affected parts, so it is a no-op elsewhere. >> >> Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions >> are built on __raw_write*(), so patch 1 would promote every store in a >> block to a store-release; as each STLR drains the write-combining buffer, >> block MMIO becomes O(n) store-releases. The arm64 versions emit plain >> STR in the loop and order the whole block with a single trailing dmb osh, >> keeping block MMIO at one-barrier cost. >> >> Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single >> PE pinned; per-call cost in ns. Consecutive writes ping-pong between two >> buffers so repeated stores are not coalesced. iowrite64/iowrite32 = >> __iowrite{64,32}_copy(). >> >> Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n) >> +-------+-----------+-----------+-----------+-------------+ >> | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | >> +-------+-----------+-----------+-----------+-------------+ >> | 8B | 67.9 ns | 67.8 ns | 3.6 ns | 3.6 ns | >> | 16B | 67.9 ns | 67.8 ns | 4.0 ns | 4.0 ns | >> | 32B | 67.9 ns | 67.9 ns | 4.6 ns | 4.6 ns | >> | 64B | 69.1 ns | 69.1 ns | 69.1 ns | 69.0 ns | >> | 128B | 138.3 ns | 138.3 ns | 138.4 ns | 138.3 ns | >> | 256B | 276.6 ns | 276.6 ns | 276.6 ns | 276.7 ns | >> | 512B | 276.6 ns | 276.5 ns | 276.6 ns | 276.6 ns | >> | 1KB | 276.6 ns | 278.4 ns | 276.6 ns | 276.6 ns | >> | 2KB | 278.4 ns | 278.4 ns | 275.9 ns | 276.6 ns | >> | 4KB | 365.7 ns | 365.7 ns | 365.7 ns | 365.7 ns | >> +-------+-----------+-----------+-----------+-------------+ >> relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no >> trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh(). >> >> Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series) >> +-------+-----------+-----------+-----------+-------------+ >> | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | >> +-------+-----------+-----------+-----------+-------------+ >> | 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns | >> | 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns | >> | 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns | >> | 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns | >> | 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns | >> | 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns | >> | 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns | >> | 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns | >> | 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns | >> | 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns | >> +-------+-----------+-----------+-----------+-------------+ >> all four helpers end with a single trailing barrier (dmb osh). >> >> Table 3 - workaround on, generic per-store memset_io/memcpy_toio >> +-------+-----------+-----------+-------------+--------------+ >> | size | iowrite64 | iowrite32 | memset_io | memcpy_toio | >> +-------+-----------+-----------+-------------+--------------+ >> | 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns | >> | 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns | >> | 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns | >> | 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns | >> | 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns | >> | 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns | >> | 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns | >> | 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns | >> | 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns | >> | 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns | >> +-------+-----------+-----------+-------------+--------------+ >> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the >> workaround promotes to store-release, so every store is individually >> ordered - hence O(n) in the store count. >> >> Tables 2 and 3 show why patch 2 is needed: the generic per-store block >> writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us >> vs ~366 ns), while the arm64 versions stay flat at one-barrier cost. > That's interesting. With the way the patch set is structured, it > now looks like: > > 1. Fix the erratum, but cause a performance regression. > 2. Restore the performance regression and (re)apply the erratum > workaround. > > Would it make sense to avoid introducing the performance > regression in the first place by structuring the patch set > slightly differently? > > 1. (Re)introduce arm64 memset_io()/memcpy_toio(). > 2. Fix the erratum once for all > > What do you reckon? Yes, that ordering makes sense. I can restructure v5 so that patch 1 introduces the arm64 memset_{to}io() implementations while preserving the existing behavior. Patch 2 will then add the complete erratum workaround, including the conditional trailing DMB for those block-write helpers. This avoids introducing the intermediate performance regression and keeps each commit independently usable. Will and Catalin, could you please share your thoughts on this approach? -Shanker ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering 2026-06-29 23:08 ` Shanker Donthineni @ 2026-06-30 13:53 ` Will Deacon 0 siblings, 0 replies; 10+ messages in thread From: Will Deacon @ 2026-06-30 13:53 UTC (permalink / raw) To: Shanker Donthineni Cc: Vladimir Murzin, Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Mon, Jun 29, 2026 at 06:08:37PM -0500, Shanker Donthineni wrote: > On 6/29/2026 5:45 AM, Vladimir Murzin wrote: > > That's interesting. With the way the patch set is structured, it > > now looks like: > > > > 1. Fix the erratum, but cause a performance regression. > > 2. Restore the performance regression and (re)apply the erratum > > workaround. > > > > Would it make sense to avoid introducing the performance > > regression in the first place by structuring the patch set > > slightly differently? > > > > 1. (Re)introduce arm64 memset_io()/memcpy_toio(). > > 2. Fix the erratum once for all > > > > What do you reckon? > > Yes, that ordering makes sense. > > I can restructure v5 so that patch 1 introduces the arm64 memset_{to}io() > implementations while preserving the existing behavior. Patch 2 will > then add the complete erratum workaround, including the conditional > trailing DMB for those block-write helpers. This avoids introducing > the intermediate performance regression and keeps each commit > independently usable. > > Will and Catalin, could you please share your thoughts on this approach? tbh, I think I'm ok with the current ordering. The second patch is purely a performance thing for affected CPUs, so doesn't strictly need to be applied or backported for functional correctness afaict. Will ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-30 14:21 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni 2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni 2026-06-30 14:21 ` Will Deacon 2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni 2026-06-29 10:48 ` Vladimir Murzin 2026-06-29 23:09 ` Shanker Donthineni 2026-06-30 14:17 ` Will Deacon 2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin 2026-06-29 23:08 ` Shanker Donthineni 2026-06-30 13:53 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox