[PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
@ 2026-06-25 18:24 Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
  0 siblings, 2 replies; 3+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

This series works around the NVIDIA Olympus device store/load ordering
erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
peripheral before an older, non-overlapping Device-nGnR* store to the
same peripheral, breaking the program order that drivers rely on for
MMIO and potentially leaving a device in an incorrect state.

Patch 1 adds the workaround. It promotes the raw MMIO store helpers
(__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
store-release on affected CPUs, and promotes the trailing DGH of the
write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
in only on affected parts, so it is a no-op elsewhere.

Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
are built on __raw_write*(), so patch 1 would promote every store in a
block to a store-release; as each STLR drains the write-combining buffer,
block MMIO becomes O(n) store-releases. The arm64 versions emit plain
STR in the loop and order the whole block with a single trailing dmb osh,
keeping block MMIO at one-barrier cost.

Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
__iowrite{64,32}_copy().

Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |   67.9 ns |   67.8 ns |    3.6 ns |    3.6 ns   |
|   16B |   67.9 ns |   67.8 ns |    4.0 ns |    4.0 ns   |
|   32B |   67.9 ns |   67.9 ns |    4.6 ns |    4.6 ns   |
|   64B |   69.1 ns |   69.1 ns |   69.1 ns |   69.0 ns   |
|  128B |  138.3 ns |  138.3 ns |  138.4 ns |  138.3 ns   |
|  256B |  276.6 ns |  276.6 ns |  276.6 ns |  276.7 ns   |
|  512B |  276.6 ns |  276.5 ns |  276.6 ns |  276.6 ns   |
|   1KB |  276.6 ns |  278.4 ns |  276.6 ns |  276.6 ns   |
|   2KB |  278.4 ns |  278.4 ns |  275.9 ns |  276.6 ns   |
|   4KB |  365.7 ns |  365.7 ns |  365.7 ns |  365.7 ns   |
+-------+-----------+-----------+-----------+-------------+
relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().

Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 3 - workaround on, generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

Tables 2 and 3 show why patch 2 is needed: the generic per-store block
writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.

Changes since v3:
  - Split the workaround into two patches: the erratum fix (1/2) and the
    arm64 memset_io()/memcpy_toio() block writers (2/2).
  - Reworked the raw MMIO write helpers to use a direct base-register
    str*/stlr* alternative sequence instead of a per-write static branch.
  - Covered the write-combining __iowrite{32,64}_copy() path by patching
    dgh() to dmb osh on affected CPUs, keeping the contiguous STR groups
    and the ordering barrier outside the copy loop; the single-element
    case now uses a plain str* as well.
  - Added arm64 memset_io()/memcpy_toio() so the byte/word block writers
    take one trailing dmb osh instead of a per-store store-release.
  - Updated the commit messages to describe the offset-addressing
    trade-off.

Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

Shanker Donthineni (2):
  arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  arm64: io: apply the device store-release workaround once per block
    write

 Documentation/arch/arm64/silicon-errata.rst |  2 +
 arch/arm64/Kconfig                          | 25 +++++++++
 arch/arm64/include/asm/barrier.h            |  4 +-
 arch/arm64/include/asm/io.h                 | 36 +++++++++----
 arch/arm64/kernel/cpu_errata.c              |  8 +++
 arch/arm64/kernel/io.c                      | 82 +++++++++++++++++++++++++++++
 arch/arm64/tools/cpucaps                    |  1 +
 7 files changed, 146 insertions(+), 12 deletions(-)

-- 
2.54.0.windows.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
@ 2026-06-25 18:24 ` Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
  1 sibling, 0 replies; 3+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release) on affected CPUs, which removes the "store is
not a store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
plain str* instructions.

Note: stlr* only supports base-register addressing, so the raw MMIO
write helpers use a base-register str*/stlr* alternative sequence. This
gives up the offset-addressed str* code generation introduced by commit
d044d6ba6f02 ("arm64: io: permit offset addressing"). A static-branch
implementation would add extra control flow without preserving the
desired offset-addressed code generation in practice, so use a direct
base-register str*/stlr* alternative instead.

For the write-combining copy helpers (__iowrite{32,64}_copy()), the
contiguous str* groups are kept, because replacing those stores would
defeat the write-combining behaviour used to improve store performance.
Rather than rely on the relaxed, no-ordering contract of these helpers -
which would leave affected CPUs behaving differently from every other
arm64 system and exposed to any future driver that depends on ordering
across such copies - the DGH hint emitted once after each copy is
promoted to dmb osh on affected CPUs. That orders the grouped stores
against subsequent loads without placing a barrier in the copy loop,
while unaffected CPUs keep the existing DGH hint. The single-element
case of __const_memcpy_toio_aligned{32,64}() likewise uses a plain str*
(instead of __raw_write*()) so it shares that str* group + DGH path
rather than taking a per-store store-release.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/all/ajVZBJgKn-5sxHD6@willie-the-truck/
---
 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 25 +++++++++++++++++
 arch/arm64/include/asm/barrier.h            |  4 ++-
 arch/arm64/include/asm/io.h                 | 31 +++++++++++++--------
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad04d1cdc0f0..c4137f89acef 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 10c69474f276..da4e66b19209 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,31 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers. It also promotes the DGH hint used after write-combining
+	  memcpy-to-IO sequences to a DMB, so grouped stores are ordered
+	  against subsequent reads without placing a barrier in the copy loop.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 9495c4441a46..22792d1305aa 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -38,7 +38,9 @@
  * Device-GRE attributes before the hint instruction with any memory accesses
  * appearing after the hint instruction.
  */
-#define dgh()		asm volatile("hint #6" : : : "memory")
+#define dgh()		asm volatile(ALTERNATIVE("hint #6", "dmb osh",	\
+					 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)	\
+				     : : : "memory")
 
 #define spec_bar()	asm volatile(ALTERNATIVE("dsb nsh\nisb\n",		\
 						 SB_BARRIER_INSN"nop\n",	\
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50..69e0fa004d31 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -16,7 +16,6 @@
 #include <asm/memory.h>
 #include <asm/early_ioremap.h>
 #include <asm/alternative.h>
-#include <asm/cpufeature.h>
 #include <asm/rsi.h>
 
 /*
@@ -25,29 +24,37 @@
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
-	volatile u8 __iomem *ptr = addr;
-	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strb %w0, [%1]",
+				 "stlrb %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writew __raw_writew
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
-	volatile u16 __iomem *ptr = addr;
-	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strh %w0, [%1]",
+				 "stlrh %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writel __raw_writel
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
-	volatile u32 __iomem *ptr = addr;
-	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %w0, [%1]",
+				 "stlr %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writeq __raw_writeq
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
-	volatile u64 __iomem *ptr = addr;
-	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %x0, [%1]",
+				 "stlr %x0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_readb __raw_readb
@@ -178,7 +185,8 @@ __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writel(*from, to);
+		asm volatile("str %w0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
@@ -235,7 +243,8 @@ __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writeq(*from, to);
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 4b0d5d932897..76c1f8cf1ee0 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -839,6 +839,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d..d367257bf770 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.54.0.windows.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
  2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
@ 2026-06-25 18:24 ` Shanker Donthineni
  1 sibling, 0 replies; 3+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
parts with the NVIDIA Olympus device store/load ordering erratum the
ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
the block to a store-release. Each stlr* carries a barrier cost, so block
MMIO becomes O(n) store-releases, making a block copy many times slower
than a single ordered burst and growing with the transfer size.

Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
and order the whole block against subsequent loads with a single
trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
relaxed contract of these helpers). This keeps block MMIO writes at
one-barrier cost rather than scaling with the transfer size.

Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns; consecutive writes ping-pong between two
buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
__iowrite{64,32}_copy()):

Table 1 - arm64 memset_io/memcpy_toio (this patch)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 2 - generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

The arm64 versions stay flat at one-barrier cost while the generic
per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
~366 ns).

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 arch/arm64/include/asm/io.h |  5 +++
 arch/arm64/kernel/io.c      | 82 +++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 69e0fa004d31..649503f347bc 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
 }
 #define __iowrite64_copy __iowrite64_copy
 
+void memset_io(volatile void __iomem *dst, int c, size_t count);
+#define memset_io memset_io
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
+#define memcpy_toio memcpy_toio
+
 /*
  * I/O memory mapping functions.
  */
diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
index fe86ada23c7d..b5fd9ee6d9eb 100644
--- a/arch/arm64/kernel/io.c
+++ b/arch/arm64/kernel/io.c
@@ -5,9 +5,91 @@
  * Copyright (C) 2012 ARM Ltd.
  */
 
+#include <linux/align.h>
 #include <linux/export.h>
 #include <linux/types.h>
 #include <linux/io.h>
+#include <linux/unaligned.h>
+
+#include <asm/alternative.h>
+
+/*
+ * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
+ * (__raw_write*()) to a store-release on affected CPUs. The generic
+ * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
+ * emit one store-release per element and turn a block write into O(n) ordered
+ * stores - far more costly than the single barrier a block actually needs.
+ *
+ * Provide arm64 versions that emit plain STR in the loop and order the whole
+ * block against subsequent loads with one trailing DMB OSH, patched in only on
+ * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
+ * preserved).
+ *
+ * This capability is currently enabled only for the NVIDIA Olympus device
+ * store/load ordering erratum, where a Device-nGnR* load may be observed before
+ * an older, non-overlapping Device-nGnR* store to the same peripheral.
+ */
+static __always_inline void iomem_block_store_barrier(void)
+{
+	asm volatile(ALTERNATIVE("nop", "dmb osh",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : : "memory");
+}
+
+void memset_io(volatile void __iomem *dst, int c, size_t count)
+{
+	u64 qc = (u8)c;
+
+	qc *= ~0ULL / 0xff;
+
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memset_io);
+
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
+{
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)
+			     : "memory");
+		src += sizeof(u64);
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memcpy_toio);
 
 /*
  * This generates a memcpy that works on a from/to address which is aligned to
-- 
2.54.0.windows.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-25 18:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox