[PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
@ 2026-06-25 18:24 Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

This series works around the NVIDIA Olympus device store/load ordering
erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
peripheral before an older, non-overlapping Device-nGnR* store to the
same peripheral, breaking the program order that drivers rely on for
MMIO and potentially leaving a device in an incorrect state.

Patch 1 adds the workaround. It promotes the raw MMIO store helpers
(__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
store-release on affected CPUs, and promotes the trailing DGH of the
write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
in only on affected parts, so it is a no-op elsewhere.

Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
are built on __raw_write*(), so patch 1 would promote every store in a
block to a store-release; as each STLR drains the write-combining buffer,
block MMIO becomes O(n) store-releases. The arm64 versions emit plain
STR in the loop and order the whole block with a single trailing dmb osh,
keeping block MMIO at one-barrier cost.

Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
__iowrite{64,32}_copy().

Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |   67.9 ns |   67.8 ns |    3.6 ns |    3.6 ns   |
|   16B |   67.9 ns |   67.8 ns |    4.0 ns |    4.0 ns   |
|   32B |   67.9 ns |   67.9 ns |    4.6 ns |    4.6 ns   |
|   64B |   69.1 ns |   69.1 ns |   69.1 ns |   69.0 ns   |
|  128B |  138.3 ns |  138.3 ns |  138.4 ns |  138.3 ns   |
|  256B |  276.6 ns |  276.6 ns |  276.6 ns |  276.7 ns   |
|  512B |  276.6 ns |  276.5 ns |  276.6 ns |  276.6 ns   |
|   1KB |  276.6 ns |  278.4 ns |  276.6 ns |  276.6 ns   |
|   2KB |  278.4 ns |  278.4 ns |  275.9 ns |  276.6 ns   |
|   4KB |  365.7 ns |  365.7 ns |  365.7 ns |  365.7 ns   |
+-------+-----------+-----------+-----------+-------------+
relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().

Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 3 - workaround on, generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

Tables 2 and 3 show why patch 2 is needed: the generic per-store block
writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.

Changes since v3:
  - Split the workaround into two patches: the erratum fix (1/2) and the
    arm64 memset_io()/memcpy_toio() block writers (2/2).
  - Reworked the raw MMIO write helpers to use a direct base-register
    str*/stlr* alternative sequence instead of a per-write static branch.
  - Covered the write-combining __iowrite{32,64}_copy() path by patching
    dgh() to dmb osh on affected CPUs, keeping the contiguous STR groups
    and the ordering barrier outside the copy loop; the single-element
    case now uses a plain str* as well.
  - Added arm64 memset_io()/memcpy_toio() so the byte/word block writers
    take one trailing dmb osh instead of a per-store store-release.
  - Updated the commit messages to describe the offset-addressing
    trade-off.

Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

Shanker Donthineni (2):
  arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  arm64: io: apply the device store-release workaround once per block
    write

 Documentation/arch/arm64/silicon-errata.rst |  2 +
 arch/arm64/Kconfig                          | 25 +++++++++
 arch/arm64/include/asm/barrier.h            |  4 +-
 arch/arm64/include/asm/io.h                 | 36 +++++++++----
 arch/arm64/kernel/cpu_errata.c              |  8 +++
 arch/arm64/kernel/io.c                      | 82 +++++++++++++++++++++++++++++
 arch/arm64/tools/cpucaps                    |  1 +
 7 files changed, 146 insertions(+), 12 deletions(-)

-- 
2.54.0.windows.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
@ 2026-06-25 18:24 ` Shanker Donthineni
  2026-06-30 14:21   ` Will Deacon
  2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
  2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
  2 siblings, 1 reply; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release) on affected CPUs, which removes the "store is
not a store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
plain str* instructions.

Note: stlr* only supports base-register addressing, so the raw MMIO
write helpers use a base-register str*/stlr* alternative sequence. This
gives up the offset-addressed str* code generation introduced by commit
d044d6ba6f02 ("arm64: io: permit offset addressing"). A static-branch
implementation would add extra control flow without preserving the
desired offset-addressed code generation in practice, so use a direct
base-register str*/stlr* alternative instead.

For the write-combining copy helpers (__iowrite{32,64}_copy()), the
contiguous str* groups are kept, because replacing those stores would
defeat the write-combining behaviour used to improve store performance.
Rather than rely on the relaxed, no-ordering contract of these helpers -
which would leave affected CPUs behaving differently from every other
arm64 system and exposed to any future driver that depends on ordering
across such copies - the DGH hint emitted once after each copy is
promoted to dmb osh on affected CPUs. That orders the grouped stores
against subsequent loads without placing a barrier in the copy loop,
while unaffected CPUs keep the existing DGH hint. The single-element
case of __const_memcpy_toio_aligned{32,64}() likewise uses a plain str*
(instead of __raw_write*()) so it shares that str* group + DGH path
rather than taking a per-store store-release.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/all/ajVZBJgKn-5sxHD6@willie-the-truck/
---
 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 25 +++++++++++++++++
 arch/arm64/include/asm/barrier.h            |  4 ++-
 arch/arm64/include/asm/io.h                 | 31 +++++++++++++--------
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad04d1cdc0f0..c4137f89acef 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 10c69474f276..da4e66b19209 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,31 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers. It also promotes the DGH hint used after write-combining
+	  memcpy-to-IO sequences to a DMB, so grouped stores are ordered
+	  against subsequent reads without placing a barrier in the copy loop.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 9495c4441a46..22792d1305aa 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -38,7 +38,9 @@
  * Device-GRE attributes before the hint instruction with any memory accesses
  * appearing after the hint instruction.
  */
-#define dgh()		asm volatile("hint #6" : : : "memory")
+#define dgh()		asm volatile(ALTERNATIVE("hint #6", "dmb osh",	\
+					 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)	\
+				     : : : "memory")
 
 #define spec_bar()	asm volatile(ALTERNATIVE("dsb nsh\nisb\n",		\
 						 SB_BARRIER_INSN"nop\n",	\
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50..69e0fa004d31 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -16,7 +16,6 @@
 #include <asm/memory.h>
 #include <asm/early_ioremap.h>
 #include <asm/alternative.h>
-#include <asm/cpufeature.h>
 #include <asm/rsi.h>
 
 /*
@@ -25,29 +24,37 @@
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
-	volatile u8 __iomem *ptr = addr;
-	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strb %w0, [%1]",
+				 "stlrb %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writew __raw_writew
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
-	volatile u16 __iomem *ptr = addr;
-	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strh %w0, [%1]",
+				 "stlrh %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writel __raw_writel
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
-	volatile u32 __iomem *ptr = addr;
-	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %w0, [%1]",
+				 "stlr %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writeq __raw_writeq
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
-	volatile u64 __iomem *ptr = addr;
-	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %x0, [%1]",
+				 "stlr %x0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_readb __raw_readb
@@ -178,7 +185,8 @@ __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writel(*from, to);
+		asm volatile("str %w0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
@@ -235,7 +243,8 @@ __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writeq(*from, to);
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 4b0d5d932897..76c1f8cf1ee0 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -839,6 +839,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d..d367257bf770 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.54.0.windows.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
  2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
@ 2026-06-25 18:24 ` Shanker Donthineni
  2026-06-29 10:48   ` Vladimir Murzin
  2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
  2 siblings, 1 reply; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
parts with the NVIDIA Olympus device store/load ordering erratum the
ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
the block to a store-release. Each stlr* carries a barrier cost, so block
MMIO becomes O(n) store-releases, making a block copy many times slower
than a single ordered burst and growing with the transfer size.

Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
and order the whole block against subsequent loads with a single
trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
relaxed contract of these helpers). This keeps block MMIO writes at
one-barrier cost rather than scaling with the transfer size.

Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns; consecutive writes ping-pong between two
buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
__iowrite{64,32}_copy()):

Table 1 - arm64 memset_io/memcpy_toio (this patch)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 2 - generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

The arm64 versions stay flat at one-barrier cost while the generic
per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
~366 ns).

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 arch/arm64/include/asm/io.h |  5 +++
 arch/arm64/kernel/io.c      | 82 +++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 69e0fa004d31..649503f347bc 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
 }
 #define __iowrite64_copy __iowrite64_copy
 
+void memset_io(volatile void __iomem *dst, int c, size_t count);
+#define memset_io memset_io
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
+#define memcpy_toio memcpy_toio
+
 /*
  * I/O memory mapping functions.
  */
diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
index fe86ada23c7d..b5fd9ee6d9eb 100644
--- a/arch/arm64/kernel/io.c
+++ b/arch/arm64/kernel/io.c
@@ -5,9 +5,91 @@
  * Copyright (C) 2012 ARM Ltd.
  */
 
+#include <linux/align.h>
 #include <linux/export.h>
 #include <linux/types.h>
 #include <linux/io.h>
+#include <linux/unaligned.h>
+
+#include <asm/alternative.h>
+
+/*
+ * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
+ * (__raw_write*()) to a store-release on affected CPUs. The generic
+ * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
+ * emit one store-release per element and turn a block write into O(n) ordered
+ * stores - far more costly than the single barrier a block actually needs.
+ *
+ * Provide arm64 versions that emit plain STR in the loop and order the whole
+ * block against subsequent loads with one trailing DMB OSH, patched in only on
+ * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
+ * preserved).
+ *
+ * This capability is currently enabled only for the NVIDIA Olympus device
+ * store/load ordering erratum, where a Device-nGnR* load may be observed before
+ * an older, non-overlapping Device-nGnR* store to the same peripheral.
+ */
+static __always_inline void iomem_block_store_barrier(void)
+{
+	asm volatile(ALTERNATIVE("nop", "dmb osh",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : : "memory");
+}
+
+void memset_io(volatile void __iomem *dst, int c, size_t count)
+{
+	u64 qc = (u8)c;
+
+	qc *= ~0ULL / 0xff;
+
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memset_io);
+
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
+{
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)
+			     : "memory");
+		src += sizeof(u64);
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memcpy_toio);
 
 /*
  * This generates a memcpy that works on a from/to address which is aligned to
-- 
2.54.0.windows.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
  2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
  2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
@ 2026-06-29 10:45 ` Vladimir Murzin
  2026-06-29 23:08   ` Shanker Donthineni
  2 siblings, 1 reply; 10+ messages in thread
From: Vladimir Murzin @ 2026-06-29 10:45 UTC (permalink / raw)
  To: Shanker Donthineni, Catalin Marinas, Will Deacon
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Vikram Sethi, Jason Sequeira

Hi,

On 6/25/26 19:24, Shanker Donthineni wrote:
> This series works around the NVIDIA Olympus device store/load ordering
> erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
> peripheral before an older, non-overlapping Device-nGnR* store to the
> same peripheral, breaking the program order that drivers rely on for
> MMIO and potentially leaving a device in an incorrect state.
> 
> Patch 1 adds the workaround. It promotes the raw MMIO store helpers
> (__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
> store-release on affected CPUs, and promotes the trailing DGH of the
> write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
> gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
> in only on affected parts, so it is a no-op elsewhere.
> 
> Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
> are built on __raw_write*(), so patch 1 would promote every store in a
> block to a store-release; as each STLR drains the write-combining buffer,
> block MMIO becomes O(n) store-releases. The arm64 versions emit plain
> STR in the loop and order the whole block with a single trailing dmb osh,
> keeping block MMIO at one-barrier cost.
> 
> Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
> PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
> buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
> __iowrite{64,32}_copy().
> 
> Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
> +-------+-----------+-----------+-----------+-------------+
> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
> +-------+-----------+-----------+-----------+-------------+
> |    8B |   67.9 ns |   67.8 ns |    3.6 ns |    3.6 ns   |
> |   16B |   67.9 ns |   67.8 ns |    4.0 ns |    4.0 ns   |
> |   32B |   67.9 ns |   67.9 ns |    4.6 ns |    4.6 ns   |
> |   64B |   69.1 ns |   69.1 ns |   69.1 ns |   69.0 ns   |
> |  128B |  138.3 ns |  138.3 ns |  138.4 ns |  138.3 ns   |
> |  256B |  276.6 ns |  276.6 ns |  276.6 ns |  276.7 ns   |
> |  512B |  276.6 ns |  276.5 ns |  276.6 ns |  276.6 ns   |
> |   1KB |  276.6 ns |  278.4 ns |  276.6 ns |  276.6 ns   |
> |   2KB |  278.4 ns |  278.4 ns |  275.9 ns |  276.6 ns   |
> |   4KB |  365.7 ns |  365.7 ns |  365.7 ns |  365.7 ns   |
> +-------+-----------+-----------+-----------+-------------+
> relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
> trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().
> 
> Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
> +-------+-----------+-----------+-----------+-------------+
> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
> +-------+-----------+-----------+-----------+-------------+
> |    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
> |   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
> |   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
> |   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
> |  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
> |  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
> |  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
> |   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
> |   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
> |   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
> +-------+-----------+-----------+-----------+-------------+
> all four helpers end with a single trailing barrier (dmb osh).
> 
> Table 3 - workaround on, generic per-store memset_io/memcpy_toio
> +-------+-----------+-----------+-------------+--------------+
> |  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
> +-------+-----------+-----------+-------------+--------------+
> |    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
> |   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
> |   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
> |   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
> |  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
> |  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
> |  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
> |   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
> |   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
> |   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
> +-------+-----------+-----------+-------------+--------------+
> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
> workaround promotes to store-release, so every store is individually
> ordered - hence O(n) in the store count.
> 
> Tables 2 and 3 show why patch 2 is needed: the generic per-store block
> writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
> vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.

That's interesting. With the way the patch set is structured, it
now looks like:

1. Fix the erratum, but cause a performance regression.
2. Restore the performance regression and (re)apply the erratum
   workaround.

Would it make sense to avoid introducing the performance
regression in the first place by structuring the patch set
slightly differently?

1. (Re)introduce arm64 memset_io()/memcpy_toio().
2. Fix the erratum once for all

What do you reckon?

Cheers
Vladimir



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
  2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
@ 2026-06-29 10:48   ` Vladimir Murzin
  2026-06-29 23:09     ` Shanker Donthineni
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Murzin @ 2026-06-29 10:48 UTC (permalink / raw)
  To: Shanker Donthineni, Catalin Marinas, Will Deacon
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Vikram Sethi, Jason Sequeira

Hi,

On 6/25/26 19:24, Shanker Donthineni wrote:
> The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
> parts with the NVIDIA Olympus device store/load ordering erratum the
> ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
> the block to a store-release. Each stlr* carries a barrier cost, so block
> MMIO becomes O(n) store-releases, making a block copy many times slower
> than a single ordered burst and growing with the transfer size.
> 
> Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
> and order the whole block against subsequent loads with a single
> trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
> relaxed contract of these helpers). This keeps block MMIO writes at
> one-barrier cost rather than scaling with the transfer size.
> 
> Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
> PE pinned; per-call cost in ns; consecutive writes ping-pong between two
> buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
> __iowrite{64,32}_copy()):
> 
> Table 1 - arm64 memset_io/memcpy_toio (this patch)
> +-------+-----------+-----------+-----------+-------------+
> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
> +-------+-----------+-----------+-----------+-------------+
> |    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
> |   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
> |   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
> |   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
> |  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
> |  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
> |  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
> |   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
> |   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
> |   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
> +-------+-----------+-----------+-----------+-------------+
> all four helpers end with a single trailing barrier (dmb osh).
> 
> Table 2 - generic per-store memset_io/memcpy_toio
> +-------+-----------+-----------+-------------+--------------+
> |  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
> +-------+-----------+-----------+-------------+--------------+
> |    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
> |   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
> |   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
> |   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
> |  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
> |  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
> |  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
> |   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
> |   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
> |   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
> +-------+-----------+-----------+-------------+--------------+
> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
> workaround promotes to store-release, so every store is individually
> ordered - hence O(n) in the store count.
> 
> The arm64 versions stay flat at one-barrier cost while the generic
> per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
> ~366 ns).
> 
> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
> ---
>  arch/arm64/include/asm/io.h |  5 +++
>  arch/arm64/kernel/io.c      | 82 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 87 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 69e0fa004d31..649503f347bc 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
>  }
>  #define __iowrite64_copy __iowrite64_copy
>  
> +void memset_io(volatile void __iomem *dst, int c, size_t count);
> +#define memset_io memset_io
> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
> +#define memcpy_toio memcpy_toio
> +
>  /*
>   * I/O memory mapping functions.
>   */
> diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
> index fe86ada23c7d..b5fd9ee6d9eb 100644
> --- a/arch/arm64/kernel/io.c
> +++ b/arch/arm64/kernel/io.c
> @@ -5,9 +5,91 @@
>   * Copyright (C) 2012 ARM Ltd.
>   */
>  
> +#include <linux/align.h>
>  #include <linux/export.h>
>  #include <linux/types.h>
>  #include <linux/io.h>
> +#include <linux/unaligned.h>
> +
> +#include <asm/alternative.h>
> +
> +/*
> + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
> + * (__raw_write*()) to a store-release on affected CPUs. The generic
> + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
> + * emit one store-release per element and turn a block write into O(n) ordered
> + * stores - far more costly than the single barrier a block actually needs.
> + *
> + * Provide arm64 versions that emit plain STR in the loop and order the whole
> + * block against subsequent loads with one trailing DMB OSH, patched in only on
> + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
> + * preserved).
> + *
> + * This capability is currently enabled only for the NVIDIA Olympus device
> + * store/load ordering erratum, where a Device-nGnR* load may be observed before
> + * an older, non-overlapping Device-nGnR* store to the same peripheral.
> + */
> +static __always_inline void iomem_block_store_barrier(void)
> +{
> +	asm volatile(ALTERNATIVE("nop", "dmb osh",
> +				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
> +		     : : : "memory");
> +}
> +
> +void memset_io(volatile void __iomem *dst, int c, size_t count)
> +{
> +	u64 qc = (u8)c;
> +
> +	qc *= ~0ULL / 0xff;
> +
> +	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
> +		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
> +		dst++;
> +		count--;
> +	}
> +	while (count >= sizeof(u64)) {
> +		asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
> +		dst += sizeof(u64);
> +		count -= sizeof(u64);
> +	}
> +	while (count) {
> +		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
> +		dst++;
> +		count--;
> +	}
> +
> +	iomem_block_store_barrier();
> +}
> +EXPORT_SYMBOL(memset_io);
> +
> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
> +{
> +	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
> +		asm volatile("strb %w0, [%1]"
> +			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
> +		src++;
> +		dst++;
> +		count--;
> +	}
> +	while (count >= sizeof(u64)) {
> +		asm volatile("str %x0, [%1]"
> +			     : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)

Why do we need get_unaligned() here? I understand this came from
the generic implementation, where it needs to handle architectures
that do not support unaligned accesses. But IIUC this is not an
issue for arm64, and there was no special handling in memcpy_toio()
before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset").
Am I missing something?

> +			     : "memory");
> +		src += sizeof(u64);
> +		dst += sizeof(u64);
> +		count -= sizeof(u64);
> +	}
> +	while (count) {
> +		asm volatile("strb %w0, [%1]"
> +			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
> +		src++;
> +		dst++;
> +		count--;
> +	}
> +
> +	iomem_block_store_barrier();

It is perhaps a matter of taste, but having the inline assembly
here (and in memset_io()) might make the code clearer. To a
casual reader, it would be obvious that the barrier is not
guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
without having to jump back and forth through the code.

Obliviously maintainers might have different preference ;) 

Cheers
Vladimir

> +}
> +EXPORT_SYMBOL(memcpy_toio);
>  
>  /*
>   * This generates a memcpy that works on a from/to address which is aligned to
> -- 2.54.0.windows.1
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
  2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
@ 2026-06-29 23:08   ` Shanker Donthineni
  2026-06-30 13:53     ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-29 23:08 UTC (permalink / raw)
  To: Vladimir Murzin, Catalin Marinas, Will Deacon
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Vikram Sethi, Jason Sequeira


Hi Vladimir,


On 6/29/2026 5:45 AM, Vladimir Murzin wrote:
> External email: Use caution opening links or attachments
>
>
> Hi,
>
> On 6/25/26 19:24, Shanker Donthineni wrote:
>> This series works around the NVIDIA Olympus device store/load ordering
>> erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
>> peripheral before an older, non-overlapping Device-nGnR* store to the
>> same peripheral, breaking the program order that drivers rely on for
>> MMIO and potentially leaving a device in an incorrect state.
>>
>> Patch 1 adds the workaround. It promotes the raw MMIO store helpers
>> (__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
>> store-release on affected CPUs, and promotes the trailing DGH of the
>> write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
>> gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
>> in only on affected parts, so it is a no-op elsewhere.
>>
>> Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
>> are built on __raw_write*(), so patch 1 would promote every store in a
>> block to a store-release; as each STLR drains the write-combining buffer,
>> block MMIO becomes O(n) store-releases. The arm64 versions emit plain
>> STR in the loop and order the whole block with a single trailing dmb osh,
>> keeping block MMIO at one-barrier cost.
>>
>> Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
>> PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
>> buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
>> __iowrite{64,32}_copy().
>>
>> Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
>> +-------+-----------+-----------+-----------+-------------+
>> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
>> +-------+-----------+-----------+-----------+-------------+
>> |    8B |   67.9 ns |   67.8 ns |    3.6 ns |    3.6 ns   |
>> |   16B |   67.9 ns |   67.8 ns |    4.0 ns |    4.0 ns   |
>> |   32B |   67.9 ns |   67.9 ns |    4.6 ns |    4.6 ns   |
>> |   64B |   69.1 ns |   69.1 ns |   69.1 ns |   69.0 ns   |
>> |  128B |  138.3 ns |  138.3 ns |  138.4 ns |  138.3 ns   |
>> |  256B |  276.6 ns |  276.6 ns |  276.6 ns |  276.7 ns   |
>> |  512B |  276.6 ns |  276.5 ns |  276.6 ns |  276.6 ns   |
>> |   1KB |  276.6 ns |  278.4 ns |  276.6 ns |  276.6 ns   |
>> |   2KB |  278.4 ns |  278.4 ns |  275.9 ns |  276.6 ns   |
>> |   4KB |  365.7 ns |  365.7 ns |  365.7 ns |  365.7 ns   |
>> +-------+-----------+-----------+-----------+-------------+
>> relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
>> trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().
>>
>> Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
>> +-------+-----------+-----------+-----------+-------------+
>> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
>> +-------+-----------+-----------+-----------+-------------+
>> |    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
>> |   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
>> |   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
>> |   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
>> |  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
>> |  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
>> |  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
>> |   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
>> |   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
>> |   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
>> +-------+-----------+-----------+-----------+-------------+
>> all four helpers end with a single trailing barrier (dmb osh).
>>
>> Table 3 - workaround on, generic per-store memset_io/memcpy_toio
>> +-------+-----------+-----------+-------------+--------------+
>> |  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
>> +-------+-----------+-----------+-------------+--------------+
>> |    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
>> |   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
>> |   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
>> |   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
>> |  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
>> |  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
>> |  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
>> |   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
>> |   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
>> |   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
>> +-------+-----------+-----------+-------------+--------------+
>> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
>> workaround promotes to store-release, so every store is individually
>> ordered - hence O(n) in the store count.
>>
>> Tables 2 and 3 show why patch 2 is needed: the generic per-store block
>> writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
>> vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.
> That's interesting. With the way the patch set is structured, it
> now looks like:
>
> 1. Fix the erratum, but cause a performance regression.
> 2. Restore the performance regression and (re)apply the erratum
>     workaround.
>
> Would it make sense to avoid introducing the performance
> regression in the first place by structuring the patch set
> slightly differently?
>
> 1. (Re)introduce arm64 memset_io()/memcpy_toio().
> 2. Fix the erratum once for all
>
> What do you reckon?

Yes, that ordering makes sense.

I can restructure v5 so that patch 1 introduces the arm64 memset_{to}io()
implementations while preserving the existing behavior. Patch 2 will
then add the complete erratum workaround, including the conditional
trailing DMB for those block-write helpers. This avoids introducing
the intermediate performance regression and keeps each commit
independently usable.

Will and Catalin, could you please share your thoughts on this approach?

-Shanker




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
  2026-06-29 10:48   ` Vladimir Murzin
@ 2026-06-29 23:09     ` Shanker Donthineni
  2026-06-30 14:17       ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Shanker Donthineni @ 2026-06-29 23:09 UTC (permalink / raw)
  To: Vladimir Murzin, Catalin Marinas, Will Deacon
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Vikram Sethi, Jason Sequeira


Hi Vladimir,

On 6/29/2026 5:48 AM, Vladimir Murzin wrote:

> External email: Use caution opening links or attachments
>
>
> Hi,
>
> On 6/25/26 19:24, Shanker Donthineni wrote:
>> The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
>> parts with the NVIDIA Olympus device store/load ordering erratum the
>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
>> the block to a store-release. Each stlr* carries a barrier cost, so block
>> MMIO becomes O(n) store-releases, making a block copy many times slower
>> than a single ordered burst and growing with the transfer size.
>>
>> Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
>> and order the whole block against subsequent loads with a single
>> trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
>> relaxed contract of these helpers). This keeps block MMIO writes at
>> one-barrier cost rather than scaling with the transfer size.
>>
>> Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
>> PE pinned; per-call cost in ns; consecutive writes ping-pong between two
>> buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
>> __iowrite{64,32}_copy()):
>>
>> Table 1 - arm64 memset_io/memcpy_toio (this patch)
>> +-------+-----------+-----------+-----------+-------------+
>> |  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
>> +-------+-----------+-----------+-----------+-------------+
>> |    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
>> |   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
>> |   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
>> |   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
>> |  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
>> |  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
>> |  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
>> |   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
>> |   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
>> |   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
>> +-------+-----------+-----------+-----------+-------------+
>> all four helpers end with a single trailing barrier (dmb osh).
>>
>> Table 2 - generic per-store memset_io/memcpy_toio
>> +-------+-----------+-----------+-------------+--------------+
>> |  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
>> +-------+-----------+-----------+-------------+--------------+
>> |    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
>> |   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
>> |   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
>> |   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
>> |  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
>> |  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
>> |  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
>> |   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
>> |   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
>> |   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
>> +-------+-----------+-----------+-------------+--------------+
>> the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
>> workaround promotes to store-release, so every store is individually
>> ordered - hence O(n) in the store count.
>>
>> The arm64 versions stay flat at one-barrier cost while the generic
>> per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
>> ~366 ns).
>>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> ---
>>   arch/arm64/include/asm/io.h |  5 +++
>>   arch/arm64/kernel/io.c      | 82 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 87 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 69e0fa004d31..649503f347bc 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
>>   }
>>   #define __iowrite64_copy __iowrite64_copy
>>
>> +void memset_io(volatile void __iomem *dst, int c, size_t count);
>> +#define memset_io memset_io
>> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
>> +#define memcpy_toio memcpy_toio
>> +
>>   /*
>>    * I/O memory mapping functions.
>>    */
>> diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
>> index fe86ada23c7d..b5fd9ee6d9eb 100644
>> --- a/arch/arm64/kernel/io.c
>> +++ b/arch/arm64/kernel/io.c
>> @@ -5,9 +5,91 @@
>>    * Copyright (C) 2012 ARM Ltd.
>>    */
>>
>> +#include <linux/align.h>
>>   #include <linux/export.h>
>>   #include <linux/types.h>
>>   #include <linux/io.h>
>> +#include <linux/unaligned.h>
>> +
>> +#include <asm/alternative.h>
>> +
>> +/*
>> + * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
>> + * (__raw_write*()) to a store-release on affected CPUs. The generic
>> + * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
>> + * emit one store-release per element and turn a block write into O(n) ordered
>> + * stores - far more costly than the single barrier a block actually needs.
>> + *
>> + * Provide arm64 versions that emit plain STR in the loop and order the whole
>> + * block against subsequent loads with one trailing DMB OSH, patched in only on
>> + * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
>> + * preserved).
>> + *
>> + * This capability is currently enabled only for the NVIDIA Olympus device
>> + * store/load ordering erratum, where a Device-nGnR* load may be observed before
>> + * an older, non-overlapping Device-nGnR* store to the same peripheral.
>> + */
>> +static __always_inline void iomem_block_store_barrier(void)
>> +{
>> +     asm volatile(ALTERNATIVE("nop", "dmb osh",
>> +                              ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
>> +                  : : : "memory");
>> +}
>> +
>> +void memset_io(volatile void __iomem *dst, int c, size_t count)
>> +{
>> +     u64 qc = (u8)c;
>> +
>> +     qc *= ~0ULL / 0xff;
>> +
>> +     while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
>> +             asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
>> +             dst++;
>> +             count--;
>> +     }
>> +     while (count >= sizeof(u64)) {
>> +             asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
>> +             dst += sizeof(u64);
>> +             count -= sizeof(u64);
>> +     }
>> +     while (count) {
>> +             asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
>> +             dst++;
>> +             count--;
>> +     }
>> +
>> +     iomem_block_store_barrier();
>> +}
>> +EXPORT_SYMBOL(memset_io);
>> +
>> +void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
>> +{
>> +     while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
>> +             asm volatile("strb %w0, [%1]"
>> +                          : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
>> +             src++;
>> +             dst++;
>> +             count--;
>> +     }
>> +     while (count >= sizeof(u64)) {
>> +             asm volatile("str %x0, [%1]"
>> +                          : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)
> Why do we need get_unaligned() here? I understand this came from
> the generic implementation, where it needs to handle architectures
> that do not support unaligned accesses. But IIUC this is not an
> issue for arm64, and there was no special handling in memcpy_toio()
> before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset").
> Am I missing something?

Thanks for the review.

I used get_unaligned() because I was trying to keep the arm64 implementation
as close as possible to the generic memcpy_toio() implementation in
lib/iomem_copy.c. However, you are right that before commit 0110feaaf6d0
(“arm64: Use new fallback IO memcpy/memset”), the arm64 implementation
used a direct u64 load and did not explicitly handle source alignment. I
can restore the previous arm64 form in v5 if that is preferred.

>> +                          : "memory");
>> +             src += sizeof(u64);
>> +             dst += sizeof(u64);
>> +             count -= sizeof(u64);
>> +     }
>> +     while (count) {
>> +             asm volatile("strb %w0, [%1]"
>> +                          : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
>> +             src++;
>> +             dst++;
>> +             count--;
>> +     }
>> +
>> +     iomem_block_store_barrier();
> It is perhaps a matter of taste, but having the inline assembly
> here (and in memset_io()) might make the code clearer. To a
> casual reader, it would be obvious that the barrier is not
> guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
> without having to jump back and forth through the code.
>
> Obliviously maintainers might have different preference ;)

Regarding the barrier, iomem_block_store_barrier() is declared
static __always_inline, so it does not add a function call. The nop/dmb
osh alternative is emitted directly in each caller. I used the helper to
avoid duplicating the alternative sequence.

I understand that placing the assembly directly in both functions could
make its conditional nature more obvious. I do not have a strong preference
and am happy to follow Will’s and Catalin’s preference here.

-Shanker




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
  2026-06-29 23:08   ` Shanker Donthineni
@ 2026-06-30 13:53     ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2026-06-30 13:53 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Vladimir Murzin, Catalin Marinas, Jason Gunthorpe,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

On Mon, Jun 29, 2026 at 06:08:37PM -0500, Shanker Donthineni wrote:
> On 6/29/2026 5:45 AM, Vladimir Murzin wrote:
> > That's interesting. With the way the patch set is structured, it
> > now looks like:
> > 
> > 1. Fix the erratum, but cause a performance regression.
> > 2. Restore the performance regression and (re)apply the erratum
> >     workaround.
> > 
> > Would it make sense to avoid introducing the performance
> > regression in the first place by structuring the patch set
> > slightly differently?
> > 
> > 1. (Re)introduce arm64 memset_io()/memcpy_toio().
> > 2. Fix the erratum once for all
> > 
> > What do you reckon?
> 
> Yes, that ordering makes sense.
> 
> I can restructure v5 so that patch 1 introduces the arm64 memset_{to}io()
> implementations while preserving the existing behavior. Patch 2 will
> then add the complete erratum workaround, including the conditional
> trailing DMB for those block-write helpers. This avoids introducing
> the intermediate performance regression and keeps each commit
> independently usable.
> 
> Will and Catalin, could you please share your thoughts on this approach?

tbh, I think I'm ok with the current ordering. The second patch is purely
a performance thing for affected CPUs, so doesn't strictly need to be
applied or backported for functional correctness afaict.

Will


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
  2026-06-29 23:09     ` Shanker Donthineni
@ 2026-06-30 14:17       ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2026-06-30 14:17 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Vladimir Murzin, Catalin Marinas, Jason Gunthorpe,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

On Mon, Jun 29, 2026 at 06:09:11PM -0500, Shanker Donthineni wrote:
> On 6/29/2026 5:48 AM, Vladimir Murzin wrote:
> > > +                          : "memory");
> > > +             src += sizeof(u64);
> > > +             dst += sizeof(u64);
> > > +             count -= sizeof(u64);
> > > +     }
> > > +     while (count) {
> > > +             asm volatile("strb %w0, [%1]"
> > > +                          : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
> > > +             src++;
> > > +             dst++;
> > > +             count--;
> > > +     }
> > > +
> > > +     iomem_block_store_barrier();
> > It is perhaps a matter of taste, but having the inline assembly
> > here (and in memset_io()) might make the code clearer. To a
> > casual reader, it would be obvious that the barrier is not
> > guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
> > without having to jump back and forth through the code.
> > 
> > Obliviously maintainers might have different preference ;)

Oblivious maintainer here :)

> Regarding the barrier, iomem_block_store_barrier() is declared
> static __always_inline, so it does not add a function call. The nop/dmb
> osh alternative is emitted directly in each caller. I used the helper to
> avoid duplicating the alternative sequence.
> 
> I understand that placing the assembly directly in both functions could
> make its conditional nature more obvious. I do not have a strong preference
> and am happy to follow Will’s and Catalin’s preference here.

I agree with Vladimir that it would be clearer to inline the conditional
barrier.

It would be even better if we could avoid having to duplicate this code
to start with, but I can't immediately think of a better alternative.

Will


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
@ 2026-06-30 14:21   ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2026-06-30 14:21 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

On Thu, Jun 25, 2026 at 01:24:24PM -0500, Shanker Donthineni wrote:
> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
> observed by a peripheral before an older, non-overlapping Device-nGnR*
> store to the same peripheral. This breaks the program-order guarantee
> that software expects for Device-nGnR* accesses and can leave a
> peripheral in an incorrect state, as a load is observed before an
> earlier store takes effect.
> 
> The erratum can occur only when all of the following apply:
> 
>   - A PE executes a Device-nGnR* store followed by a younger
>     Device-nGnR* load.
>   - The store is not a store-release.
>   - The accesses target the same peripheral and do not overlap in bytes.
>   - There is at most one intervening Device-nGnR* store in program
>     order, and there are no intervening Device-nGnR* loads.
>   - There is no DSB, and no DMB that orders loads, between the store and
>     the load.

Does that mean that a DMB LD between the store and the load would
solve the problem?

It would be interesting to see how your benchmarks motivating patch 2
look if you leave __raw_writeX as-is and instead add a barrier in
__raw_readX before the load instruction.

Will


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-30 14:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
2026-06-30 14:21   ` Will Deacon
2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
2026-06-29 10:48   ` Vladimir Murzin
2026-06-29 23:09     ` Shanker Donthineni
2026-06-30 14:17       ` Will Deacon
2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
2026-06-29 23:08   ` Shanker Donthineni
2026-06-30 13:53     ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox