* [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
@ 2026-06-05 14:45 Shanker Donthineni
2026-06-10 11:28 ` Will Deacon
0 siblings, 1 reply; 5+ messages in thread
From: Shanker Donthineni @ 2026-06-05 14:45 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, linux-arm-kernel, Vladimir Murzin
Cc: Mark Rutland, linux-kernel, linux-doc, Shanker Donthineni,
Vikram Sethi, Jason Sequeira
On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.
The erratum can occur only when all of the following apply:
- A PE executes a Device-nGnR* store followed by a younger
Device-nGnR* load.
- The store is not a store-release.
- The accesses target the same peripheral and do not overlap in bytes.
- There is at most one intervening Device-nGnR* store in program
order, and there are no intervening Device-nGnR* loads.
- There is no DSB, and no DMB that orders loads, between the store and
the load.
- Specific micro-architectural and timing conditions occur.
Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
orders loads) between the store and the load, or make the store a
store-release. A load-acquire on the load side would not help, because
acquire semantics do not prevent a load from being observed ahead of an
older store; only the store side (release or a barrier) closes the
window.
Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.
Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.
Note: stlr* only supports base-register addressing, so the raw accessors
can no longer use the offset addressing introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing"). The str* and stlr* alternates
share a single inline-asm operand and the sequence is selected at boot,
so the operand form is fixed at compile time; unaffected CPUs keep using
str* but also revert to base-register addressing. This keeps the store
side as simple as the existing load-side patching (load-acquire) and
avoids adding complexity to the device write path; retaining offset
addressing only for str* would otherwise require a runtime branch on
every write.
Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v1:
Update commit text based on feedback from Vladimir Murzin
Documentation/arch/arm64/silicon-errata.rst | 2 ++
arch/arm64/Kconfig | 23 ++++++++++++++++++++
arch/arm64/include/asm/io.h | 24 ++++++++++++++-------
arch/arm64/kernel/cpu_errata.c | 8 +++++++
arch/arm64/tools/cpucaps | 1 +
5 files changed, 50 insertions(+), 8 deletions(-)
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index 211119ce7adc..899bed3908bb 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -256,6 +256,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM |
+----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-1 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..a6bac84b05a1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
If unsure, say Y.
+config NVIDIA_OLYMPUS_1027_ERRATUM
+ bool "NVIDIA Olympus: device store/load ordering erratum"
+ default y
+ help
+ This option adds an alternative code sequence to work around an
+ NVIDIA Olympus core erratum where a Device-nGnR* store can be
+ observed by a peripheral after a younger Device-nGnR* load to the
+ same peripheral. This breaks the program order that drivers rely
+ on for MMIO and can leave a device in an incorrect state.
+
+ The workaround promotes the raw MMIO store helpers
+ (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+ required ordering. Because writel() and writel_relaxed() are built
+ on __raw_writel(), both are covered without changes to the higher
+ layers.
+
+ The fix is applied through the alternatives framework, so enabling
+ this option does not by itself activate the workaround: it is
+ patched in only when an affected CPU is detected, and is a no-op on
+ unaffected CPUs.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_834220
bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50..b6d7966e9c19 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -25,29 +25,37 @@
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
- volatile u8 __iomem *ptr = addr;
- asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strb %w0, [%1]",
+ "stlrb %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writew __raw_writew
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
- volatile u16 __iomem *ptr = addr;
- asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strh %w0, [%1]",
+ "stlrh %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writel __raw_writel
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
- volatile u32 __iomem *ptr = addr;
- asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %w0, [%1]",
+ "stlr %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writeq __raw_writeq
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
- volatile u64 __iomem *ptr = addr;
- asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %x0, [%1]",
+ "stlr %x0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_readb __raw_readb
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 5377e4c2eba2..958d7f16bfeb 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -809,6 +809,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
},
#endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+ {
+ /* NVIDIA Olympus core */
+ .desc = "NVIDIA Olympus device load/store ordering erratum",
+ .capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+ },
+#endif
#ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
{
/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d..d367257bf770 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
WORKAROUND_CAVIUM_TX2_219_TVM
WORKAROUND_CLEAN_CACHE
WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
WORKAROUND_NVIDIA_CARMEL_CNP
WORKAROUND_PMUV3_IMPDEF_TRAPS
WORKAROUND_QCOM_FALKOR_E1003
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-05 14:45 [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni @ 2026-06-10 11:28 ` Will Deacon 2026-06-10 12:50 ` Jason Gunthorpe ` (2 more replies) 0 siblings, 3 replies; 5+ messages in thread From: Will Deacon @ 2026-06-10 11:28 UTC (permalink / raw) To: Shanker Donthineni Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg [+Jason G] On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote: > On systems with NVIDIA Olympus cores, a Device-nGnR* load can be > observed by a peripheral before an older, non-overlapping Device-nGnR* > store to the same peripheral. This breaks the program-order guarantee > that software expects for Device-nGnR* accesses and can leave a > peripheral in an incorrect state, as a load is observed before an > earlier store takes effect. > > The erratum can occur only when all of the following apply: > > - A PE executes a Device-nGnR* store followed by a younger > Device-nGnR* load. > - The store is not a store-release. > - The accesses target the same peripheral and do not overlap in bytes. > - There is at most one intervening Device-nGnR* store in program > order, and there are no intervening Device-nGnR* loads. > - There is no DSB, and no DMB that orders loads, between the store and > the load. > - Specific micro-architectural and timing conditions occur. > > Two ways to restore ordering: insert a barrier (any DSB, or a DMB that > orders loads) between the store and the load, or make the store a > store-release. A load-acquire on the load side would not help, because > acquire semantics do not prevent a load from being observed ahead of an > older store; only the store side (release or a barrier) closes the > window. I think you can drop the paragraph above. A store-release isn't enough to order against a later load in the architecture either, so we're clearly in micro-architecture territory and I don't think you need to describe mechanisms that don't work here. > Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* > to stlr* (Store-Release), which removes the "store is not a > store-release" condition for every device write the kernel issues. > Because writel() and writel_relaxed() are both built on __raw_writel() > in asm-generic/io.h, patching the raw variants covers both the > non-relaxed and relaxed APIs without touching the higher layers. Note > that writel()'s own barrier sits before the store, so it does not order > the store against a subsequent readl(); the store-release promotion is > what provides that ordering. Sashiko points out that you're missing __const_memcpy_toio_aligned32(). > Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new > ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on > parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use > the plain str* sequence. > > Note: stlr* only supports base-register addressing, so the raw accessors > can no longer use the offset addressing introduced by commit d044d6ba6f02 > ("arm64: io: permit offset addressing"). The str* and stlr* alternates > share a single inline-asm operand and the sequence is selected at boot, > so the operand form is fixed at compile time; unaffected CPUs keep using > str* but also revert to base-register addressing. This keeps the store > side as simple as the existing load-side patching (load-acquire) and > avoids adding complexity to the device write path; retaining offset > addressing only for str* would otherwise require a runtime branch on > every write. I seem to remember Jason caring about that, possibly because some CPUs are very picky about write-combining? Will ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-10 11:28 ` Will Deacon @ 2026-06-10 12:50 ` Jason Gunthorpe 2026-06-10 12:53 ` Shanker Donthineni 2026-06-10 13:20 ` Shanker Donthineni 2 siblings, 0 replies; 5+ messages in thread From: Jason Gunthorpe @ 2026-06-10 12:50 UTC (permalink / raw) To: Will Deacon Cc: Shanker Donthineni, Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Wed, Jun 10, 2026 at 12:28:33PM +0100, Will Deacon wrote: > > Note: stlr* only supports base-register addressing, so the raw accessors > > can no longer use the offset addressing introduced by commit d044d6ba6f02 > > ("arm64: io: permit offset addressing"). The str* and stlr* alternates > > share a single inline-asm operand and the sequence is selected at boot, > > so the operand form is fixed at compile time; unaffected CPUs keep using > > str* but also revert to base-register addressing. This keeps the store > > side as simple as the existing load-side patching (load-acquire) and > > avoids adding complexity to the device write path; retaining offset > > addressing only for str* would otherwise require a runtime branch on > > every write. > > I seem to remember Jason caring about that, possibly because some CPUs > are very picky about write-combining? I think it was more a fall out of the work there, after looking at the assembly this minor edit to the constraint made a nice codegen impact. It is certainly a shame to loose it for this bug. If we care about write combining we can't have a branch anyhow, but that is most important for the specific memcpy operations (which will need a branch) Jason ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-10 11:28 ` Will Deacon 2026-06-10 12:50 ` Jason Gunthorpe @ 2026-06-10 12:53 ` Shanker Donthineni 2026-06-10 13:20 ` Shanker Donthineni 2 siblings, 0 replies; 5+ messages in thread From: Shanker Donthineni @ 2026-06-10 12:53 UTC (permalink / raw) To: Will Deacon Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg Hi Will, On 6/10/2026 6:28 AM, Will Deacon wrote: > External email: Use caution opening links or attachments > > > [+Jason G] > > On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote: >> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be >> observed by a peripheral before an older, non-overlapping Device-nGnR* >> store to the same peripheral. This breaks the program-order guarantee >> that software expects for Device-nGnR* accesses and can leave a >> peripheral in an incorrect state, as a load is observed before an >> earlier store takes effect. >> >> The erratum can occur only when all of the following apply: >> >> - A PE executes a Device-nGnR* store followed by a younger >> Device-nGnR* load. >> - The store is not a store-release. >> - The accesses target the same peripheral and do not overlap in bytes. >> - There is at most one intervening Device-nGnR* store in program >> order, and there are no intervening Device-nGnR* loads. >> - There is no DSB, and no DMB that orders loads, between the store and >> the load. >> - Specific micro-architectural and timing conditions occur. >> >> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that >> orders loads) between the store and the load, or make the store a >> store-release. A load-acquire on the load side would not help, because >> acquire semantics do not prevent a load from being observed ahead of an >> older store; only the store side (release or a barrier) closes the >> window. > I think you can drop the paragraph above. A store-release isn't enough > to order against a later load in the architecture either, so we're > clearly in micro-architecture territory and I don't think you need to > describe mechanisms that don't work here. Thanks, Will. I’ll drop paragraph and avoid describing store-release as an architectural ordering mechanism here. >> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* >> to stlr* (Store-Release), which removes the "store is not a >> store-release" condition for every device write the kernel issues. >> Because writel() and writel_relaxed() are both built on __raw_writel() >> in asm-generic/io.h, patching the raw variants covers both the >> non-relaxed and relaxed APIs without touching the higher layers. Note >> that writel()'s own barrier sits before the store, so it does not order >> the store against a subsequent readl(); the store-release promotion is >> what provides that ordering. > Sashiko points out that you're missing __const_memcpy_toio_aligned32(). I’ll also cover __const_memcpy_toio_aligned32(); it currently emits plain STRs directly and can bypass the raw write helper workaround. I’ll audit the aligned64 path at the same time. >> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new >> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on >> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use >> the plain str* sequence. >> >> Note: stlr* only supports base-register addressing, so the raw accessors >> can no longer use the offset addressing introduced by commit d044d6ba6f02 >> ("arm64: io: permit offset addressing"). The str* and stlr* alternates >> share a single inline-asm operand and the sequence is selected at boot, >> so the operand form is fixed at compile time; unaffected CPUs keep using >> str* but also revert to base-register addressing. This keeps the store >> side as simple as the existing load-side patching (load-acquire) and >> avoids adding complexity to the device write path; retaining offset >> addressing only for str* would otherwise require a runtime branch on >> every write. > I seem to remember Jason caring about that, possibly because some CPUs > are very picky about write-combining? For the offset-addressing concern, I’ll rework the raw accessors so unaffected CPUs keep the existing offset-addressed STR sequence, and only CPUs with ARM64_WORKAROUND_DEVICE_STORE_RELEASE take the base-register STLR path. I’ll post a v3 using the patched branch from alternative_has_cap_unlikely(), and include the memcpy_toio() aligned-helper coverage as shown below. --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -22,10 +22,46 @@ /* * Generic IO read/write. These perform native-endian accesses. */ +static __always_inline bool arm64_needs_device_store_release(void) +{ + return alternative_has_cap_unlikely( + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); +} + +static __always_inline void __raw_writeb_stlr(u8 val, + volatile void __iomem *addr) +{ + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writew_stlr(u16 val, + volatile void __iomem *addr) +{ + asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writel_stlr(u32 val, + volatile void __iomem *addr) +{ + asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writeq_stlr(u64 val, + volatile void __iomem *addr) +{ + asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr)); +} + #define __raw_writeb __raw_writeb static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) { volatile u8 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writeb_stlr(val, addr); + return; + } + asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr) { volatile u16 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writew_stlr(val, addr); + return; + } + asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr) static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr) { volatile u32 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writel_stlr(val, addr); + return; + } + asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr) static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr) { volatile u64 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writeq_stlr(val, addr); + return; + } + asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -147,6 +201,12 @@ static __always_inline void __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from, size_t count) { + if (arm64_needs_device_store_release()) { + while (count--) + __raw_writel_stlr(*from++, to++); + return; + } + switch (count) { case 8: asm volatile("str %w0, [%8, #4 * 0]\n" @@ -204,6 +264,12 @@ static __always_inline void __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from, size_t count) { + if (arm64_needs_device_store_release()) { + while (count--) + __raw_writeq_stlr(*from++, to++); + return; + } + switch (count) { case 8: asm volatile("str %x0, [%8, #8 * 0]\n" I'll post v3 patch with jump instruction patch. --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -22,10 +22,46 @@ /* * Generic IO read/write. These perform native-endian accesses. */ +static __always_inline bool arm64_needs_device_store_release(void) +{ + return alternative_has_cap_unlikely( + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); +} + +static __always_inline void __raw_writeb_stlr(u8 val, + volatile void __iomem *addr) +{ + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writew_stlr(u16 val, + volatile void __iomem *addr) +{ + asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writel_stlr(u32 val, + volatile void __iomem *addr) +{ + asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr)); +} + +static __always_inline void __raw_writeq_stlr(u64 val, + volatile void __iomem *addr) +{ + asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr)); +} + #define __raw_writeb __raw_writeb static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) { volatile u8 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writeb_stlr(val, addr); + return; + } + asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr) { volatile u16 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writew_stlr(val, addr); + return; + } + asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr) static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr) { volatile u32 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writel_stlr(val, addr); + return; + } + asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr) static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr) { volatile u64 __iomem *ptr = addr; + + if (arm64_needs_device_store_release()) { + __raw_writeq_stlr(val, addr); + return; + } + asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr)); } @@ -147,6 +201,12 @@ static __always_inline void __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from, size_t count) { + if (arm64_needs_device_store_release()) { + while (count--) + __raw_writel_stlr(*from++, to++); + return; + } + switch (count) { case 8: asm volatile("str %w0, [%8, #4 * 0]\n" @@ -204,6 +264,12 @@ static __always_inline void __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from, size_t count) { + if (arm64_needs_device_store_release()) { + while (count--) + __raw_writeq_stlr(*from++, to++); + return; + } + switch (count) { case 8: asm volatile("str %x0, [%8, #8 * 0]\n" -Shanker ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-10 11:28 ` Will Deacon 2026-06-10 12:50 ` Jason Gunthorpe 2026-06-10 12:53 ` Shanker Donthineni @ 2026-06-10 13:20 ` Shanker Donthineni 2 siblings, 0 replies; 5+ messages in thread From: Shanker Donthineni @ 2026-06-10 13:20 UTC (permalink / raw) To: Will Deacon Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg Hi Will, On 6/10/2026 6:28 AM, Will Deacon wrote: > External email: Use caution opening links or attachments > > > [+Jason G] > > On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote: >> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be >> observed by a peripheral before an older, non-overlapping Device-nGnR* >> store to the same peripheral. This breaks the program-order guarantee >> that software expects for Device-nGnR* accesses and can leave a >> peripheral in an incorrect state, as a load is observed before an >> earlier store takes effect. >> >> The erratum can occur only when all of the following apply: >> >> - A PE executes a Device-nGnR* store followed by a younger >> Device-nGnR* load. >> - The store is not a store-release. >> - The accesses target the same peripheral and do not overlap in bytes. >> - There is at most one intervening Device-nGnR* store in program >> order, and there are no intervening Device-nGnR* loads. >> - There is no DSB, and no DMB that orders loads, between the store and >> the load. >> - Specific micro-architectural and timing conditions occur. >> >> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that >> orders loads) between the store and the load, or make the store a >> store-release. A load-acquire on the load side would not help, because >> acquire semantics do not prevent a load from being observed ahead of an >> older store; only the store side (release or a barrier) closes the >> window. > I think you can drop the paragraph above. A store-release isn't enough > to order against a later load in the architecture either, so we're > clearly in micro-architecture territory and I don't think you need to > describe mechanisms that don't work here. > >> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* >> to stlr* (Store-Release), which removes the "store is not a >> store-release" condition for every device write the kernel issues. >> Because writel() and writel_relaxed() are both built on __raw_writel() >> in asm-generic/io.h, patching the raw variants covers both the >> non-relaxed and relaxed APIs without touching the higher layers. Note >> that writel()'s own barrier sits before the store, so it does not order >> the store against a subsequent readl(); the store-release promotion is >> what provides that ordering. Based on the existing code comments and after reviewing this path again, __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64() appear to be intended for WC regions. Since the erratum is scoped to Device-nGnR* accesses, and WC mappings are Normal-NC on arm64, I don’t think the STLR workaround should apply to these helpers by default. Applying it there would also break the contiguous STR grouping that this path relies on for write combining. -Shanker ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-06-10 13:20 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-05 14:45 [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni 2026-06-10 11:28 ` Will Deacon 2026-06-10 12:50 ` Jason Gunthorpe 2026-06-10 12:53 ` Shanker Donthineni 2026-06-10 13:20 ` Shanker Donthineni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox