* [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
@ 2026-06-05 14:45 Shanker Donthineni
2026-06-10 11:28 ` Will Deacon
0 siblings, 1 reply; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-05 14:45 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, linux-arm-kernel, Vladimir Murzin
Cc: Mark Rutland, linux-kernel, linux-doc, Shanker Donthineni,
Vikram Sethi, Jason Sequeira
On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.
The erratum can occur only when all of the following apply:
- A PE executes a Device-nGnR* store followed by a younger
Device-nGnR* load.
- The store is not a store-release.
- The accesses target the same peripheral and do not overlap in bytes.
- There is at most one intervening Device-nGnR* store in program
order, and there are no intervening Device-nGnR* loads.
- There is no DSB, and no DMB that orders loads, between the store and
the load.
- Specific micro-architectural and timing conditions occur.
Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
orders loads) between the store and the load, or make the store a
store-release. A load-acquire on the load side would not help, because
acquire semantics do not prevent a load from being observed ahead of an
older store; only the store side (release or a barrier) closes the
window.
Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.
Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.
Note: stlr* only supports base-register addressing, so the raw accessors
can no longer use the offset addressing introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing"). The str* and stlr* alternates
share a single inline-asm operand and the sequence is selected at boot,
so the operand form is fixed at compile time; unaffected CPUs keep using
str* but also revert to base-register addressing. This keeps the store
side as simple as the existing load-side patching (load-acquire) and
avoids adding complexity to the device write path; retaining offset
addressing only for str* would otherwise require a runtime branch on
every write.
Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v1:
Update commit text based on feedback from Vladimir Murzin
Documentation/arch/arm64/silicon-errata.rst | 2 ++
arch/arm64/Kconfig | 23 ++++++++++++++++++++
arch/arm64/include/asm/io.h | 24 ++++++++++++++-------
arch/arm64/kernel/cpu_errata.c | 8 +++++++
arch/arm64/tools/cpucaps | 1 +
5 files changed, 50 insertions(+), 8 deletions(-)
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index 211119ce7adc..899bed3908bb 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -256,6 +256,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM |
+----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-1 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..a6bac84b05a1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
If unsure, say Y.
+config NVIDIA_OLYMPUS_1027_ERRATUM
+ bool "NVIDIA Olympus: device store/load ordering erratum"
+ default y
+ help
+ This option adds an alternative code sequence to work around an
+ NVIDIA Olympus core erratum where a Device-nGnR* store can be
+ observed by a peripheral after a younger Device-nGnR* load to the
+ same peripheral. This breaks the program order that drivers rely
+ on for MMIO and can leave a device in an incorrect state.
+
+ The workaround promotes the raw MMIO store helpers
+ (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+ required ordering. Because writel() and writel_relaxed() are built
+ on __raw_writel(), both are covered without changes to the higher
+ layers.
+
+ The fix is applied through the alternatives framework, so enabling
+ this option does not by itself activate the workaround: it is
+ patched in only when an affected CPU is detected, and is a no-op on
+ unaffected CPUs.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_834220
bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50..b6d7966e9c19 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -25,29 +25,37 @@
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
- volatile u8 __iomem *ptr = addr;
- asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strb %w0, [%1]",
+ "stlrb %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writew __raw_writew
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
- volatile u16 __iomem *ptr = addr;
- asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strh %w0, [%1]",
+ "stlrh %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writel __raw_writel
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
- volatile u32 __iomem *ptr = addr;
- asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %w0, [%1]",
+ "stlr %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_writeq __raw_writeq
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
- volatile u64 __iomem *ptr = addr;
- asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %x0, [%1]",
+ "stlr %x0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}
#define __raw_readb __raw_readb
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 5377e4c2eba2..958d7f16bfeb 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -809,6 +809,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
},
#endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+ {
+ /* NVIDIA Olympus core */
+ .desc = "NVIDIA Olympus device load/store ordering erratum",
+ .capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+ },
+#endif
#ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
{
/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d..d367257bf770 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
WORKAROUND_CAVIUM_TX2_219_TVM
WORKAROUND_CLEAN_CACHE
WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
WORKAROUND_NVIDIA_CARMEL_CNP
WORKAROUND_PMUV3_IMPDEF_TRAPS
WORKAROUND_QCOM_FALKOR_E1003
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
2026-06-05 14:45 [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni
@ 2026-06-10 11:28 ` Will Deacon
2026-06-10 12:50 ` Jason Gunthorpe
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Will Deacon @ 2026-06-10 11:28 UTC (permalink / raw)
To: Shanker Donthineni
Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland,
linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg
[+Jason G]
On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote:
> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
> observed by a peripheral before an older, non-overlapping Device-nGnR*
> store to the same peripheral. This breaks the program-order guarantee
> that software expects for Device-nGnR* accesses and can leave a
> peripheral in an incorrect state, as a load is observed before an
> earlier store takes effect.
>
> The erratum can occur only when all of the following apply:
>
> - A PE executes a Device-nGnR* store followed by a younger
> Device-nGnR* load.
> - The store is not a store-release.
> - The accesses target the same peripheral and do not overlap in bytes.
> - There is at most one intervening Device-nGnR* store in program
> order, and there are no intervening Device-nGnR* loads.
> - There is no DSB, and no DMB that orders loads, between the store and
> the load.
> - Specific micro-architectural and timing conditions occur.
>
> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
> orders loads) between the store and the load, or make the store a
> store-release. A load-acquire on the load side would not help, because
> acquire semantics do not prevent a load from being observed ahead of an
> older store; only the store side (release or a barrier) closes the
> window.
I think you can drop the paragraph above. A store-release isn't enough
to order against a later load in the architecture either, so we're
clearly in micro-architecture territory and I don't think you need to
describe mechanisms that don't work here.
> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
> to stlr* (Store-Release), which removes the "store is not a
> store-release" condition for every device write the kernel issues.
> Because writel() and writel_relaxed() are both built on __raw_writel()
> in asm-generic/io.h, patching the raw variants covers both the
> non-relaxed and relaxed APIs without touching the higher layers. Note
> that writel()'s own barrier sits before the store, so it does not order
> the store against a subsequent readl(); the store-release promotion is
> what provides that ordering.
Sashiko points out that you're missing __const_memcpy_toio_aligned32().
> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
> the plain str* sequence.
>
> Note: stlr* only supports base-register addressing, so the raw accessors
> can no longer use the offset addressing introduced by commit d044d6ba6f02
> ("arm64: io: permit offset addressing"). The str* and stlr* alternates
> share a single inline-asm operand and the sequence is selected at boot,
> so the operand form is fixed at compile time; unaffected CPUs keep using
> str* but also revert to base-register addressing. This keeps the store
> side as simple as the existing load-side patching (load-acquire) and
> avoids adding complexity to the device write path; retaining offset
> addressing only for str* would otherwise require a runtime branch on
> every write.
I seem to remember Jason caring about that, possibly because some CPUs
are very picky about write-combining?
Will
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
2026-06-10 11:28 ` Will Deacon
@ 2026-06-10 12:50 ` Jason Gunthorpe
2026-06-10 12:53 ` Shanker Donthineni
2026-06-10 13:20 ` Shanker Donthineni
2 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-06-10 12:50 UTC (permalink / raw)
To: Will Deacon
Cc: Shanker Donthineni, Catalin Marinas, linux-arm-kernel,
Vladimir Murzin, Mark Rutland, linux-kernel, linux-doc,
Vikram Sethi, Jason Sequeira
On Wed, Jun 10, 2026 at 12:28:33PM +0100, Will Deacon wrote:
> > Note: stlr* only supports base-register addressing, so the raw accessors
> > can no longer use the offset addressing introduced by commit d044d6ba6f02
> > ("arm64: io: permit offset addressing"). The str* and stlr* alternates
> > share a single inline-asm operand and the sequence is selected at boot,
> > so the operand form is fixed at compile time; unaffected CPUs keep using
> > str* but also revert to base-register addressing. This keeps the store
> > side as simple as the existing load-side patching (load-acquire) and
> > avoids adding complexity to the device write path; retaining offset
> > addressing only for str* would otherwise require a runtime branch on
> > every write.
>
> I seem to remember Jason caring about that, possibly because some CPUs
> are very picky about write-combining?
I think it was more a fall out of the work there, after looking at the
assembly this minor edit to the constraint made a nice codegen
impact. It is certainly a shame to loose it for this bug.
If we care about write combining we can't have a branch anyhow, but
that is most important for the specific memcpy operations (which will
need a branch)
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
2026-06-10 11:28 ` Will Deacon
2026-06-10 12:50 ` Jason Gunthorpe
@ 2026-06-10 12:53 ` Shanker Donthineni
2026-06-10 13:20 ` Shanker Donthineni
2 siblings, 0 replies; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-10 12:53 UTC (permalink / raw)
To: Will Deacon
Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland,
linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg
Hi Will,
On 6/10/2026 6:28 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> [+Jason G]
>
> On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>> - A PE executes a Device-nGnR* store followed by a younger
>> Device-nGnR* load.
>> - The store is not a store-release.
>> - The accesses target the same peripheral and do not overlap in bytes.
>> - There is at most one intervening Device-nGnR* store in program
>> order, and there are no intervening Device-nGnR* loads.
>> - There is no DSB, and no DMB that orders loads, between the store and
>> the load.
>> - Specific micro-architectural and timing conditions occur.
>>
>> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
>> orders loads) between the store and the load, or make the store a
>> store-release. A load-acquire on the load side would not help, because
>> acquire semantics do not prevent a load from being observed ahead of an
>> older store; only the store side (release or a barrier) closes the
>> window.
> I think you can drop the paragraph above. A store-release isn't enough
> to order against a later load in the architecture either, so we're
> clearly in micro-architecture territory and I don't think you need to
> describe mechanisms that don't work here.
Thanks, Will. I’ll drop paragraph and avoid describing store-release
as an architectural ordering mechanism here.
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>> to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not order
>> the store against a subsequent readl(); the store-release promotion is
>> what provides that ordering.
> Sashiko points out that you're missing __const_memcpy_toio_aligned32().
I’ll also cover __const_memcpy_toio_aligned32(); it currently emits plain
STRs directly and can bypass the raw write helper workaround. I’ll audit
the aligned64 path at the same time.
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
>> the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so the raw accessors
>> can no longer use the offset addressing introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing"). The str* and stlr* alternates
>> share a single inline-asm operand and the sequence is selected at boot,
>> so the operand form is fixed at compile time; unaffected CPUs keep using
>> str* but also revert to base-register addressing. This keeps the store
>> side as simple as the existing load-side patching (load-acquire) and
>> avoids adding complexity to the device write path; retaining offset
>> addressing only for str* would otherwise require a runtime branch on
>> every write.
> I seem to remember Jason caring about that, possibly because some CPUs
> are very picky about write-combining?
For the offset-addressing concern, I’ll rework the raw accessors so
unaffected CPUs keep the existing offset-addressed STR sequence, and
only CPUs with ARM64_WORKAROUND_DEVICE_STORE_RELEASE take the base-register
STLR path.
I’ll post a v3 using the patched branch from alternative_has_cap_unlikely(),
and include the memcpy_toio() aligned-helper coverage as shown below.
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,46 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
+static __always_inline void __raw_writeb_stlr(u8 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writew_stlr(u16 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writel_stlr(u32 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writeq_stlr(u64 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeb_stlr(val, addr);
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
volatile u16 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writew_stlr(val, addr);
+ return;
+ }
+
asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
volatile u32 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writel_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
volatile u64 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeq_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -147,6 +201,12 @@ static __always_inline void
__const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writel_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %w0, [%8, #4 * 0]\n"
@@ -204,6 +264,12 @@ static __always_inline void
__const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writeq_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %x0, [%8, #8 * 0]\n"
I'll post v3 patch with jump instruction patch.
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,46 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
+static __always_inline void __raw_writeb_stlr(u8 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writew_stlr(u16 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writel_stlr(u32 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writeq_stlr(u64 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeb_stlr(val, addr);
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
volatile u16 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writew_stlr(val, addr);
+ return;
+ }
+
asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
volatile u32 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writel_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
volatile u64 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeq_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -147,6 +201,12 @@ static __always_inline void
__const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writel_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %w0, [%8, #4 * 0]\n"
@@ -204,6 +264,12 @@ static __always_inline void
__const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writeq_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %x0, [%8, #8 * 0]\n"
-Shanker
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
2026-06-10 11:28 ` Will Deacon
2026-06-10 12:50 ` Jason Gunthorpe
2026-06-10 12:53 ` Shanker Donthineni
@ 2026-06-10 13:20 ` Shanker Donthineni
2026-06-10 16:11 ` Jason Gunthorpe
2 siblings, 1 reply; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-10 13:20 UTC (permalink / raw)
To: Will Deacon
Cc: Catalin Marinas, linux-arm-kernel, Vladimir Murzin, Mark Rutland,
linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira, jgg
Hi Will,
On 6/10/2026 6:28 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> [+Jason G]
>
> On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>> - A PE executes a Device-nGnR* store followed by a younger
>> Device-nGnR* load.
>> - The store is not a store-release.
>> - The accesses target the same peripheral and do not overlap in bytes.
>> - There is at most one intervening Device-nGnR* store in program
>> order, and there are no intervening Device-nGnR* loads.
>> - There is no DSB, and no DMB that orders loads, between the store and
>> the load.
>> - Specific micro-architectural and timing conditions occur.
>>
>> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
>> orders loads) between the store and the load, or make the store a
>> store-release. A load-acquire on the load side would not help, because
>> acquire semantics do not prevent a load from being observed ahead of an
>> older store; only the store side (release or a barrier) closes the
>> window.
> I think you can drop the paragraph above. A store-release isn't enough
> to order against a later load in the architecture either, so we're
> clearly in micro-architecture territory and I don't think you need to
> describe mechanisms that don't work here.
>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>> to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not order
>> the store against a subsequent readl(); the store-release promotion is
>> what provides that ordering.
Based on the existing code comments and after reviewing this path again,
__const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
appear to be intended for WC regions. Since the erratum is scoped to
Device-nGnR* accesses, and WC mappings are Normal-NC on arm64, I don’t
think the STLR workaround should apply to these helpers by default.
Applying it there would also break the contiguous STR grouping that
this path relies on for write combining.
-Shanker
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
2026-06-10 13:20 ` Shanker Donthineni
@ 2026-06-10 16:11 ` Jason Gunthorpe
0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-06-10 16:11 UTC (permalink / raw)
To: Shanker Donthineni
Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, Vladimir Murzin,
Mark Rutland, linux-kernel, linux-doc, Vikram Sethi,
Jason Sequeira
On Wed, Jun 10, 2026 at 08:20:28AM -0500, Shanker Donthineni wrote:
> Based on the existing code comments and after reviewing this path again,
> __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
> appear to be intended for WC regions. Since the erratum is scoped to
> Device-nGnR* accesses, and WC mappings are Normal-NC on arm64, I don’t
> think the STLR workaround should apply to these helpers by default.
Hmm, unfortunately I think the APIs mix together IO and WC both as
__iomem things. However I recall when I was looking a this everyone
was using it for WC.
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-10 16:11 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-05 14:45 [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni
2026-06-10 11:28 ` Will Deacon
2026-06-10 12:50 ` Jason Gunthorpe
2026-06-10 12:53 ` Shanker Donthineni
2026-06-10 13:20 ` Shanker Donthineni
2026-06-10 16:11 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox