* [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
@ 2026-06-10 16:48 Shanker Donthineni
2026-06-11 13:34 ` Will Deacon
0 siblings, 1 reply; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-10 16:48 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Vladimir Murzin
Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira
On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.
The erratum can occur only when all of the following apply:
- A PE executes a Device-nGnR* store followed by a younger
Device-nGnR* load.
- The store is not a store-release.
- The accesses target the same peripheral and do not overlap in bytes.
- There is at most one intervening Device-nGnR* store in program
order, and there are no intervening Device-nGnR* loads.
- There is no DSB, and no DMB that orders loads, between the store and
the load.
- Specific micro-architectural and timing conditions occur.
Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.
Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.
Note: stlr* only supports base-register addressing, so affected CPUs use
a base-register stlr* path. Unaffected CPUs keep the original
offset-addressed str* sequence introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing").
The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
helpers are left unchanged. These helpers are intended for
write-combining mappings, which are Normal-NC on arm64. Replacing their
contiguous str* groups would defeat the write-combining behavior used to
improve store performance.
Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v2:
- Reworked the raw MMIO write helpers so unaffected CPUs keep the
existing offset-addressed STR sequence, while affected CPUs use the
base-register STLR path.
- Updated the commit message to match the code changes.
- Rebased on top of the arm64 for-next/errata branch:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
Changes since v1:
- Updated the commit message based on feedback from Vladimir Murzin.
Documentation/arch/arm64/silicon-errata.rst | 2 ++
arch/arm64/Kconfig | 23 ++++++++++++++++
arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++
arch/arm64/kernel/cpu_errata.c | 8 ++++++
arch/arm64/tools/cpucaps | 1 +
5 files changed, 64 insertions(+)
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad09bbb10da80..fc45125dc2f80 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM |
+----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c65cef81be86a..d633eb70de1ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
If unsure, say Y.
+config NVIDIA_OLYMPUS_1027_ERRATUM
+ bool "NVIDIA Olympus: device store/load ordering erratum"
+ default y
+ help
+ This option adds an alternative code sequence to work around an
+ NVIDIA Olympus core erratum where a Device-nGnR* store can be
+ observed by a peripheral after a younger Device-nGnR* load to the
+ same peripheral. This breaks the program order that drivers rely
+ on for MMIO and can leave a device in an incorrect state.
+
+ The workaround promotes the raw MMIO store helpers
+ (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+ required ordering. Because writel() and writel_relaxed() are built
+ on __raw_writel(), both are covered without changes to the higher
+ layers.
+
+ The fix is applied through the alternatives framework, so enabling
+ this option does not by itself activate the workaround: it is
+ patched in only when an affected CPU is detected, and is a no-op on
+ unaffected CPUs.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_834220
bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50b..801223e754c90 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,22 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -33,6 +45,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
volatile u16 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+ return;
+ }
+
asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -40,6 +58,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
volatile u32 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+ return;
+ }
+
asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -47,6 +71,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
volatile u64 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+ return;
+ }
+
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index d597896b0f7f3..b096d9acca578 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -838,6 +838,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
},
#endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+ {
+ /* NVIDIA Olympus core */
+ .desc = "NVIDIA Olympus device load/store ordering erratum",
+ .capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+ },
+#endif
#ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
{
/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d6..d367257bf7703 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
WORKAROUND_CAVIUM_TX2_219_TVM
WORKAROUND_CLEAN_CACHE
WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
WORKAROUND_NVIDIA_CARMEL_CNP
WORKAROUND_PMUV3_IMPDEF_TRAPS
WORKAROUND_QCOM_FALKOR_E1003
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-10 16:48 [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni @ 2026-06-11 13:34 ` Will Deacon 2026-06-11 14:08 ` Shanker Donthineni ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Will Deacon @ 2026-06-11 13:34 UTC (permalink / raw) To: Shanker Donthineni Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote: > On systems with NVIDIA Olympus cores, a Device-nGnR* load can be > observed by a peripheral before an older, non-overlapping Device-nGnR* > store to the same peripheral. This breaks the program-order guarantee > that software expects for Device-nGnR* accesses and can leave a > peripheral in an incorrect state, as a load is observed before an > earlier store takes effect. > > The erratum can occur only when all of the following apply: > > - A PE executes a Device-nGnR* store followed by a younger > Device-nGnR* load. > - The store is not a store-release. > - The accesses target the same peripheral and do not overlap in bytes. > - There is at most one intervening Device-nGnR* store in program > order, and there are no intervening Device-nGnR* loads. > - There is no DSB, and no DMB that orders loads, between the store and > the load. > - Specific micro-architectural and timing conditions occur. > > Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* > to stlr* (Store-Release), which removes the "store is not a > store-release" condition for every device write the kernel issues. > Because writel() and writel_relaxed() are both built on __raw_writel() > in asm-generic/io.h, patching the raw variants covers both the > non-relaxed and relaxed APIs without touching the higher layers. Note > that writel()'s own barrier sits before the store, so it does not order > the store against a subsequent readl(); the store-release promotion is > what provides that ordering. > > Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new > ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on > parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use > the plain str* sequence. > > Note: stlr* only supports base-register addressing, so affected CPUs use > a base-register stlr* path. Unaffected CPUs keep the original > offset-addressed str* sequence introduced by commit d044d6ba6f02 > ("arm64: io: permit offset addressing"). > > The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64() > helpers are left unchanged. These helpers are intended for > write-combining mappings, which are Normal-NC on arm64. Replacing their > contiguous str* groups would defeat the write-combining behavior used to > improve store performance. > > Co-developed-by: Vikram Sethi <vsethi@nvidia.com> > Signed-off-by: Vikram Sethi <vsethi@nvidia.com> > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> > --- > Changes since v2: > - Reworked the raw MMIO write helpers so unaffected CPUs keep the > existing offset-addressed STR sequence, while affected CPUs use the > base-register STLR path. > - Updated the commit message to match the code changes. > - Rebased on top of the arm64 for-next/errata branch: > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata > > Changes since v1: > - Updated the commit message based on feedback from Vladimir Murzin. > > Documentation/arch/arm64/silicon-errata.rst | 2 ++ > arch/arm64/Kconfig | 23 ++++++++++++++++ > arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++ > arch/arm64/kernel/cpu_errata.c | 8 ++++++ > arch/arm64/tools/cpucaps | 1 + > 5 files changed, 64 insertions(+) > > diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst > index ad09bbb10da80..fc45125dc2f80 100644 > --- a/Documentation/arch/arm64/silicon-errata.rst > +++ b/Documentation/arch/arm64/silicon-errata.rst > @@ -298,6 +298,8 @@ stable kernels. > +----------------+-----------------+-----------------+-----------------------------+ > | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | > +----------------+-----------------+-----------------+-----------------------------+ > +| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM | > ++----------------+-----------------+-----------------+-----------------------------+ > | NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 | > +----------------+-----------------+-----------------+-----------------------------+ > | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index c65cef81be86a..d633eb70de1ac 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075 > > If unsure, say Y. > > +config NVIDIA_OLYMPUS_1027_ERRATUM > + bool "NVIDIA Olympus: device store/load ordering erratum" > + default y > + help > + This option adds an alternative code sequence to work around an > + NVIDIA Olympus core erratum where a Device-nGnR* store can be > + observed by a peripheral after a younger Device-nGnR* load to the > + same peripheral. This breaks the program order that drivers rely > + on for MMIO and can leave a device in an incorrect state. > + > + The workaround promotes the raw MMIO store helpers > + (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the > + required ordering. Because writel() and writel_relaxed() are built > + on __raw_writel(), both are covered without changes to the higher > + layers. > + > + The fix is applied through the alternatives framework, so enabling > + this option does not by itself activate the workaround: it is > + patched in only when an affected CPU is detected, and is a no-op on > + unaffected CPUs. > + > + If unsure, say Y. > + > config ARM64_ERRATUM_834220 > bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)" > depends on KVM > diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h > index 8cbd1e96fd50b..801223e754c90 100644 > --- a/arch/arm64/include/asm/io.h > +++ b/arch/arm64/include/asm/io.h > @@ -22,10 +22,22 @@ > /* > * Generic IO read/write. These perform native-endian accesses. > */ > +static __always_inline bool arm64_needs_device_store_release(void) > +{ > + return alternative_has_cap_unlikely( > + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); > +} > + > #define __raw_writeb __raw_writeb > static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) > { > volatile u8 __iomem *ptr = addr; > + > + if (arm64_needs_device_store_release()) { > + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); > + return; > + } > + > asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); > } Use an 'else' clause instead of the early return? (similarly for the other changes). I still reckon you should do something with the memcpy-to-io routines. A simple option could be to make dgh() a dmb on parts with the erratum? That at least moves the barrier out of the loop. Will ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-11 13:34 ` Will Deacon @ 2026-06-11 14:08 ` Shanker Donthineni 2026-06-11 15:08 ` Vladimir Murzin 2026-06-11 17:49 ` Jason Gunthorpe 2 siblings, 0 replies; 6+ messages in thread From: Shanker Donthineni @ 2026-06-11 14:08 UTC (permalink / raw) To: Will Deacon Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi Will, On 6/11/2026 8:34 AM, Will Deacon wrote: > External email: Use caution opening links or attachments > > > On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote: >> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be >> observed by a peripheral before an older, non-overlapping Device-nGnR* >> store to the same peripheral. This breaks the program-order guarantee >> that software expects for Device-nGnR* accesses and can leave a >> peripheral in an incorrect state, as a load is observed before an >> earlier store takes effect. >> >> The erratum can occur only when all of the following apply: >> >> - A PE executes a Device-nGnR* store followed by a younger >> Device-nGnR* load. >> - The store is not a store-release. >> - The accesses target the same peripheral and do not overlap in bytes. >> - There is at most one intervening Device-nGnR* store in program >> order, and there are no intervening Device-nGnR* loads. >> - There is no DSB, and no DMB that orders loads, between the store and >> the load. >> - Specific micro-architectural and timing conditions occur. >> >> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* >> to stlr* (Store-Release), which removes the "store is not a >> store-release" condition for every device write the kernel issues. >> Because writel() and writel_relaxed() are both built on __raw_writel() >> in asm-generic/io.h, patching the raw variants covers both the >> non-relaxed and relaxed APIs without touching the higher layers. Note >> that writel()'s own barrier sits before the store, so it does not order >> the store against a subsequent readl(); the store-release promotion is >> what provides that ordering. >> >> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new >> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on >> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use >> the plain str* sequence. >> >> Note: stlr* only supports base-register addressing, so affected CPUs use >> a base-register stlr* path. Unaffected CPUs keep the original >> offset-addressed str* sequence introduced by commit d044d6ba6f02 >> ("arm64: io: permit offset addressing"). >> >> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64() >> helpers are left unchanged. These helpers are intended for >> write-combining mappings, which are Normal-NC on arm64. Replacing their >> contiguous str* groups would defeat the write-combining behavior used to >> improve store performance. >> >> Co-developed-by: Vikram Sethi <vsethi@nvidia.com> >> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> >> --- >> Changes since v2: >> - Reworked the raw MMIO write helpers so unaffected CPUs keep the >> existing offset-addressed STR sequence, while affected CPUs use the >> base-register STLR path. >> - Updated the commit message to match the code changes. >> - Rebased on top of the arm64 for-next/errata branch: >> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata >> >> Changes since v1: >> - Updated the commit message based on feedback from Vladimir Murzin. >> >> Documentation/arch/arm64/silicon-errata.rst | 2 ++ >> arch/arm64/Kconfig | 23 ++++++++++++++++ >> arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++ >> arch/arm64/kernel/cpu_errata.c | 8 ++++++ >> arch/arm64/tools/cpucaps | 1 + >> 5 files changed, 64 insertions(+) >> >> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst >> index ad09bbb10da80..fc45125dc2f80 100644 >> --- a/Documentation/arch/arm64/silicon-errata.rst >> +++ b/Documentation/arch/arm64/silicon-errata.rst >> @@ -298,6 +298,8 @@ stable kernels. >> +----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | >> +----------------+-----------------+-----------------+-----------------------------+ >> +| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM | >> ++----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 | >> +----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index c65cef81be86a..d633eb70de1ac 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075 >> >> If unsure, say Y. >> >> +config NVIDIA_OLYMPUS_1027_ERRATUM >> + bool "NVIDIA Olympus: device store/load ordering erratum" >> + default y >> + help >> + This option adds an alternative code sequence to work around an >> + NVIDIA Olympus core erratum where a Device-nGnR* store can be >> + observed by a peripheral after a younger Device-nGnR* load to the >> + same peripheral. This breaks the program order that drivers rely >> + on for MMIO and can leave a device in an incorrect state. >> + >> + The workaround promotes the raw MMIO store helpers >> + (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the >> + required ordering. Because writel() and writel_relaxed() are built >> + on __raw_writel(), both are covered without changes to the higher >> + layers. >> + >> + The fix is applied through the alternatives framework, so enabling >> + this option does not by itself activate the workaround: it is >> + patched in only when an affected CPU is detected, and is a no-op on >> + unaffected CPUs. >> + >> + If unsure, say Y. >> + >> config ARM64_ERRATUM_834220 >> bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)" >> depends on KVM >> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h >> index 8cbd1e96fd50b..801223e754c90 100644 >> --- a/arch/arm64/include/asm/io.h >> +++ b/arch/arm64/include/asm/io.h >> @@ -22,10 +22,22 @@ >> /* >> * Generic IO read/write. These perform native-endian accesses. >> */ >> +static __always_inline bool arm64_needs_device_store_release(void) >> +{ >> + return alternative_has_cap_unlikely( >> + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); >> +} >> + >> #define __raw_writeb __raw_writeb >> static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) >> { >> volatile u8 __iomem *ptr = addr; >> + >> + if (arm64_needs_device_store_release()) { >> + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); >> + return; >> + } >> + >> asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); >> } > Use an 'else' clause instead of the early return? (similarly for the other > changes). I agree. I’ll rework the raw write helpers to use an explicit if/else form instead of returning early from the STLR path. > > I still reckon you should do something with the memcpy-to-io routines. > A simple option could be to make dgh() a dmb on parts with the erratum? > That at least moves the barrier out of the loop. For the memcpy-to-IO routines, would it be acceptable to address the erratum by patching dgh() to a DMB OSH on affected CPUs, as shown below? I’ll also sync with the Olympus CPU hardware team to confirm this approach for the v4 patch. #define dgh() asm volatile(ALTERNATIVE("hint #6", "dmb osh", \ ARM64_WORKAROUND_DEVICE_STORE_RELEASE) \ : : : "memory") This keeps the existing memcpy-to-IO store sequences unchanged while placing the ordering barrier outside the copy loop as you suggested. -Shanker ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-11 13:34 ` Will Deacon 2026-06-11 14:08 ` Shanker Donthineni @ 2026-06-11 15:08 ` Vladimir Murzin 2026-06-11 16:00 ` Shanker Donthineni 2026-06-11 17:49 ` Jason Gunthorpe 2 siblings, 1 reply; 6+ messages in thread From: Vladimir Murzin @ 2026-06-11 15:08 UTC (permalink / raw) To: Will Deacon, Shanker Donthineni Cc: Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi, On 6/11/26 14:34, Will Deacon wrote: > On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote: >> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be >> observed by a peripheral before an older, non-overlapping Device-nGnR* >> store to the same peripheral. This breaks the program-order guarantee >> that software expects for Device-nGnR* accesses and can leave a >> peripheral in an incorrect state, as a load is observed before an >> earlier store takes effect. >> >> The erratum can occur only when all of the following apply: >> >> - A PE executes a Device-nGnR* store followed by a younger >> Device-nGnR* load. >> - The store is not a store-release. >> - The accesses target the same peripheral and do not overlap in bytes. >> - There is at most one intervening Device-nGnR* store in program >> order, and there are no intervening Device-nGnR* loads. >> - There is no DSB, and no DMB that orders loads, between the store and >> the load. >> - Specific micro-architectural and timing conditions occur. >> >> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* >> to stlr* (Store-Release), which removes the "store is not a >> store-release" condition for every device write the kernel issues. >> Because writel() and writel_relaxed() are both built on __raw_writel() >> in asm-generic/io.h, patching the raw variants covers both the >> non-relaxed and relaxed APIs without touching the higher layers. Note >> that writel()'s own barrier sits before the store, so it does not order >> the store against a subsequent readl(); the store-release promotion is >> what provides that ordering. >> >> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new >> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on >> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use >> the plain str* sequence. >> >> Note: stlr* only supports base-register addressing, so affected CPUs use >> a base-register stlr* path. Unaffected CPUs keep the original >> offset-addressed str* sequence introduced by commit d044d6ba6f02 >> ("arm64: io: permit offset addressing"). >> >> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64() >> helpers are left unchanged. These helpers are intended for >> write-combining mappings, which are Normal-NC on arm64. Replacing their >> contiguous str* groups would defeat the write-combining behavior used to >> improve store performance. >> >> Co-developed-by: Vikram Sethi <vsethi@nvidia.com> >> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> >> --- >> Changes since v2: >> - Reworked the raw MMIO write helpers so unaffected CPUs keep the >> existing offset-addressed STR sequence, while affected CPUs use the >> base-register STLR path. >> - Updated the commit message to match the code changes. >> - Rebased on top of the arm64 for-next/errata branch: >> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata >> >> Changes since v1: >> - Updated the commit message based on feedback from Vladimir Murzin. >> >> Documentation/arch/arm64/silicon-errata.rst | 2 ++ >> arch/arm64/Kconfig | 23 ++++++++++++++++ >> arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++ >> arch/arm64/kernel/cpu_errata.c | 8 ++++++ >> arch/arm64/tools/cpucaps | 1 + >> 5 files changed, 64 insertions(+) >> >> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst >> index ad09bbb10da80..fc45125dc2f80 100644 >> --- a/Documentation/arch/arm64/silicon-errata.rst >> +++ b/Documentation/arch/arm64/silicon-errata.rst >> @@ -298,6 +298,8 @@ stable kernels. >> +----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | >> +----------------+-----------------+-----------------+-----------------------------+ >> +| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM | >> ++----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 | >> +----------------+-----------------+-----------------+-----------------------------+ >> | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index c65cef81be86a..d633eb70de1ac 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075 >> >> If unsure, say Y. >> >> +config NVIDIA_OLYMPUS_1027_ERRATUM >> + bool "NVIDIA Olympus: device store/load ordering erratum" >> + default y >> + help >> + This option adds an alternative code sequence to work around an >> + NVIDIA Olympus core erratum where a Device-nGnR* store can be >> + observed by a peripheral after a younger Device-nGnR* load to the >> + same peripheral. This breaks the program order that drivers rely >> + on for MMIO and can leave a device in an incorrect state. >> + >> + The workaround promotes the raw MMIO store helpers >> + (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the >> + required ordering. Because writel() and writel_relaxed() are built >> + on __raw_writel(), both are covered without changes to the higher >> + layers. >> + >> + The fix is applied through the alternatives framework, so enabling >> + this option does not by itself activate the workaround: it is >> + patched in only when an affected CPU is detected, and is a no-op on >> + unaffected CPUs. >> + >> + If unsure, say Y. >> + >> config ARM64_ERRATUM_834220 >> bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)" >> depends on KVM >> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h >> index 8cbd1e96fd50b..801223e754c90 100644 >> --- a/arch/arm64/include/asm/io.h >> +++ b/arch/arm64/include/asm/io.h >> @@ -22,10 +22,22 @@ >> /* >> * Generic IO read/write. These perform native-endian accesses. >> */ >> +static __always_inline bool arm64_needs_device_store_release(void) >> +{ >> + return alternative_has_cap_unlikely( >> + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); >> +} >> + >> #define __raw_writeb __raw_writeb >> static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) >> { >> volatile u8 __iomem *ptr = addr; >> + >> + if (arm64_needs_device_store_release()) { >> + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); >> + return; >> + } >> + >> asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); >> } > Use an 'else' clause instead of the early return? (similarly for the other > changes). Perhaps I'm missing something, but it is not clear to me why all that complexity is required. IIUC, benefits coming with d044d6ba6f02 ("arm64: io: permit offset addressing") are from better code generation, so we: - save code - open opportunity for write-combining d044d6ba6f02 ("arm64: io: permit offset addressing") comes with simple benchmark to measure effect of code generation: | void writeq_zero_8_times(void *ptr) | { | writeq_relaxed(0, ptr + 8 * 0); | writeq_relaxed(0, ptr + 8 * 1); | writeq_relaxed(0, ptr + 8 * 2); | writeq_relaxed(0, ptr + 8 * 3); | writeq_relaxed(0, ptr + 8 * 4); | writeq_relaxed(0, ptr + 8 * 5); | writeq_relaxed(0, ptr + 8 * 6); | writeq_relaxed(0, ptr + 8 * 7); | } which compiles to | <writeq_zero_8_times>: | str xzr, [x0] | str xzr, [x0, #8] | str xzr, [x0, #16] | str xzr, [x0, #24] | str xzr, [x0, #32] | str xzr, [x0, #40] | str xzr, [x0, #48] | str xzr, [x0, #56] v1/v2 compiles to | <writeq_zero_8_times>: | str xzr, [x0] | add x1, x0, #0x8 | str xzr, [x1] | add x1, x0, #0x10 | str xzr, [x1] | add x1, x0, #0x18 | str xzr, [x1] | add x1, x0, #0x20 | str xzr, [x1] | add x1, x0, #0x28 | str xzr, [x1] | add x1, x0, #0x30 | str xzr, [x1] | add x0, x0, #0x38 | str xzr, [x0] were alternatives are swapping str with stlr. In other words, we are rolling back to the pre-d044d6ba6f02 implementation. v3 compiles to: | <writeq_zero_8_times>: | nop | str xzr, [x0] | add x1, x0, #0x8 | nop | str xzr, [x1] | add x1, x0, #0x10 | nop | str xzr, [x1] | add x1, x0, #0x18 | nop | str xzr, [x1] | add x1, x0, #0x20 | nop | str xzr, [x1] | add x1, x0, #0x28 | nop | str xzr, [x1] | add x1, x0, #0x30 | nop | str xzr, [x1] | add x0, x0, #0x38 | nop | str xzr, [x0] | ret where static branch swapping nop with branch to stlr and back to add. So it looks to me that we're losing an opportunity for write combining, but in terms of code size, v1/v2 seems to be the lesser of two evils. Cheers Vladimir > > I still reckon you should do something with the memcpy-to-io routines. > A simple option could be to make dgh() a dmb on parts with the erratum? > That at least moves the barrier out of the loop. > > Will > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-11 15:08 ` Vladimir Murzin @ 2026-06-11 16:00 ` Shanker Donthineni 0 siblings, 0 replies; 6+ messages in thread From: Shanker Donthineni @ 2026-06-11 16:00 UTC (permalink / raw) To: Vladimir Murzin, Will Deacon Cc: Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira Hi Vladimir, On 6/11/2026 10:08 AM, Vladimir Murzin wrote: > External email: Use caution opening links or attachments > > > Hi, > > On 6/11/26 14:34, Will Deacon wrote: >> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote: >>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be >>> observed by a peripheral before an older, non-overlapping Device-nGnR* >>> store to the same peripheral. This breaks the program-order guarantee >>> that software expects for Device-nGnR* accesses and can leave a >>> peripheral in an incorrect state, as a load is observed before an >>> earlier store takes effect. >>> >>> The erratum can occur only when all of the following apply: >>> >>> - A PE executes a Device-nGnR* store followed by a younger >>> Device-nGnR* load. >>> - The store is not a store-release. >>> - The accesses target the same peripheral and do not overlap in bytes. >>> - There is at most one intervening Device-nGnR* store in program >>> order, and there are no intervening Device-nGnR* loads. >>> - There is no DSB, and no DMB that orders loads, between the store and >>> the load. >>> - Specific micro-architectural and timing conditions occur. >>> >>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str* >>> to stlr* (Store-Release), which removes the "store is not a >>> store-release" condition for every device write the kernel issues. >>> Because writel() and writel_relaxed() are both built on __raw_writel() >>> in asm-generic/io.h, patching the raw variants covers both the >>> non-relaxed and relaxed APIs without touching the higher layers. Note >>> that writel()'s own barrier sits before the store, so it does not order >>> the store against a subsequent readl(); the store-release promotion is >>> what provides that ordering. >>> >>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new >>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on >>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use >>> the plain str* sequence. >>> >>> Note: stlr* only supports base-register addressing, so affected CPUs use >>> a base-register stlr* path. Unaffected CPUs keep the original >>> offset-addressed str* sequence introduced by commit d044d6ba6f02 >>> ("arm64: io: permit offset addressing"). >>> >>> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64() >>> helpers are left unchanged. These helpers are intended for >>> write-combining mappings, which are Normal-NC on arm64. Replacing their >>> contiguous str* groups would defeat the write-combining behavior used to >>> improve store performance. >>> >>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com> >>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> >>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> >>> --- >>> Changes since v2: >>> - Reworked the raw MMIO write helpers so unaffected CPUs keep the >>> existing offset-addressed STR sequence, while affected CPUs use the >>> base-register STLR path. >>> - Updated the commit message to match the code changes. >>> - Rebased on top of the arm64 for-next/errata branch: >>> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata >>> >>> Changes since v1: >>> - Updated the commit message based on feedback from Vladimir Murzin. >>> >>> Documentation/arch/arm64/silicon-errata.rst | 2 ++ >>> arch/arm64/Kconfig | 23 ++++++++++++++++ >>> arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++ >>> arch/arm64/kernel/cpu_errata.c | 8 ++++++ >>> arch/arm64/tools/cpucaps | 1 + >>> 5 files changed, 64 insertions(+) >>> >>> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst >>> index ad09bbb10da80..fc45125dc2f80 100644 >>> --- a/Documentation/arch/arm64/silicon-errata.rst >>> +++ b/Documentation/arch/arm64/silicon-errata.rst >>> @@ -298,6 +298,8 @@ stable kernels. >>> +----------------+-----------------+-----------------+-----------------------------+ >>> | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | >>> +----------------+-----------------+-----------------+-----------------------------+ >>> +| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM | >>> ++----------------+-----------------+-----------------+-----------------------------+ >>> | NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 | >>> +----------------+-----------------+-----------------+-----------------------------+ >>> | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index c65cef81be86a..d633eb70de1ac 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075 >>> >>> If unsure, say Y. >>> >>> +config NVIDIA_OLYMPUS_1027_ERRATUM >>> + bool "NVIDIA Olympus: device store/load ordering erratum" >>> + default y >>> + help >>> + This option adds an alternative code sequence to work around an >>> + NVIDIA Olympus core erratum where a Device-nGnR* store can be >>> + observed by a peripheral after a younger Device-nGnR* load to the >>> + same peripheral. This breaks the program order that drivers rely >>> + on for MMIO and can leave a device in an incorrect state. >>> + >>> + The workaround promotes the raw MMIO store helpers >>> + (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the >>> + required ordering. Because writel() and writel_relaxed() are built >>> + on __raw_writel(), both are covered without changes to the higher >>> + layers. >>> + >>> + The fix is applied through the alternatives framework, so enabling >>> + this option does not by itself activate the workaround: it is >>> + patched in only when an affected CPU is detected, and is a no-op on >>> + unaffected CPUs. >>> + >>> + If unsure, say Y. >>> + >>> config ARM64_ERRATUM_834220 >>> bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)" >>> depends on KVM >>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h >>> index 8cbd1e96fd50b..801223e754c90 100644 >>> --- a/arch/arm64/include/asm/io.h >>> +++ b/arch/arm64/include/asm/io.h >>> @@ -22,10 +22,22 @@ >>> /* >>> * Generic IO read/write. These perform native-endian accesses. >>> */ >>> +static __always_inline bool arm64_needs_device_store_release(void) >>> +{ >>> + return alternative_has_cap_unlikely( >>> + ARM64_WORKAROUND_DEVICE_STORE_RELEASE); >>> +} >>> + >>> #define __raw_writeb __raw_writeb >>> static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr) >>> { >>> volatile u8 __iomem *ptr = addr; >>> + >>> + if (arm64_needs_device_store_release()) { >>> + asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr)); >>> + return; >>> + } >>> + >>> asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); >>> } >> Use an 'else' clause instead of the early return? (similarly for the other >> changes). > Perhaps I'm missing something, but it is not clear to me why all that > complexity is required. > > IIUC, benefits coming with d044d6ba6f02 ("arm64: io: permit offset > addressing") are from better code generation, so we: > - save code > - open opportunity for write-combining > > d044d6ba6f02 ("arm64: io: permit offset addressing") comes with simple > benchmark to measure effect of code generation: > > | void writeq_zero_8_times(void *ptr) > | { > | writeq_relaxed(0, ptr + 8 * 0); > | writeq_relaxed(0, ptr + 8 * 1); > | writeq_relaxed(0, ptr + 8 * 2); > | writeq_relaxed(0, ptr + 8 * 3); > | writeq_relaxed(0, ptr + 8 * 4); > | writeq_relaxed(0, ptr + 8 * 5); > | writeq_relaxed(0, ptr + 8 * 6); > | writeq_relaxed(0, ptr + 8 * 7); > | } > > which compiles to > > | <writeq_zero_8_times>: > | str xzr, [x0] > | str xzr, [x0, #8] > | str xzr, [x0, #16] > | str xzr, [x0, #24] > | str xzr, [x0, #32] > | str xzr, [x0, #40] > | str xzr, [x0, #48] > | str xzr, [x0, #56] > > > v1/v2 compiles to > > | <writeq_zero_8_times>: > | str xzr, [x0] > | add x1, x0, #0x8 > | str xzr, [x1] > | add x1, x0, #0x10 > | str xzr, [x1] > | add x1, x0, #0x18 > | str xzr, [x1] > | add x1, x0, #0x20 > | str xzr, [x1] > | add x1, x0, #0x28 > | str xzr, [x1] > | add x1, x0, #0x30 > | str xzr, [x1] > | add x0, x0, #0x38 > | str xzr, [x0] > > were alternatives are swapping str with stlr. In other words, we are > rolling back to the pre-d044d6ba6f02 implementation. > > v3 compiles to: > > | <writeq_zero_8_times>: > | nop > | str xzr, [x0] > | add x1, x0, #0x8 > | nop > | str xzr, [x1] > | add x1, x0, #0x10 > | nop > | str xzr, [x1] > | add x1, x0, #0x18 > | nop > | str xzr, [x1] > | add x1, x0, #0x20 > | nop > | str xzr, [x1] > | add x1, x0, #0x28 > | nop > | str xzr, [x1] > | add x1, x0, #0x30 > | nop > | str xzr, [x1] > | add x0, x0, #0x38 > | nop > | str xzr, [x0] > | ret > > where static branch swapping nop with branch to stlr and back to add. > > So it looks to me that we're losing an opportunity for write > combining, but in terms of code size, v1/v2 seems to be the lesser of > two evils. Thanks, that makes sense. My intent with the v3 change was to keep the offset-addressed STR sequence on unaffected CPUs and use the base-register STLR sequence only on affected CPUs. However, as you point out, because STLR only supports base-register addressing, the affected path still forces the address to be materialized in a register, and the alternative_has_cap_unlikely() check adds another instruction at each write site. So the generated code no longer preserves the benefit from d044d6ba6f02 in practice. Given that, I agree the extra complexity is not justified. I’ll simplify the raw MMIO write helpers back to the direct ALTERNATIVE() form from v1/v2, where both the STR and STLR paths use base-register addressing. That is still a regression from the offset-addressed STR sequence on unaffected CPUs, but it avoids the additional static-branch/nop overhead and is the smaller of the two options. -Shanker ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum 2026-06-11 13:34 ` Will Deacon 2026-06-11 14:08 ` Shanker Donthineni 2026-06-11 15:08 ` Vladimir Murzin @ 2026-06-11 17:49 ` Jason Gunthorpe 2 siblings, 0 replies; 6+ messages in thread From: Jason Gunthorpe @ 2026-06-11 17:49 UTC (permalink / raw) To: Will Deacon Cc: Shanker Donthineni, Catalin Marinas, Vladimir Murzin, linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira On Thu, Jun 11, 2026 at 02:34:14PM +0100, Will Deacon wrote: > I still reckon you should do something with the memcpy-to-io routines. > A simple option could be to make dgh() a dmb on parts with the erratum? > That at least moves the barrier out of the loop. AFAIK only callers that know they are using WC memory should be calling dgh() and in that case we know it is NORMAL-NC and we don't need a different barrier Other random users calling memcpy_to_io functions on real IO don't have to do dgh(), and AFAIK it doesn't do anything on the Device memory types? Jason ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-11 17:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-10 16:48 [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni 2026-06-11 13:34 ` Will Deacon 2026-06-11 14:08 ` Shanker Donthineni 2026-06-11 15:08 ` Vladimir Murzin 2026-06-11 16:00 ` Shanker Donthineni 2026-06-11 17:49 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox