[PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
@ 2026-06-10 16:48 Shanker Donthineni
  2026-06-11 13:34 ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-10 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.

Note: stlr* only supports base-register addressing, so affected CPUs use
a base-register stlr* path. Unaffected CPUs keep the original
offset-addressed str* sequence introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing").

The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
helpers are left unchanged. These helpers are intended for
write-combining mappings, which are Normal-NC on arm64. Replacing their
contiguous str* groups would defeat the write-combining behavior used to
improve store performance.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 23 ++++++++++++++++
 arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 5 files changed, 64 insertions(+)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad09bbb10da80..fc45125dc2f80 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c65cef81be86a..d633eb70de1ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50b..801223e754c90 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,22 @@
 /*
  * Generic IO read/write.  These perform native-endian accesses.
  */
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+	return alternative_has_cap_unlikely(
+				ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
 	volatile u8 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -33,6 +45,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
 	volatile u16 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -40,6 +58,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
 	volatile u32 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -47,6 +71,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
 	volatile u64 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index d597896b0f7f3..b096d9acca578 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -838,6 +838,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d6..d367257bf7703 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
  2026-06-10 16:48 [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni
@ 2026-06-11 13:34 ` Will Deacon
  2026-06-11 14:08   ` Shanker Donthineni
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Will Deacon @ 2026-06-11 13:34 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
> observed by a peripheral before an older, non-overlapping Device-nGnR*
> store to the same peripheral. This breaks the program-order guarantee
> that software expects for Device-nGnR* accesses and can leave a
> peripheral in an incorrect state, as a load is observed before an
> earlier store takes effect.
> 
> The erratum can occur only when all of the following apply:
> 
>   - A PE executes a Device-nGnR* store followed by a younger
>     Device-nGnR* load.
>   - The store is not a store-release.
>   - The accesses target the same peripheral and do not overlap in bytes.
>   - There is at most one intervening Device-nGnR* store in program
>     order, and there are no intervening Device-nGnR* loads.
>   - There is no DSB, and no DMB that orders loads, between the store and
>     the load.
>   - Specific micro-architectural and timing conditions occur.
> 
> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
> to stlr* (Store-Release), which removes the "store is not a
> store-release" condition for every device write the kernel issues.
> Because writel() and writel_relaxed() are both built on __raw_writel()
> in asm-generic/io.h, patching the raw variants covers both the
> non-relaxed and relaxed APIs without touching the higher layers. Note
> that writel()'s own barrier sits before the store, so it does not order
> the store against a subsequent readl(); the store-release promotion is
> what provides that ordering.
> 
> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
> the plain str* sequence.
> 
> Note: stlr* only supports base-register addressing, so affected CPUs use
> a base-register stlr* path. Unaffected CPUs keep the original
> offset-addressed str* sequence introduced by commit d044d6ba6f02
> ("arm64: io: permit offset addressing").
> 
> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
> helpers are left unchanged. These helpers are intended for
> write-combining mappings, which are Normal-NC on arm64. Replacing their
> contiguous str* groups would defeat the write-combining behavior used to
> improve store performance.
> 
> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
> Changes since v2:
>   - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>     existing offset-addressed STR sequence, while affected CPUs use the
>     base-register STLR path.
>   - Updated the commit message to match the code changes.
>   - Rebased on top of the arm64 for-next/errata branch:
>     https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
> 
> Changes since v1:
>   - Updated the commit message based on feedback from Vladimir Murzin.
> 
>  Documentation/arch/arm64/silicon-errata.rst |  2 ++
>  arch/arm64/Kconfig                          | 23 ++++++++++++++++
>  arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>  arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>  arch/arm64/tools/cpucaps                    |  1 +
>  5 files changed, 64 insertions(+)
> 
> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
> index ad09bbb10da80..fc45125dc2f80 100644
> --- a/Documentation/arch/arm64/silicon-errata.rst
> +++ b/Documentation/arch/arm64/silicon-errata.rst
> @@ -298,6 +298,8 @@ stable kernels.
>  +----------------+-----------------+-----------------+-----------------------------+
>  | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>  +----------------+-----------------+-----------------+-----------------------------+
> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
> ++----------------+-----------------+-----------------+-----------------------------+
>  | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>  +----------------+-----------------+-----------------+-----------------------------+
>  | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c65cef81be86a..d633eb70de1ac 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>  
>  	  If unsure, say Y.
>  
> +config NVIDIA_OLYMPUS_1027_ERRATUM
> +	bool "NVIDIA Olympus: device store/load ordering erratum"
> +	default y
> +	help
> +	  This option adds an alternative code sequence to work around an
> +	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
> +	  observed by a peripheral after a younger Device-nGnR* load to the
> +	  same peripheral. This breaks the program order that drivers rely
> +	  on for MMIO and can leave a device in an incorrect state.
> +
> +	  The workaround promotes the raw MMIO store helpers
> +	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
> +	  required ordering. Because writel() and writel_relaxed() are built
> +	  on __raw_writel(), both are covered without changes to the higher
> +	  layers.
> +
> +	  The fix is applied through the alternatives framework, so enabling
> +	  this option does not by itself activate the workaround: it is
> +	  patched in only when an affected CPU is detected, and is a no-op on
> +	  unaffected CPUs.
> +
> +	  If unsure, say Y.
> +
>  config ARM64_ERRATUM_834220
>  	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>  	depends on KVM
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 8cbd1e96fd50b..801223e754c90 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -22,10 +22,22 @@
>  /*
>   * Generic IO read/write.  These perform native-endian accesses.
>   */
> +static __always_inline bool arm64_needs_device_store_release(void)
> +{
> +	return alternative_has_cap_unlikely(
> +				ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
> +}
> +
>  #define __raw_writeb __raw_writeb
>  static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
>  {
>  	volatile u8 __iomem *ptr = addr;
> +
> +	if (arm64_needs_device_store_release()) {
> +		asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
> +		return;
> +	}
> +
>  	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
>  }

Use an 'else' clause instead of the early return? (similarly for the other
changes).

I still reckon you should do something with the memcpy-to-io routines.
A simple option could be to make dgh() a dmb on parts with the erratum?
That at least moves the barrier out of the loop.

Will


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
  2026-06-11 13:34 ` Will Deacon
@ 2026-06-11 14:08   ` Shanker Donthineni
  2026-06-11 15:08   ` Vladimir Murzin
  2026-06-11 17:49   ` Jason Gunthorpe
  2 siblings, 0 replies; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-11 14:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

Hi Will,

On 6/11/2026 8:34 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>    - A PE executes a Device-nGnR* store followed by a younger
>>      Device-nGnR* load.
>>    - The store is not a store-release.
>>    - The accesses target the same peripheral and do not overlap in bytes.
>>    - There is at most one intervening Device-nGnR* store in program
>>      order, and there are no intervening Device-nGnR* loads.
>>    - There is no DSB, and no DMB that orders loads, between the store and
>>      the load.
>>    - Specific micro-architectural and timing conditions occur.
>>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>> to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not order
>> the store against a subsequent readl(); the store-release promotion is
>> what provides that ordering.
>>
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
>> the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so affected CPUs use
>> a base-register stlr* path. Unaffected CPUs keep the original
>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing").
>>
>> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
>> helpers are left unchanged. These helpers are intended for
>> write-combining mappings, which are Normal-NC on arm64. Replacing their
>> contiguous str* groups would defeat the write-combining behavior used to
>> improve store performance.
>>
>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>> Changes since v2:
>>    - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>      existing offset-addressed STR sequence, while affected CPUs use the
>>      base-register STLR path.
>>    - Updated the commit message to match the code changes.
>>    - Rebased on top of the arm64 for-next/errata branch:
>>      https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
>>
>> Changes since v1:
>>    - Updated the commit message based on feedback from Vladimir Murzin.
>>
>>   Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>   arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>   arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>   arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>   arch/arm64/tools/cpucaps                    |  1 +
>>   5 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
>> index ad09bbb10da80..fc45125dc2f80 100644
>> --- a/Documentation/arch/arm64/silicon-errata.rst
>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>> @@ -298,6 +298,8 @@ stable kernels.
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>   +----------------+-----------------+-----------------+-----------------------------+
>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>> ++----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c65cef81be86a..d633eb70de1ac 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>
>>          If unsure, say Y.
>>
>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>> +     bool "NVIDIA Olympus: device store/load ordering erratum"
>> +     default y
>> +     help
>> +       This option adds an alternative code sequence to work around an
>> +       NVIDIA Olympus core erratum where a Device-nGnR* store can be
>> +       observed by a peripheral after a younger Device-nGnR* load to the
>> +       same peripheral. This breaks the program order that drivers rely
>> +       on for MMIO and can leave a device in an incorrect state.
>> +
>> +       The workaround promotes the raw MMIO store helpers
>> +       (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>> +       required ordering. Because writel() and writel_relaxed() are built
>> +       on __raw_writel(), both are covered without changes to the higher
>> +       layers.
>> +
>> +       The fix is applied through the alternatives framework, so enabling
>> +       this option does not by itself activate the workaround: it is
>> +       patched in only when an affected CPU is detected, and is a no-op on
>> +       unaffected CPUs.
>> +
>> +       If unsure, say Y.
>> +
>>   config ARM64_ERRATUM_834220
>>        bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>        depends on KVM
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 8cbd1e96fd50b..801223e754c90 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -22,10 +22,22 @@
>>   /*
>>    * Generic IO read/write.  These perform native-endian accesses.
>>    */
>> +static __always_inline bool arm64_needs_device_store_release(void)
>> +{
>> +     return alternative_has_cap_unlikely(
>> +                             ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>> +}
>> +
>>   #define __raw_writeb __raw_writeb
>>   static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
>>   {
>>        volatile u8 __iomem *ptr = addr;
>> +
>> +     if (arm64_needs_device_store_release()) {
>> +             asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>> +             return;
>> +     }
>> +
>>        asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
>>   }
> Use an 'else' clause instead of the early return? (similarly for the other
> changes).

I agree. I’ll rework the raw write helpers to use an explicit if/else
form instead of returning early from the STLR path.

>
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.

For the memcpy-to-IO routines, would it be acceptable to address the erratum by
patching dgh() to a DMB OSH on affected CPUs, as shown below? I’ll also sync
with the Olympus CPU hardware team to confirm this approach for the v4 patch.

#define dgh()		asm volatile(ALTERNATIVE("hint #6", "dmb osh",	\
					ARM64_WORKAROUND_DEVICE_STORE_RELEASE) \
				     : : : "memory")

This keeps the existing memcpy-to-IO store sequences unchanged while placing the
ordering barrier outside the copy loop as you suggested.

-Shanker



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
  2026-06-11 13:34 ` Will Deacon
  2026-06-11 14:08   ` Shanker Donthineni
@ 2026-06-11 15:08   ` Vladimir Murzin
  2026-06-11 16:00     ` Shanker Donthineni
  2026-06-11 17:49   ` Jason Gunthorpe
  2 siblings, 1 reply; 6+ messages in thread
From: Vladimir Murzin @ 2026-06-11 15:08 UTC (permalink / raw)
  To: Will Deacon, Shanker Donthineni
  Cc: Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland,
	linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira

Hi,

On 6/11/26 14:34, Will Deacon wrote:
> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>   - A PE executes a Device-nGnR* store followed by a younger
>>     Device-nGnR* load.
>>   - The store is not a store-release.
>>   - The accesses target the same peripheral and do not overlap in bytes.
>>   - There is at most one intervening Device-nGnR* store in program
>>     order, and there are no intervening Device-nGnR* loads.
>>   - There is no DSB, and no DMB that orders loads, between the store and
>>     the load.
>>   - Specific micro-architectural and timing conditions occur.
>>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>> to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not order
>> the store against a subsequent readl(); the store-release promotion is
>> what provides that ordering.
>>
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
>> the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so affected CPUs use
>> a base-register stlr* path. Unaffected CPUs keep the original
>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing").
>>
>> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
>> helpers are left unchanged. These helpers are intended for
>> write-combining mappings, which are Normal-NC on arm64. Replacing their
>> contiguous str* groups would defeat the write-combining behavior used to
>> improve store performance.
>>
>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>> Changes since v2:
>>   - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>     existing offset-addressed STR sequence, while affected CPUs use the
>>     base-register STLR path.
>>   - Updated the commit message to match the code changes.
>>   - Rebased on top of the arm64 for-next/errata branch:
>>     https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
>>
>> Changes since v1:
>>   - Updated the commit message based on feedback from Vladimir Murzin.
>>
>>  Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>  arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>  arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>  arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>  arch/arm64/tools/cpucaps                    |  1 +
>>  5 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
>> index ad09bbb10da80..fc45125dc2f80 100644
>> --- a/Documentation/arch/arm64/silicon-errata.rst
>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>> @@ -298,6 +298,8 @@ stable kernels.
>>  +----------------+-----------------+-----------------+-----------------------------+
>>  | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>  +----------------+-----------------+-----------------+-----------------------------+
>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>> ++----------------+-----------------+-----------------+-----------------------------+
>>  | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>  +----------------+-----------------+-----------------+-----------------------------+
>>  | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c65cef81be86a..d633eb70de1ac 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>  
>>  	  If unsure, say Y.
>>  
>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>> +	bool "NVIDIA Olympus: device store/load ordering erratum"
>> +	default y
>> +	help
>> +	  This option adds an alternative code sequence to work around an
>> +	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
>> +	  observed by a peripheral after a younger Device-nGnR* load to the
>> +	  same peripheral. This breaks the program order that drivers rely
>> +	  on for MMIO and can leave a device in an incorrect state.
>> +
>> +	  The workaround promotes the raw MMIO store helpers
>> +	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>> +	  required ordering. Because writel() and writel_relaxed() are built
>> +	  on __raw_writel(), both are covered without changes to the higher
>> +	  layers.
>> +
>> +	  The fix is applied through the alternatives framework, so enabling
>> +	  this option does not by itself activate the workaround: it is
>> +	  patched in only when an affected CPU is detected, and is a no-op on
>> +	  unaffected CPUs.
>> +
>> +	  If unsure, say Y.
>> +
>>  config ARM64_ERRATUM_834220
>>  	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>  	depends on KVM
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 8cbd1e96fd50b..801223e754c90 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -22,10 +22,22 @@
>>  /*
>>   * Generic IO read/write.  These perform native-endian accesses.
>>   */
>> +static __always_inline bool arm64_needs_device_store_release(void)
>> +{
>> +	return alternative_has_cap_unlikely(
>> +				ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>> +}
>> +
>>  #define __raw_writeb __raw_writeb
>>  static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
>>  {
>>  	volatile u8 __iomem *ptr = addr;
>> +
>> +	if (arm64_needs_device_store_release()) {
>> +		asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>> +		return;
>> +	}
>> +
>>  	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
>>  }
> Use an 'else' clause instead of the early return? (similarly for the other
> changes).

Perhaps I'm missing something, but it is not clear to me why all that
complexity is required.

IIUC, benefits coming with d044d6ba6f02 ("arm64: io: permit offset
addressing") are from better code generation, so we:
 - save code
 - open opportunity for  write-combining

d044d6ba6f02 ("arm64: io: permit offset addressing") comes with simple
benchmark to measure effect of code generation:

| void writeq_zero_8_times(void *ptr)
| {
|        writeq_relaxed(0, ptr + 8 * 0);
|        writeq_relaxed(0, ptr + 8 * 1);
|        writeq_relaxed(0, ptr + 8 * 2);
|        writeq_relaxed(0, ptr + 8 * 3);
|        writeq_relaxed(0, ptr + 8 * 4);
|        writeq_relaxed(0, ptr + 8 * 5);
|        writeq_relaxed(0, ptr + 8 * 6);
|        writeq_relaxed(0, ptr + 8 * 7);
| }

which compiles to

| <writeq_zero_8_times>:
|        str     xzr, [x0]
|        str     xzr, [x0, #8]
|        str     xzr, [x0, #16]
|        str     xzr, [x0, #24]
|        str     xzr, [x0, #32]
|        str     xzr, [x0, #40]
|        str     xzr, [x0, #48]
|        str     xzr, [x0, #56]


v1/v2 compiles to

| <writeq_zero_8_times>:
|        str     xzr, [x0]
|        add     x1, x0, #0x8
|        str     xzr, [x1]
|        add     x1, x0, #0x10
|        str     xzr, [x1]
|        add     x1, x0, #0x18
|        str     xzr, [x1]
|        add     x1, x0, #0x20
|        str     xzr, [x1]
|        add     x1, x0, #0x28
|        str     xzr, [x1]
|        add     x1, x0, #0x30
|        str     xzr, [x1]
|        add     x0, x0, #0x38
|        str     xzr, [x0]

were alternatives are swapping str with stlr. In other words, we are
rolling back to the pre-d044d6ba6f02 implementation.

v3 compiles to:

| <writeq_zero_8_times>:
|        nop
|        str     xzr, [x0]
|        add     x1, x0, #0x8
|        nop
|        str     xzr, [x1]
|        add     x1, x0, #0x10
|        nop
|        str     xzr, [x1]
|        add     x1, x0, #0x18
|        nop
|        str     xzr, [x1]
|        add     x1, x0, #0x20
|        nop
|        str     xzr, [x1]
|        add     x1, x0, #0x28
|        nop
|        str     xzr, [x1]
|        add     x1, x0, #0x30
|        nop
|        str     xzr, [x1]
|        add     x0, x0, #0x38
|        nop
|        str     xzr, [x0]
|        ret

where static branch swapping nop with branch to stlr and back to add.

So it looks to me that we're losing an opportunity for write
combining, but in terms of code size, v1/v2 seems to be the lesser of
two evils.

Cheers
Vladimir

> 
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.
> 
> Will
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
  2026-06-11 15:08   ` Vladimir Murzin
@ 2026-06-11 16:00     ` Shanker Donthineni
  0 siblings, 0 replies; 6+ messages in thread
From: Shanker Donthineni @ 2026-06-11 16:00 UTC (permalink / raw)
  To: Vladimir Murzin, Will Deacon
  Cc: Catalin Marinas, Jason Gunthorpe, linux-arm-kernel, Mark Rutland,
	linux-kernel, linux-doc, Vikram Sethi, Jason Sequeira

Hi Vladimir,

On 6/11/2026 10:08 AM, Vladimir Murzin wrote:
> External email: Use caution opening links or attachments
>
>
> Hi,
>
> On 6/11/26 14:34, Will Deacon wrote:
>> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>>> store to the same peripheral. This breaks the program-order guarantee
>>> that software expects for Device-nGnR* accesses and can leave a
>>> peripheral in an incorrect state, as a load is observed before an
>>> earlier store takes effect.
>>>
>>> The erratum can occur only when all of the following apply:
>>>
>>>    - A PE executes a Device-nGnR* store followed by a younger
>>>      Device-nGnR* load.
>>>    - The store is not a store-release.
>>>    - The accesses target the same peripheral and do not overlap in bytes.
>>>    - There is at most one intervening Device-nGnR* store in program
>>>      order, and there are no intervening Device-nGnR* loads.
>>>    - There is no DSB, and no DMB that orders loads, between the store and
>>>      the load.
>>>    - Specific micro-architectural and timing conditions occur.
>>>
>>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>>> to stlr* (Store-Release), which removes the "store is not a
>>> store-release" condition for every device write the kernel issues.
>>> Because writel() and writel_relaxed() are both built on __raw_writel()
>>> in asm-generic/io.h, patching the raw variants covers both the
>>> non-relaxed and relaxed APIs without touching the higher layers. Note
>>> that writel()'s own barrier sits before the store, so it does not order
>>> the store against a subsequent readl(); the store-release promotion is
>>> what provides that ordering.
>>>
>>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
>>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
>>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
>>> the plain str* sequence.
>>>
>>> Note: stlr* only supports base-register addressing, so affected CPUs use
>>> a base-register stlr* path. Unaffected CPUs keep the original
>>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>>> ("arm64: io: permit offset addressing").
>>>
>>> The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
>>> helpers are left unchanged. These helpers are intended for
>>> write-combining mappings, which are Normal-NC on arm64. Replacing their
>>> contiguous str* groups would defeat the write-combining behavior used to
>>> improve store performance.
>>>
>>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
>>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
>>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>>> ---
>>> Changes since v2:
>>>    - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>>      existing offset-addressed STR sequence, while affected CPUs use the
>>>      base-register STLR path.
>>>    - Updated the commit message to match the code changes.
>>>    - Rebased on top of the arm64 for-next/errata branch:
>>>      https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata
>>>
>>> Changes since v1:
>>>    - Updated the commit message based on feedback from Vladimir Murzin.
>>>
>>>   Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>>   arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>>   arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>>   arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>>   arch/arm64/tools/cpucaps                    |  1 +
>>>   5 files changed, 64 insertions(+)
>>>
>>> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
>>> index ad09bbb10da80..fc45125dc2f80 100644
>>> --- a/Documentation/arch/arm64/silicon-errata.rst
>>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>>> @@ -298,6 +298,8 @@ stable kernels.
>>>   +----------------+-----------------+-----------------+-----------------------------+
>>>   | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>>   +----------------+-----------------+-----------------+-----------------------------+
>>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>>> ++----------------+-----------------+-----------------+-----------------------------+
>>>   | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>>   +----------------+-----------------+-----------------+-----------------------------+
>>>   | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index c65cef81be86a..d633eb70de1ac 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>>
>>>         If unsure, say Y.
>>>
>>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>>> +    bool "NVIDIA Olympus: device store/load ordering erratum"
>>> +    default y
>>> +    help
>>> +      This option adds an alternative code sequence to work around an
>>> +      NVIDIA Olympus core erratum where a Device-nGnR* store can be
>>> +      observed by a peripheral after a younger Device-nGnR* load to the
>>> +      same peripheral. This breaks the program order that drivers rely
>>> +      on for MMIO and can leave a device in an incorrect state.
>>> +
>>> +      The workaround promotes the raw MMIO store helpers
>>> +      (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>>> +      required ordering. Because writel() and writel_relaxed() are built
>>> +      on __raw_writel(), both are covered without changes to the higher
>>> +      layers.
>>> +
>>> +      The fix is applied through the alternatives framework, so enabling
>>> +      this option does not by itself activate the workaround: it is
>>> +      patched in only when an affected CPU is detected, and is a no-op on
>>> +      unaffected CPUs.
>>> +
>>> +      If unsure, say Y.
>>> +
>>>   config ARM64_ERRATUM_834220
>>>       bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>>       depends on KVM
>>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>>> index 8cbd1e96fd50b..801223e754c90 100644
>>> --- a/arch/arm64/include/asm/io.h
>>> +++ b/arch/arm64/include/asm/io.h
>>> @@ -22,10 +22,22 @@
>>>   /*
>>>    * Generic IO read/write.  These perform native-endian accesses.
>>>    */
>>> +static __always_inline bool arm64_needs_device_store_release(void)
>>> +{
>>> +    return alternative_has_cap_unlikely(
>>> +                            ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>>> +}
>>> +
>>>   #define __raw_writeb __raw_writeb
>>>   static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
>>>   {
>>>       volatile u8 __iomem *ptr = addr;
>>> +
>>> +    if (arm64_needs_device_store_release()) {
>>> +            asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>>> +            return;
>>> +    }
>>> +
>>>       asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
>>>   }
>> Use an 'else' clause instead of the early return? (similarly for the other
>> changes).
> Perhaps I'm missing something, but it is not clear to me why all that
> complexity is required.
>
> IIUC, benefits coming with d044d6ba6f02 ("arm64: io: permit offset
> addressing") are from better code generation, so we:
>   - save code
>   - open opportunity for  write-combining
>
> d044d6ba6f02 ("arm64: io: permit offset addressing") comes with simple
> benchmark to measure effect of code generation:
>
> | void writeq_zero_8_times(void *ptr)
> | {
> |        writeq_relaxed(0, ptr + 8 * 0);
> |        writeq_relaxed(0, ptr + 8 * 1);
> |        writeq_relaxed(0, ptr + 8 * 2);
> |        writeq_relaxed(0, ptr + 8 * 3);
> |        writeq_relaxed(0, ptr + 8 * 4);
> |        writeq_relaxed(0, ptr + 8 * 5);
> |        writeq_relaxed(0, ptr + 8 * 6);
> |        writeq_relaxed(0, ptr + 8 * 7);
> | }
>
> which compiles to
>
> | <writeq_zero_8_times>:
> |        str     xzr, [x0]
> |        str     xzr, [x0, #8]
> |        str     xzr, [x0, #16]
> |        str     xzr, [x0, #24]
> |        str     xzr, [x0, #32]
> |        str     xzr, [x0, #40]
> |        str     xzr, [x0, #48]
> |        str     xzr, [x0, #56]
>
>
> v1/v2 compiles to
>
> | <writeq_zero_8_times>:
> |        str     xzr, [x0]
> |        add     x1, x0, #0x8
> |        str     xzr, [x1]
> |        add     x1, x0, #0x10
> |        str     xzr, [x1]
> |        add     x1, x0, #0x18
> |        str     xzr, [x1]
> |        add     x1, x0, #0x20
> |        str     xzr, [x1]
> |        add     x1, x0, #0x28
> |        str     xzr, [x1]
> |        add     x1, x0, #0x30
> |        str     xzr, [x1]
> |        add     x0, x0, #0x38
> |        str     xzr, [x0]
>
> were alternatives are swapping str with stlr. In other words, we are
> rolling back to the pre-d044d6ba6f02 implementation.
>
> v3 compiles to:
>
> | <writeq_zero_8_times>:
> |        nop
> |        str     xzr, [x0]
> |        add     x1, x0, #0x8
> |        nop
> |        str     xzr, [x1]
> |        add     x1, x0, #0x10
> |        nop
> |        str     xzr, [x1]
> |        add     x1, x0, #0x18
> |        nop
> |        str     xzr, [x1]
> |        add     x1, x0, #0x20
> |        nop
> |        str     xzr, [x1]
> |        add     x1, x0, #0x28
> |        nop
> |        str     xzr, [x1]
> |        add     x1, x0, #0x30
> |        nop
> |        str     xzr, [x1]
> |        add     x0, x0, #0x38
> |        nop
> |        str     xzr, [x0]
> |        ret
>
> where static branch swapping nop with branch to stlr and back to add.
>
> So it looks to me that we're losing an opportunity for write
> combining, but in terms of code size, v1/v2 seems to be the lesser of
> two evils.

Thanks, that makes sense.

My intent with the v3 change was to keep the offset-addressed STR sequence on
unaffected CPUs and use the base-register STLR sequence only on affected CPUs.
However, as you point out, because STLR only supports base-register addressing,
the affected path still forces the address to be materialized in a register, and
the alternative_has_cap_unlikely() check adds another instruction at each write
site. So the generated code no longer preserves the benefit from d044d6ba6f02 in
practice.

Given that, I agree the extra complexity is not justified. I’ll simplify the raw
MMIO write helpers back to the direct ALTERNATIVE() form from v1/v2, where both
the STR and STLR paths use base-register addressing. That is still a regression
from the offset-addressed STR sequence on unaffected CPUs, but it avoids the
additional static-branch/nop overhead and is the smaller of the two options.

-Shanker



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
  2026-06-11 13:34 ` Will Deacon
  2026-06-11 14:08   ` Shanker Donthineni
  2026-06-11 15:08   ` Vladimir Murzin
@ 2026-06-11 17:49   ` Jason Gunthorpe
  2 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-06-11 17:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Shanker Donthineni, Catalin Marinas, Vladimir Murzin,
	linux-arm-kernel, Mark Rutland, linux-kernel, linux-doc,
	Vikram Sethi, Jason Sequeira

On Thu, Jun 11, 2026 at 02:34:14PM +0100, Will Deacon wrote:
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.

AFAIK only callers that know they are using WC memory should be
calling dgh() and in that case we know it is NORMAL-NC and we don't
need a different barrier

Other random users calling memcpy_to_io functions on real IO don't
have to do dgh(), and AFAIK it doesn't do anything on the Device
memory types?

Jason


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-11 17:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 16:48 [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni
2026-06-11 13:34 ` Will Deacon
2026-06-11 14:08   ` Shanker Donthineni
2026-06-11 15:08   ` Vladimir Murzin
2026-06-11 16:00     ` Shanker Donthineni
2026-06-11 17:49   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox