From: Shanker Donthineni <sdonthineni@nvidia.com>
To: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
linux-arm-kernel@lists.infradead.org,
Vladimir Murzin <vladimir.murzin@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Vikram Sethi <vsethi@nvidia.com>,
Jason Sequeira <jsequeira@nvidia.com>,
jgg@nvidia.com
Subject: Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
Date: Wed, 10 Jun 2026 07:53:15 -0500 [thread overview]
Message-ID: <d12cfd71-3917-4834-9912-404baeca213c@nvidia.com> (raw)
In-Reply-To: <ailKYTOX23EMnJsK@willie-the-truck>
Hi Will,
On 6/10/2026 6:28 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> [+Jason G]
>
> On Fri, Jun 05, 2026 at 09:45:51AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>> - A PE executes a Device-nGnR* store followed by a younger
>> Device-nGnR* load.
>> - The store is not a store-release.
>> - The accesses target the same peripheral and do not overlap in bytes.
>> - There is at most one intervening Device-nGnR* store in program
>> order, and there are no intervening Device-nGnR* loads.
>> - There is no DSB, and no DMB that orders loads, between the store and
>> the load.
>> - Specific micro-architectural and timing conditions occur.
>>
>> Two ways to restore ordering: insert a barrier (any DSB, or a DMB that
>> orders loads) between the store and the load, or make the store a
>> store-release. A load-acquire on the load side would not help, because
>> acquire semantics do not prevent a load from being observed ahead of an
>> older store; only the store side (release or a barrier) closes the
>> window.
> I think you can drop the paragraph above. A store-release isn't enough
> to order against a later load in the architecture either, so we're
> clearly in micro-architecture territory and I don't think you need to
> describe mechanisms that don't work here.
Thanks, Will. I’ll drop paragraph and avoid describing store-release
as an architectural ordering mechanism here.
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
>> to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not order
>> the store against a subsequent readl(); the store-release promotion is
>> what provides that ordering.
> Sashiko points out that you're missing __const_memcpy_toio_aligned32().
I’ll also cover __const_memcpy_toio_aligned32(); it currently emits plain
STRs directly and can bypass the raw write helper workaround. I’ll audit
the aligned64 path at the same time.
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
>> ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
>> parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
>> the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so the raw accessors
>> can no longer use the offset addressing introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing"). The str* and stlr* alternates
>> share a single inline-asm operand and the sequence is selected at boot,
>> so the operand form is fixed at compile time; unaffected CPUs keep using
>> str* but also revert to base-register addressing. This keeps the store
>> side as simple as the existing load-side patching (load-acquire) and
>> avoids adding complexity to the device write path; retaining offset
>> addressing only for str* would otherwise require a runtime branch on
>> every write.
> I seem to remember Jason caring about that, possibly because some CPUs
> are very picky about write-combining?
For the offset-addressing concern, I’ll rework the raw accessors so
unaffected CPUs keep the existing offset-addressed STR sequence, and
only CPUs with ARM64_WORKAROUND_DEVICE_STORE_RELEASE take the base-register
STLR path.
I’ll post a v3 using the patched branch from alternative_has_cap_unlikely(),
and include the memcpy_toio() aligned-helper coverage as shown below.
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,46 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
+static __always_inline void __raw_writeb_stlr(u8 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writew_stlr(u16 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writel_stlr(u32 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writeq_stlr(u64 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeb_stlr(val, addr);
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
volatile u16 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writew_stlr(val, addr);
+ return;
+ }
+
asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
volatile u32 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writel_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
volatile u64 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeq_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -147,6 +201,12 @@ static __always_inline void
__const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writel_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %w0, [%8, #4 * 0]\n"
@@ -204,6 +264,12 @@ static __always_inline void
__const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writeq_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %x0, [%8, #8 * 0]\n"
I'll post v3 patch with jump instruction patch.
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,46 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
+static __always_inline void __raw_writeb_stlr(u8 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writew_stlr(u16 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writel_stlr(u32 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
+static __always_inline void __raw_writeq_stlr(u64 val,
+ volatile void __iomem *addr)
+{
+ asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeb_stlr(val, addr);
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -33,6 +69,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
volatile u16 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writew_stlr(val, addr);
+ return;
+ }
+
asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -40,6 +82,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
volatile u32 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writel_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -47,6 +95,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
volatile u64 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ __raw_writeq_stlr(val, addr);
+ return;
+ }
+
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
@@ -147,6 +201,12 @@ static __always_inline void
__const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writel_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %w0, [%8, #4 * 0]\n"
@@ -204,6 +264,12 @@ static __always_inline void
__const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
size_t count)
{
+ if (arm64_needs_device_store_release()) {
+ while (count--)
+ __raw_writeq_stlr(*from++, to++);
+ return;
+ }
+
switch (count) {
case 8:
asm volatile("str %x0, [%8, #8 * 0]\n"
-Shanker
next prev parent reply other threads:[~2026-06-10 12:53 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-05 14:45 [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum Shanker Donthineni
2026-06-10 11:28 ` Will Deacon
2026-06-10 12:50 ` Jason Gunthorpe
2026-06-10 12:53 ` Shanker Donthineni [this message]
2026-06-10 13:20 ` Shanker Donthineni
2026-06-10 16:11 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d12cfd71-3917-4834-9912-404baeca213c@nvidia.com \
--to=sdonthineni@nvidia.com \
--cc=catalin.marinas@arm.com \
--cc=jgg@nvidia.com \
--cc=jsequeira@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=vladimir.murzin@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox