From: Shanker Donthineni <sdonthineni@nvidia.com>
To: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Vladimir Murzin <vladimir.murzin@arm.com>,
Jason Gunthorpe <jgg@nvidia.com>,
linux-arm-kernel@lists.infradead.org,
Mark Rutland <mark.rutland@arm.com>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Vikram Sethi <vsethi@nvidia.com>,
Jason Sequeira <jsequeira@nvidia.com>
Subject: Re: [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
Date: Thu, 2 Jul 2026 19:51:27 -0500 [thread overview]
Message-ID: <cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com> (raw)
In-Reply-To: <akPQ8F3OgER621UP@willie-the-truck>
Hi Will,
On 6/30/2026 9:21 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 25, 2026 at 01:24:24PM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>> - A PE executes a Device-nGnR* store followed by a younger
>> Device-nGnR* load.
>> - The store is not a store-release.
>> - The accesses target the same peripheral and do not overlap in bytes.
>> - There is at most one intervening Device-nGnR* store in program
>> order, and there are no intervening Device-nGnR* loads.
>> - There is no DSB, and no DMB that orders loads, between the store and
>> the load.
> Does that mean that a DMB LD between the store and the load would
> solve the problem?
I appreciate your suggestion to leave __raw_writeX() unchanged and apply
the workaround on the read side. It results in a much better
approach, write-combining performance is preserved and avoids dgh()
promotion to dmb.
The hardware team has confirmed that DMB OSH between the store and load
prevents T410-OLY-1027. They are still validating whether DMB LD is
sufficient, so I do not want to rely on DMB LD until that confirmation is
available.
> It would be interesting to see how your benchmarks motivating patch 2
> look if you leave __raw_writeX as-is and instead add a barrier in
> __raw_readX before the load instruction.
I profiled memcpy_fromio() after implementing your suggested read-side
workaround using DMB OSH:
- Patch 1 leaves __raw_writeX() unchanged and inserts DMB OSH before
each load in __raw_readX().
- Patch 2 provides an arm64 memcpy_fromio() implementation that applies
one DMB OSH before the block copy and then uses direct Device loads.
With patch 2 applying the barrier once per block, the results show no
noticeable performance regression when the workaround is active. The
micro-benchmark uses a write-combined MMIO buffer and is pinned to one
PE. The loop count is adjusted so each row performs approximately 10,000
64-bit MMIO loads in total. The table reports the per-call latency of
memcpy_fromio() in nanoseconds, with CPU cycles measured using the PMU
cycle counter shown in parentheses.
+-------+--------------------+----------------------+------------------------+
| size | WAR off ns (cyc) | OSH/load P1 ns (cyc) | OSH/block P1+P2 ns(cyc)|
+-------+--------------------+----------------------+------------------------+
| 8B | 830.4 (2735) | 835.0 (2750) | 835.1 (2750) |
| 16B | 1660.1 (5468) | 1669.6 (5499) | 1664.8 (5484) |
| 32B | 3319.7 (10934) | 3339.1 (10998) | 3324.1 (10953) |
| 64B | 6638.6 (21866) | 6677.5 (21994) | 6642.3 (21880) |
| 128B | 13275.8 (43729) | 13355.3 (43989) | 13279.0 (43747) |
| 256B | 26549.7 (87480) | 26714.5 (87993) | 26552.5 (87475) |
+-------+--------------------+----------------------+------------------------+
Micro-bench test:
local_irq_save(flags);
off = 0U;
c0 = wc_pmu_read();
t0 = ktime_get();
for (i = 0UL; i < 10000; i++) {
memcpy_fromio(dst, map + off, n * sizeof(u64));
off ^= buf_size;
}
t1 = ktime_get();
c1 = wc_pmu_read();
local_irq_restore(flags);
Patch 2:
void memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
{
...
asm volatile(ALTERNATIVE("nop", "dmb osh",
ARM64_WORKAROUND_NVIDIA_OLYMPUS_1027)
: : : "memory");
while (count &&
!IS_ALIGNED((__force unsigned long)src, sizeof(u64))) {
u8 val;
asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
"ldarb %w0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u8 *)dst = val;
src++;
dst++;
count--;
}
while (count >= sizeof(u64)) {
u64 val;
asm volatile(ALTERNATIVE("ldr %0, [%1]",
"ldar %0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u64 *)dst = val;
src += sizeof(u64);
dst += sizeof(u64);
count -= sizeof(u64);
}
while (count) {
u8 val;
asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
"ldarb %w0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u8 *)dst = val;
src++;
dst++;
count--;
}
}
I am also discussing with the hardware team to understand any broader
implications of using a read-side DMB instead of store-release writes,
and to evaluate the correctness and performance differences between
DMB OSH and DMB LD. If we proceed with the load-side workaround, I will
drop patch 2 and keep the implementation limited to the raw read
helpers. I will post v5 after receiving their feedback.
-Shanker
next prev parent reply other threads:[~2026-07-03 0:52 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
2026-06-30 14:21 ` Will Deacon
2026-07-03 0:51 ` Shanker Donthineni [this message]
2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
2026-06-29 10:48 ` Vladimir Murzin
2026-06-29 23:09 ` Shanker Donthineni
2026-06-30 14:17 ` Will Deacon
2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
2026-06-29 23:08 ` Shanker Donthineni
2026-06-30 13:53 ` Will Deacon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com \
--to=sdonthineni@nvidia.com \
--cc=catalin.marinas@arm.com \
--cc=jgg@nvidia.com \
--cc=jsequeira@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=vladimir.murzin@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox