From: Shanker Donthineni <sdonthineni@nvidia.com>
To: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Vladimir Murzin <vladimir.murzin@arm.com>,
Jason Gunthorpe <jgg@nvidia.com>,
linux-arm-kernel@lists.infradead.org,
Mark Rutland <mark.rutland@arm.com>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Vikram Sethi <vsethi@nvidia.com>,
Jason Sequeira <jsequeira@nvidia.com>
Subject: Re: [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
Date: Thu, 2 Jul 2026 19:51:27 -0500 [thread overview]
Message-ID: <cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com> (raw)
In-Reply-To: <akPQ8F3OgER621UP@willie-the-truck>
Hi Will,
On 6/30/2026 9:21 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 25, 2026 at 01:24:24PM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>> - A PE executes a Device-nGnR* store followed by a younger
>> Device-nGnR* load.
>> - The store is not a store-release.
>> - The accesses target the same peripheral and do not overlap in bytes.
>> - There is at most one intervening Device-nGnR* store in program
>> order, and there are no intervening Device-nGnR* loads.
>> - There is no DSB, and no DMB that orders loads, between the store and
>> the load.
> Does that mean that a DMB LD between the store and the load would
> solve the problem?
I appreciate your suggestion to leave __raw_writeX() unchanged and apply
the workaround on the read side. It results in a much better
approach, write-combining performance is preserved and avoids dgh()
promotion to dmb.
The hardware team has confirmed that DMB OSH between the store and load
prevents T410-OLY-1027. They are still validating whether DMB LD is
sufficient, so I do not want to rely on DMB LD until that confirmation is
available.
> It would be interesting to see how your benchmarks motivating patch 2
> look if you leave __raw_writeX as-is and instead add a barrier in
> __raw_readX before the load instruction.
I profiled memcpy_fromio() after implementing your suggested read-side
workaround using DMB OSH:
- Patch 1 leaves __raw_writeX() unchanged and inserts DMB OSH before
each load in __raw_readX().
- Patch 2 provides an arm64 memcpy_fromio() implementation that applies
one DMB OSH before the block copy and then uses direct Device loads.
With patch 2 applying the barrier once per block, the results show no
noticeable performance regression when the workaround is active. The
micro-benchmark uses a write-combined MMIO buffer and is pinned to one
PE. The loop count is adjusted so each row performs approximately 10,000
64-bit MMIO loads in total. The table reports the per-call latency of
memcpy_fromio() in nanoseconds, with CPU cycles measured using the PMU
cycle counter shown in parentheses.
+-------+--------------------+----------------------+------------------------+
| size | WAR off ns (cyc) | OSH/load P1 ns (cyc) | OSH/block P1+P2 ns(cyc)|
+-------+--------------------+----------------------+------------------------+
| 8B | 830.4 (2735) | 835.0 (2750) | 835.1 (2750) |
| 16B | 1660.1 (5468) | 1669.6 (5499) | 1664.8 (5484) |
| 32B | 3319.7 (10934) | 3339.1 (10998) | 3324.1 (10953) |
| 64B | 6638.6 (21866) | 6677.5 (21994) | 6642.3 (21880) |
| 128B | 13275.8 (43729) | 13355.3 (43989) | 13279.0 (43747) |
| 256B | 26549.7 (87480) | 26714.5 (87993) | 26552.5 (87475) |
+-------+--------------------+----------------------+------------------------+
Micro-bench test:
local_irq_save(flags);
off = 0U;
c0 = wc_pmu_read();
t0 = ktime_get();
for (i = 0UL; i < 10000; i++) {
memcpy_fromio(dst, map + off, n * sizeof(u64));
off ^= buf_size;
}
t1 = ktime_get();
c1 = wc_pmu_read();
local_irq_restore(flags);
Patch 2:
void memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
{
...
asm volatile(ALTERNATIVE("nop", "dmb osh",
ARM64_WORKAROUND_NVIDIA_OLYMPUS_1027)
: : : "memory");
while (count &&
!IS_ALIGNED((__force unsigned long)src, sizeof(u64))) {
u8 val;
asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
"ldarb %w0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u8 *)dst = val;
src++;
dst++;
count--;
}
while (count >= sizeof(u64)) {
u64 val;
asm volatile(ALTERNATIVE("ldr %0, [%1]",
"ldar %0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u64 *)dst = val;
src += sizeof(u64);
dst += sizeof(u64);
count -= sizeof(u64);
}
while (count) {
u8 val;
asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
"ldarb %w0, [%1]",
ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
: "=r" (val) : "r" (src));
*(u8 *)dst = val;
src++;
dst++;
count--;
}
}
I am also discussing with the hardware team to understand any broader
implications of using a read-side DMB instead of store-release writes,
and to evaluate the correctness and performance differences between
DMB OSH and DMB LD. If we proceed with the load-side workaround, I will
drop patch 2 and keep the implementation limited to the raw read
helpers. I will post v5 after receiving their feedback.
-Shanker
next prev parent reply other threads:[~2026-07-03 0:52 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
2026-06-30 14:21 ` Will Deacon
2026-07-03 0:51 ` Shanker Donthineni [this message]
2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
2026-06-29 10:48 ` Vladimir Murzin
2026-06-29 23:09 ` Shanker Donthineni
2026-06-30 14:17 ` Will Deacon
2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
2026-06-29 23:08 ` Shanker Donthineni
2026-06-30 13:53 ` Will Deacon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com \
--to=sdonthineni@nvidia.com \
--cc=catalin.marinas@arm.com \
--cc=jgg@nvidia.com \
--cc=jsequeira@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=vladimir.murzin@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.