Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Shanker Donthineni <sdonthineni@nvidia.com>
To: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Vladimir Murzin <vladimir.murzin@arm.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Vikram Sethi <vsethi@nvidia.com>,
	Jason Sequeira <jsequeira@nvidia.com>
Subject: Re: [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
Date: Thu, 2 Jul 2026 19:51:27 -0500	[thread overview]
Message-ID: <cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com> (raw)
In-Reply-To: <akPQ8F3OgER621UP@willie-the-truck>

Hi Will,

On 6/30/2026 9:21 AM, Will Deacon wrote:

> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 25, 2026 at 01:24:24PM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>    - A PE executes a Device-nGnR* store followed by a younger
>>      Device-nGnR* load.
>>    - The store is not a store-release.
>>    - The accesses target the same peripheral and do not overlap in bytes.
>>    - There is at most one intervening Device-nGnR* store in program
>>      order, and there are no intervening Device-nGnR* loads.
>>    - There is no DSB, and no DMB that orders loads, between the store and
>>      the load.
> Does that mean that a DMB LD between the store and the load would
> solve the problem?

I appreciate your suggestion to leave __raw_writeX() unchanged and apply
the workaround on the read side. It results in a much better
approach, write-combining performance is preserved and avoids dgh()
promotion to dmb.

The hardware team has confirmed that DMB OSH between the store and load
prevents T410-OLY-1027. They are still validating whether DMB LD is
sufficient, so I do not want to rely on DMB LD until that confirmation is
available.

> It would be interesting to see how your benchmarks motivating patch 2
> look if you leave __raw_writeX as-is and instead add a barrier in
> __raw_readX before the load instruction.

I profiled memcpy_fromio() after implementing your suggested read-side
workaround using DMB OSH:
   - Patch 1 leaves __raw_writeX() unchanged and inserts DMB OSH before
     each load in __raw_readX().
   - Patch 2 provides an arm64 memcpy_fromio() implementation that applies
     one DMB OSH before the block copy and then uses direct Device loads.

With patch 2 applying the barrier once per block, the results show no
noticeable performance regression when the workaround is active. The
micro-benchmark uses a write-combined MMIO buffer and is pinned to one
PE. The loop count is adjusted so each row performs approximately 10,000
64-bit MMIO loads in total. The table reports the per-call latency of
memcpy_fromio() in nanoseconds, with CPU cycles measured using the PMU
cycle counter shown in parentheses.

+-------+--------------------+----------------------+------------------------+
|  size | WAR off ns (cyc)   | OSH/load P1 ns (cyc) | OSH/block P1+P2 ns(cyc)|
+-------+--------------------+----------------------+------------------------+
|    8B |       830.4 (2735) |         835.0 (2750) |           835.1 (2750) |
|   16B |      1660.1 (5468) |        1669.6 (5499) |          1664.8 (5484) |
|   32B |     3319.7 (10934) |       3339.1 (10998) |         3324.1 (10953) |
|   64B |     6638.6 (21866) |       6677.5 (21994) |         6642.3 (21880) |
|  128B |    13275.8 (43729) |      13355.3 (43989) |        13279.0 (43747) |
|  256B |    26549.7 (87480) |      26714.5 (87993) |        26552.5 (87475) |
+-------+--------------------+----------------------+------------------------+

Micro-bench test:
        local_irq_save(flags);
        off = 0U;
        c0 = wc_pmu_read();
        t0 = ktime_get();
        for (i = 0UL; i < 10000; i++) {
                memcpy_fromio(dst, map + off, n * sizeof(u64));
                off ^= buf_size;
        }
        t1 = ktime_get();
        c1 = wc_pmu_read();
        local_irq_restore(flags);

Patch 2:
void memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
{
...
     asm volatile(ALTERNATIVE("nop", "dmb osh",
                  ARM64_WORKAROUND_NVIDIA_OLYMPUS_1027)
              : : : "memory");

     while (count &&
            !IS_ALIGNED((__force unsigned long)src, sizeof(u64))) {
         u8 val;
         asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
                      "ldarb %w0, [%1]",
                      ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
                  : "=r" (val) : "r" (src));
         *(u8 *)dst = val;
         src++;
         dst++;
         count--;
     }
     while (count >= sizeof(u64)) {
         u64 val;
         asm volatile(ALTERNATIVE("ldr %0, [%1]",
                      "ldar %0, [%1]",
                      ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
                  : "=r" (val) : "r" (src));
         *(u64 *)dst = val;
         src += sizeof(u64);
         dst += sizeof(u64);
         count -= sizeof(u64);
     }
     while (count) {
         u8 val;
         asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
                      "ldarb %w0, [%1]",
                      ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
                  : "=r" (val) : "r" (src));
         *(u8 *)dst = val;
         src++;
         dst++;
         count--;
     }
}

I am also discussing with the hardware team to understand any broader
implications of using a read-side DMB instead of store-release writes,
and to evaluate the correctness and performance differences between
DMB OSH and DMB LD. If we proceed with the load-side workaround, I will
drop patch 2 and keep the implementation limited to the raw read
helpers. I will post v5 after receiving their feedback.

-Shanker




  reply	other threads:[~2026-07-03  0:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 18:24 [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Shanker Donthineni
2026-06-25 18:24 ` [PATCH v4 1/2] arm64: errata: Workaround " Shanker Donthineni
2026-06-30 14:21   ` Will Deacon
2026-07-03  0:51     ` Shanker Donthineni [this message]
2026-06-25 18:24 ` [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write Shanker Donthineni
2026-06-29 10:48   ` Vladimir Murzin
2026-06-29 23:09     ` Shanker Donthineni
2026-06-30 14:17       ` Will Deacon
2026-06-29 10:45 ` [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering Vladimir Murzin
2026-06-29 23:08   ` Shanker Donthineni
2026-06-30 13:53     ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cb745e73-26ea-4d3f-80f2-ab3467e68426@nvidia.com \
    --to=sdonthineni@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=jgg@nvidia.com \
    --cc=jsequeira@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=vladimir.murzin@arm.com \
    --cc=vsethi@nvidia.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox