linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Willy Tarreau <w@1wt.eu>
To: Breno Leitao <leitao@debian.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Will Deacon <will@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	linux-arm-kernel@lists.infradead.org, kernel-team@meta.com,
	rmikey@meta.com
Subject: Re: Overhead of arm64 LSE per-CPU atomics?
Date: Tue, 4 Nov 2025 19:08:19 +0100	[thread overview]
Message-ID: <20251104180819.GB20579@1wt.eu> (raw)
In-Reply-To: <ahkk2r22peni4s7j6c7tnv3uajvwiaeg3vwyusppblcokpvgjw@zuuzipntgu7x>

Hello Breno,

On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> while LL/SC case is stable.
> In some case, LSE function runs at the same latency as LL/SC function and
> slightly faster on p50, but, something happen to the system and LSE operations
> start to take way longer than LL/SC.
> 
> Here are some interesting output coming from the latency of the functions above>
> 
> 	CPU: 47 - Latency Percentiles:
> 	====================
> 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
(...)

Very interesting. I've run them here on a 80-core Ampere Altra made
of Neoverse-N1 (armv8.2) and am getting very consistently better timings
with LSE than LL/SC:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.02 ns
  (...)

They're *all* like this, between 7.32 and 7.36 for LL/SC p99,
and 5.01 to 5.03 for LSE p99.

However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've
observed, i.e. a lot of variations that do not even depend on big
vs little cores:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.13 ns    p99: 8.81 ns
  LSE  :   p50: 45.79 ns    p95: 45.80 ns   p99: 45.86 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 67.72 ns    p95: 67.78 ns   p99: 67.80 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 59.19 ns    p95: 59.23 ns   p99: 59.25 ns
  (...)

I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76
(the latter being very close to Neoverse-N1), and the A76 (the 4 latest
ones) show the same pattern as the Altra above and are consistently much
better than the LL/SC one:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.41 ns
  LSE  :   p50: 4.43 ns     p95: 28.60 ns   p99: 30.29 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.59 ns
  LSE  :   p50: 4.42 ns     p95: 27.51 ns   p99: 29.46 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.40 ns     p95: 9.40 ns    p99: 9.40 ns
  LSE  :   p50: 4.42 ns     p95: 27.00 ns   p99: 29.60 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 10.43 ns
  LSE  :   p50: 8.02 ns     p95: 29.72 ns   p99: 31.05 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.86 ns    p99: 8.86 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 5.75 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.85 ns    p99: 9.28 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 8.29 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.79 ns     p95: 8.80 ns    p99: 8.80 ns
  LSE  :   p50: 5.71 ns     p95: 5.71 ns    p99: 5.71 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.80 ns     p95: 8.80 ns    p99: 9.30 ns
  LSE  :   p50: 5.71 ns     p95: 5.72 ns    p99: 5.72 ns

Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something
between the two (and the governor is in performance mode):

 ./percpu_bench 
ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 8 CPUs

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.28 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 19.48 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.26 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 16.30 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.25 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 4.65 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.36 ns
  LSE  :   p50: 4.63 ns     p95: 19.01 ns   p99: 32.15 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.28 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.45 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.58 ns
  LSE  :   p50: 4.82 ns     p95: 4.82 ns    p99: 4.83 ns

So it seems at first glance that LL/SC is generally slower but can be
more consistent on modern machines, that LSE is stable on older machines
and can be stable sometimes even on some modern machines.

@Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in
the Xt register (to be honest I've never understood Arm's docs regarding
instructions, even the pseudo language is super cryptic to me), and I came
up with this:

        asm volatile(
                /* LSE atomics */
                "    ldadd    %[val], %[out], %[ptr]\n"
                : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val)
                : [val] "r"((u64)(val))
                : "memory");

which assembles like this:

 ab8:   f8200040        ldadd   x0, x0, [x2]

It now gives me much better LSE performance on the ARMv9:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.32 ns    p99: 8.72 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.77 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 5.09 ns     p95: 5.11 ns    p99: 5.11 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.58 ns    p99: 9.07 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 7.42 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.47 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.42 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 8 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 9 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.06 ns    p99: 7.07 ns
  LSE  :   p50: 2.96 ns     p95: 2.97 ns    p99: 2.97 ns
  
   CPU: 10 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.05 ns    p99: 7.06 ns
  LSE  :   p50: 2.96 ns     p95: 2.96 ns    p99: 2.97 ns
  
   CPU: 11 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 6.56 ns    p99: 6.57 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.76 ns

(cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a
confirmation that my change is correct and that I'm not just doing
something ignored that tries to add zero :-/

If that's OK, then it's indeed way better!

Willy

PS: thanks Breno for sharing your test code, that's super useful!


  parent reply	other threads:[~2025-11-04 18:08 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney
2025-10-31 18:30 ` Catalin Marinas
2025-10-31 19:39   ` Paul E. McKenney
2025-10-31 22:21     ` Paul E. McKenney
2025-10-31 22:43     ` Catalin Marinas
2025-10-31 23:38       ` Paul E. McKenney
2025-11-01  3:25         ` Paul E. McKenney
2025-11-01  9:44           ` Willy Tarreau
2025-11-01 18:07             ` Paul E. McKenney
2025-11-01 11:23           ` Catalin Marinas
2025-11-01 11:41             ` Yicong Yang
2025-11-05 13:25               ` Catalin Marinas
2025-11-05 13:42                 ` Willy Tarreau
2025-11-05 14:49                   ` Catalin Marinas
2025-11-05 16:21                     ` Breno Leitao
2025-11-06  7:44                     ` Willy Tarreau
2025-11-06 13:53                       ` Catalin Marinas
2025-11-06 14:16                         ` Willy Tarreau
2025-11-03 20:12             ` Palmer Dabbelt
2025-11-03 21:49           ` Catalin Marinas
2025-11-03 21:56             ` Willy Tarreau
2025-11-04 17:05           ` Catalin Marinas
2025-11-04 18:43             ` Paul E. McKenney
2025-11-04 20:10               ` Paul E. McKenney
2025-11-05 15:34                 ` Catalin Marinas
2025-11-05 16:25                   ` Paul E. McKenney
2025-11-05 17:15                     ` Catalin Marinas
2025-11-05 17:40                       ` Paul E. McKenney
2025-11-05 19:16                         ` Catalin Marinas
2025-11-05 19:47                           ` Paul E. McKenney
2025-11-05 20:17                             ` Catalin Marinas
2025-11-05 20:45                               ` Paul E. McKenney
2025-11-05 21:13                           ` Palmer Dabbelt
2025-11-06 14:00                             ` Catalin Marinas
2025-11-06 16:30                               ` Palmer Dabbelt
2025-11-06 17:54                                 ` Catalin Marinas
2025-11-06 18:23                                   ` Palmer Dabbelt
2025-11-04 15:59   ` Breno Leitao
2025-11-04 17:06     ` Catalin Marinas
2025-11-04 18:08     ` Willy Tarreau [this message]
2025-11-04 18:22       ` Breno Leitao
2025-11-04 20:13       ` Paul E. McKenney
2025-11-04 20:35         ` Willy Tarreau
2025-11-04 21:25           ` Paul E. McKenney
2025-11-04 20:57     ` Puranjay Mohan
2025-11-27 12:29     ` Wentao Guan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251104180819.GB20579@1wt.eu \
    --to=w@1wt.eu \
    --cc=catalin.marinas@arm.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=mark.rutland@arm.com \
    --cc=paulmck@kernel.org \
    --cc=rmikey@meta.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).