From: Michael Ellerman <mpe@ellerman.id.au>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
paulmck <paulmck@kernel.org>, Nicholas Piggin <npiggin@gmail.com>,
Christophe Leroy <christophe.leroy@csgroup.eu>,
"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>,
"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: Appropriate liburcu cache line size for Power
Date: Tue, 02 Apr 2024 18:17:25 +1100 [thread overview]
Message-ID: <87a5mc8c8q.fsf@mail.lhotse> (raw)
In-Reply-To: <f16552ba-f8f8-4023-a8ab-1c746a254f3c@efficios.com>
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> On 2024-03-26 03:19, Michael Ellerman wrote:
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>>> In the powerpc architecture support within the liburcu project [1]
>>> we have a cache line size defined as 256 bytes with the following
>>> comment:
>>>
>>> /* Include size of POWER5+ L3 cache lines: 256 bytes */
>>> #define CAA_CACHE_LINE_SIZE 256
>>>
>>> I recently received a pull request on github [2] asking to
>>> change this to 128 bytes. All the material provided supports
>>> that the cache line sizes on powerpc are 128 bytes or less (even
>>> L3 on POWER7, POWER8, and POWER9) [3].
>>>
>>> I wonder where the 256 bytes L3 cache line size for POWER5+
>>> we have in liburcu comes from, and I wonder if it's the right choice
>>> for a cache line size on all powerpc, considering that the Linux
>>> kernel cache line size appear to use 128 bytes on recent Power
>>> architectures. I recall some benchmark experiments Paul and I did
>>> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
>>> cache line size, and I suppose this is why we came up with this
>>> value, but I don't have the detailed specs of that machine.
>>>
>>> Any feedback on this matter would be appreciated.
>>
>> The ISA doesn't specify the cache line size, other than it is smaller
>> than a page.
>>
>> In practice all the 64-bit IBM server CPUs I'm aware of have used 128
>> bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
>> Freescale e6500.
>>
>> It is possible to discover at runtime via AUXV headers. But that's no
>> use if you want a compile-time constant.
>
> Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
> so changing this would require a soname bump, which I don't want to do
> without really good reasons.
>
>>
>> I'm happy to run some benchmarks if you can point me at what to run. I
>> had a poke around the repository and found short_bench, but it seemed to
>> run for a very long time.
>
> I've created a dedicated test program for this, see:
>
> https://github.com/compudj/userspace-rcu-dev/tree/false-sharing
Perfect :)
> The test programs runs 4 threads by default, which can be overridden
> with "-t N". This may be needed if you want this to use all cores from
> a larger machine. See "-h" for options.
>
> On a POWER9 (architected), altivec supported:
>
> for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
> ok 1 - Stride 8 bytes, increments per ms per thread: 12264
> 1..1
> ok 1 - Stride 16 bytes, increments per ms per thread: 12276
> 1..1
> ok 1 - Stride 32 bytes, increments per ms per thread: 25638
> 1..1
> ok 1 - Stride 64 bytes, increments per ms per thread: 39934
> 1..1
> ok 1 - Stride 128 bytes, increments per ms per thread: 53971
> 1..1
> ok 1 - Stride 256 bytes, increments per ms per thread: 53599
> 1..1
> ok 1 - Stride 512 bytes, increments per ms per thread: 53962
> 1..1
>
> This points at false-sharing below 128 bytes stride.
>
> On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)
>
> for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
> ok 1 - Stride 8 bytes, increments per ms per thread: 9049
> 1..1
> ok 1 - Stride 16 bytes, increments per ms per thread: 9054
> 1..1
> ok 1 - Stride 32 bytes, increments per ms per thread: 18643
> 1..1
> ok 1 - Stride 64 bytes, increments per ms per thread: 37417
> 1..1
> ok 1 - Stride 128 bytes, increments per ms per thread: 37906
> 1..1
> ok 1 - Stride 256 bytes, increments per ms per thread: 37870
> 1..1
> ok 1 - Stride 512 bytes, increments per ms per thread: 37899
> 1..1
>
> Which points at false-sharing below 64 bytes.
>
> I prefer to be cautious about this cache line size value and aim for
> a value which takes into account the largest known cache line size
> for an architecture rather than use a too small due to the large
> overhead caused by false-sharing.
>
> Feedback is welcome.
My results are largely similar to yours.
Power9 bare metal (pvr 004e 1202), with 96 threads on 2 nodes:
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-47
NUMA node8 CPU(s): 48-95
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 96 -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 2569
ok 1 - Stride 16 bytes, increments per ms per thread: 4036
ok 1 - Stride 32 bytes, increments per ms per thread: 7226
ok 1 - Stride 64 bytes, increments per ms per thread: 15385
ok 1 - Stride 128 bytes, increments per ms per thread: 38025 <---
ok 1 - Stride 256 bytes, increments per ms per thread: 37454
ok 1 - Stride 512 bytes, increments per ms per thread: 37310
On the same machine if I offline all but one core, so running across 4
threads of a single core:
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 4 -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 14542
ok 1 - Stride 16 bytes, increments per ms per thread: 12984
ok 1 - Stride 32 bytes, increments per ms per thread: 22147
ok 1 - Stride 64 bytes, increments per ms per thread: 31378
ok 1 - Stride 128 bytes, increments per ms per thread: 42358 <---
ok 1 - Stride 256 bytes, increments per ms per thread: 41906
ok 1 - Stride 512 bytes, increments per ms per thread: 42060
On a Power10 (pvr 0080 0200), 8 threads (1 big core):
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 8 -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9235
ok 1 - Stride 16 bytes, increments per ms per thread: 18748
ok 1 - Stride 32 bytes, increments per ms per thread: 28870
ok 1 - Stride 64 bytes, increments per ms per thread: 46794
ok 1 - Stride 128 bytes, increments per ms per thread: 67571 <---
ok 1 - Stride 256 bytes, increments per ms per thread: 67571
ok 1 - Stride 512 bytes, increments per ms per thread: 67570
I tried various other combinations, but in all cases the increments
plateau at 128 bytes and above.
cheers
next prev parent reply other threads:[~2024-04-02 7:18 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-24 12:20 Appropriate liburcu cache line size for Power Mathieu Desnoyers
2024-03-25 20:34 ` Nathan Lynch
2024-03-25 21:23 ` Segher Boessenkool
2024-03-28 18:30 ` Mathieu Desnoyers
2024-03-26 7:19 ` Michael Ellerman
2024-03-26 14:37 ` Mathieu Desnoyers
2024-04-02 7:17 ` Michael Ellerman [this message]
2024-03-26 18:20 ` Segher Boessenkool
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a5mc8c8q.fsf@mail.lhotse \
--to=mpe@ellerman.id.au \
--cc=aneesh.kumar@kernel.org \
--cc=christophe.leroy@csgroup.eu \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=naveen.n.rao@linux.ibm.com \
--cc=npiggin@gmail.com \
--cc=paulmck@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.