Appropriate liburcu cache line size for Power

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Appropriate liburcu cache line size for Power
@ 2024-03-24 12:20 Mathieu Desnoyers
  2024-03-25 20:34 ` Nathan Lynch
  2024-03-26  7:19 ` Michael Ellerman
  0 siblings, 2 replies; 8+ messages in thread
From: Mathieu Desnoyers @ 2024-03-24 12:20 UTC (permalink / raw)
  To: paulmck, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

Hi,

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE     256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.

Thanks!

Mathieu

[1] https://liburcu.org
[2] https://github.com/urcu/userspace-rcu/pull/22
[3] https://www.7-cpu.com/

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-24 12:20 Appropriate liburcu cache line size for Power Mathieu Desnoyers
@ 2024-03-25 20:34 ` Nathan Lynch
  2024-03-25 21:23   ` Segher Boessenkool
  2024-03-28 18:30   ` Mathieu Desnoyers
  2024-03-26  7:19 ` Michael Ellerman
  1 sibling, 2 replies; 8+ messages in thread
From: Nathan Lynch @ 2024-03-25 20:34 UTC (permalink / raw)
  To: Mathieu Desnoyers, paulmck, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> In the powerpc architecture support within the liburcu project [1]
> we have a cache line size defined as 256 bytes with the following
> comment:
>
> /* Include size of POWER5+ L3 cache lines: 256 bytes */
> #define CAA_CACHE_LINE_SIZE     256
>
> I recently received a pull request on github [2] asking to
> change this to 128 bytes. All the material provided supports
> that the cache line sizes on powerpc are 128 bytes or less (even
> L3 on POWER7, POWER8, and POWER9) [3].
>
> I wonder where the 256 bytes L3 cache line size for POWER5+
> we have in liburcu comes from, and I wonder if it's the right choice
> for a cache line size on all powerpc, considering that the Linux
> kernel cache line size appear to use 128 bytes on recent Power
> architectures. I recall some benchmark experiments Paul and I did
> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
> cache line size, and I suppose this is why we came up with this
> value, but I don't have the detailed specs of that machine.
>
> Any feedback on this matter would be appreciated.

For what it's worth, I found a copy of an IBM Journal of Research &
Development article confirming that POWER5's L3 had a 256-byte line
size:

  Each slice [of the L3] is 12-way set-associative, with 4,096
  congruence classes of 256-byte lines managed as two 128-byte sectors
  to match the L2 line size.

https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf

I don't know of any reason to prefer 256 over 128 for current Power
processors though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-25 20:34 ` Nathan Lynch
@ 2024-03-25 21:23   ` Segher Boessenkool
  2024-03-28 18:30   ` Mathieu Desnoyers
  1 sibling, 0 replies; 8+ messages in thread
From: Segher Boessenkool @ 2024-03-25 21:23 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: paulmck, Aneesh Kumar K.V, Mathieu Desnoyers, Nicholas Piggin,
	Naveen N. Rao, linuxppc-dev@lists.ozlabs.org

On Mon, Mar 25, 2024 at 03:34:30PM -0500, Nathan Lynch wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> For what it's worth, I found a copy of an IBM Journal of Research &
> Development article confirming that POWER5's L3 had a 256-byte line
> size:
> 
>   Each slice [of the L3] is 12-way set-associative, with 4,096
>   congruence classes of 256-byte lines managed as two 128-byte sectors
>   to match the L2 line size.
> 
> https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf
> 
> I don't know of any reason to prefer 256 over 128 for current Power
> processors though.

The reason some old CPUs use bigger physical cache line sizes is to have
fewer cache lines, which speeds up lookup, or reduces power consumption
of lookup, or both.  This isn't trivial at all when implemented as a
parallel read and compare of all tags, which was the usual way to do
things long ago.

Nowadays usually a way predictor is used, severely limiting the number
of tags to be compared.  So we can use a 128B physical line size always
now.  Note that this was physical only, everything looked like 128B on
a P5 system as well.

P5 wasn't first like this fwiw, look at the L2 on a 604 for example :-)

Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-24 12:20 Appropriate liburcu cache line size for Power Mathieu Desnoyers
  2024-03-25 20:34 ` Nathan Lynch
@ 2024-03-26  7:19 ` Michael Ellerman
  2024-03-26 14:37   ` Mathieu Desnoyers
  2024-03-26 18:20   ` Segher Boessenkool
  1 sibling, 2 replies; 8+ messages in thread
From: Michael Ellerman @ 2024-03-26  7:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, paulmck, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> Hi,

Hi Mathieu,

> In the powerpc architecture support within the liburcu project [1]
> we have a cache line size defined as 256 bytes with the following
> comment:
>
> /* Include size of POWER5+ L3 cache lines: 256 bytes */
> #define CAA_CACHE_LINE_SIZE     256
>
> I recently received a pull request on github [2] asking to
> change this to 128 bytes. All the material provided supports
> that the cache line sizes on powerpc are 128 bytes or less (even
> L3 on POWER7, POWER8, and POWER9) [3].
>
> I wonder where the 256 bytes L3 cache line size for POWER5+
> we have in liburcu comes from, and I wonder if it's the right choice
> for a cache line size on all powerpc, considering that the Linux
> kernel cache line size appear to use 128 bytes on recent Power
> architectures. I recall some benchmark experiments Paul and I did
> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
> cache line size, and I suppose this is why we came up with this
> value, but I don't have the detailed specs of that machine.
>
> Any feedback on this matter would be appreciated.

The ISA doesn't specify the cache line size, other than it is smaller
than a page.

In practice all the 64-bit IBM server CPUs I'm aware of have used 128
bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
Freescale e6500.

It is possible to discover at runtime via AUXV headers. But that's no
use if you want a compile-time constant.

I'm happy to run some benchmarks if you can point me at what to run. I
had a poke around the repository and found short_bench, but it seemed to
run for a very long time.

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-26  7:19 ` Michael Ellerman
@ 2024-03-26 14:37   ` Mathieu Desnoyers
  2024-04-02  7:17     ` Michael Ellerman
  2024-03-26 18:20   ` Segher Boessenkool
  1 sibling, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2024-03-26 14:37 UTC (permalink / raw)
  To: Michael Ellerman, paulmck, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

On 2024-03-26 03:19, Michael Ellerman wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> Hi,
> 
> Hi Mathieu,
> 
>> In the powerpc architecture support within the liburcu project [1]
>> we have a cache line size defined as 256 bytes with the following
>> comment:
>>
>> /* Include size of POWER5+ L3 cache lines: 256 bytes */
>> #define CAA_CACHE_LINE_SIZE     256
>>
>> I recently received a pull request on github [2] asking to
>> change this to 128 bytes. All the material provided supports
>> that the cache line sizes on powerpc are 128 bytes or less (even
>> L3 on POWER7, POWER8, and POWER9) [3].
>>
>> I wonder where the 256 bytes L3 cache line size for POWER5+
>> we have in liburcu comes from, and I wonder if it's the right choice
>> for a cache line size on all powerpc, considering that the Linux
>> kernel cache line size appear to use 128 bytes on recent Power
>> architectures. I recall some benchmark experiments Paul and I did
>> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
>> cache line size, and I suppose this is why we came up with this
>> value, but I don't have the detailed specs of that machine.
>>
>> Any feedback on this matter would be appreciated.
> 
> The ISA doesn't specify the cache line size, other than it is smaller
> than a page.
> 
> In practice all the 64-bit IBM server CPUs I'm aware of have used 128
> bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
> Freescale e6500.
> 
> It is possible to discover at runtime via AUXV headers. But that's no
> use if you want a compile-time constant.

Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.

> 
> I'm happy to run some benchmarks if you can point me at what to run. I
> had a poke around the repository and found short_bench, but it seemed to
> run for a very long time.

I've created a dedicated test program for this, see:

https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

example use:

On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1

Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.

The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.

On a POWER9 (architected), altivec supported:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1

This points at false-sharing below 128 bytes stride.

On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1

Which points at false-sharing below 64 bytes.

I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.

Feedback is welcome.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-26  7:19 ` Michael Ellerman
  2024-03-26 14:37   ` Mathieu Desnoyers
@ 2024-03-26 18:20   ` Segher Boessenkool
  1 sibling, 0 replies; 8+ messages in thread
From: Segher Boessenkool @ 2024-03-26 18:20 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: paulmck, Aneesh Kumar K.V, Mathieu Desnoyers, Nicholas Piggin,
	Naveen N. Rao, linuxppc-dev@lists.ozlabs.org

On Tue, Mar 26, 2024 at 06:19:38PM +1100, Michael Ellerman wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> The ISA doesn't specify the cache line size, other than it is smaller
> than a page.

It also says it is "aligned".  Nowhere is it said what an aligned size
is, but it seems clear it has to be a power of two.

> In practice all the 64-bit IBM server CPUs I'm aware of have used 128
> bytes.

Yup.  It is 128B on p3 already.

> It is possible to discover at runtime via AUXV headers. But that's no
> use if you want a compile-time constant.

The architecture does not require the data block size to be equal to the
instruction block size, already.  But many programs subscribe to an
overly simplified worldview, which is a big reason everything is 128B on
all modern PowerPC.

It is quite a nice tradeoff size, there has to be a huge change in the
world for this to ever change :-)


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-25 20:34 ` Nathan Lynch
  2024-03-25 21:23   ` Segher Boessenkool
@ 2024-03-28 18:30   ` Mathieu Desnoyers
  1 sibling, 0 replies; 8+ messages in thread
From: Mathieu Desnoyers @ 2024-03-28 18:30 UTC (permalink / raw)
  To: Nathan Lynch, paulmck, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

On 2024-03-25 16:34, Nathan Lynch wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> In the powerpc architecture support within the liburcu project [1]
>> we have a cache line size defined as 256 bytes with the following
>> comment:
>>
>> /* Include size of POWER5+ L3 cache lines: 256 bytes */
>> #define CAA_CACHE_LINE_SIZE     256
>>
>> I recently received a pull request on github [2] asking to
>> change this to 128 bytes. All the material provided supports
>> that the cache line sizes on powerpc are 128 bytes or less (even
>> L3 on POWER7, POWER8, and POWER9) [3].
>>
>> I wonder where the 256 bytes L3 cache line size for POWER5+
>> we have in liburcu comes from, and I wonder if it's the right choice
>> for a cache line size on all powerpc, considering that the Linux
>> kernel cache line size appear to use 128 bytes on recent Power
>> architectures. I recall some benchmark experiments Paul and I did
>> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
>> cache line size, and I suppose this is why we came up with this
>> value, but I don't have the detailed specs of that machine.
>>
>> Any feedback on this matter would be appreciated.
> 
> For what it's worth, I found a copy of an IBM Journal of Research &
> Development article confirming that POWER5's L3 had a 256-byte line
> size:
> 
>    Each slice [of the L3] is 12-way set-associative, with 4,096
>    congruence classes of 256-byte lines managed as two 128-byte sectors
>    to match the L2 line size.
> 
> https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf
> 
> I don't know of any reason to prefer 256 over 128 for current Power
> processors though.

Thanks for the pointer. I will add a reference to it in the liburcu
source code to explain the cache line size choice.

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Appropriate liburcu cache line size for Power
  2024-03-26 14:37   ` Mathieu Desnoyers
@ 2024-04-02  7:17     ` Michael Ellerman
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2024-04-02  7:17 UTC (permalink / raw)
  To: Mathieu Desnoyers, paulmck, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao
  Cc: linuxppc-dev@lists.ozlabs.org

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> On 2024-03-26 03:19, Michael Ellerman wrote:
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>>> In the powerpc architecture support within the liburcu project [1]
>>> we have a cache line size defined as 256 bytes with the following
>>> comment:
>>>
>>> /* Include size of POWER5+ L3 cache lines: 256 bytes */
>>> #define CAA_CACHE_LINE_SIZE     256
>>>
>>> I recently received a pull request on github [2] asking to
>>> change this to 128 bytes. All the material provided supports
>>> that the cache line sizes on powerpc are 128 bytes or less (even
>>> L3 on POWER7, POWER8, and POWER9) [3].
>>>
>>> I wonder where the 256 bytes L3 cache line size for POWER5+
>>> we have in liburcu comes from, and I wonder if it's the right choice
>>> for a cache line size on all powerpc, considering that the Linux
>>> kernel cache line size appear to use 128 bytes on recent Power
>>> architectures. I recall some benchmark experiments Paul and I did
>>> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
>>> cache line size, and I suppose this is why we came up with this
>>> value, but I don't have the detailed specs of that machine.
>>>
>>> Any feedback on this matter would be appreciated.
>> 
>> The ISA doesn't specify the cache line size, other than it is smaller
>> than a page.
>> 
>> In practice all the 64-bit IBM server CPUs I'm aware of have used 128
>> bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
>> Freescale e6500.
>> 
>> It is possible to discover at runtime via AUXV headers. But that's no
>> use if you want a compile-time constant.
>
> Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
> so changing this would require a soname bump, which I don't want to do
> without really good reasons.
>
>> 
>> I'm happy to run some benchmarks if you can point me at what to run. I
>> had a poke around the repository and found short_bench, but it seemed to
>> run for a very long time.
>
> I've created a dedicated test program for this, see:
>
> https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

Perfect :)

> The test programs runs 4 threads by default, which can be overridden
> with "-t N". This may be needed if you want this to use all cores from
> a larger machine. See "-h" for options.
>
> On a POWER9 (architected), altivec supported:
>
> for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
> ok 1 - Stride 8 bytes, increments per ms per thread: 12264
> 1..1
> ok 1 - Stride 16 bytes, increments per ms per thread: 12276
> 1..1
> ok 1 - Stride 32 bytes, increments per ms per thread: 25638
> 1..1
> ok 1 - Stride 64 bytes, increments per ms per thread: 39934
> 1..1
> ok 1 - Stride 128 bytes, increments per ms per thread: 53971
> 1..1
> ok 1 - Stride 256 bytes, increments per ms per thread: 53599
> 1..1
> ok 1 - Stride 512 bytes, increments per ms per thread: 53962
> 1..1
>
> This points at false-sharing below 128 bytes stride.
>
> On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)
>
> for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
> ok 1 - Stride 8 bytes, increments per ms per thread: 9049
> 1..1
> ok 1 - Stride 16 bytes, increments per ms per thread: 9054
> 1..1
> ok 1 - Stride 32 bytes, increments per ms per thread: 18643
> 1..1
> ok 1 - Stride 64 bytes, increments per ms per thread: 37417
> 1..1
> ok 1 - Stride 128 bytes, increments per ms per thread: 37906
> 1..1
> ok 1 - Stride 256 bytes, increments per ms per thread: 37870
> 1..1
> ok 1 - Stride 512 bytes, increments per ms per thread: 37899
> 1..1
>
> Which points at false-sharing below 64 bytes.
>
> I prefer to be cautious about this cache line size value and aim for
> a value which takes into account the largest known cache line size
> for an architecture rather than use a too small due to the large
> overhead caused by false-sharing.
>
> Feedback is welcome.

My results are largely similar to yours.

Power9 bare metal (pvr 004e 1202), with 96 threads on 2 nodes:

  NUMA:
    NUMA node(s):           2
    NUMA node0 CPU(s):      0-47
    NUMA node8 CPU(s):      48-95
  
  for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 96 -s $a; done
  ok 1 - Stride 8 bytes, increments per ms per thread: 2569
  ok 1 - Stride 16 bytes, increments per ms per thread: 4036
  ok 1 - Stride 32 bytes, increments per ms per thread: 7226
  ok 1 - Stride 64 bytes, increments per ms per thread: 15385
  ok 1 - Stride 128 bytes, increments per ms per thread: 38025          <---
  ok 1 - Stride 256 bytes, increments per ms per thread: 37454
  ok 1 - Stride 512 bytes, increments per ms per thread: 37310

On the same machine if I offline all but one core, so running across 4
threads of a single core:

  for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 4 -s $a; done
  ok 1 - Stride 8 bytes, increments per ms per thread: 14542
  ok 1 - Stride 16 bytes, increments per ms per thread: 12984
  ok 1 - Stride 32 bytes, increments per ms per thread: 22147
  ok 1 - Stride 64 bytes, increments per ms per thread: 31378
  ok 1 - Stride 128 bytes, increments per ms per thread: 42358          <---
  ok 1 - Stride 256 bytes, increments per ms per thread: 41906
  ok 1 - Stride 512 bytes, increments per ms per thread: 42060

On a Power10 (pvr 0080 0200), 8 threads (1 big core):

  for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -t 8 -s $a; done
  ok 1 - Stride 8 bytes, increments per ms per thread: 9235
  ok 1 - Stride 16 bytes, increments per ms per thread: 18748
  ok 1 - Stride 32 bytes, increments per ms per thread: 28870
  ok 1 - Stride 64 bytes, increments per ms per thread: 46794
  ok 1 - Stride 128 bytes, increments per ms per thread: 67571          <---
  ok 1 - Stride 256 bytes, increments per ms per thread: 67571
  ok 1 - Stride 512 bytes, increments per ms per thread: 67570

I tried various other combinations, but in all cases the increments
plateau at 128 bytes and above.

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-04-02  7:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-24 12:20 Appropriate liburcu cache line size for Power Mathieu Desnoyers
2024-03-25 20:34 ` Nathan Lynch
2024-03-25 21:23   ` Segher Boessenkool
2024-03-28 18:30   ` Mathieu Desnoyers
2024-03-26  7:19 ` Michael Ellerman
2024-03-26 14:37   ` Mathieu Desnoyers
2024-04-02  7:17     ` Michael Ellerman
2024-03-26 18:20   ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).