From: Eric Dumazet <dada1@cosmosbay.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Miller <davem@davemloft.net>,
rostedt@goodmis.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
paulus@samba.org, benh@kernel.crashing.org,
linux-ia64@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: local_add_return
Date: Tue, 16 Dec 2008 23:59:00 +0000 [thread overview]
Message-ID: <494840C4.50000@cosmosbay.com> (raw)
In-Reply-To: <200812170908.05423.rusty@rustcorp.com.au>
Rusty Russell a Ècrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
>
> Very interesting. There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
>
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster. The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
>
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t. There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
>
> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.
>
> Cheers,
> Rusty.
>
> Benchmarks for local_t variants
>
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
>
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
>
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.
Would be good to differenciate results, if data is already in cache or not...
>
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
>
> inc add add_return cpu_local_inc read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long 118 118 115 17 17
really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...
> irqsave/rest 77 78 77 23 16
> trivalue 45 45 127 3(6) 21
> local_t 36 36 36 1(5) 17
>
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long 55 60 - 6 19
> irqsave/rest 54 54 - 11 19
> trivalue 47 47 - 5 28
> local_t 47 46 - 1 19
>
Running local_t variant benchmarks
atomic_long: local_inc95001846/11 local_add95000325/11 cpu_local_inc62000295/10 local_readI000040/1 local_add_return96000322/11 (total was 1728053248)
irqsave/restore: local_incI8000400/14 local_addI6000395/14 cpu_local_incH6000384/14 local_readh000054/2 local_add_returnP2000394/14 (total was 1728053248)
trivalue: local_inc\x1325001024/39 local_add\x1324001226/39 cpu_local_incÅ000080/2 local_readx6000766/23 local_add_returnA93003781/124 (total was 1728053248)
local_t: local_inci000059/2 local_addi000058/2 cpu_local_incB000035/1 local_readP000043/1 local_add_returnê000076/2 (total was 1728053248, warm_total 62914562)
Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
two quadcore cpus, x86-32 kernel
It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...
(at least for l1 & l2, their 4Mbytes working set fits in cache)
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
stepping : 6
cpu MHz : 3000.099
cache size : 6144 KB <<<< yes, thats big :) >>>>
If I double size of working set
#define NUM_LOCAL_TEST (2*1024*1024)
then I get quite different numbers :
Running local_t variant benchmarks
atomic_long: local_incg29007264/100 local_addg27005943/100 cpu_local_incr4000569/10 local_read\x1030000784/15 local
_add_returnf23004616/98 (total was 3456106496)
irqsave/restore: local_incD58002796/66 local_addD59001998/66 cpu_local_incó1000381/14 local_read\x1060000389/15 loc
al_add_returnE28001388/67 (total was 3456106496)
trivalue: local_inc(71000855/42 local_add(67000976/42 cpu_local_inc\x162000052/2 local_read\x1747000551/26 local_add_r
eturnà29002352/131 (total was 3456106496)
local_t: local_inc"10000492/32 local_add"06000460/32 cpu_local_incÑ000017/1 local_read\x1029000203/15 local_add_ret
urn"16000415/33 (total was 3456106496, warm_total 125829124)
If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.
Running local_t variant benchmarks
atomic_long: local_incò984929/11 local_addò984889/11 cpu_local_incâ986248/10 local_read\x11998165/1 local_add_retur
nô003292/11 (total was 2579496960)
irqsave/restore: local_inc\x124000102/14 local_add\x124000102/14 cpu_local_inc\x121000100/14 local_read\x17000013/2 local_ad
d_return\x126000103/15 (total was 2579496960)
trivalue: local_inc!000017/2 local_add 000016/2 cpu_local_inc 000017/2 local_read%000021/2 local_add_return\x1360
00110/16 (total was 2579496960)
local_t: local_inc\x17000014/2 local_add\x17000015/2 cpu_local_inc\x11000009/1 local_read\x12000010/1 local_add_return#000
019/2 (total was 2579496960, warm_total 15728642)
About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...
--
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: multiple messages have this Message-ID (diff)
From: Eric Dumazet <dada1@cosmosbay.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Miller <davem@davemloft.net>,
rostedt@goodmis.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
paulus@samba.org, benh@kernel.crashing.org,
linux-ia64@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: local_add_return
Date: Wed, 17 Dec 2008 00:59:00 +0100 [thread overview]
Message-ID: <494840C4.50000@cosmosbay.com> (raw)
In-Reply-To: <200812170908.05423.rusty@rustcorp.com.au>
Rusty Russell a écrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
>
> Very interesting. There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
>
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster. The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
>
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t. There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
>
> Any chance of an IA64 or s390 run? You can normalize if you like, since
> it's only to compare the different approaches.
>
> Cheers,
> Rusty.
>
> Benchmarks for local_t variants
>
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
>
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
>
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.
Would be good to differenciate results, if data is already in cache or not...
>
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
>
> inc add add_return cpu_local_inc read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long 118 118 115 17 17
really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...
> irqsave/rest 77 78 77 23 16
> trivalue 45 45 127 3(6) 21
> local_t 36 36 36 1(5) 17
>
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long 55 60 - 6 19
> irqsave/rest 54 54 - 11 19
> trivalue 47 47 - 5 28
> local_t 47 46 - 1 19
>
Running local_t variant benchmarks
atomic_long: local_inc=395001846/11 local_add=395000325/11 cpu_local_inc=362000295/10 local_read=49000040/1 local_add_return=396000322/11 (total was 1728053248)
irqsave/restore: local_inc=498000400/14 local_add=496000395/14 cpu_local_inc=486000384/14 local_read=68000054/2 local_add_return=502000394/14 (total was 1728053248)
trivalue: local_inc=1325001024/39 local_add=1324001226/39 cpu_local_inc=81000080/2 local_read=786000766/23 local_add_return=4193003781/124 (total was 1728053248)
local_t: local_inc=69000059/2 local_add=69000058/2 cpu_local_inc=42000035/1 local_read=50000043/1 local_add_return=90000076/2 (total was 1728053248, warm_total 62914562)
Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
two quadcore cpus, x86-32 kernel
It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...
(at least for l1 & l2, their 4Mbytes working set fits in cache)
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
stepping : 6
cpu MHz : 3000.099
cache size : 6144 KB <<<< yes, thats big :) >>>>
If I double size of working set
#define NUM_LOCAL_TEST (2*1024*1024)
then I get quite different numbers :
Running local_t variant benchmarks
atomic_long: local_inc=6729007264/100 local_add=6727005943/100 cpu_local_inc=724000569/10 local_read=1030000784/15 local
_add_return=6623004616/98 (total was 3456106496)
irqsave/restore: local_inc=4458002796/66 local_add=4459001998/66 cpu_local_inc=971000381/14 local_read=1060000389/15 loc
al_add_return=4528001388/67 (total was 3456106496)
trivalue: local_inc=2871000855/42 local_add=2867000976/42 cpu_local_inc=162000052/2 local_read=1747000551/26 local_add_r
eturn=8829002352/131 (total was 3456106496)
local_t: local_inc=2210000492/32 local_add=2206000460/32 cpu_local_inc=84000017/1 local_read=1029000203/15 local_add_ret
urn=2216000415/33 (total was 3456106496, warm_total 125829124)
If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.
Running local_t variant benchmarks
atomic_long: local_inc=98984929/11 local_add=98984889/11 cpu_local_inc=89986248/10 local_read=11998165/1 local_add_retur
n=99003292/11 (total was 2579496960)
irqsave/restore: local_inc=124000102/14 local_add=124000102/14 cpu_local_inc=121000100/14 local_read=17000013/2 local_ad
d_return=126000103/15 (total was 2579496960)
trivalue: local_inc=21000017/2 local_add=20000016/2 cpu_local_inc=20000017/2 local_read=25000021/2 local_add_return=1360
00110/16 (total was 2579496960)
local_t: local_inc=17000014/2 local_add=17000015/2 cpu_local_inc=11000009/1 local_read=12000010/1 local_add_return=23000
019/2 (total was 2579496960, warm_total 15728642)
About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...
next prev parent reply other threads:[~2008-12-16 23:59 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-15 13:47 local_add_return Steven Rostedt
2008-12-16 6:33 ` local_add_return Rusty Russell
2008-12-16 6:57 ` local_add_return David Miller
2008-12-16 7:13 ` local_add_return David Miller
2008-12-16 22:38 ` local_add_return Rusty Russell
2008-12-16 22:50 ` local_add_return Rusty Russell
2008-12-16 23:25 ` local_add_return Luck, Tony
2008-12-16 23:25 ` local_add_return Luck, Tony
2008-12-16 23:43 ` local_add_return Heiko Carstens
2008-12-16 23:43 ` local_add_return Heiko Carstens
2008-12-16 23:59 ` Eric Dumazet [this message]
2008-12-16 23:59 ` local_add_return Eric Dumazet
2008-12-17 0:01 ` local_add_return Mathieu Desnoyers
2008-12-17 0:01 ` local_add_return Mathieu Desnoyers
2008-12-18 22:52 ` local_add_return Rusty Russell
2008-12-18 22:53 ` local_add_return Rusty Russell
2008-12-19 3:35 ` local_add_return Mathieu Desnoyers
2008-12-19 3:35 ` local_add_return Mathieu Desnoyers
2008-12-19 5:54 ` local_add_return Rusty Russell
2008-12-19 5:54 ` local_add_return Rusty Russell
2008-12-19 17:06 ` local_add_return Mathieu Desnoyers
2008-12-19 17:06 ` local_add_return Mathieu Desnoyers
2008-12-20 1:33 ` local_add_return Rusty Russell
2008-12-20 1:45 ` local_add_return Rusty Russell
2008-12-20 1:33 ` local_add_return Rusty Russell
2008-12-22 18:43 ` local_add_return Mathieu Desnoyers
2008-12-22 18:43 ` local_add_return Mathieu Desnoyers
2008-12-24 11:42 ` local_add_return Rusty Russell
2008-12-24 11:54 ` local_add_return Rusty Russell
2008-12-24 18:53 ` local_add_return Mathieu Desnoyers
2008-12-24 18:53 ` local_add_return Mathieu Desnoyers
2008-12-16 16:25 ` local_add_return Mathieu Desnoyers
2008-12-17 11:23 ` local_add_return Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=494840C4.50000@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=davem@davemloft.net \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=mathieu.desnoyers@polymtl.ca \
--cc=paulus@samba.org \
--cc=rostedt@goodmis.org \
--cc=rusty@rustcorp.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.