Re: local_add_return - Eric Dumazet

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eric Dumazet <dada1@cosmosbay.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Miller <davem@davemloft.net>,
	rostedt@goodmis.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	paulus@samba.org, benh@kernel.crashing.org,
	linux-ia64@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: local_add_return
Date: Tue, 16 Dec 2008 23:59:00 +0000	[thread overview]
Message-ID: <494840C4.50000@cosmosbay.com> (raw)
In-Reply-To: <200812170908.05423.rusty@rustcorp.com.au>

Rusty Russell a Ècrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.

Would be good to differenciate results, if data is already in cache or not...

> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17

really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...

> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 

Running local_t variant benchmarks
atomic_long: local_inc95001846/11 local_add95000325/11 cpu_local_inc62000295/10 local_readI000040/1 local_add_return96000322/11 (total was 1728053248)
irqsave/restore: local_incI8000400/14 local_addI6000395/14 cpu_local_incH6000384/14 local_readh000054/2 local_add_returnP2000394/14 (total was 1728053248)
trivalue: local_inc\x1325001024/39 local_add\x1324001226/39 cpu_local_incÅ000080/2 local_readx6000766/23 local_add_returnA93003781/124 (total was 1728053248)
local_t: local_inci000059/2 local_addi000058/2 cpu_local_incB000035/1 local_readP000043/1 local_add_returnê000076/2 (total was 1728053248, warm_total 62914562)


Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz

two quadcore cpus, x86-32 kernel

It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...

(at least for l1 & l2, their 4Mbytes working set fits in cache)

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.099
cache size      : 6144 KB    <<<< yes, thats big :) >>>>

If I double size of working set

#define NUM_LOCAL_TEST (2*1024*1024)

then I get quite different numbers :

Running local_t variant benchmarks
atomic_long: local_incg29007264/100 local_addg27005943/100 cpu_local_incr4000569/10 local_read\x1030000784/15 local
_add_returnf23004616/98 (total was 3456106496)
irqsave/restore: local_incD58002796/66 local_addD59001998/66 cpu_local_incó1000381/14 local_read\x1060000389/15 loc
al_add_returnE28001388/67 (total was 3456106496)
trivalue: local_inc(71000855/42 local_add(67000976/42 cpu_local_inc\x162000052/2 local_read\x1747000551/26 local_add_r
eturnà29002352/131 (total was 3456106496)
local_t: local_inc"10000492/32 local_add"06000460/32 cpu_local_incÑ000017/1 local_read\x1029000203/15 local_add_ret
urn"16000415/33 (total was 3456106496, warm_total 125829124)

If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.

Running local_t variant benchmarks
atomic_long: local_incò984929/11 local_addò984889/11 cpu_local_incâ986248/10 local_read\x11998165/1 local_add_retur
nô003292/11 (total was 2579496960)
irqsave/restore: local_inc\x124000102/14 local_add\x124000102/14 cpu_local_inc\x121000100/14 local_read\x17000013/2 local_ad
d_return\x126000103/15 (total was 2579496960)
trivalue: local_inc!000017/2 local_add 000016/2 cpu_local_inc 000017/2 local_read%000021/2 local_add_return\x1360
00110/16 (total was 2579496960)
local_t: local_inc\x17000014/2 local_add\x17000015/2 cpu_local_inc\x11000009/1 local_read\x12000010/1 local_add_return#000
019/2 (total was 2579496960, warm_total 15728642)



About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...


--
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: Eric Dumazet <dada1@cosmosbay.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Miller <davem@davemloft.net>,
	rostedt@goodmis.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	paulus@samba.org, benh@kernel.crashing.org,
	linux-ia64@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: local_add_return
Date: Wed, 17 Dec 2008 00:59:00 +0100	[thread overview]
Message-ID: <494840C4.50000@cosmosbay.com> (raw)
In-Reply-To: <200812170908.05423.rusty@rustcorp.com.au>

Rusty Russell a écrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.

Would be good to differenciate results, if data is already in cache or not...

> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17

really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...

> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 

Running local_t variant benchmarks
atomic_long: local_inc=395001846/11 local_add=395000325/11 cpu_local_inc=362000295/10 local_read=49000040/1 local_add_return=396000322/11 (total was 1728053248)
irqsave/restore: local_inc=498000400/14 local_add=496000395/14 cpu_local_inc=486000384/14 local_read=68000054/2 local_add_return=502000394/14 (total was 1728053248)
trivalue: local_inc=1325001024/39 local_add=1324001226/39 cpu_local_inc=81000080/2 local_read=786000766/23 local_add_return=4193003781/124 (total was 1728053248)
local_t: local_inc=69000059/2 local_add=69000058/2 cpu_local_inc=42000035/1 local_read=50000043/1 local_add_return=90000076/2 (total was 1728053248, warm_total 62914562)


Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz

two quadcore cpus, x86-32 kernel

It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...

(at least for l1 & l2, their 4Mbytes working set fits in cache)

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.099
cache size      : 6144 KB    <<<< yes, thats big :) >>>>

If I double size of working set

#define NUM_LOCAL_TEST (2*1024*1024)

then I get quite different numbers :

Running local_t variant benchmarks
atomic_long: local_inc=6729007264/100 local_add=6727005943/100 cpu_local_inc=724000569/10 local_read=1030000784/15 local
_add_return=6623004616/98 (total was 3456106496)
irqsave/restore: local_inc=4458002796/66 local_add=4459001998/66 cpu_local_inc=971000381/14 local_read=1060000389/15 loc
al_add_return=4528001388/67 (total was 3456106496)
trivalue: local_inc=2871000855/42 local_add=2867000976/42 cpu_local_inc=162000052/2 local_read=1747000551/26 local_add_r
eturn=8829002352/131 (total was 3456106496)
local_t: local_inc=2210000492/32 local_add=2206000460/32 cpu_local_inc=84000017/1 local_read=1029000203/15 local_add_ret
urn=2216000415/33 (total was 3456106496, warm_total 125829124)

If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.

Running local_t variant benchmarks
atomic_long: local_inc=98984929/11 local_add=98984889/11 cpu_local_inc=89986248/10 local_read=11998165/1 local_add_retur
n=99003292/11 (total was 2579496960)
irqsave/restore: local_inc=124000102/14 local_add=124000102/14 cpu_local_inc=121000100/14 local_read=17000013/2 local_ad
d_return=126000103/15 (total was 2579496960)
trivalue: local_inc=21000017/2 local_add=20000016/2 cpu_local_inc=20000017/2 local_read=25000021/2 local_add_return=1360
00110/16 (total was 2579496960)
local_t: local_inc=17000014/2 local_add=17000015/2 cpu_local_inc=11000009/1 local_read=12000010/1 local_add_return=23000
019/2 (total was 2579496960, warm_total 15728642)



About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...

next prev parent reply	other threads:[~2008-12-16 23:59 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-15 13:47 local_add_return Steven Rostedt
2008-12-16  6:33 ` local_add_return Rusty Russell
2008-12-16  6:57   ` local_add_return David Miller
2008-12-16  7:13   ` local_add_return David Miller
2008-12-16 22:38     ` local_add_return Rusty Russell
2008-12-16 22:50       ` local_add_return Rusty Russell
2008-12-16 23:25       ` local_add_return Luck, Tony
2008-12-16 23:25         ` local_add_return Luck, Tony
2008-12-16 23:43       ` local_add_return Heiko Carstens
2008-12-16 23:43         ` local_add_return Heiko Carstens
2008-12-16 23:59       ` Eric Dumazet [this message]
2008-12-16 23:59         ` local_add_return Eric Dumazet
2008-12-17  0:01       ` local_add_return Mathieu Desnoyers
2008-12-17  0:01         ` local_add_return Mathieu Desnoyers
2008-12-18 22:52         ` local_add_return Rusty Russell
2008-12-18 22:53           ` local_add_return Rusty Russell
2008-12-19  3:35           ` local_add_return Mathieu Desnoyers
2008-12-19  3:35             ` local_add_return Mathieu Desnoyers
2008-12-19  5:54             ` local_add_return Rusty Russell
2008-12-19  5:54               ` local_add_return Rusty Russell
2008-12-19 17:06               ` local_add_return Mathieu Desnoyers
2008-12-19 17:06                 ` local_add_return Mathieu Desnoyers
2008-12-20  1:33                 ` local_add_return Rusty Russell
2008-12-20  1:45                   ` local_add_return Rusty Russell
2008-12-20  1:33                   ` local_add_return Rusty Russell
2008-12-22 18:43                   ` local_add_return Mathieu Desnoyers
2008-12-22 18:43                     ` local_add_return Mathieu Desnoyers
2008-12-24 11:42                     ` local_add_return Rusty Russell
2008-12-24 11:54                       ` local_add_return Rusty Russell
2008-12-24 18:53                       ` local_add_return Mathieu Desnoyers
2008-12-24 18:53                         ` local_add_return Mathieu Desnoyers
2008-12-16 16:25   ` local_add_return Mathieu Desnoyers
2008-12-17 11:23     ` local_add_return Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=494840C4.50000@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=davem@davemloft.net \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=paulus@samba.org \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.