linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
@ 2008-11-16 16:18 Igor Podlesny
  2008-11-17  0:36 ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-16 16:18 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Hi!

Recently I've decided to give a try to x86_64 version of mine distro
of choice (subject to change).
I prefer using own compiled kernels, so I know that in both x86_32 and
x86_64 modes .configs are very similar to each other.

The first quote is x86_32 kernel's dmesg (2.6.27.6-1khz i686):

    [    0.000000] Linux version 2.6.27.6-1khz
(poige@arch.localdomain) (gcc version 4.2.4)
    [    0.021589] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor
6000+ stepping 03

    [    0.092818] Total of 2 processors activated (12061.15 BogoMIPS).

    [    0.093095] xor: automatically using best checksumming function: pIII_sse
    [    0.097988]    pIII_sse  :  9164.000 MB/sec
    [    0.098027] xor: using function: pIII_sse (9164.000 MB/sec

    [    2.392055] raid6: int32x1   1121 MB/s
    [    2.409048] raid6: int32x2   1226 MB/s
    [    2.412224] input: AT Translated Set 2 keyboard as /class/input/input0
    [    2.426040] raid6: int32x4   1191 MB/s
    [    2.443066] raid6: int32x8    882 MB/s
    [    2.460013] raid6: mmxx1     2453 MB/s
    [    2.477014] raid6: mmxx2     4574 MB/s
    [    2.494024] raid6: sse1x1    2441 MB/s
    [    2.511014] raid6: sse1x2    4222 MB/s
    [    2.528013] raid6: sse2x1    4187 MB/s
    [    2.545004] raid6: sse2x2    5562 MB/s
    [    2.545042] raid6: using algorithm sse2x2 (5562 MB/s)

And now follows x86_64:

    [    0.000000] Linux version 2.6.27.6-64_1khz (root@archlive) (gcc
version 4.3.1 (GCC) )
    [    0.019180] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor
6000+ stepping 03

    [    0.091750] Total of 2 processors activated (12061.66 BogoMIPS).

    [    0.092073] xor: automatically using best checksumming
function: generic_sse
    [    0.096986]    generic_sse:  9192.000 MB/sec
    [    0.097024] xor: using function: generic_sse (9192.000 MB/sec)

    [    2.583571] md: raid0 personality registered for level 0
    [    2.583614] md: raid1 personality registered for level 1
    [    2.600025] raid6: int64x1   2722 MB/s
    [    2.617010] raid6: int64x2   3660 MB/s
    [    2.634006] raid6: int64x4   3265 MB/s
    [    2.651012] raid6: int64x8   2593 MB/s
    [    2.668034] raid6: sse2x1    1476 MB/s
    [    2.685021] raid6: sse2x2    2316 MB/s
    [    2.702022] raid6: sse2x4    3175 MB/s
    [    2.702060] raid6: using algorithm sse2x4 (3175 MB/s)

So, there're 2 strange things in those dmesgs. The first one might be
unrelated to Linux RAID but affects it -- have you noticed that in
x86_64, raid6 algorithm is ~ 50 % slower, than in x86_32? Is that due
to not too optimized code for x86_64 mode? And the second -- why is
raid6 using algorithm sse2x4 (3175 MB/s), whereas int64x2 gives
slightly better (~ 15 %) throughput -- 3660 MB/s?

Has anyone on the list similar observations? Can gcc's version
difference affect so much? I doubt that, but I can try build x86_32
with gcc 4.3.1 (as x86_64 was).

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-16 16:18 Igor Podlesny
@ 2008-11-17  0:36 ` H. Peter Anvin
  2008-11-17 20:56   ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-17  0:36 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid, neilb

Igor Podlesny wrote:
> 
> So, there're 2 strange things in those dmesgs. The first one might be
> unrelated to Linux RAID but affects it -- have you noticed that in
> x86_64, raid6 algorithm is ~ 50 % slower, than in x86_32? Is that due
> to not too optimized code for x86_64 mode? And the second -- why is
> raid6 using algorithm sse2x4 (3175 MB/s), whereas int64x2 gives
> slightly better (~ 15 %) throughput -- 3660 MB/s?
> 
> Has anyone on the list similar observations? Can gcc's version
> difference affect so much? I doubt that, but I can try build x86_32
> with gcc 4.3.1 (as x86_64 was).
> 

The SSE modes have nicer cache behaviours and are therefore preferred
even if they are slower.

It is very odd that your SSE2 modes are that much slower in 64-bit mode.
 It could just be an artifact of the may the test is done (cache
anomalies?), but I kind of suspect there is something more fishy going on.

The sse2 code in the x1 and x2 case is actually identical between x86-32
and -64 (the x4 case is only available for -64) so it is very strange
that you're seeing this kind of effect.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-17  0:36 ` H. Peter Anvin
@ 2008-11-17 20:56   ` Igor Podlesny
  0 siblings, 0 replies; 16+ messages in thread
From: Igor Podlesny @ 2008-11-17 20:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid, neilb

2008/11/17 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
[...]
>> Has anyone on the list similar observations? Can gcc's version
>> difference affect so much? I doubt that, but I can try build x86_32
>> with gcc 4.3.1 (as x86_64 was).
>>
>
> The SSE modes have nicer cache behaviours and are therefore preferred
> even if they are slower.
>
That was my guess that they're preferable (but I wasn't aware of exact
reason, thanks!). :-)
>
> It is very odd that your SSE2 modes are that much slower in 64-bit mode.
>  It could just be an artifact of the may the test is done (cache
> anomalies?), but I kind of suspect there is something more fishy going on.

I've built gcc-4.2.4 and recompiled the kernel with it. .config's diff:
	
    -CONFIG_LOCALVERSION="-64_1khz"
    +CONFIG_LOCALVERSION="-64_1khz-gcc42"
     CONFIG_LOCALVERSION_AUTO=y
     CONFIG_SWAP=y
     CONFIG_SYSVIPC=y
    @@ -287,7 +287,7 @@
     # CONFIG_RELOCATABLE is not set
     CONFIG_PHYSICAL_ALIGN=0x200000
     CONFIG_HOTPLUG_CPU=y
    -CONFIG_COMPAT_VDSO=y
    +# CONFIG_COMPAT_VDSO is not set

And dmesg now says:
	
    [    2.606258] raid6: int64x1   2210 MB/s
    [    2.623255] raid6: int64x2   3246 MB/s
    [    2.640257] raid6: int64x4   3289 MB/s
    [    2.657262] raid6: int64x8   3019 MB/s
    [    2.674262] raid6: sse2x1    4253 MB/s
    [    2.691258] raid6: sse2x2    5621 MB/s
    [    2.708261] raid6: sse2x4    5718 MB/s
    [    2.708299] raid6: using algorithm sse2x4 (5718 MB/s)

So, I deem that's not VDSO's effect but gcc's version instead.

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
@ 2008-11-17 22:35 H. Peter Anvin
  2008-11-18 12:03 ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-17 22:35 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid, neilb

Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file; preferrably compiled with CONFIG_DEBUG_INFO.

-- 
Sent from my mobile phone (pardon any lack of formatting)


-----Original Message-----
From: Igor Podlesny <for.poige+linux@gmail.com>
Sent: Monday, November 17, 2008 12:56
To: H. Peter Anvin <hpa@zytor.com>
Cc: linux-raid@vger.kernel.org; neilb@suse.de
Subject: Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.

2008/11/17 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
[...]
>> Has anyone on the list similar observations? Can gcc's version
>> difference affect so much? I doubt that, but I can try build x86_32
>> with gcc 4.3.1 (as x86_64 was).
>>
>
> The SSE modes have nicer cache behaviours and are therefore preferred
> even if they are slower.
>
That was my guess that they're preferable (but I wasn't aware of exact
reason, thanks!). :-)
>
> It is very odd that your SSE2 modes are that much slower in 64-bit mode.
>  It could just be an artifact of the may the test is done (cache
> anomalies?), but I kind of suspect there is something more fishy going on

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-17 22:35 raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64 H. Peter Anvin
@ 2008-11-18 12:03 ` Igor Podlesny
  2008-11-18 15:47   ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-18 12:03 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/18 H. Peter Anvin <hpa@zytor.com>:
> Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file;

	Mine raid456 isn't a module, it's built-in into kernel. Will be .o
enough? Or would it be better to re-compile as module?

> preferrably compiled with CONFIG_DEBUG_INFO.

	ok.

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-18 12:03 ` Igor Podlesny
@ 2008-11-18 15:47   ` H. Peter Anvin
  2008-11-21 19:22     ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-18 15:47 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid

Igor Podlesny wrote:
> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file;
> 
> 	Mine raid456 isn't a module, it's built-in into kernel. Will be .o
> enough? Or would it be better to re-compile as module?
> 
>> preferrably compiled with CONFIG_DEBUG_INFO.
> 

Built-in is fine.  I need the vmlinux file, though, not bzImage.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-18 15:47   ` H. Peter Anvin
@ 2008-11-21 19:22     ` Igor Podlesny
  2008-11-21 19:31       ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-21 19:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/18 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
>> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>>>
>>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or
>>> raid456.ko file;

	I've narrowed it a bit: the problem arises only when "-Os" is being used.
>
> Built-in is fine.  I need the vmlinux file, though, not bzImage.
>
	Ok, it's compiling right now. vmlinux is pretty huge, so I guess it'd
be better to upload it to an ftp.

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-21 19:22     ` Igor Podlesny
@ 2008-11-21 19:31       ` H. Peter Anvin
  2008-11-21 19:33         ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-21 19:31 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid

Igor Podlesny wrote:
> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>> Igor Podlesny wrote:
>>> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>>>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or
>>>> raid456.ko file;
> 
> 	I've narrowed it a bit: the problem arises only when "-Os" is being used.
>> Built-in is fine.  I need the vmlinux file, though, not bzImage.
>>
> 	Ok, it's compiling right now. vmlinux is pretty huge, so I guess it'd
> be better to upload it to an ftp.
> 

It probably is.  My side can handle large emails, but yours might not
(and, obviously, don't send it to the list.)

It sort of makes sense that -Os would break this stuff.  For newer gccs,
it would be better to use SSE2 intrinsics rather than inline assembly.
The problem is that it breaks older gcc.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-21 19:31       ` H. Peter Anvin
@ 2008-11-21 19:33         ` Igor Podlesny
  2008-11-21 20:15           ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-21 19:33 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
> It sort of makes sense that -Os would break this stuff.  For newer gccs,
> it would be better to use SSE2 intrinsics rather than inline assembly.
> The problem is that it breaks older gcc.
>
	Well, #ifdef could be helpful then, couldn't it? :-)

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-21 19:33         ` Igor Podlesny
@ 2008-11-21 20:15           ` H. Peter Anvin
  2008-11-22  5:40             ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-21 20:15 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid

Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>> It sort of makes sense that -Os would break this stuff.  For newer gccs,
>> it would be better to use SSE2 intrinsics rather than inline assembly.
>> The problem is that it breaks older gcc.
>>
> 	Well, #ifdef could be helpful then, couldn't it? :-)
> 

Yes, and that's probably the way to go.

I just tested a version using gcc intrinsics with gcc 4.3, and it is
almost 20% faster than the inline assembly version.  That, plus the fact
that the code is actually readable, makes me really want to figure out
how best to deploy this.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-21 20:15           ` H. Peter Anvin
@ 2008-11-22  5:40             ` Igor Podlesny
  2008-11-22  5:42               ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-22  5:40 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
> I just tested a version using gcc intrinsics with gcc 4.3, and it is
> almost 20% faster than the inline assembly version.  That, plus the fact
> that the code is actually readable, makes me really want to figure out
> how best to deploy this.

	Please let us know when commiting the patch -- 20 % is valuable. ;-)

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-22  5:40             ` Igor Podlesny
@ 2008-11-22  5:42               ` H. Peter Anvin
  2008-11-22  5:45                 ` Igor Podlesny
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-22  5:42 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid

Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>> I just tested a version using gcc intrinsics with gcc 4.3, and it is
>> almost 20% faster than the inline assembly version.  That, plus the fact
>> that the code is actually readable, makes me really want to figure out
>> how best to deploy this.
> 
> 	Please let us know when commiting the patch -- 20 % is valuable. ;-)
> 

Looks like I was a bit too optimistic.  The 20% was because of the 
missed prefetchnta, which means polluting the cache.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-22  5:42               ` H. Peter Anvin
@ 2008-11-22  5:45                 ` Igor Podlesny
  2008-11-23  1:12                   ` John Robinson
  2008-12-05 13:36                   ` Igor Podlesny
  0 siblings, 2 replies; 16+ messages in thread
From: Igor Podlesny @ 2008-11-22  5:45 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
>>        Please let us know when commiting the patch -- 20 % is valuable.
>> ;-)
>>
>
> Looks like I was a bit too optimistic.  The 20% was because of the missed
> prefetchnta, which means polluting the cache.
>
>        -hpa

	Well, whatever it gives unless it's a regression. :-)

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-22  5:45                 ` Igor Podlesny
@ 2008-11-23  1:12                   ` John Robinson
  2008-12-05 13:36                   ` Igor Podlesny
  1 sibling, 0 replies; 16+ messages in thread
From: John Robinson @ 2008-11-23  1:12 UTC (permalink / raw)
  To: Linux RAID

On 22/11/2008 05:45, Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>>>        Please let us know when commiting the patch -- 20 % is valuable.
>>> ;-)
>>>
>> Looks like I was a bit too optimistic.  The 20% was because of the missed
>> prefetchnta, which means polluting the cache.
> 
> 	Well, whatever it gives unless it's a regression. :-)

Not sure I agree with that - if the 20% improvement were to have a heavy 
impact on the rest of system performance when the system is already 
going to be underperforming, it may not worth having (though it might be 
an option for people who can afford the CPU or whatever).

Cheers,

John.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-11-22  5:45                 ` Igor Podlesny
  2008-11-23  1:12                   ` John Robinson
@ 2008-12-05 13:36                   ` Igor Podlesny
  2008-12-05 17:34                     ` H. Peter Anvin
  1 sibling, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-12-05 13:36 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

2008/11/22 Igor Podlesny <for.poige+linux@gmail.com>:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>>>        Please let us know when commiting the patch -- 20 % is valuable.
>>> ;-)
>>>
>>
>> Looks like I was a bit too optimistic.  The 20% was because of the missed
>> prefetchnta, which means polluting the cache.
>>
>>        -hpa
>
>        Well, whatever it gives unless it's a regression. :-)

	Hi! Any news?

-- 
End of message. Next message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
  2008-12-05 13:36                   ` Igor Podlesny
@ 2008-12-05 17:34                     ` H. Peter Anvin
  0 siblings, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2008-12-05 17:34 UTC (permalink / raw)
  To: for.poige+linux; +Cc: linux-raid

Igor Podlesny wrote:
> 2008/11/22 Igor Podlesny <for.poige+linux@gmail.com>:
>> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
>> [...]
>>>>        Please let us know when commiting the patch -- 20 % is valuable.
>>>> ;-)
>>>>
>>> Looks like I was a bit too optimistic.  The 20% was because of the missed
>>> prefetchnta, which means polluting the cache.
>>>
>>>        -hpa
>>        Well, whatever it gives unless it's a regression. :-)
> 
> 	Hi! Any news?
> 

No.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-12-05 17:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-17 22:35 raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64 H. Peter Anvin
2008-11-18 12:03 ` Igor Podlesny
2008-11-18 15:47   ` H. Peter Anvin
2008-11-21 19:22     ` Igor Podlesny
2008-11-21 19:31       ` H. Peter Anvin
2008-11-21 19:33         ` Igor Podlesny
2008-11-21 20:15           ` H. Peter Anvin
2008-11-22  5:40             ` Igor Podlesny
2008-11-22  5:42               ` H. Peter Anvin
2008-11-22  5:45                 ` Igor Podlesny
2008-11-23  1:12                   ` John Robinson
2008-12-05 13:36                   ` Igor Podlesny
2008-12-05 17:34                     ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2008-11-16 16:18 Igor Podlesny
2008-11-17  0:36 ` H. Peter Anvin
2008-11-17 20:56   ` Igor Podlesny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).