* raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
@ 2008-11-16 16:18 Igor Podlesny
2008-11-17 0:36 ` H. Peter Anvin
0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-16 16:18 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
Hi!
Recently I've decided to give a try to x86_64 version of mine distro
of choice (subject to change).
I prefer using own compiled kernels, so I know that in both x86_32 and
x86_64 modes .configs are very similar to each other.
The first quote is x86_32 kernel's dmesg (2.6.27.6-1khz i686):
[ 0.000000] Linux version 2.6.27.6-1khz
(poige@arch.localdomain) (gcc version 4.2.4)
[ 0.021589] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor
6000+ stepping 03
[ 0.092818] Total of 2 processors activated (12061.15 BogoMIPS).
[ 0.093095] xor: automatically using best checksumming function: pIII_sse
[ 0.097988] pIII_sse : 9164.000 MB/sec
[ 0.098027] xor: using function: pIII_sse (9164.000 MB/sec
[ 2.392055] raid6: int32x1 1121 MB/s
[ 2.409048] raid6: int32x2 1226 MB/s
[ 2.412224] input: AT Translated Set 2 keyboard as /class/input/input0
[ 2.426040] raid6: int32x4 1191 MB/s
[ 2.443066] raid6: int32x8 882 MB/s
[ 2.460013] raid6: mmxx1 2453 MB/s
[ 2.477014] raid6: mmxx2 4574 MB/s
[ 2.494024] raid6: sse1x1 2441 MB/s
[ 2.511014] raid6: sse1x2 4222 MB/s
[ 2.528013] raid6: sse2x1 4187 MB/s
[ 2.545004] raid6: sse2x2 5562 MB/s
[ 2.545042] raid6: using algorithm sse2x2 (5562 MB/s)
And now follows x86_64:
[ 0.000000] Linux version 2.6.27.6-64_1khz (root@archlive) (gcc
version 4.3.1 (GCC) )
[ 0.019180] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor
6000+ stepping 03
[ 0.091750] Total of 2 processors activated (12061.66 BogoMIPS).
[ 0.092073] xor: automatically using best checksumming
function: generic_sse
[ 0.096986] generic_sse: 9192.000 MB/sec
[ 0.097024] xor: using function: generic_sse (9192.000 MB/sec)
[ 2.583571] md: raid0 personality registered for level 0
[ 2.583614] md: raid1 personality registered for level 1
[ 2.600025] raid6: int64x1 2722 MB/s
[ 2.617010] raid6: int64x2 3660 MB/s
[ 2.634006] raid6: int64x4 3265 MB/s
[ 2.651012] raid6: int64x8 2593 MB/s
[ 2.668034] raid6: sse2x1 1476 MB/s
[ 2.685021] raid6: sse2x2 2316 MB/s
[ 2.702022] raid6: sse2x4 3175 MB/s
[ 2.702060] raid6: using algorithm sse2x4 (3175 MB/s)
So, there're 2 strange things in those dmesgs. The first one might be
unrelated to Linux RAID but affects it -- have you noticed that in
x86_64, raid6 algorithm is ~ 50 % slower, than in x86_32? Is that due
to not too optimized code for x86_64 mode? And the second -- why is
raid6 using algorithm sse2x4 (3175 MB/s), whereas int64x2 gives
slightly better (~ 15 %) throughput -- 3660 MB/s?
Has anyone on the list similar observations? Can gcc's version
difference affect so much? I doubt that, but I can try build x86_32
with gcc 4.3.1 (as x86_64 was).
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-16 16:18 raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64 Igor Podlesny
@ 2008-11-17 0:36 ` H. Peter Anvin
2008-11-17 20:56 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-17 0:36 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid, neilb
Igor Podlesny wrote:
>
> So, there're 2 strange things in those dmesgs. The first one might be
> unrelated to Linux RAID but affects it -- have you noticed that in
> x86_64, raid6 algorithm is ~ 50 % slower, than in x86_32? Is that due
> to not too optimized code for x86_64 mode? And the second -- why is
> raid6 using algorithm sse2x4 (3175 MB/s), whereas int64x2 gives
> slightly better (~ 15 %) throughput -- 3660 MB/s?
>
> Has anyone on the list similar observations? Can gcc's version
> difference affect so much? I doubt that, but I can try build x86_32
> with gcc 4.3.1 (as x86_64 was).
>
The SSE modes have nicer cache behaviours and are therefore preferred
even if they are slower.
It is very odd that your SSE2 modes are that much slower in 64-bit mode.
It could just be an artifact of the may the test is done (cache
anomalies?), but I kind of suspect there is something more fishy going on.
The sse2 code in the x1 and x2 case is actually identical between x86-32
and -64 (the x4 case is only available for -64) so it is very strange
that you're seeing this kind of effect.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-17 0:36 ` H. Peter Anvin
@ 2008-11-17 20:56 ` Igor Podlesny
0 siblings, 0 replies; 16+ messages in thread
From: Igor Podlesny @ 2008-11-17 20:56 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid, neilb
2008/11/17 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
[...]
>> Has anyone on the list similar observations? Can gcc's version
>> difference affect so much? I doubt that, but I can try build x86_32
>> with gcc 4.3.1 (as x86_64 was).
>>
>
> The SSE modes have nicer cache behaviours and are therefore preferred
> even if they are slower.
>
That was my guess that they're preferable (but I wasn't aware of exact
reason, thanks!). :-)
>
> It is very odd that your SSE2 modes are that much slower in 64-bit mode.
> It could just be an artifact of the may the test is done (cache
> anomalies?), but I kind of suspect there is something more fishy going on.
I've built gcc-4.2.4 and recompiled the kernel with it. .config's diff:
-CONFIG_LOCALVERSION="-64_1khz"
+CONFIG_LOCALVERSION="-64_1khz-gcc42"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
@@ -287,7 +287,7 @@
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
-CONFIG_COMPAT_VDSO=y
+# CONFIG_COMPAT_VDSO is not set
And dmesg now says:
[ 2.606258] raid6: int64x1 2210 MB/s
[ 2.623255] raid6: int64x2 3246 MB/s
[ 2.640257] raid6: int64x4 3289 MB/s
[ 2.657262] raid6: int64x8 3019 MB/s
[ 2.674262] raid6: sse2x1 4253 MB/s
[ 2.691258] raid6: sse2x2 5621 MB/s
[ 2.708261] raid6: sse2x4 5718 MB/s
[ 2.708299] raid6: using algorithm sse2x4 (5718 MB/s)
So, I deem that's not VDSO's effect but gcc's version instead.
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
@ 2008-11-17 22:35 H. Peter Anvin
2008-11-18 12:03 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-17 22:35 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid, neilb
Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file; preferrably compiled with CONFIG_DEBUG_INFO.
--
Sent from my mobile phone (pardon any lack of formatting)
-----Original Message-----
From: Igor Podlesny <for.poige+linux@gmail.com>
Sent: Monday, November 17, 2008 12:56
To: H. Peter Anvin <hpa@zytor.com>
Cc: linux-raid@vger.kernel.org; neilb@suse.de
Subject: Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008/11/17 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
[...]
>> Has anyone on the list similar observations? Can gcc's version
>> difference affect so much? I doubt that, but I can try build x86_32
>> with gcc 4.3.1 (as x86_64 was).
>>
>
> The SSE modes have nicer cache behaviours and are therefore preferred
> even if they are slower.
>
That was my guess that they're preferable (but I wasn't aware of exact
reason, thanks!). :-)
>
> It is very odd that your SSE2 modes are that much slower in 64-bit mode.
> It could just be an artifact of the may the test is done (cache
> anomalies?), but I kind of suspect there is something more fishy going on
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-17 22:35 H. Peter Anvin
@ 2008-11-18 12:03 ` Igor Podlesny
2008-11-18 15:47 ` H. Peter Anvin
0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-18 12:03 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/18 H. Peter Anvin <hpa@zytor.com>:
> Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file;
Mine raid456 isn't a module, it's built-in into kernel. Will be .o
enough? Or would it be better to re-compile as module?
> preferrably compiled with CONFIG_DEBUG_INFO.
ok.
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-18 12:03 ` Igor Podlesny
@ 2008-11-18 15:47 ` H. Peter Anvin
2008-11-21 19:22 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-18 15:47 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid
Igor Podlesny wrote:
> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or raid456.ko file;
>
> Mine raid456 isn't a module, it's built-in into kernel. Will be .o
> enough? Or would it be better to re-compile as module?
>
>> preferrably compiled with CONFIG_DEBUG_INFO.
>
Built-in is fine. I need the vmlinux file, though, not bzImage.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-18 15:47 ` H. Peter Anvin
@ 2008-11-21 19:22 ` Igor Podlesny
2008-11-21 19:31 ` H. Peter Anvin
0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-21 19:22 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/18 H. Peter Anvin <hpa@zytor.com>:
> Igor Podlesny wrote:
>>
>> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>>>
>>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or
>>> raid456.ko file;
I've narrowed it a bit: the problem arises only when "-Os" is being used.
>
> Built-in is fine. I need the vmlinux file, though, not bzImage.
>
Ok, it's compiling right now. vmlinux is pretty huge, so I guess it'd
be better to upload it to an ftp.
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-21 19:22 ` Igor Podlesny
@ 2008-11-21 19:31 ` H. Peter Anvin
2008-11-21 19:33 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-21 19:31 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid
Igor Podlesny wrote:
> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>> Igor Podlesny wrote:
>>> 2008/11/18 H. Peter Anvin <hpa@zytor.com>:
>>>> Interesting... Perhaps you could send me your "bad" kernel vmlinux or
>>>> raid456.ko file;
>
> I've narrowed it a bit: the problem arises only when "-Os" is being used.
>> Built-in is fine. I need the vmlinux file, though, not bzImage.
>>
> Ok, it's compiling right now. vmlinux is pretty huge, so I guess it'd
> be better to upload it to an ftp.
>
It probably is. My side can handle large emails, but yours might not
(and, obviously, don't send it to the list.)
It sort of makes sense that -Os would break this stuff. For newer gccs,
it would be better to use SSE2 intrinsics rather than inline assembly.
The problem is that it breaks older gcc.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-21 19:31 ` H. Peter Anvin
@ 2008-11-21 19:33 ` Igor Podlesny
2008-11-21 20:15 ` H. Peter Anvin
0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-21 19:33 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
> It sort of makes sense that -Os would break this stuff. For newer gccs,
> it would be better to use SSE2 intrinsics rather than inline assembly.
> The problem is that it breaks older gcc.
>
Well, #ifdef could be helpful then, couldn't it? :-)
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-21 19:33 ` Igor Podlesny
@ 2008-11-21 20:15 ` H. Peter Anvin
2008-11-22 5:40 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-21 20:15 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid
Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>> It sort of makes sense that -Os would break this stuff. For newer gccs,
>> it would be better to use SSE2 intrinsics rather than inline assembly.
>> The problem is that it breaks older gcc.
>>
> Well, #ifdef could be helpful then, couldn't it? :-)
>
Yes, and that's probably the way to go.
I just tested a version using gcc intrinsics with gcc 4.3, and it is
almost 20% faster than the inline assembly version. That, plus the fact
that the code is actually readable, makes me really want to figure out
how best to deploy this.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-21 20:15 ` H. Peter Anvin
@ 2008-11-22 5:40 ` Igor Podlesny
2008-11-22 5:42 ` H. Peter Anvin
0 siblings, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-11-22 5:40 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
> I just tested a version using gcc intrinsics with gcc 4.3, and it is
> almost 20% faster than the inline assembly version. That, plus the fact
> that the code is actually readable, makes me really want to figure out
> how best to deploy this.
Please let us know when commiting the patch -- 20 % is valuable. ;-)
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-22 5:40 ` Igor Podlesny
@ 2008-11-22 5:42 ` H. Peter Anvin
2008-11-22 5:45 ` Igor Podlesny
0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2008-11-22 5:42 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid
Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>> I just tested a version using gcc intrinsics with gcc 4.3, and it is
>> almost 20% faster than the inline assembly version. That, plus the fact
>> that the code is actually readable, makes me really want to figure out
>> how best to deploy this.
>
> Please let us know when commiting the patch -- 20 % is valuable. ;-)
>
Looks like I was a bit too optimistic. The 20% was because of the
missed prefetchnta, which means polluting the cache.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-22 5:42 ` H. Peter Anvin
@ 2008-11-22 5:45 ` Igor Podlesny
2008-11-23 1:12 ` John Robinson
2008-12-05 13:36 ` Igor Podlesny
0 siblings, 2 replies; 16+ messages in thread
From: Igor Podlesny @ 2008-11-22 5:45 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/22 H. Peter Anvin <hpa@zytor.com>:
[...]
>> Please let us know when commiting the patch -- 20 % is valuable.
>> ;-)
>>
>
> Looks like I was a bit too optimistic. The 20% was because of the missed
> prefetchnta, which means polluting the cache.
>
> -hpa
Well, whatever it gives unless it's a regression. :-)
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-22 5:45 ` Igor Podlesny
@ 2008-11-23 1:12 ` John Robinson
2008-12-05 13:36 ` Igor Podlesny
1 sibling, 0 replies; 16+ messages in thread
From: John Robinson @ 2008-11-23 1:12 UTC (permalink / raw)
To: Linux RAID
On 22/11/2008 05:45, Igor Podlesny wrote:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>>> Please let us know when commiting the patch -- 20 % is valuable.
>>> ;-)
>>>
>> Looks like I was a bit too optimistic. The 20% was because of the missed
>> prefetchnta, which means polluting the cache.
>
> Well, whatever it gives unless it's a regression. :-)
Not sure I agree with that - if the 20% improvement were to have a heavy
impact on the rest of system performance when the system is already
going to be underperforming, it may not worth having (though it might be
an option for people who can afford the CPU or whatever).
Cheers,
John.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-11-22 5:45 ` Igor Podlesny
2008-11-23 1:12 ` John Robinson
@ 2008-12-05 13:36 ` Igor Podlesny
2008-12-05 17:34 ` H. Peter Anvin
1 sibling, 1 reply; 16+ messages in thread
From: Igor Podlesny @ 2008-12-05 13:36 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: linux-raid
2008/11/22 Igor Podlesny <for.poige+linux@gmail.com>:
> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
> [...]
>>> Please let us know when commiting the patch -- 20 % is valuable.
>>> ;-)
>>>
>>
>> Looks like I was a bit too optimistic. The 20% was because of the missed
>> prefetchnta, which means polluting the cache.
>>
>> -hpa
>
> Well, whatever it gives unless it's a regression. :-)
Hi! Any news?
--
End of message. Next message?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64.
2008-12-05 13:36 ` Igor Podlesny
@ 2008-12-05 17:34 ` H. Peter Anvin
0 siblings, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2008-12-05 17:34 UTC (permalink / raw)
To: for.poige+linux; +Cc: linux-raid
Igor Podlesny wrote:
> 2008/11/22 Igor Podlesny <for.poige+linux@gmail.com>:
>> 2008/11/22 H. Peter Anvin <hpa@zytor.com>:
>> [...]
>>>> Please let us know when commiting the patch -- 20 % is valuable.
>>>> ;-)
>>>>
>>> Looks like I was a bit too optimistic. The 20% was because of the missed
>>> prefetchnta, which means polluting the cache.
>>>
>>> -hpa
>> Well, whatever it gives unless it's a regression. :-)
>
> Hi! Any news?
>
No.
-hpa
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-12-05 17:34 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-16 16:18 raid6's using not the best bandwidth method && raid6 algo is significantly slower in x86_64 Igor Podlesny
2008-11-17 0:36 ` H. Peter Anvin
2008-11-17 20:56 ` Igor Podlesny
-- strict thread matches above, loose matches on Subject: below --
2008-11-17 22:35 H. Peter Anvin
2008-11-18 12:03 ` Igor Podlesny
2008-11-18 15:47 ` H. Peter Anvin
2008-11-21 19:22 ` Igor Podlesny
2008-11-21 19:31 ` H. Peter Anvin
2008-11-21 19:33 ` Igor Podlesny
2008-11-21 20:15 ` H. Peter Anvin
2008-11-22 5:40 ` Igor Podlesny
2008-11-22 5:42 ` H. Peter Anvin
2008-11-22 5:45 ` Igor Podlesny
2008-11-23 1:12 ` John Robinson
2008-12-05 13:36 ` Igor Podlesny
2008-12-05 17:34 ` H. Peter Anvin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).