* Itanium2@900MHz slower than alpha@666MHz ?
@ 2003-10-24 15:14 Ionut Georgescu
2003-10-24 15:35 ` Matthew Wilcox
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: Ionut Georgescu @ 2003-10-24 15:14 UTC (permalink / raw)
To: linux-ia64
Hello,
I am puzzled about the speed of a zx2000 workstation with a 900Mhz CPU.
According to the SPECfp2000 benchmarks, this workstation should be about
twice as fast as a DS10 alpha workstation and according to the fftw2
benchmarks at least 50% faster (double precision, real data, 256x256 FFT
transforms). I ran the fftw2 benchmark myself and I could reproduce the
data on fftw.org
However, my program is about 40% slower on the zx2000 as on the alpha.
It only does some Fourier transforms (fftw2, 256x256) and some matrix
operations (sort of an inner product). Both fftw2 and the program have
been compiled with ecc -O2 -ipo -limf. ecc is Version 7.1, Build
20030307.
Both the alpha and the Itanium2 run Debian stable and kernel 2.4.20.
Is there anything else I can do to improve performance ? I tried to some
profiling (CFLAGS="-g -p -Ob0 -O0 -inline_debug_info"), but the report
is missing the call-graph and a lot of other information, so that I
can't trust the quality of those data. Right now I'm trying to dig my
way through qprof and pfmon (for the moment qprof fails when
QPROF_HW_EVENT is set).
Thanks a lot,
Ionut
--
***************
* Ionut Georgescu
* http://www.physik.tu-cottbus.de/~george/
* Registered Linux User #244479
*
* "In Windows you can do everything Microsoft wants you to do; in Unix you
* can do anything the computer is able to do."
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
@ 2003-10-24 15:35 ` Matthew Wilcox
2003-10-24 16:26 ` Ionut Georgescu
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2003-10-24 15:35 UTC (permalink / raw)
To: linux-ia64
On Fri, Oct 24, 2003 at 05:14:50PM +0200, Ionut Georgescu wrote:
> I am puzzled about the speed of a zx2000 workstation with a 900Mhz CPU.
> According to the SPECfp2000 benchmarks, this workstation should be about
> twice as fast as a DS10 alpha workstation and according to the fftw2
> benchmarks at least 50% faster (double precision, real data, 256x256 FFT
> transforms). I ran the fftw2 benchmark myself and I could reproduce the
> data on fftw.org
>
> However, my program is about 40% slower on the zx2000 as on the alpha.
> It only does some Fourier transforms (fftw2, 256x256) and some matrix
> operations (sort of an inner product). Both fftw2 and the program have
> been compiled with ecc -O2 -ipo -limf. ecc is Version 7.1, Build
> 20030307.
This strikes me as possibly being a cache size thing. Do you have the
1.5MB cache or 3MB cache version of the 900MHz Itanium? The DS10 specs
I found at
http://www.spec.org/cpu2000/results/res2000q3/cpu2000-20000630-00134.asc
say it has a 2MB cache, so if your data set fits in a 2MB cache and not
in a 1.5MB cache, that would be a possible cause.
What kind of fluctuations do you see between runs? Linux doesn't
do cache-colouring, so high variation between runs could indicate a
close-to-edge-of-cache scenario.
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
2003-10-24 15:35 ` Matthew Wilcox
@ 2003-10-24 16:26 ` Ionut Georgescu
2003-10-24 16:49 ` Matthew Wilcox
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Ionut Georgescu @ 2003-10-24 16:26 UTC (permalink / raw)
To: linux-ia64
Yes, it is a 1.5MB cache CPU.
But it might be more than just cache misses. When running with a 256x256
grid, I am actually using 3 256x257 matrices, which is slightly over
1.5MB. I just made a comparison with a 32x32 grid and the difference is
the same: alpha 10.294s, zx2000 14.777s . -O3 is only by 0.4s faster
than -O2 in this case.
The variations are within +-0.02s between runs (for the 32x32 case).
Is there a way to test the bandwidth of the cache ? Because I think the
alphas have actually 2MB of L2 cache, not L3.
Thanks,
Ionut
On Fri, Oct 24, 2003 at 04:35:43PM +0100, Matthew Wilcox wrote:
> On Fri, Oct 24, 2003 at 05:14:50PM +0200, Ionut Georgescu wrote:
> > I am puzzled about the speed of a zx2000 workstation with a 900Mhz CPU.
> > According to the SPECfp2000 benchmarks, this workstation should be about
> > twice as fast as a DS10 alpha workstation and according to the fftw2
> > benchmarks at least 50% faster (double precision, real data, 256x256 FFT
> > transforms). I ran the fftw2 benchmark myself and I could reproduce the
> > data on fftw.org
> >
> > However, my program is about 40% slower on the zx2000 as on the alpha.
> > It only does some Fourier transforms (fftw2, 256x256) and some matrix
> > operations (sort of an inner product). Both fftw2 and the program have
> > been compiled with ecc -O2 -ipo -limf. ecc is Version 7.1, Build
> > 20030307.
>
> This strikes me as possibly being a cache size thing. Do you have the
> 1.5MB cache or 3MB cache version of the 900MHz Itanium? The DS10 specs
> I found at
> http://www.spec.org/cpu2000/results/res2000q3/cpu2000-20000630-00134.asc
> say it has a 2MB cache, so if your data set fits in a 2MB cache and not
> in a 1.5MB cache, that would be a possible cause.
>
> What kind of fluctuations do you see between runs? Linux doesn't
> do cache-colouring, so high variation between runs could indicate a
> close-to-edge-of-cache scenario.
>
> --
> "It's not Hollywood. War is real, war is primarily not about defeat or
> victory, it is about death. I've seen thousands and thousands of dead bodies.
> Do you think I want to have an academic debate on this subject?" -- Robert Fisk
>
--
***************
* Ionut Georgescu
* http://www.physik.tu-cottbus.de/~george/
* Registered Linux User #244479
*
* "In Windows you can do everything Microsoft wants you to do; in Unix you
* can do anything the computer is able to do."
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
2003-10-24 15:35 ` Matthew Wilcox
2003-10-24 16:26 ` Ionut Georgescu
@ 2003-10-24 16:49 ` Matthew Wilcox
2003-10-24 16:55 ` Grant Grundler
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2003-10-24 16:49 UTC (permalink / raw)
To: linux-ia64
On Fri, Oct 24, 2003 at 06:26:55PM +0200, Ionut Georgescu wrote:
> Yes, it is a 1.5MB cache CPU.
>
> But it might be more than just cache misses. When running with a 256x256
> grid, I am actually using 3 256x257 matrices, which is slightly over
> 1.5MB. I just made a comparison with a 32x32 grid and the difference is
> the same: alpha 10.294s, zx2000 14.777s . -O3 is only by 0.4s faster
> than -O2 in this case.
>
> The variations are within +-0.02s between runs (for the 32x32 case).
>
> Is there a way to test the bandwidth of the cache ? Because I think the
> alphas have actually 2MB of L2 cache, not L3.
I don't know about testing cache bandwidth, but /proc/pal/cpu0/cache_info
gives various stats on how the different levels of cache play together
on ia64.
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (2 preceding siblings ...)
2003-10-24 16:49 ` Matthew Wilcox
@ 2003-10-24 16:55 ` Grant Grundler
2003-10-24 17:10 ` Stephane Eranian
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Grant Grundler @ 2003-10-24 16:55 UTC (permalink / raw)
To: linux-ia64
On Fri, Oct 24, 2003 at 06:26:55PM +0200, Ionut Georgescu wrote:
> Is there a way to test the bandwidth of the cache ? Because I think the
> alphas have actually 2MB of L2 cache, not L3.
lmbench I think has a section for this.
grant
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (3 preceding siblings ...)
2003-10-24 16:55 ` Grant Grundler
@ 2003-10-24 17:10 ` Stephane Eranian
2003-10-24 17:42 ` Chen, Kenneth W
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Stephane Eranian @ 2003-10-24 17:10 UTC (permalink / raw)
To: linux-ia64
Ionut,
On Fri, Oct 24, 2003 at 06:26:55PM +0200, Ionut Georgescu wrote:
> Yes, it is a 1.5MB cache CPU.
>
> But it might be more than just cache misses. When running with a 256x256
> grid, I am actually using 3 256x257 matrices, which is slightly over
> 1.5MB. I just made a comparison with a 32x32 grid and the difference is
> the same: alpha 10.294s, zx2000 14.777s . -O3 is only by 0.4s faster
> than -O2 in this case.
>
Another thing to check is the syslog. Check to see if your program
is not getting "floating point software assist" fault. That can slow
you down significantly.
--
-Stephane
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (4 preceding siblings ...)
2003-10-24 17:10 ` Stephane Eranian
@ 2003-10-24 17:42 ` Chen, Kenneth W
2003-10-24 18:36 ` David Mosberger
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Chen, Kenneth W @ 2003-10-24 17:42 UTC (permalink / raw)
To: linux-ia64
It wasn't clear from the description whether you actually turned on
profile guided optimization with electron compiler. It is a two pass
compilation, once with -prof_gen to generate execution profile and then
once with -prof_use to complete PGO optimization.
One other neat thing about Itanium architecture is the capability of
it's performance counter. It has capability to do cycle accounting that
break-down the number of cycles that are lost due to various kinds of
micro-architectural events, it is based on CPU's actual stall cycles in
the pipeline so you can see exactly where the stall is coming from to
eliminate any guess work.
See electron compiler user's guide for PGO methodology:
http://www.intel.com/software/products/compilers/c60l/resources/c_ug_lnx
.pdf
Cycle accounting is described in Intel Itanium Software developer's
manual.
- Ken
-----Original Message-----
From: linux-ia64-owner@vger.kernel.org
[mailto:linux-ia64-owner@vger.kernel.org] On Behalf Of Ionut Georgescu
Sent: Friday, October 24, 2003 8:15 AM
To: linux-ia64@vger.kernel.org
Subject: Itanium2@900MHz slower than alpha@666MHz ?
Hello,
I am puzzled about the speed of a zx2000 workstation with a 900Mhz CPU.
According to the SPECfp2000 benchmarks, this workstation should be about
twice as fast as a DS10 alpha workstation and according to the fftw2
benchmarks at least 50% faster (double precision, real data, 256x256 FFT
transforms). I ran the fftw2 benchmark myself and I could reproduce the
data on fftw.org
However, my program is about 40% slower on the zx2000 as on the alpha.
It only does some Fourier transforms (fftw2, 256x256) and some matrix
operations (sort of an inner product). Both fftw2 and the program have
been compiled with ecc -O2 -ipo -limf. ecc is Version 7.1, Build
20030307.
Both the alpha and the Itanium2 run Debian stable and kernel 2.4.20.
Is there anything else I can do to improve performance ? I tried to some
profiling (CFLAGS="-g -p -Ob0 -O0 -inline_debug_info"), but the report
is missing the call-graph and a lot of other information, so that I
can't trust the quality of those data. Right now I'm trying to dig my
way through qprof and pfmon (for the moment qprof fails when
QPROF_HW_EVENT is set).
Thanks a lot,
Ionut
--
***************
* Ionut Georgescu
* http://www.physik.tu-cottbus.de/~george/
* Registered Linux User #244479
*
* "In Windows you can do everything Microsoft wants you to do; in Unix
you
* can do anything the computer is able to do."
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (5 preceding siblings ...)
2003-10-24 17:42 ` Chen, Kenneth W
@ 2003-10-24 18:36 ` David Mosberger
2003-10-24 18:55 ` Ionut Georgescu
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: David Mosberger @ 2003-10-24 18:36 UTC (permalink / raw)
To: linux-ia64
>>>>> On Fri, 24 Oct 2003 10:10:44 -0700, Stephane Eranian <eranian@hpl.hp.com> said:
Stephane> Ionut, On Fri, Oct 24, 2003 at 06:26:55PM +0200, Ionut
Stephane> Georgescu wrote:
>> Yes, it is a 1.5MB cache CPU.
>> But it might be more than just cache misses. When running with a
>> 256x256 grid, I am actually using 3 256x257 matrices, which is
>> slightly over 1.5MB. I just made a comparison with a 32x32 grid
>> and the difference is the same: alpha 10.294s, zx2000 14.777s
>> . -O3 is only by 0.4s faster than -O2 in this case.
Stephane> Another thing to check is the syslog. Check to see if your
Stephane> program is not getting "floating point software assist"
Stephane> fault. That can slow you down significantly.
Note that on Red Hat you'd have to raise the dmesg level above WARNING
to see such messages (dmesg -n8 should do the trick).
--david
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (6 preceding siblings ...)
2003-10-24 18:36 ` David Mosberger
@ 2003-10-24 18:55 ` Ionut Georgescu
2003-10-24 18:59 ` Ionut Georgescu
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Ionut Georgescu @ 2003-10-24 18:55 UTC (permalink / raw)
To: linux-ia64
Thanks, but I still need to learn about this kind of optimization.
After running the program I don't get any .dpi file. And the name of the
.dpy file is rather related to the pid of the process than to the name
of the program.
On Fri, Oct 24, 2003 at 10:42:40AM -0700, Chen, Kenneth W wrote:
> It wasn't clear from the description whether you actually turned on
> profile guided optimization with electron compiler. It is a two pass
> compilation, once with -prof_gen to generate execution profile and then
> once with -prof_use to complete PGO optimization.
>
> One other neat thing about Itanium architecture is the capability of
> it's performance counter. It has capability to do cycle accounting that
> break-down the number of cycles that are lost due to various kinds of
> micro-architectural events, it is based on CPU's actual stall cycles in
> the pipeline so you can see exactly where the stall is coming from to
> eliminate any guess work.
Are qprof and pfmon enough to do this ?
>
> See electron compiler user's guide for PGO methodology:
> http://www.intel.com/software/products/compilers/c60l/resources/c_ug_lnx
> .pdf
Is this the same for 7.x ?
>
> Cycle accounting is described in Intel Itanium Software developer's
> manual.
>
Thank you for the info. I'll try to find the bottleneck.
Ionut
> - Ken
>
>
> -----Original Message-----
> From: linux-ia64-owner@vger.kernel.org
> [mailto:linux-ia64-owner@vger.kernel.org] On Behalf Of Ionut Georgescu
> Sent: Friday, October 24, 2003 8:15 AM
> To: linux-ia64@vger.kernel.org
> Subject: Itanium2@900MHz slower than alpha@666MHz ?
>
> Hello,
>
> I am puzzled about the speed of a zx2000 workstation with a 900Mhz CPU.
> According to the SPECfp2000 benchmarks, this workstation should be about
> twice as fast as a DS10 alpha workstation and according to the fftw2
> benchmarks at least 50% faster (double precision, real data, 256x256 FFT
> transforms). I ran the fftw2 benchmark myself and I could reproduce the
> data on fftw.org
>
> However, my program is about 40% slower on the zx2000 as on the alpha.
> It only does some Fourier transforms (fftw2, 256x256) and some matrix
> operations (sort of an inner product). Both fftw2 and the program have
> been compiled with ecc -O2 -ipo -limf. ecc is Version 7.1, Build
> 20030307.
>
> Both the alpha and the Itanium2 run Debian stable and kernel 2.4.20.
>
> Is there anything else I can do to improve performance ? I tried to some
> profiling (CFLAGS="-g -p -Ob0 -O0 -inline_debug_info"), but the report
> is missing the call-graph and a lot of other information, so that I
> can't trust the quality of those data. Right now I'm trying to dig my
> way through qprof and pfmon (for the moment qprof fails when
> QPROF_HW_EVENT is set).
>
> Thanks a lot,
> Ionut
>
> --
> ***************
> * Ionut Georgescu
> * http://www.physik.tu-cottbus.de/~george/
> * Registered Linux User #244479
> *
> * "In Windows you can do everything Microsoft wants you to do; in Unix
> you
> * can do anything the computer is able to do."
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
***************
* Ionut Georgescu
* http://www.physik.tu-cottbus.de/~george/
* Registered Linux User #244479
*
* "In Windows you can do everything Microsoft wants you to do; in Unix you
* can do anything the computer is able to do."
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (7 preceding siblings ...)
2003-10-24 18:55 ` Ionut Georgescu
@ 2003-10-24 18:59 ` Ionut Georgescu
2003-10-24 20:25 ` Siddha, Suresh B
2003-10-24 20:36 ` Chen, Kenneth W
10 siblings, 0 replies; 12+ messages in thread
From: Ionut Georgescu @ 2003-10-24 18:59 UTC (permalink / raw)
To: linux-ia64
On Fri, Oct 24, 2003 at 08:55:17PM +0200, Ionut Georgescu wrote:
> Thanks, but I still need to learn about this kind of optimization.
> After running the program I don't get any .dpi file. And the name of the
> .dpy file is rather related to the pid of the process than to the name
Sorry, meant the .dyn file.
--
***************
* Ionut Georgescu
* http://www.physik.tu-cottbus.de/~george/
* Registered Linux User #244479
*
* "In Windows you can do everything Microsoft wants you to do; in Unix you
* can do anything the computer is able to do."
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (8 preceding siblings ...)
2003-10-24 18:59 ` Ionut Georgescu
@ 2003-10-24 20:25 ` Siddha, Suresh B
2003-10-24 20:36 ` Chen, Kenneth W
10 siblings, 0 replies; 12+ messages in thread
From: Siddha, Suresh B @ 2003-10-24 20:25 UTC (permalink / raw)
To: linux-ia64
>
> On Fri, Oct 24, 2003 at 08:55:17PM +0200, Ionut Georgescu wrote:
> > Thanks, but I still need to learn about this kind of optimization.
> > After running the program I don't get any .dpi file. And
> the name of the
> > .dpy file is rather related to the pid of the process than
> to the name
>
> Sorry, meant the .dyn file.
>
You need to use profmerge to convert *.dyn into pgopti.dpi information. Name of the *.dyn file is related to the pid because it will enable collecting information from multiple/parallel runs.
thanks,
suresh
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Itanium2@900MHz slower than alpha@666MHz ?
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
` (9 preceding siblings ...)
2003-10-24 20:25 ` Siddha, Suresh B
@ 2003-10-24 20:36 ` Chen, Kenneth W
10 siblings, 0 replies; 12+ messages in thread
From: Chen, Kenneth W @ 2003-10-24 20:36 UTC (permalink / raw)
To: linux-ia64
>> One other neat thing about Itanium architecture is the capability of
>> it's performance counter. It has capability to do cycle accounting
that
>> break-down the number of cycles that are lost due to various kinds of
>> micro-architectural events, it is based on CPU's actual stall cycles
in
>> the pipeline so you can see exactly where the stall is coming from to
>> eliminate any guess work.
>
>Are qprof and pfmon enough to do this ?
Yes.
>> See electron compiler user's guide for PGO methodology:
>>
http://www.intel.com/software/products/compilers/c60l/resources/c_ug_lnx
>> .pdf
>
>Is this the same for 7.x ?
I think so.
- Ken
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-10-24 20:36 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-24 15:14 Itanium2@900MHz slower than alpha@666MHz ? Ionut Georgescu
2003-10-24 15:35 ` Matthew Wilcox
2003-10-24 16:26 ` Ionut Georgescu
2003-10-24 16:49 ` Matthew Wilcox
2003-10-24 16:55 ` Grant Grundler
2003-10-24 17:10 ` Stephane Eranian
2003-10-24 17:42 ` Chen, Kenneth W
2003-10-24 18:36 ` David Mosberger
2003-10-24 18:55 ` Ionut Georgescu
2003-10-24 18:59 ` Ionut Georgescu
2003-10-24 20:25 ` Siddha, Suresh B
2003-10-24 20:36 ` Chen, Kenneth W
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox