Consistency problem on IPF

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Consistency problem on IPF
@ 2004-07-12  9:13 Marc Gonzalez-Sigler
  2004-07-12 10:06 ` Erich Focht
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Marc Gonzalez-Sigler @ 2004-07-12  9:13 UTC (permalink / raw)
  To: linux-ia64

Hello,

Several weeks ago, I wrote a naive matrix-matrix multiply program.

int main(void)
{
   static double A[N][N], B[N][N], C[N][N];

   /* Initialize A and B */

   /* Main loop */
   for (i=0; i < N; ++i)
     for (j=0; j < N; ++j)
       for (k=0; k < N; ++k)
         C[i][j] += A[i][k]*B[k][j];

   /* Print the sum of all elements of C */
}

The system:

$ cat /proc/cpuinfo
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 1
revision   : 5
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1300.000000
itc MHz    : 1300.000000
BogoMIPS   : 1946.15

processor  : 1
[same as processor 0]

$ uname -a
Linux c64 2.6.6 #2 SMP Thu Jun 10 18:03:20 CEST 2004 ia64 GNU/Linux

I started with NQ2, which was a bad idea. I ran the same program 100 
times on an empty system, and saw very different execution times. I 
tried to pin the program to a single CPU, but the results were similar.

NQ2
MIN    = 1.190000
MAX    = 11.470000
MEAN   = 4.686900
MEDIAN = 1.390000
STDDEV = 4.181866

OK. NQ2 was probably a pathological case. Let us try NP0.

NP0
MIN    = 0.670000
MAX    = 1.770000
MEAN   = 1.013100
MEDIAN = 0.670000
STDDEV = 0.466653

Better, but still quite inconsistent...

The same experiment on a 3.0 GHz Northwood running 2.4.22

NP0
MEAN   = 1.375200
MEDIAN = 1.375000
STDDEV = 0.002825

Tony Luck, an Intel engineer, told me on a different list this was a 
page-coloring issue. Would you agree? Is there a knob in Linux 2.6 to 
request a smarter physical page allocation policy? Do you think I would 
get similar results if I used 2.4 instead of 2.6?

Thanks to everybody for reading this far.

-- 
Regards, Marc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
@ 2004-07-12 10:06 ` Erich Focht
  2004-07-12 11:12 ` Duraid Madina
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Erich Focht @ 2004-07-12 10:06 UTC (permalink / raw)
  To: linux-ia64

Hi Marc,

usually, if you can solve the problem at user level it is improbable
that someone will provide a solution in the kernel. To my knowledge
there is no knob for optimizing memory layout in 2.6 and 2.4 results
will be similar. What plaform do you use? If it's NUMA, there are more
answers...

At user level you could:
- try an optimized matrix-matrix multiply (BLAS3 DGEMM function from
the Intel MKL (math kernel library) ?). I'd expect that to be coded in
such a way that the impact of your data layout is reduced/limited.
- try using pages from hugetlbfs and keep all data in one page.

Regards,
Erich

On Monday 12 July 2004 11:13, Marc Gonzalez-Sigler wrote:
> Hello,
> 
> Several weeks ago, I wrote a naive matrix-matrix multiply program.
> 
> int main(void)
> {
>    static double A[N][N], B[N][N], C[N][N];
> 
>    /* Initialize A and B */
> 
>    /* Main loop */
>    for (i=0; i < N; ++i)
>      for (j=0; j < N; ++j)
>        for (k=0; k < N; ++k)
>          C[i][j] += A[i][k]*B[k][j];
> 
>    /* Print the sum of all elements of C */
> }
> 
> The system:
> 
> $ cat /proc/cpuinfo
> processor  : 0
> vendor     : GenuineIntel
> arch       : IA-64
> family     : Itanium 2
> model      : 1
> revision   : 5
> archrev    : 0
> features   : branchlong
> cpu number : 0
> cpu regs   : 4
> cpu MHz    : 1300.000000
> itc MHz    : 1300.000000
> BogoMIPS   : 1946.15
> 
> processor  : 1
> [same as processor 0]
> 
> $ uname -a
> Linux c64 2.6.6 #2 SMP Thu Jun 10 18:03:20 CEST 2004 ia64 GNU/Linux
> 
> 
> I started with NQ2, which was a bad idea. I ran the same program 100 
> times on an empty system, and saw very different execution times. I 
> tried to pin the program to a single CPU, but the results were similar.
> 
> NQ2
> MIN    = 1.190000
> MAX    = 11.470000
> MEAN   = 4.686900
> MEDIAN = 1.390000
> STDDEV = 4.181866
> 
> OK. NQ2 was probably a pathological case. Let us try NP0.
> 
> NP0
> MIN    = 0.670000
> MAX    = 1.770000
> MEAN   = 1.013100
> MEDIAN = 0.670000
> STDDEV = 0.466653
> 
> Better, but still quite inconsistent...
> 
> The same experiment on a 3.0 GHz Northwood running 2.4.22
> 
> NP0
> MEAN   = 1.375200
> MEDIAN = 1.375000
> STDDEV = 0.002825
> 
> Tony Luck, an Intel engineer, told me on a different list this was a 
> page-coloring issue. Would you agree? Is there a knob in Linux 2.6 to 
> request a smarter physical page allocation policy? Do you think I would 
> get similar results if I used 2.4 instead of 2.6?
> 
> Thanks to everybody for reading this far.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
  2004-07-12 10:06 ` Erich Focht
@ 2004-07-12 11:12 ` Duraid Madina
  2004-07-12 11:24 ` Erich Focht
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Duraid Madina @ 2004-07-12 11:12 UTC (permalink / raw)
  To: linux-ia64

For the record, on HP-UX 11.23 the standard deviation is down around 
0.1s for NP0, 512. For N\x1024, it's basically zero. This is on a 2-way 
1.5G/6M system.

Linux has a way to go yet...

	Duraid

Erich Focht wrote:
> Hi Marc,
> 
> usually, if you can solve the problem at user level it is improbable
> that someone will provide a solution in the kernel. To my knowledge
> there is no knob for optimizing memory layout in 2.6 and 2.4 results
> will be similar. What plaform do you use? If it's NUMA, there are more
> answers...
> 
> At user level you could:
> - try an optimized matrix-matrix multiply (BLAS3 DGEMM function from
> the Intel MKL (math kernel library) ?). I'd expect that to be coded in
> such a way that the impact of your data layout is reduced/limited.
> - try using pages from hugetlbfs and keep all data in one page.
> 
> Regards,
> Erich
> 
> On Monday 12 July 2004 11:13, Marc Gonzalez-Sigler wrote:
> 
>>Hello,
>>
>>Several weeks ago, I wrote a naive matrix-matrix multiply program.
>>
>>int main(void)
>>{
>>   static double A[N][N], B[N][N], C[N][N];
>>
>>   /* Initialize A and B */
>>
>>   /* Main loop */
>>   for (i=0; i < N; ++i)
>>     for (j=0; j < N; ++j)
>>       for (k=0; k < N; ++k)
>>         C[i][j] += A[i][k]*B[k][j];
>>
>>   /* Print the sum of all elements of C */
>>}
>>
>>The system:
>>
>>$ cat /proc/cpuinfo
>>processor  : 0
>>vendor     : GenuineIntel
>>arch       : IA-64
>>family     : Itanium 2
>>model      : 1
>>revision   : 5
>>archrev    : 0
>>features   : branchlong
>>cpu number : 0
>>cpu regs   : 4
>>cpu MHz    : 1300.000000
>>itc MHz    : 1300.000000
>>BogoMIPS   : 1946.15
>>
>>processor  : 1
>>[same as processor 0]
>>
>>$ uname -a
>>Linux c64 2.6.6 #2 SMP Thu Jun 10 18:03:20 CEST 2004 ia64 GNU/Linux
>>
>>
>>I started with NQ2, which was a bad idea. I ran the same program 100 
>>times on an empty system, and saw very different execution times. I 
>>tried to pin the program to a single CPU, but the results were similar.
>>
>>NQ2
>>MIN    = 1.190000
>>MAX    = 11.470000
>>MEAN   = 4.686900
>>MEDIAN = 1.390000
>>STDDEV = 4.181866
>>
>>OK. NQ2 was probably a pathological case. Let us try NP0.
>>
>>NP0
>>MIN    = 0.670000
>>MAX    = 1.770000
>>MEAN   = 1.013100
>>MEDIAN = 0.670000
>>STDDEV = 0.466653
>>
>>Better, but still quite inconsistent...
>>
>>The same experiment on a 3.0 GHz Northwood running 2.4.22
>>
>>NP0
>>MEAN   = 1.375200
>>MEDIAN = 1.375000
>>STDDEV = 0.002825
>>
>>Tony Luck, an Intel engineer, told me on a different list this was a 
>>page-coloring issue. Would you agree? Is there a knob in Linux 2.6 to 
>>request a smarter physical page allocation policy? Do you think I would 
>>get similar results if I used 2.4 instead of 2.6?
>>
>>Thanks to everybody for reading this far.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
  2004-07-12 10:06 ` Erich Focht
  2004-07-12 11:12 ` Duraid Madina
@ 2004-07-12 11:24 ` Erich Focht
  2004-07-12 11:39 ` Marc Gonzalez-Sigler
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Erich Focht @ 2004-07-12 11:24 UTC (permalink / raw)
  To: linux-ia64

On Monday 12 July 2004 13:12, Duraid Madina wrote:
> For the record, on HP-UX 11.23 the standard deviation is down around 
> 0.1s for NP0, 512. For N\x1024, it's basically zero. This is on a 2-way 
> 1.5G/6M system.

Which page-size was the testcode using? HPUX is more flexible here due
to different usage of the TLB.

What compiler? The initial mail sounded like gcc has been used. I'd
expect a reasonable (i.e. optimizing) compiler to recognize the
trivial matrix-matrix multiply pattern and replace it by highly
optimized code. Which would reduce the problems...

> Linux has a way to go yet...

Might be, but one should check whether comparing apples with apples
;-)

Erich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (2 preceding siblings ...)
  2004-07-12 11:24 ` Erich Focht
@ 2004-07-12 11:39 ` Marc Gonzalez-Sigler
  2004-07-12 12:01 ` Marc Gonzalez-Sigler
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Marc Gonzalez-Sigler @ 2004-07-12 11:39 UTC (permalink / raw)
  To: linux-ia64

Hello Duraid,

Did you get the same standard deviation for NP0 and NQ2?

What page size did you use?

When NP0, I assume most of your programs complete in approximately 0.6 
seconds?

Marc

Duraid Madina wrote:

> For the record, on HP-UX 11.23 the standard deviation is down around 
> 0.1s for NP0, 512. For N\x1024, it's basically zero. This is on a 2-way 
> 1.5G/6M system.
> 
> Linux has a way to go yet...
> 
>     Duraid
> 
> Erich Focht wrote:
> 
>> Hi Marc,
>>
>> usually, if you can solve the problem at user level it is improbable
>> that someone will provide a solution in the kernel. To my knowledge
>> there is no knob for optimizing memory layout in 2.6 and 2.4 results
>> will be similar. What plaform do you use? If it's NUMA, there are more
>> answers...
>>
>> At user level you could:
>> - try an optimized matrix-matrix multiply (BLAS3 DGEMM function from
>> the Intel MKL (math kernel library) ?). I'd expect that to be coded in
>> such a way that the impact of your data layout is reduced/limited.
>> - try using pages from hugetlbfs and keep all data in one page.
>>
>> Regards,
>> Erich
>>
>> On Monday 12 July 2004 11:13, Marc Gonzalez-Sigler wrote:
>>
>>> Hello,
>>>
>>> Several weeks ago, I wrote a naive matrix-matrix multiply program.
>>>
>>> int main(void)
>>> {
>>>   static double A[N][N], B[N][N], C[N][N];
>>>
>>>   /* Initialize A and B */
>>>
>>>   /* Main loop */
>>>   for (i=0; i < N; ++i)
>>>     for (j=0; j < N; ++j)
>>>       for (k=0; k < N; ++k)
>>>         C[i][j] += A[i][k]*B[k][j];
>>>
>>>   /* Print the sum of all elements of C */
>>> }
>>>
>>> The system:
>>>
>>> $ cat /proc/cpuinfo
>>> processor  : 0
>>> vendor     : GenuineIntel
>>> arch       : IA-64
>>> family     : Itanium 2
>>> model      : 1
>>> revision   : 5
>>> archrev    : 0
>>> features   : branchlong
>>> cpu number : 0
>>> cpu regs   : 4
>>> cpu MHz    : 1300.000000
>>> itc MHz    : 1300.000000
>>> BogoMIPS   : 1946.15
>>>
>>> processor  : 1
>>> [same as processor 0]
>>>
>>> $ uname -a
>>> Linux c64 2.6.6 #2 SMP Thu Jun 10 18:03:20 CEST 2004 ia64 GNU/Linux
>>>
>>>
>>> I started with NQ2, which was a bad idea. I ran the same program 
>>> 100 times on an empty system, and saw very different execution times. 
>>> I tried to pin the program to a single CPU, but the results were 
>>> similar.
>>>
>>> NQ2
>>> MIN    = 1.190000
>>> MAX    = 11.470000
>>> MEAN   = 4.686900
>>> MEDIAN = 1.390000
>>> STDDEV = 4.181866
>>>
>>> OK. NQ2 was probably a pathological case. Let us try NP0.
>>>
>>> NP0
>>> MIN    = 0.670000
>>> MAX    = 1.770000
>>> MEAN   = 1.013100
>>> MEDIAN = 0.670000
>>> STDDEV = 0.466653
>>>
>>> Better, but still quite inconsistent...
>>>
>>> The same experiment on a 3.0 GHz Northwood running 2.4.22
>>>
>>> NP0
>>> MEAN   = 1.375200
>>> MEDIAN = 1.375000
>>> STDDEV = 0.002825
>>>
>>> Tony Luck, an Intel engineer, told me on a different list this was a 
>>> page-coloring issue. Would you agree? Is there a knob in Linux 2.6 to 
>>> request a smarter physical page allocation policy? Do you think I 
>>> would get similar results if I used 2.4 instead of 2.6?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (3 preceding siblings ...)
  2004-07-12 11:39 ` Marc Gonzalez-Sigler
@ 2004-07-12 12:01 ` Marc Gonzalez-Sigler
  2004-07-12 20:37 ` David Mosberger
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Marc Gonzalez-Sigler @ 2004-07-12 12:01 UTC (permalink / raw)
  To: linux-ia64

Erich Focht wrote:

> On Monday 12 July 2004 13:12, Duraid Madina wrote:
> 
>> For the record, on HP-UX 11.23 the standard deviation is down around
>> 0.1s for NP0, 512. For N\x1024, it's basically zero. This is on a 2-way
>> 1.5G/6M system.
> 
> Which page-size was the testcode using? HPUX is more flexible here due
> to different usage of the TLB.
> 
> What compiler? The initial mail sounded like gcc has been used. I'd
> expect a reasonable (i.e. optimizing) compiler to recognize the
> trivial matrix-matrix multiply pattern and replace it by highly
> optimized code. Which would reduce the problems...

(For the record, I tried gcc-3.3.4 and orcc-2.1)

Perhaps I was not clear enough. I used matrix-matrix multiply only as an 
example. My real problem is the non-deterministic behavior.

Say I tile the main loop nest. I want to compare the execution time of 
the original, untiled program and the execution time of the modified, 
tiled program.

If the original version completes in 1 second 80% of the time, and the 
modified version completes in 0.5 seconds 80% of the time, but 2 seconds 
20% of the time, then, if I am unlucky, I might eliminate an excellent 
candidate. This is why I need the execution times of a given program to 
be consistent.

I have looked at /usr/src/linux/Documentation/vm/hugetlbpage.txt and I 
will try it if there are no better solutions.

-- 
Regards, Marc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (4 preceding siblings ...)
  2004-07-12 12:01 ` Marc Gonzalez-Sigler
@ 2004-07-12 20:37 ` David Mosberger
  2004-07-12 21:49 ` Duraid Madina
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: David Mosberger @ 2004-07-12 20:37 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 12 Jul 2004 14:01:15 +0200, Marc Gonzalez-Sigler <marc.gonzalez-sigler@inria.fr> said:

  Marc> (For the record, I tried gcc-3.3.4 and orcc-2.1)

For floating-point stuff, you'll definitely want to try the Intel compiler.
It's a free download for the unsupported/non-commercial version.

As others have pointed out, for matrix multiply, you'll definitely
want to use a hand-tuned version as is available in several math
libraries (yeah, I realize you are not really after matrix multiply).

  Marc> Perhaps I was not clear enough. I used matrix-matrix multiply
  Marc> only as an example. My real problem is the non-deterministic
  Marc> behavior.

Page-coloring can make things more deterministic, at the expense of making
_everything_ go slower.  If you search the net, you should be able to
find a page-coloring module.  We have experimented with it in the past
and it did its job, but it's overall impact was to slow things down, so
it's not a great solution.

The other thing you could do on Linux is use huge pages.  That will
mitigate/eliminate the effect of page coloring (and also reduce TLB
pressure).

  Marc> Say I tile the main loop nest. I want to compare the execution
  Marc> time of the original, untiled program and the execution time
  Marc> of the modified, tiled program.

  Marc> If the original version completes in 1 second 80% of the time,
  Marc> and the modified version completes in 0.5 seconds 80% of the
  Marc> time, but 2 seconds 20% of the time, then, if I am unlucky, I
  Marc> might eliminate an excellent candidate. This is why I need the
  Marc> execution times of a given program to be consistent.

I think you have to allow for the fact that modern CPUs (and OSes) do
not really offer deterministic performance.  And don't expect things
to get better.  Dynamic power throttling, multi-threading, etc., will
make performance analysis very "interesting".

	--david

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (5 preceding siblings ...)
  2004-07-12 20:37 ` David Mosberger
@ 2004-07-12 21:49 ` Duraid Madina
  2004-07-12 22:05 ` David Mosberger
  2004-07-12 22:08 ` Rich Altmaier
  8 siblings, 0 replies; 10+ messages in thread
From: Duraid Madina @ 2004-07-12 21:49 UTC (permalink / raw)
  To: linux-ia64

Erich Focht wrote:
> On Monday 12 July 2004 13:12, Duraid Madina wrote:
> 
>>For the record, on HP-UX 11.23 the standard deviation is down around 
>>0.1s for NP0, 512. For N\x1024, it's basically zero. This is on a 2-way 
>>1.5G/6M system.
> 
> 
> Which page-size was the testcode using?

Doesn't matter too much, but clamping the page size to 4K (the smallest 
possible) increases the standard deviation to ~0.5sec (for a +O0 
program, see below). 8K improves that somewhat and beyond 8K, the values 
are basically what I reported before. There's no need for truly large 
pages on this small example..

> HPUX is more flexible here due
> to different usage of the TLB.

Ain't that the truth! I *wish* Linux could do the same thing - problems 
like Marc's would disappear (not to mention that performance for many 
big numerical codes out there would increase significantly). But no, 
thanks to Oracle, we are forced to deal with junk like hugetlbfs..

> What compiler?

HP aC++/ANSI C B3910B A.05.56 [June 09 2004] [aCC6_beta]

> The initial mail sounded like gcc has been used. I'd
> expect a reasonable (i.e. optimizing) compiler to recognize the
> trivial matrix-matrix multiply pattern and replace it by highly
> optimized code. Which would reduce the problems...

The deviation seemed to stay the same when using +O0 (which slowed the 
program by more than a factor of 10, so I'm pretty sure it's not using 
highly optimized code now.)

>>Linux has a way to go yet...
> 
> 
> Might be, but one should check whether comparing apples with apples
> ;-)

I thought the G5 was the fastest computer on earth? ;)

	Duraid

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (6 preceding siblings ...)
  2004-07-12 21:49 ` Duraid Madina
@ 2004-07-12 22:05 ` David Mosberger
  2004-07-12 22:08 ` Rich Altmaier
  8 siblings, 0 replies; 10+ messages in thread
From: David Mosberger @ 2004-07-12 22:05 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 13 Jul 2004 07:49:35 +1000, Duraid Madina <duraid@octopus.com.au> said:

  Duraid> But no, thanks to Oracle, we are forced to deal with junk
  Duraid> like hugetlbfs..

Huge pages and transparent superpages are mostly orthogonal.  The
former sometimes can be used as a temporary work-around for the
latter, but superpages won't obsolete huge pages (for those problems
where huge pages are suitable, they can't be beat).  This is why huge
pages were accepted into the Linux kernel.  Superpage support is being
worked on separately.

	--david

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Consistency problem on IPF
  2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
                   ` (7 preceding siblings ...)
  2004-07-12 22:05 ` David Mosberger
@ 2004-07-12 22:08 ` Rich Altmaier
  8 siblings, 0 replies; 10+ messages in thread
From: Rich Altmaier @ 2004-07-12 22:08 UTC (permalink / raw)
  To: linux-ia64

We had a slight look at this code and notice the leading dimension of
the array is a power of 2.  This is often problematic
with a weaker compiler like gcc.  We used icc (Intel's) and saw
very repeatable execution times, and significantly better time.
Cache line conflicts (which page coloring addresses) can be
caused by a case like this:
3 arrays, with loop unrolling, creates a strain even on the
8-way associative cache.   The compiler can control how many
of these ways are being consumed.
So far gcc hasn't shown many cases of good execution time, as I see it.

All of this said, we do see cases where we would like some page
coloring control by the OS.  But it is also the case that
you will get far, far better results by first deciding to use the
Intel compiler (that is the largest influence on results).

Many of the features we persue are indeed aimed at achieving deterministic
performance.  Customers do like it.  Every time a job runs slower than
a prior run, there is a feeling of losing something.
I would also recommend you consider running your jobs on an Altix system!

Thanks, Rich

David Mosberger wrote:

>>>>>>On Mon, 12 Jul 2004 14:01:15 +0200, Marc Gonzalez-Sigler <marc.gonzalez-sigler@inria.fr> said:
> 
> 
>   Marc> (For the record, I tried gcc-3.3.4 and orcc-2.1)
> 
> For floating-point stuff, you'll definitely want to try the Intel compiler.
> It's a free download for the unsupported/non-commercial version.
> 
> As others have pointed out, for matrix multiply, you'll definitely
> want to use a hand-tuned version as is available in several math
> libraries (yeah, I realize you are not really after matrix multiply).
> 
>   Marc> Perhaps I was not clear enough. I used matrix-matrix multiply
>   Marc> only as an example. My real problem is the non-deterministic
>   Marc> behavior.
> 
> Page-coloring can make things more deterministic, at the expense of making
> _everything_ go slower.  If you search the net, you should be able to
> find a page-coloring module.  We have experimented with it in the past
> and it did its job, but it's overall impact was to slow things down, so
> it's not a great solution.
> 
> The other thing you could do on Linux is use huge pages.  That will
> mitigate/eliminate the effect of page coloring (and also reduce TLB
> pressure).
> 
>   Marc> Say I tile the main loop nest. I want to compare the execution
>   Marc> time of the original, untiled program and the execution time
>   Marc> of the modified, tiled program.
> 
>   Marc> If the original version completes in 1 second 80% of the time,
>   Marc> and the modified version completes in 0.5 seconds 80% of the
>   Marc> time, but 2 seconds 20% of the time, then, if I am unlucky, I
>   Marc> might eliminate an excellent candidate. This is why I need the
>   Marc> execution times of a given program to be consistent.
> 
> I think you have to allow for the fact that modern CPUs (and OSes) do
> not really offer deterministic performance.  And don't expect things
> to get better.  Dynamic power throttling, multi-threading, etc., will
> make performance analysis very "interesting".
> 
> 	--david

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-07-12 22:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-12  9:13 Consistency problem on IPF Marc Gonzalez-Sigler
2004-07-12 10:06 ` Erich Focht
2004-07-12 11:12 ` Duraid Madina
2004-07-12 11:24 ` Erich Focht
2004-07-12 11:39 ` Marc Gonzalez-Sigler
2004-07-12 12:01 ` Marc Gonzalez-Sigler
2004-07-12 20:37 ` David Mosberger
2004-07-12 21:49 ` Duraid Madina
2004-07-12 22:05 ` David Mosberger
2004-07-12 22:08 ` Rich Altmaier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox