From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marc Gonzalez-Sigler Date: Mon, 12 Jul 2004 11:39:06 +0000 Subject: Re: Consistency problem on IPF Message-Id: <40F2785A.10903@inria.fr> List-Id: References: <40F2562C.10208@inria.fr> In-Reply-To: <40F2562C.10208@inria.fr> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org Hello Duraid, Did you get the same standard deviation for NP0 and NQ2? What page size did you use? When NP0, I assume most of your programs complete in approximately 0.6 seconds? Marc Duraid Madina wrote: > For the record, on HP-UX 11.23 the standard deviation is down around > 0.1s for NP0, 512. For N24, it's basically zero. This is on a 2-way > 1.5G/6M system. > > Linux has a way to go yet... > > Duraid > > Erich Focht wrote: > >> Hi Marc, >> >> usually, if you can solve the problem at user level it is improbable >> that someone will provide a solution in the kernel. To my knowledge >> there is no knob for optimizing memory layout in 2.6 and 2.4 results >> will be similar. What plaform do you use? If it's NUMA, there are more >> answers... >> >> At user level you could: >> - try an optimized matrix-matrix multiply (BLAS3 DGEMM function from >> the Intel MKL (math kernel library) ?). I'd expect that to be coded in >> such a way that the impact of your data layout is reduced/limited. >> - try using pages from hugetlbfs and keep all data in one page. >> >> Regards, >> Erich >> >> On Monday 12 July 2004 11:13, Marc Gonzalez-Sigler wrote: >> >>> Hello, >>> >>> Several weeks ago, I wrote a naive matrix-matrix multiply program. >>> >>> int main(void) >>> { >>> static double A[N][N], B[N][N], C[N][N]; >>> >>> /* Initialize A and B */ >>> >>> /* Main loop */ >>> for (i=0; i < N; ++i) >>> for (j=0; j < N; ++j) >>> for (k=0; k < N; ++k) >>> C[i][j] += A[i][k]*B[k][j]; >>> >>> /* Print the sum of all elements of C */ >>> } >>> >>> The system: >>> >>> $ cat /proc/cpuinfo >>> processor : 0 >>> vendor : GenuineIntel >>> arch : IA-64 >>> family : Itanium 2 >>> model : 1 >>> revision : 5 >>> archrev : 0 >>> features : branchlong >>> cpu number : 0 >>> cpu regs : 4 >>> cpu MHz : 1300.000000 >>> itc MHz : 1300.000000 >>> BogoMIPS : 1946.15 >>> >>> processor : 1 >>> [same as processor 0] >>> >>> $ uname -a >>> Linux c64 2.6.6 #2 SMP Thu Jun 10 18:03:20 CEST 2004 ia64 GNU/Linux >>> >>> >>> I started with NQ2, which was a bad idea. I ran the same program >>> 100 times on an empty system, and saw very different execution times. >>> I tried to pin the program to a single CPU, but the results were >>> similar. >>> >>> NQ2 >>> MIN = 1.190000 >>> MAX = 11.470000 >>> MEAN = 4.686900 >>> MEDIAN = 1.390000 >>> STDDEV = 4.181866 >>> >>> OK. NQ2 was probably a pathological case. Let us try NP0. >>> >>> NP0 >>> MIN = 0.670000 >>> MAX = 1.770000 >>> MEAN = 1.013100 >>> MEDIAN = 0.670000 >>> STDDEV = 0.466653 >>> >>> Better, but still quite inconsistent... >>> >>> The same experiment on a 3.0 GHz Northwood running 2.4.22 >>> >>> NP0 >>> MEAN = 1.375200 >>> MEDIAN = 1.375000 >>> STDDEV = 0.002825 >>> >>> Tony Luck, an Intel engineer, told me on a different list this was a >>> page-coloring issue. Would you agree? Is there a knob in Linux 2.6 to >>> request a smarter physical page allocation policy? Do you think I >>> would get similar results if I used 2.4 instead of 2.6?