* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
@ 2002-02-20 17:02 ` Michael Madore
2002-02-20 17:34 ` David Mosberger
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Michael Madore @ 2002-02-20 17:02 UTC (permalink / raw)
To: linux-ia64
Christian Hildner wrote:
>
> Another thing: Is the lia64-sim list broken? I cannot subscribe to the
> list.
I'm curious about this also. The lia64-sim list does not appear in the
list of lists from majordomo.
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
2002-02-20 17:02 ` Michael Madore
@ 2002-02-20 17:34 ` David Mosberger
2002-02-22 11:35 ` Christian Hildner
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-02-20 17:34 UTC (permalink / raw)
To: linux-ia64
>>>>> On Wed, 20 Feb 2002 13:21:22 +0100, Christian Hildner <christian.hildner@hob.de> said:
Christian> Hi! Does anybody have a benchmarking tool or test
Christian> program for performance comparisons on the VHPT feature?
Christian> Or are there applications that heavily use many different
Christian> memory pages?
There certainly are! The SPEC CPU benchmarks are quite memory intensive, for
example.
Christian> Another thing: Is the lia64-sim list broken? I cannot
Christian> subscribe to the list.
I'm not aware of any problems. What exactly isn't working? Note that
you're to subscribe via mail to lia64-sim-request@linux.hpl.hp.com
since I upgraded from majordomo to mailman recently. An archive of
the mailing list traffic is at:
http://www.hpl.hp.com/hosted/linux/mail-archives/lia64-sim/
There hasn't been much traffic recently.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
2002-02-20 17:02 ` Michael Madore
2002-02-20 17:34 ` David Mosberger
@ 2002-02-22 11:35 ` Christian Hildner
2002-02-22 16:58 ` David Mosberger
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-02-22 11:35 UTC (permalink / raw)
To: linux-ia64
David Mosberger schrieb:
> >>>>> On Wed, 20 Feb 2002 13:21:22 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
> Christian> Hi! Does anybody have a benchmarking tool or test
> Christian> program for performance comparisons on the VHPT feature?
> Christian> Or are there applications that heavily use many different
> Christian> memory pages?
>
> There certainly are! The SPEC CPU benchmarks are quite memory intensive, for
> example.
>
I made some measures and found that on Itanium it takes about ~700 CPU cycles to load a single
byte when TLB is missing and VHPT is enabled vs. ~900 cycles with VHPT disabled and handling the
TLB miss in the IVT. Does anybody know if the VHPT walker is implemented in hardware (probably
not), microcode (that's what I think) or unimplemented at all?
Christian
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (2 preceding siblings ...)
2002-02-22 11:35 ` Christian Hildner
@ 2002-02-22 16:58 ` David Mosberger
2002-02-28 8:06 ` Christian Hildner
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-02-22 16:58 UTC (permalink / raw)
To: linux-ia64
>>>>> On Fri, 22 Feb 2002 12:35:51 +0100, Christian Hildner <christian.hildner@hob.de> said:
Christian> I made some measures and found that on Itanium it takes
Christian> about ~700 CPU cycles to load a single byte when TLB is
Christian> missing and VHPT is enabled vs. ~900 cycles with VHPT
Christian> disabled and handling the TLB miss in the IVT.
It's not that simple. On Itanium, the VHPT will help only if the TLB
entry can be found in the cache (this is described in the Itanium
microarch. manual, IIRC). I don't think anything has been said
publically yet what McKinley does, so we'll have to wait a bit longer.
The 700 cycle number sounds too high. For example, I have a little
test program that shows repeatedly touching ~92 pages takes about 25
cycles on average and touching more than 128 pages takes about 73
cycles on average, for a difference of about 48 cycles.
Christian> Does
Christian> anybody know if the VHPT walker is implemented in
Christian> hardware (probably not), microcode (that's what I think)
Christian> or unimplemented at all?
It's definitely implemented in hardware, though I don't know the
implementation details.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (3 preceding siblings ...)
2002-02-22 16:58 ` David Mosberger
@ 2002-02-28 8:06 ` Christian Hildner
2002-03-01 2:32 ` David Mosberger
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-02-28 8:06 UTC (permalink / raw)
To: linux-ia64
David
Could you please send me your test program to verify this. Since I haven't fixed the storage in
my test prog maybe there are additional page faults. Is it possible to fix malloc() storage from
userspace?
Thanks
Christian
David Mosberger schrieb:
> >>>>> On Fri, 22 Feb 2002 12:35:51 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
> Christian> I made some measures and found that on Itanium it takes
> Christian> about ~700 CPU cycles to load a single byte when TLB is
> Christian> missing and VHPT is enabled vs. ~900 cycles with VHPT
> Christian> disabled and handling the TLB miss in the IVT.
>
> It's not that simple. On Itanium, the VHPT will help only if the TLB
> entry can be found in the cache (this is described in the Itanium
> microarch. manual, IIRC). I don't think anything has been said
> publically yet what McKinley does, so we'll have to wait a bit longer.
>
> The 700 cycle number sounds too high. For example, I have a little
> test program that shows repeatedly touching ~92 pages takes about 25
> cycles on average and touching more than 128 pages takes about 73
> cycles on average, for a difference of about 48 cycles.
>
> Christian> Does
> Christian> anybody know if the VHPT walker is implemented in
> Christian> hardware (probably not), microcode (that's what I think)
> Christian> or unimplemented at all?
>
> It's definitely implemented in hardware, though I don't know the
> implementation details.
>
> --david
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (4 preceding siblings ...)
2002-02-28 8:06 ` Christian Hildner
@ 2002-03-01 2:32 ` David Mosberger
2002-03-08 7:50 ` Christian Hildner
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-01 2:32 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 28 Feb 2002 09:06:54 +0100, Christian Hildner <christian.hildner@hob.de> said:
Christian> David Could you please send me your test program to
Christian> verify this. Since I haven't fixed the storage in my test
Christian> prog maybe there are additional page faults.
Well, the copyright is almost longer than the program itself, but here
you go. It was just a quick hack, so treat with care... If you do
make enhancements, I'd be interested, though.
--david
/*
Copyright (c) 1999-2002 Hewlett-Packard Co.
Written by David Mosberger-Tang <davidm@hpl.hp.com>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
The program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details. */
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define MAX_DTLB_SIZE 4096
#define PAGE_SIZE 65536
#define LINE_SIZE 64
#define LONGS_PER_LINE (LINE_SIZE/sizeof(long))
#define LOOKUPS_PER_TEST 30000000
static long page[2*MAX_DTLB_SIZE][PAGE_SIZE/sizeof(long)];
long
walk (long count)
{
long index = 0, i, sum = 0;
for (i = 0; i < LOOKUPS_PER_TEST; ++i)
{
sum += page[index][(index*LONGS_PER_LINE) % (PAGE_SIZE/sizeof(long))];
++index;
if (index >= count)
index = 0;
}
return sum;
}
void
run (const char *label, long (*func)(long), long count)
{
struct timeval tv_start, tv_stop;
double delta;
long result;
int n;
for (n = 0; n < 1; ++n)
{
gettimeofday(&tv_start, 0);
result = (*func)(count);
gettimeofday(&tv_stop, 0);
delta = ((tv_stop.tv_sec + tv_stop.tv_usec / 1000000.0)
- (tv_start.tv_sec + tv_start.tv_usec / 1000000.0));
if (delta > 0.0)
printf("%s: %10.5g seconds: %10.5g ns/access (checksum=%lu)\n",
label, delta, 1e9 * delta / LOOKUPS_PER_TEST, result);
}
}
int
main (int argc, char ** argv)
{
char buf[256];
int i;
for (i = 0; i < MAX_DTLB_SIZE; i += 1 + i/100)
{
sprintf (buf, "%3u", i);
run(buf, walk, i);
}
return 0;
}
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (5 preceding siblings ...)
2002-03-01 2:32 ` David Mosberger
@ 2002-03-08 7:50 ` Christian Hildner
2002-03-08 8:12 ` David Mosberger
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-03-08 7:50 UTC (permalink / raw)
To: linux-ia64
David Mosberger schrieb:
> >>>>> On Thu, 28 Feb 2002 09:06:54 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
> Christian> David Could you please send me your test program to
> Christian> verify this. Since I haven't fixed the storage in my test
> Christian> prog maybe there are additional page faults.
>
> Well, the copyright is almost longer than the program itself, but here
> you go. It was just a quick hack, so treat with care... If you do
> make enhancements, I'd be interested, though.
>
> --david
>
David,
I tried with my different program and it's just as before. Without TLB miss ~10 cycles, TLB miss
and VHPT ~750 cycles and TLB miss with VHPT disabled ~900 cycles. I don't know what wrong. Is it
the test prog or the processor. In the IVT there are only some few instructions to do. The intel
manual says that there is a variable latency for the instruction itc.
Could you please try with my test program to verify what I found (compile with -O). The pointer
increment is half the page size, so TLB hit and TLB miss are changing.
Christian
#define KB 1024
#define INCREMENT 4096
#define MEMSIZE (128*KB)
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/resource.h>
inline long get_ticks(void)
{
long ticks;
asm volatile ("
;;
mov %0=ar.itc
;;
\n"
:"=r"(ticks)
:);
return(ticks);
}
int main(void)
{
char *stg;
char *current;
char *end;
char val;
long t1,t2,tdiff;
struct rlimit rlim;
printf("tlbtest starting ...\n");
rlim.rlim_cur = MEMSIZE;
rlim.rlim_max = MEMSIZE;
if (setrlimit(RLIMIT_MEMLOCK, &rlim) != 0) {
printf("setrlimit() failed\n");
return(-1);
}
stg = (char *)malloc(MEMSIZE);
if (stg = NULL) {
printf("memory not available\n");
return(-1);
}
printf("stg = %p\n", (void *)stg);
if (!mlock((void *)stg, MEMSIZE)) {
printf("mlock() failed\n");
free(stg);
return(-1);
}
end = stg + MEMSIZE;
current = (char *)stg;
while (current < end) {
t1 = get_ticks();
val = *current;
t2 = get_ticks();
printf("ticks1 %ld\n", t2-t1);
current += INCREMENT;
};
printf("ignore this value val = %d\n", (int)val);
free(stg);
return(0);
}
typical output:
VPHT disabled VHPT enabled
tlbtest starting ... tlbtest starting ...
stg = 0x20000000002a6010 stg = 0x20000000002a6010
ticks1 11 ticks1 13
ticks1 36 ticks1 16
ticks1 1349 ticks1 1135
ticks1 187 ticks1 191
ticks1 928 ticks1 751
ticks1 10 ticks1 10
ticks1 933 ticks1 778
ticks1 10 ticks1 10
ticks1 946 ticks1 725
ticks1 21 ticks1 10
ticks1 1101 ticks1 912
ticks1 10 ticks1 10
ticks1 941 ticks1 755
ticks1 10 ticks1 10
ticks1 939 ticks1 740
ticks1 10 ticks1 10
ticks1 966 ticks1 749
ticks1 10 ticks1 10
ticks1 954 ticks1 748
ticks1 21 ticks1 10
ticks1 920 ticks1 734
ticks1 10 ticks1 21
ticks1 936 ticks1 765
ticks1 21 ticks1 21
ticks1 945 ticks1 733
ticks1 10 ticks1 10
ticks1 1113 ticks1 921
ticks1 21 ticks1 10
ticks1 921 ticks1 754
ticks1 10 ticks1 10
ticks1 903 ticks1 719
ticks1 10 ticks1 10
ignore this value val = 0 ignore this value val = 0
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (6 preceding siblings ...)
2002-03-08 7:50 ` Christian Hildner
@ 2002-03-08 8:12 ` David Mosberger
2002-03-08 10:26 ` Christian Hildner
2002-03-08 17:31 ` David Mosberger
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-08 8:12 UTC (permalink / raw)
To: linux-ia64
>>>>> On Fri, 08 Mar 2002 08:50:15 +0100, Christian Hildner <christian.hildner@hob.de> said:
Christian> Could you please try with my test program to verify what
Christian> I found (compile with -O). The pointer increment is half
Christian> the page size, so TLB hit and TLB miss are changing.
I think it's your program: I suspect you're measuring cache misses
more than anything else. Try touching a set of pages repeatedly and
try to touch different words, such that there are no cache misses.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (7 preceding siblings ...)
2002-03-08 8:12 ` David Mosberger
@ 2002-03-08 10:26 ` Christian Hildner
2002-03-08 17:31 ` David Mosberger
9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-03-08 10:26 UTC (permalink / raw)
To: linux-ia64
David Mosberger schrieb:
> >>>>> On Fri, 08 Mar 2002 08:50:15 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
> Christian> Could you please try with my test program to verify what
> Christian> I found (compile with -O). The pointer increment is half
> Christian> the page size, so TLB hit and TLB miss are changing.
>
> I think it's your program: I suspect you're measuring cache misses
> more than anything else. Try touching a set of pages repeatedly and
> try to touch different words, such that there are no cache misses.
>
> --david
Ok that's true. I changed my program so that before the measuring loop there comes the same loop
for filling the cache. Also I had to increase the memory size to a minimum of 128 pages because
for itanium there are 32 entries for L1-DTLB and 96 entries for L2-DTLB. Now I get values of 42
cycles with VHPT enabled and 180 cycles with VHPT disabled. This values are coming near to the
ones you found.
Christian
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Linux-ia64] VHPT performance
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
` (8 preceding siblings ...)
2002-03-08 10:26 ` Christian Hildner
@ 2002-03-08 17:31 ` David Mosberger
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-08 17:31 UTC (permalink / raw)
To: linux-ia64
>>>>> On Fri, 08 Mar 2002 11:26:33 +0100, Christian Hildner <christian.hildner@hob.de> said:
Christian> Ok that's true. I changed my program so that before the
Christian> measuring loop there comes the same loop for filling the
Christian> cache. Also I had to increase the memory size to a
Christian> minimum of 128 pages because for itanium there are 32
Christian> entries for L1-DTLB and 96 entries for L2-DTLB. Now I get
Christian> values of 42 cycles with VHPT enabled and 180 cycles with
Christian> VHPT disabled. This values are coming near to the ones
Christian> you found.
Great!
Two other things you may want to try:
o Use a stride of PAGE_SIZE+LINE_SIZE. This reduces the likelihood
of exceeding the cache associativity.
o Rather than calling printf() in each iteration, collect the results
in an array and print them once the test is done. printf() is a monster
and will blow away a good portion of the first level caches as well as
a couple of TLB entries.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread