public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Linux-ia64] VHPT performance
@ 2002-02-20 12:21 Christian Hildner
  2002-02-20 17:02 ` Michael Madore
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Christian Hildner @ 2002-02-20 12:21 UTC (permalink / raw)
  To: linux-ia64

Hi!

Does anybody have a benchmarking tool or test program for performance
comparisons on the VHPT feature? Or are there applications that heavily
use many different memory pages?

Another thing: Is the lia64-sim list broken? I cannot subscribe to the
list.


Christian



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
@ 2002-02-20 17:02 ` Michael Madore
  2002-02-20 17:34 ` David Mosberger
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Michael Madore @ 2002-02-20 17:02 UTC (permalink / raw)
  To: linux-ia64

Christian Hildner wrote:
> 
> Another thing: Is the lia64-sim list broken? I cannot subscribe to the
> list.

I'm curious about this also.  The lia64-sim list does not appear in the
list of lists from majordomo.

Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
  2002-02-20 17:02 ` Michael Madore
@ 2002-02-20 17:34 ` David Mosberger
  2002-02-22 11:35 ` Christian Hildner
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-02-20 17:34 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 20 Feb 2002 13:21:22 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> Hi!  Does anybody have a benchmarking tool or test
  Christian> program for performance comparisons on the VHPT feature?
  Christian> Or are there applications that heavily use many different
  Christian> memory pages?

There certainly are!  The SPEC CPU benchmarks are quite memory intensive, for
example.

  Christian> Another thing: Is the lia64-sim list broken? I cannot
  Christian> subscribe to the list.

I'm not aware of any problems.  What exactly isn't working?  Note that
you're to subscribe via mail to lia64-sim-request@linux.hpl.hp.com
since I upgraded from majordomo to mailman recently.  An archive of
the mailing list traffic is at:

	http://www.hpl.hp.com/hosted/linux/mail-archives/lia64-sim/

There hasn't been much traffic recently.

	--david


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
  2002-02-20 17:02 ` Michael Madore
  2002-02-20 17:34 ` David Mosberger
@ 2002-02-22 11:35 ` Christian Hildner
  2002-02-22 16:58 ` David Mosberger
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-02-22 11:35 UTC (permalink / raw)
  To: linux-ia64


David Mosberger schrieb:

> >>>>> On Wed, 20 Feb 2002 13:21:22 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
>   Christian> Hi!  Does anybody have a benchmarking tool or test
>   Christian> program for performance comparisons on the VHPT feature?
>   Christian> Or are there applications that heavily use many different
>   Christian> memory pages?
>
> There certainly are!  The SPEC CPU benchmarks are quite memory intensive, for
> example.
>

I made some measures and found that on Itanium it takes about ~700 CPU cycles to load a single
byte when TLB is missing and VHPT is enabled vs. ~900 cycles with VHPT disabled and handling the
TLB miss in the IVT. Does anybody know if the VHPT walker is implemented in hardware (probably
not), microcode (that's what I think) or unimplemented at all?

Christian



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (2 preceding siblings ...)
  2002-02-22 11:35 ` Christian Hildner
@ 2002-02-22 16:58 ` David Mosberger
  2002-02-28  8:06 ` Christian Hildner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-02-22 16:58 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 22 Feb 2002 12:35:51 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> I made some measures and found that on Itanium it takes
  Christian> about ~700 CPU cycles to load a single byte when TLB is
  Christian> missing and VHPT is enabled vs. ~900 cycles with VHPT
  Christian> disabled and handling the TLB miss in the IVT.

It's not that simple.  On Itanium, the VHPT will help only if the TLB
entry can be found in the cache (this is described in the Itanium
microarch. manual, IIRC).  I don't think anything has been said
publically yet what McKinley does, so we'll have to wait a bit longer.

The 700 cycle number sounds too high.  For example, I have a little
test program that shows repeatedly touching ~92 pages takes about 25
cycles on average and touching more than 128 pages takes about 73
cycles on average, for a difference of about 48 cycles.

  Christian> Does
  Christian> anybody know if the VHPT walker is implemented in
  Christian> hardware (probably not), microcode (that's what I think)
  Christian> or unimplemented at all?

It's definitely implemented in hardware, though I don't know the
implementation details.

	--david


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (3 preceding siblings ...)
  2002-02-22 16:58 ` David Mosberger
@ 2002-02-28  8:06 ` Christian Hildner
  2002-03-01  2:32 ` David Mosberger
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-02-28  8:06 UTC (permalink / raw)
  To: linux-ia64

David

Could you please send me your test program to verify this. Since I haven't fixed the storage in
my test prog maybe there are additional page faults. Is it possible to fix malloc() storage from
userspace?

Thanks

Christian

David Mosberger schrieb:

> >>>>> On Fri, 22 Feb 2002 12:35:51 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
>   Christian> I made some measures and found that on Itanium it takes
>   Christian> about ~700 CPU cycles to load a single byte when TLB is
>   Christian> missing and VHPT is enabled vs. ~900 cycles with VHPT
>   Christian> disabled and handling the TLB miss in the IVT.
>
> It's not that simple.  On Itanium, the VHPT will help only if the TLB
> entry can be found in the cache (this is described in the Itanium
> microarch. manual, IIRC).  I don't think anything has been said
> publically yet what McKinley does, so we'll have to wait a bit longer.
>
> The 700 cycle number sounds too high.  For example, I have a little
> test program that shows repeatedly touching ~92 pages takes about 25
> cycles on average and touching more than 128 pages takes about 73
> cycles on average, for a difference of about 48 cycles.
>
>   Christian> Does
>   Christian> anybody know if the VHPT walker is implemented in
>   Christian> hardware (probably not), microcode (that's what I think)
>   Christian> or unimplemented at all?
>
> It's definitely implemented in hardware, though I don't know the
> implementation details.
>
>         --david



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (4 preceding siblings ...)
  2002-02-28  8:06 ` Christian Hildner
@ 2002-03-01  2:32 ` David Mosberger
  2002-03-08  7:50 ` Christian Hildner
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-01  2:32 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 28 Feb 2002 09:06:54 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> David Could you please send me your test program to
  Christian> verify this. Since I haven't fixed the storage in my test
  Christian> prog maybe there are additional page faults.

Well, the copyright is almost longer than the program itself, but here
you go.  It was just a quick hack, so treat with care...  If you do
make enhancements, I'd be interested, though.

	--david

/*
    Copyright (c) 1999-2002 Hewlett-Packard Co.
	Written by David Mosberger-Tang <davidm@hpl.hp.com>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.

The program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.  */

#include <stdio.h>
#include <string.h>

#include <sys/time.h>

#define MAX_DTLB_SIZE	  4096
#define PAGE_SIZE	65536
#define LINE_SIZE	   64
#define LONGS_PER_LINE	(LINE_SIZE/sizeof(long))

#define LOOKUPS_PER_TEST	30000000

static long page[2*MAX_DTLB_SIZE][PAGE_SIZE/sizeof(long)];

long
walk (long count)
{
  long index = 0, i, sum = 0;

  for (i = 0; i < LOOKUPS_PER_TEST; ++i)
    {
      sum += page[index][(index*LONGS_PER_LINE) % (PAGE_SIZE/sizeof(long))];
      ++index;
      if (index >= count)
	index = 0;
    }
  return sum;
}


void
run (const char *label, long (*func)(long), long count)
{
  struct timeval tv_start, tv_stop;
  double delta;
  long result;
  int n;

  for (n = 0; n < 1; ++n)
    {
      gettimeofday(&tv_start, 0);
      result = (*func)(count);
      gettimeofday(&tv_stop, 0);
      delta = ((tv_stop.tv_sec + tv_stop.tv_usec / 1000000.0)
	       - (tv_start.tv_sec + tv_start.tv_usec / 1000000.0));
      if (delta > 0.0)
	printf("%s: %10.5g seconds: %10.5g ns/access (checksum=%lu)\n",
	       label, delta, 1e9 * delta / LOOKUPS_PER_TEST, result);
    }
}

int
main (int argc, char ** argv)
{
  char buf[256];
  int i;

  for (i = 0; i < MAX_DTLB_SIZE; i += 1 + i/100)
    {
      sprintf (buf, "%3u", i);
      run(buf, walk, i);
    }
  return 0;
}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (5 preceding siblings ...)
  2002-03-01  2:32 ` David Mosberger
@ 2002-03-08  7:50 ` Christian Hildner
  2002-03-08  8:12 ` David Mosberger
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-03-08  7:50 UTC (permalink / raw)
  To: linux-ia64


David Mosberger schrieb:

> >>>>> On Thu, 28 Feb 2002 09:06:54 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
>   Christian> David Could you please send me your test program to
>   Christian> verify this. Since I haven't fixed the storage in my test
>   Christian> prog maybe there are additional page faults.
>
> Well, the copyright is almost longer than the program itself, but here
> you go.  It was just a quick hack, so treat with care...  If you do
> make enhancements, I'd be interested, though.
>
>         --david
>

David,

I tried with my different program and it's just as before. Without TLB miss ~10 cycles, TLB miss
and VHPT ~750 cycles and TLB miss with VHPT disabled ~900 cycles. I don't know what wrong. Is it
the test prog or the processor. In the IVT there are only some few instructions to do. The intel
manual says that there is a variable latency for the instruction itc.

Could you please try with my test program to verify what I found (compile with -O). The pointer
increment is half the page size, so TLB hit and TLB miss are changing.

Christian

#define KB 1024
#define INCREMENT 4096
#define MEMSIZE (128*KB)

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/resource.h>

inline long get_ticks(void)
{
    long ticks;

    asm volatile ("
        ;;
        mov %0=ar.itc
        ;;
        \n"
        :"=r"(ticks)
        :);
    return(ticks);
}


int main(void)
{

    char *stg;
    char *current;
    char *end;
    char val;
    long t1,t2,tdiff;
    struct rlimit rlim;

    printf("tlbtest starting ...\n");
    rlim.rlim_cur = MEMSIZE;
    rlim.rlim_max = MEMSIZE;
    if (setrlimit(RLIMIT_MEMLOCK, &rlim) != 0) {
        printf("setrlimit() failed\n");
        return(-1);
    }
    stg = (char *)malloc(MEMSIZE);
    if (stg = NULL) {
        printf("memory not available\n");
        return(-1);
    }
    printf("stg = %p\n", (void *)stg);
    if (!mlock((void *)stg, MEMSIZE)) {
        printf("mlock() failed\n");
        free(stg);
        return(-1);
    }
    end = stg + MEMSIZE;
    current = (char *)stg;
    while (current < end) {
        t1 = get_ticks();
        val = *current;
        t2 = get_ticks();
        printf("ticks1 %ld\n", t2-t1);
        current += INCREMENT;
    };
    printf("ignore this value val = %d\n", (int)val);
    free(stg);
    return(0);
}

typical output:
VPHT disabled                VHPT enabled

tlbtest starting ...         tlbtest starting ...
stg = 0x20000000002a6010     stg = 0x20000000002a6010
ticks1 11                    ticks1 13
ticks1 36                    ticks1 16
ticks1 1349                  ticks1 1135
ticks1 187                   ticks1 191
ticks1 928                   ticks1 751
ticks1 10                    ticks1 10
ticks1 933                   ticks1 778
ticks1 10                    ticks1 10
ticks1 946                   ticks1 725
ticks1 21                    ticks1 10
ticks1 1101                  ticks1 912
ticks1 10                    ticks1 10
ticks1 941                   ticks1 755
ticks1 10                    ticks1 10
ticks1 939                   ticks1 740
ticks1 10                    ticks1 10
ticks1 966                   ticks1 749
ticks1 10                    ticks1 10
ticks1 954                   ticks1 748
ticks1 21                    ticks1 10
ticks1 920                   ticks1 734
ticks1 10                    ticks1 21
ticks1 936                   ticks1 765
ticks1 21                    ticks1 21
ticks1 945                   ticks1 733
ticks1 10                    ticks1 10
ticks1 1113                  ticks1 921
ticks1 21                    ticks1 10
ticks1 921                   ticks1 754
ticks1 10                    ticks1 10
ticks1 903                   ticks1 719
ticks1 10                    ticks1 10
ignore this value val = 0    ignore this value val = 0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (6 preceding siblings ...)
  2002-03-08  7:50 ` Christian Hildner
@ 2002-03-08  8:12 ` David Mosberger
  2002-03-08 10:26 ` Christian Hildner
  2002-03-08 17:31 ` David Mosberger
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-08  8:12 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 08 Mar 2002 08:50:15 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> Could you please try with my test program to verify what
  Christian> I found (compile with -O). The pointer increment is half
  Christian> the page size, so TLB hit and TLB miss are changing.

I think it's your program: I suspect you're measuring cache misses
more than anything else.  Try touching a set of pages repeatedly and
try to touch different words, such that there are no cache misses.

	--david


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (7 preceding siblings ...)
  2002-03-08  8:12 ` David Mosberger
@ 2002-03-08 10:26 ` Christian Hildner
  2002-03-08 17:31 ` David Mosberger
  9 siblings, 0 replies; 11+ messages in thread
From: Christian Hildner @ 2002-03-08 10:26 UTC (permalink / raw)
  To: linux-ia64


David Mosberger schrieb:

> >>>>> On Fri, 08 Mar 2002 08:50:15 +0100, Christian Hildner <christian.hildner@hob.de> said:
>
>   Christian> Could you please try with my test program to verify what
>   Christian> I found (compile with -O). The pointer increment is half
>   Christian> the page size, so TLB hit and TLB miss are changing.
>
> I think it's your program: I suspect you're measuring cache misses
> more than anything else.  Try touching a set of pages repeatedly and
> try to touch different words, such that there are no cache misses.
>
>         --david

Ok that's true. I changed my program so that before the measuring loop there comes the same loop
for filling the cache. Also I had to increase the memory size to a minimum of 128 pages because
for itanium there are 32 entries for L1-DTLB and 96 entries for L2-DTLB. Now I get values of 42
cycles with VHPT enabled and 180 cycles with VHPT disabled. This values are coming near to the
ones you found.

Christian



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linux-ia64] VHPT performance
  2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
                   ` (8 preceding siblings ...)
  2002-03-08 10:26 ` Christian Hildner
@ 2002-03-08 17:31 ` David Mosberger
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2002-03-08 17:31 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 08 Mar 2002 11:26:33 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> Ok that's true. I changed my program so that before the
  Christian> measuring loop there comes the same loop for filling the
  Christian> cache. Also I had to increase the memory size to a
  Christian> minimum of 128 pages because for itanium there are 32
  Christian> entries for L1-DTLB and 96 entries for L2-DTLB. Now I get
  Christian> values of 42 cycles with VHPT enabled and 180 cycles with
  Christian> VHPT disabled. This values are coming near to the ones
  Christian> you found.

Great!

Two other things you may want to try:

 o Use a stride of PAGE_SIZE+LINE_SIZE.  This reduces the likelihood
   of exceeding the cache associativity.

 o Rather than calling printf() in each iteration, collect the results
   in an array and print them once the test is done.  printf() is a monster
   and will blow away a good portion of the first level caches as well as
   a couple of TLB entries.

	--david


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-03-08 17:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-20 12:21 [Linux-ia64] VHPT performance Christian Hildner
2002-02-20 17:02 ` Michael Madore
2002-02-20 17:34 ` David Mosberger
2002-02-22 11:35 ` Christian Hildner
2002-02-22 16:58 ` David Mosberger
2002-02-28  8:06 ` Christian Hildner
2002-03-01  2:32 ` David Mosberger
2002-03-08  7:50 ` Christian Hildner
2002-03-08  8:12 ` David Mosberger
2002-03-08 10:26 ` Christian Hildner
2002-03-08 17:31 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox