public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* benchmarking bandwidth Northbridge<->RAM
@ 2004-02-05 22:49 Simonas Leleiva
  2004-02-15  0:06 ` dean gaudet
  0 siblings, 1 reply; 3+ messages in thread
From: Simonas Leleiva @ 2004-02-05 22:49 UTC (permalink / raw)
  To: linux-kernel


Hello all,

I'm writing a benchmarking program under Linux.

Here's what I do (and what I sadly find inefficient):

1. I use the PCI_MEMORY_BAR0 address from the PCI device 00:00.0 (as
given in 'lspci' it's the Host (or North) bridge) and then I mmap the
opened /dev/mem with this address into user space, as it would be done
with any PCI device, which memory I want to access.

2. I manipulate on the returned pointer and the malloc'ed pointer (to
represent RAM) with memset command, keeping in mind that: A time for
data to flow between NB and RAM is equal to the time a RAM<->RAM
operation is completed minus the one with NB<->RAM (as I asume the
data goes RAM->NB->CPU->NB->RAM and NB->CPU->NB->RAM).

Now my doubts on each step mentioned above are:

1. Is such approach corrent? I doubt, because I've come across with
such Host bridges, which do NOT have an addresable memory region (on
my AMD Athlon NB starts @ 0xd0000000 and is of 128MB length; elsewhere
I've found it starting @ 0xe8000000-32MB, but recently my programme
crashed because the were NO addressable PCI region to the Host
bridge).
   After all, is this the way to directly accessing NB?

2. But what about L1/L2 caching ruining the benchmark, about which I judge
from very big results (~5GB/s is truly not the correct benchmark with my
2x166MHz RAM and 64bit bus - in the best case it should not overflow
2.5GB/s)? I've searched the web for cache disablings, but what I've only
found was the memtest's source-code, which works only under plain non-Linux
(non PM) environment (memtest makes a bootable floppy and then launches 'bare
naked'). So I find memtest's inline assembly useless under linux..
   The present workaround is to launch my benchmark with different set of
mem-chunks, and observing a speed-decrease at the specific size (when the
chunk doesn't fit in cache. In my case - decrease from 5GB/s to 2.9GB/s -
still too big..) and then treat that sized-chunks to be the actual benchmark
results.. However, part of the chunk may lay in cache anyway..

Where am I thinking wrong? Hope to get a tip from you out-there :)

Hear ya (hopefully soon) !

--
Simon




^ permalink raw reply	[flat|nested] 3+ messages in thread

* benchmarking bandwidth Northbridge<->RAM
@ 2004-02-05 23:49 Simonas Leleiva
  0 siblings, 0 replies; 3+ messages in thread
From: Simonas Leleiva @ 2004-02-05 23:49 UTC (permalink / raw)
  To: linux-kernel

Hello all,

I'm writing a benchmarking program under Linux.

Here's what I do (and what I sadly find inefficient):

1. I use the PCI_MEMORY_BAR0 address from the PCI device 00:00.0 (as
given in 'lspci' it's the Host (or North) bridge) and then I mmap the
opened /dev/mem with this address into user space, as it would be done
with any PCI device, which memory I want to access.

2. I manipulate on the returned pointer and the malloc'ed pointer (to
represent RAM) with memset command, keeping in mind that: A time for
data to flow between NB and RAM is equal to the time a RAM<->RAM
operation is completed minus the one with NB<->RAM (as I asume the
data goes RAM->NB->CPU->NB->RAM and NB->CPU->NB->RAM).

Now my doubts on each step mentioned above are:

1. Is such approach corrent? I doubt, because I've come across with
such Host bridges, which do NOT have an addresable memory region (on
my AMD Athlon NB starts @ 0xd0000000 and is of 128MB length; elsewhere
I've found it starting @ 0xe8000000-32MB, but recently my programme
crashed because the were NO addressable PCI region to the Host
bridge).
   After all, is this the way to directly accessing NB?

2. But what about L1/L2 caching ruining the benchmark, about which I judge
from very big results (~5GB/s is truly not the correct benchmark with my
2x166MHz RAM and 64bit bus - in the best case it should not overflow
2.5GB/s)? I've searched the web for cache disablings, but what I've only
found was the memtest's source-code, which works only under plain non-Linux
(non PM) environment (memtest makes a bootable floppy and then launches 'bare
naked'). So I find memtest's inline assembly useless under linux..
   The present workaround is to launch my benchmark with different set of
mem-chunks, and observing a speed-decrease at the specific size (when the
chunk doesn't fit in cache. In my case - decrease from 5GB/s to 2.9GB/s -
still too big..) and then treat that sized-chunks to be the actual benchmark
results.. However, part of the chunk may lay in cache anyway..

Where am I thinking wrong? Hope to get a tip from you out-there :)

Hear ya (hopefully soon) !

--
Simon






^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: benchmarking bandwidth Northbridge<->RAM
  2004-02-05 22:49 benchmarking bandwidth Northbridge<->RAM Simonas Leleiva
@ 2004-02-15  0:06 ` dean gaudet
  0 siblings, 0 replies; 3+ messages in thread
From: dean gaudet @ 2004-02-15  0:06 UTC (permalink / raw)
  To: Simonas Leleiva; +Cc: linux-kernel

On Fri, 6 Feb 2004, Simonas Leleiva wrote:

> 2. But what about L1/L2 caching ruining the benchmark, about which I judge
> from very big results (~5GB/s is truly not the correct benchmark with my
> 2x166MHz RAM and 64bit bus - in the best case it should not overflow
> 2.5GB/s)? I've searched the web for cache disablings, but what I've only
> found was the memtest's source-code, which works only under plain non-Linux
> (non PM) environment (memtest makes a bootable floppy and then launches 'bare
> naked'). So I find memtest's inline assembly useless under linux..

mem read/write latency for memory mapped uncachable doesn't always have a
direct relationship to the cold-cache case of the normal cachable path.
it's still an interesting thing to measure the uncachable path -- but
if you're really interested in the cachable path you need to do it a
different way.

the absolute best way to look at cold cache mem read latency is to set up
a random pointer chase.  it has to be random because you want to eliminate
the effect of automatic hw prefetchers present in most modern hardware.

lmbench 3.0 has some code which is almost right (iirc it's the lat_mem
component) -- but it does these random walks within a page before moving
to another page.  this doesn't defeat prefetchers well enough.  i.e. a
prefetcher need only see a couple accesses on a page before it can just
decide to stream the entire page in and get a better cache hit rate.
in my experience there are prefetchers which defeat this.

in case you haven't heard of a pointer chase, it's basically a loop which
looks like this:

	void *p = foo;

	for (;;) {
		p = *(void **)p;
		p = *(void **)p;
		p = *(void **)p;
		... repeated 100 times
	}

it's a linked list walk basically.

the genius of pointer chases is that you can measure a bazillion memory
system details just by varying the layout of the linked list.

to defeat L1/L2/prefetchers choose a large arena, say 32MiB, break it up
into 64B (or whatever the linesize is) objects, then place those objects
into a linked list in a random order.

-dean

p.s. bunzip2 is a real-world workload which is essentially a random
memory walk over 4-byte objects in a 3600000 byte array.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-02-15  0:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-05 22:49 benchmarking bandwidth Northbridge<->RAM Simonas Leleiva
2004-02-15  0:06 ` dean gaudet
  -- strict thread matches above, loose matches on Subject: below --
2004-02-05 23:49 Simonas Leleiva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox