touch_cache() only touches two thirds

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* touch_cache() only touches two thirds
@ 2006-11-09  0:56 Bela Lubkin
  2006-11-10  8:25 ` Andi Kleen
  0 siblings, 1 reply; 7+ messages in thread
From: Bela Lubkin @ 2006-11-09  0:56 UTC (permalink / raw)
  To: linux-kernel

Submitted as <http://bugzilla.kernel.org/show_bug.cgi?id=7476>:

I noticed this bug while looking at something else.  These comments are based
purely on source inspection without any runtime observations.  The bug
remains in the latest sources I could find: 2.6.19-rc5 and rc5-mm1.

Ingo Molnar's patch referred to as "scheduler cache-hot-auto-tune" introduced
the function kernel/sched.c:touch_cache().  Here is the 2.6.19-rc5-mm1
version (my "forward / backward" comments):

/*
 * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
 * is the operation that is timed, so we try to generate unpredictable
 * cachemisses that still end up filling the L2 cache:
 */
static void touch_cache(void *__cache, unsigned long __size)
{
        unsigned long size = __size / sizeof(long);
        unsigned long chunk1 = size / 3;
        unsigned long chunk2 = 2 * size / 3;
        unsigned long *cache = __cache;
        int i;

        for (i = 0; i < size / 6; i += 8) {
                switch (i % 6) {
                        case 0: cache[i]++;         /* 1st third, forward  */
                        case 1: cache[size-1-i]++;  /* 3rd third, backward */
                        case 2: cache[chunk1-i]++;  /* 1st third, backward */
                        case 3: cache[chunk1+i]++;  /* 2nd third, forward  */
                        case 4: cache[chunk2-i]++;  /* 2nd third, backward */
                        case 5: cache[chunk2+i]++;  /* 3rd third, forward  */
                }
        }
}

Notice that the for() loop increments `i' by 8 (earlier versions of the code
used a stride of 4).  Since all visited values of i are even, `i % 6' can
never be odd.  The switch cases 1/3/5 will never be hit -- so the 3rd third
of the test buffer isn't being touched.  Migration costs are actually being
calculated relative to buffers that are only 2/3 as large as intended.

An easy fix is to make the stride relatively prime to the modulus:

--- sched.c.orig        2006-11-08 16:17:37.299500000 -0800
+++ sched.c     2006-11-08 16:28:21.699750000 -0800
@@ -5829,5 +5829,5 @@ static void touch_cache(void *__cache, u
        int i;

-       for (i = 0; i < size / 6; i += 8) {
+       for (i = 0; i < size / 6; i += 7) {
                switch (i % 6) {
                        case 0: cache[i]++;

>Bela<

PS: I suspect this at least partially explains Kenneth W Chen's observations
in "Variation in measure_migration_cost() with
scheduler-cache-hot-autodetect.patch in -mm",
<http://lkml.org/lkml/2005/6/21/473>:

> I'm consistently getting a smaller than expected cache migration cost
> as measured by Ingo's scheduler-cache-hot-autodetect.patch currently
> in -mm tree.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: touch_cache() only touches two thirds
  2006-11-09  0:56 touch_cache() only touches two thirds Bela Lubkin
@ 2006-11-10  8:25 ` Andi Kleen
  2006-11-11  0:12   ` Bela Lubkin
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2006-11-10  8:25 UTC (permalink / raw)
  To: Bela Lubkin; +Cc: linux-kernel, mingo

"Bela Lubkin" <blubkin@vmware.com> writes:
> 
> /*
>  * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
>  * is the operation that is timed, so we try to generate unpredictable
>  * cachemisses that still end up filling the L2 cache:
>  */

The comment is misleading anyways. AFAIK several of the modern
CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch 
predictors advanced enough to follow several streams forward and backwards
in parallel.

I hit this while doing NUMA benchmarking for example.

Most likely to be really unpredictable you need to use a
true RND and somehow make sure still the full cache range 
is covered.

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: touch_cache() only touches two thirds
  2006-11-10  8:25 ` Andi Kleen
@ 2006-11-11  0:12   ` Bela Lubkin
  2006-11-11  0:23     ` Andi Kleen
  2006-11-18  4:18     ` dean gaudet
  0 siblings, 2 replies; 7+ messages in thread
From: Bela Lubkin @ 2006-11-11  0:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, mingo

Andi Kleen wrote:

> "Bela Lubkin" <blubkin@vmware.com> writes:
> > 
> > /*
> >  * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
> >  * is the operation that is timed, so we try to generate unpredictable
> >  * cachemisses that still end up filling the L2 cache:
> >  */
>
> The comment is misleading anyways. AFAIK several of the modern
> CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch 
> predictors advanced enough to follow several streams forward and backwards
> in parallel.
>
> I hit this while doing NUMA benchmarking for example.
>
> Most likely to be really unpredictable you need to use a
> true RND and somehow make sure still the full cache range 
> is covered.

The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
covers the full cache range.  Granted that modern CPUs may be able to track
multiple simultaneous cache access streams: how many such streams are they
likely to be able to follow at once?  It seems like going from 1 to 2 would
be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
incremental difference.  So what do the actual implementations in the field
support?

The code (original and corrected) uses 6 simultaneous streams.

I have a modified version that takes a `ways' parameter to use an arbitrary
number of streams.  I'll post that onto bugzilla.kernel.org.

>Bela<

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: touch_cache() only touches two thirds
  2006-11-11  0:12   ` Bela Lubkin
@ 2006-11-11  0:23     ` Andi Kleen
  2006-11-11  1:48       ` Bela Lubkin
  2006-11-18  4:18     ` dean gaudet
  1 sibling, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2006-11-11  0:23 UTC (permalink / raw)
  To: Bela Lubkin; +Cc: linux-kernel, mingo


> The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
> covers the full cache range.  Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once?  It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference.  So what do the actual implementations in the field
> support?

I remember reading at some point that a POWER4 could track at least 5+ parallel
streams.  I don't know how many K8 handles, but it is multiple too at least 
(forward and backwards)

I don't have more data, but at least the newer Intel CPUs seem to be also 
very good at prefetching and when you look at a die photo the L/S unit 
in general is quite big. More than 6 streams handled is certainly a 
possibility.

I guess it could be figured out with some clever benchmarking.
> 
> The code (original and corrected) uses 6 simultaneous streams.

My gut feeling is that this is not enough.

> I have a modified version that takes a `ways' parameter to use an arbitrary
> number of streams.  I'll post that onto bugzilla.kernel.org.

Post it to the list please.

-Andi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: touch_cache() only touches two thirds
  2006-11-11  0:23     ` Andi Kleen
@ 2006-11-11  1:48       ` Bela Lubkin
  0 siblings, 0 replies; 7+ messages in thread
From: Bela Lubkin @ 2006-11-11  1:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, mingo

Andi Kleen wrote:

Bela>> The corrected code in
Bela>> <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> covers the
Bela>> full cache range.  Granted that modern CPUs may be able to track
Bela>> multiple simultaneous cache access streams: how many such streams
Bela>> are they likely to be able to follow at once?  It seems like
Bela>> going from 1 to 2 would be a big win, 2 to 3 a small win, beyond
Bela>> that it wouldn't likely make much incremental difference.  So
Bela>> what do the actual implementations in the field support?

Andi> I remember reading at some point that a POWER4 could track at
Andi> least 5+ parallel streams.  I don't know how many K8 handles, but
Andi> it is multiple too at least (forward and backwards)

Andi> I don't have more data, but at least the newer Intel CPUs seem to
Andi> be also very good at prefetching and when you look at a die photo
Andi> the L/S unit in general is quite big. More than 6 streams handled
Andi> is certainly a possibility.

Andi> I guess it could be figured out with some clever benchmarking.

Bela>> The code (original and corrected) uses 6 simultaneous streams.

Andi> My gut feeling is that this is not enough.

Bela>> I have a modified version that takes a `ways' parameter to
Bela>> use an arbitrary number of streams.  I'll post that onto
Bela>> bugzilla.kernel.org.

Andi> Post it to the list please.

Ok, will do.  I'm not real sure about list etiquette here.  A discussion
is underway on <http://bugzilla.kernel.org/show_bug.cgi?id=7476>.  Here
is the code I've posted there.  (Slightly newer versions here.)

First is a C program -- a test harness that embeds the new touch_cache()
routine (needs memory management work to go into kernel).  Then a shell
script I've been using to torture it.

The torture script tests it with cache lines up to 66-longs, and with up
to 656-way streaming (artifacts of the shell's $RANDOM ;-)

Moved this to my home account so I have some control over the mailer...

>Bela<

=============================================================================
/*
 * Test program to demonstrate touch_cache() algorithms
 *
 * Bela Lubkin 2006-11-10
 */

#include <stdio.h>
#include <stdlib.h>

/* Elements Per Displayline -- display parameter for self-check code */
#define EPD 64

/*
 * The following definitions describe the cache line size of the machine
 * architecture:
 *
 *   cache_t, here defined as `long', is a single cache element
 *   LPC, Longs Per Cacheline, is the number of elements per cache line
 *
 * For consistency, cache_t should probably be int32_t, and only LPC
 * should be varied to match various architectures.
 */

#define LPC 8
int	lpc = LPC;
typedef long cache_t;

bar()
{
	int i;

	printf("+");
	for (i = 0; i < EPD + (EPD / lpc) - 1; i++)
		printf("=");
	printf("+\n");
}

clear(cache_t *cache, int size)
{
	int i;

	for (i = -EPD; i < size + EPD; i++)
		cache[i] = 0;
}

/*
 * show() dumps the resulting touched cache and checks it for
 * correctness.
 *
 * The `misplaced' test isn't strictly necessary, as the actual goal is
 * merely to touch each cache line (anywhere within the line).  I found
 * the additional restriction useful to promote overall correctness
 * during the process of refining the touch_cache() algorithm.
 */
show(cache_t *cache, int size)
{
	int i;
	int missed = 0, underrun = 0, misplaced = 0, overrun = 0;

	for (i = -EPD; i < size + EPD; i++) {
		if ((i + EPD) % EPD == 0)
			printf("|");
		printf("%c", cache[i] ? '0' + cache[i] : (i < 0 || i >= size) ? '-' : '.');
		if ((i + EPD) % EPD == EPD - 1)
			printf("\n");
		else if ((i + EPD) % lpc == lpc - 1)
			printf(":");
		if (i >= 0 && i < size && (i % lpc) == 0 && cache[i] == 0)
			missed++;
		if (cache[i]) {
			if (i < 0)
				underrun += cache[i];
			if (i >= size)
				overrun += cache[i];
			if (i % lpc != 0)
				misplaced += cache[i];
		}
	}
	if ((i + EPD) % EPD != 0)
		printf("\n");
	if (missed)
		printf("ERROR: %d cache lines were missed!\n", missed);
	if (underrun)
		printf("ERROR: %d writes before beginning of buffer!\n", underrun);
	if (overrun)
		printf("ERROR: %d writes after end of buffer!\n", overrun);
	if (misplaced)
		printf("ERROR: %d writes at unexpected alignments within a cache line!\n", misplaced);
	if (missed || underrun || misplaced || overrun)
		exit(1);
}

static int *waystart;

/*
 * When putting into kernel, use vmalloc()/vfree();
 * change error handling.
 */

prep_ways(int ways, int size)
{
	int way, waysize = size / ways;

	/* one extra `way' is used when `ways' is odd */
	/* (actually, only the even elements of this array are used) */
	waystart = (int *)malloc((ways + 1) * sizeof(*waystart));

	if (!waystart) {
		fprintf(stderr, "malloc failed\n");
		exit(1);
	}

	for (way = 0; way < ways + 1; way++) {
		if ((way & 1) == 0)
			/* even `waystart' points to 1st line in its `way' */
			waystart[way] = way * waysize;
		else
			/* odd `waystart' points to last line in its `way' */
			waystart[way] = (way + 1) * waysize - lpc;
		/* align to next cache line */
		waystart[way] = lpc * ((waystart[way] + lpc - 1) / lpc);
	}
}

free_ways()
{
	free(waystart);
}

touch_cache(cache_t *cache, int ways, int size)
{
	int way, i;

	/*
	 * We access the buffer via `ways' independent 'streams' of
	 * read/write access which are interleaved; every other one
	 * is written backwards.  This is supposed to keep the cache
	 * from recognizing any linear access pattern.
	 *
	 * [--->     <---|--->     <---|--->     <---]
	 *
	 * We touch every cacheline in the buffer (32 bytes on 32-bit
	 * systems, 64 bytes on 64-bit systems; actually now `lpc *
	 * sizeof(cache_t)', could be determined at runtime).
	 */
	for (i = 0; i < size / ways; i += lpc) {
		for (way = 0; way < ways; way++) {
			if ((way & 1) == 0)
				cache[waystart[way] + i]++;
			else
				cache[waystart[way] - i]++;
		}
	}
}

main(int argc, char *argv[])
{
	int i;
	int size, ways;
	cache_t *cache;

	size = atoi(argv[1]);
	ways = atoi(argv[2]);
	if (argc > 3)
		lpc = atoi(argv[3]);
	if (argc < 3 || ways <= 0 || size < ways) {
		fprintf(stderr, "usage: touch_cache cache_size ways [longs-per-cacheline]\n");
		fprintf(stderr, "       cache_size >= ways\n");
		exit(1);
	}

	/*
	 * This is a bit of a shuck, papering over the fact that it's
	 * hard to get perfect 1:1 cache line coverage in an odd-sized
	 * buffer...
	 */
	if (size % (ways * lpc)) {
		printf("cache_size should be a multiple of %d * ways\n", lpc);
		printf("Raising cache_size...\n");
		size = (lpc * ways) * (1 + size / lpc / ways);
	}

	printf("size: %d, ways: %d, longs-per-cacheline: %d\n", size, ways, lpc);

	/* allocate an extra 2*EPD cache elements, two display lines,
	   to demonstrate not running off the ends of the buffer */
	cache = (cache_t *)malloc((EPD * 2 + size) * sizeof(*cache));
	cache += EPD;

	if (!cache) {
		fprintf(stderr, "malloc failed\n");
		exit(1);
	}

	clear(cache, size);
	prep_ways(ways, size);

	touch_cache(cache, ways, size);

	free_ways();

	bar();
	show(cache, size);
	bar();

	exit(0);
}
=============================================================================
#!/bin/bash

# Torture the touch_cache() algorithm.
#
# This produces about 24MB of output.  Any "ERROR" messages indicate a
# problem; the rest should be rather boring.  Run as:
#
#   ./touch_cache.test.sh > touch_cache.test.out
#   grep -i error touch_cache.test.out

exec 2>&1

i=0
err=0
while [ $i -lt 1000 ]; do
  let i=i+1
  let size=$RANDOM+1
  let ways=$RANDOM/50+1
  case $RANDOM in
    1[01234]???) lpc=4;;
    1[56789]???) lpc=8;;
    2[01234]???) lpc=16;;
    2[56789]???) lpc=32;;
          3????) lpc=64;;
          *) let lpc=$RANDOM/500+1;;
  esac
  if [ $ways -gt $size ]; then
    x=$ways
    ways=$size
    size=$x
  fi
  ./touch_cache $size $ways $lpc || let err=err+1
done
if [ $err -gt 0 ]; then
  echo ERROR: errors above, check output
else
  echo Test completed with no errors.
fi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: touch_cache() only touches two thirds
  2006-11-11  0:12   ` Bela Lubkin
  2006-11-11  0:23     ` Andi Kleen
@ 2006-11-18  4:18     ` dean gaudet
  2006-11-18  4:31       ` dean gaudet
  1 sibling, 1 reply; 7+ messages in thread
From: dean gaudet @ 2006-11-18  4:18 UTC (permalink / raw)
  To: Bela Lubkin; +Cc: Andi Kleen, linux-kernel, mingo

On Fri, 10 Nov 2006, Bela Lubkin wrote:

> The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
> covers the full cache range.  Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once?  It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference.  So what do the actual implementations in the field
> support?

p-m family, core, core2 track one stream on each of 12 to 16 pages.  in 
the earlier ones they split the trackers into some forward-only and some 
backward-only, but on core2 i think they're all bidirectional.  if i had 
to guess they round-robin the trackers, so once you hit 17 pages with 
streams they're defeated.

a p4 (0f0403, probably "prescott") i have here is tracking 16 -- seems to 
use LRU or pLRU but i haven't tested really, you need to get out past 32 
streams before it really starts falling off... and even then the next-line 
prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a 
pair of dirty/state bits -- it prefetches the other half of a pair 
automatically).  oh it can track forward or backward, and is happy with 
strides up to 128.

k8 rev F tracks one stream on each of 20 pages (forward or backward).  it 
also seems to use round-robin, and is defeated as soon as you have 21 
streams.

i swear there was an x86 which did 28 streams, but it was a few years ago 
that i last looked really closely at the prefetchers and i don't have 
access to the data at the moment.

i suggest that streams are the wrong approach.  i was actually considering 
this same problem this week, happy to see your thread.

the approach i was considering was to set up two pointer chases:

one pointer chase covering enough cache lines (and in a prefetchable 
ordering) for "scrubbing" the cache(s).

another pointer chase arranged to fill the L1 (or L2) using many many 
pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
then i'd allocate 512 pages and put one line on each page (pages ordered 
randomly), but colour them so they fill the L1.  this conveniently happens 
to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
pressure from the microbenchmark.

you can actually get away with a pointer every 256 bytes today -- none of 
the prefetchers on today's x86 cores consider a 256 byte stride to be 
prefetchable.  for safety you might want to use 512 byte alignment... this 
lets you get away with fewer pages for colouring larger caches.

the benchmark i was considering would be like so:

	switch to cpu m
	scrub the cache
	switch to cpu n
	scrub the cache
	traverse the coloured list and modify each cache line as we go
	switch to cpu m
	start timing
	traverse the coloured list without modification
	stop timing

-dean

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: touch_cache() only touches two thirds
  2006-11-18  4:18     ` dean gaudet
@ 2006-11-18  4:31       ` dean gaudet
  0 siblings, 0 replies; 7+ messages in thread
From: dean gaudet @ 2006-11-18  4:31 UTC (permalink / raw)
  To: Bela Lubkin; +Cc: Andi Kleen, linux-kernel, mingo

On Fri, 17 Nov 2006, dean gaudet wrote:

> another pointer chase arranged to fill the L1 (or L2) using many many 
> pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
> then i'd allocate 512 pages and put one line on each page (pages ordered 
> randomly), but colour them so they fill the L1.  this conveniently happens 
> to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
> pressure from the microbenchmark.

btw, for L2-sized measurements you don't need to be so clever... you can 
get away with a random traversal of the L2 on 128B boundaries.  (need to 
avoid the "next-line prefetch" issues on p-m/core/core2, p4 model 3 and 
later.)  there's just so many more pages required to map the L2 than any 
reasonable prefetcher is going to have any time soon.

-dean


> the benchmark i was considering would be like so:
> 
> 	switch to cpu m
> 	scrub the cache
> 	switch to cpu n
> 	scrub the cache
> 	traverse the coloured list and modify each cache line as we go
> 	switch to cpu m
> 	start timing
> 	traverse the coloured list without modification
> 	stop timing

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-11-18  4:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-09  0:56 touch_cache() only touches two thirds Bela Lubkin
2006-11-10  8:25 ` Andi Kleen
2006-11-11  0:12   ` Bela Lubkin
2006-11-11  0:23     ` Andi Kleen
2006-11-11  1:48       ` Bela Lubkin
2006-11-18  4:18     ` dean gaudet
2006-11-18  4:31       ` dean gaudet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox