* touch_cache() only touches two thirds
@ 2006-11-09 0:56 Bela Lubkin
2006-11-10 8:25 ` Andi Kleen
0 siblings, 1 reply; 7+ messages in thread
From: Bela Lubkin @ 2006-11-09 0:56 UTC (permalink / raw)
To: linux-kernel
Submitted as <http://bugzilla.kernel.org/show_bug.cgi?id=7476>:
I noticed this bug while looking at something else. These comments are based
purely on source inspection without any runtime observations. The bug
remains in the latest sources I could find: 2.6.19-rc5 and rc5-mm1.
Ingo Molnar's patch referred to as "scheduler cache-hot-auto-tune" introduced
the function kernel/sched.c:touch_cache(). Here is the 2.6.19-rc5-mm1
version (my "forward / backward" comments):
/*
* Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
* is the operation that is timed, so we try to generate unpredictable
* cachemisses that still end up filling the L2 cache:
*/
static void touch_cache(void *__cache, unsigned long __size)
{
unsigned long size = __size / sizeof(long);
unsigned long chunk1 = size / 3;
unsigned long chunk2 = 2 * size / 3;
unsigned long *cache = __cache;
int i;
for (i = 0; i < size / 6; i += 8) {
switch (i % 6) {
case 0: cache[i]++; /* 1st third, forward */
case 1: cache[size-1-i]++; /* 3rd third, backward */
case 2: cache[chunk1-i]++; /* 1st third, backward */
case 3: cache[chunk1+i]++; /* 2nd third, forward */
case 4: cache[chunk2-i]++; /* 2nd third, backward */
case 5: cache[chunk2+i]++; /* 3rd third, forward */
}
}
}
Notice that the for() loop increments `i' by 8 (earlier versions of the code
used a stride of 4). Since all visited values of i are even, `i % 6' can
never be odd. The switch cases 1/3/5 will never be hit -- so the 3rd third
of the test buffer isn't being touched. Migration costs are actually being
calculated relative to buffers that are only 2/3 as large as intended.
An easy fix is to make the stride relatively prime to the modulus:
--- sched.c.orig 2006-11-08 16:17:37.299500000 -0800
+++ sched.c 2006-11-08 16:28:21.699750000 -0800
@@ -5829,5 +5829,5 @@ static void touch_cache(void *__cache, u
int i;
- for (i = 0; i < size / 6; i += 8) {
+ for (i = 0; i < size / 6; i += 7) {
switch (i % 6) {
case 0: cache[i]++;
>Bela<
PS: I suspect this at least partially explains Kenneth W Chen's observations
in "Variation in measure_migration_cost() with
scheduler-cache-hot-autodetect.patch in -mm",
<http://lkml.org/lkml/2005/6/21/473>:
> I'm consistently getting a smaller than expected cache migration cost
> as measured by Ingo's scheduler-cache-hot-autodetect.patch currently
> in -mm tree.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: touch_cache() only touches two thirds 2006-11-09 0:56 touch_cache() only touches two thirds Bela Lubkin @ 2006-11-10 8:25 ` Andi Kleen 2006-11-11 0:12 ` Bela Lubkin 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2006-11-10 8:25 UTC (permalink / raw) To: Bela Lubkin; +Cc: linux-kernel, mingo "Bela Lubkin" <blubkin@vmware.com> writes: > > /* > * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This > * is the operation that is timed, so we try to generate unpredictable > * cachemisses that still end up filling the L2 cache: > */ The comment is misleading anyways. AFAIK several of the modern CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch predictors advanced enough to follow several streams forward and backwards in parallel. I hit this while doing NUMA benchmarking for example. Most likely to be really unpredictable you need to use a true RND and somehow make sure still the full cache range is covered. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: touch_cache() only touches two thirds 2006-11-10 8:25 ` Andi Kleen @ 2006-11-11 0:12 ` Bela Lubkin 2006-11-11 0:23 ` Andi Kleen 2006-11-18 4:18 ` dean gaudet 0 siblings, 2 replies; 7+ messages in thread From: Bela Lubkin @ 2006-11-11 0:12 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, mingo Andi Kleen wrote: > "Bela Lubkin" <blubkin@vmware.com> writes: > > > > /* > > * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This > > * is the operation that is timed, so we try to generate unpredictable > > * cachemisses that still end up filling the L2 cache: > > */ > > The comment is misleading anyways. AFAIK several of the modern > CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch > predictors advanced enough to follow several streams forward and backwards > in parallel. > > I hit this while doing NUMA benchmarking for example. > > Most likely to be really unpredictable you need to use a > true RND and somehow make sure still the full cache range > is covered. The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> covers the full cache range. Granted that modern CPUs may be able to track multiple simultaneous cache access streams: how many such streams are they likely to be able to follow at once? It seems like going from 1 to 2 would be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much incremental difference. So what do the actual implementations in the field support? The code (original and corrected) uses 6 simultaneous streams. I have a modified version that takes a `ways' parameter to use an arbitrary number of streams. I'll post that onto bugzilla.kernel.org. >Bela< ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: touch_cache() only touches two thirds 2006-11-11 0:12 ` Bela Lubkin @ 2006-11-11 0:23 ` Andi Kleen 2006-11-11 1:48 ` Bela Lubkin 2006-11-18 4:18 ` dean gaudet 1 sibling, 1 reply; 7+ messages in thread From: Andi Kleen @ 2006-11-11 0:23 UTC (permalink / raw) To: Bela Lubkin; +Cc: linux-kernel, mingo > The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> > covers the full cache range. Granted that modern CPUs may be able to track > multiple simultaneous cache access streams: how many such streams are they > likely to be able to follow at once? It seems like going from 1 to 2 would > be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much > incremental difference. So what do the actual implementations in the field > support? I remember reading at some point that a POWER4 could track at least 5+ parallel streams. I don't know how many K8 handles, but it is multiple too at least (forward and backwards) I don't have more data, but at least the newer Intel CPUs seem to be also very good at prefetching and when you look at a die photo the L/S unit in general is quite big. More than 6 streams handled is certainly a possibility. I guess it could be figured out with some clever benchmarking. > > The code (original and corrected) uses 6 simultaneous streams. My gut feeling is that this is not enough. > I have a modified version that takes a `ways' parameter to use an arbitrary > number of streams. I'll post that onto bugzilla.kernel.org. Post it to the list please. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: touch_cache() only touches two thirds 2006-11-11 0:23 ` Andi Kleen @ 2006-11-11 1:48 ` Bela Lubkin 0 siblings, 0 replies; 7+ messages in thread From: Bela Lubkin @ 2006-11-11 1:48 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, mingo Andi Kleen wrote: Bela>> The corrected code in Bela>> <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> covers the Bela>> full cache range. Granted that modern CPUs may be able to track Bela>> multiple simultaneous cache access streams: how many such streams Bela>> are they likely to be able to follow at once? It seems like Bela>> going from 1 to 2 would be a big win, 2 to 3 a small win, beyond Bela>> that it wouldn't likely make much incremental difference. So Bela>> what do the actual implementations in the field support? Andi> I remember reading at some point that a POWER4 could track at Andi> least 5+ parallel streams. I don't know how many K8 handles, but Andi> it is multiple too at least (forward and backwards) Andi> I don't have more data, but at least the newer Intel CPUs seem to Andi> be also very good at prefetching and when you look at a die photo Andi> the L/S unit in general is quite big. More than 6 streams handled Andi> is certainly a possibility. Andi> I guess it could be figured out with some clever benchmarking. Bela>> The code (original and corrected) uses 6 simultaneous streams. Andi> My gut feeling is that this is not enough. Bela>> I have a modified version that takes a `ways' parameter to Bela>> use an arbitrary number of streams. I'll post that onto Bela>> bugzilla.kernel.org. Andi> Post it to the list please. Ok, will do. I'm not real sure about list etiquette here. A discussion is underway on <http://bugzilla.kernel.org/show_bug.cgi?id=7476>. Here is the code I've posted there. (Slightly newer versions here.) First is a C program -- a test harness that embeds the new touch_cache() routine (needs memory management work to go into kernel). Then a shell script I've been using to torture it. The torture script tests it with cache lines up to 66-longs, and with up to 656-way streaming (artifacts of the shell's $RANDOM ;-) Moved this to my home account so I have some control over the mailer... >Bela< ============================================================================= /* * Test program to demonstrate touch_cache() algorithms * * Bela Lubkin 2006-11-10 */ #include <stdio.h> #include <stdlib.h> /* Elements Per Displayline -- display parameter for self-check code */ #define EPD 64 /* * The following definitions describe the cache line size of the machine * architecture: * * cache_t, here defined as `long', is a single cache element * LPC, Longs Per Cacheline, is the number of elements per cache line * * For consistency, cache_t should probably be int32_t, and only LPC * should be varied to match various architectures. */ #define LPC 8 int lpc = LPC; typedef long cache_t; bar() { int i; printf("+"); for (i = 0; i < EPD + (EPD / lpc) - 1; i++) printf("="); printf("+\n"); } clear(cache_t *cache, int size) { int i; for (i = -EPD; i < size + EPD; i++) cache[i] = 0; } /* * show() dumps the resulting touched cache and checks it for * correctness. * * The `misplaced' test isn't strictly necessary, as the actual goal is * merely to touch each cache line (anywhere within the line). I found * the additional restriction useful to promote overall correctness * during the process of refining the touch_cache() algorithm. */ show(cache_t *cache, int size) { int i; int missed = 0, underrun = 0, misplaced = 0, overrun = 0; for (i = -EPD; i < size + EPD; i++) { if ((i + EPD) % EPD == 0) printf("|"); printf("%c", cache[i] ? '0' + cache[i] : (i < 0 || i >= size) ? '-' : '.'); if ((i + EPD) % EPD == EPD - 1) printf("\n"); else if ((i + EPD) % lpc == lpc - 1) printf(":"); if (i >= 0 && i < size && (i % lpc) == 0 && cache[i] == 0) missed++; if (cache[i]) { if (i < 0) underrun += cache[i]; if (i >= size) overrun += cache[i]; if (i % lpc != 0) misplaced += cache[i]; } } if ((i + EPD) % EPD != 0) printf("\n"); if (missed) printf("ERROR: %d cache lines were missed!\n", missed); if (underrun) printf("ERROR: %d writes before beginning of buffer!\n", underrun); if (overrun) printf("ERROR: %d writes after end of buffer!\n", overrun); if (misplaced) printf("ERROR: %d writes at unexpected alignments within a cache line!\n", misplaced); if (missed || underrun || misplaced || overrun) exit(1); } static int *waystart; /* * When putting into kernel, use vmalloc()/vfree(); * change error handling. */ prep_ways(int ways, int size) { int way, waysize = size / ways; /* one extra `way' is used when `ways' is odd */ /* (actually, only the even elements of this array are used) */ waystart = (int *)malloc((ways + 1) * sizeof(*waystart)); if (!waystart) { fprintf(stderr, "malloc failed\n"); exit(1); } for (way = 0; way < ways + 1; way++) { if ((way & 1) == 0) /* even `waystart' points to 1st line in its `way' */ waystart[way] = way * waysize; else /* odd `waystart' points to last line in its `way' */ waystart[way] = (way + 1) * waysize - lpc; /* align to next cache line */ waystart[way] = lpc * ((waystart[way] + lpc - 1) / lpc); } } free_ways() { free(waystart); } touch_cache(cache_t *cache, int ways, int size) { int way, i; /* * We access the buffer via `ways' independent 'streams' of * read/write access which are interleaved; every other one * is written backwards. This is supposed to keep the cache * from recognizing any linear access pattern. * * [---> <---|---> <---|---> <---] * * We touch every cacheline in the buffer (32 bytes on 32-bit * systems, 64 bytes on 64-bit systems; actually now `lpc * * sizeof(cache_t)', could be determined at runtime). */ for (i = 0; i < size / ways; i += lpc) { for (way = 0; way < ways; way++) { if ((way & 1) == 0) cache[waystart[way] + i]++; else cache[waystart[way] - i]++; } } } main(int argc, char *argv[]) { int i; int size, ways; cache_t *cache; size = atoi(argv[1]); ways = atoi(argv[2]); if (argc > 3) lpc = atoi(argv[3]); if (argc < 3 || ways <= 0 || size < ways) { fprintf(stderr, "usage: touch_cache cache_size ways [longs-per-cacheline]\n"); fprintf(stderr, " cache_size >= ways\n"); exit(1); } /* * This is a bit of a shuck, papering over the fact that it's * hard to get perfect 1:1 cache line coverage in an odd-sized * buffer... */ if (size % (ways * lpc)) { printf("cache_size should be a multiple of %d * ways\n", lpc); printf("Raising cache_size...\n"); size = (lpc * ways) * (1 + size / lpc / ways); } printf("size: %d, ways: %d, longs-per-cacheline: %d\n", size, ways, lpc); /* allocate an extra 2*EPD cache elements, two display lines, to demonstrate not running off the ends of the buffer */ cache = (cache_t *)malloc((EPD * 2 + size) * sizeof(*cache)); cache += EPD; if (!cache) { fprintf(stderr, "malloc failed\n"); exit(1); } clear(cache, size); prep_ways(ways, size); touch_cache(cache, ways, size); free_ways(); bar(); show(cache, size); bar(); exit(0); } ============================================================================= #!/bin/bash # Torture the touch_cache() algorithm. # # This produces about 24MB of output. Any "ERROR" messages indicate a # problem; the rest should be rather boring. Run as: # # ./touch_cache.test.sh > touch_cache.test.out # grep -i error touch_cache.test.out exec 2>&1 i=0 err=0 while [ $i -lt 1000 ]; do let i=i+1 let size=$RANDOM+1 let ways=$RANDOM/50+1 case $RANDOM in 1[01234]???) lpc=4;; 1[56789]???) lpc=8;; 2[01234]???) lpc=16;; 2[56789]???) lpc=32;; 3????) lpc=64;; *) let lpc=$RANDOM/500+1;; esac if [ $ways -gt $size ]; then x=$ways ways=$size size=$x fi ./touch_cache $size $ways $lpc || let err=err+1 done if [ $err -gt 0 ]; then echo ERROR: errors above, check output else echo Test completed with no errors. fi ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: touch_cache() only touches two thirds 2006-11-11 0:12 ` Bela Lubkin 2006-11-11 0:23 ` Andi Kleen @ 2006-11-18 4:18 ` dean gaudet 2006-11-18 4:31 ` dean gaudet 1 sibling, 1 reply; 7+ messages in thread From: dean gaudet @ 2006-11-18 4:18 UTC (permalink / raw) To: Bela Lubkin; +Cc: Andi Kleen, linux-kernel, mingo On Fri, 10 Nov 2006, Bela Lubkin wrote: > The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> > covers the full cache range. Granted that modern CPUs may be able to track > multiple simultaneous cache access streams: how many such streams are they > likely to be able to follow at once? It seems like going from 1 to 2 would > be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much > incremental difference. So what do the actual implementations in the field > support? p-m family, core, core2 track one stream on each of 12 to 16 pages. in the earlier ones they split the trackers into some forward-only and some backward-only, but on core2 i think they're all bidirectional. if i had to guess they round-robin the trackers, so once you hit 17 pages with streams they're defeated. a p4 (0f0403, probably "prescott") i have here is tracking 16 -- seems to use LRU or pLRU but i haven't tested really, you need to get out past 32 streams before it really starts falling off... and even then the next-line prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a pair of dirty/state bits -- it prefetches the other half of a pair automatically). oh it can track forward or backward, and is happy with strides up to 128. k8 rev F tracks one stream on each of 20 pages (forward or backward). it also seems to use round-robin, and is defeated as soon as you have 21 streams. i swear there was an x86 which did 28 streams, but it was a few years ago that i last looked really closely at the prefetchers and i don't have access to the data at the moment. i suggest that streams are the wrong approach. i was actually considering this same problem this week, happy to see your thread. the approach i was considering was to set up two pointer chases: one pointer chase covering enough cache lines (and in a prefetchable ordering) for "scrubbing" the cache(s). another pointer chase arranged to fill the L1 (or L2) using many many pages. i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines then i'd allocate 512 pages and put one line on each page (pages ordered randomly), but colour them so they fill the L1. this conveniently happens to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB pressure from the microbenchmark. you can actually get away with a pointer every 256 bytes today -- none of the prefetchers on today's x86 cores consider a 256 byte stride to be prefetchable. for safety you might want to use 512 byte alignment... this lets you get away with fewer pages for colouring larger caches. the benchmark i was considering would be like so: switch to cpu m scrub the cache switch to cpu n scrub the cache traverse the coloured list and modify each cache line as we go switch to cpu m start timing traverse the coloured list without modification stop timing -dean ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: touch_cache() only touches two thirds 2006-11-18 4:18 ` dean gaudet @ 2006-11-18 4:31 ` dean gaudet 0 siblings, 0 replies; 7+ messages in thread From: dean gaudet @ 2006-11-18 4:31 UTC (permalink / raw) To: Bela Lubkin; +Cc: Andi Kleen, linux-kernel, mingo On Fri, 17 Nov 2006, dean gaudet wrote: > another pointer chase arranged to fill the L1 (or L2) using many many > pages. i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines > then i'd allocate 512 pages and put one line on each page (pages ordered > randomly), but colour them so they fill the L1. this conveniently happens > to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB > pressure from the microbenchmark. btw, for L2-sized measurements you don't need to be so clever... you can get away with a random traversal of the L2 on 128B boundaries. (need to avoid the "next-line prefetch" issues on p-m/core/core2, p4 model 3 and later.) there's just so many more pages required to map the L2 than any reasonable prefetcher is going to have any time soon. -dean > the benchmark i was considering would be like so: > > switch to cpu m > scrub the cache > switch to cpu n > scrub the cache > traverse the coloured list and modify each cache line as we go > switch to cpu m > start timing > traverse the coloured list without modification > stop timing ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-11-18 4:31 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-09 0:56 touch_cache() only touches two thirds Bela Lubkin 2006-11-10 8:25 ` Andi Kleen 2006-11-11 0:12 ` Bela Lubkin 2006-11-11 0:23 ` Andi Kleen 2006-11-11 1:48 ` Bela Lubkin 2006-11-18 4:18 ` dean gaudet 2006-11-18 4:31 ` dean gaudet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox