* Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm
@ 2005-06-22 3:19 Chen, Kenneth W
2005-06-22 4:53 ` Luck, Tony
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Chen, Kenneth W @ 2005-06-22 3:19 UTC (permalink / raw)
To: 'Ingo Molnar', linux-kernel, linux-ia64
I'm consistently getting a smaller than expected cache migration cost
as measured by Ingo's scheduler-cache-hot-autodetect.patch currently
in -mm tree. In this patch, the memory used to calibrate migration
cost is obtained by vmalloc call. Would it make sense to use
__get_free_pages() instead? I did the following experiments on a
variety of machines I have access to:
migration cost migration cost
with vmalloc mem with __get_free_pages
3.0GHz Xeon, 8MB cache 6.23 ms 6.32 ms
3.4GHz Xeon, 2MB cache 1.62 ms 2.00 ms
1.6GHz Itanium2, 9MB 9.2 ms 10.2 ms
1.4GHz Itanium2, 4MB 4.2 ms 4.4 ms
Why the discrepancy? Possible cache coloring issue?
--- linux-2.6.12/kernel/sched.c.orig 2005-06-21 19:42:45.067876790 -0700
+++ linux-2.6.12/kernel/sched.c 2005-06-21 19:43:42.580571398 -0700
@@ -5420,7 +5420,7 @@ measure_cost(int cpu1, int cpu2, void *c
__init static unsigned long long measure_migration_cost(int cpu1, int cpu2)
{
unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
- unsigned int max_size, size, size_found = 0;
+ unsigned int order, max_size, size, size_found = 0;
long long cost = 0, prev_cost;
void *cache;
@@ -5448,7 +5448,10 @@ __init static unsigned long long measure
/*
* Allocate the working set:
*/
- cache = vmalloc(max_size);
+ for (order = 0; PAGE_SIZE << order < max_size; order++)
+ ;
+ cache = (void *) __get_free_pages(GFP_KERNEL, order);
+
if (!cache) {
printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
return 1000000; // return 1 msec on very small boxen
@@ -5508,7 +5511,7 @@ __init static unsigned long long measure
printk("[%d][%d] working set size found: %d, cost: %Ld\n",
cpu1, cpu2, size_found, max_cost);
- vfree(cache);
+ free_pages((unsigned long) cache, order);
/*
* A task is considered 'cache cold' if at least 2 times
^ permalink raw reply [flat|nested] 4+ messages in thread* RE: Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm
2005-06-22 3:19 Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm Chen, Kenneth W
@ 2005-06-22 4:53 ` Luck, Tony
2005-06-22 7:14 ` Ingo Molnar
2005-06-23 1:41 ` Chen, Kenneth W
2 siblings, 0 replies; 4+ messages in thread
From: Luck, Tony @ 2005-06-22 4:53 UTC (permalink / raw)
To: Chen, Kenneth W, Ingo Molnar, linux-kernel, linux-ia64
Nitpick:
> printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
You should change this printk() message to not say "vmalloc".
-Tony
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm
2005-06-22 3:19 Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm Chen, Kenneth W
2005-06-22 4:53 ` Luck, Tony
@ 2005-06-22 7:14 ` Ingo Molnar
2005-06-23 1:41 ` Chen, Kenneth W
2 siblings, 0 replies; 4+ messages in thread
From: Ingo Molnar @ 2005-06-22 7:14 UTC (permalink / raw)
To: Chen, Kenneth W; +Cc: linux-kernel, linux-ia64
* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> I'm consistently getting a smaller than expected cache migration cost
> as measured by Ingo's scheduler-cache-hot-autodetect.patch currently
> in -mm tree. In this patch, the memory used to calibrate migration
> cost is obtained by vmalloc call. Would it make sense to use
> __get_free_pages() instead? I did the following experiments on a
> variety of machines I have access to:
>
> migration cost migration cost
> with vmalloc mem with __get_free_pages
> 3.0GHz Xeon, 8MB cache 6.23 ms 6.32 ms
> 3.4GHz Xeon, 2MB cache 1.62 ms 2.00 ms
> 1.6GHz Itanium2, 9MB 9.2 ms 10.2 ms
> 1.4GHz Itanium2, 4MB 4.2 ms 4.4 ms
>
> Why the discrepancy? Possible cache coloring issue?
probably coloring effects, yes. Another reason could be that
touch_cache() touches 6 separate areas of memory, which combined with
the stack give a minimum of 7 hugepage TLBs. How many are there in these
Xeons? If there are say 4 of them then we could be trashing these TLB
entries. There are much more 4K TLBs. To reduce the number of TLBs
utilized, could you change touch_cache() to do something like:
unsigned long size = __size/sizeof(long), chunk1 = size/2;
unsigned long *cache = __cache;
int i;
for (i = 0; i < size/4; i += 8) {
switch (i % 4) {
case 0: cache[i]++;
case 1: cache[size-1-i]++;
case 2: cache[chunk1-i]++;
case 3: cache[chunk1+i]++;
}
}
does this change the migration-cost values? Btw., how did you determine
the value of the 'ideal' migration cost? Was this based on the database
benchmark measurements?
There are a couple of reasons vmalloc() is better than gfp(): 1) it has
no size limit in the measured range, and 2) it more accurately mimics
migration costs of userspace apps, which typically have most of their
cache-footprint in paged memory, not in hugepage memory.
Ingo
^ permalink raw reply [flat|nested] 4+ messages in thread* RE: Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm
2005-06-22 3:19 Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm Chen, Kenneth W
2005-06-22 4:53 ` Luck, Tony
2005-06-22 7:14 ` Ingo Molnar
@ 2005-06-23 1:41 ` Chen, Kenneth W
2 siblings, 0 replies; 4+ messages in thread
From: Chen, Kenneth W @ 2005-06-23 1:41 UTC (permalink / raw)
To: 'Ingo Molnar'; +Cc: linux-kernel, linux-ia64
Ingo Molnar wrote on Wednesday, June 22, 2005 12:15 AM
> probably coloring effects, yes. Another reason could be that
> touch_cache() touches 6 separate areas of memory, which combined with
> the stack give a minimum of 7 hugepage TLBs. How many are there in these
> Xeons? If there are say 4 of them then we could be trashing these TLB
> entries. There are much more 4K TLBs. To reduce the number of TLBs
> utilized, could you change touch_cache() to do something like:
>
> unsigned long size = __size/sizeof(long), chunk1 = size/2;
> unsigned long *cache = __cache;
> int i;
>
> for (i = 0; i < size/4; i += 8) {
> switch (i % 4) {
> case 0: cache[i]++;
> case 1: cache[size-1-i]++;
> case 2: cache[chunk1-i]++;
> case 3: cache[chunk1+i]++;
> }
> }
>
> does this change the migration-cost values?
Yes it does. On one processor, it goes down, but goes up on another.
So I'm not sure if I completely understand the behavior.
vmalloc'ed __get_free_pages
3.0GHz Xeon, 8MB 6.46ms 7.05ms
3.4GHz Xeon, 2MB 0.93ms 1.22ms
1.6GHz ia64, 9MB 9.72ms 10.06ms
What I'm really after though is to have the parameter close to an
experimentally determined optimal value. So either algorithm with
__get_free_pages appears to be closer.
> Btw., how did you determine the value of the 'ideal' migration cost?
> Was this based on the database benchmark measurements?
Yes, it is based on my favorite "industry standard transaction processing
database" bench (I probably should use a shorter name like OLTP).
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2005-06-23 1:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-22 3:19 Variation in measure_migration_cost() with scheduler-cache-hot-autodetect.patch in -mm Chen, Kenneth W
2005-06-22 4:53 ` Luck, Tony
2005-06-22 7:14 ` Ingo Molnar
2005-06-23 1:41 ` Chen, Kenneth W
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox