RE: Industry db benchmark result on recent 2.6 kernels

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: Industry db benchmark result on recent 2.6 kernels
@ 2005-04-02  1:00 Chen, Kenneth W
  2005-04-02  2:12 ` Nick Piggin
  2005-04-02 14:53 ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-02  1:00 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> the current defaults for cache_hot_time are 10 msec for NUMA domains,
> and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
> Are you increasing cache_hot_time in your experiment? If that solves
> most of the problem that would be an easy thing to fix for 2.6.12.


Chen, Kenneth W wrote on Thursday, March 31, 2005 9:15 PM
> Yes, we are increasing the number in our experiments.  It's in the queue
> and I should have a result soon.

Hot of the press: bumping up cache_hot_time to 10ms on our db setup brings
2.6.11 performance on par with 2.6.9.  Theory confirmed.

- Ken



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  1:00 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
@ 2005-04-02  2:12 ` Nick Piggin
  2005-04-02 14:53 ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
  1 sibling, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2005-04-02  2:12 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Linus Torvalds',
	'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> 
>>the current defaults for cache_hot_time are 10 msec for NUMA domains,
>>and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
>>Are you increasing cache_hot_time in your experiment? If that solves
>>most of the problem that would be an easy thing to fix for 2.6.12.
> 
> 
> 
> Chen, Kenneth W wrote on Thursday, March 31, 2005 9:15 PM
> 
>>Yes, we are increasing the number in our experiments.  It's in the queue
>>and I should have a result soon.
> 
> 
> Hot of the press: bumping up cache_hot_time to 10ms on our db setup brings
> 2.6.11 performance on par with 2.6.9.  Theory confirmed.
> 

OK, that's good. I'll look at whether we can easily use Ingo's
tool on the SMP domain only, to avoid the large O(n^2). That might
be an acceptable short term solution for 2.6.12.

If you get a chance to also look at those block layer patches that
would be good - if they give you a nice improvement, that would
justify getting them into -mm.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-02  1:00 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
  2005-04-02  2:12 ` Nick Piggin
@ 2005-04-02 14:53 ` Ingo Molnar
  2005-04-02 21:22   ` Paul Jackson
  2005-04-03  5:53   ` Paul Jackson
  1 sibling, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-02 14:53 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6308 bytes --]


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> > the current defaults for cache_hot_time are 10 msec for NUMA domains,
> > and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
> > Are you increasing cache_hot_time in your experiment? If that solves
> > most of the problem that would be an easy thing to fix for 2.6.12.
> 
> 
> Chen, Kenneth W wrote on Thursday, March 31, 2005 9:15 PM
> > Yes, we are increasing the number in our experiments.  It's in the queue
> > and I should have a result soon.
> 
> Hot of the press: bumping up cache_hot_time to 10ms on our db setup 
> brings 2.6.11 performance on par with 2.6.9.  Theory confirmed.

very good! Could you try the attached patch? (against BK-curr, but 
should apply to 2.6.12-rc1 too)

in this patch i've reworked the migration-tuning/auto-detection code in 
a number of ways, to solve various problems the first version of the 
code had.

the first problem was that detection time scaled with O(N^2), which is 
unacceptable on larger SMP and NUMA systems. To solve this:

- i've added a 'domain distance' function, which is used to cache 
  measurement results. Each distance is only measured once. This means 
  that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT 
  distances 0 and 1, and on SMP distance 0 is measured. The code walks 
  the domain tree to determine the distance, so it automatically follows 
  whatever hierarchy an architecture sets up. This cuts down on the boot 
  time significantly and removes the O(N^2) limit. The only assumption 
  is that migration costs can be expressed as a function of domain
  distance - this covers the overwhelming majority of existing systems, 
  and is a good guess even for more assymetric systems.

  [ people hacking systems that have assymetries that break this 
    assumption (e.g. different CPU speeds) should experiment a bit with 
    the cpu_distance() function. Adding a ->migration_distance factor to 
    the domain structure would be one possible solution - but lets first 
    see the problem systems, if they exist at all. Lets not overdesign. ]

another problem was that only a single cache-size was used for measuring 
the cost of migration, and most architectures didnt set that variable 
up. Furthermore, a single cache-size does not fit NUMA hierarchies with 
L3 caches and does not fit HT setups, where different CPUs will often 
have different 'effective cache sizes'. To solve this problem:

- instead of relying on a single cache-size provided by the platform and
  sticking to it, the code now auto-detects the 'effective migration 
  cost' between two measured CPUs, via iterating through a wide range of
  cachesizes. The code searches for the maximum migration cost, which
  occurs when the working set of the test-workload falls just below the
  'effective cache size'. I.e. real-life optimized search is done for
  the maximum migration cost, between two real CPUs.

  This, amongst other things, has the positive effect hat if e.g. two
  CPUs share a L2/L3 cache, a different (and accurate) migration cost
  will be found than between two CPUs on the same system that dont share
  any caches.

(the reliable measurement of migration costs is tricky - see the source
for details.)

furthermore i've added various boot-time options to override/tune
migration behavior.

firstly, there's a blanket override for autodetection:

	migration_cost=1000,2000,3000

will override the depth 0/1/2 values with 1msec/2msec/3msec values.

secondly, there's a global factor that can be used to increase (or
decrease) the autodetected values:

	migration_factor=120

will increase the autodetected values by 20%. This option is useful to
tune things in a workload-dependent way - e.g. if a workload is
cache-insensitive then CPU utilization can be maximized by specifying
migration_factor=0.

i've tested the autodetection code quite extensively on x86, on 3
different classes of systems: dual Celeron, dual HT P4, 8-way
P3/Xeon/2MB, and the autodetected values look pretty good:

Dual Celeron (128K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
 ---------------------
           [00]    [01]
 [00]:     -     1.7(1)
 [01]:   1.7(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 1.7 (1784008)
 ---------------------

here the slow memory subsystem dominates system performance, and even
though caches are small, the migration cost is 1.7 msecs.

Dual HT P4 (512K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
 ---------------------
           [00]    [01]    [02]    [03]
 [00]:     -     0.4(1)  0.0(0)  0.4(1)
 [01]:   0.4(1)    -     0.4(1)  0.0(0)
 [02]:   0.0(0)  0.4(1)    -     0.4(1)
 [03]:   0.4(1)  0.0(0)  0.4(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (33900) 0.4 (448514)
 ---------------------


here it can be seen that there is no migration cost between two HT 
siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory 
system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

8-way P3/Xeon [2MB L2 cache]:

 ---------------------
 migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
 ---------------------
           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
 [00]:     -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [01]:  19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [02]:  19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [03]:  19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [04]:  19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1)
 [05]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1)
 [06]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1)
 [07]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 19.2 (19281756)
 ---------------------

this one has huge caches and a relatively slow memory subsystem - so the 
migration cost is 19 msecs.

in theory the code should work fine on ia64 as well, and i'm very 
curious what kind of default migration costs the code measures there ...

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 14919 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4699,6 +4699,400 @@ static void check_sibling_maps(void)
 #endif
 
 /*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads a big buffer to flush caches
+ * 2) the source CPU reads+dirties a shared buffer 
+ * 3) the target CPU reads+dirties the same shared buffer
+ * 4) the target CPU reads a big buffer to flush caches
+ *
+ * We measure how long steps #2 and #3 take (step #1 and #4 is not
+ * measured), in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a large buffer-size and iterate down to smaller
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * do a maximum search for the cost. The maximum cost for a migration
+ * occurs when the working set is just below the effective cache size.
+ */
+
+
+/*
+ * Flush the cache by reading a big buffer. (We want all writeback
+ * activities to subside. Works only if cache size is larger than
+ * 2*size, but that is good enough as the biggest migration effect
+ * is around cachesize size.)
+ */
+__init static void read_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	volatile unsigned long data;
+	int i;
+
+	for (i = 0; i < 2*size; i += 4)
+		data = cache[i];
+}
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 4) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	/*
+	 * Make sure the cache is quiet on this CPU,
+	 * before starting measurement:
+	 */
+	read_cache(data->cache, data->size);
+
+	data->delta = sched_clock();
+	touch_cache(data->cache, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	touch_cache(data->cache, data->size);
+	data->delta = sched_clock() - data->delta;
+	/*
+	 * Make sure the cache is quiet, so that it does not interfere
+	 * with the next run on this CPU:
+	 */
+	read_cache(data->cache, data->size);
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+
+	return data.delta;
+}
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned long max_cache_size;
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE];
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long cpu_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024UL)
+#define DEFAULT_CACHE_SIZE	(8*1024*1024UL)
+#define ITERATIONS		4
+
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long long cost = 0, cost1 = 0, cost2 = 0;
+	unsigned long size;
+	void *cache;
+	int i;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size)
+		size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+	else
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	/*
+	 * We allocate 2*size so that read_cache() can access a
+	 * larger buffer:
+	 */
+	cache = vmalloc(2*size);
+	if (!cache) {
+		printk("could not vmalloc %ld bytes for cache!\n", 2*size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+	memset(cache, 0, 2*size);
+
+	while (size >= MIN_CACHE_SIZE) {
+		/*
+		 * Measure the migration cost of 'size' bytes, over an
+		 * average of 10 runs:
+		 *
+		 * (We perturb the cache size by a small (0..4k)
+		 *  value to compensate size/alignment related artifacts.
+		 *  We also subtract the cost of the operation done on
+		 *  the same CPU.)
+		 */
+		cost1 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+			cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+		}
+
+		cost2 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+			cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+		}
+		if (cost1 > cost2)
+			cost = max(cost, cost1 - cost2);
+		/*
+		 * Iterate down the cachesize (in a non-power-of-2
+		 * way to avoid artifacts) in 5% decrements:
+		 */
+		size = size * 19 / 20;
+	}
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost, 2*ITERATIONS);
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_cache_decay(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+
+	printk("---------------------\n");
+	printk("migration cost matrix (max_cache_size: %ld, cpu: %ld MHz):\n",
+		max_cache_size, cpu_khz/1000);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = cpu_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance])
+				cost = migration_cost[distance];
+			else {
+				cost = measure_cacheflush_time(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("---------------------\n");
+	printk("cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	printk("---------------------\n");
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+}
+
+/*
  * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
  */
 static void __devinit arch_init_sched_domains(void)
@@ -4820,6 +5214,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_cache_decay();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,11 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	max_cache_size = 32*1024*1024;
+	calibrate_cache_decay();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -112,7 +112,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,15 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the auto-tuning code will search:
+ */
+extern unsigned long max_cache_size;
+extern void calibrate_cache_decay(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-02 14:53 ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
@ 2005-04-02 21:22   ` Paul Jackson
  2005-04-03  5:53   ` Paul Jackson
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-02 21:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> in theory the code should work fine on ia64 as well,

Nice.  I'll try it on our SN2 Altix IA64 as well.
Though I am being delayed a day or two in this
by irrelevant problems.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-02 14:53 ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
  2005-04-02 21:22   ` Paul Jackson
@ 2005-04-03  5:53   ` Paul Jackson
  2005-04-03  7:04     ` Ingo Molnar
  2005-04-03  9:01     ` Paul Jackson
  1 sibling, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03  5:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Just so as no else wastes time repeating the little bit I've done so
far, and so I don't waste time figuring out what is already known,
here's what I have so far, trying out Ingo's "sched: auto-tune migration
costs" on ia64 SN2:

To get it to compile against 2.6.12-rc1-mm4, I did thus:

      1. Manually edited "include/asm-x86_64/topology.h" to
         remove .cache_hot_time (patch failed due to conflicts
         with nearby changes to add some *_idx terms).
      2. Moved the 394 line block of new code in kernel/sched.c
         to _before_ the large  #ifdef ARCH_HAS_SCHED_DOMAIN,
         #else, #endif block.  The ia64 arch (only) defines
         ARCH_HAS_SCHED_DOMAIN, so was being denied use of Ingo's
         code when it was buried in the '#else-#endif' side of
         this large conditional block.
      3. Add "#include <linux/vmalloc.h>" to kernel/sched.c
      4. Don't print cpu_khz in the cost matrix header, as cpu_khz
         is only in a few arch's (x86_64, ppc, i386, arm).

Note that (2) was just a superficial fix - it compiles, but the result
could easily be insanely stupid and I'd have no clue.  I need to
read the code some more.

Booting on an 8 CPU ia64 SN2, the console output got far enough to show:

============================ begin ============================
Brought up 8 CPUs
softlockup thread 7 started up.
Total of 8 processors activated (15548.60 BogoMIPS).
---------------------
migration cost matrix (max_cache_size: 33554432):
---------------------
          [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
[00]:     -
============================= end =============================

Then it hung for 5 or 10 minutes, and then it blurted out a panic and
died. I'll quote the whole panic, including backtrace, in case someone
happens to see something obvious.

But I'm not asking anyone to think about this yet, unless it amuses
them.  I can usefully occupy myself reading the code and adding printk's
for a while.

Note the first 3 chars of the panic message "4.5".  This looks like it
might be the [00]-[01] entry of Ingo's table, flushed out when the
newlines of the panic came through.

============================ begin ============================
4.5(0)<1>Unable to handle kernel paging request at virtual address 0000000000010008
swapper[1]: Oops 8813272891392 [1]
Modules linked in:

Pid: 1, CPU 0, comm:              swapper
psr : 0000101008026018 ifs : 8000000000000288 ip  : [<a0000001000d9a30>]    Not tainted
ip is at queue_work+0xb0/0x1a0
unat: 0000000000000000 pfs : 0000000000000288 rsc : 0000000000000003
rnat: a000000100ab2a50 bsps: 0000000000100000 pr  : 5a66666956996a65
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001000d99b0 b6  : a000000100003320 b7  : a000000100490200
f6  : 1003e0000000000009ff7 f7  : 1003e000418d3645db265
f8  : 1003e000000003b8186ed f9  : 1003e0000000000005f3b
f10 : 1003e0000000000001000 f11 : 1003e0000000000000040
r1  : a000000100c9de60 r2  : 0000000000000000 r3  : 0000000000000001
r8  : 0000000000000000 r9  : 0000000000000000 r10 : a000000100969c50
r11 : 0000000000000004 r12 : e00001b03a8d7910 r13 : e00001b03a8d0000
r14 : 0000000000000000 r15 : 0000000000010008 r16 : e00001b03a8d0dc0
r17 : 0000000000010008 r18 : 0000000000000103 r19 : a000000100c32048
r20 : a000000100c32018 r21 : a000000100aa92c8 r22 : e000003003005d90
r23 : e000003003005da8 r24 : a000000100cf2098 r25 : e000003003005db0
r26 : a000000100ab4bf4 r27 : e000003003005d81 r28 : 000000010004b001
r29 : 0000000000000000 r30 : 000000010004b000 r31 : a000000100c32010

Call Trace:
 [<a000000100010460>] show_stack+0x80/0xa0
                                sp=e00001b03a8d74d0 bsp=e00001b03a8d1620
 [<a000000100010d40>] show_regs+0x860/0x880
                                sp=e00001b03a8d76a0 bsp=e00001b03a8d15b8
 [<a000000100036390>] die+0x170/0x200
                                sp=e00001b03a8d76b0 bsp=e00001b03a8d1580
 [<a00000010005bb20>] ia64_do_page_fault+0x200/0xa40
                                sp=e00001b03a8d76b0 bsp=e00001b03a8d1520
 [<a00000010000b2c0>] ia64_leave_kernel+0x0/0x290
                                sp=e00001b03a8d7740 bsp=e00001b03a8d1520
 [<a0000001000d9a30>] queue_work+0xb0/0x1a0
                                sp=e00001b03a8d7910 bsp=e00001b03a8d14e0
 [<a0000001000db0d0>] schedule_work+0x30/0x60
                                sp=e00001b03a8d7910 bsp=e00001b03a8d14c8
 [<a000000100490230>] blank_screen_t+0x30/0x60
                                sp=e00001b03a8d7910 bsp=e00001b03a8d14b8
 [<a0000001000c8130>] run_timer_softirq+0x2d0/0x4a0
                                sp=e00001b03a8d7910 bsp=e00001b03a8d1410
 [<a0000001000bb920>] __do_softirq+0x220/0x260
                                sp=e00001b03a8d7930 bsp=e00001b03a8d1378
 [<a0000001000bb9e0>] do_softirq+0x80/0xe0
                                sp=e00001b03a8d7930 bsp=e00001b03a8d1320
 [<a0000001000bbc50>] irq_exit+0x90/0xc0
                                sp=e00001b03a8d7930 bsp=e00001b03a8d1310
 [<a00000010000f4b0>] ia64_handle_irq+0x110/0x140
                                sp=e00001b03a8d7930 bsp=e00001b03a8d12d8
 [<a00000010000b2c0>] ia64_leave_kernel+0x0/0x290
                                sp=e00001b03a8d7930 bsp=e00001b03a8d12d8
 [<a000000100844b20>] read_cache+0x40/0x60
                                sp=e00001b03a8d7b00 bsp=e00001b03a8d12c8
 [<a000000100844fb0>] target_handler+0xd0/0xe0
                                sp=e00001b03a8d7b00 bsp=e00001b03a8d1298
 [<a000000100845150>] measure_one+0x190/0x240
                                sp=e00001b03a8d7b00 bsp=e00001b03a8d1260
 [<a000000100845890>] measure_cacheflush_time+0x270/0x420
                                sp=e00001b03a8d7b30 bsp=e00001b03a8d1200
 [<a0000001000a7350>] calibrate_cache_decay+0x710/0x740
                                sp=e00001b03a8d7b40 bsp=e00001b03a8d1148
 [<a000000100056180>] arch_init_sched_domains+0x12c0/0x1e40
                                sp=e00001b03a8d7b60 bsp=e00001b03a8d0e80
 [<a000000100845a60>] sched_init_smp+0x20/0x60
                                sp=e00001b03a8d7de0 bsp=e00001b03a8d0e70
 [<a000000100009570>] init+0x250/0x440
                                sp=e00001b03a8d7de0 bsp=e00001b03a8d0e38
 [<a000000100012940>] kernel_thread_helper+0xe0/0x100
                                sp=e00001b03a8d7e30 bsp=e00001b03a8d0e10
 [<a000000100009120>] start_kernel_thread+0x20/0x40
                                sp=e00001b03a8d7e30 bsp=e00001b03a8d0e10
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
============================= end =============================


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03  5:53   ` Paul Jackson
@ 2005-04-03  7:04     ` Ingo Molnar
  2005-04-03  8:15       ` Paul Jackson
                         ` (2 more replies)
  2005-04-03  9:01     ` Paul Jackson
  1 sibling, 3 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-03  7:04 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3018 bytes --]


* Paul Jackson <pj@engr.sgi.com> wrote:

> Just so as no else wastes time repeating the little bit I've done so 
> far, and so I don't waste time figuring out what is already known, 
> here's what I have so far, trying out Ingo's "sched: auto-tune 
> migration costs" on ia64 SN2:
> 
> To get it to compile against 2.6.12-rc1-mm4, I did thus:
> 
>       1. Manually edited "include/asm-x86_64/topology.h" to
>          remove .cache_hot_time (patch failed due to conflicts
>          with nearby changes to add some *_idx terms).

(next time you can ignore that hunk - we override the cache_hot_time 
value anyway.)

>       2. Moved the 394 line block of new code in kernel/sched.c
>          to _before_ the large  #ifdef ARCH_HAS_SCHED_DOMAIN,
>          #else, #endif block.  The ia64 arch (only) defines
>          ARCH_HAS_SCHED_DOMAIN, so was being denied use of Ingo's
>          code when it was buried in the '#else-#endif' side of
>          this large conditional block.

yeah, indeed. The place you moved it to is the right spot, as it's under 
CONFIG_SMP. I've done this in my tree too.

>       3. Add "#include <linux/vmalloc.h>" to kernel/sched.c

ok, did this in my tree too.

>       4. Don't print cpu_khz in the cost matrix header, as cpu_khz
>          is only in a few arch's (x86_64, ppc, i386, arm).

ok.

> Brought up 8 CPUs
> softlockup thread 7 started up.
> Total of 8 processors activated (15548.60 BogoMIPS).
> ---------------------
> migration cost matrix (max_cache_size: 33554432):
> ---------------------
>           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
> [00]:     -
> ============================= end =============================
> 
> Then it hung for 5 or 10 minutes, [...]

the default on ia64 (32MB) was way too large and caused the search to 
start from 64MB. That can take a _long_ time.

i've attached a new patch with your changes included, and a couple of 
new things added:

 - removed the 32MB max_cache_size hack from ia64 - it should now fall 
   back to the default 5MB and do a search from 10MB downwards. This
   should speed up the search.

 - added a migration_debug boot option - use it to get verbose printouts 
   about the search for the migration cost.

 - added a max_cache_size=<bytes> boot option for debugging.

 - a few cleanups

(in the next iteration of the patch i'll try a new method to further 
speed up the search - but didnt want to change it too much in this 
iteration.)

>  [<a0000001000db0d0>] schedule_work+0x30/0x60
>                                 sp=e00001b03a8d7910 bsp=e00001b03a8d14c8
>  [<a000000100490230>] blank_screen_t+0x30/0x60
>                                 sp=e00001b03a8d7910 bsp=e00001b03a8d14b8
>  [<a0000001000c8130>] run_timer_softirq+0x2d0/0x4a0
>                                 sp=e00001b03a8d7910 bsp=e00001b03a8d1410

i think the crash is an unrelated bug: it seems the screen blanking 
timer hit and has crashed the box - i suspect it didnt expect the bootup 
to take that long.

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 16470 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/vmalloc.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4639,6 +4640,438 @@ void __devinit init_sched_build_groups(s
 	last->next = first;
 }
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads a big buffer to flush caches
+ * 2) the source CPU reads+dirties a shared buffer 
+ * 3) the target CPU reads+dirties the same shared buffer
+ * 4) the target CPU reads a big buffer to flush caches
+ *
+ * We measure how long steps #2 and #3 take (step #1 and #4 is not
+ * measured), in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a large buffer-size and iterate down to smaller
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * do a maximum search for the cost. The maximum cost for a migration
+ * occurs when the working set is just below the effective cache size.
+ */
+
+
+/*
+ * Flush the cache by reading a big buffer. (We want all writeback
+ * activities to subside. Works only if cache size is larger than
+ * 2*size, but that is good enough as the biggest migration effect
+ * is around cachesize size.)
+ */
+__init static void read_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	volatile unsigned long data;
+	int i;
+
+	for (i = 0; i < 2*size; i += 4)
+		data = cache[i];
+}
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 4) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	/*
+	 * Make sure the cache is quiet on this CPU,
+	 * before starting measurement:
+	 */
+	read_cache(data->cache, data->size);
+
+	data->delta = sched_clock();
+	touch_cache(data->cache, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	touch_cache(data->cache, data->size);
+	data->delta = sched_clock() - data->delta;
+	/*
+	 * Make sure the cache is quiet, so that it does not interfere
+	 * with the next run on this CPU:
+	 */
+	read_cache(data->cache, data->size);
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+
+	return data.delta;
+}
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE];
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+static __initdata unsigned int migration_debug;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long cpu_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		2
+
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long long cost = 0, cost1 = 0, cost2 = 0;
+	unsigned int size, cache_size = 0;
+	void *cache;
+	int i;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size)
+		size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+	else
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	/*
+	 * We allocate 2*size so that read_cache() can access a
+	 * larger buffer:
+	 */
+	cache = vmalloc(2*size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+	memset(cache, 0, 2*size);
+
+	while (size >= MIN_CACHE_SIZE) {
+		/*
+		 * Measure the migration cost of 'size' bytes, over an
+		 * average of 10 runs:
+		 *
+		 * (We perturb the cache size by a small (0..4k)
+		 *  value to compensate size/alignment related artifacts.
+		 *  We also subtract the cost of the operation done on
+		 *  the same CPU.)
+		 */
+		cost1 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+			cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+		}
+
+		cost2 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+			cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+		}
+		if (cost1 > cost2) {
+			cost = max(cost, cost1 - cost2);
+			cache_size = size;
+		}
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld (%ld): (%8Ld %8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				cpu_distance(cpu1, cpu2),
+				cost1, cost2, cost1-cost2);
+		/*
+		 * Iterate down the cachesize (in a non-power-of-2
+		 * way to avoid artifacts) in 5% decrements:
+		 */
+		size = size * 19 / 20;
+	}
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost, 2*ITERATIONS);
+
+	if (migration_debug)
+		printk("[%d][%d] cache size found: %d, cost: %Ld\n",
+			cpu1, cpu2, cache_size, cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+
+	printk("---------------------\n");
+	printk("migration cost matrix (max_cache_size: %d, cpu: %ld MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = cpu_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance])
+				cost = migration_cost[distance];
+			else {
+				cost = measure_cacheflush_time(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("---------------------\n");
+	printk("cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	printk("---------------------\n");
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+}
+
 
 #ifdef ARCH_HAS_SCHED_DOMAIN
 extern void __devinit arch_init_sched_domains(void);
@@ -4820,6 +5253,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -112,7 +111,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,17 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-mips/mach-ip27/topology.h.orig
+++ linux/include/asm-mips/mach-ip27/topology.h
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03  7:04     ` Ingo Molnar
@ 2005-04-03  8:15       ` Paul Jackson
  2005-04-03 11:34       ` Paul Jackson
  2005-04-04  1:11       ` Chen, Kenneth W
  2 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03  8:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

> the default on ia64 (32MB) was way too large

Agreed.  It took about 9 minutes to search the first pair of cpus
(cpu 0 to cpu 1) from a size of 67107840 down to a size of 62906,
based on some prints I added since my last message.


> it seems the screen blanking timer hit

Ah - yes.  That makes sense.


> do a search from 10MB downwards. This
>   should speed up the search.

That will help (I'm guessing not enough - will see shortly.)


> verbose printouts 

I will put them to good use.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03  7:04     ` Ingo Molnar
  2005-04-03  8:15       ` Paul Jackson
@ 2005-04-03 11:34       ` Paul Jackson
  2005-04-03 14:12         ` Paul Jackson
                           ` (2 more replies)
  2005-04-04  1:11       ` Chen, Kenneth W
  2 siblings, 3 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03 11:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ok - that flies, or at least walks.  It took 53 seconds to
compute this cost matrix.

Here's what it prints, on a small 8 CPU ia64 SN2 Altix, with
the migration_debug prints formatted separately from the primary
table, for ease of reading:

Total of 8 processors activated (15548.60 BogoMIPS).
---------------------
migration cost matrix (max_cache_size: 0, cpu: -1 MHz):
---------------------
          [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
[00]:     -     4.0(0) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
[01]:   4.0(0)    -    21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
[02]:  21.7(1) 21.7(1)    -     4.0(0) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
[03]:  21.7(1) 21.7(1)  4.0(0)    -    21.7(1) 21.7(1) 21.7(1) 21.7(1)
[04]:  21.7(1) 21.7(1) 21.7(1) 21.7(1)    -     4.0(0) 21.7(1) 21.7(1)
[05]:  21.7(1) 21.7(1) 21.7(1) 21.7(1)  4.0(0)    -    21.7(1) 21.7(1)
[06]:  21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)    -     4.0(0)
[07]:  21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)  4.0(0)    -
---------------------
cacheflush times [2]: 4.0 (4059264) 21.7 (21764604)
---------------------

-> [0][1][10485760]   0.0 (0): (236441590 260844347 -24402757)
-> [0][1][9961472]   0.0 (0): (223517112 247446351 -23929239)
-> [0][1][9463398]   0.0 (0): (210676318 234128642 -23452324)
-> [0][1][8990228]   0.0 (0): (199150391 222962366 -23811975)
-> [0][1][8540716]   0.0 (0): (188000682 211792893 -23792211)
-> [0][1][8113680]   0.0 (0): (177705384 201661649 -23956265)
-> [0][1][7707996]   0.0 (0): (167300335 190993072 -23692737)
-> [0][1][7322596]   0.0 (0): (157792762 181764189 -23971427)
-> [0][1][6956466]   0.0 (0): (148554966 172428430 -23873464)
-> [0][1][6608642]   0.0 (0): (140208195 163875201 -23667006)
-> [0][1][6278209]   0.0 (0): (131352820 155083956 -23731136)
-> [0][1][5964298]   0.0 (0): (123604215 147567322 -23963107)
-> [0][1][5666083]   0.0 (0): (116411565 140028494 -23616929)
-> [0][1][5382778]   0.0 (0): (109268755 133013626 -23744871)
-> [0][1][5113639]   0.0 (0): (102398180 126017425 -23619245)
-> [0][1][4857957]   0.0 (0): (95917364 119835534 -23918170)
-> [0][1][4615059]   0.0 (0): (90016707 114103575 -24086868)
-> [0][1][4384306]   0.0 (0): (84323765 108006547 -23682782)
-> [0][1][4165090]   0.0 (0): (79059754 102627005 -23567251)
-> [0][1][3956835]   0.0 (0): (73688423 97291492 -23603069)
-> [0][1][3758993]   0.0 (0): (68716008 88560989 -19844981)
-> [0][1][3571043]   0.0 (0): (63733160 81897350 -18164190)
-> [0][1][3392490]   0.0 (0): (59879383 74232277 -14352894)
-> [0][1][3222865]   0.0 (0): (56841544 66555118 -9713574)
-> [0][1][3061721]   0.0 (0): (52946522 56831787 -3885265)
-> [0][1][2908634]   2.1 (0): (48782033 46610015  2172018)
-> [0][1][2763202]   7.4 (0): (45641483 38180422  7461061)
-> [0][1][2625041]   8.1 (0): (42666487 34547956  8118531)
-> [0][1][2493788]   8.1 (0): (40480659 32408260  8072399)
-> [0][1][2369098]   8.1 (0): (37962874 30163246  7799628)
-> [0][1][2250643]   8.1 (0): (34472406 26857206  7615200)
-> [0][1][2138110]   8.1 (0): (31271314 23649223  7622091)
-> [0][1][2031204]   8.1 (0): (28089754 21439413  6650341)
-> [0][1][1929643]   8.1 (0): (26354009 18543359  7810650)
-> [0][1][1833160]   8.1 (0): (21147235 14447434  6699801)
-> [0][1][1741502]   8.1 (0): (18121355 12206595  5914760)
-> [0][1][1654426]   8.1 (0): (15329605 10598656  4730949)
-> [0][1][1571704]   8.1 (0): (13611633  8689517  4922116)
-> [0][1][1493118]   8.1 (0): (11372044  6757841  4614203)
-> [0][1][1418462]   8.1 (0): ( 9444150  4882452  4561698)
-> [0][1][1347538]   8.1 (0): ( 8191406  4085242  4106164)
-> [0][1][1280161]   8.1 (0): ( 7790609  3898213  3892396)
-> [0][1][1216152]   8.1 (0): ( 7374407  3707184  3667223)
-> [0][1][1155344]   8.1 (0): ( 6999015  3515903  3483112)
-> [0][1][1097576]   8.1 (0): ( 6673248  3322754  3350494)
-> [0][1][1042697]   8.1 (0): ( 6335524  3161843  3173681)
-> [0][1][ 990562]   8.1 (0): ( 6004402  3008483  2995919)
-> [0][1][ 941033]   8.1 (0): ( 5725906  2863829  2862077)
-> [0][1][ 893981]   8.1 (0): ( 5426110  2734901  2691209)
-> [0][1][ 849281]   8.1 (0): ( 5140906  2596169  2544737)
-> [0][1][ 806816]   8.1 (0): ( 4898502  2465125  2433377)
-> [0][1][ 766475]   8.1 (0): ( 4649361  2349720  2299641)
-> [0][1][ 728151]   8.1 (0): ( 4427640  2224358  2203282)
-> [0][1][ 691743]   8.1 (0): ( 4205722  2113134  2092588)
-> [0][1][ 657155]   8.1 (0): ( 3991213  1997003  1994210)
-> [0][1][ 624297]   8.1 (0): ( 3808184  1922251  1885933)
-> [0][1][ 593082]   8.1 (0): ( 3637960  1824619  1813341)
-> [0][1][ 563427]   8.1 (0): ( 3436507  1717571  1718936)
-> [0][1][ 535255]   8.1 (0): ( 3258815  1638947  1619868)
-> [0][1][ 508492]   8.1 (0): ( 3107777  1554970  1552807)
-> [0][1][ 483067]   8.1 (0): ( 2947291  1476728  1470563)
-> [0][1][ 458913]   8.1 (0): ( 2791433  1408435  1382998)
-> [0][1][ 435967]   8.1 (0): ( 2652944  1322870  1330074)
-> [0][1][ 414168]   8.1 (0): ( 2535588  1270619  1264969)
-> [0][1][ 393459]   8.1 (0): ( 2412219  1213071  1199148)
-> [0][1][ 373786]   8.1 (0): ( 2282233  1141089  1141144)
-> [0][1][ 355096]   8.1 (0): ( 2177739  1084862  1092877)
-> [0][1][ 337341]   8.1 (0): ( 2072370  1027962  1044408)
-> [0][1][ 320473]   8.1 (0): ( 1968254   992712   975542)
-> [0][1][ 304449]   8.1 (0): ( 1869227   934710   934517)
-> [0][1][ 289226]   8.1 (0): ( 1787025   882591   904434)
-> [0][1][ 274764]   8.1 (0): ( 1685326   834412   850914)
-> [0][1][ 261025]   8.1 (0): ( 1599396   784748   814648)
-> [0][1][ 247973]   8.1 (0): ( 1520534   752650   767884)
-> [0][1][ 235574]   8.1 (0): ( 1447687   697913   749774)
-> [0][1][ 223795]   8.1 (0): ( 1395297   655215   740082)
-> [0][1][ 212605]   8.1 (0): ( 1304578   612879   691699)
-> [0][1][ 201974]   8.1 (0): ( 1242526   582890   659636)
-> [0][1][ 191875]   8.1 (0): ( 1192597   567342   625255)
-> [0][1][ 182281]   8.1 (0): ( 1154183   522506   631677)
-> [0][1][ 173166]   8.1 (0): ( 1063889   496038   567851)
-> [0][1][ 164507]   8.1 (0): ( 1018003   479707   538296)
-> [0][1][ 156281]   8.1 (0): (  966639   453335   513304)
-> [0][1][ 148466]   8.1 (0): (  911832   450767   461065)
-> [0][1][ 141042]   8.1 (0): (  861256   398012   463244)
-> [0][1][ 133989]   8.1 (0): (  816934   375902   441032)
-> [0][1][ 127289]   8.1 (0): (  772519   357192   415327)
-> [0][1][ 120924]   8.1 (0): (  738252   348262   389990)
-> [0][1][ 114877]   8.1 (0): (  700449   337719   362730)
-> [0][1][ 109133]   8.1 (0): (  666714   321362   345352)
-> [0][1][ 103676]   8.1 (0): (  632466   301106   331360)
-> [0][1][  98492]   8.1 (0): (  605840   283103   322737)
-> [0][1][  93567]   8.1 (0): (  574951   270209   304742)
-> [0][1][  88888]   8.1 (0): (  548250   275193   273057)
-> [0][1][  84443]   8.1 (0): (  520930   247909   273021)
-> [0][1][  80220]   8.1 (0): (  497343   235625   261718)
-> [0][1][  76209]   8.1 (0): (  475014   225910   249104)
-> [0][1][  72398]   8.1 (0): (  452979   217067   235912)
-> [0][1][  68778]   8.1 (0): (  437237   210221   227016)
[0][1] cache size found: 68778, cost: 2029632
-> [0][2][10485760]  21.3 (1): (280966301 259655197 21311104)
-> [0][2][9961472]  21.3 (1): (267532701 246464370 21068331)
-> [0][2][9463398]  21.4 (1): (255880339 234472901 21407438)
-> [0][2][8990228]  21.4 (1): (243788056 222976517 20811539)
-> [0][2][8540716]  21.4 (1): (232918189 211696038 21222151)
-> [0][2][8113680]  21.4 (1): (221867712 201185104 20682608)
-> [0][2][7707996]  21.4 (1): (211791934 190954096 20837838)
-> [0][2][7322596]  21.4 (1): (202147650 181014688 21132962)
-> [0][2][6956466]  21.4 (1): (193033236 172217232 20816004)
-> [0][2][6608642]  21.4 (1): (184315412 163326007 20989405)
-> [0][2][6278209]  21.5 (1): (176305487 154773881 21531606)
-> [0][2][5964298]  21.5 (1): (168324309 147210017 21114292)
-> [0][2][5666083]  21.5 (1): (160980756 140036947 20943809)
-> [0][2][5382778]  21.5 (1): (154082090 133018746 21063344)
-> [0][2][5113639]  21.5 (1): (147265497 125683900 21581597)
-> [0][2][4857957]  21.5 (1): (140785593 119757775 21027818)
-> [0][2][4615059]  21.5 (1): (134874499 113951649 20922850)
-> [0][2][4384306]  21.5 (1): (128926147 107954231 20971916)
-> [0][2][4165090]  21.5 (1): (123503066 102480419 21022647)
-> [0][2][3956835]  21.5 (1): (118671392 97407382 21264010)
-> [0][2][3758993]  24.6 (1): (113654772 89043284 24611488)
-> [0][2][3571043]  28.4 (1): (108688391 80283321 28405070)
-> [0][2][3392490]  30.8 (1): (103382097 72550143 30831954)
-> [0][2][3222865]  31.6 (1): (98813621 67206209 31607412)
-> [0][2][3061721]  34.6 (1): (94028301 59338910 34689391)
-> [0][2][2908634]  37.4 (1): (89336206 51906192 37430014)
-> [0][2][2763202]  41.6 (1): (85311763 43645210 41666553)
-> [0][2][2625041]  42.9 (1): (80967888 38067729 42900159)
-> [0][2][2493788]  43.5 (1): (76682195 33152985 43529210)
-> [0][2][2369098]  43.5 (1): (71641348 29795784 41845564)
-> [0][2][2250643]  43.5 (1): (67010318 26541156 40469162)
-> [0][2][2138110]  43.5 (1): (61116015 23104539 38011476)
-> [0][2][2031204]  43.5 (1): (56833321 20862067 35971254)
-> [0][2][1929643]  43.5 (1): (51867693 18230427 33637266)
-> [0][2][1833160]  43.5 (1): (47463072 14447831 33015241)
-> [0][2][1741502]  43.5 (1): (43341579 11804071 31537508)
-> [0][2][1654426]  43.5 (1): (39128869 10120316 29008553)
-> [0][2][1571704]  43.5 (1): (37112854  8701340 28411514)
-> [0][2][1493118]  43.5 (1): (33762646  6895300 26867346)
-> [0][2][1418462]  43.5 (1): (30140145  5233522 24906623)
-> [0][2][1347538]  43.5 (1): (28104612  4404674 23699938)
-> [0][2][1280161]  43.5 (1): (26384793  4142184 22242609)
-> [0][2][1216152]  43.5 (1): (24998530  3738080 21260450)
-> [0][2][1155344]  43.5 (1): (23692228  3530175 20162053)
-> [0][2][1097576]  43.5 (1): (22609076  3339587 19269489)
-> [0][2][1042697]  43.5 (1): (21418282  3178330 18239952)
-> [0][2][ 990562]  43.5 (1): (20406158  3017420 17388738)
-> [0][2][ 941033]  43.5 (1): (19348908  2870304 16478604)
-> [0][2][ 893981]  43.5 (1): (18374639  2739213 15635426)
-> [0][2][ 849281]  43.5 (1): (17423103  2600269 14822834)
-> [0][2][ 806816]  43.5 (1): (16663412  2478036 14185376)
-> [0][2][ 766475]  43.5 (1): (15709827  2350116 13359711)
-> [0][2][ 728151]  43.5 (1): (15008379  2220979 12787400)
-> [0][2][ 691743]  43.5 (1): (14214646  2113246 12101400)
-> [0][2][ 657155]  43.5 (1): (13507218  2012407 11494811)
-> [0][2][ 624297]  43.5 (1): (12850596  1932937 10917659)
-> [0][2][ 593082]  43.5 (1): (12249394  1822902 10426492)
-> [0][2][ 563427]  43.5 (1): (11575604  1741807  9833797)
-> [0][2][ 535255]  43.5 (1): (11013846  1659054  9354792)
-> [0][2][ 508492]  43.5 (1): (10406034  1583114  8822920)
-> [0][2][ 483067]  43.5 (1): ( 9991526  1487716  8503810)
-> [0][2][ 458913]  43.5 (1): ( 9424050  1411217  8012833)
-> [0][2][ 435967]  43.5 (1): ( 8969577  1341693  7627884)
-> [0][2][ 414168]  43.5 (1): ( 8619883  1265423  7354460)
-> [0][2][ 393459]  43.5 (1): ( 8172469  1215225  6957244)
-> [0][2][ 373786]  43.5 (1): ( 7693509  1156090  6537419)
-> [0][2][ 355096]  43.5 (1): ( 7321746  1096988  6224758)
-> [0][2][ 337341]  43.5 (1): ( 6996736  1048745  5947991)
-> [0][2][ 320473]  43.5 (1): ( 6567567  1000294  5567273)
-> [0][2][ 304449]  43.5 (1): ( 6243750   936678  5307072)
-> [0][2][ 289226]  43.5 (1): ( 5979868   896936  5082932)
-> [0][2][ 274764]  43.5 (1): ( 5738336   843368  4894968)
-> [0][2][ 261025]  43.5 (1): ( 5365096   791876  4573220)
-> [0][2][ 247973]  43.5 (1): ( 5059736   743485  4316251)
-> [0][2][ 235574]  43.5 (1): ( 4831530   699291  4132239)
-> [0][2][ 223795]  43.5 (1): ( 4638916   680192  3958724)
-> [0][2][ 212605]  43.5 (1): ( 4481601   637145  3844456)
-> [0][2][ 201974]  43.5 (1): ( 4191611   592293  3599318)
-> [0][2][ 191875]  43.5 (1): ( 3949722   570027  3379695)
-> [0][2][ 182281]  43.5 (1): ( 3726611   537697  3188914)
-> [0][2][ 173166]  43.5 (1): ( 3592882   515552  3077330)
-> [0][2][ 164507]  43.5 (1): ( 3390972   484264  2906708)
-> [0][2][ 156281]  43.5 (1): ( 3245101   459775  2785326)
-> [0][2][ 148466]  43.5 (1): ( 3113578   440451  2673127)
-> [0][2][ 141042]  43.5 (1): ( 2931948   409050  2522898)
-> [0][2][ 133989]  43.5 (1): ( 2808474   388318  2420156)
-> [0][2][ 127289]  43.5 (1): ( 2605945   368634  2237311)
-> [0][2][ 120924]  43.5 (1): ( 2447962   348413  2099549)
-> [0][2][ 114877]  43.5 (1): ( 2311453   341602  1969851)
-> [0][2][ 109133]  43.5 (1): ( 2213124   317917  1895207)
-> [0][2][ 103676]  43.5 (1): ( 2114799   301876  1812923)
-> [0][2][  98492]  43.5 (1): ( 2029078   288864  1740214)
-> [0][2][  93567]  43.5 (1): ( 1944647   282941  1661706)
-> [0][2][  88888]  43.5 (1): ( 1878239   263551  1614688)
-> [0][2][  84443]  43.5 (1): ( 1785472   254075  1531397)
-> [0][2][  80220]  43.5 (1): ( 1708646   241511  1467135)
-> [0][2][  76209]  43.5 (1): ( 1611896   241541  1370355)
-> [0][2][  72398]  43.5 (1): ( 1518939   222991  1295948)
-> [0][2][  68778]  43.5 (1): ( 1439208   224816  1214392)
[0][2] cache size found: 68778, cost: 10882302


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 11:34       ` Paul Jackson
@ 2005-04-03 14:12         ` Paul Jackson
  2005-04-03 15:01           ` Ingo Molnar
  2005-04-03 15:24           ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
  2005-04-03 14:29         ` Ingo Molnar
  2005-04-04  4:25         ` Andy Lutomirski
  2 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03 14:12 UTC (permalink / raw)
  To: mingo; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Three more observations.

 1) The slowest measure_one() calls are, not surprisingly, those for the
    largest sizes.  At least on my test system of the moment, the plot
    of cost versus size has one major maximum (a one hump camel, not two).
    
    Seems like if we computed from smallest size upward, instead of largest
    downward, and stopped whenever two consecutive measurements were less
    than say 70% of the max seen so far, then we could save a nice chunk
    of the time.

    Of course, if two hump systems exist, this is not reliable on them.

 2) Trivial warning fix for printf format mismatch:

=================================== begin ===================================
--- 2.6.12-rc1-mm4.orig/kernel/sched.c  2005-04-03 06:32:34.000000000 -0700
+++ 2.6.12-rc1-mm4/kernel/sched.c       2005-04-03 06:34:07.000000000 -0700
@@ -5211,7 +5211,7 @@ void __devinit calibrate_migration_costs
 #ifdef CONFIG_X86
                        cpu_khz/1000
 #else
-                       -1
+                       -1L
 #endif
                );
        printk("---------------------\n");
==================================== end ====================================


 3) I was noticing that my test system was only showing a couple of distinct
    values for cpu_distance, even though it has 4 distinct distances for
    values of node_distance.  So I coded up a variant of cpu_distance that
    converts the problem to a node_distance problem, and got the following
    cost matrix:

=================================== begin ===================================
Total of 8 processors activated (15515.64 BogoMIPS).
---------------------
migration cost matrix (max_cache_size: 0, cpu: -1 MHz):
---------------------
          [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
[00]:     -     4.0(0) 21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
[01]:   4.0(0)    -    21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
[02]:  21.7(1) 21.7(1)    -     4.0(0) 25.3(3) 25.3(3) 25.2(2) 25.2(2)
[03]:  21.7(1) 21.7(1)  4.0(0)    -    25.3(3) 25.3(3) 25.2(2) 25.2(2)
[04]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)    -     4.0(0) 21.7(1) 21.7(1)
[05]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)  4.0(0)    -    21.7(1) 21.7(1)
[06]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)    -     4.0(0)
[07]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)  4.0(0)    -
---------------------
cacheflush times [4]: 4.0 (4080540) 21.7 (21781380) 25.2 (25259428) 25.3 (25372682)
---------------------
==================================== end ====================================


    The code (below) is twice as complicated, the runtime twice as long,
    and it's less intuitive - sched_domains seems more appropriate as
    the basis for migration costs than the node distances in SLIT tables.
    Finally, I don't know if distinguishing between costs of 21.7 and
    25.3 is worth much.

    So the case for switching to this node_distance base is less than
    persuasive, to put it politely.

    Perhaps it's only real value is in highlighting that perhaps the
    code to setup the sched_domain topology on our ia64 SN2 Altix systems
    is too coarse, given that it only found two distance values, not four.

    If that's the case, I will have to call in someone else to examine
    whether it's appropriate to refine the sched_domains setup for this
    kind of system.  I'm not competent to determine that, nor to code it.

    Here's the code that bases cpu_distance on node_distance:

=================================== begin ===================================
__init static int cmpint(const void *a, const void *b)
{
        return *(int *)a - *(int *)b;
}

/*
 * Estimate distance of two CPUs based on their node_distance,
 * mapping to sequential integers 0, 1, ... N-1, for the N
 * distinct values of distances (closest CPUs are distance 0,
 * farthest CPUs are distance N-1).  If there are more than
 * MAX_DOMAIN_DISTANCE distinct different distance values,
 * collapse the larger distances to one value.
 */

__init static unsigned long cpu_distance(int cpu1, int cpu2)
{
	static int num_dist_vals;
	static int node_distances[MAX_DOMAIN_DISTANCE];
	int dist = node_distance(cpu_to_node(cpu1), cpu_to_node(cpu2));
	int v;

	if (num_dist_vals == 0) {
		int i, j, k;

		/*
		 * For each dist not already in node_distances[], if there's
		 * room or it's less than an existing 'luser' entry, add it.
		 */
		for_each_online_node(i) {
			for_each_online_node(j) {
				int dist = node_distance(i, j);
				int luser = -1;

				for (k = 0; k < num_dist_vals; k++) {
					if (node_distances[k] == dist)
						break;
					if (dist < node_distances[k])
						luser = k;
				}
				if (node_distances[k] == dist)
					continue;
				else if (num_dist_vals < MAX_DOMAIN_DISTANCE)
					node_distances[num_dist_vals++] = dist;
				else if (luser >= 0)
					node_distances[luser] = dist;
			}
		}
		/*
		 * Now node_distances[] has the smallest num_dist_vals
		 * distinct node distance values.  Lets sort them.
		 */
		sort(node_distances, num_dist_vals, sizeof(int), cmpint, NULL);

		if (migration_debug) {
			printk("Sorted node_distances used for cpu distances\n");
			for(v = 0; v < num_dist_vals; v++)
				printk("  node_distances[%d] = %d\n",
						v, node_distances[v]);
		}
	}

	/* The "- 1" maps all larger sizes onto one */
	for (v = 0; v < num_dist_vals - 1; v++)
		if (dist == node_distances[v])
			break;
	return v;
}
==================================== end ====================================


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 14:12         ` Paul Jackson
@ 2005-04-03 15:01           ` Ingo Molnar
  2005-04-03 22:30             ` Paul Jackson
  2005-04-03 15:24           ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-03 15:01 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel


* Paul Jackson <pj@engr.sgi.com> wrote:

> Three more observations.
> 
>  1) The slowest measure_one() calls are, not surprisingly, those for the
>     largest sizes.  At least on my test system of the moment, the plot
>     of cost versus size has one major maximum (a one hump camel, not two).
>     
>     Seems like if we computed from smallest size upward, instead of largest
>     downward, and stopped whenever two consecutive measurements were less
>     than say 70% of the max seen so far, then we could save a nice chunk
>     of the time.
> 
>     Of course, if two hump systems exist, this is not reliable on them.

yes, this is the approach i'm currently working on, but it's not 
reliable yet. (one of the systems i have drifts its cost into infinity 
after the hump, which shouldnt happen)

>  2) Trivial warning fix for printf format mismatch:

thx.

>  3) I was noticing that my test system was only showing a couple of 
>     distinct values for cpu_distance, even though it has 4 distinct 
>     distances for values of node_distance.  So I coded up a variant of 
>     cpu_distance that converts the problem to a node_distance problem, 
>     and got the following cost matrix:
> 
> =================================== begin ===================================
> Total of 8 processors activated (15515.64 BogoMIPS).
> ---------------------
> migration cost matrix (max_cache_size: 0, cpu: -1 MHz):
> ---------------------
>           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
> [00]:     -     4.0(0) 21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
> [01]:   4.0(0)    -    21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
> [02]:  21.7(1) 21.7(1)    -     4.0(0) 25.3(3) 25.3(3) 25.2(2) 25.2(2)
> [03]:  21.7(1) 21.7(1)  4.0(0)    -    25.3(3) 25.3(3) 25.2(2) 25.2(2)
> [04]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)    -     4.0(0) 21.7(1) 21.7(1)
> [05]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)  4.0(0)    -    21.7(1) 21.7(1)
> [06]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)    -     4.0(0)
> [07]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)  4.0(0)    -
> ---------------------
> cacheflush times [4]: 4.0 (4080540) 21.7 (21781380) 25.2 (25259428) 25.3 (25372682)

i'll first try the bottom-up approach to speed up detection (getting to
the hump is very fast most of the time). The hard part was to create a
workload that generates the hump reliably on a number of boxes - i'm
happy it works on ia64 too.

then we can let the arch override the cpu_distance() method, although i
do think that _if_ there is a significant hierarchy between CPUs it
should be represented via a matching sched-domains hierarchy, and the
full hierarchy should be tuned accordingly.

btw., the migration cost matrix we can later use to tune all the other 
sched-domains balancing related tunables as well - cache_hot_time is 
just the first obvious step. (which also happens to make the most 
difference.)

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 15:01           ` Ingo Molnar
@ 2005-04-03 22:30             ` Paul Jackson
  2005-04-05  6:53               ` [patch] sched: auto-tune migration costs Andi Kleen
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2005-04-03 22:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> if_ there is a significant hierarchy between CPUs it
> should be represented via a matching sched-domains hierarchy,

Agreed.

I'll see how the sched domains hierarchy looks on a bigger SN2 systems.

If the CPU hierarchy is not reflected in the sched-domain hierarchy any
better there, then I will look to involve the "SN2 sched domain
hierarchy experts" in improving SN2 the sched-domain hierarchy.

Ok - that works.  Your patch of yesterday provides just the tool
I need to measure this.  Cool.

> i'll first try the bottom-up approach to speed up detection (getting to
> the hump is very fast most of the time).

Good.

> then we can let the arch override the cpu_distance() method

I'm not aware we need that, yet anyway.  First I should see if
the SN2 sched_domains need improving.  Take a shot at doing it
'the right way' before we go inventing overrides.  I suspect
you agree.

> the migration cost matrix we can later use to tune all the other 
> sched-domains balancing related tunables as well

That comes close to my actual motivation here.  I hope to expose a
"cpu_distance" such as based on this cost matrix, to userland.

We already expose the SLIT table node distances (using SN2 specific
/proc files today, others are working on an arch-neutral mechanism).

As we push more cores and hyperthreads into a single package on one end,
and more complex numa topologies on the other end, this becomes
increasingly interesting to NUMA aware user software.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs
  2005-04-03 22:30             ` Paul Jackson
@ 2005-04-05  6:53               ` Andi Kleen
  2005-04-05  7:20                 ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Andi Kleen @ 2005-04-05  6:53 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Paul Jackson <pj@engr.sgi.com> writes:
>
> We already expose the SLIT table node distances (using SN2 specific
> /proc files today, others are working on an arch-neutral mechanism).

There is already an arch neutral mechanism in sysfs, see
/sys/devices/system/node/node*/distance

That should be exactly SLIT on x86-64 and IA64 where node_distance()
reports SLIT data.

But of course SLIT doesn't know anything about cache latencies.

/Andi

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs
  2005-04-05  6:53               ` [patch] sched: auto-tune migration costs Andi Kleen
@ 2005-04-05  7:20                 ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-05  7:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Andi wrote:
> There is already an arch neutral mechanism in sysfs, see
> /sys/devices/system/node/node*/distance

Excellent - thank-you.

> But of course SLIT doesn't know anything about cache latencies.

Of course.  Though SLIT does know about basic node distances, which
tend to correlate with cache migration costs, apparently closely enough
for our current modest needs.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 14:12         ` Paul Jackson
  2005-04-03 15:01           ` Ingo Molnar
@ 2005-04-03 15:24           ` Ingo Molnar
  2005-04-03 23:08             ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-03 15:24 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel


* Paul Jackson <pj@engr.sgi.com> wrote:

> 
>  3) I was noticing that my test system was only showing a couple of 
>     distinct values for cpu_distance, even though it has 4 distinct 
>     distances for values of node_distance.  So I coded up a variant of 
>     cpu_distance that converts the problem to a node_distance problem, 
>     and got the following cost matrix:

>     The code (below) is twice as complicated, the runtime twice as long,
>     and it's less intuitive - sched_domains seems more appropriate as
>     the basis for migration costs than the node distances in SLIT tables.
>     Finally, I don't know if distinguishing between costs of 21.7 and
>     25.3 is worth much.

the main problem is that we can do nothing with this matrix: we only 
print it, but then the values get written into a 0/1 sched-domains 
hierarchy - so the information is lost.

if you create a sched-domains hierarchy (based on the SLIT tables, or in 
whatever other way) that matches the CPU hierarchy then you'll 
automatically get the proper distances detected.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 15:24           ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
@ 2005-04-03 23:08             ` Paul Jackson
  2005-04-04  2:08               ` Nick Piggin
  2005-04-04  6:50               ` Ingo Molnar
  0 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03 23:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> if you create a sched-domains hierarchy (based on the SLIT tables, or in 
> whatever other way) that matches the CPU hierarchy then you'll 
> automatically get the proper distances detected.

Yes - agreed.  I should push in the direction of improving the
SN2 sched domain hierarchy.

Would be a good idea to rename 'cpu_distance()' to something more
specific, like 'cpu_dist_ndx()', and reserve the generic name
'cpu_distance()' for later use to return a scaled integer distance,
rather like 'node_distance()' does now.  For example, 'cpu_distance()'
might, someday, return integer values such as:

	40  217  252  253

as are displayed (in tenths) in the debug line:

---------------------
cacheflush times [4]: 4.0 (4080540) 21.7 (21781380) 25.2 (25259428) 25.3 (25372682)
---------------------

(that is, the integer (long)cost / 100000 - one less zero).

I don't know that we have any use, yet, for this 'cpu_distance()' as a
scaled integer value.  But I'd be more comfortable reserving that name
for that purpose.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 23:08             ` Paul Jackson
@ 2005-04-04  2:08               ` Nick Piggin
  2005-04-04  3:55                 ` Paul Jackson
  2005-04-04  6:50               ` Ingo Molnar
  1 sibling, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2005-04-04  2:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ingo Molnar, kenneth.w.chen, torvalds, akpm, linux-kernel

Paul Jackson wrote:
> Ingo wrote:
> 
>>if you create a sched-domains hierarchy (based on the SLIT tables, or in 
>>whatever other way) that matches the CPU hierarchy then you'll 
>>automatically get the proper distances detected.
> 
> 
> Yes - agreed.  I should push in the direction of improving the
> SN2 sched domain hierarchy.
> 

I'd just be a bit careful about this. Your biggest systems will have
what? At least 7 or 8 domains if you're just going by the number of
hops, right? And maybe more if there is more to your topology than
just number of hops.

sched-domains firstly has a few problems even with your 2 level NUMA
domains (although I'm looking at fixing them if possible), but also
everything just has to do more work as you traverse the domains and
scan all CPUs for balancing opportunities. And its not like the cpu
scheduler uses any sort of exact science to make choices...

Nick

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  2:08               ` Nick Piggin
@ 2005-04-04  3:55                 ` Paul Jackson
  2005-04-04  5:45                   ` Ingo Molnar
  2005-04-04  5:56                   ` Nick Piggin
  0 siblings, 2 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  3:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mingo, kenneth.w.chen, torvalds, akpm, linux-kernel

Paul wrote:
> I should push in the direction of improving the
> SN2 sched domain hierarchy.

Nick wrote:
> I'd just be a bit careful about this.

Good point - thanks.

I will - be careful.  I have no delusions that I know what would be an
"improvement" to the scheduler - if anything.

Ingo, if I understood correctly, suggested pushing any necessary detail
of the CPU hierarchy into the scheduler domains, so that his latest work
tuning migration costs could pick it up from there.

It makes good sense for the migration cost estimation to be based on
whatever CPU hierarchy is visible in the sched domains.

But if we knew the CPU hierarchy in more detail, and if we had some
other use for that detail (we don't that I know), then I take it from
your comment that we should be reluctant to push those details into the
sched domains.  Put them someplace else if we need them.

One question - how serious do you view difference in migration cost
between say 21.7 and 25.3, two of the cacheflush times I reported on a
small SN2?

I'm guessing that this is probably below the noise threshold, at least
as far as scheduler domains, schedulers and migration care, unless and
until some persuasive measurements show a situation in which it matters.

As you say - not an exact science.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  3:55                 ` Paul Jackson
@ 2005-04-04  5:45                   ` Ingo Molnar
  2005-04-04  5:50                     ` Paul Jackson
  2005-04-04  5:56                   ` Nick Piggin
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04  5:45 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> Ingo, if I understood correctly, suggested pushing any necessary 
> detail of the CPU hierarchy into the scheduler domains, so that his 
> latest work tuning migration costs could pick it up from there.
> 
> It makes good sense for the migration cost estimation to be based on 
> whatever CPU hierarchy is visible in the sched domains.
> 
> But if we knew the CPU hierarchy in more detail, and if we had some 
> other use for that detail (we don't that I know), then I take it from 
> your comment that we should be reluctant to push those details into 
> the sched domains.  Put them someplace else if we need them.

There's no other place to push them - most of the hierarchy related 
decisions are done based on the domain tree. So the decision to make is: 
"is it worth complicating the domain tree, in exchange for more accurate 
handling of the real hierarchy?".

In general, the pros are potentially more accuracy and thus higher 
application performance, the cons are overhead (more tree walking) and 
artifacts (the sched-domains logic is good but not perfect, and even if 
there were no bugs in it, the decisions are approximations. One more 
domain level might make things worse.)

but trying and benchmarking it is necessary to tell for sure.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  5:45                   ` Ingo Molnar
@ 2005-04-04  5:50                     ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  5:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

Ingo wrote:
> There's no other place to push them 

One could make a place, if the need arose.

> but trying and benchmarking it is necessary to tell for sure.

Hard to argue with that ... ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  3:55                 ` Paul Jackson
  2005-04-04  5:45                   ` Ingo Molnar
@ 2005-04-04  5:56                   ` Nick Piggin
  2005-04-04  6:38                     ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2005-04-04  5:56 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ingo Molnar, kenneth.w.chen, torvalds, Andrew Morton, lkml

On Sun, 2005-04-03 at 20:55 -0700, Paul Jackson wrote:

> But if we knew the CPU hierarchy in more detail, and if we had some
> other use for that detail (we don't that I know), then I take it from
> your comment that we should be reluctant to push those details into the
> sched domains.  Put them someplace else if we need them.
> 

In a sense, the information *is* already there - in node_distance.
What I think should be done is probably to use node_distance when
calculating costs, and correlate that with sched-domains as best
we can.

I've got an idea of how to do it, but I'll wait until Ingo gets the
fundamentals working wel before I have a look.

> 
> One question - how serious do you view difference in migration cost
> between say 21.7 and 25.3, two of the cacheflush times I reported on a
> small SN2?
> 
> I'm guessing that this is probably below the noise threshold, at least
> as far as scheduler domains, schedulers and migration care, unless and
> until some persuasive measurements show a situation in which it matters.
> 

Yes, likely below noise. There is an issue with a behavioural
transition point in the wakeup code where you might see good
behaviour with 21 and bad with 25, or vice versa on some workloads.
This is fixed in the scheduler patches coming through -mm though.

But I wasn't worried so much about the absolute value not being
right, rather it maybe not being deterministic. So maybe depending
on what CPU gets assigned what cpuid, you might get different
values on identical machines.

> As you say - not an exact science.
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  5:56                   ` Nick Piggin
@ 2005-04-04  6:38                     ` Paul Jackson
  2005-04-04  6:48                       ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  6:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mingo, kenneth.w.chen, torvalds, akpm, linux-kernel

Nick wrote:
> In a sense, the information *is* already there - in node_distance.
> What I think should be done is probably to use node_distance when
> calculating costs, ...

Hmmm ... perhaps I'm confused, but this sure sounds like the alternative
implementation of cpu_distance using node_distance that I submitted to
this thread about 16 hours ago.  It was using this alternative that
got me the more varied matrix:

---------------------
          [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
[00]:     -     4.0(0) 21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
[01]:   4.0(0)    -    21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
[02]:  21.7(1) 21.7(1)    -     4.0(0) 25.3(3) 25.3(3) 25.2(2) 25.2(2)
[03]:  21.7(1) 21.7(1)  4.0(0)    -    25.3(3) 25.3(3) 25.2(2) 25.2(2)
[04]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)    -     4.0(0) 21.7(1) 21.7(1)
[05]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)  4.0(0)    -    21.7(1) 21.7(1)
[06]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)    -     4.0(0)
[07]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)  4.0(0)    -
---------------------

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  6:38                     ` Paul Jackson
@ 2005-04-04  6:48                       ` Ingo Molnar
  2005-04-04  7:37                         ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04  6:48 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> Nick wrote:
> > In a sense, the information *is* already there - in node_distance.
> > What I think should be done is probably to use node_distance when
> > calculating costs, ...
> 
> Hmmm ... perhaps I'm confused, but this sure sounds like the alternative
> implementation of cpu_distance using node_distance that I submitted to
> this thread about 16 hours ago.

yes, it's that method.

> [...] It was using this alternative that got me the more varied 
> matrix:
> 
> ---------------------
>           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
> [00]:     -     4.0(0) 21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
> [01]:   4.0(0)    -    21.7(1) 21.7(1) 25.2(2) 25.2(2) 25.3(3) 25.3(3)
> [02]:  21.7(1) 21.7(1)    -     4.0(0) 25.3(3) 25.3(3) 25.2(2) 25.2(2)
> [03]:  21.7(1) 21.7(1)  4.0(0)    -    25.3(3) 25.3(3) 25.2(2) 25.2(2)
> [04]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)    -     4.0(0) 21.7(1) 21.7(1)
> [05]:  25.2(2) 25.2(2) 25.3(3) 25.3(3)  4.0(0)    -    21.7(1) 21.7(1)
> [06]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)    -     4.0(0)
> [07]:  25.3(3) 25.3(3) 25.2(2) 25.2(2) 21.7(1) 21.7(1)  4.0(0)    -
> ---------------------

the problem i mentioned earlier is that there is no other use for the 
matrix right now than the domain hierarchy. And if there's no place in 
the domain hieararchy to put this info then the information is lost.

so we might be able to _measure_ a rank-3 matrix, but if the domain is 
only rank-2 then we'll have to discard one level of information.

we could try some hybride method of averaging 25.3 with 21.7 and putting 
that into the domain tree, but i'd be against it for the following 
reasons:

firstly, _if_ an extra level in the hierarchy makes a difference, we 
might as well add it to the domain tree - and that may bring other 
advantages (in terms of more finegrained balancing) in addition to 
better migration.

secondly, right now the cost measurement method and calculation is 
rather simple and has minimal assumptions, and i'd like to keep it so as 
long as possible. If an extra domain level gives problems or artifacts 
elsewhere then we should fix those problems if possible, and not 
complicate the cost calculation.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  6:48                       ` Ingo Molnar
@ 2005-04-04  7:37                         ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  7:37 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

Ingo wrote:
> the problem i mentioned earlier is that there is no other use

Eh ... whatever.  The present seems straight forward enough, with a
simple sched domain tree and your auto-tune migration cost calculation
bolted directly on top of that.

I'd better leave the futures to those more experienced than I.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 23:08             ` Paul Jackson
  2005-04-04  2:08               ` Nick Piggin
@ 2005-04-04  6:50               ` Ingo Molnar
  2005-04-04  7:27                 ` Paul Jackson
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04  6:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel


* Paul Jackson <pj@engr.sgi.com> wrote:

> Would be a good idea to rename 'cpu_distance()' to something more 
> specific, like 'cpu_dist_ndx()', and reserve the generic name 
> 'cpu_distance()' for later use to return a scaled integer distance, 
> rather like 'node_distance()' does now. [...]

agreed - i've changed it to domain_distance() in my tree.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  6:50               ` Ingo Molnar
@ 2005-04-04  7:27                 ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  7:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> agreed - i've changed it to domain_distance() in my tree.

Good - cool - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 11:34       ` Paul Jackson
  2005-04-03 14:12         ` Paul Jackson
@ 2005-04-03 14:29         ` Ingo Molnar
  2005-04-03 23:15           ` Paul Jackson
  2005-04-04  1:31           ` Chen, Kenneth W
  2005-04-04  4:25         ` Andy Lutomirski
  2 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-03 14:29 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]


* Paul Jackson <pj@engr.sgi.com> wrote:

> Ok - that flies, or at least walks.  It took 53 seconds to compute 
> this cost matrix.

53 seconds is too much - i'm working on reducing it.

> Here's what it prints, on a small 8 CPU ia64 SN2 Altix, with
> the migration_debug prints formatted separately from the primary
> table, for ease of reading:
> 
> Total of 8 processors activated (15548.60 BogoMIPS).
> ---------------------
> migration cost matrix (max_cache_size: 0, cpu: -1 MHz):
> ---------------------
>           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
> [00]:     -     4.0(0) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
> [01]:   4.0(0)    -    21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
> [02]:  21.7(1) 21.7(1)    -     4.0(0) 21.7(1) 21.7(1) 21.7(1) 21.7(1)
> [03]:  21.7(1) 21.7(1)  4.0(0)    -    21.7(1) 21.7(1) 21.7(1) 21.7(1)
> [04]:  21.7(1) 21.7(1) 21.7(1) 21.7(1)    -     4.0(0) 21.7(1) 21.7(1)
> [05]:  21.7(1) 21.7(1) 21.7(1) 21.7(1)  4.0(0)    -    21.7(1) 21.7(1)
> [06]:  21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)    -     4.0(0)
> [07]:  21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1) 21.7(1)  4.0(0)    -

how close are these numbers to the real worst-case migration costs on 
that box? What are the cache sizes and what is their hierarchies?

i've attached another snapshot - there is no speedup yet, but i've 
changed the debug printout to be separate from the matrix printout, and 
i've fixed the cache_size printout. (the printout of a 68K cache was 
incorrect - that was just the last iteration step)

it will be interesting to see what effect the above assymetry in 
migration costs will have on scheduling. With 4msec intra-node cutoff it 
should be pretty migration-happy, inter-node 21 msec is rather high and 
should avoid unnecessary migration.

is there any workload that shows the same scheduling related performance 
regressions, other than Ken's $1m+ benchmark kit?

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 16768 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/vmalloc.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4639,6 +4640,453 @@ void __devinit init_sched_build_groups(s
 	last->next = first;
 }
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads a big buffer to flush caches
+ * 2) the source CPU reads+dirties a shared buffer 
+ * 3) the target CPU reads+dirties the same shared buffer
+ * 4) the target CPU reads a big buffer to flush caches
+ *
+ * We measure how long steps #2 and #3 take (step #1 and #4 is not
+ * measured), in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a large buffer-size and iterate down to smaller
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * do a maximum search for the cost. The maximum cost for a migration
+ * occurs when the working set is just below the effective cache size.
+ */
+
+
+/*
+ * Flush the cache by reading a big buffer. (We want all writeback
+ * activities to subside. Works only if cache size is larger than
+ * 2*size, but that is good enough as the biggest migration effect
+ * is around cachesize size.)
+ */
+__init static void read_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	volatile unsigned long data;
+	int i;
+
+	for (i = 0; i < 2*size; i += 4)
+		data = cache[i];
+}
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 4) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	/*
+	 * Make sure the cache is quiet on this CPU,
+	 * before starting measurement:
+	 */
+	read_cache(data->cache, data->size);
+
+	data->delta = sched_clock();
+	touch_cache(data->cache, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	touch_cache(data->cache, data->size);
+	data->delta = sched_clock() - data->delta;
+	/*
+	 * Make sure the cache is quiet, so that it does not interfere
+	 * with the next run on this CPU:
+	 */
+	read_cache(data->cache, data->size);
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		return -1;
+	}
+
+	return data.delta;
+}
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE];
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+static __initdata unsigned int migration_debug;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long cpu_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		2
+
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long long cost = 0, cost1 = 0, cost2 = 0;
+	unsigned int size, cache_size = 0;
+	void *cache;
+	int i;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size)
+		size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+	else
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	/*
+	 * We allocate 2*size so that read_cache() can access a
+	 * larger buffer:
+	 */
+	cache = vmalloc(2*size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+	memset(cache, 0, 2*size);
+
+	while (size >= MIN_CACHE_SIZE) {
+		/*
+		 * Measure the migration cost of 'size' bytes, over an
+		 * average of 10 runs:
+		 *
+		 * (We perturb the cache size by a small (0..4k)
+		 *  value to compensate size/alignment related artifacts.
+		 *  We also subtract the cost of the operation done on
+		 *  the same CPU.)
+		 */
+		cost1 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+			cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+		}
+
+		cost2 = 0;
+		for (i = 0; i < ITERATIONS; i++) {
+			cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+			cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+		}
+		if (cost1 > cost2) {
+			if (cost < cost1 - cost2) {
+				cost = cost1 - cost2;
+				cache_size = size;
+			}
+		}
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld (%ld): (%8Ld %8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				cpu_distance(cpu1, cpu2),
+				cost1, cost2, cost1-cost2);
+		/*
+		 * Iterate down the cachesize (in a non-power-of-2
+		 * way to avoid artifacts) in 5% decrements:
+		 */
+		size = size * 19 / 20;
+	}
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost, 2*ITERATIONS);
+
+	if (migration_debug)
+		printk("[%d][%d] cache size found: %d, cost: %Ld\n",
+			cpu1, cpu2, cache_size, cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2)
+				continue;
+			distance = cpu_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance])
+				cost = migration_cost[distance];
+			else {
+				cost = measure_cacheflush_time(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+		}
+	}
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+	/*
+	 * Print the matrix:
+	 */
+	printk("---------------------\n");
+	printk("migration cost matrix (max_cache_size: %d, cpu: %ld MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = cpu_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			cost = migration_cost[distance];
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("---------------------\n");
+	printk("cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	printk("---------------------\n");
+
+}
+
 
 #ifdef ARCH_HAS_SCHED_DOMAIN
 extern void __devinit arch_init_sched_domains(void);
@@ -4820,6 +5268,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -112,7 +111,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,17 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-mips/mach-ip27/topology.h.orig
+++ linux/include/asm-mips/mach-ip27/topology.h
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 14:29         ` Ingo Molnar
@ 2005-04-03 23:15           ` Paul Jackson
  2005-04-04  1:31           ` Chen, Kenneth W
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03 23:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> how close are these numbers to the real worst-case migration costs on 
> that box? What are the cache sizes and what is their hierarchies?
>  ...
> is there any workload that shows the same scheduling related performance 
> regressions, other than Ken's $1m+ benchmark kit?

I'll have to talk to some people Monday and get back to you.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 14:29         ` Ingo Molnar
  2005-04-03 23:15           ` Paul Jackson
@ 2005-04-04  1:31           ` Chen, Kenneth W
  2005-04-04  6:24             ` Ingo Molnar
  1 sibling, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-04  1:31 UTC (permalink / raw)
  To: 'Ingo Molnar', Paul Jackson
  Cc: torvalds, nickpiggin, akpm, linux-kernel

Ingo Molnar wrote on Sunday, April 03, 2005 7:30 AM
> how close are these numbers to the real worst-case migration costs on
> that box?

I booted your latest patch on a 4-way SMP box (1.5 GHz, 9MB ia64). This
is what it produces.  I think the estimate is excellent.

[00]:     -    10.4(0) 10.4(0) 10.4(0)
[01]:  10.4(0)    -    10.4(0) 10.4(0)
[02]:  10.4(0) 10.4(0)    -    10.4(0)
[03]:  10.4(0) 10.4(0) 10.4(0)    -
---------------------
cacheflush times [1]: 10.4 (10448800)

One other minor thing: when booting a numa kernel on smp box, there is
a numa scheduler domain at the top level and cache_hot_time will be set
to 0 in that case on smp box.  Though this will be a mutt point with
recent patch from Suresh Siddha for removing the extra bogus scheduler
domains.  http://marc.theaimsgroup.com/?t=111240208000001&r=1&w=2

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  1:31           ` Chen, Kenneth W
@ 2005-04-04  6:24             ` Ingo Molnar
  2005-04-04  6:39               ` Ingo Molnar
  2005-04-06  0:08               ` Chen, Kenneth W
  0 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04  6:24 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Ingo Molnar wrote on Sunday, April 03, 2005 7:30 AM
> > how close are these numbers to the real worst-case migration costs on
> > that box?
> 
> I booted your latest patch on a 4-way SMP box (1.5 GHz, 9MB ia64). This
> is what it produces.  I think the estimate is excellent.
> 
> [00]:     -    10.4(0) 10.4(0) 10.4(0)
> [01]:  10.4(0)    -    10.4(0) 10.4(0)
> [02]:  10.4(0) 10.4(0)    -    10.4(0)
> [03]:  10.4(0) 10.4(0) 10.4(0)    -
> ---------------------
> cacheflush times [1]: 10.4 (10448800)

great! How long does the benchmark take (hours?), and is there any way 
to speed up the benchmarking (without hurting accuracy), so that 
multiple migration-cost settings could be tried? Would it be possible to 
try a few other values via the migration_factor boot option, in 0.5 msec 
steps or so, to find the current sweet spot? It used to be at 11 msec 
previously, correct? E.g. migration_factor=105 will change the cost to 
10.9 msec, migration_factor=110 will change it to 11.4, etc. Or with the 
latest snapshot you can set absolute values as well, 
migration_cost=11500 sets the cost to 11.5 msec.

> One other minor thing: when booting a numa kernel on smp box, there is 
> a numa scheduler domain at the top level and cache_hot_time will be 
> set to 0 in that case on smp box.  Though this will be a mutt point 
> with recent patch from Suresh Siddha for removing the extra bogus 
> scheduler domains.  
> http://marc.theaimsgroup.com/?t=111240208000001&r=1&w=2

at first sight the dummy domain should not be a problem, the ->cache_hot 
values are only used when deciding whether a task should migrate to a 
parallel domain or not - if there's an extra highlevel domain instance 
then such decisions are never made, so a zero value makes no difference.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  6:24             ` Ingo Molnar
@ 2005-04-04  6:39               ` Ingo Molnar
  2005-04-06  0:08               ` Chen, Kenneth W
  1 sibling, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04  6:39 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

* Ingo Molnar <mingo@elte.hu> wrote:

> > a numa scheduler domain at the top level and cache_hot_time will be 
> > set to 0 in that case on smp box.  Though this will be a mutt point 
> > with recent patch from Suresh Siddha for removing the extra bogus 
> > scheduler domains.  
> > http://marc.theaimsgroup.com/?t=111240208000001&r=1&w=2
> 
> at first sight the dummy domain should not be a problem, [...]

at second sight, maybe it could be a problem after all. It's safe is 
load_balance(), where task_hot() should never happen to be called for 
the dummy domain. (because the dummy domain has only one CPU group on 
such boxes)

But if the dummy domain has SD_WAKE_AFFINE set then it's being 
considered for passive migration, and a value of 0 means 'can always 
migrate', and in situations where other domains signalled 'task is too 
hot', this domain may still override the decision (incorrectly). So the 
safe value for dummy domains would a cacheflush time of 'infinity' - to 
make sure migration decisions are only done via other domains.

I've changed this in my tree - migration_cost[] is now initialized to 
-1LL, which should solve this problem.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  6:24             ` Ingo Molnar
  2005-04-04  6:39               ` Ingo Molnar
@ 2005-04-06  0:08               ` Chen, Kenneth W
  1 sibling, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-06  0:08 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

Ingo Molnar wrote on Sunday, April 03, 2005 11:24 PM

> great! How long does the benchmark take (hours?), and is there any way
> to speed up the benchmarking (without hurting accuracy), so that
> multiple migration-cost settings could be tried? Would it be possible to
> try a few other values via the migration_factor boot option, in 0.5 msec
> steps or so, to find the current sweet spot? It used to be at 11 msec
> previously, correct?

It take days, each experiment is 5 hours.  Previous experiments on 2.6.8
shows that the sweet spot was 12.5ms.

This time on 2.6.11, it got pushed into 16 ms.  Results comparing to 10ms:

 8 ms  -0.3%
10 ms  --
12 ms	+0.11%
16 ms +0.14%
20 ms +0.06%

12ms and up all has about 1.5% idle time.  We are not anywhere near the
limits on what the disk storage can deliver.  So there is a potential to
to tune/optimize the scheduler and reap these extra idle time.

- Ken



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03 11:34       ` Paul Jackson
  2005-04-03 14:12         ` Paul Jackson
  2005-04-03 14:29         ` Ingo Molnar
@ 2005-04-04  4:25         ` Andy Lutomirski
  2005-04-04  4:36           ` Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2005-04-04  4:25 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Paul Jackson wrote:
> Ok - that flies, or at least walks.  It took 53 seconds to
> compute this cost matrix.

Not that I really know what I'm talking about here, but this sounds 
highly parallelizable.  It seems like you could do N/2 measurements at a 
time, so this should be O(N) to compute the matrix (ignoring issues of 
how long it takes to write the data to memory, but that should be 
insignificant).

Even if you can't parallelize it all the way, it ought to at least help.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  4:25         ` Andy Lutomirski
@ 2005-04-04  4:36           ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04  4:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Andy wrote:
> Not that I really know what I'm talking about here, but this sounds 
> highly parallelizable.

I doubt it.  If we are testing the cost of a migration between CPUs
alpha and beta, and at the same time testing betweeen CPUs gamma and
delta, then often there will be some hardware that is shared by both the
<alpha, beta> path, and the <gamma, delta> path.  This would affect the
test results.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03  7:04     ` Ingo Molnar
  2005-04-03  8:15       ` Paul Jackson
  2005-04-03 11:34       ` Paul Jackson
@ 2005-04-04  1:11       ` Chen, Kenneth W
  2005-04-04 11:37         ` Ingo Molnar
  2 siblings, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-04  1:11 UTC (permalink / raw)
  To: 'Ingo Molnar', Paul Jackson
  Cc: torvalds, nickpiggin, akpm, linux-kernel

Ingo Molnar wrote on Saturday, April 02, 2005 11:04 PM
> the default on ia64 (32MB) was way too large and caused the search to
> start from 64MB. That can take a _long_ time.
>
> i've attached a new patch with your changes included, and a couple of
> new things added:
>
>  - removed the 32MB max_cache_size hack from ia64 - it should now fall
>    back to the default 5MB and do a search from 10MB downwards. This
>    should speed up the search.

The cache size information on ia64 is already available at the finger tip.
Here is a patch that I whipped up to set max_cache_size for ia64.


--- linux-2.6.12-rc1/arch/ia64/kernel/setup.c.orig	2005-04-03 17:14:40.000000000 -0700
+++ linux-2.6.12-rc1/arch/ia64/kernel/setup.c	2005-04-03 17:55:46.000000000 -0700
@@ -561,6 +561,7 @@ static void
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
+	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -585,8 +586,11 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
+		if (cache_size < cci.pcci_cache_size)
+			cache_size = cci.pcci_cache_size;
         }
   out:
+	max_cache_size = max(max_cache_size, cache_size);
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04  1:11       ` Chen, Kenneth W
@ 2005-04-04 11:37         ` Ingo Molnar
  2005-04-04 17:27           ` Paul Jackson
  2005-04-05  1:43           ` Chen, Kenneth W
  0 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2005-04-04 11:37 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1367 bytes --]


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Ingo Molnar wrote on Saturday, April 02, 2005 11:04 PM
> > the default on ia64 (32MB) was way too large and caused the search to
> > start from 64MB. That can take a _long_ time.
> >
> > i've attached a new patch with your changes included, and a couple of
> > new things added:
> >
> >  - removed the 32MB max_cache_size hack from ia64 - it should now fall
> >    back to the default 5MB and do a search from 10MB downwards. This
> >    should speed up the search.
> 
> The cache size information on ia64 is already available at the finger 
> tip. Here is a patch that I whipped up to set max_cache_size for ia64.

thanks - i've added this to my tree.

i've attached the latest snapshot. There are a number of changes in the 
patch: firstly, i changed the direction of the iteration to go from 
small sizes to larger sizes, and i added a method to detect the maximum.  

Furthermore, i tweaked the test some more to make it both faster and 
more reliable, and i cleaned up the code. (e.g. now we migrate via the 
scheduler, not via on_each_cpu().) The default patch should print enough 
debug information as-is.

I changed the workload too so potentially the detected values might be 
off from the ideal value on your box. The values detected on x86 are 
mostly unchanged, relative to previous patches.

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 18105 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/vmalloc.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4640,6 +4641,478 @@ void __devinit init_sched_build_groups(s
 }
 
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads+dirties a shared buffer 
+ * 2) the target CPU reads+dirties the same shared buffer
+ *
+ * We measure how long they take, in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a small buffer-size and iterate up to larger
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * doing a maximum search for the cost. (The maximum cost for a migration
+ * normally occurs when the working set size is around the effective cache
+ * size.)
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		3
+#define SIZE_THRESH		130
+#define COST_THRESH		130
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
+		{ -1LL , };
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long domain_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+static __initdata unsigned int migration_debug = 1;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 8) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	cpumask_t mask, saved_mask;
+	unsigned long long t0, t1, t2, t3, cost;
+
+	saved_mask = current->cpus_allowed;
+
+	/*
+	 * Migrate to the source CPU:
+	 */
+	mask = cpumask_of_cpu(source);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != source);
+
+	/*
+	 * Dirty the working set:
+	 */
+	t0 = sched_clock();
+	touch_cache(cache, size);
+	t1 = sched_clock();
+
+	/*
+	 * Migrate to the target CPU, dirty the L2 cache and access
+	 * the shared buffer. (which represents the working set
+	 * of a migrated task.)
+	 */
+	mask = cpumask_of_cpu(target);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != target);
+
+	t2 = sched_clock();
+	touch_cache(cache, size);
+	t3 = sched_clock();
+
+	cost = t2-t1 + t3-t2;
+
+	if (migration_debug >= 2)
+		printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n",
+			source, target, t1-t0, t2-t1, t3-t2, cost);
+
+	set_cpus_allowed(current, saved_mask);
+
+	return cost;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+__init static unsigned long long
+measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
+{
+	unsigned long long cost1, cost2;
+	int i;
+
+	/*
+	 * Measure the migration cost of 'size' bytes, over an
+	 * average of 10 runs:
+	 *
+	 * (We perturb the cache size by a small (0..4k)
+	 *  value to compensate size/alignment related artifacts.
+	 *  We also subtract the cost of the operation done on
+	 *  the same CPU.)
+	 */
+	cost1 = 0;
+
+	/*
+	 * dry run, to make sure we start off cache-cold on cpu1,
+	 * and to get any vmalloc pagefaults in advance:
+	 */
+	measure_one(cache, size, cpu1, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+
+	measure_one(cache, size, cpu2, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+
+	/*
+	 * (We measure the non-migrating [cached] cost on both
+	 *  cpu1 and cpu2, to handle CPUs with different speeds)
+	 */
+	cost2 = 0;
+
+	measure_one(cache, size, cpu1, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+
+	measure_one(cache, size, cpu2, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost1, 2*ITERATIONS);
+	do_div(cost2, 2*ITERATIONS);
+
+	return cost1 - cost2;
+}
+
+__init static unsigned long long measure_migration_cost(int cpu1, int cpu2)
+{
+	unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
+	unsigned int max_size, size, size_found = 0;
+	long long cost = 0, prev_cost;
+	void *cache;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size) {
+		max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+		size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE);
+	} else {
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		size = MIN_CACHE_SIZE;
+	}
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+
+	/*
+	 * Allocate the working set:
+	 */
+	cache = vmalloc(max_size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+
+	while (size <= max_size) {
+		prev_cost = cost;
+		cost = measure_cost(cpu1, cpu2, cache, size);
+
+		/*
+		 * Update the max:
+		 */
+		if (cost > 0) {
+			if (max_cost < cost) {
+				max_cost = cost;
+				size_found = size;
+			}
+		}
+		/*
+		 * Calculate average fluctuation, we use this to prevent
+		 * noise from triggering an early break out of the loop:
+		 */
+		fluct = abs(cost - prev_cost);
+		avg_fluct = (avg_fluct + fluct)/2;
+
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): (%8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				(long)max_cost / 1000000,
+				((long)max_cost / 100000) % 10,
+				domain_distance(cpu1, cpu2),
+				cost, avg_fluct);
+
+		/*
+		 * If we iterated at least 20% past the previous maximum,
+		 * and the cost has dropped by more than 20% already,
+		 * (taking fluctuations into account) then we assume to
+		 * have found the maximum and break out of the loop early:
+		 */
+		if (size_found && (size*100 > size_found*SIZE_THRESH))
+			if (cost+avg_fluct <= 0 ||
+				max_cost*100 > (cost+avg_fluct)*COST_THRESH) {
+
+				if (migration_debug)
+					printk("-> found max.\n");
+				break;
+			}
+		/*
+		 * Increase the cachesize in 5% steps:
+		 */
+		size = size * 20 / 19;
+	}
+
+	if (migration_debug)
+		printk("[%d][%d] working set size found: %d, cost: %Ld\n",
+			cpu1, cpu2, size_found, max_cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+	unsigned long flags, j0, j1;
+
+	local_irq_save(flags);
+	local_irq_enable();
+	j0 = jiffies;
+
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2)
+				continue;
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance])
+				cost = migration_cost[distance];
+			else {
+				cost = measure_migration_cost(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+		}
+	}
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+	/*
+	 * Print the matrix:
+	 */
+	printk("---------------------\n");
+	printk("| migration cost matrix (max_cache_size: %d, cpu: %ld MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1L
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			cost = migration_cost[distance];
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("--------------------------------\n");
+	printk("| cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	j1 = jiffies;
+	printk("| calibration delay: %ld seconds\n", (j1-j0)/HZ);
+	printk("--------------------------------\n");
+
+	local_irq_restore(flags);
+}
+
+
 #ifdef ARCH_HAS_SCHED_DOMAIN
 extern void __devinit arch_init_sched_domains(void);
 extern void __devinit arch_destroy_sched_domains(void);
@@ -4820,6 +5293,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/ia64/kernel/setup.c.orig
+++ linux/arch/ia64/kernel/setup.c
@@ -561,6 +561,7 @@ static void
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
+	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -585,8 +586,11 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
+		if (cache_size < cci.pcci_cache_size)
+			cache_size = cci.pcci_cache_size;
         }
   out:
+	max_cache_size = max(max_cache_size, cache_size);
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -112,7 +111,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,17 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-mips/mach-ip27/topology.h.orig
+++ linux/include/asm-mips/mach-ip27/topology.h
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04 11:37         ` Ingo Molnar
@ 2005-04-04 17:27           ` Paul Jackson
  2005-04-05  1:43           ` Chen, Kenneth W
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-04 17:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Ingo wrote:
> i've attached the latest snapshot.

I ran your latest snapshot on 64 CPU (well, 62 - one node wasn't
working) system.  I made one change - chop the matrix lines at 8 terms. 
It's a hack - don't know if it's a good idea.  But the long lines were
hard to read (and would only get worse on a 512).  And I had a fear,
probably unfounded, that the long lines could slow things down.

It built and ran fine, exactly as provided, against 2.6.12-rc1-mm4. I
probably have the unchopped matrix output in my screenlog file, if you
want it.  Though, given that the matrix is more or less symmetric, I
wasn't seeing much value in the part I chopped.

It took 24 seconds - a little painful, but booting this system takes
a few minutes, so 24 seconds is not fatal - just painful.

The maximum finding code - to stop scanning after the max has been
passed, works fine.  If it had been (impossibly) perfect, stopping right
at the max, it would have been perhaps 30% faster, so there is not a
huge amount to be gained from trying to fine tune the scan termination
logic.

I can imagine that one could trim this time by doing a couple of scans,
the first one at lower density (perhaps just one out of four sizes
considered), then the second scan at full density, around the maximum
found by the first.  However this would be less robust, and yet more
logic.

Or perhaps, long shot, one could get fancy with some parameterized curve
fitting.  If some equation is a reasonably fit for the function being
sampled here, then just a low density scan through the max could be used
to estimate the co-efficients of whatever the equation was, and the
equation used to find the maximum, instead of the samples.  This would
be fun to play with, but I can't now - other duties are calling.

The one change:

diff -Naurp auto-tune_migration_costs/kernel/sched.c auto-tune_migration_costs_chopped/kernel/sched.c
--- auto-tune_migration_costs/kernel/sched.c	2005-04-04 09:11:43.000000000 -0700
+++ auto-tune_migration_costs_chopped/kernel/sched.c	2005-04-04 09:11:22.000000000 -0700
@@ -5287,6 +5287,7 @@ void __devinit calibrate_migration_costs
 			distance = domain_distance(cpu1, cpu2);
 			max_distance = max(max_distance, distance);
 			cost = migration_cost[distance];
+			if (cpu2 < 8)
 			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
 				((long)cost / 100000) % 10, distance);
 		}

With this change, the output was:

Memory: 243350592k/244270096k available (7182k code, 921216k reserved, 3776k data, 368k init)
McKinley Errata 9 workaround not needed; disabling it
Dentry cache hash table entries: 33554432 (order: 14, 268435456 bytes)
Inode-cache hash table entries: 16777216 (order: 13, 134217728 bytes)
Mount-cache hash table entries: 1024
Boot processor id 0x0/0x40
Brought up 62 CPUs
Total of 62 processors activated (138340.68 BogoMIPS).
-> [0][2][3145728]  12.3 [ 12.3] (1): (12361880  6180940)
-> [0][2][3311292]  13.1 [ 13.1] (1): (13175591  3497325)
-> [0][2][3485570]  13.7 [ 13.7] (1): (13718647  2020190)
-> [0][2][3669021]  14.3 [ 14.3] (1): (14356800  1329171)
-> [0][2][3862127]  15.5 [ 15.5] (1): (15522156  1247263)
-> [0][2][4065396]  16.4 [ 16.4] (1): (16487934  1106520)
-> [0][2][4279364]  17.3 [ 17.3] (1): (17356154   987370)
-> [0][2][4504593]  18.1 [ 18.1] (1): (18144452   887834)
-> [0][2][4741676]  18.9 [ 18.9] (1): (18934638   839010)
-> [0][2][4991237]  19.9 [ 19.9] (1): (19965884   935128)
-> [0][2][5253933]  21.0 [ 21.0] (1): (21067441  1018342)
-> [0][2][5530455]  22.3 [ 22.3] (1): (22303727  1127314)
-> [0][2][5821531]  23.4 [ 23.4] (1): (23453867  1138727)
-> [0][2][6127927]  23.4 [ 23.4] (1): (23406625   592984)
-> [0][2][6450449]  23.5 [ 23.5] (1): (23586123   386241)
-> [0][2][6789946]  23.5 [ 23.5] (1): (23519823   226270)
-> [0][2][7147311]  22.6 [ 23.5] (1): (22619385   563354)
-> [0][2][7523485]  21.9 [ 23.5] (1): (21998024   592357)
-> [0][2][7919457]  20.7 [ 23.5] (1): (20705771   942305)
-> [0][2][8336270]  17.2 [ 23.5] (1): (17244361  2201857)
-> [0][2][8775021]  14.6 [ 23.5] (1): (14644331  2400943)
-> found max.
[0][2] working set size found: 6450449, cost: 23586123
-> [0][32][3145728]  17.8 [ 17.8] (2): (17848927  8924463)
-> [0][32][3311292]  18.8 [ 18.8] (2): (18811236  4943386)
-> [0][32][3485570]  19.7 [ 19.7] (2): (19779337  2955743)
-> [0][32][3669021]  20.8 [ 20.8] (2): (20811634  1994020)
-> [0][32][3862127]  21.9 [ 21.9] (2): (21919806  1551096)
-> [0][32][4065396]  23.0 [ 23.0] (2): (23075814  1353552)
-> [0][32][4279364]  24.2 [ 24.2] (2): (24267691  1272714)
-> [0][32][4504593]  25.5 [ 25.5] (2): (25546809  1275916)
-> [0][32][4741676]  26.8 [ 26.8] (2): (26886375  1307741)
-> [0][32][4991237]  28.2 [ 28.2] (2): (28291601  1356483)
-> [0][32][5253933]  29.5 [ 29.5] (2): (29587239  1326060)
-> [0][32][5530455]  30.6 [ 30.6] (2): (30669228  1204024)
-> [0][32][5821531]  30.9 [ 30.9] (2): (30969069   751932)
-> [0][32][6127927]  30.3 [ 30.9] (2): (30353322   683839)
-> [0][32][6450449]  29.3 [ 30.9] (2): (29381521   827820)
-> [0][32][6789946]  27.4 [ 30.9] (2): (27459958  1374691)
-> [0][32][7147311]  26.4 [ 30.9] (2): (26403308  1215670)
-> [0][32][7523485]  23.9 [ 30.9] (2): (23967782  1825598)
-> [0][32][7919457]  19.4 [ 30.9] (2): (19483305  3155037)
-> found max.
[0][32] working set size found: 5821531, cost: 30969069
---------------------
| migration cost matrix (max_cache_size: 6291456, cpu: -1 MHz):
---------------------
          [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]    [08]    [09]    [10]    [11]    [12]    [13]    [14]    [15]    [16]    [17]    [18]    [19]    [20]    [21]    [22]    [23]    [24]    [25]    [26]    [27]    [28]    [29]    [30]    [31]    [32]    [33]    [34]    [35]    [36]    [37]    [38]    [39]    [40]    [41]    [42]    [43]    [44]    [45]    [46]    [47]    [48]    [49]    [50]    [51]    [52]    [53]    [54]    [55]    [56]    [57]    [58]    [59]    [60]    [61]
[00]:     -     0.0(0) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)
[01]:   0.0(0)    -    47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)
[02]:  47.1(1) 47.1(1)    -     0.0(0) 47.1(1) 47.1(1) 47.1(1) 47.1(1)
[03]:  47.1(1) 47.1(1)  0.0(0)    -    47.1(1) 47.1(1) 47.1(1) 47.1(1)
[04]:  47.1(1) 47.1(1) 47.1(1) 47.1(1)    -     0.0(0) 47.1(1) 47.1(1)
[05]:  47.1(1) 47.1(1) 47.1(1) 47.1(1)  0.0(0)    -    47.1(1) 47.1(1)
[06]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -     0.0(0)
[07]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)  0.0(0)    -
[08]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[09]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[10]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[11]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[12]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[13]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[14]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[15]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[16]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[17]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[18]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[19]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[20]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[21]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[22]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[23]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[24]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[25]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[26]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[27]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[28]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[29]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[30]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[31]:  47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1) 47.1(1)    -
[32]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[33]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[34]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[35]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[36]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[37]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[38]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[39]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[40]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[41]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[42]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[43]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[44]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[45]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[46]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[47]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 47.1(1) 47.1(1)    -
[48]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[49]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[50]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[51]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[52]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[53]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[54]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[55]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[56]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[57]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[58]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[59]:  47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[60]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
[61]:  61.9(2) 61.9(2) 47.1(1) 47.1(1) 61.9(2) 61.9(2) 61.9(2) 61.9(2)    -
--------------------------------
| cacheflush times [3]: 0.0 (-1) 47.1 (47172246) 61.9 (61938138)
| calibration delay: 24 seconds
--------------------------------

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-04 11:37         ` Ingo Molnar
  2005-04-04 17:27           ` Paul Jackson
@ 2005-04-05  1:43           ` Chen, Kenneth W
  2005-04-05  1:49             ` Ingo Molnar
  1 sibling, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-05  1:43 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> The cache size information on ia64 is already available at the finger
> tip. Here is a patch that I whipped up to set max_cache_size for ia64.

Ingo Molnar wrote on Monday, April 04, 2005 4:38 AM
> thanks - i've added this to my tree.
>
> i've attached the latest snapshot. There are a number of changes in the
> patch: firstly, i changed the direction of the iteration to go from
> small sizes to larger sizes, and i added a method to detect the maximum.
>
> Furthermore, i tweaked the test some more to make it both faster and
> more reliable, and i cleaned up the code. (e.g. now we migrate via the
> scheduler, not via on_each_cpu().) The default patch should print enough
> debug information as-is.
>
> I changed the workload too so potentially the detected values might be
> off from the ideal value on your box. The values detected on x86 are
> mostly unchanged, relative to previous patches.

Perhaps, I'm not getting the latest patch?  It skipped measuring because
migration cost array is non-zero (initialized to -1LL):

        [00]    [01]    [02]    [03]
[00]:     -     0.0(0)  0.0(0)  0.0(0)
[01]:   0.0(0)    -     0.0(0)  0.0(0)
[02]:   0.0(0)  0.0(0)    -     0.0(0)
[03]:   0.0(0)  0.0(0)  0.0(0)    -
--------------------------------
| cacheflush times [1]: 0.0 (-1)
| calibration delay: 0 seconds
--------------------------------


Need this change?  I bet you had that in your tree already.

--- ./kernel/sched.c.orig	2005-04-04 18:01:45.000000000 -0700
+++ ./kernel/sched.c	2005-04-04 18:21:41.000000000 -0700
@@ -5050,7 +5050,7 @@ void __devinit calibrate_migration_costs
 			/*
 			 * Do we have the result cached already?
 			 */
-			if (migration_cost[distance])
+			if (migration_cost[distance] != -1LL)
 				cost = migration_cost[distance];
 			else {
 				cost = measure_migration_cost(cpu1, cpu2);



Also, the cost calculation in measure_one() looks fishy to me in this version.

> +	/*
> +	 * Dirty the working set:
> +	 */
> +	t0 = sched_clock();
> +	touch_cache(cache, size);
> +	t1 = sched_clock();
> +
> +	/*
> +	 * Migrate to the target CPU, dirty the L2 cache and access
> +	 * the shared buffer. (which represents the working set
> +	 * of a migrated task.)
> +	 */
> +	mask = cpumask_of_cpu(target);
> +	set_cpus_allowed(current, mask);
> +	WARN_ON(smp_processor_id() != target);
> +
> +	t2 = sched_clock();
> +	touch_cache(cache, size);
> +	t3 = sched_clock();
> +
> +	cost = t2-t1 + t3-t2;

Typo here ??



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-05  1:43           ` Chen, Kenneth W
@ 2005-04-05  1:49             ` Ingo Molnar
  2005-04-05  3:04               ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-05  1:49 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Perhaps, I'm not getting the latest patch?  It skipped measuring 
> because migration cost array is non-zero (initialized to -1LL):

yeah ... some mixup here. I've attached the latest.

> Also, the cost calculation in measure_one() looks fishy to me in this 
> version.

> > +	t0 = sched_clock();
> > +	touch_cache(cache, size);
> > +	t1 = sched_clock();

> > +	t2 = sched_clock();
> > +	touch_cache(cache, size);
> > +	t3 = sched_clock();

> > +	cost = t2-t1 + t3-t2;
> 
> Typo here ??

yeah - fixed this too in the attached snapshot.

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 18113 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/vmalloc.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4640,6 +4641,478 @@ void __devinit init_sched_build_groups(s
 }
 
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads+dirties a shared buffer 
+ * 2) the target CPU reads+dirties the same shared buffer
+ *
+ * We measure how long they take, in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a small buffer-size and iterate up to larger
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * doing a maximum search for the cost. (The maximum cost for a migration
+ * normally occurs when the working set size is around the effective cache
+ * size.)
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		3
+#define SIZE_THRESH		130
+#define COST_THRESH		130
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
+		{ -1LL , };
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long domain_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+static __initdata unsigned int migration_debug = 1;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 8) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	cpumask_t mask, saved_mask;
+	unsigned long long t0, t1, t2, t3, cost;
+
+	saved_mask = current->cpus_allowed;
+
+	/*
+	 * Migrate to the source CPU:
+	 */
+	mask = cpumask_of_cpu(source);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != source);
+
+	/*
+	 * Dirty the working set:
+	 */
+	t0 = sched_clock();
+	touch_cache(cache, size);
+	t1 = sched_clock();
+
+	/*
+	 * Migrate to the target CPU, dirty the L2 cache and access
+	 * the shared buffer. (which represents the working set
+	 * of a migrated task.)
+	 */
+	mask = cpumask_of_cpu(target);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != target);
+
+	t2 = sched_clock();
+	touch_cache(cache, size);
+	t3 = sched_clock();
+
+	cost = t1-t0 + t3-t2;
+
+	if (migration_debug >= 2)
+		printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n",
+			source, target, t1-t0, t1-t0, t3-t2, cost);
+
+	set_cpus_allowed(current, saved_mask);
+
+	return cost;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+__init static unsigned long long
+measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
+{
+	unsigned long long cost1, cost2;
+	int i;
+
+	/*
+	 * Measure the migration cost of 'size' bytes, over an
+	 * average of 10 runs:
+	 *
+	 * (We perturb the cache size by a small (0..4k)
+	 *  value to compensate size/alignment related artifacts.
+	 *  We also subtract the cost of the operation done on
+	 *  the same CPU.)
+	 */
+	cost1 = 0;
+
+	/*
+	 * dry run, to make sure we start off cache-cold on cpu1,
+	 * and to get any vmalloc pagefaults in advance:
+	 */
+	measure_one(cache, size, cpu1, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+
+	measure_one(cache, size, cpu2, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+
+	/*
+	 * (We measure the non-migrating [cached] cost on both
+	 *  cpu1 and cpu2, to handle CPUs with different speeds)
+	 */
+	cost2 = 0;
+
+	measure_one(cache, size, cpu1, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+
+	measure_one(cache, size, cpu2, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost1, 2*ITERATIONS);
+	do_div(cost2, 2*ITERATIONS);
+
+	return cost1 - cost2;
+}
+
+__init static unsigned long long measure_migration_cost(int cpu1, int cpu2)
+{
+	unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
+	unsigned int max_size, size, size_found = 0;
+	long long cost = 0, prev_cost;
+	void *cache;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size) {
+		max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+		size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE);
+	} else {
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		size = MIN_CACHE_SIZE;
+	}
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+
+	/*
+	 * Allocate the working set:
+	 */
+	cache = vmalloc(max_size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+
+	while (size <= max_size) {
+		prev_cost = cost;
+		cost = measure_cost(cpu1, cpu2, cache, size);
+
+		/*
+		 * Update the max:
+		 */
+		if (cost > 0) {
+			if (max_cost < cost) {
+				max_cost = cost;
+				size_found = size;
+			}
+		}
+		/*
+		 * Calculate average fluctuation, we use this to prevent
+		 * noise from triggering an early break out of the loop:
+		 */
+		fluct = abs(cost - prev_cost);
+		avg_fluct = (avg_fluct + fluct)/2;
+
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): (%8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				(long)max_cost / 1000000,
+				((long)max_cost / 100000) % 10,
+				domain_distance(cpu1, cpu2),
+				cost, avg_fluct);
+
+		/*
+		 * If we iterated at least 20% past the previous maximum,
+		 * and the cost has dropped by more than 20% already,
+		 * (taking fluctuations into account) then we assume to
+		 * have found the maximum and break out of the loop early:
+		 */
+		if (size_found && (size*100 > size_found*SIZE_THRESH))
+			if (cost+avg_fluct <= 0 ||
+				max_cost*100 > (cost+avg_fluct)*COST_THRESH) {
+
+				if (migration_debug)
+					printk("-> found max.\n");
+				break;
+			}
+		/*
+		 * Increase the cachesize in 5% steps:
+		 */
+		size = size * 20 / 19;
+	}
+
+	if (migration_debug)
+		printk("[%d][%d] working set size found: %d, cost: %Ld\n",
+			cpu1, cpu2, size_found, max_cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+	unsigned long flags, j0, j1;
+
+	local_irq_save(flags);
+	local_irq_enable();
+	j0 = jiffies;
+
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2)
+				continue;
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance] != -1LL)
+				cost = migration_cost[distance];
+			else {
+				cost = measure_migration_cost(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+		}
+	}
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+	/*
+	 * Print the matrix:
+	 */
+	printk("---------------------\n");
+	printk("| migration cost matrix (max_cache_size: %d, cpu: %ld MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1L
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			cost = migration_cost[distance];
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("--------------------------------\n");
+	printk("| cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	j1 = jiffies;
+	printk("| calibration delay: %ld seconds\n", (j1-j0)/HZ);
+	printk("--------------------------------\n");
+
+	local_irq_restore(flags);
+}
+
+
 #ifdef ARCH_HAS_SCHED_DOMAIN
 extern void __devinit arch_init_sched_domains(void);
 extern void __devinit arch_destroy_sched_domains(void);
@@ -4820,6 +5293,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/ia64/kernel/setup.c.orig
+++ linux/arch/ia64/kernel/setup.c
@@ -561,6 +561,7 @@ static void
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
+	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -585,8 +586,11 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
+		if (cache_size < cci.pcci_cache_size)
+			cache_size = cci.pcci_cache_size;
         }
   out:
+	max_cache_size = max(max_cache_size, cache_size);
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -112,7 +111,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,17 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-mips/mach-ip27/topology.h.orig
+++ linux/include/asm-mips/mach-ip27/topology.h
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-05  1:49             ` Ingo Molnar
@ 2005-04-05  3:04               ` Ingo Molnar
  2005-04-06  3:33                 ` Chen, Kenneth W
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-05  3:04 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]


latest patch attached. Changes:

 - stabilized calibration even more, by using cache flushing 
   instructions to generate a predictable working set. The cache 
   flushing itself is not timed, it is used to create quiescent
   cache  state.

   I only guessed the ia64 version - e.g. i didnt know what 'type' 
   argument to pass to ia64_sal_cache_flush() to get a d/cache 
   flush+invalidate. Same for ppc/ppc64 - i only guessed the function
   in question but didnt test it.

 - due to more stable results, reduced ITERATIONS from 3 to 2 - this 
   should further speed up calibration.

tested on x86, the calibration results look ok there.

	Ingo

[-- Attachment #2: cache-hot-autodetect.patch --]
[-- Type: text/plain, Size: 18775 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/vmalloc.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4640,6 +4641,506 @@ void __devinit init_sched_build_groups(s
 }
 
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads+dirties a shared buffer 
+ * 2) the target CPU reads+dirties the same shared buffer
+ *
+ * We measure how long they take, in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a small buffer-size and iterate up to larger
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * doing a maximum search for the cost. (The maximum cost for a migration
+ * normally occurs when the working set size is around the effective cache
+ * size.)
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		2
+#define SIZE_THRESH		130
+#define COST_THRESH		130
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
+		{ [ 0 ... MAX_DOMAIN_DISTANCE-1 ] = -1LL };
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__init static unsigned long domain_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+static __initdata unsigned int migration_debug = 1;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 8) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+// FIXME: move sched_cacheflush into arch include files:
+
+#ifdef CONFIG_IA64
+# include <asm/sal.h>
+#endif
+
+__init static void sched_cacheflush(void)
+{
+#ifdef CONFIG_X86
+	asm ("wbinvd");
+#elif defined(CONFIG_IA64)
+	ia64_sal_cache_flush(1); // what argument does d/cache flush?
+#elif defined(CONFIG_PPC64) || defined(CONFIG_PPC)
+	cacheflush();
+#else
+# warning implement sched_cacheflush()! Calibration results may be unreliable.
+#endif
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	cpumask_t mask, saved_mask;
+	unsigned long long t0, t1, t2, t3, cost;
+
+	saved_mask = current->cpus_allowed;
+
+	/*
+	 * Flush source caches to RAM and invalidate them:
+	 */
+	sched_cacheflush();
+
+	/*
+	 * Migrate to the source CPU:
+	 */
+	mask = cpumask_of_cpu(source);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != source);
+
+	/*
+	 * Dirty the working set:
+	 */
+	t0 = sched_clock();
+	touch_cache(cache, size);
+	t1 = sched_clock();
+
+	/*
+	 * Migrate to the target CPU, dirty the L2 cache and access
+	 * the shared buffer. (which represents the working set
+	 * of a migrated task.)
+	 */
+	mask = cpumask_of_cpu(target);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != target);
+
+	t2 = sched_clock();
+	touch_cache(cache, size);
+	t3 = sched_clock();
+
+	cost = t1-t0 + t3-t2;
+
+	if (migration_debug >= 2)
+		printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n",
+			source, target, t1-t0, t1-t0, t3-t2, cost);
+	/*
+	 * Flush target caches to RAM and invalidate them:
+	 */
+	sched_cacheflush();
+
+	set_cpus_allowed(current, saved_mask);
+
+	return cost;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+__init static unsigned long long
+measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
+{
+	unsigned long long cost1, cost2;
+	int i;
+
+	/*
+	 * Measure the migration cost of 'size' bytes, over an
+	 * average of 10 runs:
+	 *
+	 * (We perturb the cache size by a small (0..4k)
+	 *  value to compensate size/alignment related artifacts.
+	 *  We also subtract the cost of the operation done on
+	 *  the same CPU.)
+	 */
+	cost1 = 0;
+
+	/*
+	 * dry run, to make sure we start off cache-cold on cpu1,
+	 * and to get any vmalloc pagefaults in advance:
+	 */
+	measure_one(cache, size, cpu1, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+
+	measure_one(cache, size, cpu2, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+
+	/*
+	 * (We measure the non-migrating [cached] cost on both
+	 *  cpu1 and cpu2, to handle CPUs with different speeds)
+	 */
+	cost2 = 0;
+
+	measure_one(cache, size, cpu1, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+
+	measure_one(cache, size, cpu2, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost1, 2*ITERATIONS);
+	do_div(cost2, 2*ITERATIONS);
+
+	return cost1 - cost2;
+}
+
+__init static unsigned long long measure_migration_cost(int cpu1, int cpu2)
+{
+	unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
+	unsigned int max_size, size, size_found = 0;
+	long long cost = 0, prev_cost;
+	void *cache;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size) {
+		max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+		size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE);
+	} else {
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		size = MIN_CACHE_SIZE;
+	}
+		
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+
+	/*
+	 * Allocate the working set:
+	 */
+	cache = vmalloc(max_size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+
+	while (size <= max_size) {
+		prev_cost = cost;
+		cost = measure_cost(cpu1, cpu2, cache, size);
+
+		/*
+		 * Update the max:
+		 */
+		if (cost > 0) {
+			if (max_cost < cost) {
+				max_cost = cost;
+				size_found = size;
+			}
+		}
+		/*
+		 * Calculate average fluctuation, we use this to prevent
+		 * noise from triggering an early break out of the loop:
+		 */
+		fluct = abs(cost - prev_cost);
+		avg_fluct = (avg_fluct + fluct)/2;
+
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): (%8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				(long)max_cost / 1000000,
+				((long)max_cost / 100000) % 10,
+				domain_distance(cpu1, cpu2),
+				cost, avg_fluct);
+
+		/*
+		 * If we iterated at least 20% past the previous maximum,
+		 * and the cost has dropped by more than 20% already,
+		 * (taking fluctuations into account) then we assume to
+		 * have found the maximum and break out of the loop early:
+		 */
+		if (size_found && (size*100 > size_found*SIZE_THRESH))
+			if (cost+avg_fluct <= 0 ||
+				max_cost*100 > (cost+avg_fluct)*COST_THRESH) {
+
+				if (migration_debug)
+					printk("-> found max.\n");
+				break;
+			}
+		/*
+		 * Increase the cachesize in 5% steps:
+		 */
+		size = size * 20 / 19;
+	}
+
+	if (migration_debug)
+		printk("[%d][%d] working set size found: %d, cost: %Ld\n",
+			cpu1, cpu2, size_found, max_cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+	unsigned long flags, j0, j1;
+
+	local_irq_save(flags);
+	local_irq_enable();
+	j0 = jiffies;
+
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2)
+				continue;
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance] != -1LL)
+				cost = migration_cost[distance];
+			else {
+				cost = measure_migration_cost(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+		}
+	}
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+	/*
+	 * Print the matrix:
+	 */
+	printk("---------------------\n");
+	printk("| migration cost matrix (max_cache_size: %d, cpu: %ld MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1L
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			cost = migration_cost[distance];
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("--------------------------------\n");
+	printk("| cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	j1 = jiffies;
+	printk("| calibration delay: %ld seconds\n", (j1-j0)/HZ);
+	printk("--------------------------------\n");
+
+	local_irq_restore(flags);
+}
+
+
 #ifdef ARCH_HAS_SCHED_DOMAIN
 extern void __devinit arch_init_sched_domains(void);
 extern void __devinit arch_destroy_sched_domains(void);
@@ -4820,6 +5321,10 @@ static void __devinit arch_init_sched_do
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -358,6 +358,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 
 void __devinit arch_destroy_sched_domains(void)
--- linux/arch/ia64/kernel/setup.c.orig
+++ linux/arch/ia64/kernel/setup.c
@@ -561,6 +561,7 @@ static void
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
+	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -585,8 +586,11 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
+		if (cache_size < cci.pcci_cache_size)
+			cache_size = cci.pcci_cache_size;
         }
   out:
+	max_cache_size = max(max_cache_size, cache_size);
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +72,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -112,7 +111,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,17 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,6 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,6 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-mips/mach-ip27/topology.h.orig
+++ linux/include/asm-mips/mach-ip27/topology.h
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-05  3:04               ` Ingo Molnar
@ 2005-04-06  3:33                 ` Chen, Kenneth W
  2005-04-06  6:45                   ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-06  3:33 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

Ingo Molnar wrote on Monday, April 04, 2005 8:05 PM
>
> latest patch attached. Changes:
>
>  - stabilized calibration even more, by using cache flushing
>    instructions to generate a predictable working set. The cache
>    flushing itself is not timed, it is used to create quiescent
>    cache  state.
>
>    I only guessed the ia64 version - e.g. i didnt know what 'type'
>    argument to pass to ia64_sal_cache_flush() to get a d/cache
>    flush+invalidate.

It is preferable to use a ia64_pal_cache_flush instead of SAL call. But
it hangs the machine with that pal call.  I will look at it tomorrow.
The type argument for sal_cache_flush is: 1 for icache, 2 for dcache,
and 3 for i+d.


>  - due to more stable results, reduced ITERATIONS from 3 to 2 - this
>    should further speed up calibration.
>
> tested on x86, the calibration results look ok there.

Calibration result on ia64 (1.5 GHz, 9 MB), somewhat smaller in this
version compare to earlier estimate of 10.4ms.  The optimal setting
found by a db workload is around 16 ms.

---------------------
| migration cost matrix (max_cache_size: 9437184, cpu: -1 MHz):

---------------------
          [00]    [01]    [02]    [03]
[00]:     -     9.3(0)  9.3(0)  9.3(0)
[01]:   9.3(0)    -     9.3(0)  9.3(0)
[02]:   9.3(0)  9.3(0)    -     9.3(0)
[03]:   9.3(0)  9.3(0)  9.3(0)    -
--------------------------------
| cacheflush times [1]: 9.3 (9329800)
| calibration delay: 16 seconds
--------------------------------



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-06  3:33                 ` Chen, Kenneth W
@ 2005-04-06  6:45                   ` Ingo Molnar
  2005-04-08  2:27                     ` Chen, Kenneth W
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-06  6:45 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > tested on x86, the calibration results look ok there.
> 
> Calibration result on ia64 (1.5 GHz, 9 MB), somewhat smaller in this 
> version compare to earlier estimate of 10.4ms.  The optimal setting 
> found by a db workload is around 16 ms.

with idle time in the system i'd first concentrate on getting rid of the 
idle time, and then re-measuring the sweet spot. 9.3 msec is certainly a 
correct ballpark figure.

There will also be workload related artifacts: you may speculatively 
delay migration to another CPU, in the hope of the currently executing 
task scheduling away soon. (I have played with that in the past - the 
scheduler has some idea about how scheduling-happy a task is, based on 
the interactivity estimator.)

The cost matrix on the other hand measures the pure cache-related 
migration cost, on a quiet system. There can be an up to factor 2 
increase in the 'effective migration cost', depending on the average 
length of the scheduling atom of the currently executing task. Further 
increases may happen if the system does not scale and interacting 
migrations slow down each other. So with the 9.3msec estimate, the true 
migration factor might be between 9.3 and 18.6 msecs. The bad news would 
be if the estimator gave a higher value than your sweet spot.

once we have a good estimate of the migration cost between domains 
(ignoring permanent penalties such as NUMA), there's a whole lot of 
things we can do with that, to apply it in a broader sense.

> ---------------------
> | migration cost matrix (max_cache_size: 9437184, cpu: -1 MHz):
> ---------------------
>           [00]    [01]    [02]    [03]
> [00]:     -     9.3(0)  9.3(0)  9.3(0)
> [01]:   9.3(0)    -     9.3(0)  9.3(0)
> [02]:   9.3(0)  9.3(0)    -     9.3(0)
> [03]:   9.3(0)  9.3(0)  9.3(0)    -
> --------------------------------
> | cacheflush times [1]: 9.3 (9329800)
> | calibration delay: 16 seconds
> --------------------------------

ok, the delay of 16 secs is alot better. Could you send me the full 
detection log, how stable is the curve?

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-06  6:45                   ` Ingo Molnar
@ 2005-04-08  2:27                     ` Chen, Kenneth W
  0 siblings, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-08  2:27 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: Paul Jackson, torvalds, nickpiggin, akpm, linux-kernel

Ingo Molnar wrote on Tuesday, April 05, 2005 11:46 PM
> ok, the delay of 16 secs is alot better. Could you send me the full
> detection log, how stable is the curve?

Full log attached.


begin 666 boot.log
M0F]O="!P<F]C97-S;W(@:60@,'@P+S!X8S0Q. I#4%4@,3H@<WEN8VAR;VYI
M>F5D($E40R!W:71H($-052 P("AL87-T(&1I9F8@,R!C>6-L97,L(&UA>&5R
M<B U-# @8WEC;&5S*0I#4%4@,3H@8F%S92!F<F5Q/3$Y.2XT-3E-2'HL($E4
M0R!R871I;STQ-2\R+"!)5$,@9G)E<3TQ-#DU+CDT-$U(>@I#86QI8G)A=&EN
M9R!D96QA>2!L;V]P+BXN(#(R-#$N,#@@0F]G;TU)4%,@*&QP:CTQ,#DS-C,R
M*0I#4%4@,CH@<WEN8VAR;VYI>F5D($E40R!W:71H($-052 P("AL87-T(&1I
M9F8@,R!C>6-L97,L(&UA>&5R<B U-# @8WEC;&5S*0I#4%4@,CH@8F%S92!F
M<F5Q/3$Y.2XT-3E-2'HL($E40R!R871I;STQ-2\R+"!)5$,@9G)E<3TQ-#DU
M+CDT-$U(>@I#86QI8G)A=&EN9R!D96QA>2!L;V]P+BXN(#(R-#$N,#@@0F]G
M;TU)4%,@*&QP:CTQ,#DS-C,R*0I#4%4@,SH@<WEN8VAR;VYI>F5D($E40R!W
M:71H($-052 P("AL87-T(&1I9F8@,R!C>6-L97,L(&UA>&5R<B U-# @8WEC
M;&5S*0I#4%4@,SH@8F%S92!F<F5Q/3$Y.2XT-3E-2'HL($E40R!R871I;STQ
M-2\R+"!)5$,@9G)E<3TQ-#DU+CDT-$U(>@I#86QI8G)A=&EN9R!D96QA>2!L
M;V]P+BXN(#(R-#$N,#@@0F]G;TU)4%,@*&QP:CTQ,#DS-C,R*0I"<F]U9VAT
M('5P(#0@0U!5<PI4;W1A;"!O9B T('!R;V-E<W-O<G,@86-T:79A=&5D("@X
M.38T+C,R($)O9V]-25!3*2X*0U!5,"!A='1A8VAI;F<@<V-H960M9&]M86EN
M.@H@9&]M86EN(# Z('-P86X@9@H@(&=R;W5P<SH@,2 R(#0@. I#4%4Q(&%T
M=&%C:&EN9R!S8VAE9"UD;VUA:6XZ"B!D;VUA:6X@,#H@<W!A;B!F"B @9W)O
M=7!S.B R(#0@." Q"D-053(@871T86-H:6YG('-C:&5D+61O;6%I;CH*(&1O
M;6%I;B P.B!S<&%N(&8*("!G<F]U<',Z(#0@." Q(#(*0U!5,R!A='1A8VAI
M;F<@<V-H960M9&]M86EN.@H@9&]M86EN(# Z('-P86X@9@H@(&=R;W5P<SH@
M." Q(#(@- I;,"T^,5TZ(" W,#<Y-C,P(" W,#<Y-C,P(" S,S(Y.#4P(#T^
M(" @,3 T,#DT.# N"ELP+3XQ73H@(#<P.#$T,C @(#<P.#$T,C @(#,S,S$W
M.3$@/3X@(" Q,#0Q,S(Q,2X*6S M/C%=.B @-S$W,#8U,R @-S$W,#8U,R @
M,S,T.#$Y.2 ]/B @(#$P-3$X.#4R+@I;,2T^,%TZ(" W,#<R-#$U(" W,#<R
M-#$U(" S,S,Q.#0T(#T^(" @,3 T,#0R-3DN"ELQ+3XP73H@(#<P-S8R,C@@
M(#<P-S8R,C@@(#,S,S W,S(@/3X@(" Q,#0P-CDV,"X*6S$M/C!=.B @-S$V
M-#$Q-" @-S$V-#$Q-" @,S,T.#$Q,R ]/B @(#$P-3$R,C(W+@I;,"T^,%TZ
M(" W,#<W,3$W(" W,#<W,3$W(" @.30R,38Q(#T^(" @(#@P,3DR-S@N"ELP
M+3XP73H@(#<P-S0Y-C(@(#<P-S0Y-C(@(" Y-#(R,C$@/3X@(" @.# Q-S$X
M,RX*6S M/C!=.B @-S$W,#(Y-R @-S$W,#(Y-R @(#DS.34S," ]/B @(" X
M,3 Y.#(W+@I;,2T^,5TZ(" W,#<V,#4T(" W,#<V,#4T(" @.30R,CDR(#T^
M(" @(#@P,3@S-#8N"ELQ+3XQ73H@(#<P-S U,S<@(#<P-S U,S<@(" Y-#$T
M-S@@/3X@(" @.# Q,C Q-2X*6S$M/C%=.B @-S$V,34Y,B @-S$V,34Y,B @
M(#DS.#<V." ]/B @(" X,3 P,S8P+@HM/B!;,%U;,5U;-#<Q.#4Y,ET@(" R
M+C0@6R @,BXT72 H,"DZ("@@,C0P,CDV-B @,3(P,30X,RD*6S M/C%=.B @
M-S4S,S(P," @-S4S,S(P," @,S4R,C$T-2 ]/B @(#$Q,#4U,S0U+@I;,"T^
M,5TZ(" W-3,Q-3,V(" W-3,Q-3,V(" S-3(T,#DV(#T^(" @,3$P-34V,S(N
M"ELP+3XQ73H@(#<T-#$Q.#(@(#<T-#$Q.#(@(#,U,#,S.3@@/3X@(" Q,#DT
M-#4X,"X*6S$M/C!=.B @-S4R,S<Q,2 @-S4R,S<Q,2 @,S4R-S4Y-" ]/B @
M(#$Q,#4Q,S U+@I;,2T^,%TZ(" W-3(V,S0P(" W-3(V,S0P(" S-3(V-C,W
M(#T^(" @,3$P-3(Y-S<N"ELQ+3XP73H@(#<T,SDU.3<@(#<T,SDU.3<@(#,U
M,#0S,C8@/3X@(" Q,#DT,SDR,RX*6S M/C!=.B @-S4S,#@X,R @-S4S,#@X
M,R @(#DX.#(Q-2 ]/B @(" X-3$Y,#DX+@I;,"T^,%TZ(" W-3,Q-C$U(" W
M-3,Q-C$U(" @.3@W-C,S(#T^(" @(#@U,3DR-#@N"ELP+3XP73H@(#<T,SDS
M,C8@(#<T,SDS,C8@(" Y.3 W,S@@/3X@(" @.#0S,# V-"X*6S$M/C%=.B @
M-S4R.#4W-R @-S4R.#4W-R @(#DX-SDW,2 ]/B @(" X-3$V-30X+@I;,2T^
M,5TZ(" W-3(T-#@S(" W-3(T-#@S(" @.3@W-S$Q(#T^(" @(#@U,3(Q.30N
M"ELQ+3XQ73H@(#<T,S(V-S<@(#<T,S(V-S<@(" Y.#DW-CD@/3X@(" @.#0R
M,C0T-BX*+3X@6S!=6S%=6S0Y-C8Y,SA=(" @,BXU(%L@(#(N-5T@*# I.B H
M(#(U,C@R.3 @(" V-C,T,#,I"ELP+3XQ73H@(#<X-SDQ.#<@(#<X-SDQ.#<@
M(#,W,#DP,C @/3X@(" Q,34X.#(P-RX*6S M/C%=.B @-S@W-S8V-" @-S@W
M-S8V-" @,S<P-SDQ,2 ]/B @(#$Q-3@U-3<U+@I;,"T^,5TZ(" W.3(T.34Q
M(" W.3(T.34Q(" S-C@W,#$U(#T^(" @,3$V,3$Y-C8N"ELQ+3XP73H@(#<X
M-S$P-3,@(#<X-S$P-3,@(#,W,30P,S(@/3X@(" Q,34X-3 X-2X*6S$M/C!=
M.B @-S@W,3DP-R @-S@W,3DP-R @,S<Q-# W,B ]/B @(#$Q-3@U.3<Y+@I;
M,2T^,%TZ(" W.3(P-#DY(" W.3(P-#DY(" S-CDQ.30S(#T^(" @,3$V,3(T
M-#(N"ELP+3XP73H@(#<X-S4Y.3@@(#<X-S4Y.3@@(#$P-# R-#@@/3X@(" @
M.#DQ-C(T-BX*6S M/C!=.B @-S@W-C0U," @-S@W-C0U," @,3 T,# T." ]
M/B @(" X.3$V-#DX+@I;,"T^,%TZ(" W.3(Q-S$Q(" W.3(Q-S$Q(" Q,#0S
M,C S(#T^(" @(#@Y-C0Y,30N"ELQ+3XQ73H@(#<X-S(R.#@@(#<X-S(R.#@@
M(#$P-# P,3(@/3X@(" @.#DQ,C,P,"X*6S$M/C%=.B @-S@V.3@U.2 @-S@V
M.3@U.2 @,3 S.38X,B ]/B @(" X.3 Y-30Q+@I;,2T^,5TZ(" W.3$W,#@W
M(" W.3$W,#@W(" Q,#0R.#8W(#T^(" @(#@Y-3DY-30N"BT^(%LP75LQ75LU
M,C(X,S4U72 @(#(N-B!;(" R+C9=("@P*3H@*" R-C8Q,C8T(" @,SDX,3@X
M*0I;,"T^,5TZ(" X,C8P-CDR(" X,C8P-CDR(" S.#@R.3<W(#T^(" @,3(Q
M-#,V-CDN"ELP+3XQ73H@(#@R-3<X,#0@(#@R-3<X,#0@(#,X.#,T,30@/3X@
M(" Q,C$T,3(Q."X*6S M/C%=.B @.#0R-CDS,2 @.#0R-CDS,2 @,SDP,S,V
M-R ]/B @(#$R,S,P,CDX+@I;,2T^,%TZ(" X,C4Q,C8T(" X,C4Q,C8T(" S
M.#@W,#<S(#T^(" @,3(Q,S@S,S<N"ELQ+3XP73H@(#@R-3$X,S@@(#@R-3$X
M,S@@(#,X.#8T-3(@/3X@(" Q,C$S.#(Y,"X*6S$M/C!=.B @.#0Q.3,U," @
M.#0Q.3,U," @,SDP-3<Q-R ]/B @(#$R,S(U,#8W+@I;,"T^,%TZ(" X,C4X
M.#@W(" X,C4X.#@W(" Q,#DW,#(Y(#T^(" @(#DS-34Y,38N"ELP+3XP73H@
M(#@R-3,V,S,@(#@R-3,V,S,@(#$Q,# R-C$@/3X@(" @.3,U,S@Y-"X*6S M
M/C!=.B @.#0R-#8X-R @.#0R-#8X-R @,3 Y,SDX,B ]/B @(" Y-3$X-C8Y
M+@I;,2T^,5TZ(" X,C4T-30U(" X,C4T-30U(" Q,#DU.#$T(#T^(" @(#DS
M-3 S-3DN"ELQ+3XQ73H@(#@R-#DT,34@(#@R-#DT,34@(#$P.38V,3@@/3X@
M(" @.3,T-C S,RX*6S$M/C%=.B @.#0Q.#@S,2 @.#0Q.#@S,2 @,3 Y,S8Y
M-" ]/B @(" Y-3$R-3(U+@HM/B!;,%U;,5U;-34P,S4S,5T@(" R+C@@6R @
M,BXX72 H,"DZ("@@,C@P,#DS." @(#(V.#DS,2D*6S M/C%=.B @.#@V,#(R
M-R @.#@V,#(R-R @-#$P.3DT-" ]/B @(#$R.3<P,3<Q+@I;,"T^,5TZ(" X
M.#4X-3,W(" X.#4X-3,W(" T,3$P-S V(#T^(" @,3(Y-CDR-#,N"ELP+3XQ
M73H@(#@W-S4S,#,@(#@W-S4S,#,@(#0P.#<Q.3,@/3X@(" Q,C@V,C0Y-BX*
M6S$M/C!=.B @.#@U,#(S-R @.#@U,#(S-R @-#$Q,S0X-R ]/B @(#$R.38S
M-S(T+@I;,2T^,%TZ(" X.#4R,C0T(" X.#4R,C0T(" T,3$R,S,Y(#T^(" @
M,3(Y-C0U.#,N"ELQ+3XP73H@(#@W-C@U-#0@(#@W-C@U-#0@(#0P.3 R-#8@
M/3X@(" Q,C@U.#<Y,"X*6S M/C!=.B @.#@U-CDW-B @.#@U-CDW-B @,3$U
M,C8Q," ]/B @(#$P,# Y-3@V+@I;,"T^,%TZ(" X.#4T-C4P(" X.#4T-C4P
M(" Q,34T.3(Y(#T^(" @,3 P,#DU-SDN"ELP+3XP73H@(#@W-S0S,S @(#@W
M-S0S,S @(#$Q-30V,38@/3X@(" @.3DR.#DT-BX*6S$M/C%=.B @.#@U,3<V
M,R @.#@U,3<V,R @,3$U,3,W-B ]/B @(#$P,# S,3,Y+@I;,2T^,5TZ(" X
M.#4P-#,X(" X.#4P-#,X(" Q,34P.3,X(#T^(" @,3 P,#$S-S8N"ELQ+3XQ
M73H@(#@W-C8X-C$@(#@W-C8X-C$@(#$Q-30U,#8@/3X@(" @.3DR,3,V-RX*
M+3X@6S!=6S%=6S4W.3,Q.3!=(" @,BXY(%L@(#(N.5T@*# I.B H(#(Y-#@T
M-C$@(" R,#@R,C<I"ELP+3XQ73H@(#DS,S@V-# @(#DS,S@V-# @(#0S,C8T
M,#@@/3X@(" Q,S8V-3 T."X*6S M/C%=.B @.3,S.#DP,2 @.3,S.#DP,2 @
M-#,R-S0R,2 ]/B @(#$S-C8V,S(R+@I;,"T^,5TZ(" Y,C0X-3DS(" Y,C0X
M-3DS(" T,S Q-C@Y(#T^(" @,3,U-3 R.#(N"ELQ+3XP73H@(#DS,C@W-C @
M(#DS,C@W-C @(#0S,C<V-S4@/3X@(" Q,S8U-C0S-2X*6S$M/C!=.B @.3,S
M,# U-B @.3,S,# U-B @-#,S,#,U-" ]/B @(#$S-C8P-#$P+@I;,2T^,%TZ
M(" Y,C,Y,#DW(" Y,C,Y,#DW(" T,S V,CDP(#T^(" @,3,U-#4S.#<N"ELP
M+3XP73H@(#DS,S0Y,#D@(#DS,S0Y,#D@(#$R,3(S.30@/3X@(" Q,#4T-S,P
M,RX*6S M/C!=.B @.3,S-C,U,R @.3,S-C,U,R @,3(Q,3DR,2 ]/B @(#$P
M-30X,C<T+@I;,"T^,%TZ(" Y,C0V,#0X(" Y,C0V,#0X(" Q,C$U,C U(#T^
M(" @,3 T-C$R-3,N"ELQ+3XQ73H@(#DS,S(T.3(@(#DS,S(T.3(@(#$R,3$S
M-S8@/3X@(" Q,#4T,S@V."X*6S$M/C%=.B @.3,S,38Q." @.3,S,38Q." @
M,3(Q,3$T-B ]/B @(#$P-30R-S8T+@I;,2T^,5TZ(" Y,C,W.#(Q(" Y,C,W
M.#(Q(" Q,C$U,#0V(#T^(" @,3 T-3(X-C<N"BT^(%LP75LQ75LV,#DX,#DT
M72 @(#,N,2!;(" S+C%=("@P*3H@*" S,3 T,S$Q(" @,3@R,#,X*0I;,"T^
M,5TZ(" Y-S4V-#0U(" Y-S4V-#0U(" T-34S-38R(#T^(" @,30S,3 P,#<N
M"ELP+3XQ73H@(#DW-38P-C0@(#DW-38P-C0@(#0U-3,X.3,@/3X@(" Q-#,P
M.3DU-RX*6S M/C%=.B @.38U-S$R-2 @.38U-S$R-2 @-#4U,S$V-B ]/B @
M(#$T,C$P,CDQ+@I;,2T^,%TZ(" Y-S0V.38Q(" Y-S0V.38Q(" T-34W,S(V
M(#T^(" @,30S,#0R.#<N"ELQ+3XP73H@(#DW-#DT,3 @(#DW-#DT,3 @(#0U
M-30V-C4@/3X@(" Q-#,P-# W-2X*6S$M/C!=.B @.38U,#DV,2 @.38U,#DV
M,2 @-#4U-C<Q-B ]/B @(#$T,C W-C<W+@I;,"T^,%TZ(" Y-S4S-S4X(" Y
M-S4S-S4X(" Q,C<Y,#<P(#T^(" @,3$P,S(X,C@N"ELP+3XP73H@(#DW-34R
M,3(@(#DW-34R,3(@(#$R-S8U.30@/3X@(" Q,3 S,3@P-BX*6S M/C!=.B @
M.38U.3(R," @.38U.3(R," @,3(W-3DY." ]/B @(#$P.3,U,C$X+@I;,2T^
M,5TZ(" Y-S4P,# R(" Y-S4P,# R(" Q,C<V,C0V(#T^(" @,3$P,C8R-#@N
M"ELQ+3XQ73H@(#DW-#<U.#D@(#DW-#<U.#D@(#$R-S4X-3D@/3X@(" Q,3 R
M,S0T."X*6S$M/C%=.B @.38T-S(T-2 @.38T-S(T-2 @,3(W-C,U," ]/B @
M(#$P.3(S-3DU+@HM/B!;,%U;,5U;-C0Q.3 T-ET@(" S+C(@6R @,RXR72 H
M,"DZ("@@,S(W.30X-" @(#$W.#8P-2D*6S M/C%=.B Q,#$R,SDU-R Q,#$R
M,SDU-R @-#<V.#$R-R ]/B @(#$T.#DR,#@T+@I;,"T^,5TZ(#$P,3(R,34R
M(#$P,3(R,34R(" T-S8W-C0U(#T^(" @,30X.#DW.3<N"ELP+3XQ73H@,3 S
M,C@Q-3<@,3 S,C@Q-3<@(#0W.3(V,S$@/3X@(" Q-3$R,#<X."X*6S$M/C!=
M.B Q,#$Q,S0V-B Q,#$Q,S0V-B @-#<W,3DW," ]/B @(#$T.#@U-#,V+@I;
M,2T^,%TZ(#$P,3$W-#@X(#$P,3$W-#@X(" T-S8X-S$W(#T^(" @,30X.#8R
M,#4N"ELQ+3XP73H@,3 S,C$X-CD@,3 S,C$X-CD@(#0W.30U.3$@/3X@(" Q
M-3$Q-C0V,"X*6S M/C!=.B Q,#$R,#DY." Q,#$R,#DY." @,3,T.3(R-" ]
M/B @(#$Q-#<P,C(R+@I;,"T^,%TZ(#$P,3(T,S8W(#$P,3(T,S8W(" Q,S0Y
M,3$S(#T^(" @,3$T-S,T.# N"ELP+3XP73H@,3 S,C@Q.#(@,3 S,C@Q.#(@
M(#$S-#,V-C4@/3X@(" Q,38W,3@T-RX*6S$M/C%=.B Q,#$Q.# T.2 Q,#$Q
M.# T.2 @,3,T-3@R,2 ]/B @(#$Q-#8S.#<P+@I;,2T^,5TZ(#$P,3$T-C(X
M(#$P,3$T-C(X(" Q,S0U-#4P(#T^(" @,3$T-C P-S@N"ELQ+3XQ73H@,3 S
M,C$Q-38@,3 S,C$Q-38@(#$S-#0X.3 @/3X@(" Q,38V-C T-BX*+3X@6S!=
M6S%=6S8W-38X.3!=(" @,RXT(%L@(#,N-%T@*# I.B H(#,T,S4T-3 @(" Q
M-C<R.#4I"ELP+3XQ73H@,3 W-S$U-3,@,3 W-S$U-3,@(#4P,3@T,#D@/3X@
M(" Q-3<X.3DV,BX*6S M/C%=.B Q,#<W,3(X,B Q,#<W,3(X,B @-3 Q.#0T
M-" ]/B @(#$U-S@Y-S(V+@I;,"T^,5TZ(#$P-S$U,C4R(#$P-S$U,C4R(" U
M,#0T-S8Q(#T^(" @,34W-C P,3,N"ELQ+3XP73H@,3 W-3DX.#D@,3 W-3DX
M.#D@(#4P,C$V-S8@/3X@(" Q-3<X,34V-2X*6S$M/C!=.B Q,#<V-#8Y-" Q
M,#<V-#8Y-" @-3 R,3$Q," ]/B @(#$U-S@U.# T+@I;,2T^,%TZ(#$P-S V
M-3@R(#$P-S V-3@R(" U,#0W.#$W(#T^(" @,34W-30S.3DN"ELP+3XP73H@
M,3 W-C<X,3,@,3 W-C<X,3,@(#$T,3DY-C$@/3X@(" Q,C$X-S<W-"X*6S M
M/C!=.B Q,#<V-S,Q,B Q,#<V-S,Q,B @,30Q.3@Y,R ]/B @(#$R,3@W,C U
M+@I;,"T^,%TZ(#$P-S$Q.#DR(#$P-S$Q.#DR(" Q-#$V-C@X(#T^(" @,3(Q
M,C@U.# N"ELQ+3XQ73H@,3 W-C0P.3,@,3 W-C0P.3,@(#$T,3@X,3@@/3X@
M(" Q,C$X,CDQ,2X*6S$M/C%=.B Q,#<V,3 X,R Q,#<V,3 X,R @,30Q-C,R
M-R ]/B @(#$R,3<W-#$P+@I;,2T^,5TZ(#$P-S U-S0T(#$P-S U-S0T(" Q
M-#$S,C S(#T^(" @,3(Q,3@Y-#<N"BT^(%LP75LQ75LW,3$R-3$U72 @(#,N
M-B!;(" S+C9=("@P*3H@*" S-C$Y-#4P(" @,3<U-C0R*0I;,"T^,5TZ(#$Q
M,C$W.#<X(#$Q,C$W.#<X(" U,C@S-3$X(#T^(" @,38U,#$S.38N"ELP+3XQ
M73H@,3$R,3<P-34@,3$R,3<P-34@(#4R.#$X-#D@/3X@(" Q-C0Y.#DP-"X*
M6S M/C%=.B Q,3,X,38X-" Q,3,X,38X-" @-3,P.38V,2 ]/B @(#$V-CDQ
M,S0U+@I;,2T^,%TZ(#$Q,C V,C4T(#$Q,C V,C4T(" U,C@X-S(Q(#T^(" @
M,38T.30Y-S4N"ELQ+3XP73H@,3$R,#DW-#4@,3$R,#DW-#4@(#4R.#4S-#<@
M/3X@(" Q-C0Y-3 Y,BX*6S$M/C!=.B Q,3,V-S0Y-2 Q,3,V-S0Y-2 @-3,Q
M-34T-R ]/B @(#$V-C@S,#0R+@I;,"T^,%TZ(#$Q,C$T,C4X(#$Q,C$T,C4X
M(" Q-#DT-S<X(#T^(" @,3(W,#DP,S8N"ELP+3XP73H@,3$R,3DP-# @,3$R
M,3DP-# @(#$T.3(Q-S<@/3X@(" Q,C<Q,3(Q-RX*6S M/C!=.B Q,3,W-S@V
M-R Q,3,W-S@V-R @,30Y,3 U,2 ]/B @(#$R.#8X.3$X+@I;,2T^,5TZ(#$Q
M,C Y,S,T(#$Q,C Y,S,T(" Q-#DP.3,P(#T^(" @,3(W,# R-C0N"ELQ+3XQ
M73H@,3$R,#<R-S8@,3$R,#<R-S8@(#$T.3(X-#@@/3X@(" Q,C<P,#$R-"X*
M6S$M/C%=.B Q,3,V-S4T,R Q,3,V-S4T,R @,30X.3DU,B ]/B @(#$R.#4W
M-#DU+@HM/B!;,%U;,5U;-S0X-C@U-UT@(" S+C@@6R @,RXX72 H,"DZ("@@
M,S@P-S8U-R @(#$X,3DR-"D*6S M/C%=.B Q,3DS-C4V,R Q,3DS-C4V,R @
M-34V,#@X-2 ]/B @(#$W-#DW-#0X+@I;,"T^,5TZ(#$Q.3,U,C(Y(#$Q.3,U
M,C(Y(" U-38Q,3@Q(#T^(" @,3<T.38T,3 N"ELP+3XQ73H@,3$X-S(Q-34@
M,3$X-S(Q-34@(#4U.#8Y,3@@/3X@(" Q-S0U.3 W,RX*6S$M/C!=.B Q,3DR
M,C$R,B Q,3DR,C$R,B @-34V-38U,B ]/B @(#$W-#@W-S<T+@I;,2T^,%TZ
M(#$Q.3(V,C U(#$Q.3(V,C U(" U-38W,C@W(#T^(" @,3<T.3,T.3(N"ELQ
M+3XP73H@,3$X-C$U-S$@,3$X-C$U-S$@(#4U.3$P-S@@/3X@(" Q-S0U,C8T
M.2X*6S M/C!=.B Q,3DS,C<U,B Q,3DS,C<U,B @,34W,C8V.2 ]/B @(#$S
M-3 U-#(Q+@I;,"T^,%TZ(#$Q.3,S-3(T(#$Q.3,S-3(T(" Q-3<P,C<T(#T^
M(" @,3,U,#,W.3@N"ELP+3XP73H@,3$X-S(X.3,@,3$X-S(X.3,@(#$U-CDQ
M-S$@/3X@(" Q,S0T,C V-"X*6S$M/C%=.B Q,3DR-#DU,2 Q,3DR-#DU,2 @
M,34W,3,V-B ]/B @(#$S-#DV,S$W+@I;,2T^,5TZ(#$Q.3(T-S Y(#$Q.3(T
M-S Y(" Q-38Y,3@U(#T^(" @,3,T.3,X.30N"ELQ+3XQ73H@,3$X-3DX,S4@
M,3$X-3DX,S4@(#$U-C<P.#4@/3X@(" Q,S0R-CDR,"X*+3X@6S!=6S%=6S<X
M.# Y,#)=(" @-"XP(%L@(#0N,%T@*# I.B H(#0P,#@W,S<@(" Q.3$U,#(I
M"ELP+3XQ73H@,3(U.3DV-#D@,3(U.3DV-#D@(#4X.#0Y,S,@/3X@(" Q.#0X
M-#4X,BX*6S M/C%=.B Q,C4Y.#(Q-2 Q,C4Y.#(Q-2 @-3@X,S4S," ]/B @
M(#$X-#@Q-S0U+@I;,"T^,5TZ(#$R-3DR-S Q(#$R-3DR-S Q(" U.#@V,#(S
M(#T^(" @,3@T-S@W,C0N"ELQ+3XP73H@,3(U.#8X-C$@,3(U.#8X-C$@(#4X
M.#<Y-C4@/3X@(" Q.#0W-#@R-BX*6S$M/C!=.B Q,C4Y,38U-R Q,C4Y,38U
M-R @-3@X-S<S,2 ]/B @(#$X-#<Y,S@X+@I;,2T^,%TZ(#$R-3@R-#0S(#$R
M-3@R-#0S(" U.#@Y,S Q(#T^(" @,3@T-S$W-#0N"ELP+3XP73H@,3(U.38R
M-SD@,3(U.38R-SD@(#$V-3(V-#8@/3X@(" Q-#(T.#DR-2X*6S M/C!=.B Q
M,C4Y-C<U,B Q,C4Y-C<U,B @,38U,34Q." ]/B @(#$T,C0X,C<P+@I;,"T^
M,%TZ(#$R-3DP,SDU(#$R-3DP,SDU(" Q-C0W,3$V(#T^(" @,30R,S<U,3$N
M"ELQ+3XQ73H@,3(U.3 R.3,@,3(U.3 R.3,@(#$V-#<Y.#$@/3X@(" Q-#(S
M.#(W-"X*6S$M/C%=.B Q,C4X-C0S,2 Q,C4X-C0S,2 @,38U,#4P,B ]/B @
M(#$T,C,V.3,S+@I;,2T^,5TZ(#$R-3<V-C,U(#$R-3<V-C,U(" Q-C0Y,CDR
M(#T^(" @,30R,C4Y,C<N"BT^(%LP75LQ75LX,CDU-C@V72 @(#0N,B!;(" T
M+C)=("@P*3H@*" T,C0P-S0P(" @,C$Q-S4R*0I;,"T^,5TZ(#$S,C<U,C0Y
M(#$S,C<U,C0Y(" V,3DU-#@V(#T^(" @,3DT-S W,S4N"ELP+3XQ73H@,3,R
M-S$T,C@@,3,R-S$T,C@@(#8Q.34W-C@@/3X@(" Q.30V-S$Y-BX*6S M/C%=
M.B Q,S(S-S0Q-B Q,S(S-S0Q-B @-C$Y,3$X-B ]/B @(#$Y-#(X-C R+@I;
M,2T^,%TZ(#$S,C8P-3(V(#$S,C8P-3(V(" V,3DW,C0U(#T^(" @,3DT-3<W
M-S$N"ELQ+3XP73H@,3,R-C0X.#D@,3,R-C0X.#D@(#8Q.3<R-C$@/3X@(" Q
M.30V,C$U,"X*6S$M/C!=.B Q,S(R.30R." Q,S(R.30R." @-C$Y-38X,R ]
M/B @(#$Y-#(U,3$Q+@I;,"T^,%TZ(#$S,C<P,3@P(#$S,C<P,3@P(" Q-S,W
M.#8R(#T^(" @,34P,#@P-#(N"ELP+3XP73H@,3,R-CDR-#<@,3,R-CDR-#<@
M(#$W,S0V-S@@/3X@(" Q-3 P,SDR-2X*6S M/C!=.B Q,S(S-C0W,B Q,S(S
M-C0W,B @,3<S-SDX-B ]/B @(#$T.3<T-#4X+@I;,2T^,5TZ(#$S,C8T.#8Y
M(#$S,C8T.#8Y(" Q-S,V.3$R(#T^(" @,34P,#$W.#$N"ELQ+3XQ73H@,3,R
M-3DY,# @,3,R-3DY,# @(#$W,S8W.3D@/3X@(" Q-#DY-C8Y.2X*6S$M/C%=
M.B Q,S(R-#<S.2 Q,S(R-#<S.2 @,3<S-C8T-2 ]/B @(#$T.38Q,S@T+@HM
M/B!;,%U;,5U;.#<S,C,P,5T@(" T+C0@6R @-"XT72 H,"DZ("@@-#0V,38T
M." @(#(Q-C,S,"D*6S M/C%=.B Q,S@Y.#@U,2 Q,S@Y.#@U,2 @-C0X.3 R
M.2 ]/B @(#(P,S@W.#@P+@I;,"T^,5TZ(#$S.#DV,C<V(#$S.#DV,C<V(" V
M-#@Y,3,Y(#T^(" @,C S.#4T,34N"ELP+3XQ73H@,3,X,S0Y,#@@,3,X,S0Y
M,#@@(#8U,3@X.3@@/3X@(" R,#,U,S@P-BX*6S$M/C!=.B Q,S@X,C$P-2 Q
M,S@X,C$P-2 @-C0Y,# Q," ]/B @(#(P,S<R,3$U+@I;,2T^,%TZ(#$S.#@S
M-S(R(#$S.#@S-S(R(" V-#DP,SDR(#T^(" @,C S-S0Q,30N"ELQ+3XP73H@
M,3,X,C0X,# @,3,X,C0X,# @(#8U,C0Y.3@@/3X@(" R,#,T.3<Y."X*6S M
M/C!=.B Q,S@Y,3@Y-" Q,S@Y,3@Y-" @,3@T.#<Y,R ]/B @(#$U-S0P-C@W
M+@I;,"T^,%TZ(#$S.#DS,S0W(#$S.#DS,S0W(" Q.#0X-S<W(#T^(" @,34W
M-#(Q,C0N"ELP+3XP73H@,3,X,S T.38@,3,X,S T.38@(#$X-#0U-C<@/3X@
M(" Q-38W-3 V,RX*6S$M/C%=.B Q,S@X-S,P,2 Q,S@X-S,P,2 @,3@T-#0X
M-B ]/B @(#$U-S,Q-S@W+@I;,2T^,5TZ(#$S.#@Q-S8T(#$S.#@Q-S8T(" Q
M.#0T-#0Y(#T^(" @,34W,C8R,3,N"ELQ+3XQ73H@,3,X,3DV-C0@,3,X,3DV
M-C0@(#$X,SDP-S4@/3X@(" Q-38U.#<S.2X*+3X@6S!=6S%=6SDQ.3$X.35=
M(" @-"XV(%L@(#0N-ET@*# I.B H(#0V-C4R-#D@(" R,#DY-C4I"ELP+3XQ
M73H@,30U,# P-#$@,30U,# P-#$@(#8X-#(W-CD@/3X@(" R,3,T,C@Q,"X*
M6S M/C%=.B Q-#0Y-S<Y,2 Q-#0Y-S<Y,2 @-C@T,C8P.2 ]/B @(#(Q,S0P
M-# P+@I;,"T^,5TZ(#$T-SDS,C,R(#$T-SDS,C,R(" V.#<Y.3 W(#T^(" @
M,C$V-S,Q,SDN"ELQ+3XP73H@,30T.#(V-#<@,30T.#(V-#<@(#8X-#4X,3$@
M/3X@(" R,3,R.#0U."X*6S$M/C!=.B Q-#0X-38Y-R Q-#0X-38Y-R @-C@T
M-C4X.2 ]/B @(#(Q,S,R,C@V+@I;,2T^,%TZ(#$T-S@P,S W(#$T-S@P,S W
M(" V.#@S-C,W(#T^(" @,C$V-C,Y-#0N"ELP+3XP73H@,30T.3$W,S,@,30T
M.3$W,S,@(#(S,S(V-C0@/3X@(" Q-C@R-#,Y-RX*6S M/C!=.B Q-#0Y-CDW
M." Q-#0Y-CDW." @,C,S,C$P," ]/B @(#$V.#(Y,#<X+@I;,"T^,%TZ(#$T
M-SDP,# S(#$T-SDP,# S(" R,S,R-#,S(#T^(" @,3<Q,C(T,S8N"ELQ+3XQ
M73H@,30T.#<S.#(@,30T.#<S.#(@(#(S,C$X-#@@/3X@(" Q-C@P.3(S,"X*
M6S$M/C%=.B Q-#0X,CDT." Q-#0X,CDT." @,C,S,#$Y." ]/B @(#$V.#$S
M,30V+@I;,2T^,5TZ(#$T-S<Y,#,R(#$T-S<Y,#,R(" R,S,S-#(W(#T^(" @
M,3<Q,3(T-3DN"BT^(%LP75LQ75LY-C<U-C<X72 @(#0N-2!;(" T+C9=("@P
M*3H@*" T-3,S,38S(" @,3<Q,#(U*0I;,"T^,5TZ(#$U-# P,3$R(#$U-# P
M,3$R(" W,C8R-3<T(#T^(" @,C(V-C(V.#8N"ELP+3XQ73H@,34T,#4S,#$@
M,34T,#4S,#$@(#<R-38P,#8@/3X@(" R,C8V,3,P-RX*6S M/C%=.B Q-3,S
M,38T,B Q-3,S,38T,B @-S(X,C4S-R ]/B @(#(R-C$T,3<Y+@I;,2T^,%TZ
M(#$U,S@T,C@R(#$U,S@T,C@R(" W,C8U-3@Q(#T^(" @,C(V-#DX-C,N"ELQ
M+3XP73H@,34S.3 Y-C(@,34S.3 Y-C(@(#<R-C P,S@@/3X@(" R,C8U,3 P
M,"X*6S$M/C!=.B Q-3,Q.#<S-B Q-3,Q.#<S-B @-S(Y-# X,2 ]/B @(#(R
M-C$R.#$W+@I;,"T^,%TZ(#$U-# Q,3DY(#$U-# Q,3DY(" S-34V.3<W(#T^
M(" @,3@Y-3@Q-S8N"ELP+3XP73H@,34T,#(W-S0@,34T,#(W-S0@(#,U.#4V
M-S@@/3X@(" Q.#DX.#0U,BX*6S M/C!=.B Q-3,S,#@P-R Q-3,S,#@P-R @
M,S4T-S,R-" ]/B @(#$X.#<X,3,Q+@I;,2T^,5TZ(#$U,SDQ-CDT(#$U,SDQ
M-CDT(" S-3<R,#$W(#T^(" @,3@Y-C,W,3$N"ELQ+3XQ73H@,34S.3 S.#D@
M,34S.3 S.#D@(#,U-3<R-3 @/3X@(" Q.#DT-S8S.2X*6S$M/C%=.B Q-3,Q
M-#@V-" Q-3,Q-#@V-" @,S4U,C0P." ]/B @(#$X.#8W,C<R+@HM/B!;,%U;
M,5U;,3 Q.#0Y,C1=(" @,RXW(%L@(#0N-ET@*# I.B H(#,W,30T-3(@(" T
M.30X-C@I"ELP+3XQ73H@,38R,38X.#,@,38R,38X.#,@(#<X,3 U-3D@/3X@
M(" R-# R-S0T,BX*6S M/C%=.B Q-C(P-S8Q.2 Q-C(P-S8Q.2 @-S@Q-S<R
M,R ]/B @(#(T,#(U,S0R+@I;,"T^,5TZ(#$V,3,S.# Q(#$V,3,S.# Q(" W
M.#<R,30Q(#T^(" @,C0P,#4Y-#(N"ELQ+3XP73H@,38Q.30Q.3 @,38Q.30Q
M.3 @(#<X,C(Q,S<@/3X@(" R-# Q-C,R-RX*6S$M/C!=.B Q-C$Y.3(T-2 Q
M-C$Y.3(T-2 @-S@Q.#<T.2 ]/B @(#(T,#$W.3DT+@I;,2T^,%TZ(#$V,3(P
M,3DS(#$V,3(P,3DS(" W.#@S.34Y(#T^(" @,C0P,#0Q-3(N"ELP+3XP73H@
M,38R,#0S,SD@,38R,#0S,SD@(#<T.#<X,C<@/3X@(" R,S8Y,C$V-BX*6S M
M/C!=.B Q-C(Q,# S-2 Q-C(Q,# S-2 @-S4S.#<Y," ]/B @(#(S-S0X.#(U
M+@I;,"T^,%TZ(#$V,3,Q-3@U(#$V,3,Q-3@U(" W-C0R,3 Y(#T^(" @,C,W
M-S,V.30N"ELQ+3XQ73H@,38R,#$T,#$@,38R,#$T,#$@(#<U,S,S-C(@/3X@
M(" R,S<S-#<V,RX*6S$M/C%=.B Q-C$Y,C4T-B Q-C$Y,C4T-B @-S0W-C<W
M." ]/B @(#(S-C8Y,S(T+@I;,2T^,5TZ(#$V,3$X.#$X(#$V,3$X.#$X(" W
M-C,U-3 R(#T^(" @,C,W-30S,C N"BT^(%LP75LQ75LQ,#<R,#DW,ET@(" P
M+C(@6R @-"XV72 H,"DZ("@@(#(W-C@Q-R @,3DV-C(U,2D*6S M/C%=.B Q
M-S$T,38Y,2 Q-S$T,38Y,2 @.#8V,#DR-R ]/B @(#(U.# R-C$X+@I;,"T^
M,5TZ(#$W,30T,3DX(#$W,30T,3DX(" X-C4Y,S<S(#T^(" @,C4X,#,U-S$N
M"ELP+3XQ73H@,3<R.3<R-3(@,3<R.3<R-3(@(#@V,3DT-C0@/3X@(" R-3DQ
M-C<Q-BX*6S$M/C!=.B Q-S$R-34S-2 Q-S$R-34S-2 @.#8V,C U." ]/B @
M(#(U-S@W-3DS+@I;,2T^,%TZ(#$W,3,P-3(P(#$W,3,P-3(P(" X-C8S.3(X
M(#T^(" @,C4W.30T-#@N"ELQ+3XP73H@,3<R-S@S,C @,3<R-S@S,C @(#@V
M,C$Y-S4@/3X@(" R-3DP,#(Y-2X*6S M/C!=.B Q-S$S.#@Q,R Q-S$S.#@Q
M,R Q-#0U,30T,B ]/B @(#,Q-3DP,C4U+@I;,"T^,%TZ(#$W,3,Y.#DV(#$W
M,3,Y.#DV(#$T-3 X-3(P(#T^(" @,S$V-#@T,38N"ELP+3XP73H@,3<R.#DX
M-C$@,3<R.#DX-C$@,3,U.# Q-34@/3X@(" S,#@W,# Q-BX*6S$M/C%=.B Q
M-S$R.#4S-2 Q-S$R.#4S-2 Q-#0P-3,X,B ]/B @(#,Q-3,S.3$W+@I;,2T^
M,5TZ(#$W,3(V-S$X(#$W,3(V-S$X(#$T-#4S,S(R(#T^(" @,S$U.# P-# N
M"ELQ+3XQ73H@,3<R-S@T.#0@,3<R-S@T.#0@,3,U-C8P-38@/3X@(" S,#@T
M-#4T,"X*+3X@6S!=6S%=6S$Q,C@U,C,S72 @+34N+3,@6R @-"XV72 H,"DZ
M("@M-3,X,3DY-B @,S@Q,C4S,BD*6S M/C%=.B Q.#(P.#8U-B Q.#(P.#8U
M-B @.34V-3,U,B ]/B @(#(W-S<T,# X+@I;,"T^,5TZ(#$X,3DY,S<Q(#$X
M,3DY,S<Q(" Y-38P,#$X(#T^(" @,C<W-3DS.#DN"ELP+3XQ73H@,3@P,S,W
M,3,@,3@P,S,W,3,@(#DU,S(R,S<@/3X@(" R-S4V-3DU,"X*6S$M/C!=.B Q
M.#$W.3<T-B Q.#$W.3<T-B @.34V.# Y,R ]/B @(#(W-S0W.#,Y+@I;,2T^
M,%TZ(#$X,3@U-38T(#$X,3@U-38T(" Y-38Q,C0U(#T^(" @,C<W-#8X,#DN
M"ELQ+3XP73H@,3@P,30Y-C8@,3@P,30Y-C8@(#DU,S0U.3(@/3X@(" R-S4T
M.34U."X*6S M/C!=.B Q.#$Y-3@R,2 Q.#$Y-3@R,2 Q.#,V.#0R-B ]/B @
M(#,V-38T,C0W+@I;,"T^,%TZ(#$X,3DW,3DU(#$X,3DW,3DU(#$X,S8X-C(U
M(#T^(" @,S8U-C4X,C N"ELP+3XP73H@,3@P,C<S-C8@,3@P,C<S-C8@,3@P
M.3 U.38@/3X@(" S-C$Q-SDV,BX*6S$M/C%=.B Q.#$Y,#(X-B Q.#$Y,#(X
M-B Q.#,R,3 Y-R ]/B @(#,V-3$Q,S@S+@I;,2T^,5TZ(#$X,3<Y.#4V(#$X
M,3<Y.#4V(#$X,S0T-C@R(#T^(" @,S8U,C0U,S@N"ELQ+3XQ73H@,3@P,3$T
M,# @,3@P,3$T,# @,3@P.#0P,3<@/3X@(" S-C Y-30Q-RX*+3X@6S!=6S%=
M6S$Q.#<Y,3DR72 @+3@N+38@6R @-"XV72 H,"DZ("@M.#8W,#4P." @,S4U
M,#4R,BD*6S M/C%=.B Q.#@T-S,V-" Q.#@T-S,V-" Q,#0U,C(Y," ]/B @
M(#(Y,CDY-C4T+@I;,"T^,5TZ(#$X.#0P.3DV(#$X.#0P.3DV(#$P-#4Q.38Y
M(#T^(" @,CDR.3(Y-C4N"ELP+3XQ73H@,3DQ-3<Y-#4@,3DQ-3<Y-#4@,3 T
M-30S.#$@/3X@(" R.38Q,C,R-BX*6S$M/C!=.B Q.#@R-3$Y," Q.#@R-3$Y
M," Q,#0U.3(T,R ]/B @(#(Y,C@T-#,S+@I;,2T^,%TZ(#$X.#(U,# R(#$X
M.#(U,# R(#$P-#4S.3@T(#T^(" @,CDR-S@Y.#8N"ELQ+3XP73H@,3DQ,S8R
M,C4@,3DQ,S8R,C4@,3 T-3@R-C8@/3X@(" R.34Y-#0Y,2X*6S M/C!=.B Q
M.#@S.3(R,2 Q.#@S.3(R,2 Q.#DR,34P," ]/B @(#,W-S8P-S(Q+@I;,"T^
M,%TZ(#$X.#,W,S,V(#$X.#,W,S,V(#$X.3$T-34X(#T^(" @,S<W-3$X.30N
M"ELP+3XP73H@,3DQ-3$V-#<@,3DQ-3$V-#<@,3DS,C,W.34@/3X@(" S.#0W
M-30T,BX*6S$M/C%=.B Q.#@R-30Q,R Q.#@R-30Q,R Q.#DP-S@S,R ]/B @
M(#,W-S,S,C0V+@I;,2T^,5TZ(#$X.#(Q.3DR(#$X.#(Q.3DR(#$X.3 V,S,U
M(#T^(" @,S<W,C@S,C<N"ELQ+3XQ73H@,3DQ,S(V,#0@,3DQ,S(V,#0@,3DS
M,3 Q-# @/3X@(" S.#0T,C<T-"X*+3X@6S!=6S%=6S$R-3 T-#$R72 @+3@N
M+38@6R @-"XV72 H,"DZ("@M.#8U-#DP.2 @,3<X,S V,"D*6S M/C%=.B R
M,#(S-#0X." R,#(S-#0X." Q,34P,C@W,2 ]/B @(#,Q-S,W,S4Y+@I;,"T^
M,5TZ(#(P,C,P.3 Y(#(P,C,P.3 Y(#$Q-#DY.#0S(#T^(" @,S$W,S W-3(N
M"ELP+3XQ73H@,C P,CDT-S4@,C P,CDT-S4@,3$T,C@V,S4@/3X@(" S,30U
M.#$Q,"X*6S$M/C!=.B R,#(Q,C0U-B R,#(Q,C0U-B Q,34Q-#,S-2 ]/B @
M(#,Q-S(V-SDQ+@I;,2T^,%TZ(#(P,C$P-S$T(#(P,C$P-S$T(#$Q-3 V.30Q
M(#T^(" @,S$W,3<V-34N"ELQ+3XP73H@,C P,3$R,34@,C P,3$R,34@,3$T
M,S8P,# @/3X@(" S,30T-S(Q-2X*6S M/C!=.B R,#(R,#@S,2 R,#(R,#@S
M,2 R,#0X-S(Q,2 ]/B @(#0P-S X,#0R+@I;,"T^,%TZ(#(P,C(W-S$Y(#(P
M,C(W-S$Y(#(P-#@X.#0X(#T^(" @-# W,38U-C<N"ELP+3XP73H@,C P,C$P
M-3(@,C P,C$P-3(@,C R-S X-3@@/3X@(" T,#(Y,3DQ,"X*6S$M/C%=.B R
M,#(Q-C<U.2 R,#(Q-C<U.2 R,#0W,SDQ-R ]/B @(#0P-CDP-C<V+@I;,2T^
M,5TZ(#(P,C W-C8S(#(P,C W-C8S(#(P-#<Q-C@W(#T^(" @-# V-SDS-3 N
M"ELQ+3XQ73H@,C P,#4S-#<@,C P,#4S-#<@,C R-3,Q,3(@/3X@(" T,#(U
M.#0U.2X*+3X@6S!=6S%=6S$S,38R-3,X72 @+3@N+3@@6R @-"XV72 H,"DZ
M("@M.#@Y.#$S." @,3 Q,S$T-"D*6S M/C%=.B R,#DR-38S-2 R,#DR-38S
M-2 Q,C0X,#4Q.2 ]/B @(#,S-# V,34T+@I;,"T^,5TZ(#(P.3(X-S@T(#(P
M.3(X-S@T(#$R-#@Q.#(X(#T^(" @,S,T,3 V,3(N"ELP+3XQ73H@,C$P.38X
M,3$@,C$P.38X,3$@,3(T-S8W-SD@/3X@(" S,S4W,S4Y,"X*6S$M/C!=.B R
M,#DP,CDR," R,#DP,CDR," Q,C0Y-C(Q,2 ]/B @(#,S,SDY,3,Q+@I;,2T^
M,%TZ(#(P.3 W.#(V(#(P.3 W.#(V(#$R-#@V-C Q(#T^(" @,S,S.30T,C<N
M"ELQ+3XP73H@,C$P-SDP-S(@,C$P-SDP-S(@,3(T-S<X,S<@/3X@(" S,S4U
M-CDP.2X*6S M/C!=.B R,#DR,38Y-B R,#DR,38Y-B R,3 W-#(Q-2 ]/B @
M(#0Q.3DU.3$Q+@I;,"T^,%TZ(#(P.3(P-34T(#(P.3(P-34T(#(Q,#<W,3(Q
M(#T^(" @-#$Y.3<V-S4N"ELP+3XP73H@,C$P.#DR-C0@,C$P.#DR-C0@,C$S
M,S<Y.#0@/3X@(" T,C0R-S(T."X*6S$M/C%=.B R,#DP-3@V,2 R,#DP-3@V
M,2 R,3 V-#0S-" ]/B @(#0Q.3<P,CDU+@I;,2T^,5TZ(#(P.#DY-C,R(#(P
M.#DY-C,R(#(Q,#8S-3 S(#T^(" @-#$Y-C,Q,S4N"ELQ+3XQ73H@,C$P-S8V
M,#0@,C$P-S8V,#0@,C$S,C,T,3<@/3X@(" T,C0P,# R,2X*+3X@6S!=6S%=
M6S$S.#4U,S S72 @+3@N+3<@6R @-"XV72 H,"DZ("@M.#<Q,S$S-2 @(#4Y
M.3 W,RD*6S M/C%=.B R,3DY-#@S-" R,3DY-#@S-" Q,S4X,34S-R ]/B @
M(#,U-3<V,S<Q+@I;,"T^,5TZ(#(Q.3DR.#(X(#(Q.3DR.#(X(#$S-3<X.#0U
M(#T^(" @,S4U-S$V-S,N"ELP+3XQ73H@,C(S-C8R.#@@,C(S-C8R.#@@,3,V
M-3 X-S @/3X@(" S-C Q-S$U."X*6S$M/C!=.B R,3DW,#<X-B R,3DW,#<X
M-B Q,S4Y-# T.2 ]/B @(#,U-38T.#,U+@I;,2T^,%TZ(#(Q.3<P-#@W(#(Q
M.3<P-#@W(#$S-3@U,3DV(#T^(" @,S4U-34V.#,N"ELQ+3XP73H@,C(S-#0S
M,3@@,C(S-#0S,3@@,3,V-3@X,# @/3X@(" S-C P,S$Q."X*6S M/C!=.B R
M,3DY,#,W,2 R,3DY,#,W,2 R,C W-3,R." ]/B @(#0T,#8U-CDY+@I;,"T^
M,%TZ(#(Q.3@W,# R(#(Q.3@W,# R(#(R,#<X.3 T(#T^(" @-#0P-C4Y,#8N
M"ELP+3XP73H@,C(S-C T,C(@,C(S-C T,C(@,C(U-#4W.#,@/3X@(" T-#DP
M-C(P-2X*6S$M/C%=.B R,3DW-C8R-" R,3DW-C8R-" R,C V.34V-" ]/B @
M(#0T,#0V,3@X+@I;,2T^,5TZ(#(Q.38X,#,Y(#(Q.38X,#,Y(#(R,#8V-S<Q
M(#T^(" @-#0P,S0X,3 N"ELQ+3XQ73H@,C(S-#(T,#$@,C(S-#(T,#$@,C(U
M,C@W.38@/3X@(" T-#@W,3$Y-RX*+3X@6S!=6S%=6S$T-3@T-3(Y72 @+3@N
M+38@6R @-"XV72 H,"DZ("@M.#8X,C8R,2 @(#,Q-#<Y,RD*6S M/C%=.B R
M,S(Q,3DR.2 R,S(Q,3DR.2 Q-#<S.34V,R ]/B @(#,W.34Q-#DR+@I;,"T^
M,5TZ(#(S,C$P-3@Q(#(S,C$P-3@Q(#$T-S,W,S4X(#T^(" @,S<Y-#<Y,SDN
M"ELP+3XQ73H@,C,S.30R,C@@,C,S.30R,C@@,30W,S@T.3 @/3X@(" S.#$S
M,C<Q."X*6S$M/C!=.B R,S$X-#DQ," R,S$X-#DQ," Q-#<T-#(P-" ]/B @
M(#,W.3(Y,3$T+@I;,2T^,%TZ(#(S,3@Y,C,S(#(S,3@Y,C,S(#$T-S0Q,#8R
M(#T^(" @,S<Y,S R.34N"ELQ+3XP73H@,C,S-S(Q-C<@,C,S-S(Q-C<@,30W
M-#8Y,3<@/3X@(" S.#$Q.3 X-"X*6S M/C!=.B R,S(P-C$Q.2 R,S(P-C$Q
M.2 R,S,V.3 T-2 ]/B @(#0V-3<U,38T+@I;,"T^,%TZ(#(S,C V,C<Y(#(S
M,C V,C<Y(#(S,S<R-#(U(#T^(" @-#8U-S@W,#0N"ELP+3XP73H@,C,S.3 S
M-C<@,C,S.3 S-C<@,C,V-3(Q.38@/3X@(" T-S T,C4V,RX*6S$M/C%=.B R
M,S$Y,#<U-R R,S$Y,#<U-R R,S,U-#$R." ]/B @(#0V-30T.#@U+@I;,2T^
M,5TZ(#(S,3@U-S$Q(#(S,3@U-S$Q(#(S,S4U-C4X(#T^(" @-#8U-#$S-CDN
M"ELQ+3XQ73H@,C,S-C@P-C0@,C,S-C@P-C0@,C,V,S$R,#@@/3X@(" T-CDY
M.3(W,BX*+3X@6S!=6S%=6S$U,S4R,3,U72 @+3@N+3<@6R @-"XV72 H,"DZ
M("@M.#<U-SDV." @(#$Y-3 W,"D*6S M/C%=.B R-#4X,#,U,2 R-#4X,#,U
M,2 Q-C Q,S@S.2 ]/B @(#0P-3DT,3DP+@I;,"T^,5TZ(#(T-3<V,3@P(#(T
M-3<V,3@P(#$V,#$S-3$X(#T^(" @-# U.#DV.3@N"ELP+3XQ73H@,C0S,30Y
M-S,@,C0S,30Y-S,@,34X-S8R-C<@/3X@(" T,#$Y,3(T,"X*6S$M/C!=.B R
M-#4U-3@W,B R-#4U-3@W,B Q-C R-3(T,R ]/B @(#0P-3@Q,3$U+@I;,2T^
M,%TZ(#(T-38P-#<V(#(T-38P-#<V(#$V,#$W-S0S(#T^(" @-# U-S@R,3DN
M"ELQ+3XP73H@,C0R.3(U,#(@,C0R.3(U,#(@,34X.#DW,C(@/3X@(" T,#$X
M,C(R-"X*6S M/C!=.B R-#4W-S V,2 R-#4W-S V,2 R-#<Q-C P,B ]/B @
M(#0Y,CDS,#8S+@I;,"T^,%TZ(#(T-3<W,C,Q(#(T-3<W,C,Q(#(T-S$T.3@R
M(#T^(" @-#DR.3(R,3,N"ELP+3XP73H@,C0S,#4X,C@@,C0S,#4X,C@@,C0T
M.3$S,3@@/3X@(" T.#<Y-S$T-BX*6S$M/C%=.B R-#4U-3DR-" R-#4U-3DR
M-" R-#8Y,34Q-" ]/B @(#0Y,C0W-#,X+@I;,2T^,5TZ(#(T-34U-#,P(#(T
M-34U-#,P(#(T-C@Y,3@U(#T^(" @-#DR-#0V,34N"ELQ+3XQ73H@,C0R.#,Q
M.34@,C0R.#,Q.34@,C0T-C(Q,3D@/3X@(" T.#<T-3,Q-"X*+3X@6S!=6S%=
M6S$V,38P,30R72 @+3@N+38@6R @-"XV72 H,"DZ("@M.#8S-#0W-R @(#$U
M.3(X,"D*6S M/C%=.B R-3DT,C$U." R-3DT,C$U." Q-S(Y-S$V-" ]/B @
M(#0S,C,Y,S(R+@I;,"T^,5TZ(#(U.30T-#4T(#(U.30T-#4T(#$W,CDV,C8V
M(#T^(" @-#,R-# W,C N"ELP+3XQ73H@,C8Q.3DW,30@,C8Q.3DW,30@,3<S
M-#(T,#8@/3X@(" T,S4T,C$R,"X*6S$M/C!=.B R-3DQ,SDV,B R-3DQ,SDV
M,B Q-S,Q-#0Q.2 ]/B @(#0S,C(X,S@Q+@I;,2T^,%TZ(#(U.3$X.3@W(#(U
M.3$X.3@W(#$W,S$T,3<X(#T^(" @-#,R,S,Q-C4N"ELQ+3XP73H@,C8Q-S4S
M.3(@,C8Q-S4S.3(@,3<S-3<Q-S @/3X@(" T,S4S,C4V,BX*6S M/C!=.B R
M-3DT,S@R,2 R-3DT,S@R,2 R-C$T-S@X,2 ]/B @(#4R,#DQ-S R+@I;,"T^
M,%TZ(#(U.3,X-S8X(#(U.3,X-S8X(#(V,30U-C,W(#T^(" @-3(P.#0T,#4N
M"ELP+3XP73H@,C8Q.30P.38@,C8Q.30P.38@,C8T-3$Y.34@/3X@(" U,C8T
M-C Y,2X*6S$M/C%=.B R-3DQ.3DY," R-3DQ.3DY," R-C$R-#0Y-R ]/B @
M(#4R,#0T-#@W+@I;,2T^,5TZ(#(U.3$V-C(Y(#(U.3$V-C(Y(#(V,3(X,3<S
M(#T^(" @-3(P-#0X,#(N"ELQ+3XQ73H@,C8Q-S(W.3D@,C8Q-S(W.3D@,C8T
M,S(V,3<@/3X@(" U,C8P-30Q-BX*+3X@6S!=6S%=6S$W,#$P-C<U72 @+3@N
M+3D@6R @-"XV72 H,"DZ("@M.#DU.# S-R @(#(T,30R,"D*6S M/C%=.B R
M-S(Y,#DV-2 R-S(Y,#DV-2 Q.#4Y.3 Q-" ]/B @(#0U.#@Y.3<Y+@I;,"T^
M,5TZ(#(W,CDV.#,V(#(W,CDV.#,V(#$X-3DX-C0X(#T^(" @-#4X.34T.#0N
M"ELP+3XQ73H@,C<V,C0W-C<@,C<V,C0W-C<@,3@W-C(P.3 @/3X@(" T-C,X
M-C@U-RX*6S$M/C!=.B R-S(V-3<X-" R-S(V-3<X-" Q.#8Q-C8X," ]/B @
M(#0U.#@R-#8T+@I;,2T^,%TZ(#(W,C8X.3$T(#(W,C8X.3$T(#$X-C$Q-3DW
M(#T^(" @-#4X.# U,3$N"ELQ+3XP73H@,C<V,#8X-C@@,C<V,#8X-C@@,3@W
M-S0R,S(@/3X@(" T-C,X,3$P,"X*6S M/C!=.B R-S(Y,34X-2 R-S(Y,34X
M-2 R-S4R-S$X-B ]/B @(#4T.#$X-S<Q+@I;,"T^,%TZ(#(W,C@V,#(Q(#(W
M,C@V,#(Q(#(W-3(S,3@P(#T^(" @-30X,#DR,#$N"ELP+3XP73H@,C<V,3<W
M.3 @,C<V,3<W.3 @,C<X.30P-S,@/3X@(" U-34Q,3@V,RX*6S$M/C%=.B R
M-S(W,3DS,B R-S(W,3DS,B R-S4P,C$P.2 ]/B @(#4T-S<T,#0Q+@I;,2T^
M,5TZ(#(W,C8V,S U(#(W,C8V,S U(#(W-3 S,C$X(#T^(" @-30W-CDU,C,N
M"ELQ+3XQ73H@,C<U.3DU,38@,C<U.3DU,38@,C<X-C4V,S<@/3X@(" U-30V
M-3$U,RX*+3X@6S!=6S%=6S$W.3 U.3<S72 @+3DN,"!;(" T+C9=("@P*3H@
M*"TY,# R.30W(" @,30S,38U*0I;,"T^,5TZ(#(X.#(X-C8S(#(X.#(X-C8S
M(#(P,3 R-C0P(#T^(" @-#@Y,S$S,#,N"ELP+3XQ73H@,C@X,S,S-#$@,C@X
M,S,S-#$@,C P.3DW-#<@/3X@(" T.#DS,S X."X*6S M/C%=.B R.#8Y,#DY
M-2 R.#8Y,#DY-2 R,# V.3(U-2 ]/B @(#0X-S8P,C4P+@I;,2T^,%TZ(#(X
M.# Q.3,S(#(X.# Q.3,S(#(P,3$X,C P(#T^(" @-#@Y,C Q,S,N"ELQ+3XP
M73H@,C@X,#4V,C4@,C@X,#4V,C4@,C Q,3,P,C,@/3X@(" T.#DQ.#8T."X*
M6S$M/C!=.B R.#8V.#DT," R.#8V.#DT," R,# X,# Y,R ]/B @(#0X-S0Y
M,#,S+@I;,"T^,%TZ(#(X.#(V-C(R(#(X.#(V-C(R(#(Y,#,V,S(U(#T^(" @
M-3<X-C(Y-#<N"ELP+3XP73H@,C@X,C4R,S@@,C@X,C4R,S@@,CDP,C<P,30@
M/3X@(" U-S@U,C(U,BX*6S M/C!=.B R.#8X-S(X.2 R.#8X-S(X.2 R.#@R
M,S8W-2 ]/B @(#4W-3$P.38T+@I;,2T^,5TZ(#(X.# T-#$R(#(X.# T-#$R
M(#(Y,# S,CDV(#T^(" @-3<X,#<W,#@N"ELQ+3XQ73H@,C@X,#(T,C0@,C@X
M,#(T,C0@,C@Y.3DY.3<@/3X@(" U-S@P,C0R,2X*6S$M/C%=.B R.#8V,S0R
M,R R.#8V,S0R,R R.#<Y,3 X-" ]/B @(#4W-#4T-3 W+@HM/B!;,%U;,5U;
M,3@X-#@S.3)=(" M."XM."!;(" T+C9=("@P*3H@*"TX.#$T-S@R(" @,38U
M-C8U*0I;,%U;,5T@=V]R:VEN9R!S970@<VEZ92!F;W5N9#H@.3$Y,3@Y-2P@
M8V]S=#H@-#8V-3(T.0HM+2TM+2TM+2TM+2TM+2TM+2TM+2T*?"!M:6=R871I
M;VX@8V]S="!M871R:7@@*&UA>%]C86-H95]S:7IE.B Y-#,W,3@T+"!C<'4Z
M("TQ($U(>BDZ"BTM+2TM+2TM+2TM+2TM+2TM+2TM+0H@(" @(" @(" @6S P
M72 @("!;,#%=(" @(%LP,ET@(" @6S S70I;,#!=.B @(" @+2 @(" @.2XS
M*# I(" Y+C,H,"D@(#DN,R@P*0I;,#%=.B @(#DN,R@P*2 @(" M(" @(" Y
M+C,H,"D@(#DN,R@P*0I;,#)=.B @(#DN,R@P*2 @.2XS*# I(" @("T@(" @
M(#DN,R@P*0I;,#-=.B @(#DN,R@P*2 @.2XS*# I(" Y+C,H,"D@(" @+2 @
M( HM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+0I\(&-A8VAE9FQU
M<V@@=&EM97,@6S%=.B Y+C,@*#DS,S T.3@I"GP@8V%L:6)R871I;VX@9&5L
M87DZ(#$V('-E8V]N9',*+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM+2TM
#+2T*
`
end


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels]
  2005-04-03  5:53   ` Paul Jackson
  2005-04-03  7:04     ` Ingo Molnar
@ 2005-04-03  9:01     ` Paul Jackson
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-03  9:01 UTC (permalink / raw)
  To: Paul Jackson
  Cc: mingo, kenneth.w.chen, torvalds, nickpiggin, akpm, linux-kernel

Earlier, Paul wrote:
> Note the first 3 chars of the panic message "4.5".  This looks like it
> might be the [00]-[01] entry of Ingo's table, flushed out when the
> newlines of the panic came through.

For the record, the above speculation is probably wrong.

More likely, the first six characters "4.5(0)" of my quoted panic
message came out some time before the panic, and represent the the
[0]-[1] entry of the table.  These six chars came out at approx.
nine minutes into the calculation, and the timer panic'd the system at
ten minutes.  I didn't look at the screen between the 9th and 10th
minute, to realize that it had finally computed one table entry.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
@ 2005-04-01 16:34 Manfred Spraul
  0 siblings, 0 replies; 82+ messages in thread
From: Manfred Spraul @ 2005-04-01 16:34 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linux Kernel Mailing List

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> With that said, here goes our first data point along with some historical data
> we have collected so far.
>
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

Is it possible to generate an instruction level oprofile for one recent kernel?
I have convinced Mark Wong from OSDL to generate a few for postgres DBT-2, but postgres is limited by it's user space buffer manager, thus it wasn't that useful:

http://khack.osdl.org/stp/299167/oprofile/


--
	Manfred


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Industry db benchmark result on recent 2.6 kernels
@ 2005-03-28 19:33 Chen, Kenneth W
  2005-03-28 19:50 ` Dave Hansen
  2005-03-30  0:00 ` Linus Torvalds
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-28 19:33 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: linux-kernel

The roller coaster ride continues for the 2.6 kernel on how it measure
up in performance using industry standard database transaction processing
benchmark.  We took a measurement on 2.6.11 and found it is 13% down from
the baseline.

We will be taking db benchmark measurements more frequently from now on with
latest kernel from kernel.org (and make these measurements on a fixed interval).
By doing this, I hope to achieve two things: one is to track base kernel
performance on a regular base; secondly, which is more important in my opinion,
is to create a better communication flow to the kernel developers and to keep
all interested party well informed on the kernel performance for this enterprise
workload.

With that said, here goes our first data point along with some historical data
we have collected so far.

2.6.11	-13%
2.6.9		- 6%
2.6.8		-23%
2.6.2		- 1%
baseline	(rhel3)

The glory detail on the benchmark configuration: 4-way SMP, 1.6 GHz Intel
itanium2, 64GB memory, 450 73GB 15k-rpm disks.  All experiments were done
With exact same hardware and application software, except different kernel
versions.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:33 Chen, Kenneth W
@ 2005-03-28 19:50 ` Dave Hansen
  2005-03-28 20:01   ` Chen, Kenneth W
  2005-03-30  0:00 ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2005-03-28 19:50 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', Linux Kernel Mailing List

On Mon, 2005-03-28 at 11:33 -0800, Chen, Kenneth W wrote:
> We will be taking db benchmark measurements more frequently from now on with
> latest kernel from kernel.org (and make these measurements on a fixed interval).
> By doing this, I hope to achieve two things: one is to track base kernel
> performance on a regular base; secondly, which is more important in my opinion,
> is to create a better communication flow to the kernel developers and to keep
> all interested party well informed on the kernel performance for this enterprise
> workload.

I'd guess that doing it on kernel.org is too late, sometimes.  How high
is the overhead of doing a test?  Would you be able to test each -mm
release?  It's somewhat easier to toss something out of -mm for
re-review than it is out of Linus's tree.

-- Dave


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:50 ` Dave Hansen
@ 2005-03-28 20:01   ` Chen, Kenneth W
  0 siblings, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-28 20:01 UTC (permalink / raw)
  To: 'Dave Hansen'; +Cc: 'Andrew Morton', Linux Kernel Mailing List

On Mon, 2005-03-28 at 11:33 -0800, Chen, Kenneth W wrote:
> We will be taking db benchmark measurements more frequently from now on with
> latest kernel from kernel.org (and make these measurements on a fixed interval).
> By doing this, I hope to achieve two things: one is to track base kernel
> performance on a regular base; secondly, which is more important in my opinion,
> is to create a better communication flow to the kernel developers and to keep
> all interested party well informed on the kernel performance for this enterprise
> workload.

Dave Hansen wrote on Monday, March 28, 2005 11:50 AM
> I'd guess that doing it on kernel.org is too late, sometimes.  How high
> is the overhead of doing a test?  Would you be able to test each -mm
> release?  It's somewhat easier to toss something out of -mm for
> re-review than it is out of Linus's tree.

The overhead is fairly high to run the benchmark.  It's not a one minute run.
(more or less like a 5 hour exercise.  Benchmark run time along is 3+ hours).
-mm has so many stuff, I'm not sure we would have the bandwidth to do a search
on which patch trigger N% regression, etc.  Let me try the base kernel first
and if resources are available, I can attempt to do it on -mm tree.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:33 Chen, Kenneth W
  2005-03-28 19:50 ` Dave Hansen
@ 2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
                     ` (2 more replies)
  1 sibling, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2005-03-30  0:00 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', linux-kernel

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> 
> With that said, here goes our first data point along with some historical data
> we have collected so far.
> 
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

How repeatable are the numbers across reboots with the same kernel? Some
benchmarks will depend heavily on just where things land in memory, 
especially with things like PAE or even just cache behaviour (ie if some 
frequenly-used page needs to be kmap'ped or not depending on where it 
landed).

You don't have the PAE issue on ia64, but there could be other issues.  
Some of them just disk-layout issues or similar, ie performance might
change depending on where on the disk the data is written in relationship
to where most of the reads come from etc etc. The fact that it seems to 
fluctuate pretty wildly makes me wonder how stable the numbers are.

Also, it would be absolutely wonderful to see a finer granularity (which 
would likely also answer the stability question of the numbers). If you 
can do this with the daily snapshots, that would be great. If it's not 
easily automatable, or if a run takes a long time, maybe every other or 
every third day would be possible?

Doing just release kernels means that there will be a two-month lag
between telling developers that something pissed up performance. Doing it
every day (or at least a couple of times a week) will be much more 
interesting.

I realize that testing can easily be overwhelming, but if something like 
this can be automated, and run in a timely fashion, that would be really 
great. Two months (or half a year) later, and we have absolutely _no_ idea 
what might have caused a regression. For example, that 2.6.2->2.6.8 change 
obviously makes pretty much any developer just go "I've got no clue".

In fact, it would be interesting (still) to go back in time if the
benchmark can be done fast enough, and try to do testing of the historical
weekly (if not daily) builds to see where the big differences happened. If
you can narrow down the 6-month gap of 2.6.2->2.6.8 to a week or a few
days, that would already make people sit up a bit - as it is it's too big
a problem for any developer to look at.

The daily patches are all there on kernel.org, even if the old ones have
been moved into /pub/linux/kernel/v2.6/snapshots/old/.. It's "just" a
small matter of automation ;)

Btw, this isn't just for you either - I'd absolutely _love_ it for pretty
much any benchmark. So anybody who has a favourite benchmark, whether
"obviously relevant" or not, and has the inclination to make a _simple_
daily number (preferably a nice graph), go for it. 

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
@ 2005-03-30  0:22   ` Chen, Kenneth W
  2005-03-30  0:46   ` Chen, Kenneth W
  2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  0:22 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> With that said, here goes our first data point along with some historical data
> we have collected so far.
>
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> How repeatable are the numbers across reboots with the same kernel? Some
> benchmarks will depend heavily on just where things land in memory,
> especially with things like PAE or even just cache behaviour (ie if some
> frequenly-used page needs to be kmap'ped or not depending on where it
> landed).

Very repeatable.  This workload is very steady and resolution in throughput
is repeatable down to 0.1%.  We toss everything below that level as noise.


> You don't have the PAE issue on ia64, but there could be other issues.
> Some of them just disk-layout issues or similar, ie performance might
> change depending on where on the disk the data is written in relationship
> to where most of the reads come from etc etc. The fact that it seems to
> fluctuate pretty wildly makes me wonder how stable the numbers are.

This workload has been around for 10+ years and people at Intel studied the
characteristics of this workload inside out for 10+ years.  Every stones will
be turned at least more than once while we tune the entire setup making sure
everything is well balanced.  And we tune the system whenever there is a
hardware change.  Data layout on the disk spindle are very well balanced.


> Also, it would be absolutely wonderful to see a finer granularity (which
> would likely also answer the stability question of the numbers). If you
> can do this with the daily snapshots, that would be great. If it's not
> easily automatable, or if a run takes a long time, maybe every other or
> every third day would be possible?

I sure will make my management know that Linus wants to see the performance
number on a daily bases (I will ask for a couple of million dollar to my
manager for this project :-))



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
@ 2005-03-30  0:46   ` Chen, Kenneth W
  2005-03-30  0:57     ` Linus Torvalds
  2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  0:46 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> The fact that it seems to fluctuate pretty wildly makes me wonder
> how stable the numbers are.

I can't resist myself from bragging. The high point in the fluctuation
might be because someone is working hard trying to make 2.6 kernel run
faster.  Hint hint hint .....  ;-)



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:46   ` Chen, Kenneth W
@ 2005-03-30  0:57     ` Linus Torvalds
  2005-03-30  1:31       ` Nick Piggin
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2005-03-30  0:57 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', linux-kernel



On Tue, 29 Mar 2005, Chen, Kenneth W wrote:
>
> Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> > The fact that it seems to fluctuate pretty wildly makes me wonder
> > how stable the numbers are.
> 
> I can't resist myself from bragging. The high point in the fluctuation
> might be because someone is working hard trying to make 2.6 kernel run
> faster.  Hint hint hint .....  ;-)

Heh. How do you explain the low-point? If there's somebody out there 
working hard on making it run slower, I want to whack the guy ;)

Good luck with the million-dollar grants, btw. We're all rooting for you, 
and hope your manager is a total push-over.

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:57     ` Linus Torvalds
@ 2005-03-30  1:31       ` Nick Piggin
  2005-03-30  1:38         ` Chen, Kenneth W
  0 siblings, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2005-03-30  1:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chen, Kenneth W, 'Andrew Morton', linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 29 Mar 2005, Chen, Kenneth W wrote:
> 
>>Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
>>
>>>The fact that it seems to fluctuate pretty wildly makes me wonder
>>>how stable the numbers are.
>>
>>I can't resist myself from bragging. The high point in the fluctuation
>>might be because someone is working hard trying to make 2.6 kernel run
>>faster.  Hint hint hint .....  ;-)
> 
> 
> Heh. How do you explain the low-point? If there's somebody out there 
> working hard on making it run slower, I want to whack the guy ;)
> 

If it is doing a lot of mapping/unmapping (or fork/exit), then that
might explain why 2.6.11 is worse.

Fortunately there are more patches to improve this on the way.

Kernel profiles would be useful if possible.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:31       ` Nick Piggin
@ 2005-03-30  1:38         ` Chen, Kenneth W
  2005-03-30  1:56           ` Nick Piggin
  2005-03-31 14:14           ` Ingo Molnar
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  1:38 UTC (permalink / raw)
  To: 'Nick Piggin', Linus Torvalds
  Cc: 'Andrew Morton', linux-kernel

Nick Piggin wrote on Tuesday, March 29, 2005 5:32 PM
> If it is doing a lot of mapping/unmapping (or fork/exit), then that
> might explain why 2.6.11 is worse.
>
> Fortunately there are more patches to improve this on the way.

Once benchmark reaches steady state, there is no mapping/unmapping
going on.  Actually, the virtual address space for all the processes
are so stable at steady state that we don't even see it grow or shrink.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:38         ` Chen, Kenneth W
@ 2005-03-30  1:56           ` Nick Piggin
  2005-03-31 14:14           ` Ingo Molnar
  1 sibling, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2005-03-30  1:56 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linus Torvalds, 'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 29, 2005 5:32 PM
> 
>>If it is doing a lot of mapping/unmapping (or fork/exit), then that
>>might explain why 2.6.11 is worse.
>>
>>Fortunately there are more patches to improve this on the way.
> 
> 
> Once benchmark reaches steady state, there is no mapping/unmapping
> going on.  Actually, the virtual address space for all the processes
> are so stable at steady state that we don't even see it grow or shrink.
> 

Oh, well there goes that theory ;)

The only other thing I can think of is the CPU scheduler changes
that went into 2.6.11 (but there are obviously a lot that I can't
think of).

I'm sure I don't need to tell you it would be nice to track down
the source of these problems rather than papering over them with
improvements to the block layer... any indication of what has gone
wrong?

Typically if the CPU scheduler has gone bad and is moving too many
tasks around (and hurting caches), you'll see things like copy_*_user
increase in cost for the same units of work performed. Wheras if it
is too reluctant to move tasks, you'll see increased idle time.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:38         ` Chen, Kenneth W
  2005-03-30  1:56           ` Nick Piggin
@ 2005-03-31 14:14           ` Ingo Molnar
  2005-03-31 19:53             ` Chen, Kenneth W
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:14 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', Linus Torvalds, 'Andrew Morton',
	linux-kernel

* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > If it is doing a lot of mapping/unmapping (or fork/exit), then that
> > might explain why 2.6.11 is worse.
> >
> > Fortunately there are more patches to improve this on the way.
> 
> Once benchmark reaches steady state, there is no mapping/unmapping 
> going on.  Actually, the virtual address space for all the processes 
> are so stable at steady state that we don't even see it grow or 
> shrink.

is there any idle time on the system, in steady state (it's a sign of 
under-balancing)? Idle balancing (and wakeup balancing) is one of the 
things that got tuned back and forth alot. Also, do you know what the 
total number of context-switches is during the full test on each kernel?  
Too many context-switches can be an indicator of over-balancing. Another 
sign of migration gone bad can be relative increase of userspace time 
vs. system time. (due to cache trashing, on DB workloads, where most of 
the cache contents are userspace's.)

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 14:14           ` Ingo Molnar
@ 2005-03-31 19:53             ` Chen, Kenneth W
  2005-03-31 20:05               ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-31 19:53 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Nick Piggin', Linus Torvalds, 'Andrew Morton',
	linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 6:15 AM
> is there any idle time on the system, in steady state (it's a sign of
> under-balancing)? Idle balancing (and wakeup balancing) is one of the
> things that got tuned back and forth alot. Also, do you know what the
> total number of context-switches is during the full test on each kernel?
> Too many context-switches can be an indicator of over-balancing. Another
> sign of migration gone bad can be relative increase of userspace time
> vs. system time. (due to cache trashing, on DB workloads, where most of
> the cache contents are userspace's.)

No, there are no idle time on the system. If system become I/O bound, we
would do everything we can to remove that bottleneck, i.e., throw a couple
hundred GB of memory to the system, or add a couple hundred disk drives,
etc.  Believe it or not, we are currently CPU bound and that's the reason
why I care about every single cpu cycle being spend in the kernel code.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 19:53             ` Chen, Kenneth W
@ 2005-03-31 20:05               ` Linus Torvalds
  2005-03-31 20:08                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2005-03-31 20:05 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel



On Thu, 31 Mar 2005, Chen, Kenneth W wrote:
> 
> No, there are no idle time on the system. If system become I/O bound, we
> would do everything we can to remove that bottleneck, i.e., throw a couple
> hundred GB of memory to the system, or add a couple hundred disk drives,
> etc.  Believe it or not, we are currently CPU bound and that's the reason
> why I care about every single cpu cycle being spend in the kernel code.

Can you post oprofile data for a run? Preferably both for the "best case"  
2.6.x thing (no point in comparing 2.4.x oprofiles with current) and for
"current kernel", whether that be 2.6.11 or some more recent snapshot?

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 20:05               ` Linus Torvalds
@ 2005-03-31 20:08                 ` Linus Torvalds
  2005-03-31 22:14                   ` Chen, Kenneth W
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2005-03-31 20:08 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel

On Thu, 31 Mar 2005, Linus Torvalds wrote:
> 
> Can you post oprofile data for a run?

Btw, I realize that you can't give good oprofiles for the user-mode
components, but a kernel profile with even just single "time spent in user
mode" datapoint would be good, since a kernel scheduling problem might
just make caches work worse, and so the biggest negative might be visible
in the amount of time we spend in user mode due to more cache misses..

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 20:08                 ` Linus Torvalds
@ 2005-03-31 22:14                   ` Chen, Kenneth W
  2005-03-31 23:35                     ` Nick Piggin
  2005-04-01  4:52                     ` Ingo Molnar
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-03-31 22:14 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Linus Torvalds wrote on Thursday, March 31, 2005 12:09 PM
> Btw, I realize that you can't give good oprofiles for the user-mode
> components, but a kernel profile with even just single "time spent in user
> mode" datapoint would be good, since a kernel scheduling problem might
> just make caches work worse, and so the biggest negative might be visible
> in the amount of time we spend in user mode due to more cache misses..

I was going to bring it up in another thread.  Since you brought it up, I
will ride it along.

The low point in 2.6.11 could very well be the change in the scheduler.
It does too many load balancing in the wake up path and possibly made a
lot of unwise decision.  For example, in try_to_wake_up(), it will try
SD_WAKE_AFFINE for task that is not hot.  By not hot, it looks at when it
was last ran and compare to a constant sd->cache_hot_time.  The problem
is this cache_hot_time is fixed for the entire universe, whether it is a
little celeron processor with 128KB of cache or a sever class Itanium2
processor with 9MB L3 cache.  This one size fit all isn't really working
at all.

We had experimented that parameter earlier and found it was one of the major
source of low point in 2.6.8.  I debated the issue on LKML about 4 month
ago and finally everyone agreed to make that parameter a boot time param.
The change made into bk tree for 2.6.9 release, but somehow it got ripped
right out 2 days after it went in.  I suspect 2.6.11 is a replay of 2.6.8
for the regression in the scheduler.  We are running experiment to confirm
this theory.

That actually brings up more thoughts: what about all other sched parameters?
We found values other than the default helps to push performance up, but it
is probably not acceptable to pick a default number from a db benchmark.
Kernel needs either a dynamic closed feedback loop to adapt to the workload
or some runtime tunables to control them.  Though the latter option did not
go anywhere in the past.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 22:14                   ` Chen, Kenneth W
@ 2005-03-31 23:35                     ` Nick Piggin
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  4:52                     ` Ingo Molnar
  1 sibling, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2005-03-31 23:35 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Linus Torvalds', 'Ingo Molnar',
	'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Linus Torvalds wrote on Thursday, March 31, 2005 12:09 PM
> 
>>Btw, I realize that you can't give good oprofiles for the user-mode
>>components, but a kernel profile with even just single "time spent in user
>>mode" datapoint would be good, since a kernel scheduling problem might
>>just make caches work worse, and so the biggest negative might be visible
>>in the amount of time we spend in user mode due to more cache misses..
> 
> 
> I was going to bring it up in another thread.  Since you brought it up, I
> will ride it along.
> 
> The low point in 2.6.11 could very well be the change in the scheduler.
> It does too many load balancing in the wake up path and possibly made a
> lot of unwise decision.

OK, and considering you have got no idle time at all, and the 2.6.11
kernel included some scheduler changes to make balancing much more
aggressive, so unfortunately that's likely to have caused the latest drop.

> For example, in try_to_wake_up(), it will try
> SD_WAKE_AFFINE for task that is not hot.  By not hot, it looks at when it
> was last ran and compare to a constant sd->cache_hot_time.

The other problem with using that value there is that it represents a hard
cutoff point in behaviour. For example, on a workload that really wants to
have wakers and wakees together, it will work poorly on low loads, but when
things get loaded up enough that we start seeing cache cold tasks there,
behaviour suddenly changes.

In the -mm kernels, there are a large number of scheduler changes that
reduce the amount of balancing. They also remove cache_hot_time from this
path (though it is still useful for periodic balancing).

 > The problem
 > is this cache_hot_time is fixed for the entire universe, whether it is a
 > little celeron processor with 128KB of cache or a sever class Itanium2
 > processor with 9MB L3 cache.  This one size fit all isn't really working
 > at all.

Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
for all processors with respect to all others, and dynamically tune
cache_hot_time. Unfortunately it was never completely polished, and it is
an O(cpus^2) operation. It is a good idea to look into though.

> We had experimented that parameter earlier and found it was one of the major
> source of low point in 2.6.8.  I debated the issue on LKML about 4 month
> ago and finally everyone agreed to make that parameter a boot time param.
> The change made into bk tree for 2.6.9 release, but somehow it got ripped
> right out 2 days after it went in.  I suspect 2.6.11 is a replay of 2.6.8
> for the regression in the scheduler.  We are running experiment to confirm
> this theory.
> 
> That actually brings up more thoughts: what about all other sched parameters?
> We found values other than the default helps to push performance up, but it
> is probably not acceptable to pick a default number from a db benchmark.
> Kernel needs either a dynamic closed feedback loop to adapt to the workload
> or some runtime tunables to control them.  Though the latter option did not
> go anywhere in the past.
> 

They're in -mm. I think Andrew would rather see things (like auto tuning
cache hot time) rather than putting more runtime variables in.

If you were to make a program which adjusted various parameters using a
feedback loop, then that would be a good argument to put runtime tunables
in.

Oh, one last thing - if you do a great deal of scheduler tuning, it would
be very good if you could possibly use the patchset in -mm. Things have
changed sufficiently that optimal values you find in 2.6 will not be the
same as those in -mm. I realise this may be difficult to justify, but I
would hate for the whole cycle to have to happen again when the patches
go into 2.6.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 23:35                     ` Nick Piggin
@ 2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
                                           ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-01  6:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, torvalds, mingo, akpm, linux-kernel

Nick wrote:
> Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> ... Unfortunately ... and it is an O(cpus^2) operation.

Yes - a cool patch.

If we had an arch-specific bit of code, that for any two cpus, could
give a 'pseudo-distance' between them, where the only real requirements
were that (1) if two pairs of cpus had the same pseudo-distance, then
that meant they had the same size, layout, kind and speed of bus amd
cache hardware between them (*), and (2) it was cheap - hardly more than
a few lines of code and a subroutine call to obtain, then Ingo's code
could be:

	for each cpu c1:
	    for each cpu c2:
		psdist = pseudo_distance(c1, c2)
		if I've seen psdist before, use the latency computed for that psdist
		else compute a real latency number and remember it for that psdist

A generic form of pseudo_distance, which would work for all normal
sized systems, would be:

int pseudo_distance(int c1, int c2)
{
	static int x;
	return x++;
}

Then us poor slobs with big honkin numa iron could code up a real
pseudo_distance() routine, to avoid the actual pain of doing real work
for cpus^2 iterations for large cpu counts.

Our big boxes have regular geometries with much symmetry, so would
provide significant opportunity to exploit equal pseudo-distances.

And I would imagine that costs of K * NCPU * NCPU are tolerable in this
estimation routine. for sufficiently small K, and existing values of
NCPU.

(*) That is, if pseudo_distance(c1, c2) == pseudo_distance(d1, d2), then
    this meant that however c1 and c2 were connected to each other in the
    system (intervening buses and caches and such), cpus d1 and d2 were
    connected the same way, so could be presumed to have the same latency,
    close enough.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
@ 2005-04-01  6:34                         ` Nick Piggin
  2005-04-01  7:19                           ` Paul Jackson
  2005-04-01  6:46                         ` Ingo Molnar
  2005-04-01  6:59                         ` Ingo Molnar
  2 siblings, 1 reply; 82+ messages in thread
From: Nick Piggin @ 2005-04-01  6:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, Ingo Molnar, Andrew Morton, lkml

On Thu, 2005-03-31 at 22:05 -0800, Paul Jackson wrote:
> 
> Then us poor slobs with big honkin numa iron could code up a real
> pseudo_distance() routine, to avoid the actual pain of doing real work
> for cpus^2 iterations for large cpu counts.
> 
> Our big boxes have regular geometries with much symmetry, so would
> provide significant opportunity to exploit equal pseudo-distances.
> 

Couple of observations:

This doesn't actually need to be an O(n^2) operation. The result
of it is only going to be used in the sched domains code, so what
is really wanted is "how far away is one sched_group from another",
although we may also scale that based on the *amount* of cache
in the path between 2 cpus, that is often just a property of the
CPUs themselves in smaller systems, so also not O(n^2).

Secondly, we could use Ingo's O(n^2) code for the *SMP* domain on
all architectures (so in your case of only 2 CPUs per node, it is
obviously much cheaper, even over 256 nodes).

Then the NUMA domain could just inherit this SMP value as a default,
and allow architectures to override it individually.

This may allow us to set up decent baseline numbers, properly scaled
by cache size vs memory bandwidth without going overboard in
complexity (while still allowing arch code to do more fancy stuff).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:34                         ` Nick Piggin
@ 2005-04-01  7:19                           ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-01  7:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, torvalds, mingo, akpm, linux-kernel

> Couple of observations:

yeah - plausible enough.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
@ 2005-04-01  6:46                         ` Ingo Molnar
  2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01  6:59                         ` Ingo Molnar
  2 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-01  6:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]


* Paul Jackson <pj@engr.sgi.com> wrote:

> Nick wrote:
> > Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> > ... Unfortunately ... and it is an O(cpus^2) operation.
> 
> Yes - a cool patch.

before we get into complexities, i'd like to see whether it solves Ken's 
performance problem. The attached patch (against BK-curr, but should 
apply to vanilla 2.6.12-rc1 too) adds the autodetection feature. (For 
ia64 i've hacked in a cachesize of 9MB for Ken's testsystem.)

boots fine on x86, and gives this on a 4-way box:

 Brought up 4 CPUs
 migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
         [00]  [01]  [02]  [03]
 [00]:    1.3   1.3   1.4   1.2
 [01]:    1.5   1.3   1.3   1.5
 [02]:    1.5   1.3   1.3   1.5
 [03]:    1.3   1.3   1.3   1.3
 min_delta: 1351361
 using cache_decay nsec: 1351361 (1 msec)

which is a pretty reasonable estimate on that box. (fast P4s, small 
cache)

Ken, could you give it a go?

	Ingo

[-- Attachment #2: cache-hot-autodetect-2.6.12-rc1-A0 --]
[-- Type: text/plain, Size: 9988 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4699,6 +4699,232 @@ static void check_sibling_maps(void)
 #endif
 
 /*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void fill_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	unsigned long data = 0xdeadbeef;
+	int i;
+
+	for (i = 0; i < size/4; i++) {
+		if ((i & 3) == 0)
+			cache[i] = data;
+		if ((i & 3) == 1)
+			cache[size-1-i] = data;
+		if ((i & 3) == 2)
+			cache[size/2-i] = data;
+		if ((i & 3) == 3)
+			cache[size/2+i] = data;
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	void *local_cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+	unsigned long long t0, t1;
+	unsigned long flags;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	memset(data->local_cache, 0, data->size);
+	local_irq_save(flags);
+	t0 = sched_clock();
+	fill_cache(data->cache, data->size);
+	t1 = sched_clock();
+	local_irq_restore(flags);
+
+	data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+	unsigned long flags;
+	void *local_cache;
+
+	local_cache = vmalloc(size);
+	if (!local_cache) {
+		printk("couldnt allocate local cache ...\n");
+		return 0;
+	}
+	memset(local_cache, 0, size);
+
+	local_irq_save(flags);
+	local_irq_enable();
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+	data.local_cache = local_cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+
+	vfree(local_cache);
+
+	return data.delta;
+}
+
+__initdata unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 2.1 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long size = sched_cache_size*21/10;
+	unsigned long long delta, max = 0;
+	void *cache;
+	int i;
+
+	if (!size) {
+		printk("arch has not set cachesize - using default.\n");
+		return 0;
+	}
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	cache = vmalloc(size);
+	if (!cache) {
+		printk("could not vmalloc %ld bytes for cache!\n", size);
+		return 0;
+	}
+	memset(cache, 0, size);
+	for (i = 0; i < 20; i++) {
+		delta = measure_one(cache, size, cpu1, cpu2);
+		if (delta > max)
+			max = delta;
+	}
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 *
+	 * (We use 2.1 times the L2 cachesize in our measurement,
+	 *  we keep this factor when returning.)
+	 */
+	return max;
+}
+
+unsigned long long cache_decay_nsec;
+
+void __devinit calibrate_cache_decay(void)
+{
+	int cpu1 = -1, cpu2 = -1;
+	unsigned long long min_delta = -1ULL;
+
+	printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+		sched_cache_size, cpu_khz/1000);
+	printk("      ");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("  [%02d]", cpu1);
+	}
+	printk("\n");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("[%02d]: ", cpu1);
+		for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+			unsigned long long delta;
+
+			if (!cpu_online(cpu2))
+				continue;
+			delta = measure_cacheflush_time(cpu1, cpu2);
+			
+			printk(" %3Ld.%ld", delta >> 20,
+				(((long)delta >> 10) / 102) % 10);
+			if ((cpu1 != cpu2) && (delta < min_delta))
+				min_delta = delta;
+		}
+		printk("\n");
+	}
+	printk("min_delta: %Ld\n", min_delta);
+	if (min_delta != -1ULL)
+		cache_decay_nsec = min_delta;
+	printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+		cache_decay_nsec, cache_decay_nsec >> 20);
+
+
+}
+
+/*
  * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
  */
 static void __devinit arch_init_sched_domains(void)
@@ -4706,6 +4932,7 @@ static void __devinit arch_init_sched_do
 	int i;
 	cpumask_t cpu_default_map;
 
+	calibrate_cache_decay();
 #if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
 	check_sibling_maps();
 #endif
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -139,6 +139,9 @@ void __devinit arch_init_sched_domains(v
 	int i;
 	cpumask_t cpu_default_map;
 
+	sched_cache_size = 9*1024*1024; // hack for Kenneth
+	calibrate_cache_decay();
+
 	/*
 	 * Setup mask for cpus without special case scheduling requirements.
 	 * For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		sched_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,7 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +73,7 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -61,6 +61,12 @@
 #endif
 
 /*
+ * total time penalty to migrate a typical application's cache contents
+ * from one CPU to another. Measured by the boot-time code.
+ */
+extern unsigned long long cache_decay_nsec;
+
+/*
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
  * SD_CPU_INIT, for SMP domains
@@ -112,7 +118,7 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,12 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+extern unsigned long sched_cache_size;
+extern void calibrate_cache_decay(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,7 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,7 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,7 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:46                         ` Ingo Molnar
@ 2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  1:44                             ` Paul Jackson
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-01 22:32 UTC (permalink / raw)
  To: 'Ingo Molnar', Paul Jackson
  Cc: Nick Piggin, torvalds, akpm, linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 10:46 PM
> before we get into complexities, i'd like to see whether it solves Ken's
> performance problem. The attached patch (against BK-curr, but should
> apply to vanilla 2.6.12-rc1 too) adds the autodetection feature. (For
> ia64 i've hacked in a cachesize of 9MB for Ken's testsystem.)

Very nice, it had a good estimate of 9ms on my test system.  Our historical
data showed that 12ms was the best on the same system for the db workload
(that was done on 2.6.8).  Scheduler dynamic has changed in 2.6.11 and this
old finding may not apply any more for the new kernel.

migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
        [00]  [01]  [02]  [03]
[00]:    9.1   8.5   8.5   8.5
[01]:    8.5   9.1   8.5   8.5
[02]:    8.5   8.5   9.1   8.5
[03]:    8.5   8.5   8.5   9.1
min_delta: 8908106
using cache_decay nsec: 8908106 (8 msec)


Paul, you definitely want to check this out on your large numa box.  I booted
a kernel with this patch on a 32-way numa box and it took a long .... time
to produce the cost matrix.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:32                           ` Chen, Kenneth W
@ 2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  2:19                               ` Nick Piggin
  2005-04-04  1:40                               ` Kevin Puetz
  2005-04-02  1:44                             ` Paul Jackson
  1 sibling, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2005-04-01 22:51 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', Paul Jackson, Nick Piggin, akpm,
	linux-kernel

On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> 
> Paul, you definitely want to check this out on your large numa box.  I booted
> a kernel with this patch on a 32-way numa box and it took a long .... time
> to produce the cost matrix.

Is there anything fundamentally wrong with the notion of just initializing
the cost matrix to something that isn't completely wrong at bootup, and
just lettign user space fill it in?

Then you couple that with a program that can do so automatically (ie 
move the in-kernel heuristics into user-land), and something that can 
re-load it on demand.

Voila - you have something potentially expensive that you run once, and 
then you have a matrix that can be edited by the sysadmin later and just 
re-loaded at each boot.. That sounds pretty optimal, especially in the 
sense that it allows the sysadmin to tweak things depending on the use of 
the box is he really wants to.

Hmm? Or am I just totally on crack?

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:51                             ` Linus Torvalds
@ 2005-04-02  2:19                               ` Nick Piggin
  2005-04-04  1:40                               ` Kevin Puetz
  1 sibling, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2005-04-02  2:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chen, Kenneth W, 'Ingo Molnar', Paul Jackson, akpm,
	linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> 
>>Paul, you definitely want to check this out on your large numa box.  I booted
>>a kernel with this patch on a 32-way numa box and it took a long .... time
>>to produce the cost matrix.
> 
> 
> Is there anything fundamentally wrong with the notion of just initializing
> the cost matrix to something that isn't completely wrong at bootup, and
> just lettign user space fill it in?
> 

That's probably not a bad idea. You'd have to do things like
set RT scheduling for your user tasks, and not have any other
activity happening. So that effectively hangs your system for
a while anyway.

But if you run it once and dump the output to a config file...

Anyway we're faced with the immediate problem of crap performance
for 2.6.12 (for people with 1500 disks), so an in-kernel solution
might be better in the short term. I'll see if we can adapt Ingo's
thingy with something that is "good enough" and doesn't take years
to run on a 512 way.

Nick

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  2:19                               ` Nick Piggin
@ 2005-04-04  1:40                               ` Kevin Puetz
  1 sibling, 0 replies; 82+ messages in thread
From: Kevin Puetz @ 2005-04-04  1:40 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

> 
> 
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
>> 
>> Paul, you definitely want to check this out on your large numa box.  I
>> booted a kernel with this patch on a 32-way numa box and it took a long
>> .... time to produce the cost matrix.
> 
> Is there anything fundamentally wrong with the notion of just initializing
> the cost matrix to something that isn't completely wrong at bootup, and
> just lettign user space fill it in?

Wouldn't getting rescheduled (and thus having another program trash the
cache on you) really mess up the data collection though? I suppose by
spawning off threads, each with a fixed affinity and SCHED_FIFO one could
hang onto the CPU to collect the data. But then it's not (a lot) different
than doing it in-kernel.
 
> Then you couple that with a program that can do so automatically (ie
> move the in-kernel heuristics into user-land), and something that can
> re-load it on demand.

This part seems sensible though :-)

> Voila - you have something potentially expensive that you run once, and
> then you have a matrix that can be edited by the sysadmin later and just
> re-loaded at each boot.. That sounds pretty optimal, especially in the
> sense that it allows the sysadmin to tweak things depending on the use of
> the box is he really wants to.
> 
> Hmm? Or am I just totally on crack?
> 
> Linus



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01 22:51                             ` Linus Torvalds
@ 2005-04-02  1:44                             ` Paul Jackson
  2005-04-02  2:05                               ` Chen, Kenneth W
  1 sibling, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2005-04-02  1:44 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Kenneth wrote:
> Paul, you definitely want to check this out on your large numa box.

Interesting - thanks.  I can get a kernel patched and booted on a big
box easily enough.  I don't know how to run an "industry db benchmark",
and benchmarks aren't my forte.

Should I rope in one of our guys who is benchmark savvy, or are there
some instructions you can point to for running an appropriate benchmark?

Or are we just interested, first of all, in what sort of values this
cost matrix gets initialized with (and how slow it is to compute)?

I can get time on a 64-cpu with a days notice, and time on a 512-cpu
with 2 or 3 days notice.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  1:44                             ` Paul Jackson
@ 2005-04-02  2:05                               ` Chen, Kenneth W
  2005-04-02  2:38                                 ` Paul Jackson
  2005-04-03  6:36                                 ` David Lang
  0 siblings, 2 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-02  2:05 UTC (permalink / raw)
  To: 'Paul Jackson'; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Paul Jackson wrote on Friday, April 01, 2005 5:45 PM
> Kenneth wrote:
> > Paul, you definitely want to check this out on your large numa box.
>
> Interesting - thanks.  I can get a kernel patched and booted on a big
> box easily enough.  I don't know how to run an "industry db benchmark",
> and benchmarks aren't my forte.

To run this "industry db benchmark", assuming you have a 32-way numa box,
I recommend buying the following:

512 GB memory
1500 73 GB 15k-rpm fiber channel disks
50 hardware raid controllers, make sure you get the top of the line model
   (the one has 1GB memory in the controller).
25 fiber channel controllers
4  gigabit ethernet controllers.
12 rack frames

Then you will be off to go.  Oh, get several 220 volt power outlets too,
probably some refrigeration unit will go along with that.  Sorry, I
haven't mention the mid-tier and the client machines yet.

;-)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  2:05                               ` Chen, Kenneth W
@ 2005-04-02  2:38                                 ` Paul Jackson
  2005-04-03  6:36                                 ` David Lang
  1 sibling, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-02  2:38 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Kenneth wrote:
> I recommend buying the following:

ah so ... I think I'll skip running the industry db benchmark
for now, if that's all the same.

What sort of feedback are you looking for from my running this
patch?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  2:05                               ` Chen, Kenneth W
  2005-04-02  2:38                                 ` Paul Jackson
@ 2005-04-03  6:36                                 ` David Lang
  2005-04-03  6:53                                   ` Andreas Dilger
  1 sibling, 1 reply; 82+ messages in thread
From: David Lang @ 2005-04-03  6:36 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Paul Jackson', mingo, nickpiggin, torvalds, akpm,
	linux-kernel

On Fri, 1 Apr 2005, Chen, Kenneth W wrote:

> To run this "industry db benchmark", assuming you have a 32-way numa box,
> I recommend buying the following:
>
> 512 GB memory
> 1500 73 GB 15k-rpm fiber channel disks
> 50 hardware raid controllers, make sure you get the top of the line model
>   (the one has 1GB memory in the controller).
> 25 fiber channel controllers
> 4  gigabit ethernet controllers.
> 12 rack frames

Ken, given that you don't have the bandwidth to keep all of those disks 
fully utilized, do you have any idea how big a performance hit you would 
take going to larger, but slower SATA drives?

given that this would let you get the same storage with about 1200 fewer 
drives (with corresponding savings in raid controllers, fiberchannel 
controllers and rack frames) it would be interesting to know how close it 
would be (for a lot of people the savings, which probably are within 
spitting distance of $1M could be work the decrease in performance)

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  6:36                                 ` David Lang
@ 2005-04-03  6:53                                   ` Andreas Dilger
  2005-04-03  7:23                                     ` David Lang
  0 siblings, 1 reply; 82+ messages in thread
From: Andreas Dilger @ 2005-04-03  6:53 UTC (permalink / raw)
  To: David Lang
  Cc: Chen, Kenneth W, 'Paul Jackson', mingo, nickpiggin,
	torvalds, akpm, linux-kernel

On Apr 02, 2005  22:36 -0800, David Lang wrote:
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> >To run this "industry db benchmark", assuming you have a 32-way numa box,
> >I recommend buying the following:
> >
> >512 GB memory
> >1500 73 GB 15k-rpm fiber channel disks
> >50 hardware raid controllers, make sure you get the top of the line model
> >  (the one has 1GB memory in the controller).
> >25 fiber channel controllers
> >4  gigabit ethernet controllers.
> >12 rack frames
> 
> Ken, given that you don't have the bandwidth to keep all of those disks 
> fully utilized, do you have any idea how big a performance hit you would 
> take going to larger, but slower SATA drives?
> 
> given that this would let you get the same storage with about 1200 fewer 
> drives (with corresponding savings in raid controllers, fiberchannel 
> controllers and rack frames) it would be interesting to know how close it 
> would be (for a lot of people the savings, which probably are within 
> spitting distance of $1M could be work the decrease in performance)

For benchmarks like these, the issue isn't the storage capacity, but
rather the ability to have lots of heads seeking concurrently to
access the many database tables.  At one large site I used to work at,
the database ran on hundreds of 1, 2, and 4GB disks long after they
could be replaced by many fewer, larger disks...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  6:53                                   ` Andreas Dilger
@ 2005-04-03  7:23                                     ` David Lang
  2005-04-03  7:38                                       ` Nick Piggin
  0 siblings, 1 reply; 82+ messages in thread
From: David Lang @ 2005-04-03  7:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Chen, Kenneth W, 'Paul Jackson', mingo, nickpiggin,
	torvalds, akpm, linux-kernel

On Sat, 2 Apr 2005, Andreas Dilger wrote:

>> given that this would let you get the same storage with about 1200 fewer
>> drives (with corresponding savings in raid controllers, fiberchannel
>> controllers and rack frames) it would be interesting to know how close it
>> would be (for a lot of people the savings, which probably are within
>> spitting distance of $1M could be work the decrease in performance)
>
> For benchmarks like these, the issue isn't the storage capacity, but
> rather the ability to have lots of heads seeking concurrently to
> access the many database tables.  At one large site I used to work at,
> the database ran on hundreds of 1, 2, and 4GB disks long after they
> could be replaced by many fewer, larger disks...

I can understand this to a point, but it seems to me that after you get 
beyond some point you stop gaining from this (simply becouse you run out 
of bandwidth to keep all the heads busy). I would have guessed that this 
happened somewhere in the hundreds of drives rather then the thousands, so 
going from 1500x73G to 400x300G (even if this drops you from 15Krpm to 
10Krpm) would still saturate the interface bandwidth before the drives

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  7:23                                     ` David Lang
@ 2005-04-03  7:38                                       ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2005-04-03  7:38 UTC (permalink / raw)
  To: David Lang
  Cc: Andreas Dilger, Chen, Kenneth W, 'Paul Jackson', mingo,
	torvalds, akpm, linux-kernel

David Lang wrote:

> On Sat, 2 Apr 2005, Andreas Dilger wrote:
>
>>> given that this would let you get the same storage with about 1200 
>>> fewer
>>> drives (with corresponding savings in raid controllers, fiberchannel
>>> controllers and rack frames) it would be interesting to know how 
>>> close it
>>> would be (for a lot of people the savings, which probably are within
>>> spitting distance of $1M could be work the decrease in performance)
>>
>>
>> For benchmarks like these, the issue isn't the storage capacity, but
>> rather the ability to have lots of heads seeking concurrently to
>> access the many database tables.  At one large site I used to work at,
>> the database ran on hundreds of 1, 2, and 4GB disks long after they
>> could be replaced by many fewer, larger disks...
>
>
> I can understand this to a point, but it seems to me that after you 
> get beyond some point you stop gaining from this (simply becouse you 
> run out of bandwidth to keep all the heads busy). I would have guessed 
> that this happened somewhere in the hundreds of drives rather then the 
> thousands, so going from 1500x73G to 400x300G (even if this drops you 
> from 15Krpm to 10Krpm) would still saturate the interface bandwidth 
> before the drives
>

But in this case probably not - Ken increases IO capacity until the CPUs 
become saturated.
So there probably isn't a very large margin for error, you might need 
2000 of the slower
SATA disks to achieve a similar IOPS capacity.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
  2005-04-01  6:46                         ` Ingo Molnar
@ 2005-04-01  6:59                         ` Ingo Molnar
  2005-04-01  9:29                           ` Paul Jackson
  2 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-01  6:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> Nick wrote:
> > Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> > ... Unfortunately ... and it is an O(cpus^2) operation.
> 
> Yes - a cool patch.
> 
> If we had an arch-specific bit of code, that for any two cpus, could 
> give a 'pseudo-distance' between them, where the only real 
> requirements were that (1) if two pairs of cpus had the same 
> pseudo-distance, then that meant they had the same size, layout, kind 
> and speed of bus amd cache hardware between them (*), and (2) it was 
> cheap - hardly more than a few lines of code and a subroutine call to 
> obtain, then Ingo's code could be:

yeah. The search can be limited quite drastically if all duplicate 
constellations of CPUs (which is a function of the topology) are only 
measured once.

but can be 'pseudo-distance' be calculated accurately enough? If it's a 
scalar, how do you make sure that unique paths for data to flow have 
different distances? The danger is 'false sharing' in the following 
scenario: lets say CPUs #1 and #2 are connected via hardware H1,H2,H3, 
CPUs #3 and #4 are connected via H4,H5,H6. Each hardware component is 
unique and has different characteristics. (e.g. this scenario can happen 
when different speed CPUs are mixed into the same system - or if there 
is some bus assymetry)

It has to be made sure that H1+H2+H3 != H4+H5+H6, otherwise false 
sharing will happen. For that 'uniqueness of sum' to be guaranteed, one 
has to assign power-of-two values to each separate type of hardware 
component.

[ or one has to assing very accurate 'distance' values to hardware 
components. (adding another source for errors - i.e. false sharing of 
the migration value) ]

and even the power-of-two assignment method has its limitations: it 
obviously runs out at 32/64 components (i'm not sure we can do that), 
and if a given component type can be present in the same path _twice_, 
that component will have to take two bits.

or is the 'at most 64 different hardware component types' limit ok? (it 
feels like a limit we might regret later.)

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:59                         ` Ingo Molnar
@ 2005-04-01  9:29                           ` Paul Jackson
  2005-04-01 10:34                             ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Paul Jackson @ 2005-04-01  9:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

> It has to be made sure that H1+H2+H3 != H4+H5+H6,

Yeah - if you start trying to think about the general case here, the
combinations tend to explode on one.

I'm thinking we get off easy, because:

 1) Specific arch's can apply specific short cuts.

	My intuition was that any specific architecture, when it
	got down to specifics, could find enough ways to cheat
	so that it could get results quickly, that easily fit
	in a single 'distance' word, which results were 'close
	enough.'

 2) The bigger the system, the more uniform its core hardware.

	At least SGI's big iron systems are usually pretty
	uniform in the hardware that matters here.  We might mix
	two cpu speeds, or a couple of memory sizes.  Not much
	more, at least that I know of.  A 1024 NUMA cobbled
	together from a wide variety of parts would be a very
	strange beast indeed.

 3) Approximate results (aliasing at the edges) are ok.

	If the SN2 arch code ends up telling the cache latency
	initialization code that two cpus on opposite sides of
	a 1024 cpu system are the same distance as another such
	pair, even though they aren't exactly the same distance,
	does anyone care?  Not I.

So I think we've got plenty of opportunity to special case arch's,
plenty of headroom, and plenty of latitude to bend not break if we do
start to push the limits.

Think of that 64 bits as if it was floating point, not int.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  9:29                           ` Paul Jackson
@ 2005-04-01 10:34                             ` Ingo Molnar
  2005-04-01 14:39                               ` Paul Jackson
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-01 10:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> > It has to be made sure that H1+H2+H3 != H4+H5+H6,
> 
> Yeah - if you start trying to think about the general case here, the 
> combinations tend to explode on one.

well, while i dont think we need that much complexity, the most generic 
case is a representation of the actual hardware (cache/bus) layout, 
where separate hardware component types have different IDs.

e.g. a simple hiearchy would be:

         ____H1____
      _H2_        _H2_
    H3    H3    H3    H3
   [P1]  [P2]  [P3]  [P4]

Then all that has to happen is to build a 'path' of ids (e.g. "H3,H2,H3" 
is a path), which is a vector of IDs, and an array of already measured 
vectors. These IDs never get added so they just have to be unique per 
type of component.

there are two different vectors possible: H3,H2,H3 and H3,H2,H1,H2,H3, 
so two measurements are needed, between P1 and P2 and P1 and P3. (the 
first natural occurence of each path)

this is tree walking and vector building/matching. There is no 
restriction on the layout of the hierarchy, other than it has to be a 
tree. (no circularity) It's easy to specify such a tree, and there are 
no 'mixup' dangers.

> I'm thinking we get off easy, because:
> 
>  1) Specific arch's can apply specific short cuts.
> 
> 	My intuition was that any specific architecture, when it
> 	got down to specifics, could find enough ways to cheat
> 	so that it could get results quickly, that easily fit
> 	in a single 'distance' word, which results were 'close
> 	enough.'

yes - but the fundamental problem is already that we do have per-arch 
shortcuts: the cache_hot value. If an arch wanted to set it up, it could 
do it. But it's not easy to set it up and the value is not intuitive. So 
the key is to make it convenient and fool-proof to set up the data - 
otherwise it just wont be used, or will be used incorrectly.

but i'd too go for the simpler 'pseudo-distance' function, because it's 
so much easier to iterate through it. But it's not intuitive. Maybe it 
should be called 'connection ID': a unique ID for each uniqe type of 
path between CPUs. An architecture can take shortcuts if it has a simple 
layout (most have). I.e.:

	sched_cpu_connection_type(int cpu1, int cpu2)

would return a unique type ID for different. Note that 'distance' (or 
'memory access latency', or 'NUMA factor') as a unit is not sufficient, 
as it does not take cache-size nor CPU speed into account, which does 
play a role in the migration characteristics.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 10:34                             ` Ingo Molnar
@ 2005-04-01 14:39                               ` Paul Jackson
  0 siblings, 0 replies; 82+ messages in thread
From: Paul Jackson @ 2005-04-01 14:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

Ingo wrote:
> but i'd too go for the simpler 'pseudo-distance' function, because it's 
> so much easier to iterate through it. But it's not intuitive. Maybe it 
> should be called 'connection ID': a unique ID for each uniqe type of 
> path between CPUs.

Well said.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 22:14                   ` Chen, Kenneth W
  2005-03-31 23:35                     ` Nick Piggin
@ 2005-04-01  4:52                     ` Ingo Molnar
  2005-04-01  5:14                       ` Chen, Kenneth W
  1 sibling, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2005-04-01  4:52 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> The low point in 2.6.11 could very well be the change in the 
> scheduler. It does too many load balancing in the wake up path and 
> possibly made a lot of unwise decision.  For example, in 
> try_to_wake_up(), it will try SD_WAKE_AFFINE for task that is not hot.  
> By not hot, it looks at when it was last ran and compare to a constant 
> sd->cache_hot_time.  The problem is this cache_hot_time is fixed for 
> the entire universe, whether it is a little celeron processor with 
> 128KB of cache or a sever class Itanium2 processor with 9MB L3 cache.  
> This one size fit all isn't really working at all.

the current scheduler queue in -mm has some experimental bits as well 
which will reduce the amount of balancing. But we cannot just merge them 
an bloc right now, there's been too much back and forth in recent 
kernels. The safe-to-merge-for-2.6.12 bits are already in -BK.

> We had experimented that parameter earlier and found it was one of the 
> major source of low point in 2.6.8.  I debated the issue on LKML about 
> 4 month ago and finally everyone agreed to make that parameter a boot 
> time param. The change made into bk tree for 2.6.9 release, but 
> somehow it got ripped right out 2 days after it went in.  I suspect 
> 2.6.11 is a replay of 2.6.8 for the regression in the scheduler.  We 
> are running experiment to confirm this theory.

the current defaults for cache_hot_time are 10 msec for NUMA domains, 
and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.  
Are you increasing cache_hot_time in your experiment? If that solves 
most of the problem that would be an easy thing to fix for 2.6.12.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  4:52                     ` Ingo Molnar
@ 2005-04-01  5:14                       ` Chen, Kenneth W
  0 siblings, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-01  5:14 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> the current scheduler queue in -mm has some experimental bits as well
> which will reduce the amount of balancing. But we cannot just merge them
> an bloc right now, there's been too much back and forth in recent
> kernels. The safe-to-merge-for-2.6.12 bits are already in -BK.

I agree, please give me some time to go through these patches on our db
setup.

> the current defaults for cache_hot_time are 10 msec for NUMA domains,
> and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
> Are you increasing cache_hot_time in your experiment? If that solves
> most of the problem that would be an easy thing to fix for 2.6.12.

Yes, we are increasing the number in our experiments.  It's in the queue
and I should have a result soon.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
  2005-03-30  0:46   ` Chen, Kenneth W
@ 2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 0 replies; 82+ messages in thread
From: Chen, Kenneth W @ 2005-04-01 22:51 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> Also, it would be absolutely wonderful to see a finer granularity (which
> would likely also answer the stability question of the numbers). If you
> can do this with the daily snapshots, that would be great. If it's not
> easily automatable, or if a run takes a long time, maybe every other or
> every third day would be possible?
>
> Doing just release kernels means that there will be a two-month lag
> between telling developers that something pissed up performance. Doing it
> every day (or at least a couple of times a week) will be much more
> interesting.

Indeed, we agree that regular disciplined performance testing is important
and we (as Intel) will take on the challenge to support the Linux community.
I just got an approval to start this project.  We will report more detail
on how/where to publish the performance number, etc.

- Ken



^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2005-04-08  2:28 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-02  1:00 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
2005-04-02  2:12 ` Nick Piggin
2005-04-02 14:53 ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
2005-04-02 21:22   ` Paul Jackson
2005-04-03  5:53   ` Paul Jackson
2005-04-03  7:04     ` Ingo Molnar
2005-04-03  8:15       ` Paul Jackson
2005-04-03 11:34       ` Paul Jackson
2005-04-03 14:12         ` Paul Jackson
2005-04-03 15:01           ` Ingo Molnar
2005-04-03 22:30             ` Paul Jackson
2005-04-05  6:53               ` [patch] sched: auto-tune migration costs Andi Kleen
2005-04-05  7:20                 ` Paul Jackson
2005-04-03 15:24           ` [patch] sched: auto-tune migration costs [was: Re: Industry db benchmark result on recent 2.6 kernels] Ingo Molnar
2005-04-03 23:08             ` Paul Jackson
2005-04-04  2:08               ` Nick Piggin
2005-04-04  3:55                 ` Paul Jackson
2005-04-04  5:45                   ` Ingo Molnar
2005-04-04  5:50                     ` Paul Jackson
2005-04-04  5:56                   ` Nick Piggin
2005-04-04  6:38                     ` Paul Jackson
2005-04-04  6:48                       ` Ingo Molnar
2005-04-04  7:37                         ` Paul Jackson
2005-04-04  6:50               ` Ingo Molnar
2005-04-04  7:27                 ` Paul Jackson
2005-04-03 14:29         ` Ingo Molnar
2005-04-03 23:15           ` Paul Jackson
2005-04-04  1:31           ` Chen, Kenneth W
2005-04-04  6:24             ` Ingo Molnar
2005-04-04  6:39               ` Ingo Molnar
2005-04-06  0:08               ` Chen, Kenneth W
2005-04-04  4:25         ` Andy Lutomirski
2005-04-04  4:36           ` Paul Jackson
2005-04-04  1:11       ` Chen, Kenneth W
2005-04-04 11:37         ` Ingo Molnar
2005-04-04 17:27           ` Paul Jackson
2005-04-05  1:43           ` Chen, Kenneth W
2005-04-05  1:49             ` Ingo Molnar
2005-04-05  3:04               ` Ingo Molnar
2005-04-06  3:33                 ` Chen, Kenneth W
2005-04-06  6:45                   ` Ingo Molnar
2005-04-08  2:27                     ` Chen, Kenneth W
2005-04-03  9:01     ` Paul Jackson
  -- strict thread matches above, loose matches on Subject: below --
2005-04-01 16:34 Industry db benchmark result on recent 2.6 kernels Manfred Spraul
2005-03-28 19:33 Chen, Kenneth W
2005-03-28 19:50 ` Dave Hansen
2005-03-28 20:01   ` Chen, Kenneth W
2005-03-30  0:00 ` Linus Torvalds
2005-03-30  0:22   ` Chen, Kenneth W
2005-03-30  0:46   ` Chen, Kenneth W
2005-03-30  0:57     ` Linus Torvalds
2005-03-30  1:31       ` Nick Piggin
2005-03-30  1:38         ` Chen, Kenneth W
2005-03-30  1:56           ` Nick Piggin
2005-03-31 14:14           ` Ingo Molnar
2005-03-31 19:53             ` Chen, Kenneth W
2005-03-31 20:05               ` Linus Torvalds
2005-03-31 20:08                 ` Linus Torvalds
2005-03-31 22:14                   ` Chen, Kenneth W
2005-03-31 23:35                     ` Nick Piggin
2005-04-01  6:05                       ` Paul Jackson
2005-04-01  6:34                         ` Nick Piggin
2005-04-01  7:19                           ` Paul Jackson
2005-04-01  6:46                         ` Ingo Molnar
2005-04-01 22:32                           ` Chen, Kenneth W
2005-04-01 22:51                             ` Linus Torvalds
2005-04-02  2:19                               ` Nick Piggin
2005-04-04  1:40                               ` Kevin Puetz
2005-04-02  1:44                             ` Paul Jackson
2005-04-02  2:05                               ` Chen, Kenneth W
2005-04-02  2:38                                 ` Paul Jackson
2005-04-03  6:36                                 ` David Lang
2005-04-03  6:53                                   ` Andreas Dilger
2005-04-03  7:23                                     ` David Lang
2005-04-03  7:38                                       ` Nick Piggin
2005-04-01  6:59                         ` Ingo Molnar
2005-04-01  9:29                           ` Paul Jackson
2005-04-01 10:34                             ` Ingo Molnar
2005-04-01 14:39                               ` Paul Jackson
2005-04-01  4:52                     ` Ingo Molnar
2005-04-01  5:14                       ` Chen, Kenneth W
2005-04-01 22:51   ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox