* boot-time slowdown for measure_migration_cost
@ 2006-01-27 21:03 Bjorn Helgaas
2006-01-30 17:21 ` Ingo Molnar
0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2006-01-27 21:03 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-ia64, linux-kernel
The boot-time migration cost auto-tuning stuff seems to have
been merged to Linus' tree since 2.6.15. On little one- or
two-processor systems, the time required to measure the
migration costs isn't very noticeable, but by the time we
get to even a four-processor ia64 box, it adds about
30 seconds to the boot time, which seems like a lot.
Is that expected? Is the information we get really worth
that much? Could the measurement be done at run-time
instead? Is there a smaller hammer we could use, e.g.,
flushing just the buffer rather than the *entire* cache?
Did we just implement sched_cacheflush() incorrectly for
ia64?
Only ia64, x86, and x86_64 currently have a non-empty
sched_cacheflush(), and the x86* ones contain only "wbinvd()".
So I suspect that only ia64 sees this slowdown. But I would
guess that other arches will implement it in the future.
Bjorn
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-27 21:03 boot-time slowdown for measure_migration_cost Bjorn Helgaas
@ 2006-01-30 17:21 ` Ingo Molnar
2006-01-30 18:53 ` Luck, Tony
2006-01-30 19:26 ` Chen, Kenneth W
0 siblings, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2006-01-30 17:21 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Ingo Molnar, linux-ia64, linux-kernel
* Bjorn Helgaas <bjorn.helgaas@hp.com> wrote:
> The boot-time migration cost auto-tuning stuff seems to have been
> merged to Linus' tree since 2.6.15. On little one- or two-processor
> systems, the time required to measure the migration costs isn't very
> noticeable, but by the time we get to even a four-processor ia64 box,
> it adds about 30 seconds to the boot time, which seems like a lot.
>
> Is that expected? Is the information we get really worth that much?
> Could the measurement be done at run-time instead? Is there a smaller
> hammer we could use, e.g., flushing just the buffer rather than the
> *entire* cache? Did we just implement sched_cacheflush() incorrectly
> for ia64?
>
> Only ia64, x86, and x86_64 currently have a non-empty
> sched_cacheflush(), and the x86* ones contain only "wbinvd()". So I
> suspect that only ia64 sees this slowdown. But I would guess that
> other arches will implement it in the future.
the main cost comes from accessing the test-buffer when the buffer size
gets above the real cachesize. There are a coupe of ways to improve
that:
- double-check that max_cache_size gets set up correctly on your
architecture - the code searches from ~64K to 2*max_cache_size.
- take the values that are auto-detected and use the migration_cost=
boot parameter - see Documentation/kernel-parameters.txt:
migration_cost=
[KNL,SMP] debug: override scheduler migration costs
Format: <level-1-usecs>,<level-2-usecs>,...
This debugging option can be used to override the
default scheduler migration cost matrix. The numbers
are indexed by 'CPU domain distance'.
E.g. migration_cost=1000,2000,3000 on an SMT NUMA
box will set up an intra-core migration cost of
1 msec, an inter-core migration cost of 2 msecs,
and an inter-node migration cost of 3 msecs.
(a distribution could do this automatically as well in the installer,
i've constructed the bootup printout to be in the format that is
needed for migration_cost. I have not tested this too extensively
though, so double-check the result via an additional
migration_debug=2 printout as well! Let me know if you find any bugs
here.)
via this solution you will get zero overhead on subsequent bootups.
- in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
measurement more noisy though.
- in kernel/sched.c, change this line:
size = size * 20 / 19;
to:
size = size * 10 / 9;
this will probably halve the cost - against at the expense of
accuracy and statistical stability.
Ingo
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: boot-time slowdown for measure_migration_cost
2006-01-30 17:21 ` Ingo Molnar
@ 2006-01-30 18:53 ` Luck, Tony
2006-01-30 19:24 ` Ingo Molnar
2006-01-30 20:43 ` John Hawkes
2006-01-30 19:26 ` Chen, Kenneth W
1 sibling, 2 replies; 12+ messages in thread
From: Luck, Tony @ 2006-01-30 18:53 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel
On Mon, Jan 30, 2006 at 06:21:40PM +0100, Ingo Molnar wrote:
> - double-check that max_cache_size gets set up correctly on your
> architecture - the code searches from ~64K to 2*max_cache_size.
Ia64 gets that from PAL in get_max_cacheline_size() in ia64/kernel/setup.c
A quick printk() in there confirms that we get the right answer (9MB for
me), and that it happens before we compute the migration cost.
> - take the values that are auto-detected and use the migration_cost=
> boot parameter - see Documentation/kernel-parameters.txt:
> ...
> via this solution you will get zero overhead on subsequent bootups.
But if you are going to go this route, you could drop all this code from
the kernel and have a hard-wired constant, with a user-mode test program
to compute the more accurate value.
> - in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
> measurement more noisy though.
Doing this drops the time to compute the value from 15.58s to 10.39s, while
the value of migration_cost changes from 10112 to 9909.
> - in kernel/sched.c, change this line:
> size = size * 20 / 19;
> to:
> size = size * 10 / 9;
Doing this instead of changing ITERATIONS makes the computation take 7.79s
and the computed migration_cost is 9987.
Doing both gets the time down to 5.20s, and the migration_cost=9990.
So the variation in the computed value of migration_cost was at worst
2% with these modifications to the algorithm. Do you really need to know
the value to this accuracy? What 2nd order bad effects would occur from
using an off-by-2% value for scheduling decisions?
On the plus side Prarit's results show that this time isn't scaling with
NR_CPUS ... apparently just cache size and number of domains are significant
in the time to compute.
-Tony
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-30 18:53 ` Luck, Tony
@ 2006-01-30 19:24 ` Ingo Molnar
2006-01-30 20:00 ` Luck, Tony
2006-01-30 20:43 ` John Hawkes
1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2006-01-30 19:24 UTC (permalink / raw)
To: Luck, Tony
Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel,
Andrew Morton, Linus Torvalds
* Luck, Tony <tony.luck@intel.com> wrote:
> Doing this drops the time to compute the value from 15.58s to 10.39s,
> while the value of migration_cost changes from 10112 to 9909.
>
> > - in kernel/sched.c, change this line:
> > size = size * 20 / 19;
> > to:
> > size = size * 10 / 9;
>
> Doing this instead of changing ITERATIONS makes the computation take
> 7.79s and the computed migration_cost is 9987.
>
> Doing both gets the time down to 5.20s, and the migration_cost=9990.
ok, that's good enough i think - we could certainly do the patch below
in v2.6.16.
> On the plus side Prarit's results show that this time isn't scaling
> with NR_CPUS ... apparently just cache size and number of domains are
> significant in the time to compute.
yes, this comes from the algorithm, it only computes once per distance
(and uses the cached value from then on), independently of the number of
CPUs.
Ingo
---
reduce the amount of time the migration cost calculations cost during
bootup.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -5141,7 +5141,7 @@ static void init_sched_build_groups(stru
#define SEARCH_SCOPE 2
#define MIN_CACHE_SIZE (64*1024U)
#define DEFAULT_CACHE_SIZE (5*1024*1024U)
-#define ITERATIONS 2
+#define ITERATIONS 1
#define SIZE_THRESH 130
#define COST_THRESH 130
@@ -5480,9 +5480,9 @@ static unsigned long long measure_migrat
break;
}
/*
- * Increase the cachesize in 5% steps:
+ * Increase the cachesize in 10% steps:
*/
- size = size * 20 / 19;
+ size = size * 10 / 9;
}
if (migration_debug)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-30 19:24 ` Ingo Molnar
@ 2006-01-30 20:00 ` Luck, Tony
2006-01-30 20:43 ` Prarit Bhargava
0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2006-01-30 20:00 UTC (permalink / raw)
To: Ingo Molnar
Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel,
Andrew Morton, Linus Torvalds
On Mon, Jan 30, 2006 at 08:24:38PM +0100, Ingo Molnar wrote:
> > Doing both gets the time down to 5.20s, and the migration_cost=9990.
>
> ok, that's good enough i think - we could certainly do the patch below
> in v2.6.16.
Might it be wise to see whether the 2% variation that I saw can be
repeated on some other architecture? Bjorn's initial post was just
questioning whether we need to spend this much time during boot to acquire
this data. Now we have *one* data point that on an ia64 with four cpus
with 9MB cache in a single domain that we can speed the calculation by
a factor of three with only a 2% loss of accuracy. Can someone else try
this patch and post the before/after values for migration_cost from dmesg?
-Tony
---
reduce the amount of time the migration cost calculations cost during
bootup.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -5141,7 +5141,7 @@ static void init_sched_build_groups(stru
#define SEARCH_SCOPE 2
#define MIN_CACHE_SIZE (64*1024U)
#define DEFAULT_CACHE_SIZE (5*1024*1024U)
-#define ITERATIONS 2
+#define ITERATIONS 1
#define SIZE_THRESH 130
#define COST_THRESH 130
@@ -5480,9 +5480,9 @@ static unsigned long long measure_migrat
break;
}
/*
- * Increase the cachesize in 5% steps:
+ * Increase the cachesize in 10% steps:
*/
- size = size * 20 / 19;
+ size = size * 10 / 9;
}
if (migration_debug)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-30 20:00 ` Luck, Tony
@ 2006-01-30 20:43 ` Prarit Bhargava
2006-01-30 20:52 ` Prarit Bhargava
0 siblings, 1 reply; 12+ messages in thread
From: Prarit Bhargava @ 2006-01-30 20:43 UTC (permalink / raw)
To: Luck, Tony
Cc: Ingo Molnar, Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel,
Andrew Morton, Linus Torvalds
Tony,
>
>
> Might it be wise to see whether the 2% variation that I saw can be
> repeated on some other architecture? Can someone else try
> this patch and post the before/after values for migration_cost from dmesg?
>
Ask and ye shall receive ... on the 64p/64G system.
Pristine:
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
Patched:
[ 0.001307] Calibrating delay loop... 2244.60 BogoMIPS (lpj=4489216)
[ 9.942308] Brought up 64 CPUs
[ 9.942812] Total of 64 processors activated (143654.91 BogoMIPS).
[ 18.080441] migration_cost=0,31934,38750
[ 23.865993] checking if image is initramfs... it is
P.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-30 20:43 ` Prarit Bhargava
@ 2006-01-30 20:52 ` Prarit Bhargava
0 siblings, 0 replies; 12+ messages in thread
From: Prarit Bhargava @ 2006-01-30 20:52 UTC (permalink / raw)
To: Prarit Bhargava
Cc: Luck, Tony, Ingo Molnar, Bjorn Helgaas, Ingo Molnar, linux-ia64,
linux-kernel, Andrew Morton, Linus Torvalds
Prarit Bhargava wrote:
>
> Tony,
>
>>
>>
>> Might it be wise to see whether the 2% variation that I saw can be
>> repeated on some other architecture? Can someone else try
>> this patch and post the before/after values for migration_cost from
>> dmesg?
>>
Whoops. Let's try that again:
Pristine (with build_sched_domains measurement):
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
Patched (with build_sched_domains measurement):
[ 9.942267] Brought up 64 CPUs
[ 9.942930] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.944032] build_sched_domains: beginmigration_cost=0,31854,38739
[ 23.868304] build_sched_domains: end
P.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-30 18:53 ` Luck, Tony
2006-01-30 19:24 ` Ingo Molnar
@ 2006-01-30 20:43 ` John Hawkes
1 sibling, 0 replies; 12+ messages in thread
From: John Hawkes @ 2006-01-30 20:43 UTC (permalink / raw)
To: Luck, Tony, Ingo Molnar
Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel
From: "Luck, Tony" <tony.luck@intel.com>
...
> So the variation in the computed value of migration_cost was at worst
> 2% with these modifications to the algorithm. Do you really need to know
> the value to this accuracy? What 2nd order bad effects would occur from
> using an off-by-2% value for scheduling decisions?
>
> On the plus side Prarit's results show that this time isn't scaling with
> NR_CPUS ... apparently just cache size and number of domains are significant
> in the time to compute.
Yes, the calculation is done just once per domain level, and a desire to
achieve great accuracy for the calculation presupposes that the cpuM-to-cpuN
migration cost for a given domain level is identical (or very close) across
all the CPU pairs. That is, for a given domain level, only one CPU pair are
chosen for the calculation. For the ia64/sn2 NUMA Altix, and I suspect for
other NUMA platforms, this just isn't true for the middle domain level (i.e.,
the level that appears when the CPU count is >32p) -- i.e., some CPU pairs are
"closer" than other pairs. The variation for other CPU pairs in this domain
level is certainly much greater than 2%.
John Hawkes
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: boot-time slowdown for measure_migration_cost
2006-01-30 17:21 ` Ingo Molnar
2006-01-30 18:53 ` Luck, Tony
@ 2006-01-30 19:26 ` Chen, Kenneth W
1 sibling, 0 replies; 12+ messages in thread
From: Chen, Kenneth W @ 2006-01-30 19:26 UTC (permalink / raw)
To: 'Ingo Molnar', Bjorn Helgaas
Cc: Ingo Molnar, linux-ia64, linux-kernel
Ingo Molnar wrote on Monday, January 30, 2006 9:22 AM
> - in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
> measurement more noisy though.
>
> - in kernel/sched.c, change this line:
>
> size = size * 20 / 19;
>
> to:
>
> size = size * 10 / 9;
>
> this will probably halve the cost - against at the expense of
> accuracy and statistical stability.
I think the kernel should keep the accuracy and stability. One option
would be by default not to measure the migration cost. For people who
wants to have an accurate scheduler parameter, they can turn on a boot
time option to do the measure.
- Ken
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: boot-time slowdown for measure_migration_cost
@ 2006-01-27 21:48 Luck, Tony
2006-01-27 22:08 ` Prarit Bhargava
0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2006-01-27 21:48 UTC (permalink / raw)
To: Bjorn Helgaas, Ingo Molnar; +Cc: linux-ia64, linux-kernel
> The boot-time migration cost auto-tuning stuff seems to have
> been merged to Linus' tree since 2.6.15. On little one- or
> two-processor systems, the time required to measure the
> migration costs isn't very noticeable, but by the time we
> get to even a four-processor ia64 box, it adds about
> 30 seconds to the boot time, which seems like a lot.
I only see about 16 seconds for a 4-way tiger (not that 16 seconds
is good ... but it not as bad as 30). This was with a build
from tiger_defconfig that sets CONFIG_NR_CPUS=4 ... so I wonder
what's causing the factor of two. I measured with a printk
each side of build_sched_domains() and booted with the "time"
command line arg to get:
[ 0.540718] Building sched domains
[ 16.124693] migration_cost=10091
[ 16.124789] Done
More importantly, how does this time scale as the number of
cpus increases? Linear, or worse? What happens on a 512 cpu
Altix (if it's quadratic, they may be still waiting for the
boot to finish :-)
-Tony
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
2006-01-27 21:48 Luck, Tony
@ 2006-01-27 22:08 ` Prarit Bhargava
0 siblings, 0 replies; 12+ messages in thread
From: Prarit Bhargava @ 2006-01-27 22:08 UTC (permalink / raw)
To: Luck, Tony; +Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel
Luck, Tony wrote:
>>The boot-time migration cost auto-tuning stuff seems to have
>>been merged to Linus' tree since 2.6.15. On little one- or
>>two-processor systems, the time required to measure the
>>migration costs isn't very noticeable, but by the time we
>>get to even a four-processor ia64 box, it adds about
>>30 seconds to the boot time, which seems like a lot.
>
>
> I only see about 16 seconds for a 4-way tiger (not that 16 seconds
> is good ... but it not as bad as 30). This was with a build
> from tiger_defconfig that sets CONFIG_NR_CPUS=4 ... so I wonder
> what's causing the factor of two. I measured with a printk
> each side of build_sched_domains() and booted with the "time"
> command line arg to get:
I've noticed the delay on a 16p and 64p. At first I thought it was a
system hang but have since learned to live with the delay.
What happens on a 512 cpu
> Altix (if it's quadratic, they may be still waiting for the
> boot to finish :-)
Not quadratic. This is a 64p Altix ...
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
P.
>
> -Tony
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: boot-time slowdown for measure_migration_cost
@ 2006-02-01 0:50 Chuck Ebbert
0 siblings, 0 replies; 12+ messages in thread
From: Chuck Ebbert @ 2006-02-01 0:50 UTC (permalink / raw)
To: Luck, Tony
Cc: Bjorn Helgaas, Ingo Molnar, linux-ia64, linux-kernel,
Linus Torvalds, Andrew Morton, Chen, Kenneth W
In-Reply-To: <20060130200026.GA5081@agluck-lia64.sc.intel.com>
On Mon, 30 Jan 2006, Tony Luck wrote:
> Might it be wise to see whether the 2% variation that I saw can be
> repeated on some other architecture? Bjorn's initial post was just
> questioning whether we need to spend this much time during boot to acquire
> this data. Now we have *one* data point that on an ia64 with four cpus
> with 9MB cache in a single domain that we can speed the calculation by
> a factor of three with only a 2% loss of accuracy. Can someone else try
> this patch and post the before/after values for migration_cost from dmesg?
Before:
messages.1:Jan 24 01:19:45 d2 kernel: [ 6.377117] migration_cost=9352
messages.1:Jan 27 21:07:55 d2 kernel: [ 6.384871] migration_cost=9329
messages.1:Jan 28 11:00:32 d2 kernel: [ 6.384215] migration_cost=9338
messages.1:Jan 28 12:55:03 d2 kernel: [ 6.389189] migration_cost=9364
After:
messages:Jan 31 07:55:07 d2 kernel: [ 1.859359] migration_cost=9274
This was on a dual PII Xeon with 2MB L2 cache. About 3.5x as fast and
only 1% change.
Maybe the default could be to run the quick test with an option to run the
more-accurate one?
--
Chuck
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-02-01 0:53 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-27 21:03 boot-time slowdown for measure_migration_cost Bjorn Helgaas
2006-01-30 17:21 ` Ingo Molnar
2006-01-30 18:53 ` Luck, Tony
2006-01-30 19:24 ` Ingo Molnar
2006-01-30 20:00 ` Luck, Tony
2006-01-30 20:43 ` Prarit Bhargava
2006-01-30 20:52 ` Prarit Bhargava
2006-01-30 20:43 ` John Hawkes
2006-01-30 19:26 ` Chen, Kenneth W
-- strict thread matches above, loose matches on Subject: below --
2006-01-27 21:48 Luck, Tony
2006-01-27 22:08 ` Prarit Bhargava
2006-02-01 0:50 Chuck Ebbert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox