linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Default cache_hot_time value back to 10ms
@ 2004-10-06  0:42 Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06  0:42 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'

Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> We have experimented with similar thing, via bumping up sd->cache_hot_time to
> a very large number, like 1 sec.  What we measured was a equally low throughput.
> But that was because of not enough load balancing.

Since we are talking about load balancing, we decided to measure various
value for cache_hot_time variable to see how it affects app performance. We
first establish baseline number with vanilla base kernel (default at 2.5ms),
then sweep that variable up to 1000ms.  All of the experiments are done with
Ingo's patch posted earlier.  Here are the result (test environment is 4-way
SMP machine, 32 GB memory, 500 disks running industry standard db transaction
processing workload):

cache_hot_time  | workload throughput
--------------------------------------
         2.5ms  - 100.0   (0% idle)
         5ms    - 106.0   (0% idle)
         10ms   - 112.5   (1% idle)
         15ms   - 111.6   (3% idle)
         25ms   - 111.1   (5% idle)
         250ms  - 105.6   (7% idle)
         1000ms - 105.4   (7% idle)

Clearly the default value for SMP has the worst application throughput (12%
below peak performance).  When set too low, kernel is too aggressive on load
balancing and we are still seeing cache thrashing despite the perf fix.
However, If set too high, kernel gets too conservative and not doing enough
load balance.

This value was default to 10ms before domain scheduler, why does domain
scheduler need to change it to 2.5ms? And on what bases does that decision
take place?  We are proposing change that number back to 10ms.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-05 17:37:21.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-05 17:38:02.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Chen, Kenneth W
@ 2004-10-06  0:47 ` Con Kolivas
  2004-10-06  1:02   ` Nick Piggin
  2004-10-06  0:58 ` Nick Piggin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Con Kolivas @ 2004-10-06  0:47 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'

Chen, Kenneth W writes:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>> We have experimented with similar thing, via bumping up sd->cache_hot_time to
>> a very large number, like 1 sec.  What we measured was a equally low throughput.
>> But that was because of not enough load balancing.
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
> 
> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.

Should it not be based on the cache flush time? We measure that and set the 
cache_decay_ticks and can base it on that. What is the cache_decay_ticks 
value reported in the dmesg of your hardware?

Cheers,
Con


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
@ 2004-10-06  0:58 ` Nick Piggin
  2004-10-06  3:55 ` Andrew Morton
  2004-10-06  7:48 ` Ingo Molnar
  3 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  0:58 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton'

Chen, Kenneth W wrote:
> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> 
>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>a very large number, like 1 sec.  What we measured was a equally low throughput.
>>But that was because of not enough load balancing.
> 
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
> 

Great testing, thanks.

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.
> 

IIRC Ingo wanted it lower, to closer match previous values (correct
me if I'm wrong).

I think your patch would be fine though (when timeslicing tasks on
the same CPU, I've typically seen large regressions when going below
a 10ms timeslice, even on a small cache CPU (512K).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:47 ` Con Kolivas
@ 2004-10-06  1:02   ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  1:02 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Chen, Kenneth W, 'Ingo Molnar', linux-kernel,
	'Andrew Morton'

Con Kolivas wrote:

> Should it not be based on the cache flush time? We measure that and set 
> the cache_decay_ticks and can base it on that. What is the 
> cache_decay_ticks value reported in the dmesg of your hardware?
> 

It should be, but the cache_decay_ticks calculation is so crude that I
preferred to use a fixed value to reduce the variation between different
setups.

I once experimented with attempting to figure out memory bandwidth based
on reading an uncached page. That might be the way to go.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
  2004-10-06  0:58 ` Nick Piggin
@ 2004-10-06  3:55 ` Andrew Morton
  2004-10-06  4:30   ` Nick Piggin
  2004-10-06  7:48 ` Ingo Molnar
  3 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-10-06  3:55 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, linux-kernel, nickpiggin

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> This value was default to 10ms before domain scheduler, why does domain
>  scheduler need to change it to 2.5ms? And on what bases does that decision
>  take place?  We are proposing change that number back to 10ms.

It sounds like this needs to be runtime tunable?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  3:55 ` Andrew Morton
@ 2004-10-06  4:30   ` Nick Piggin
  2004-10-06  4:51     ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  4:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chen, Kenneth W, mingo, linux-kernel

Andrew Morton wrote:
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> 
>>This value was default to 10ms before domain scheduler, why does domain
>> scheduler need to change it to 2.5ms? And on what bases does that decision
>> take place?  We are proposing change that number back to 10ms.
> 
> 
> It sounds like this needs to be runtime tunable?
> 

I'd say it is probably too low level to be a useful tunable (although
for testing I guess so... but then you could have *lots* of parameters
tunable).

I don't think there was a really good reason why this value is 2.5ms.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  4:30   ` Nick Piggin
@ 2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Andrew Morton @ 2004-10-06  4:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>  Andrew Morton wrote:
>  > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>  > 
>  >>This value was default to 10ms before domain scheduler, why does domain
>  >> scheduler need to change it to 2.5ms? And on what bases does that decision
>  >> take place?  We are proposing change that number back to 10ms.
>  > 
>  > 
>  > It sounds like this needs to be runtime tunable?
>  > 
> 
>  I'd say it is probably too low level to be a useful tunable (although
>  for testing I guess so... but then you could have *lots* of parameters
>  tunable).

This tunable caused an 11% performance difference in (I assume) TPCx. 
That's a big deal, and people will want to diddle it.

If one number works optimally for all machines and workloads then fine.

But yes, avoiding a tunable would be nice, but we need a tunable to work
out whether we can avoid making it tunable ;)

Not that I'm soliciting patches or anything.  I'll duck this one for now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
@ 2004-10-06  5:00       ` Nick Piggin
  2004-10-06  5:09         ` Andrew Morton
  2004-10-06  5:52       ` Chen, Kenneth W
  2004-10-06 19:27       ` Chen, Kenneth W
  2 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  5:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, Judith Lebzelter

Andrew Morton wrote:

> This tunable caused an 11% performance difference in (I assume) TPCx. 
> That's a big deal, and people will want to diddle it.
> 

True. But 2.5 I think really is too low (for anyone, except maybe a
CPU with no/a tiny L2 cache).

> If one number works optimally for all machines and workloads then fine.
> 

Yeah.. 10ms may bring up idle times a bit on other workloads. Judith
had some database tests that were very sensitive to this - if 10ms is
OK there, then I'd say it would be OK for most things.

> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
> 

Heh. I think it would be good to have a automatic thingy to tune it.
A smarter cache_decay_ticks calculation would suit.

> Not that I'm soliciting patches or anything.  I'll duck this one for now.
> 

OK. Any idea when 2.6.9 will be coming out? :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:00       ` Nick Piggin
@ 2004-10-06  5:09         ` Andrew Morton
  2004-10-06  5:21           ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-10-06  5:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Any idea when 2.6.9 will be coming out?

Before -mm hits 1000 patches, I hope.

2.6.8 wasn't really super-stable and our main tool for getting the quality
is to stretch the release times, give us time to shake things out.  The
release time is largely driven by perceptions of current stability, bug
report rates, etc.

A current guess would be -rc4 later this week, 2.6.9 late next week.  We'll
see.

One way of advancing that is to get down and work on bugs in current -linus
tree, yes?

If this still doesn't seem to be working out and if 2.6.9 isn't as good as
we'd like I'll consider shutting down -mm completely once we hit -rc2 so
people have nothing else to do apart from fix bugs in, and test -linus. 
We'll see.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:09         ` Andrew Morton
@ 2004-10-06  5:21           ` Nick Piggin
  2004-10-06  5:33             ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  5:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Any idea when 2.6.9 will be coming out?
> 
> 
> Before -mm hits 1000 patches, I hope.
> 
> 2.6.8 wasn't really super-stable and our main tool for getting the quality
> is to stretch the release times, give us time to shake things out.  The
> release time is largely driven by perceptions of current stability, bug
> report rates, etc.
> 
> A current guess would be -rc4 later this week, 2.6.9 late next week.  We'll
> see.
> 
> One way of advancing that is to get down and work on bugs in current -linus
> tree, yes?
> 
> If this still doesn't seem to be working out and if 2.6.9 isn't as good as
> we'd like I'll consider shutting down -mm completely once we hit -rc2 so
> people have nothing else to do apart from fix bugs in, and test -linus. 
> We'll see.
> 

OK thanks for the explanation.

Any thoughts about making -rc's into -pre's, and doing real -rc's?
It would have caught the NFS bug that made 2.6.8.1, and probably
the cd burning problems... Or is Linus' patching finger just too
itchy?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:21           ` Nick Piggin
@ 2004-10-06  5:33             ` Andrew Morton
  2004-10-06  5:46               ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-10-06  5:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Any thoughts about making -rc's into -pre's, and doing real -rc's?

I think what we have is OK.  The idea is that once 2.6.9 is released we
merge up all the well-tested code which is sitting in various trees and has
been under test for a few weeks.  As soon as all that well-tested code is
merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
with the stabilisation of 2.6.9.

If someone goes and develops *new* code after the release of, say, 2.6.9
then tough tittie, it's too late for 2.6.9: we don't want new code - we
want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
2.6.11.

That's the theory anyway.  If it means that it takes a long time to get
code into the kernel.org tree, well, that's a cost.  That latency may be
high but the bandwidth is pretty good.

There are exceptions of course.  Completely new
drivers/filesystems/architectures can go in any old time becasue they won't
break existing setups.  Although I do tend to hold back on even these in
the (probably overoptimistic) hope that people will then concentrate on
mainline bug fixing and testing.

>  It would have caught the NFS bug that made 2.6.8.1, and probably
>  the cd burning problems... Or is Linus' patching finger just too
>  itchy?

uh, let's say that incident was "proof by counter example".

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:33             ` Andrew Morton
@ 2004-10-06  5:46               ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2004-10-06  5:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> 
> 
> I think what we have is OK.  The idea is that once 2.6.9 is released we
> merge up all the well-tested code which is sitting in various trees and has
> been under test for a few weeks.  As soon as all that well-tested code is
> merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
> with the stabilisation of 2.6.9.
> 
> If someone goes and develops *new* code after the release of, say, 2.6.9
> then tough tittie, it's too late for 2.6.9: we don't want new code - we
> want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
> 2.6.11.
> 
> That's the theory anyway.  If it means that it takes a long time to get
> code into the kernel.org tree, well, that's a cost.  That latency may be
> high but the bandwidth is pretty good.
> 
> There are exceptions of course.  Completely new
> drivers/filesystems/architectures can go in any old time becasue they won't
> break existing setups.  Although I do tend to hold back on even these in
> the (probably overoptimistic) hope that people will then concentrate on
> mainline bug fixing and testing.
> 
> 
>> It would have caught the NFS bug that made 2.6.8.1, and probably
>> the cd burning problems... Or is Linus' patching finger just too
>> itchy?
> 
> 
> uh, let's say that incident was "proof by counter example".
> 

Heh :)

OK I agree on all these points. And yeah it has worked quite well...

But by real -rc, I mean 2.6.9 is a week after 2.6.9-rcx minus the
extraversion string; nothing more.

The main point (for me, at least) is that if -rc1 comes out, and I'm
still working on some bug or having something else tested then I can
hurry up and/or send you and Linus a polite email saying don't release
yet.

Would probably be a help for people running automated testing and
regression tests, etc. And just generally increase the userbase a
little bit.

Catching the odd paper bag bug would be a fringe benefit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
@ 2004-10-06  5:52       ` Chen, Kenneth W
  2004-10-06 19:27       ` Chen, Kenneth W
  2 siblings, 0 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06  5:52 UTC (permalink / raw)
  To: 'Andrew Morton', Nick Piggin; +Cc: mingo, linux-kernel

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> >  > It sounds like this needs to be runtime tunable?
> >  >
> >
> >  I'd say it is probably too low level to be a useful tunable (although
> >  for testing I guess so... but then you could have *lots* of parameters
> >  tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)

Just to throw in some more benchmark numbers, we measured that specjbb
throughput went up by about 0.3% with cache_hot_time set to 10ms compare
to default 2.5ms.  No measurable speedup/regression on volanmark (we
just tried 10 and 2.5ms).

- Ken



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Chen, Kenneth W
                   ` (2 preceding siblings ...)
  2004-10-06  3:55 ` Andrew Morton
@ 2004-10-06  7:48 ` Ingo Molnar
  2004-10-06 17:18   ` Chen, Kenneth W
  3 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2004-10-06  7:48 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec.  What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

could you please try the test in 1 msec increments around 10 msec? It
would be very nice to find a good formula and the 5 msec steps are too
coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
the remaining 1 msec slots around the new maximum. (assuming the
workload measurement is stable.)

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.

agreed. What value does cache_decay_ticks have on your box?

> 
> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  7:48 ` Ingo Molnar
@ 2004-10-06 17:18   ` Chen, Kenneth W
  2004-10-06 19:55     ` Ingo Molnar
  2004-10-06 22:46     ` Peter Williams
  0 siblings, 2 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 17:18 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec.  What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
> could you please try the test in 1 msec increments around 10 msec? It
> would be very nice to find a good formula and the 5 msec steps are too
> coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
> the remaining 1 msec slots around the new maximum. (assuming the
> workload measurement is stable.)

I should've post the whole thing yesterday, we had measurement of 7.5 and
12.5 ms.  Here is the result (repeating 5, 10, 15 for easy reading).

 5   ms 106.0
 7.5 ms 110.3
10   ms 112.5
12.5 ms 112.0
15   ms 111.6


> > This value was default to 10ms before domain scheduler, why does domain
> > scheduler need to change it to 2.5ms? And on what bases does that decision
> > take place?  We are proposing change that number back to 10ms.
>
> agreed. What value does cache_decay_ticks have on your box?


I see all the fancy calculation with cache_decay_ticks on x86, but nobody
actually uses it in the domain scheduler.  Anyway, my box has that value
hard coded to 10ms (ia64).

- Ken



^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
  2004-10-06  5:52       ` Chen, Kenneth W
@ 2004-10-06 19:27       ` Chen, Kenneth W
  2004-10-06 19:39         ` Andrew Morton
  2 siblings, 1 reply; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 19:27 UTC (permalink / raw)
  To: 'Andrew Morton', Nick Piggin; +Cc: mingo, linux-kernel

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >  I'd say it is probably too low level to be a useful tunable (although
> >  for testing I guess so... but then you could have *lots* of parameters
> >  tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
>
> Not that I'm soliciting patches or anything.  I'll duck this one for now.

Andrew, can I safely interpret this response as you are OK with having
cache_hot_time set to 10 ms for now?  And you will merge this change for
2.6.9?  I think Ingo and Nick are both OK with that change as well. Thanks.

- Ken



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 19:27       ` Chen, Kenneth W
@ 2004-10-06 19:39         ` Andrew Morton
  2004-10-06 20:38           ` Chen, Kenneth W
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-10-06 19:39 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: nickpiggin, mingo, linux-kernel

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
>  > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>  > >  I'd say it is probably too low level to be a useful tunable (although
>  > >  for testing I guess so... but then you could have *lots* of parameters
>  > >  tunable).
>  >
>  > This tunable caused an 11% performance difference in (I assume) TPCx.
>  > That's a big deal, and people will want to diddle it.
>  >
>  > If one number works optimally for all machines and workloads then fine.
>  >
>  > But yes, avoiding a tunable would be nice, but we need a tunable to work
>  > out whether we can avoid making it tunable ;)
>  >
>  > Not that I'm soliciting patches or anything.  I'll duck this one for now.
> 
>  Andrew, can I safely interpret this response as you are OK with having
>  cache_hot_time set to 10 ms for now?

I have a lot of scheduler changes queued up and I view this change as being
not very high priority.  If someone sends a patch to update -mm then we can
run with that, however Ingo's auto-tuning seems a far preferable approach.

>  And you will merge this change for 2.6.9?

I was not planning on doing so, but could be persuaded, I guess.

It's very, very late for this and subtle CPU scheduler regressions tend to
take a long time (weeks or months) to be identified.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 17:18   ` Chen, Kenneth W
@ 2004-10-06 19:55     ` Ingo Molnar
  2004-10-06 22:46     ` Peter Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2004-10-06 19:55 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

>  5   ms 106.0
>  7.5 ms 110.3
> 10   ms 112.5
> 12.5 ms 112.0
> 15   ms 111.6

ok, great. 9ms and 11ms would still be interesting. My guess would be
that the maximum is at 9. (albeit the numbers, when plotted, indicate
that the measurement might be a bit noisy.)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 19:39         ` Andrew Morton
@ 2004-10-06 20:38           ` Chen, Kenneth W
  2004-10-06 20:43             ` Andrew Morton
  2004-10-06 20:50             ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 20:38 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: nickpiggin, mingo, linux-kernel

Andrew Morton wrote on Wednesday, October 06, 2004 12:40 PM
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> >  Andrew, can I safely interpret this response as you are OK with having
> >  cache_hot_time set to 10 ms for now?
>
> I have a lot of scheduler changes queued up and I view this change as being
> not very high priority.  If someone sends a patch to update -mm then we can
> run with that, however Ingo's auto-tuning seems a far preferable approach.
>
> >  And you will merge this change for 2.6.9?
>
> I was not planning on doing so, but could be persuaded, I guess.
>
> It's very, very late for this and subtle CPU scheduler regressions tend to
> take a long time (weeks or months) to be identified.


Let me try to persuade ;-).  First, it hard to accept the fact that we are
leaving 11% of performance on the table just due to a poorly chosen parameter.
This much percentage difference on a db workload is a huge deal.  It basically
"unfairly" handicap 2.6 kernel behind competition, even handicap ourselves compare
to 2.4 kernel.  We have established from various workloads that 10 ms works the
best, from db to java workload.  What more data can we provide to swing you in
that direction?

Secondly, let me ask the question again from the first mail thread:  this value
*WAS* 10 ms for a long time, before the domain scheduler.  What's so special
about domain scheduler that all the sudden this parameter get changed to 2.5?
I'd like to see some justification/prior measurement for such change when
domain scheduler kicks in.

- Ken



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 20:38           ` Chen, Kenneth W
@ 2004-10-06 20:43             ` Andrew Morton
  2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-06 20:50             ` Ingo Molnar
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-10-06 20:43 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: nickpiggin, mingo, linux-kernel

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
>  Secondly, let me ask the question again from the first mail thread:  this value
>  *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
>  about domain scheduler that all the sudden this parameter get changed to 2.5?

So why on earth was it switched from 10 to 2.5 in the first place?

Please resend the final patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:38           ` Chen, Kenneth W
  2004-10-06 20:43             ` Andrew Morton
@ 2004-10-06 20:50             ` Ingo Molnar
  2004-10-06 21:03               ` Chen, Kenneth W
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2004-10-06 20:50 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', nickpiggin, linux-kernel


On Wed, 6 Oct 2004, Chen, Kenneth W wrote:

> Let me try to persuade ;-).  First, it hard to accept the fact that we
> are leaving 11% of performance on the table just due to a poorly chosen
> parameter. This much percentage difference on a db workload is a huge
> deal.  It basically "unfairly" handicap 2.6 kernel behind competition,
> even handicap ourselves compare to 2.4 kernel.  We have established from
> various workloads that 10 ms works the best, from db to java workload.  
> What more data can we provide to swing you in that direction?

the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
DB benchmark, but it will sure be too much of a migration cutoff for other
boxes. And too much of a migration cutoff means increased idle time -
resulting in CPU-under-utilization and worse performance.

so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
need the same weeks-long testcycle that all scheduler balancing patches
need. There are so many different type of workloads ...

> Secondly, let me ask the question again from the first mail thread:  
> this value *WAS* 10 ms for a long time, before the domain scheduler.  
> What's so special about domain scheduler that all the sudden this
> parameter get changed to 2.5? I'd like to see some justification/prior
> measurement for such change when domain scheduler kicks in.

iirc it was tweaked as a result of the other bug that you fixed. But, high
sensitivity to this tunable was nevery truly established, and a 9 MB L2
cache CPU is certainly not typical - and it is certainly the one that
hurts most from migration effects.

anyway, we were running based on cache_decay_ticks for a long time - is
that what was 10 msec on your box? The cache_decay_ticks calculation was
pretty fine too, it scaled up with cachesize.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:50             ` Ingo Molnar
@ 2004-10-06 21:03               ` Chen, Kenneth W
  0 siblings, 0 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 21:03 UTC (permalink / raw)
  To: 'Ingo Molnar'; +Cc: 'Andrew Morton', nickpiggin, linux-kernel

Ingo Molnar wrote on Wednesday, October 06, 2004 1:51 PM
> On Wed, 6 Oct 2004, Chen, Kenneth W wrote:
> > Let me try to persuade ;-).  First, it hard to accept the fact that we
> > are leaving 11% of performance on the table just due to a poorly chosen
> > parameter. This much percentage difference on a db workload is a huge
> > deal.  It basically "unfairly" handicap 2.6 kernel behind competition,
> > even handicap ourselves compare to 2.4 kernel.  We have established from
> > various workloads that 10 ms works the best, from db to java workload.
> > What more data can we provide to swing you in that direction?
>
> the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
> DB benchmark, but it will sure be too much of a migration cutoff for other
> boxes. And too much of a migration cutoff means increased idle time -
> resulting in CPU-under-utilization and worse performance.
>
> so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
> scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
> need the same weeks-long testcycle that all scheduler balancing patches
> need. There are so many different type of workloads ...

I would argue that the testing should be the other way around: having people
argue/provide data why 2.5 is better than 10.  Is there any prior measurement
or mailing list posting out there?


> > Secondly, let me ask the question again from the first mail thread:
> > this value *WAS* 10 ms for a long time, before the domain scheduler.
> > What's so special about domain scheduler that all the sudden this
> > parameter get changed to 2.5? I'd like to see some justification/prior
> > measurement for such change when domain scheduler kicks in.
>
> iirc it was tweaked as a result of the other bug that you fixed.

Is it possible that whatever the tweak was done before, was running with
broken load balancing logic and thus invalidates the 2.5 ms result?


> anyway, we were running based on cache_decay_ticks for a long time - is
> that what was 10 msec on your box? The cache_decay_ticks calculation was
> pretty fine too, it scaled up with cachesize.

Yes, cache_decay_tick is what I referring to.  I guess I was too concentrated
on ia64. For ia64, it was hard coded to 10ms regardless what cache size is.

cache_decay_ticks isn't used anywhere in 2.6.9-rc3, maybe that should be the
one for cache_hot_time.

- Ken



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 17:18   ` Chen, Kenneth W
  2004-10-06 19:55     ` Ingo Molnar
@ 2004-10-06 22:46     ` Peter Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2004-10-06 22:46 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Ingo Molnar', linux-kernel,
	'Andrew Morton', 'Nick Piggin'

Chen, Kenneth W wrote:
>>Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>>
>>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>>a very large number, like 1 sec.  What we measured was a equally low throughput.
>>>But that was because of not enough load balancing.
>>
>>Since we are talking about load balancing, we decided to measure various
>>value for cache_hot_time variable to see how it affects app performance. We
>>first establish baseline number with vanilla base kernel (default at 2.5ms),
>>then sweep that variable up to 1000ms.  All of the experiments are done with
>>Ingo's patch posted earlier.  Here are the result (test environment is 4-way
>>SMP machine, 32 GB memory, 500 disks running industry standard db transaction
>>processing workload):
>>
>>cache_hot_time  | workload throughput
>>--------------------------------------
>>         2.5ms  - 100.0   (0% idle)
>>         5ms    - 106.0   (0% idle)
>>         10ms   - 112.5   (1% idle)
>>         15ms   - 111.6   (3% idle)
>>         25ms   - 111.1   (5% idle)
>>         250ms  - 105.6   (7% idle)
>>         1000ms - 105.4   (7% idle)
>>
>>Clearly the default value for SMP has the worst application throughput (12%
>>below peak performance).  When set too low, kernel is too aggressive on load
>>balancing and we are still seeing cache thrashing despite the perf fix.
>>However, If set too high, kernel gets too conservative and not doing enough
>>load balance.
> 
> 
> Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
> 
>>could you please try the test in 1 msec increments around 10 msec? It
>>would be very nice to find a good formula and the 5 msec steps are too
>>coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
>>the remaining 1 msec slots around the new maximum. (assuming the
>>workload measurement is stable.)
> 
> 
> I should've post the whole thing yesterday, we had measurement of 7.5 and
> 12.5 ms.  Here is the result (repeating 5, 10, 15 for easy reading).
> 
>  5   ms 106.0
>  7.5 ms 110.3
> 10   ms 112.5
> 12.5 ms 112.0
> 15   ms 111.6
> 
> 
> 
>>>This value was default to 10ms before domain scheduler, why does domain
>>>scheduler need to change it to 2.5ms? And on what bases does that decision
>>>take place?  We are proposing change that number back to 10ms.
>>
>>agreed. What value does cache_decay_ticks have on your box?
> 
> 
> 
> I see all the fancy calculation with cache_decay_ticks on x86, but nobody
> actually uses it in the domain scheduler.  Anyway, my box has that value
> hard coded to 10ms (ia64).
> 

If you fit a quadratic equation to this data, take the first derivative 
and then solve for zero it will give the cache_hot_time that maximizes 
the throughput.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:43             ` Andrew Morton
@ 2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-07  2:26                 ` Nick Piggin
  2004-10-07  6:29                 ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 23:14 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: nickpiggin, mingo, linux-kernel

Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> >
> >  Secondly, let me ask the question again from the first mail thread:  this value
> >  *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
> >  about domain scheduler that all the sudden this parameter get changed to 2.5?
>
> So why on earth was it switched from 10 to 2.5 in the first place?
>
> Please resend the final patch.


Here is a patch that revert default cache_hot_time value back to the equivalence
of pre-domain scheduler, which determins task's cache affinity via architecture
defined variable cache_decay_ticks.

This is a mere request that we go back to what *was* before, *NOT* as a new
scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
industry recognized workload).

As a side note, I'd like to get involved on future scheduler tuning experiments,
we have fair amount of benchmark environments where we can validate things across
various kind of workload, i.e., db, java, cpu, etc.  Thanks.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

patch against 2.6.9-rc3:

--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-06 15:10:56.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-06 15:18:51.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_ticks*1000000,\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\


patch against 2.6.9-rc3-mm2:

--- linux-2.6.9-rc3/include/linux/topology.h.orig	2004-10-06 15:32:48.000000000 -0700
+++ linux-2.6.9-rc3/include/linux/topology.h	2004-10-06 15:33:25.000000000 -0700
@@ -113,7 +113,7 @@ static inline int __next_node_with_cpus(
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000/2),		\
+	.cache_hot_time		= (cache_decay_ticks*1000),\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 23:14               ` Chen, Kenneth W
@ 2004-10-07  2:26                 ` Nick Piggin
  2004-10-07  6:29                 ` Ingo Molnar
  1 sibling, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2004-10-07  2:26 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', mingo, linux-kernel

Chen, Kenneth W wrote:
> Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
> 
>>"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>>
>>> Secondly, let me ask the question again from the first mail thread:  this value
>>> *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
>>> about domain scheduler that all the sudden this parameter get changed to 2.5?
>>
>>So why on earth was it switched from 10 to 2.5 in the first place?
>>
>>Please resend the final patch.
> 
> 
> 
> Here is a patch that revert default cache_hot_time value back to the equivalence
> of pre-domain scheduler, which determins task's cache affinity via architecture
> defined variable cache_decay_ticks.
> 
> This is a mere request that we go back to what *was* before, *NOT* as a new
> scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
> industry recognized workload).
> 

OK... Well Andrew as I said I'd be happy for this to go in. I'd be *extra*
happy if Judith ran a few of those dbt thingy tests which had been sensitive
to idle time. Can you ask her about that? or should I?

> As a side note, I'd like to get involved on future scheduler tuning experiments,
> we have fair amount of benchmark environments where we can validate things across
> various kind of workload, i.e., db, java, cpu, etc.  Thanks.
> 

That would be very welcome indeed. We have a big backlog of scheduler things
to go in after 2.6.9 is released (although not many of them change the runtime
behaviour IIRC). After that, I have some experimental performance work that
could use wider testing. After *that*, the multiprocessor scheduler will in a
state where 2.6 shouldn't need much more work, so we can concentrate on just
tuning the dials.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-07  2:26                 ` Nick Piggin
@ 2004-10-07  6:29                 ` Ingo Molnar
  2004-10-07  7:08                   ` Jeff Garzik
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2004-10-07  6:29 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', nickpiggin, linux-kernel


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Here is a patch that revert default cache_hot_time value back to the
> equivalence of pre-domain scheduler, which determins task's cache
> affinity via architecture defined variable cache_decay_ticks.

i could agree with this oneliner patch for 2.6.9, it only affects the
SMP balancer and there for the most common boxes it likely results in a
similar migration cutoff as the 2.5 msec we currently have. Here are the
changes that occur on a couple of x86 boxes:

 2-way celeron, 128K cache:         2.5 msec -> 1.0 msec 
 2-way/4-way P4 Xeon 1MB cache:     2.5 msec -> 2.0 msec
 8-way P3 Xeon 2MB cache:           2.5 msec -> 6.0 msec

each of these changes is sane and not too drastic.

(on ia64 there is no auto-tuning of cache_decay_ticks, there you've got
a decay=<x> boot parameter anyway, to fix things up.)

there was one particular DB test that was quite sensitive to idle time
introduced by too large migration cutoff: dbt2-pgsql. Could you try that
one too and compare -rc3 performance to -rc3+migration-patches?

> This is a mere request that we go back to what *was* before, *NOT* as
> a new scheduler tweak (Whatever tweak done for domain scheduler broke
> traditional/ industry recognized workload).
> 
> As a side note, I'd like to get involved on future scheduler tuning
> experiments, we have fair amount of benchmark environments where we
> can validate things across various kind of workload, i.e., db, java,
> cpu, etc.  Thanks.

yeah, it would be nice to test the following 3 kernels:

 2.6.9-rc3 vanilla,
 2.6.9-rc3 + cache_hot_fix + use-cache_decay_ticks
 2.6.9-rc3 + cache_hot_fixes + autotune-patch

using as many different CPU types (and # of CPUs) as possible.

The most important factor in these measurements is statistical stability
of the result - if noise is too high then it's hard to judge. (the
numbers you posted in previous mails are quite stable, but not all
benchmarks are like that.)

> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-07  6:29                 ` Ingo Molnar
@ 2004-10-07  7:08                   ` Jeff Garzik
  2004-10-07  7:26                     ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2004-10-07  7:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, 'Andrew Morton', nickpiggin,
	linux-kernel

Ingo Molnar wrote:
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> >>Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
> 
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>


[tangent]  FWIW Andrew has recently been using "Acked-by" as well, 
presumably for patches created by person X from but reviewed by wholly 
independent person Y (since signed-off-by indicates you have some amount 
of legal standing to actually sign off on the patch)

	Jeff



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-07  7:08                   ` Jeff Garzik
@ 2004-10-07  7:26                     ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2004-10-07  7:26 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Chen, Kenneth W, 'Andrew Morton', nickpiggin,
	linux-kernel


* Jeff Garzik <jgarzik@pobox.com> wrote:

> Ingo Molnar wrote:
> >* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> >>>Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
> >
> >
> >Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> 
> [tangent] FWIW Andrew has recently been using "Acked-by" as well,
> presumably for patches created by person X from but reviewed by wholly
> independent person Y (since signed-off-by indicates you have some
> amount of legal standing to actually sign off on the patch)

[even more tangential] even if this werent a onliner, i might have some
amount of legal standing, i wrote the original cache_decay_ticks code
that this patch reverts to ;) But yeah, Acked-by would be more
informative here.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
       [not found] <200410071028.01931.habanero@us.ibm.com>
@ 2004-10-07 15:58 ` Andrew Theurer
  2004-10-08  9:47   ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Theurer @ 2004-10-07 15:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernel, pwil3058, nickpiggin, mingo, kenneth.w.chen, akpm

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

> OK... Well Andrew as I said I'd be happy for this to go in. I'd be *extra*
> happy if Judith ran a few of those dbt thingy tests which had been
> sensitive to idle time. Can you ask her about that? or should I?
>
> > As a side note, I'd like to get involved on future scheduler tuning
>
> experiments,
>
> > we have fair amount of benchmark environments where we can validate
> > things
>
> across
>
> > various kind of workload, i.e., db, java, cpu, etc.  Thanks.
>
> That would be very welcome indeed. We have a big backlog of scheduler
> things to go in after 2.6.9 is released (although not many of them change
> the runtime behaviour IIRC). After that, I have some experimental
> performance work that could use wider testing. After *that*, the
> multiprocessor scheduler will in a state where 2.6 shouldn't need much more
> work, so we can concentrate on just tuning the dials.

I'd like to add some comments as well:

1) We are seeing similar problems with that "well known" DB transaction 
benchmark, as well as another well known benchmark measuring multi-tier J2EE 
server performance.  Both problems are with load balancing.  It's not quite 
the same situation.  We have too much idle time and not enough throughput.  
Giving a more aggressive idle balance has helped there.  The 3 areas we have 
changed at:

wake_idle()  -find the first idle cpu, statring with cpu->sd and moving up the 
sd's as needed.  Modify SD_NODE_INIT.flags and SD_CPU_INIT.flags to include 
SD_WAKE_IDLE.  Now, if there is an idle cpu (and task->cpu is busy), we move 
it to the closest idle cpu.

can_migrate() put back (again) the aggressive idle condition in can_migrate().  
Do not look at task_hot when we have an idle cpu.

idle_balance() / SD_NODE_INIT   add SD_BALANCE_NEWIDLE to SD_NODE_INIT.flags 
so a newly idle_balance can try to balance from an appropriate cpu, first a 
cpu close to it, then farther out.

(the above changes IMO could also pave the way for removing timer based -idle- 
balances)

IMO, I don't think idle cpus should play by the exact same rules as busy ones 
when load balacing.  I am not saying the only answer is not looking at cache 
warmth at all, but maybe a much more relaxed policy.

Also, finding (at boot time) the best cache_hot_time is a step in the right 
direction, but I have to wonder it cache_hot() is really doing the right 
thing.  It looks like all cache_hot does is decide this task is cache hot 
because it ran recently.  Who's to say the task got cache warm in the first 
place?  Shouldn't we be looking at both how long ago it ran and the length of 
time it ran?  Some of these workloads have very high transaction rates, and 
in turn have very high context switch rates.  I would be surprised if many of 
the tasks got more than enough continuous run time to get good cache warmth 
anyway.  I am all for testing chace warmth, but I think we should start 
looking at more than just how long ago the task ran.

-Andrew Theurer




[-- Attachment #2: 100-wake_idle-patch.269-rc3 --]
[-- Type: text/x-diff, Size: 2668 bytes --]

diff -Naurp linux-2.6.9-rc3/kernel/sched.c linux-2.6.9-rc3-wake_idle/kernel/sched.c
--- linux-2.6.9-rc3/kernel/sched.c	2004-10-09 05:59:47.000000000 -0700
+++ linux-2.6.9-rc3-wake_idle/kernel/sched.c	2004-10-11 00:57:28.590909272 -0700
@@ -393,7 +393,8 @@ struct sched_domain {
 	.flags			= SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_BALANCE	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
@@ -413,7 +414,8 @@ struct sched_domain {
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_EXEC	\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_BALANCE	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
@@ -1066,35 +1068,30 @@ static inline unsigned long target_load(
 #endif
 
 /*
- * wake_idle() is useful especially on SMT architectures to wake a
- * task onto an idle sibling if we would otherwise wake it onto a
- * busy sibling.
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and and idle cpu is available.  The span of cpus to
+ * search is the top most sched domain with SD_BALANCE_EXEC.
  *
  * Returns the CPU we should wake onto.
  */
 #if defined(ARCH_HAS_SCHED_WAKE_IDLE)
 static int wake_idle(int cpu, task_t *p)
 {
-	cpumask_t tmp;
-	runqueue_t *rq = cpu_rq(cpu);
-	struct sched_domain *sd;
+	cpumask_t cpumask;
+	struct sched_domain *sd = NULL;
 	int i;
 
-	if (idle_cpu(cpu))
-		return cpu;
-
-	sd = rq->sd;
-	if (!(sd->flags & SD_WAKE_IDLE))
-		return cpu;
-
-	cpus_and(tmp, sd->span, cpu_online_map);
-	cpus_and(tmp, tmp, p->cpus_allowed);
-
-	for_each_cpu_mask(i, tmp) {
-		if (idle_cpu(i))
-			return i;
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(cpumask, sd->span, cpu_online_map);
+			cpus_and(cpumask, cpumask, p->cpus_allowed);
+			for_each_cpu_mask(i, cpumask) {
+				if (idle_cpu(i))
+					return i;
+			}
+		}
+		else break;
 	}
-
 	return cpu;
 }
 #else
@@ -1205,10 +1202,12 @@ static int try_to_wake_up(task_t * p, un
 	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
 out_set_cpu:
 	schedstat_inc(rq, ttwu_attempts);
-	new_cpu = wake_idle(new_cpu, p);
-	if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
-		schedstat_inc(rq, ttwu_moved);
-		set_task_cpu(p, new_cpu);
+	if (!idle_cpu(cpu)) {
+		new_cpu = wake_idle(new_cpu, p);
+		if (new_cpu != cpu) {
+			schedstat_inc(rq, ttwu_moved);
+			set_task_cpu(p, new_cpu);
+		}
 		task_rq_unlock(rq, &flags);
 		/* might preempt at this point */
 		rq = task_rq_lock(p, &flags);

[-- Attachment #3: 120-new_idle-patch.269-rc3 --]
[-- Type: text/x-diff, Size: 627 bytes --]

diff -Naurp linux-2.6.9-rc3-wake_idle-can_migrate/kernel/sched.c linux-2.6.9-rc3-wake_idle-can_migrate-newidle/kernel/sched.c
--- linux-2.6.9-rc3-wake_idle-can_migrate/kernel/sched.c	2004-10-11 01:11:58.016917328 -0700
+++ linux-2.6.9-rc3-wake_idle-can_migrate-newidle/kernel/sched.c	2004-10-11 01:24:05.826894464 -0700
@@ -413,7 +413,8 @@ struct sched_domain {
 	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
-	.flags			= SD_BALANCE_EXEC	\
+	.flags			= SD_BALANCE_NEWIDLE	\
+				| SD_BALANCE_EXEC	\
 				| SD_WAKE_BALANCE	\
 				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\

[-- Attachment #4: 110-can_migrate-patch.269-rc3 --]
[-- Type: text/x-diff, Size: 719 bytes --]

diff -Naurp linux-2.6.9-rc3-wake_idle/kernel/sched.c linux-2.6.9-rc3-wake_idle-can_migrate/kernel/sched.c
--- linux-2.6.9-rc3-wake_idle/kernel/sched.c	2004-10-11 00:57:28.590909272 -0700
+++ linux-2.6.9-rc3-wake_idle-can_migrate/kernel/sched.c	2004-10-11 01:11:58.016917328 -0700
@@ -1780,14 +1780,8 @@ int can_migrate_task(task_t *p, runqueue
 		return 0;
 	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
-
-	/* Aggressive migration if we've failed balancing */
-	if (idle == NEWLY_IDLE ||
-			sd->nr_balance_failed < sd->cache_nice_tries) {
-		if (task_hot(p, rq->timestamp_last_tick, sd))
-			return 0;
-	}
-
+	if (idle == NOT_IDLE && task_hot(p, rq->timestamp_last_tick, sd))
+		return 0;
 	return 1;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
@ 2004-10-07 18:44 Albert Cahalan
  0 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2004-10-07 18:44 UTC (permalink / raw)
  To: linux-kernel mailing list
  Cc: nickpiggin, kenneth.w.chen, kernel, mingo, Andrew Morton OSDL

Con Kolivas writes:

> Should it not be based on the cache flush time? We measure
> that and set the cache_decay_ticks and can base it on that.

Often one must use the time, but...

If the system goes idle for an hour, the last-run
process is still cache-hot.

Many systems let you measure cache line castouts.
Time is a very crude approximation of this.
Memory traffic is a slightly better approximation.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-07 15:58 ` Andrew Theurer
@ 2004-10-08  9:47   ` Nick Piggin
  2004-10-08 14:11     ` Andrew Theurer
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2004-10-08  9:47 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: linux-kernel, kernel, pwil3058, mingo, kenneth.w.chen, akpm

Andrew Theurer wrote:

> I'd like to add some comments as well:
> 

OK, thanks Andrew. Can I ask that we revisit these after 2.6.9 comes
out? I don't think the situation should be worse than 2.6.8, and
basically 2.6.9 is in bugfix only mode at this point, so I doubt we
could get anything more in even if we wanted to.

Please be sure to bring this up again after 2.6.9. Thanks.

Nick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-08  9:47   ` Nick Piggin
@ 2004-10-08 14:11     ` Andrew Theurer
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Theurer @ 2004-10-08 14:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, kernel, pwil3058, mingo, kenneth.w.chen, akpm

On Friday 08 October 2004 04:47, Nick Piggin wrote:
> Andrew Theurer wrote:
> > I'd like to add some comments as well:
>
> OK, thanks Andrew. Can I ask that we revisit these after 2.6.9 comes
> out? I don't think the situation should be worse than 2.6.8, and
> basically 2.6.9 is in bugfix only mode at this point, so I doubt we
> could get anything more in even if we wanted to.
>
> Please be sure to bring this up again after 2.6.9. Thanks.

No problem.  I was thinking this would be post 2.6.9 or somewhere in -mm.

Andrew Theurer

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-10-08 14:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-07 18:44 Default cache_hot_time value back to 10ms Albert Cahalan
     [not found] <200410071028.01931.habanero@us.ibm.com>
2004-10-07 15:58 ` Andrew Theurer
2004-10-08  9:47   ` Nick Piggin
2004-10-08 14:11     ` Andrew Theurer
  -- strict thread matches above, loose matches on Subject: below --
2004-10-06  0:42 Chen, Kenneth W
2004-10-06  0:47 ` Con Kolivas
2004-10-06  1:02   ` Nick Piggin
2004-10-06  0:58 ` Nick Piggin
2004-10-06  3:55 ` Andrew Morton
2004-10-06  4:30   ` Nick Piggin
2004-10-06  4:51     ` Andrew Morton
2004-10-06  5:00       ` Nick Piggin
2004-10-06  5:09         ` Andrew Morton
2004-10-06  5:21           ` Nick Piggin
2004-10-06  5:33             ` Andrew Morton
2004-10-06  5:46               ` Nick Piggin
2004-10-06  5:52       ` Chen, Kenneth W
2004-10-06 19:27       ` Chen, Kenneth W
2004-10-06 19:39         ` Andrew Morton
2004-10-06 20:38           ` Chen, Kenneth W
2004-10-06 20:43             ` Andrew Morton
2004-10-06 23:14               ` Chen, Kenneth W
2004-10-07  2:26                 ` Nick Piggin
2004-10-07  6:29                 ` Ingo Molnar
2004-10-07  7:08                   ` Jeff Garzik
2004-10-07  7:26                     ` Ingo Molnar
2004-10-06 20:50             ` Ingo Molnar
2004-10-06 21:03               ` Chen, Kenneth W
2004-10-06  7:48 ` Ingo Molnar
2004-10-06 17:18   ` Chen, Kenneth W
2004-10-06 19:55     ` Ingo Molnar
2004-10-06 22:46     ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).