Industry db benchmark result on recent 2.6 kernels

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Industry db benchmark result on recent 2.6 kernels
@ 2005-03-28 19:33 Chen, Kenneth W
  2005-03-28 19:50 ` Dave Hansen
  2005-03-30  0:00 ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-28 19:33 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: linux-kernel

The roller coaster ride continues for the 2.6 kernel on how it measure
up in performance using industry standard database transaction processing
benchmark.  We took a measurement on 2.6.11 and found it is 13% down from
the baseline.

We will be taking db benchmark measurements more frequently from now on with
latest kernel from kernel.org (and make these measurements on a fixed interval).
By doing this, I hope to achieve two things: one is to track base kernel
performance on a regular base; secondly, which is more important in my opinion,
is to create a better communication flow to the kernel developers and to keep
all interested party well informed on the kernel performance for this enterprise
workload.

With that said, here goes our first data point along with some historical data
we have collected so far.

2.6.11	-13%
2.6.9		- 6%
2.6.8		-23%
2.6.2		- 1%
baseline	(rhel3)

The glory detail on the benchmark configuration: 4-way SMP, 1.6 GHz Intel
itanium2, 64GB memory, 450 73GB 15k-rpm disks.  All experiments were done
With exact same hardware and application software, except different kernel
versions.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:33 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
@ 2005-03-28 19:50 ` Dave Hansen
  2005-03-28 20:01   ` Chen, Kenneth W
  2005-03-30  0:00 ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2005-03-28 19:50 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', Linux Kernel Mailing List

On Mon, 2005-03-28 at 11:33 -0800, Chen, Kenneth W wrote:
> We will be taking db benchmark measurements more frequently from now on with
> latest kernel from kernel.org (and make these measurements on a fixed interval).
> By doing this, I hope to achieve two things: one is to track base kernel
> performance on a regular base; secondly, which is more important in my opinion,
> is to create a better communication flow to the kernel developers and to keep
> all interested party well informed on the kernel performance for this enterprise
> workload.

I'd guess that doing it on kernel.org is too late, sometimes.  How high
is the overhead of doing a test?  Would you be able to test each -mm
release?  It's somewhat easier to toss something out of -mm for
re-review than it is out of Linus's tree.

-- Dave


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:50 ` Dave Hansen
@ 2005-03-28 20:01   ` Chen, Kenneth W
  0 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-28 20:01 UTC (permalink / raw)
  To: 'Dave Hansen'; +Cc: 'Andrew Morton', Linux Kernel Mailing List

On Mon, 2005-03-28 at 11:33 -0800, Chen, Kenneth W wrote:
> We will be taking db benchmark measurements more frequently from now on with
> latest kernel from kernel.org (and make these measurements on a fixed interval).
> By doing this, I hope to achieve two things: one is to track base kernel
> performance on a regular base; secondly, which is more important in my opinion,
> is to create a better communication flow to the kernel developers and to keep
> all interested party well informed on the kernel performance for this enterprise
> workload.

Dave Hansen wrote on Monday, March 28, 2005 11:50 AM
> I'd guess that doing it on kernel.org is too late, sometimes.  How high
> is the overhead of doing a test?  Would you be able to test each -mm
> release?  It's somewhat easier to toss something out of -mm for
> re-review than it is out of Linus's tree.

The overhead is fairly high to run the benchmark.  It's not a one minute run.
(more or less like a 5 hour exercise.  Benchmark run time along is 3+ hours).
-mm has so many stuff, I'm not sure we would have the bandwidth to do a search
on which patch trigger N% regression, etc.  Let me try the base kernel first
and if resources are available, I can attempt to do it on -mm tree.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-28 19:33 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
  2005-03-28 19:50 ` Dave Hansen
@ 2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
                     ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Linus Torvalds @ 2005-03-30  0:00 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', linux-kernel

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> 
> With that said, here goes our first data point along with some historical data
> we have collected so far.
> 
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

How repeatable are the numbers across reboots with the same kernel? Some
benchmarks will depend heavily on just where things land in memory, 
especially with things like PAE or even just cache behaviour (ie if some 
frequenly-used page needs to be kmap'ped or not depending on where it 
landed).

You don't have the PAE issue on ia64, but there could be other issues.  
Some of them just disk-layout issues or similar, ie performance might
change depending on where on the disk the data is written in relationship
to where most of the reads come from etc etc. The fact that it seems to 
fluctuate pretty wildly makes me wonder how stable the numbers are.

Also, it would be absolutely wonderful to see a finer granularity (which 
would likely also answer the stability question of the numbers). If you 
can do this with the daily snapshots, that would be great. If it's not 
easily automatable, or if a run takes a long time, maybe every other or 
every third day would be possible?

Doing just release kernels means that there will be a two-month lag
between telling developers that something pissed up performance. Doing it
every day (or at least a couple of times a week) will be much more 
interesting.

I realize that testing can easily be overwhelming, but if something like 
this can be automated, and run in a timely fashion, that would be really 
great. Two months (or half a year) later, and we have absolutely _no_ idea 
what might have caused a regression. For example, that 2.6.2->2.6.8 change 
obviously makes pretty much any developer just go "I've got no clue".

In fact, it would be interesting (still) to go back in time if the
benchmark can be done fast enough, and try to do testing of the historical
weekly (if not daily) builds to see where the big differences happened. If
you can narrow down the 6-month gap of 2.6.2->2.6.8 to a week or a few
days, that would already make people sit up a bit - as it is it's too big
a problem for any developer to look at.

The daily patches are all there on kernel.org, even if the old ones have
been moved into /pub/linux/kernel/v2.6/snapshots/old/.. It's "just" a
small matter of automation ;)

Btw, this isn't just for you either - I'd absolutely _love_ it for pretty
much any benchmark. So anybody who has a favourite benchmark, whether
"obviously relevant" or not, and has the inclination to make a _simple_
daily number (preferably a nice graph), go for it. 

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
@ 2005-03-30  0:22   ` Chen, Kenneth W
  2005-03-30  0:46   ` Chen, Kenneth W
  2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  0:22 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> With that said, here goes our first data point along with some historical data
> we have collected so far.
>
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> How repeatable are the numbers across reboots with the same kernel? Some
> benchmarks will depend heavily on just where things land in memory,
> especially with things like PAE or even just cache behaviour (ie if some
> frequenly-used page needs to be kmap'ped or not depending on where it
> landed).

Very repeatable.  This workload is very steady and resolution in throughput
is repeatable down to 0.1%.  We toss everything below that level as noise.


> You don't have the PAE issue on ia64, but there could be other issues.
> Some of them just disk-layout issues or similar, ie performance might
> change depending on where on the disk the data is written in relationship
> to where most of the reads come from etc etc. The fact that it seems to
> fluctuate pretty wildly makes me wonder how stable the numbers are.

This workload has been around for 10+ years and people at Intel studied the
characteristics of this workload inside out for 10+ years.  Every stones will
be turned at least more than once while we tune the entire setup making sure
everything is well balanced.  And we tune the system whenever there is a
hardware change.  Data layout on the disk spindle are very well balanced.


> Also, it would be absolutely wonderful to see a finer granularity (which
> would likely also answer the stability question of the numbers). If you
> can do this with the daily snapshots, that would be great. If it's not
> easily automatable, or if a run takes a long time, maybe every other or
> every third day would be possible?

I sure will make my management know that Linus wants to see the performance
number on a daily bases (I will ask for a couple of million dollar to my
manager for this project :-))



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
@ 2005-03-30  0:46   ` Chen, Kenneth W
  2005-03-30  0:57     ` Linus Torvalds
  2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  0:46 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> The fact that it seems to fluctuate pretty wildly makes me wonder
> how stable the numbers are.

I can't resist myself from bragging. The high point in the fluctuation
might be because someone is working hard trying to make 2.6 kernel run
faster.  Hint hint hint .....  ;-)



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:46   ` Chen, Kenneth W
@ 2005-03-30  0:57     ` Linus Torvalds
  2005-03-30  1:31       ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2005-03-30  0:57 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', linux-kernel



On Tue, 29 Mar 2005, Chen, Kenneth W wrote:
>
> Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> > The fact that it seems to fluctuate pretty wildly makes me wonder
> > how stable the numbers are.
> 
> I can't resist myself from bragging. The high point in the fluctuation
> might be because someone is working hard trying to make 2.6 kernel run
> faster.  Hint hint hint .....  ;-)

Heh. How do you explain the low-point? If there's somebody out there 
working hard on making it run slower, I want to whack the guy ;)

Good luck with the million-dollar grants, btw. We're all rooting for you, 
and hope your manager is a total push-over.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:57     ` Linus Torvalds
@ 2005-03-30  1:31       ` Nick Piggin
  2005-03-30  1:38         ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2005-03-30  1:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chen, Kenneth W, 'Andrew Morton', linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 29 Mar 2005, Chen, Kenneth W wrote:
> 
>>Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
>>
>>>The fact that it seems to fluctuate pretty wildly makes me wonder
>>>how stable the numbers are.
>>
>>I can't resist myself from bragging. The high point in the fluctuation
>>might be because someone is working hard trying to make 2.6 kernel run
>>faster.  Hint hint hint .....  ;-)
> 
> 
> Heh. How do you explain the low-point? If there's somebody out there 
> working hard on making it run slower, I want to whack the guy ;)
> 

If it is doing a lot of mapping/unmapping (or fork/exit), then that
might explain why 2.6.11 is worse.

Fortunately there are more patches to improve this on the way.

Kernel profiles would be useful if possible.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:31       ` Nick Piggin
@ 2005-03-30  1:38         ` Chen, Kenneth W
  2005-03-30  1:56           ` Nick Piggin
  2005-03-31 14:14           ` Ingo Molnar
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-30  1:38 UTC (permalink / raw)
  To: 'Nick Piggin', Linus Torvalds
  Cc: 'Andrew Morton', linux-kernel

Nick Piggin wrote on Tuesday, March 29, 2005 5:32 PM
> If it is doing a lot of mapping/unmapping (or fork/exit), then that
> might explain why 2.6.11 is worse.
>
> Fortunately there are more patches to improve this on the way.

Once benchmark reaches steady state, there is no mapping/unmapping
going on.  Actually, the virtual address space for all the processes
are so stable at steady state that we don't even see it grow or shrink.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:38         ` Chen, Kenneth W
@ 2005-03-30  1:56           ` Nick Piggin
  2005-03-31 14:14           ` Ingo Molnar
  1 sibling, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2005-03-30  1:56 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linus Torvalds, 'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 29, 2005 5:32 PM
> 
>>If it is doing a lot of mapping/unmapping (or fork/exit), then that
>>might explain why 2.6.11 is worse.
>>
>>Fortunately there are more patches to improve this on the way.
> 
> 
> Once benchmark reaches steady state, there is no mapping/unmapping
> going on.  Actually, the virtual address space for all the processes
> are so stable at steady state that we don't even see it grow or shrink.
> 

Oh, well there goes that theory ;)

The only other thing I can think of is the CPU scheduler changes
that went into 2.6.11 (but there are obviously a lot that I can't
think of).

I'm sure I don't need to tell you it would be nice to track down
the source of these problems rather than papering over them with
improvements to the block layer... any indication of what has gone
wrong?

Typically if the CPU scheduler has gone bad and is moving too many
tasks around (and hurting caches), you'll see things like copy_*_user
increase in cost for the same units of work performed. Wheras if it
is too reluctant to move tasks, you'll see increased idle time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  1:38         ` Chen, Kenneth W
  2005-03-30  1:56           ` Nick Piggin
@ 2005-03-31 14:14           ` Ingo Molnar
  2005-03-31 19:53             ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:14 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', Linus Torvalds, 'Andrew Morton',
	linux-kernel

* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > If it is doing a lot of mapping/unmapping (or fork/exit), then that
> > might explain why 2.6.11 is worse.
> >
> > Fortunately there are more patches to improve this on the way.
> 
> Once benchmark reaches steady state, there is no mapping/unmapping 
> going on.  Actually, the virtual address space for all the processes 
> are so stable at steady state that we don't even see it grow or 
> shrink.

is there any idle time on the system, in steady state (it's a sign of 
under-balancing)? Idle balancing (and wakeup balancing) is one of the 
things that got tuned back and forth alot. Also, do you know what the 
total number of context-switches is during the full test on each kernel?  
Too many context-switches can be an indicator of over-balancing. Another 
sign of migration gone bad can be relative increase of userspace time 
vs. system time. (due to cache trashing, on DB workloads, where most of 
the cache contents are userspace's.)

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 14:14           ` Ingo Molnar
@ 2005-03-31 19:53             ` Chen, Kenneth W
  2005-03-31 20:05               ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-31 19:53 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Nick Piggin', Linus Torvalds, 'Andrew Morton',
	linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 6:15 AM
> is there any idle time on the system, in steady state (it's a sign of
> under-balancing)? Idle balancing (and wakeup balancing) is one of the
> things that got tuned back and forth alot. Also, do you know what the
> total number of context-switches is during the full test on each kernel?
> Too many context-switches can be an indicator of over-balancing. Another
> sign of migration gone bad can be relative increase of userspace time
> vs. system time. (due to cache trashing, on DB workloads, where most of
> the cache contents are userspace's.)

No, there are no idle time on the system. If system become I/O bound, we
would do everything we can to remove that bottleneck, i.e., throw a couple
hundred GB of memory to the system, or add a couple hundred disk drives,
etc.  Believe it or not, we are currently CPU bound and that's the reason
why I care about every single cpu cycle being spend in the kernel code.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 19:53             ` Chen, Kenneth W
@ 2005-03-31 20:05               ` Linus Torvalds
  2005-03-31 20:08                 ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2005-03-31 20:05 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel



On Thu, 31 Mar 2005, Chen, Kenneth W wrote:
> 
> No, there are no idle time on the system. If system become I/O bound, we
> would do everything we can to remove that bottleneck, i.e., throw a couple
> hundred GB of memory to the system, or add a couple hundred disk drives,
> etc.  Believe it or not, we are currently CPU bound and that's the reason
> why I care about every single cpu cycle being spend in the kernel code.

Can you post oprofile data for a run? Preferably both for the "best case"  
2.6.x thing (no point in comparing 2.4.x oprofiles with current) and for
"current kernel", whether that be 2.6.11 or some more recent snapshot?

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 20:05               ` Linus Torvalds
@ 2005-03-31 20:08                 ` Linus Torvalds
  2005-03-31 22:14                   ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2005-03-31 20:08 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel

On Thu, 31 Mar 2005, Linus Torvalds wrote:
> 
> Can you post oprofile data for a run?

Btw, I realize that you can't give good oprofiles for the user-mode
components, but a kernel profile with even just single "time spent in user
mode" datapoint would be good, since a kernel scheduling problem might
just make caches work worse, and so the biggest negative might be visible
in the amount of time we spend in user mode due to more cache misses..

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 20:08                 ` Linus Torvalds
@ 2005-03-31 22:14                   ` Chen, Kenneth W
  2005-03-31 23:35                     ` Nick Piggin
  2005-04-01  4:52                     ` Ingo Molnar
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-03-31 22:14 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: 'Ingo Molnar', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Linus Torvalds wrote on Thursday, March 31, 2005 12:09 PM
> Btw, I realize that you can't give good oprofiles for the user-mode
> components, but a kernel profile with even just single "time spent in user
> mode" datapoint would be good, since a kernel scheduling problem might
> just make caches work worse, and so the biggest negative might be visible
> in the amount of time we spend in user mode due to more cache misses..

I was going to bring it up in another thread.  Since you brought it up, I
will ride it along.

The low point in 2.6.11 could very well be the change in the scheduler.
It does too many load balancing in the wake up path and possibly made a
lot of unwise decision.  For example, in try_to_wake_up(), it will try
SD_WAKE_AFFINE for task that is not hot.  By not hot, it looks at when it
was last ran and compare to a constant sd->cache_hot_time.  The problem
is this cache_hot_time is fixed for the entire universe, whether it is a
little celeron processor with 128KB of cache or a sever class Itanium2
processor with 9MB L3 cache.  This one size fit all isn't really working
at all.

We had experimented that parameter earlier and found it was one of the major
source of low point in 2.6.8.  I debated the issue on LKML about 4 month
ago and finally everyone agreed to make that parameter a boot time param.
The change made into bk tree for 2.6.9 release, but somehow it got ripped
right out 2 days after it went in.  I suspect 2.6.11 is a replay of 2.6.8
for the regression in the scheduler.  We are running experiment to confirm
this theory.

That actually brings up more thoughts: what about all other sched parameters?
We found values other than the default helps to push performance up, but it
is probably not acceptable to pick a default number from a db benchmark.
Kernel needs either a dynamic closed feedback loop to adapt to the workload
or some runtime tunables to control them.  Though the latter option did not
go anywhere in the past.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 22:14                   ` Chen, Kenneth W
@ 2005-03-31 23:35                     ` Nick Piggin
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  4:52                     ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2005-03-31 23:35 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Linus Torvalds', 'Ingo Molnar',
	'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Linus Torvalds wrote on Thursday, March 31, 2005 12:09 PM
> 
>>Btw, I realize that you can't give good oprofiles for the user-mode
>>components, but a kernel profile with even just single "time spent in user
>>mode" datapoint would be good, since a kernel scheduling problem might
>>just make caches work worse, and so the biggest negative might be visible
>>in the amount of time we spend in user mode due to more cache misses..
> 
> 
> I was going to bring it up in another thread.  Since you brought it up, I
> will ride it along.
> 
> The low point in 2.6.11 could very well be the change in the scheduler.
> It does too many load balancing in the wake up path and possibly made a
> lot of unwise decision.

OK, and considering you have got no idle time at all, and the 2.6.11
kernel included some scheduler changes to make balancing much more
aggressive, so unfortunately that's likely to have caused the latest drop.

> For example, in try_to_wake_up(), it will try
> SD_WAKE_AFFINE for task that is not hot.  By not hot, it looks at when it
> was last ran and compare to a constant sd->cache_hot_time.

The other problem with using that value there is that it represents a hard
cutoff point in behaviour. For example, on a workload that really wants to
have wakers and wakees together, it will work poorly on low loads, but when
things get loaded up enough that we start seeing cache cold tasks there,
behaviour suddenly changes.

In the -mm kernels, there are a large number of scheduler changes that
reduce the amount of balancing. They also remove cache_hot_time from this
path (though it is still useful for periodic balancing).

 > The problem
 > is this cache_hot_time is fixed for the entire universe, whether it is a
 > little celeron processor with 128KB of cache or a sever class Itanium2
 > processor with 9MB L3 cache.  This one size fit all isn't really working
 > at all.

Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
for all processors with respect to all others, and dynamically tune
cache_hot_time. Unfortunately it was never completely polished, and it is
an O(cpus^2) operation. It is a good idea to look into though.

> We had experimented that parameter earlier and found it was one of the major
> source of low point in 2.6.8.  I debated the issue on LKML about 4 month
> ago and finally everyone agreed to make that parameter a boot time param.
> The change made into bk tree for 2.6.9 release, but somehow it got ripped
> right out 2 days after it went in.  I suspect 2.6.11 is a replay of 2.6.8
> for the regression in the scheduler.  We are running experiment to confirm
> this theory.
> 
> That actually brings up more thoughts: what about all other sched parameters?
> We found values other than the default helps to push performance up, but it
> is probably not acceptable to pick a default number from a db benchmark.
> Kernel needs either a dynamic closed feedback loop to adapt to the workload
> or some runtime tunables to control them.  Though the latter option did not
> go anywhere in the past.
> 

They're in -mm. I think Andrew would rather see things (like auto tuning
cache hot time) rather than putting more runtime variables in.

If you were to make a program which adjusted various parameters using a
feedback loop, then that would be a good argument to put runtime tunables
in.

Oh, one last thing - if you do a great deal of scheduler tuning, it would
be very good if you could possibly use the patchset in -mm. Things have
changed sufficiently that optimal values you find in 2.6 will not be the
same as those in -mm. I realise this may be difficult to justify, but I
would hate for the whole cycle to have to happen again when the patches
go into 2.6.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 23:35                     ` Nick Piggin
@ 2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
                                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Paul Jackson @ 2005-04-01  6:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, torvalds, mingo, akpm, linux-kernel

Nick wrote:
> Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> ... Unfortunately ... and it is an O(cpus^2) operation.

Yes - a cool patch.

If we had an arch-specific bit of code, that for any two cpus, could
give a 'pseudo-distance' between them, where the only real requirements
were that (1) if two pairs of cpus had the same pseudo-distance, then
that meant they had the same size, layout, kind and speed of bus amd
cache hardware between them (*), and (2) it was cheap - hardly more than
a few lines of code and a subroutine call to obtain, then Ingo's code
could be:

	for each cpu c1:
	    for each cpu c2:
		psdist = pseudo_distance(c1, c2)
		if I've seen psdist before, use the latency computed for that psdist
		else compute a real latency number and remember it for that psdist

A generic form of pseudo_distance, which would work for all normal
sized systems, would be:

int pseudo_distance(int c1, int c2)
{
	static int x;
	return x++;
}

Then us poor slobs with big honkin numa iron could code up a real
pseudo_distance() routine, to avoid the actual pain of doing real work
for cpus^2 iterations for large cpu counts.

Our big boxes have regular geometries with much symmetry, so would
provide significant opportunity to exploit equal pseudo-distances.

And I would imagine that costs of K * NCPU * NCPU are tolerable in this
estimation routine. for sufficiently small K, and existing values of
NCPU.

(*) That is, if pseudo_distance(c1, c2) == pseudo_distance(d1, d2), then
    this meant that however c1 and c2 were connected to each other in the
    system (intervening buses and caches and such), cpus d1 and d2 were
    connected the same way, so could be presumed to have the same latency,
    close enough.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
@ 2005-04-01  6:34                         ` Nick Piggin
  2005-04-01  7:19                           ` Paul Jackson
  2005-04-01  6:46                         ` Ingo Molnar
  2005-04-01  6:59                         ` Ingo Molnar
  2 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2005-04-01  6:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: kenneth.w.chen, torvalds, Ingo Molnar, Andrew Morton, lkml

On Thu, 2005-03-31 at 22:05 -0800, Paul Jackson wrote:
> 
> Then us poor slobs with big honkin numa iron could code up a real
> pseudo_distance() routine, to avoid the actual pain of doing real work
> for cpus^2 iterations for large cpu counts.
> 
> Our big boxes have regular geometries with much symmetry, so would
> provide significant opportunity to exploit equal pseudo-distances.
> 

Couple of observations:

This doesn't actually need to be an O(n^2) operation. The result
of it is only going to be used in the sched domains code, so what
is really wanted is "how far away is one sched_group from another",
although we may also scale that based on the *amount* of cache
in the path between 2 cpus, that is often just a property of the
CPUs themselves in smaller systems, so also not O(n^2).

Secondly, we could use Ingo's O(n^2) code for the *SMP* domain on
all architectures (so in your case of only 2 CPUs per node, it is
obviously much cheaper, even over 256 nodes).

Then the NUMA domain could just inherit this SMP value as a default,
and allow architectures to override it individually.

This may allow us to set up decent baseline numbers, properly scaled
by cache size vs memory bandwidth without going overboard in
complexity (while still allowing arch code to do more fancy stuff).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:34                         ` Nick Piggin
@ 2005-04-01  7:19                           ` Paul Jackson
  0 siblings, 0 replies; 41+ messages in thread
From: Paul Jackson @ 2005-04-01  7:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, torvalds, mingo, akpm, linux-kernel

> Couple of observations:

yeah - plausible enough.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
@ 2005-04-01  6:46                         ` Ingo Molnar
  2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01  6:59                         ` Ingo Molnar
  2 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2005-04-01  6:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]


* Paul Jackson <pj@engr.sgi.com> wrote:

> Nick wrote:
> > Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> > ... Unfortunately ... and it is an O(cpus^2) operation.
> 
> Yes - a cool patch.

before we get into complexities, i'd like to see whether it solves Ken's 
performance problem. The attached patch (against BK-curr, but should 
apply to vanilla 2.6.12-rc1 too) adds the autodetection feature. (For 
ia64 i've hacked in a cachesize of 9MB for Ken's testsystem.)

boots fine on x86, and gives this on a 4-way box:

 Brought up 4 CPUs
 migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
         [00]  [01]  [02]  [03]
 [00]:    1.3   1.3   1.4   1.2
 [01]:    1.5   1.3   1.3   1.5
 [02]:    1.5   1.3   1.3   1.5
 [03]:    1.3   1.3   1.3   1.3
 min_delta: 1351361
 using cache_decay nsec: 1351361 (1 msec)

which is a pretty reasonable estimate on that box. (fast P4s, small 
cache)

Ken, could you give it a go?

	Ingo

[-- Attachment #2: cache-hot-autodetect-2.6.12-rc1-A0 --]
[-- Type: text/plain, Size: 9988 bytes --]

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4699,6 +4699,232 @@ static void check_sibling_maps(void)
 #endif
 
 /*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void fill_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	unsigned long data = 0xdeadbeef;
+	int i;
+
+	for (i = 0; i < size/4; i++) {
+		if ((i & 3) == 0)
+			cache[i] = data;
+		if ((i & 3) == 1)
+			cache[size-1-i] = data;
+		if ((i & 3) == 2)
+			cache[size/2-i] = data;
+		if ((i & 3) == 3)
+			cache[size/2+i] = data;
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	void *local_cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+	unsigned long long t0, t1;
+	unsigned long flags;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	memset(data->local_cache, 0, data->size);
+	local_irq_save(flags);
+	t0 = sched_clock();
+	fill_cache(data->cache, data->size);
+	t1 = sched_clock();
+	local_irq_restore(flags);
+
+	data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+	unsigned long flags;
+	void *local_cache;
+
+	local_cache = vmalloc(size);
+	if (!local_cache) {
+		printk("couldnt allocate local cache ...\n");
+		return 0;
+	}
+	memset(local_cache, 0, size);
+
+	local_irq_save(flags);
+	local_irq_enable();
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+	data.local_cache = local_cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+
+	vfree(local_cache);
+
+	return data.delta;
+}
+
+__initdata unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 2.1 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long size = sched_cache_size*21/10;
+	unsigned long long delta, max = 0;
+	void *cache;
+	int i;
+
+	if (!size) {
+		printk("arch has not set cachesize - using default.\n");
+		return 0;
+	}
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	cache = vmalloc(size);
+	if (!cache) {
+		printk("could not vmalloc %ld bytes for cache!\n", size);
+		return 0;
+	}
+	memset(cache, 0, size);
+	for (i = 0; i < 20; i++) {
+		delta = measure_one(cache, size, cpu1, cpu2);
+		if (delta > max)
+			max = delta;
+	}
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 *
+	 * (We use 2.1 times the L2 cachesize in our measurement,
+	 *  we keep this factor when returning.)
+	 */
+	return max;
+}
+
+unsigned long long cache_decay_nsec;
+
+void __devinit calibrate_cache_decay(void)
+{
+	int cpu1 = -1, cpu2 = -1;
+	unsigned long long min_delta = -1ULL;
+
+	printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+		sched_cache_size, cpu_khz/1000);
+	printk("      ");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("  [%02d]", cpu1);
+	}
+	printk("\n");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("[%02d]: ", cpu1);
+		for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+			unsigned long long delta;
+
+			if (!cpu_online(cpu2))
+				continue;
+			delta = measure_cacheflush_time(cpu1, cpu2);
+			
+			printk(" %3Ld.%ld", delta >> 20,
+				(((long)delta >> 10) / 102) % 10);
+			if ((cpu1 != cpu2) && (delta < min_delta))
+				min_delta = delta;
+		}
+		printk("\n");
+	}
+	printk("min_delta: %Ld\n", min_delta);
+	if (min_delta != -1ULL)
+		cache_decay_nsec = min_delta;
+	printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+		cache_decay_nsec, cache_decay_nsec >> 20);
+
+
+}
+
+/*
  * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
  */
 static void __devinit arch_init_sched_domains(void)
@@ -4706,6 +4932,7 @@ static void __devinit arch_init_sched_do
 	int i;
 	cpumask_t cpu_default_map;
 
+	calibrate_cache_decay();
 #if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
 	check_sibling_maps();
 #endif
--- linux/arch/ia64/kernel/domain.c.orig
+++ linux/arch/ia64/kernel/domain.c
@@ -139,6 +139,9 @@ void __devinit arch_init_sched_domains(v
 	int i;
 	cpumask_t cpu_default_map;
 
+	sched_cache_size = 9*1024*1024; // hack for Kenneth
+	calibrate_cache_decay();
+
 	/*
 	 * Setup mask for cpus without special case scheduling requirements.
 	 * For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -873,6 +873,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		sched_cache_size = cachesize * 1024;
 	}
 }
 
--- linux/include/asm-ia64/topology.h.orig
+++ linux/include/asm-ia64/topology.h
@@ -51,7 +51,7 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -73,7 +73,7 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 320,			\
 	.busy_factor		= 320,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/topology.h.orig
+++ linux/include/linux/topology.h
@@ -61,6 +61,12 @@
 #endif
 
 /*
+ * total time penalty to migrate a typical application's cache contents
+ * from one CPU to another. Measured by the boot-time code.
+ */
+extern unsigned long long cache_decay_nsec;
+
+/*
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
  * SD_CPU_INIT, for SMP domains
@@ -112,7 +118,7 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,12 @@ extern cpumask_t cpu_isolated_map;
 extern void init_sched_build_groups(struct sched_group groups[],
 	                        cpumask_t span, int (*group_fn)(int cpu));
 extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
+
 #endif /* ARCH_HAS_SCHED_DOMAIN */
+
+extern unsigned long sched_cache_size;
+extern void calibrate_cache_decay(void);
+
 #endif /* CONFIG_SMP */
 
 
--- linux/include/asm-i386/topology.h.orig
+++ linux/include/asm-i386/topology.h
@@ -75,7 +75,7 @@ static inline cpumask_t pcibus_to_cpumas
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-ppc64/topology.h.orig
+++ linux/include/asm-ppc64/topology.h
@@ -46,7 +46,7 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
--- linux/include/asm-x86_64/topology.h.orig
+++ linux/include/asm-x86_64/topology.h
@@ -48,7 +48,7 @@ static inline cpumask_t __pcibus_to_cpum
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:46                         ` Ingo Molnar
@ 2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  1:44                             ` Paul Jackson
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-04-01 22:32 UTC (permalink / raw)
  To: 'Ingo Molnar', Paul Jackson
  Cc: Nick Piggin, torvalds, akpm, linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 10:46 PM
> before we get into complexities, i'd like to see whether it solves Ken's
> performance problem. The attached patch (against BK-curr, but should
> apply to vanilla 2.6.12-rc1 too) adds the autodetection feature. (For
> ia64 i've hacked in a cachesize of 9MB for Ken's testsystem.)

Very nice, it had a good estimate of 9ms on my test system.  Our historical
data showed that 12ms was the best on the same system for the db workload
(that was done on 2.6.8).  Scheduler dynamic has changed in 2.6.11 and this
old finding may not apply any more for the new kernel.

migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
        [00]  [01]  [02]  [03]
[00]:    9.1   8.5   8.5   8.5
[01]:    8.5   9.1   8.5   8.5
[02]:    8.5   8.5   9.1   8.5
[03]:    8.5   8.5   8.5   9.1
min_delta: 8908106
using cache_decay nsec: 8908106 (8 msec)


Paul, you definitely want to check this out on your large numa box.  I booted
a kernel with this patch on a 32-way numa box and it took a long .... time
to produce the cost matrix.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:32                           ` Chen, Kenneth W
@ 2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  2:19                               ` Nick Piggin
  2005-04-04  1:40                               ` Kevin Puetz
  2005-04-02  1:44                             ` Paul Jackson
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2005-04-01 22:51 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', Paul Jackson, Nick Piggin, akpm,
	linux-kernel

On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> 
> Paul, you definitely want to check this out on your large numa box.  I booted
> a kernel with this patch on a 32-way numa box and it took a long .... time
> to produce the cost matrix.

Is there anything fundamentally wrong with the notion of just initializing
the cost matrix to something that isn't completely wrong at bootup, and
just lettign user space fill it in?

Then you couple that with a program that can do so automatically (ie 
move the in-kernel heuristics into user-land), and something that can 
re-load it on demand.

Voila - you have something potentially expensive that you run once, and 
then you have a matrix that can be edited by the sysadmin later and just 
re-loaded at each boot.. That sounds pretty optimal, especially in the 
sense that it allows the sysadmin to tweak things depending on the use of 
the box is he really wants to.

Hmm? Or am I just totally on crack?

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:51                             ` Linus Torvalds
@ 2005-04-02  2:19                               ` Nick Piggin
  2005-04-04  1:40                               ` Kevin Puetz
  1 sibling, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2005-04-02  2:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chen, Kenneth W, 'Ingo Molnar', Paul Jackson, akpm,
	linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> 
>>Paul, you definitely want to check this out on your large numa box.  I booted
>>a kernel with this patch on a 32-way numa box and it took a long .... time
>>to produce the cost matrix.
> 
> 
> Is there anything fundamentally wrong with the notion of just initializing
> the cost matrix to something that isn't completely wrong at bootup, and
> just lettign user space fill it in?
> 

That's probably not a bad idea. You'd have to do things like
set RT scheduling for your user tasks, and not have any other
activity happening. So that effectively hangs your system for
a while anyway.

But if you run it once and dump the output to a config file...

Anyway we're faced with the immediate problem of crap performance
for 2.6.12 (for people with 1500 disks), so an in-kernel solution
might be better in the short term. I'll see if we can adapt Ingo's
thingy with something that is "good enough" and doesn't take years
to run on a 512 way.

Nick

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:51                             ` Linus Torvalds
  2005-04-02  2:19                               ` Nick Piggin
@ 2005-04-04  1:40                               ` Kevin Puetz
  1 sibling, 0 replies; 41+ messages in thread
From: Kevin Puetz @ 2005-04-04  1:40 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

> 
> 
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
>> 
>> Paul, you definitely want to check this out on your large numa box.  I
>> booted a kernel with this patch on a 32-way numa box and it took a long
>> .... time to produce the cost matrix.
> 
> Is there anything fundamentally wrong with the notion of just initializing
> the cost matrix to something that isn't completely wrong at bootup, and
> just lettign user space fill it in?

Wouldn't getting rescheduled (and thus having another program trash the
cache on you) really mess up the data collection though? I suppose by
spawning off threads, each with a fixed affinity and SCHED_FIFO one could
hang onto the CPU to collect the data. But then it's not (a lot) different
than doing it in-kernel.
 
> Then you couple that with a program that can do so automatically (ie
> move the in-kernel heuristics into user-land), and something that can
> re-load it on demand.

This part seems sensible though :-)

> Voila - you have something potentially expensive that you run once, and
> then you have a matrix that can be edited by the sysadmin later and just
> re-loaded at each boot.. That sounds pretty optimal, especially in the
> sense that it allows the sysadmin to tweak things depending on the use of
> the box is he really wants to.
> 
> Hmm? Or am I just totally on crack?
> 
> Linus



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 22:32                           ` Chen, Kenneth W
  2005-04-01 22:51                             ` Linus Torvalds
@ 2005-04-02  1:44                             ` Paul Jackson
  2005-04-02  2:05                               ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Paul Jackson @ 2005-04-02  1:44 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Kenneth wrote:
> Paul, you definitely want to check this out on your large numa box.

Interesting - thanks.  I can get a kernel patched and booted on a big
box easily enough.  I don't know how to run an "industry db benchmark",
and benchmarks aren't my forte.

Should I rope in one of our guys who is benchmark savvy, or are there
some instructions you can point to for running an appropriate benchmark?

Or are we just interested, first of all, in what sort of values this
cost matrix gets initialized with (and how slow it is to compute)?

I can get time on a 64-cpu with a days notice, and time on a 512-cpu
with 2 or 3 days notice.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  1:44                             ` Paul Jackson
@ 2005-04-02  2:05                               ` Chen, Kenneth W
  2005-04-02  2:38                                 ` Paul Jackson
  2005-04-03  6:36                                 ` David Lang
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-04-02  2:05 UTC (permalink / raw)
  To: 'Paul Jackson'; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Paul Jackson wrote on Friday, April 01, 2005 5:45 PM
> Kenneth wrote:
> > Paul, you definitely want to check this out on your large numa box.
>
> Interesting - thanks.  I can get a kernel patched and booted on a big
> box easily enough.  I don't know how to run an "industry db benchmark",
> and benchmarks aren't my forte.

To run this "industry db benchmark", assuming you have a 32-way numa box,
I recommend buying the following:

512 GB memory
1500 73 GB 15k-rpm fiber channel disks
50 hardware raid controllers, make sure you get the top of the line model
   (the one has 1GB memory in the controller).
25 fiber channel controllers
4  gigabit ethernet controllers.
12 rack frames

Then you will be off to go.  Oh, get several 220 volt power outlets too,
probably some refrigeration unit will go along with that.  Sorry, I
haven't mention the mid-tier and the client machines yet.

;-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  2:05                               ` Chen, Kenneth W
@ 2005-04-02  2:38                                 ` Paul Jackson
  2005-04-03  6:36                                 ` David Lang
  1 sibling, 0 replies; 41+ messages in thread
From: Paul Jackson @ 2005-04-02  2:38 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, nickpiggin, torvalds, akpm, linux-kernel

Kenneth wrote:
> I recommend buying the following:

ah so ... I think I'll skip running the industry db benchmark
for now, if that's all the same.

What sort of feedback are you looking for from my running this
patch?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  2:05                               ` Chen, Kenneth W
  2005-04-02  2:38                                 ` Paul Jackson
@ 2005-04-03  6:36                                 ` David Lang
  2005-04-03  6:53                                   ` Andreas Dilger
  1 sibling, 1 reply; 41+ messages in thread
From: David Lang @ 2005-04-03  6:36 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Paul Jackson', mingo, nickpiggin, torvalds, akpm,
	linux-kernel

On Fri, 1 Apr 2005, Chen, Kenneth W wrote:

> To run this "industry db benchmark", assuming you have a 32-way numa box,
> I recommend buying the following:
>
> 512 GB memory
> 1500 73 GB 15k-rpm fiber channel disks
> 50 hardware raid controllers, make sure you get the top of the line model
>   (the one has 1GB memory in the controller).
> 25 fiber channel controllers
> 4  gigabit ethernet controllers.
> 12 rack frames

Ken, given that you don't have the bandwidth to keep all of those disks 
fully utilized, do you have any idea how big a performance hit you would 
take going to larger, but slower SATA drives?

given that this would let you get the same storage with about 1200 fewer 
drives (with corresponding savings in raid controllers, fiberchannel 
controllers and rack frames) it would be interesting to know how close it 
would be (for a lot of people the savings, which probably are within 
spitting distance of $1M could be work the decrease in performance)

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  6:36                                 ` David Lang
@ 2005-04-03  6:53                                   ` Andreas Dilger
  2005-04-03  7:23                                     ` David Lang
  0 siblings, 1 reply; 41+ messages in thread
From: Andreas Dilger @ 2005-04-03  6:53 UTC (permalink / raw)
  To: David Lang
  Cc: Chen, Kenneth W, 'Paul Jackson', mingo, nickpiggin,
	torvalds, akpm, linux-kernel

On Apr 02, 2005  22:36 -0800, David Lang wrote:
> On Fri, 1 Apr 2005, Chen, Kenneth W wrote:
> >To run this "industry db benchmark", assuming you have a 32-way numa box,
> >I recommend buying the following:
> >
> >512 GB memory
> >1500 73 GB 15k-rpm fiber channel disks
> >50 hardware raid controllers, make sure you get the top of the line model
> >  (the one has 1GB memory in the controller).
> >25 fiber channel controllers
> >4  gigabit ethernet controllers.
> >12 rack frames
> 
> Ken, given that you don't have the bandwidth to keep all of those disks 
> fully utilized, do you have any idea how big a performance hit you would 
> take going to larger, but slower SATA drives?
> 
> given that this would let you get the same storage with about 1200 fewer 
> drives (with corresponding savings in raid controllers, fiberchannel 
> controllers and rack frames) it would be interesting to know how close it 
> would be (for a lot of people the savings, which probably are within 
> spitting distance of $1M could be work the decrease in performance)

For benchmarks like these, the issue isn't the storage capacity, but
rather the ability to have lots of heads seeking concurrently to
access the many database tables.  At one large site I used to work at,
the database ran on hundreds of 1, 2, and 4GB disks long after they
could be replaced by many fewer, larger disks...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  6:53                                   ` Andreas Dilger
@ 2005-04-03  7:23                                     ` David Lang
  2005-04-03  7:38                                       ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: David Lang @ 2005-04-03  7:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Chen, Kenneth W, 'Paul Jackson', mingo, nickpiggin,
	torvalds, akpm, linux-kernel

On Sat, 2 Apr 2005, Andreas Dilger wrote:

>> given that this would let you get the same storage with about 1200 fewer
>> drives (with corresponding savings in raid controllers, fiberchannel
>> controllers and rack frames) it would be interesting to know how close it
>> would be (for a lot of people the savings, which probably are within
>> spitting distance of $1M could be work the decrease in performance)
>
> For benchmarks like these, the issue isn't the storage capacity, but
> rather the ability to have lots of heads seeking concurrently to
> access the many database tables.  At one large site I used to work at,
> the database ran on hundreds of 1, 2, and 4GB disks long after they
> could be replaced by many fewer, larger disks...

I can understand this to a point, but it seems to me that after you get 
beyond some point you stop gaining from this (simply becouse you run out 
of bandwidth to keep all the heads busy). I would have guessed that this 
happened somewhere in the hundreds of drives rather then the thousands, so 
going from 1500x73G to 400x300G (even if this drops you from 15Krpm to 
10Krpm) would still saturate the interface bandwidth before the drives

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-03  7:23                                     ` David Lang
@ 2005-04-03  7:38                                       ` Nick Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2005-04-03  7:38 UTC (permalink / raw)
  To: David Lang
  Cc: Andreas Dilger, Chen, Kenneth W, 'Paul Jackson', mingo,
	torvalds, akpm, linux-kernel

David Lang wrote:

> On Sat, 2 Apr 2005, Andreas Dilger wrote:
>
>>> given that this would let you get the same storage with about 1200 
>>> fewer
>>> drives (with corresponding savings in raid controllers, fiberchannel
>>> controllers and rack frames) it would be interesting to know how 
>>> close it
>>> would be (for a lot of people the savings, which probably are within
>>> spitting distance of $1M could be work the decrease in performance)
>>
>>
>> For benchmarks like these, the issue isn't the storage capacity, but
>> rather the ability to have lots of heads seeking concurrently to
>> access the many database tables.  At one large site I used to work at,
>> the database ran on hundreds of 1, 2, and 4GB disks long after they
>> could be replaced by many fewer, larger disks...
>
>
> I can understand this to a point, but it seems to me that after you 
> get beyond some point you stop gaining from this (simply becouse you 
> run out of bandwidth to keep all the heads busy). I would have guessed 
> that this happened somewhere in the hundreds of drives rather then the 
> thousands, so going from 1500x73G to 400x300G (even if this drops you 
> from 15Krpm to 10Krpm) would still saturate the interface bandwidth 
> before the drives
>

But in this case probably not - Ken increases IO capacity until the CPUs 
become saturated.
So there probably isn't a very large margin for error, you might need 
2000 of the slower
SATA disks to achieve a similar IOPS capacity.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:05                       ` Paul Jackson
  2005-04-01  6:34                         ` Nick Piggin
  2005-04-01  6:46                         ` Ingo Molnar
@ 2005-04-01  6:59                         ` Ingo Molnar
  2005-04-01  9:29                           ` Paul Jackson
  2 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2005-04-01  6:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> Nick wrote:
> > Ingo had a cool patch to estimate dirty => dirty cacheline transfer latency
> > ... Unfortunately ... and it is an O(cpus^2) operation.
> 
> Yes - a cool patch.
> 
> If we had an arch-specific bit of code, that for any two cpus, could 
> give a 'pseudo-distance' between them, where the only real 
> requirements were that (1) if two pairs of cpus had the same 
> pseudo-distance, then that meant they had the same size, layout, kind 
> and speed of bus amd cache hardware between them (*), and (2) it was 
> cheap - hardly more than a few lines of code and a subroutine call to 
> obtain, then Ingo's code could be:

yeah. The search can be limited quite drastically if all duplicate 
constellations of CPUs (which is a function of the topology) are only 
measured once.

but can be 'pseudo-distance' be calculated accurately enough? If it's a 
scalar, how do you make sure that unique paths for data to flow have 
different distances? The danger is 'false sharing' in the following 
scenario: lets say CPUs #1 and #2 are connected via hardware H1,H2,H3, 
CPUs #3 and #4 are connected via H4,H5,H6. Each hardware component is 
unique and has different characteristics. (e.g. this scenario can happen 
when different speed CPUs are mixed into the same system - or if there 
is some bus assymetry)

It has to be made sure that H1+H2+H3 != H4+H5+H6, otherwise false 
sharing will happen. For that 'uniqueness of sum' to be guaranteed, one 
has to assign power-of-two values to each separate type of hardware 
component.

[ or one has to assing very accurate 'distance' values to hardware 
components. (adding another source for errors - i.e. false sharing of 
the migration value) ]

and even the power-of-two assignment method has its limitations: it 
obviously runs out at 32/64 components (i'm not sure we can do that), 
and if a given component type can be present in the same path _twice_, 
that component will have to take two bits.

or is the 'at most 64 different hardware component types' limit ok? (it 
feels like a limit we might regret later.)

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  6:59                         ` Ingo Molnar
@ 2005-04-01  9:29                           ` Paul Jackson
  2005-04-01 10:34                             ` Ingo Molnar
  0 siblings, 1 reply; 41+ messages in thread
From: Paul Jackson @ 2005-04-01  9:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

> It has to be made sure that H1+H2+H3 != H4+H5+H6,

Yeah - if you start trying to think about the general case here, the
combinations tend to explode on one.

I'm thinking we get off easy, because:

 1) Specific arch's can apply specific short cuts.

	My intuition was that any specific architecture, when it
	got down to specifics, could find enough ways to cheat
	so that it could get results quickly, that easily fit
	in a single 'distance' word, which results were 'close
	enough.'

 2) The bigger the system, the more uniform its core hardware.

	At least SGI's big iron systems are usually pretty
	uniform in the hardware that matters here.  We might mix
	two cpu speeds, or a couple of memory sizes.  Not much
	more, at least that I know of.  A 1024 NUMA cobbled
	together from a wide variety of parts would be a very
	strange beast indeed.

 3) Approximate results (aliasing at the edges) are ok.

	If the SN2 arch code ends up telling the cache latency
	initialization code that two cpus on opposite sides of
	a 1024 cpu system are the same distance as another such
	pair, even though they aren't exactly the same distance,
	does anyone care?  Not I.

So I think we've got plenty of opportunity to special case arch's,
plenty of headroom, and plenty of latitude to bend not break if we do
start to push the limits.

Think of that 64 bits as if it was floating point, not int.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  9:29                           ` Paul Jackson
@ 2005-04-01 10:34                             ` Ingo Molnar
  2005-04-01 14:39                               ` Paul Jackson
  0 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2005-04-01 10:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

* Paul Jackson <pj@engr.sgi.com> wrote:

> > It has to be made sure that H1+H2+H3 != H4+H5+H6,
> 
> Yeah - if you start trying to think about the general case here, the 
> combinations tend to explode on one.

well, while i dont think we need that much complexity, the most generic 
case is a representation of the actual hardware (cache/bus) layout, 
where separate hardware component types have different IDs.

e.g. a simple hiearchy would be:

         ____H1____
      _H2_        _H2_
    H3    H3    H3    H3
   [P1]  [P2]  [P3]  [P4]

Then all that has to happen is to build a 'path' of ids (e.g. "H3,H2,H3" 
is a path), which is a vector of IDs, and an array of already measured 
vectors. These IDs never get added so they just have to be unique per 
type of component.

there are two different vectors possible: H3,H2,H3 and H3,H2,H1,H2,H3, 
so two measurements are needed, between P1 and P2 and P1 and P3. (the 
first natural occurence of each path)

this is tree walking and vector building/matching. There is no 
restriction on the layout of the hierarchy, other than it has to be a 
tree. (no circularity) It's easy to specify such a tree, and there are 
no 'mixup' dangers.

> I'm thinking we get off easy, because:
> 
>  1) Specific arch's can apply specific short cuts.
> 
> 	My intuition was that any specific architecture, when it
> 	got down to specifics, could find enough ways to cheat
> 	so that it could get results quickly, that easily fit
> 	in a single 'distance' word, which results were 'close
> 	enough.'

yes - but the fundamental problem is already that we do have per-arch 
shortcuts: the cache_hot value. If an arch wanted to set it up, it could 
do it. But it's not easy to set it up and the value is not intuitive. So 
the key is to make it convenient and fool-proof to set up the data - 
otherwise it just wont be used, or will be used incorrectly.

but i'd too go for the simpler 'pseudo-distance' function, because it's 
so much easier to iterate through it. But it's not intuitive. Maybe it 
should be called 'connection ID': a unique ID for each uniqe type of 
path between CPUs. An architecture can take shortcuts if it has a simple 
layout (most have). I.e.:

	sched_cpu_connection_type(int cpu1, int cpu2)

would return a unique type ID for different. Note that 'distance' (or 
'memory access latency', or 'NUMA factor') as a unit is not sufficient, 
as it does not take cache-size nor CPU speed into account, which does 
play a role in the migration characteristics.

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-01 10:34                             ` Ingo Molnar
@ 2005-04-01 14:39                               ` Paul Jackson
  0 siblings, 0 replies; 41+ messages in thread
From: Paul Jackson @ 2005-04-01 14:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, kenneth.w.chen, torvalds, akpm, linux-kernel

Ingo wrote:
> but i'd too go for the simpler 'pseudo-distance' function, because it's 
> so much easier to iterate through it. But it's not intuitive. Maybe it 
> should be called 'connection ID': a unique ID for each uniqe type of 
> path between CPUs.

Well said.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-03-31 22:14                   ` Chen, Kenneth W
  2005-03-31 23:35                     ` Nick Piggin
@ 2005-04-01  4:52                     ` Ingo Molnar
  2005-04-01  5:14                       ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2005-04-01  4:52 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> The low point in 2.6.11 could very well be the change in the 
> scheduler. It does too many load balancing in the wake up path and 
> possibly made a lot of unwise decision.  For example, in 
> try_to_wake_up(), it will try SD_WAKE_AFFINE for task that is not hot.  
> By not hot, it looks at when it was last ran and compare to a constant 
> sd->cache_hot_time.  The problem is this cache_hot_time is fixed for 
> the entire universe, whether it is a little celeron processor with 
> 128KB of cache or a sever class Itanium2 processor with 9MB L3 cache.  
> This one size fit all isn't really working at all.

the current scheduler queue in -mm has some experimental bits as well 
which will reduce the amount of balancing. But we cannot just merge them 
an bloc right now, there's been too much back and forth in recent 
kernels. The safe-to-merge-for-2.6.12 bits are already in -BK.

> We had experimented that parameter earlier and found it was one of the 
> major source of low point in 2.6.8.  I debated the issue on LKML about 
> 4 month ago and finally everyone agreed to make that parameter a boot 
> time param. The change made into bk tree for 2.6.9 release, but 
> somehow it got ripped right out 2 days after it went in.  I suspect 
> 2.6.11 is a replay of 2.6.8 for the regression in the scheduler.  We 
> are running experiment to confirm this theory.

the current defaults for cache_hot_time are 10 msec for NUMA domains, 
and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.  
Are you increasing cache_hot_time in your experiment? If that solves 
most of the problem that would be an easy thing to fix for 2.6.12.

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-04-01  4:52                     ` Ingo Molnar
@ 2005-04-01  5:14                       ` Chen, Kenneth W
  0 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-04-01  5:14 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> the current scheduler queue in -mm has some experimental bits as well
> which will reduce the amount of balancing. But we cannot just merge them
> an bloc right now, there's been too much back and forth in recent
> kernels. The safe-to-merge-for-2.6.12 bits are already in -BK.

I agree, please give me some time to go through these patches on our db
setup.

> the current defaults for cache_hot_time are 10 msec for NUMA domains,
> and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
> Are you increasing cache_hot_time in your experiment? If that solves
> most of the problem that would be an easy thing to fix for 2.6.12.

Yes, we are increasing the number in our experiments.  It's in the queue
and I should have a result soon.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
  2005-03-30  0:00 ` Linus Torvalds
  2005-03-30  0:22   ` Chen, Kenneth W
  2005-03-30  0:46   ` Chen, Kenneth W
@ 2005-04-01 22:51   ` Chen, Kenneth W
  2 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2005-04-01 22:51 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: 'Andrew Morton', linux-kernel

Linus Torvalds wrote on Tuesday, March 29, 2005 4:00 PM
> Also, it would be absolutely wonderful to see a finer granularity (which
> would likely also answer the stability question of the numbers). If you
> can do this with the daily snapshots, that would be great. If it's not
> easily automatable, or if a run takes a long time, maybe every other or
> every third day would be possible?
>
> Doing just release kernels means that there will be a two-month lag
> between telling developers that something pissed up performance. Doing it
> every day (or at least a couple of times a week) will be much more
> interesting.

Indeed, we agree that regular disciplined performance testing is important
and we (as Intel) will take on the challenge to support the Linux community.
I just got an approval to start this project.  We will report more detail
on how/where to publish the performance number, etc.

- Ken



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
@ 2005-04-01 16:34 Manfred Spraul
  0 siblings, 0 replies; 41+ messages in thread
From: Manfred Spraul @ 2005-04-01 16:34 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linux Kernel Mailing List

On Mon, 28 Mar 2005, Chen, Kenneth W wrote:
> With that said, here goes our first data point along with some historical data
> we have collected so far.
>
> 2.6.11	-13%
> 2.6.9		- 6%
> 2.6.8		-23%
> 2.6.2		- 1%
> baseline	(rhel3)

Is it possible to generate an instruction level oprofile for one recent kernel?
I have convinced Mark Wong from OSDL to generate a few for postgres DBT-2, but postgres is limited by it's user space buffer manager, thus it wasn't that useful:

http://khack.osdl.org/stp/299167/oprofile/


--
	Manfred


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: Industry db benchmark result on recent 2.6 kernels
@ 2005-04-02  1:00 Chen, Kenneth W
  2005-04-02  2:12 ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2005-04-02  1:00 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Andrew Morton', linux-kernel

Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> the current defaults for cache_hot_time are 10 msec for NUMA domains,
> and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
> Are you increasing cache_hot_time in your experiment? If that solves
> most of the problem that would be an easy thing to fix for 2.6.12.


Chen, Kenneth W wrote on Thursday, March 31, 2005 9:15 PM
> Yes, we are increasing the number in our experiments.  It's in the queue
> and I should have a result soon.

Hot of the press: bumping up cache_hot_time to 10ms on our db setup brings
2.6.11 performance on par with 2.6.9.  Theory confirmed.

- Ken



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Industry db benchmark result on recent 2.6 kernels
  2005-04-02  1:00 Chen, Kenneth W
@ 2005-04-02  2:12 ` Nick Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2005-04-02  2:12 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Linus Torvalds',
	'Andrew Morton', linux-kernel

Chen, Kenneth W wrote:
> Ingo Molnar wrote on Thursday, March 31, 2005 8:52 PM
> 
>>the current defaults for cache_hot_time are 10 msec for NUMA domains,
>>and 2.5 msec for SMP domains. Clearly too low for CPUs with 9MB cache.
>>Are you increasing cache_hot_time in your experiment? If that solves
>>most of the problem that would be an easy thing to fix for 2.6.12.
> 
> 
> 
> Chen, Kenneth W wrote on Thursday, March 31, 2005 9:15 PM
> 
>>Yes, we are increasing the number in our experiments.  It's in the queue
>>and I should have a result soon.
> 
> 
> Hot of the press: bumping up cache_hot_time to 10ms on our db setup brings
> 2.6.11 performance on par with 2.6.9.  Theory confirmed.
> 

OK, that's good. I'll look at whether we can easily use Ingo's
tool on the SMP domain only, to avoid the large O(n^2). That might
be an acceptable short term solution for 2.6.12.

If you get a chance to also look at those block layer patches that
would be good - if they give you a nice improvement, that would
justify getting them into -mm.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2005-04-04  1:40 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-28 19:33 Industry db benchmark result on recent 2.6 kernels Chen, Kenneth W
2005-03-28 19:50 ` Dave Hansen
2005-03-28 20:01   ` Chen, Kenneth W
2005-03-30  0:00 ` Linus Torvalds
2005-03-30  0:22   ` Chen, Kenneth W
2005-03-30  0:46   ` Chen, Kenneth W
2005-03-30  0:57     ` Linus Torvalds
2005-03-30  1:31       ` Nick Piggin
2005-03-30  1:38         ` Chen, Kenneth W
2005-03-30  1:56           ` Nick Piggin
2005-03-31 14:14           ` Ingo Molnar
2005-03-31 19:53             ` Chen, Kenneth W
2005-03-31 20:05               ` Linus Torvalds
2005-03-31 20:08                 ` Linus Torvalds
2005-03-31 22:14                   ` Chen, Kenneth W
2005-03-31 23:35                     ` Nick Piggin
2005-04-01  6:05                       ` Paul Jackson
2005-04-01  6:34                         ` Nick Piggin
2005-04-01  7:19                           ` Paul Jackson
2005-04-01  6:46                         ` Ingo Molnar
2005-04-01 22:32                           ` Chen, Kenneth W
2005-04-01 22:51                             ` Linus Torvalds
2005-04-02  2:19                               ` Nick Piggin
2005-04-04  1:40                               ` Kevin Puetz
2005-04-02  1:44                             ` Paul Jackson
2005-04-02  2:05                               ` Chen, Kenneth W
2005-04-02  2:38                                 ` Paul Jackson
2005-04-03  6:36                                 ` David Lang
2005-04-03  6:53                                   ` Andreas Dilger
2005-04-03  7:23                                     ` David Lang
2005-04-03  7:38                                       ` Nick Piggin
2005-04-01  6:59                         ` Ingo Molnar
2005-04-01  9:29                           ` Paul Jackson
2005-04-01 10:34                             ` Ingo Molnar
2005-04-01 14:39                               ` Paul Jackson
2005-04-01  4:52                     ` Ingo Molnar
2005-04-01  5:14                       ` Chen, Kenneth W
2005-04-01 22:51   ` Chen, Kenneth W
  -- strict thread matches above, loose matches on Subject: below --
2005-04-01 16:34 Manfred Spraul
2005-04-02  1:00 Chen, Kenneth W
2005-04-02  2:12 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox