Scaling noise

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Scaling noise
@ 2003-09-03  4:03 Larry McVoy
  2003-09-03  4:12 ` Roland Dreier
  2003-09-03  4:18 ` Anton Blanchard
  0 siblings, 2 replies; 64+ messages in thread
From: Larry McVoy @ 2003-09-03  4:03 UTC (permalink / raw)
  To: linux-kernel

I've frequently tried to make the point that all the scaling for lots of
processors is nonsense.  Mr Dell says it better:

    "Eight-way (servers) are less than 1 percent of the market and shrinking
    pretty dramatically," Dell said. "If our competitors want to claim
    they're No. 1 in eight-ways, that's fine. We want to lead the market
    with two-way and four-way (processor machines)."

Tell me again that it is a good idea to screw up uniprocessor performance
for 64 way machines.  Great idea, that.  Go Dinosaurs!
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:03 Scaling noise Larry McVoy
@ 2003-09-03  4:12 ` Roland Dreier
  2003-09-03  4:20   ` Larry McVoy
  2003-09-03 15:12   ` Martin J. Bligh
  2003-09-03  4:18 ` Anton Blanchard
  1 sibling, 2 replies; 64+ messages in thread
From: Roland Dreier @ 2003-09-03  4:12 UTC (permalink / raw)
  Cc: linux-kernel

+--------------+
|  Don't feed  |
|  the trolls  |
|              |
|  thank you   |
+--------------+
      | |
      | |
      | |
      | |
  ....\ /....

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:03 Scaling noise Larry McVoy
  2003-09-03  4:12 ` Roland Dreier
@ 2003-09-03  4:18 ` Anton Blanchard
  2003-09-03  4:29   ` Larry McVoy
  1 sibling, 1 reply; 64+ messages in thread
From: Anton Blanchard @ 2003-09-03  4:18 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel


> I've frequently tried to make the point that all the scaling for lots of
> processors is nonsense.  Mr Dell says it better:
> 
>     "Eight-way (servers) are less than 1 percent of the market and shrinking
>     pretty dramatically," Dell said. "If our competitors want to claim
>     they're No. 1 in eight-ways, that's fine. We want to lead the market
>     with two-way and four-way (processor machines)."
> 
> Tell me again that it is a good idea to screw up uniprocessor performance
> for 64 way machines.  Great idea, that.  Go Dinosaurs!

And does your 4 way have hyperthreading?

Anton

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:12 ` Roland Dreier
@ 2003-09-03  4:20   ` Larry McVoy
  2003-09-03 15:12   ` Martin J. Bligh
  1 sibling, 0 replies; 64+ messages in thread
From: Larry McVoy @ 2003-09-03  4:20 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel

And here I thought that real data was interesting.  My mistake.

On Tue, Sep 02, 2003 at 09:12:36PM -0700, Roland Dreier wrote:
> +--------------+
> |  Don't feed  |
> |  the trolls  |
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:18 ` Anton Blanchard
@ 2003-09-03  4:29   ` Larry McVoy
  2003-09-03  4:33     ` CaT
  2003-09-03  6:28     ` Anton Blanchard
  0 siblings, 2 replies; 64+ messages in thread
From: Larry McVoy @ 2003-09-03  4:29 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Larry McVoy, linux-kernel

On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote:
> > I've frequently tried to make the point that all the scaling for lots of
> > processors is nonsense.  Mr Dell says it better:
> > 
> >     "Eight-way (servers) are less than 1 percent of the market and shrinking
> >     pretty dramatically," Dell said. "If our competitors want to claim
> >     they're No. 1 in eight-ways, that's fine. We want to lead the market
> >     with two-way and four-way (processor machines)."
> > 
> > Tell me again that it is a good idea to screw up uniprocessor performance
> > for 64 way machines.  Great idea, that.  Go Dinosaurs!
> 
> And does your 4 way have hyperthreading?

What part of "shrinking pretty dramatically" did you not understand?  Maybe
you know more than Mike Dell.  Could you share that insight?
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:29   ` Larry McVoy
@ 2003-09-03  4:33     ` CaT
  2003-09-03  5:08       ` Larry McVoy
  2003-09-03  6:28     ` Anton Blanchard
  1 sibling, 1 reply; 64+ messages in thread
From: CaT @ 2003-09-03  4:33 UTC (permalink / raw)
  To: Larry McVoy, Anton Blanchard, Larry McVoy, linux-kernel

On Tue, Sep 02, 2003 at 09:29:53PM -0700, Larry McVoy wrote:
> On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote:
> > > I've frequently tried to make the point that all the scaling for lots of
> > > processors is nonsense.  Mr Dell says it better:
> > > 
> > >     "Eight-way (servers) are less than 1 percent of the market and shrinking
> > >     pretty dramatically," Dell said. "If our competitors want to claim
> > >     they're No. 1 in eight-ways, that's fine. We want to lead the market
> > >     with two-way and four-way (processor machines)."
> > > 
> > > Tell me again that it is a good idea to screw up uniprocessor performance
> > > for 64 way machines.  Great idea, that.  Go Dinosaurs!
> > 
> > And does your 4 way have hyperthreading?
> 
> What part of "shrinking pretty dramatically" did you not understand?  Maybe
> you know more than Mike Dell.  Could you share that insight?

I think Anton is referring to the fact that on a 4-way cpu machine with
HT enabled you basically have an 8-way smp box (with special conditions)
and so if 4-way machines are becoming more popular, making sure that 8-way
smp works well is a good idea.

At least that's how I took it.

-- 
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
	- http://tinyurl.com/h6fo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:33     ` CaT
@ 2003-09-03  5:08       ` Larry McVoy
  2003-09-03  5:44         ` Mikael Abrahamsson
                           ` (5 more replies)
  0 siblings, 6 replies; 64+ messages in thread
From: Larry McVoy @ 2003-09-03  5:08 UTC (permalink / raw)
  To: CaT; +Cc: Larry McVoy, Anton Blanchard, linux-kernel

On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
> I think Anton is referring to the fact that on a 4-way cpu machine with
> HT enabled you basically have an 8-way smp box (with special conditions)
> and so if 4-way machines are becoming more popular, making sure that 8-way
> smp works well is a good idea.

Maybe this is a better way to get my point across.  Think about more CPUs
on the same memory subsystem.  I've been trying to make this scaling point
ever since I discovered how much cache misses hurt.  That was about 1995
or so.  At that point, memory latency was about 200 ns and processor speeds
were at about 200Mhz or 5 ns.  Today, memory latency is about 130 ns and
processor speeds are about .3 ns.  Processor speeds are 15 times faster and
memory is less than 2 times faster.  SMP makes that ratio worse.

It's called asymptotic behavior.  After a while you can look at the graph
and see that more CPUs on the same memory doesn't make sense.  It hasn't
made sense for a decade, what makes anyone think that is changing?
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
@ 2003-09-03  5:44         ` Mikael Abrahamsson
  2003-09-03  6:12         ` Bernd Eckenfels
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 64+ messages in thread
From: Mikael Abrahamsson @ 2003-09-03  5:44 UTC (permalink / raw)
  To: linux-kernel

On Tue, 2 Sep 2003, Larry McVoy wrote:

> It's called asymptotic behavior.  After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense.  It hasn't
> made sense for a decade, what makes anyone think that is changing?

It didnt make sense two decades ago either, the VAX 8300 could be made to 
go 6way and it stopped going faster around the third processor added.

(my memory is a bit rusty, but I believe this is what we came up with when 
we got donated a few of those in the mid 90ties and yes, they're not from 
83 but perhaps from 86-87 so not two decades ago either).

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
  2003-09-03  5:44         ` Mikael Abrahamsson
@ 2003-09-03  6:12         ` Bernd Eckenfels
  2003-09-03 12:09           ` Alan Cox
  2003-09-03  8:11         ` Giuliano Pochini
                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 64+ messages in thread
From: Bernd Eckenfels @ 2003-09-03  6:12 UTC (permalink / raw)
  To: linux-kernel

In article <20030903050859.GD10257@work.bitmover.com> you wrote:
> It's called asymptotic behavior.  After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense.  It hasn't
> made sense for a decade, what makes anyone think that is changing?

Thats why NUMA gets so popular.

Larry, dont forget, that Linux is growing in the University Labs, where
those big NUMA and Multi-Node Clusters are most popular for Number
Crunching.

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:29   ` Larry McVoy
  2003-09-03  4:33     ` CaT
@ 2003-09-03  6:28     ` Anton Blanchard
  2003-09-03  6:55       ` Nick Piggin
  1 sibling, 1 reply; 64+ messages in thread
From: Anton Blanchard @ 2003-09-03  6:28 UTC (permalink / raw)
  To: Larry McVoy, Larry McVoy, linux-kernel

> > > I've frequently tried to make the point that all the scaling for
> > > lots of processors is nonsense.  Mr Dell says it better:
> > > 
> > >     "Eight-way (servers) are less than 1 percent of the market and
> > >     shrinking pretty dramatically," Dell said. "If our competitors
> > >     want to claim they're No. 1 in eight-ways, that's fine. We
> > >     want to lead the market with two-way and four-way (processor
> > >     machines)."
> > > 
> > > Tell me again that it is a good idea to screw up uniprocessor
> > > performance for 64 way machines.  Great idea, that.  Go Dinosaurs!
> > 
> > And does your 4 way have hyperthreading?
> 
> What part of "shrinking pretty dramatically" did you not understand?
> Maybe you know more than Mike Dell.  Could you share that insight?

Ok. But only because you asked nicely.

Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell 
processors with hyperthreading on them. Scaling to 4 or 8 threads is just
like scaling to 4 or 8 processors, only worse.

However, lets not end up in a yet another 64 way scalability argument here.

The thing we should be worrying about is the UP -> 2 way SMP scalability
issue. If every chip in the future has hyperthreading then all of sudden
everyone is running an SMP kernel. And what hurts us?

atomic ops
memory barriers

Ive always worried about those atomic ops that only appear in an SMP
kernel, but Rusty recently reminded me its the same story for most of the
memory barriers.

Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it
also helps the big end of town is just a bonus.

Anton

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  6:28     ` Anton Blanchard
@ 2003-09-03  6:55       ` Nick Piggin
  2003-09-03 15:23         ` Martin J. Bligh
  2003-09-03 15:51         ` UP Regression (was) " Cliff White
  0 siblings, 2 replies; 64+ messages in thread
From: Nick Piggin @ 2003-09-03  6:55 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Larry McVoy, Larry McVoy, linux-kernel

Anton Blanchard wrote:

>>>>I've frequently tried to make the point that all the scaling for
>>>>lots of processors is nonsense.  Mr Dell says it better:
>>>>
>>>>    "Eight-way (servers) are less than 1 percent of the market and
>>>>    shrinking pretty dramatically," Dell said. "If our competitors
>>>>    want to claim they're No. 1 in eight-ways, that's fine. We
>>>>    want to lead the market with two-way and four-way (processor
>>>>    machines)."
>>>>
>>>>Tell me again that it is a good idea to screw up uniprocessor
>>>>performance for 64 way machines.  Great idea, that.  Go Dinosaurs!
>>>>
>>>And does your 4 way have hyperthreading?
>>>
>>What part of "shrinking pretty dramatically" did you not understand?
>>Maybe you know more than Mike Dell.  Could you share that insight?
>>
>
>Ok. But only because you asked nicely.
>
>Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell 
>processors with hyperthreading on them. Scaling to 4 or 8 threads is just
>like scaling to 4 or 8 processors, only worse.
>
>However, lets not end up in a yet another 64 way scalability argument here.
>
>The thing we should be worrying about is the UP -> 2 way SMP scalability
>issue. If every chip in the future has hyperthreading then all of sudden
>everyone is running an SMP kernel. And what hurts us?
>
>atomic ops
>memory barriers
>
>Ive always worried about those atomic ops that only appear in an SMP
>kernel, but Rusty recently reminded me its the same story for most of the
>memory barriers.
>
>Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it
>also helps the big end of town is just a bonus.
>

I think LM advocates aiming single image scalability at or before the knee
of the CPU vs performance curve. Say thats 4 way, it means you should get
good performance on 8 ways while keeping top performance on 1 and 2 and 4
ways. (Sorry if I mis-represent your position).

I don't think anyone advocates sacrificing UP performance for 32 ways, but
as he says it can happen .1% at a time.

But it looks like 2.6 will scale well to 16 way and higher. I wonder if
there are many regressions from 2.4 or 2.2 on small systems.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
  2003-09-03  5:44         ` Mikael Abrahamsson
  2003-09-03  6:12         ` Bernd Eckenfels
@ 2003-09-03  8:11         ` Giuliano Pochini
  2003-09-03 14:25         ` Steven Cole
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 64+ messages in thread
From: Giuliano Pochini @ 2003-09-03  8:11 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel


On 03-Sep-2003 Larry McVoy wrote:
> That was about 1995
> or so.  At that point, memory latency was about 200 ns and processor speeds
> were at about 200Mhz or 5 ns.  Today, memory latency is about 130 ns and
> processor speeds are about .3 ns.  Processor speeds are 15 times faster and
> memory is less than 2 times faster.  SMP makes that ratio worse.

Latency is not bandwidth. btw you are right, that's why caches are
growing, too. It's likely in the future there will be only UP (HT'd ?)
and NUMA machines.


Bye.
    Giuliano.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  6:12         ` Bernd Eckenfels
@ 2003-09-03 12:09           ` Alan Cox
  2003-09-03 15:10             ` Martin J. Bligh
  0 siblings, 1 reply; 64+ messages in thread
From: Alan Cox @ 2003-09-03 12:09 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: Linux Kernel Mailing List

On Mer, 2003-09-03 at 07:12, Bernd Eckenfels wrote:
> Thats why NUMA gets so popular.

NUMA doesn't help you much.

> Larry, dont forget, that Linux is growing in the University Labs, where
> those big NUMA and Multi-Node Clusters are most popular for Number
> Crunching.

multi node yes, numa not much and where numa-like systems are being used
they are being used for message passing not as a fake big pc. 

Numa is valuable because
- It makes some things go faster without having to rewrite them
- It lets you partition a large box into several effective small ones 
  cutting maintenance
- It lets you partition a large box into several effective small ones
  so you can avoid buying two software licenses for expensive toys

if you actually care enough about performance to write the code to do
the job then its value is rather questionable. There are exceptions as
with anything else.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 14:25         ` Steven Cole
@ 2003-09-03 12:47           ` Antonio Vargas
  2003-09-03 15:31             ` Steven Cole
  2003-09-08 19:12           ` bill davidsen
  1 sibling, 1 reply; 64+ messages in thread
From: Antonio Vargas @ 2003-09-03 12:47 UTC (permalink / raw)
  To: Steven Cole; +Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel

On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote:
> On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:

> [snip]
> 
> The question which will continue to be important in the next kernel
> series is: How to best accommodate the future many-CPU machines without
> sacrificing performance on the low-end?  The change is that the 'many'
> in the above may start to double every few years.
> 
> Some candidate answers to this have been discussed before, such as
> cache-coherent clusters.  I just hope this gets worked out before the
> hardware ships.

As you may probably know, CC-clusters were heavily advocated by the
same Larry McVoy who has started this thread.

Greets, Antonio.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
                           ` (2 preceding siblings ...)
  2003-09-03  8:11         ` Giuliano Pochini
@ 2003-09-03 14:25         ` Steven Cole
  2003-09-03 12:47           ` Antonio Vargas
  2003-09-08 19:12           ` bill davidsen
  2003-09-03 16:37         ` Kurt Wall
  2003-09-06 15:08         ` Pavel Machek
  5 siblings, 2 replies; 64+ messages in thread
From: Steven Cole @ 2003-09-03 14:25 UTC (permalink / raw)
  To: Larry McVoy; +Cc: CaT, Anton Blanchard, linux-kernel

On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
> > I think Anton is referring to the fact that on a 4-way cpu machine with
> > HT enabled you basically have an 8-way smp box (with special conditions)
> > and so if 4-way machines are becoming more popular, making sure that 8-way
> > smp works well is a good idea.
> 
> Maybe this is a better way to get my point across.  Think about more CPUs
> on the same memory subsystem.  I've been trying to make this scaling point
> ever since I discovered how much cache misses hurt.  That was about 1995
> or so.  At that point, memory latency was about 200 ns and processor speeds
> were at about 200Mhz or 5 ns.  Today, memory latency is about 130 ns and
> processor speeds are about .3 ns.  Processor speeds are 15 times faster and
> memory is less than 2 times faster.  SMP makes that ratio worse.
> 
> It's called asymptotic behavior.  After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense.  It hasn't
> made sense for a decade, what makes anyone think that is changing?

You're right about the asymptotic behavior and you'll just get more
right as time goes on, but other forces are at work.

What is changing is the number of cores per 'processor' is increasing. 
The Intel Montecito will increase this to two, and rumor has it that the
Intel Tanglewood may have as many as sixteen.  The IBM Power6 will
likely be similarly capable.

The Tanglewood is not some far off flight of fancy; it may be available
as soon as the 2.8.x stable series, so planning to accommodate it should
be happening now.  

With companies like SGI building Altix systems with 64 and 128 CPUs
using the current single-core Madison, just think of what will be
possible using the future hardware. 

In four years, Michael Dell will still be saying the same thing, but
he'll just fudge his answer by a factor of four. 

The question which will continue to be important in the next kernel
series is: How to best accommodate the future many-CPU machines without
sacrificing performance on the low-end?  The change is that the 'many'
in the above may start to double every few years.

Some candidate answers to this have been discussed before, such as
cache-coherent clusters.  I just hope this gets worked out before the
hardware ships.

Steven

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 12:09           ` Alan Cox
@ 2003-09-03 15:10             ` Martin J. Bligh
  2003-09-03 16:01               ` Jörn Engel
  2003-09-04 20:36               ` Rik van Riel
  0 siblings, 2 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 15:10 UTC (permalink / raw)
  To: Alan Cox, Bernd Eckenfels; +Cc: Linux Kernel Mailing List

> multi node yes, numa not much and where numa-like systems are being used
> they are being used for message passing not as a fake big pc. 
> 
> Numa is valuable because
> - It makes some things go faster without having to rewrite them
> - It lets you partition a large box into several effective small ones 
>   cutting maintenance
> - It lets you partition a large box into several effective small ones
>   so you can avoid buying two software licenses for expensive toys
> 
> if you actually care enough about performance to write the code to do
> the job then its value is rather questionable. There are exceptions as
> with anything else.

The real core use of NUMA is to run one really big app on one machine, 
where it's hard to split it across a cluster. You just can't build an 
SMP box big enough for some of these things.

M.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  4:12 ` Roland Dreier
  2003-09-03  4:20   ` Larry McVoy
@ 2003-09-03 15:12   ` Martin J. Bligh
  1 sibling, 0 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 15:12 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel

--Roland Dreier <roland@topspin.com> wrote (on Tuesday, September 02, 2003 21:12:36 -0700):

> +--------------+
>|  Don't feed  |
>|  the trolls  |
>|              | 
>|  thank you   |
> +--------------+
>       | |
>       | |
>       | |
>       | |
>   ....\ /....

Agreed. Please refer to the last flamefest a few months ago, when this was
covered in detail.

M.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  6:55       ` Nick Piggin
@ 2003-09-03 15:23         ` Martin J. Bligh
  2003-09-03 15:39           ` Larry McVoy
  2003-09-03 17:16           ` William Lee Irwin III
  2003-09-03 15:51         ` UP Regression (was) " Cliff White
  1 sibling, 2 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 15:23 UTC (permalink / raw)
  To: Nick Piggin, Anton Blanchard; +Cc: Larry McVoy, linux-kernel

> I think LM advocates aiming single image scalability at or before the knee
> of the CPU vs performance curve. Say thats 4 way, it means you should get
> good performance on 8 ways while keeping top performance on 1 and 2 and 4
> ways. (Sorry if I mis-represent your position).

Splitting big machines into a cluster is not a solution. However, oddly 
enough I actually agree with Larry, with one major caveat ... you have to
make it an SSI cluster (single system image) - that way it's transparent
to users. Unfortunately that's hard to do, but since we still have a 
system that's single memory image coherent, it shouldn't actually be nearly 
as hard as doing it across machines, as you can still fudge in the odd 
global piece if you need it. 

Without SSI, it's pretty useless, you're just turning an expensive box
into a cheap cluster, and burning a lot of cash.

> I don't think anyone advocates sacrificing UP performance for 32 ways, but
> as he says it can happen .1% at a time.
> 
> But it looks like 2.6 will scale well to 16 way and higher. I wonder if
> there are many regressions from 2.4 or 2.2 on small systems.

You want real data instead of FUD? How *dare* you? ;-)

Would be real interesting to see this ... there are actually plenty of
real degredations there, none of which (that I've seen) come from any
scalability changes. Things like RMAP on fork times (for which there are
other legitimite reasons) are more responsible (for which the "scalability" 
people have offered a solution).

Numbers would be cool ... particularly if people can refrain from the
"it's worse, therefore it must be some scalability change that's at fault"
insta-moron-leap-of-logic.

M.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 12:47           ` Antonio Vargas
@ 2003-09-03 15:31             ` Steven Cole
  2003-09-04  1:50               ` Daniel Phillips
  0 siblings, 1 reply; 64+ messages in thread
From: Steven Cole @ 2003-09-03 15:31 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel

On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote:
> > On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
> > > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
> 
> > [snip]
> > 
> > The question which will continue to be important in the next kernel
> > series is: How to best accommodate the future many-CPU machines without
> > sacrificing performance on the low-end?  The change is that the 'many'
> > in the above may start to double every few years.
> > 
> > Some candidate answers to this have been discussed before, such as
> > cache-coherent clusters.  I just hope this gets worked out before the
> > hardware ships.
> 
> As you may probably know, CC-clusters were heavily advocated by the
> same Larry McVoy who has started this thread.
> 

Yes, thanks.  I'm well aware of that.  I would like to get a discussion
going again on CC-clusters, since that seems to be a way out of the
scaling spiral.  Here is an interesting link:
http://www.opersys.com/adeos/practical-smp-clusters/

Steven




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:23         ` Martin J. Bligh
@ 2003-09-03 15:39           ` Larry McVoy
  2003-09-03 15:50             ` Martin J. Bligh
                               ` (2 more replies)
  2003-09-03 17:16           ` William Lee Irwin III
  1 sibling, 3 replies; 64+ messages in thread
From: Larry McVoy @ 2003-09-03 15:39 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Nick Piggin, Anton Blanchard, Larry McVoy, linux-kernel

On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote:
> > I think LM advocates aiming single image scalability at or before the knee
> > of the CPU vs performance curve. Say thats 4 way, it means you should get
> > good performance on 8 ways while keeping top performance on 1 and 2 and 4
> > ways. (Sorry if I mis-represent your position).
> 
> Splitting big machines into a cluster is not a solution. However, oddly 
> enough I actually agree with Larry, with one major caveat ... you have to
> make it an SSI cluster (single system image) - that way it's transparent
> to users. 

Err, when did I ever say it wasn't SSI?  If you look at what I said it's
clearly SSI.  Unified process, device, file, and memory namespaces.

I'm pretty sure people were so eager to argue with my lovely personality
that they never bothered to understand the architecture.  It's _always_
been SSI.  I have slides going back at least 4 years that state this:

	http://www.bitmover.com/talks/smp-clusters
	http://www.bitmover.com/talks/cliq

> Numbers would be cool ... particularly if people can refrain from the
> "it's worse, therefore it must be some scalability change that's at fault"
> insta-moron-leap-of-logic.

It's really easy to claim that scalability isn't the problem.  Scaling
changes in general cause very minute differences, it's just that there
are a lot of them.  There is constant pressure to scale further and people
think it's cool.  You can argue you all you want that scaling done right
isn't a problem but nobody has ever managed to do it right.  I know it's
politically incorrect to say this group won't either but there is no 
evidence that they will.

Instead of doggedly following the footsteps down a path that hasn't worked
before, why not do something cool?  The CC stuff is a fun place to work,
it's the last paradigm shift that will ever happen in OS, it's a chance 
for Linux to actually do something new.  I harp all the time that open
source is a copying mechanism and you are playing right into my hands.
Make me wrong.  Do something new.  Don't like this design?  OK, then come
up with a better design.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:39           ` Larry McVoy
@ 2003-09-03 15:50             ` Martin J. Bligh
  2003-09-04  0:49               ` Larry McVoy
  2003-09-04  4:49             ` Scaling noise David S. Miller
  2003-09-08 19:50             ` bill davidsen
  2 siblings, 1 reply; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 15:50 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Nick Piggin, Anton Blanchard, linux-kernel

> Err, when did I ever say it wasn't SSI?  If you look at what I said it's
> clearly SSI.  Unified process, device, file, and memory namespaces.

I think it was the bit when you suggested using bitkeeper to sync multiple
/etc/passwd files when I really switched off ... perhaps you were just
joking ;-) Perhaps we just had a massive communication disconnect.

> I'm pretty sure people were so eager to argue with my lovely personality
> that they never bothered to understand the architecture.  It's _always_
> been SSI.  I have slides going back at least 4 years that state this:
> 
> 	http://www.bitmover.com/talks/smp-clusters
> 	http://www.bitmover.com/talks/cliq

I can go back and re-read them, if I misread them last time than I apologise.
I've also shifted perspectives on SSI clusters somewhat over the last year. 
Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-)

I'd rather start with everything separate (one OS instance per node), and
bind things back together, than split everything up. However, I'm really
not sure how feasible it is until we actually have something that works.

I have a rough plan of how to go about it mapped out, in small steps that
might be useful by themselves. It's a lot of fairly complex hard work ;-)

>> Numbers would be cool ... particularly if people can refrain from the
>> "it's worse, therefore it must be some scalability change that's at fault"
>> insta-moron-leap-of-logic.
> 
> It's really easy to claim that scalability isn't the problem.  Scaling
> changes in general cause very minute differences, it's just that there
> are a lot of them.  There is constant pressure to scale further and people
> think it's cool.  You can argue you all you want that scaling done right
> isn't a problem but nobody has ever managed to do it right.  I know it's
> politically incorrect to say this group won't either but there is no 
> evidence that they will.

Let's not go into that one again, we've both dragged that over the coals
already. Time to agree to disagree. All the significant degredations I
looked at that people screamed were scalability changes turned out to
be something else completely. 

> Instead of doggedly following the footsteps down a path that hasn't worked
> before, why not do something cool?  The CC stuff is a fun place to work,
> it's the last paradigm shift that will ever happen in OS, it's a chance 
> for Linux to actually do something new.  I harp all the time that open
> source is a copying mechanism and you are playing right into my hands.
> Make me wrong.  Do something new.  Don't like this design?  OK, then come
> up with a better design.

I'm cool with doing SSI clusters over NUMA on a per-node basis. But it's
still vapourware ... yes, I'd love to work on that full time to try and
change that if I can get funding to do so.

M.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* UP Regression (was) Re: Scaling noise
  2003-09-03  6:55       ` Nick Piggin
  2003-09-03 15:23         ` Martin J. Bligh
@ 2003-09-03 15:51         ` Cliff White
  2003-09-03 17:21           ` William Lee Irwin III
  2003-09-04  0:54           ` Nick Piggin
  1 sibling, 2 replies; 64+ messages in thread
From: Cliff White @ 2003-09-03 15:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, cliffw


[snip]
.
> 
> I don't think anyone advocates sacrificing UP performance for 32 ways, but
> as he says it can happen .1% at a time.
> 
> But it looks like 2.6 will scale well to 16 way and higher. I wonder if
> there are many regressions from 2.4 or 2.2 on small systems.
> 
> 
On the Scalable Test Platform, running osdl-aim-7,  for the
UP case, 2.4 is a bit better than 2.6, this is consistent across
many runs. For SMP, 2.6 is better, but the delta is rather
small, until we get to 8 CPUS. We have a lot of un-parsed data from other
tests - might be some trends there also.
See http://developer.osdl.org/cliffw/reaim/index.html 
2.4 kernels are at the bottom of the page.

Run #   PLM #  Kernel                   workload        Max JPM  max    host
1-way                                                            lusers
 278671 2083    patch-2.4.23-pre2       new_dbase       1066.75  18      
stp1-003
278835  2087    2.6.0-test4-mm5         new_dbase       995.74   17      
stp1-003
2-way
278690  2083    patch-2.4.23-pre2       new_dbase       1300.01  22      
stp2-000
278854  2087    2.6.0-test4-mm5         new_dbase       1340.96  22      
stp2-000
4-way
278437  2075    patch-2.4.23-pre1       new_dbase       5268.41  80      
stp4-000
278805  2084    2.6.0-test4-mm4         new_dbase       5355.73  88      
stp4-000
8-way
278651  2083    patch-2.4.23-pre2       new_dbase       6790.01  112     
stp8-002
 278722 2084    2.6.0-test4-mm4         new_dbase       8189.51  136     
stp8-001

cliffw
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:10             ` Martin J. Bligh
@ 2003-09-03 16:01               ` Jörn Engel
  2003-09-03 16:21                 ` Martin J. Bligh
  2003-09-04 20:36               ` Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Jörn Engel @ 2003-09-03 16:01 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

On Wed, 3 September 2003 08:10:33 -0700, Martin J. Bligh wrote:
> 
> > multi node yes, numa not much and where numa-like systems are being used
> > they are being used for message passing not as a fake big pc. 
> > 
> > Numa is valuable because
> > - It makes some things go faster without having to rewrite them
> > - It lets you partition a large box into several effective small ones 
> >   cutting maintenance
> > - It lets you partition a large box into several effective small ones
> >   so you can avoid buying two software licenses for expensive toys
> > 
> > if you actually care enough about performance to write the code to do
> > the job then its value is rather questionable. There are exceptions as
> > with anything else.
> 
> The real core use of NUMA is to run one really big app on one machine, 
> where it's hard to split it across a cluster. You just can't build an 
> SMP box big enough for some of these things.

This "hard to split" is usually caused by memory use instead of cpu
use, right?

I don't see a big problem scaling number crunchers over a cluster, but
a process with a working set >64GB cannot be split between 4GB
machines easily.

Jörn

-- 
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 16:01               ` Jörn Engel
@ 2003-09-03 16:21                 ` Martin J. Bligh
  2003-09-03 19:41                   ` Mike Fedyk
  0 siblings, 1 reply; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 16:21 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

>> The real core use of NUMA is to run one really big app on one machine, 
>> where it's hard to split it across a cluster. You just can't build an 
>> SMP box big enough for some of these things.
> 
> This "hard to split" is usually caused by memory use instead of cpu
> use, right?

Heavy process intercommunication I guess, often but not always through
shared mem.

> I don't see a big problem scaling number crunchers over a cluster, but
> a process with a working set >64GB cannot be split between 4GB
> machines easily.

Right - some problems split nicely, and should get run on clusters because
it's a shitload cheaper. Preferably an SSI cluster so you get to manage
things easily, but either way. As you say, some things just don't split
that way, and that's why people pay for big iron (which ends up being
NUMA). 

I've seen people use big machines for clusterable things, which I think
is a waste of money, but the cost of the machine compared to the cost
of admin (vs multiple machines) may have come down to the point where 
it's worth it now. You get implicit "cluster" load balancing done in a
transparent way by the OS on NUMA boxes.

M.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
                           ` (3 preceding siblings ...)
  2003-09-03 14:25         ` Steven Cole
@ 2003-09-03 16:37         ` Kurt Wall
  2003-09-06 15:08         ` Pavel Machek
  5 siblings, 0 replies; 64+ messages in thread
From: Kurt Wall @ 2003-09-03 16:37 UTC (permalink / raw)
  To: linux-kernel

Quoth Larry McVoy:

[SMP hits memory latency wall]

> It's called asymptotic behavior.  After a while you can look at the graph
> and see that more CPUs on the same memory doesn't make sense.  It hasn't
> made sense for a decade, what makes anyone think that is changing?

Isn't this what NUMA is for, then?

Kurt
-- 
"There was a boy called Eustace Clarence Scrubb, and he almost deserved
it."
		-- C. S. Lewis, The Chronicles of Narnia

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:23         ` Martin J. Bligh
  2003-09-03 15:39           ` Larry McVoy
@ 2003-09-03 17:16           ` William Lee Irwin III
  1 sibling, 0 replies; 64+ messages in thread
From: William Lee Irwin III @ 2003-09-03 17:16 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Nick Piggin, Anton Blanchard, Larry McVoy, linux-kernel

On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote:
> Would be real interesting to see this ... there are actually plenty of
> real degredations there, none of which (that I've seen) come from any
> scalability changes. Things like RMAP on fork times (for which there are
> other legitimite reasons) are more responsible (for which the "scalability" 
> people have offered a solution).

How'd that get capitalized? It's not an acronym.

At any rate, fork()'s relevance to performance is not being measured
in any context remotely resembling real usage cases, e.g. forking
servers. There are other problems with kernel compiles, for instance,
internally limited parallelism, and a relatively highly constrained
userspace component which is impossible to increase the concurrency of.

-- wli

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: UP Regression (was) Re: Scaling noise
  2003-09-03 15:51         ` UP Regression (was) " Cliff White
@ 2003-09-03 17:21           ` William Lee Irwin III
  2003-09-03 18:53             ` Cliff White
  2003-09-04  0:54           ` Nick Piggin
  1 sibling, 1 reply; 64+ messages in thread
From: William Lee Irwin III @ 2003-09-03 17:21 UTC (permalink / raw)
  To: Cliff White; +Cc: Nick Piggin, linux-kernel

On Wed, Sep 03, 2003 at 08:51:56AM -0700, Cliff White wrote:
> On the Scalable Test Platform, running osdl-aim-7,  for the
> UP case, 2.4 is a bit better than 2.6, this is consistent across
> many runs. For SMP, 2.6 is better, but the delta is rather
> small, until we get to 8 CPUS. We have a lot of un-parsed data from other
> tests - might be some trends there also.
> See http://developer.osdl.org/cliffw/reaim/index.html 
> 2.4 kernels are at the bottom of the page.

Do you have profile data for these runs? Also, that webpage doesn't
have 2.4.x results.


-- wli

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: UP Regression (was) Re: Scaling noise
  2003-09-03 17:21           ` William Lee Irwin III
@ 2003-09-03 18:53             ` Cliff White
  0 siblings, 0 replies; 64+ messages in thread
From: Cliff White @ 2003-09-03 18:53 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Nick Piggin, linux-kernel

> On Wed, Sep 03, 2003 at 08:51:56AM -0700, Cliff White wrote:
> > On the Scalable Test Platform, running osdl-aim-7,  for the
> > UP case, 2.4 is a bit better than 2.6, this is consistent across
> > many runs. For SMP, 2.6 is better, but the delta is rather
> > small, until we get to 8 CPUS. We have a lot of un-parsed data from other
> > tests - might be some trends there also.
> > See http://developer.osdl.org/cliffw/reaim/index.html 
> > 2.4 kernels are at the bottom of the page.
> 
> Do you have profile data for these runs? 

For most of them, yes. The link to the profile data is at
the top of the report. Report sorted by load right now.

Also, that webpage doesn't
> have 2.4.x results.

>> 2.4 kernels are at the bottom of the page.

Scroll all the way down, look for the 'Other Kernels' 
header. There are results for linux-2.4.22, 2.4.23-pre1 + pre2 
for both the new_dbase and compute workloads.

Here's a link to 2.4.23-pre2 on an 8-way, if you don't see it..
http://khack.osdl.org/stp/278651/
cliffw

> 
> 
> -- wli
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 16:21                 ` Martin J. Bligh
@ 2003-09-03 19:41                   ` Mike Fedyk
  2003-09-03 20:11                     ` Martin J. Bligh
  0 siblings, 1 reply; 64+ messages in thread
From: Mike Fedyk @ 2003-09-03 19:41 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: J?rn Engel, Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote:
> I've seen people use big machines for clusterable things, which I think
> is a waste of money, but the cost of the machine compared to the cost
> of admin (vs multiple machines) may have come down to the point where 
> it's worth it now. You get implicit "cluster" load balancing done in a
> transparent way by the OS on NUMA boxes.

Doesn't SSI clustering do something similar (without the effency of the
interconnections though)?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 19:41                   ` Mike Fedyk
@ 2003-09-03 20:11                     ` Martin J. Bligh
  0 siblings, 0 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-03 20:11 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: J?rn Engel, Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

> On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote:
>> I've seen people use big machines for clusterable things, which I think
>> is a waste of money, but the cost of the machine compared to the cost
>> of admin (vs multiple machines) may have come down to the point where 
>> it's worth it now. You get implicit "cluster" load balancing done in a
>> transparent way by the OS on NUMA boxes.
> 
> Doesn't SSI clustering do something similar (without the effency of the
> interconnections though)?

Yes ... *if* someone had a implementation that worked well and was 
maintainable ;-)

M.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:50             ` Martin J. Bligh
@ 2003-09-04  0:49               ` Larry McVoy
  2003-09-04  2:21                 ` Daniel Phillips
  0 siblings, 1 reply; 64+ messages in thread
From: Larry McVoy @ 2003-09-04  0:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Larry McVoy, Nick Piggin, Anton Blanchard, linux-kernel

On Wed, Sep 03, 2003 at 08:50:46AM -0700, Martin J. Bligh wrote:
> > Err, when did I ever say it wasn't SSI?  If you look at what I said it's
> > clearly SSI.  Unified process, device, file, and memory namespaces.
> 
> I think it was the bit when you suggested using bitkeeper to sync multiple
> /etc/passwd files when I really switched off ... perhaps you were just
> joking ;-) Perhaps we just had a massive communication disconnect.

I wasn't joking, but that has nothing to do with clusters.  The BK license
has a "single user is free" mode because I wanted very much to allow distros
to use BK to control their /etc files.  It would be amazingly useful if you
could do an upgrade and merge your config changes with their config changes.
Instead we're still in the 80's in terms of config files.

By the way, I could care less if it were BK, CVS, SVN, SCCS, RCS,
whatever.  The config files need to be under version control and you
need to be able to merge in your changes.  BK is what I'd like because
I understand it and know it would work, but it's not a BK thing at all,
I'd happily do work on RCS or whatever to make this happen.  It's just
amazingly painful that these files aren't under version control, it's
stupid, there is an obviously better answer and the distros aren't
seeing it.  Bummer.

But this has nothing to do with clusters.

> > I'm pretty sure people were so eager to argue with my lovely personality
> > that they never bothered to understand the architecture.  It's _always_
> > been SSI.  I have slides going back at least 4 years that state this:
> > 
> > 	http://www.bitmover.com/talks/smp-clusters
> > 	http://www.bitmover.com/talks/cliq
> 
> I can go back and re-read them, if I misread them last time than I apologise.
> I've also shifted perspectives on SSI clusters somewhat over the last year. 
> Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-)

Cool!

> I'd rather start with everything separate (one OS instance per node), and
> bind things back together, than split everything up. However, I'm really
> not sure how feasible it is until we actually have something that works.

I'm in 100% agreement.  It's much better to have a bunch of OS's and pull
them together than have one and try and pry it apart.

> I have a rough plan of how to go about it mapped out, in small steps that
> might be useful by themselves. It's a lot of fairly complex hard work ;-)

I've spent quite a bit of time thinking about this and if it started going
anywhere it would be easy for you to tell me to put up or shut up.  I'd 
be happy to do some real work on this.  Maybe it would just be doing the 
architecture stuff but I strongly suspect there are few people out there
masochistic enough to make controlling tty semantics work properly in this
environment.  I don't want to do it, I'd love someone else to do it, but 
if noone steps up to the bat I will.  I did all the POSIX crud in SunOS,
I understand the issues, I can do it here and it is part of the least fun
work so if I'm pushing the model I should be willing to put some work into
the non fun part.

The VM work is a lot more fun, I'd like to play there but I suspect that if
we got rolling there are far more talented people who would push me aside.
That's cool, the best people should do the work.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: UP Regression (was) Re: Scaling noise
  2003-09-03 15:51         ` UP Regression (was) " Cliff White
  2003-09-03 17:21           ` William Lee Irwin III
@ 2003-09-04  0:54           ` Nick Piggin
  1 sibling, 0 replies; 64+ messages in thread
From: Nick Piggin @ 2003-09-04  0:54 UTC (permalink / raw)
  To: Cliff White; +Cc: linux-kernel



Cliff White wrote:

>[snip]
>.
>
>>I don't think anyone advocates sacrificing UP performance for 32 ways, but
>>as he says it can happen .1% at a time.
>>
>>But it looks like 2.6 will scale well to 16 way and higher. I wonder if
>>there are many regressions from 2.4 or 2.2 on small systems.
>>
>>
>>
>On the Scalable Test Platform, running osdl-aim-7,  for the
>UP case, 2.4 is a bit better than 2.6, this is consistent across
>many runs. For SMP, 2.6 is better, but the delta is rather
>small, until we get to 8 CPUS. We have a lot of un-parsed data from other
>tests - might be some trends there also.
>See http://developer.osdl.org/cliffw/reaim/index.html 
>2.4 kernels are at the bottom of the page.
>

Forgive my ignorance of your benchmarks, but this might very well
be HZ == 1000?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:31             ` Steven Cole
@ 2003-09-04  1:50               ` Daniel Phillips
  2003-09-04  1:52                 ` Larry McVoy
                                   ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Daniel Phillips @ 2003-09-04  1:50 UTC (permalink / raw)
  To: Steven Cole, Antonio Vargas
  Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel

On Wednesday 03 September 2003 17:31, Steven Cole wrote:
> On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> > As you may probably know, CC-clusters were heavily advocated by the
> > same Larry McVoy who has started this thread.
>
> Yes, thanks.  I'm well aware of that.  I would like to get a discussion
> going again on CC-clusters, since that seems to be a way out of the
> scaling spiral.  Here is an interesting link:
> http://www.opersys.com/adeos/practical-smp-clusters/

As you know, the argument is that locking overhead grows by some factor worse 
than linear as the size of an SMP cluster increases, so that the locking 
overhead explodes at some point, and thus it would be more efficient to 
eliminate the SMP overhead entirely and run a cluster of UP kernels, 
communicating through the high bandwidth channel provided by shared memory.

There are other arguments, such as how complex locking is, and how it will 
never work correctly, but those are noise: it's pretty much done now, the 
complexity is still manageable, and Linux has never been more stable.

There was a time when SMP locking overhead actually cost something in the high 
single digits on Linux, on certain loads.  Today, you'd have to work at it to 
find a real load where the 2.5/6 kernel spends more than 1% of its time in 
locking overhead, even on a large SMP machine (sample size of one: I asked 
Bill Irwin how his 32 node Numa cluster is running these days).  This blows 
the ccCluster idea out of the water, sorry.  The only way ccCluster gets to 
live is if SMP locking is pathetic and it's not.

As for Karim's work, it's a quintessentially flashy trick to make two UP 
kernels run on a dual processor.  It's worth doing, but not because it blazes 
the way forward for ccClusters.  It can be the basis for hot kernel swap: 
migrate all the processes to one of the two CPUs, load and start a new kernel 
on the other one, migrate all processes to it, and let the new kernel restart 
the first processor, which is now idle.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  1:50               ` Daniel Phillips
@ 2003-09-04  1:52                 ` Larry McVoy
  2003-09-04  4:42                   ` David S. Miller
  2003-09-04  2:18                 ` William Lee Irwin III
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 64+ messages in thread
From: Larry McVoy @ 2003-09-04  1:52 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Steven Cole, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard,
	linux-kernel

On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> There are other arguments, such as how complex locking is, and how it will 
> never work correctly, but those are noise: it's pretty much done now, the 
> complexity is still manageable, and Linux has never been more stable.

yeah, right.  I'm not sure what you are smoking but I'll avoid your dealer.

Your politics are showing, Daniel.  Try staying focussed on the technical
merits and we can have a discussion.  Otherwise you just get ignored.  
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  1:50               ` Daniel Phillips
  2003-09-04  1:52                 ` Larry McVoy
@ 2003-09-04  2:18                 ` William Lee Irwin III
  2003-09-04  2:19                 ` Steven Cole
  2003-09-08 19:27                 ` bill davidsen
  3 siblings, 0 replies; 64+ messages in thread
From: William Lee Irwin III @ 2003-09-04  2:18 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Steven Cole, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard,
	linux-kernel

On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> Bill Irwin how his 32 node Numa cluster is running these days).  This blows 

Sorry for any misunderstanding, the model only goes to 16 nodes/64x,
the box mentioned was 32 cpus. It's also SMP (SSI, shared memory,
mach-numaq), not a cluster. I also only have half of it full-time.


-- wli

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  1:50               ` Daniel Phillips
  2003-09-04  1:52                 ` Larry McVoy
  2003-09-04  2:18                 ` William Lee Irwin III
@ 2003-09-04  2:19                 ` Steven Cole
  2003-09-04  2:35                   ` William Lee Irwin III
  2003-09-04  3:07                   ` Daniel Phillips
  2003-09-08 19:27                 ` bill davidsen
  3 siblings, 2 replies; 64+ messages in thread
From: Steven Cole @ 2003-09-04  2:19 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel

On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote:
> On Wednesday 03 September 2003 17:31, Steven Cole wrote:
> > On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote:
> > > As you may probably know, CC-clusters were heavily advocated by the
> > > same Larry McVoy who has started this thread.
> >
> > Yes, thanks.  I'm well aware of that.  I would like to get a discussion
> > going again on CC-clusters, since that seems to be a way out of the
> > scaling spiral.  Here is an interesting link:
> > http://www.opersys.com/adeos/practical-smp-clusters/
> 
> As you know, the argument is that locking overhead grows by some factor worse 
> than linear as the size of an SMP cluster increases, so that the locking 
> overhead explodes at some point, and thus it would be more efficient to 
> eliminate the SMP overhead entirely and run a cluster of UP kernels, 
> communicating through the high bandwidth channel provided by shared memory.
> 
> There are other arguments, such as how complex locking is, and how it will 
> never work correctly, but those are noise: it's pretty much done now, the 
> complexity is still manageable, and Linux has never been more stable.
> 
> There was a time when SMP locking overhead actually cost something in the high 
> single digits on Linux, on certain loads.  Today, you'd have to work at it to 
> find a real load where the 2.5/6 kernel spends more than 1% of its time in 
> locking overhead, even on a large SMP machine (sample size of one: I asked 
> Bill Irwin how his 32 node Numa cluster is running these days).  This blows 
> the ccCluster idea out of the water, sorry.  The only way ccCluster gets to 
> live is if SMP locking is pathetic and it's not.

I would never call the SMP locking pathetic, but it could be improved.
Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
(Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
Large NUMA Systems", available for download here:
http://archive.linuxsymposium.org/ols2003/Proceedings/
it appears that for those applications, the curves begin to flatten
rather alarmingly.  This may have little to do with locking overhead.

One possible benefit of using ccClusters would be to stay on that lower
part of the curve for the nodes, using  perhaps 16 CPUs in a node.  That
way, a 256 CPU (e.g. Altix 3000) system might perform better than if a
single kernel were to be used.  I say might.  It's likely that only
empirical data will tell the tale for sure.

> 
> As for Karim's work, it's a quintessentially flashy trick to make two UP 
> kernels run on a dual processor.  It's worth doing, but not because it blazes 
> the way forward for ccClusters.  It can be the basis for hot kernel swap: 
> migrate all the processes to one of the two CPUs, load and start a new kernel 
> on the other one, migrate all processes to it, and let the new kernel restart 
> the first processor, which is now idle.
> 
Thank you for that very succinct summary of my rather long-winded
exposition on that subject which I posted here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2
Quite a bit of the complexity which I mentioned, if it were necessary at
all, could go into user space helper processes which get spawned for the
kernel going away, and before init for the on-coming kernel. Also, my
comment about not being able to shoe-horn two kernels in at once for
32-bit arches may have been addressed by Ingo's 4G/4G split.

Steven


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  0:49               ` Larry McVoy
@ 2003-09-04  2:21                 ` Daniel Phillips
  2003-09-04  2:35                   ` Martin J. Bligh
  2003-09-04  2:46                   ` Larry McVoy
  0 siblings, 2 replies; 64+ messages in thread
From: Daniel Phillips @ 2003-09-04  2:21 UTC (permalink / raw)
  To: Larry McVoy, Martin J. Bligh
  Cc: Larry McVoy, Nick Piggin, Anton Blanchard, linux-kernel

On Thursday 04 September 2003 02:49, Larry McVoy wrote:
> It's much better to have a bunch of OS's and pull
> them together than have one and try and pry it apart.

This is bogus.  The numbers clearly don't work if the ccCluster is made of 
uniprocessors, so obviously the SMP locking has to be implemented anyway, to 
get each node up to the size just below the supposed knee in the scaling 
curve.  This eliminates the argument about saving complexity and/or work.

The way Linux scales now, the locking stays out of the range where SSI could 
compete up to, what?  128 processors?  More?  Maybe we'd better ask SGI about 
that, but we already know what the answer is for 32: boring old SMP wins 
hands down.  Where is the machine that has the knee in the wrong part of the 
curve?  Oh, maybe we should all just stop whatever work we're doing and wait 
ten years for one to show up.

But far be it from me to suggest that reality should intefere with your fun.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:19                 ` Steven Cole
@ 2003-09-04  2:35                   ` William Lee Irwin III
  2003-09-04  2:40                     ` Steven Cole
  2003-09-04  3:07                   ` Daniel Phillips
  1 sibling, 1 reply; 64+ messages in thread
From: William Lee Irwin III @ 2003-09-04  2:35 UTC (permalink / raw)
  To: Steven Cole
  Cc: Daniel Phillips, Antonio Vargas, Larry McVoy, CaT,
	Anton Blanchard, linux-kernel

On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
> I would never call the SMP locking pathetic, but it could be improved.
> Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> Large NUMA Systems", available for download here:
> http://archive.linuxsymposium.org/ols2003/Proceedings/
> it appears that for those applications, the curves begin to flatten
> rather alarmingly.  This may have little to do with locking overhead.

Those numbers are 2.4.x


-- wli

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:21                 ` Daniel Phillips
@ 2003-09-04  2:35                   ` Martin J. Bligh
  2003-09-04  2:46                   ` Larry McVoy
  1 sibling, 0 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-04  2:35 UTC (permalink / raw)
  To: Daniel Phillips, Larry McVoy; +Cc: Nick Piggin, Anton Blanchard, linux-kernel

> On Thursday 04 September 2003 02:49, Larry McVoy wrote:
>> It's much better to have a bunch of OS's and pull
>> them together than have one and try and pry it apart.
> 
> This is bogus.  The numbers clearly don't work if the ccCluster is made of 
> uniprocessors, so obviously the SMP locking has to be implemented anyway, to 
> get each node up to the size just below the supposed knee in the scaling 
> curve.  This eliminates the argument about saving complexity and/or work.
> 
> The way Linux scales now, the locking stays out of the range where SSI could 
> compete up to, what?  128 processors?  More?  Maybe we'd better ask SGI about 
> that, but we already know what the answer is for 32: boring old SMP wins 
> hands down.  Where is the machine that has the knee in the wrong part of the 
> curve?  Oh, maybe we should all just stop whatever work we're doing and wait 
> ten years for one to show up.
> 
> But far be it from me to suggest that reality should intefere with your fun.

Yes you need locking, but only for the bits where you glue stuff back
together. Plenty of bits can operate indepandantly per node, or at
least ... I'm hoping they can in my vapourware world ;-)

M.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:35                   ` William Lee Irwin III
@ 2003-09-04  2:40                     ` Steven Cole
  2003-09-04  3:20                       ` Nick Piggin
  0 siblings, 1 reply; 64+ messages in thread
From: Steven Cole @ 2003-09-04  2:40 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Daniel Phillips, Antonio Vargas, Larry McVoy, CaT,
	Anton Blanchard, linux-kernel

On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote:
> On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
> > I would never call the SMP locking pathetic, but it could be improved.
> > Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> > (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> > Large NUMA Systems", available for download here:
> > http://archive.linuxsymposium.org/ols2003/Proceedings/
> > it appears that for those applications, the curves begin to flatten
> > rather alarmingly.  This may have little to do with locking overhead.
> 
> Those numbers are 2.4.x

Yes, I saw that.  It would be interesting to see results for recent
2.6.0-textX kernels.  Judging from other recent numbers out of osdl, the
results for 2.6 should be quite a bit better.  But won't the curves
still begin to flatten, but at a higher CPU count?  Or has the miracle
goodness of RCU pushed those limits to insanely high numbers?

Steven


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:21                 ` Daniel Phillips
  2003-09-04  2:35                   ` Martin J. Bligh
@ 2003-09-04  2:46                   ` Larry McVoy
  2003-09-04  4:58                     ` David S. Miller
  1 sibling, 1 reply; 64+ messages in thread
From: Larry McVoy @ 2003-09-04  2:46 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Larry McVoy, Martin J. Bligh, Nick Piggin, Anton Blanchard,
	linux-kernel

On Thu, Sep 04, 2003 at 04:21:16AM +0200, Daniel Phillips wrote:
> On Thursday 04 September 2003 02:49, Larry McVoy wrote:
> > It's much better to have a bunch of OS's and pull
> > them together than have one and try and pry it apart.
> 
> This is bogus.  The numbers clearly don't work if the ccCluster is made of 
> uniprocessors, so obviously the SMP locking has to be implemented anyway, to 
> get each node up to the size just below the supposed knee in the scaling 
> curve.  This eliminates the argument about saving complexity and/or work.

If you thought before you spoke you'd realize how wrong you are.  How many
locks are there in the IRIX/Solaris/Linux I/O path?  How many are needed for
2-4 way scaling?  

Here's the litmus test: list all the locks in the kernel and the locking
hierarchy.  If you, a self claimed genius, can't do it, how can the rest
of us mortals possibly do it?  Quick.  You have 30 seconds, I want a list.
A complete list with the locking hierarchy, no silly awk scripts.  You have
to show which locks can deadlock, from memory.

No list?  Cool, you just proved my point.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:19                 ` Steven Cole
  2003-09-04  2:35                   ` William Lee Irwin III
@ 2003-09-04  3:07                   ` Daniel Phillips
  1 sibling, 0 replies; 64+ messages in thread
From: Daniel Phillips @ 2003-09-04  3:07 UTC (permalink / raw)
  To: Steven Cole
  Cc: Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel

On Thursday 04 September 2003 04:19, Steven Cole wrote:
> On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote:
> > There was a time when SMP locking overhead actually cost something in the
> > high single digits on Linux, on certain loads.  Today, you'd have to work
> > at it to find a real load where the 2.5/6 kernel spends more than 1% of
> > its time in locking overhead, even on a large SMP machine (sample size of
> > one: I asked Bill Irwin how his 32 node Numa cluster is running these
> > days).  This blows the ccCluster idea out of the water, sorry.  The only
> > way ccCluster gets to live is if SMP locking is pathetic and it's not.
>
> I would never call the SMP locking pathetic, but it could be improved.
> Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
> (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
> Large NUMA Systems", available for download here:
> http://archive.linuxsymposium.org/ols2003/Proceedings/
> it appears that for those applications, the curves begin to flatten
> rather alarmingly.  This may have little to do with locking overhead.

2.4.17 is getting a little old, don't you think?  This is the thing that 
changed most in 2.4 -> 2.6, and indeed, much of the work was in locking.

> One possible benefit of using ccClusters would be to stay on that lower
> part of the curve for the nodes, using  perhaps 16 CPUs in a node.  That
> way, a 256 CPU (e.g. Altix 3000) system might perform better than if a
> single kernel were to be used.  I say might.  It's likely that only
> empirical data will tell the tale for sure.

Right, and we do not see SGI contributing patches for partitioning their 256 
CPU boxes.  That's all the empirical data I need at this point.

They surely do partition them, but not at the Linux OS level.

> > As for Karim's work, it's a quintessentially flashy trick to make two UP
> > kernels run on a dual processor.  It's worth doing, but not because it
> > blazes the way forward for ccClusters.  It can be the basis for hot
> > kernel swap: migrate all the processes to one of the two CPUs, load and
> > start a new kernel on the other one, migrate all processes to it, and let
> > the new kernel restart the first processor, which is now idle.
>
> Thank you for that very succinct summary of my rather long-winded
> exposition on that subject which I posted here:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2

I swear I made the above up on the spot, just now :-)

> Quite a bit of the complexity which I mentioned, if it were necessary at
> all, could go into user space helper processes which get spawned for the
> kernel going away, and before init for the on-coming kernel. Also, my
> comment about not being able to shoe-horn two kernels in at once for
> 32-bit arches may have been addressed by Ingo's 4G/4G split.

I don't see what you're worried about, they are separate kernels and you get 
two instances of whatever split you want. 

Regards,

Daniel


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:40                     ` Steven Cole
@ 2003-09-04  3:20                       ` Nick Piggin
  0 siblings, 0 replies; 64+ messages in thread
From: Nick Piggin @ 2003-09-04  3:20 UTC (permalink / raw)
  To: Steven Cole
  Cc: William Lee Irwin III, Daniel Phillips, Antonio Vargas,
	Larry McVoy, CaT, Anton Blanchard, linux-kernel

Steven Cole wrote:

>On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote:
>
>>On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote:
>>
>>>I would never call the SMP locking pathetic, but it could be improved.
>>>Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7
>>>(Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for
>>>Large NUMA Systems", available for download here:
>>>http://archive.linuxsymposium.org/ols2003/Proceedings/
>>>it appears that for those applications, the curves begin to flatten
>>>rather alarmingly.  This may have little to do with locking overhead.
>>>
>>Those numbers are 2.4.x
>>
>
>Yes, I saw that.  It would be interesting to see results for recent
>2.6.0-textX kernels.  Judging from other recent numbers out of osdl, the
>results for 2.6 should be quite a bit better.  But won't the curves
>still begin to flatten, but at a higher CPU count?  Or has the miracle
>goodness of RCU pushed those limits to insanely high numbers?
>

They fixed some big 2.4 scalability problems, so it wouldn't be as
impressive as plain 2.4 -> 2.6. However there are obviously hardware
scalability limits as well as software ones. So a more interesting
comparison would of course be 2.6 vs LM's SSI clusters.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  1:52                 ` Larry McVoy
@ 2003-09-04  4:42                   ` David S. Miller
  2003-09-08 19:40                     ` bill davidsen
  0 siblings, 1 reply; 64+ messages in thread
From: David S. Miller @ 2003-09-04  4:42 UTC (permalink / raw)
  To: Larry McVoy; +Cc: phillips, elenstev, wind, lm, cat, anton, linux-kernel

On Wed, 3 Sep 2003 18:52:49 -0700
Larry McVoy <lm@bitmover.com> wrote:

> On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
> > There are other arguments, such as how complex locking is, and how it will 
> > never work correctly, but those are noise: it's pretty much done now, the 
> > complexity is still manageable, and Linux has never been more stable.
> 
> yeah, right.  I'm not sure what you are smoking but I'll avoid your dealer.

I hate to enter these threads but...

The amount of locking bugs found in the core networking, ipv4, and
ipv6 for a year or two in 2.4.x has been nearly nil.

If you're going to try and argue against supporting huge SMP
to me, don't make locking complexity one of the arguments. :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:39           ` Larry McVoy
  2003-09-03 15:50             ` Martin J. Bligh
@ 2003-09-04  4:49             ` David S. Miller
  2003-09-08 19:50             ` bill davidsen
  2 siblings, 0 replies; 64+ messages in thread
From: David S. Miller @ 2003-09-04  4:49 UTC (permalink / raw)
  To: Larry McVoy; +Cc: mbligh, piggin, anton, lm, linux-kernel

On Wed, 3 Sep 2003 08:39:01 -0700
Larry McVoy <lm@bitmover.com> wrote:

> It's really easy to claim that scalability isn't the problem.  Scaling
> changes in general cause very minute differences, it's just that there
> are a lot of them.  There is constant pressure to scale further and people
> think it's cool.

So why are people still going down this path?

I'll tell you why, because as SMP issues start to embark upon
the mainstream boxes people are going to find clever solutions
to most of the memory sharing issues that cause all the "lock
overhead".

Things like RCU are just the tip of the iceberg.  And think Larry,
we didn't have stuff like RCU back when you were directly working
and watching people work on huge SMP systems.

I think it's instructive to look at hyperthreading from another
angle in this argument, that the cpu people invested billions of
dollars in work to turn memory latency into free cpu cycles.

Put that in your pipe and smoke it :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  2:46                   ` Larry McVoy
@ 2003-09-04  4:58                     ` David S. Miller
  2003-09-10 15:47                       ` Lock EVERYTHING (for testing) [was: Re: Scaling noise] Timothy Miller
  0 siblings, 1 reply; 64+ messages in thread
From: David S. Miller @ 2003-09-04  4:58 UTC (permalink / raw)
  To: Larry McVoy; +Cc: phillips, lm, mbligh, piggin, anton, linux-kernel

On Wed, 3 Sep 2003 19:46:08 -0700
Larry McVoy <lm@bitmover.com> wrote:

> Here's the litmus test: list all the locks in the kernel and the locking
> hierarchy.  If you, a self claimed genius, can't do it, how can the rest
> of us mortals possibly do it?  Quick.  You have 30 seconds, I want a list.
> A complete list with the locking hierarchy, no silly awk scripts.  You have
> to show which locks can deadlock, from memory.
> 
> No list?  Cool, you just proved my point.

No point Larry, asking the same question about how the I/O
path works sans the locks will give you the same blank stare.

I absolutely do not accept the complexity argument.  We have a fully
scalable kernel now.  Do you know why?  It's not because we have some
weird genius trolls writing the code, it's because of our insanely
huge testing base.

People give a lot of credit to the people writing the code in the
Linux kernel which actually belongs to the people running the
code. :-)

That's where the other systems failed, all the in-house stress
testing in the world is not going to find the bugs we do find in
Linux.  That's why Solaris goes out buggy and with all kinds of
SMP deadlocks, their tester base is just too small to hit all
the important bugs.

FWIW, I actually can list all the locks taken for the primary paths in
the networking, and that's about as finely locked as we can make it.
As can Alexey Kuznetsov...

So again, if you're going to argue against huge SMP (at least to me),
don't use the locking complexity argument.  Not only have we basically
conquered it, we've along the way found some amazing ways to find
locking bugs both at runtime and at compile time.  You can even debug
them on uniprocessor systems.  And this doesn't even count the
potential things we can do with Linus's sparse tool.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:10             ` Martin J. Bligh
  2003-09-03 16:01               ` Jörn Engel
@ 2003-09-04 20:36               ` Rik van Riel
  2003-09-04 20:47                 ` Martin J. Bligh
  2003-09-04 21:30                 ` William Lee Irwin III
  1 sibling, 2 replies; 64+ messages in thread
From: Rik van Riel @ 2003-09-04 20:36 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

On Wed, 3 Sep 2003, Martin J. Bligh wrote:

> The real core use of NUMA is to run one really big app on one machine,
> where it's hard to split it across a cluster. You just can't build an
> SMP box big enough for some of these things.

That only works when the NUMA factor is low enough that
you can effectively treat the box as an SMP system.

It doesn't work when you have a NUMA factor of 15 (like
some unspecified box you are very familiar with) and
half of your database index is always on the "other half"
of the two-node NUMA system.

You'll end up with half your accesses being 15 times as
slow, meaning that your average memory access time is 8
times as high!  Good way to REDUCE performance, but most
people won't like that...

If the NUMA factor is low enough that applications can
treat it like SMP, then the kernel NUMA support won't
have to be very high either...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04 20:36               ` Rik van Riel
@ 2003-09-04 20:47                 ` Martin J. Bligh
  2003-09-04 21:30                 ` William Lee Irwin III
  1 sibling, 0 replies; 64+ messages in thread
From: Martin J. Bligh @ 2003-09-04 20:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

> On Wed, 3 Sep 2003, Martin J. Bligh wrote:
> 
>> The real core use of NUMA is to run one really big app on one machine,
>> where it's hard to split it across a cluster. You just can't build an
>> SMP box big enough for some of these things.
> 
> That only works when the NUMA factor is low enough that
> you can effectively treat the box as an SMP system.
> 
> It doesn't work when you have a NUMA factor of 15 (like
> some unspecified box you are very familiar with) and
> half of your database index is always on the "other half"
> of the two-node NUMA system.
> 
> You'll end up with half your accesses being 15 times as
> slow, meaning that your average memory access time is 8
> times as high!  Good way to REDUCE performance, but most
> people won't like that...
> 
> If the NUMA factor is low enough that applications can
> treat it like SMP, then the kernel NUMA support won't
> have to be very high either...

I think there's a few too many assumptions in that - are you thinking
of a big r/w shmem application? There's lots of other application
programming models that wouldn't suffer nearly so much ... but maybe
they're more splittable ... there's lots of things we can do to ensure
at least better than average node-locality for most of the memory.

M.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04 20:36               ` Rik van Riel
  2003-09-04 20:47                 ` Martin J. Bligh
@ 2003-09-04 21:30                 ` William Lee Irwin III
  1 sibling, 0 replies; 64+ messages in thread
From: William Lee Irwin III @ 2003-09-04 21:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Martin J. Bligh, Alan Cox, Bernd Eckenfels,
	Linux Kernel Mailing List

On Thu, Sep 04, 2003 at 04:36:56PM -0400, Rik van Riel wrote:
> You'll end up with half your accesses being 15 times as
> slow, meaning that your average memory access time is 8
> times as high!  Good way to REDUCE performance, but most
> people won't like that...
> If the NUMA factor is low enough that applications can
> treat it like SMP, then the kernel NUMA support won't
> have to be very high either...

This does not hold. The data set is not necessarily where the
communication occurs.


-- wli

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03  5:08       ` Larry McVoy
                           ` (4 preceding siblings ...)
  2003-09-03 16:37         ` Kurt Wall
@ 2003-09-06 15:08         ` Pavel Machek
  2003-09-08 13:38           ` Alan Cox
  5 siblings, 1 reply; 64+ messages in thread
From: Pavel Machek @ 2003-09-06 15:08 UTC (permalink / raw)
  To: Larry McVoy, CaT, Larry McVoy, Anton Blanchard, linux-kernel

Hi!

> Maybe this is a better way to get my point across.  Think about more CPUs
> on the same memory subsystem.  I've been trying to make this scaling point

The point of hyperthreading is that more virtual CPUs on same memory
subsystem can actually help stuff.
-- 
				Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-06 15:08         ` Pavel Machek
@ 2003-09-08 13:38           ` Alan Cox
  2003-09-09  6:11             ` Rob Landley
  0 siblings, 1 reply; 64+ messages in thread
From: Alan Cox @ 2003-09-08 13:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Larry McVoy, CaT, Larry McVoy, Anton Blanchard,
	Linux Kernel Mailing List

On Sad, 2003-09-06 at 16:08, Pavel Machek wrote:
> Hi!
> 
> > Maybe this is a better way to get my point across.  Think about more CPUs
> > on the same memory subsystem.  I've been trying to make this scaling point
> 
> The point of hyperthreading is that more virtual CPUs on same memory
> subsystem can actually help stuff.

Its a way of exposing asynchronicity keeping the old instruction set.
Its trying to make better use of the bandwidth available by having
something else to schedule into stalls. Thats why HT is really good for
code which is full of polling I/O, badly coded memory accesses but is
worthless on perfectly tuned hand coded stuff which doesnt stall.

Its great feature is that HT gets *more* not less useful as the CPU gets
faster..

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 14:25         ` Steven Cole
  2003-09-03 12:47           ` Antonio Vargas
@ 2003-09-08 19:12           ` bill davidsen
  1 sibling, 0 replies; 64+ messages in thread
From: bill davidsen @ 2003-09-08 19:12 UTC (permalink / raw)
  To: linux-kernel

In article <1062599136.1724.84.camel@spc9.esa.lanl.gov>,
Steven Cole  <elenstev@mesatop.com> wrote:
| On Tue, 2003-09-02 at 23:08, Larry McVoy wrote:
| > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote:
| > > I think Anton is referring to the fact that on a 4-way cpu machine with
| > > HT enabled you basically have an 8-way smp box (with special conditions)
| > > and so if 4-way machines are becoming more popular, making sure that 8-way
| > > smp works well is a good idea.
| > 
| > Maybe this is a better way to get my point across.  Think about more CPUs
| > on the same memory subsystem.  I've been trying to make this scaling point
| > ever since I discovered how much cache misses hurt.  That was about 1995
| > or so.  At that point, memory latency was about 200 ns and processor speeds
| > were at about 200Mhz or 5 ns.  Today, memory latency is about 130 ns and
| > processor speeds are about .3 ns.  Processor speeds are 15 times faster and
| > memory is less than 2 times faster.  SMP makes that ratio worse.
| > 
| > It's called asymptotic behavior.  After a while you can look at the graph
| > and see that more CPUs on the same memory doesn't make sense.  It hasn't
| > made sense for a decade, what makes anyone think that is changing?
| 
| You're right about the asymptotic behavior and you'll just get more
| right as time goes on, but other forces are at work.
| 
| What is changing is the number of cores per 'processor' is increasing. 
| The Intel Montecito will increase this to two, and rumor has it that the
| Intel Tanglewood may have as many as sixteen.  The IBM Power6 will
| likely be similarly capable.
| 
| The Tanglewood is not some far off flight of fancy; it may be available
| as soon as the 2.8.x stable series, so planning to accommodate it should
| be happening now.  
| 
| With companies like SGI building Altix systems with 64 and 128 CPUs
| using the current single-core Madison, just think of what will be
| possible using the future hardware. 
| 
| In four years, Michael Dell will still be saying the same thing, but
| he'll just fudge his answer by a factor of four. 

The mass market will still be in small machines, because the CPUs keep
on getting faster. And at least for most small servers running Linux,
like news, mail, DNS, and web, the disk, memory and network are more of
a problem than the CPU. Some database and CGI loads are CPU intensive,
but I don't see that the nature of loads will change; most aren't CPU
intensive.

| The question which will continue to be important in the next kernel
| series is: How to best accommodate the future many-CPU machines without
| sacrificing performance on the low-end?  The change is that the 'many'
| in the above may start to double every few years.

Since you can still get a decent research grant or graduate thesis out
of ways to use a lot of CPUs, there will not be a lack of thought on the
topic. I think Larry is just worried that some of these solutions may
really work poorly on smaller systems.

| Some candidate answers to this have been discussed before, such as
| cache-coherent clusters.  I just hope this gets worked out before the
| hardware ships.

Honestly, I would expect a good solution to scale better at the "more"
end of the range than the "less." A good 16-way approach will probably
not need major work for 256, while it may be pretty grim for the uni or
2-way counting HT machines.

With all the work people are doing on writing scheduler changes for
responsiveness, and the number of people trying them, I would assume a
need for improvement on small machines and response over throughput.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  1:50               ` Daniel Phillips
                                   ` (2 preceding siblings ...)
  2003-09-04  2:19                 ` Steven Cole
@ 2003-09-08 19:27                 ` bill davidsen
  3 siblings, 0 replies; 64+ messages in thread
From: bill davidsen @ 2003-09-08 19:27 UTC (permalink / raw)
  To: linux-kernel

In article <200309040350.31949.phillips@arcor.de>,
Daniel Phillips  <phillips@arcor.de> wrote:

| As for Karim's work, it's a quintessentially flashy trick to make two UP 
| kernels run on a dual processor.  It's worth doing, but not because it blazes 
| the way forward for ccClusters.  It can be the basis for hot kernel swap: 
| migrate all the processes to one of the two CPUs, load and start a new kernel 
| on the other one, migrate all processes to it, and let the new kernel restart 
| the first processor, which is now idle.

UML running on a sibling, anyone? Interesting concept, not necessarily
useful.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-04  4:42                   ` David S. Miller
@ 2003-09-08 19:40                     ` bill davidsen
  0 siblings, 0 replies; 64+ messages in thread
From: bill davidsen @ 2003-09-08 19:40 UTC (permalink / raw)
  To: linux-kernel

In article <20030903214233.24d3c902.davem@redhat.com>,
David S. Miller <davem@redhat.com> wrote:
| On Wed, 3 Sep 2003 18:52:49 -0700
| Larry McVoy <lm@bitmover.com> wrote:
| 
| > On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote:
| > > There are other arguments, such as how complex locking is, and how it will 
| > > never work correctly, but those are noise: it's pretty much done now, the 
| > > complexity is still manageable, and Linux has never been more stable.
| > 
| > yeah, right.  I'm not sure what you are smoking but I'll avoid your dealer.
| 
| I hate to enter these threads but...
| 
| The amount of locking bugs found in the core networking, ipv4, and
| ipv6 for a year or two in 2.4.x has been nearly nil.
| 
| If you're going to try and argue against supporting huge SMP
| to me, don't make locking complexity one of the arguments. :-)

If you count only "bugs" which cause hang or oops, sure. But just
because something works doesn't make it simple (or non-complex if you
prefer). But look at all the "lockless" changes and such in 2.4, and I
think you will agree that there have been a number and it is complex. I
don't think stable and complex are mutually exclusive in this case.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-03 15:39           ` Larry McVoy
  2003-09-03 15:50             ` Martin J. Bligh
  2003-09-04  4:49             ` Scaling noise David S. Miller
@ 2003-09-08 19:50             ` bill davidsen
  2003-09-08 23:39               ` Peter Chubb
  2 siblings, 1 reply; 64+ messages in thread
From: bill davidsen @ 2003-09-08 19:50 UTC (permalink / raw)
  To: linux-kernel

In article <20030903153901.GB5769@work.bitmover.com>,
Larry McVoy  <lm@bitmover.com> wrote:

| It's really easy to claim that scalability isn't the problem.  Scaling
| changes in general cause very minute differences, it's just that there
| are a lot of them.  There is constant pressure to scale further and people
| think it's cool.  You can argue you all you want that scaling done right
| isn't a problem but nobody has ever managed to do it right.  I know it's
| politically incorrect to say this group won't either but there is no 
| evidence that they will.

I think that if the problem of a single scheduler which is "best" at
everything proves out of reach, perhaps in 2.7 a modular scheduler will
appear, which will allow the user to select the Nick+Con+Ingo
responsiveness, or the default pretty good at everything, or the 4kbit
affinity mask NUMA on steroids solution.

I have faith that Linux will solve this one one way or the other,
probably both.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-08 19:50             ` bill davidsen
@ 2003-09-08 23:39               ` Peter Chubb
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Chubb @ 2003-09-08 23:39 UTC (permalink / raw)
  To: bill davidsen; +Cc: linux-kernel

>>>>> "bill" == bill davidsen <davidsen@tmr.com> writes:

> In article <20030903153901.GB5769@work.bitmover.com>, Larry
> McVoy <lm@bitmover.com> wrote:

Larry> It's really easy to claim that scalability isn't the problem.
Larry> Scaling changes in general cause very minute differences, it's
Larry> just that there are a lot of them.  There is constant pressure
Larry> to scale further and people think it's cool.  You can argue
Larry> you all you want that scaling done right isn't a problem but
Larry> nobody has ever managed to do it right.  I know it's 
Larry> politically incorrect to say this group won't either but there
Larry> is no evidence that they will.

bill> I think that if the problem of a single scheduler which is
bill> "best" at everything proves out of reach, perhaps in 2.7 a
bill> modular scheduler will appear, which will allow the user to
bill> select the Nick+Con+Ingo responsiveness, or the default pretty
bill> good at everything, or the 4kbit affinity mask NUMA on steroids
bill> solution.

Well, as I see it it's not processor but memory scalability that's the
problem right now.  Memories are getting larger (and for NUMA systems,
sparser), and the current linux solutions don't scale particularly
well --- particularly when, for architectures like PPC or IA64, you
need two copies in different formats, one for the hardware to look up,
and one for the OS.

I *do* think that pluggable schedulers are a good idea --- I'd like to
introduce something like the scheduler class mechanism that SVr4 has
(except that I've seen that code, and don't want to get sued by SCO)
to allow different processes to be in different classes in a cleaner
manner than the current FIFO or RR vs OTHER classes.  We should be
able to introduce isochronous, gang, lottery or fairshare schedulers
(etc) at runtime, and then tie processes severally and indivdually to
those schedulers, with a well defined idea of what happens when
scheduler priorities overlap, and well defined APIs to adjust
scheduler parameters.  However, this will require more major
infrastructure changes, and a better separation of dispatcher from
scheduler than in the current one-size-fits-all scheduler.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
You are lost in a maze of BitKeeper repositories,   all slightly different.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-08 13:38           ` Alan Cox
@ 2003-09-09  6:11             ` Rob Landley
  2003-09-09 16:07               ` Ricardo Bugalho
  0 siblings, 1 reply; 64+ messages in thread
From: Rob Landley @ 2003-09-09  6:11 UTC (permalink / raw)
  To: Alan Cox, Pavel Machek
  Cc: CaT, Larry McVoy, Anton Blanchard, Linux Kernel Mailing List

On Monday 08 September 2003 09:38, Alan Cox wrote:
> On Sad, 2003-09-06 at 16:08, Pavel Machek wrote:
> > Hi!
> >
> > > Maybe this is a better way to get my point across.  Think about more
> > > CPUs on the same memory subsystem.  I've been trying to make this
> > > scaling point
> >
> > The point of hyperthreading is that more virtual CPUs on same memory
> > subsystem can actually help stuff.
>
> Its a way of exposing asynchronicity keeping the old instruction set.
> Its trying to make better use of the bandwidth available by having
> something else to schedule into stalls. Thats why HT is really good for
> code which is full of polling I/O, badly coded memory accesses but is
> worthless on perfectly tuned hand coded stuff which doesnt stall.

<rant>

I wouldn't call it worthless.  "Proof of concept", maybe.

Modern processors (Athlon and P4 both, I believe) have three execution cores, 
and so are trying to dispatch three instructions per clock.  With 
speculation, lookahead, branch prediction, register renaming, instruction 
reordering, magic pixie dust, happy thoughts, a tailwind, and 8 zillion other 
related things, they can just about do it too, but not even close to 100% of 
the time.  Extracting three parallel instructions from one instruction stream 
is doable, but not fun, and not consistent.

The third core is unavoidably idle some of the time.  Trying to keep four 
cores bus would be a nightmare.  (All the VLIW guys keep trying to unload 
this on the compiler.  Don't ask me how a compiler is supposed to do branch 
prediction and speculative execution.  I suppose having to recompile your 
binaries for more cores isn't TOO big a problem these days, but the boxed 
mainstream desktop apps people wouldn't like it at all.)

Transistor budgets keep going up as manufacturing die sizes shrink, and the 
engineers keep wanting to throw transistors at the problem.  The first really 
easy way to turn transistors into performance are a bigger L1 cache, but 
somewhere between 256k and one megabyte per running process you hit some 
serious diminishing returns since your working set is in cache and your far 
accesses to big datasets (or streaming data) just aren't going to be helped 
by more L1 cache.

The other obvious way to turn transistors into performance is to build 
execution cores out of them.  (Yeah, you can also pipeline yourself to death 
to do less per clock for marketing reasons, but there's serious diminishing 
returns there too.)  With more execution cores, you can (theoretically) 
execute more instructions per clock.  Except that keeping 3 cores busy out of 
one instruction stream is really hard, and 4 would be a nightmare...

Hyperthreading is just a neat hack to keep multiple cores busy.  Having 
another point of execution to schedule instructions from means you're 
guaranteed to keep 1 core busy all the time for each point of execution 
(barring memory access latency on "branch to mars" conditions), and with 3 
cores and 2 pointes of execution they can fight over the middle core, which 
should just about never be idle when the system is loaded.

With hyperthreading (SMT, whatever you wanna call it), the move to 4 execution 
cores becomes a no-brainer.  (Keeping 2 cores busy from one instruction 
stream is relatively trivial), and even 5 (since keeping 3 cores busy is a 
solved problem, although it's not busy all the time, but the two threads can 
fight for the extra core when they actually have something for it to do...)

And THAT is where SMT starts showing real performance benefits, when you get 
to 4 or 5 cores.  It's cheaper than SMP on a die because they can share all 
sorts of hardware (not the least of which being L1 cache, and you can even 
expand L1 cache a bit because you now have the working sets of 2 processes to 
stick in it)...

Intel's been desperate for a way to make use of its transistor budget for a 
while; manufacturing is what it does better than AMD< not clever processor 
design.  The original Itanic, case in point, had more than 3 instruction 
execution cores in each chip: 3 VLIW, a HP-PA Risc, and a brain-damaged 
Pentium (which itself had a couple execution cores)...  The long list of 
reasons Itanic sucked started with the fact that it had 3 different modes and 
whichever one you were in circuitry for the other 2 wouldn't contribute a 
darn thing to your performance (although it did not stop there, and in fact 
didn't even slow down...)

Of course since power is now the third variable along with price/performance, 
sooner or later you'll see chips that individually power down cores as they 
go dormant.  Possibly even a banked L1 cache; who knows?  (It's another 
alternative to clocking down the whole chip; power down individual functional 
units of the chip.  Dunno who might actually do that, or when, but it's nice 
to have options...)

</rant>

In brief: hyper threading is cool.

> Its great feature is that HT gets *more* not less useful as the CPU gets
> faster..

Excution point 1 stalls waiting for memory, so execution point 2 gets extra 
cores.  The classic tale of overlapping processing and I/O, only this time 
with the memory bus being the slow device you have to wait for...

Rob

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-09  6:11             ` Rob Landley
@ 2003-09-09 16:07               ` Ricardo Bugalho
  2003-09-10  5:14                 ` Rob Landley
  0 siblings, 1 reply; 64+ messages in thread
From: Ricardo Bugalho @ 2003-09-09 16:07 UTC (permalink / raw)
  To: linux-kernel

On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote:

> Modern processors (Athlon and P4 both, I believe) have three execution
> cores, and so are trying to dispatch three instructions per clock.  With

Neither of these CPUs are multi-core. They're just superscalar cores, that
is, they can dispatch multiple instructions in parallel. An example of a
multi-core CPU is the POWER4: there are two complete cores in the same
sillicon die, sharing some cache levels and memory bus.

BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have
three-way decoders that decode up to three x86 instructions into µOPs.
Pentium4 has a one-way decoder and the trace cache that stores decodes
µOPs.
As a curiosity, AMD's K5 and K6 were 4 way.

> four cores bus would be a nightmare.  (All the VLIW guys keep trying to
> unload this on the compiler.  Don't ask me how a compiler is supposed to
> do branch prediction and speculative execution.  I suppose having to
> recompile your binaries for more cores isn't TOO big a problem these
> days, but the boxed mainstream desktop apps people wouldn't like it at
> all.)

In normal instructions sets, whatever CPUs do, from the software
perspective, it MUST look like the CPU is executing one instruction at a
time. In VLIW, some forms of parallelism are exposed. For example, before
executing two instructions in parallel, non-VLIW CPUs have to check for
data dependencies. If they exist, those two instructions can't be executed
in parallel. VLIW instruction sets just define that instructions MUST be
grouped in sets of N instructions that can be executed in parallel and
that if they don't the CPU, the CPU will yield an exception or undefined
behaviour.
In a similar manner, there is the issue of avaliable execution units and
exeptions.
The net result is that in-order VLIW CPUs are simpler to design that
in-order superscalar RISC CPUs, but I think it won't make much of a
difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem
implementation.
VLIW ISAs are no different from others regarding branch prediction --
which is a problem for ALL pipelined implementations, superscalar or not.
Speculative execution is a feature of out-of-order implementation.

> Transistor budgets keep going up as manufacturing die sizes shrink, and
> the engineers keep wanting to throw transistors at the problem.  The
> first really easy way to turn transistors into performance are a bigger
> L1 cache, but somewhere between 256k and one megabyte per running
> process you hit some serious diminishing returns since your working set
> is in cache and your far accesses to big datasets (or streaming data)
> just aren't going to be helped by more L1 cache.

L1 caches are kept small so they can be fast.

> Hyperthreading is just a neat hack to keep multiple cores busy.  Having

SMT (Simultaneous Multi-Threading, aka Hyperthreading in Intel's marketing
term) is a neat hack to keep execution units within the same core busy.
And its a cheap hack when the CPUs are alread out-of-order. CMP
(Concurrent Multi-Processing) is a neat hack to keep expensive resources
like big L2/L3 caches and memory interfaces busy by placing multiple cores
on the same die.
CMP is simpler, but is only usefull for multi-thread performance. With
SMT, it makes sense to add more execution units that now, so it can also
help single-thread performance.

> Intel's been desperate for a way to make use of its transistor budget
> for a while; manufacturing is what it does better than AMD< not clever
> processor design.  The original Itanic, case in point, had more than 3
> instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a
> brain-damaged Pentium (which itself had a couple execution cores)... The
> long list of reasons Itanic sucked started with the fact that it had 3
> different modes and whichever one you were in circuitry for the other 2
> wouldn't contribute a darn thing to your performance (although it did
> not stop there, and in fact didn't even slow down...)

Itanium doesn't have hardware support for PA-RISC emulation. The IA-64 ISA
has some similarities with PA-RISC to ease dynamic translation though.
But you're right: the IA-32 hardware emulation layer is not a Good Thing™.

-- 
	Ricardo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-09 16:07               ` Ricardo Bugalho
@ 2003-09-10  5:14                 ` Rob Landley
  2003-09-10  5:45                   ` David Mosberger
  2003-09-10 10:10                   ` Ricardo Bugalho
  0 siblings, 2 replies; 64+ messages in thread
From: Rob Landley @ 2003-09-10  5:14 UTC (permalink / raw)
  To: Ricardo Bugalho, linux-kernel

On Tuesday 09 September 2003 12:07, Ricardo Bugalho wrote:
> On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote:
> > Modern processors (Athlon and P4 both, I believe) have three execution
> > cores, and so are trying to dispatch three instructions per clock.  With
>
> Neither of these CPUs are multi-core. They're just superscalar cores, that
> is, they can dispatch multiple instructions in parallel. An example of a
> multi-core CPU is the POWER4: there are two complete cores in the same
> sillicon die, sharing some cache levels and memory bus.

Sorry, wrong terminology.  (I'm a software dude.)

"Instruction execution thingy".  (Well you didn't give it a name either. :)

> BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have
> three-way decoders that decode up to three x86 instructions into µOPs.
> Pentium4 has a one-way decoder and the trace cache that stores decodes
> µOPs.
> As a curiosity, AMD's K5 and K6 were 4 way.

I hadn't known that.  (I had known that the AMD guys I talked to around Austin 
had proven to themselves that 4 way was not a good idea in the real world, 
but I didn't know it had actually made it outside of the labs...)

> > four cores bus would be a nightmare.  (All the VLIW guys keep trying to
> > unload this on the compiler.  Don't ask me how a compiler is supposed to
> > do branch prediction and speculative execution.  I suppose having to
> > recompile your binaries for more cores isn't TOO big a problem these
> > days, but the boxed mainstream desktop apps people wouldn't like it at
> > all.)
>
> In normal instructions sets, whatever CPUs do, from the software
> perspective, it MUST look like the CPU is executing one instruction at a
> time.

Yup.

> In VLIW, some forms of parallelism are exposed.

I tend to think of it as "unloaded upon the compiler"...

> For example, before
> executing two instructions in parallel, non-VLIW CPUs have to check for
> data dependencies. If they exist, those two instructions can't be executed
> in parallel. VLIW instruction sets just define that instructions MUST be
> grouped in sets of N instructions that can be executed in parallel and
> that if they don't the CPU, the CPU will yield an exception or undefined
> behaviour.

Presumably this is the compiler's job, and the CPU can just have "undefined 
behavior" if fed impossible instruction mixes.  But yeah, throwing an 
exception would be the conscientious thing to do. :)

> In a similar manner, there is the issue of avaliable execution units and
> exeptions.
> The net result is that in-order VLIW CPUs are simpler to design that
> in-order superscalar RISC CPUs, but I think it won't make much of a
> difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem
> implementation.

I'm not sure what the point of out-of-order VLIW would be.  You just put extra 
pressure on the memory bus by tagging your instructions with grouping info, 
just to give you even LESS leeway about shuffling the groups at run-time...

> VLIW ISAs are no different from others regarding branch prediction --
> which is a problem for ALL pipelined implementations, superscalar or not.
> Speculative execution is a feature of out-of-order implementation.

Ah yes, predication.  Rather than having instruction execution thingies be 
idle, have them follow both branches and do work with a 100% chance of being 
thrown away.  And you wonder why the chips have heat problems... :)

> > Transistor budgets keep going up as manufacturing die sizes shrink, and
> > the engineers keep wanting to throw transistors at the problem.  The
> > first really easy way to turn transistors into performance are a bigger
> > L1 cache, but somewhere between 256k and one megabyte per running
> > process you hit some serious diminishing returns since your working set
> > is in cache and your far accesses to big datasets (or streaming data)
> > just aren't going to be helped by more L1 cache.
>
> L1 caches are kept small so they can be fast.

Sorry, I still refer to on-die L2 caches as L1.  Bad habit.  (As I said, I get 
the names wrong...)  "On die cache."  Right.

The point was, you can spend your transistor budget with big caches on the 
die, but there are diminishing returns.

> > Intel's been desperate for a way to make use of its transistor budget
> > for a while; manufacturing is what it does better than AMD< not clever
> > processor design.  The original Itanic, case in point, had more than 3
> > instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a
> > brain-damaged Pentium (which itself had a couple execution cores)... The
> > long list of reasons Itanic sucked started with the fact that it had 3
> > different modes and whichever one you were in circuitry for the other 2
> > wouldn't contribute a darn thing to your performance (although it did
> > not stop there, and in fact didn't even slow down...)
>
> Itanium doesn't have hardware support for PA-RISC emulation.

I'm under the impression it used to be part of the design, circa 1997.  But I 
must admit: when discussing Itanium I'm not really prepared; I stopped paying 
too much attention a year or so after the sucker had taped out but still had 
no silicon to play with, especially after HP and SGI revived their own chip 
designs due to the delay...)

I only actually got to play with the original Itanium hardware once, and never 
got it out of the darn monitor that substituted for a bios.  The people who 
did benchmarked it at about Pentium III 300 mhz levels, and it became a 
doorstop.  (These days, I've got a friend who's got an Itanium II evaluation 
system, but it's another doorstop and I'm not going to make him hook it up 
again just so I can go "yeah, I agree with you, it sucks"...)

> The IA-64 ISA
> has some similarities with PA-RISC to ease dynamic translation though.
> But you're right: the IA-32 hardware emulation layer is not a Good Thing™.

It's apparently going away.

http://news.com.com/2100-1006-997936.html?tag=nl

Rob

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-10  5:14                 ` Rob Landley
@ 2003-09-10  5:45                   ` David Mosberger
  2003-09-10 10:10                   ` Ricardo Bugalho
  1 sibling, 0 replies; 64+ messages in thread
From: David Mosberger @ 2003-09-10  5:45 UTC (permalink / raw)
  To: rob; +Cc: Ricardo Bugalho, linux-kernel

>>>>> On Wed, 10 Sep 2003 01:14:37 -0400, Rob Landley <rob@landley.net> said:

  Rob> (These days, I've got a friend who's got an Itanium II
  Rob> evaluation system, but it's another doorstop and I'm not going
  Rob> to make him hook it up again just so I can go "yeah, I agree
  Rob> with you, it sucks"...)

I'm sorry to hear that.  If you really do want to try out an Itanium 2
system, an easy way to go about it is to get an account at
http://testdrive.hp.com/ .  It's a quick and painless process and a
single account will give you access to all test-drive machines,
including various Linux Itanium machines (up to 4x 1.4GHz),
as shown here: http://testdrive.hp.com/current.shtml

	--david

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Scaling noise
  2003-09-10  5:14                 ` Rob Landley
  2003-09-10  5:45                   ` David Mosberger
@ 2003-09-10 10:10                   ` Ricardo Bugalho
  1 sibling, 0 replies; 64+ messages in thread
From: Ricardo Bugalho @ 2003-09-10 10:10 UTC (permalink / raw)
  To: rob; +Cc: linux-kernel

On Wed, 2003-09-10 at 06:14, Rob Landley wrote:
> I'm not sure what the point of out-of-order VLIW would be.  You just put extra 
> pressure on the memory bus by tagging your instructions with grouping info, 
> just to give you even LESS leeway about shuffling the groups at run-time...

The point is: simpler in-order implementations. In-order CPUs don't
reorder instructions at run-time, as the name suggests.

> > VLIW ISAs are no different from others regarding branch prediction --
> > which is a problem for ALL pipelined implementations, superscalar or not.
> > Speculative execution is a feature of out-of-order implementation.
> 
> Ah yes, predication.  Rather than having instruction execution thingies be 
> idle, have them follow both branches and do work with a 100% chance of being 
> thrown away.  And you wonder why the chips have heat problems... :)

You're confusing brach prediction with instruction predication.
Branch prediction is a design feature, needed for most pipelined CPUs.
Because they're pipelined, the CPU may not know whether to take or not
the branch when its time to fetch the next instructions. So, instead of
stalling, it guesses. If its wrong, it has to rollback.
Instruction predication is another form of conditional execution: each
instruction has a predicate (a register) and is only executed if the
predicate is true.
The bad thing is that these instructions take their slot in the
pipeline, even if the CPU knows they'll never be executed in the moment
it fecthed them.
The good sides are:
a) Unlike branches, it doesn't have a constant mispredict penalty. So,
its good to replace "small" and unpredictable branches
b) Instead of a control dependency (branches) predication is a data
dependency. So, it gives compilers more freedom in scheduling-

> The point was, you can spend your transistor budget with big caches on the 
> die, but there are diminishing returns.

Depends on the workload..

-- 
	Ricardo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Lock EVERYTHING (for testing) [was: Re: Scaling noise]
@ 2003-09-10 15:47 John Bradford
  2003-09-11 16:37 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 64+ messages in thread
From: John Bradford @ 2003-09-10 15:47 UTC (permalink / raw)
  To: davem, miller; +Cc: anton, linux-kernel, lm, mbligh, phillips, piggin

> The analogy for Linux is this:  At a machine level, we add a check to 
> EVERY access.  The check is there to ensure that every memory access is 
> properly locked.  So, if some access is made where there isn't a proper 
> lock applied, then we can print a warning with the line number or drop 
> out into kdb or something of that sort.
>
> I'm betting there's another solution to this, otherwise, I wouldn't 
> suggest such an idea, because of the relative amount of work versus 
> benefit.  But it may require massive modifications to GCC to add this 
> code in at the machine level.

Couldn't Valgrind be modified to do this for the kernel?

http://developer.kde.org/~sewardj/

John.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Lock EVERYTHING (for testing) [was: Re: Scaling noise]
  2003-09-04  4:58                     ` David S. Miller
@ 2003-09-10 15:47                       ` Timothy Miller
  0 siblings, 0 replies; 64+ messages in thread
From: Timothy Miller @ 2003-09-10 15:47 UTC (permalink / raw)
  To: David S. Miller
  Cc: Larry McVoy, phillips, mbligh, piggin, anton, linux-kernel

David S. Miller wrote:

> 
> So again, if you're going to argue against huge SMP (at least to me),
> don't use the locking complexity argument.  Not only have we basically
> conquered it, we've along the way found some amazing ways to find
> locking bugs both at runtime and at compile time.  You can even debug
> them on uniprocessor systems.  And this doesn't even count the
> potential things we can do with Linus's sparse tool.

Pardon me for suggesting another idea for which I have no code written, 
but I was just wondering...

Is there a way we could get gcc to wrap EVERY memory access with some 
kind of debug lock?

Actually, I do have code, but for another application.  I designed a 
graphics drawing engine which has a FIFO for commands.  Before sending 
commands, you have to be sure there is enough free space in the FIFO, so 
there is a macro we use which tries to do this in an efficient way. 
Anyhow, there have been instances where we didn't check for enough space 
or didn't check for space at all, etc., and those bugs have been 
sometimes hard to find.

Two macros involved are CHECK_FIFO and WRITE_WORD.  Normally, CHECK_FIFO 
just checks for space, and WRITE_WORD just writes a word (it's more 
complicated than that, but never mind).  However, we have a second set 
of macros which check to make sure we're doing everything right.  The 
"check checker" macros have CHECK_FIFO set a counter and WRITE_WORD 
decrement that.  (Again, a bit more complex than that.)  If the counter 
ever goes below zero, we know we screwed up and exactly where.  Another 
thing we have is a way to indicate that we know we're doing something 
that looks like it may violate the normal way of things but really 
doesn't (for instance, sometimes, we write fewer words than we check 
for, and that is something we still print warnings about, but not in the 
cases where it's intentional).

The analogy for Linux is this:  At a machine level, we add a check to 
EVERY access.  The check is there to ensure that every memory access is 
properly locked.  So, if some access is made where there isn't a proper 
lock applied, then we can print a warning with the line number or drop 
out into kdb or something of that sort.

I'm betting there's another solution to this, otherwise, I wouldn't 
suggest such an idea, because of the relative amount of work versus 
benefit.  But it may require massive modifications to GCC to add this 
code in at the machine level.

Perhaps an even better solution would be to run an emulator.  Anyone 
know of a 686 emulator I can compile for intel?  The emulator could be 
modified to track locks and determine if any accesses are made without 
proper locks.

And another option that I could REALLY sink my teeth into.  If there was 
a 686 implementation in Verilog that I could run on an FPGA, it would be 
an order of magnitude slower than a real CPU, but still faster than an 
emulator.

One idea is to have something which can run 686 ISA that fits in a 
Virtex 1000 and runs at maybe 66mhz.  We put that with some adaptor 
board into an old dual processor PC that expects a Pentium Pro with a 
66mhz FSB.

That's probably overly ambitious, although I do do chip design for a 
living, so it's not entirely beyond the realm of possibility.

One problem is that we need to have metadata about memory accesses so we 
can track the difference between accesses which are to memory private to 
a CPU (no lock required) and accesses which are to shared memory (lock 
required) so we can determine what is a violation.  The FPGA daughter 
board would have to have its own RAM on it to track that.

And that leads me to another idea:  Reprogramming Transmeta processors 
to do all that.  :)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Lock EVERYTHING (for testing) [was: Re: Scaling noise]
  2003-09-10 15:47 Lock EVERYTHING (for testing) [was: Re: Scaling noise] John Bradford
@ 2003-09-11 16:37 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 64+ messages in thread
From: Jeremy Fitzhardinge @ 2003-09-11 16:37 UTC (permalink / raw)
  To: John Bradford
  Cc: davem, miller, anton, Linux Kernel List, lm, mbligh, phillips,
	piggin

On Wed, 2003-09-10 at 08:47, John Bradford wrote:
> > The analogy for Linux is this:  At a machine level, we add a check to 
> > EVERY access.  The check is there to ensure that every memory access is 
> > properly locked.  So, if some access is made where there isn't a proper 
> > lock applied, then we can print a warning with the line number or drop 
> > out into kdb or something of that sort.
> >
> > I'm betting there's another solution to this, otherwise, I wouldn't 
> > suggest such an idea, because of the relative amount of work versus 
> > benefit.  But it may require massive modifications to GCC to add this 
> > code in at the machine level.
> 
> Couldn't Valgrind be modified to do this for the kernel?
> 
> http://developer.kde.org/~sewardj/

I have a UML-under-Valgrind project on the backburner.  Valgrind has an
instrumentation mode which checks to see every memory access is covered
by appropriate locks in an MT program.  I'm afraid it will generate a
lot of noise in the kernel though, since there's a lot of code which
does unlocked memory access (probably correctly).

	J


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2003-09-11 16:37 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-03  4:03 Scaling noise Larry McVoy
2003-09-03  4:12 ` Roland Dreier
2003-09-03  4:20   ` Larry McVoy
2003-09-03 15:12   ` Martin J. Bligh
2003-09-03  4:18 ` Anton Blanchard
2003-09-03  4:29   ` Larry McVoy
2003-09-03  4:33     ` CaT
2003-09-03  5:08       ` Larry McVoy
2003-09-03  5:44         ` Mikael Abrahamsson
2003-09-03  6:12         ` Bernd Eckenfels
2003-09-03 12:09           ` Alan Cox
2003-09-03 15:10             ` Martin J. Bligh
2003-09-03 16:01               ` Jörn Engel
2003-09-03 16:21                 ` Martin J. Bligh
2003-09-03 19:41                   ` Mike Fedyk
2003-09-03 20:11                     ` Martin J. Bligh
2003-09-04 20:36               ` Rik van Riel
2003-09-04 20:47                 ` Martin J. Bligh
2003-09-04 21:30                 ` William Lee Irwin III
2003-09-03  8:11         ` Giuliano Pochini
2003-09-03 14:25         ` Steven Cole
2003-09-03 12:47           ` Antonio Vargas
2003-09-03 15:31             ` Steven Cole
2003-09-04  1:50               ` Daniel Phillips
2003-09-04  1:52                 ` Larry McVoy
2003-09-04  4:42                   ` David S. Miller
2003-09-08 19:40                     ` bill davidsen
2003-09-04  2:18                 ` William Lee Irwin III
2003-09-04  2:19                 ` Steven Cole
2003-09-04  2:35                   ` William Lee Irwin III
2003-09-04  2:40                     ` Steven Cole
2003-09-04  3:20                       ` Nick Piggin
2003-09-04  3:07                   ` Daniel Phillips
2003-09-08 19:27                 ` bill davidsen
2003-09-08 19:12           ` bill davidsen
2003-09-03 16:37         ` Kurt Wall
2003-09-06 15:08         ` Pavel Machek
2003-09-08 13:38           ` Alan Cox
2003-09-09  6:11             ` Rob Landley
2003-09-09 16:07               ` Ricardo Bugalho
2003-09-10  5:14                 ` Rob Landley
2003-09-10  5:45                   ` David Mosberger
2003-09-10 10:10                   ` Ricardo Bugalho
2003-09-03  6:28     ` Anton Blanchard
2003-09-03  6:55       ` Nick Piggin
2003-09-03 15:23         ` Martin J. Bligh
2003-09-03 15:39           ` Larry McVoy
2003-09-03 15:50             ` Martin J. Bligh
2003-09-04  0:49               ` Larry McVoy
2003-09-04  2:21                 ` Daniel Phillips
2003-09-04  2:35                   ` Martin J. Bligh
2003-09-04  2:46                   ` Larry McVoy
2003-09-04  4:58                     ` David S. Miller
2003-09-10 15:47                       ` Lock EVERYTHING (for testing) [was: Re: Scaling noise] Timothy Miller
2003-09-04  4:49             ` Scaling noise David S. Miller
2003-09-08 19:50             ` bill davidsen
2003-09-08 23:39               ` Peter Chubb
2003-09-03 17:16           ` William Lee Irwin III
2003-09-03 15:51         ` UP Regression (was) " Cliff White
2003-09-03 17:21           ` William Lee Irwin III
2003-09-03 18:53             ` Cliff White
2003-09-04  0:54           ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2003-09-10 15:47 Lock EVERYTHING (for testing) [was: Re: Scaling noise] John Bradford
2003-09-11 16:37 ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox