RE: SMP/cc Cluster description

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: SMP/cc Cluster description
@ 2001-12-07 19:14 Dana Lacoste
  2001-12-07 19:28 ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Dana Lacoste @ 2001-12-07 19:14 UTC (permalink / raw)
  To: 'Larry McVoy', Martin J. Bligh; +Cc: linux-kernel

Man you guys are NUTS.

But this is a fun conversation so I'm going to join in.

> Did you even consider that this is virtually identical to the problem
> that a network of workstations or servers has?  Did it occur 
> to you that
> people have solved this problem in many different ways?  Or 
> did you just
> want to piss into the wind and enjoy the spray?

I may be a total tool here, but this question is really bugging me :

What, if any, advantages does your proposal have over (say) a Beowulf
cluster?  Why does having the cluster in one box seem a better solution
than having a Beowulf type cluster with a shared Network filesystem?

You've declared everything to be separate, so that I can't see
what's not separate any more :)

Is it just an issue of shared memory?  You want to be able to share
memory between processes on separate systems at high speed?  Why
not Myrinet then?  Yeah, it's slower, but the order of magnitude
reduction in cost compared to a 64 way SMP box makes this a trivial
decision in my books....

Or am I missing something really obvious here????

Dana Lacoste
Embedded Linux Developer (The OPPOSITE side of the scale)
Ottawa, Canada

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 19:14 SMP/cc Cluster description Dana Lacoste
@ 2001-12-07 19:28 ` Larry McVoy
  0 siblings, 0 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 19:28 UTC (permalink / raw)
  To: Dana Lacoste; +Cc: 'Larry McVoy', Martin J. Bligh, linux-kernel

On Fri, Dec 07, 2001 at 11:14:13AM -0800, Dana Lacoste wrote:
> Man you guys are NUTS.

I resemble that remark :-)

> > Did you even consider that this is virtually identical to the problem
> > that a network of workstations or servers has?  Did it occur 
> > to you that
> > people have solved this problem in many different ways?  Or 
> > did you just
> > want to piss into the wind and enjoy the spray?
> 
> I may be a total tool here, but this question is really bugging me :
> 
> What, if any, advantages does your proposal have over (say) a Beowulf
> cluster?  Why does having the cluster in one box seem a better solution
> than having a Beowulf type cluster with a shared Network filesystem?

Because I can mmap the same data across cluster nodes and get at it using
hardware, so a cache miss is a cache miss regardless of which node I'm
on, and it takes ~200 nanoseconds.  With a network based cluster, those
times go up about a factor of 10,000 or so.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* RE: SMP/cc Cluster description
@ 2001-12-10 15:59 cardente, john
  0 siblings, 0 replies; 75+ messages in thread
From: cardente, john @ 2001-12-10 15:59 UTC (permalink / raw)
  To: 'Jeff V. Merkey '
  Cc: 'David S. Miller ', 'lm@bitmover.com ',
	'davidel@xmailserver.org ',
	'rusty@rustcorp.com.au ',
	'Martin.Bligh@us.ibm.com ',
	'riel@conectiva.com.br ', 'lars.spam@nocrew.org ',
	'alan@lxorguk.ukuu.org.uk ', 'hps@intermeta.de ',
	'linux-kernel@vger.kernel.org ',
	'jmerkey@timpanogas.org '

>I know what the PCI cards do.  I was the person who pushed
>marty Albert, the Chairman of the Dolphin Board at the time in 
>1995 to pursue design work on them.  I also worked with Justin 
>Rattner (I saw one of your early prototype boxes in 1996 in his labs).  

Ahh, sometimes it's hard to gauge "understanding" on this list  ;-)
Good idea BTW. For a while we looked into using those cards
to implement a non-cc NUMA cluster system. That was a while
ago, however, and I've managed to forget most of the details. Also,
with the assimilation of DG into EMC I've tossed most of my dolphin
specs.

>Those stubs were aweful short for the lost slot in your 
>system, and I am surprised you did not get signal skew.  Those
>stubs had to be 1.5 inches long :-).

Yes, I spent many hours in the lab hunting for signal integrity
issues. As you may guess it was not always easy being a third
party agent on an intel bus...

>Wrong.  There is a small window where you can copy into a 
>remote nodes memory.

As I said above I tossed by P2B spec so I cant refresh my memory
on this. Did this work like reflective memory or do you scribble
on a piece of memory and then poke the card to send to another node?
Its my guess that the former prohibits the memory being cacheable
while the latter relies on compliant SW and therefore doesnt afford
transparent cross node memory references. Are either of these right?

>It's OK.  We love DG and your support of SCI.  Keep up the good 
>work.

Wish that I was but sadly I'm not. DG was my first job after grad school
and cutting my teeth on the ccNUMA stuff was simply an outstanding
experience.
Those were good days....

Thanks for the reply...
john

ps. I got two of the older PCI cards sitting in my desk drawer.
Now you've got me considering pulling those guys out and having
some fun!!!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* RE: SMP/cc Cluster description
@ 2001-12-06 22:20 cardente, john
  2001-12-06 23:00 ` Jeff V. Merkey
  0 siblings, 1 reply; 75+ messages in thread
From: cardente, john @ 2001-12-06 22:20 UTC (permalink / raw)
  To: 'Jeff V. Merkey', David S. Miller
  Cc: lm, davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel, jmerkey

Hi Jeff,

I was one of the primary SCI guys at DG for all of
their Intel based ccNUMA machines. I worked with
Dolphin closely on a variety of things for those
systems including micro-coding a modified/optimized
version of their SCI implementation as well as 
architecting and implementing changes to their SCI
coherency ASIC for the third (last) DG ccNUMA system.
Beyond that I was the primary coherency protocol 
person for the project and was responsible for making
sure we played nice with Intel's coherency protocol.

Getting to the point I saw your post below and I thought
there might be some confusion between what the DG boxes
did and what those PCI cards do. In the DG system we
implemented ASIC's that sat on the processor bus which
examined every memory reference to maintain system wide
coherency. These evaluations were done for every bus
transaction at a cache line granularity. These chips
acted as bridges that enforced coherency between the SMP local
snoopy bus protocol and the SCI protocol used system
wide. The essential point here is that only by being
apart of the coherency protocol on the processor bus 
were those chips able to implement ccNUMA at a cacheline 
level coherency.

The Dolphin PCI cards, however, cannot perform the same
function due to the fact that the PCI bus is outside of the
Intel coherency domain. Therefore it lacks the visiblity
and control to enforce coherency. Instead, those cards 
only allowed for the explicit sending of messages across 
SCI for use with clustering libraries like MPI. One could
use this kind of messaging protocol to implement explicit
coherency (as you noted) but the sharing granularity of
such a system is at the page level, not cache line. There
have been many efforts to implement this kind of system
and (if I recall correctly) they usually go under the
name of Shared Virtual Memory systems.

Anyway, there were two reasons for the post. First, if I've
been following the thread correctly most of the discussion
up to this point has involed issues at the cacheline level
and dont apply to a system built from Dolphin PCi cards.
Nor can one build such a system from those cards and
I felt compelled to clear up any potential confusion. My
second, prideful, reason was to justify the cost of those
DG machines!!! (and NUMA-Q's as they were very similar in
architecture).

take care, and please disregard if I misunderstood your
post or the thread...

john

-----Original Message-----
From: Jeff V. Merkey [mailto:jmerkey@vger.timpanogas.org]
Sent: Thursday, December 06, 2001 1:38 PM
To: David S. Miller
Cc: lm@bitmover.com; davidel@xmailserver.org; rusty@rustcorp.com.au;
Martin.Bligh@us.ibm.com; riel@conectiva.com.br; lars.spam@nocrew.org;
alan@lxorguk.ukuu.org.uk; hps@intermeta.de;
linux-kernel@vger.kernel.org; jmerkey@timpanogas.org
Subject: Re: SMP/cc Cluster description

On Thu, Dec 06, 2001 at 11:27:31AM -0700, Jeff V. Merkey wrote:

And also, if you download the SCI drivers in my area, and order
some SCI adapters from Dolphin in Albquerque, NM, you can set up 
a ccNUMA system with standard PCs.  Dolphin has 66Mhz versions (and
a 133Mhz coming in the future) that run at almost a gigabyte per 
second node-2-node over a parallel fabric.  The cross-sectional
SCI fabric bandwidth scales at (O)(2N) as you add nodes.  

If you want to play around with ccNUMA with Standard PCs, these 
cards are relatively inepxensive, and allow you to setup some 
powerful cc/SMP systems with explicit coherence.  The full 
ccNUMA boxes from DG are expensive, however.  That way, instead
of everyone talking about it, you guys could get some cool 
hardware and experiment with some of your rather forward 
looking and interesting ideas.

:-)

Jeff

> 
> 
> Guys,
> 
> I am the maintaner of SCI, the ccNUMA technology standard.  I know
> alot about this stuff, and have been involved with SCI since 
> 1994.  I work with it every day and the Dolphin guys on some huge 
> supercomputer accounts, like Los Alamos and Sandia Labs in NM.  
> I will tell you this from what I know.
> 
> A shared everything approach is a programmers dream come true,
> but you can forget getting reasonable fault tolerance with it.  The 
> shared memory zealots want everyone to believe ccNUMA is better 
> than sex, but it does not scale when compared to Shared-Nothing
> programming models.  There's also a lot of tough issues for dealing 
> with failed nodes, and how you recover when peoples memory is 
> all over the place across a nuch of machines.  
> 
> SCI scales better in ccNUMA and all NUMA technoogies scale very 
> well when they are used with "Explicit Coherence" instead of 
> "Implicit Coherence" which is what you get with SMP systems.  
> Years of research by Dr. Justin Rattner at Intel's High 
> performance labs demonstrated that shared nothing models scaled
> into the thousands of nodes, while all these shared everything
> "Super SMP" approaches hit the wall at 64 processors generally.
> 
> SCI is the fastest shared nothing interface out there, and it also
> can do ccNUMA.  Sequent, Sun, DG and a host of other NUMA providers
> use Dolphin's SCI technology and have for years.   ccNUMA is useful 
> for applications that still assume a shared nothing approach but that
> use the ccNUMA and NUMA capabilities for better optimization.
> 
> Forget trying to recreate the COMA architecture of Kendall-Square.  
> The name was truly descriptive of what happened in this architecture
> when a node fails -- goes into a "COMA".  This whole discussion I have
> lived through before and you will find that ccNUMA is virtually 
> unimplementable on most general purpose OSs.  And yes, there are 
> a lot of products and software out there, but when you look under 
> the cover (like ServerNet) you discover their coherence models 
> for the most part relay on push/pull explicit coherence models.
> 
> My 2 cents.
> 
> Jeff 
> 
> 
> 
> On Thu, Dec 06, 2001 at 12:09:32AM -0800, David S. Miller wrote:
> >    From: Larry McVoy <lm@bitmover.com>
> >    Date: Thu, 6 Dec 2001 00:02:16 -0800
> >    
> >    Err, Dave, that's *exactly* the point of the ccCluster stuff.  You
get
> >    all that seperation for every data structure for free.  Think about
> >    it a bit.  Aren't you going to feel a little bit stupid if you do all
> >    this work, one object at a time, and someone can come along and do
the
> >    whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> >    but it is much easier than the route you are taking.  
> > 
> > How does ccClusters avoid the file system namespace locking issues?
> > How do all the OS nodes see a consistent FS tree?
> > 
> > All the talk is about the "magic filesystem, thread it as much as you
> > want" and I'm telling you that is the fundamental problem, the
> > filesystem name space locking.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:20 cardente, john
@ 2001-12-06 23:00 ` Jeff V. Merkey
  0 siblings, 0 replies; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-06 23:00 UTC (permalink / raw)
  To: cardente, john
  Cc: David S. Miller, lm, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel, jmerkey

On Thu, Dec 06, 2001 at 05:20:22PM -0500, cardente, john wrote:
> Hi Jeff,
> 
> I was one of the primary SCI guys at DG for all of
> their Intel based ccNUMA machines. I worked with
> Dolphin closely on a variety of things for those
> systems including micro-coding a modified/optimized
> version of their SCI implementation as well as 
> architecting and implementing changes to their SCI
> coherency ASIC for the third (last) DG ccNUMA system.
> Beyond that I was the primary coherency protocol 
> person for the project and was responsible for making
> sure we played nice with Intel's coherency protocol.

I know what the PCI cards do.  I was the person who pushed
marty Albert, the Chairman of the Dolphin Board at the time in 
1995 to pursue design work on them.  I also worked with Justin 
Rattner (I saw one of your early prototype boxes in 1996 in his labs).  

> 
> Getting to the point I saw your post below and I thought
> there might be some confusion between what the DG boxes
> did and what those PCI cards do. In the DG system we
> implemented ASIC's that sat on the processor bus which
> examined every memory reference to maintain system wide
> coherency. These evaluations were done for every bus
> transaction at a cache line granularity. These chips
> acted as bridges that enforced coherency between the SMP local
> snoopy bus protocol and the SCI protocol used system
> wide. The essential point here is that only by being
> apart of the coherency protocol on the processor bus 
> were those chips able to implement ccNUMA at a cacheline 
> level coherency.
> 

There was a Pentium Pro adapter that plugged into the memory 
bus Dolphin was working on.  Intel backed away from this 
design since they had little interest in opening up their 
memory bus architecture.  DG was the only exception, and if
you remember, you guys had to spin your own motherboards.  
Those stubs were aweful short for the lost slot in your 
system, and I am surprised you did not get signal skew.  Those
stubs had to be 1.5 inches long :-).

> 
> The Dolphin PCI cards, however, cannot perform the same
> function due to the fact that the PCI bus is outside of the
> Intel coherency domain. Therefore it lacks the visiblity
> and control to enforce coherency. Instead, those cards 
> only allowed for the explicit sending of messages across 
> SCI for use with clustering libraries like MPI. One could
> use this kind of messaging protocol to implement explicit
> coherency (as you noted) but the sharing granularity of
> such a system is at the page level, not cache line. There
> have been many efforts to implement this kind of system
> and (if I recall correctly) they usually go under the
> name of Shared Virtual Memory systems.

Wrong.  There is a small window where you can copy into a 
remote nodes memory.

> 
> 
> Anyway, there were two reasons for the post. First, if I've
> been following the thread correctly most of the discussion
> up to this point has involed issues at the cacheline level
> and dont apply to a system built from Dolphin PCi cards.
> Nor can one build such a system from those cards and
> I felt compelled to clear up any potential confusion. My
> second, prideful, reason was to justify the cost of those
> DG machines!!! (and NUMA-Q's as they were very similar in
> architecture).
> 
> take care, and please disregard if I misunderstood your
> post or the thread...

It's OK.  We love DG and your support of SCI.  Keep up the good 
work.

Jeff


> 
> john
> 
> 
> -----Original Message-----
> From: Jeff V. Merkey [mailto:jmerkey@vger.timpanogas.org]
> Sent: Thursday, December 06, 2001 1:38 PM
> To: David S. Miller
> Cc: lm@bitmover.com; davidel@xmailserver.org; rusty@rustcorp.com.au;
> Martin.Bligh@us.ibm.com; riel@conectiva.com.br; lars.spam@nocrew.org;
> alan@lxorguk.ukuu.org.uk; hps@intermeta.de;
> linux-kernel@vger.kernel.org; jmerkey@timpanogas.org
> Subject: Re: SMP/cc Cluster description
> 
> 
> On Thu, Dec 06, 2001 at 11:27:31AM -0700, Jeff V. Merkey wrote:
> 
> And also, if you download the SCI drivers in my area, and order
> some SCI adapters from Dolphin in Albquerque, NM, you can set up 
> a ccNUMA system with standard PCs.  Dolphin has 66Mhz versions (and
> a 133Mhz coming in the future) that run at almost a gigabyte per 
> second node-2-node over a parallel fabric.  The cross-sectional
> SCI fabric bandwidth scales at (O)(2N) as you add nodes.  
> 
> If you want to play around with ccNUMA with Standard PCs, these 
> cards are relatively inepxensive, and allow you to setup some 
> powerful cc/SMP systems with explicit coherence.  The full 
> ccNUMA boxes from DG are expensive, however.  That way, instead
> of everyone talking about it, you guys could get some cool 
> hardware and experiment with some of your rather forward 
> looking and interesting ideas.
> 
> :-)
> 
> Jeff
> 
> 
> 
> > 
> > 
> > Guys,
> > 
> > I am the maintaner of SCI, the ccNUMA technology standard.  I know
> > alot about this stuff, and have been involved with SCI since 
> > 1994.  I work with it every day and the Dolphin guys on some huge 
> > supercomputer accounts, like Los Alamos and Sandia Labs in NM.  
> > I will tell you this from what I know.
> > 
> > A shared everything approach is a programmers dream come true,
> > but you can forget getting reasonable fault tolerance with it.  The 
> > shared memory zealots want everyone to believe ccNUMA is better 
> > than sex, but it does not scale when compared to Shared-Nothing
> > programming models.  There's also a lot of tough issues for dealing 
> > with failed nodes, and how you recover when peoples memory is 
> > all over the place across a nuch of machines.  
> > 
> > SCI scales better in ccNUMA and all NUMA technoogies scale very 
> > well when they are used with "Explicit Coherence" instead of 
> > "Implicit Coherence" which is what you get with SMP systems.  
> > Years of research by Dr. Justin Rattner at Intel's High 
> > performance labs demonstrated that shared nothing models scaled
> > into the thousands of nodes, while all these shared everything
> > "Super SMP" approaches hit the wall at 64 processors generally.
> > 
> > SCI is the fastest shared nothing interface out there, and it also
> > can do ccNUMA.  Sequent, Sun, DG and a host of other NUMA providers
> > use Dolphin's SCI technology and have for years.   ccNUMA is useful 
> > for applications that still assume a shared nothing approach but that
> > use the ccNUMA and NUMA capabilities for better optimization.
> > 
> > Forget trying to recreate the COMA architecture of Kendall-Square.  
> > The name was truly descriptive of what happened in this architecture
> > when a node fails -- goes into a "COMA".  This whole discussion I have
> > lived through before and you will find that ccNUMA is virtually 
> > unimplementable on most general purpose OSs.  And yes, there are 
> > a lot of products and software out there, but when you look under 
> > the cover (like ServerNet) you discover their coherence models 
> > for the most part relay on push/pull explicit coherence models.
> > 
> > My 2 cents.
> > 
> > Jeff 
> > 
> > 
> > 
> > On Thu, Dec 06, 2001 at 12:09:32AM -0800, David S. Miller wrote:
> > >    From: Larry McVoy <lm@bitmover.com>
> > >    Date: Thu, 6 Dec 2001 00:02:16 -0800
> > >    
> > >    Err, Dave, that's *exactly* the point of the ccCluster stuff.  You
> get
> > >    all that seperation for every data structure for free.  Think about
> > >    it a bit.  Aren't you going to feel a little bit stupid if you do all
> > >    this work, one object at a time, and someone can come along and do
> the
> > >    whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> > >    but it is much easier than the route you are taking.  
> > > 
> > > How does ccClusters avoid the file system namespace locking issues?
> > > How do all the OS nodes see a consistent FS tree?
> > > 
> > > All the talk is about the "magic filesystem, thread it as much as you
> > > want" and I'm telling you that is the fundamental problem, the
> > > filesystem name space locking.
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Linux/Pro [was Re: Coding style - a non-issue]
@ 2001-12-04 23:31 Rik van Riel
  2001-12-04 23:37 ` Martin J. Bligh
  0 siblings, 1 reply; 75+ messages in thread
From: Rik van Riel @ 2001-12-04 23:31 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Lars Brinkhoff, Alan Cox, Larry McVoy, hps, linux-kernel

On Tue, 4 Dec 2001, Martin J. Bligh wrote:

> > Premise 3: it is far easier to take a bunch of operating system images
> >    and make them share the parts they need to share (i.e., the page
> >    cache), than to take a single image and pry it apart so that it
> >    runs well on N processors.
>
> Of course it's easier. But it seems like you're left with much more
> work to reiterate in each application you write to run on this thing.
> Do you want to do the work once in the kernel, or repeatedly in each
> application?

There seems to be a little misunderstanding here; from what
I gathered when talking to Larry, the idea behind ccClusters
is that they provide a single system image in a NUMA box, but
with separated operating system kernels.

Of course, this is close to what a "single" NUMA kernel often
ends up doing with much ugliness, so I think Larry's idea to
construct NUMA OSes by putting individual kernels of nodes to
work together makes a lot of sense.

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Linux/Pro [was Re: Coding style - a non-issue]
  2001-12-04 23:31 Linux/Pro [was Re: Coding style - a non-issue] Rik van Riel
@ 2001-12-04 23:37 ` Martin J. Bligh
  2001-12-05  0:36   ` SMP/cc Cluster description [was Linux/Pro] Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-04 23:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lars Brinkhoff, Alan Cox, Larry McVoy, hps, linux-kernel

>> > Premise 3: it is far easier to take a bunch of operating system images
>> >    and make them share the parts they need to share (i.e., the page
>> >    cache), than to take a single image and pry it apart so that it
>> >    runs well on N processors.
>> 
>> Of course it's easier. But it seems like you're left with much more
>> work to reiterate in each application you write to run on this thing.
>> Do you want to do the work once in the kernel, or repeatedly in each
>> application?
> 
> There seems to be a little misunderstanding here; from what
> I gathered when talking to Larry, the idea behind ccClusters
> is that they provide a single system image in a NUMA box, but
> with separated operating system kernels.

OK, then I've partially misunderstood this ... can people provide some 
more reference material? Please email to me, and I'll collate the results
back to the list (should save some traffic).

I have already the following:

http://www.bitmover.com/talks/cliq/slide01.html 
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0001.2/1172.html 
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0001.3/0236.html 
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0007.3/1222.html 

Thanks,

Martin.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* SMP/cc Cluster description [was Linux/Pro]
  2001-12-04 23:37 ` Martin J. Bligh
@ 2001-12-05  0:36   ` Larry McVoy
  2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
  2001-12-05  3:17     ` Stephen Satchell
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-05  0:36 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Lars Brinkhoff, Alan Cox, Larry McVoy, hps,
	linux-kernel

On Tue, Dec 04, 2001 at 03:37:37PM -0800, Martin J. Bligh wrote:
> >> > Premise 3: it is far easier to take a bunch of operating system images
> >> >    and make them share the parts they need to share (i.e., the page
> >> >    cache), than to take a single image and pry it apart so that it
> >> >    runs well on N processors.
> >> 
> >> Of course it's easier. But it seems like you're left with much more
> >> work to reiterate in each application you write to run on this thing.
> >> Do you want to do the work once in the kernel, or repeatedly in each
> >> application?
> > 
> > There seems to be a little misunderstanding here; from what
> > I gathered when talking to Larry, the idea behind ccClusters
> > is that they provide a single system image in a NUMA box, but
> > with separated operating system kernels.

Right except NUMA is orthogonal, ccClusters work fine on a regular SMP 
box.

> OK, then I've partially misunderstood this ... can people provide some 
> more reference material? Please email to me, and I'll collate the results
> back to the list (should save some traffic).

I'll try and type in a small explanation, I apologize in advance for the
bervity, I'm under a lot of pressure on the BK front these days...

The most recent set of slides are here:

    http://www.bitmover.com/ml/slide01.html

A couple of useful papers are at

    http://www.bitmover.com/llnl/smp.pdf
    http://www.bitmover.com/llnl/labs.pdf

The first explains why I think fine grained multi threading is a mistake
and the second is a paper I wrote to try and get LLNL to push for what
I called SMP clusters (which are not a cluster of SMPs, they are a 
cluster of operating system instances on a single SMP).

The basic idea is this: if you consider the usefulness of an SMP versus a
cluster, the main thing in favor of the SMP is

    all processes/processors can share the same memory at memory speeds.
    I typically describe this as "all processes can mmap the same data".
    A cluster loses here, even if it provides DSM over a high speed
    link, it isn't going to have 200 ns caches misses, it's orders of
    magnitude slower.  For a lot of MPI apps that doesn't matter, but
    there are apps for which high performance shared memory is required.

There are other issues like having a big fast bus, load balancing, etc.,
but the main thing is that you can share data quickly and coherently.
If you don't need that performance/coherency and you can afford to 
replicate the data, a traditional cluster is a *much* cheaper and 
easier answer.  Many problems, such as web server farms, are better
done on Beowulf style clusters than an SMP, they will actually scale
better.

OK, so suppose we focus on the SMP problem space.  It's a requirement
that all the processes on all the processors need to be able to access
memory coherently.  DSM and/or MPI isn't an answer for this problem 
space.

The traditional way to use an SMP is to take a single OS image and 
"thread" it such that all the CPUs can be in the OS at the same time.
Pretty much all the data structures need to get a lock and each CPU
takes the lock before it uses the data structure.  The limit of the
ratio of locks to cache lines is 1:1, i.e., each cache line will need
a lock in order to get 100% of the scaling on the system (yes, I know
this isn't quite true but it is close and you get the idea).

Go read the "smp.pdf" paper for my reasons on why this is a bad approach,
I'll assume for now you are willing to agree that it is for the purposes
of discussion.

If we want to get the most use out of big SMP boxes but we also want to
do the least amount of "damage" in the form of threading complexity in
the source base.  This is a "have your cake and eat it too" goal, one
that I think is eminently reachable.

So how I propose we do this is by booting multiple Linux images on
a single box.  Each OS image owns part of the machine, 1-4 CPUs, 0 or
more devices such as disk, ethernet, etc., part of memory.  In addition,
all OS images share, as a page cache, part of main memory, typically
the bulk of main memory.

The first thing to understand that the *only* way to share data is in
memory, in the globally shared page cache.  You do not share devices,
devices are proxied.  So if I want data from your disk or file system,
I ask you to put it in memory and then I mmap it.  In fact, you really
only share files and you only share them via mmap (yeah, read and write
as well but that's the uninteresting case).

This sharing gets complex because now we have more than one OS image
which is managing the same set of pages.  One could argue that the 
code complexity is just as bad as a fine grained multi threaded OS
image but that's simply incorrect.  I would hide almost 100% of this
code in a file system, with some generic changes (as few as possible)
in the VM system.  There are some changes in the process layer as well,
but we'll talk about them later.

If you're sitting here thinking about all the complexity involved in
sharing pages, it is really helpful to think about this in the following
way (note you would not actually implement it like this in the long
run but you could start this way):

Imagine that for any given file system there is one server OS image and N
client os images.  Imagine that for each client, there is a proxy process
running on behalf of the client on the server.  Sort of like NFS biods.
Each time the client OS wants to do an mmap() it asks the proxy to do
the mmap().  There are some corner cases but if you think about it, by
having the proxies do the mmaps, we *know* that all the server OS data
structures are correct.  As far as the server is concerned, the remote
OS clients are no different than the local proxy process.  This is from
the correctness point of view, not the performance point of view.

OK, so we've handled setting up the page tables, but we haven't handled
page faults or pageouts.  Let's punt on pageouts for the time being,
we can come back to that.  Let's figure out a pagefault path that will
give correct, albeit slow, behaviour.  Suppose that when the client faults
on a page, the client side file system sends a pagefault message to the
proxy, the proxy faults in the page, calls a new vtop() system call to
get the physical page, and passes that page descriptor back to the client
side.  The client side loads up the TLB & page tables and away we go.
Whoops, no we don't, because the remote OS could page out the page and
the client OS will get the wrong data (think about a TLB shootdown that
_didn't_ happen when it should have; bad bad bad).  Again, thinking 
just from the correctness point of view, suppose the proxy mlock()ed
the page into memory.  Now we know it is OK to load it up and use it.
This is why I said skip pageout for now, we're not going to do them 
to start with anyway.

OK, so start throwing stones at this.  Once we have a memory model that
works, I'll go through the process model.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  0:36   ` SMP/cc Cluster description [was Linux/Pro] Larry McVoy
@ 2001-12-05  2:36     ` David S. Miller
  2001-12-05  3:23       ` Larry McVoy
  2001-12-05  3:25       ` Davide Libenzi
  2001-12-05  3:17     ` Stephen Satchell
  1 sibling, 2 replies; 75+ messages in thread
From: David S. Miller @ 2001-12-05  2:36 UTC (permalink / raw)
  To: lm; +Cc: Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Tue, 4 Dec 2001 16:36:46 -0800

   OK, so start throwing stones at this.  Once we have a memory model that
   works, I'll go through the process model.

What is the difference between your messages and spin locks?
Both seem to shuffle between cpus anytime anything interesting
happens.

In the spinlock case, I can thread out the locks in the page cache
hash table so that the shuffling is reduced.  In the message case, I
always have to talk to someone.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
@ 2001-12-05  3:23       ` Larry McVoy
  2001-12-05  6:05         ` David S. Miller
  2001-12-05  8:12         ` Momchil Velikov
  2001-12-05  3:25       ` Davide Libenzi
  1 sibling, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-05  3:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

On Tue, Dec 04, 2001 at 06:36:01PM -0800, David S. Miller wrote:
>    From: Larry McVoy <lm@bitmover.com>
>    Date: Tue, 4 Dec 2001 16:36:46 -0800
>    
>    OK, so start throwing stones at this.  Once we have a memory model that
>    works, I'll go through the process model.
> 
> What is the difference between your messages and spin locks?
> Both seem to shuffle between cpus anytime anything interesting
> happens.
> 
> In the spinlock case, I can thread out the locks in the page cache
> hash table so that the shuffling is reduced.  In the message case, I
> always have to talk to someone.

Two points: 

a) if you haven't already, go read the Solaris doors paper.  Now think about
   a cross OS instead of a cross address space doors call.  They are very
   similar.  Think about a TLB shootdown, it's sort of a doors style cross
   call already, not exactly the same, but do you see what I mean?

b) I am not claiming that you will have less threading in the page cache.
   I suspect it will be virtually identical except that in my case 90%+
   of the threading is in the ccCluster file system, not the generic code.
   I do not want to get into a "my threading is better than your threading"
   discussion.  I'll even go so far as to say that this approach will have
   more threading if that makes you happy, as long as we agree it is outside
   of the generic part of the kernel.  So unless I have CONFIG_CC_CLUSTER,
   you pay nothing or very, very little.

   Where this approach wins big is everywhere except the page cache.  Every
   single data structure in the system becomes N-way more parallel -- with
   no additional locks -- when you boot up N instances of the OS.  That's
   a powerful statement.  Sure, I freely admit that you'll add a few locks
   to deal with cross OS synchronization but all of those are configed out
   unless you are running a ccCluster.  There is absolutely zero chance that
   you can get the same level of scaling with the same number of locks using
   the traditional approach.  I'll go out on a limb and predict it is at
   least 100x as many locks.  Think about it, there are a lot of things that
   are threaded simply because of some corner case that happens to show up
   in some benchmark or workload.  In the ccCluster case, those by and large
   go away.  Less locks, less deadlocks, less memory barriers, more reliable,
   less filling, tastes great and Mikey likes it :-)
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  3:23       ` Larry McVoy
@ 2001-12-05  6:05         ` David S. Miller
  2001-12-05  6:51           ` Jeff Merkey
  2001-12-06  2:52           ` Rusty Russell
  2001-12-05  8:12         ` Momchil Velikov
  1 sibling, 2 replies; 75+ messages in thread
From: David S. Miller @ 2001-12-05  6:05 UTC (permalink / raw)
  To: lm; +Cc: Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Tue, 4 Dec 2001 19:23:17 -0800

      There is absolutely zero chance that you can get the same level
      of scaling with the same number of locks using the traditional
      approach.  I'll go out on a limb and predict it is at least 100x as many locks.

Coming from the background of having threaded from scratch a complete
networking stack (ie. my shit doesn't stink and I'm here to tell you
about it :-))) I think your claims are pretty much out of whack.

Starting from initial implementation to having all the locking
disappear from the kernel profiles during any given load was composed
of three tasks:

	1) Is this object almost entirely reader based (or the corrolary)?
	   Use special locking that exploits this.  See linux/brlock.h
	   for one such "special locking" we invented to handle these
	   cases optimally.

	   This transformation results in ZERO shared dirty cache
	   lines if the initial analysis is correct.

	2) Can we "fan out" the locking so that people touching
           seperate objects %99 of the time touch different cache
	   lines?

	   This doesn't mean "more per-object locking", it means more
	   like "more per-zone locking".  Per-hashchain locking falls
	   into this category and is very effective for any load I
	   have ever seen.

	3) Is this really a per-cpu "thing"?  The per-cpu SKB header
	   caches are an example of this.  The per-cpu SLAB magazines
	   are yet another.

Another source of scalability problems has nothing to do with whether
you do some kind of clustering or not.  You are still going to get
into situations where multiple cpus want (for example) page 3 of
libc.so :-)  (what I'm trying to say is that it is a hardware issue
in some classes of situations)

Frankly, after applying #1 and/or #2 and/or #3 above to what shows up
to have contention (I claim the ipv4 stack to have had this done for
it) there isn't much you are going to get back.  I see zero reasons to
add any more locks to ipv4, and I don't think we've overdone it in the
networking either.

Even more boldly, I claim that Linux's current ipv4 scales further
than anything coming out of Sun engineering.  From my perspective
Sun's scalability efforts are more in line with "rubber-stamp"
per-object locking when things show up in the profiles than anything
else.  Their networking is really big and fat.  For example the
Solaris per-socket TCP information is nearly 4 times the size of that
in Linux (last time I checked their headers).  And as we all know
their stuff sits on top of some thick infrastructure (ie. STREAMS)
(OK, here is where someone pops up a realistic networking benchmark
where we scale worse than Solaris.  I would welcome such a retort
because it'd probably end up being a really simple thing to fix.)

My main point: I think we currently scale as far as we could in the
places we've done the work (which would include networking) and it
isn't "too much locking".

The problem areas of scalability, for which no real solution is
evident yet, involve the file name lookup tree data structures,
ie. the dcache under Linux.  All accesses here are tree based, and
everyone starts from similar roots.  So you can't per-node or
per-branch lock as everyone traverses the same paths.  Furthermore you
can't use "special locks" as in #1 since this data structure is
neither heavy reader nor heavy writer.

But the real point here is that SMP/cc clusters are not going to solve
this name lookup scaling problem.

The dcache_lock shows up heavily on real workloads under current
Linux.  And it will show up just as badly on a SMP/cc cluster.  SMP/cc
clusters talk a lot about "put it into a special filesystem and scale
that all you want" but I'm trying to show that frankly thats where the
"no solution evident" scaling problems actually are today.

If LLNL was not too jazzed up about your proposal, I right now don't
blame them.  Because with the information I have right now, I think
your claims about it's potential are bogus.

I really want to be shown wrong, simply because the name path locking
issue is one that has been giving me mental gas for quite some time.

Another thing I've found is that SMP scalability changes that help
the "8, 16, 32, 64" cpu case almost never harm the "4, 2" cpu
cases.  Rather, they tend to improve the smaller cpu number cases.
Finally, as I think Ingo pointed out recently, some of the results of
our SMP work has even improved the uniprocessor cases.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  6:05         ` David S. Miller
@ 2001-12-05  6:51           ` Jeff Merkey
  2001-12-06  2:52           ` Rusty Russell
  1 sibling, 0 replies; 75+ messages in thread
From: Jeff Merkey @ 2001-12-05  6:51 UTC (permalink / raw)
  To: lm, David S. Miller
  Cc: Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

----- Original Message -----
From: "David S. Miller" <davem@redhat.com>
To: <lm@bitmover.com>
Cc: <Martin.Bligh@us.ibm.com>; <riel@conectiva.com.br>;
<lars.spam@nocrew.org>; <alan@lxorguk.ukuu.org.uk>; <hps@intermeta.de>;
<linux-kernel@vger.kernel.org>
Sent: Tuesday, December 04, 2001 11:05 PM
Subject: Re: SMP/cc Cluster description

>    From: Larry McVoy <lm@bitmover.com>
>    Date: Tue, 4 Dec 2001 19:23:17 -0800
>
> Even more boldly, I claim that Linux's current ipv4 scales further
> than anything coming out of Sun engineering.  From my perspective
> Sun's scalability efforts are more in line with "rubber-stamp"
> per-object locking when things show up in the profiles than anything
> else.  Their networking is really big and fat.  For example the
> Solaris per-socket TCP information is nearly 4 times the size of that
> in Linux (last time I checked their headers).  And as we all know
> their stuff sits on top of some thick infrastructure (ie. STREAMS)
> (OK, here is where someone pops up a realistic networking benchmark
> where we scale worse than Solaris.  I would welcome such a retort
> because it'd probably end up being a really simple thing to fix.)

David,

The job you did on ipv4 is quite excellent.  I multi-threaded the ODI layer
in NetWare,
and I have reviewed your work and it's as good as anything out there.  The
fewer locks,
the better.  Also, I agree that Sun's "hot air" regarding their SMP is just
that.  Sure, they
have a greart priority inheritenace model, but big f_cking deal.  sleep
locks and their behaviors
have little to do with I./O scaling on interrupt paths, other than to
increase the background
transaction activity on the memory bus.

Your ipv4 work is not perfect, but it's certainly good enough.  We found
with NetWare that SMP
scaling was tough to achieve since the processor was never the bottleneck --
the I/O bus was.
Uniprocessor NetWare 3.12 still runs circles around Linux or anything else,
and it's not
multithreaded, just well optimaized (and hand coded in assembler).

There are a few optimizations you could still to do to make it even faster,
but these are off line
discussions.  :-)

Jeff
.
>
> Franks a lot,
> David S. Miller
> davem@redhat.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  6:05         ` David S. Miller
  2001-12-05  6:51           ` Jeff Merkey
@ 2001-12-06  2:52           ` Rusty Russell
  2001-12-06  3:19             ` Davide Libenzi
  1 sibling, 1 reply; 75+ messages in thread
From: Rusty Russell @ 2001-12-06  2:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

On Tue, 04 Dec 2001 22:05:11 -0800 (PST)
"David S. Miller" <davem@redhat.com> wrote:

> The problem areas of scalability, for which no real solution is
> evident yet, involve the file name lookup tree data structures,
> ie. the dcache under Linux.  All accesses here are tree based, and
> everyone starts from similar roots.  So you can't per-node or
> per-branch lock as everyone traverses the same paths.  Furthermore you
> can't use "special locks" as in #1 since this data structure is
> neither heavy reader nor heavy writer.

Yes.  dbench on 4-way was showing d_lookup hurting us, but replacing
dcache_lock with a rw_lock (Anton Blanchard provided an atomic_dec_and_wlock)
and a separate lock for the unused list DIDN'T HELP ONE BIT.

Why?  Because there's no contention on the lock!  The problem is almost
entirely moving the cacheline around (which is the same for a rw lock).

I'd love to say that I can solve this with RCU, but it's vastly non-trivial
and I haven't got code, so I'm not going to say that. 8)

Rusty.
-- 
  Anyone who quotes me is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  2:52           ` Rusty Russell
@ 2001-12-06  3:19             ` Davide Libenzi
  2001-12-06  7:56               ` David S. Miller
  2001-12-06 14:24               ` Rik van Riel
  0 siblings, 2 replies; 75+ messages in thread
From: Davide Libenzi @ 2001-12-06  3:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David S. Miller, lm, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

On Thu, 6 Dec 2001, Rusty Russell wrote:

> I'd love to say that I can solve this with RCU, but it's vastly non-trivial
> and I haven't got code, so I'm not going to say that. 8)

Lockless algos could help if we're able to have "good" quiescent point
inside the kernel. Or better have a good quiescent infrastructure to have
lockless code to plug in.



- Davide



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  3:19             ` Davide Libenzi
@ 2001-12-06  7:56               ` David S. Miller
  2001-12-06  8:02                 ` Larry McVoy
  2001-12-06 14:24               ` Rik van Riel
  1 sibling, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06  7:56 UTC (permalink / raw)
  To: davidel; +Cc: rusty, lm, Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

   From: Davide Libenzi <davidel@xmailserver.org>
   Date: Wed, 5 Dec 2001 19:19:19 -0800 (PST)

   On Thu, 6 Dec 2001, Rusty Russell wrote:

   > I'd love to say that I can solve this with RCU, but it's vastly non-trivial
   > and I haven't got code, so I'm not going to say that. 8)

   Lockless algos could help if we're able to have "good" quiescent point
   inside the kernel. Or better have a good quiescent infrastructure to have
   lockless code to plug in.

Lockless algorithms don't get rid of the shared cache lines.

I used to once think that lockless algorithms were the SMP holy-grail,
but this was undone when I realized they had the same fundamental
overhead spinlocks do.

These lockless algorithms, instructions like CAS, DCAS, "infinite
consensus number", it's all crap.  You have to seperate out the access
areas amongst different cpus so they don't collide, and none of these
mechanisms do that.

That is, unless some lockless algorithm involving %100 local dirty
state has been invented while I wasn't looking :-)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  7:56               ` David S. Miller
@ 2001-12-06  8:02                 ` Larry McVoy
  2001-12-06  8:09                   ` David S. Miller
  2001-12-06 19:42                   ` Daniel Phillips
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-06  8:02 UTC (permalink / raw)
  To: David S. Miller
  Cc: davidel, rusty, lm, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

On Wed, Dec 05, 2001 at 11:56:17PM -0800, David S. Miller wrote:
> These lockless algorithms, instructions like CAS, DCAS, "infinite
> consensus number", it's all crap.  You have to seperate out the access
> areas amongst different cpus so they don't collide, and none of these
> mechanisms do that.

Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
all that seperation for every data structure for free.  Think about
it a bit.  Aren't you going to feel a little bit stupid if you do all
this work, one object at a time, and someone can come along and do the
whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
but it is much easier than the route you are taking.  
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  8:02                 ` Larry McVoy
@ 2001-12-06  8:09                   ` David S. Miller
  2001-12-06 18:27                     ` Jeff V. Merkey
  2001-12-06 19:42                   ` Daniel Phillips
  1 sibling, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06  8:09 UTC (permalink / raw)
  To: lm; +Cc: davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 00:02:16 -0800

   Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
   all that seperation for every data structure for free.  Think about
   it a bit.  Aren't you going to feel a little bit stupid if you do all
   this work, one object at a time, and someone can come along and do the
   whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
   but it is much easier than the route you are taking.  

How does ccClusters avoid the file system namespace locking issues?
How do all the OS nodes see a consistent FS tree?

All the talk is about the "magic filesystem, thread it as much as you
want" and I'm telling you that is the fundamental problem, the
filesystem name space locking.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  8:09                   ` David S. Miller
@ 2001-12-06 18:27                     ` Jeff V. Merkey
  2001-12-06 18:37                       ` Jeff V. Merkey
  2001-12-06 19:11                       ` Davide Libenzi
  0 siblings, 2 replies; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-06 18:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel, jmerkey

Guys,

I am the maintaner of SCI, the ccNUMA technology standard.  I know
alot about this stuff, and have been involved with SCI since 
1994.  I work with it every day and the Dolphin guys on some huge 
supercomputer accounts, like Los Alamos and Sandia Labs in NM.  
I will tell you this from what I know.

A shared everything approach is a programmers dream come true,
but you can forget getting reasonable fault tolerance with it.  The 
shared memory zealots want everyone to believe ccNUMA is better 
than sex, but it does not scale when compared to Shared-Nothing
programming models.  There's also a lot of tough issues for dealing 
with failed nodes, and how you recover when peoples memory is 
all over the place across a nuch of machines.  

SCI scales better in ccNUMA and all NUMA technoogies scale very 
well when they are used with "Explicit Coherence" instead of 
"Implicit Coherence" which is what you get with SMP systems.  
Years of research by Dr. Justin Rattner at Intel's High 
performance labs demonstrated that shared nothing models scaled
into the thousands of nodes, while all these shared everything
"Super SMP" approaches hit the wall at 64 processors generally.

SCI is the fastest shared nothing interface out there, and it also
can do ccNUMA.  Sequent, Sun, DG and a host of other NUMA providers
use Dolphin's SCI technology and have for years.   ccNUMA is useful 
for applications that still assume a shared nothing approach but that
use the ccNUMA and NUMA capabilities for better optimization.

Forget trying to recreate the COMA architecture of Kendall-Square.  
The name was truly descriptive of what happened in this architecture
when a node fails -- goes into a "COMA".  This whole discussion I have
lived through before and you will find that ccNUMA is virtually 
unimplementable on most general purpose OSs.  And yes, there are 
a lot of products and software out there, but when you look under 
the cover (like ServerNet) you discover their coherence models 
for the most part relay on push/pull explicit coherence models.

My 2 cents.

Jeff 

On Thu, Dec 06, 2001 at 12:09:32AM -0800, David S. Miller wrote:
>    From: Larry McVoy <lm@bitmover.com>
>    Date: Thu, 6 Dec 2001 00:02:16 -0800
>    
>    Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
>    all that seperation for every data structure for free.  Think about
>    it a bit.  Aren't you going to feel a little bit stupid if you do all
>    this work, one object at a time, and someone can come along and do the
>    whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
>    but it is much easier than the route you are taking.  
> 
> How does ccClusters avoid the file system namespace locking issues?
> How do all the OS nodes see a consistent FS tree?
> 
> All the talk is about the "magic filesystem, thread it as much as you
> want" and I'm telling you that is the fundamental problem, the
> filesystem name space locking.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 18:27                     ` Jeff V. Merkey
@ 2001-12-06 18:37                       ` Jeff V. Merkey
  2001-12-06 18:36                         ` Martin J. Bligh
  2001-12-06 19:11                       ` Davide Libenzi
  1 sibling, 1 reply; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-06 18:37 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel, jmerkey

On Thu, Dec 06, 2001 at 11:27:31AM -0700, Jeff V. Merkey wrote:

And also, if you download the SCI drivers in my area, and order
some SCI adapters from Dolphin in Albquerque, NM, you can set up 
a ccNUMA system with standard PCs.  Dolphin has 66Mhz versions (and
a 133Mhz coming in the future) that run at almost a gigabyte per 
second node-2-node over a parallel fabric.  The cross-sectional
SCI fabric bandwidth scales at (O)(2N) as you add nodes.  

If you want to play around with ccNUMA with Standard PCs, these 
cards are relatively inepxensive, and allow you to setup some 
powerful cc/SMP systems with explicit coherence.  The full 
ccNUMA boxes from DG are expensive, however.  That way, instead
of everyone talking about it, you guys could get some cool 
hardware and experiment with some of your rather forward 
looking and interesting ideas.

:-)

Jeff



> 
> 
> Guys,
> 
> I am the maintaner of SCI, the ccNUMA technology standard.  I know
> alot about this stuff, and have been involved with SCI since 
> 1994.  I work with it every day and the Dolphin guys on some huge 
> supercomputer accounts, like Los Alamos and Sandia Labs in NM.  
> I will tell you this from what I know.
> 
> A shared everything approach is a programmers dream come true,
> but you can forget getting reasonable fault tolerance with it.  The 
> shared memory zealots want everyone to believe ccNUMA is better 
> than sex, but it does not scale when compared to Shared-Nothing
> programming models.  There's also a lot of tough issues for dealing 
> with failed nodes, and how you recover when peoples memory is 
> all over the place across a nuch of machines.  
> 
> SCI scales better in ccNUMA and all NUMA technoogies scale very 
> well when they are used with "Explicit Coherence" instead of 
> "Implicit Coherence" which is what you get with SMP systems.  
> Years of research by Dr. Justin Rattner at Intel's High 
> performance labs demonstrated that shared nothing models scaled
> into the thousands of nodes, while all these shared everything
> "Super SMP" approaches hit the wall at 64 processors generally.
> 
> SCI is the fastest shared nothing interface out there, and it also
> can do ccNUMA.  Sequent, Sun, DG and a host of other NUMA providers
> use Dolphin's SCI technology and have for years.   ccNUMA is useful 
> for applications that still assume a shared nothing approach but that
> use the ccNUMA and NUMA capabilities for better optimization.
> 
> Forget trying to recreate the COMA architecture of Kendall-Square.  
> The name was truly descriptive of what happened in this architecture
> when a node fails -- goes into a "COMA".  This whole discussion I have
> lived through before and you will find that ccNUMA is virtually 
> unimplementable on most general purpose OSs.  And yes, there are 
> a lot of products and software out there, but when you look under 
> the cover (like ServerNet) you discover their coherence models 
> for the most part relay on push/pull explicit coherence models.
> 
> My 2 cents.
> 
> Jeff 
> 
> 
> 
> On Thu, Dec 06, 2001 at 12:09:32AM -0800, David S. Miller wrote:
> >    From: Larry McVoy <lm@bitmover.com>
> >    Date: Thu, 6 Dec 2001 00:02:16 -0800
> >    
> >    Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
> >    all that seperation for every data structure for free.  Think about
> >    it a bit.  Aren't you going to feel a little bit stupid if you do all
> >    this work, one object at a time, and someone can come along and do the
> >    whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> >    but it is much easier than the route you are taking.  
> > 
> > How does ccClusters avoid the file system namespace locking issues?
> > How do all the OS nodes see a consistent FS tree?
> > 
> > All the talk is about the "magic filesystem, thread it as much as you
> > want" and I'm telling you that is the fundamental problem, the
> > filesystem name space locking.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 18:37                       ` Jeff V. Merkey
@ 2001-12-06 18:36                         ` Martin J. Bligh
  2001-12-06 18:45                           ` Jeff V. Merkey
  0 siblings, 1 reply; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-06 18:36 UTC (permalink / raw)
  To: Jeff V. Merkey, David S. Miller
  Cc: lm, davidel, rusty, riel, lars.spam, alan, hps, linux-kernel,
	jmerkey

> If you want to play around with ccNUMA with Standard PCs, these 
> cards are relatively inepxensive, and allow you to setup some 
> powerful cc/SMP systems with explicit coherence.  The full 
> ccNUMA boxes from DG are expensive, however.  That way, instead
> of everyone talking about it, you guys could get some cool 
> hardware and experiment with some of your rather forward 
> looking and interesting ideas.

Or you could just book some time on the 16x NUMA-Q that's publicly
available in the OSDL for free, and running Linux already.

Martin.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 18:36                         ` Martin J. Bligh
@ 2001-12-06 18:45                           ` Jeff V. Merkey
  0 siblings, 0 replies; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-06 18:45 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, lm, davidel, rusty, riel, lars.spam, alan, hps,
	linux-kernel, jmerkey

On Thu, Dec 06, 2001 at 10:36:57AM -0800, Martin J. Bligh wrote:
> > If you want to play around with ccNUMA with Standard PCs, these 
> > cards are relatively inepxensive, and allow you to setup some 
> > powerful cc/SMP systems with explicit coherence.  The full 
> > ccNUMA boxes from DG are expensive, however.  That way, instead
> > of everyone talking about it, you guys could get some cool 
> > hardware and experiment with some of your rather forward 
> > looking and interesting ideas.
> 
> Or you could just book some time on the 16x NUMA-Q that's publicly
> available in the OSDL for free, and running Linux already.
> 
> Martin.

Could even have some shipped off to you guys if you to play 
around with them.  :-)

Jeff


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 18:27                     ` Jeff V. Merkey
  2001-12-06 18:37                       ` Jeff V. Merkey
@ 2001-12-06 19:11                       ` Davide Libenzi
  2001-12-06 19:34                         ` Jeff V. Merkey
  1 sibling, 1 reply; 75+ messages in thread
From: Davide Libenzi @ 2001-12-06 19:11 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: David S. Miller, lm, rusty, Martin J. Bligh, Rik vav Riel,
	lars.spam, Alan Cox, hps, lkml, jmerkey

On Thu, 6 Dec 2001, Jeff V. Merkey wrote:

> Guys,
>
> I am the maintaner of SCI, the ccNUMA technology standard.  I know
> alot about this stuff, and have been involved with SCI since
> 1994.  I work with it every day and the Dolphin guys on some huge
> supercomputer accounts, like Los Alamos and Sandia Labs in NM.
> I will tell you this from what I know.
>
> A shared everything approach is a programmers dream come true,
> but you can forget getting reasonable fault tolerance with it.  The
> shared memory zealots want everyone to believe ccNUMA is better
> than sex, but it does not scale when compared to Shared-Nothing
> programming models.  There's also a lot of tough issues for dealing
> with failed nodes, and how you recover when peoples memory is
> all over the place across a nuch of machines.

If you can afford rewriting/rearchitecting your application it's pretty
clear that the share-nothing model is the winner one.
But if you can rewrite your application using a share-nothing model you
don't need any fancy clustering architectures since beowulf like cluster
would work for you and they'll give you a great scalability over the
number of nodes.
The problem arises when you've to choose between a new architecture
( share nothing ) using conventional clusters and a
share-all/keep-all-your-application-as-is one.
The share nothing is cheap and gives you a very nice scalability, these
are the two mayor pros for this solution.
On the other side you've a vary bad scalability and a very expensive
solution.
But you've to consider :

1) rewriting is risky

2) good developers to rewrite your stuff are expensive ( $100K up to $150K
	in my area )

These are the reason that let me think that conventional SMP machines will
have a future in addition to my believing that technology will help a lot
to improve scalability.

- Davide

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 19:11                       ` Davide Libenzi
@ 2001-12-06 19:34                         ` Jeff V. Merkey
  2001-12-06 23:16                           ` David Lang
  0 siblings, 1 reply; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-06 19:34 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, lm, rusty, Martin J. Bligh, Rik vav Riel,
	lars.spam, Alan Cox, hps, lkml, jmerkey

On Thu, Dec 06, 2001 at 11:11:27AM -0800, Davide Libenzi wrote:
> On Thu, 6 Dec 2001, Jeff V. Merkey wrote:
> 
> > Guys,
> >
> > I am the maintaner of SCI, the ccNUMA technology standard.  I know
> > alot about this stuff, and have been involved with SCI since
> > 1994.  I work with it every day and the Dolphin guys on some huge
> > supercomputer accounts, like Los Alamos and Sandia Labs in NM.
> > I will tell you this from what I know.
> >
> > A shared everything approach is a programmers dream come true,
> > but you can forget getting reasonable fault tolerance with it.  The
> > shared memory zealots want everyone to believe ccNUMA is better
> > than sex, but it does not scale when compared to Shared-Nothing
> > programming models.  There's also a lot of tough issues for dealing
> > with failed nodes, and how you recover when peoples memory is
> > all over the place across a nuch of machines.
> 
> If you can afford rewriting/rearchitecting your application it's pretty
> clear that the share-nothing model is the winner one.
> But if you can rewrite your application using a share-nothing model you
> don't need any fancy clustering architectures since beowulf like cluster
> would work for you and they'll give you a great scalability over the
> number of nodes.
> The problem arises when you've to choose between a new architecture
> ( share nothing ) using conventional clusters and a
> share-all/keep-all-your-application-as-is one.
> The share nothing is cheap and gives you a very nice scalability, these
> are the two mayor pros for this solution.
> On the other side you've a vary bad scalability and a very expensive
> solution.
> But you've to consider :
> 
> 1) rewriting is risky
> 
> 2) good developers to rewrite your stuff are expensive ( $100K up to $150K
> 	in my area )
> 
> These are the reason that let me think that conventional SMP machines will
> have a future in addition to my believing that technology will help a lot
> to improve scalability.
> 

There's a way through the fog.  Shared Nothing with explicit coherence.
You are correct, applications need to be rewritten to exploit it.  It 
is possible to run existing SMP apps process -> process across nodes
with ccNUMA, and this works, but you don't get the scaling as shared
nothing.

Jeff

Jeff


> 
> 
> 
> - Davide
> 
> 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 19:34                         ` Jeff V. Merkey
@ 2001-12-06 23:16                           ` David Lang
  2001-12-07  2:56                             ` Jeff V. Merkey
  0 siblings, 1 reply; 75+ messages in thread
From: David Lang @ 2001-12-06 23:16 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Davide Libenzi, David S. Miller, lm, rusty, Martin J. Bligh,
	Rik vav Riel, lars.spam, Alan Cox, hps, lkml, jmerkey

also some applications (i.e. databases) are such that nobody has really
been able to rewrite them into the shared nothing model (although oracle
has attempted it, from what I hear it has problems)

David Lang

 On Thu, 6 Dec 2001,
Jeff V. Merkey wrote:

> Date: Thu, 6 Dec 2001 12:34:48 -0700
> From: Jeff V. Merkey <jmerkey@vger.timpanogas.org>
> To: Davide Libenzi <davidel@xmailserver.org>
> Cc: David S. Miller <davem@redhat.com>, lm@bitmover.com,
>      rusty@rustcorp.com.au, Martin J. Bligh <Martin.Bligh@us.ibm.com>,
>      Rik vav Riel <riel@conectiva.com.br>, lars.spam@nocrew.org,
>      Alan Cox <alan@lxorguk.ukuu.org.uk>, hps@intermeta.de,
>      lkml <linux-kernel@vger.kernel.org>, jmerkey@timpanogas.org
> Subject: Re: SMP/cc Cluster description
>
> On Thu, Dec 06, 2001 at 11:11:27AM -0800, Davide Libenzi wrote:
> > On Thu, 6 Dec 2001, Jeff V. Merkey wrote:
> >
> > > Guys,
> > >
> > > I am the maintaner of SCI, the ccNUMA technology standard.  I know
> > > alot about this stuff, and have been involved with SCI since
> > > 1994.  I work with it every day and the Dolphin guys on some huge
> > > supercomputer accounts, like Los Alamos and Sandia Labs in NM.
> > > I will tell you this from what I know.
> > >
> > > A shared everything approach is a programmers dream come true,
> > > but you can forget getting reasonable fault tolerance with it.  The
> > > shared memory zealots want everyone to believe ccNUMA is better
> > > than sex, but it does not scale when compared to Shared-Nothing
> > > programming models.  There's also a lot of tough issues for dealing
> > > with failed nodes, and how you recover when peoples memory is
> > > all over the place across a nuch of machines.
> >
> > If you can afford rewriting/rearchitecting your application it's pretty
> > clear that the share-nothing model is the winner one.
> > But if you can rewrite your application using a share-nothing model you
> > don't need any fancy clustering architectures since beowulf like cluster
> > would work for you and they'll give you a great scalability over the
> > number of nodes.
> > The problem arises when you've to choose between a new architecture
> > ( share nothing ) using conventional clusters and a
> > share-all/keep-all-your-application-as-is one.
> > The share nothing is cheap and gives you a very nice scalability, these
> > are the two mayor pros for this solution.
> > On the other side you've a vary bad scalability and a very expensive
> > solution.
> > But you've to consider :
> >
> > 1) rewriting is risky
> >
> > 2) good developers to rewrite your stuff are expensive ( $100K up to $150K
> > 	in my area )
> >
> > These are the reason that let me think that conventional SMP machines will
> > have a future in addition to my believing that technology will help a lot
> > to improve scalability.
> >
>
> There's a way through the fog.  Shared Nothing with explicit coherence.
> You are correct, applications need to be rewritten to exploit it.  It
> is possible to run existing SMP apps process -> process across nodes
> with ccNUMA, and this works, but you don't get the scaling as shared
> nothing.
>
> Jeff
>
> Jeff
>
>
> >
> >
> >
> > - Davide
> >
> >
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:16                           ` David Lang
@ 2001-12-07  2:56                             ` Jeff V. Merkey
  2001-12-07  4:23                               ` David Lang
  0 siblings, 1 reply; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-07  2:56 UTC (permalink / raw)
  To: David Lang
  Cc: Davide Libenzi, David S. Miller, lm, rusty, Martin J. Bligh,
	Rik vav Riel, lars.spam, Alan Cox, hps, lkml, jmerkey

On Thu, Dec 06, 2001 at 03:16:13PM -0800, David Lang wrote:
> also some applications (i.e. databases) are such that nobody has really
> been able to rewrite them into the shared nothing model (although oracle
> has attempted it, from what I hear it has problems)
> 
> David Lang

OPS (Oracle Parallel Server) is shared nothing.  

Jeff


> 
>  On Thu, 6 Dec 2001,
> Jeff V. Merkey wrote:
> 
> > Date: Thu, 6 Dec 2001 12:34:48 -0700
> > From: Jeff V. Merkey <jmerkey@vger.timpanogas.org>
> > To: Davide Libenzi <davidel@xmailserver.org>
> > Cc: David S. Miller <davem@redhat.com>, lm@bitmover.com,
> >      rusty@rustcorp.com.au, Martin J. Bligh <Martin.Bligh@us.ibm.com>,
> >      Rik vav Riel <riel@conectiva.com.br>, lars.spam@nocrew.org,
> >      Alan Cox <alan@lxorguk.ukuu.org.uk>, hps@intermeta.de,
> >      lkml <linux-kernel@vger.kernel.org>, jmerkey@timpanogas.org
> > Subject: Re: SMP/cc Cluster description
> >
> > On Thu, Dec 06, 2001 at 11:11:27AM -0800, Davide Libenzi wrote:
> > > On Thu, 6 Dec 2001, Jeff V. Merkey wrote:
> > >
> > > > Guys,
> > > >
> > > > I am the maintaner of SCI, the ccNUMA technology standard.  I know
> > > > alot about this stuff, and have been involved with SCI since
> > > > 1994.  I work with it every day and the Dolphin guys on some huge
> > > > supercomputer accounts, like Los Alamos and Sandia Labs in NM.
> > > > I will tell you this from what I know.
> > > >
> > > > A shared everything approach is a programmers dream come true,
> > > > but you can forget getting reasonable fault tolerance with it.  The
> > > > shared memory zealots want everyone to believe ccNUMA is better
> > > > than sex, but it does not scale when compared to Shared-Nothing
> > > > programming models.  There's also a lot of tough issues for dealing
> > > > with failed nodes, and how you recover when peoples memory is
> > > > all over the place across a nuch of machines.
> > >
> > > If you can afford rewriting/rearchitecting your application it's pretty
> > > clear that the share-nothing model is the winner one.
> > > But if you can rewrite your application using a share-nothing model you
> > > don't need any fancy clustering architectures since beowulf like cluster
> > > would work for you and they'll give you a great scalability over the
> > > number of nodes.
> > > The problem arises when you've to choose between a new architecture
> > > ( share nothing ) using conventional clusters and a
> > > share-all/keep-all-your-application-as-is one.
> > > The share nothing is cheap and gives you a very nice scalability, these
> > > are the two mayor pros for this solution.
> > > On the other side you've a vary bad scalability and a very expensive
> > > solution.
> > > But you've to consider :
> > >
> > > 1) rewriting is risky
> > >
> > > 2) good developers to rewrite your stuff are expensive ( $100K up to $150K
> > > 	in my area )
> > >
> > > These are the reason that let me think that conventional SMP machines will
> > > have a future in addition to my believing that technology will help a lot
> > > to improve scalability.
> > >
> >
> > There's a way through the fog.  Shared Nothing with explicit coherence.
> > You are correct, applications need to be rewritten to exploit it.  It
> > is possible to run existing SMP apps process -> process across nodes
> > with ccNUMA, and this works, but you don't get the scaling as shared
> > nothing.
> >
> > Jeff
> >
> > Jeff
> >
> >
> > >
> > >
> > >
> > > - Davide
> > >
> > >
> > >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:56                             ` Jeff V. Merkey
@ 2001-12-07  4:23                               ` David Lang
  2001-12-07  5:45                                 ` Jeff V. Merkey
  0 siblings, 1 reply; 75+ messages in thread
From: David Lang @ 2001-12-07  4:23 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Davide Libenzi, David S. Miller, lm, rusty, Martin J. Bligh,
	Rik vav Riel, lars.spam, Alan Cox, hps, lkml, jmerkey

On Thu, 6 Dec 2001, Jeff V. Merkey wrote:

> On Thu, Dec 06, 2001 at 03:16:13PM -0800, David Lang wrote:
> > also some applications (i.e. databases) are such that nobody has really
> > been able to rewrite them into the shared nothing model (although oracle
> > has attempted it, from what I hear it has problems)
> >
> > David Lang
>
> OPS (Oracle Parallel Server) is shared nothing.
>

correct, and from what I have been hearing from my local database folks
it's significantly less efficiant then a large SMP machine (up intil you
hit the point where you just can't buy a machine big enough :-)

I'm interested in hearing more if you have had better experiances with it.

David Lang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  4:23                               ` David Lang
@ 2001-12-07  5:45                                 ` Jeff V. Merkey
  0 siblings, 0 replies; 75+ messages in thread
From: Jeff V. Merkey @ 2001-12-07  5:45 UTC (permalink / raw)
  To: David Lang
  Cc: Davide Libenzi, David S. Miller, lm, rusty, Martin J. Bligh,
	Rik vav Riel, lars.spam, Alan Cox, hps, lkml, jmerkey

On Thu, Dec 06, 2001 at 08:23:36PM -0800, David Lang wrote:
> On Thu, 6 Dec 2001, Jeff V. Merkey wrote:
> 
> > On Thu, Dec 06, 2001 at 03:16:13PM -0800, David Lang wrote:
> > > also some applications (i.e. databases) are such that nobody has really
> > > been able to rewrite them into the shared nothing model (although oracle
> > > has attempted it, from what I hear it has problems)
> > >
> > > David Lang
> >
> > OPS (Oracle Parallel Server) is shared nothing.
> >
> 
> correct, and from what I have been hearing from my local database folks
> it's significantly less efficiant then a large SMP machine (up intil you
> hit the point where you just can't buy a machine big enough :-)
> 
> I'm interested in hearing more if you have had better experiances with it.
> 
> David Lang

I worked with the OPS code years back.  It came from DEC originally 
and is a very old technology.  It grew out of disk pinging, where 
the messages would be pinged across a shared disk.  Some cool features,
but I never saw it scale well beyond 16 nodes.  SQL queries are a lot 
like HTML requests, so similair approaches work well with them.  The code
was not so good, or readable.

Databases are "structured data" applications and present unique problems,
but most data stored on planet earth is "unstructured data", word files, 
emails, spreadsheets, etc.  The problem of scaling involves different 
approaches for these two categories, and the unstructured data problem 
is easily solvable and scalable.  

I ported Oracle to NetWare SMP in 1995 with Van Okamura (a very fun 
project), and SMP scaling was much better.  In those days, shared 
SCSI was what was around.  Writing an SOSD layer for Oracle 
was a lot of fun.  Working on OPS was also fun, but the code 
was not in such good shape.  Their method of dealing with 
deadlocks was not to.  Their approach assumed deadlocks were 
infrequent events (which they were) and they used a mathod that would 
detect them after the fact rather than before and deal with them 
then.  

I saw some impressive numbers for TPC-C come out of OPS 4 way clusters,
but more nodes than this (except the N-cube implementation) seemed to 
not do so well.

Jeff 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  8:02                 ` Larry McVoy
  2001-12-06  8:09                   ` David S. Miller
@ 2001-12-06 19:42                   ` Daniel Phillips
  2001-12-06 19:53                     ` Larry McVoy
  1 sibling, 1 reply; 75+ messages in thread
From: Daniel Phillips @ 2001-12-06 19:42 UTC (permalink / raw)
  To: Larry McVoy, David S. Miller
  Cc: davidel, rusty, lm, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

On December 6, 2001 09:02 am, Larry McVoy wrote:
> On Wed, Dec 05, 2001 at 11:56:17PM -0800, David S. Miller wrote:
> > These lockless algorithms, instructions like CAS, DCAS, "infinite
> > consensus number", it's all crap.  You have to seperate out the access
> > areas amongst different cpus so they don't collide, and none of these
> > mechanisms do that.
> 
> Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
> all that seperation for every data structure for free.  Think about
> it a bit.  Aren't you going to feel a little bit stupid if you do all
> this work, one object at a time, and someone can come along and do the
> whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> but it is much easier than the route you are taking.  

What I don't get after looking at your material, is how you intend to do the 
locking.  Sharing a mmap across OS instances is fine, but how do processes on 
the two different OS's avoid stepping on each other when they access the same 
file?

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 19:42                   ` Daniel Phillips
@ 2001-12-06 19:53                     ` Larry McVoy
  2001-12-06 20:10                       ` Daniel Phillips
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 19:53 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Larry McVoy, David S. Miller, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel

On Thu, Dec 06, 2001 at 08:42:05PM +0100, Daniel Phillips wrote:
> On December 6, 2001 09:02 am, Larry McVoy wrote:
> > On Wed, Dec 05, 2001 at 11:56:17PM -0800, David S. Miller wrote:
> > > These lockless algorithms, instructions like CAS, DCAS, "infinite
> > > consensus number", it's all crap.  You have to seperate out the access
> > > areas amongst different cpus so they don't collide, and none of these
> > > mechanisms do that.
> > 
> > Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
> > all that seperation for every data structure for free.  Think about
> > it a bit.  Aren't you going to feel a little bit stupid if you do all
> > this work, one object at a time, and someone can come along and do the
> > whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> > but it is much easier than the route you are taking.  
> 
> What I don't get after looking at your material, is how you intend to do the 
> locking.  Sharing a mmap across OS instances is fine, but how do processes on 
> the two different OS's avoid stepping on each other when they access the same 
> file?

Exactly the same way they would if they were two processes on a traditional
SMP OS.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 19:53                     ` Larry McVoy
@ 2001-12-06 20:10                       ` Daniel Phillips
  2001-12-06 20:10                         ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Daniel Phillips @ 2001-12-06 20:10 UTC (permalink / raw)
  To: Larry McVoy
  Cc: Larry McVoy, David S. Miller, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel

On December 6, 2001 08:53 pm, Larry McVoy wrote:
> On Thu, Dec 06, 2001 at 08:42:05PM +0100, Daniel Phillips wrote:
> > On December 6, 2001 09:02 am, Larry McVoy wrote:
> > > On Wed, Dec 05, 2001 at 11:56:17PM -0800, David S. Miller wrote:
> > > > These lockless algorithms, instructions like CAS, DCAS, "infinite
> > > > consensus number", it's all crap.  You have to seperate out the access
> > > > areas amongst different cpus so they don't collide, and none of these
> > > > mechanisms do that.
> > > 
> > > Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
> > > all that seperation for every data structure for free.  Think about
> > > it a bit.  Aren't you going to feel a little bit stupid if you do all
> > > this work, one object at a time, and someone can come along and do the
> > > whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> > > but it is much easier than the route you are taking.  
> > 
> > What I don't get after looking at your material, is how you intend to do the 
> > locking.  Sharing a mmap across OS instances is fine, but how do processes on 
> > the two different OS's avoid stepping on each other when they access the same 
> > file?
> 
> Exactly the same way they would if they were two processes on a traditional
> SMP OS.

They'd use locks internal to the VFS and fs, plus Posix-style locks and
assorted other userland serializers, which don't come for free.  As davem 
said, you'll have to present a coherent namespace, that's just one of the 
annoying details.  So far you haven't said much about how such things are 
going to be handled.

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:10                       ` Daniel Phillips
@ 2001-12-06 20:10                         ` Larry McVoy
  2001-12-06 20:15                           ` David S. Miller
  2001-12-06 22:38                           ` Alan Cox
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 20:10 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Larry McVoy, David S. Miller, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel

On Thu, Dec 06, 2001 at 09:10:51PM +0100, Daniel Phillips wrote:
> On December 6, 2001 08:53 pm, Larry McVoy wrote:
> > On Thu, Dec 06, 2001 at 08:42:05PM +0100, Daniel Phillips wrote:
> > > On December 6, 2001 09:02 am, Larry McVoy wrote:
> > > > On Wed, Dec 05, 2001 at 11:56:17PM -0800, David S. Miller wrote:
> > > > > These lockless algorithms, instructions like CAS, DCAS, "infinite
> > > > > consensus number", it's all crap.  You have to seperate out the access
> > > > > areas amongst different cpus so they don't collide, and none of these
> > > > > mechanisms do that.
> > > > 
> > > > Err, Dave, that's *exactly* the point of the ccCluster stuff.  You get
> > > > all that seperation for every data structure for free.  Think about
> > > > it a bit.  Aren't you going to feel a little bit stupid if you do all
> > > > this work, one object at a time, and someone can come along and do the
> > > > whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> > > > but it is much easier than the route you are taking.  
> > > 
> > > What I don't get after looking at your material, is how you intend to do the 
> > > locking.  Sharing a mmap across OS instances is fine, but how do processes on 
> > > the two different OS's avoid stepping on each other when they access the same 
> > > file?
> > 
> > Exactly the same way they would if they were two processes on a traditional
> > SMP OS.
> 
> They'd use locks internal to the VFS and fs, plus Posix-style locks and
> assorted other userland serializers, which don't come for free.  As davem 
> said, you'll have to present a coherent namespace, that's just one of the 
> annoying details.  So far you haven't said much about how such things are 
> going to be handled.

Huh?  Of course not, they'd use mutexes in a mmap-ed file, which uses
the hardware's coherency.  No locks in the vfs or fs, that's all done
in the mmap/page fault path for sure, but once the data is mapped you
aren't dealing with the file system at all.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:10                         ` Larry McVoy
@ 2001-12-06 20:15                           ` David S. Miller
  2001-12-06 20:21                             ` Larry McVoy
  2001-12-06 22:38                           ` Alan Cox
  1 sibling, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06 20:15 UTC (permalink / raw)
  To: lm
  Cc: phillips, davidel, rusty, Martin.Bligh, riel, lars.spam, alan,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 12:10:04 -0800

   Huh?  Of course not, they'd use mutexes in a mmap-ed file, which uses
   the hardware's coherency.  No locks in the vfs or fs, that's all done
   in the mmap/page fault path for sure, but once the data is mapped you
   aren't dealing with the file system at all.

We're talking about two things.

Once the data is MMAP'd, sure things are coherent just like on any
other SMP, for the user.

But HOW DID YOU GET THERE?  That is the question you are avoiding.
How do I look up "/etc/passwd" in the filesystem on a ccCluster?
How does OS image 1 see the same "/etc/passwd" as OS image 2?

If you aren't getting rid of this locking, what is the point?
That is what we are trying to talk about.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:15                           ` David S. Miller
@ 2001-12-06 20:21                             ` Larry McVoy
  2001-12-06 21:02                               ` David S. Miller
                                                 ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 20:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam, alan,
	hps, linux-kernel

On Thu, Dec 06, 2001 at 12:15:54PM -0800, David S. Miller wrote:
>    From: Larry McVoy <lm@bitmover.com>
>    Date: Thu, 6 Dec 2001 12:10:04 -0800
> 
>    Huh?  Of course not, they'd use mutexes in a mmap-ed file, which uses
>    the hardware's coherency.  No locks in the vfs or fs, that's all done
>    in the mmap/page fault path for sure, but once the data is mapped you
>    aren't dealing with the file system at all.
> 
> We're talking about two things.
> 
> Once the data is MMAP'd, sure things are coherent just like on any
> other SMP, for the user.
> 
> But HOW DID YOU GET THERE?  That is the question you are avoiding.
> How do I look up "/etc/passwd" in the filesystem on a ccCluster?
> How does OS image 1 see the same "/etc/passwd" as OS image 2?
> 
> If you aren't getting rid of this locking, what is the point?
> That is what we are trying to talk about.

The points are:

a) you have to thread the entire kernel, every data structure which is a
   problem.  Scheduler, networking, device drivers, everything.  That's
   thousands of locks and uncountable bugs, not to mention the impact on
   uniprocessor performance.

b) I have to thread a file system.

So I'm not saying that I'll thread less in the file system (actually I am,
but let's skip that for now and assume I have to do everything you have
to do).  All I'm saying is that I don't have to worry about the rest of
the kernel which is a huge savings.

You tell me - which is easier, multithreading the networking stack to 
64 way SMP or running 64 distinct networking stacks?
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:21                             ` Larry McVoy
@ 2001-12-06 21:02                               ` David S. Miller
  2001-12-06 22:27                                 ` Benjamin LaHaise
  2001-12-06 21:30                               ` Daniel Phillips
  2001-12-06 22:37                               ` Alan Cox
  2 siblings, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06 21:02 UTC (permalink / raw)
  To: lm
  Cc: phillips, davidel, rusty, Martin.Bligh, riel, lars.spam, alan,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 12:21:16 -0800

   You tell me - which is easier, multithreading the networking stack to 
   64 way SMP or running 64 distinct networking stacks?

We've done %90 of the "other stuff" already, why waste the work?
We've done the networking, we've done the scheduler, and the
networking/block drivers are there too.

I was actually pretty happy with how easy (relatively) the networking
was to thread nicely.

The point is, you have to make a captivating argument for ccClusters,
what does it buy us now that we've done a lot of the work you are
telling us it will save?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 21:02                               ` David S. Miller
@ 2001-12-06 22:27                                 ` Benjamin LaHaise
  2001-12-06 22:59                                   ` Alan Cox
  2001-12-06 23:08                                   ` David S. Miller
  0 siblings, 2 replies; 75+ messages in thread
From: Benjamin LaHaise @ 2001-12-06 22:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam, alan,
	hps, linux-kernel

On Thu, Dec 06, 2001 at 01:02:02PM -0800, David S. Miller wrote:
> We've done %90 of the "other stuff" already, why waste the work?
> We've done the networking, we've done the scheduler, and the
> networking/block drivers are there too.

The scheduler doesn't scale too well...

> I was actually pretty happy with how easy (relatively) the networking
> was to thread nicely.
> 
> The point is, you have to make a captivating argument for ccClusters,
> what does it buy us now that we've done a lot of the work you are
> telling us it will save?

The most captivating arguments are along the following lines:

	- scales perfectly across NUMA fabrics: there are a number of 
	  upcoming architechures (hammer, power4, others) where the 
	  latency costs on remote memory are significantly higher.  By
	  making the entire kernel local, we'll see optimal performance 
	  for local operations, with good performance for the remote 
	  actions (the ccClusterFS should be very low overhead).
	- opens up a number of possibilities in terms of serviceability: 
	  if a chunk of the system is taken offline, only the one kernel 
	  group has to go away.  Useful in containing failures.
	- lower overhead for SMP systems.  We can use UP kernels local 
	  to each CPU.  Should make kernel compiles faster. ;-)

At the very least it is well worth investigating.  Bootstrapping the 
ccCluster work shouldn't take more than a week or so, which will let 
us attach some hard numbers to the kind of impact it has on purely 
cpu local tasks.

		-ben
-- 
Fish.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:27                                 ` Benjamin LaHaise
@ 2001-12-06 22:59                                   ` Alan Cox
  2001-12-06 23:08                                   ` David S. Miller
  1 sibling, 0 replies; 75+ messages in thread
From: Alan Cox @ 2001-12-06 22:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David S. Miller, lm, phillips, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel

> On Thu, Dec 06, 2001 at 01:02:02PM -0800, David S. Miller wrote:
> > We've done %90 of the "other stuff" already, why waste the work?
> > We've done the networking, we've done the scheduler, and the
> > networking/block drivers are there too.
> 
> The scheduler doesn't scale too well...

Understatement. However retrofitting a real scheduler doesn't break the
scalability of the system IMHO.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:27                                 ` Benjamin LaHaise
  2001-12-06 22:59                                   ` Alan Cox
@ 2001-12-06 23:08                                   ` David S. Miller
  2001-12-06 23:26                                     ` Larry McVoy
  1 sibling, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06 23:08 UTC (permalink / raw)
  To: bcrl
  Cc: lm, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam, alan,
	hps, linux-kernel

   From: Benjamin LaHaise <bcrl@redhat.com>
   Date: Thu, 6 Dec 2001 17:27:08 -0500

   	- lower overhead for SMP systems.  We can use UP kernels local 
   	  to each CPU.  Should make kernel compiles faster. ;-)
   
Actually, this isn't what is being proposed.  Something like
"4 cpu" SMP kernels.

   At the very least it is well worth investigating.  Bootstrapping the 
   ccCluster work shouldn't take more than a week or so, which will let 
   us attach some hard numbers to the kind of impact it has on purely 
   cpu local tasks.
   
I think it is worth considering too, but I don't know if a week
estimate is sane or not :-)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:08                                   ` David S. Miller
@ 2001-12-06 23:26                                     ` Larry McVoy
  2001-12-07  2:49                                       ` Adam Keys
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 23:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: bcrl, lm, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	alan, hps, linux-kernel

On Thu, Dec 06, 2001 at 03:08:47PM -0800, David S. Miller wrote:
>    From: Benjamin LaHaise <bcrl@redhat.com>
>    Date: Thu, 6 Dec 2001 17:27:08 -0500
> 
>    	- lower overhead for SMP systems.  We can use UP kernels local 
>    	  to each CPU.  Should make kernel compiles faster. ;-)
>    
> Actually, this isn't what is being proposed.  Something like
> "4 cpu" SMP kernels.

I personally want to cluster small SMP OS images because I don't want to
do the process migration crap anywhere except at exec time, it simplifies
a lot.  So I need a second order load balancing term that I can get
from 2 or 4 way smp nodes.  If you are willing to process migration to
handle load imbalances, then you could do uniprocessor only.  I think
the complexity tradeoff is in favor of the small SMP OS clusters, we
already have them.  Process migration is a rats nest.

>    At the very least it is well worth investigating.  Bootstrapping the 
>    ccCluster work shouldn't take more than a week or so, which will let 
>    us attach some hard numbers to the kind of impact it has on purely 
>    cpu local tasks.
>    
> I think it is worth considering too, but I don't know if a week
> estimate is sane or not :-)

Yeah, it's possible that you could get something booting in a week but I
think it's a bit harder than that too.  One idea that was kicked around
was to use Jeff's UML work and "boot" multiple UML's on top of a virtual
SMP.  You get things to work there and then do a "port" to real hardware.
Kind of a cool idea if you ask me.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:26                                     ` Larry McVoy
@ 2001-12-07  2:49                                       ` Adam Keys
  2001-12-07  4:40                                         ` Jeff Dike
  0 siblings, 1 reply; 75+ messages in thread
From: Adam Keys @ 2001-12-07  2:49 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

On December 06, 2001 05:26, Larry McVoy wrote: > Yeah, it's possible that you 
could get something booting in a week but I > think it's a bit harder than 
that too. One idea that was kicked around > was to use Jeff's UML work and 
"boot" multiple UML's on top of a virtual > SMP. You get things to work there 
and then do a "port" to real hardware. > Kind of a cool idea if you ask me.

Point me in the right direction. After reading over your slides and SMP paper 
(still have the labs.pdf on my queue), it seemed to me that you could easily 
simulate what you want with lots of UML's talking to each other. I think you 
would need to create some kind of device that uses a file or a shared memory 
segment as the cluster's memory. Actually, I think that (shared memory) is 
how Jeff had intended on implementing SMP in UML anyway. At this point I 
don't think UML supports SMP though I know of at least one person who was 
attempting it.

Once said device would implemented, you could start working on the unique 
challenges ccClusters present. I guess this would be what you consider 
"bootstrapping", although I don't really know what that would entail at this 
point. Then you just need some bored college student :) to hack it out.

I've been negligent in following this mammoth link...cluebat me if you 
mentioned it somewhere upthread.

-- akk~

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:49                                       ` Adam Keys
@ 2001-12-07  4:40                                         ` Jeff Dike
  0 siblings, 0 replies; 75+ messages in thread
From: Jeff Dike @ 2001-12-07  4:40 UTC (permalink / raw)
  To: Adam Keys; +Cc: Larry McVoy, linux-kernel

akeys@post.cis.smu.edu said:
>  it seemed to me that you could easily  simulate what you want with
> lots of UML's talking to each other. I think you  would need to create
> some kind of device that uses a file or a shared memory  segment as
> the cluster's memory. 

Yeah, there is already support for mapping in a random file and using that
as UML memory, so that would be used for the cluster interconnect for any
cluster emulations you wanted to run with UML.

> Actually, I think that (shared memory) is  how
> Jeff had intended on implementing SMP in UML anyway.

No, at least not any shared memory that's not already there.  UML uses a 
host process for each UML process, and UML kernel data and text are
shared between all these host processes.  SMP just means having more than
one host process runnable at a time.  Each runnable process on the host
is a virtual processor.

> At this point I
> don't think UML supports SMP though I know of at least one person who
> was  attempting it.

It doesn't yet.  Someone is (or was), but I haven't heard a peep from him in
at least a month.  So this is starting to look like another little project
which got quickly going but just as quickly abandoned.

				Jeff

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:21                             ` Larry McVoy
  2001-12-06 21:02                               ` David S. Miller
@ 2001-12-06 21:30                               ` Daniel Phillips
  2001-12-07  8:54                                 ` Henning Schmiedehausen
  2001-12-06 22:37                               ` Alan Cox
  2 siblings, 1 reply; 75+ messages in thread
From: Daniel Phillips @ 2001-12-06 21:30 UTC (permalink / raw)
  To: Larry McVoy, David S. Miller
  Cc: lm, davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

On December 6, 2001 09:21 pm, Larry McVoy wrote:
> On Thu, Dec 06, 2001 at 12:15:54PM -0800, David S. Miller wrote:
> > If you aren't getting rid of this locking, what is the point?
> > That is what we are trying to talk about.
> 
> The points are:
> 
> a) you have to thread the entire kernel, every data structure which is a
>    problem.  Scheduler, networking, device drivers, everything.  That's
>    thousands of locks and uncountable bugs, not to mention the impact on
>    uniprocessor performance.
> 
> b) I have to thread a file system.

OK, this is your central point.  It's a little more than just a mmap, no?
We're pressing you on your specific ideas on how to handle the 'peripheral' 
details.

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 21:30                               ` Daniel Phillips
@ 2001-12-07  8:54                                 ` Henning Schmiedehausen
  2001-12-07 16:06                                   ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Henning Schmiedehausen @ 2001-12-07  8:54 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Larry McVoy, David S. Miller, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, linux-kernel

On Thu, 2001-12-06 at 22:30, Daniel Phillips wrote:
> On December 6, 2001 09:21 pm, Larry McVoy wrote:
> > On Thu, Dec 06, 2001 at 12:15:54PM -0800, David S. Miller wrote:
> > > If you aren't getting rid of this locking, what is the point?
> > > That is what we are trying to talk about.
> > 
> > The points are:
> > 
> > a) you have to thread the entire kernel, every data structure which is a
> >    problem.  Scheduler, networking, device drivers, everything.  That's
> >    thousands of locks and uncountable bugs, not to mention the impact on
> >    uniprocessor performance.
> > 
> > b) I have to thread a file system.
> 
> OK, this is your central point.  It's a little more than just a mmap, no?
> We're pressing you on your specific ideas on how to handle the 'peripheral' 
> details.

How about creating one node as "master" and write a "cluster network
filesystem" which uses shared memory as its "network layer". 

Then boot all other nodes diskless from these cluster network
filesystems.

You can still have shared mmap (which I believe is Larry's toy point)
between the nodes but you avoid all of the filesystem locking issues,
because you're going over (a hopefully superfast) memory network
filesystem.

Or go iSCSI and attach a network device to each of the cluster node. Or
go 802.1q, attach a virtual network device to each cluster node, pull
all of them out over GigE and let some Cisco outside sort these out
again. :-)

What I don't like about the approach is the fact that all nodes should
share the same file system. One (at least IMHO) does not want this for
at least /etc. The "all nodes the same FS" works fine for your number
cruncher clusters where every node runs more or less the same software. 

It does not work for cluster boxes like the Starfire or 390. There you
want to boot totally different OS images on the nodes. No sense in
implementing a "threaded file system" that is not used.

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  8:54                                 ` Henning Schmiedehausen
@ 2001-12-07 16:06                                   ` Larry McVoy
  2001-12-07 16:44                                     ` Martin J. Bligh
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 16:06 UTC (permalink / raw)
  To: Henning Schmiedehausen
  Cc: Daniel Phillips, Larry McVoy, David S. Miller, davidel, rusty,
	Martin.Bligh, riel, lars.spam, alan, linux-kernel

> How about creating one node as "master" and write a "cluster network
> filesystem" which uses shared memory as its "network layer". 

Right.

> Then boot all other nodes diskless from these cluster network
> filesystems.

Wrong.  Give each node its own private boot fs.  Then mount /data.

> You can still have shared mmap (which I believe is Larry's toy point)
> between the nodes but you avoid all of the filesystem locking issues,
> because you're going over (a hopefully superfast) memory network
> filesystem.

There is no network, unless you consider the memory interconnect a 
network (I think the hardware guys would raise their eyebrows at 
that name).

> What I don't like about the approach is the fact that all nodes should
> share the same file system. One (at least IMHO) does not want this for
> at least /etc. 

Read through my other postings, I said that things are private by
default.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 16:06                                   ` Larry McVoy
@ 2001-12-07 16:44                                     ` Martin J. Bligh
  2001-12-07 17:23                                       ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-07 16:44 UTC (permalink / raw)
  To: Larry McVoy, Henning Schmiedehausen; +Cc: linux-kernel

I'm taking mercy on people and trimming down the cc: list ...

>> What I don't like about the approach is the fact that all nodes should
>> share the same file system. One (at least IMHO) does not want this for
>> at least /etc. 
> 
> Read through my other postings, I said that things are private by
> default.

So if I understand you correctly, you're saying you have a private /etc for 
each instance of the sub-OS. Doesn't this make management of the system 
a complete pig? And require modifying many user level tools?

M.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 16:44                                     ` Martin J. Bligh
@ 2001-12-07 17:23                                       ` Larry McVoy
  2001-12-07 18:04                                         ` Martin J. Bligh
  2001-12-07 19:00                                         ` Daniel Bergman
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 17:23 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Larry McVoy, Henning Schmiedehausen, linux-kernel

On Fri, Dec 07, 2001 at 08:44:03AM -0800, Martin J. Bligh wrote:
> I'm taking mercy on people and trimming down the cc: list ...
> 
> >> What I don't like about the approach is the fact that all nodes should
> >> share the same file system. One (at least IMHO) does not want this for
> >> at least /etc. 
> > 
> > Read through my other postings, I said that things are private by
> > default.
> 
> So if I understand you correctly, you're saying you have a private /etc for 
> each instance of the sub-OS. Doesn't this make management of the system 
> a complete pig? And require modifying many user level tools?

My pay job is developing a distributed source management system which works
by replication.  We already have users who put all the etc files in it and
manage them that way.  Works great.  It's like rdist except it never screws
up and it has merging.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 17:23                                       ` Larry McVoy
@ 2001-12-07 18:04                                         ` Martin J. Bligh
  2001-12-07 18:23                                           ` Larry McVoy
  2001-12-07 19:00                                         ` Daniel Bergman
  1 sibling, 1 reply; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-07 18:04 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Henning Schmiedehausen, linux-kernel

> My pay job is developing a distributed source management system which works
> by replication.  We already have users who put all the etc files in it and
> manage them that way.  Works great.  It's like rdist except it never screws
> up and it has merging.

So would that mean I would need bitkeeper installed in order to change my
password? 

And IIRC, bitkeeper is not free either?

M.




^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 18:04                                         ` Martin J. Bligh
@ 2001-12-07 18:23                                           ` Larry McVoy
  2001-12-07 18:42                                             ` Martin J. Bligh
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 18:23 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Larry McVoy, Henning Schmiedehausen, linux-kernel

On Fri, Dec 07, 2001 at 10:04:11AM -0800, Martin J. Bligh wrote:
> > My pay job is developing a distributed source management system which works
> > by replication.  We already have users who put all the etc files in it and
> > manage them that way.  Works great.  It's like rdist except it never screws
> > up and it has merging.
> 
> So would that mean I would need bitkeeper installed in order to change my
> password? 

No, that's just one way to solve the problem.  Another way would be to have
a master/slave relationship between the replicas sort of like CVS.  In fact,
you could use CVS.

> And IIRC, bitkeeper is not free either?

Actually it is for this purpose.  You can either do open logging (probably
not what you want) or run it in single user mode which doesn't log, you
just lose the audit trail (all checkins look like they are made by root).

If I could figure out a way to allow the use of BK for /etc with out any
restrictions at all, and at the same time prevent people from just putting
all their source in /etc and shutting down our commercial revenue, I'd
do it in a heartbeat.  I'd *love it* if when I did an upgrade from Red Hat,
the config files were part of a BK repository and I just did a pull/merge
to join my local changes with whatever they've done.  That would be a huge
step in making sys admin a lot less problematic.  But this is more than a
bit off topic...
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 18:23                                           ` Larry McVoy
@ 2001-12-07 18:42                                             ` Martin J. Bligh
  2001-12-07 18:48                                               ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-07 18:42 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Henning Schmiedehausen, linux-kernel

>> So would that mean I would need bitkeeper installed in order to change my
>> password? 
> 
> No, that's just one way to solve the problem.  Another way would be to have
> a master/slave relationship between the replicas sort of like CVS.  In fact,
> you could use CVS.

I'm not sure that's any less vomitworthy. 

Keeping things simple that users and/or sysadmins have to deal with is a 
Good Thing (tm). I'd have the complexity in the kernel, where complexity 
is pushed to the kernel developers, thanks.

>> And IIRC, bitkeeper is not free either?
>
> (... some slighty twisted concept of free snipped.)
>
>  But this is more than a bit off topic...

No it's not that far off topic, my point is that you're shifting the complexity 
problems to other areas (eg. system mangement / the application level / 
filesystems / scheduler load balancing) rather than solving them.

Martin.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 18:42                                             ` Martin J. Bligh
@ 2001-12-07 18:48                                               ` Larry McVoy
  2001-12-07 19:06                                                 ` Martin J. Bligh
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 18:48 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Larry McVoy, Henning Schmiedehausen, linux-kernel

On Fri, Dec 07, 2001 at 10:42:00AM -0800, Martin J. Bligh wrote:
> >> So would that mean I would need bitkeeper installed in order to change my
> >> password? 
> > 
> > No, that's just one way to solve the problem.  Another way would be to have
> > a master/slave relationship between the replicas sort of like CVS.  In fact,
> > you could use CVS.
> 
> I'm not sure that's any less vomitworthy. 

You're right, it's so much better to manage all machines independently
so that they can get out of sync with each other.

Did you even consider that this is virtually identical to the problem
that a network of workstations or servers has?  Did it occur to you that
people have solved this problem in many different ways?  Or did you just
want to piss into the wind and enjoy the spray?

> Keeping things simple that users and/or sysadmins have to deal with is a 
> Good Thing (tm). I'd have the complexity in the kernel, where complexity 
> is pushed to the kernel developers, thanks.

Yeah, that's what I want, my password file management in the kernel.  
Brilliant.  Why didn't I think of that?

> >> And IIRC, bitkeeper is not free either?
> >
> > (... some slighty twisted concept of free snipped.)
> >
> >  But this is more than a bit off topic...
> 
> No it's not that far off topic, my point is that you're shifting the complexity 
> problems to other areas (eg. system mangement / the application level / 
> filesystems / scheduler load balancing) rather than solving them.

Whoops, you are so right, in order to work on OS scaling I'd better solve
password file management or the OS ideas are meaningless.  Uh huh.  I'll
get right on that, thanks for setting me straight here.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 18:48                                               ` Larry McVoy
@ 2001-12-07 19:06                                                 ` Martin J. Bligh
  0 siblings, 0 replies; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-07 19:06 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Henning Schmiedehausen, linux-kernel

> You're right, it's so much better to manage all machines independently
> so that they can get out of sync with each other.

No it's much better to just have one machine running one instance of
the OS so that it can't get out of sync with itself.

>> Keeping things simple that users and/or sysadmins have to deal with is a 
>> Good Thing (tm). I'd have the complexity in the kernel, where complexity 
>> is pushed to the kernel developers, thanks.
> 
> Yeah, that's what I want, my password file management in the kernel.  
> Brilliant.  Why didn't I think of that?

No, I want my password file management to be in a one file for the whole
machine. Where it is now. Without requiring syncronisation. If we put the 
complexity in the kernel to make the system scale running one OS we 
don't have the problem that you're creating at all.

>> No it's not that far off topic, my point is that you're shifting the complexity 
>> problems to other areas (eg. system mangement / the application level / 
>> filesystems / scheduler load balancing) rather than solving them.
> 
> Whoops, you are so right, in order to work on OS scaling I'd better solve
> password file management or the OS ideas are meaningless.  Uh huh.  I'll
> get right on that, thanks for setting me straight here.

If you don't chop the OS up into multiple instances, you don't have these
problems. If you create the problems, I expect you to solve them. 

You're not making the system as a whole scale, you're just pushing the
problems out somewhere else.

Martin.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 17:23                                       ` Larry McVoy
  2001-12-07 18:04                                         ` Martin J. Bligh
@ 2001-12-07 19:00                                         ` Daniel Bergman
  2001-12-07 19:07                                           ` Larry McVoy
  2001-12-09  9:24                                           ` Pavel Machek
  1 sibling, 2 replies; 75+ messages in thread
From: Daniel Bergman @ 2001-12-07 19:00 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Martin J. Bligh, Henning Schmiedehausen, linux-kernel


> My pay job is developing a distributed source management system which works
> by replication.  We already have users who put all the etc files in it and
> manage them that way.  Works great.  It's like rdist except it never screws
> up and it has merging.

I'm just curious, what about security? Is this done in clear-text? 
Sounds dangerous to put /etc/shadow, for example, in clear-text on the
cable.

Sorry for getting off-topic.

Regards,
Daniel

--
Daniel Bergman
Phone: 08 - 55064278
Mobile: 08 - 6311430 
Fax: 08 - 59827056
Email: d-b@home.se

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 19:00                                         ` Daniel Bergman
@ 2001-12-07 19:07                                           ` Larry McVoy
  2001-12-09  9:24                                           ` Pavel Machek
  1 sibling, 0 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-07 19:07 UTC (permalink / raw)
  To: Daniel Bergman
  Cc: Larry McVoy, Martin J. Bligh, Henning Schmiedehausen,
	linux-kernel

On Fri, Dec 07, 2001 at 08:00:32PM +0100, Daniel Bergman wrote:
> > My pay job is developing a distributed source management system which works
> > by replication.  We already have users who put all the etc files in it and
> > manage them that way.  Works great.  It's like rdist except it never screws
> > up and it has merging.
> 
> I'm just curious, what about security? Is this done in clear-text? 
> Sounds dangerous to put /etc/shadow, for example, in clear-text on the
> cable.

BitKeeper can, and typically does, use ssh as a transport.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07 19:00                                         ` Daniel Bergman
  2001-12-07 19:07                                           ` Larry McVoy
@ 2001-12-09  9:24                                           ` Pavel Machek
  1 sibling, 0 replies; 75+ messages in thread
From: Pavel Machek @ 2001-12-09  9:24 UTC (permalink / raw)
  To: Daniel Bergman
  Cc: Larry McVoy, Martin J. Bligh, Henning Schmiedehausen,
	linux-kernel

Hi!

> > My pay job is developing a distributed source management system which works
> > by replication.  We already have users who put all the etc files in it and
> > manage them that way.  Works great.  It's like rdist except it never screws
> > up and it has merging.
> 
> I'm just curious, what about security? Is this done in clear-text? 
> Sounds dangerous to put /etc/shadow, for example, in clear-text on the
> cable.

This is going over System Area Network. You don't encrypt your PCI, either.
-- 
"I do not steal MS software. It is not worth it."
                                -- Pavel Kankovsky

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:21                             ` Larry McVoy
  2001-12-06 21:02                               ` David S. Miller
  2001-12-06 21:30                               ` Daniel Phillips
@ 2001-12-06 22:37                               ` Alan Cox
  2001-12-06 22:35                                 ` Larry McVoy
  2 siblings, 1 reply; 75+ messages in thread
From: Alan Cox @ 2001-12-06 22:37 UTC (permalink / raw)
  To: Larry McVoy
  Cc: David S. Miller, lm, phillips, davidel, rusty, Martin.Bligh, riel,
	lars.spam, alan, hps, linux-kernel

>    problem.  Scheduler, networking, device drivers, everything.  That's
>    thousands of locks and uncountable bugs, not to mention the impact on
>    uniprocessor performance.

Most of my block drivers in Linux have one lock. The block queuing layer
has one lock which is often the same lock.

> You tell me - which is easier, multithreading the networking stack to 
> 64 way SMP or running 64 distinct networking stacks?

Which is easier. Managing 64 routers or managing 1 router ?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:37                               ` Alan Cox
@ 2001-12-06 22:35                                 ` Larry McVoy
  2001-12-06 22:54                                   ` Alan Cox
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 22:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Larry McVoy, David S. Miller, phillips, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

On Thu, Dec 06, 2001 at 10:37:18PM +0000, Alan Cox wrote:
> >    problem.  Scheduler, networking, device drivers, everything.  That's
> >    thousands of locks and uncountable bugs, not to mention the impact on
> >    uniprocessor performance.
> 
> Most of my block drivers in Linux have one lock. The block queuing layer
> has one lock which is often the same lock.

Hooray!  That's great and that's the way I'd like to keep it.  Do you think
you can do that on a 64 way SMP?  Not much chance, right?

> > You tell me - which is easier, multithreading the networking stack to 
> > 64 way SMP or running 64 distinct networking stacks?
> 
> Which is easier. Managing 64 routers or managing 1 router ?

That's a red herring, there are not 64 routers in either picture, there
are 64 ethernet interfaces in both pictures.  So let me rephrase the
question: given 64 ethernets, 64 CPUs, on one machine, what's easier,
1 multithreaded networking stack or 64 independent networking stacks?
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:35                                 ` Larry McVoy
@ 2001-12-06 22:54                                   ` Alan Cox
  2001-12-07  2:34                                     ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Alan Cox @ 2001-12-06 22:54 UTC (permalink / raw)
  To: Larry McVoy
  Cc: Alan Cox, Larry McVoy, David S. Miller, phillips, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

> > Most of my block drivers in Linux have one lock. The block queuing layer
> > has one lock which is often the same lock.
> 
> Hooray!  That's great and that's the way I'd like to keep it.  Do you think
> you can do that on a 64 way SMP?  Not much chance, right?

It wouldn't be a big problem to keep it that way on the well designed
hardware. The badly designed stuff (here Im thinking the NCR5380 I debugged
today since its fresh in my mind) I'd probably want 2 locks, one for queue
locking, one for request management.

> > Which is easier. Managing 64 routers or managing 1 router ?
> That's a red herring, there are not 64 routers in either picture, there
> are 64 ethernet interfaces in both pictures.  So let me rephrase the
> question: given 64 ethernets, 64 CPUs, on one machine, what's easier,
> 1 multithreaded networking stack or 64 independent networking stacks?

I think you miss the point. If I have to program the system as 64
independant stacks from the app level I'm going to go slowly mad


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:54                                   ` Alan Cox
@ 2001-12-07  2:34                                     ` Larry McVoy
  2001-12-07  2:50                                       ` David S. Miller
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-07  2:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Larry McVoy, David S. Miller, phillips, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

On Thu, Dec 06, 2001 at 10:54:03PM +0000, Alan Cox wrote:
> > That's a red herring, there are not 64 routers in either picture, there
> > are 64 ethernet interfaces in both pictures.  So let me rephrase the
> > question: given 64 ethernets, 64 CPUs, on one machine, what's easier,
> > 1 multithreaded networking stack or 64 independent networking stacks?
> 
> I think you miss the point. If I have to program the system as 64
> independant stacks from the app level I'm going to go slowly mad

Well, that depends.  Suppose the application is a webserver.  Not your
simple static page web server, that one is on a shared nothing cluster
already.  It's a webserver that has a big honkin' database, with lots
of data being updated all time, the classic sort of thing that a big
SMP can handle but a cluster could not.  Fair enough?

Now imagine that the system is a collection of little OS images, each
with their own file system, etc.  Except /home/httpd is mounted on 
a globally shared file system.  Each os image has its own set of 
interfaces, one or more, and its own http server.  Which updates 
data in /home/httpd.

Can you see that this is a non-issue?  For this application, the ccCluster
model works great.  The data is all in a shared file system, nice and 
coherent, the apps don't actually know there is another OS banging on the 
data, it all just works.

Wait, I'll admit this means that the apps have to be thread safe, but that's
true for the traditional SMP as well.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:34                                     ` Larry McVoy
@ 2001-12-07  2:50                                       ` David S. Miller
  0 siblings, 0 replies; 75+ messages in thread
From: David S. Miller @ 2001-12-07  2:50 UTC (permalink / raw)
  To: lm
  Cc: alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 18:34:51 -0800

   The data is all in a shared file system, nice and coherent, the
   apps don't actually know there is another OS banging on the data,
   it all just works.

Larry1: "One way to get the ccCluster scalability is by un-globalizing
         the filesystem"

Larry2: "Let me tell you about this great application of ccClusters,
	 it involves using a shared file system.  It all just works."

Either you're going to replicate everyone's content or you're going to
use a shared filesystem.  In one case you'll go fast but have the same
locking problems as a traditional SMP, in the other case you'll go
slow because you'll be replicating all the time.

Which is it :-)

What I suppose is coming up is some example application that really
doesn't need a shared filesystem, which I bet will be a quite obscure
one or at least obscure enough that it can't justify ccCluster all on
it's own.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 20:10                         ` Larry McVoy
  2001-12-06 20:15                           ` David S. Miller
@ 2001-12-06 22:38                           ` Alan Cox
  2001-12-06 22:32                             ` Larry McVoy
  1 sibling, 1 reply; 75+ messages in thread
From: Alan Cox @ 2001-12-06 22:38 UTC (permalink / raw)
  To: Larry McVoy
  Cc: Daniel Phillips, Larry McVoy, David S. Miller, davidel, rusty,
	Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

> the hardware's coherency.  No locks in the vfs or fs, that's all done
> in the mmap/page fault path for sure, but once the data is mapped you
> aren't dealing with the file system at all.

ftruncate

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:38                           ` Alan Cox
@ 2001-12-06 22:32                             ` Larry McVoy
  2001-12-06 22:48                               ` Alexander Viro
  2001-12-06 22:55                               ` Alan Cox
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 22:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Larry McVoy, Daniel Phillips, David S. Miller, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

On Thu, Dec 06, 2001 at 10:38:14PM +0000, Alan Cox wrote:
> > the hardware's coherency.  No locks in the vfs or fs, that's all done
> > in the mmap/page fault path for sure, but once the data is mapped you
> > aren't dealing with the file system at all.
> 
> ftruncate

I'm not sure what the point is.  We've already agreed that the multiple OS
instances will have synchonization to do for file operations, ftruncate
being one of them.

I thought the question was how N user processes do locking and my answer
stands: exactly like they'd do it on an SMP, with mutex_enter()/exit() on
some portion of the mapped file.  The mapped file is just a chunk of cache
coherent memory.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:32                             ` Larry McVoy
@ 2001-12-06 22:48                               ` Alexander Viro
  2001-12-06 22:55                               ` Alan Cox
  1 sibling, 0 replies; 75+ messages in thread
From: Alexander Viro @ 2001-12-06 22:48 UTC (permalink / raw)
  To: Larry McVoy
  Cc: Alan Cox, Daniel Phillips, David S. Miller, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

On Thu, 6 Dec 2001, Larry McVoy wrote:

> I'm not sure what the point is.  We've already agreed that the multiple OS
> instances will have synchonization to do for file operations, ftruncate
> being one of them.

That's nice.  But said operation involves serious wanking with metadata
and _that_ would better have exclusion with write(2) and some warranties
about pageouts.

You can do lockless get_block() and truncate().  And it will be a hive
of races always ready to break out.  We used to try that and it was
a fscking mess of unbelievable proportions - worse than that in full-blown
rename() support.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:32                             ` Larry McVoy
  2001-12-06 22:48                               ` Alexander Viro
@ 2001-12-06 22:55                               ` Alan Cox
  2001-12-06 23:15                                 ` Larry McVoy
  1 sibling, 1 reply; 75+ messages in thread
From: Alan Cox @ 2001-12-06 22:55 UTC (permalink / raw)
  To: Larry McVoy
  Cc: Alan Cox, Larry McVoy, Daniel Phillips, David S. Miller, davidel,
	rusty, Martin.Bligh, riel, lars.spam, hps, linux-kernel

> > ftruncate
> 
> I'm not sure what the point is.  We've already agreed that the multiple OS
> instances will have synchonization to do for file operations, ftruncate
> being one of them.
> 
> I thought the question was how N user processes do locking and my answer
> stands: exactly like they'd do it on an SMP, with mutex_enter()/exit() on
> some portion of the mapped file.  The mapped file is just a chunk of cache

ftrucate invalidates that memory under you, on all nodes. That means you do
end up needing cross node locking and your file operations simply won't lie
down and scale cleanly


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 22:55                               ` Alan Cox
@ 2001-12-06 23:15                                 ` Larry McVoy
  2001-12-06 23:19                                   ` David S. Miller
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 23:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Larry McVoy, Daniel Phillips, David S. Miller, davidel, rusty,
	Martin.Bligh, riel, lars.spam, hps, linux-kernel

On Thu, Dec 06, 2001 at 10:55:37PM +0000, Alan Cox wrote:
> > > ftruncate
> > 
> > I'm not sure what the point is.  We've already agreed that the multiple OS
> > instances will have synchonization to do for file operations, ftruncate
> > being one of them.
> > 
> > I thought the question was how N user processes do locking and my answer
> > stands: exactly like they'd do it on an SMP, with mutex_enter()/exit() on
> > some portion of the mapped file.  The mapped file is just a chunk of cache
> 
> ftrucate invalidates that memory under you, on all nodes. That means you do
> end up needing cross node locking and your file operations simply won't lie
> down and scale cleanly

Wait a second, you're missing something.  If you are going to make a single
OS image work on a 64 way (or whatever) SMP, you have all of these issues,
right?  I'm not introducing additional locking problems with the design.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:15                                 ` Larry McVoy
@ 2001-12-06 23:19                                   ` David S. Miller
  2001-12-06 23:32                                     ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06 23:19 UTC (permalink / raw)
  To: lm
  Cc: alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 15:15:04 -0800

   Wait a second, you're missing something.  If you are going to make a single
   OS image work on a 64 way (or whatever) SMP, you have all of these issues,
   right?  I'm not introducing additional locking problems with the design.

And you're not taking any of them away from the VFS layer.  That is
where all the real fundamental current scaling problems are in the
Linux kernel.

That is why I spent so much timing describing the filesystem name path
locking problem, those are the major hurdles we have to go over.
Networking is old hat, people have done work to improve the scheduler
scaling, it's just these hand full of VFS layer issues that are
dogging us.

So my point is, if you're going to promote some "locking complexity"
advantage, I don't think that's where a ccCluster even makes a dent in
the viability spectrum.

Where it does have advantages are for things like offlining node
clusters in a NUMA system.  High availability et al.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:19                                   ` David S. Miller
@ 2001-12-06 23:32                                     ` Larry McVoy
  2001-12-06 23:47                                       ` David S. Miller
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-06 23:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

On Thu, Dec 06, 2001 at 03:19:45PM -0800, David S. Miller wrote:
>    From: Larry McVoy <lm@bitmover.com>
>    Date: Thu, 6 Dec 2001 15:15:04 -0800
>    
>    Wait a second, you're missing something.  If you are going to make a single
>    OS image work on a 64 way (or whatever) SMP, you have all of these issues,
>    right?  I'm not introducing additional locking problems with the design.
> 
> And you're not taking any of them away from the VFS layer.  That is
> where all the real fundamental current scaling problems are in the
> Linux kernel.

Sure I am, but I haven't told you how.  Suppose that your current
VFS can handle N cpus byt you have N*M cpus.  Take a look at
http://bitmover.com/lm/papers/bigfoot.ps and imagine applying that
technique here.  To summarize what I'm proposing, the locking problems are
because too many cpus want at the same data structures at the same time.
One way to solve that is to fine grain thread the data structures, and
that is a pain in the ass.  Another way to solve it may be to "stripe"
the file "servers".  Imagine each CPU serving up a part of a bigfoot
file system.  I've just reduced the scaling problems by a factor of M.

And, the ccCluster approach moves most of the nasty locking
problems into a ccCluster specific filesystem rather than buggering up
the generic paths.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:32                                     ` Larry McVoy
@ 2001-12-06 23:47                                       ` David S. Miller
  2001-12-07  0:17                                         ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-06 23:47 UTC (permalink / raw)
  To: lm
  Cc: alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 15:32:57 -0800

   And, the ccCluster approach moves most of the nasty locking
   problems into a ccCluster specific filesystem rather than buggering
   up the generic paths.

I still don't believe this, you are still going to need a lot of
generic VFS threading.  This is why myself and others keep talking
about ftruncate(), namei() et al.

If I look up "/etc" on bigfoot, littletoe, or whatever fancy name you
want to call the filesystem setup, SOMETHING has to control access to
the path name components (ie. there has to be locking).

You are not "N*M scaling" lookups on filesystem path components.
In fact, bigfoot sounds like it would make path name traversal more
heavyweight than it is now because these stripes need to coordinate
with each other somehow.

You keep saying "it'll be in the filesystem" over and over.  And the
point I'm trying to make is that this is not going to do away with the
fundamental problems.  They are still there with a ccCluster, they are
still there with bigfoot, and you are not getting N*M scaling on
filesystem name component walks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 23:47                                       ` David S. Miller
@ 2001-12-07  0:17                                         ` Larry McVoy
  2001-12-07  2:37                                           ` David S. Miller
  0 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-12-07  0:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

On Thu, Dec 06, 2001 at 03:47:35PM -0800, David S. Miller wrote:
>    From: Larry McVoy <lm@bitmover.com>
>    Date: Thu, 6 Dec 2001 15:32:57 -0800
>    
>    And, the ccCluster approach moves most of the nasty locking
>    problems into a ccCluster specific filesystem rather than buggering
>    up the generic paths.
> 
> I still don't believe this, you are still going to need a lot of
> generic VFS threading.  This is why myself and others keep talking
> about ftruncate(), namei() et al.
> 
> If I look up "/etc" on bigfoot, littletoe, or whatever fancy name you
> want to call the filesystem setup, SOMETHING has to control access to
> the path name components (ie. there has to be locking).

Sure, but you are assuming one file system, which is global.  That's
certainly one way to do it, but not the only way, and not the way that
I've suggested.  I'm not sure if you remember this, but I always advocated
partially shared and partially private.  /, for example, is private.
In fact, so is /proc, /tmp, and /dev.  There are /gproc, /gtmp, and /gdev
which are in the global namespace and do for the cluster what /<xxx>
does for a regular machine.

We can go around and around on this and the end result will be that I will
have narrowed the locking problem down to the point that only the processes
which are actually using the resource have to participate in the locking.
In a traditional SMP OS, all processes have to participate.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  0:17                                         ` Larry McVoy
@ 2001-12-07  2:37                                           ` David S. Miller
  2001-12-07  2:43                                             ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: David S. Miller @ 2001-12-07  2:37 UTC (permalink / raw)
  To: lm
  Cc: alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 16:17:44 -0800

   There are /gproc, /gtmp, and /gdev
   which are in the global namespace and do for the cluster what /<xxx>
   does for a regular machine.

And /getc, which is where my /getc/passwd is going to be.

   We can go around and around on this and the end result will be that
   I will have narrowed the locking problem down to the point that
   only the processes which are actually using the resource have to
   participate in the locking.  In a traditional SMP OS, all processes
   have to participate.

We can split up name spaces today with Al Viro's namespace
infrastructure.

But frankly for the cases where scalability matters, like a http
server, they are all going at the same files in a global file
space.

I still think ccClusters don't solve any new problems in the
locking space.  "I get rid of it by putting people on different
filesystems" is not an answer which is unique to ccClusters, current
systems can do that.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:37                                           ` David S. Miller
@ 2001-12-07  2:43                                             ` Larry McVoy
  2001-12-07  2:59                                               ` David S. Miller
  2001-12-07  3:17                                               ` Martin J. Bligh
  0 siblings, 2 replies; 75+ messages in thread
From: Larry McVoy @ 2001-12-07  2:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

> I still think ccClusters don't solve any new problems in the
> locking space.  "I get rid of it by putting people on different
> filesystems" is not an answer which is unique to ccClusters, current
> systems can do that.

If your point is that it doesn't solve any locking problems in the filesystem,
I'll almost grant you that.  Not quite because ccClusters open the door to 
different ways of solving problems that a traditional SMP doesn't.

However, where it wins big is on everything else.  Please explain to me how
you are going to make a scheduler that works for 64 CPUS that doesn't suck?
And explain to me how that will perform as well as N different scheduler
queues which I get for free.  Just as an example.  We can then go down the
path of every device driver, the networking stack, the process interfaces,
signals, etc.  

There is a hell of a lot of threading that has to go on to get to
64 cpus and it screws the heck out of the uniprocessor performance.
I think you want to prove how studly you are at threading, David,
but what you are really doing is proving that you are buried in the
trees and can't see the forest.  Pop up 50,000 feet and think about it.
Let's go have some beers and talk about it off line.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:43                                             ` Larry McVoy
@ 2001-12-07  2:59                                               ` David S. Miller
  2001-12-07  3:17                                               ` Martin J. Bligh
  1 sibling, 0 replies; 75+ messages in thread
From: David S. Miller @ 2001-12-07  2:59 UTC (permalink / raw)
  To: lm
  Cc: alan, phillips, davidel, rusty, Martin.Bligh, riel, lars.spam,
	hps, linux-kernel

   From: Larry McVoy <lm@bitmover.com>
   Date: Thu, 6 Dec 2001 18:43:27 -0800

   However, where it wins big is on everything else.  Please explain to me how
   you are going to make a scheduler that works for 64 CPUS that doesn't suck?

What stops me from basically doing a scheduler which ends up doing
what ccCluster does, groups of 4 cpu nodes?  Absolutely nothing of
course.  How is ccCluster unique in this regard then?

The scheduler is a mess right now only because Linus hates making
major changes to it.

   We can then go down the path of ... the networking stack ...
   signals ...

Done and done.  Device drivers are mostly done, and what was your
other category... oh process interfaces, those are done too.

In fact character devices are the only ugly area in 2.5.x, and
who really cares if TTYs scale to 64 cpus :-)  But this will get
mostly done anyways to kill off the global kernel lock completely.

   There is a hell of a lot of threading that has to go on to get to
   64 cpus and it screws the heck out of the uniprocessor performance.

Not with CONFIG_SMP turned off.  None of the interesting SMP overhead
hits the uniprocessor case.

Why do you keep talking about uniprocessor being screwed?  This is why
we have CONFIG_SMP, to nop the bulk of it out.

   I think you want to prove how studly you are at threading, David,

No, frankly I don't.

What I want is for you to show what is really unique and new about
ccClusters and what incredible doors are openned up by it.  So far I
have been shown ONE, and that is the high availability aspect.

To me, it is far from the holy grail you portray it to be.

   Let's go have some beers and talk about it off line.

How about posting some compelling arguments online first? :-)

It all boils down to the same shit currently.  "ccClusters lets you do
this", and this is leading to "but we can do that already today".

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-07  2:43                                             ` Larry McVoy
  2001-12-07  2:59                                               ` David S. Miller
@ 2001-12-07  3:17                                               ` Martin J. Bligh
  1 sibling, 0 replies; 75+ messages in thread
From: Martin J. Bligh @ 2001-12-07  3:17 UTC (permalink / raw)
  To: Larry McVoy, David S. Miller
  Cc: alan, phillips, davidel, rusty, riel, lars.spam, hps,
	linux-kernel

> However, where it wins big is on everything else.  Please explain to me how
> you are going to make a scheduler that works for 64 CPUS that doesn't suck?

Modifying the current scheduler to use multiple scheduler queues is not 
particularly hard.  It's been done already. See http://lse.sourceforge.net or 
Davide's work.

Please explain how you're going to load balance the scheduler queues across 
your system in a way that doesn't suck.

> And explain to me how that will perform as well as N different scheduler
> queues which I get for free.  Just as an example. 

I dispute that the work that you have to do up front to get ccClusters to work
is "free". It's free after you've done the work already. You may think it's 
*easier*, but that's different.

So we're going to do our work one step at a time in many small chunks, and
you're going to do it all at once in one big chunk (and not end up with as
cohesive a system out of the end of it) .... not necessarily a huge difference
in overall effort (though I know you think it is, the logic you're using to demonstrate
your point is fallacious).

Your objection to the "more traditional" way of scaling things (which seems
to be based around what you call the locking cliff) seems to be that it greatly
increases the complexity of the OS. I would say that splitting the system into
multiple kernels then trying to glue it all back together also greatly increases
the complexity of the OS. Oh, and the complexity of the applications that run 
on it too if they have to worry about seperate bits of the FS for each instance 
of the sub-OS.

Martin.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06  3:19             ` Davide Libenzi
  2001-12-06  7:56               ` David S. Miller
@ 2001-12-06 14:24               ` Rik van Riel
  2001-12-06 17:28                 ` Davide Libenzi
  1 sibling, 1 reply; 75+ messages in thread
From: Rik van Riel @ 2001-12-06 14:24 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Rusty Russell, David S. Miller, lm, Martin.Bligh, lars.spam, alan,
	hps, linux-kernel

On Wed, 5 Dec 2001, Davide Libenzi wrote:
> On Thu, 6 Dec 2001, Rusty Russell wrote:
>
> > I'd love to say that I can solve this with RCU, but it's vastly non-trivial
> > and I haven't got code, so I'm not going to say that. 8)
>
> Lockless algos could help if we're able to have "good" quiescent point
> inside the kernel. Or better have a good quiescent infrastructure to
> have lockless code to plug in.

Machines get dragged down by _uncontended_ locks, simply
due to cache line ping-pong effects.

regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 14:24               ` Rik van Riel
@ 2001-12-06 17:28                 ` Davide Libenzi
  2001-12-06 17:52                   ` Rik van Riel
  0 siblings, 1 reply; 75+ messages in thread
From: Davide Libenzi @ 2001-12-06 17:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rusty Russell, David S. Miller, lm, Martin J. Bligh, lars.spam,
	Alan Cox, hps, lkml

On Thu, 6 Dec 2001, Rik van Riel wrote:

> On Wed, 5 Dec 2001, Davide Libenzi wrote:
> > On Thu, 6 Dec 2001, Rusty Russell wrote:
> >
> > > I'd love to say that I can solve this with RCU, but it's vastly non-trivial
> > > and I haven't got code, so I'm not going to say that. 8)
> >
> > Lockless algos could help if we're able to have "good" quiescent point
> > inside the kernel. Or better have a good quiescent infrastructure to
> > have lockless code to plug in.
>
> Machines get dragged down by _uncontended_ locks, simply
> due to cache line ping-pong effects.

Rik, i think you're confused about lockless algos.
It's not an rwlock where the reader has to dirty a cacheline in any case,
the reader simply does _not_ write any cache line accessing the
list/hash/tree or whatever you use.
These algo uses barries and all changes are done when the system walk
through a quiescent state by flushing a list-of-changes.
Drawback, you've to be able to tollerate stale data.

- Davide

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 17:28                 ` Davide Libenzi
@ 2001-12-06 17:52                   ` Rik van Riel
  2001-12-06 18:10                     ` Davide Libenzi
  0 siblings, 1 reply; 75+ messages in thread
From: Rik van Riel @ 2001-12-06 17:52 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Rusty Russell, David S. Miller, lm, Martin J. Bligh, lars.spam,
	Alan Cox, hps, lkml

On Thu, 6 Dec 2001, Davide Libenzi wrote:

> > Machines get dragged down by _uncontended_ locks, simply
> > due to cache line ping-pong effects.
>
> Rik, i think you're confused about lockless algos.
> It's not an rwlock where the reader has to dirty a cacheline in any case,
> the reader simply does _not_ write any cache line accessing the
> list/hash/tree or whatever you use.

Hmmm indeed, so the cache lines can be shared as long
as the data is mostly read-only. I think I see it now.

However, this would only work for data which is mostly
read-only, not for anything else...

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-06 17:52                   ` Rik van Riel
@ 2001-12-06 18:10                     ` Davide Libenzi
  0 siblings, 0 replies; 75+ messages in thread
From: Davide Libenzi @ 2001-12-06 18:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rusty Russell, David S. Miller, lm, Martin J. Bligh, lars.spam,
	Alan Cox, hps, lkml

On Thu, 6 Dec 2001, Rik van Riel wrote:

> On Thu, 6 Dec 2001, Davide Libenzi wrote:
>
> > > Machines get dragged down by _uncontended_ locks, simply
> > > due to cache line ping-pong effects.
> >
> > Rik, i think you're confused about lockless algos.
> > It's not an rwlock where the reader has to dirty a cacheline in any case,
> > the reader simply does _not_ write any cache line accessing the
> > list/hash/tree or whatever you use.
>
> Hmmm indeed, so the cache lines can be shared as long
> as the data is mostly read-only. I think I see it now.
>
> However, this would only work for data which is mostly
> read-only, not for anything else...

yes of course, but in such case these methods could help solving cache
issues over traditional rwlocks.



- Davide



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  3:23       ` Larry McVoy
  2001-12-05  6:05         ` David S. Miller
@ 2001-12-05  8:12         ` Momchil Velikov
  1 sibling, 0 replies; 75+ messages in thread
From: Momchil Velikov @ 2001-12-05  8:12 UTC (permalink / raw)
  To: Larry McVoy
  Cc: David S. Miller, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel

>>>>> "Larry" == Larry McVoy <lm@bitmover.com> writes:
Larry>    Where this approach wins big is everywhere except the page cache.  Every
Larry>    single data structure in the system becomes N-way more parallel -- with
Larry>    no additional locks -- when you boot up N instances of the OS.  That's

I was wondering about multiple OS instances in their own address
space. What's the need for separate address spaces for the kernels ?

It looks more natural to me to _actually_ have N instances of kernel
data structures in the _same address space_, i.e. turning
each global variable into an array_ indexed by an "instance id",
much in the same way as we have now per-CPU structures. Well,
I don't actually think it would be as simple as stated above, I'm just
proposing it as a general approach towards ccclustering.

(btw, there was some discussion on #kernelnewbies, on Nov 12th and
21st, you can find the logs here
http://vengeance.et.tudelft.nl/~smoke/log/kernelnewbies/2001-11/)

Regards,
-velco

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
  2001-12-05  3:23       ` Larry McVoy
@ 2001-12-05  3:25       ` Davide Libenzi
  1 sibling, 0 replies; 75+ messages in thread
From: Davide Libenzi @ 2001-12-05  3:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: lm, Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

On Tue, 4 Dec 2001, David S. Miller wrote:

>    From: Larry McVoy <lm@bitmover.com>
>    Date: Tue, 4 Dec 2001 16:36:46 -0800
>
>    OK, so start throwing stones at this.  Once we have a memory model that
>    works, I'll go through the process model.
>
> What is the difference between your messages and spin locks?
> Both seem to shuffle between cpus anytime anything interesting
> happens.
>
> In the spinlock case, I can thread out the locks in the page cache
> hash table so that the shuffling is reduced.  In the message case, I
> always have to talk to someone.

Time ago I read an interesting article that implemented shared memory over
network ( ATM in that case ) reproducing in large scale the
cache/memory/bus computer architecture.
Shared memory on each node was the equivalent of the CPU cache, a "generic
shared memory repository" was the equivalent of the main memory and the
snooping traffic was running on the network.
I think I picked it up from ACM but I can't find it right now.




- Davide



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: SMP/cc Cluster description
  2001-12-05  0:36   ` SMP/cc Cluster description [was Linux/Pro] Larry McVoy
  2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
@ 2001-12-05  3:17     ` Stephen Satchell
  1 sibling, 0 replies; 75+ messages in thread
From: Stephen Satchell @ 2001-12-05  3:17 UTC (permalink / raw)
  To: David S. Miller, lm
  Cc: Martin.Bligh, riel, lars.spam, alan, hps, linux-kernel

At 06:36 PM 12/4/01 -0800, David S. Miller wrote:
>What is the difference between your messages and spin locks?
>Both seem to shuffle between cpus anytime anything interesting
>happens.
>
>In the spinlock case, I can thread out the locks in the page cache
>hash table so that the shuffling is reduced.  In the message case, I
>always have to talk to someone.

While what I'm about to say has little bearing on the SMP/cc case:  one 
significant advantage of messages over spinlocks is being able to assign 
priority with low overhead in the quick-response real-time multi-CPU 
arena.  I worked with a cluster of up to 14 CPUs using something very much 
like NUMA in which task scheduling used a set of prioritized message 
queues.  The system I worked on was designed to break transaction-oriented 
tasks into a string of "work units" each of which could be processed very 
quickly -- on the order of three milliseconds or less.  (The limit of 14 
CPUs was set by the hardware used to implement the main system bus.)

I bring this up only because I have never seen a spinlock system that dealt 
with priority issues very well when under heavy load.

OK, I've said my piece, now I'll sit back and continue to watch your 
discussion.

Satch

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2001-12-10 16:00 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-07 19:14 SMP/cc Cluster description Dana Lacoste
2001-12-07 19:28 ` Larry McVoy
  -- strict thread matches above, loose matches on Subject: below --
2001-12-10 15:59 cardente, john
2001-12-06 22:20 cardente, john
2001-12-06 23:00 ` Jeff V. Merkey
2001-12-04 23:31 Linux/Pro [was Re: Coding style - a non-issue] Rik van Riel
2001-12-04 23:37 ` Martin J. Bligh
2001-12-05  0:36   ` SMP/cc Cluster description [was Linux/Pro] Larry McVoy
2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
2001-12-05  3:23       ` Larry McVoy
2001-12-05  6:05         ` David S. Miller
2001-12-05  6:51           ` Jeff Merkey
2001-12-06  2:52           ` Rusty Russell
2001-12-06  3:19             ` Davide Libenzi
2001-12-06  7:56               ` David S. Miller
2001-12-06  8:02                 ` Larry McVoy
2001-12-06  8:09                   ` David S. Miller
2001-12-06 18:27                     ` Jeff V. Merkey
2001-12-06 18:37                       ` Jeff V. Merkey
2001-12-06 18:36                         ` Martin J. Bligh
2001-12-06 18:45                           ` Jeff V. Merkey
2001-12-06 19:11                       ` Davide Libenzi
2001-12-06 19:34                         ` Jeff V. Merkey
2001-12-06 23:16                           ` David Lang
2001-12-07  2:56                             ` Jeff V. Merkey
2001-12-07  4:23                               ` David Lang
2001-12-07  5:45                                 ` Jeff V. Merkey
2001-12-06 19:42                   ` Daniel Phillips
2001-12-06 19:53                     ` Larry McVoy
2001-12-06 20:10                       ` Daniel Phillips
2001-12-06 20:10                         ` Larry McVoy
2001-12-06 20:15                           ` David S. Miller
2001-12-06 20:21                             ` Larry McVoy
2001-12-06 21:02                               ` David S. Miller
2001-12-06 22:27                                 ` Benjamin LaHaise
2001-12-06 22:59                                   ` Alan Cox
2001-12-06 23:08                                   ` David S. Miller
2001-12-06 23:26                                     ` Larry McVoy
2001-12-07  2:49                                       ` Adam Keys
2001-12-07  4:40                                         ` Jeff Dike
2001-12-06 21:30                               ` Daniel Phillips
2001-12-07  8:54                                 ` Henning Schmiedehausen
2001-12-07 16:06                                   ` Larry McVoy
2001-12-07 16:44                                     ` Martin J. Bligh
2001-12-07 17:23                                       ` Larry McVoy
2001-12-07 18:04                                         ` Martin J. Bligh
2001-12-07 18:23                                           ` Larry McVoy
2001-12-07 18:42                                             ` Martin J. Bligh
2001-12-07 18:48                                               ` Larry McVoy
2001-12-07 19:06                                                 ` Martin J. Bligh
2001-12-07 19:00                                         ` Daniel Bergman
2001-12-07 19:07                                           ` Larry McVoy
2001-12-09  9:24                                           ` Pavel Machek
2001-12-06 22:37                               ` Alan Cox
2001-12-06 22:35                                 ` Larry McVoy
2001-12-06 22:54                                   ` Alan Cox
2001-12-07  2:34                                     ` Larry McVoy
2001-12-07  2:50                                       ` David S. Miller
2001-12-06 22:38                           ` Alan Cox
2001-12-06 22:32                             ` Larry McVoy
2001-12-06 22:48                               ` Alexander Viro
2001-12-06 22:55                               ` Alan Cox
2001-12-06 23:15                                 ` Larry McVoy
2001-12-06 23:19                                   ` David S. Miller
2001-12-06 23:32                                     ` Larry McVoy
2001-12-06 23:47                                       ` David S. Miller
2001-12-07  0:17                                         ` Larry McVoy
2001-12-07  2:37                                           ` David S. Miller
2001-12-07  2:43                                             ` Larry McVoy
2001-12-07  2:59                                               ` David S. Miller
2001-12-07  3:17                                               ` Martin J. Bligh
2001-12-06 14:24               ` Rik van Riel
2001-12-06 17:28                 ` Davide Libenzi
2001-12-06 17:52                   ` Rik van Riel
2001-12-06 18:10                     ` Davide Libenzi
2001-12-05  8:12         ` Momchil Velikov
2001-12-05  3:25       ` Davide Libenzi
2001-12-05  3:17     ` Stephen Satchell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox