public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RE: SMP/cc Cluster description
@ 2001-12-07 19:14 Dana Lacoste
  2001-12-07 19:28 ` Larry McVoy
  0 siblings, 1 reply; 75+ messages in thread
From: Dana Lacoste @ 2001-12-07 19:14 UTC (permalink / raw)
  To: 'Larry McVoy', Martin J. Bligh; +Cc: linux-kernel

Man you guys are NUTS.

But this is a fun conversation so I'm going to join in.

> Did you even consider that this is virtually identical to the problem
> that a network of workstations or servers has?  Did it occur 
> to you that
> people have solved this problem in many different ways?  Or 
> did you just
> want to piss into the wind and enjoy the spray?

I may be a total tool here, but this question is really bugging me :

What, if any, advantages does your proposal have over (say) a Beowulf
cluster?  Why does having the cluster in one box seem a better solution
than having a Beowulf type cluster with a shared Network filesystem?

You've declared everything to be separate, so that I can't see
what's not separate any more :)

Is it just an issue of shared memory?  You want to be able to share
memory between processes on separate systems at high speed?  Why
not Myrinet then?  Yeah, it's slower, but the order of magnitude
reduction in cost compared to a 64 way SMP box makes this a trivial
decision in my books....

Or am I missing something really obvious here????

Dana Lacoste
Embedded Linux Developer (The OPPOSITE side of the scale)
Ottawa, Canada

^ permalink raw reply	[flat|nested] 75+ messages in thread
* RE: SMP/cc Cluster description
@ 2001-12-10 15:59 cardente, john
  0 siblings, 0 replies; 75+ messages in thread
From: cardente, john @ 2001-12-10 15:59 UTC (permalink / raw)
  To: 'Jeff V. Merkey '
  Cc: 'David S. Miller ', 'lm@bitmover.com ',
	'davidel@xmailserver.org ',
	'rusty@rustcorp.com.au ',
	'Martin.Bligh@us.ibm.com ',
	'riel@conectiva.com.br ', 'lars.spam@nocrew.org ',
	'alan@lxorguk.ukuu.org.uk ', 'hps@intermeta.de ',
	'linux-kernel@vger.kernel.org ',
	'jmerkey@timpanogas.org '

 

>I know what the PCI cards do.  I was the person who pushed
>marty Albert, the Chairman of the Dolphin Board at the time in 
>1995 to pursue design work on them.  I also worked with Justin 
>Rattner (I saw one of your early prototype boxes in 1996 in his labs).  

Ahh, sometimes it's hard to gauge "understanding" on this list  ;-)
Good idea BTW. For a while we looked into using those cards
to implement a non-cc NUMA cluster system. That was a while
ago, however, and I've managed to forget most of the details. Also,
with the assimilation of DG into EMC I've tossed most of my dolphin
specs.


>Those stubs were aweful short for the lost slot in your 
>system, and I am surprised you did not get signal skew.  Those
>stubs had to be 1.5 inches long :-).

Yes, I spent many hours in the lab hunting for signal integrity
issues. As you may guess it was not always easy being a third
party agent on an intel bus...


>Wrong.  There is a small window where you can copy into a 
>remote nodes memory.

As I said above I tossed by P2B spec so I cant refresh my memory
on this. Did this work like reflective memory or do you scribble
on a piece of memory and then poke the card to send to another node?
Its my guess that the former prohibits the memory being cacheable
while the latter relies on compliant SW and therefore doesnt afford
transparent cross node memory references. Are either of these right?


>It's OK.  We love DG and your support of SCI.  Keep up the good 
>work.

Wish that I was but sadly I'm not. DG was my first job after grad school
and cutting my teeth on the ccNUMA stuff was simply an outstanding
experience.
Those were good days....

Thanks for the reply...
john

ps. I got two of the older PCI cards sitting in my desk drawer.
Now you've got me considering pulling those guys out and having
some fun!!!

^ permalink raw reply	[flat|nested] 75+ messages in thread
* RE: SMP/cc Cluster description
@ 2001-12-06 22:20 cardente, john
  2001-12-06 23:00 ` Jeff V. Merkey
  0 siblings, 1 reply; 75+ messages in thread
From: cardente, john @ 2001-12-06 22:20 UTC (permalink / raw)
  To: 'Jeff V. Merkey', David S. Miller
  Cc: lm, davidel, rusty, Martin.Bligh, riel, lars.spam, alan, hps,
	linux-kernel, jmerkey

Hi Jeff,

I was one of the primary SCI guys at DG for all of
their Intel based ccNUMA machines. I worked with
Dolphin closely on a variety of things for those
systems including micro-coding a modified/optimized
version of their SCI implementation as well as 
architecting and implementing changes to their SCI
coherency ASIC for the third (last) DG ccNUMA system.
Beyond that I was the primary coherency protocol 
person for the project and was responsible for making
sure we played nice with Intel's coherency protocol.

Getting to the point I saw your post below and I thought
there might be some confusion between what the DG boxes
did and what those PCI cards do. In the DG system we
implemented ASIC's that sat on the processor bus which
examined every memory reference to maintain system wide
coherency. These evaluations were done for every bus
transaction at a cache line granularity. These chips
acted as bridges that enforced coherency between the SMP local
snoopy bus protocol and the SCI protocol used system
wide. The essential point here is that only by being
apart of the coherency protocol on the processor bus 
were those chips able to implement ccNUMA at a cacheline 
level coherency.


The Dolphin PCI cards, however, cannot perform the same
function due to the fact that the PCI bus is outside of the
Intel coherency domain. Therefore it lacks the visiblity
and control to enforce coherency. Instead, those cards 
only allowed for the explicit sending of messages across 
SCI for use with clustering libraries like MPI. One could
use this kind of messaging protocol to implement explicit
coherency (as you noted) but the sharing granularity of
such a system is at the page level, not cache line. There
have been many efforts to implement this kind of system
and (if I recall correctly) they usually go under the
name of Shared Virtual Memory systems.


Anyway, there were two reasons for the post. First, if I've
been following the thread correctly most of the discussion
up to this point has involed issues at the cacheline level
and dont apply to a system built from Dolphin PCi cards.
Nor can one build such a system from those cards and
I felt compelled to clear up any potential confusion. My
second, prideful, reason was to justify the cost of those
DG machines!!! (and NUMA-Q's as they were very similar in
architecture).

take care, and please disregard if I misunderstood your
post or the thread...

john


-----Original Message-----
From: Jeff V. Merkey [mailto:jmerkey@vger.timpanogas.org]
Sent: Thursday, December 06, 2001 1:38 PM
To: David S. Miller
Cc: lm@bitmover.com; davidel@xmailserver.org; rusty@rustcorp.com.au;
Martin.Bligh@us.ibm.com; riel@conectiva.com.br; lars.spam@nocrew.org;
alan@lxorguk.ukuu.org.uk; hps@intermeta.de;
linux-kernel@vger.kernel.org; jmerkey@timpanogas.org
Subject: Re: SMP/cc Cluster description


On Thu, Dec 06, 2001 at 11:27:31AM -0700, Jeff V. Merkey wrote:

And also, if you download the SCI drivers in my area, and order
some SCI adapters from Dolphin in Albquerque, NM, you can set up 
a ccNUMA system with standard PCs.  Dolphin has 66Mhz versions (and
a 133Mhz coming in the future) that run at almost a gigabyte per 
second node-2-node over a parallel fabric.  The cross-sectional
SCI fabric bandwidth scales at (O)(2N) as you add nodes.  

If you want to play around with ccNUMA with Standard PCs, these 
cards are relatively inepxensive, and allow you to setup some 
powerful cc/SMP systems with explicit coherence.  The full 
ccNUMA boxes from DG are expensive, however.  That way, instead
of everyone talking about it, you guys could get some cool 
hardware and experiment with some of your rather forward 
looking and interesting ideas.

:-)

Jeff



> 
> 
> Guys,
> 
> I am the maintaner of SCI, the ccNUMA technology standard.  I know
> alot about this stuff, and have been involved with SCI since 
> 1994.  I work with it every day and the Dolphin guys on some huge 
> supercomputer accounts, like Los Alamos and Sandia Labs in NM.  
> I will tell you this from what I know.
> 
> A shared everything approach is a programmers dream come true,
> but you can forget getting reasonable fault tolerance with it.  The 
> shared memory zealots want everyone to believe ccNUMA is better 
> than sex, but it does not scale when compared to Shared-Nothing
> programming models.  There's also a lot of tough issues for dealing 
> with failed nodes, and how you recover when peoples memory is 
> all over the place across a nuch of machines.  
> 
> SCI scales better in ccNUMA and all NUMA technoogies scale very 
> well when they are used with "Explicit Coherence" instead of 
> "Implicit Coherence" which is what you get with SMP systems.  
> Years of research by Dr. Justin Rattner at Intel's High 
> performance labs demonstrated that shared nothing models scaled
> into the thousands of nodes, while all these shared everything
> "Super SMP" approaches hit the wall at 64 processors generally.
> 
> SCI is the fastest shared nothing interface out there, and it also
> can do ccNUMA.  Sequent, Sun, DG and a host of other NUMA providers
> use Dolphin's SCI technology and have for years.   ccNUMA is useful 
> for applications that still assume a shared nothing approach but that
> use the ccNUMA and NUMA capabilities for better optimization.
> 
> Forget trying to recreate the COMA architecture of Kendall-Square.  
> The name was truly descriptive of what happened in this architecture
> when a node fails -- goes into a "COMA".  This whole discussion I have
> lived through before and you will find that ccNUMA is virtually 
> unimplementable on most general purpose OSs.  And yes, there are 
> a lot of products and software out there, but when you look under 
> the cover (like ServerNet) you discover their coherence models 
> for the most part relay on push/pull explicit coherence models.
> 
> My 2 cents.
> 
> Jeff 
> 
> 
> 
> On Thu, Dec 06, 2001 at 12:09:32AM -0800, David S. Miller wrote:
> >    From: Larry McVoy <lm@bitmover.com>
> >    Date: Thu, 6 Dec 2001 00:02:16 -0800
> >    
> >    Err, Dave, that's *exactly* the point of the ccCluster stuff.  You
get
> >    all that seperation for every data structure for free.  Think about
> >    it a bit.  Aren't you going to feel a little bit stupid if you do all
> >    this work, one object at a time, and someone can come along and do
the
> >    whole OS in one swoop?  Yeah, I'm spouting crap, it isn't that easy,
> >    but it is much easier than the route you are taking.  
> > 
> > How does ccClusters avoid the file system namespace locking issues?
> > How do all the OS nodes see a consistent FS tree?
> > 
> > All the talk is about the "magic filesystem, thread it as much as you
> > want" and I'm telling you that is the fundamental problem, the
> > filesystem name space locking.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 75+ messages in thread
* Re: Linux/Pro [was Re: Coding style - a non-issue]
@ 2001-12-04 23:31 Rik van Riel
  2001-12-04 23:37 ` Martin J. Bligh
  0 siblings, 1 reply; 75+ messages in thread
From: Rik van Riel @ 2001-12-04 23:31 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Lars Brinkhoff, Alan Cox, Larry McVoy, hps, linux-kernel

On Tue, 4 Dec 2001, Martin J. Bligh wrote:

> > Premise 3: it is far easier to take a bunch of operating system images
> >    and make them share the parts they need to share (i.e., the page
> >    cache), than to take a single image and pry it apart so that it
> >    runs well on N processors.
>
> Of course it's easier. But it seems like you're left with much more
> work to reiterate in each application you write to run on this thing.
> Do you want to do the work once in the kernel, or repeatedly in each
> application?

There seems to be a little misunderstanding here; from what
I gathered when talking to Larry, the idea behind ccClusters
is that they provide a single system image in a NUMA box, but
with separated operating system kernels.

Of course, this is close to what a "single" NUMA kernel often
ends up doing with much ugliness, so I think Larry's idea to
construct NUMA OSes by putting individual kernels of nodes to
work together makes a lot of sense.

regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2001-12-10 16:00 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-07 19:14 SMP/cc Cluster description Dana Lacoste
2001-12-07 19:28 ` Larry McVoy
  -- strict thread matches above, loose matches on Subject: below --
2001-12-10 15:59 cardente, john
2001-12-06 22:20 cardente, john
2001-12-06 23:00 ` Jeff V. Merkey
2001-12-04 23:31 Linux/Pro [was Re: Coding style - a non-issue] Rik van Riel
2001-12-04 23:37 ` Martin J. Bligh
2001-12-05  0:36   ` SMP/cc Cluster description [was Linux/Pro] Larry McVoy
2001-12-05  2:36     ` SMP/cc Cluster description David S. Miller
2001-12-05  3:23       ` Larry McVoy
2001-12-05  6:05         ` David S. Miller
2001-12-05  6:51           ` Jeff Merkey
2001-12-06  2:52           ` Rusty Russell
2001-12-06  3:19             ` Davide Libenzi
2001-12-06  7:56               ` David S. Miller
2001-12-06  8:02                 ` Larry McVoy
2001-12-06  8:09                   ` David S. Miller
2001-12-06 18:27                     ` Jeff V. Merkey
2001-12-06 18:37                       ` Jeff V. Merkey
2001-12-06 18:36                         ` Martin J. Bligh
2001-12-06 18:45                           ` Jeff V. Merkey
2001-12-06 19:11                       ` Davide Libenzi
2001-12-06 19:34                         ` Jeff V. Merkey
2001-12-06 23:16                           ` David Lang
2001-12-07  2:56                             ` Jeff V. Merkey
2001-12-07  4:23                               ` David Lang
2001-12-07  5:45                                 ` Jeff V. Merkey
2001-12-06 19:42                   ` Daniel Phillips
2001-12-06 19:53                     ` Larry McVoy
2001-12-06 20:10                       ` Daniel Phillips
2001-12-06 20:10                         ` Larry McVoy
2001-12-06 20:15                           ` David S. Miller
2001-12-06 20:21                             ` Larry McVoy
2001-12-06 21:02                               ` David S. Miller
2001-12-06 22:27                                 ` Benjamin LaHaise
2001-12-06 22:59                                   ` Alan Cox
2001-12-06 23:08                                   ` David S. Miller
2001-12-06 23:26                                     ` Larry McVoy
2001-12-07  2:49                                       ` Adam Keys
2001-12-07  4:40                                         ` Jeff Dike
2001-12-06 21:30                               ` Daniel Phillips
2001-12-07  8:54                                 ` Henning Schmiedehausen
2001-12-07 16:06                                   ` Larry McVoy
2001-12-07 16:44                                     ` Martin J. Bligh
2001-12-07 17:23                                       ` Larry McVoy
2001-12-07 18:04                                         ` Martin J. Bligh
2001-12-07 18:23                                           ` Larry McVoy
2001-12-07 18:42                                             ` Martin J. Bligh
2001-12-07 18:48                                               ` Larry McVoy
2001-12-07 19:06                                                 ` Martin J. Bligh
2001-12-07 19:00                                         ` Daniel Bergman
2001-12-07 19:07                                           ` Larry McVoy
2001-12-09  9:24                                           ` Pavel Machek
2001-12-06 22:37                               ` Alan Cox
2001-12-06 22:35                                 ` Larry McVoy
2001-12-06 22:54                                   ` Alan Cox
2001-12-07  2:34                                     ` Larry McVoy
2001-12-07  2:50                                       ` David S. Miller
2001-12-06 22:38                           ` Alan Cox
2001-12-06 22:32                             ` Larry McVoy
2001-12-06 22:48                               ` Alexander Viro
2001-12-06 22:55                               ` Alan Cox
2001-12-06 23:15                                 ` Larry McVoy
2001-12-06 23:19                                   ` David S. Miller
2001-12-06 23:32                                     ` Larry McVoy
2001-12-06 23:47                                       ` David S. Miller
2001-12-07  0:17                                         ` Larry McVoy
2001-12-07  2:37                                           ` David S. Miller
2001-12-07  2:43                                             ` Larry McVoy
2001-12-07  2:59                                               ` David S. Miller
2001-12-07  3:17                                               ` Martin J. Bligh
2001-12-06 14:24               ` Rik van Riel
2001-12-06 17:28                 ` Davide Libenzi
2001-12-06 17:52                   ` Rik van Riel
2001-12-06 18:10                     ` Davide Libenzi
2001-12-05  8:12         ` Momchil Velikov
2001-12-05  3:25       ` Davide Libenzi
2001-12-05  3:17     ` Stephen Satchell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox