SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)
@ 2000-01-24 22:46 dg50
  2000-01-24 23:56 ` Jamie Lokier
  2000-01-25  0:54 ` Larry McVoy
  0 siblings, 2 replies; 7+ messages in thread
From: dg50 @ 2000-01-24 22:46 UTC (permalink / raw)
  To: linux-kernel

I've been reading the SMP thread and this is a truly educational and
fascinating discussion. How SMP works, and how much of a benefit it
provides has always been a bit of a mystery to me - and I think the light
is slowly coming on.

But I have a couple of (perhaps dumb) questions.

OK, if you have an n-way SMP box, then you have n processors with n (local)
caches sharing a single block of main system memory. If you then run a
threaded program (like a renderer) with a thread per processor, you wind up
with n threads all looking at a single block of shared memory - right?

OK, if a thread accesses (I assume writes, reading isn't destructive, is
it?) a memory location that another processor is "interested" in, then
you've invalidated that processor's local cache - so it has to be flushed
and refreshed. Have enough cross-talk between threads, and you can achieve
the worst-case scenario where every memory access flushes the cache of
every processor, totally defeating the purpose of the cache, and perhaps
even adding nontrivial cache-flushing overhead.

If this is indeed the case (please correct any misconceptions I have) then
it strikes me that perhaps the hardware design of SMP is broken. That
instead of sharing main memory, each processor should have it's own main
memory. You connect the various main memory chunks to the "primary" CPU via
some sort of very wide, very fast memory bus, and then when you spawn a
thread, you instead do something more like a fork - copy the relevent
process and data to the child cpu's private main memory (perhaps via some
sort of blitter) over this bus, and then let that CPU go play in its own
sandbox for a while.

Which really is more like the "array of uni-processor boxen joined by a
network" model than it is current SMP - just with a REALLY fast&wide
network pipe that just happens to be in the same physical box.

Comments? Please feel free to reply private-only if this is just too
entry-level for general discussuion.

DG

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)
  2000-01-24 22:46 dg50
@ 2000-01-24 23:56 ` Jamie Lokier
  2000-01-25  2:38   ` Ralf Baechle
  2000-01-25  0:54 ` Larry McVoy
  1 sibling, 1 reply; 7+ messages in thread
From: Jamie Lokier @ 2000-01-24 23:56 UTC (permalink / raw)
  To: dg50; +Cc: linux-kernel

dg50@daimlerchrysler.com wrote:
> If this is indeed the case (please correct any misconceptions I have) then
> it strikes me that perhaps the hardware design of SMP is broken. That
> instead of sharing main memory, each processor should have it's own main
> memory. You connect the various main memory chunks to the "primary" CPU via
> some sort of very wide, very fast memory bus, and then when you spawn a
> thread, you instead do something more like a fork - copy the relevent
> process and data to the child cpu's private main memory (perhaps via some
> sort of blitter) over this bus, and then let that CPU go play in its own
> sandbox for a while.

I think you just reinvented NUMA -- Non-Uniform Memory Access.  Every
CPU can access the others' memory, but you really want them to
concentrate on their own.  SGI does some boxes like that.

Linux even has a memory allocator which is moving in the direction of
supporting those things.

> Which really is more like the "array of uni-processor boxen joined by a
> network" model than it is current SMP - just with a REALLY fast&wide
> network pipe that just happens to be in the same physical box.

It's been proposed to have multiple instances of the OS running too,
instead of one OS running on all CPUs.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)
  2000-01-24 22:46 dg50
  2000-01-24 23:56 ` Jamie Lokier
@ 2000-01-25  0:54 ` Larry McVoy
  1 sibling, 0 replies; 7+ messages in thread
From: Larry McVoy @ 2000-01-25  0:54 UTC (permalink / raw)
  To: dg50; +Cc: linux-kernel

: OK, if you have an n-way SMP box, then you have n processors with n (local)
: caches sharing a single block of main system memory. If you then run a
: threaded program (like a renderer) with a thread per processor, you wind up
: with n threads all looking at a single block of shared memory - right?

Yes.  A really good reference for all of this, at the sort of accessable
and general level, is ``Computer Architecture, A Quantitative Approach''
by Patterson and Hennessy, ISBN 1-55860-329-8.  I have the second edition
which was updated in '96.  It's an excellent book and think that if all
the members of this list read it and understood it, that would be of
enormous benefit.  It's a hardware book, but I'm of the school that says
OS people are part hardware people and should have more than a passing
understanding of hardware.

The chapter you want is Chapter 8: multiprocessors.  But read the whole
thing, it's one of the most approachable yet useful texts in the field.
Contrast it to Knuth's work, which are without a doubt the definitive
works in his area, but are also not at all for the faint of heart.
I personally can rarely make head or tail of what Knuth is saying
without some outside help, but I can understand all of P&H on my own -
you can too.

: OK, if a thread accesses (I assume writes, reading isn't destructive, is
: it?) a memory location that another processor is "interested" in, then
: you've invalidated that processor's local cache - so it has to be flushed
: and refreshed. 

Reading is better than writing, lots of people can read.  Where they
read from depends on if the L2 caches are write through or write back
(L1 caches I believe are 100% write through, I can't really see how a
write back one would ever work even on a uniprocessor unless all I/O's
caused a cache flush).

If you did your caches correctly, and you are using a snooping protocol,
then typically what happens is CPU 1 puts a request for location X on
the bus; CPU 5 has that data in a dirty cache line; CPU 5 invalidates 
the memory op and responds as if it were memory, saying here's X.
(It's not the CPUs at all, by the way, it's the cache controllers.
They are quite async when compared to CPUs, though there are a lot of
interdependencies).

I believe it was the SGI Challenge which had a busted cache controller 
and had to do a write back to main memory and then a read, causing
cache to cache transactions to actually be cache to memory, memory
to cache (which really hurts).

: Have enough cross-talk between threads, and you can achieve
: the worst-case scenario where every memory access flushes the cache of
: every processor, totally defeating the purpose of the cache, and perhaps
: even adding nontrivial cache-flushing overhead.

Yes, indeed.  At SGI, where they had a very finely threaded kernel, cache
misses in the OS became a major bottleneck and they did a great deal of
restructuring to make things work better.  The most typical thing was to
put the lock and the data most likely to be needed on the same cache line.
I.e.,

	struct widget {
		rwlock	w_lock;
		int	flags;
		struct	widget *next;
		/* more data after this */
	};

and then they would do

	for (p = list;
	    (read_lock(p) == OK) && (p->flags != whatever);
	    t = p->next, read_unlock(p), p = t);

with the idea being that the common case was that one cache line miss would
get you the lock and the data that you most cared about.

This, by the way, is a REALLY STUPID IDEA.  If you are locking at
this level, you really need to be looking for other ways to scale up
your system.   You're working like crazy to support a model that the
hardware can't really support.  As a part hardware and part software guy,
it breaks my heart to see the hardware guys bust their butts to give you
a system that works, only to watch the software guys abuse the sh*t out
of the system and make it look bad.  On the other side of the coin, the
hardware guys do advertise this stuff as working and only later come out
and explain the limitations.

A great case in point is that the hardware guys, when they first started
doing SMP boxes, talked about how anything could run anywhere.  Later,
they discovered that this was really false.  They never said that, they
wrote a bunch of papers on ``cache affinity'' which is a nice way of 
saying ``when you put a process on a CPU/cache, don't move it unless the
world will come to an end if you leave it there''.  Mark Hahn's equations
that he psoted about the scheduler have everything to do with this concept,
you have built up some state in the cache and the last thing you want to do
is have to rebuild it somewhere else.

This is an area, sorry to pick on them again, that SGI just completely
screwed up in their scheduler.  The simple and right answer is to put
processes on a CPU and leave them there until the system is dramatically
unbalanced.  The complicated and slow thing to do is to rethink the
decision at every context switch.  The SGI scheduler got it upside down,
you had to prove that you needed to stay on a cache rather than prove that
you needed to move.  So I/O bound jobs got screwed because they never 
ran long enough to have built up what was considered a cache foot print
so they were always consider candidates for movement.  I got substantial
improvements in BDS (an extension to NFS that did ~ 100Mbyte/sec I/O)
by locking the BDS daemons down to individual processors, something the
scheduler could have trivially done correctly itself.

: If this is indeed the case (please correct any misconceptions I have) then
: it strikes me that perhaps the hardware design of SMP is broken. That
: instead of sharing main memory, each processor should have it's own main
: memory. You connect the various main memory chunks to the "primary" CPU via
: some sort of very wide, very fast memory bus, and then when you spawn a
: thread, you instead do something more like a fork - copy the relevent
: process and data to the child cpu's private main memory (perhaps via some
: sort of blitter) over this bus, and then let that CPU go play in its own
: sandbox for a while.

What you have described is an SGI Origin.  Each node in an origin looks like

			[ 2x R10K CPU ]
			      |
		 memory ------+-------- I/O
		              |
			      |
	    interconnect to the rest of the nodes

Each card had 2 cpus, some memory, an I/O bus, and an interconnect bus
which connected the card to the rest of the system in a hypercube fabric
(think of that as a more scalable thing than a bus; more scalable yes,
but with a slight penalty for remote memory access).

The memory coherency was directory based rather than snoopy (snoopy
works when all caches see all transactions, but that doesn't scale.
Directory would appear to scale farther but guess what - the OS locks
completely thrashed the sh*t out of the directories, once again proving
that the illusion of shared, coherent memory if a false one).

Anyway, another thing that didn't work was moving memory.  This is called
"page migration" and the problem with it is that the work that you do
has to substantially outweigh the cost of moving the page in the first
place and it usually doesn't.  So in practice, at SGI and at customers,
the migration idea didn't pay off.  Maybe things have changed since I
left, my info is pretty dated.  I kind of doubt it but there must be
some SGI folks reading this who are willing to correct me.

: Which really is more like the "array of uni-processor boxen joined by a
: network" model than it is current SMP - just with a REALLY fast&wide
: network pipe that just happens to be in the same physical box.

Yup.  Take a look at the book and the short papers I pointed to the other
day.  You'll see a lot in both along these lines.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)
  2000-01-24 23:56 ` Jamie Lokier
@ 2000-01-25  2:38   ` Ralf Baechle
  0 siblings, 0 replies; 7+ messages in thread
From: Ralf Baechle @ 2000-01-25  2:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: dg50, linux-kernel

On Tue, Jan 25, 2000 at 12:56:45AM +0100, Jamie Lokier wrote:

> dg50@daimlerchrysler.com wrote:
> > If this is indeed the case (please correct any misconceptions I have) then
> > it strikes me that perhaps the hardware design of SMP is broken. That
> > instead of sharing main memory, each processor should have it's own main
> > memory. You connect the various main memory chunks to the "primary" CPU via
> > some sort of very wide, very fast memory bus, and then when you spawn a
> > thread, you instead do something more like a fork - copy the relevent
> > process and data to the child cpu's private main memory (perhaps via some
> > sort of blitter) over this bus, and then let that CPU go play in its own
> > sandbox for a while.
> 
> I think you just reinvented NUMA -- Non-Uniform Memory Access.  Every
> CPU can access the others' memory, but you really want them to
> concentrate on their own.  SGI does some boxes like that.

SGI does ccNUMA, cache coherent NUMA.  The difference is that unlike in
`real' NUMA machines each processor on a node has access to memory in each
node directly.  A node in an Origin system is a dual CPU SMP system.

> Linux even has a memory allocator which is moving in the direction of
> supporting those things.

It's actually the start of the support for the Origin series but other
systems are expected to jump the wagon.

> > Which really is more like the "array of uni-processor boxen joined by a
> > network" model than it is current SMP - just with a REALLY fast&wide
> > network pipe that just happens to be in the same physical box.
> 
> It's been proposed to have multiple instances of the OS running too,
> instead of one OS running on all CPUs.

Ok, but users and application developers still want the entire system to
feel like a single system.

  Ralf

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM)
       [not found] <000f01bf66da$872a6730$021d85d1@youwant.to>
@ 2000-01-25 10:47 ` Davide Libenzi
  0 siblings, 0 replies; 7+ messages in thread
From: Davide Libenzi @ 2000-01-25 10:47 UTC (permalink / raw)
  To: David Schwartz, dg50, linux-kernel

Tuesday, January 25, 2000 3:18 AM
David Schwartz <davids@webmaster.com> wrote :

> If you wanted completely separate memory spaces for each processor, the
> current hardware will let you have it. Just separate the address space
into
> logical chunks and code each processor only to use its chunk. The current
> hardware design lets you do exactly what you are suggesting. And if one
> processor does need to acceess the memory earmarked for another, the
current
> hardware provides a fast way to do it.

100% agree, and is faster than an ethernet connection between N separated UP
machines.
Probably the cost of a N way SMP machine is higher than N single UP machines
( at least
for PCs ) but this isn't linux-business, isn't it ?

The cache misses cost that an SMP architecture must sustain is :

CMTc = Nm * F( Np * Ms * WTR )

where :

CMTc = cache misses time cost
Ms = memory size shared between :
Np = number of processes sharing Ms
WTR = write touch rate ( statistical average ) at which the Np processes
write access Ms
F = a probably non linear function depending on architecture, etc ...
Nm = number of memory shares

This is an absolute value that _must_ be compared ( weighted ) with the time
spent by the single processes in computing to ponder if the application
design
we've chosen for SMP is right, or even more, if SMP is the correct target
for
our app.

Take at the rendering pipeline example.
We've each step read a bit of data ( think as from stdin ), do a relatively
long compute on data and write another kind of data ( think as to stdout )
to be processed by the next pipeline step.
The step pattern can be expressed as :

RCCCCCCCCCCCCW

where R = read, C = compute and W = write.
Say we've a six step pipeline, so :

Nm = 5 ( 6 - 1 )
Np = 2
Ms = tipically small ( triangles, scanlines, ...)
WTR = small compared with the computing times

We can think as Ms be a ( relatively big ) object set.
This increase Ms but lengthen the computing path, so the weighted cost
equals.
This is, IMVHO, a good candidate for SMP.

Consider now a typical data centric application in which we've a continuous
read-write cycles along the entire data set :

RCCWRCRWCCRCWC

If we can't split this data set into autonomous chunks of data, we have :

Nm = the number threads we've split the app
Np = typically equal to Nm
Ms = probably the entire data set
WTR = typically high coz the nature of the application

This is not a good candidate for SMP.

Typical examples of these applications are the ones in which the lower steps
of the computing path must access to data computed ( read as
write-accessed ) from
most of previous steps.

Davide.

--
All this stuff is IMVHO

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading  by IBM)
@ 2000-01-25 19:39 Iain McClatchie
       [not found] ` <388DFF0F.8E7784A1@timpanogas.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Iain McClatchie @ 2000-01-25 19:39 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

One of the problems with this forum is that you can't hear the murmur
of assent ripple through the hardware design crowd when Larry rants
about this stuff.  Larry has had his head out of the box for a long
time.

Look at the ASCI project.  The intention was for SGI to build an
Origin with around 1000 CPUs.  That Origin had extra cache coherence
directory RAM and special encodings in that RAM so that the hardware
could actually keep the memory across all 1000 CPUs coherent.  We
added extra physical address bits to the R10K to make this machine
possible.

Last I heard, the machine is mostly programmed with message passing.

I remember having a talk with an O/S guy who was implementing some
sort of message delivery utility inside the O/S.  This was when
Cellular IRIX was in development, and they were investigating having
the various O/S images talk to each other with messages across the
shared memory.  Then someone found out the O/S images could signal
each other FASTER through the HIPPI connections than they could
through shared memory.  That is, this machine had a HIPPI port local
to each O/S image, and all those HIPPI ports were connected together
via a HIPPI switch.

Those HIPPI connections were build with the _same_physical_link_ as
the shared memory - an 800 MB/s source-synchronous channel.  But if
you're sending a message, it's better to have the I/O system just
send the bits one way than have the shared memory system do two round
trips, one to invalidate the mailbox buffer for writing and another to
process the remote cache miss to receive the message.

-Iain McClatchie
www.10xinc.com
iain@10xinc.com
650-364-0520 voice
650-364-0530 FAX

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: SMP Theory (was: Re: Interesting analysis of linux kernel threading  by IBM)
       [not found] ` <388DFF0F.8E7784A1@timpanogas.com>
@ 2000-01-25 21:26   ` Iain McClatchie
  0 siblings, 0 replies; 7+ messages in thread
From: Iain McClatchie @ 2000-01-25 21:26 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Larry McVoy, linux-kernel

Jeff> How do you do fault tolerance on Shared-Everything (like what you
Jeff> describe)?

Disclaimer: I worked on CPUs, not the system controller, so I didn't
actually look at the Verilog.

At an electrical level, the links were designed to be hot-pluggable.
Each O2000 cabinet had I think 8 CPU cards in it (16 CPUs), and had
its own power supply and plug.  Within a cabinet, the I/O cables were
hot pluggable, as were the power supplies, fans, etc.  CPU cards were
not hot pluggable.

The hub chip on each CPU card connected to the shared memory network,
as well as the memory and the two CPUs.  I think the hub had hardware
registers which granted memory access and I/O access to other hubs.  A
CPU could deny access to its local resources from another CPU.  IRIX
could use these capabilities to completely partition the machine, but
I don't think they got any better than a shared-everything vs
shared-nothing toggle, at least, not while I was there.

[I think I'm using the word image wrongly, but you know what I mean.]

The idea was to write an operating system that could transfer
processes between O/S images, reboot O/S images independently, and
tolerate different O/S revisions on different partitions of the
machine.  That would allow online O/S upgrades, and perhaps even
replacement of whole cabinets of hardware.

I know the I/O system was set up to connect two different hub chips to
each I/O crossbar, to maintain access to the I/O resources should a
CPU card go down.  You could (of course) have multiple network
adapters going to the same network, and spread the adapters across
different cabinets.  I think you could also arrange to have fiber
channel disk arrays driven from two different cabinets.  This could
have made it possible to hot plug a cabinet while the machine was
online, without taking down any more processes or disks or whatever
than had already been taken down by the failing hardware.

The I/O stuff sounds heavyweight, but I would imagine you'd have to do
the very same thing on a "shared-nothing" cluster.

Jeff> Sounds like COMA or SCI?

COMA: Cache Only Memory Architecture.  Whenever you look at such a
thing, ask when cache lines are invalidated.  That's the hard problem,
and I haven't yet heard a reasonable answer.

SCI: Rings are bad for latency.  Latency is bad for CPUs.  SCI is all
rings - in the hardware, and in the coherency protocols.  Kendall
Square Research did something like this; it was a disaster.

SGI did neither of these things.  ccNUMA was most similar to the DASH
project from Stanford.  Not surprising -- SGI has very close ties to
Stanford.  Note that the Flash work is done on a mutated Origin.

Jeff> Most folks dismiss shared-everything architectures as
Jeff> non-resilent (Intel and Microsoft > have traditionally been
Jeff> shared-nothing bigots on this point).

Hmm.  Do people actually use NT for big clusters?  I thought clusters
were all done with VMS, Unix, and O/S 360 (or whatever it's called
these days).  I'm not sure how Microsoft's opinion on the matter would
affect anyone.

After working at SGI, I'm convinced that most of the MP problem is a
software problem.  Granted, if the CPU is integrated with the memory
controller, you'll need some good hooks to make MP work, but that's 3
to 10 man-years of work.  I'm not familiar with what Intel does in the
O/S and libraries/application area (iWarp?), but I would imagine any
bigotry on their part comes from the marketing guys selling what they
have.

And yes, I've heard about Intel's ASCI offering.  What was it, 9000
Pentium Pros in a single room?  Has Intel Oregon sold much to anyone
else?

-Iain McClatchie
www.10xinc.com
iain@10xinc.com
650-364-0520 voice
650-364-0530 FAX

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2000-01-25 17:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-01-25 19:39 SMP Theory (was: Re: Interesting analysis of linux kernel threading by IBM) Iain McClatchie
     [not found] ` <388DFF0F.8E7784A1@timpanogas.com>
2000-01-25 21:26   ` Iain McClatchie
     [not found] <000f01bf66da$872a6730$021d85d1@youwant.to>
2000-01-25 10:47 ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2000-01-24 22:46 dg50
2000-01-24 23:56 ` Jamie Lokier
2000-01-25  2:38   ` Ralf Baechle
2000-01-25  0:54 ` Larry McVoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox