help offered

All of lore.kernel.org
 help / color / mirror / Atom feed

* help offered
@ 1998-11-24 12:27 Torbjörn Gannholm
  1998-11-24 20:33 ` Ariel Faigon
  0 siblings, 1 reply; 35+ messages in thread
From: Torbjörn Gannholm @ 1998-11-24 12:27 UTC (permalink / raw)
  To: linux@cthulhu.engr.sgi.com

First my apologies for asking questions you all know the answer to:
What kind of SGI-machines does Linux currently work on?
In what areas does IRIX6.5 have a significant edge over Linux
performancewise?

We really want to put Linux on _everything_ we've got, from PI's to
O2000 (we might also keep a PowerSeries380 for fun), as well as on suns
and pcs. It would make everything a lot simpler to administrate, plus if
we're not happy we can try to hack something.

If necessary for performance, we can keep IRIX on the numbercrunchers,
and, if that's not a performance problem, use gcc/egcs and glibc. Same
questions for these contra Irix Development Kit as above.

I have my employer's blessing to put time into porting and/or
development of Linux/gcc/glibc for SGI-machines if someone just points
me in the right direction.
I have solid programming experience, I have dabbled a bit in sysadmin
and am a quick learner (I have at times written useful code in unknown
languages from examples), but haven't been this deep in before.

--
/Torbjörn

This message is a personal message from Torbjörn Gannholm
and does not necessarily represent the opinion of my employer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-24 12:27 help offered Torbjörn Gannholm
@ 1998-11-24 20:33 ` Ariel Faigon
  1998-11-25 19:49   ` Olivier Galibert
  0 siblings, 1 reply; 35+ messages in thread
From: Ariel Faigon @ 1998-11-24 20:33 UTC (permalink / raw)
  To: Torbjörn Gannholm; +Cc: linux

:
:First my apologies for asking questions you all know the answer to:
:What kind of SGI-machines does Linux currently work on?
:
Only Indys.

:In what areas does IRIX6.5 have a significant edge over Linux
:performancewise?
:
	- Scalability: up to 256 CPUs
	- Guaranteed Real Time response (kernel is preemptible
	  i.e. you can have multiple system calls executing in server
	  space simultaneously.
	- A real journalling filesystem (XFS). Reboot doesn't
	  require a lenghty 'fsck'.  Even if you have a terabyte
	  filesystem the filesystem check takes one second or so.
	- Bandwidth (I/O networking) e.g. 4 GB/sec write to
	  RAID disks.

Linux has a clear edge is latency (as opposed to bandwidth)
short system call paths, simplicity and a general speed advantage
almost accros the board on machines with a single CPU, small disks,
small files etc.

Note that Linux doesn't even support big files (more than 4 GB)
on x86, while IRIX supports many terabyte files.

:We really want to put Linux on _everything_ we've got, from PI's to
:O2000 (we might also keep a PowerSeries380 for fun), as well as on suns
:and pcs. It would make everything a lot simpler to administrate, plus if
:we're not happy we can try to hack something.
:
Me too. This is not easy to do.  It is a big work.

:If necessary for performance, we can keep IRIX on the numbercrunchers,
:and, if that's not a performance problem, use gcc/egcs and glibc. Same
:questions for these contra Irix Development Kit as above.
:
:I have my employer's blessing to put time into porting and/or
:development of Linux/gcc/glibc for SGI-machines if someone just points
:me in the right direction.
:I have solid programming experience, I have dabbled a bit in sysadmin
:and am a quick learner (I have at times written useful code in unknown
:languages from examples), but haven't been this deep in before.
:
If you could make 'glibc' run on IRIX and send me the details
of what you did (and patches to the maintainers) that would be
a great great thing.  It will make most freeware programs more
portable between IRIX and Linux.

-- 
Peace, Ariel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-24 20:33 ` Ariel Faigon
@ 1998-11-25 19:49   ` Olivier Galibert
  1998-11-25 19:57       ` John E. Schimmel
                       ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Olivier Galibert @ 1998-11-25 19:49 UTC (permalink / raw)
  To: linux

On Tue, Nov 24, 1998 at 12:33:45PM -0800, Ariel Faigon wrote:
> 	- Scalability: up to 256 CPUs

I can tell than an O2K with 64 CPUS works quite well when the hardware
isn't failing, but the hardware is often failing...

> 	- Guaranteed Real Time response (kernel is preemptible
> 	  i.e. you can have multiple system calls executing in server
> 	  space simultaneously.

Linux 2.1.* is very preemtible, even if there are  stil some things to
do.

> 	- A real journalling filesystem (XFS). Reboot doesn't
> 	  require a lenghty 'fsck'.  Even if you have a terabyte
> 	  filesystem the filesystem check takes one second or so.

xfs is _very_ good.

> 	- Bandwidth (I/O networking) e.g. 4 GB/sec write to
> 	  RAID disks.

Interesting.  Our "local" SGI vendor  (i.e. the one for France),  told
us that 1GB/sec write  speed was too much  and he could only guarantee
800MB/sec for our 1TB raid array.

> Note that Linux doesn't even support big files (more than 4 GB)
> on x86, while IRIX supports many terabyte files.

The  limit on  x86 is  2GB.   To  be  fair,  said  terabyte  files and
filesystems  are  connected  to systems   with  a 64bits architecture.
Afaik, linux on alpha handles terabytes files.

  OG.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 19:57       ` John E. Schimmel
  0 siblings, 0 replies; 35+ messages in thread
From: John E. Schimmel @ 1998-11-25 19:57 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux

> 
> The  limit on  x86 is  2GB.   To  be  fair,  said  terabyte  files and
> filesystems  are  connected  to systems   with  a 64bits architecture.
> Afaik, linux on alpha handles terabytes files.
> 
>   OG.
> 

We support >2GB on 32 bit systems, and added lseek64() and friends
before we had 64 bit size_t/off_t.

--------------------------------------------------------------
John E. Schimmel                       Email:    jes@sgi.com         
KD6MNW				       Voice:    (650)933-4116
Silicon Graphics Inc.                  Fax:      (650)933-0513
http://reality.sgi.com/jes             Cellular: (209)631-0896
--------------------------------------------------------------

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 19:57       ` John E. Schimmel
  0 siblings, 0 replies; 35+ messages in thread
From: John E. Schimmel @ 1998-11-25 19:57 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux

> 
> The  limit on  x86 is  2GB.   To  be  fair,  said  terabyte  files and
> filesystems  are  connected  to systems   with  a 64bits architecture.
> Afaik, linux on alpha handles terabytes files.
> 
>   OG.
> 

We support >2GB on 32 bit systems, and added lseek64() and friends
before we had 64 bit size_t/off_t.

--------------------------------------------------------------
John E. Schimmel                       Email:    jes@sgi.com         
KD6MNW				       Voice:    (650)933-4116
Silicon Graphics Inc.                  Fax:      (650)933-0513
http://reality.sgi.com/jes             Cellular: (209)631-0896
--------------------------------------------------------------

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 19:49   ` Olivier Galibert
  1998-11-25 19:57       ` John E. Schimmel
@ 1998-11-25 20:11     ` Jeffrey Watts
  1998-11-25 20:43       ` Greg Chesson
  1998-11-25 20:37       ` Ariel Faigon
  1998-11-25 20:56     ` Greg Chesson
  3 siblings, 1 reply; 35+ messages in thread
From: Jeffrey Watts @ 1998-11-25 20:11 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux

On Wed, 25 Nov 1998, Olivier Galibert wrote:

> I can tell than an O2K with 64 CPUS works quite well when the hardware
> isn't failing, but the hardware is often failing...

Have you had a high failure rate?  We just bought 8 O2Ks with 12 CPUs
each.  We are supposed to take delivery next month.  The 2 single-module
Integrated Test machines we've gotten seem to work great.  These machines
will be used in a high-availability environment (4 nines, moving to 5
nines with FailSafe 2.0).

J.

o-----------------------------------o
| Jeffrey Watts                     |
| watts@sunflower.com           o-------------------------------------o
| Systems Analyst               | "I don't think Microsoft is evil in |
| Sprint - Systems Management   |  itself; I just think that they     |
o-------------------------------|  make really crappy operating       |
                                |  systems."                          |
                                |  -- Linus Torvalds                  |
                                o-------------------------------------o

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 20:37       ` Ariel Faigon
  0 siblings, 0 replies; 35+ messages in thread
From: Ariel Faigon @ 1998-11-25 20:37 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux

:
:> 	- Guaranteed Real Time response (kernel is preemptible
:> 	  i.e. you can have multiple system calls executing in server
:> 	  space simultaneously.
:
:Linux 2.1.* is very preemtible, even if there are  stil some things to
:do.
:

Interesting.  Could you elaborate on:

	0) What was changed in recent Linux kernels
	   to support preemtibility in kernel space?
	1) Which "serious" (i.e not 'getpid') system calls are
	   now reentrant ?
	2) What still remains to be done so Linux can really
	   scale before it gets bottlenecked by kernel locks ?

:> 	- Bandwidth (I/O networking) e.g. 4 GB/sec write to
:> 	  RAID disks.
:
:Interesting.  Our "local" SGI vendor  (i.e. the one for France),  told
:us that 1GB/sec write  speed was too much  and he could only guarantee
:800MB/sec for our 1TB raid array.
:

I've seen way much higher numbers.  They are not official, and
are not supposed to be used in sales situations, but were obtained
in our labs with XFS and arrays that were designed and tuned to
maximize bandwidth and to prove that XFS is not the bottleneck.
I believe they also used fiberchannel etc.   Anyway, there are
some much greater experts on this subject on this list if they
care to give the details.

:
:> Note that Linux doesn't even support big files (more than 4 GB)
:> on x86, while IRIX supports many terabyte files.
:
:The  limit on  x86 is  2GB.   To  be  fair,  said  terabyte  files and
:filesystems  are  connected  to systems   with  a 64bits architecture.
:Afaik, linux on alpha handles terabytes files.
:
Yes, I know this is not a strictly-Linux limitation, which is why
I was careful to add "on x86".

Please don't get me wrong: I'm not trying to advocate any OS over
the other (I love and use both) and I didn't mean to turn this
into an advocacy thread.  I was just to respond honestly and fairly
to a specific question I was asked.

-- 
Peace, Ariel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 20:37       ` Ariel Faigon
  0 siblings, 0 replies; 35+ messages in thread
From: Ariel Faigon @ 1998-11-25 20:37 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux

:
:> 	- Guaranteed Real Time response (kernel is preemptible
:> 	  i.e. you can have multiple system calls executing in server
:> 	  space simultaneously.
:
:Linux 2.1.* is very preemtible, even if there are  stil some things to
:do.
:

Interesting.  Could you elaborate on:

	0) What was changed in recent Linux kernels
	   to support preemtibility in kernel space?
	1) Which "serious" (i.e not 'getpid') system calls are
	   now reentrant ?
	2) What still remains to be done so Linux can really
	   scale before it gets bottlenecked by kernel locks ?

:> 	- Bandwidth (I/O networking) e.g. 4 GB/sec write to
:> 	  RAID disks.
:
:Interesting.  Our "local" SGI vendor  (i.e. the one for France),  told
:us that 1GB/sec write  speed was too much  and he could only guarantee
:800MB/sec for our 1TB raid array.
:

I've seen way much higher numbers.  They are not official, and
are not supposed to be used in sales situations, but were obtained
in our labs with XFS and arrays that were designed and tuned to
maximize bandwidth and to prove that XFS is not the bottleneck.
I believe they also used fiberchannel etc.   Anyway, there are
some much greater experts on this subject on this list if they
care to give the details.

:
:> Note that Linux doesn't even support big files (more than 4 GB)
:> on x86, while IRIX supports many terabyte files.
:
:The  limit on  x86 is  2GB.   To  be  fair,  said  terabyte  files and
:filesystems  are  connected  to systems   with  a 64bits architecture.
:Afaik, linux on alpha handles terabytes files.
:
Yes, I know this is not a strictly-Linux limitation, which is why
I was careful to add "on x86".

Please don't get me wrong: I'm not trying to advocate any OS over
the other (I love and use both) and I didn't mean to turn this
into an advocacy thread.  I was just to respond honestly and fairly
to a specific question I was asked.

-- 
Peace, Ariel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:11     ` Jeffrey Watts
@ 1998-11-25 20:43       ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 20:43 UTC (permalink / raw)
  To: Jeffrey Watts, Olivier Galibert; +Cc: linux

Small systems have a much lower failure rate than large (128p) systems.
This is for software as well as hardware.

There have been improvements in all hw failure modes in the last
year.  The most common failure is memory.  This is no suprise since there
are statistically 10X more memory components in a system compared to everything
else.  The second most common failure is power supplies.  The power supplies
have been reengineered.  New systems have the new supplies.  Systems
in the field are upgraded when there are problems.

Although all systems are burned in and tested before leaving the factory,
they can suffer damage by the time they arrive at a new site.  Although the
DOA rate is low, it is still non-zero.

Once a system is installed and any infant mortality problems have been
solved, the probability for continuous operation is very high.
Bad power, frequent reconfigurations, moving cables and boards about
can cause problems with any system.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:37       ` Ariel Faigon
  (?)
@ 1998-11-25 20:51       ` pjlahaie
  1998-11-25 21:18           ` William J. Earl
                           ` (2 more replies)
  -1 siblings, 3 replies; 35+ messages in thread
From: pjlahaie @ 1998-11-25 20:51 UTC (permalink / raw)
  To: Ariel Faigon; +Cc: Olivier Galibert, linux

On Wed, 25 Nov 1998, Ariel Faigon wrote:

> I've seen way much higher numbers.  They are not official, and
> are not supposed to be used in sales situations, but were obtained
> in our labs with XFS and arrays that were designed and tuned to
> maximize bandwidth and to prove that XFS is not the bottleneck.
> I believe they also used fiberchannel etc.   Anyway, there are
> some much greater experts on this subject on this list if they
> care to give the details.

    I was under the impression the O2k memory bandwidth was limited to
~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
it?  It would have to go over the CrayLink "network" and that doesn't do
4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
node accessing "local" drives and adding all the bandwidth together.

						- Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 19:49   ` Olivier Galibert
                       ` (2 preceding siblings ...)
  1998-11-25 20:37       ` Ariel Faigon
@ 1998-11-25 20:56     ` Greg Chesson
  1998-11-25 21:12       ` Olivier Galibert
  3 siblings, 1 reply; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 20:56 UTC (permalink / raw)
  To: Olivier Galibert, linux

>Interesting.  Our "local" SGI vendor  (i.e. the one for France),  told
>us that 1GB/sec write  speed was too much  and he could only guarantee
>800MB/sec for our 1TB raid array.

800 MB/s might be a good conversative estimate for a particular RAID array.
However, it is not a limit for Origin systems or disk arrays in general.
We regularly specify and deliver systems with file and network performance
much greater than 800 MB/s.  Also, regarding file system bandwidth
most discussions do not clarify between peak, sustained, or average performance
or specify the transfer sizes or number of clients or other important
environmental factors.  Disk vendors are the worst offenders.  RAID vendors
are pretty bad, too.  I've seen two different vendors claim the sum of the
peak bandwidths of the disk channels on their boxes as the expected
file system performance.  Woe to the customer who actually believes
such garbage.  And woe to the vendors who have to compete against such garbage.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:04           ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:04 UTC (permalink / raw)
  To: Alan Cox, ariel; +Cc: galibert, linux

One definition of OS scalability that I have not seen
in general use is this:

	an OS scales to S number of processors if all S processors
	can be executing in the kernel at the same time.

An OS that scales to S active kernels can usually operate hardware
with P processors, where P > S.  A system for 1P should
be able to handle 2P with a little work.  I expect a lightweight kernel
like Linux to handle 4p with a few locks if on average only one of the
4p is in the kernel.  I'd suggest that the LInux kernel is at present
(1S, 4p) or maybe (1.5S, 4P).

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:04           ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:04 UTC (permalink / raw)
  To: Alan Cox, ariel; +Cc: galibert, linux

One definition of OS scalability that I have not seen
in general use is this:

	an OS scales to S number of processors if all S processors
	can be executing in the kernel at the same time.

An OS that scales to S active kernels can usually operate hardware
with P processors, where P > S.  A system for 1P should
be able to handle 2P with a little work.  I expect a lightweight kernel
like Linux to handle 4p with a few locks if on average only one of the
4p is in the kernel.  I'd suggest that the LInux kernel is at present
(1S, 4p) or maybe (1.5S, 4P).

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:56     ` Greg Chesson
@ 1998-11-25 21:12       ` Olivier Galibert
  0 siblings, 0 replies; 35+ messages in thread
From: Olivier Galibert @ 1998-11-25 21:12 UTC (permalink / raw)
  To: Greg Chesson, linux

On Wed, Nov 25, 1998 at 12:56:41PM -0800, Greg Chesson wrote:
> And woe to the vendors who have to compete against such garbage.

Ohh yeah.

Actually, in our case, it was slightly better:
- we wanted 1TB of disk.
- we wanted to be able to  dump the full 8GB memory  of the O2K to the
  disk in around 10 seconds.
- we wanted to buy everything to SGI (fibre channel raid array, disks,
  everything).

So the SGI dudes  were able to  choose solutions known to work instead
of having to cope with existing hardware :-)

  OG.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:18           ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 21:18 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

pjlahaie@atlsci.com writes:
 > On Wed, 25 Nov 1998, Ariel Faigon wrote:
 > 
 > > I've seen way much higher numbers.  They are not official, and
 > > are not supposed to be used in sales situations, but were obtained
 > > in our labs with XFS and arrays that were designed and tuned to
 > > maximize bandwidth and to prove that XFS is not the bottleneck.
 > > I believe they also used fiberchannel etc.   Anyway, there are
 > > some much greater experts on this subject on this list if they
 > > care to give the details.
 > 
 >     I was under the impression the O2k memory bandwidth was limited to
 > ~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
 > it?  It would have to go over the CrayLink "network" and that doesn't do
 > 4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
 > node accessing "local" drives and adding all the bandwidth together.

       A single node is only 800 MB/s, but an 8P Origin 2000 has four nodes,
and a 32P has 16 nodes.  The router network bandwidth scales with the number
of nodes, so the memory bandwidth of a 32P Origin 2000 is far more than enough
for 4 GB/s.  If you attach the drives to multiple controllers on multiple
nodes, then it is easy to stripe across them with the volume manager to
get high bandwidth.  The volume manager does requests in parallel, so it
is not a bottleneck.

     The Origin architecture does not have a central bus, so it is not bus
limited.  Just add boxes until the bandwidth is enough for what you need.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:18           ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 21:18 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

pjlahaie@atlsci.com writes:
 > On Wed, 25 Nov 1998, Ariel Faigon wrote:
 > 
 > > I've seen way much higher numbers.  They are not official, and
 > > are not supposed to be used in sales situations, but were obtained
 > > in our labs with XFS and arrays that were designed and tuned to
 > > maximize bandwidth and to prove that XFS is not the bottleneck.
 > > I believe they also used fiberchannel etc.   Anyway, there are
 > > some much greater experts on this subject on this list if they
 > > care to give the details.
 > 
 >     I was under the impression the O2k memory bandwidth was limited to
 > ~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
 > it?  It would have to go over the CrayLink "network" and that doesn't do
 > 4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
 > node accessing "local" drives and adding all the bandwidth together.

       A single node is only 800 MB/s, but an 8P Origin 2000 has four nodes,
and a 32P has 16 nodes.  The router network bandwidth scales with the number
of nodes, so the memory bandwidth of a 32P Origin 2000 is far more than enough
for 4 GB/s.  If you attach the drives to multiple controllers on multiple
nodes, then it is easy to stripe across them with the volume manager to
get high bandwidth.  The volume manager does requests in parallel, so it
is not a bottleneck.

     The Origin architecture does not have a central bus, so it is not bus
limited.  Just add boxes until the bandwidth is enough for what you need.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:24           ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:24 UTC (permalink / raw)
  To: pjlahaie, Ariel Faigon; +Cc: Olivier Galibert, linux

max rate on an io channel is 800 MB/s..  The sustainable rates range
from 580 to 720 depending which channel you're looking at in the machine.

But the memory subsystem is ccNUMA.  That means any channel in the system
can read/write any memory in the system.  With io buffers that comprise
multiple pages, and with the pages of the buffer located on several different
memory controllers, multiple io channels can burst (in parallel) to the
"array" of pages that comprise the buffer.

A system in my lab has 24 Fibre Channels.
We build an XFS file system that operates all 24 channels on a read or write.
There's a RAID controller on each FC with an 8+1 LUN.
The file system is arranged so that each controller has a 4MB IO,
each disk gets a 512KB IO, and the IO size of the native file allocation
block is 24*4MB == 96 MB.  It takes about 50ms for a one disk transfer.
During that time all 24 FC burst into memory.  The dma rate is around 90 MB/s.
So, that is 2160 MB/s.  The aggregate memory bandwidth for an 16-cpu Origin
is about 4000 MB/s (sustained).  The actual memory bandwidth is about
20% more, but I derate it when doing this kind of exercise.

In order to avoid page management overhead, we rely on the ability
to specify large (16MB) pages for buffers of this kind.  The OS is quite
happy to manage large pages as well as the default-sized ones (16KB).
Without this capability, the page management overhead would be a major
stumbling block.  Also, striped IO of this kind does not go through the
file system buffer cache.  These are direct-io transfers between the channel
and user-supplied buffers.  It's not clear the Linux permits dma to a mapped
user page.... I get different opinions from folks.  Nevertheless, large pages
and direct IO are necessary tools for operating big io.

The channel bandwidth is 2160 MB/s during an IO as noted above.
However, the channels can't transfer continuously in this configuration
because the disks have to seek occasionally.  So, the amount that you
derate the peak bandwidth to get to sustained file system bandwidth
depends on the block size, number of seeks, average seek distance,
the number of IO requests in the hardware/controller pipeline, plus
some fuzz to account for faster transfer rates on the outside cylinders
compared to inside cylinders plus the number of direction changes
(mix of reads and writes) plus some analysis of extra drive rotations
on writes and the effectiveness of disk and controller track caches.
Whew.  Anyway the 24-channel RAID-3 system will sustain about 1520 MB/s
under arbitrary read/write and seek patterns.  "Good" patterns will trend
toward 2000 MB/s, but it won't fall below 1500.

So, a 16-processor Origin can operate a 2 GB/s file system and use only
40-50% of its internal bandwidth.  Obviously, many many configurations
of processors, channels, disks and network devices are possible.

When configuring a bandwidth-oriented file server, we shoot for
1/3 bandwidth for the disks, 1/3 for the network, and the rest for software.
These are just rules of thumb, but indicate the kind of thought process
that should be applied.

Sorry for the long message, but I detected some basic misunderstandings
of what this hardware and software can do.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:24           ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:24 UTC (permalink / raw)
  To: pjlahaie, Ariel Faigon; +Cc: Olivier Galibert, linux

max rate on an io channel is 800 MB/s..  The sustainable rates range
from 580 to 720 depending which channel you're looking at in the machine.

But the memory subsystem is ccNUMA.  That means any channel in the system
can read/write any memory in the system.  With io buffers that comprise
multiple pages, and with the pages of the buffer located on several different
memory controllers, multiple io channels can burst (in parallel) to the
"array" of pages that comprise the buffer.

A system in my lab has 24 Fibre Channels.
We build an XFS file system that operates all 24 channels on a read or write.
There's a RAID controller on each FC with an 8+1 LUN.
The file system is arranged so that each controller has a 4MB IO,
each disk gets a 512KB IO, and the IO size of the native file allocation
block is 24*4MB == 96 MB.  It takes about 50ms for a one disk transfer.
During that time all 24 FC burst into memory.  The dma rate is around 90 MB/s.
So, that is 2160 MB/s.  The aggregate memory bandwidth for an 16-cpu Origin
is about 4000 MB/s (sustained).  The actual memory bandwidth is about
20% more, but I derate it when doing this kind of exercise.

In order to avoid page management overhead, we rely on the ability
to specify large (16MB) pages for buffers of this kind.  The OS is quite
happy to manage large pages as well as the default-sized ones (16KB).
Without this capability, the page management overhead would be a major
stumbling block.  Also, striped IO of this kind does not go through the
file system buffer cache.  These are direct-io transfers between the channel
and user-supplied buffers.  It's not clear the Linux permits dma to a mapped
user page.... I get different opinions from folks.  Nevertheless, large pages
and direct IO are necessary tools for operating big io.

The channel bandwidth is 2160 MB/s during an IO as noted above.
However, the channels can't transfer continuously in this configuration
because the disks have to seek occasionally.  So, the amount that you
derate the peak bandwidth to get to sustained file system bandwidth
depends on the block size, number of seeks, average seek distance,
the number of IO requests in the hardware/controller pipeline, plus
some fuzz to account for faster transfer rates on the outside cylinders
compared to inside cylinders plus the number of direction changes
(mix of reads and writes) plus some analysis of extra drive rotations
on writes and the effectiveness of disk and controller track caches.
Whew.  Anyway the 24-channel RAID-3 system will sustain about 1520 MB/s
under arbitrary read/write and seek patterns.  "Good" patterns will trend
toward 2000 MB/s, but it won't fall below 1500.

So, a 16-processor Origin can operate a 2 GB/s file system and use only
40-50% of its internal bandwidth.  Obviously, many many configurations
of processors, channels, disks and network devices are possible.

When configuring a bandwidth-oriented file server, we shoot for
1/3 bandwidth for the disks, 1/3 for the network, and the rest for software.
These are just rules of thumb, but indicate the kind of thought process
that should be applied.

Sorry for the long message, but I detected some basic misunderstandings
of what this hardware and software can do.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 21:24           ` Greg Chesson
  (?)
@ 1998-11-25 21:38           ` pjlahaie
  1998-11-25 21:57               ` Greg Chesson
  1998-11-25 22:08               ` William J. Earl
  -1 siblings, 2 replies; 35+ messages in thread
From: pjlahaie @ 1998-11-25 21:38 UTC (permalink / raw)
  To: Greg Chesson; +Cc: Ariel Faigon, Olivier Galibert, linux

On Wed, 25 Nov 1998, Greg Chesson wrote:

> But the memory subsystem is ccNUMA.  That means any channel in the system
> can read/write any memory in the system.  With io buffers that comprise
> multiple pages, and with the pages of the buffer located on several different
> memory controllers, multiple io channels can burst (in parallel) to the
> "array" of pages that comprise the buffer.

    Except some of this has to go through the CrayLink.  The memory you
are "bursting" to is not on the same node.  Therefore, if you have a
dual-threaded application that runs over the data, at most the max
bandwidth is 1.6GB/s (seeing as it's advantagous to spread your code to
two nodes and split the memory between them).  If you application can make
use of all processors on that box, then you get the full bandwidth.  The
most any single processor in that Origin can handle is 800MB/s and if it
needs to get that data, eventually that data is shoveled through the
CrayLink (and hopefully is gets migrated there).  Is there anything flawed
with this reasoning?

> file system buffer cache.  These are direct-io transfers between the channel
> and user-supplied buffers.  It's not clear the Linux permits dma to a mapped
> user page.... I get different opinions from folks.  Nevertheless, large pages

    I don't see why it cannot be done.  The page-cache/file system buffer
cache are supposed to be merged.  If you mmap that data, you should just
get a pte pointing to that area in the page cache.

> So, a 16-processor Origin can operate a 2 GB/s file system and use only
> 40-50% of its internal bandwidth.  Obviously, many many configurations
> of processors, channels, disks and network devices are possible.

    But that bandwidth isn't single node bandwidth.  No single node can do
4GB/s.  All nodes need to use their local memory to achieve max bandwidth.

						- Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:37       ` Ariel Faigon
  (?)
  (?)
@ 1998-11-25 21:46       ` Alan Cox
  1998-11-25 21:04           ` Greg Chesson
  -1 siblings, 1 reply; 35+ messages in thread
From: Alan Cox @ 1998-11-25 21:46 UTC (permalink / raw)
  To: ariel; +Cc: galibert, linux

> :Linux 2.1.* is very preemtible, even if there are  stil some things to
> :do.

Umm

> :
> 
> Interesting.  Could you elaborate on:
> 
> 	0) What was changed in recent Linux kernels
> 	   to support preemtibility in kernel space?

Nothing

> 	1) Which "serious" (i.e not 'getpid') system calls are
> 	   now reentrant ?

signals, scheduling related stuff

> 	2) What still remains to be done so Linux can really
> 	   scale before it gets bottlenecked by kernel locks ?

Actually it scales fine to 4 CPUs for most stuff on Intel. The pieces that
dont scale are memory intensive and the intel hardware doesnt scale either 8)

But from a theoretical point of view the page cache, vm and fs layers
dont scale.

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:57               ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:57 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

On Nov 25,  4:38pm, <pjlahaie@atlsci.com> wrote:
> Subject: Re: help offered
> On Wed, 25 Nov 1998, Greg Chesson wrote:
>
> > But the memory subsystem is ccNUMA.  That means any channel in the system
> > can read/write any memory in the system.  With io buffers that comprise
> > multiple pages, and with the pages of the buffer located on several
different
> > memory controllers, multiple io channels can burst (in parallel) to the
> > "array" of pages that comprise the buffer.
>
>     Except some of this has to go through the CrayLink.  The memory you

there are 8 IO links in the example I gave plus numerous Craylinks  -
I think it's 16 for the example.  The bandwidth definitely does not go
down a single link.

> are "bursting" to is not on the same node.  Therefore, if you have a
> dual-threaded application that runs over the data, at most the max
> bandwidth is 1.6GB/s (seeing as it's advantagous to spread your code to

the application in the example is single-threaded.
Lot's of people just want a bigpipe and a single file descriptor.

> two nodes and split the memory between them).  If you application can make
> use of all processors on that box, then you get the full bandwidth.  The

a single-thread app can easily malloc pages from all the processor slots
on the box.  Can't do that in a cluster or a shared-nothing machine.
You can think of processor slots as just extra memory controllers.
For some applications we ship with "sparse" processor nodes for just
this purpose.

> most any single processor in that Origin can handle is 800MB/s and if it
> needs to get that data, eventually that data is shoveled through the
> CrayLink (and hopefully is gets migrated there).  Is there anything flawed
> with this reasoning?

The single processor limit is set by the memory controller bandwidth.
It can peak at over 600 MB/s, but 500 MB/s is a good number for sustained
random access ops.
>
>     I don't see why it cannot be done.  The page-cache/file system buffer
> cache are supposed to be merged.  If you mmap that data, you should just
> get a pte pointing to that area in the page cache.

ok.

>
>     But that bandwidth isn't single node bandwidth.  No single node can do
> 4GB/s.  All nodes need to use their local memory to achieve max bandwidth.
>

we do make systems where single node bandwidth is many gigabytes/sec.
They're called vector supercomputers.

The max amount of memory on a motherboard is 4 GBytes, I think.
In order to get a bigger memory, more processors must be added.
Do you want to criticize that, too?

The "beauty" of the ccNUMA memory architeture is that by using off-the-shelf
memory circuits, both bandwidth and capacity can be aggregated in a modular
way and still be mapped into a coherent virtual address space.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 21:57               ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 21:57 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

On Nov 25,  4:38pm, <pjlahaie@atlsci.com> wrote:
> Subject: Re: help offered
> On Wed, 25 Nov 1998, Greg Chesson wrote:
>
> > But the memory subsystem is ccNUMA.  That means any channel in the system
> > can read/write any memory in the system.  With io buffers that comprise
> > multiple pages, and with the pages of the buffer located on several
different
> > memory controllers, multiple io channels can burst (in parallel) to the
> > "array" of pages that comprise the buffer.
>
>     Except some of this has to go through the CrayLink.  The memory you

there are 8 IO links in the example I gave plus numerous Craylinks  -
I think it's 16 for the example.  The bandwidth definitely does not go
down a single link.

> are "bursting" to is not on the same node.  Therefore, if you have a
> dual-threaded application that runs over the data, at most the max
> bandwidth is 1.6GB/s (seeing as it's advantagous to spread your code to

the application in the example is single-threaded.
Lot's of people just want a bigpipe and a single file descriptor.

> two nodes and split the memory between them).  If you application can make
> use of all processors on that box, then you get the full bandwidth.  The

a single-thread app can easily malloc pages from all the processor slots
on the box.  Can't do that in a cluster or a shared-nothing machine.
You can think of processor slots as just extra memory controllers.
For some applications we ship with "sparse" processor nodes for just
this purpose.

> most any single processor in that Origin can handle is 800MB/s and if it
> needs to get that data, eventually that data is shoveled through the
> CrayLink (and hopefully is gets migrated there).  Is there anything flawed
> with this reasoning?

The single processor limit is set by the memory controller bandwidth.
It can peak at over 600 MB/s, but 500 MB/s is a good number for sustained
random access ops.
>
>     I don't see why it cannot be done.  The page-cache/file system buffer
> cache are supposed to be merged.  If you mmap that data, you should just
> get a pte pointing to that area in the page cache.

ok.

>
>     But that bandwidth isn't single node bandwidth.  No single node can do
> 4GB/s.  All nodes need to use their local memory to achieve max bandwidth.
>

we do make systems where single node bandwidth is many gigabytes/sec.
They're called vector supercomputers.

The max amount of memory on a motherboard is 4 GBytes, I think.
In order to get a bigger memory, more processors must be added.
Do you want to criticize that, too?

The "beauty" of the ccNUMA memory architeture is that by using off-the-shelf
memory circuits, both bandwidth and capacity can be aggregated in a modular
way and still be mapped into a coherent virtual address space.

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:08               ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 22:08 UTC (permalink / raw)
  To: pjlahaie; +Cc: Greg Chesson, Ariel Faigon, Olivier Galibert, linux

pjlahaie@atlsci.com writes:
 > On Wed, 25 Nov 1998, Greg Chesson wrote:
 > 
 > > But the memory subsystem is ccNUMA.  That means any channel in the system
 > > can read/write any memory in the system.  With io buffers that comprise
 > > multiple pages, and with the pages of the buffer located on several different
 > > memory controllers, multiple io channels can burst (in parallel) to the
 > > "array" of pages that comprise the buffer.
 > 
 >     Except some of this has to go through the CrayLink.  The memory you
 > are "bursting" to is not on the same node.  Therefore, if you have a
 > dual-threaded application that runs over the data, at most the max
 > bandwidth is 1.6GB/s (seeing as it's advantagous to spread your code to
 > two nodes and split the memory between them).  If you application can make
 > use of all processors on that box, then you get the full bandwidth.  The
 > most any single processor in that Origin can handle is 800MB/s and if it
 > needs to get that data, eventually that data is shoveled through the
 > CrayLink (and hopefully is gets migrated there).  Is there anything flawed
 > with this reasoning?

      There is not a single CrayLink.  Each router port is a CrayLink.
There are many CrayLinks if you have many nodes, so the aggregate bandwidth
scales.  If you want to do some real processing on all of the data, you
will need more processors that would be required to simply read it into
some processor cache anyway, so the node bandwidth is unlikely to be the
limiting factor.  When you add processors, you get more aggregate bandwidth.

...
 > > So, a 16-processor Origin can operate a 2 GB/s file system and use only
 > > 40-50% of its internal bandwidth.  Obviously, many many configurations
 > > of processors, channels, disks and network devices are possible.
 > 
 >     But that bandwidth isn't single node bandwidth.  No single node can do
 > 4GB/s.  All nodes need to use their local memory to achieve max bandwidth.

       Yes.  The point of the operating system is to spread the load of
processing, I/O, and networking over the complete system, not create
bottlenecks at a particular node card. 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:08               ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 22:08 UTC (permalink / raw)
  To: pjlahaie; +Cc: Greg Chesson, Ariel Faigon, Olivier Galibert, linux

pjlahaie@atlsci.com writes:
 > On Wed, 25 Nov 1998, Greg Chesson wrote:
 > 
 > > But the memory subsystem is ccNUMA.  That means any channel in the system
 > > can read/write any memory in the system.  With io buffers that comprise
 > > multiple pages, and with the pages of the buffer located on several different
 > > memory controllers, multiple io channels can burst (in parallel) to the
 > > "array" of pages that comprise the buffer.
 > 
 >     Except some of this has to go through the CrayLink.  The memory you
 > are "bursting" to is not on the same node.  Therefore, if you have a
 > dual-threaded application that runs over the data, at most the max
 > bandwidth is 1.6GB/s (seeing as it's advantagous to spread your code to
 > two nodes and split the memory between them).  If you application can make
 > use of all processors on that box, then you get the full bandwidth.  The
 > most any single processor in that Origin can handle is 800MB/s and if it
 > needs to get that data, eventually that data is shoveled through the
 > CrayLink (and hopefully is gets migrated there).  Is there anything flawed
 > with this reasoning?

      There is not a single CrayLink.  Each router port is a CrayLink.
There are many CrayLinks if you have many nodes, so the aggregate bandwidth
scales.  If you want to do some real processing on all of the data, you
will need more processors that would be required to simply read it into
some processor cache anyway, so the node bandwidth is unlikely to be the
limiting factor.  When you add processors, you get more aggregate bandwidth.

...
 > > So, a 16-processor Origin can operate a 2 GB/s file system and use only
 > > 40-50% of its internal bandwidth.  Obviously, many many configurations
 > > of processors, channels, disks and network devices are possible.
 > 
 >     But that bandwidth isn't single node bandwidth.  No single node can do
 > 4GB/s.  All nodes need to use their local memory to achieve max bandwidth.

       Yes.  The point of the operating system is to spread the load of
processing, I/O, and networking over the complete system, not create
bottlenecks at a particular node card. 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 21:57               ` Greg Chesson
  (?)
@ 1998-11-25 22:09               ` pjlahaie
  1998-11-25 22:57                   ` Greg Chesson
  -1 siblings, 1 reply; 35+ messages in thread
From: pjlahaie @ 1998-11-25 22:09 UTC (permalink / raw)
  To: Greg Chesson; +Cc: Ariel Faigon, Olivier Galibert, linux

On Wed, 25 Nov 1998, Greg Chesson wrote:

> we do make systems where single node bandwidth is many gigabytes/sec.
> They're called vector supercomputers.

    I wasn't criticizing anything or anyone.  Just trying to get more
information.  Considering I had questions our SGI Tech/Salesmen combo
could not answer.

> The max amount of memory on a motherboard is 4 GBytes, I think.
> In order to get a bigger memory, more processors must be added.
> Do you want to criticize that, too?

    No need to take anything personal.  I'm just asking for information
and you guys get jumpy.  I have nothing bad to say about the O2k hardware,
I'm just curious about it's architecture and how exactly "bandwidth" is
measured, since it's not as straightforward as in an SMP system.

						- Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:51       ` pjlahaie
  1998-11-25 21:18           ` William J. Earl
  1998-11-25 21:24           ` Greg Chesson
@ 1998-11-25 22:13         ` Alex Kozlov
  1998-11-25 22:15           ` pjlahaie
  1998-11-25 22:25             ` William J. Earl
  2 siblings, 2 replies; 35+ messages in thread
From: Alex Kozlov @ 1998-11-25 22:13 UTC (permalink / raw)
  To: pjlahaie; +Cc: ariel, galibert, linux

pjlahaie@atlsci.com wrote:
>
>     I was under the impression the O2k memory bandwidth was limited to
> ~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
> it?  It would have to go over the CrayLink "network" and that doesn't do
> 4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
> node accessing "local" drives and adding all the bandwidth together.
> 

I thought craylink is 6 GB/s:

cl0: flags=4041<UP,RUNNING,DRVRLOCK>
        inet 192.0.2.113 netmask 0xffffff00 
        speed 6.40 Gbit/s

Is it not true in practice?

-- 
Alexander V. Kozlov | alexvk@engr.sgi.com | (650) 933-8493

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 22:13         ` Alex Kozlov
@ 1998-11-25 22:15           ` pjlahaie
  1998-11-25 22:25             ` William J. Earl
  1 sibling, 0 replies; 35+ messages in thread
From: pjlahaie @ 1998-11-25 22:15 UTC (permalink / raw)
  To: Alex Kozlov; +Cc: ariel, galibert, linux

On Wed, 25 Nov 1998, Alex Kozlov wrote:

> I thought craylink is 6 GB/s:
                        ^^^^
> 
> cl0: flags=4041<UP,RUNNING,DRVRLOCK>
>         inet 192.0.2.113 netmask 0xffffff00 
>         speed 6.40 Gbit/s
                     ^^^^

    Those a bits, not bytes.  So 6.40Gb is 0.8GB/s.

						- Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:25             ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 22:25 UTC (permalink / raw)
  To: Alex Kozlov; +Cc: pjlahaie, ariel, galibert, linux

Alex Kozlov writes:
 > pjlahaie@atlsci.com wrote:
 > >
 > >     I was under the impression the O2k memory bandwidth was limited to
 > > ~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
 > > it?  It would have to go over the CrayLink "network" and that doesn't do
 > > 4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
 > > node accessing "local" drives and adding all the bandwidth together.
 > > 
 > 
 > I thought craylink is 6 GB/s:
 > 
 > cl0: flags=4041<UP,RUNNING,DRVRLOCK>
 >         inet 192.0.2.113 netmask 0xffffff00 
 >         speed 6.40 Gbit/s
 > 
 > Is it not true in practice?

     Yes, but I think the "4GB/s" was "4 gigabytes/second".  The link is
about 800 megabytes/second (6.4 gigabits/second).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:25             ` William J. Earl
  0 siblings, 0 replies; 35+ messages in thread
From: William J. Earl @ 1998-11-25 22:25 UTC (permalink / raw)
  To: Alex Kozlov; +Cc: pjlahaie, ariel, galibert, linux

Alex Kozlov writes:
 > pjlahaie@atlsci.com wrote:
 > >
 > >     I was under the impression the O2k memory bandwidth was limited to
 > > ~800MB/s.  If so, even if you can read 4GB/s what are you foing to do with
 > > it?  It would have to go over the CrayLink "network" and that doesn't do
 > > 4GB/s.  The only way I can see 4GB/s disk throughput is multiple of the
 > > node accessing "local" drives and adding all the bandwidth together.
 > > 
 > 
 > I thought craylink is 6 GB/s:
 > 
 > cl0: flags=4041<UP,RUNNING,DRVRLOCK>
 >         inet 192.0.2.113 netmask 0xffffff00 
 >         speed 6.40 Gbit/s
 > 
 > Is it not true in practice?

     Yes, but I think the "4GB/s" was "4 gigabytes/second".  The link is
about 800 megabytes/second (6.4 gigabits/second).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:57                   ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 22:57 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

no problem, I apologize for being jumpy.

When you're married to an english teacher you can easily learn
to read too much into someone's choice of words (because that's
what literary criticism is all about.... :-)

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
@ 1998-11-25 22:57                   ` Greg Chesson
  0 siblings, 0 replies; 35+ messages in thread
From: Greg Chesson @ 1998-11-25 22:57 UTC (permalink / raw)
  To: pjlahaie; +Cc: Ariel Faigon, Olivier Galibert, linux

no problem, I apologize for being jumpy.

When you're married to an english teacher you can easily learn
to read too much into someone's choice of words (because that's
what literary criticism is all about.... :-)

g

-- 
Greg Chesson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:37       ` Ariel Faigon
                         ` (2 preceding siblings ...)
  (?)
@ 1998-11-26 12:28       ` ralf
       [not found]         ` <19981126085407.A2201@uni-koblenz.de>
  -1 siblings, 1 reply; 35+ messages in thread
From: ralf @ 1998-11-26 12:28 UTC (permalink / raw)
  To: Ariel Faigon, Olivier Galibert; +Cc: linux

On Wed, Nov 25, 1998 at 12:37:36PM -0800, Ariel Faigon wrote:

> :Linux 2.1.* is very preemtible, even if there are  stil some things to
> :do.
> 
> Interesting.  Could you elaborate on:
> 
> 	0) What was changed in recent Linux kernels
> 	   to support preemtibility in kernel space?

I think people are confusing the terms reentrant and preemptible.

> 	1) Which "serious" (i.e not 'getpid') system calls are
> 	   now reentrant ?

The large majority of the ``small stuff'' is now reentrant, that means
signals, interrupts, stuff like getpid.  Many subsystems or structures are
nowadays protected by there own locks and no longer by the big evil
lock-everything kernel lock.

> 	2) What still remains to be done so Linux can really
> 	   scale before it gets bottlenecked by kernel locks ?

The big ones which still need a lot of work are

 - VFS and lower layers are protected by the big kernel lock.
 - bottom half handlers run on only one CPU.
 - socket code is protected by the big kernel lock

It's 2.3 work, don't expect it to happen any day soon.  If you're
interested in more details, grep the kernel for lock_kernel / unlock_kernel.

  Ralf

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-25 20:37       ` Ariel Faigon
                         ` (3 preceding siblings ...)
  (?)
@ 1998-11-26 22:17       ` Miguel de Icaza
  -1 siblings, 0 replies; 35+ messages in thread
From: Miguel de Icaza @ 1998-11-26 22:17 UTC (permalink / raw)
  To: ariel; +Cc: galibert, linux

> 	  1) Which "serious" (i.e not 'getpid') system calls are
> 	     now reentrant ?

Very few and neither the file system layer nor the networking layer
have been properly fine-grain locked for this task to make sense. 

Linux is still far from competnig with IRIX and Solaris in this
field. 

Miguel.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
       [not found]         ` <19981126085407.A2201@uni-koblenz.de>
@ 1998-11-27 22:59           ` Olivier Galibert
  1998-11-28  2:45             ` ralf
  0 siblings, 1 reply; 35+ messages in thread
From: Olivier Galibert @ 1998-11-27 22:59 UTC (permalink / raw)
  To: ralf, Ariel Faigon; +Cc: linux

On Thu, Nov 26, 1998 at 08:54:07AM -0600, ralf@uni-koblenz.de wrote:
> On Thu, Nov 26, 1998 at 06:28:37AM -0600, ralf@uni-koblenz.de wrote:
> 
> > The big ones which still need a lot of work are
> > 
> >  - VFS and lower layers are protected by the big kernel lock.
> 
> Talked with Stephen Tweedie about this, it's considered a tough job to
> multithread that right.

Afaik, the main problem is avoiding deadlocks.  Tough job.

  OG.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: help offered
  1998-11-27 22:59           ` Olivier Galibert
@ 1998-11-28  2:45             ` ralf
  0 siblings, 0 replies; 35+ messages in thread
From: ralf @ 1998-11-28  2:45 UTC (permalink / raw)
  To: Ariel Faigon, linux

On Fri, Nov 27, 1998 at 11:59:18PM +0100, Olivier Galibert wrote:

> On Thu, Nov 26, 1998 at 08:54:07AM -0600, ralf@uni-koblenz.de wrote:
> > On Thu, Nov 26, 1998 at 06:28:37AM -0600, ralf@uni-koblenz.de wrote:
> > 
> > > The big ones which still need a lot of work are
> > > 
> > >  - VFS and lower layers are protected by the big kernel lock.
> > 
> > Talked with Stephen Tweedie about this, it's considered a tough job to
> > multithread that right.
> 
> Afaik, the main problem is avoiding deadlocks.  Tough job.

Not only, it has to be efficient and Linus has to like it.  As things are
right now only some of Stephen's work in that area was done with MT in
mind.

  Ralf

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~1998-11-28  2:46 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
1998-11-24 12:27 help offered Torbjörn Gannholm
1998-11-24 20:33 ` Ariel Faigon
1998-11-25 19:49   ` Olivier Galibert
1998-11-25 19:57     ` John E. Schimmel
1998-11-25 19:57       ` John E. Schimmel
1998-11-25 20:11     ` Jeffrey Watts
1998-11-25 20:43       ` Greg Chesson
1998-11-25 20:37     ` Ariel Faigon
1998-11-25 20:37       ` Ariel Faigon
1998-11-25 20:51       ` pjlahaie
1998-11-25 21:18         ` William J. Earl
1998-11-25 21:18           ` William J. Earl
1998-11-25 21:24         ` Greg Chesson
1998-11-25 21:24           ` Greg Chesson
1998-11-25 21:38           ` pjlahaie
1998-11-25 21:57             ` Greg Chesson
1998-11-25 21:57               ` Greg Chesson
1998-11-25 22:09               ` pjlahaie
1998-11-25 22:57                 ` Greg Chesson
1998-11-25 22:57                   ` Greg Chesson
1998-11-25 22:08             ` William J. Earl
1998-11-25 22:08               ` William J. Earl
1998-11-25 22:13         ` Alex Kozlov
1998-11-25 22:15           ` pjlahaie
1998-11-25 22:25           ` William J. Earl
1998-11-25 22:25             ` William J. Earl
1998-11-25 21:46       ` Alan Cox
1998-11-25 21:04         ` Greg Chesson
1998-11-25 21:04           ` Greg Chesson
1998-11-26 12:28       ` ralf
     [not found]         ` <19981126085407.A2201@uni-koblenz.de>
1998-11-27 22:59           ` Olivier Galibert
1998-11-28  2:45             ` ralf
1998-11-26 22:17       ` Miguel de Icaza
1998-11-25 20:56     ` Greg Chesson
1998-11-25 21:12       ` Olivier Galibert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.