All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesse Pollard <jesse@cats-chateau.net>
To: ebiederm@xmission.com (Eric W. Biederman),
	Werner Almesberger <werner@almesberger.net>
Cc: Jeff Garzik <jgarzik@pobox.com>,
	Nivedita Singhvi <niv@us.ibm.com>,
	netdev@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: TOE brain dump
Date: Wed, 6 Aug 2003 07:46:33 -0500	[thread overview]
Message-ID: <03080607463300.08387@tabby> (raw)
In-Reply-To: <m1u18wuinm.fsf@frodo.biederman.org>

On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote:
> Werner Almesberger <werner@almesberger.net> writes:
> > Eric W. Biederman wrote:
> > > The optimized for low latency cases seem to have a strong
> > > market in clusters.
> >
> > Clusters have captive, no, _desperate_ customers ;-) And it
> > seems that people are just as happy putting MPI as their
> > transport on top of all those link-layer technologies.
>
> MPI is not a transport.  It an interface like the Berkeley sockets
> layer.  The semantics it wants right now are usually mapped to
> TCP/IP when used on an IP network.  Though I suspect SCTP might
> be a better fit.
>
> But right now nothing in the IP stack is a particularly good fit.
>
> Right now there is a very strong feeling among most of the people
> using and developing on clusters that by and large what they are doing
> is not of interest to the general kernel community, and so has no
> chance of going in.   So you see hack piled on top of hack piled on
> top of hack.
>
> Mostly I think the that is less true, at least if they can stand the
> process of severe code review and cleaning up their code.  If we can
> put in code to scale the kernel to 64 processors.  NIC drivers for
> fast interconnects and a few similar tweaks can't hurt either.
>
> But of course to get through the peer review process people need
> to understand what they are doing.
>
> > > There is one place in low latency communications that I can think
> > > of where TCP/IP is not the proper solution.  For low latency
> > > communication the checksum is at the wrong end of the packet.
> >
> > That's one of the few things ATM's AAL5 got right. But in the end,
> > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
> > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
> > you may well treat it as an atomic unit.
>
> So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
> per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
> second switch chip + 1.3us to the top level switch chip + 1.3us to a middle
> layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver.
>
> 1.3us * 7 = 9.1us to deliver a packet to the other side.  That is
> still quite painful.  Right now I can get better latencies over any of
> the cluster interconnects.  I think 5 us is the current low end, with
> the high end being about 1 us.

I think you are off here since the second and third layer should not recompute
checksums other than for the header (if they even did that). Most of the
switches I used (mind, not configured) were wire speed. Only header checksums
had recomputes, and I understood it was only for routing.

> Quite often in MPI when a message is sent the program cannot continue
> until the reply is received.  Possibly this is a fundamental problem
> with the application programming model, encouraging applications to
> be latency sensitive.  But it is a well established API and
> programming paradigm so it has to be lived with.
>
> All of this is pretty much the reverse of the TOE case.  Things are
> latency sensitive because real work needs to be done.  And the more
> latency you have the slower that work gets done.
>
> A lot of the NICs which are used for MPI tend to be smart for two
> reasons.  1) So they can do source routing. 2) So they can safely
> export some of their interface to user space, so in the fast path
> they can bypass the kernel.

And bypass any security checks required. A single rogue MPI application
using such an interface can/will bring the cluster down.

Now this is not as much of a problem since many clusters use a standalone
internal network, AND are single application clusters. These clusters
tend to be relatively small (32 - 64 nodes? perhaps 16-32 is better. The
clusters I've worked with have always been large 128-300 nodes, so I'm
not a good judge of "small").

This is immediately broken when you schedule two or more batch jobs on
a cluster in parallel.

It is also broken if the two jobs require different security contexts.

  parent reply	other threads:[~2003-08-06 12:47 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-02 17:04 TOE brain dump Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 21:21                     ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
2003-08-04  1:47         ` Glen Turner
2003-08-04  3:48           ` Larry McVoy
2003-08-04  5:25           ` David S. Miller
2003-08-04 16:20             ` Web100 Matt Mathis
2003-08-06  7:12         ` TOE brain dump Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:25             ` Eric W. Biederman
2003-08-05 17:19           ` Eric W. Biederman
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 15:58                   ` Andy Isaacson
2003-08-06 16:27                     ` Chris Friesen
2003-08-06 17:01                       ` Andy Isaacson
2003-08-06 17:55                         ` Matti Aarnio
2003-08-07  2:14                         ` Lincoln Dale
2003-08-06 12:46             ` Jesse Pollard [this message]
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:22         ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-04 17:19         ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox
     [not found] <g83n.8vu.9@gated-at.bofh.it>
2003-08-03 12:13 ` Ihar 'Philips' Filipau
2003-08-03 18:10   ` Werner Almesberger
2003-08-04  8:55     ` Ihar 'Philips' Filipau
2003-08-04 13:08       ` Jesse Pollard
2003-08-04 19:32       ` Werner Almesberger
2003-08-04 19:48         ` David Lang
2003-08-04 19:56           ` Werner Almesberger
2003-08-04 20:01             ` David Lang
2003-08-04 20:09               ` Werner Almesberger
2003-08-04 20:24                 ` David Lang
2003-08-05  1:38                   ` Werner Almesberger
2003-08-05  1:46                     ` David Lang
2003-08-05  1:54                       ` Larry McVoy
2003-08-05  2:30                         ` Werner Almesberger
2003-08-06  1:47                           ` Val Henson
2003-08-05  3:04                       ` Werner Almesberger
2003-08-04 23:30           ` Peter Chubb
     [not found] <gq0f.8bj.9@gated-at.bofh.it>
     [not found] ` <gvCD.4mJ.5@gated-at.bofh.it>
     [not found]   ` <gJmp.7Th.33@gated-at.bofh.it>
     [not found]     ` <gNpS.2YJ.9@gated-at.bofh.it>
2003-08-04 14:15       ` Ihar 'Philips' Filipau
2003-08-04 14:56         ` Jesse Pollard
2003-08-04 15:51           ` Ihar 'Philips' Filipau
  -- strict thread matches above, loose matches on Subject: below --
2003-08-04 16:45 jamal
2003-08-04 18:48 ` Ihar 'Philips' Filipau
2003-08-04 19:42   ` jamal
2003-08-04 20:06     ` Ihar 'Philips' Filipau
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03080607463300.08387@tabby \
    --to=jesse@cats-chateau.net \
    --cc=ebiederm@xmission.com \
    --cc=jgarzik@pobox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@oss.sgi.com \
    --cc=niv@us.ibm.com \
    --cc=werner@almesberger.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.