netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: TOE brain dump
@ 2003-08-04 16:45 jamal
  2003-08-04 18:48 ` Ihar 'Philips' Filipau
  0 siblings, 1 reply; 53+ messages in thread
From: jamal @ 2003-08-04 16:45 UTC (permalink / raw)
  To: netdev; +Cc: Ihar 'Philips' Filipau


Can you please post to netdev? Posting networking related issues to
linux kernel alone is considered rude. Posting them to netdev only
is acceptable.

>  Ihar 'Philips' Filipau wrote:
>
> >Werner Almesberger wrote:
> > Ihar 'Philips' Filipau wrote:
> >
> >| | |  Modern NPUs generally do this.
> >
> >
> > Unfortunately, they don't - they run *some* code, but that
> > is rarely a Linux kernel, or a substantial part of it.
> >
>
>    Embedded CPU we are using is based MIPS, and has a lot of specialized
> instructions.
>    It makes not that much sense to run kernel (especially Linux) on CPU
> which is optimized for handling of network packets. (And has actually
> several co-processors to help in this task).

The coprocessors are useful, but that has nothing to do with the value 
of the NPU. You can add those within a general processor system.
I am also in the camp that to be really useful these things need to run
a real OS - Linux.

>    How much sense it makes to run general purpose OS (optimized for PCs
> and servers) on devices which can make only couple of functions? (and no
> MMU btw)
>
>    It is a whole idea behind this kind of CPUs - to do a few of
> functions - but to do them good.
>
>    If you will start stretching CPUs like this to fit Linux kernel - it
> will generally just increase price. Probably there are some markets
> which can afford this.
>

Actually i believe it will lower the prices.I am waiting for intel to get 
hyperthreading right - then we'll see these things disapear.
The only thing useful about NPUs is their ability to management the
discrepency between memory latency and CPU speeds. Trust me i used to
be in the same camp as you.If you note, a lot of these things appeared
around the height of the .com days. VCs were looking for something
new and exciting.

>    Remeber - "Small is beatiful" (c) - and linux kernel far from it.
>    Our routing code which handles two GE interfaces (actually not pure
> GE, but up to 2.5GB) fits into 3k. 3k of code - and that's it. not 650kb
> of bzip compressed bloat. And it handles two interfaces, handles fast
> data path from siblign interfaces, handles up to 1E6 routes. 3k of code.
> not 650k of bzip.

If all you wanted was to do L3 - why not just buy a $5 chip that can do this 
for a lot more interfaces? Why sweat over optimizing L3 routing in a 
3K space? 
to nit: Its no longer about routing or bridging, friend. Thats like getting 
fries at mcdonalds.

cheers,
jamal 

^ permalink raw reply	[flat|nested] 53+ messages in thread
* RE: TOE brain dump
@ 2003-08-04 18:36 Perez-Gonzalez, Inaky
  2003-08-04 19:03 ` Alan Cox
  0 siblings, 1 reply; 53+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-08-04 18:36 UTC (permalink / raw)
  To: Larry McVoy, David Lang
  Cc: Erik Andersen, Werner Almesberger, Jeff Garzik, netdev,
	linux-kernel, Nivedita Singhvi


> From: Larry McVoy [mailto:lm@bitmover.com]
>
> > 2. router nodes that have access to main memory (PCI card running linux
> > acting as a router/firewall/VPN to offload the main CPU's)
> 
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
> 
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?

Because you want to stack 200 of those together in a huge
data center interconnecting whatever you want to interconnect
and you don't want your maintenance costs to go up to the sky?

I see your point, though :)

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 53+ messages in thread
* TOE brain dump
@ 2003-08-02 17:04 Werner Almesberger
  2003-08-02 17:32 ` Nivedita Singhvi
  2003-08-02 20:57 ` Alan Cox
  0 siblings, 2 replies; 53+ messages in thread
From: Werner Almesberger @ 2003-08-02 17:04 UTC (permalink / raw)
  To: netdev, linux-kernel

At OLS, there was a bit of discussion on (true and false *) TOEs
(TCP Offload Engines). In the course of this discussion, I've
suggested what might be a novel approach, so in case this is a
good idea, I'd like to dump my thoughts on it, before someone
tries to patent my ideas. (Most likely, some of this has already
been done or tried elsewhere, but it can't hurt to try to err on
the safe side.)

(*) The InfiniBand people unfortunately call also their TCP/IP
    bypass "TOE" (for which they promptly get shouted down,
    every time they use that word). This is misleading, because
    there is no TCP that's getting offloaded, but TCP is simply
    never done. I would consider it to be more accurate to view
    this as a separate networking technology, with semantics
    different from TCP/IP, similar to ATM and AAL5.

While I'm not entirely convinced about the usefulness of TOE in
all the cases it's been suggested for, I can see value in certain
areas, e.g. when TCP per-packet overhead becomes an issue.

However, I consider the approach of putting a new or heavily
modified stack, which duplicates a considerable amount of the
functionality in the main kernel, on a separate piece of hardware
questionable at best. Some of the issues:

 - if this stack is closed source or generally hard to modify,
   security fixes will be slowed down

 - if this stack is closed source or generally hard to modify,
   TOE will not be available to projects modifying the stack,
   e.g. any of the research projects trying to make TCP work at
   gigabit speeds

 - this stack either needs to implement all administrative
   interfaces of the regular kernel, or such a system would have
   non-uniform configuration/monitoring across interfaces

 - in some cases, administrative interfaces will require a
   NIC/TOE-specific switch in the kernel (netlink helps here)

 - route changes on multi-homed hosts (or any similar kind of
   failover) are difficult if the state of TCP connections is
   tied to specific NICs (I've discussed some issues when
   "migrating" TCP connections in the documentation of tcpcp,
   http://www.almesberger.net/tcpcp/)

 - new kernel features will always lag behind on this kind of
   TOE, and different kernels will require different "firmware"

 - last but not least, keeping TOE firmware up to date with the
   TCP/IP stack in the mainstream kernel will require - for each
   such TOE device - a significant and continuous effort over a
   long period of time

In short, I think such a solution is either a pain to use, or
unmaintainable, or - most likely - both.

So, how to do better ? Easy: use the Source, Luke. Here's my
idea:

 - instead of putting a different stack on the TOE, a
   general-purpose processor (probably with some enhancements,
   and certainly with optimized data paths) is added to the NIC

 - that processor runs the same Linux kernel image as the host,
   acting like a NUMA system

 - a selectable part of TCP/IP is handled on the NIC, and the
   rest of the system runs on the host processor

 - instrumentation is added to the mainstream kernel to ensure
   that as little data as possible is shared between the main
   CPU and such peripheral CPUs. Note that such instrumentation
   would be generic, outlining possible boundaries, and not tied
   to a specific TOE design.

 - depending on hardware details (cache coherence, etc.), the
   instrumentation mentioned above may even be necessary for
   correctness. This would have the unfortunate effect of making
   the design very fragile with respect to changes in the
   mainstream kernel. (Performance loss in the case of imperfect
   instrumentation would be preferable.)

 - further instrumentation may be needed to let the kernel switch
   CPUs (i.e. host to NIC, and vice versa) at the right time

 - since the NIC would probably use a CPU design different from
   the host CPU, we'd need "fat" kernel binaries:

   - data structures are the same, i.e. word sizes, byte order,
     bit numbering, etc. are compatible, and alignments are
     chosen such that all CPUs involved are reasonably happy

   - kernels live in the same address space

   - function pointers become arrays, with one pointer per
     architecture. When comparing pointers, the first element is
     used.

 - if one should choose to also run parts of user space on the
   NIC, fat binaries would also be needed for this (along with
   other complications)

Benefits:

 - putting the CPU next to the NIC keeps data paths short, and
   allows for all kinds of optimizations (e.g. a pipelined
   memory architecture)

 - the design is fairly generic, and would equally apply to
   other areas of the kernel than TCP/IP

 - using the same kernel image eliminates most maintenance
   problems, and encourages experimenting with the stack

 - using the same kernel image (and compatible data structures)
   guarantees that administrative interfaces are uniform in the
   entire system

 - such a design is likely to be able to allow TCP state to be
   moved to a different NIC, if necessary

Possible problems, that may kill this idea:

 - it may be too hard to achieve correctness

 - it may be too hard to switch CPUs properly

 - it may not be possible to express copy operations efficiently
   in such a context

 - there may be no way to avoid sharing of hardware-specific
   data structures, such as page tables, or to emulate their use

 - people may consider the instrumentation required for this,
   although fairly generic, too intrusive

 - all this instrumentation may eat too much performance

 - nobody may be interested in building hardware for this

 - nobody may be patient enough to pursue such long-termish
   development, with uncertain outcome

 - something I haven't thought of

I lack the resources (hardware, financial, and otherwise) to
actually do something with these ideas, so please feel free to
put them to some use.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2003-08-06 21:13 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-08-04 16:45 TOE brain dump jamal
2003-08-04 18:48 ` Ihar 'Philips' Filipau
2003-08-04 19:42   ` jamal
2003-08-04 20:06     ` Ihar 'Philips' Filipau
  -- strict thread matches above, loose matches on Subject: below --
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox
2003-08-02 17:04 Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
     [not found]         ` <3F2DBB2B.9050803@aarnet.edu.au>
2003-08-04  5:25           ` David S. Miller
2003-08-06  7:12         ` Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:25             ` Eric W. Biederman
2003-08-05 17:19           ` Eric W. Biederman
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 12:46             ` Jesse Pollard
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).