The ultimate TOE design

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* The ultimate TOE design
@ 2004-09-15 19:33 Jeff Garzik
  2004-09-15 20:04 ` Paul Jakma
                   ` (3 more replies)
  0 siblings, 4 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 19:33 UTC (permalink / raw)
  To: netdev; +Cc: leonid.grossman, Linux Kernel

(reply-to set to netdev)

Every now and then people ask on the lists about TOE, TCP assist, and 
that sort of thing.  Ignoring the issue of TCP hardware assist, I wanted 
to describe what I feel is an optimal method to _fully offload_ the 
Linux TCP stack.

Put simply, the "ultimate TOE card" would be a card with network ports, 
a generic CPU (arm, mips, whatever.), some RAM, and some flash.  This 
card's "firmware" is the Linux kernel, configured to run as a _totally 
indepenent network node_, with IP address(es) all its own.

Then, your host system OS will communicate with the Linux kernel running 
on the card across the PCI bus, using IP packets (64K fixed MTU).

This effectively:

1) fragment processing, IPsec, and other services onto the card.

2) You can use huge card<->host MTUs, which makes sendfile(2) faster 
with _zero_ kernel changes

3) You can let the PCI card do 100% of the checksum 
processing/generation, and treat the network connection connection 
across the PCI bus as CHECKSUM_UNNECESSARY.

2) With enough RAM and cpu cycles, you can even offload complex services 
like Web services:  the PCI card runs Apache, and fetches files across 
the network (your PCI bus!) from the host system.

3) Does not require _any_ modification of Linux network stack. 
Interfacing with the card merely requires a simple DMA interface to copy 
IP (not ethernet) packets across the PCI bus, and that fits within the 
existing Linux net driver API.

4) ensures that the TOE "firmware" [the Linux kernel] can be easily 
updated in the event of new features or (more importantly) security 
problems.

5) Linux is the most RFC-compliant net stack in the world.  Why 
re-create (or license) an inferior one?

6) Long-term maintenance of TOE firmware is a BIG problem with existing 
full-TOE systems.  Under this design, sysadmins would update and patch 
their PCI card with security updates just like any other system on their 
network.  This is added work, yes, but it's a known quantity and a task 
they are already doing for other systems.

7) The design is both portable [tons of embedded CPUs, with and without 
MMUs, can run Linux] and scalable.

My dream is that some vendor will come along and implement such a 
design, and sell it in enough volume that it's US$100 or less.  There 
are a few cards on the market already where implementing this design 
_may_ be possible, but they are all fairly expensive.  Just need enough 
resources on the PCI to be able to Linux as a 
router/firewall/iSCSI/web-proxy gadget.

And I'm not aware of anybody doing a direct IP-over-PCI thing, either.

But I'll keep on dreaming...  ;-)

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:33 The ultimate TOE design Jeff Garzik
@ 2004-09-15 20:04 ` Paul Jakma
  2004-09-15 19:14   ` Alan Cox
                     ` (3 more replies)
  2004-09-15 20:11 ` David Stevens
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 69+ messages in thread
From: Paul Jakma @ 2004-09-15 20:04 UTC (permalink / raw)
  To: Netdev; +Cc: leonid.grossman, Linux Kernel

On Wed, 15 Sep 2004, Jeff Garzik wrote:

> Put simply, the "ultimate TOE card" would be a card with network ports, a 
> generic CPU (arm, mips, whatever.), some RAM, and some flash.  This card's 
> "firmware" is the Linux kernel, configured to run as a _totally indepenent 
> network node_, with IP address(es) all its own.
>
> Then, your host system OS will communicate with the Linux kernel running on 
> the card across the PCI bus, using IP packets (64K fixed MTU).

> My dream is that some vendor will come along and implement such a 
> design, and sell it in enough volume that it's US$100 or less. 
> There are a few cards on the market already where implementing this 
> design _may_ be possible, but they are all fairly expensive.

The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
card running Linux. Or is that what you were referring to with 
"<cards exist> but they are all fairly expensive."?

> 	Jeff

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
There is nothing so easy but that it becomes difficult when you do it
reluctantly.
 		-- Publius Terentius Afer (Terence)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:04 ` Paul Jakma
@ 2004-09-15 19:14   ` Alan Cox
  2004-09-15 20:41     ` Jeff Garzik
                       ` (3 more replies)
  2004-09-15 20:26   ` Neil Horman
                     ` (2 subsequent siblings)
  3 siblings, 4 replies; 69+ messages in thread
From: Alan Cox @ 2004-09-15 19:14 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel Mailing List

On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
> card running Linux. Or is that what you were referring to with 
> "<cards exist> but they are all fairly expensive."?

Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
and also had the advantage they ran user mode code when idle from
network processing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:14   ` Alan Cox
@ 2004-09-15 20:41     ` Jeff Garzik
  2004-09-15 21:01       ` David S. Miller
  2004-09-15 20:53     ` David S. Miller
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 20:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel Mailing List

Alan Cox wrote:
> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
> 
>>The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
>>card running Linux. Or is that what you were referring to with 
>>"<cards exist> but they are all fairly expensive."?
> 
> 
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.


The point was more to show people who are doing TOE _anyway_ to a decent 
design.

As I said in another post, "just don't bother with TOE" is a very valid 
answer with today's CPUs.

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:41     ` Jeff Garzik
@ 2004-09-15 21:01       ` David S. Miller
  2004-09-15 21:08         ` Jeff Garzik
  2004-09-15 21:15         ` Michael Richardson
  0 siblings, 2 replies; 69+ messages in thread
From: David S. Miller @ 2004-09-15 21:01 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

On Wed, 15 Sep 2004 16:41:51 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> The point was more to show people who are doing TOE _anyway_ to a decent 
> design.

We shouldn't be forced to refine people's non-sensible ideas which
we'll not support anyways.

If TOE is supported on Windows only, I happily welcome that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:01       ` David S. Miller
@ 2004-09-15 21:08         ` Jeff Garzik
  2004-09-15 21:13           ` David S. Miller
  2004-09-15 21:15         ` Michael Richardson
  1 sibling, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 21:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

On Wed, Sep 15, 2004 at 02:01:23PM -0700, David S. Miller wrote:
> On Wed, 15 Sep 2004 16:41:51 -0400
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
> > The point was more to show people who are doing TOE _anyway_ to a decent 
> > design.
> 
> We shouldn't be forced to refine people's non-sensible ideas which
> we'll not support anyways.

I just described a design that -we already support-.

It's generic scalable model that has application outside the acronym
"TOE".  Did you read my message, or just see 'TOE' and nothing else?

Sun used this model with their x86 cards.  Total MP did something
similar with their 4-processor PowerPC cards.

There's nothing inherently wrong with sticking a computer running
Linux inside another computer ;-)

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:08         ` Jeff Garzik
@ 2004-09-15 21:13           ` David S. Miller
  2004-09-15 21:23             ` Jeff Garzik
  2004-09-15 22:31             ` Jeff Garzik
  0 siblings, 2 replies; 69+ messages in thread
From: David S. Miller @ 2004-09-15 21:13 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

On Wed, 15 Sep 2004 17:08:18 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> There's nothing inherently wrong with sticking a computer running
> Linux inside another computer ;-)

And we already support that :-)

Plus we have things like TSO too but that doesn't require a full Linux
instance to realize on a networking port.
Simple silicon implements this already.
I don't see how that differs from your "big MTU" ideas.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:13           ` David S. Miller
@ 2004-09-15 21:23             ` Jeff Garzik
  2004-09-15 21:29               ` David S. Miller
                                 ` (2 more replies)
  2004-09-15 22:31             ` Jeff Garzik
  1 sibling, 3 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 21:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

David S. Miller wrote:
> On Wed, 15 Sep 2004 17:08:18 -0400
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
> 
>>There's nothing inherently wrong with sticking a computer running
>>Linux inside another computer ;-)
> 
> 
> And we already support that :-)
> 
> Plus we have things like TSO too but that doesn't require a full Linux
> instance to realize on a networking port.
> Simple silicon implements this already.
> I don't see how that differs from your "big MTU" ideas.


Part of this is about how to talk to business people.... marketing.

The typical definition of TOE is "offload 90+% of the net stack", as 
opposed to "TCP assist", which is stuff like TSO.

If people ask about how to support TOE in Linux, you can say "sure, we 
_already_ support TOE, just stick Linux on a PCI card" rather than "no 
we don't support it."

And wha-la, we support TOE with zero code changes ;-)

	Jeff, who would love to have a bunch of Athlons
	on PCI cards to play with.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:23             ` Jeff Garzik
@ 2004-09-15 21:29               ` David S. Miller
  2004-09-15 22:26                 ` Jeff Garzik
  2004-09-15 23:29                 ` Leonid Grossman
  2004-09-16  0:57               ` jamal
  2004-09-16  9:29               ` Lincoln Dale
  2 siblings, 2 replies; 69+ messages in thread
From: David S. Miller @ 2004-09-15 21:29 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

On Wed, 15 Sep 2004 17:23:49 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> The typical definition of TOE is "offload 90+% of the net stack", as 
> opposed to "TCP assist", which is stuff like TSO.

I think a better goal is "offload 90+% of the net stack cost" which
is effectively what TSO does on the send side.

This is why these discussions are so circular.

If we want to discuss something specific, like receive offload
schemes, that is a very different matter.  And I'm sure folks
like Rusty have a lot to contribute in this area :-)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:29               ` David S. Miller
@ 2004-09-15 22:26                 ` Jeff Garzik
  2004-09-15 23:29                 ` Leonid Grossman
  1 sibling, 0 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 22:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

David S. Miller wrote:
> On Wed, 15 Sep 2004 17:23:49 -0400
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
> 
>>The typical definition of TOE is "offload 90+% of the net stack", as 
>>opposed to "TCP assist", which is stuff like TSO.
> 
> 
> I think a better goal is "offload 90+% of the net stack cost" which
> is effectively what TSO does on the send side.


A better goal is to not bother with TOE at all, and just get multi-core 
processors with huge memory bandwidth :)

Again, the point of my message is to have something _positive_ to tell 
people when they specifically asked about TOE.  Rather than "no, we'll 
never do TOE" we have "it's possible, but there are better questions you 
should be asking"

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-15 21:29               ` David S. Miller
  2004-09-15 22:26                 ` Jeff Garzik
@ 2004-09-15 23:29                 ` Leonid Grossman
  2004-09-24 13:07                   ` Lennert Buytenhek
  1 sibling, 1 reply; 69+ messages in thread
From: Leonid Grossman @ 2004-09-15 23:29 UTC (permalink / raw)
  To: 'David S. Miller', 'Jeff Garzik'
  Cc: alan, paul, netdev, linux-kernel

I think Jeff's "ultimate TOE card" based upon generic embedded CPU is doable
at GbE, but we may not see such a product because it's too late for it to
succeed.

TOE is a pretty questionable product in itself; one of the main reasons
people build TOE cards is to put RDMA on top of it and end up with an RNIC
(NIC+TOE+RDMA) Ethernet card.
The hope is to eventually run all three types of server traffic (network,
storage, IPC) over an RNIC, and get rid of two other HBAs in a system.

For this "fabric conversion" over Ethernet to happen it has to be at 10GbE
not GbE, since storage (FiberChannel) is already at 4Gb.
And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC
(granted, with some means to simplify debugging and reduce the risk of hw
bugs and TCP changes).

On some other points on the thread:

WRT the TOE price, I suspect that when RNICs come out they will command
little premium over conventional NICs - it will be just a technology
upgrade.

WRT larger MTU - going to bigger MTUs helps a lot, but it will be years
before the infrastructure moves beyond 9600 byte MTU. Even right now, usage
of 9600 byte Jumbos is not universal.

WRT TSO, for applications that don't require RDMA TSO indeed helps a lot on
the transmit side for 1500 MTU - 10GbE cards are innevitably CPU bound, and
we are seeing ~3x throughput improvement with normal frames.

This leaves receive offload schemes in Linux as a biggest improvement (short
of supporting TOE) to make.
It will be great to see such receive schemes defined and implemented, as I
stated in an earlier thread we will be willing to participate in such work
and put the support in S2io 10GbE ASIC and drivers.

> -----Original Message-----
> From: David S. Miller [mailto:davem@davemloft.net] 
> Sent: Wednesday, September 15, 2004 2:29 PM
> To: Jeff Garzik
> Cc: alan@lxorguk.ukuu.org.uk; paul@clubi.ie; 
> netdev@oss.sgi.com; leonid.grossman@s2io.com; 
> linux-kernel@vger.kernel.org
> Subject: Re: The ultimate TOE design
> 
> On Wed, 15 Sep 2004 17:23:49 -0400
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
> > The typical definition of TOE is "offload 90+% of the net 
> stack", as 
> > opposed to "TCP assist", which is stuff like TSO.
> 
> I think a better goal is "offload 90+% of the net stack cost" 
> which is effectively what TSO does on the send side.
> 
> This is why these discussions are so circular.
> 
> If we want to discuss something specific, like receive 
> offload schemes, that is a very different matter.  And I'm 
> sure folks like Rusty have a lot to contribute in this area :-)
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 23:29                 ` Leonid Grossman
@ 2004-09-24 13:07                   ` Lennert Buytenhek
  2004-09-24 13:21                     ` Leonid Grossman
  0 siblings, 1 reply; 69+ messages in thread
From: Lennert Buytenhek @ 2004-09-24 13:07 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'David S. Miller', 'Jeff Garzik', alan, paul,
	netdev, linux-kernel

On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote:

> And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC
> (granted, with some means to simplify debugging and reduce the risk of hw
> bugs and TCP changes).

Intel's IXP2800 can do 10GbE.

http://www.intel.com/design/network/products/npfamily/ixp2800.htm


--L

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-24 13:07                   ` Lennert Buytenhek
@ 2004-09-24 13:21                     ` Leonid Grossman
  2004-09-24 18:09                       ` Lennert Buytenhek
  0 siblings, 1 reply; 69+ messages in thread
From: Leonid Grossman @ 2004-09-24 13:21 UTC (permalink / raw)
  To: 'Lennert Buytenhek'
  Cc: 'David S. Miller', 'Jeff Garzik', alan, paul,
	netdev, linux-kernel

 

> -----Original Message-----
> From: Lennert Buytenhek [mailto:buytenh@wantstofly.org] 
> Sent: Friday, September 24, 2004 6:08 AM
> To: Leonid Grossman
> Cc: 'David S. Miller'; 'Jeff Garzik'; 
> alan@lxorguk.ukuu.org.uk; paul@clubi.ie; netdev@oss.sgi.com; 
> linux-kernel@vger.kernel.org
> Subject: Re: The ultimate TOE design
> 
> On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote:
> 
> > And at 10GbE, embedded CPUs just don't cut it - it has to be custom 
> > ASIC (granted, with some means to simplify debugging and reduce the 
> > risk of hw bugs and TCP changes).
> 
> Intel's IXP2800 can do 10GbE.

Hi Lennert,
I was referring to the server side. 
One can certanly build a 10GbE box based on IPX2800 (or some other parts),
but at 17-25W it is not usable in NICs since the entire PCI card budget is
less than that - nothing left for 10GbE PHY, memory, etc.
Leonid

> 
> http://www.intel.com/design/network/products/npfamily/ixp2800.htm
> 
> 
> --L
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-24 13:21                     ` Leonid Grossman
@ 2004-09-24 18:09                       ` Lennert Buytenhek
  2004-09-24 19:39                         ` Joel Jaeggli
  0 siblings, 1 reply; 69+ messages in thread
From: Lennert Buytenhek @ 2004-09-24 18:09 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'David S. Miller', 'Jeff Garzik', alan, paul,
	netdev, linux-kernel

On Fri, Sep 24, 2004 at 06:21:35AM -0700, Leonid Grossman wrote:

> > > And at 10GbE, embedded CPUs just don't cut it - it has to be custom 
> > > ASIC (granted, with some means to simplify debugging and reduce the 
> > > risk of hw bugs and TCP changes).
> > 
> > Intel's IXP2800 can do 10GbE.
> 
> Hi Lennert,

Hello,


> I was referring to the server side. 
> One can certanly build a 10GbE box based on IPX2800 (or some other parts),
> but at 17-25W it is not usable in NICs since the entire PCI card budget is
> less than that - nothing left for 10GbE PHY, memory, etc.

Ah, ok, that makes sense.  As someone else also noted, the IXP2800
only has a 64/66 PCI interface anyway, so it wouldn't really be
suitable for the task you were referring to.


cheers,
Lennert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-24 18:09                       ` Lennert Buytenhek
@ 2004-09-24 19:39                         ` Joel Jaeggli
  0 siblings, 0 replies; 69+ messages in thread
From: Joel Jaeggli @ 2004-09-24 19:39 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Leonid Grossman, 'David S. Miller', 'Jeff Garzik',
	alan, paul, netdev, linux-kernel

On Fri, 24 Sep 2004, Lennert Buytenhek wrote:

>
>> I was referring to the server side.
>> One can certanly build a 10GbE box based on IPX2800 (or some other parts),
>> but at 17-25W it is not usable in NICs since the entire PCI card budget is
>> less than that - nothing left for 10GbE PHY, memory, etc.

I have a graphics card which requires two four pin molex power connectors, 
going back in time there have allway been certain perphiral cards which 
required external (non-bus supplied power sources for whatever reason) 
(hard-drive on a card, sparc on a card, pc on a card, early 90's hardware 
mpeg encoder, data collection device, remote mangement card, graphics card 
in modern mac etc), it's obviously not a general solution, but it's been 
done frequently enough that it shouldn't just be discarded out of hand.

> Ah, ok, that makes sense.  As someone else also noted, the IXP2800
> only has a 64/66 PCI interface anyway, so it wouldn't really be
> suitable for the task you were referring to.
>
>
> cheers,
> Lennert
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli  	       Unix Consulting 	       joelja@darkwing.uoregon.edu 
GPG Key Fingerprint:     5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:23             ` Jeff Garzik
  2004-09-15 21:29               ` David S. Miller
@ 2004-09-16  0:57               ` jamal
  2004-09-16  5:25                 ` Leonid Grossman
  2004-09-16  9:29               ` Lincoln Dale
  2 siblings, 1 reply; 69+ messages in thread
From: jamal @ 2004-09-16  0:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David S. Miller, alan, paul, netdev, leonid.grossman,
	linux-kernel

Jeff,
You are only allowed to start a TOE thread only every six months ;->

On a serious note, I think that PCI-express (if it lives upto its
expectation) will demolish dreams of a lot of these TOE investments.
Our problem is NOT the CPU right now (80% idle processing 450Kpps
forwarding). Bus and memory distance/latency are. If intel would get rid
of the big conspiracy in the form of chipset division and just integrate
the MC like AMD is, we'll be on our our way to kill TOE and a lot of the
network processors (like the IXP). Dang, running Linux is more exciting
than microcoding things to fit into a 2Kword program store. 

I rest my canadiana $.02 

cheers,
jamal

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-16  0:57               ` jamal
@ 2004-09-16  5:25                 ` Leonid Grossman
  0 siblings, 0 replies; 69+ messages in thread
From: Leonid Grossman @ 2004-09-16  5:25 UTC (permalink / raw)
  To: hadi, 'Jeff Garzik'
  Cc: 'David S. Miller', alan, paul, netdev, linux-kernel

 

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca] 
> Sent: Wednesday, September 15, 2004 5:58 PM
> To: Jeff Garzik
> Cc: David S. Miller; alan@lxorguk.ukuu.org.uk; paul@clubi.ie; 
> netdev@oss.sgi.com; leonid.grossman@s2io.com; 
> linux-kernel@vger.kernel.org
> Subject: Re: The ultimate TOE design
> 
> Jeff,
> You are only allowed to start a TOE thread only every six months ;->
> 
> On a serious note, I think that PCI-express (if it lives upto its
> expectation) will demolish dreams of a lot of these TOE investments.
> Our problem is NOT the CPU right now (80% idle processing 
> 450Kpps forwarding). Bus and memory distance/latency are. 

In servers, both bottlenecks are there - if you look at the cost of TCP and
filesystem processing at 10GbE, CPU is a huge problem (and will be for
foreseeable future), even for fastest 64-bit systems. 
I agree though that bus and memory are bigger issues, this is exactly the
reason for all these RDMA over Ethernet investments :-)
Anyways, did not mean to start an argument - with all the new CPU, bus and
HBA technologies coming to the market it will be another 18-24 months before
we know what works and what doesn't...
Leonid


>If 
> intel would get rid of the big conspiracy in the form of 
> chipset division and just integrate the MC like AMD is, we'll 
> be on our our way to kill TOE and a lot of the network 
> processors (like the IXP). Dang, running Linux is more 
> exciting than microcoding things to fit into a 2Kword program store. 
> 
> I rest my canadiana $.02 
> 
> cheers,
> jamal
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:23             ` Jeff Garzik
  2004-09-15 21:29               ` David S. Miller
  2004-09-16  0:57               ` jamal
@ 2004-09-16  9:29               ` Lincoln Dale
  2004-09-16 12:19                 ` Alan Cox
  2 siblings, 1 reply; 69+ messages in thread
From: Lincoln Dale @ 2004-09-16  9:29 UTC (permalink / raw)
  To: Jeff Garzik, David S. Miller
  Cc: alan, paul, netdev, leonid.grossman, linux-kernel

not that i disagree with the general idea and rationale, but reality is 
what it is today for some reasons:

At 07:23 AM 16/09/2004, Jeff Garzik wrote:
>         Jeff, who would love to have a bunch of Athlons
>         on PCI cards to play with.

. . . this ignore the realities of power restrictions of PCI today . . .
sure, one could create a PCI card that takes a power-connector, but that 
don't scale so well either . . .

At 07:29 AM 16/09/2004, David S. Miller wrote:
>I think a better goal is "offload 90+% of the net stack cost" which
>is effectively what TSO does on the send side.
>
>This is why these discussions are so circular.

TSO works on LAN-like environments (zero latency, minimal drop), it doesn't 
work so well across the internet . . .

i believe that there are better alternatives than TSO, but it involves NICs 
having decent scatter-gather DMA engines and being able to be handled 
multiple transactions (packets/frames) at once.
in theory, NICs like tg2/tg3 should be capable of implementing something 
like this -- if one could get to the ucode on the embedded cores.

at least with PCI Express the general architecture of a PC starts to have a 
hope of keeping up with Moore's law.
the same couldn't be said prior to DDR-SDRAM and higher front-side-bus 
frequencies.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16  9:29               ` Lincoln Dale
@ 2004-09-16 12:19                 ` Alan Cox
  2004-09-16 13:33                   ` Andi Kleen
  0 siblings, 1 reply; 69+ messages in thread
From: Alan Cox @ 2004-09-16 12:19 UTC (permalink / raw)
  To: Lincoln Dale
  Cc: Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman,
	Linux Kernel Mailing List

On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote:
> . . . this ignore the realities of power restrictions of PCI today . . .
> sure, one could create a PCI card that takes a power-connector, but that 
> don't scale so well either . . .

At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
controller. I'm sure its not co-incidence that powerpc shows up on such
boards a lot more than x86 however.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 12:19                 ` Alan Cox
@ 2004-09-16 13:33                   ` Andi Kleen
  2004-09-16 12:57                     ` Alan Cox
  0 siblings, 1 reply; 69+ messages in thread
From: Andi Kleen @ 2004-09-16 13:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Lincoln Dale, Jeff Garzik, David S. Miller, paul, netdev,
	leonid.grossman, Linux Kernel Mailing List

On Thu, Sep 16, 2004 at 01:19:21PM +0100, Alan Cox wrote:
> On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote:
> > . . . this ignore the realities of power restrictions of PCI today . . .
> > sure, one could create a PCI card that takes a power-connector, but that 
> > don't scale so well either . . .
> 
> At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI

Are you sure that's worst case, not average? Worst case is usually
much worse on a big CPU like an Athlon, but the power supply 
has to be sized for it. 

-Andi

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 13:33                   ` Andi Kleen
@ 2004-09-16 12:57                     ` Alan Cox
  2004-09-16 22:37                       ` Lincoln Dale
  0 siblings, 1 reply; 69+ messages in thread
From: Alan Cox @ 2004-09-16 12:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Lincoln Dale, Jeff Garzik, David S. Miller, paul, netdev,
	leonid.grossman, Linux Kernel Mailing List

On Iau, 2004-09-16 at 14:33, Andi Kleen wrote:
> > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
> 
> Are you sure that's worst case, not average? Worst case is usually
> much worse on a big CPU like an Athlon, but the power supply 
> has to be sized for it. 

You are correct - 6W average 9W TDP, still less than my scsicontroller
8)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 12:57                     ` Alan Cox
@ 2004-09-16 22:37                       ` Lincoln Dale
  2004-09-17 13:38                         ` Jörn Engel
  0 siblings, 1 reply; 69+ messages in thread
From: Lincoln Dale @ 2004-09-16 22:37 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Jeff Garzik, David S. Miller, paul, netdev,
	leonid.grossman, Linux Kernel Mailing List

Hi Alan,

At 10:57 PM 16/09/2004, Alan Cox wrote:
>On Iau, 2004-09-16 at 14:33, Andi Kleen wrote:
> > > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
> >
> > Are you sure that's worst case, not average? Worst case is usually
> > much worse on a big CPU like an Athlon, but the power supply
> > has to be sized for it.
>
>You are correct - 6W average 9W TDP, still less than my scsicontroller
>8)

sure -- ok -- that gets you the main processor.
now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it 
still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY, ...

couple that with the voltage of PCI where you're likely to need 
step-up/step-down circuits (which aren't 100% efficient themselves), you're 
still going to get very close to the limit, if not over it.

... and after all that, the Geode is really designed to be an embedded 
processor.
Jeff was implying using garden-variety processors which seem to have large 
heatsinks, not to mention cooling fans, not to mention quite significant 
heat generation.

we're not _quite_ at the stage of being able to take garden-variety 
processors and build-your-own-blade-server using PCI _just_ yet. :-)

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 22:37                       ` Lincoln Dale
@ 2004-09-17 13:38                         ` Jörn Engel
  0 siblings, 0 replies; 69+ messages in thread
From: Jörn Engel @ 2004-09-17 13:38 UTC (permalink / raw)
  To: Lincoln Dale
  Cc: Alan Cox, Andi Kleen, Jeff Garzik, David S. Miller, paul, netdev,
	leonid.grossman, Linux Kernel Mailing List

On Fri, 17 September 2004 08:37:17 +1000, Lincoln Dale wrote:
> 
> sure -- ok -- that gets you the main processor.
> now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it 
> still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY, 
> ...
> 
> couple that with the voltage of PCI where you're likely to need 
> step-up/step-down circuits (which aren't 100% efficient themselves), you're 
> still going to get very close to the limit, if not over it.
> 
> ... and after all that, the Geode is really designed to be an embedded 
> processor.
> Jeff was implying using garden-variety processors which seem to have large 
> heatsinks, not to mention cooling fans, not to mention quite significant 
> heat generation.
> 
> we're not _quite_ at the stage of being able to take garden-variety 
> processors and build-your-own-blade-server using PCI _just_ yet. :-)

FWIW, I've already been working with complete systems that suck their
power from PCI.  They do exist, just not in the grocery store next
door.

Jörn

-- 
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:13           ` David S. Miller
  2004-09-15 21:23             ` Jeff Garzik
@ 2004-09-15 22:31             ` Jeff Garzik
  1 sibling, 0 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 22:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel

David S. Miller wrote:
> Plus we have things like TSO too but that doesn't require a full Linux
> instance to realize on a networking port.
> Simple silicon implements this already.
> I don't see how that differs from your "big MTU" ideas.


WRT MTU:  if the card is a buffering endpoint, rather than a 
passthrough, the card deals with Path MTU and fragmentation, leaving the 
card<->host MTU at 64K, getting nice big fat frames.

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:01       ` David S. Miller
  2004-09-15 21:08         ` Jeff Garzik
@ 2004-09-15 21:15         ` Michael Richardson
  1 sibling, 0 replies; 69+ messages in thread
From: Michael Richardson @ 2004-09-15 21:15 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jeff Garzik, alan, paul, netdev, leonid.grossman, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----


>>>>> "David" == David S Miller <davem@davemloft.net> writes:
    >> The point was more to show people who are doing TOE _anyway_ to a decent 
    >> design.

    David> We shouldn't be forced to refine people's non-sensible ideas which
    David> we'll not support anyways.

    David> If TOE is supported on Windows only, I happily welcome that.

  Ha. Too hard to do :-)

  The TOEs and L7 content switches that I know of are supported...
      UNDER LINUX ONLY 

  The one that I'm most familliar with (Seaway's SW5000/NCA2000)
provides a new socket family to the host, which corresponds to streams
that terminate on the NCA2000.  The host can request things like having
two TCP streams be cross-connected, even adding/subtracting SSL along
the way.

  This code does not interact with the Linux IP stack at all --- so it
isn't exactly a TOE. You have to, at a minimum recompile applications.

- --
]     "Elmo went to the wrong fundraiser" - The Simpson         |  firewalls  [
]   Michael Richardson,    Xelerance Corporation, Ottawa, ON    |net architect[
] mcr@xelerance.com      http://www.sandelman.ottawa.on.ca/mcr/ |device driver[
] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Finger me for keys

iQCVAwUBQUiw3YqHRg3pndX9AQHrwQQAoK2C4btD6vk/UZ1Bv7zTgtbw/EvZuU2F
ZqPDiYfHMIsfsCYBWqLrjU2oxkkO+RgH3NOoNTJQuuVFjLlDw2pPHgH9DXaYdZy8
3To0LGdmIZR4u+mMx2WFRyYjuDM1iQ3ZbAskN5JzW3Jc77SbrJZaap1fQua5U3qg
gfNQ21OPkSI=
=+JBc
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:14   ` Alan Cox
  2004-09-15 20:41     ` Jeff Garzik
@ 2004-09-15 20:53     ` David S. Miller
  2004-09-16  1:05       ` Andrea Arcangeli
  2004-09-15 21:10     ` David Lang
  2004-09-15 23:05     ` Paul Jakma
  3 siblings, 1 reply; 69+ messages in thread
From: David S. Miller @ 2004-09-15 20:53 UTC (permalink / raw)
  To: Alan Cox; +Cc: paul, netdev, leonid.grossman, linux-kernel

On Wed, 15 Sep 2004 20:14:22 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
> > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
> > card running Linux. Or is that what you were referring to with 
> > "<cards exist> but they are all fairly expensive."?
> 
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.

ROFL, and this is my position on this topic as well.

There are absolutely no justified economics in these
TOE engines.  By the time you deploy them, the cpus
and memory catch up and what's more those are general
purpose and not just for networking as David Stevens
and others have said.

TOE is just junk, and we'll reject any attempt to put
that garbage into the kernel.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:53     ` David S. Miller
@ 2004-09-16  1:05       ` Andrea Arcangeli
  0 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2004-09-16  1:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: Alan Cox, paul, netdev, leonid.grossman, linux-kernel

On Wed, Sep 15, 2004 at 01:53:08PM -0700, David S. Miller wrote:
> There are absolutely no justified economics in these
> TOE engines.  By the time you deploy them, the cpus
> and memory catch up and what's more those are general
> purpose and not just for networking as David Stevens
> and others have said.

I'm not sure if economics are the worst part of what is being shipped,
to me the worst part is security, I'd never trust myself such a
non-open-source TCP stack for something critical even if it was going to
be much cheaper and performant. Even my PDA is using the linux tcp
stack, and my cell phone only speaks UDP with the wap server anyways.
TCP segment offload OTOH doesn't involve much "intelligence" in the NIC
and it's very reasonable to trust it especially because all the incoming
packets (the real potential offenders) are still processed by the linux
tcp stack.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:14   ` Alan Cox
  2004-09-15 20:41     ` Jeff Garzik
  2004-09-15 20:53     ` David S. Miller
@ 2004-09-15 21:10     ` David Lang
  2004-09-15 23:05     ` Paul Jakma
  3 siblings, 0 replies; 69+ messages in thread
From: David Lang @ 2004-09-15 21:10 UTC (permalink / raw)
  To: Alan Cox; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel Mailing List

On Wed, 15 Sep 2004, Alan Cox wrote:

> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>> card running Linux. Or is that what you were referring to with
>> "<cards exist> but they are all fairly expensive."?
>
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.

That depends on how many of these accelerators you already have in the 
system. If you have 4 of them and they are heavily used so that you want 
to offload them it definantly isn't cheap to add a 5th (you useually have 
to go up to 8 or so and the difference between 4 and 8 is frequently 2x-4x 
the cost of the 4 processor box)

now if you start with a single CPU system then yes, adding a second one is 
cheap. but these are useually not the people who really need TOE (they may 
think that they do, but that's a different story :-)

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:14   ` Alan Cox
                       ` (2 preceding siblings ...)
  2004-09-15 21:10     ` David Lang
@ 2004-09-15 23:05     ` Paul Jakma
  3 siblings, 0 replies; 69+ messages in thread
From: Paul Jakma @ 2004-09-15 23:05 UTC (permalink / raw)
  To: Alan Cox; +Cc: Netdev, leonid.grossman, Linux Kernel Mailing List

On Wed, 15 Sep 2004, Alan Cox wrote:

> Last time I checked 2Ghz accelerators for intel and AMD were quite 
> cheap and also had the advantage they ran user mode code when idle 
> from network processing.

Indeed.

Unfortunately though, my vague understanding is, the interesting bits 
on the IXP, the microengines, are integrated with the XScale ASIC.

I agree it's silly to stick a general purpose CPU in there, but you 
get it for "free" anyway.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
War is an equal opportunity destroyer.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:04 ` Paul Jakma
  2004-09-15 19:14   ` Alan Cox
@ 2004-09-15 20:26   ` Neil Horman
  2004-09-15 21:03     ` Wes Felter
  2004-09-16  5:51     ` Matt Porter
  2004-09-15 21:36   ` Deepak Saxena
  2004-09-15 21:59   ` Tony Lee
  3 siblings, 2 replies; 69+ messages in thread
From: Neil Horman @ 2004-09-15 20:26 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel

Paul Jakma wrote:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
> 
>> Put simply, the "ultimate TOE card" would be a card with network 
>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some 
>> flash.  This card's "firmware" is the Linux kernel, configured to run 
>> as a _totally indepenent network node_, with IP address(es) all its own.
>>
>> Then, your host system OS will communicate with the Linux kernel 
>> running on the card across the PCI bus, using IP packets (64K fixed MTU).
> 
> 
>> My dream is that some vendor will come along and implement such a 
>> design, and sell it in enough volume that it's US$100 or less. There 
>> are a few cards on the market already where implementing this design 
>> _may_ be possible, but they are all fairly expensive.
> 
> 
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI card 
> running Linux. Or is that what you were referring to with "<cards exist> 
> but they are all fairly expensive."?
> 
>>     Jeff
> 
> 
> regards,

IBM's PowerNP chip was also very simmilar (a powerpc core with lots of 
hardware assists for DMA and packet inspection in the extended register 
area).  Don't know if they still sell it, but at one time I had heard 
they had booted linux on it.
Neil

-- 
/***************************************************
  *Neil Horman
  *Software Engineer
  *Red Hat, Inc.
  *nhorman@redhat.com
  *gpg keyid: 1024D / 0x92A74FA1
  *http://pgp.mit.edu
  ***************************************************/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:26   ` Neil Horman
@ 2004-09-15 21:03     ` Wes Felter
  2004-09-15 21:15       ` Jeff Garzik
                         ` (2 more replies)
  2004-09-16  5:51     ` Matt Porter
  1 sibling, 3 replies; 69+ messages in thread
From: Wes Felter @ 2004-09-15 21:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

Neil Horman wrote:
> Paul Jakma wrote:
> 
>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>
>>> Put simply, the "ultimate TOE card" would be a card with network 
>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some 
>>> flash.  This card's "firmware" is the Linux kernel, configured to run 
>>> as a _totally indepenent network node_, with IP address(es) all its own.
>>>
>>> Then, your host system OS will communicate with the Linux kernel 
>>> running on the card across the PCI bus, using IP packets (64K fixed 
>>> MTU).

>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
>> card running Linux. Or is that what you were referring to with "<cards 
>> exist> but they are all fairly expensive."?

> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of 
> hardware assists for DMA and packet inspection in the extended register 
> area).  Don't know if they still sell it, but at one time I had heard 
> they had booted linux on it.

An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core 
and PowerNP's PowerPC core are way too slow to do any significant 
processing; they are intended for control tasks like updating the 
routing tables. All the work in the IXP or PowerNP is done by the 
microengines, which have weird, non-Linux-compatible architectures.

To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 
GHz processor on the card? Sounds expensive.

A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though.

Wes Felter - wesley@felter.org - http://felter.org/wesley/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:03     ` Wes Felter
@ 2004-09-15 21:15       ` Jeff Garzik
  2004-09-15 21:35         ` Wes Felter
  2004-09-15 21:25       ` Imran Badr
  2004-09-16 11:37       ` Neil Horman
  2 siblings, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 21:15 UTC (permalink / raw)
  To: Wes Felter; +Cc: netdev, linux-kernel

On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote:
> To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 
> GHz processor on the card? Sounds expensive.

Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet?

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:15       ` Jeff Garzik
@ 2004-09-15 21:35         ` Wes Felter
  2004-09-15 21:42           ` Jeff Garzik
  0 siblings, 1 reply; 69+ messages in thread
From: Wes Felter @ 2004-09-15 21:35 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Jeff Garzik wrote:

> On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote:
> 
>>To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 
>>GHz processor on the card? Sounds expensive.
> 
> 
> Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet?

Yes. (Or a 4-way ~2GHz server.)

When the fastest general-purpose processors cannot handle the fastest 
Ethernet links, putting such a processor on a NIC won't help much. I 
think this is why people are attracted to TOE ASICs, even if that isn't 
the right solution.

-- 
Wes Felter - wesley@felter.org - http://felter.org/wesley/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:35         ` Wes Felter
@ 2004-09-15 21:42           ` Jeff Garzik
  0 siblings, 0 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 21:42 UTC (permalink / raw)
  To: Wes Felter; +Cc: netdev, linux-kernel

On Wed, Sep 15, 2004 at 04:35:31PM -0500, Wes Felter wrote:
> Jeff Garzik wrote:
> 
> >On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote:
> >
> >>To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 
> >>GHz processor on the card? Sounds expensive.
> >
> >
> >Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet?
> 
> Yes. (Or a 4-way ~2GHz server.)

It was a rhetoric question.

No, you don't.

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:03     ` Wes Felter
  2004-09-15 21:15       ` Jeff Garzik
@ 2004-09-15 21:25       ` Imran Badr
  2004-09-16 11:37       ` Neil Horman
  2 siblings, 0 replies; 69+ messages in thread
From: Imran Badr @ 2004-09-15 21:25 UTC (permalink / raw)
  To: Wes Felter, linux-kernel; +Cc: netdev

Please see:

Cavium Networks Introduces OCTEON(TM) Family of Integrated Network Services
Processors With up to 16 MIPS64(R)-Based Cores for Internet Services,
Content and Security Processing"

http://www.linuxelectrons.com/article.php?story=20040913082030668&mode=print




-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Wes Felter
Sent: Wednesday, September 15, 2004 2:04 PM
To: linux-kernel@vger.kernel.org
Cc: netdev@oss.sgi.com
Subject: [SPAM] Re: The ultimate TOE design


Neil Horman wrote:
> Paul Jakma wrote:
>
>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>
>>> Put simply, the "ultimate TOE card" would be a card with network
>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some
>>> flash.  This card's "firmware" is the Linux kernel, configured to run
>>> as a _totally indepenent network node_, with IP address(es) all its own.
>>>
>>> Then, your host system OS will communicate with the Linux kernel
>>> running on the card across the PCI bus, using IP packets (64K fixed
>>> MTU).

>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>> card running Linux. Or is that what you were referring to with "<cards
>> exist> but they are all fairly expensive."?

> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
> hardware assists for DMA and packet inspection in the extended register
> area).  Don't know if they still sell it, but at one time I had heard
> they had booted linux on it.

An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core
and PowerNP's PowerPC core are way too slow to do any significant
processing; they are intended for control tasks like updating the
routing tables. All the work in the IXP or PowerNP is done by the
microengines, which have weird, non-Linux-compatible architectures.

To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
GHz processor on the card? Sounds expensive.

A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though.

Wes Felter - wesley@felter.org - http://felter.org/wesley/


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:03     ` Wes Felter
  2004-09-15 21:15       ` Jeff Garzik
  2004-09-15 21:25       ` Imran Badr
@ 2004-09-16 11:37       ` Neil Horman
  2 siblings, 0 replies; 69+ messages in thread
From: Neil Horman @ 2004-09-16 11:37 UTC (permalink / raw)
  To: Wes Felter; +Cc: linux-kernel, netdev

Wes Felter wrote:
> Neil Horman wrote:
> 
>> Paul Jakma wrote:
>>
>>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>>
>>>> Put simply, the "ultimate TOE card" would be a card with network 
>>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some 
>>>> flash.  This card's "firmware" is the Linux kernel, configured to 
>>>> run as a _totally indepenent network node_, with IP address(es) all 
>>>> its own.
>>>>
>>>> Then, your host system OS will communicate with the Linux kernel 
>>>> running on the card across the PCI bus, using IP packets (64K fixed 
>>>> MTU).
> 
> 
>>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
>>> card running Linux. Or is that what you were referring to with 
>>> "<cards exist> but they are all fairly expensive."?
> 
> 
>> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of 
>> hardware assists for DMA and packet inspection in the extended 
>> register area).  Don't know if they still sell it, but at one time I 
>> had heard they had booted linux on it.
> 
> 
> An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core 
> and PowerNP's PowerPC core are way too slow to do any significant 
> processing; they are intended for control tasks like updating the 
> routing tables. All the work in the IXP or PowerNP is done by the 
> microengines, which have weird, non-Linux-compatible architectures.
> 
I didn't say the assist hardware wouldn't need an extra driver.  Its not 
100% free, as Jeff proposes, but the CPU portion of these designs is 
_sufficient_ to run linux, and a driver can be written to drive the 
remainder of these chips.  Its the combination that network device 
manufacturers design to today: A specialized chip to do L3/L2 forwarding 
at line rate over a large number of ports, and just enough general 
purpose CPU to manage the user interface, the forwarding hardware and 
any overflow forwarding that the forwarding hardware can't deal with 
quickly.
> To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 
> GHz processor on the card? Sounds expensive.
> 
To handle port densities that are competing in the market today?  Yes, 
which as I mentioned earlier would price designs like this out of the 
market.  Jeffs idea is a nice one, but it doesn't really fit well with 
the hardware that networking equipment manufacturers are building today. 
  Take a look at Broadcoms StrataSwitch/StrataXGS lines, or Switchcores 
Xpeedium processors.  These are the sorts of things we have to work with 
.  They provide network stack offload in competitive port densities, but 
they aren't also general purpose processors.  They need a driver to 
massage their behavior into something more linux friendly.  If we could 
develop an infrastrucutre that made these chips easy to integrate into a 
  platform running linux, linux could quickly come to dominate a large 
portion of the network device space.

Neil

> Wes Felter - wesley@felter.org - http://felter.org/wesley/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
/***************************************************
  *Neil Horman
  *Software Engineer
  *Red Hat, Inc.
  *nhorman@redhat.com
  *gpg keyid: 1024D / 0x92A74FA1
  *http://pgp.mit.edu
  ***************************************************/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:26   ` Neil Horman
  2004-09-15 21:03     ` Wes Felter
@ 2004-09-16  5:51     ` Matt Porter
  1 sibling, 0 replies; 69+ messages in thread
From: Matt Porter @ 2004-09-16  5:51 UTC (permalink / raw)
  To: Neil Horman; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel

On Wed, Sep 15, 2004 at 04:26:09PM -0400, Neil Horman wrote:
> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of 
> hardware assists for DMA and packet inspection in the extended register 
> area).  Don't know if they still sell it, but at one time I had heard 
> they had booted linux on it.

Well, yes, PowerNP support has been in the kernel for years and embedded
Linux distros like Mvista support them.  It's no longer an IBM chip,
though.  AMCC purchased the PPC4xx network processors (PowerNP) from
IBM and later purchased the entire standard SoC PPC4xx product line
from IBM.  That is, except for the PPC4xx STB chips like are found in
the Hauppage MediaMVP, IBM retained those.  AMCC pretty much owns all
the PPC4xx line and PowerNP 405H/L are still available.

-Matt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:04 ` Paul Jakma
  2004-09-15 19:14   ` Alan Cox
  2004-09-15 20:26   ` Neil Horman
@ 2004-09-15 21:36   ` Deepak Saxena
  2004-09-15 23:03     ` Paul Jakma
  2004-09-24 13:11     ` Lennert Buytenhek
  2004-09-15 21:59   ` Tony Lee
  3 siblings, 2 replies; 69+ messages in thread
From: Deepak Saxena @ 2004-09-15 21:36 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel

On Sep 15 2004, at 21:04, Paul Jakma was caught saying:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
> 
> >Put simply, the "ultimate TOE card" would be a card with network ports, a 
> >generic CPU (arm, mips, whatever.), some RAM, and some flash.  This card's 
> >"firmware" is the Linux kernel, configured to run as a _totally indepenent 
> >network node_, with IP address(es) all its own.
> >
> >Then, your host system OS will communicate with the Linux kernel running 
> >on the card across the PCI bus, using IP packets (64K fixed MTU).
> 
> >My dream is that some vendor will come along and implement such a 
> >design, and sell it in enough volume that it's US$100 or less. 
> >There are a few cards on the market already where implementing this 
> >design _may_ be possible, but they are all fairly expensive.
> 
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
> card running Linux. Or is that what you were referring to with 
> "<cards exist> but they are all fairly expensive."?

Unfortunately all the SW that lets one make use of the interesting
features of the IXPs (microEngines, crypto, etc) is a pile of
propietary code.

~Deepak


-- 
Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/

"Unlike me, many of you have accepted the situation of your imprisonment
and will die here like rotten cabbages." - Number 6

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:36   ` Deepak Saxena
@ 2004-09-15 23:03     ` Paul Jakma
  2004-09-24 13:11     ` Lennert Buytenhek
  1 sibling, 0 replies; 69+ messages in thread
From: Paul Jakma @ 2004-09-15 23:03 UTC (permalink / raw)
  To: Deepak Saxena; +Cc: Netdev, leonid.grossman, Linux Kernel

On Wed, 15 Sep 2004, Deepak Saxena wrote:

> Unfortunately all the SW that lets one make use of the interesting 
> features of the IXPs (microEngines, crypto, etc) is a pile of 
> propietary code.

My vague understanding is that while Intel's microengine code is 
proprietary, they do provide the docs to the microengines to let you 
write your own, no?

> ~Deepak

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Better tried by twelve than carried by six.
 		-- Jeff Cooper

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:36   ` Deepak Saxena
  2004-09-15 23:03     ` Paul Jakma
@ 2004-09-24 13:11     ` Lennert Buytenhek
  1 sibling, 0 replies; 69+ messages in thread
From: Lennert Buytenhek @ 2004-09-24 13:11 UTC (permalink / raw)
  To: Deepak Saxena; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel

On Wed, Sep 15, 2004 at 02:36:00PM -0700, Deepak Saxena wrote:

> > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI 
> > card running Linux. Or is that what you were referring to with 
> > "<cards exist> but they are all fairly expensive."?
> 
> Unfortunately all the SW that lets one make use of the interesting
> features of the IXPs (microEngines, crypto, etc) is a pile of
> propietary code.

I'm working on open source microengine code for the IXP line, which
should be available Real Soon Now(TM).


--L

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:04 ` Paul Jakma
                     ` (2 preceding siblings ...)
  2004-09-15 21:36   ` Deepak Saxena
@ 2004-09-15 21:59   ` Tony Lee
  3 siblings, 0 replies; 69+ messages in thread
From: Tony Lee @ 2004-09-15 21:59 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel

On Wed, 15 Sep 2004 21:04:38 +0100 (IST), Paul Jakma <paul@clubi.ie> wrote:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
> 
> > Put simply, the "ultimate TOE card" would be a card with network ports, a
> > generic CPU (arm, mips, whatever.), some RAM, and some flash.  This card's
> > "firmware" is the Linux kernel, configured to run as a _totally indepenent
> > network node_, with IP address(es) all its own.
> >
> > Then, your host system OS will communicate with the Linux kernel running on
> > the card across the PCI bus, using IP packets (64K fixed MTU).
> 
> > My dream is that some vendor will come along and implement such a
> > design, and sell it in enough volume that it's US$100 or less.
> > There are a few cards on the market already where implementing this
> > design _may_ be possible, but they are all fairly expensive.
> 
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> card running Linux. Or is that what you were referring to with
> "<cards exist> but they are all fairly expensive."?
> 
> >       Jeff
> 
> regards,
> --
> Paul Jakma      paul@clubi.ie   paul@jakma.org  Key ID: 64A2FF6A



I believe Broadcom 5704 (570x) chip/nic card come with 2 MIPS CPUs (133 MHz)
one each for both Tx and Rx data path.   The GIGE nic card cost < $50
couple years ago.


Too bad, the software SDK for them is closed (quoted at $96K couple years ago) .

Otherwise, there can be some interesting applications with that extremely
inexpensive chip/nic card.

RDMA over TCP/UDP with that chip/nic card over gige can be very interesting.

so is SSL proxy, SSH tunnel, etc.

With the right distributing processing design, it might even possible
to offload SMB,
NFS to the "right" nic card.


-Tony
--
Having fun with Xilinx Virtex Pro II reconfigurable HW +  integrated PPC + Linux

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:33 The ultimate TOE design Jeff Garzik
  2004-09-15 20:04 ` Paul Jakma
@ 2004-09-15 20:11 ` David Stevens
  2004-09-15 20:16   ` David Schwartz
                     ` (5 more replies)
  2004-09-15 21:36 ` John Heffner
  2004-09-16  9:03 ` Lars Marowsky-Bree
  3 siblings, 6 replies; 69+ messages in thread
From: David Stevens @ 2004-09-15 20:11 UTC (permalink / raw)
  To: Netdev; +Cc: leonid.grossman, Linux Kernel, netdev

I've never understood why people are so interested in off-loading
networking. Isn't that just a multi-processor system where you can't
use any of the network processor cycles for anything else? And, of
course, to be cheap, the network processor will be slower, and much
harder to debug and update software.

If the PCI bus is too slow, or MTU's too small, wouldn't
it be better to fix those directly and use a fast host processor that can
also do other things when not needed for networking? And why have
memory on a NIC that can't be used by other things?

Why don't we off-load filesystems to disks instead?  Or a graphics
card that implements X ? :-) I'd rather have shared system resources--
more flexible. :-)

                                        +-DLS

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
@ 2004-09-15 20:16   ` David Schwartz
  2004-09-15 20:25   ` Jeff Garzik
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 69+ messages in thread
From: David Schwartz @ 2004-09-15 20:16 UTC (permalink / raw)
  To: David Stevens, Netdev, leonid.grossman, Linux Kernel

David Stevens wrote:

> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

The issues of debugging the network processor software and maintaining it is 
certainly a legitimate one. However, nothing stops you from using the extra 
network processor cycles for other purposes.

> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that
> can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

This isn't an either-or. Processors are cheap. Memory is cheap.

> Why don't we off-load filesystems to disks instead?  Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)

It's not one or the other. If, for example, your network card, graphics 
card, and hard drive controller all use a common instruction set and are all 
interconnected by a fast bus, code can be fairly mobile and run wherever 
it's the most efficient. Nothing stops the OS from offloading internal tasks 
to these processors as well.

The only real stumbling blocks have been cost/volume considerations and the 
fact that the central processor(s) can be so fast, and the I/O so slow in 
comparison, that there's not much to gain.

DS

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
  2004-09-15 20:16   ` David Schwartz
@ 2004-09-15 20:25   ` Jeff Garzik
  2004-09-15 20:54     ` Neil Horman
  2004-09-15 20:31   ` Bill Rugolsky Jr.
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 20:25 UTC (permalink / raw)
  To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel

David Stevens wrote:
> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

Well I do agree there is a strong don't-bother-with-TOE argument: 
Moore's law, the CPUs (manufactured in vast quantities) will usually


However, there are companies are Just Gotta Do TOE...  and I am not 
inclined to assist in any effort that compromises Linux's RFC compliancy 
or security.  Current TOE efforts seem to be of the "shove your data 
through this black box" variety, which is rather disheartening.

Even non-TOE NICs these days have ever-more-complex firmwares.  tg3 is a 
MIPS-based engine for example.


> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the 
Internet will be small as long as ethernet enjoys continued success.

	Jeff

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:25   ` Jeff Garzik
@ 2004-09-15 20:54     ` Neil Horman
  0 siblings, 0 replies; 69+ messages in thread
From: Neil Horman @ 2004-09-15 20:54 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel

Jeff Garzik wrote:
> David Stevens wrote:
> 
>> I've never understood why people are so interested in off-loading
>> networking. Isn't that just a multi-processor system where you can't
>> use any of the network processor cycles for anything else? And, of
>> course, to be cheap, the network processor will be slower, and much
>> harder to debug and update software.
> 
> 
> Well I do agree there is a strong don't-bother-with-TOE argument: 
> Moore's law, the CPUs (manufactured in vast quantities) will usually
> 
> 
> However, there are companies are Just Gotta Do TOE...  and I am not 
> inclined to assist in any effort that compromises Linux's RFC compliancy 
> or security.  Current TOE efforts seem to be of the "shove your data 
> through this black box" variety, which is rather disheartening.
> 
> Even non-TOE NICs these days have ever-more-complex firmwares.  tg3 is a 
> MIPS-based engine for example.
> 
> 
>> If the PCI bus is too slow, or MTU's too small, wouldn't
>> it be better to fix those directly and use a fast host processor that can
>> also do other things when not needed for networking? And why have
>> memory on a NIC that can't be used by other things?
> 
> 
> PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the 
> Internet will be small as long as ethernet enjoys continued success.
> 
>     Jeff

There is also something to be said for the embedded market here. 
offload chips are fairly usefull when building switches and routers. 
Dave M. in a thread just a few weeks ago provided some metrics for how 
much bandwidth a PCI-x bus and a some-odd-gigahertz processor could 
handle.  It worked that a pc with the right componenets could 
theoretically handle about 4 gigahertz nics running traffic full duplex 
at line rate.  Thats great, but it doesn't come close to what you need 
for a 24 port gigabit L3 switch, nor does it approach the correct price 
point.  Most of these designs use a less expensive processor running at 
a slower speed, and an offload chip (that incorporates tx/rx logic and a 
switching fabric) to preform most of the routing and switching.  For 
cost concious network equipment manufacturers, they are really the way 
to go.  Unfortunately, many of them don't actaully run as a 
co-processor, and so don't enable Jeff's idea very well (yet :))

Neil

-- 
/***************************************************
  *Neil Horman
  *Software Engineer
  *Red Hat, Inc.
  *nhorman@redhat.com
  *gpg keyid: 1024D / 0x92A74FA1
  *http://pgp.mit.edu
  ***************************************************/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
  2004-09-15 20:16   ` David Schwartz
  2004-09-15 20:25   ` Jeff Garzik
@ 2004-09-15 20:31   ` Bill Rugolsky Jr.
  2004-09-15 21:41   ` Joel Jaeggli
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 69+ messages in thread
From: Bill Rugolsky Jr. @ 2004-09-15 20:31 UTC (permalink / raw)
  To: Netdev; +Cc: Jeff Garzik, David Stevens

On Wed, Sep 15, 2004 at 02:11:04PM -0600, David Stevens wrote:
> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

I tend to agree.

Referring to the Opteron with its per-CPU memory controller, Robert Olsson
just wrote in the "TX performance of Intel 82546" thread:

  This is a little breakthrough as we for the first time see some
  aggregated performance with packet forwarding and got something in
  return for all multiprocessor efforts.

  IMO this is much more important then the last percent of performance
  of pps numbers.

In 2005, we'll have commodity dual-core packages, making a four-core
(dual-CPU) system available at an attractive price point.  The number
will rise dramatically after that.  I don't really think CPU cycles are
the problem.  A  useful reason I can see for "offloading" is isolation of
concerns, e.g., locking, real-time latencies, security, etc.  But then,
why not run something like the Xen2 virtual machine environment?

Regards,

	Bill Rugolsky

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
                     ` (2 preceding siblings ...)
  2004-09-15 20:31   ` Bill Rugolsky Jr.
@ 2004-09-15 21:41   ` Joel Jaeggli
  2004-09-16  6:33   ` Valdis.Kletnieks
  2004-09-17  6:46   ` Eric Mudama
  5 siblings, 0 replies; 69+ messages in thread
From: Joel Jaeggli @ 2004-09-15 21:41 UTC (permalink / raw)
  To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel

On Wed, 15 Sep 2004, David Stevens wrote:

> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

I's like to amplify this, adding more general purpose cpu to a machine 
strikes me as the right design choice since they're simply more generally 
useful than dedicated cpu's. look at linux software raid compared to the 
alternatives, frankly I haven't seen a hardware controller that can touch 
it for performance given a similar number of disks and interfaces... 
Currently graphcas card have substantionaly more memory bandwidth and 
pipelines than most general purpose cpu's but eventually that won't be the 
case. as it is gpus still represent the biggest chunk of independat 
computational power in a and at least on the server side we don't even 
use them.

> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

Between hyper-transport tunnels, pci-x, pci-express and infinband, the 
bottlnecks between the cpu core and the perhiperals and memory are falling 
away at a rapid clip even as cpu's get faster. we're in a much better 
position to build balanced systems then we were 2 years ago.

> Why don't we off-load filesystems to disks instead?  Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)
>
>                                        +-DLS
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli  	       Unix Consulting 	       joelja@darkwing.uoregon.edu 
GPG Key Fingerprint:     5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
                     ` (3 preceding siblings ...)
  2004-09-15 21:41   ` Joel Jaeggli
@ 2004-09-16  6:33   ` Valdis.Kletnieks
  2004-09-17  6:46   ` Eric Mudama
  5 siblings, 0 replies; 69+ messages in thread
From: Valdis.Kletnieks @ 2004-09-16  6:33 UTC (permalink / raw)
  To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

On Wed, 15 Sep 2004 14:11:04 MDT, David Stevens said:

> Why don't we off-load filesystems to disks instead?  Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)

All depends where in the "cycle of reincarnation" we are at the moment.  Way
back in 1964, IBM released this monster called System/360 - and one of the
things it did was push a *lot* of the disk processing off on the channel and
disk controller using a count-key-data format rather than the fixed-block that
Linux uses. So out on the platters, the disk format would say things like "This
is a 400 byte record, the first 56 of which is a search key". A lot of stuff,
both userspace and OS, used things like 'Search Key Equal' and letting the disk
do all the searching.

There was also this terminal beast called the 3270, which had a local
controller for the terminals, and only interrupted the CPU on 'page send' type
events.

Back then, the ideas made sense - it wasn't at all unreasonable for a single
S/360-65 to drive 3,000+ concurrent terminals in an airline reservation system or
similar (and we're talking about a box that had literally only half the
hamsters of a VAX780).

But today, the 3270 isn't seen much anymore, and currently IBM emulates the CKD
format on fixed-block systems for their z/Series boxes running z/OS or whatever MVS is
called now....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 20:11 ` David Stevens
                     ` (4 preceding siblings ...)
  2004-09-16  6:33   ` Valdis.Kletnieks
@ 2004-09-17  6:46   ` Eric Mudama
  2004-09-17 14:15     ` Alan Cox
  2004-09-17 20:27     ` Valdis.Kletnieks
  5 siblings, 2 replies; 69+ messages in thread
From: Eric Mudama @ 2004-09-17  6:46 UTC (permalink / raw)
  To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel

On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrote:
> Why don't we off-load filesystems to disks instead?

Disks have had file systems on them since close to the beginning...

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-17  6:46   ` Eric Mudama
@ 2004-09-17 14:15     ` Alan Cox
  2004-09-17 20:27     ` Valdis.Kletnieks
  1 sibling, 0 replies; 69+ messages in thread
From: Alan Cox @ 2004-09-17 14:15 UTC (permalink / raw)
  To: Eric Mudama
  Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel Mailing List

On Gwe, 2004-09-17 at 07:46, Eric Mudama wrote:
> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrote:
> > Why don't we off-load filesystems to disks instead?
> 
> Disks have had file systems on them since close to the beginning...

This is essentially the path Lustre is taking. Although it seems you
don't want to have a "full" file system on the disk since you lose to
much flexibility, instead you want the ability to allocate by handle
giving hints about locality and use.

People have also tried full file system offload - intel for example
prototyped an I2O file system class, and adaptec clearly were trying
this out on aacraid development from looking at the public headers.

Alan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-17  6:46   ` Eric Mudama
  2004-09-17 14:15     ` Alan Cox
@ 2004-09-17 20:27     ` Valdis.Kletnieks
  2004-09-17 20:36       ` David Lang
  2004-09-22 23:25       ` Eric Mudama
  1 sibling, 2 replies; 69+ messages in thread
From: Valdis.Kletnieks @ 2004-09-17 20:27 UTC (permalink / raw)
  To: Eric Mudama; +Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 840 bytes --]

On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said:
> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrot
e:
> > Why don't we off-load filesystems to disks instead?
> 
> Disks have had file systems on them since close to the beginning...

No, he means "offload the processing of the filesystem to the disk itself".

IBM's MVS  systems basically did that - it used the disk's "Search Key" I/O
opcodes to basically get the equivalent of doing namei() out on the disk itself
(it did this for system catalog and PDS directory searches from the beginning,
and added 'indexed VTOC' support in the mid-80s).  So you'd send out a CCW
(channel command word) stream that basically said "Find me the dataset
USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the
moral equiv of an inode) ready to go.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-17 20:27     ` Valdis.Kletnieks
@ 2004-09-17 20:36       ` David Lang
  2004-09-17 23:20         ` Tony Lee
  2004-09-22 23:25       ` Eric Mudama
  1 sibling, 1 reply; 69+ messages in thread
From: David Lang @ 2004-09-17 20:36 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Eric Mudama, David Stevens, Netdev, leonid.grossman, Linux Kernel

actually the sector based access that is made to modern drives is a very 
primitive filesystem. if you go back to the days of the MFM and RLL drives 
you had the computer sending the raw bitstreams to the drives, but with 
SCSI and IDE this stopped and you instead a higher level logical block to 
the drive and it deals with the details of getting it to and from the 
platter.

David Lang

On Fri, 17 Sep 2004 Valdis.Kletnieks@vt.edu wrote:

> Date: Fri, 17 Sep 2004 16:27:31 -0400
> From: Valdis.Kletnieks@vt.edu
> To: Eric Mudama <edmudama@gmail.com>
> Cc: David Stevens <dlstevens@us.ibm.com>, Netdev <netdev@oss.sgi.com>,
>     leonid.grossman@s2io.com, Linux Kernel <linux-kernel@vger.kernel.org>
> Subject: Re: The ultimate TOE design 
> 
> On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said:
>> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrot
> e:
>>> Why don't we off-load filesystems to disks instead?
>>
>> Disks have had file systems on them since close to the beginning...
>
> No, he means "offload the processing of the filesystem to the disk itself".
>
> IBM's MVS  systems basically did that - it used the disk's "Search Key" I/O
> opcodes to basically get the equivalent of doing namei() out on the disk itself
> (it did this for system catalog and PDS directory searches from the beginning,
> and added 'indexed VTOC' support in the mid-80s).  So you'd send out a CCW
> (channel command word) stream that basically said "Find me the dataset
> USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the
> moral equiv of an inode) ready to go.
>
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-17 20:36       ` David Lang
@ 2004-09-17 23:20         ` Tony Lee
  2004-09-17 23:36           ` Leonid Grossman
  0 siblings, 1 reply; 69+ messages in thread
From: Tony Lee @ 2004-09-17 23:20 UTC (permalink / raw)
  To: David Lang
  Cc: valdis.kletnieks, Eric Mudama, David Stevens, Netdev,
	leonid.grossman, Linux Kernel

On Fri, 17 Sep 2004 13:36:14 -0700 (PDT), David Lang
<david.lang@digitalinsight.com> wrote:
> actually the sector based access that is made to modern drives is a very
> primitive filesystem. if you go back to the days of the MFM and RLL drives
> you had the computer sending the raw bitstreams to the drives, but with
> SCSI and IDE this stopped and you instead a higher level logical block to
> the drive and it deals with the details of getting it to and from the
> platter.
> 
> David Lang
> 

Maybe next evolutionary step is to put VFS layer directory on top of
RDMA -> PCI
Express/Latest serial IO, etc.
Similar to access file thru NFS/SMB just on a faster standardize
(RDMA) transport.

On the networking front, instead of TOE, it should be services
offload, similar to
web load balancer.     Offload service base on src/dest addr port
proto (tcp/udp).
NSO (Network service offload.)    - kind of like Apache's reverse proxy 
with URL rewrite, but maybe for other applications. 

Question for Leonid of S2io.com:  Your company has an interesting card.
I think it must have some kind of embedded CPU.  Care to tell us what kind 
of CPU are they?

-- 
-Tony
Having a lot of fun with Xilinx Virtex Pro II reconfigurable HW + ppc + Linux

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-17 23:20         ` Tony Lee
@ 2004-09-17 23:36           ` Leonid Grossman
  0 siblings, 0 replies; 69+ messages in thread
From: Leonid Grossman @ 2004-09-17 23:36 UTC (permalink / raw)
  To: 'Tony Lee', 'David Lang'
  Cc: valdis.kletnieks, 'Eric Mudama', 'David Stevens',
	'Netdev', 'Linux Kernel'

 

> -----Original Message-----
> From: Tony Lee [mailto:tony.p.lee@gmail.com] 
> Sent: Friday, September 17, 2004 4:21 PM
Skipped...

> Question for Leonid of S2io.com:  Your company has an 
> interesting card.
> I think it must have some kind of embedded CPU.  Care to tell 
> us what kind of CPU are they?

Hi Tony, 
For 10GbE card, we designed our own ASIC - embedded CPUs don't cut it at
10GbE...
Leonid


> -- 
> -Tony
> Having a lot of fun with Xilinx Virtex Pro II reconfigurable 
> HW + ppc + Linux
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-17 20:27     ` Valdis.Kletnieks
  2004-09-17 20:36       ` David Lang
@ 2004-09-22 23:25       ` Eric Mudama
  1 sibling, 0 replies; 69+ messages in thread
From: Eric Mudama @ 2004-09-22 23:25 UTC (permalink / raw)
  To: valdis.kletnieks@vt.edu
  Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel

On Fri, 17 Sep 2004 16:27:31 -0400, valdis.kletnieks@vt.edu
<valdis.kletnieks@vt.edu> wrote:
> No, he means "offload the processing of the filesystem to the disk itself".

I know what was meant.

I'm not saying the filesystem on the drive is very advanced, but it's
still a filesystem.  Our "Record ID" is the LBA identifier, and all
records are 1 block in size.  We can handle defects, reallocations,
and other issues, with some success.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:33 The ultimate TOE design Jeff Garzik
  2004-09-15 20:04 ` Paul Jakma
  2004-09-15 20:11 ` David Stevens
@ 2004-09-15 21:36 ` John Heffner
  2004-09-15 21:46   ` David S. Miller
  2004-09-15 23:16   ` James Morris
  2004-09-16  9:03 ` Lars Marowsky-Bree
  3 siblings, 2 replies; 69+ messages in thread
From: John Heffner @ 2004-09-15 21:36 UTC (permalink / raw)
  To: Netdev; +Cc: leonid.grossman

My view on TOE is that it is brought up in response to the fact that when
leading edge network technologies are brought out (GigE a few years ago,
10 GigE now), hosts can't keep up.  Specifically, these people usually
don't care about wasting their man CPU cycles, but rather the fact they
can't get the full rate out of their expensive new NIC.

The reason hosts can't keep up are

  (a) host bus speeds, or (not exclusive)
  (b) the CPU can't handle per-packet processing

In the case of (a), TOE doesn't really help.  In the case of (b), Jeff's
proposed general-purpose offload doesn't help -- you really need a custom
ASIC or maybe FPGA if you hope to beat the host CPU.  Thus I think Jeff's
idea is not likely to fly with this crowd of TOE proponents.

The other (much nicer) solution to case (b) is to just USE A BIGGER MTU.
1500 bytes is ridiculously small.  Even with a 9k MTU, the benefits of TOE
or TSO are nearly vanishing.  Those who say they require high performance,
but are unwilling to buy or produce networking gear with an MTU larger
than 1500 bytes probably deserve what they get.

There are other possible justifications for TOE (with other
counter-arguments) -- basically to reduce load on the main CPU -- but I
think these are for the most part NOT what is driving the market (let me
know if I'm being myopic here).  This issue is also largely or completely
solved by using a bigger MTU.

  -John

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:36 ` John Heffner
@ 2004-09-15 21:46   ` David S. Miller
  2004-09-16  6:20     ` Andi Kleen
  2004-09-15 23:16   ` James Morris
  1 sibling, 1 reply; 69+ messages in thread
From: David S. Miller @ 2004-09-15 21:46 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev, leonid.grossman

On Wed, 15 Sep 2004 17:36:18 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> The other (much nicer) solution to case (b) is to just USE A BIGGER MTU.
> 1500 bytes is ridiculously small.  Even with a 9k MTU, the benefits of TOE
> or TSO are nearly vanishing.  Those who say they require high performance,
> but are unwilling to buy or produce networking gear with an MTU larger
> than 1500 bytes probably deserve what they get.

TSO gives a kind of virtual 64K MTU, FWIW.  But I do see your
point.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:46   ` David S. Miller
@ 2004-09-16  6:20     ` Andi Kleen
  2004-09-16 13:10       ` Leonid Grossman
  0 siblings, 1 reply; 69+ messages in thread
From: Andi Kleen @ 2004-09-16  6:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Heffner, netdev, leonid.grossman

On Wed, Sep 15, 2004 at 02:46:24PM -0700, David S. Miller wrote:
> On Wed, 15 Sep 2004 17:36:18 -0400 (EDT)
> John Heffner <jheffner@psc.edu> wrote:
> 
> > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU.
> > 1500 bytes is ridiculously small.  Even with a 9k MTU, the benefits of TOE
> > or TSO are nearly vanishing.  Those who say they require high performance,
> > but are unwilling to buy or produce networking gear with an MTU larger
> > than 1500 bytes probably deserve what they get.
> 
> TSO gives a kind of virtual 64K MTU, FWIW.  But I do see your
> point.

We still need to solve the same problem for RX though.

-Andi

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-16  6:20     ` Andi Kleen
@ 2004-09-16 13:10       ` Leonid Grossman
  2004-09-16 16:18         ` Nivedita Singhvi
  0 siblings, 1 reply; 69+ messages in thread
From: Leonid Grossman @ 2004-09-16 13:10 UTC (permalink / raw)
  To: 'Andi Kleen', 'David S. Miller'
  Cc: 'John Heffner', netdev

> -----Original Message-----
> From: Andi Kleen [mailto:ak@suse.de] 
> Sent: Wednesday, September 15, 2004 11:21 PM
> To: David S. Miller
> Cc: John Heffner; netdev@oss.sgi.com; leonid.grossman@s2io.com
> Subject: Re: The ultimate TOE design
> 
> On Wed, Sep 15, 2004 at 02:46:24PM -0700, David S. Miller wrote:
> > On Wed, 15 Sep 2004 17:36:18 -0400 (EDT) John Heffner 
> > <jheffner@psc.edu> wrote:
> > 
> > > The other (much nicer) solution to case (b) is to just 
> USE A BIGGER MTU.
> > > 1500 bytes is ridiculously small.  Even with a 9k MTU, 
> the benefits 
> > > of TOE or TSO are nearly vanishing.  Those who say they 
> require high 
> > > performance, but are unwilling to buy or produce networking gear 
> > > with an MTU larger than 1500 bytes probably deserve what they get.
> > 
> > TSO gives a kind of virtual 64K MTU, FWIW.  But I do see your point.
> 
> We still need to solve the same problem for RX though.
> 
> -Andi

Ditto.

We can dream about benefits of huge MTUs, but the reality is that moving
beyond 9k MTU is years away. Reasons - mainly infrastructure, plus MTU above
~10k may loose checksum protection (granted, this depends whether the errors
are simple or complex, and also this may not be a showstopper for some
people).
Even 9k MTU is very far from being universally accepted, eight years after
our Alteon spec went out :-).
TSO works great for the transmit side (even for 9k MTU, the impact is not
insignificant), but RX problem that Andi is talking about is a major issue
for a lot of users.

I don't have hard data yet, but the expectations are that the effect of
doing "RX side TSO" will be close to having 64k RX MTU - I'll publish some
numbers once we bring up first Unix drivers with this feature and do some
measurements.

Leonid

> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 13:10       ` Leonid Grossman
@ 2004-09-16 16:18         ` Nivedita Singhvi
  2004-09-16 20:34           ` Leonid Grossman
  0 siblings, 1 reply; 69+ messages in thread
From: Nivedita Singhvi @ 2004-09-16 16:18 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Andi Kleen', 'David S. Miller',
	'John Heffner', netdev

Leonid Grossman wrote:

> We can dream about benefits of huge MTUs, but the reality is that moving
> beyond 9k MTU is years away. Reasons - mainly infrastructure, plus MTU above
> ~10k may loose checksum protection (granted, this depends whether the errors
> are simple or complex, and also this may not be a showstopper for some
> people).
> Even 9k MTU is very far from being universally accepted, eight years after
> our Alteon spec went out :-).

One other factor is TCP congestion control, and congestion
windows we obey. Most of the time, you just can't send that
much.

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-16 16:18         ` Nivedita Singhvi
@ 2004-09-16 20:34           ` Leonid Grossman
  2004-09-22 20:18             ` Nivedita Singhvi
  0 siblings, 1 reply; 69+ messages in thread
From: Leonid Grossman @ 2004-09-16 20:34 UTC (permalink / raw)
  To: 'Nivedita Singhvi'
  Cc: 'Andi Kleen', 'David S. Miller',
	'John Heffner', netdev

 

> -----Original Message-----
> From: Nivedita Singhvi [mailto:niv@us.ibm.com] 
> Sent: Thursday, September 16, 2004 9:19 AM
> To: Leonid Grossman
> Cc: 'Andi Kleen'; 'David S. Miller'; 'John Heffner'; 
> netdev@oss.sgi.com
> Subject: Re: The ultimate TOE design
> 
> Leonid Grossman wrote:
> 
> > We can dream about benefits of huge MTUs, but the reality is that 
> > moving beyond 9k MTU is years away. Reasons - mainly 
> infrastructure, 
> > plus MTU above ~10k may loose checksum protection (granted, this 
> > depends whether the errors are simple or complex, and also this may 
> > not be a showstopper for some people).
> > Even 9k MTU is very far from being universally accepted, 
> eight years 
> > after our Alteon spec went out :-).
> 
> One other factor is TCP congestion control, and congestion 
> windows we obey. Most of the time, you just can't send that much.

It's a bit painful to setup, but in general with 9k jumbos and TSO we were
able to get close to pci-x 133 limit - both in LAN and WAN tests.
Leonid

> 
> thanks,
> Nivedita
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-16 20:34           ` Leonid Grossman
@ 2004-09-22 20:18             ` Nivedita Singhvi
  2004-09-23  4:46               ` Leonid Grossman
  0 siblings, 1 reply; 69+ messages in thread
From: Nivedita Singhvi @ 2004-09-22 20:18 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Andi Kleen', 'David S. Miller',
	'John Heffner', netdev, linux-kernel

Leonid Grossman wrote:

>>From: Nivedita Singhvi [mailto:niv@us.ibm.com] 
>>Sent: Thursday, September 16, 2004 9:19 AM
>>To: Leonid Grossman
>>Cc: 'Andi Kleen'; 'David S. Miller'; 'John Heffner'; 
>>netdev@oss.sgi.com
>>Subject: Re: The ultimate TOE design
>>
>>Leonid Grossman wrote:
>>
>>
>>>We can dream about benefits of huge MTUs, but the reality is that 
>>>moving beyond 9k MTU is years away. Reasons - mainly infrastructure, 
>>>plus MTU above ~10k may loose checksum protection (granted, this 
>>>depends whether the errors are simple or complex, and also this may 
>>>not be a showstopper for some people).
>>>Even 9k MTU is very far from being universally accepted, 
>>>eight years after our Alteon spec went out :-).
>>
>>One other factor is TCP congestion control, and congestion 
>>windows we obey. Most of the time, you just can't send that much.
> 
> 
> It's a bit painful to setup, but in general with 9k jumbos and TSO we were
> able to get close to pci-x 133 limit - both in LAN and WAN tests.
> Leonid

Cool, but a very specific environment, no? ;)

What concerns me about all this is that it seems
so very host-centric design. Wouldn't it be nice if
we had a little bit more network-centric worldview
when designing network infrastructure?

It isn't just a matter of how had we can push stuff
out, it also matters how much the network can take.
Blasting tens of gigs into the ether seems all very
exciting sexy and cool, but suited for dedicated links
or network attached storage channels, not general-purpose
networking on the Internet or intra-nets.

And if that is the case, we're talking about a much
smaller market (but perhaps a more profitable
one ;))...

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-22 20:18             ` Nivedita Singhvi
@ 2004-09-23  4:46               ` Leonid Grossman
  0 siblings, 0 replies; 69+ messages in thread
From: Leonid Grossman @ 2004-09-23  4:46 UTC (permalink / raw)
  To: 'Nivedita Singhvi'
  Cc: 'Andi Kleen', 'David S. Miller',
	'John Heffner', netdev, linux-kernel


> > 
> > It's a bit painful to setup, but in general with 9k jumbos 
> and TSO we 
> > were able to get close to pci-x 133 limit - both in LAN and 
> WAN tests.
> > Leonid
> 
> Cool, but a very specific environment, no? ;)

Define specific environment :-). We are running common tcp benchmarks like
nttcp or iperf or Chariot or filesystem applications on a very generic white
boxes, with generic OS/settings.

> 
> What concerns me about all this is that it seems so very 
> host-centric design. Wouldn't it be nice if we had a little 
> bit more network-centric worldview when designing network 
> infrastructure?
> 
> It isn't just a matter of how had we can push stuff out, it 
> also matters how much the network can take.
> Blasting tens of gigs into the ether seems all very exciting 
> sexy and cool, but suited for dedicated links or network 
> attached storage channels, not general-purpose networking on 
> the Internet or intra-nets.

This is somewhat different from IB or FC "miniature networks", 
some/most of 10GbE testing runs in existing datacenters or over 
existing long-haul links - see for example
http://sravot.home.cern.ch/sravot/Networking/10GbE/LSR_041504.htm

Cheers, Leonid

> 
> And if that is the case, we're talking about a much smaller 
> market (but perhaps a more profitable one ;))...
> 
> thanks,
> Nivedita
> 
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 21:36 ` John Heffner
  2004-09-15 21:46   ` David S. Miller
@ 2004-09-15 23:16   ` James Morris
  2004-09-15 23:37     ` Leonid Grossman
  2004-09-15 23:52     ` John Heffner
  1 sibling, 2 replies; 69+ messages in thread
From: James Morris @ 2004-09-15 23:16 UTC (permalink / raw)
  To: John Heffner; +Cc: Netdev, leonid.grossman

On Wed, 15 Sep 2004, John Heffner wrote:

> The other (much nicer) solution to case (b) is to just USE A BIGGER MTU.
> 1500 bytes is ridiculously small.  Even with a 9k MTU, the benefits of TOE
> or TSO are nearly vanishing.

Do you have any figures on (large) MTU size vs performance on a current
commidity system?


- James
-- 
James Morris
<jmorris@redhat.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: The ultimate TOE design
  2004-09-15 23:16   ` James Morris
@ 2004-09-15 23:37     ` Leonid Grossman
  2004-09-15 23:52     ` John Heffner
  1 sibling, 0 replies; 69+ messages in thread
From: Leonid Grossman @ 2004-09-15 23:37 UTC (permalink / raw)
  To: 'James Morris', 'John Heffner'; +Cc: 'Netdev'

 

> -----Original Message-----
> From: James Morris [mailto:jmorris@redhat.com] 
> Sent: Wednesday, September 15, 2004 4:16 PM
> To: John Heffner
> Cc: Netdev; leonid.grossman@s2io.com
> Subject: Re: The ultimate TOE design
> 
> On Wed, 15 Sep 2004, John Heffner wrote:
> 
> > The other (much nicer) solution to case (b) is to just USE 
> A BIGGER MTU.
> > 1500 bytes is ridiculously small.  Even with a 9k MTU, the 
> benefits of 
> > TOE or TSO are nearly vanishing.
> 
> Do you have any figures on (large) MTU size vs performance on 
> a current commidity system?

It's very system-dependent. Say on 2-way Xeon our card goes from ~2Gbps to
~6Gbps, on 64-bit systems the delta is obviously less.

For 9k MTU, the delta goes down of course but it is still ~10% on 2.6
systems - say on 2-way Opterons we go from 7Gbps to 7.6Gbps.

Leonid

> 
> 
> - James
> --
> James Morris
> <jmorris@redhat.com>
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 23:16   ` James Morris
  2004-09-15 23:37     ` Leonid Grossman
@ 2004-09-15 23:52     ` John Heffner
  2004-09-16  1:43       ` James Morris
  1 sibling, 1 reply; 69+ messages in thread
From: John Heffner @ 2004-09-15 23:52 UTC (permalink / raw)
  To: James Morris; +Cc: Netdev, leonid.grossman

On Wed, 15 Sep 2004, James Morris wrote:

> On Wed, 15 Sep 2004, John Heffner wrote:
>
> > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU.
> > 1500 bytes is ridiculously small.  Even with a 9k MTU, the benefits of TOE
> > or TSO are nearly vanishing.
>
> Do you have any figures on (large) MTU size vs performance on a current
> commidity system?

What qualifies as large?  I ran some measurements out to 9k w/GigE a few
years ago.  (Something like 100 byte increments.)  I can try to find the
data if anyone is interested.  I may try to run a similar experiment on 10
GigE with modern hardware, and I think these cards can go out to 16k as
well.

The basic idea is that the margainal benefit (in terms of CPU cycles) of
increasing the MTU is proportional to log(MTU).  In all experiments I have
done or seen, the data agree with this, except for discontinuities due to
PCI settings, page sizes and maybe a couple other things.

There are some arguments that going to much larger MTUs could be of
substantial benefit other than CPU cycles, but this is harder to quantify.
See <http://www.psc.edu/~mathis/MTU/>.

  -John

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 23:52     ` John Heffner
@ 2004-09-16  1:43       ` James Morris
  0 siblings, 0 replies; 69+ messages in thread
From: James Morris @ 2004-09-16  1:43 UTC (permalink / raw)
  To: John Heffner; +Cc: Netdev, leonid.grossman

On Wed, 15 Sep 2004, John Heffner wrote:

> > Do you have any figures on (large) MTU size vs performance on a current
> > commidity system?
> 
> What qualifies as large?  I ran some measurements out to 9k w/GigE a few
> years ago.  (Something like 100 byte increments.)  I can try to find the
> data if anyone is interested.  I may try to run a similar experiment on 10
> GigE with modern hardware, and I think these cards can go out to 16k as
> well.

Anything above 1500 bytes (up to 64k would be interesting).

> There are some arguments that going to much larger MTUs could be of
> substantial benefit other than CPU cycles, but this is harder to quantify.
> See <http://www.psc.edu/~mathis/MTU/>.

Thanks for the info.


- James
-- 
James Morris
<jmorris@redhat.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: The ultimate TOE design
  2004-09-15 19:33 The ultimate TOE design Jeff Garzik
                   ` (2 preceding siblings ...)
  2004-09-15 21:36 ` John Heffner
@ 2004-09-16  9:03 ` Lars Marowsky-Bree
  3 siblings, 0 replies; 69+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-16  9:03 UTC (permalink / raw)
  To: Netdev; +Cc: Linux Kernel

On 2004-09-15T15:33:47,
   Jeff Garzik <jgarzik@pobox.com> said:

> Then, your host system OS will communicate with the Linux kernel running 
> on the card across the PCI bus, using IP packets (64K fixed MTU).
> 
> This effectively:

Actually, given that there's almost no reason to offload TCP/IP
processing for speed (better spent the money on CPU / memory for the
main system), I like the idea of this for security: Off-load the packet
filtering to create an additional security barrier. (Different CPU
architecture and all that.)

(With two cards, one could even use the conntrack fail-over internally.
- A Linux-running NIC with builtin firewalling, sell to all the windows
weenies... ;)

With dedicated processors, maybe a IP/Sec accelerator would also be
cool, but I'd think a crypto accelerator for the main system would again
be saner here (unless, of course, the argument of the security domain
isolation is applied again).

Admittedely, one can solve all these differently, but it still might be
cool. ;-)

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 69+ messages in thread

[parent not found: <1095328673.1063.130.camel@jzny.localdomain>]

* RE: The ultimate TOE design
       [not found] <1095328673.1063.130.camel@jzny.localdomain>
@ 2004-09-16 14:57 ` Leonid Grossman
  0 siblings, 0 replies; 69+ messages in thread
From: Leonid Grossman @ 2004-09-16 14:57 UTC (permalink / raw)
  To: hadi
  Cc: 'Jeff Garzik', 'David S. Miller', alan, paul,
	netdev, linux-kernel

 

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca] 
> Sent: Thursday, September 16, 2004 2:58 AM
> To: Leonid Grossman
> Cc: 'Jeff Garzik'; 'David S. Miller'; 
> alan@lxorguk.ukuu.org.uk; paul@clubi.ie; netdev@oss.sgi.com; 
> linux-kernel@vger.kernel.org
> Subject: RE: The ultimate TOE design
> 
> On Thu, 2004-09-16 at 01:25, Leonid Grossman wrote:
> >  
> > > -----Original Message-----
> > > From: jamal [mailto:hadi@cyberus.ca]
> 
> > > On a serious note, I think that PCI-express (if it lives upto its
> > > expectation) will demolish dreams of a lot of these TOE 
> investments.
> > > Our problem is NOT the CPU right now (80% idle processing 450Kpps 
> > > forwarding). Bus and memory distance/latency are.
> > 
> > In servers, both bottlenecks are there - if you look at the cost of 
> > TCP and filesystem processing at 10GbE, CPU is a huge problem (and 
> > will be for foreseeable future), even for fastest 64-bit systems.
> 
> True, but with the bus contention being a non-issue you got 
> more of that xeon being available for use (lets say i can use 
> 50% more of its capacity then i can do more). IOW, it becomes 
> a compute capacity problem mostly - one that you should in 
> theory be able to throw more CPU at. SMT (the way power5 and 
> some of the network processors do it[1]) should go a long way 
> to address both additional compute and hardware threading to 
> work around memory latencies. With PCI-express, compute power 
> in mini-clustering in the form of AS 
> (http://www.asi-sig.org/home) is being plotted as we speak.
> To sumarize: The problem to solve in 24 months maybe 100Gige.
> 
> > I agree though that bus and memory are bigger issues, this 
> is exactly 
> > the reason for all these RDMA over Ethernet investments :-)
> 
> And AS does a damn good job at specing all those RDMA 
> requirements; my view is that intel is going to build them 
> chips - so it can be done on a
> $5 board off the pacific rim. This takes most of the small 
> players out of the market.
> 
> > Anyways, did not mean to start an argument - with all the 
> new CPU, bus 
> > and HBA technologies coming to the market it will be another 18-24 
> > months before we know what works and what doesn't...
> 
> Agreed. Would you like to invest on something that will obsoleted in
> 18-24 months though? OR even not obsoleted, but holds that 
> uncertainty?
> I think thats the risk facing you when you are in the offload 
> bussiness.

Well.. Any business has risks, this one doesn't seem to be higher than
others :-)
I view 18-26 mo timeframe as a start of the offload mass-adoption, not the
end of it.

In our tests, the bus contention and the %cpu are mostly orthogonal
problems; PCI-X DDR and PCI-Express will help but only to a point.
(BTW this is all related to the higher end systems - 2-4 way and above,
running 10GbE NICs. Client is a different story, cpu is mostly "free"
there).
My sense is that (unlike on previous cycles) the "slow host, fast network"
scenario is here to stay for a long while, and will have to be addressed one
way or another - whether it is a full TOE+RDMA offload in a longer run, or
an improvement to "static" offloads. 
In server space, applications will never be happy with less than 80% cpu.

Leonid

> 
> Here are results for Hifn 7956 ref board on 2.6GHz P4 (HT) 
> system, kernel  2.6.6 SMP as compared to a s/ware only setup 
> on same machine.
> [Name of tester withheld to protect privacy].
> 
> first column - algo, second - packet size, third - time in us 
> spend by hw crypto, forth - time in us spent by sw crypto:
> 
> des   64:       28      3
> des  128:       29      6
> des  192:       33      9
> des  256:       33      12
> des  320:       37      15
> des  384:       38      18
> des  448:       41      21
> des  512:       42      23
> des  576:       45      26
> des  640:       46      29
> des  704:       49      33
> des  768:       50      35
> des  832:       53      38
> des  896:       54      41
> des  960:       57      44
> des 1024:       58      47
> des 1088:       61      50
> des 1152:       62      53
> des 1216:       66      56
> des 1280:       66      59
> des 1344:       70      62
> des 1408:       71      65
> des 1472:       74      68
> des3_ede   64:  28      6
> des3_ede  128:  30      13
> des3_ede  192:  34      20
> des3_ede  256:  43      26
> des3_ede  320:  38      33
> des3_ede  384:  48      40
> des3_ede  448:  44      45
> des3_ede  512:  54      53
> des3_ede  576:  50      60
> des3_ede  640:  59      67
> des3_ede  704:  55      74
> des3_ede  768:  66      78
> des3_ede  832:  61      85
> des3_ede  896:  72      94
> des3_ede  960:  67      100
> des3_ede 1024:  77      107
> des3_ede 1088:  73      114
> des3_ede 1152:  82      121
> des3_ede 1216:  79      127
> des3_ede 1280:  88      128
> des3_ede 1344:  84      135
> des3_ede 1408:  94      147
> des3_ede 1472:  90      153
> aes   64:       28      2
> aes  192:       33      6
> aes  320:       37      10
> aes  448:       46      15
> aes  576:       53      19
> aes  704:       53      23
> aes  832:       65      28
> aes  960:       66      32
> aes 1088:       71      37
> aes 1216:       80      41
> aes 1344:       83      45
> aes 1472:       92      50
> 
> Moral of the data above: The 2.6Ghz is already showing signs 
> of obsoleting the hifn crypto offloader[2]. I think it took 
> less than a year for it to happen.
> 
> cheers,
> jamal
> 
> [1] I also like the MIPS.com approach to SMT
> 
> [2] There are actually issues with some of the crypto 
> offloading in Linux; however this does serve as a good example.
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2004-09-24 19:39 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-15 19:33 The ultimate TOE design Jeff Garzik
2004-09-15 20:04 ` Paul Jakma
2004-09-15 19:14   ` Alan Cox
2004-09-15 20:41     ` Jeff Garzik
2004-09-15 21:01       ` David S. Miller
2004-09-15 21:08         ` Jeff Garzik
2004-09-15 21:13           ` David S. Miller
2004-09-15 21:23             ` Jeff Garzik
2004-09-15 21:29               ` David S. Miller
2004-09-15 22:26                 ` Jeff Garzik
2004-09-15 23:29                 ` Leonid Grossman
2004-09-24 13:07                   ` Lennert Buytenhek
2004-09-24 13:21                     ` Leonid Grossman
2004-09-24 18:09                       ` Lennert Buytenhek
2004-09-24 19:39                         ` Joel Jaeggli
2004-09-16  0:57               ` jamal
2004-09-16  5:25                 ` Leonid Grossman
2004-09-16  9:29               ` Lincoln Dale
2004-09-16 12:19                 ` Alan Cox
2004-09-16 13:33                   ` Andi Kleen
2004-09-16 12:57                     ` Alan Cox
2004-09-16 22:37                       ` Lincoln Dale
2004-09-17 13:38                         ` Jörn Engel
2004-09-15 22:31             ` Jeff Garzik
2004-09-15 21:15         ` Michael Richardson
2004-09-15 20:53     ` David S. Miller
2004-09-16  1:05       ` Andrea Arcangeli
2004-09-15 21:10     ` David Lang
2004-09-15 23:05     ` Paul Jakma
2004-09-15 20:26   ` Neil Horman
2004-09-15 21:03     ` Wes Felter
2004-09-15 21:15       ` Jeff Garzik
2004-09-15 21:35         ` Wes Felter
2004-09-15 21:42           ` Jeff Garzik
2004-09-15 21:25       ` Imran Badr
2004-09-16 11:37       ` Neil Horman
2004-09-16  5:51     ` Matt Porter
2004-09-15 21:36   ` Deepak Saxena
2004-09-15 23:03     ` Paul Jakma
2004-09-24 13:11     ` Lennert Buytenhek
2004-09-15 21:59   ` Tony Lee
2004-09-15 20:11 ` David Stevens
2004-09-15 20:16   ` David Schwartz
2004-09-15 20:25   ` Jeff Garzik
2004-09-15 20:54     ` Neil Horman
2004-09-15 20:31   ` Bill Rugolsky Jr.
2004-09-15 21:41   ` Joel Jaeggli
2004-09-16  6:33   ` Valdis.Kletnieks
2004-09-17  6:46   ` Eric Mudama
2004-09-17 14:15     ` Alan Cox
2004-09-17 20:27     ` Valdis.Kletnieks
2004-09-17 20:36       ` David Lang
2004-09-17 23:20         ` Tony Lee
2004-09-17 23:36           ` Leonid Grossman
2004-09-22 23:25       ` Eric Mudama
2004-09-15 21:36 ` John Heffner
2004-09-15 21:46   ` David S. Miller
2004-09-16  6:20     ` Andi Kleen
2004-09-16 13:10       ` Leonid Grossman
2004-09-16 16:18         ` Nivedita Singhvi
2004-09-16 20:34           ` Leonid Grossman
2004-09-22 20:18             ` Nivedita Singhvi
2004-09-23  4:46               ` Leonid Grossman
2004-09-15 23:16   ` James Morris
2004-09-15 23:37     ` Leonid Grossman
2004-09-15 23:52     ` John Heffner
2004-09-16  1:43       ` James Morris
2004-09-16  9:03 ` Lars Marowsky-Bree
     [not found] <1095328673.1063.130.camel@jzny.localdomain>
2004-09-16 14:57 ` Leonid Grossman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).