* The ultimate TOE design
@ 2004-09-15 19:33 Jeff Garzik
2004-09-15 20:04 ` Paul Jakma
` (3 more replies)
0 siblings, 4 replies; 69+ messages in thread
From: Jeff Garzik @ 2004-09-15 19:33 UTC (permalink / raw)
To: netdev; +Cc: leonid.grossman, Linux Kernel
(reply-to set to netdev)
Every now and then people ask on the lists about TOE, TCP assist, and
that sort of thing. Ignoring the issue of TCP hardware assist, I wanted
to describe what I feel is an optimal method to _fully offload_ the
Linux TCP stack.
Put simply, the "ultimate TOE card" would be a card with network ports,
a generic CPU (arm, mips, whatever.), some RAM, and some flash. This
card's "firmware" is the Linux kernel, configured to run as a _totally
indepenent network node_, with IP address(es) all its own.
Then, your host system OS will communicate with the Linux kernel running
on the card across the PCI bus, using IP packets (64K fixed MTU).
This effectively:
1) fragment processing, IPsec, and other services onto the card.
2) You can use huge card<->host MTUs, which makes sendfile(2) faster
with _zero_ kernel changes
3) You can let the PCI card do 100% of the checksum
processing/generation, and treat the network connection connection
across the PCI bus as CHECKSUM_UNNECESSARY.
2) With enough RAM and cpu cycles, you can even offload complex services
like Web services: the PCI card runs Apache, and fetches files across
the network (your PCI bus!) from the host system.
3) Does not require _any_ modification of Linux network stack.
Interfacing with the card merely requires a simple DMA interface to copy
IP (not ethernet) packets across the PCI bus, and that fits within the
existing Linux net driver API.
4) ensures that the TOE "firmware" [the Linux kernel] can be easily
updated in the event of new features or (more importantly) security
problems.
5) Linux is the most RFC-compliant net stack in the world. Why
re-create (or license) an inferior one?
6) Long-term maintenance of TOE firmware is a BIG problem with existing
full-TOE systems. Under this design, sysadmins would update and patch
their PCI card with security updates just like any other system on their
network. This is added work, yes, but it's a known quantity and a task
they are already doing for other systems.
7) The design is both portable [tons of embedded CPUs, with and without
MMUs, can run Linux] and scalable.
My dream is that some vendor will come along and implement such a
design, and sell it in enough volume that it's US$100 or less. There
are a few cards on the market already where implementing this design
_may_ be possible, but they are all fairly expensive. Just need enough
resources on the PCI to be able to Linux as a
router/firewall/iSCSI/web-proxy gadget.
And I'm not aware of anybody doing a direct IP-over-PCI thing, either.
But I'll keep on dreaming... ;-)
Jeff
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: The ultimate TOE design 2004-09-15 19:33 The ultimate TOE design Jeff Garzik @ 2004-09-15 20:04 ` Paul Jakma 2004-09-15 19:14 ` Alan Cox ` (3 more replies) 2004-09-15 20:11 ` David Stevens ` (2 subsequent siblings) 3 siblings, 4 replies; 69+ messages in thread From: Paul Jakma @ 2004-09-15 20:04 UTC (permalink / raw) To: Netdev; +Cc: leonid.grossman, Linux Kernel On Wed, 15 Sep 2004, Jeff Garzik wrote: > Put simply, the "ultimate TOE card" would be a card with network ports, a > generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's > "firmware" is the Linux kernel, configured to run as a _totally indepenent > network node_, with IP address(es) all its own. > > Then, your host system OS will communicate with the Linux kernel running on > the card across the PCI bus, using IP packets (64K fixed MTU). > My dream is that some vendor will come along and implement such a > design, and sell it in enough volume that it's US$100 or less. > There are a few cards on the market already where implementing this > design _may_ be possible, but they are all fairly expensive. The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI card running Linux. Or is that what you were referring to with "<cards exist> but they are all fairly expensive."? > Jeff regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: There is nothing so easy but that it becomes difficult when you do it reluctantly. -- Publius Terentius Afer (Terence) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:04 ` Paul Jakma @ 2004-09-15 19:14 ` Alan Cox 2004-09-15 20:41 ` Jeff Garzik ` (3 more replies) 2004-09-15 20:26 ` Neil Horman ` (2 subsequent siblings) 3 siblings, 4 replies; 69+ messages in thread From: Alan Cox @ 2004-09-15 19:14 UTC (permalink / raw) To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel Mailing List On Mer, 2004-09-15 at 21:04, Paul Jakma wrote: > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI > card running Linux. Or is that what you were referring to with > "<cards exist> but they are all fairly expensive."? Last time I checked 2Ghz accelerators for intel and AMD were quite cheap and also had the advantage they ran user mode code when idle from network processing. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:14 ` Alan Cox @ 2004-09-15 20:41 ` Jeff Garzik 2004-09-15 21:01 ` David S. Miller 2004-09-15 20:53 ` David S. Miller ` (2 subsequent siblings) 3 siblings, 1 reply; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 20:41 UTC (permalink / raw) To: Alan Cox; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel Mailing List Alan Cox wrote: > On Mer, 2004-09-15 at 21:04, Paul Jakma wrote: > >>The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI >>card running Linux. Or is that what you were referring to with >>"<cards exist> but they are all fairly expensive."? > > > Last time I checked 2Ghz accelerators for intel and AMD were quite cheap > and also had the advantage they ran user mode code when idle from > network processing. The point was more to show people who are doing TOE _anyway_ to a decent design. As I said in another post, "just don't bother with TOE" is a very valid answer with today's CPUs. Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:41 ` Jeff Garzik @ 2004-09-15 21:01 ` David S. Miller 2004-09-15 21:08 ` Jeff Garzik 2004-09-15 21:15 ` Michael Richardson 0 siblings, 2 replies; 69+ messages in thread From: David S. Miller @ 2004-09-15 21:01 UTC (permalink / raw) To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel On Wed, 15 Sep 2004 16:41:51 -0400 Jeff Garzik <jgarzik@pobox.com> wrote: > The point was more to show people who are doing TOE _anyway_ to a decent > design. We shouldn't be forced to refine people's non-sensible ideas which we'll not support anyways. If TOE is supported on Windows only, I happily welcome that. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:01 ` David S. Miller @ 2004-09-15 21:08 ` Jeff Garzik 2004-09-15 21:13 ` David S. Miller 2004-09-15 21:15 ` Michael Richardson 1 sibling, 1 reply; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 21:08 UTC (permalink / raw) To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel On Wed, Sep 15, 2004 at 02:01:23PM -0700, David S. Miller wrote: > On Wed, 15 Sep 2004 16:41:51 -0400 > Jeff Garzik <jgarzik@pobox.com> wrote: > > > The point was more to show people who are doing TOE _anyway_ to a decent > > design. > > We shouldn't be forced to refine people's non-sensible ideas which > we'll not support anyways. I just described a design that -we already support-. It's generic scalable model that has application outside the acronym "TOE". Did you read my message, or just see 'TOE' and nothing else? Sun used this model with their x86 cards. Total MP did something similar with their 4-processor PowerPC cards. There's nothing inherently wrong with sticking a computer running Linux inside another computer ;-) Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:08 ` Jeff Garzik @ 2004-09-15 21:13 ` David S. Miller 2004-09-15 21:23 ` Jeff Garzik 2004-09-15 22:31 ` Jeff Garzik 0 siblings, 2 replies; 69+ messages in thread From: David S. Miller @ 2004-09-15 21:13 UTC (permalink / raw) To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel On Wed, 15 Sep 2004 17:08:18 -0400 Jeff Garzik <jgarzik@pobox.com> wrote: > There's nothing inherently wrong with sticking a computer running > Linux inside another computer ;-) And we already support that :-) Plus we have things like TSO too but that doesn't require a full Linux instance to realize on a networking port. Simple silicon implements this already. I don't see how that differs from your "big MTU" ideas. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:13 ` David S. Miller @ 2004-09-15 21:23 ` Jeff Garzik 2004-09-15 21:29 ` David S. Miller ` (2 more replies) 2004-09-15 22:31 ` Jeff Garzik 1 sibling, 3 replies; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 21:23 UTC (permalink / raw) To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel David S. Miller wrote: > On Wed, 15 Sep 2004 17:08:18 -0400 > Jeff Garzik <jgarzik@pobox.com> wrote: > > >>There's nothing inherently wrong with sticking a computer running >>Linux inside another computer ;-) > > > And we already support that :-) > > Plus we have things like TSO too but that doesn't require a full Linux > instance to realize on a networking port. > Simple silicon implements this already. > I don't see how that differs from your "big MTU" ideas. Part of this is about how to talk to business people.... marketing. The typical definition of TOE is "offload 90+% of the net stack", as opposed to "TCP assist", which is stuff like TSO. If people ask about how to support TOE in Linux, you can say "sure, we _already_ support TOE, just stick Linux on a PCI card" rather than "no we don't support it." And wha-la, we support TOE with zero code changes ;-) Jeff, who would love to have a bunch of Athlons on PCI cards to play with. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:23 ` Jeff Garzik @ 2004-09-15 21:29 ` David S. Miller 2004-09-15 22:26 ` Jeff Garzik 2004-09-15 23:29 ` Leonid Grossman 2004-09-16 0:57 ` jamal 2004-09-16 9:29 ` Lincoln Dale 2 siblings, 2 replies; 69+ messages in thread From: David S. Miller @ 2004-09-15 21:29 UTC (permalink / raw) To: Jeff Garzik; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel On Wed, 15 Sep 2004 17:23:49 -0400 Jeff Garzik <jgarzik@pobox.com> wrote: > The typical definition of TOE is "offload 90+% of the net stack", as > opposed to "TCP assist", which is stuff like TSO. I think a better goal is "offload 90+% of the net stack cost" which is effectively what TSO does on the send side. This is why these discussions are so circular. If we want to discuss something specific, like receive offload schemes, that is a very different matter. And I'm sure folks like Rusty have a lot to contribute in this area :-) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:29 ` David S. Miller @ 2004-09-15 22:26 ` Jeff Garzik 2004-09-15 23:29 ` Leonid Grossman 1 sibling, 0 replies; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 22:26 UTC (permalink / raw) To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel David S. Miller wrote: > On Wed, 15 Sep 2004 17:23:49 -0400 > Jeff Garzik <jgarzik@pobox.com> wrote: > > >>The typical definition of TOE is "offload 90+% of the net stack", as >>opposed to "TCP assist", which is stuff like TSO. > > > I think a better goal is "offload 90+% of the net stack cost" which > is effectively what TSO does on the send side. A better goal is to not bother with TOE at all, and just get multi-core processors with huge memory bandwidth :) Again, the point of my message is to have something _positive_ to tell people when they specifically asked about TOE. Rather than "no, we'll never do TOE" we have "it's possible, but there are better questions you should be asking" Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-15 21:29 ` David S. Miller 2004-09-15 22:26 ` Jeff Garzik @ 2004-09-15 23:29 ` Leonid Grossman 2004-09-24 13:07 ` Lennert Buytenhek 1 sibling, 1 reply; 69+ messages in thread From: Leonid Grossman @ 2004-09-15 23:29 UTC (permalink / raw) To: 'David S. Miller', 'Jeff Garzik' Cc: alan, paul, netdev, linux-kernel I think Jeff's "ultimate TOE card" based upon generic embedded CPU is doable at GbE, but we may not see such a product because it's too late for it to succeed. TOE is a pretty questionable product in itself; one of the main reasons people build TOE cards is to put RDMA on top of it and end up with an RNIC (NIC+TOE+RDMA) Ethernet card. The hope is to eventually run all three types of server traffic (network, storage, IPC) over an RNIC, and get rid of two other HBAs in a system. For this "fabric conversion" over Ethernet to happen it has to be at 10GbE not GbE, since storage (FiberChannel) is already at 4Gb. And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC (granted, with some means to simplify debugging and reduce the risk of hw bugs and TCP changes). On some other points on the thread: WRT the TOE price, I suspect that when RNICs come out they will command little premium over conventional NICs - it will be just a technology upgrade. WRT larger MTU - going to bigger MTUs helps a lot, but it will be years before the infrastructure moves beyond 9600 byte MTU. Even right now, usage of 9600 byte Jumbos is not universal. WRT TSO, for applications that don't require RDMA TSO indeed helps a lot on the transmit side for 1500 MTU - 10GbE cards are innevitably CPU bound, and we are seeing ~3x throughput improvement with normal frames. This leaves receive offload schemes in Linux as a biggest improvement (short of supporting TOE) to make. It will be great to see such receive schemes defined and implemented, as I stated in an earlier thread we will be willing to participate in such work and put the support in S2io 10GbE ASIC and drivers. > -----Original Message----- > From: David S. Miller [mailto:davem@davemloft.net] > Sent: Wednesday, September 15, 2004 2:29 PM > To: Jeff Garzik > Cc: alan@lxorguk.ukuu.org.uk; paul@clubi.ie; > netdev@oss.sgi.com; leonid.grossman@s2io.com; > linux-kernel@vger.kernel.org > Subject: Re: The ultimate TOE design > > On Wed, 15 Sep 2004 17:23:49 -0400 > Jeff Garzik <jgarzik@pobox.com> wrote: > > > The typical definition of TOE is "offload 90+% of the net > stack", as > > opposed to "TCP assist", which is stuff like TSO. > > I think a better goal is "offload 90+% of the net stack cost" > which is effectively what TSO does on the send side. > > This is why these discussions are so circular. > > If we want to discuss something specific, like receive > offload schemes, that is a very different matter. And I'm > sure folks like Rusty have a lot to contribute in this area :-) > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 23:29 ` Leonid Grossman @ 2004-09-24 13:07 ` Lennert Buytenhek 2004-09-24 13:21 ` Leonid Grossman 0 siblings, 1 reply; 69+ messages in thread From: Lennert Buytenhek @ 2004-09-24 13:07 UTC (permalink / raw) To: Leonid Grossman Cc: 'David S. Miller', 'Jeff Garzik', alan, paul, netdev, linux-kernel On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote: > And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC > (granted, with some means to simplify debugging and reduce the risk of hw > bugs and TCP changes). Intel's IXP2800 can do 10GbE. http://www.intel.com/design/network/products/npfamily/ixp2800.htm --L ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-24 13:07 ` Lennert Buytenhek @ 2004-09-24 13:21 ` Leonid Grossman 2004-09-24 18:09 ` Lennert Buytenhek 0 siblings, 1 reply; 69+ messages in thread From: Leonid Grossman @ 2004-09-24 13:21 UTC (permalink / raw) To: 'Lennert Buytenhek' Cc: 'David S. Miller', 'Jeff Garzik', alan, paul, netdev, linux-kernel > -----Original Message----- > From: Lennert Buytenhek [mailto:buytenh@wantstofly.org] > Sent: Friday, September 24, 2004 6:08 AM > To: Leonid Grossman > Cc: 'David S. Miller'; 'Jeff Garzik'; > alan@lxorguk.ukuu.org.uk; paul@clubi.ie; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: Re: The ultimate TOE design > > On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote: > > > And at 10GbE, embedded CPUs just don't cut it - it has to be custom > > ASIC (granted, with some means to simplify debugging and reduce the > > risk of hw bugs and TCP changes). > > Intel's IXP2800 can do 10GbE. Hi Lennert, I was referring to the server side. One can certanly build a 10GbE box based on IPX2800 (or some other parts), but at 17-25W it is not usable in NICs since the entire PCI card budget is less than that - nothing left for 10GbE PHY, memory, etc. Leonid > > http://www.intel.com/design/network/products/npfamily/ixp2800.htm > > > --L > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-24 13:21 ` Leonid Grossman @ 2004-09-24 18:09 ` Lennert Buytenhek 2004-09-24 19:39 ` Joel Jaeggli 0 siblings, 1 reply; 69+ messages in thread From: Lennert Buytenhek @ 2004-09-24 18:09 UTC (permalink / raw) To: Leonid Grossman Cc: 'David S. Miller', 'Jeff Garzik', alan, paul, netdev, linux-kernel On Fri, Sep 24, 2004 at 06:21:35AM -0700, Leonid Grossman wrote: > > > And at 10GbE, embedded CPUs just don't cut it - it has to be custom > > > ASIC (granted, with some means to simplify debugging and reduce the > > > risk of hw bugs and TCP changes). > > > > Intel's IXP2800 can do 10GbE. > > Hi Lennert, Hello, > I was referring to the server side. > One can certanly build a 10GbE box based on IPX2800 (or some other parts), > but at 17-25W it is not usable in NICs since the entire PCI card budget is > less than that - nothing left for 10GbE PHY, memory, etc. Ah, ok, that makes sense. As someone else also noted, the IXP2800 only has a 64/66 PCI interface anyway, so it wouldn't really be suitable for the task you were referring to. cheers, Lennert ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-24 18:09 ` Lennert Buytenhek @ 2004-09-24 19:39 ` Joel Jaeggli 0 siblings, 0 replies; 69+ messages in thread From: Joel Jaeggli @ 2004-09-24 19:39 UTC (permalink / raw) To: Lennert Buytenhek Cc: Leonid Grossman, 'David S. Miller', 'Jeff Garzik', alan, paul, netdev, linux-kernel On Fri, 24 Sep 2004, Lennert Buytenhek wrote: > >> I was referring to the server side. >> One can certanly build a 10GbE box based on IPX2800 (or some other parts), >> but at 17-25W it is not usable in NICs since the entire PCI card budget is >> less than that - nothing left for 10GbE PHY, memory, etc. I have a graphics card which requires two four pin molex power connectors, going back in time there have allway been certain perphiral cards which required external (non-bus supplied power sources for whatever reason) (hard-drive on a card, sparc on a card, pc on a card, early 90's hardware mpeg encoder, data collection device, remote mangement card, graphics card in modern mac etc), it's obviously not a general solution, but it's been done frequently enough that it shouldn't just be discarded out of hand. > Ah, ok, that makes sense. As someone else also noted, the IXP2800 > only has a 64/66 PCI interface anyway, so it wouldn't really be > suitable for the task you were referring to. > > > cheers, > Lennert > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- -------------------------------------------------------------------------- Joel Jaeggli Unix Consulting joelja@darkwing.uoregon.edu GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:23 ` Jeff Garzik 2004-09-15 21:29 ` David S. Miller @ 2004-09-16 0:57 ` jamal 2004-09-16 5:25 ` Leonid Grossman 2004-09-16 9:29 ` Lincoln Dale 2 siblings, 1 reply; 69+ messages in thread From: jamal @ 2004-09-16 0:57 UTC (permalink / raw) To: Jeff Garzik Cc: David S. Miller, alan, paul, netdev, leonid.grossman, linux-kernel Jeff, You are only allowed to start a TOE thread only every six months ;-> On a serious note, I think that PCI-express (if it lives upto its expectation) will demolish dreams of a lot of these TOE investments. Our problem is NOT the CPU right now (80% idle processing 450Kpps forwarding). Bus and memory distance/latency are. If intel would get rid of the big conspiracy in the form of chipset division and just integrate the MC like AMD is, we'll be on our our way to kill TOE and a lot of the network processors (like the IXP). Dang, running Linux is more exciting than microcoding things to fit into a 2Kword program store. I rest my canadiana $.02 cheers, jamal ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-16 0:57 ` jamal @ 2004-09-16 5:25 ` Leonid Grossman 0 siblings, 0 replies; 69+ messages in thread From: Leonid Grossman @ 2004-09-16 5:25 UTC (permalink / raw) To: hadi, 'Jeff Garzik' Cc: 'David S. Miller', alan, paul, netdev, linux-kernel > -----Original Message----- > From: jamal [mailto:hadi@cyberus.ca] > Sent: Wednesday, September 15, 2004 5:58 PM > To: Jeff Garzik > Cc: David S. Miller; alan@lxorguk.ukuu.org.uk; paul@clubi.ie; > netdev@oss.sgi.com; leonid.grossman@s2io.com; > linux-kernel@vger.kernel.org > Subject: Re: The ultimate TOE design > > Jeff, > You are only allowed to start a TOE thread only every six months ;-> > > On a serious note, I think that PCI-express (if it lives upto its > expectation) will demolish dreams of a lot of these TOE investments. > Our problem is NOT the CPU right now (80% idle processing > 450Kpps forwarding). Bus and memory distance/latency are. In servers, both bottlenecks are there - if you look at the cost of TCP and filesystem processing at 10GbE, CPU is a huge problem (and will be for foreseeable future), even for fastest 64-bit systems. I agree though that bus and memory are bigger issues, this is exactly the reason for all these RDMA over Ethernet investments :-) Anyways, did not mean to start an argument - with all the new CPU, bus and HBA technologies coming to the market it will be another 18-24 months before we know what works and what doesn't... Leonid >If > intel would get rid of the big conspiracy in the form of > chipset division and just integrate the MC like AMD is, we'll > be on our our way to kill TOE and a lot of the network > processors (like the IXP). Dang, running Linux is more > exciting than microcoding things to fit into a 2Kword program store. > > I rest my canadiana $.02 > > cheers, > jamal > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:23 ` Jeff Garzik 2004-09-15 21:29 ` David S. Miller 2004-09-16 0:57 ` jamal @ 2004-09-16 9:29 ` Lincoln Dale 2004-09-16 12:19 ` Alan Cox 2 siblings, 1 reply; 69+ messages in thread From: Lincoln Dale @ 2004-09-16 9:29 UTC (permalink / raw) To: Jeff Garzik, David S. Miller Cc: alan, paul, netdev, leonid.grossman, linux-kernel not that i disagree with the general idea and rationale, but reality is what it is today for some reasons: At 07:23 AM 16/09/2004, Jeff Garzik wrote: > Jeff, who would love to have a bunch of Athlons > on PCI cards to play with. . . . this ignore the realities of power restrictions of PCI today . . . sure, one could create a PCI card that takes a power-connector, but that don't scale so well either . . . At 07:29 AM 16/09/2004, David S. Miller wrote: >I think a better goal is "offload 90+% of the net stack cost" which >is effectively what TSO does on the send side. > >This is why these discussions are so circular. TSO works on LAN-like environments (zero latency, minimal drop), it doesn't work so well across the internet . . . i believe that there are better alternatives than TSO, but it involves NICs having decent scatter-gather DMA engines and being able to be handled multiple transactions (packets/frames) at once. in theory, NICs like tg2/tg3 should be capable of implementing something like this -- if one could get to the ucode on the embedded cores. at least with PCI Express the general architecture of a PC starts to have a hope of keeping up with Moore's law. the same couldn't be said prior to DDR-SDRAM and higher front-side-bus frequencies. cheers, lincoln. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 9:29 ` Lincoln Dale @ 2004-09-16 12:19 ` Alan Cox 2004-09-16 13:33 ` Andi Kleen 0 siblings, 1 reply; 69+ messages in thread From: Alan Cox @ 2004-09-16 12:19 UTC (permalink / raw) To: Lincoln Dale Cc: Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman, Linux Kernel Mailing List On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote: > . . . this ignore the realities of power restrictions of PCI today . . . > sure, one could create a PCI card that takes a power-connector, but that > don't scale so well either . . . At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI controller. I'm sure its not co-incidence that powerpc shows up on such boards a lot more than x86 however. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 12:19 ` Alan Cox @ 2004-09-16 13:33 ` Andi Kleen 2004-09-16 12:57 ` Alan Cox 0 siblings, 1 reply; 69+ messages in thread From: Andi Kleen @ 2004-09-16 13:33 UTC (permalink / raw) To: Alan Cox Cc: Lincoln Dale, Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman, Linux Kernel Mailing List On Thu, Sep 16, 2004 at 01:19:21PM +0100, Alan Cox wrote: > On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote: > > . . . this ignore the realities of power restrictions of PCI today . . . > > sure, one could create a PCI card that takes a power-connector, but that > > don't scale so well either . . . > > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI Are you sure that's worst case, not average? Worst case is usually much worse on a big CPU like an Athlon, but the power supply has to be sized for it. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 13:33 ` Andi Kleen @ 2004-09-16 12:57 ` Alan Cox 2004-09-16 22:37 ` Lincoln Dale 0 siblings, 1 reply; 69+ messages in thread From: Alan Cox @ 2004-09-16 12:57 UTC (permalink / raw) To: Andi Kleen Cc: Lincoln Dale, Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman, Linux Kernel Mailing List On Iau, 2004-09-16 at 14:33, Andi Kleen wrote: > > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI > > Are you sure that's worst case, not average? Worst case is usually > much worse on a big CPU like an Athlon, but the power supply > has to be sized for it. You are correct - 6W average 9W TDP, still less than my scsicontroller 8) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 12:57 ` Alan Cox @ 2004-09-16 22:37 ` Lincoln Dale 2004-09-17 13:38 ` Jörn Engel 0 siblings, 1 reply; 69+ messages in thread From: Lincoln Dale @ 2004-09-16 22:37 UTC (permalink / raw) To: Alan Cox Cc: Andi Kleen, Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman, Linux Kernel Mailing List Hi Alan, At 10:57 PM 16/09/2004, Alan Cox wrote: >On Iau, 2004-09-16 at 14:33, Andi Kleen wrote: > > > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI > > > > Are you sure that's worst case, not average? Worst case is usually > > much worse on a big CPU like an Athlon, but the power supply > > has to be sized for it. > >You are correct - 6W average 9W TDP, still less than my scsicontroller >8) sure -- ok -- that gets you the main processor. now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY, ... couple that with the voltage of PCI where you're likely to need step-up/step-down circuits (which aren't 100% efficient themselves), you're still going to get very close to the limit, if not over it. ... and after all that, the Geode is really designed to be an embedded processor. Jeff was implying using garden-variety processors which seem to have large heatsinks, not to mention cooling fans, not to mention quite significant heat generation. we're not _quite_ at the stage of being able to take garden-variety processors and build-your-own-blade-server using PCI _just_ yet. :-) cheers, lincoln. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 22:37 ` Lincoln Dale @ 2004-09-17 13:38 ` Jörn Engel 0 siblings, 0 replies; 69+ messages in thread From: Jörn Engel @ 2004-09-17 13:38 UTC (permalink / raw) To: Lincoln Dale Cc: Alan Cox, Andi Kleen, Jeff Garzik, David S. Miller, paul, netdev, leonid.grossman, Linux Kernel Mailing List On Fri, 17 September 2004 08:37:17 +1000, Lincoln Dale wrote: > > sure -- ok -- that gets you the main processor. > now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it > still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY, > ... > > couple that with the voltage of PCI where you're likely to need > step-up/step-down circuits (which aren't 100% efficient themselves), you're > still going to get very close to the limit, if not over it. > > ... and after all that, the Geode is really designed to be an embedded > processor. > Jeff was implying using garden-variety processors which seem to have large > heatsinks, not to mention cooling fans, not to mention quite significant > heat generation. > > we're not _quite_ at the stage of being able to take garden-variety > processors and build-your-own-blade-server using PCI _just_ yet. :-) FWIW, I've already been working with complete systems that suck their power from PCI. They do exist, just not in the grocery store next door. Jörn -- Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo vorher keine existiert hat. -- Doris Lessing ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:13 ` David S. Miller 2004-09-15 21:23 ` Jeff Garzik @ 2004-09-15 22:31 ` Jeff Garzik 1 sibling, 0 replies; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 22:31 UTC (permalink / raw) To: David S. Miller; +Cc: alan, paul, netdev, leonid.grossman, linux-kernel David S. Miller wrote: > Plus we have things like TSO too but that doesn't require a full Linux > instance to realize on a networking port. > Simple silicon implements this already. > I don't see how that differs from your "big MTU" ideas. WRT MTU: if the card is a buffering endpoint, rather than a passthrough, the card deals with Path MTU and fragmentation, leaving the card<->host MTU at 64K, getting nice big fat frames. Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:01 ` David S. Miller 2004-09-15 21:08 ` Jeff Garzik @ 2004-09-15 21:15 ` Michael Richardson 1 sibling, 0 replies; 69+ messages in thread From: Michael Richardson @ 2004-09-15 21:15 UTC (permalink / raw) To: David S. Miller Cc: Jeff Garzik, alan, paul, netdev, leonid.grossman, linux-kernel -----BEGIN PGP SIGNED MESSAGE----- >>>>> "David" == David S Miller <davem@davemloft.net> writes: >> The point was more to show people who are doing TOE _anyway_ to a decent >> design. David> We shouldn't be forced to refine people's non-sensible ideas which David> we'll not support anyways. David> If TOE is supported on Windows only, I happily welcome that. Ha. Too hard to do :-) The TOEs and L7 content switches that I know of are supported... UNDER LINUX ONLY The one that I'm most familliar with (Seaway's SW5000/NCA2000) provides a new socket family to the host, which corresponds to streams that terminate on the NCA2000. The host can request things like having two TCP streams be cross-connected, even adding/subtracting SSL along the way. This code does not interact with the Linux IP stack at all --- so it isn't exactly a TOE. You have to, at a minimum recompile applications. - -- ] "Elmo went to the wrong fundraiser" - The Simpson | firewalls [ ] Michael Richardson, Xelerance Corporation, Ottawa, ON |net architect[ ] mcr@xelerance.com http://www.sandelman.ottawa.on.ca/mcr/ |device driver[ ] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Finger me for keys iQCVAwUBQUiw3YqHRg3pndX9AQHrwQQAoK2C4btD6vk/UZ1Bv7zTgtbw/EvZuU2F ZqPDiYfHMIsfsCYBWqLrjU2oxkkO+RgH3NOoNTJQuuVFjLlDw2pPHgH9DXaYdZy8 3To0LGdmIZR4u+mMx2WFRyYjuDM1iQ3ZbAskN5JzW3Jc77SbrJZaap1fQua5U3qg gfNQ21OPkSI= =+JBc -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:14 ` Alan Cox 2004-09-15 20:41 ` Jeff Garzik @ 2004-09-15 20:53 ` David S. Miller 2004-09-16 1:05 ` Andrea Arcangeli 2004-09-15 21:10 ` David Lang 2004-09-15 23:05 ` Paul Jakma 3 siblings, 1 reply; 69+ messages in thread From: David S. Miller @ 2004-09-15 20:53 UTC (permalink / raw) To: Alan Cox; +Cc: paul, netdev, leonid.grossman, linux-kernel On Wed, 15 Sep 2004 20:14:22 +0100 Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > On Mer, 2004-09-15 at 21:04, Paul Jakma wrote: > > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI > > card running Linux. Or is that what you were referring to with > > "<cards exist> but they are all fairly expensive."? > > Last time I checked 2Ghz accelerators for intel and AMD were quite cheap > and also had the advantage they ran user mode code when idle from > network processing. ROFL, and this is my position on this topic as well. There are absolutely no justified economics in these TOE engines. By the time you deploy them, the cpus and memory catch up and what's more those are general purpose and not just for networking as David Stevens and others have said. TOE is just junk, and we'll reject any attempt to put that garbage into the kernel. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:53 ` David S. Miller @ 2004-09-16 1:05 ` Andrea Arcangeli 0 siblings, 0 replies; 69+ messages in thread From: Andrea Arcangeli @ 2004-09-16 1:05 UTC (permalink / raw) To: David S. Miller; +Cc: Alan Cox, paul, netdev, leonid.grossman, linux-kernel On Wed, Sep 15, 2004 at 01:53:08PM -0700, David S. Miller wrote: > There are absolutely no justified economics in these > TOE engines. By the time you deploy them, the cpus > and memory catch up and what's more those are general > purpose and not just for networking as David Stevens > and others have said. I'm not sure if economics are the worst part of what is being shipped, to me the worst part is security, I'd never trust myself such a non-open-source TCP stack for something critical even if it was going to be much cheaper and performant. Even my PDA is using the linux tcp stack, and my cell phone only speaks UDP with the wap server anyways. TCP segment offload OTOH doesn't involve much "intelligence" in the NIC and it's very reasonable to trust it especially because all the incoming packets (the real potential offenders) are still processed by the linux tcp stack. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:14 ` Alan Cox 2004-09-15 20:41 ` Jeff Garzik 2004-09-15 20:53 ` David S. Miller @ 2004-09-15 21:10 ` David Lang 2004-09-15 23:05 ` Paul Jakma 3 siblings, 0 replies; 69+ messages in thread From: David Lang @ 2004-09-15 21:10 UTC (permalink / raw) To: Alan Cox; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel Mailing List On Wed, 15 Sep 2004, Alan Cox wrote: > On Mer, 2004-09-15 at 21:04, Paul Jakma wrote: >> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI >> card running Linux. Or is that what you were referring to with >> "<cards exist> but they are all fairly expensive."? > > Last time I checked 2Ghz accelerators for intel and AMD were quite cheap > and also had the advantage they ran user mode code when idle from > network processing. That depends on how many of these accelerators you already have in the system. If you have 4 of them and they are heavily used so that you want to offload them it definantly isn't cheap to add a 5th (you useually have to go up to 8 or so and the difference between 4 and 8 is frequently 2x-4x the cost of the 4 processor box) now if you start with a single CPU system then yes, adding a second one is cheap. but these are useually not the people who really need TOE (they may think that they do, but that's a different story :-) David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:14 ` Alan Cox ` (2 preceding siblings ...) 2004-09-15 21:10 ` David Lang @ 2004-09-15 23:05 ` Paul Jakma 3 siblings, 0 replies; 69+ messages in thread From: Paul Jakma @ 2004-09-15 23:05 UTC (permalink / raw) To: Alan Cox; +Cc: Netdev, leonid.grossman, Linux Kernel Mailing List On Wed, 15 Sep 2004, Alan Cox wrote: > Last time I checked 2Ghz accelerators for intel and AMD were quite > cheap and also had the advantage they ran user mode code when idle > from network processing. Indeed. Unfortunately though, my vague understanding is, the interesting bits on the IXP, the microengines, are integrated with the XScale ASIC. I agree it's silly to stick a general purpose CPU in there, but you get it for "free" anyway. regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: War is an equal opportunity destroyer. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:04 ` Paul Jakma 2004-09-15 19:14 ` Alan Cox @ 2004-09-15 20:26 ` Neil Horman 2004-09-15 21:03 ` Wes Felter 2004-09-16 5:51 ` Matt Porter 2004-09-15 21:36 ` Deepak Saxena 2004-09-15 21:59 ` Tony Lee 3 siblings, 2 replies; 69+ messages in thread From: Neil Horman @ 2004-09-15 20:26 UTC (permalink / raw) To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel Paul Jakma wrote: > On Wed, 15 Sep 2004, Jeff Garzik wrote: > >> Put simply, the "ultimate TOE card" would be a card with network >> ports, a generic CPU (arm, mips, whatever.), some RAM, and some >> flash. This card's "firmware" is the Linux kernel, configured to run >> as a _totally indepenent network node_, with IP address(es) all its own. >> >> Then, your host system OS will communicate with the Linux kernel >> running on the card across the PCI bus, using IP packets (64K fixed MTU). > > >> My dream is that some vendor will come along and implement such a >> design, and sell it in enough volume that it's US$100 or less. There >> are a few cards on the market already where implementing this design >> _may_ be possible, but they are all fairly expensive. > > > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI card > running Linux. Or is that what you were referring to with "<cards exist> > but they are all fairly expensive."? > >> Jeff > > > regards, IBM's PowerNP chip was also very simmilar (a powerpc core with lots of hardware assists for DMA and packet inspection in the extended register area). Don't know if they still sell it, but at one time I had heard they had booted linux on it. Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:26 ` Neil Horman @ 2004-09-15 21:03 ` Wes Felter 2004-09-15 21:15 ` Jeff Garzik ` (2 more replies) 2004-09-16 5:51 ` Matt Porter 1 sibling, 3 replies; 69+ messages in thread From: Wes Felter @ 2004-09-15 21:03 UTC (permalink / raw) To: linux-kernel; +Cc: netdev Neil Horman wrote: > Paul Jakma wrote: > >> On Wed, 15 Sep 2004, Jeff Garzik wrote: >> >>> Put simply, the "ultimate TOE card" would be a card with network >>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some >>> flash. This card's "firmware" is the Linux kernel, configured to run >>> as a _totally indepenent network node_, with IP address(es) all its own. >>> >>> Then, your host system OS will communicate with the Linux kernel >>> running on the card across the PCI bus, using IP packets (64K fixed >>> MTU). >> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI >> card running Linux. Or is that what you were referring to with "<cards >> exist> but they are all fairly expensive."? > IBM's PowerNP chip was also very simmilar (a powerpc core with lots of > hardware assists for DMA and packet inspection in the extended register > area). Don't know if they still sell it, but at one time I had heard > they had booted linux on it. An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core and PowerNP's PowerPC core are way too slow to do any significant processing; they are intended for control tasks like updating the routing tables. All the work in the IXP or PowerNP is done by the microengines, which have weird, non-Linux-compatible architectures. To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 GHz processor on the card? Sounds expensive. A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though. Wes Felter - wesley@felter.org - http://felter.org/wesley/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:03 ` Wes Felter @ 2004-09-15 21:15 ` Jeff Garzik 2004-09-15 21:35 ` Wes Felter 2004-09-15 21:25 ` Imran Badr 2004-09-16 11:37 ` Neil Horman 2 siblings, 1 reply; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 21:15 UTC (permalink / raw) To: Wes Felter; +Cc: netdev, linux-kernel On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote: > To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 > GHz processor on the card? Sounds expensive. Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet? Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:15 ` Jeff Garzik @ 2004-09-15 21:35 ` Wes Felter 2004-09-15 21:42 ` Jeff Garzik 0 siblings, 1 reply; 69+ messages in thread From: Wes Felter @ 2004-09-15 21:35 UTC (permalink / raw) To: netdev; +Cc: linux-kernel Jeff Garzik wrote: > On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote: > >>To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 >>GHz processor on the card? Sounds expensive. > > > Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet? Yes. (Or a 4-way ~2GHz server.) When the fastest general-purpose processors cannot handle the fastest Ethernet links, putting such a processor on a NIC won't help much. I think this is why people are attracted to TOE ASICs, even if that isn't the right solution. -- Wes Felter - wesley@felter.org - http://felter.org/wesley/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:35 ` Wes Felter @ 2004-09-15 21:42 ` Jeff Garzik 0 siblings, 0 replies; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 21:42 UTC (permalink / raw) To: Wes Felter; +Cc: netdev, linux-kernel On Wed, Sep 15, 2004 at 04:35:31PM -0500, Wes Felter wrote: > Jeff Garzik wrote: > > >On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote: > > > >>To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 > >>GHz processor on the card? Sounds expensive. > > > > > >Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet? > > Yes. (Or a 4-way ~2GHz server.) It was a rhetoric question. No, you don't. Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:03 ` Wes Felter 2004-09-15 21:15 ` Jeff Garzik @ 2004-09-15 21:25 ` Imran Badr 2004-09-16 11:37 ` Neil Horman 2 siblings, 0 replies; 69+ messages in thread From: Imran Badr @ 2004-09-15 21:25 UTC (permalink / raw) To: Wes Felter, linux-kernel; +Cc: netdev Please see: Cavium Networks Introduces OCTEON(TM) Family of Integrated Network Services Processors With up to 16 MIPS64(R)-Based Cores for Internet Services, Content and Security Processing" http://www.linuxelectrons.com/article.php?story=20040913082030668&mode=print -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Wes Felter Sent: Wednesday, September 15, 2004 2:04 PM To: linux-kernel@vger.kernel.org Cc: netdev@oss.sgi.com Subject: [SPAM] Re: The ultimate TOE design Neil Horman wrote: > Paul Jakma wrote: > >> On Wed, 15 Sep 2004, Jeff Garzik wrote: >> >>> Put simply, the "ultimate TOE card" would be a card with network >>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some >>> flash. This card's "firmware" is the Linux kernel, configured to run >>> as a _totally indepenent network node_, with IP address(es) all its own. >>> >>> Then, your host system OS will communicate with the Linux kernel >>> running on the card across the PCI bus, using IP packets (64K fixed >>> MTU). >> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI >> card running Linux. Or is that what you were referring to with "<cards >> exist> but they are all fairly expensive."? > IBM's PowerNP chip was also very simmilar (a powerpc core with lots of > hardware assists for DMA and packet inspection in the extended register > area). Don't know if they still sell it, but at one time I had heard > they had booted linux on it. An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core and PowerNP's PowerPC core are way too slow to do any significant processing; they are intended for control tasks like updating the routing tables. All the work in the IXP or PowerNP is done by the microengines, which have weird, non-Linux-compatible architectures. To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 GHz processor on the card? Sounds expensive. A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though. Wes Felter - wesley@felter.org - http://felter.org/wesley/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:03 ` Wes Felter 2004-09-15 21:15 ` Jeff Garzik 2004-09-15 21:25 ` Imran Badr @ 2004-09-16 11:37 ` Neil Horman 2 siblings, 0 replies; 69+ messages in thread From: Neil Horman @ 2004-09-16 11:37 UTC (permalink / raw) To: Wes Felter; +Cc: linux-kernel, netdev Wes Felter wrote: > Neil Horman wrote: > >> Paul Jakma wrote: >> >>> On Wed, 15 Sep 2004, Jeff Garzik wrote: >>> >>>> Put simply, the "ultimate TOE card" would be a card with network >>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some >>>> flash. This card's "firmware" is the Linux kernel, configured to >>>> run as a _totally indepenent network node_, with IP address(es) all >>>> its own. >>>> >>>> Then, your host system OS will communicate with the Linux kernel >>>> running on the card across the PCI bus, using IP packets (64K fixed >>>> MTU). > > >>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI >>> card running Linux. Or is that what you were referring to with >>> "<cards exist> but they are all fairly expensive."? > > >> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of >> hardware assists for DMA and packet inspection in the extended >> register area). Don't know if they still sell it, but at one time I >> had heard they had booted linux on it. > > > An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core > and PowerNP's PowerPC core are way too slow to do any significant > processing; they are intended for control tasks like updating the > routing tables. All the work in the IXP or PowerNP is done by the > microengines, which have weird, non-Linux-compatible architectures. > I didn't say the assist hardware wouldn't need an extra driver. Its not 100% free, as Jeff proposes, but the CPU portion of these designs is _sufficient_ to run linux, and a driver can be written to drive the remainder of these chips. Its the combination that network device manufacturers design to today: A specialized chip to do L3/L2 forwarding at line rate over a large number of ports, and just enough general purpose CPU to manage the user interface, the forwarding hardware and any overflow forwarding that the forwarding hardware can't deal with quickly. > To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10 > GHz processor on the card? Sounds expensive. > To handle port densities that are competing in the market today? Yes, which as I mentioned earlier would price designs like this out of the market. Jeffs idea is a nice one, but it doesn't really fit well with the hardware that networking equipment manufacturers are building today. Take a look at Broadcoms StrataSwitch/StrataXGS lines, or Switchcores Xpeedium processors. These are the sorts of things we have to work with . They provide network stack offload in competitive port densities, but they aren't also general purpose processors. They need a driver to massage their behavior into something more linux friendly. If we could develop an infrastrucutre that made these chips easy to integrate into a platform running linux, linux could quickly come to dominate a large portion of the network device space. Neil > Wes Felter - wesley@felter.org - http://felter.org/wesley/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:26 ` Neil Horman 2004-09-15 21:03 ` Wes Felter @ 2004-09-16 5:51 ` Matt Porter 1 sibling, 0 replies; 69+ messages in thread From: Matt Porter @ 2004-09-16 5:51 UTC (permalink / raw) To: Neil Horman; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel On Wed, Sep 15, 2004 at 04:26:09PM -0400, Neil Horman wrote: > IBM's PowerNP chip was also very simmilar (a powerpc core with lots of > hardware assists for DMA and packet inspection in the extended register > area). Don't know if they still sell it, but at one time I had heard > they had booted linux on it. Well, yes, PowerNP support has been in the kernel for years and embedded Linux distros like Mvista support them. It's no longer an IBM chip, though. AMCC purchased the PPC4xx network processors (PowerNP) from IBM and later purchased the entire standard SoC PPC4xx product line from IBM. That is, except for the PPC4xx STB chips like are found in the Hauppage MediaMVP, IBM retained those. AMCC pretty much owns all the PPC4xx line and PowerNP 405H/L are still available. -Matt ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:04 ` Paul Jakma 2004-09-15 19:14 ` Alan Cox 2004-09-15 20:26 ` Neil Horman @ 2004-09-15 21:36 ` Deepak Saxena 2004-09-15 23:03 ` Paul Jakma 2004-09-24 13:11 ` Lennert Buytenhek 2004-09-15 21:59 ` Tony Lee 3 siblings, 2 replies; 69+ messages in thread From: Deepak Saxena @ 2004-09-15 21:36 UTC (permalink / raw) To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel On Sep 15 2004, at 21:04, Paul Jakma was caught saying: > On Wed, 15 Sep 2004, Jeff Garzik wrote: > > >Put simply, the "ultimate TOE card" would be a card with network ports, a > >generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's > >"firmware" is the Linux kernel, configured to run as a _totally indepenent > >network node_, with IP address(es) all its own. > > > >Then, your host system OS will communicate with the Linux kernel running > >on the card across the PCI bus, using IP packets (64K fixed MTU). > > >My dream is that some vendor will come along and implement such a > >design, and sell it in enough volume that it's US$100 or less. > >There are a few cards on the market already where implementing this > >design _may_ be possible, but they are all fairly expensive. > > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI > card running Linux. Or is that what you were referring to with > "<cards exist> but they are all fairly expensive."? Unfortunately all the SW that lets one make use of the interesting features of the IXPs (microEngines, crypto, etc) is a pile of propietary code. ~Deepak -- Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/ "Unlike me, many of you have accepted the situation of your imprisonment and will die here like rotten cabbages." - Number 6 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:36 ` Deepak Saxena @ 2004-09-15 23:03 ` Paul Jakma 2004-09-24 13:11 ` Lennert Buytenhek 1 sibling, 0 replies; 69+ messages in thread From: Paul Jakma @ 2004-09-15 23:03 UTC (permalink / raw) To: Deepak Saxena; +Cc: Netdev, leonid.grossman, Linux Kernel On Wed, 15 Sep 2004, Deepak Saxena wrote: > Unfortunately all the SW that lets one make use of the interesting > features of the IXPs (microEngines, crypto, etc) is a pile of > propietary code. My vague understanding is that while Intel's microengine code is proprietary, they do provide the docs to the microengines to let you write your own, no? > ~Deepak regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Better tried by twelve than carried by six. -- Jeff Cooper ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:36 ` Deepak Saxena 2004-09-15 23:03 ` Paul Jakma @ 2004-09-24 13:11 ` Lennert Buytenhek 1 sibling, 0 replies; 69+ messages in thread From: Lennert Buytenhek @ 2004-09-24 13:11 UTC (permalink / raw) To: Deepak Saxena; +Cc: Paul Jakma, Netdev, leonid.grossman, Linux Kernel On Wed, Sep 15, 2004 at 02:36:00PM -0700, Deepak Saxena wrote: > > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI > > card running Linux. Or is that what you were referring to with > > "<cards exist> but they are all fairly expensive."? > > Unfortunately all the SW that lets one make use of the interesting > features of the IXPs (microEngines, crypto, etc) is a pile of > propietary code. I'm working on open source microengine code for the IXP line, which should be available Real Soon Now(TM). --L ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:04 ` Paul Jakma ` (2 preceding siblings ...) 2004-09-15 21:36 ` Deepak Saxena @ 2004-09-15 21:59 ` Tony Lee 3 siblings, 0 replies; 69+ messages in thread From: Tony Lee @ 2004-09-15 21:59 UTC (permalink / raw) To: Paul Jakma; +Cc: Netdev, leonid.grossman, Linux Kernel On Wed, 15 Sep 2004 21:04:38 +0100 (IST), Paul Jakma <paul@clubi.ie> wrote: > On Wed, 15 Sep 2004, Jeff Garzik wrote: > > > Put simply, the "ultimate TOE card" would be a card with network ports, a > > generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's > > "firmware" is the Linux kernel, configured to run as a _totally indepenent > > network node_, with IP address(es) all its own. > > > > Then, your host system OS will communicate with the Linux kernel running on > > the card across the PCI bus, using IP packets (64K fixed MTU). > > > My dream is that some vendor will come along and implement such a > > design, and sell it in enough volume that it's US$100 or less. > > There are a few cards on the market already where implementing this > > design _may_ be possible, but they are all fairly expensive. > > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI > card running Linux. Or is that what you were referring to with > "<cards exist> but they are all fairly expensive."? > > > Jeff > > regards, > -- > Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A I believe Broadcom 5704 (570x) chip/nic card come with 2 MIPS CPUs (133 MHz) one each for both Tx and Rx data path. The GIGE nic card cost < $50 couple years ago. Too bad, the software SDK for them is closed (quoted at $96K couple years ago) . Otherwise, there can be some interesting applications with that extremely inexpensive chip/nic card. RDMA over TCP/UDP with that chip/nic card over gige can be very interesting. so is SSL proxy, SSH tunnel, etc. With the right distributing processing design, it might even possible to offload SMB, NFS to the "right" nic card. -Tony -- Having fun with Xilinx Virtex Pro II reconfigurable HW + integrated PPC + Linux ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:33 The ultimate TOE design Jeff Garzik 2004-09-15 20:04 ` Paul Jakma @ 2004-09-15 20:11 ` David Stevens 2004-09-15 20:16 ` David Schwartz ` (5 more replies) 2004-09-15 21:36 ` John Heffner 2004-09-16 9:03 ` Lars Marowsky-Bree 3 siblings, 6 replies; 69+ messages in thread From: David Stevens @ 2004-09-15 20:11 UTC (permalink / raw) To: Netdev; +Cc: leonid.grossman, Linux Kernel, netdev I've never understood why people are so interested in off-loading networking. Isn't that just a multi-processor system where you can't use any of the network processor cycles for anything else? And, of course, to be cheap, the network processor will be slower, and much harder to debug and update software. If the PCI bus is too slow, or MTU's too small, wouldn't it be better to fix those directly and use a fast host processor that can also do other things when not needed for networking? And why have memory on a NIC that can't be used by other things? Why don't we off-load filesystems to disks instead? Or a graphics card that implements X ? :-) I'd rather have shared system resources-- more flexible. :-) +-DLS ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens @ 2004-09-15 20:16 ` David Schwartz 2004-09-15 20:25 ` Jeff Garzik ` (4 subsequent siblings) 5 siblings, 0 replies; 69+ messages in thread From: David Schwartz @ 2004-09-15 20:16 UTC (permalink / raw) To: David Stevens, Netdev, leonid.grossman, Linux Kernel David Stevens wrote: > I've never understood why people are so interested in off-loading > networking. Isn't that just a multi-processor system where you can't > use any of the network processor cycles for anything else? And, of > course, to be cheap, the network processor will be slower, and much > harder to debug and update software. The issues of debugging the network processor software and maintaining it is certainly a legitimate one. However, nothing stops you from using the extra network processor cycles for other purposes. > If the PCI bus is too slow, or MTU's too small, wouldn't > it be better to fix those directly and use a fast host processor that > can > also do other things when not needed for networking? And why have > memory on a NIC that can't be used by other things? This isn't an either-or. Processors are cheap. Memory is cheap. > Why don't we off-load filesystems to disks instead? Or a graphics > card that implements X ? :-) I'd rather have shared system resources-- > more flexible. :-) It's not one or the other. If, for example, your network card, graphics card, and hard drive controller all use a common instruction set and are all interconnected by a fast bus, code can be fairly mobile and run wherever it's the most efficient. Nothing stops the OS from offloading internal tasks to these processors as well. The only real stumbling blocks have been cost/volume considerations and the fact that the central processor(s) can be so fast, and the I/O so slow in comparison, that there's not much to gain. DS ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens 2004-09-15 20:16 ` David Schwartz @ 2004-09-15 20:25 ` Jeff Garzik 2004-09-15 20:54 ` Neil Horman 2004-09-15 20:31 ` Bill Rugolsky Jr. ` (3 subsequent siblings) 5 siblings, 1 reply; 69+ messages in thread From: Jeff Garzik @ 2004-09-15 20:25 UTC (permalink / raw) To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel David Stevens wrote: > I've never understood why people are so interested in off-loading > networking. Isn't that just a multi-processor system where you can't > use any of the network processor cycles for anything else? And, of > course, to be cheap, the network processor will be slower, and much > harder to debug and update software. Well I do agree there is a strong don't-bother-with-TOE argument: Moore's law, the CPUs (manufactured in vast quantities) will usually However, there are companies are Just Gotta Do TOE... and I am not inclined to assist in any effort that compromises Linux's RFC compliancy or security. Current TOE efforts seem to be of the "shove your data through this black box" variety, which is rather disheartening. Even non-TOE NICs these days have ever-more-complex firmwares. tg3 is a MIPS-based engine for example. > If the PCI bus is too slow, or MTU's too small, wouldn't > it be better to fix those directly and use a fast host processor that can > also do other things when not needed for networking? And why have > memory on a NIC that can't be used by other things? PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the Internet will be small as long as ethernet enjoys continued success. Jeff ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:25 ` Jeff Garzik @ 2004-09-15 20:54 ` Neil Horman 0 siblings, 0 replies; 69+ messages in thread From: Neil Horman @ 2004-09-15 20:54 UTC (permalink / raw) To: Jeff Garzik; +Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel Jeff Garzik wrote: > David Stevens wrote: > >> I've never understood why people are so interested in off-loading >> networking. Isn't that just a multi-processor system where you can't >> use any of the network processor cycles for anything else? And, of >> course, to be cheap, the network processor will be slower, and much >> harder to debug and update software. > > > Well I do agree there is a strong don't-bother-with-TOE argument: > Moore's law, the CPUs (manufactured in vast quantities) will usually > > > However, there are companies are Just Gotta Do TOE... and I am not > inclined to assist in any effort that compromises Linux's RFC compliancy > or security. Current TOE efforts seem to be of the "shove your data > through this black box" variety, which is rather disheartening. > > Even non-TOE NICs these days have ever-more-complex firmwares. tg3 is a > MIPS-based engine for example. > > >> If the PCI bus is too slow, or MTU's too small, wouldn't >> it be better to fix those directly and use a fast host processor that can >> also do other things when not needed for networking? And why have >> memory on a NIC that can't be used by other things? > > > PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the > Internet will be small as long as ethernet enjoys continued success. > > Jeff There is also something to be said for the embedded market here. offload chips are fairly usefull when building switches and routers. Dave M. in a thread just a few weeks ago provided some metrics for how much bandwidth a PCI-x bus and a some-odd-gigahertz processor could handle. It worked that a pc with the right componenets could theoretically handle about 4 gigahertz nics running traffic full duplex at line rate. Thats great, but it doesn't come close to what you need for a 24 port gigabit L3 switch, nor does it approach the correct price point. Most of these designs use a less expensive processor running at a slower speed, and an offload chip (that incorporates tx/rx logic and a switching fabric) to preform most of the routing and switching. For cost concious network equipment manufacturers, they are really the way to go. Unfortunately, many of them don't actaully run as a co-processor, and so don't enable Jeff's idea very well (yet :)) Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens 2004-09-15 20:16 ` David Schwartz 2004-09-15 20:25 ` Jeff Garzik @ 2004-09-15 20:31 ` Bill Rugolsky Jr. 2004-09-15 21:41 ` Joel Jaeggli ` (2 subsequent siblings) 5 siblings, 0 replies; 69+ messages in thread From: Bill Rugolsky Jr. @ 2004-09-15 20:31 UTC (permalink / raw) To: Netdev; +Cc: Jeff Garzik, David Stevens On Wed, Sep 15, 2004 at 02:11:04PM -0600, David Stevens wrote: > If the PCI bus is too slow, or MTU's too small, wouldn't > it be better to fix those directly and use a fast host processor that can > also do other things when not needed for networking? And why have > memory on a NIC that can't be used by other things? I tend to agree. Referring to the Opteron with its per-CPU memory controller, Robert Olsson just wrote in the "TX performance of Intel 82546" thread: This is a little breakthrough as we for the first time see some aggregated performance with packet forwarding and got something in return for all multiprocessor efforts. IMO this is much more important then the last percent of performance of pps numbers. In 2005, we'll have commodity dual-core packages, making a four-core (dual-CPU) system available at an attractive price point. The number will rise dramatically after that. I don't really think CPU cycles are the problem. A useful reason I can see for "offloading" is isolation of concerns, e.g., locking, real-time latencies, security, etc. But then, why not run something like the Xen2 virtual machine environment? Regards, Bill Rugolsky ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens ` (2 preceding siblings ...) 2004-09-15 20:31 ` Bill Rugolsky Jr. @ 2004-09-15 21:41 ` Joel Jaeggli 2004-09-16 6:33 ` Valdis.Kletnieks 2004-09-17 6:46 ` Eric Mudama 5 siblings, 0 replies; 69+ messages in thread From: Joel Jaeggli @ 2004-09-15 21:41 UTC (permalink / raw) To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel On Wed, 15 Sep 2004, David Stevens wrote: > I've never understood why people are so interested in off-loading > networking. Isn't that just a multi-processor system where you can't > use any of the network processor cycles for anything else? And, of > course, to be cheap, the network processor will be slower, and much > harder to debug and update software. I's like to amplify this, adding more general purpose cpu to a machine strikes me as the right design choice since they're simply more generally useful than dedicated cpu's. look at linux software raid compared to the alternatives, frankly I haven't seen a hardware controller that can touch it for performance given a similar number of disks and interfaces... Currently graphcas card have substantionaly more memory bandwidth and pipelines than most general purpose cpu's but eventually that won't be the case. as it is gpus still represent the biggest chunk of independat computational power in a and at least on the server side we don't even use them. > If the PCI bus is too slow, or MTU's too small, wouldn't > it be better to fix those directly and use a fast host processor that can > also do other things when not needed for networking? And why have > memory on a NIC that can't be used by other things? Between hyper-transport tunnels, pci-x, pci-express and infinband, the bottlnecks between the cpu core and the perhiperals and memory are falling away at a rapid clip even as cpu's get faster. we're in a much better position to build balanced systems then we were 2 years ago. > Why don't we off-load filesystems to disks instead? Or a graphics > card that implements X ? :-) I'd rather have shared system resources-- > more flexible. :-) > > +-DLS > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- -------------------------------------------------------------------------- Joel Jaeggli Unix Consulting joelja@darkwing.uoregon.edu GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens ` (3 preceding siblings ...) 2004-09-15 21:41 ` Joel Jaeggli @ 2004-09-16 6:33 ` Valdis.Kletnieks 2004-09-17 6:46 ` Eric Mudama 5 siblings, 0 replies; 69+ messages in thread From: Valdis.Kletnieks @ 2004-09-16 6:33 UTC (permalink / raw) To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 1384 bytes --] On Wed, 15 Sep 2004 14:11:04 MDT, David Stevens said: > Why don't we off-load filesystems to disks instead? Or a graphics > card that implements X ? :-) I'd rather have shared system resources-- > more flexible. :-) All depends where in the "cycle of reincarnation" we are at the moment. Way back in 1964, IBM released this monster called System/360 - and one of the things it did was push a *lot* of the disk processing off on the channel and disk controller using a count-key-data format rather than the fixed-block that Linux uses. So out on the platters, the disk format would say things like "This is a 400 byte record, the first 56 of which is a search key". A lot of stuff, both userspace and OS, used things like 'Search Key Equal' and letting the disk do all the searching. There was also this terminal beast called the 3270, which had a local controller for the terminals, and only interrupted the CPU on 'page send' type events. Back then, the ideas made sense - it wasn't at all unreasonable for a single S/360-65 to drive 3,000+ concurrent terminals in an airline reservation system or similar (and we're talking about a box that had literally only half the hamsters of a VAX780). But today, the 3270 isn't seen much anymore, and currently IBM emulates the CKD format on fixed-block systems for their z/Series boxes running z/OS or whatever MVS is called now.... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 20:11 ` David Stevens ` (4 preceding siblings ...) 2004-09-16 6:33 ` Valdis.Kletnieks @ 2004-09-17 6:46 ` Eric Mudama 2004-09-17 14:15 ` Alan Cox 2004-09-17 20:27 ` Valdis.Kletnieks 5 siblings, 2 replies; 69+ messages in thread From: Eric Mudama @ 2004-09-17 6:46 UTC (permalink / raw) To: David Stevens; +Cc: Netdev, leonid.grossman, Linux Kernel On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrote: > Why don't we off-load filesystems to disks instead? Disks have had file systems on them since close to the beginning... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-17 6:46 ` Eric Mudama @ 2004-09-17 14:15 ` Alan Cox 2004-09-17 20:27 ` Valdis.Kletnieks 1 sibling, 0 replies; 69+ messages in thread From: Alan Cox @ 2004-09-17 14:15 UTC (permalink / raw) To: Eric Mudama Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel Mailing List On Gwe, 2004-09-17 at 07:46, Eric Mudama wrote: > On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrote: > > Why don't we off-load filesystems to disks instead? > > Disks have had file systems on them since close to the beginning... This is essentially the path Lustre is taking. Although it seems you don't want to have a "full" file system on the disk since you lose to much flexibility, instead you want the ability to allocate by handle giving hints about locality and use. People have also tried full file system offload - intel for example prototyped an I2O file system class, and adaptec clearly were trying this out on aacraid development from looking at the public headers. Alan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-17 6:46 ` Eric Mudama 2004-09-17 14:15 ` Alan Cox @ 2004-09-17 20:27 ` Valdis.Kletnieks 2004-09-17 20:36 ` David Lang 2004-09-22 23:25 ` Eric Mudama 1 sibling, 2 replies; 69+ messages in thread From: Valdis.Kletnieks @ 2004-09-17 20:27 UTC (permalink / raw) To: Eric Mudama; +Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 840 bytes --] On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said: > On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrot e: > > Why don't we off-load filesystems to disks instead? > > Disks have had file systems on them since close to the beginning... No, he means "offload the processing of the filesystem to the disk itself". IBM's MVS systems basically did that - it used the disk's "Search Key" I/O opcodes to basically get the equivalent of doing namei() out on the disk itself (it did this for system catalog and PDS directory searches from the beginning, and added 'indexed VTOC' support in the mid-80s). So you'd send out a CCW (channel command word) stream that basically said "Find me the dataset USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the moral equiv of an inode) ready to go. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-17 20:27 ` Valdis.Kletnieks @ 2004-09-17 20:36 ` David Lang 2004-09-17 23:20 ` Tony Lee 2004-09-22 23:25 ` Eric Mudama 1 sibling, 1 reply; 69+ messages in thread From: David Lang @ 2004-09-17 20:36 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Eric Mudama, David Stevens, Netdev, leonid.grossman, Linux Kernel actually the sector based access that is made to modern drives is a very primitive filesystem. if you go back to the days of the MFM and RLL drives you had the computer sending the raw bitstreams to the drives, but with SCSI and IDE this stopped and you instead a higher level logical block to the drive and it deals with the details of getting it to and from the platter. David Lang On Fri, 17 Sep 2004 Valdis.Kletnieks@vt.edu wrote: > Date: Fri, 17 Sep 2004 16:27:31 -0400 > From: Valdis.Kletnieks@vt.edu > To: Eric Mudama <edmudama@gmail.com> > Cc: David Stevens <dlstevens@us.ibm.com>, Netdev <netdev@oss.sgi.com>, > leonid.grossman@s2io.com, Linux Kernel <linux-kernel@vger.kernel.org> > Subject: Re: The ultimate TOE design > > On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said: >> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <dlstevens@us.ibm.com> wrot > e: >>> Why don't we off-load filesystems to disks instead? >> >> Disks have had file systems on them since close to the beginning... > > No, he means "offload the processing of the filesystem to the disk itself". > > IBM's MVS systems basically did that - it used the disk's "Search Key" I/O > opcodes to basically get the equivalent of doing namei() out on the disk itself > (it did this for system catalog and PDS directory searches from the beginning, > and added 'indexed VTOC' support in the mid-80s). So you'd send out a CCW > (channel command word) stream that basically said "Find me the dataset > USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the > moral equiv of an inode) ready to go. > > -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-17 20:36 ` David Lang @ 2004-09-17 23:20 ` Tony Lee 2004-09-17 23:36 ` Leonid Grossman 0 siblings, 1 reply; 69+ messages in thread From: Tony Lee @ 2004-09-17 23:20 UTC (permalink / raw) To: David Lang Cc: valdis.kletnieks, Eric Mudama, David Stevens, Netdev, leonid.grossman, Linux Kernel On Fri, 17 Sep 2004 13:36:14 -0700 (PDT), David Lang <david.lang@digitalinsight.com> wrote: > actually the sector based access that is made to modern drives is a very > primitive filesystem. if you go back to the days of the MFM and RLL drives > you had the computer sending the raw bitstreams to the drives, but with > SCSI and IDE this stopped and you instead a higher level logical block to > the drive and it deals with the details of getting it to and from the > platter. > > David Lang > Maybe next evolutionary step is to put VFS layer directory on top of RDMA -> PCI Express/Latest serial IO, etc. Similar to access file thru NFS/SMB just on a faster standardize (RDMA) transport. On the networking front, instead of TOE, it should be services offload, similar to web load balancer. Offload service base on src/dest addr port proto (tcp/udp). NSO (Network service offload.) - kind of like Apache's reverse proxy with URL rewrite, but maybe for other applications. Question for Leonid of S2io.com: Your company has an interesting card. I think it must have some kind of embedded CPU. Care to tell us what kind of CPU are they? -- -Tony Having a lot of fun with Xilinx Virtex Pro II reconfigurable HW + ppc + Linux ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-17 23:20 ` Tony Lee @ 2004-09-17 23:36 ` Leonid Grossman 0 siblings, 0 replies; 69+ messages in thread From: Leonid Grossman @ 2004-09-17 23:36 UTC (permalink / raw) To: 'Tony Lee', 'David Lang' Cc: valdis.kletnieks, 'Eric Mudama', 'David Stevens', 'Netdev', 'Linux Kernel' > -----Original Message----- > From: Tony Lee [mailto:tony.p.lee@gmail.com] > Sent: Friday, September 17, 2004 4:21 PM Skipped... > Question for Leonid of S2io.com: Your company has an > interesting card. > I think it must have some kind of embedded CPU. Care to tell > us what kind of CPU are they? Hi Tony, For 10GbE card, we designed our own ASIC - embedded CPUs don't cut it at 10GbE... Leonid > -- > -Tony > Having a lot of fun with Xilinx Virtex Pro II reconfigurable > HW + ppc + Linux > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-17 20:27 ` Valdis.Kletnieks 2004-09-17 20:36 ` David Lang @ 2004-09-22 23:25 ` Eric Mudama 1 sibling, 0 replies; 69+ messages in thread From: Eric Mudama @ 2004-09-22 23:25 UTC (permalink / raw) To: valdis.kletnieks@vt.edu Cc: David Stevens, Netdev, leonid.grossman, Linux Kernel On Fri, 17 Sep 2004 16:27:31 -0400, valdis.kletnieks@vt.edu <valdis.kletnieks@vt.edu> wrote: > No, he means "offload the processing of the filesystem to the disk itself". I know what was meant. I'm not saying the filesystem on the drive is very advanced, but it's still a filesystem. Our "Record ID" is the LBA identifier, and all records are 1 block in size. We can handle defects, reallocations, and other issues, with some success. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:33 The ultimate TOE design Jeff Garzik 2004-09-15 20:04 ` Paul Jakma 2004-09-15 20:11 ` David Stevens @ 2004-09-15 21:36 ` John Heffner 2004-09-15 21:46 ` David S. Miller 2004-09-15 23:16 ` James Morris 2004-09-16 9:03 ` Lars Marowsky-Bree 3 siblings, 2 replies; 69+ messages in thread From: John Heffner @ 2004-09-15 21:36 UTC (permalink / raw) To: Netdev; +Cc: leonid.grossman My view on TOE is that it is brought up in response to the fact that when leading edge network technologies are brought out (GigE a few years ago, 10 GigE now), hosts can't keep up. Specifically, these people usually don't care about wasting their man CPU cycles, but rather the fact they can't get the full rate out of their expensive new NIC. The reason hosts can't keep up are (a) host bus speeds, or (not exclusive) (b) the CPU can't handle per-packet processing In the case of (a), TOE doesn't really help. In the case of (b), Jeff's proposed general-purpose offload doesn't help -- you really need a custom ASIC or maybe FPGA if you hope to beat the host CPU. Thus I think Jeff's idea is not likely to fly with this crowd of TOE proponents. The other (much nicer) solution to case (b) is to just USE A BIGGER MTU. 1500 bytes is ridiculously small. Even with a 9k MTU, the benefits of TOE or TSO are nearly vanishing. Those who say they require high performance, but are unwilling to buy or produce networking gear with an MTU larger than 1500 bytes probably deserve what they get. There are other possible justifications for TOE (with other counter-arguments) -- basically to reduce load on the main CPU -- but I think these are for the most part NOT what is driving the market (let me know if I'm being myopic here). This issue is also largely or completely solved by using a bigger MTU. -John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:36 ` John Heffner @ 2004-09-15 21:46 ` David S. Miller 2004-09-16 6:20 ` Andi Kleen 2004-09-15 23:16 ` James Morris 1 sibling, 1 reply; 69+ messages in thread From: David S. Miller @ 2004-09-15 21:46 UTC (permalink / raw) To: John Heffner; +Cc: netdev, leonid.grossman On Wed, 15 Sep 2004 17:36:18 -0400 (EDT) John Heffner <jheffner@psc.edu> wrote: > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU. > 1500 bytes is ridiculously small. Even with a 9k MTU, the benefits of TOE > or TSO are nearly vanishing. Those who say they require high performance, > but are unwilling to buy or produce networking gear with an MTU larger > than 1500 bytes probably deserve what they get. TSO gives a kind of virtual 64K MTU, FWIW. But I do see your point. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:46 ` David S. Miller @ 2004-09-16 6:20 ` Andi Kleen 2004-09-16 13:10 ` Leonid Grossman 0 siblings, 1 reply; 69+ messages in thread From: Andi Kleen @ 2004-09-16 6:20 UTC (permalink / raw) To: David S. Miller; +Cc: John Heffner, netdev, leonid.grossman On Wed, Sep 15, 2004 at 02:46:24PM -0700, David S. Miller wrote: > On Wed, 15 Sep 2004 17:36:18 -0400 (EDT) > John Heffner <jheffner@psc.edu> wrote: > > > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU. > > 1500 bytes is ridiculously small. Even with a 9k MTU, the benefits of TOE > > or TSO are nearly vanishing. Those who say they require high performance, > > but are unwilling to buy or produce networking gear with an MTU larger > > than 1500 bytes probably deserve what they get. > > TSO gives a kind of virtual 64K MTU, FWIW. But I do see your > point. We still need to solve the same problem for RX though. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-16 6:20 ` Andi Kleen @ 2004-09-16 13:10 ` Leonid Grossman 2004-09-16 16:18 ` Nivedita Singhvi 0 siblings, 1 reply; 69+ messages in thread From: Leonid Grossman @ 2004-09-16 13:10 UTC (permalink / raw) To: 'Andi Kleen', 'David S. Miller' Cc: 'John Heffner', netdev > -----Original Message----- > From: Andi Kleen [mailto:ak@suse.de] > Sent: Wednesday, September 15, 2004 11:21 PM > To: David S. Miller > Cc: John Heffner; netdev@oss.sgi.com; leonid.grossman@s2io.com > Subject: Re: The ultimate TOE design > > On Wed, Sep 15, 2004 at 02:46:24PM -0700, David S. Miller wrote: > > On Wed, 15 Sep 2004 17:36:18 -0400 (EDT) John Heffner > > <jheffner@psc.edu> wrote: > > > > > The other (much nicer) solution to case (b) is to just > USE A BIGGER MTU. > > > 1500 bytes is ridiculously small. Even with a 9k MTU, > the benefits > > > of TOE or TSO are nearly vanishing. Those who say they > require high > > > performance, but are unwilling to buy or produce networking gear > > > with an MTU larger than 1500 bytes probably deserve what they get. > > > > TSO gives a kind of virtual 64K MTU, FWIW. But I do see your point. > > We still need to solve the same problem for RX though. > > -Andi Ditto. We can dream about benefits of huge MTUs, but the reality is that moving beyond 9k MTU is years away. Reasons - mainly infrastructure, plus MTU above ~10k may loose checksum protection (granted, this depends whether the errors are simple or complex, and also this may not be a showstopper for some people). Even 9k MTU is very far from being universally accepted, eight years after our Alteon spec went out :-). TSO works great for the transmit side (even for 9k MTU, the impact is not insignificant), but RX problem that Andi is talking about is a major issue for a lot of users. I don't have hard data yet, but the expectations are that the effect of doing "RX side TSO" will be close to having 64k RX MTU - I'll publish some numbers once we bring up first Unix drivers with this feature and do some measurements. Leonid > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 13:10 ` Leonid Grossman @ 2004-09-16 16:18 ` Nivedita Singhvi 2004-09-16 20:34 ` Leonid Grossman 0 siblings, 1 reply; 69+ messages in thread From: Nivedita Singhvi @ 2004-09-16 16:18 UTC (permalink / raw) To: Leonid Grossman Cc: 'Andi Kleen', 'David S. Miller', 'John Heffner', netdev Leonid Grossman wrote: > We can dream about benefits of huge MTUs, but the reality is that moving > beyond 9k MTU is years away. Reasons - mainly infrastructure, plus MTU above > ~10k may loose checksum protection (granted, this depends whether the errors > are simple or complex, and also this may not be a showstopper for some > people). > Even 9k MTU is very far from being universally accepted, eight years after > our Alteon spec went out :-). One other factor is TCP congestion control, and congestion windows we obey. Most of the time, you just can't send that much. thanks, Nivedita ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-16 16:18 ` Nivedita Singhvi @ 2004-09-16 20:34 ` Leonid Grossman 2004-09-22 20:18 ` Nivedita Singhvi 0 siblings, 1 reply; 69+ messages in thread From: Leonid Grossman @ 2004-09-16 20:34 UTC (permalink / raw) To: 'Nivedita Singhvi' Cc: 'Andi Kleen', 'David S. Miller', 'John Heffner', netdev > -----Original Message----- > From: Nivedita Singhvi [mailto:niv@us.ibm.com] > Sent: Thursday, September 16, 2004 9:19 AM > To: Leonid Grossman > Cc: 'Andi Kleen'; 'David S. Miller'; 'John Heffner'; > netdev@oss.sgi.com > Subject: Re: The ultimate TOE design > > Leonid Grossman wrote: > > > We can dream about benefits of huge MTUs, but the reality is that > > moving beyond 9k MTU is years away. Reasons - mainly > infrastructure, > > plus MTU above ~10k may loose checksum protection (granted, this > > depends whether the errors are simple or complex, and also this may > > not be a showstopper for some people). > > Even 9k MTU is very far from being universally accepted, > eight years > > after our Alteon spec went out :-). > > One other factor is TCP congestion control, and congestion > windows we obey. Most of the time, you just can't send that much. It's a bit painful to setup, but in general with 9k jumbos and TSO we were able to get close to pci-x 133 limit - both in LAN and WAN tests. Leonid > > thanks, > Nivedita > > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-16 20:34 ` Leonid Grossman @ 2004-09-22 20:18 ` Nivedita Singhvi 2004-09-23 4:46 ` Leonid Grossman 0 siblings, 1 reply; 69+ messages in thread From: Nivedita Singhvi @ 2004-09-22 20:18 UTC (permalink / raw) To: Leonid Grossman Cc: 'Andi Kleen', 'David S. Miller', 'John Heffner', netdev, linux-kernel Leonid Grossman wrote: >>From: Nivedita Singhvi [mailto:niv@us.ibm.com] >>Sent: Thursday, September 16, 2004 9:19 AM >>To: Leonid Grossman >>Cc: 'Andi Kleen'; 'David S. Miller'; 'John Heffner'; >>netdev@oss.sgi.com >>Subject: Re: The ultimate TOE design >> >>Leonid Grossman wrote: >> >> >>>We can dream about benefits of huge MTUs, but the reality is that >>>moving beyond 9k MTU is years away. Reasons - mainly infrastructure, >>>plus MTU above ~10k may loose checksum protection (granted, this >>>depends whether the errors are simple or complex, and also this may >>>not be a showstopper for some people). >>>Even 9k MTU is very far from being universally accepted, >>>eight years after our Alteon spec went out :-). >> >>One other factor is TCP congestion control, and congestion >>windows we obey. Most of the time, you just can't send that much. > > > It's a bit painful to setup, but in general with 9k jumbos and TSO we were > able to get close to pci-x 133 limit - both in LAN and WAN tests. > Leonid Cool, but a very specific environment, no? ;) What concerns me about all this is that it seems so very host-centric design. Wouldn't it be nice if we had a little bit more network-centric worldview when designing network infrastructure? It isn't just a matter of how had we can push stuff out, it also matters how much the network can take. Blasting tens of gigs into the ether seems all very exciting sexy and cool, but suited for dedicated links or network attached storage channels, not general-purpose networking on the Internet or intra-nets. And if that is the case, we're talking about a much smaller market (but perhaps a more profitable one ;))... thanks, Nivedita ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-22 20:18 ` Nivedita Singhvi @ 2004-09-23 4:46 ` Leonid Grossman 0 siblings, 0 replies; 69+ messages in thread From: Leonid Grossman @ 2004-09-23 4:46 UTC (permalink / raw) To: 'Nivedita Singhvi' Cc: 'Andi Kleen', 'David S. Miller', 'John Heffner', netdev, linux-kernel > > > > It's a bit painful to setup, but in general with 9k jumbos > and TSO we > > were able to get close to pci-x 133 limit - both in LAN and > WAN tests. > > Leonid > > Cool, but a very specific environment, no? ;) Define specific environment :-). We are running common tcp benchmarks like nttcp or iperf or Chariot or filesystem applications on a very generic white boxes, with generic OS/settings. > > What concerns me about all this is that it seems so very > host-centric design. Wouldn't it be nice if we had a little > bit more network-centric worldview when designing network > infrastructure? > > It isn't just a matter of how had we can push stuff out, it > also matters how much the network can take. > Blasting tens of gigs into the ether seems all very exciting > sexy and cool, but suited for dedicated links or network > attached storage channels, not general-purpose networking on > the Internet or intra-nets. This is somewhat different from IB or FC "miniature networks", some/most of 10GbE testing runs in existing datacenters or over existing long-haul links - see for example http://sravot.home.cern.ch/sravot/Networking/10GbE/LSR_041504.htm Cheers, Leonid > > And if that is the case, we're talking about a much smaller > market (but perhaps a more profitable one ;))... > > thanks, > Nivedita > > > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 21:36 ` John Heffner 2004-09-15 21:46 ` David S. Miller @ 2004-09-15 23:16 ` James Morris 2004-09-15 23:37 ` Leonid Grossman 2004-09-15 23:52 ` John Heffner 1 sibling, 2 replies; 69+ messages in thread From: James Morris @ 2004-09-15 23:16 UTC (permalink / raw) To: John Heffner; +Cc: Netdev, leonid.grossman On Wed, 15 Sep 2004, John Heffner wrote: > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU. > 1500 bytes is ridiculously small. Even with a 9k MTU, the benefits of TOE > or TSO are nearly vanishing. Do you have any figures on (large) MTU size vs performance on a current commidity system? - James -- James Morris <jmorris@redhat.com> ^ permalink raw reply [flat|nested] 69+ messages in thread
* RE: The ultimate TOE design 2004-09-15 23:16 ` James Morris @ 2004-09-15 23:37 ` Leonid Grossman 2004-09-15 23:52 ` John Heffner 1 sibling, 0 replies; 69+ messages in thread From: Leonid Grossman @ 2004-09-15 23:37 UTC (permalink / raw) To: 'James Morris', 'John Heffner'; +Cc: 'Netdev' > -----Original Message----- > From: James Morris [mailto:jmorris@redhat.com] > Sent: Wednesday, September 15, 2004 4:16 PM > To: John Heffner > Cc: Netdev; leonid.grossman@s2io.com > Subject: Re: The ultimate TOE design > > On Wed, 15 Sep 2004, John Heffner wrote: > > > The other (much nicer) solution to case (b) is to just USE > A BIGGER MTU. > > 1500 bytes is ridiculously small. Even with a 9k MTU, the > benefits of > > TOE or TSO are nearly vanishing. > > Do you have any figures on (large) MTU size vs performance on > a current commidity system? It's very system-dependent. Say on 2-way Xeon our card goes from ~2Gbps to ~6Gbps, on 64-bit systems the delta is obviously less. For 9k MTU, the delta goes down of course but it is still ~10% on 2.6 systems - say on 2-way Opterons we go from 7Gbps to 7.6Gbps. Leonid > > > - James > -- > James Morris > <jmorris@redhat.com> > > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 23:16 ` James Morris 2004-09-15 23:37 ` Leonid Grossman @ 2004-09-15 23:52 ` John Heffner 2004-09-16 1:43 ` James Morris 1 sibling, 1 reply; 69+ messages in thread From: John Heffner @ 2004-09-15 23:52 UTC (permalink / raw) To: James Morris; +Cc: Netdev, leonid.grossman On Wed, 15 Sep 2004, James Morris wrote: > On Wed, 15 Sep 2004, John Heffner wrote: > > > The other (much nicer) solution to case (b) is to just USE A BIGGER MTU. > > 1500 bytes is ridiculously small. Even with a 9k MTU, the benefits of TOE > > or TSO are nearly vanishing. > > Do you have any figures on (large) MTU size vs performance on a current > commidity system? What qualifies as large? I ran some measurements out to 9k w/GigE a few years ago. (Something like 100 byte increments.) I can try to find the data if anyone is interested. I may try to run a similar experiment on 10 GigE with modern hardware, and I think these cards can go out to 16k as well. The basic idea is that the margainal benefit (in terms of CPU cycles) of increasing the MTU is proportional to log(MTU). In all experiments I have done or seen, the data agree with this, except for discontinuities due to PCI settings, page sizes and maybe a couple other things. There are some arguments that going to much larger MTUs could be of substantial benefit other than CPU cycles, but this is harder to quantify. See <http://www.psc.edu/~mathis/MTU/>. -John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 23:52 ` John Heffner @ 2004-09-16 1:43 ` James Morris 0 siblings, 0 replies; 69+ messages in thread From: James Morris @ 2004-09-16 1:43 UTC (permalink / raw) To: John Heffner; +Cc: Netdev, leonid.grossman On Wed, 15 Sep 2004, John Heffner wrote: > > Do you have any figures on (large) MTU size vs performance on a current > > commidity system? > > What qualifies as large? I ran some measurements out to 9k w/GigE a few > years ago. (Something like 100 byte increments.) I can try to find the > data if anyone is interested. I may try to run a similar experiment on 10 > GigE with modern hardware, and I think these cards can go out to 16k as > well. Anything above 1500 bytes (up to 64k would be interesting). > There are some arguments that going to much larger MTUs could be of > substantial benefit other than CPU cycles, but this is harder to quantify. > See <http://www.psc.edu/~mathis/MTU/>. Thanks for the info. - James -- James Morris <jmorris@redhat.com> ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: The ultimate TOE design 2004-09-15 19:33 The ultimate TOE design Jeff Garzik ` (2 preceding siblings ...) 2004-09-15 21:36 ` John Heffner @ 2004-09-16 9:03 ` Lars Marowsky-Bree 3 siblings, 0 replies; 69+ messages in thread From: Lars Marowsky-Bree @ 2004-09-16 9:03 UTC (permalink / raw) To: Netdev; +Cc: Linux Kernel On 2004-09-15T15:33:47, Jeff Garzik <jgarzik@pobox.com> said: > Then, your host system OS will communicate with the Linux kernel running > on the card across the PCI bus, using IP packets (64K fixed MTU). > > This effectively: Actually, given that there's almost no reason to offload TCP/IP processing for speed (better spent the money on CPU / memory for the main system), I like the idea of this for security: Off-load the packet filtering to create an additional security barrier. (Different CPU architecture and all that.) (With two cards, one could even use the conntrack fail-over internally. - A Linux-running NIC with builtin firewalling, sell to all the windows weenies... ;) With dedicated processors, maybe a IP/Sec accelerator would also be cool, but I'd think a crypto accelerator for the main system would again be saner here (unless, of course, the argument of the security domain isolation is applied again). Admittedely, one can solve all these differently, but it still might be cool. ;-) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX AG - A Novell company ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <1095328673.1063.130.camel@jzny.localdomain>]
* RE: The ultimate TOE design [not found] <1095328673.1063.130.camel@jzny.localdomain> @ 2004-09-16 14:57 ` Leonid Grossman 0 siblings, 0 replies; 69+ messages in thread From: Leonid Grossman @ 2004-09-16 14:57 UTC (permalink / raw) To: hadi Cc: 'Jeff Garzik', 'David S. Miller', alan, paul, netdev, linux-kernel > -----Original Message----- > From: jamal [mailto:hadi@cyberus.ca] > Sent: Thursday, September 16, 2004 2:58 AM > To: Leonid Grossman > Cc: 'Jeff Garzik'; 'David S. Miller'; > alan@lxorguk.ukuu.org.uk; paul@clubi.ie; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: RE: The ultimate TOE design > > On Thu, 2004-09-16 at 01:25, Leonid Grossman wrote: > > > > > -----Original Message----- > > > From: jamal [mailto:hadi@cyberus.ca] > > > > On a serious note, I think that PCI-express (if it lives upto its > > > expectation) will demolish dreams of a lot of these TOE > investments. > > > Our problem is NOT the CPU right now (80% idle processing 450Kpps > > > forwarding). Bus and memory distance/latency are. > > > > In servers, both bottlenecks are there - if you look at the cost of > > TCP and filesystem processing at 10GbE, CPU is a huge problem (and > > will be for foreseeable future), even for fastest 64-bit systems. > > True, but with the bus contention being a non-issue you got > more of that xeon being available for use (lets say i can use > 50% more of its capacity then i can do more). IOW, it becomes > a compute capacity problem mostly - one that you should in > theory be able to throw more CPU at. SMT (the way power5 and > some of the network processors do it[1]) should go a long way > to address both additional compute and hardware threading to > work around memory latencies. With PCI-express, compute power > in mini-clustering in the form of AS > (http://www.asi-sig.org/home) is being plotted as we speak. > To sumarize: The problem to solve in 24 months maybe 100Gige. > > > I agree though that bus and memory are bigger issues, this > is exactly > > the reason for all these RDMA over Ethernet investments :-) > > And AS does a damn good job at specing all those RDMA > requirements; my view is that intel is going to build them > chips - so it can be done on a > $5 board off the pacific rim. This takes most of the small > players out of the market. > > > Anyways, did not mean to start an argument - with all the > new CPU, bus > > and HBA technologies coming to the market it will be another 18-24 > > months before we know what works and what doesn't... > > Agreed. Would you like to invest on something that will obsoleted in > 18-24 months though? OR even not obsoleted, but holds that > uncertainty? > I think thats the risk facing you when you are in the offload > bussiness. Well.. Any business has risks, this one doesn't seem to be higher than others :-) I view 18-26 mo timeframe as a start of the offload mass-adoption, not the end of it. In our tests, the bus contention and the %cpu are mostly orthogonal problems; PCI-X DDR and PCI-Express will help but only to a point. (BTW this is all related to the higher end systems - 2-4 way and above, running 10GbE NICs. Client is a different story, cpu is mostly "free" there). My sense is that (unlike on previous cycles) the "slow host, fast network" scenario is here to stay for a long while, and will have to be addressed one way or another - whether it is a full TOE+RDMA offload in a longer run, or an improvement to "static" offloads. In server space, applications will never be happy with less than 80% cpu. Leonid > > Here are results for Hifn 7956 ref board on 2.6GHz P4 (HT) > system, kernel 2.6.6 SMP as compared to a s/ware only setup > on same machine. > [Name of tester withheld to protect privacy]. > > first column - algo, second - packet size, third - time in us > spend by hw crypto, forth - time in us spent by sw crypto: > > des 64: 28 3 > des 128: 29 6 > des 192: 33 9 > des 256: 33 12 > des 320: 37 15 > des 384: 38 18 > des 448: 41 21 > des 512: 42 23 > des 576: 45 26 > des 640: 46 29 > des 704: 49 33 > des 768: 50 35 > des 832: 53 38 > des 896: 54 41 > des 960: 57 44 > des 1024: 58 47 > des 1088: 61 50 > des 1152: 62 53 > des 1216: 66 56 > des 1280: 66 59 > des 1344: 70 62 > des 1408: 71 65 > des 1472: 74 68 > des3_ede 64: 28 6 > des3_ede 128: 30 13 > des3_ede 192: 34 20 > des3_ede 256: 43 26 > des3_ede 320: 38 33 > des3_ede 384: 48 40 > des3_ede 448: 44 45 > des3_ede 512: 54 53 > des3_ede 576: 50 60 > des3_ede 640: 59 67 > des3_ede 704: 55 74 > des3_ede 768: 66 78 > des3_ede 832: 61 85 > des3_ede 896: 72 94 > des3_ede 960: 67 100 > des3_ede 1024: 77 107 > des3_ede 1088: 73 114 > des3_ede 1152: 82 121 > des3_ede 1216: 79 127 > des3_ede 1280: 88 128 > des3_ede 1344: 84 135 > des3_ede 1408: 94 147 > des3_ede 1472: 90 153 > aes 64: 28 2 > aes 192: 33 6 > aes 320: 37 10 > aes 448: 46 15 > aes 576: 53 19 > aes 704: 53 23 > aes 832: 65 28 > aes 960: 66 32 > aes 1088: 71 37 > aes 1216: 80 41 > aes 1344: 83 45 > aes 1472: 92 50 > > Moral of the data above: The 2.6Ghz is already showing signs > of obsoleting the hifn crypto offloader[2]. I think it took > less than a year for it to happen. > > cheers, > jamal > > [1] I also like the MIPS.com approach to SMT > > [2] There are actually issues with some of the crypto > offloading in Linux; however this does serve as a good example. > ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2004-09-24 19:39 UTC | newest]
Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-15 19:33 The ultimate TOE design Jeff Garzik
2004-09-15 20:04 ` Paul Jakma
2004-09-15 19:14 ` Alan Cox
2004-09-15 20:41 ` Jeff Garzik
2004-09-15 21:01 ` David S. Miller
2004-09-15 21:08 ` Jeff Garzik
2004-09-15 21:13 ` David S. Miller
2004-09-15 21:23 ` Jeff Garzik
2004-09-15 21:29 ` David S. Miller
2004-09-15 22:26 ` Jeff Garzik
2004-09-15 23:29 ` Leonid Grossman
2004-09-24 13:07 ` Lennert Buytenhek
2004-09-24 13:21 ` Leonid Grossman
2004-09-24 18:09 ` Lennert Buytenhek
2004-09-24 19:39 ` Joel Jaeggli
2004-09-16 0:57 ` jamal
2004-09-16 5:25 ` Leonid Grossman
2004-09-16 9:29 ` Lincoln Dale
2004-09-16 12:19 ` Alan Cox
2004-09-16 13:33 ` Andi Kleen
2004-09-16 12:57 ` Alan Cox
2004-09-16 22:37 ` Lincoln Dale
2004-09-17 13:38 ` Jörn Engel
2004-09-15 22:31 ` Jeff Garzik
2004-09-15 21:15 ` Michael Richardson
2004-09-15 20:53 ` David S. Miller
2004-09-16 1:05 ` Andrea Arcangeli
2004-09-15 21:10 ` David Lang
2004-09-15 23:05 ` Paul Jakma
2004-09-15 20:26 ` Neil Horman
2004-09-15 21:03 ` Wes Felter
2004-09-15 21:15 ` Jeff Garzik
2004-09-15 21:35 ` Wes Felter
2004-09-15 21:42 ` Jeff Garzik
2004-09-15 21:25 ` Imran Badr
2004-09-16 11:37 ` Neil Horman
2004-09-16 5:51 ` Matt Porter
2004-09-15 21:36 ` Deepak Saxena
2004-09-15 23:03 ` Paul Jakma
2004-09-24 13:11 ` Lennert Buytenhek
2004-09-15 21:59 ` Tony Lee
2004-09-15 20:11 ` David Stevens
2004-09-15 20:16 ` David Schwartz
2004-09-15 20:25 ` Jeff Garzik
2004-09-15 20:54 ` Neil Horman
2004-09-15 20:31 ` Bill Rugolsky Jr.
2004-09-15 21:41 ` Joel Jaeggli
2004-09-16 6:33 ` Valdis.Kletnieks
2004-09-17 6:46 ` Eric Mudama
2004-09-17 14:15 ` Alan Cox
2004-09-17 20:27 ` Valdis.Kletnieks
2004-09-17 20:36 ` David Lang
2004-09-17 23:20 ` Tony Lee
2004-09-17 23:36 ` Leonid Grossman
2004-09-22 23:25 ` Eric Mudama
2004-09-15 21:36 ` John Heffner
2004-09-15 21:46 ` David S. Miller
2004-09-16 6:20 ` Andi Kleen
2004-09-16 13:10 ` Leonid Grossman
2004-09-16 16:18 ` Nivedita Singhvi
2004-09-16 20:34 ` Leonid Grossman
2004-09-22 20:18 ` Nivedita Singhvi
2004-09-23 4:46 ` Leonid Grossman
2004-09-15 23:16 ` James Morris
2004-09-15 23:37 ` Leonid Grossman
2004-09-15 23:52 ` John Heffner
2004-09-16 1:43 ` James Morris
2004-09-16 9:03 ` Lars Marowsky-Bree
[not found] <1095328673.1063.130.camel@jzny.localdomain>
2004-09-16 14:57 ` Leonid Grossman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).