* RDMA will be reverted @ 2006-06-28 7:07 David Miller 2006-06-28 7:41 ` Evgeniy Polyakov ` (3 more replies) 0 siblings, 4 replies; 74+ messages in thread From: David Miller @ 2006-06-28 7:07 UTC (permalink / raw) To: rolandd; +Cc: netdev, akpm Roland, there is no way in the world we would have let support for RDMA into the kernel tree had we seen and reviewed it on netdev. I've discussed this with Andrew Morton, and we'd like you to please revert all of the RDMA code from Linus's tree immedialtely. Folks are well aware how against RDMA and TOE type schemes the Linux networking developers are. So the fact that none of these RDMA changes went up for review on netdev strikes me as just a little bit more than suspicious. Please do not do this again, thank you. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-28 7:07 RDMA will be reverted David Miller @ 2006-06-28 7:41 ` Evgeniy Polyakov 2006-06-28 14:56 ` Tom Tucker ` (2 subsequent siblings) 3 siblings, 0 replies; 74+ messages in thread From: Evgeniy Polyakov @ 2006-06-28 7:41 UTC (permalink / raw) To: David Miller; +Cc: rolandd, netdev, akpm On Wed, Jun 28, 2006 at 12:07:15AM -0700, David Miller (davem@davemloft.net) wrote: > Roland, there is no way in the world we would have let support for > RDMA into the kernel tree had we seen and reviewed it on netdev. I've > discussed this with Andrew Morton, and we'd like you to please revert > all of the RDMA code from Linus's tree immedialtely. May I suggest to not revert it. RDMA and RDDP can be considered as tun/tap or packet socket devices until they start to change internal network structures. As far as I can see they do not, only use existing like userspace can. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-28 7:07 RDMA will be reverted David Miller 2006-06-28 7:41 ` Evgeniy Polyakov @ 2006-06-28 14:56 ` Tom Tucker 2006-06-28 15:01 ` Steve Wise 2006-06-29 16:54 ` Roland Dreier 3 siblings, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-06-28 14:56 UTC (permalink / raw) To: David Miller; +Cc: rolandd, netdev, akpm On Wed, 2006-06-28 at 00:07 -0700, David Miller wrote: > Roland, there is no way in the world we would have let support for > RDMA into the kernel tree had we seen and reviewed it on netdev. I've > discussed this with Andrew Morton, and we'd like you to please revert > all of the RDMA code from Linus's tree immedialtely. > > Folks are well aware how against RDMA and TOE type schemes the Linux > networking developers are. So the fact that none of these RDMA > changes went up for review on netdev strikes me as just a little bit > more than suspicious. > > Please do not do this again, thank you. I believe Roland is on vacation (they just had a baby..). It is my belief that everything that Roland submitted went through both netdev and lklm reviews. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-28 7:07 RDMA will be reverted David Miller 2006-06-28 7:41 ` Evgeniy Polyakov 2006-06-28 14:56 ` Tom Tucker @ 2006-06-28 15:01 ` Steve Wise 2006-06-29 16:54 ` Roland Dreier 3 siblings, 0 replies; 74+ messages in thread From: Steve Wise @ 2006-06-28 15:01 UTC (permalink / raw) To: David Miller; +Cc: rolandd, netdev, akpm On Wed, 2006-06-28 at 00:07 -0700, David Miller wrote: > Roland, there is no way in the world we would have let support for > RDMA into the kernel tree had we seen and reviewed it on netdev. I've > discussed this with Andrew Morton, and we'd like you to please revert > all of the RDMA code from Linus's tree immedialtely. > > Folks are well aware how against RDMA and TOE type schemes the Linux > networking developers are. So the fact that none of these RDMA > changes went up for review on netdev strikes me as just a little bit > more than suspicious. > > Please do not do this again, thank you. Dave, There is no support for RDMA/TCP in linux today, nor in Roland's git tree for that matter. I have posted a patch series for RDMA/TCP core support to lklm and netdev over the last few weeks and gone through 3 review cycles. (see "iWARP Core Changes" threads). In addition, I posted the Ammasso RDMA driver for review as well. It also went through 3 review cycles. Based on review feedback and lack of any serious issues, it was my understanding that everyone was comfortable with RDMA/TCP. Nothing underhand was going on here. Steve. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-28 7:07 RDMA will be reverted David Miller ` (2 preceding siblings ...) 2006-06-28 15:01 ` Steve Wise @ 2006-06-29 16:54 ` Roland Dreier 2006-06-29 17:32 ` YOSHIFUJI Hideaki / 吉藤英明 2006-06-29 19:46 ` David Miller 3 siblings, 2 replies; 74+ messages in thread From: Roland Dreier @ 2006-06-29 16:54 UTC (permalink / raw) To: David Miller; +Cc: netdev, akpm David> Roland, there is no way in the world we would have let David> support for RDMA into the kernel tree had we seen and David> reviewed it on netdev. I've discussed this with Andrew David> Morton, and we'd like you to please revert all of the RDMA David> code from Linus's tree immedialtely. David> Folks are well aware how against RDMA and TOE type schemes David> the Linux networking developers are. So the fact that none David> of these RDMA changes went up for review on netdev strikes David> me as just a little bit more than suspicious. [I'm really on paternity leave, but this was brought to my attention and seems important enough to respond to] Dave, you're going to have to be more specific. What do you mean by RDMA? The whole drivers/infiniband infrastructure, which handles RDMA over IB, has been upstream for a year and a half, and was in fact originally merged by you, so I'm guessing that's not what you mean. If you're talking about the "RDMA CM" (drivers/infiniband/core/cma.c et al) that was just merged, then you should be aware that that was posted by Sean Hefty to netdev for review, multiple times (eg a quick search finds <http://lwn.net/Articles/170202/>). It is true that the intention of the abstraction is to provide a common mechanism for handling IB and iWARP (RDMA/TCP) connections, but at the moment no iWARP code is upstream. Right now all it does is allow IP addressing to be used for IB connections. In any case I think we need to find a way for Linux to support iWARP hardware, since there are users that want this, and (some of) the vendors are working hard to do things the right way (including cc'ing netdev on the conversation). I don't think it's good for Linux for the answer to just be, "sorry, you're wrong to want to use that hardware." - Roland ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 16:54 ` Roland Dreier @ 2006-06-29 17:32 ` YOSHIFUJI Hideaki / 吉藤英明 2006-06-29 17:35 ` Roland Dreier 2006-06-29 19:46 ` David Miller 1 sibling, 1 reply; 74+ messages in thread From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 17:32 UTC (permalink / raw) To: rdreier; +Cc: davem, netdev, akpm, yoshfuji Hello. In article <adawtazgawi.fsf@cisco.com> (at Thu, 29 Jun 2006 09:54:37 -0700), Roland Dreier <rdreier@cisco.com> says: > David> Roland, there is no way in the world we would have let > David> support for RDMA into the kernel tree had we seen and > David> reviewed it on netdev. I've discussed this with Andrew > David> Morton, and we'd like you to please revert all of the RDMA > David> code from Linus's tree immedialtely. > > David> Folks are well aware how against RDMA and TOE type schemes > David> the Linux networking developers are. So the fact that none > David> of these RDMA changes went up for review on netdev strikes > David> me as just a little bit more than suspicious. > > [I'm really on paternity leave, but this was brought to my attention > and seems important enough to respond to] > > Dave, you're going to have to be more specific. What do you mean by > RDMA? The whole drivers/infiniband infrastructure, which handles RDMA > over IB, has been upstream for a year and a half, and was in fact > originally merged by you, so I'm guessing that's not what you mean. NET_DMA things. --yoshfuji ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 17:32 ` YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 17:35 ` Roland Dreier 2006-06-29 17:40 ` YOSHIFUJI Hideaki / 吉藤英明 0 siblings, 1 reply; 74+ messages in thread From: Roland Dreier @ 2006-06-29 17:35 UTC (permalink / raw) To: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= Cc: davem, netdev, akpm > > Dave, you're going to have to be more specific. What do you mean by > > RDMA? The whole drivers/infiniband infrastructure, which handles RDMA > > over IB, has been upstream for a year and a half, and was in fact > > originally merged by you, so I'm guessing that's not what you mean. > > NET_DMA things. But NET_DMA seems to be for the new DMA engine support (I/OAT really I guess?). I had nothing to do with merging any of that, and as far as I can tell, Dave signed off on all of those changes, so I don't think that's what he's complaining about either. - R. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 17:35 ` Roland Dreier @ 2006-06-29 17:40 ` YOSHIFUJI Hideaki / 吉藤英明 0 siblings, 0 replies; 74+ messages in thread From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 17:40 UTC (permalink / raw) To: rdreier; +Cc: davem, netdev, akpm, yoshfuji In article <adasllng8zn.fsf@cisco.com> (at Thu, 29 Jun 2006 10:35:56 -0700), Roland Dreier <rdreier@cisco.com> says: > > > Dave, you're going to have to be more specific. What do you mean by > > > RDMA? The whole drivers/infiniband infrastructure, which handles RDMA > > > over IB, has been upstream for a year and a half, and was in fact > > > originally merged by you, so I'm guessing that's not what you mean. > > > > NET_DMA things. > > But NET_DMA seems to be for the new DMA engine support (I/OAT really I > guess?). I had nothing to do with merging any of that, and as far as > I can tell, Dave signed off on all of those changes, so I don't think > that's what he's complaining about either. Oops, sorry, you're right... --yoshfuji ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 16:54 ` Roland Dreier 2006-06-29 17:32 ` YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 19:46 ` David Miller 2006-06-29 20:11 ` Tom Tucker 2006-06-30 20:51 ` Roland Dreier 1 sibling, 2 replies; 74+ messages in thread From: David Miller @ 2006-06-29 19:46 UTC (permalink / raw) To: rdreier; +Cc: netdev, akpm From: Roland Dreier <rdreier@cisco.com> Date: Thu, 29 Jun 2006 09:54:37 -0700 > In any case I think we need to find a way for Linux to support iWARP > hardware, since there are users that want this, and (some of) the > vendors are working hard to do things the right way (including cc'ing > netdev on the conversation). I don't think it's good for Linux for > the answer to just be, "sorry, you're wrong to want to use that hardware." We give the same response for TOE stuff. The integration of iWARP with the Linux networking, while much better than TOE, is still heavily flawed. What most people might not realize when using this stuff is that: 1) None of their firewall rules will apply to the iWARP communications. 2) None of their packet scheduling configurations can be applied to the iWARP communications. 3) It is not possible to encapsulate iWARP traffic in IPSEC And the list goes on and on. This is what we don't like about technologies that implement their own networking stack in the card firmware. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 19:46 ` David Miller @ 2006-06-29 20:11 ` Tom Tucker 2006-06-29 20:16 ` Tom Tucker ` (2 more replies) 2006-06-30 20:51 ` Roland Dreier 1 sibling, 3 replies; 74+ messages in thread From: Tom Tucker @ 2006-06-29 20:11 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Thu, 2006-06-29 at 12:46 -0700, David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Thu, 29 Jun 2006 09:54:37 -0700 > > > In any case I think we need to find a way for Linux to support iWARP > > hardware, since there are users that want this, and (some of) the > > vendors are working hard to do things the right way (including cc'ing > > netdev on the conversation). I don't think it's good for Linux for > > the answer to just be, "sorry, you're wrong to want to use that hardware." > > We give the same response for TOE stuff. What does the word "we" represent in this context? Is it the Linux community at large, Linux and Andrew, you? I'm not trying to be argumentative, I just want to understand how carefully and by whom iWARP technology has been considered. > > The integration of iWARP with the Linux networking, while much better > than TOE, is still heavily flawed. > > What most people might not realize when using this stuff is that: Agreed, the patch improves some things, but doesn't address others. But isn't this position a condemnation of the good to spite the bad? > > 1) None of their firewall rules will apply to the iWARP communications. > 2) None of their packet scheduling configurations can be applied to > the iWARP communications. > 3) It is not possible to encapsulate iWARP traffic in IPSEC > > And the list goes on and on. It does, however, this position statement makes things worse, not better. By this I mean that deep adapters (iSCSI, iWARP) are even more debilitated by not being able to snoop MTU changes, etc... and are therefore forced to duplicate sub-systems (e.g. ARP, ICMP, ...) already ably implemented in host software. > This is what we don't like about technologies that implement their own > networking stack in the card firmware. Doesn't this position force vendors to build deeper adapters, not shallower adapters? Isn't this exactly the opposite of what is intended? > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:11 ` Tom Tucker @ 2006-06-29 20:16 ` Tom Tucker 2006-06-29 20:19 ` David Miller 2006-06-29 20:42 ` James Morris 2 siblings, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-06-29 20:16 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm [...snip...] > community at large, Linux and Andrew, you? I'm not trying to be Linus sorry... spell checker... ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:11 ` Tom Tucker 2006-06-29 20:16 ` Tom Tucker @ 2006-06-29 20:19 ` David Miller 2006-06-29 20:47 ` Tom Tucker 2006-06-29 21:25 ` Andi Kleen 2006-06-29 20:42 ` James Morris 2 siblings, 2 replies; 74+ messages in thread From: David Miller @ 2006-06-29 20:19 UTC (permalink / raw) To: tom; +Cc: rdreier, netdev, akpm From: Tom Tucker <tom@opengridcomputing.com> Date: Thu, 29 Jun 2006 15:11:06 -0500 > Doesn't this position force vendors to build deeper adapters, not > shallower adapters? Isn't this exactly the opposite of what is intended? Nope. Look at what the networking developers give a lot of attention and effort to, things like TCP Large Receive Offload, and Van Jacobson net channels, both of which are fully stack integrated receive performance enhancements. They do not bypass netfilter, they do not bypass packet scheduling, and yet they provide a hardware assist performance improvement for receive. This has been stated over and over again. If companies keep designing undesirable hardware that unnecessarily takes features away from the user, that really is not our problem. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:19 ` David Miller @ 2006-06-29 20:47 ` Tom Tucker 2006-06-29 20:53 ` David Miller 2006-06-29 21:25 ` Andi Kleen 1 sibling, 1 reply; 74+ messages in thread From: Tom Tucker @ 2006-06-29 20:47 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Thu, 2006-06-29 at 13:19 -0700, David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Thu, 29 Jun 2006 15:11:06 -0500 > > > Doesn't this position force vendors to build deeper adapters, not > > shallower adapters? Isn't this exactly the opposite of what is intended? > > Nope. > > Look at what the networking developers give a lot of attention and > effort to, things like TCP Large Receive Offload, and Van Jacobson net > channels, both of which are fully stack integrated receive performance > enhancements. They do not bypass netfilter, they do not bypass > packet scheduling, and yet they provide a hardware assist performance > improvement for receive. These technologies are integrated because someone chose to and was allowed to integrate them. I contend that iWARP could be equally well integrated if the decision was made to do so. It would, however, require cooperation from both the hardware vendors and the netdev maintainers. > > This has been stated over and over again. For TOE, you are correct, however, for iWARP, you can't do RDMA (direct placement into application buffers) without state in the adapter. I personally tried very hard to build an adapter without doing so, but alas, I failed ;-) > > If companies keep designing undesirable hardware that unnecessarily > takes features away from the user, that really is not our problem. I concede that features have been lost, but some applications benefit greatly from RDMA. For these applications and these customers, the hardware is not undesirable and the fact that netfilter won't work on their sub 5us latency adapter is not perceived to be a big issue. The mention of packet scheduling would cause an apoplectic seizure...unless it were in the hardware... All that verbiage aside, I believe that it is not a matter of whether it is possible to integrate iWARP it is question of whether it is permissible. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:47 ` Tom Tucker @ 2006-06-29 20:53 ` David Miller 2006-06-29 21:28 ` Tom Tucker 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-06-29 20:53 UTC (permalink / raw) To: tom; +Cc: rdreier, netdev, akpm From: Tom Tucker <tom@opengridcomputing.com> Date: Thu, 29 Jun 2006 15:47:13 -0500 > I concede that features have been lost, but some applications benefit > greatly from RDMA. For these applications and these customers, TOE folks give the same story... it's a broken record, really. Let us know when you can say something new about the situation. Under Linux we get to make better long term architectually sane decisions, even if it is to the dismay of the backers of certain short-sighted pieces of technology. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:53 ` David Miller @ 2006-06-29 21:28 ` Tom Tucker 0 siblings, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-06-29 21:28 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Thu, 2006-06-29 at 13:53 -0700, David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Thu, 29 Jun 2006 15:47:13 -0500 > > > I concede that features have been lost, but some applications benefit > > greatly from RDMA. For these applications and these customers, > > TOE folks give the same story... it's a broken record, really. > > Let us know when you can say something new about the situation. > > Under Linux we get to make better long term architectually sane > decisions, even if it is to the dismay of the backers of certain > short-sighted pieces of technology. Would you indulge me with one final clarification? - Are you condemning RDMA over TCP as an ill-conceived technology? - Are you condemning the implementation of iWARP? - Are you condemning both? Thanks, Tom ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:19 ` David Miller 2006-06-29 20:47 ` Tom Tucker @ 2006-06-29 21:25 ` Andi Kleen 1 sibling, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-06-29 21:25 UTC (permalink / raw) To: David Miller; +Cc: tom, rdreier, netdev, akpm > They do not bypass netfilter, they do not bypass > packet scheduling, and yet they provide a hardware assist performance > improvement for receive. Not that I'm a TOE advocate, but as long as the adapter leaves SYN/SYN-ACK to the stack and only turns on RDMA in ESTABLISHED it could at least do nearly all of netfilter too (as established in the channel discussion) -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 20:11 ` Tom Tucker 2006-06-29 20:16 ` Tom Tucker 2006-06-29 20:19 ` David Miller @ 2006-06-29 20:42 ` James Morris 2 siblings, 0 replies; 74+ messages in thread From: James Morris @ 2006-06-29 20:42 UTC (permalink / raw) To: Tom Tucker; +Cc: David Miller, rdreier, netdev, akpm On Thu, 29 Jun 2006, Tom Tucker wrote: > On Thu, 2006-06-29 at 12:46 -0700, David Miller wrote: > > From: Roland Dreier <rdreier@cisco.com> > > Date: Thu, 29 Jun 2006 09:54:37 -0700 > > > > > In any case I think we need to find a way for Linux to support iWARP > > > hardware, since there are users that want this, and (some of) the > > > vendors are working hard to do things the right way (including cc'ing > > > netdev on the conversation). I don't think it's good for Linux for > > > the answer to just be, "sorry, you're wrong to want to use that hardware." > > > > We give the same response for TOE stuff. > > What does the word "we" represent in this context? Is it the Linux > community at large, Linux and Andrew, you? I'm not trying to be > argumentative, I just want to understand how carefully and by whom iWARP > technology has been considered. $ grep -ri davem /usr/src/linux - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-29 19:46 ` David Miller 2006-06-29 20:11 ` Tom Tucker @ 2006-06-30 20:51 ` Roland Dreier 2006-06-30 21:16 ` David Miller 1 sibling, 1 reply; 74+ messages in thread From: Roland Dreier @ 2006-06-30 20:51 UTC (permalink / raw) To: David Miller; +Cc: netdev, akpm You snipped my question about what specifically you wanted reverted, so I'm going to assume that after cooling down and understanding the situation, you're OK with everything that's in Linus's tree... > The integration of iWARP with the Linux networking, while much better > than TOE, is still heavily flawed. > > What most people might not realize when using this stuff is that: > > 1) None of their firewall rules will apply to the iWARP communications. > 2) None of their packet scheduling configurations can be applied to > the iWARP communications. > 3) It is not possible to encapsulate iWARP traffic in IPSEC Yes, there are tradeoffs with iWARP. However, there seem to be users who are willing to make those tradeoffs. And I can't think of a single other example of a case where we refused to merge a driver, not because of any issues with the driver code, but because we don't like the hardware it drives and think that people shouldn't be able to use the HW with Linux. And it makes me sad that we're doing that here. Don't get me wrong, I'm all for rejecting patches that make the core networking stack worse or harder to maintain or are bad patches for whatever reason. I know that the present is science fiction, but I always thought that the forbidden technologies would be stuff like nanotech or human cloning -- I never would have guessed that iWARP would be in that category. Anyway, what is your feeling about changes strictly under drivers/infiniband that add low-level driver support for iWARP devices? The changes that Steve Wise proposed aren't strictly necessary for iWARP support -- they just make things work better when routes change. Thanks, Roland ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-30 20:51 ` Roland Dreier @ 2006-06-30 21:16 ` David Miller 2006-06-30 23:01 ` Tom Tucker 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-06-30 21:16 UTC (permalink / raw) To: rdreier; +Cc: netdev, akpm From: Roland Dreier <rdreier@cisco.com> Date: Fri, 30 Jun 2006 13:51:19 -0700 > And I can't think of a single other example of a case where we > refused to merge a driver, not because of any issues with the driver > code, but because we don't like the hardware it drives and think > that people shouldn't be able to use the HW with Linux. And it > makes me sad that we're doing that here. The TOE folks have tried to submit their hooks and drivers on several occaisions, and we've rejected it every time. I definitely don't want the iWARP stuff to go in until we have a long good discussion about this. And you have a good week long opportunity to do so as I'm about to go on vacation until next Friday :-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-30 21:16 ` David Miller @ 2006-06-30 23:01 ` Tom Tucker 2006-07-01 14:26 ` Andi Kleen 2006-07-01 21:45 ` David Miller 0 siblings, 2 replies; 74+ messages in thread From: Tom Tucker @ 2006-06-30 23:01 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > The TOE folks have tried to submit their hooks and drivers > on several occaisions, and we've rejected it every time. iWARP != TOE > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-30 23:01 ` Tom Tucker @ 2006-07-01 14:26 ` Andi Kleen 2006-07-04 18:34 ` Andy Gay ` (2 more replies) 2006-07-01 21:45 ` David Miller 1 sibling, 3 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-01 14:26 UTC (permalink / raw) To: Tom Tucker; +Cc: David Miller, rdreier, netdev, akpm On Saturday 01 July 2006 01:01, Tom Tucker wrote: > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > > > The TOE folks have tried to submit their hooks and drivers > > on several occaisions, and we've rejected it every time. > > iWARP != TOE Perhaps a good start of that discussion David asked for would be if you could give us an overview of the differences and how you avoid the TOE problems. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-01 14:26 ` Andi Kleen @ 2006-07-04 18:34 ` Andy Gay 2006-07-04 20:47 ` Andi Kleen 2006-07-04 20:34 ` Roland Dreier 2006-07-05 17:09 ` Tom Tucker 2 siblings, 1 reply; 74+ messages in thread From: Andy Gay @ 2006-07-04 18:34 UTC (permalink / raw) To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote: > On Saturday 01 July 2006 01:01, Tom Tucker wrote: > > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > > > > > The TOE folks have tried to submit their hooks and drivers > > > on several occaisions, and we've rejected it every time. > > > > iWARP != TOE > > Perhaps a good start of that discussion David asked for would > be if you could give us an overview of the differences > and how you avoid the TOE problems. Interesting thread, I hope someone replies to Andi's request. I've actually no real idea what RDMA, IWARP & TOE are, so I may be barking up completely the wrong tree here. If so, apologies. But it sounds like we're talking about technologies that offload some part of the network/transport layer processing to the interface device? And the primary objection to that is that it may bypass some of the cool features of the Linux stack? Stuff like iptables and ... what exactly? Presumably the reason why such offloading would be a Good Thing are to do with very high speed network processing, 10G ethernet and the like. Which sounds to me very like the way dedicated networking kit would do that. So if you have a device that needs to be a very high performance router, you dedicate it to that function and don't try to do clever per-packet or -flow processing at the same time. In the Cisco world, there's a network design approach where you consider your equipment in three 'layers', I think they call them the core, distribution and access layers, or something like that. The idea is that the core layer handles the real high speed stuff, you don't do anything much except routing/switching in there. The other layers aggregate flows for the core, if you need extra processing (firewalls etc) you do it somewhere in there. So, for example, the packet capture functions (sort of like tcpdump) don't work if fast switching is in use, which it would be in the core. Users accept these tradeoffs, because if you design it right you can do the extra processing on some other device in the outer layers. So perhaps there's a good argument to make that a Linux system with the right hardware could be considered a core device. Likely any place you have such a system it would be dedicated to just moving data as well as possible, and let other systems do the other stuff. You wouldn't want your server farm systems to also be your firewalls. Bottom line - these technologies seem to me to have a place in a well designed network. Just my 2c... - Andy > > -Andi > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 18:34 ` Andy Gay @ 2006-07-04 20:47 ` Andi Kleen 2006-07-04 22:22 ` Andy Gay 0 siblings, 1 reply; 74+ messages in thread From: Andi Kleen @ 2006-07-04 20:47 UTC (permalink / raw) To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm > So perhaps there's a good argument to make that a Linux system with the > right hardware could be considered a core device. Likely any place you > have such a system it would be dedicated to just moving data as well as > possible, and let other systems do the other stuff. You wouldn't want > your server farm systems to also be your firewalls. Why not? While Linux firewall performance is not flawless its problems (e.g. slow conntrack) seems to be mostly in an area where TOE cannot do much about. > Bottom line - these technologies seem to me to have a place in a well > designed network. I think there is a web page listing why it's bad, but here a quick summary: One worry is to debug it all together. Currently we have a single stack to debug, although it's already difficult to control the complexity as it grows more bells and whistles. Just take a look at Cisco IOS release notes to see how hard and difficult it is to get it all to work together. Another reason is that there are general doubts that TOE can keep up with the ever growing performance of CPUs. Even if Linux added it today it would be likely slower again a few months later. That is also a big difference to Cisco hardware. Linux usually runs on fast main CPUs (or if you run it on slow CPUs you normally don't expect the best network performance). And they get faster and faster constantly. Admittedly 10GB NICs are still a bit too fast for mainstream systems, but that seems to be mostly a problem outside the CPUs and it looks like the next generation of systems will catch up with enough bandwidth in this area. Also it tends to accelerate the wrong thing. On a lot of workloads the main problem is keeping a lot of different connections under control, and TOE tends to be slow at keeping connection information synchronized with the host. That is why the Linux strategy has been to ask for useful stateless offloads instead. Examples of this are checksum offload (long time classic), TSO (TCP segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off load), RX hashing with MSI-X (not implemented yet, but basically it allows load balancing of incoming streams to CPU) Note that all these are more or less stateless offloads. iWARP is not clear yet what it is. From the meager bits of information about it that reached netdev so far it at least sounds it does RDMA and needs far more state than any of the other offloads we got so far and likely got the usual TOE scaling issues. It's also likely on the wrong side of Moore's law. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 20:47 ` Andi Kleen @ 2006-07-04 22:22 ` Andy Gay 2006-07-04 23:01 ` Andi Kleen 0 siblings, 1 reply; 74+ messages in thread From: Andy Gay @ 2006-07-04 22:22 UTC (permalink / raw) To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm On Tue, 2006-07-04 at 22:47 +0200, Andi Kleen wrote: > > So perhaps there's a good argument to make that a Linux system with the > > right hardware could be considered a core device. Likely any place you > > have such a system it would be dedicated to just moving data as well as > > possible, and let other systems do the other stuff. You wouldn't want > > your server farm systems to also be your firewalls. > > Why not? While Linux firewall performance is not flawless its problems > (e.g. slow conntrack) seems to be mostly in an area where TOE cannot > do much about. No doubt you *can* do this, but would you want to? My point wasn't really about performance here, more that systems needing this level of performance (server farm is just an example) will probably be on an 'inside' network with firewalling being done elsewhere (at the access layer, to use the Cisco paradigm). It's just not good design to attach such systems directly to an untrusted network, IMHO. So these systems just don't need netfilter capabilities. > > > Bottom line - these technologies seem to me to have a place in a well > > designed network. > > I think there is a web page listing why it's bad, but here > a quick summary: > > One worry is to debug it all together. Currently we have a single stack > to debug, although it's already difficult to control the complexity as it > grows more bells and whistles. > > Just take a look at Cisco IOS release notes to see how hard > and difficult it is to get it all to work together. No argument there! > > Another reason is that there are general doubts that TOE can > keep up with the ever growing performance of CPUs. Even if Linux > added it today it would be likely slower again a few months later. > That is also a big difference to Cisco hardware. Linux usually > runs on fast main CPUs (or if you run it on slow CPUs you normally > don't expect the best network performance). And they get faster > and faster constantly. > > Admittedly 10GB NICs are still a bit too fast for > mainstream systems, but that seems to be mostly a problem > outside the CPUs and it looks like the next generation > of systems will catch up with enough bandwidth in this area. > > Also it tends to accelerate the wrong thing. On a lot of workloads > the main problem is keeping a lot of different connections under > control, and TOE tends to be slow at keeping connection > information synchronized with the host. > > That is why the Linux strategy has been to ask for useful stateless offloads > instead. Examples of this are checksum offload (long time classic), TSO (TCP > segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off > load), RX hashing with MSI-X (not implemented yet, but basically > it allows load balancing of incoming streams to CPU) > > Note that all these are more or less stateless offloads. > > iWARP is not clear yet what it is. From the meager bits of information > about it that reached netdev so far it at least sounds it does RDMA and needs > far more state than any of the other offloads we got so far and likely > got the usual TOE scaling issues. It's also likely on the wrong side > of Moore's law. > > -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 22:22 ` Andy Gay @ 2006-07-04 23:01 ` Andi Kleen 2006-07-04 23:48 ` Andy Gay 0 siblings, 1 reply; 74+ messages in thread From: Andi Kleen @ 2006-07-04 23:01 UTC (permalink / raw) To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm > My point wasn't really about performance here, more that systems needing > this level of performance (server farm is just an example) will probably > be on an 'inside' network with firewalling being done elsewhere (at the > access layer, to use the Cisco paradigm). It's just not good design to > attach such systems directly to an untrusted network, IMHO. So these > systems just don't need netfilter capabilities. Don't think of the highend. It is exotic and rare. Think of the ordinary single linux box somewhere at a rackspace provider which represents the majority of Linux boxes around. With a not too skilled admin who mostly uses the default settings of his configuration. For that running firewalling on the same box makes a lot of sense. Normally it is not that loaded and it doesn't matter much how it performs, but it might be occasionally slashdotted and then it should still hold up. BTW basic firewalling is not really that bad as long as you don't have too many rules. Mostly conntrack is painful right now. I'm sure at some point it will be fixed too. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 23:01 ` Andi Kleen @ 2006-07-04 23:48 ` Andy Gay 2006-07-05 0:04 ` Andi Kleen 0 siblings, 1 reply; 74+ messages in thread From: Andy Gay @ 2006-07-04 23:48 UTC (permalink / raw) To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm On Wed, 2006-07-05 at 01:01 +0200, Andi Kleen wrote: > > My point wasn't really about performance here, more that systems needing > > this level of performance (server farm is just an example) will probably > > be on an 'inside' network with firewalling being done elsewhere (at the > > access layer, to use the Cisco paradigm). It's just not good design to > > attach such systems directly to an untrusted network, IMHO. So these > > systems just don't need netfilter capabilities. > > Don't think of the highend. It is exotic and rare. Sure. But isn't the high end exactly where these new technologies are intended to fit? > > Think of the ordinary single linux box somewhere at a rackspace provider which > represents the majority of Linux boxes around. How many of those need 10G nics? > > With a not too skilled admin who mostly uses the default settings of his configuration. > For that running firewalling on the same box makes a lot of sense. Yup. I run a few of those. And I run firewalls on them. But they're on 1.5M T1 pipes at best. I probably fit into your 'not too skilled' category, too :) > > Normally it is not that loaded and it doesn't matter much how it performs, > but it might be occasionally slashdotted and then it should still hold up. > > BTW basic firewalling is not really that bad as long as you don't have too many > rules. Mostly conntrack is painful right now. I'm sure at some point it will > be fixed too. Actually, I wasn't aware of any pain with conntrack, it works great for me. But like I said, I don't run any real high speed connections. We're focusing on netfilter here. Is breaking netfilter really the only issue with this stuff? I know you mentioned some other concerns (about TOE specifically), they were really scalability things though weren't they - like you're not convinced this really solves any performance issues long term. I'm certainly not qualified to discuss that, hopefully some of the others will weigh in here. > > -Andi > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 23:48 ` Andy Gay @ 2006-07-05 0:04 ` Andi Kleen 0 siblings, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-05 0:04 UTC (permalink / raw) To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm > > Think of the ordinary single linux box somewhere at a rackspace provider which > > represents the majority of Linux boxes around. > How many of those need 10G nics? Most of them already have gigabit. At some point they will have 10G too. Admittedly the iThingy under discussion here seems to be Infiniband only which will probably not appear in such a use case. > We're focusing on netfilter here. Is breaking netfilter really the only > issue with this stuff? Another concern is that it will just not be able to keep up with a high rate of new connections or a high number of them (because the hardware has too limited state) And then there are the other issues I listed like subtle TCP bugs (TSO is already a nightmare in this area and it's still not quite right) etc. > I know you mentioned some other concerns (about > TOE specifically), they were really scalability things though weren't > they There was more than just scalability. Reread it. Anyways the thread is already getting off topic - i'm not actually that much interested in a generic TOE discussion because the issue is pretty much settled already with broad consensus. You can refer to the netdev archives or the respective web pages if you want more details. It would need someone who can describe how this new RDMA device avoids all the problems, but so far its advocates don't seem to be interested in doing that and I cannot contribute more. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-01 14:26 ` Andi Kleen 2006-07-04 18:34 ` Andy Gay @ 2006-07-04 20:34 ` Roland Dreier 2006-07-24 22:06 ` David Miller 2006-07-05 17:09 ` Tom Tucker 2 siblings, 1 reply; 74+ messages in thread From: Roland Dreier @ 2006-07-04 20:34 UTC (permalink / raw) To: Andi Kleen; +Cc: Tom Tucker, David Miller, netdev, akpm Andi> Perhaps a good start of that discussion David asked for Andi> would be if you could give us an overview of the differences Andi> and how you avoid the TOE problems. Well, here's a quick overview, leaving out some of the details. The difference between TOE and iWARP/RDMA is really the interface that they present. A TOE ("TCP Offload Engine") is a piece of hardware that offloads TCP processing from the main system to handle regular sockets. There is either some way to hand off a socket from the host stack to the TOE, or a socket is created on the TOE to start with, but in both cases, the TOE is handling processing for normal TCP sockets. This means that the TOE has some hardware and/or firmware to do stateful TCP processing. An iWARP device, or RNIC (RDMA NIC), also usually has hardware and/or firmware TCP processing, but this isn't exposed through the BSD socket interface. Instead, an RNIC presents an interface more like an InfiniBand HCA: work requests (sends, receives, RDMA operations) are passed to the RNIC via work queues, and completion notification is returned asynchronously via completion queues. An iWARP connection can handle both send/receive ("two-sided") and get/put (RDMA or "one-sided") operations. For full details of the protocol used for this, you can look at the drafs from the IETF rddp working group, but basically an RDMA protocol is layered on top of a connected stream protocol -- usually TCP, but SCTP could be used as well. A lot of the perfomance of iWARP comes from the RDMA/direct placement capabilities -- for example an NFS/RDMA server can process requests out of order and put data directly into the buffer that's waiting for it, without using any CPU on the destination -- but even send/receive operations can be useful. As a side note, an RNIC will also typically support the same sort of kernel bypass as an IB HCA, where work queues can be safely mapped into a userspace process's memory so that work requests can be posted without a system call. In fact, when people usually use RDMA as a shorthand for the combination of the three features I described: asynchronous work queues and completion queues, connections that support both send/receive and RDMA, and kernel bypass. In any case, RNIC support can be added to the existing IB stack with fairly minor modifications -- you can search the netdev archives for the patchsets posted by Steve Wise, but nearly all of the new code is in the low-level hardware driver for the specific iWARP devices. The real issues for netdev are things like Steve Wise's patch to add route change notifiers, which could be used to tell RNICs when to update the next hop for a connection they're handling. More generally, it would be interesting to see if it's possible to tie an RNIC into the kernel's packet filtering, so that disallowed connections don't get set up. This seems very similar in spirit to the problems around packet filtering that were raised for VJ netchannels. - Roland ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 20:34 ` Roland Dreier @ 2006-07-24 22:06 ` David Miller 2006-07-24 23:10 ` Andi Kleen 2006-07-25 5:51 ` Evgeniy Polyakov 0 siblings, 2 replies; 74+ messages in thread From: David Miller @ 2006-07-24 22:06 UTC (permalink / raw) To: rdreier; +Cc: ak, tom, netdev, akpm From: Roland Dreier <rdreier@cisco.com> Date: Tue, 04 Jul 2006 13:34:27 -0700 > Well, here's a quick overview, leaving out some of the details. The > difference between TOE and iWARP/RDMA is really the interface that > they present. Thanks for the description Roland. It helps me understand the situation better. > The real issues for netdev are things like Steve Wise's patch to add > route change notifiers, which could be used to tell RNICs when to > update the next hop for a connection they're handling. I'll probably put Steve's patches in soon. > More generally, it would be interesting to see if it's possible to > tie an RNIC into the kernel's packet filtering, so that disallowed > connections don't get set up. This seems very similar in spirit to > the problems around packet filtering that were raised for VJ > netchannels. Don't get too excited about VJ netchannels, more and more roadblocks to their practicality are being found every day. For example, my idea to allow ESTABLISHED TCP socket demux to be done before netfilter is flawed. Connection tracking and NAT can change the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP socket, therefore we must always hit netfilter first. All the original costs of route, netfilter, TCP socket lookup all reappear as we make VJ netchannels fit all the rules of real practical systems, eliminating their gains entirely. I will also note in passing that papers on related ideas, such as the Exokernel stuff, are very careful to not address the issue of how practical 1) their demux engine is and 2) the negative side effects of userspace TCP implementations. For an example of the latter, if you have some 1GB JAVA process you do not want to wake that monster up just to do some ACK processing or TCP window updates, yet if you don't you violate TCP's rules and risk spurious unnecessary retransmits. Furthermore, the VJ netchannel gains can be partially obtained from generic stateless facilities that we are going to get anyways. Networking chips supporting multiple MSI-X vectors, choosen by hashing the flow ID, can move TCP processing to "end nodes" which are cpu threads in this case, by having each such MSI-X vector target a different cpu thread. The good news is that we've survived a long time without revolutions like VJ net channels, and the existing TCP stack can be improved dramatically and in ways that people will see benefits from in a shorter amount of time. For example, Alexey Kuznetsov and I have some ideas on how to make the most expensive TCP function for a sender, tcp_ack(), more efficient by using different data structures for the retransmit queue and the loss/recovery packet SACK state. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-24 22:06 ` David Miller @ 2006-07-24 23:10 ` Andi Kleen 2006-07-24 23:22 ` David Miller 2006-07-25 5:51 ` Evgeniy Polyakov 1 sibling, 1 reply; 74+ messages in thread From: Andi Kleen @ 2006-07-24 23:10 UTC (permalink / raw) To: David Miller; +Cc: rdreier, tom, netdev, akpm > For example, my idea to allow ESTABLISHED TCP socket demux to be done > before netfilter is flawed. Connection tracking and NAT can change > the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP > socket, therefore we must always hit netfilter first. Hmm, how does this happen? I guess either when a connection is masqueraded and an application did a bind() on a local port that is used by the masquerading engine. That could be handled by just disallowing it. Or when you have a transparent proxy setup with the proxy on the local host. Perhaps in that case netfilter could be taught to reinject packets in a way that they hit another ESTABLISHED lookup. Did I miss a case? > All the original costs of route, netfilter, TCP socket lookup all > reappear as we make VJ netchannels fit all the rules of real practical > systems, eliminating their gains entirely. At least most of the optimizations from the early demux scheme could be probably gotten simpler by adding a fast path to iptables/conntrack/etc. that checks if all rules only check SYN etc. packets and doesn't walk the full rules then (or more generalized a fast TCP flag mask check similar to what TCP does). With that ESTABLISHED would hit TCP only with relatively small overhead. > I will also note in > passing that papers on related ideas, such as the Exokernel stuff, are > very careful to not address the issue of how practical 1) their demux > engine is and 2) the negative side effects of userspace TCP > implementations. For an example of the latter, if you have some 1GB > JAVA process you do not want to wake that monster up just to do some > ACK processing or TCP window updates, yet if you don't you violate > TCP's rules and risk spurious unnecessary retransmits. I don't quite get why the size of the process matters here - if only some user space TCP library is called directly then it shouldn't really matter how big or small the rest of the process is. Or did you mean migration costs as described below? But on the other hand full user space TCP seems to me of little gain compared to a process context implementation. I somehow like it better to hide these implementation details in the kernel. > Furthermore, the VJ netchannel gains can be partially obtained from > generic stateless facilities that we are going to get anyways. > Networking chips supporting multiple MSI-X vectors, choosen by hashing > the flow ID, can move TCP processing to "end nodes" which are cpu > threads in this case, by having each such MSI-X vector target a > different cpu thread. The problem with the scheme is that to do process context processing effectively you would need to teach the scheduler to aggressively migrate on wake up (so that the process ends up on the CPU that was selected by the hash function in the NIC). But what do you do when you have lots of different connections with different target CPU hash values or when this would require you to move multiple compute intensive processes or a single core? Without user context TCP, but using softirqs instead, it looks a bit better because you can at least use different CPUs to do the ACK processing etc. and the hash function spreading out connections over your CPUs doesn't harm. But you still have relatively high cache line transfer costs in handing over these packet from the softirq CPUs to the final process consumer. I liked VJ's idea of using arrays-of-something instead of lists for that to avoid some cache line transfers. Ok at least it sounds nice in theory - haven't seen any hard numbers on this scheme compared to a traditional double linked list. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-24 23:10 ` Andi Kleen @ 2006-07-24 23:22 ` David Miller 2006-07-25 0:02 ` Andi Kleen 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-24 23:22 UTC (permalink / raw) To: ak; +Cc: rdreier, tom, netdev, akpm From: Andi Kleen <ak@suse.de> Date: Tue, 25 Jul 2006 01:10:25 +0200 > > All the original costs of route, netfilter, TCP socket lookup all > > reappear as we make VJ netchannels fit all the rules of real practical > > systems, eliminating their gains entirely. > > At least most of the optimizations from the early demux scheme could > be probably gotten simpler by adding a fast path to iptables/conntrack/etc. > that checks if all rules only check SYN etc. packets and doesn't walk the > full rules then (or more generalized a fast TCP flag mask check similar > to what TCP does). With that ESTABLISHED would hit TCP only with relatively > small overhead. Actually, all is not lost. Alexey has a more clever idea which is basically to run the netfilter hooks in the socket receive path. So we'd do the socket demux, wake net channel task on remote cpu, and that thread of control would run the netfilter hooks. > > I will also note in > > passing that papers on related ideas, such as the Exokernel stuff, are > > very careful to not address the issue of how practical 1) their demux > > engine is and 2) the negative side effects of userspace TCP > > implementations. For an example of the latter, if you have some 1GB > > JAVA process you do not want to wake that monster up just to do some > > ACK processing or TCP window updates, yet if you don't you violate > > TCP's rules and risk spurious unnecessary retransmits. > > I don't quite get why the size of the process matters here - if only > some user space TCP library is called directly then it shouldn't > really matter how big or small the rest of the process is. Where does state live in such a huge process? Usually, it is scattered all over it's address space. Let us say that java application just did a lot of churning on it's own data structure, swapping out TCP library state objects, we will prematurely swap that stuff back in just to spit out an ACK or similar. > But on the other hand full user space TCP seems to me of little gain > compared to a process context implementation. I totally agree. > > Furthermore, the VJ netchannel gains can be partially obtained from > > generic stateless facilities that we are going to get anyways. > > Networking chips supporting multiple MSI-X vectors, choosen by hashing > > the flow ID, can move TCP processing to "end nodes" which are cpu > > threads in this case, by having each such MSI-X vector target a > > different cpu thread. > > The problem with the scheme is that to do process context processing > effectively you would need to teach the scheduler to aggressively > migrate on wake up (so that the process ends up on the CPU that > was selected by the hash function in the NIC). I don't see this as a big problem. It's all in software, we can control the behavior. > But what do you do when you have lots of different connections > with different target CPU hash values or when this would require > you to move multiple compute intensive processes or a single core? That is why we have scheduler :) Even in a best effort scenerio, things will be generally better than they are currently, plus there is nothing precluding the flow demux MSI-X selection from getting more intelligent. For example, the demuxer could "notice" that TCPdata transmits for flow X tend to happen on cpu X, and update a flow table to record that fact. It could use the same flow table as the one used for LRO. > But you still have relatively high cache line transfer costs in > handing over these packet from the softirq CPUs to the final process > consumer. It is true, in order to get the full benefit we have to target the MSI-X vectors intelligently. For stateless things like routing and IPSEC gateways and firewalls, none of this really matters. But for local transports, it matters a lot. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-24 23:22 ` David Miller @ 2006-07-25 0:02 ` Andi Kleen 2006-07-25 0:29 ` Rick Jones 0 siblings, 1 reply; 74+ messages in thread From: Andi Kleen @ 2006-07-25 0:02 UTC (permalink / raw) To: David Miller; +Cc: rdreier, tom, netdev, akpm On Tuesday 25 July 2006 01:22, David Miller wrote: > From: Andi Kleen <ak@suse.de> > Date: Tue, 25 Jul 2006 01:10:25 +0200 > > > > All the original costs of route, netfilter, TCP socket lookup all > > > reappear as we make VJ netchannels fit all the rules of real practical > > > systems, eliminating their gains entirely. > > > > At least most of the optimizations from the early demux scheme could > > be probably gotten simpler by adding a fast path to iptables/conntrack/etc. > > that checks if all rules only check SYN etc. packets and doesn't walk the > > full rules then (or more generalized a fast TCP flag mask check similar > > to what TCP does). With that ESTABLISHED would hit TCP only with relatively > > small overhead. > > Actually, all is not lost. Alexey has a more clever idea which > is basically to run the netfilter hooks in the socket receive > path. The gain being that the target CPU does the work instead of the softirq one? Some combined lookup and better handler of ESTABLISHED still seems like a good idea. One idea I had at some point was to separate conntrack for local connection vs routed connections and attach the local conntrack to the socket (and use its lookup tables). Then at least for local connections conntrack should be nearly free. It should also solve the issue we currently have that enabled conntrack makes local network performance significantly worse. > Where does state live in such a huge process? Usually, it is > scattered all over it's address space. Let us say that java > application just did a lot of churning on it's own data > structure, swapping out TCP library state objects, we will > prematurely swap that stuff back in just to spit out an ACK > or similar. TCP state is usually multiple cache lines, so you would have cache misses anyways. Do you worry about the TLBs? > > But what do you do when you have lots of different connections > > with different target CPU hash values or when this would require > > you to move multiple compute intensive processes or a single core? > > That is why we have scheduler :) It can't do well if it gets conflicting input. > Even in a best effort scenerio, things > will be generally better than they are currently, plus there is nothing > precluding the flow demux MSI-X selection from getting more intelligent. Intelligent = statefull in this case. AFAIK the only way to do it stateless is hashes and the output of hashes tends to be unpredictible by definition. > For example, the demuxer could "notice" that TCPdata transmits for > flow X tend to happen on cpu X, and update a flow table to record that > fact. It could use the same flow table as the one used for LRO. Hmm, i somewhat doubt that lower end NICs will ever have such flow tables. Also the flow tables could always thrash (because the on NIC RAM is necessarily limited) or they or require the NIC to look up state in memory which is probably not much faster than the CPUs doing it. Using hash functions in the hardware to select the MSI-X seems more elegant, cheaper and much more scalable to me. The drawback of hashes is that for processes with multiple connections you have to move some work back into the softirqs that run on the MSI-X target CPUs. So basically doing process context TCP fully will require much more complex and statefull hardware. Or you can keep it only as a fast path for specific situations (single busy connection per thread) and stay with mostly-softirq processing for the many connection cases. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:02 ` Andi Kleen @ 2006-07-25 0:29 ` Rick Jones 2006-07-25 0:45 ` David Miller 2006-07-25 1:42 ` Andi Kleen 0 siblings, 2 replies; 74+ messages in thread From: Rick Jones @ 2006-07-25 0:29 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, rdreier, tom, netdev, akpm This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling. IPS was done by the 10.20 stack at the handoff between the driver and netisr. If the packet was not an IP datagram fragment, parts of the transport and IP headers would be hashed, and the result would be the netisr queue to which the packet would be queued for further processing. It worked fine and dandy for stuff like aggregate netperf TCP_RR tests because there was a 1-1 correspondence between a connection and a process/thread. It was "OK" for the networking to dictate where the process should run. That feels rather like a NIC that would hash packets and pick the MSI-X based on that. However, as Andi discusses, when there is a process/thread doing more than one connection, picking a CPU based on addressing hashing will be like TweedleDee and TweedleDum telling Alice to go in opposite directions. Hence TOPS in 11.X. This time, when there is a "normal" lookup location in the path, where the application last accessed the socket is determined, and things shift-over to that CPU. This then is the process (well actually the scheduler) telling networking where it should do its work. That addresses the multiple connections per thread/process and still works just as well for 1-1. There are still issues if you have mutiple threads/processes concurrently accessing the same socket/connection, but that one is much more rare. Nirvana I suppose would be the addition of a field in the header which could be used for the determination of where to process. A Transport Protocol option I suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for something along those lines. It does though mean that the "state" is per-packet without it having to be based on addressing information. Almost like RDMA arriving saying where the data goes, but this thing says where the processing should happen :) rick jones ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:29 ` Rick Jones @ 2006-07-25 0:45 ` David Miller 2006-07-25 0:55 ` Rick Jones 2006-07-25 1:03 ` Rick Jones 2006-07-25 1:42 ` Andi Kleen 1 sibling, 2 replies; 74+ messages in thread From: David Miller @ 2006-07-25 0:45 UTC (permalink / raw) To: rick.jones2; +Cc: ak, rdreier, tom, netdev, akpm From: Rick Jones <rick.jones2@hp.com> Date: Mon, 24 Jul 2006 17:29:05 -0700 > Nirvana I suppose would be the addition of a field in the header > which could be used for the determination of where to process. A > Transport Protocol option I suppose, maybe the IPv6 flow id, but > knuth only knows if anyone would go for something along those lines. > It does though mean that the "state" is per-packet without it having > to be based on addressing information. Almost like RDMA arriving > saying where the data goes, but this thing says where the processing > should happen :) Since the full interpretation of the TCP timestamp option field value is largely local to the peer setting it, there is nothing wrong with stealing a few bits for destination cpu information. It would have to be done in such a way as to not make the PAWS tests fail by accident. But I think it's doable. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:45 ` David Miller @ 2006-07-25 0:55 ` Rick Jones 2006-07-25 1:04 ` Andi Kleen 2006-07-25 1:21 ` David Miller 2006-07-25 1:03 ` Rick Jones 1 sibling, 2 replies; 74+ messages in thread From: Rick Jones @ 2006-07-25 0:55 UTC (permalink / raw) To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Mon, 24 Jul 2006 17:29:05 -0700 > > >>Nirvana I suppose would be the addition of a field in the header >>which could be used for the determination of where to process. A >>Transport Protocol option I suppose, maybe the IPv6 flow id, but >>knuth only knows if anyone would go for something along those lines. >>It does though mean that the "state" is per-packet without it having >>to be based on addressing information. Almost like RDMA arriving >>saying where the data goes, but this thing says where the processing >>should happen :) > > > Since the full interpretation of the TCP timestamp option field value > is largely local to the peer setting it, there is nothing wrong with > stealing a few bits for destination cpu information. Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while sounding initially bizzare would be in the realm of theoretically possible before tooooo long. > It would have to be done in such a way as to not make the PAWS > tests fail by accident. But I think it's doable. That would cover TCP, are there similarly fungible fields in SCTP or other ULPs? And if we were to want to get HW support for the thing, getting it adopted in a de jure standards body would probably be in order :) rick jones ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:55 ` Rick Jones @ 2006-07-25 1:04 ` Andi Kleen 2006-07-25 1:21 ` David Miller 1 sibling, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-25 1:04 UTC (permalink / raw) To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm > Even enough bits for 1024 or 2048 CPUs in the single system image? MSI-X supports 255 sub interrupts max, most hardware probably much less (e.g. 8 seems to be a popular number). It can be always hashed to the existing CPUs. It's a nice idea but I think standard hashing + processing in softirq would be worth a try first at least. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:55 ` Rick Jones 2006-07-25 1:04 ` Andi Kleen @ 2006-07-25 1:21 ` David Miller 2006-07-25 16:29 ` Rick Jones 1 sibling, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-25 1:21 UTC (permalink / raw) To: rick.jones2; +Cc: ak, rdreier, tom, netdev, akpm From: Rick Jones <rick.jones2@hp.com> Date: Mon, 24 Jul 2006 17:55:24 -0700 > Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen > 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while > sounding initially bizzare would be in the realm of theoretically possible > before tooooo long. Read the RSS NDIS documents from Microsoft. You aren't going to want to demux to more than, say, 256 cpus for single network adapter even on the largest machines. Therefore a simple translation table and/or "base cpu number" is sufficient to only need 8 bits of cpu identification. You will be limited by the number of MSI-X vectors also, for implementations demuxing directly to cpus using MSI-X selection. > That would cover TCP, are there similarly fungible fields in SCTP or > other ULPs? And if we were to want to get HW support for the thing, > getting it adopted in a de jure standards body would probably be in > order :) Microsoft never does this, neither do we. LRO came out of our own design, the network folks found it reasonable and thus they have started to implement it. The same is true for Microsofts RSS stuff. It's a hardware interpretation, therefore it belongs in a driver API specification, nowhere else. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 1:21 ` David Miller @ 2006-07-25 16:29 ` Rick Jones 2006-07-25 16:32 ` Andi Kleen 0 siblings, 1 reply; 74+ messages in thread From: Rick Jones @ 2006-07-25 16:29 UTC (permalink / raw) To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Mon, 24 Jul 2006 17:55:24 -0700 > > >>Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen >>1024 touted by SGI, and with things going so multi-core, perhaps 16384 while >>sounding initially bizzare would be in the realm of theoretically possible >>before tooooo long. > > > Read the RSS NDIS documents from Microsoft. I'll see about hunting them down. > You aren't going to want > to demux to more than, say, 256 cpus for single network adapter even > on the largest machines. I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because it still sounds like networking telling the scheduler where to run threads of execution, and even though I'm a networking guy I seem to have the notion that it should be the other way 'round. >>That would cover TCP, are there similarly fungible fields in SCTP or >>other ULPs? And if we were to want to get HW support for the thing, >>getting it adopted in a de jure standards body would probably be in >>order :) > > > Microsoft never does this, neither do we. LRO came out of our own > design, the network folks found it reasonable and thus they have > started to implement it. The same is true for Microsofts RSS stuff. > > It's a hardware interpretation, therefore it belongs in a driver API > specification, nowhere else. It may be a hardware interpretation, but doesn't it have non-trivial system implications - where one runs threads/processes etc? rick jones ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 16:29 ` Rick Jones @ 2006-07-25 16:32 ` Andi Kleen 0 siblings, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-25 16:32 UTC (permalink / raw) To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm > It may be a hardware interpretation, but doesn't it have non-trivial system > implications - where one runs threads/processes etc? Only if you do process context RX processing. If you chose not to it doesn't have much influence. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:45 ` David Miller 2006-07-25 0:55 ` Rick Jones @ 2006-07-25 1:03 ` Rick Jones 1 sibling, 0 replies; 74+ messages in thread From: Rick Jones @ 2006-07-25 1:03 UTC (permalink / raw) To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm > It would have to be done in such a way as to not make the PAWS > tests fail by accident. But I think it's doable. CPU ID and higher-order generation number such that whenever the process migrates to a lower-numbered CPU, the generation number is bumped to make the timestamp larger than before? rick jones ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 0:29 ` Rick Jones 2006-07-25 0:45 ` David Miller @ 2006-07-25 1:42 ` Andi Kleen 1 sibling, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-25 1:42 UTC (permalink / raw) To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm On Tuesday 25 July 2006 02:29, Rick Jones wrote: > This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 > concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling. We've also talking about this for many years, just no code so far. Or rather Linux so far left the job to manual tuning. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-24 22:06 ` David Miller 2006-07-24 23:10 ` Andi Kleen @ 2006-07-25 5:51 ` Evgeniy Polyakov 2006-07-25 6:48 ` David Miller 1 sibling, 1 reply; 74+ messages in thread From: Evgeniy Polyakov @ 2006-07-25 5:51 UTC (permalink / raw) To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm On Mon, Jul 24, 2006 at 03:06:13PM -0700, David Miller (davem@davemloft.net) wrote: > Don't get too excited about VJ netchannels, more and more roadblocks > to their practicality are being found every day. > > For example, my idea to allow ESTABLISHED TCP socket demux to be done > before netfilter is flawed. Connection tracking and NAT can change > the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP > socket, therefore we must always hit netfilter first. There is no problem with netfilter and process context processing - when skb is removed from hardware list/array and is being processed by netfilter in netchannel (or in process context in general), there is no problems if changed skb will be rerouted into different queue and state. > All the original costs of route, netfilter, TCP socket lookup all > reappear as we make VJ netchannels fit all the rules of real practical > systems, eliminating their gains entirely. I will also note in > passing that papers on related ideas, such as the Exokernel stuff, are > very careful to not address the issue of how practical 1) their demux > engine is and 2) the negative side effects of userspace TCP > implementations. For an example of the latter, if you have some 1GB > JAVA process you do not want to wake that monster up just to do some > ACK processing or TCP window updates, yet if you don't you violate > TCP's rules and risk spurious unnecessary retransmits. I still plan to continue userspace implementation. If gigantic-java-monster (tm) is going to read some data - it has been awakened already, thus it is in the memeory (with linked tcp lib), so there is zero overhead. > Furthermore, the VJ netchannel gains can be partially obtained from > generic stateless facilities that we are going to get anyways. > Networking chips supporting multiple MSI-X vectors, choosen by hashing > the flow ID, can move TCP processing to "end nodes" which are cpu > threads in this case, by having each such MSI-X vector target a > different cpu thread. And if that CPU is very busy? Linux should somehow tell NIC that some CPUs are valid and some are not right now, but not in a second, so scheduler must be tightly bound with network internals. Just my 2 coins. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 5:51 ` Evgeniy Polyakov @ 2006-07-25 6:48 ` David Miller 2006-07-25 6:59 ` Evgeniy Polyakov 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-25 6:48 UTC (permalink / raw) To: johnpol; +Cc: rdreier, ak, tom, netdev, akpm From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Tue, 25 Jul 2006 09:51:28 +0400 > On Mon, Jul 24, 2006 at 03:06:13PM -0700, David Miller (davem@davemloft.net) wrote: > > Furthermore, the VJ netchannel gains can be partially obtained from > > generic stateless facilities that we are going to get anyways. > > Networking chips supporting multiple MSI-X vectors, choosen by hashing > > the flow ID, can move TCP processing to "end nodes" which are cpu > > threads in this case, by having each such MSI-X vector target a > > different cpu thread. > > And if that CPU is very busy? > Linux should somehow tell NIC that some CPUs are valid and some are not > right now, but not in a second, so scheduler must be tightly bound with > network internals. Yes, it is research problem. Most of the time, even stateless version will improve things. >From another viewpoint, even in worst case, it can be no worse than current situation. :) BTW, such dynamic remapping is provided for in the NDIS interfaces. There is an indexing table that is gone through using computed hash to get "cpu number". ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 6:48 ` David Miller @ 2006-07-25 6:59 ` Evgeniy Polyakov 2006-07-25 7:33 ` David Miller 0 siblings, 1 reply; 74+ messages in thread From: Evgeniy Polyakov @ 2006-07-25 6:59 UTC (permalink / raw) To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm On Mon, Jul 24, 2006 at 11:48:53PM -0700, David Miller (davem@davemloft.net) wrote: > > And if that CPU is very busy? > > Linux should somehow tell NIC that some CPUs are valid and some are not > > right now, but not in a second, so scheduler must be tightly bound with > > network internals. > > Yes, it is research problem. > > Most of the time, even stateless version will improve things. > From another viewpoint, even in worst case, it can be no > worse than current situation. :) > > BTW, such dynamic remapping is provided for in the NDIS interfaces. > There is an indexing table that is gone through using computed hash to > get "cpu number". I think we should force Linux scheduler to export some easily accessed CPU statistic, so that info might be used by irq layer/protocol processing. As a side completely unrelated to either my or others work note :) - I think it is a nanooptimisation - we get a bit of performance here, and lose those bit in other place. When bag is filled, there is no much sence of rearranging some stuff inside to be able to place another one - it is better to buy new bag. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 6:59 ` Evgeniy Polyakov @ 2006-07-25 7:33 ` David Miller 2006-07-25 7:42 ` Evgeniy Polyakov 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-25 7:33 UTC (permalink / raw) To: johnpol; +Cc: rdreier, ak, tom, netdev, akpm From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Tue, 25 Jul 2006 10:59:21 +0400 > As a side completely unrelated to either my or others work note :) - > I think it is a nanooptimisation - we get a bit of performance here, > and lose those bit in other place. > When bag is filled, there is no much sence of rearranging some stuff > inside to be able to place another one - it is better to buy new bag. It is a matter of what the viewpoint is, I suppose. I think in this specific case it might turn out to be better for the scheduler to respond to what the device throws at it, rather than the other way around. And in that case we need no feedback from scheduler to cpu demux engine. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-25 7:33 ` David Miller @ 2006-07-25 7:42 ` Evgeniy Polyakov 0 siblings, 0 replies; 74+ messages in thread From: Evgeniy Polyakov @ 2006-07-25 7:42 UTC (permalink / raw) To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm On Tue, Jul 25, 2006 at 12:33:44AM -0700, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Tue, 25 Jul 2006 10:59:21 +0400 > > > As a side completely unrelated to either my or others work note :) - > > I think it is a nanooptimisation - we get a bit of performance here, > > and lose those bit in other place. > > When bag is filled, there is no much sence of rearranging some stuff > > inside to be able to place another one - it is better to buy new bag. > > It is a matter of what the viewpoint is, I suppose. Definitely. > I think in this specific case it might turn out to be > better for the scheduler to respond to what the device > throws at it, rather than the other way around. And > in that case we need no feedback from scheduler to > cpu demux engine. That's exactly one bit lose/gain - if CPU is loafing - we get a gain, and lose otherwise - so instead of generally predictible steady behaviour we can end up with bursty shapes. Actually without real tests all it is just a handwaving, so let's see when modern NICs get that capability, so network softirq scheduling would be changed accordingly. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-01 14:26 ` Andi Kleen 2006-07-04 18:34 ` Andy Gay 2006-07-04 20:34 ` Roland Dreier @ 2006-07-05 17:09 ` Tom Tucker 2006-07-05 17:50 ` Steve Wise 2006-07-24 22:23 ` David Miller 2 siblings, 2 replies; 74+ messages in thread From: Tom Tucker @ 2006-07-05 17:09 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, rdreier, netdev, akpm On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote: > On Saturday 01 July 2006 01:01, Tom Tucker wrote: > > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > > > > > The TOE folks have tried to submit their hooks and drivers > > > on several occaisions, and we've rejected it every time. > > > > iWARP != TOE > > Perhaps a good start of that discussion David asked for would > be if you could give us an overview of the differences > and how you avoid the TOE problems. I think Roland already gave the high-level overview. For those interested in some of the details, the API for iWARP transports was originally conceived independently from IB and is documented in the RDMAC Verbs Specification found here: http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf The protocols, etc... are available here: http://www.ietf.org/html.charters/rddp-charter.html As Roland mentioned, the RDMAC verbs are *very* similar to the IB verbs and so when we were thinking about how to design an API for iWARP we concluded it would be best to leverage the tremendous amount of work already done for IB by OpenFabrics and then work iteratively to extend this API to include features unique to iWARP. This work has been ongoing since September of 2005. There is an open source svn repository available for the iWARP source at https://openib.org/svn/gen2/branches/iwarp. There is also an open source NFS over RDMA implementation for Linux available here that: http://sourceforge.net/projects/nfs-rdma. So how do we avoid the TOE pitfalls with iWARP? I think it depends on the pitfall. At the low level: - Stale Network/Address Information: Path MTU Change, ICMP Redirect and ARP next hop changes need netlink notifier events so that hardware can be updated when they change. I see this support as an extension (new events) to an existing service and a relatively low-level of "parallel stack integration". iSCSI and IB could also benefit from these events. - Port Space Collision, i.e. socket app and rdma/iWARP apps collide on a port number: The RDMA CMA needs to be able to allocate and de-allocate port numbers, however, the services that do this today are not exported and would need some minor tweaking. iSCSI and IB benefit from these services as well. - netfilter rules, syn-flood, conn-rate, etc.... You pointed out that if connection establishment were done in the native stack (SYN, SYN/ACK), that this would account for the bulk of the netfilter utility, however, this probably results in falling into many of the TOE traps people have issue with. WRT to http://linux-net.osdl.org/index.php/TOE Security Updates "A TOE net stack is closed source firmware. Linux engineers have no way to fix security issues that arise. As a result, only non-TOE users will receive security updates, leaving random windows of vulnerability for each TOE NIC's users." - A Linux security update may or may not be relevant to a vendors implementation. - If a vendor's implementation has a security issue then the customer must rely on the vendor to fix it. This is no less true for iWARP than for any adapter. Point-in-time Solution "Each TOE NIC has a limited lifetime of usefulness, because system hardware rapidly catches up to TOE performance levels, and eventually exceeds TOE performance levels. We saw this with 10mbit TOE, 100mbit TOE, gigabit TOE, and soon with 10gig TOE." - iWARP needs to do protocol processing in order to validate and evaluate TCP payload in advance of direct data placement. This requirement is independent of CPU speed. Different Network Behavior "System administrators are quite familiar with how the Linux network stack interoperates with the world at large. TOE is a black box, each NIC requires re-examination of network behavior. Network scanners and analysis tools must be updated, or they will provide faulty analysis." - Native Linux Tools like tcpdump, netstat, etc... will not work as expected. - Network Analyzers such as Finisar, etc... will work just fine. Performance "Experience has shown that TOE implementations require additional work (programming the hardware, hardware-specific socket manipulation) to set up and tear down connections. For connection intensive protocols such as HTTP, TOE often underperforms." - I suspect that connection rates for RDMA adapters fall well-below the rates attainable with a dumb device. That said, all of the RDMA applications that I know of are not connection intensive. Even for TOE, the later HTTP versions makes connection rates less of an issue. Hardware-specific limits "TOE NICs are more resource limited than your overall computer system. This is most readily apparent under load, when trying to support thousands of simultaneous connections. TOE NICs simply do not have the memory resources to buffer thousands of connections, much less have the CPU power to handle such loads. Further, each TOE NIC has different resource limitations (often unpublished, only to be discovered at the worst moments)." - Any hardware device has this issue and so does iWARP "Once resources are exhausted, TOE will either fall back to 100% software net stack, defeating the purpose of TOE, or will deny service to additional clients." - A depleted iWARP adapter will simply fail the request. There is no parallel iWARP stack to fall back on. Resource-based denial-of-service attacks "If an attacker can discover the TOE NIC model in use, they can use this information to enable resource-based algorithmic attacks. For example, a SYN flood could potentially use up all TOE resources in a matter of seconds. The TOE NIC will either stop accepting connections (complete DoS), or will constantly bounce back to the software net stack." - True of iWARP too. RFC compliance "Linux is the most RFC-compliant network stack available. TOE can only diminish this. Further, as a black box, each TOE NIC will have a different level of RFC compliance, and different TCP/IP features they do/don't support." - True of iWARP too. Linux features "TOE is by definition poorly integrated into Linux. TOE NICs will not provide netfilter, packet scheduling, QoS, and many other features that Linux users depend on. Or if they do provide this, they implement the features in a vendor-specific manner. The featureset becomes vendor-specific." - This is the problem we're trying to solve...incrementally and responsibly. Requires vendor-specific tools "In order to configure a TOE NIC, hardware-specific tools are usually required. This dramatically increases support costs." - OpenFabrics is an attempt to solve this not only across vendors, but also across transports (at this time IB and iWARP) Poor user support "Linux engineers cannot provide an adequate level of support for TOE users, and must instead refer users to the vendor -- who in all likelihood cares more about non-Linux operating systems." - This will certainly be true for iWARP early on. Short term kernel maintenance "Supporting TOE requires massive, heavily invasive hooks into the network stack. This increases the kernel maintenance burden on Linux engineers, to support a solution Linux engineers have no control over." - iWARP does not use sockets and does not share data structures with the TCP stack. - It is not my opinion, however, that the patches in question consist of "massive, heavily invasive hooks into the network stack". Long term user support "Linux has been in existence for over a decade, and some pieces of decade-old hardware continue to be used and supported. In contrast, most hardware vendors end-of-life (stop supporting) their hardware after just a few years. For most hardware vendors, the sales of old hardware simply do not justify dedicating engineers to Linux support for many years." - If the hooks are not hideous and invasive then support should not be any more onerous than for any other hardware device. Long term kernel maintenance "Similarly, kernel engineers must support TOE for as long as users continue to use the hardware. Hardware vendors disappear, get bought, or simply disappear (go out of business) during our maintenance timeframe. Once a hardware vendor loses interest in Linux, TOE NICs will cease to receive security updates, and hardware issues become incredibly difficult to debug. Each new generation of system hardware often requires re-examination of hardware drivers, a task made far more difficult without a hardware vendor to receive questions." - This seems like a general rant against any hardware device and so it applies to iWARP too. Eliminates global system view "With TOE, the system no longer has a complete picture of all resources used by network connections. Some connections are software-based, and thus limited by existing policy controls (such as per-socket memory limits). Other connections are managed by TOE, and these details are hidden. As such, the VM cannot adequately manage overall socket buffer memory usage, TOE-enabled connections cannot be rate-limited by the same controls as software-based connections, per-user socket security limits may be ignored, etc." - iWARP doesn't use socket buffers. Linux has several TCP Congestion Control algorithms available. For TOE connections, this would no longer be true, all the congestion control would be done by proprietary vendor specific algorithms on the card. - I don't know of any proprietary congestion control algorithms built into iWARP and doubt they would work between vendors. There is an iWARP Interoperability Lab at UNH that tests this kind of thing. > > -Andi > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-05 17:09 ` Tom Tucker @ 2006-07-05 17:50 ` Steve Wise 2006-07-24 22:25 ` David Miller 2006-07-24 22:23 ` David Miller 1 sibling, 1 reply; 74+ messages in thread From: Steve Wise @ 2006-07-05 17:50 UTC (permalink / raw) To: Tom Tucker; +Cc: Andi Kleen, David Miller, rdreier, netdev, akpm On Wed, 2006-07-05 at 12:09 -0500, Tom Tucker wrote: > On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote: > > On Saturday 01 July 2006 01:01, Tom Tucker wrote: > > > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > > > > > > > The TOE folks have tried to submit their hooks and drivers > > > > on several occaisions, and we've rejected it every time. > > > > > > iWARP != TOE > > > > Perhaps a good start of that discussion David asked for would > > be if you could give us an overview of the differences > > and how you avoid the TOE problems. > > I think Roland already gave the high-level overview. For those > interested in some of the details, the API for iWARP transports was > originally conceived independently from IB and is documented in the > RDMAC Verbs Specification found here: > > http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf > > The protocols, etc... are available here: > http://www.ietf.org/html.charters/rddp-charter.html > > As Roland mentioned, the RDMAC verbs are *very* similar to the IB verbs > and so when we were thinking about how to design an API for iWARP we > concluded it would be best to leverage the tremendous amount of work > already done for IB by OpenFabrics and then work iteratively to extend > this API to include features unique to iWARP. This work has been ongoing > since September of 2005. > > There is an open source svn repository available for the iWARP source at > https://openib.org/svn/gen2/branches/iwarp. > > There is also an open source NFS over RDMA implementation for Linux > available here that: http://sourceforge.net/projects/nfs-rdma. > > > So how do we avoid the TOE pitfalls with iWARP? I think it depends on > the pitfall. At the low level: > > - Stale Network/Address Information: Path MTU Change, ICMP Redirect > and ARP next hop changes need netlink notifier events so that hardware > can be updated when they change. I see this support as an extension (new > events) to an existing service and a relatively low-level of "parallel > stack integration". iSCSI and IB could also benefit from these events. > > - Port Space Collision, i.e. socket app and rdma/iWARP apps collide on > a port number: The RDMA CMA needs to be able to allocate and de-allocate > port numbers, however, the services that do this today are not exported > and would need some minor tweaking. iSCSI and IB benefit from these > services as well. > > - netfilter rules, syn-flood, conn-rate, etc.... You pointed out that > if connection establishment were done in the native stack (SYN, > SYN/ACK), that this would account for the bulk of the netfilter utility, > however, this probably results in falling into many of the TOE traps > people have issue with. However, iWARP devices _could_ integrate with netfilter. For most devices the connection request event (SYN) gets passed up to the host driver. So the driver can enforce filter rules then. Also, i think a notification type mechanism could be used to trigger iWARP drivers to go re-apply filter rules on existing connections and kill ones that should be filtered. I'm not that familiar yet with netfilter, but I think this could all be done... Steve. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-05 17:50 ` Steve Wise @ 2006-07-24 22:25 ` David Miller 2006-07-24 22:47 ` Caitlin Bestler 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-24 22:25 UTC (permalink / raw) To: swise; +Cc: tom, ak, rdreier, netdev, akpm From: Steve Wise <swise@opengridcomputing.com> Date: Wed, 05 Jul 2006 12:50:34 -0500 > However, iWARP devices _could_ integrate with netfilter. For most > devices the connection request event (SYN) gets passed up to the host > driver. So the driver can enforce filter rules then. This doesn't work. In order to handle things like NAT and connection tracking properly you must even allow ESTABLISHED state packets to pass through netfilter. Netfilter can have rules such as "NAT port 200 to 300, leave the other fields alone" and your suggested scheme cannot handle this. ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: RDMA will be reverted 2006-07-24 22:25 ` David Miller @ 2006-07-24 22:47 ` Caitlin Bestler 0 siblings, 0 replies; 74+ messages in thread From: Caitlin Bestler @ 2006-07-24 22:47 UTC (permalink / raw) To: David Miller, swise; +Cc: tom, ak, rdreier, netdev, akpm netdev-owner@vger.kernel.org wrote: > From: Steve Wise <swise@opengridcomputing.com> > Date: Wed, 05 Jul 2006 12:50:34 -0500 > >> However, iWARP devices _could_ integrate with netfilter. For most >> devices the connection request event (SYN) gets passed up to the host >> driver. So the driver can enforce filter rules then. > > This doesn't work. In order to handle things like NAT and > connection tracking properly you must even allow ESTABLISHED > state packets to pass through netfilter. > > Netfilter can have rules such as "NAT port 200 to 300, leave > the other fields alone" and your suggested scheme cannot handle this. This is totally irrelevant. But it does work. First, an RDMA connection once established associates a TCP connection *as identified external to the box* with an RDMA endpoint (conventionally a "QP"). Performing a NAT translation on a TCP packet would certainly be within the capabilities of an RNIC, but it would be pointless. The relabeled TCP segment would be associated with the same QP. Once an RDMA connection is established, the individual TCP segments are only of interest to the RDMA endpoint. Payload is delivered through the RDMA interface (the same one already used for InfiniBand). The purpose of integration with netfilter would be to ensure that no RDMA Connection could exist, or persist, if netfilter would not allow the TCP connection to be created. That is not a matter of packet filtering, it is matter of administrative consistency. If someone uses netfilter to block connections from a given IP netmask then they reasonably expect that there will be no connections with any host within that IP netmask. They do not expect exceptions for RDMA, iSCSI or InfiniBand. The existing connection management interfaces in openfabrics, designed to support both InfiniBand and iWARP, could naturally be extended to validate all RDMA connections using an IP address with netfilter. This would be of real value. The only real value of a rule such as "NAT port 200 to 300" is to allow a remote peer to establish a connection to port 200 with a local listener using port 300. That *can* be supported without actually manipulating the header in each TCP packet. It is also possible to discuss other netfilter functionality that serves a valid end-user purpose, such as counting packets. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-05 17:09 ` Tom Tucker 2006-07-05 17:50 ` Steve Wise @ 2006-07-24 22:23 ` David Miller 2006-07-24 22:57 ` Caitlin Bestler 1 sibling, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-24 22:23 UTC (permalink / raw) To: tom; +Cc: ak, rdreier, netdev, akpm From: Tom Tucker <tom@opengridcomputing.com> Date: Wed, 05 Jul 2006 12:09:42 -0500 > "A TOE net stack is closed source firmware. Linux engineers have no way > to fix security issues that arise. As a result, only non-TOE users will > receive security updates, leaving random windows of vulnerability for > each TOE NIC's users." > > - A Linux security update may or may not be relevant to a vendors > implementation. > > - If a vendor's implementation has a security issue then the customer > must rely on the vendor to fix it. This is no less true for iWARP than > for any adapter. This isn't how things actually work. Users have a computer, and they can rightly expect the community to help them solve problems that occur in the upstream kernel. When a bug is found and the person is using NIC X, we don't necessarily forward the bug report to the vendor of NIC X. Instead we try to fix the bug. Many chip drivers are maintained by people who do not work for the company that makes the chip, and this works just fine. If only the chip vendor can fix a security problem, this makes Linux less agile to fix. Even aspect of a problem on a Linux system that cannot be fixed entirely by the community is a net negative for Linux. > - iWARP needs to do protocol processing in order to validate and > evaluate TCP payload in advance of direct data placement. This > requirement is independent of CPU speed. Yet, RDMA itself is just an optimization meant to deal with limitations of cpu and memory speed. You can rephrase the situation in whatever way suits your argument, but it does not make the core issue go away :) > - I suspect that connection rates for RDMA adapters fall well-below the > rates attainable with a dumb device. That said, all of the RDMA > applications that I know of are not connection intensive. Even for TOE, > the later HTTP versions makes connection rates less of an issue. This is a very naive evaluation of the situation. Yes, newer versions of protocols such as HTTP make the per-client connection burdon lower, but the number of clients will increase in time to more than makeup for whatever gains are seen due to this. And then you have protocols which by design are connection heavy, and rightly so, such as bittorrent. Being able to handle enormous numbers of connections, with extreme scalability and low latency, is an absolute requirement of any modern day serious TCP stack. And this requirement is not going away. Wishing this requirement away due to HTTP persistent connections is very unrealistic, at best. > - This is the problem we're trying to solve...incrementally and > responsibly. You can't. See my email to Roland about why even VJ net channels are found to be impractical. To support netfilter properly, you must traverse the whole netfilter stack, because NAT can rewrite packets, yet still make them destined for the local system, and thus they will have a different identity for connection demux by the time the TCP stack sees the packet. All of these tranformations occur between normal packet receive and the TCP stack. You would therefore need to put your card between netfilter and TCP in the packet input path, and at that point why bother with the stateful card at all? The fact is that stateless approaches will always be better than stateful things because you cannot replicate the functionality we have in the Linux stack without replicating 10 years of work into your chip's firmware. At that point you should just run Linux on your NIC since that is what you are effectively doing :) In conversations such as these, it helps us a lot if you can be frank and honest about the true absolute limitations of your technology. I can see that your viewpoint is tainted when I hear things such as HTTP persistent connections being used as a reason why high TCP connection rates won't matter in the future. Such assertions are understood to be patently false by anyone who understands TCP and how it is used in the real world. ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: RDMA will be reverted 2006-07-24 22:23 ` David Miller @ 2006-07-24 22:57 ` Caitlin Bestler 0 siblings, 0 replies; 74+ messages in thread From: Caitlin Bestler @ 2006-07-24 22:57 UTC (permalink / raw) To: David Miller, tom; +Cc: ak, rdreier, netdev, akpm netdev-owner@vger.kernel.org wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Wed, 05 Jul 2006 12:09:42 -0500 > >> "A TOE net stack is closed source firmware. Linux engineers have no >> way to fix security issues that arise. As a result, only non-TOE >> users will receive security updates, leaving random windows of >> vulnerability for each TOE NIC's users." >> >> - A Linux security update may or may not be relevant to a vendors >> implementation. >> >> - If a vendor's implementation has a security issue then the customer >> must rely on the vendor to fix it. This is no less true for iWARP >> than for any adapter. > > This isn't how things actually work. > > Users have a computer, and they can rightly expect the > community to help them solve problems that occur in the > upstream kernel. > > When a bug is found and the person is using NIC X, we don't > necessarily forward the bug report to the vendor of NIC X. > Instead we try to fix the bug. Many chip drivers are > maintained by people who do not work for the company that > makes the chip, and this works just fine. > > If only the chip vendor can fix a security problem, this > makes Linux less agile to fix. Even aspect of a problem on a > Linux system that cannot be fixed entirely by the community > is a net negative for Linux. > >> - iWARP needs to do protocol processing in order to validate and >> evaluate TCP payload in advance of direct data placement. This >> requirement is independent of CPU speed. > > Yet, RDMA itself is just an optimization meant to deal with > limitations of cpu and memory speed. You can rephrase the > situation in whatever way suits your argument, but it does > not make the core issue go away :) > RDMA is a protocol that allows the application to more precisely state the actual ordering requirements. It improves the end-to-end interactions and has value over a protocol with only byte or message stream semantics regardless of local interface efficiencies. See http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt In any event, isn't the value of an RDMA interface to applications already settled? The question is how best to deal integrate the usage of IP addresses with the kernel. The inability to validate the low-level packet validation in open source code is a limitation of *all* RDMA solutions, the transport layer of InfiniBand is just as offloaded as it is for iWARP. The patches proposed are intended to support integrated connection management for RDMA connections using IP addresses, no matter what the underlying transport is. The only difference is that *all* iWARP connections use IP addresses. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-06-30 23:01 ` Tom Tucker 2006-07-01 14:26 ` Andi Kleen @ 2006-07-01 21:45 ` David Miller 2006-07-04 20:34 ` Roland Dreier 1 sibling, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-01 21:45 UTC (permalink / raw) To: tom; +Cc: rdreier, netdev, akpm From: Tom Tucker <tom@opengridcomputing.com> Date: Fri, 30 Jun 2006 18:01:43 -0500 > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: > > > The TOE folks have tried to submit their hooks and drivers > > on several occaisions, and we've rejected it every time. > > iWARP != TOE You are taking my comment out of context. And the fact that you removed the comment I am respond to, shows that you really aren't following the conversation. Roland stated that it has never been the case that we have rejected adding support for a certain class of devices on the kinds of merits being discussed in this thread. And I'm saying that TOE is such a case where we have emphatically done so. So I am not saying iWARP or RDMA is equal to TOE, and if you had actually read this thread you would have understood that. You're just looking for cannon fodder in my emails. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-01 21:45 ` David Miller @ 2006-07-04 20:34 ` Roland Dreier 2006-07-05 18:27 ` David Miller 0 siblings, 1 reply; 74+ messages in thread From: Roland Dreier @ 2006-07-04 20:34 UTC (permalink / raw) To: David Miller; +Cc: tom, netdev, akpm > Roland stated that it has never been the case that we have > rejected adding support for a certain class of devices on the > kinds of merits being discussed in this thread. And I'm saying > that TOE is such a case where we have emphatically done so. Well, in the past it's seemed more like patches have been rejected not because of a blanket refusal to consider support for certain hardware, but rather because of issues with the patches themselves. eg last year when Chelsio submitted some TOE hooks, you wrote the following <http://marc.theaimsgroup.com/?l=linux-netdev&m=112382991506811&w=2> >> There is no way you're going to be allowed to call such deep TCP >> internals from your driver. >> This would mean that every time we wish to change the data structures >> and interfaces for TCP socket lookup, your drivers would need to >> change. which looks like a very good reason to reject the changes. > So I am not saying iWARP or RDMA is equal to TOE, and if you had > actually read this thread you would have understood that. There's definitely been quite a bit of conflation between the two in this thread, even if you're not responsible... - R. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-04 20:34 ` Roland Dreier @ 2006-07-05 18:27 ` David Miller 2006-07-05 20:29 ` Roland Dreier 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-05 18:27 UTC (permalink / raw) To: rdreier; +Cc: tom, netdev, akpm From: Roland Dreier <rdreier@cisco.com> Date: Tue, 04 Jul 2006 13:34:30 -0700 > Well, in the past it's seemed more like patches have been rejected not > because of a blanket refusal to consider support for certain hardware, Then why in the world would we put up explicit web pages that say "TOE is bad, here's a list of reasons why" if we had any intention of ever adding support for these kinds of devices? http://linux-net.osdl.org/index.php/TOE It had nothing to do with a particular implementation of the patches, it had everything to do with fundamentals of the technology. It's going to be difficult to discuss RDMA and iWARP sanely unless you accept the indisputable fact that we've rejected TOE as a technology entirely, and it is an example of precedence for disallowing support for entire classes of hardware. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-05 18:27 ` David Miller @ 2006-07-05 20:29 ` Roland Dreier 2006-07-06 3:03 ` David Miller 0 siblings, 1 reply; 74+ messages in thread From: Roland Dreier @ 2006-07-05 20:29 UTC (permalink / raw) To: David Miller; +Cc: tom, netdev, akpm > Then why in the world would we put up explicit web pages that > say "TOE is bad, here's a list of reasons why" if we had any > intention of ever adding support for these kinds of devices? I think there's a little bit of leap of logic there. Everyone agrees that winmodems are bad and yet there's still drivers/char/mwave. This TOE-phobia feels almost as if in the middle of one of those silly IDE vs. SCSI flamewars, someone declared that Linux shouldn't have IDE drivers. > It's going to be difficult to discuss RDMA and iWARP sanely unless you > accept the indisputable fact that we've rejected TOE as a technology > entirely, and it is an example of precedence for disallowing support > for entire classes of hardware. Fine. I don't think I have much more to add to the discussion anyway. The way forward seems to be to merge basic iWARP support that lives in drivers/infiniband, and then you can accept or reject things for better integration, like notifiers for routing changes. - R. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-05 20:29 ` Roland Dreier @ 2006-07-06 3:03 ` David Miller 2006-07-06 5:25 ` Tom Tucker 0 siblings, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-06 3:03 UTC (permalink / raw) To: rdreier; +Cc: tom, netdev, akpm From: Roland Dreier <rdreier@cisco.com> Date: Wed, 05 Jul 2006 13:29:35 -0700 > The way forward seems to be to merge basic iWARP support that lives in > drivers/infiniband, and then you can accept or reject things for > better integration, like notifiers for routing changes. <sarcasm> Let's merge in drivers before the necessary infrastructure. </sarcasm> No, I think that's not the way forward. You build the foundation before the house, if the foundation cannot be built then you are wasting your time with the house idea. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-06 3:03 ` David Miller @ 2006-07-06 5:25 ` Tom Tucker 2006-07-06 14:08 ` Herbert Xu 2006-07-07 6:53 ` David Miller 0 siblings, 2 replies; 74+ messages in thread From: Tom Tucker @ 2006-07-06 5:25 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Wed, 2006-07-05 at 20:03 -0700, David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Wed, 05 Jul 2006 13:29:35 -0700 > > > The way forward seems to be to merge basic iWARP support that lives in > > drivers/infiniband, and then you can accept or reject things for > > better integration, like notifiers for routing changes. > > <sarcasm> > Let's merge in drivers before the necessary infrastructure. > </sarcasm> > > No, I think that's not the way forward. You build the foundation > before the house, if the foundation cannot be built then you are > wasting your time with the house idea. We have been running NFS and other apps over iWARP 24x7 for the last 6 mos...without the proposed netdev patch. We've run 200+ node MPI clusters for days and days over iWARP...without the proposed netdev patch. We ran iWARP interoperability tests across the country between Boston and San Jose...without ... yes I know ... you get it. <sarcasm> News flash...the foundation is built! </sarcasm> But! Our stable LAN and the WAN tests didn't often experience MTU changes, and redirects...but of course we knew these were inevitable. So our goal was to make iWARP more robust in the face of a more dynamic network topology. Shutters on the house maybe...I dunno, it's your analogy ;-) All that said, the proposed patch helps not only iWARP, but other transports (iSCSI, IB) as well. It is not large, invasive, intrusive...hell it's not even new. It leverages an existing event notifier mechanism. This patch is about dotting I's and crossing T's, it's not about foundations. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-06 5:25 ` Tom Tucker @ 2006-07-06 14:08 ` Herbert Xu 2006-07-06 17:36 ` Tom Tucker 2006-07-07 6:53 ` David Miller 1 sibling, 1 reply; 74+ messages in thread From: Herbert Xu @ 2006-07-06 14:08 UTC (permalink / raw) To: Tom Tucker; +Cc: davem, rdreier, netdev, akpm Tom Tucker <tom@opengridcomputing.com> wrote: > > All that said, the proposed patch helps not only iWARP, but other > transports (iSCSI, IB) as well. It is not large, invasive, Care to explain on how it helps those other technologies? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-06 14:08 ` Herbert Xu @ 2006-07-06 17:36 ` Tom Tucker 2006-07-07 0:03 ` Herbert Xu 0 siblings, 1 reply; 74+ messages in thread From: Tom Tucker @ 2006-07-06 17:36 UTC (permalink / raw) To: Herbert Xu; +Cc: davem, rdreier, netdev, akpm On Fri, 2006-07-07 at 00:08 +1000, Herbert Xu wrote: > Tom Tucker <tom@opengridcomputing.com> wrote: > > > > All that said, the proposed patch helps not only iWARP, but other > > transports (iSCSI, IB) as well. It is not large, invasive, > > Care to explain on how it helps those other technologies? The RDMA CMA uses IP addresses and port numbers to create a uniform addressing scheme across all transport types. For IB, it is necessary to resolve IP addresses to IB GIDs. The ARP protocol is used to do this and a netfilter rule is installed to snoop the incoming ARP replies. This would not be necessary if ARP events were provided as in the patch. Unified wire iSCSI adapters have the same issue as iWARP wrt to managing IP addresses and ports. > > Cheers, ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-06 17:36 ` Tom Tucker @ 2006-07-07 0:03 ` Herbert Xu 2006-07-07 0:32 ` Tom Tucker 0 siblings, 1 reply; 74+ messages in thread From: Herbert Xu @ 2006-07-07 0:03 UTC (permalink / raw) To: Tom Tucker; +Cc: davem, rdreier, netdev, akpm On Thu, Jul 06, 2006 at 12:36:24PM -0500, Tom Tucker wrote: > > The RDMA CMA uses IP addresses and port numbers to create a uniform > addressing scheme across all transport types. For IB, it is necessary to > resolve IP addresses to IB GIDs. The ARP protocol is used to do this and > a netfilter rule is installed to snoop the incoming ARP replies. This > would not be necessary if ARP events were provided as in the patch. Well the concerns we have do not apply to just iWARP, but RDMA/IP in general so this isn't really another technology. In fact, it seems that we now have IP-specific knowledge living in drivers/infiniband/core/cma.c which is suboptimal. > Unified wire iSCSI adapters have the same issue as iWARP wrt to managing > IP addresses and ports. If by Unified wire iSCSI you mean something that presents a SCSI interface together with an Ethernet interface where the two share the same MAC and IP address, then we have the same concerns with it as we do with iWARP or TOE. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-07 0:03 ` Herbert Xu @ 2006-07-07 0:32 ` Tom Tucker 0 siblings, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-07-07 0:32 UTC (permalink / raw) To: Herbert Xu; +Cc: davem, rdreier, netdev, akpm On Fri, 2006-07-07 at 10:03 +1000, Herbert Xu wrote: > On Thu, Jul 06, 2006 at 12:36:24PM -0500, Tom Tucker wrote: > > > > The RDMA CMA uses IP addresses and port numbers to create a uniform > > addressing scheme across all transport types. For IB, it is necessary to > > resolve IP addresses to IB GIDs. The ARP protocol is used to do this and > > a netfilter rule is installed to snoop the incoming ARP replies. This > > would not be necessary if ARP events were provided as in the patch. > > Well the concerns we have do not apply to just iWARP, but RDMA/IP in > general so this isn't really another technology. > > In fact, it seems that we now have IP-specific knowledge living in > drivers/infiniband/core/cma.c which is suboptimal. To be clear the CMA doesn't look in the ARP packet, it just uses the existence of the packet as an indication that it should check to see if the ARP request it submitted for an IP address has been resolved yet. I agree that this is suboptimal and why I think the notifier is a nice alternative. > > > Unified wire iSCSI adapters have the same issue as iWARP wrt to managing > > IP addresses and ports. > > If by Unified wire iSCSI you mean something that presents a SCSI interface > together with an Ethernet interface where the two share the same MAC and > IP address, Yes, this is what I mean. But the notifier doesn't actually support this, you would need to expose the IP/port space database to solve that problem. What I was referring to relative to iSCSI is if the adapter is relying on Linux to do ARP via the above suboptimal implementation, then it would benefit from the notifier patch. > then we have the same concerns with it as we do with iWARP or > TOE. Yes indeed. > > Cheers, ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-06 5:25 ` Tom Tucker 2006-07-06 14:08 ` Herbert Xu @ 2006-07-07 6:53 ` David Miller 2006-07-07 8:11 ` What is RDMA (was: RDMA will be reverted) Herbert Xu 2006-07-07 13:29 ` RDMA will be reverted Tom Tucker 1 sibling, 2 replies; 74+ messages in thread From: David Miller @ 2006-07-07 6:53 UTC (permalink / raw) To: tom; +Cc: rdreier, netdev, akpm From: Tom Tucker <tom@opengridcomputing.com> Date: Thu, 06 Jul 2006 00:25:03 -0500 > This patch is about dotting I's and crossing T's, it's not about > foundations. You assume that I've flat out rejected RDMA, in fact I haven't. I really don't have enough information to form a final opinion yet. There's about a week of emails on this topic which I need to read and digest first. What I am saying, however, is that we need to understand the technology and the hooks you guys want before we put any of it in. I don't think that's unreasonable. ^ permalink raw reply [flat|nested] 74+ messages in thread
* What is RDMA (was: RDMA will be reverted) 2006-07-07 6:53 ` David Miller @ 2006-07-07 8:11 ` Herbert Xu 2006-07-07 18:25 ` Steve Wise 2006-07-24 22:29 ` What is RDMA David Miller 2006-07-07 13:29 ` RDMA will be reverted Tom Tucker 1 sibling, 2 replies; 74+ messages in thread From: Herbert Xu @ 2006-07-07 8:11 UTC (permalink / raw) To: David Miller; +Cc: tom, rdreier, netdev, akpm, Jeff Garzik On Fri, Jul 07, 2006 at 06:53:20AM +0000, David Miller wrote: > > What I am saying, however, is that we need to understand the > technology and the hooks you guys want before we put any of it in. Yes indeed. Here is what I've understood so far so let's see if we can start building a censensus. 1) RDMA over straight Infiniband is not contentious. In this case no IP networking is involved. 2) RDMA over TCP/IP (or SCTP) can theoretically run on any network that supported IP, including Infiniband and Ethernet. 3) When RDMA over TCP is completely done in hardware, i.e., it has its own IP address, MAC address, and simply presents an RDMA interface (whatever that may be) to Linux, we're OK with it. This is similar to how some iSCSI adapters work. 4) When RDMA over TCP is done completely in the Linux networking stack, we don't have a problem because the existing TCP stack is still in charge. However, this is pretty pointless. 5) RDMA over TCP on the receive side is offloaded into the NIC. This allows the NIC to directly place data into the application's buffer. We're starting to have a little bit of a problem because it means that part of the incoming IP traffic is now being directly processed by the NIC, with no input from the Linux TCP/IP stack. However, as long as the connection establishment/acks are still controlled/seen by Linux we can probably live with it. 6) RDMA over TCP on the transmit side is offloaded into the NIC. This is starting to look very worrying. The reason is that we lose all control to crucial aspects of TCP like congestion control. It is now completely up to the NIC to do that. For straight RDMA over Infiniband this isn't an issue because the traffic is not likely to travel across the Internet. However, for RDMA over TCP, one of their goals is to support sending traffic over the Internet so this is a concern. Incidentally, this is why they need to know about things like MAC/route/MTU changing. 7) RDMA over TCP is completely offloaded into the NIC, however, they still use Linux's IP address, MAC address, and rely on us to tell it about events such as MTU updates or MAC changes. In addition to the problems we have in 5) and 6), we now have a portion of TCP port space which has suddenly become invisible to Linux. What's more, we lose control (e.g., netfilter) over what connections may or may not be established. So to my mind, RDMA over TCP is most problematic when it shares the same IP/MAC address as the Linux host, and when the transmit side and/or the connection establishment (case 6 and 7) is offloaded into the NIC. This also happens to be the only scenario where they need the notification patch that started all this discussion. BTW, this URL gives an interesting perspective on RDMA over TCP (particularly Q14/Q15): http://www.rdmaconsortium.org/home/FAQs_Apr25.htm Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA (was: RDMA will be reverted) 2006-07-07 8:11 ` What is RDMA (was: RDMA will be reverted) Herbert Xu @ 2006-07-07 18:25 ` Steve Wise 2006-07-11 8:17 ` Herbert Xu 2006-07-24 22:29 ` What is RDMA David Miller 1 sibling, 1 reply; 74+ messages in thread From: Steve Wise @ 2006-07-07 18:25 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik Great summation. Comments in-line... On Fri, 2006-07-07 at 18:11 +1000, Herbert Xu wrote: > On Fri, Jul 07, 2006 at 06:53:20AM +0000, David Miller wrote: > > > > What I am saying, however, is that we need to understand the > > technology and the hooks you guys want before we put any of it in. > > Yes indeed. > > Here is what I've understood so far so let's see if we can start building > a censensus. > > 1) RDMA over straight Infiniband is not contentious. In this case no > IP networking is involved. > Some IP networking is involved for this. IP addresses and port numbers are used by the RDMA Connection Manager. The motivation for this was two-fold, I think: 1) to simplify the connection setup model. The IB CM model was very complex. 2) to allow ULPs to be transport independent. Thus a single code base for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP transports without code changes or knowing about transport-specific addressing. The routing table is also consulted to determine which rdma device should be used for connection setup. Each rdma device also installs a netdev device for native stack traffic. The RDMA CM maintains an association between the netdev device and the rdma device. And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to GID/QPN info. This is done by calling arp_send() directly, and snooping all ARP packets to "discover" when the arp entry is completed. > 2) RDMA over TCP/IP (or SCTP) can theoretically run on any network that > supported IP, including Infiniband and Ethernet. > > 3) When RDMA over TCP is completely done in hardware, i.e., it has its > own IP address, MAC address, and simply presents an RDMA interface > (whatever that may be) to Linux, we're OK with it. > > This is similar to how some iSCSI adapters work. > The Ammasso driver implements this method. It supports 2 mac addresses on the single GigE port: 1 for native host networking traffic only, and one for RDMA/TCP only. The firmware implements a full TCP/IP/ARP/ICMP stack and handles all function of the RDMA/TCP connection setup. However, even these types of devices need some integration with the networking subsystem. Namely the existing Infiniband rdma connection manager assumes it will find a netdev device for each rdma device registered. So it uses the routing table to look up a netdev to determine which rdma device should be used for connection setup. The Ammasso driver installs 2 netdevs, one of which is a virtual device used soley for assigning IP addresses to the RDMA side of the nic, and for the RDMA CM to find this device... > 4) When RDMA over TCP is done completely in the Linux networking stack, > we don't have a problem because the existing TCP stack is still in > charge. However, this is pretty pointless. > Indeed. I see one case where this model might be useful: If the optimizations that RDMA gives helps mainly the server side of an application, then the client side might use a software-only rdma stack and a dumb nic. The server buys the deep rnic adapter and gets the perf benefits... > > 5) RDMA over TCP on the receive side is offloaded into the NIC. This > allows the NIC to directly place data into the application's buffer. > > We're starting to have a little bit of a problem because it means that > part of the incoming IP traffic is now being directly processed by the > NIC, with no input from the Linux TCP/IP stack. > > However, as long as the connection establishment/acks are still > controlled/seen by Linux we can probably live with it. > > 6) RDMA over TCP on the transmit side is offloaded into the NIC. This > is starting to look very worrying. > > The reason is that we lose all control to crucial aspects of TCP like > congestion control. It is now completely up to the NIC to do that. > For straight RDMA over Infiniband this isn't an issue because the > traffic is not likely to travel across the Internet. > > However, for RDMA over TCP, one of their goals is to support sending > traffic over the Internet so this is a concern. Incidentally, this is > why they need to know about things like MAC/route/MTU changing. > > 7) RDMA over TCP is completely offloaded into the NIC, however, they still > use Linux's IP address, MAC address, and rely on us to tell it about > events such as MTU updates or MAC changes. > I only know of type 3 rnics (ammasso) and type 7 rnics (chelsio + others). I haven't seen any type 5 or 6 designs yet for RDMA/TCP... > In addition to the problems we have in 5) and 6), we now have a portion > of TCP port space which has suddenly become invisible to Linux. What's > more, we lose control (e.g., netfilter) over what connections may or > may not be established. port space issues and netfilter integration can be fixed, I think, if there is a desire to do so. > > So to my mind, RDMA over TCP is most problematic when it shares the same > IP/MAC address as the Linux host, and when the transmit side and/or the > connection establishment (case 6 and 7) is offloaded into the NIC. This > also happens to be the only scenario where they need the notification > patch that started all this discussion. > Note that the current Infiniband RDMA connection setup could also benefit from the notification patch. Then it would not need to filter all incoming ARP packets... Steve. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA (was: RDMA will be reverted) 2006-07-07 18:25 ` Steve Wise @ 2006-07-11 8:17 ` Herbert Xu 2006-07-11 13:27 ` Steve Wise 0 siblings, 1 reply; 74+ messages in thread From: Herbert Xu @ 2006-07-11 8:17 UTC (permalink / raw) To: Steve Wise; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik On Fri, Jul 07, 2006 at 01:25:44PM -0500, Steve Wise wrote: > > Some IP networking is involved for this. IP addresses and port numbers > are used by the RDMA Connection Manager. The motivation for this was > two-fold, I think: > > 1) to simplify the connection setup model. The IB CM model was very > complex. > > 2) to allow ULPs to be transport independent. Thus a single code base > for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP > transports without code changes or knowing about transport-specific > addressing. > > The routing table is also consulted to determine which rdma device > should be used for connection setup. Each rdma device also installs a > netdev device for native stack traffic. The RDMA CM maintains an > association between the netdev device and the rdma device. > > And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to > GID/QPN info. This is done by calling arp_send() directly, and snooping > all ARP packets to "discover" when the arp entry is completed. This sounds interesting. Since this is going to be IB-neutral, what about moving high-level logic like this is moved out of drivers/infiniband and into net? That way the rest of the networking community can add input into how things are done. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA (was: RDMA will be reverted) 2006-07-11 8:17 ` Herbert Xu @ 2006-07-11 13:27 ` Steve Wise 0 siblings, 0 replies; 74+ messages in thread From: Steve Wise @ 2006-07-11 13:27 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik On Tue, 2006-07-11 at 18:17 +1000, Herbert Xu wrote: > On Fri, Jul 07, 2006 at 01:25:44PM -0500, Steve Wise wrote: > > > > Some IP networking is involved for this. IP addresses and port numbers > > are used by the RDMA Connection Manager. The motivation for this was > > two-fold, I think: > > > > 1) to simplify the connection setup model. The IB CM model was very > > complex. > > > > 2) to allow ULPs to be transport independent. Thus a single code base > > for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP > > transports without code changes or knowing about transport-specific > > addressing. > > > > The routing table is also consulted to determine which rdma device > > should be used for connection setup. Each rdma device also installs a > > netdev device for native stack traffic. The RDMA CM maintains an > > association between the netdev device and the rdma device. > > > > And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to > > GID/QPN info. This is done by calling arp_send() directly, and snooping > > all ARP packets to "discover" when the arp entry is completed. > > This sounds interesting. > > Since this is going to be IB-neutral, what about moving high-level logic > like this is moved out of drivers/infiniband and into net? > > That way the rest of the networking community can add input into how > things are done. > The notifier patch I posted sort of does this already by eliminating the need to snoop arp replies. It will notify interested subsystems of neighbour changes (EG when an ARP reply is processed and the neighbour struct updated with the correct hw mac addr). And I _think_ neigh_event_send() could be used instead of arp_send(). Steve. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA 2006-07-07 8:11 ` What is RDMA (was: RDMA will be reverted) Herbert Xu 2006-07-07 18:25 ` Steve Wise @ 2006-07-24 22:29 ` David Miller 2006-07-24 22:34 ` Rick Jones 1 sibling, 1 reply; 74+ messages in thread From: David Miller @ 2006-07-24 22:29 UTC (permalink / raw) To: herbert; +Cc: tom, rdreier, netdev, akpm, jgarzik From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 7 Jul 2006 18:11:31 +1000 > 5) RDMA over TCP on the receive side is offloaded into the NIC. This > allows the NIC to directly place data into the application's buffer. > > We're starting to have a little bit of a problem because it means that > part of the incoming IP traffic is now being directly processed by the > NIC, with no input from the Linux TCP/IP stack. > > However, as long as the connection establishment/acks are still > controlled/seen by Linux we can probably live with it. As I have detailed in other emails, even if you get the connection establishment packets processed by netfilter, you can end up with a non-working connection because NAT can want to transform all of the established state packets in the same way. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA 2006-07-24 22:29 ` What is RDMA David Miller @ 2006-07-24 22:34 ` Rick Jones 2006-07-24 22:39 ` David Miller 2006-07-24 22:49 ` Andi Kleen 0 siblings, 2 replies; 74+ messages in thread From: Rick Jones @ 2006-07-24 22:34 UTC (permalink / raw) To: David Miller; +Cc: herbert, tom, rdreier, netdev, akpm, jgarzik That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E standpoint. rick jones "Purity Of End To END" ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA 2006-07-24 22:34 ` Rick Jones @ 2006-07-24 22:39 ` David Miller 2006-07-24 22:49 ` Andi Kleen 1 sibling, 0 replies; 74+ messages in thread From: David Miller @ 2006-07-24 22:39 UTC (permalink / raw) To: rick.jones2; +Cc: herbert, tom, rdreier, netdev, akpm, jgarzik From: Rick Jones <rick.jones2@hp.com> Date: Mon, 24 Jul 2006 15:34:30 -0700 > That TOE/iWARP could end-up being precluded by NAT seems so ironic > from a POE2E standpoint. To be honest we do not have a pure end to end internet, and some of our failed experiments in the past prove this :-) For example, we have an optimization that allows much earlier termination of TIME_WAIT connections. It relies upon TCP timestamps and attributes we can determine about end hosts using that information (it is yet another Van Jacobson idea btw). But NAT means that IP+Port does not necessarily equate to the same host over time, not even over short periods of time. A NAT box could be using Port X for host A and then host B some short time later. Therefore we had to disable the early timewait recycling trick by default. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: What is RDMA 2006-07-24 22:34 ` Rick Jones 2006-07-24 22:39 ` David Miller @ 2006-07-24 22:49 ` Andi Kleen 1 sibling, 0 replies; 74+ messages in thread From: Andi Kleen @ 2006-07-24 22:49 UTC (permalink / raw) To: Rick Jones; +Cc: David Miller, herbert, tom, rdreier, netdev, akpm, jgarzik On Tuesday 25 July 2006 00:34, Rick Jones wrote: > That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E > standpoint. Yes, it's sad, but reality unfortunately. There is even precedent: the VJ stateless TW recycling scheme also turned out to not work because of NAT considerations. -Andi ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted 2006-07-07 6:53 ` David Miller 2006-07-07 8:11 ` What is RDMA (was: RDMA will be reverted) Herbert Xu @ 2006-07-07 13:29 ` Tom Tucker 1 sibling, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-07-07 13:29 UTC (permalink / raw) To: David Miller; +Cc: rdreier, netdev, akpm On Thu, 2006-07-06 at 23:53 -0700, David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Thu, 06 Jul 2006 00:25:03 -0500 > > > This patch is about dotting I's and crossing T's, it's not about > > foundations. > > You assume that I've flat out rejected RDMA, in fact I haven't. I > really don't have enough information to form a final opinion yet. > There's about a week of emails on this topic which I need to read > and digest first. I realize that there is a tremendous amount of work out there and this is just one thread. > > What I am saying, however, is that we need to understand the > technology and the hooks you guys want before we put any of it in. Absolutely. > > I don't think that's unreasonable. Not at all. Let me know if I can help. Tom ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: RDMA will be reverted @ 2006-07-06 13:26 Caitlin Bestler 0 siblings, 0 replies; 74+ messages in thread From: Caitlin Bestler @ 2006-07-06 13:26 UTC (permalink / raw) To: Andi Kleen, Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm Andi Kleen wrote: > >> We're focusing on netfilter here. Is breaking netfilter really the >> only issue with this stuff? > > Another concern is that it will just not be able to keep > up with a high rate of new connections or a high number of them > (because the hardware has too limited state) > Neither iWARP or an iSCSI initiator will require extremely high rates of connection establishment. An RNIC only establishes connections when its services have been explicitly requested (via use of a specific service). In any event, the key question here is whether integration with the netdevice improves things or whether the offload device should be "totally transparent" to the kernel. If the offload device somehow insisted on handling connection requests that the kernel would have been able to handle then this would be an issue. But the kernel is not currently handling RDMA Connect requests on its own, and I know of no-one who has suggested that a software-only implementation of RDMA is feasible at 10Gbit is feasible. netfiler integration is definitely something that needs to be addressed, but the L2/L3 integrations need to be in place first. > And then there are the other issues I listed like subtle TCP bugs > (TSO is already a nightmare in this area and it's still not quite > right) etc. > Making an RNIC "fully transparent" to the kernell would require it to handle many L2 and L3 issues in parallel with the host stack. That increases the chance of a bug, or at least a subtle difference between the host and the RNIC which while being compliant would be unexpected. The purposes of the proposed patches is to enable the RNIC to be in full compliance with the host stack on IP layer issues. > > It would need someone who can describe how this new RDMA device avoids > all the problems, but so far its advocates don't seem to be interested > in doing that and I cannot contribute more. > RDMA services are already defined for the kernel. The connection management and network notifier patches are about enabling RDMA devices to use IP addresses in a way that is consistent. Obviously doing so is more important for an iWARP device than for an InfiniBand device, but each InfiniBand users have expressed a desire to use IP addressing. Applications do not use RDMA by accident, it is a major design decision. Once an application uses RDMA it is no longer a direct consumer of the transport layer protocol. Indeed, one of the main objectives of the OpenFabrics stack is to enable typical applications to be written that will work over RDMA without caring what the underlying transport is. The options for control will still be there, but just as a sockets programmer does not typically care whether their IP is carried over SLIP, PPP, Ethernet or ATM; most RDMA developers should not have to worry about iWARP or InfiniBand. http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt provides an overview on how RDMA benefits applications, and when applications would benefit from its use as compared to plain TCP. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: RDMA will be reverted @ 2006-07-25 19:59 Tom Tucker 0 siblings, 0 replies; 74+ messages in thread From: Tom Tucker @ 2006-07-25 19:59 UTC (permalink / raw) To: David Miller; +Cc: ak, rdreier, netdev, akpm On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Wed, 05 Jul 2006 12:09:42 -0500 > > > "A TOE net stack is closed source firmware. Linux engineers have no way > > to fix security issues that arise. As a result, only non-TOE users will > > receive security updates, leaving random windows of vulnerability for > > each TOE NIC's users." > > > > - A Linux security update may or may not be relevant to a vendors > > implementation. > > > > - If a vendor's implementation has a security issue then the customer > > must rely on the vendor to fix it. This is no less true for iWARP than > > for any adapter. > > This isn't how things actually work. > > Users have a computer, and they can rightly expect the community > to help them solve problems that occur in the upstream kernel. > > When a bug is found and the person is using NIC X, we don't > necessarily forward the bug report to the vendor of NIC X. > Instead we try to fix the bug. Many chip drivers are maintained > by people who do not work for the company that makes the chip, > and this works just fine. > > If only the chip vendor can fix a security problem, this makes Linux > less agile to fix. Even aspect of a problem on a Linux system that > cannot be fixed entirely by the community is a net negative for Linux. > All true. What I meant to say was that this is "no less true than for any deep adapter". It is incontrovertible that a deep adapter is less flexible, and more difficult to support than a shallow adapter. > > - iWARP needs to do protocol processing in order to validate and > > evaluate TCP payload in advance of direct data placement. This > > requirement is independent of CPU speed. > > Yet, RDMA itself is just an optimization meant to deal with > limitations of cpu and memory speed. You can rephrase the > situation in whatever way suits your argument, but it does not > make the core issue go away :) Yep. > > > - I suspect that connection rates for RDMA adapters fall well-below the > > rates attainable with a dumb device. That said, all of the RDMA > > applications that I know of are not connection intensive. Even for TOE, > > the later HTTP versions makes connection rates less of an issue. > > This is a very naive evaluation of the situation. Yes, newer > versions of protocols such as HTTP make the per-client connection > burdon lower, but the number of clients will increase in time to > more than makeup for whatever gains are seen due to this. Naive is being kind, my HTTP comment is irrelevant :). > And then you have protocols which by design are connection heavy, > and rightly so, such as bittorrent. > > Being able to handle enormous numbers of connections, with extreme > scalability and low latency, is an absolute requirement of any modern > day serious TCP stack. And this requirement is not going away. > Wishing this requirement away due to HTTP persistent connections is > very unrealistic, at best. > > > - This is the problem we're trying to solve...incrementally and > > responsibly. > > You can't. See my email to Roland about why even VJ net channels > are found to be impractical. To support netfilter properly, you > must traverse the whole netfilter stack, because NAT can rewrite > packets, yet still make them destined for the local system, and > thus they will have a different identity for connection demux > by the time the TCP stack sees the packet. > I'm not claiming that all the problems can be solved, I'm suggesting that integration could be better and that partial integration is better than none. > All of these tranformations occur between normal packet receive > and the TCP stack. You would therefore need to put your card > between netfilter and TCP in the packet input path, and at that > point why bother with the stateful card at all? > > The fact is that stateless approaches will always be better than > stateful things because you cannot replicate the functionality we > have in the Linux stack without replicating 10 years of work into > your chip's firmware. At that point you should just run Linux > on your NIC since that is what you are effectively doing :) > I wish...I'd have a better stack. > In conversations such as these, it helps us a lot if you can be frank > and honest about the true absolute limitations of your technology. I'm trying ... classifying these limitations as "core can't fix" and "fixable with integration" is where we're getting crosswise. > I > can see that your viewpoint is tainted when I hear things such as HTTP > persistent connections being used as a reason why high TCP connection > rates won't matter in the future. Such assertions are understood to > be patently false by anyone who understands TCP and how it is used in > the real world. Partial "Fixable with Integration" Summary - ARP Resolution - ICMP Redirect - Path MTU Change - Route Update - Colliding TCP Port Spaces Partial "Can't Fix" Issues Summary: - Many devices cannot support more than tens of thousands of concurrent connections (16-64k would be typical). The number of supported RDMA connections does not scale with server resources. - Netfilter integration is busted. Some have suggested that devices that do connection establishment in host software could honor netfilter rules at startup. I'm concerned that this would be more confusing than helpful (which rules work, which don't) - NAT doesn't work when run on the same machine as the RDMA stack with hardware assist. Post connection establishment adapter sees untranslated packet. - Connection rates will likely be lower for devices that do connection establishment in the device vs. in the host. - The open source community cannot easily predict, diagnose or fix problems in the hardware stack. It's a black box. - Most hardware stacks lack the security features present in the native stack and cannot be extended to handle new exploits. ^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2006-07-25 19:59 UTC | newest] Thread overview: 74+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-28 7:07 RDMA will be reverted David Miller 2006-06-28 7:41 ` Evgeniy Polyakov 2006-06-28 14:56 ` Tom Tucker 2006-06-28 15:01 ` Steve Wise 2006-06-29 16:54 ` Roland Dreier 2006-06-29 17:32 ` YOSHIFUJI Hideaki / 吉藤英明 2006-06-29 17:35 ` Roland Dreier 2006-06-29 17:40 ` YOSHIFUJI Hideaki / 吉藤英明 2006-06-29 19:46 ` David Miller 2006-06-29 20:11 ` Tom Tucker 2006-06-29 20:16 ` Tom Tucker 2006-06-29 20:19 ` David Miller 2006-06-29 20:47 ` Tom Tucker 2006-06-29 20:53 ` David Miller 2006-06-29 21:28 ` Tom Tucker 2006-06-29 21:25 ` Andi Kleen 2006-06-29 20:42 ` James Morris 2006-06-30 20:51 ` Roland Dreier 2006-06-30 21:16 ` David Miller 2006-06-30 23:01 ` Tom Tucker 2006-07-01 14:26 ` Andi Kleen 2006-07-04 18:34 ` Andy Gay 2006-07-04 20:47 ` Andi Kleen 2006-07-04 22:22 ` Andy Gay 2006-07-04 23:01 ` Andi Kleen 2006-07-04 23:48 ` Andy Gay 2006-07-05 0:04 ` Andi Kleen 2006-07-04 20:34 ` Roland Dreier 2006-07-24 22:06 ` David Miller 2006-07-24 23:10 ` Andi Kleen 2006-07-24 23:22 ` David Miller 2006-07-25 0:02 ` Andi Kleen 2006-07-25 0:29 ` Rick Jones 2006-07-25 0:45 ` David Miller 2006-07-25 0:55 ` Rick Jones 2006-07-25 1:04 ` Andi Kleen 2006-07-25 1:21 ` David Miller 2006-07-25 16:29 ` Rick Jones 2006-07-25 16:32 ` Andi Kleen 2006-07-25 1:03 ` Rick Jones 2006-07-25 1:42 ` Andi Kleen 2006-07-25 5:51 ` Evgeniy Polyakov 2006-07-25 6:48 ` David Miller 2006-07-25 6:59 ` Evgeniy Polyakov 2006-07-25 7:33 ` David Miller 2006-07-25 7:42 ` Evgeniy Polyakov 2006-07-05 17:09 ` Tom Tucker 2006-07-05 17:50 ` Steve Wise 2006-07-24 22:25 ` David Miller 2006-07-24 22:47 ` Caitlin Bestler 2006-07-24 22:23 ` David Miller 2006-07-24 22:57 ` Caitlin Bestler 2006-07-01 21:45 ` David Miller 2006-07-04 20:34 ` Roland Dreier 2006-07-05 18:27 ` David Miller 2006-07-05 20:29 ` Roland Dreier 2006-07-06 3:03 ` David Miller 2006-07-06 5:25 ` Tom Tucker 2006-07-06 14:08 ` Herbert Xu 2006-07-06 17:36 ` Tom Tucker 2006-07-07 0:03 ` Herbert Xu 2006-07-07 0:32 ` Tom Tucker 2006-07-07 6:53 ` David Miller 2006-07-07 8:11 ` What is RDMA (was: RDMA will be reverted) Herbert Xu 2006-07-07 18:25 ` Steve Wise 2006-07-11 8:17 ` Herbert Xu 2006-07-11 13:27 ` Steve Wise 2006-07-24 22:29 ` What is RDMA David Miller 2006-07-24 22:34 ` Rick Jones 2006-07-24 22:39 ` David Miller 2006-07-24 22:49 ` Andi Kleen 2006-07-07 13:29 ` RDMA will be reverted Tom Tucker -- strict thread matches above, loose matches on Subject: below -- 2006-07-06 13:26 Caitlin Bestler 2006-07-25 19:59 Tom Tucker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).