RDMA will be reverted

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RDMA will be reverted
@ 2006-06-28  7:07 David Miller
  2006-06-28  7:41 ` Evgeniy Polyakov
                   ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: David Miller @ 2006-06-28  7:07 UTC (permalink / raw)
  To: rolandd; +Cc: netdev, akpm

Roland, there is no way in the world we would have let support for
RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
discussed this with Andrew Morton, and we'd like you to please revert
all of the RDMA code from Linus's tree immedialtely.

Folks are well aware how against RDMA and TOE type schemes the Linux
networking developers are.  So the fact that none of these RDMA
changes went up for review on netdev strikes me as just a little bit
more than suspicious.

Please do not do this again, thank you.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-28  7:07 RDMA will be reverted David Miller
@ 2006-06-28  7:41 ` Evgeniy Polyakov
  2006-06-28 14:56 ` Tom Tucker
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 74+ messages in thread
From: Evgeniy Polyakov @ 2006-06-28  7:41 UTC (permalink / raw)
  To: David Miller; +Cc: rolandd, netdev, akpm

On Wed, Jun 28, 2006 at 12:07:15AM -0700, David Miller (davem@davemloft.net) wrote:
> Roland, there is no way in the world we would have let support for
> RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
> discussed this with Andrew Morton, and we'd like you to please revert
> all of the RDMA code from Linus's tree immedialtely.

May I suggest to not revert it. RDMA and RDDP can be considered as
tun/tap or packet socket devices until they start to change internal
network structures. As far as I can see they do not, only use existing
like userspace can.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-28  7:07 RDMA will be reverted David Miller
  2006-06-28  7:41 ` Evgeniy Polyakov
@ 2006-06-28 14:56 ` Tom Tucker
  2006-06-28 15:01 ` Steve Wise
  2006-06-29 16:54 ` Roland Dreier
  3 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-06-28 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: rolandd, netdev, akpm

On Wed, 2006-06-28 at 00:07 -0700, David Miller wrote:
> Roland, there is no way in the world we would have let support for
> RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
> discussed this with Andrew Morton, and we'd like you to please revert
> all of the RDMA code from Linus's tree immedialtely.
> 
> Folks are well aware how against RDMA and TOE type schemes the Linux
> networking developers are.  So the fact that none of these RDMA
> changes went up for review on netdev strikes me as just a little bit
> more than suspicious.
> 
> Please do not do this again, thank you.

I believe Roland is on vacation (they just had a baby..). 

It is my belief that everything that Roland submitted went through both
netdev and lklm reviews. 


> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-28  7:07 RDMA will be reverted David Miller
  2006-06-28  7:41 ` Evgeniy Polyakov
  2006-06-28 14:56 ` Tom Tucker
@ 2006-06-28 15:01 ` Steve Wise
  2006-06-29 16:54 ` Roland Dreier
  3 siblings, 0 replies; 74+ messages in thread
From: Steve Wise @ 2006-06-28 15:01 UTC (permalink / raw)
  To: David Miller; +Cc: rolandd, netdev, akpm

On Wed, 2006-06-28 at 00:07 -0700, David Miller wrote:
> Roland, there is no way in the world we would have let support for
> RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
> discussed this with Andrew Morton, and we'd like you to please revert
> all of the RDMA code from Linus's tree immedialtely.
> 
> Folks are well aware how against RDMA and TOE type schemes the Linux
> networking developers are.  So the fact that none of these RDMA
> changes went up for review on netdev strikes me as just a little bit
> more than suspicious.
> 
> Please do not do this again, thank you.

Dave,

There is no support for RDMA/TCP in linux today, nor in Roland's git
tree for that matter.  I have posted a patch series for RDMA/TCP core
support to lklm and netdev over the last few weeks and gone through 3
review cycles. (see "iWARP Core Changes" threads).  In addition, I
posted the Ammasso RDMA driver for review as well.  It also went through
3 review cycles.  

Based on review feedback and lack of any serious issues, it was my
understanding that everyone was comfortable with RDMA/TCP.  Nothing
underhand was going on here.  

Steve.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-28  7:07 RDMA will be reverted David Miller
                   ` (2 preceding siblings ...)
  2006-06-28 15:01 ` Steve Wise
@ 2006-06-29 16:54 ` Roland Dreier
  2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
  2006-06-29 19:46   ` David Miller
  3 siblings, 2 replies; 74+ messages in thread
From: Roland Dreier @ 2006-06-29 16:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, akpm

    David> Roland, there is no way in the world we would have let
    David> support for RDMA into the kernel tree had we seen and
    David> reviewed it on netdev.  I've discussed this with Andrew
    David> Morton, and we'd like you to please revert all of the RDMA
    David> code from Linus's tree immedialtely.

    David> Folks are well aware how against RDMA and TOE type schemes
    David> the Linux networking developers are.  So the fact that none
    David> of these RDMA changes went up for review on netdev strikes
    David> me as just a little bit more than suspicious.

[I'm really on paternity leave, but this was brought to my attention
and seems important enough to respond to]

Dave, you're going to have to be more specific.  What do you mean by
RDMA?  The whole drivers/infiniband infrastructure, which handles RDMA
over IB, has been upstream for a year and a half, and was in fact
originally merged by you, so I'm guessing that's not what you mean.

If you're talking about the "RDMA CM" (drivers/infiniband/core/cma.c
et al) that was just merged, then you should be aware that that was
posted by Sean Hefty to netdev for review, multiple times (eg a quick
search finds <http://lwn.net/Articles/170202/>).  It is true that the
intention of the abstraction is to provide a common mechanism for
handling IB and iWARP (RDMA/TCP) connections, but at the moment no
iWARP code is upstream.  Right now all it does is allow IP addressing
to be used for IB connections.

In any case I think we need to find a way for Linux to support iWARP
hardware, since there are users that want this, and (some of) the
vendors are working hard to do things the right way (including cc'ing
netdev on the conversation).  I don't think it's good for Linux for
the answer to just be, "sorry, you're wrong to want to use that hardware."

 - Roland

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 16:54 ` Roland Dreier
@ 2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
  2006-06-29 17:35     ` Roland Dreier
  2006-06-29 19:46   ` David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 17:32 UTC (permalink / raw)
  To: rdreier; +Cc: davem, netdev, akpm, yoshfuji

Hello.

In article <adawtazgawi.fsf@cisco.com> (at Thu, 29 Jun 2006 09:54:37 -0700), Roland Dreier <rdreier@cisco.com> says:

>     David> Roland, there is no way in the world we would have let
>     David> support for RDMA into the kernel tree had we seen and
>     David> reviewed it on netdev.  I've discussed this with Andrew
>     David> Morton, and we'd like you to please revert all of the RDMA
>     David> code from Linus's tree immedialtely.
> 
>     David> Folks are well aware how against RDMA and TOE type schemes
>     David> the Linux networking developers are.  So the fact that none
>     David> of these RDMA changes went up for review on netdev strikes
>     David> me as just a little bit more than suspicious.
> 
> [I'm really on paternity leave, but this was brought to my attention
> and seems important enough to respond to]
> 
> Dave, you're going to have to be more specific.  What do you mean by
> RDMA?  The whole drivers/infiniband infrastructure, which handles RDMA
> over IB, has been upstream for a year and a half, and was in fact
> originally merged by you, so I'm guessing that's not what you mean.

NET_DMA things.

--yoshfuji

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-06-29 17:35     ` Roland Dreier
  2006-06-29 17:40       ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 74+ messages in thread
From: Roland Dreier @ 2006-06-29 17:35 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=
  Cc: davem, netdev, akpm

 > > Dave, you're going to have to be more specific.  What do you mean by
 > > RDMA?  The whole drivers/infiniband infrastructure, which handles RDMA
 > > over IB, has been upstream for a year and a half, and was in fact
 > > originally merged by you, so I'm guessing that's not what you mean.
 > 
 > NET_DMA things.

But NET_DMA seems to be for the new DMA engine support (I/OAT really I
guess?).  I had nothing to do with merging any of that, and as far as
I can tell, Dave signed off on all of those changes, so I don't think
that's what he's complaining about either.

 - R.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 17:35     ` Roland Dreier
@ 2006-06-29 17:40       ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 0 replies; 74+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-29 17:40 UTC (permalink / raw)
  To: rdreier; +Cc: davem, netdev, akpm, yoshfuji

In article <adasllng8zn.fsf@cisco.com> (at Thu, 29 Jun 2006 10:35:56 -0700), Roland Dreier <rdreier@cisco.com> says:

>  > > Dave, you're going to have to be more specific.  What do you mean by
>  > > RDMA?  The whole drivers/infiniband infrastructure, which handles RDMA
>  > > over IB, has been upstream for a year and a half, and was in fact
>  > > originally merged by you, so I'm guessing that's not what you mean.
>  > 
>  > NET_DMA things.
> 
> But NET_DMA seems to be for the new DMA engine support (I/OAT really I
> guess?).  I had nothing to do with merging any of that, and as far as
> I can tell, Dave signed off on all of those changes, so I don't think
> that's what he's complaining about either.

Oops, sorry, you're right...

--yoshfuji

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 16:54 ` Roland Dreier
  2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-06-29 19:46   ` David Miller
  2006-06-29 20:11     ` Tom Tucker
  2006-06-30 20:51     ` Roland Dreier
  1 sibling, 2 replies; 74+ messages in thread
From: David Miller @ 2006-06-29 19:46 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, akpm

From: Roland Dreier <rdreier@cisco.com>
Date: Thu, 29 Jun 2006 09:54:37 -0700

> In any case I think we need to find a way for Linux to support iWARP
> hardware, since there are users that want this, and (some of) the
> vendors are working hard to do things the right way (including cc'ing
> netdev on the conversation).  I don't think it's good for Linux for
> the answer to just be, "sorry, you're wrong to want to use that hardware."

We give the same response for TOE stuff.

The integration of iWARP with the Linux networking, while much better
than TOE, is still heavily flawed.

What most people might not realize when using this stuff is that:

1) None of their firewall rules will apply to the iWARP communications.
2) None of their packet scheduling configurations can be applied to
   the iWARP communications.
3) It is not possible to encapsulate iWARP traffic in IPSEC

And the list goes on and on.

This is what we don't like about technologies that implement their own
networking stack in the card firmware.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 19:46   ` David Miller
@ 2006-06-29 20:11     ` Tom Tucker
  2006-06-29 20:16       ` Tom Tucker
                         ` (2 more replies)
  2006-06-30 20:51     ` Roland Dreier
  1 sibling, 3 replies; 74+ messages in thread
From: Tom Tucker @ 2006-06-29 20:11 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Thu, 2006-06-29 at 12:46 -0700, David Miller wrote:
> From: Roland Dreier <rdreier@cisco.com>
> Date: Thu, 29 Jun 2006 09:54:37 -0700
> 
> > In any case I think we need to find a way for Linux to support iWARP
> > hardware, since there are users that want this, and (some of) the
> > vendors are working hard to do things the right way (including cc'ing
> > netdev on the conversation).  I don't think it's good for Linux for
> > the answer to just be, "sorry, you're wrong to want to use that hardware."
> 
> We give the same response for TOE stuff.

What does the word "we" represent in this context? Is it the Linux
community at large, Linux and Andrew, you? I'm not trying to be
argumentative, I just want to understand how carefully and by whom iWARP
technology has been considered.

> 
> The integration of iWARP with the Linux networking, while much better
> than TOE, is still heavily flawed.
> 
> What most people might not realize when using this stuff is that:

Agreed, the patch improves some things, but doesn't address others. But
isn't this position a condemnation of the good to spite the bad? 

> 
> 1) None of their firewall rules will apply to the iWARP communications.
> 2) None of their packet scheduling configurations can be applied to
>    the iWARP communications.
> 3) It is not possible to encapsulate iWARP traffic in IPSEC
> 
> And the list goes on and on.

It does, however, this position statement makes things worse, not
better. By this I mean that deep adapters (iSCSI, iWARP) are even more
debilitated by not being able to snoop MTU changes, etc... and are
therefore forced to duplicate sub-systems (e.g. ARP, ICMP, ...) already
ably implemented in host software.

> This is what we don't like about technologies that implement their own
> networking stack in the card firmware.

Doesn't this position force vendors to build deeper adapters, not
shallower adapters? Isn't this exactly the opposite of what is intended?

> 

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:11     ` Tom Tucker
@ 2006-06-29 20:16       ` Tom Tucker
  2006-06-29 20:19       ` David Miller
  2006-06-29 20:42       ` James Morris
  2 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-06-29 20:16 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

[...snip...]
> community at large, Linux and Andrew, you? I'm not trying to be

		Linus sorry... spell checker...




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:11     ` Tom Tucker
  2006-06-29 20:16       ` Tom Tucker
@ 2006-06-29 20:19       ` David Miller
  2006-06-29 20:47         ` Tom Tucker
  2006-06-29 21:25         ` Andi Kleen
  2006-06-29 20:42       ` James Morris
  2 siblings, 2 replies; 74+ messages in thread
From: David Miller @ 2006-06-29 20:19 UTC (permalink / raw)
  To: tom; +Cc: rdreier, netdev, akpm

From: Tom Tucker <tom@opengridcomputing.com>
Date: Thu, 29 Jun 2006 15:11:06 -0500

> Doesn't this position force vendors to build deeper adapters, not
> shallower adapters? Isn't this exactly the opposite of what is intended?

Nope.

Look at what the networking developers give a lot of attention and
effort to, things like TCP Large Receive Offload, and Van Jacobson net
channels, both of which are fully stack integrated receive performance
enhancements.  They do not bypass netfilter, they do not bypass
packet scheduling, and yet they provide a hardware assist performance
improvement for receive.

This has been stated over and over again.

If companies keep designing undesirable hardware that unnecessarily
takes features away from the user, that really is not our problem.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:19       ` David Miller
@ 2006-06-29 20:47         ` Tom Tucker
  2006-06-29 20:53           ` David Miller
  2006-06-29 21:25         ` Andi Kleen
  1 sibling, 1 reply; 74+ messages in thread
From: Tom Tucker @ 2006-06-29 20:47 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Thu, 2006-06-29 at 13:19 -0700, David Miller wrote:
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Thu, 29 Jun 2006 15:11:06 -0500
> 
> > Doesn't this position force vendors to build deeper adapters, not
> > shallower adapters? Isn't this exactly the opposite of what is intended?
> 
> Nope.
> 
> Look at what the networking developers give a lot of attention and
> effort to, things like TCP Large Receive Offload, and Van Jacobson net
> channels, both of which are fully stack integrated receive performance
> enhancements.  They do not bypass netfilter, they do not bypass
> packet scheduling, and yet they provide a hardware assist performance
> improvement for receive.

These technologies are integrated because someone chose to and was
allowed to integrate them. I contend that iWARP could be equally well
integrated if the decision was made to do so. It would, however, require
cooperation from both the hardware vendors and the netdev maintainers.

> 
> This has been stated over and over again.

For TOE, you are correct, however, for iWARP, you can't do RDMA (direct
placement into application buffers) without state in the adapter. I
personally tried very hard to build an adapter without doing so, but
alas, I failed ;-) 

> 
> If companies keep designing undesirable hardware that unnecessarily
> takes features away from the user, that really is not our problem.

I concede that features have been lost, but some applications benefit
greatly from RDMA. For these applications and these customers, the
hardware is not undesirable and the fact that netfilter won't work on
their sub 5us latency adapter is not perceived to be a big issue. The
mention of packet scheduling would cause an apoplectic seizure...unless
it were in the hardware...

All that verbiage aside, I believe that it is not a matter of whether it
is possible to integrate iWARP it is question of whether it is
permissible.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:47         ` Tom Tucker
@ 2006-06-29 20:53           ` David Miller
  2006-06-29 21:28             ` Tom Tucker
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-06-29 20:53 UTC (permalink / raw)
  To: tom; +Cc: rdreier, netdev, akpm

From: Tom Tucker <tom@opengridcomputing.com>
Date: Thu, 29 Jun 2006 15:47:13 -0500

> I concede that features have been lost, but some applications benefit
> greatly from RDMA. For these applications and these customers,

TOE folks give the same story... it's a broken record, really.

Let us know when you can say something new about the situation.

Under Linux we get to make better long term architectually sane
decisions, even if it is to the dismay of the backers of certain
short-sighted pieces of technology.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:53           ` David Miller
@ 2006-06-29 21:28             ` Tom Tucker
  0 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-06-29 21:28 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Thu, 2006-06-29 at 13:53 -0700, David Miller wrote:
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Thu, 29 Jun 2006 15:47:13 -0500
> 
> > I concede that features have been lost, but some applications benefit
> > greatly from RDMA. For these applications and these customers,
> 
> TOE folks give the same story... it's a broken record, really.
> 
> Let us know when you can say something new about the situation.
> 
> Under Linux we get to make better long term architectually sane
> decisions, even if it is to the dismay of the backers of certain
> short-sighted pieces of technology.

Would you indulge me with one final clarification?  

- Are you condemning RDMA over TCP as an ill-conceived technology?

- Are you condemning the implementation of iWARP? 

- Are you condemning both?

Thanks,
Tom




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:19       ` David Miller
  2006-06-29 20:47         ` Tom Tucker
@ 2006-06-29 21:25         ` Andi Kleen
  1 sibling, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-06-29 21:25 UTC (permalink / raw)
  To: David Miller; +Cc: tom, rdreier, netdev, akpm

> They do not bypass netfilter, they do not bypass
> packet scheduling, and yet they provide a hardware assist performance
> improvement for receive.

Not that I'm a TOE advocate, but as long as the adapter leaves SYN/SYN-ACK
to the stack and only turns on RDMA in ESTABLISHED it could at least
do nearly all of netfilter too (as established in the channel discussion) 

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 20:11     ` Tom Tucker
  2006-06-29 20:16       ` Tom Tucker
  2006-06-29 20:19       ` David Miller
@ 2006-06-29 20:42       ` James Morris
  2 siblings, 0 replies; 74+ messages in thread
From: James Morris @ 2006-06-29 20:42 UTC (permalink / raw)
  To: Tom Tucker; +Cc: David Miller, rdreier, netdev, akpm

On Thu, 29 Jun 2006, Tom Tucker wrote:

> On Thu, 2006-06-29 at 12:46 -0700, David Miller wrote:
> > From: Roland Dreier <rdreier@cisco.com>
> > Date: Thu, 29 Jun 2006 09:54:37 -0700
> > 
> > > In any case I think we need to find a way for Linux to support iWARP
> > > hardware, since there are users that want this, and (some of) the
> > > vendors are working hard to do things the right way (including cc'ing
> > > netdev on the conversation).  I don't think it's good for Linux for
> > > the answer to just be, "sorry, you're wrong to want to use that hardware."
> > 
> > We give the same response for TOE stuff.
> 
> What does the word "we" represent in this context? Is it the Linux
> community at large, Linux and Andrew, you? I'm not trying to be
> argumentative, I just want to understand how carefully and by whom iWARP
> technology has been considered.

$ grep -ri davem /usr/src/linux



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-29 19:46   ` David Miller
  2006-06-29 20:11     ` Tom Tucker
@ 2006-06-30 20:51     ` Roland Dreier
  2006-06-30 21:16       ` David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: Roland Dreier @ 2006-06-30 20:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, akpm

You snipped my question about what specifically you wanted reverted,
so I'm going to assume that after cooling down and understanding the
situation, you're OK with everything that's in Linus's tree...

 > The integration of iWARP with the Linux networking, while much better
 > than TOE, is still heavily flawed.
 > 
 > What most people might not realize when using this stuff is that:
 > 
 > 1) None of their firewall rules will apply to the iWARP communications.
 > 2) None of their packet scheduling configurations can be applied to
 >    the iWARP communications.
 > 3) It is not possible to encapsulate iWARP traffic in IPSEC

Yes, there are tradeoffs with iWARP.  However, there seem to be users
who are willing to make those tradeoffs.  And I can't think of a
single other example of a case where we refused to merge a driver, not
because of any issues with the driver code, but because we don't like
the hardware it drives and think that people shouldn't be able to use
the HW with Linux.  And it makes me sad that we're doing that here.

Don't get me wrong, I'm all for rejecting patches that make the core
networking stack worse or harder to maintain or are bad patches for
whatever reason.  I know that the present is science fiction, but I
always thought that the forbidden technologies would be stuff like
nanotech or human cloning -- I never would have guessed that iWARP
would be in that category.

Anyway, what is your feeling about changes strictly under
drivers/infiniband that add low-level driver support for iWARP
devices?  The changes that Steve Wise proposed aren't strictly
necessary for iWARP support -- they just make things work better when
routes change.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-30 20:51     ` Roland Dreier
@ 2006-06-30 21:16       ` David Miller
  2006-06-30 23:01         ` Tom Tucker
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-06-30 21:16 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, akpm

From: Roland Dreier <rdreier@cisco.com>
Date: Fri, 30 Jun 2006 13:51:19 -0700

> And I can't think of a single other example of a case where we
> refused to merge a driver, not because of any issues with the driver
> code, but because we don't like the hardware it drives and think
> that people shouldn't be able to use the HW with Linux.  And it
> makes me sad that we're doing that here.

The TOE folks have tried to submit their hooks and drivers
on several occaisions, and we've rejected it every time.

I definitely don't want the iWARP stuff to go in until we
have a long good discussion about this.  And you have a good
week long opportunity to do so as I'm about to go on vacation
until next Friday :-)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-30 21:16       ` David Miller
@ 2006-06-30 23:01         ` Tom Tucker
  2006-07-01 14:26           ` Andi Kleen
  2006-07-01 21:45           ` David Miller
  0 siblings, 2 replies; 74+ messages in thread
From: Tom Tucker @ 2006-06-30 23:01 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:

> The TOE folks have tried to submit their hooks and drivers
> on several occaisions, and we've rejected it every time.

iWARP != TOE

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-30 23:01         ` Tom Tucker
@ 2006-07-01 14:26           ` Andi Kleen
  2006-07-04 18:34             ` Andy Gay
                               ` (2 more replies)
  2006-07-01 21:45           ` David Miller
  1 sibling, 3 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-01 14:26 UTC (permalink / raw)
  To: Tom Tucker; +Cc: David Miller, rdreier, netdev, akpm

On Saturday 01 July 2006 01:01, Tom Tucker wrote:
> On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
> 
> > The TOE folks have tried to submit their hooks and drivers
> > on several occaisions, and we've rejected it every time.
> 
> iWARP != TOE

Perhaps a good start of that discussion David asked for would 
be if you could give us an overview of the differences
and how you avoid the TOE problems.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-01 14:26           ` Andi Kleen
@ 2006-07-04 18:34             ` Andy Gay
  2006-07-04 20:47               ` Andi Kleen
  2006-07-04 20:34             ` Roland Dreier
  2006-07-05 17:09             ` Tom Tucker
  2 siblings, 1 reply; 74+ messages in thread
From: Andy Gay @ 2006-07-04 18:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote:
> On Saturday 01 July 2006 01:01, Tom Tucker wrote:
> > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
> > 
> > > The TOE folks have tried to submit their hooks and drivers
> > > on several occaisions, and we've rejected it every time.
> > 
> > iWARP != TOE
> 
> Perhaps a good start of that discussion David asked for would 
> be if you could give us an overview of the differences
> and how you avoid the TOE problems.

Interesting thread, I hope someone replies to Andi's request.
I've actually no real idea what RDMA, IWARP & TOE are, so I may be
barking up completely the wrong tree here. If so, apologies.

But it sounds like we're talking about technologies that offload some
part of the network/transport layer processing to the interface device?

And the primary objection to that is that it may bypass some of the cool
features of the Linux stack? Stuff like iptables and ... what exactly?

Presumably the reason why such offloading would be a Good Thing are to
do with very high speed network processing, 10G ethernet and the like.
Which sounds to me very like the way dedicated networking kit would do
that. So if you have a device that needs to be a very high performance
router, you dedicate it to that function and don't try to do clever
per-packet or -flow processing at the same time.

In the Cisco world, there's a network design approach where you consider
your equipment in three 'layers', I think they call them the core,
distribution and access layers, or something like that. The idea is that
the core layer handles the real high speed stuff, you don't do anything
much except routing/switching in there. The other layers aggregate flows
for the core, if you need extra processing (firewalls etc) you do it
somewhere in there. So, for example, the packet capture functions (sort
of like tcpdump) don't work if fast switching is in use, which it would
be in the core. Users accept these tradeoffs, because if you design it
right you can do the extra processing on some other device in the outer
layers.

So perhaps there's a good argument to make that a Linux system with the
right hardware could be considered a core device. Likely any place you
have such a system it would be dedicated to just moving data as well as
possible, and let other systems do the other stuff. You wouldn't want
your server farm systems to also be your firewalls.

Bottom line - these technologies seem to me to have a place in a well
designed network.

Just my 2c...

- Andy

> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 18:34             ` Andy Gay
@ 2006-07-04 20:47               ` Andi Kleen
  2006-07-04 22:22                 ` Andy Gay
  0 siblings, 1 reply; 74+ messages in thread
From: Andi Kleen @ 2006-07-04 20:47 UTC (permalink / raw)
  To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

> So perhaps there's a good argument to make that a Linux system with the
> right hardware could be considered a core device. Likely any place you
> have such a system it would be dedicated to just moving data as well as
> possible, and let other systems do the other stuff. You wouldn't want
> your server farm systems to also be your firewalls.

Why not? While Linux firewall performance is not flawless its problems
(e.g. slow conntrack) seems to be mostly in an area where TOE cannot
do much about.

> Bottom line - these technologies seem to me to have a place in a well
> designed network.

I think there is a web page listing why it's bad, but here 
a quick summary:

One worry is to debug it all together. Currently we have a single stack
to debug, although it's already difficult to control the complexity as it 
grows more bells and whistles.

Just take a look at Cisco IOS release notes to see how hard
and difficult it is to get it all to work together.

Another reason is that there are general doubts that TOE can
keep up with the ever growing performance of CPUs. Even if Linux
added it today it would be likely slower again a few months later.
That is also a big difference to Cisco hardware. Linux usually
runs on fast main CPUs (or if you run it on slow CPUs you normally
don't expect the best network performance). And they get faster
and faster constantly.

Admittedly 10GB NICs are still a bit too fast for
mainstream systems, but that seems to be mostly a problem
outside the CPUs and it looks like the next generation
of systems will catch up with enough bandwidth in this area.

Also it tends to accelerate the wrong thing. On a lot of workloads
the main problem is keeping a lot of different connections under 
control, and TOE tends to be slow at keeping connection
information synchronized with the host.

That is why the Linux strategy has been to ask for useful stateless offloads
instead. Examples of this are checksum offload (long time classic), TSO (TCP 
segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off 
load), RX hashing with MSI-X (not implemented yet, but basically
it allows load balancing of incoming streams to CPU) 

Note that all these are more or less stateless offloads.

iWARP is not clear yet what it is. From the meager bits of information
about it that reached netdev so far it at least sounds it does RDMA and needs 
far more state than any of the other offloads we got so far and likely
got the usual TOE scaling issues. It's also likely on the wrong side 
of Moore's law.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 20:47               ` Andi Kleen
@ 2006-07-04 22:22                 ` Andy Gay
  2006-07-04 23:01                   ` Andi Kleen
  0 siblings, 1 reply; 74+ messages in thread
From: Andy Gay @ 2006-07-04 22:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

On Tue, 2006-07-04 at 22:47 +0200, Andi Kleen wrote:
> > So perhaps there's a good argument to make that a Linux system with the
> > right hardware could be considered a core device. Likely any place you
> > have such a system it would be dedicated to just moving data as well as
> > possible, and let other systems do the other stuff. You wouldn't want
> > your server farm systems to also be your firewalls.
> 
> Why not? While Linux firewall performance is not flawless its problems
> (e.g. slow conntrack) seems to be mostly in an area where TOE cannot
> do much about.
No doubt you *can* do this, but would you want to?
My point wasn't really about performance here, more that systems needing
this level of performance (server farm is just an example) will probably
be on an 'inside' network with firewalling being done elsewhere (at the
access layer, to use the Cisco paradigm). It's just not good design to
attach such systems directly to an untrusted network, IMHO. So these
systems just don't need netfilter capabilities.

> 
> > Bottom line - these technologies seem to me to have a place in a well
> > designed network.
> 
> I think there is a web page listing why it's bad, but here 
> a quick summary:
> 
> One worry is to debug it all together. Currently we have a single stack
> to debug, although it's already difficult to control the complexity as it 
> grows more bells and whistles.
> 
> Just take a look at Cisco IOS release notes to see how hard
> and difficult it is to get it all to work together.
No argument there!

> 
> Another reason is that there are general doubts that TOE can
> keep up with the ever growing performance of CPUs. Even if Linux
> added it today it would be likely slower again a few months later.
> That is also a big difference to Cisco hardware. Linux usually
> runs on fast main CPUs (or if you run it on slow CPUs you normally
> don't expect the best network performance). And they get faster
> and faster constantly.
> 
> Admittedly 10GB NICs are still a bit too fast for
> mainstream systems, but that seems to be mostly a problem
> outside the CPUs and it looks like the next generation
> of systems will catch up with enough bandwidth in this area.
> 
> Also it tends to accelerate the wrong thing. On a lot of workloads
> the main problem is keeping a lot of different connections under 
> control, and TOE tends to be slow at keeping connection
> information synchronized with the host.
> 
> That is why the Linux strategy has been to ask for useful stateless offloads
> instead. Examples of this are checksum offload (long time classic), TSO (TCP 
> segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off 
> load), RX hashing with MSI-X (not implemented yet, but basically
> it allows load balancing of incoming streams to CPU) 
> 
> Note that all these are more or less stateless offloads.
> 
> iWARP is not clear yet what it is. From the meager bits of information
> about it that reached netdev so far it at least sounds it does RDMA and needs 
> far more state than any of the other offloads we got so far and likely
> got the usual TOE scaling issues. It's also likely on the wrong side 
> of Moore's law.
> 
> -Andi


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 22:22                 ` Andy Gay
@ 2006-07-04 23:01                   ` Andi Kleen
  2006-07-04 23:48                     ` Andy Gay
  0 siblings, 1 reply; 74+ messages in thread
From: Andi Kleen @ 2006-07-04 23:01 UTC (permalink / raw)
  To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

> My point wasn't really about performance here, more that systems needing
> this level of performance (server farm is just an example) will probably
> be on an 'inside' network with firewalling being done elsewhere (at the
> access layer, to use the Cisco paradigm). It's just not good design to
> attach such systems directly to an untrusted network, IMHO. So these
> systems just don't need netfilter capabilities.

Don't think of the highend. It is exotic and rare.

Think of the ordinary single linux box somewhere at a rackspace provider which 
represents the majority of Linux boxes around. 

With a not too skilled admin who mostly uses the default settings of his configuration.
For that running firewalling on the same box makes a lot of sense.

Normally it is not that loaded and it doesn't matter much how it performs,
but it might be occasionally slashdotted and then it should still hold up.

BTW basic firewalling is not really that bad as long as you don't have too many
rules. Mostly conntrack is painful right now. I'm sure at some point it will
be fixed too.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 23:01                   ` Andi Kleen
@ 2006-07-04 23:48                     ` Andy Gay
  2006-07-05  0:04                       ` Andi Kleen
  0 siblings, 1 reply; 74+ messages in thread
From: Andy Gay @ 2006-07-04 23:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

On Wed, 2006-07-05 at 01:01 +0200, Andi Kleen wrote:
> > My point wasn't really about performance here, more that systems needing
> > this level of performance (server farm is just an example) will probably
> > be on an 'inside' network with firewalling being done elsewhere (at the
> > access layer, to use the Cisco paradigm). It's just not good design to
> > attach such systems directly to an untrusted network, IMHO. So these
> > systems just don't need netfilter capabilities.
> 
> Don't think of the highend. It is exotic and rare.
Sure. But isn't the high end exactly where these new technologies are
intended to fit?

> 
> Think of the ordinary single linux box somewhere at a rackspace provider which 
> represents the majority of Linux boxes around. 
How many of those need 10G nics?

> 
> With a not too skilled admin who mostly uses the default settings of his configuration.
> For that running firewalling on the same box makes a lot of sense.
Yup. I run a few of those. And I run firewalls on them. But they're on
1.5M T1 pipes at best.
I probably fit into your 'not too skilled' category, too :) 

> 
> Normally it is not that loaded and it doesn't matter much how it performs,
> but it might be occasionally slashdotted and then it should still hold up.
> 
> BTW basic firewalling is not really that bad as long as you don't have too many
> rules. Mostly conntrack is painful right now. I'm sure at some point it will
> be fixed too.
Actually, I wasn't aware of any pain with conntrack, it works great for
me. But like I said, I don't run any real high speed connections.

We're focusing on netfilter here. Is breaking netfilter really the only
issue with this stuff? I know you mentioned some other concerns (about
TOE specifically), they were really scalability things though weren't
they - like you're not convinced this really solves any performance
issues long term. I'm certainly not qualified to discuss that, hopefully
some of the others will weigh in here.

> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 23:48                     ` Andy Gay
@ 2006-07-05  0:04                       ` Andi Kleen
  0 siblings, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-05  0:04 UTC (permalink / raw)
  To: Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

> > Think of the ordinary single linux box somewhere at a rackspace provider which 
> > represents the majority of Linux boxes around. 
> How many of those need 10G nics?

Most of them already have gigabit. At some point they will have 10G too.

Admittedly the iThingy under discussion here seems to be Infiniband only which
will probably not appear in such a use case.

> We're focusing on netfilter here. Is breaking netfilter really the only
> issue with this stuff?

Another concern is that it will just not be able to keep 
up with a high rate of new connections or a high number of them
(because the hardware has too limited state)

And then there are the other issues I listed like subtle TCP bugs
(TSO is already a nightmare in this area and it's still not quite
right) etc. 

> I know you mentioned some other concerns (about 
> TOE specifically), they were really scalability things though weren't
> they 

There was more than just scalability. Reread it.

Anyways the thread is already getting off topic - i'm not actually
that much interested in a generic TOE discussion because the issue
is pretty much settled already with broad consensus. You can refer
to the netdev archives or the respective web pages if you want more
details.

It would need someone who can describe how this new RDMA device avoids
all the problems, but so far its advocates don't seem to be interested
in doing that and I cannot contribute more.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-01 14:26           ` Andi Kleen
  2006-07-04 18:34             ` Andy Gay
@ 2006-07-04 20:34             ` Roland Dreier
  2006-07-24 22:06               ` David Miller
  2006-07-05 17:09             ` Tom Tucker
  2 siblings, 1 reply; 74+ messages in thread
From: Roland Dreier @ 2006-07-04 20:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Tucker, David Miller, netdev, akpm

    Andi> Perhaps a good start of that discussion David asked for
    Andi> would be if you could give us an overview of the differences
    Andi> and how you avoid the TOE problems.

Well, here's a quick overview, leaving out some of the details.  The
difference between TOE and iWARP/RDMA is really the interface that
they present.

A TOE ("TCP Offload Engine") is a piece of hardware that offloads TCP
processing from the main system to handle regular sockets.  There is
either some way to hand off a socket from the host stack to the TOE,
or a socket is created on the TOE to start with, but in both cases,
the TOE is handling processing for normal TCP sockets.  This means
that the TOE has some hardware and/or firmware to do stateful TCP
processing.

An iWARP device, or RNIC (RDMA NIC), also usually has hardware and/or
firmware TCP processing, but this isn't exposed through the BSD socket
interface.  Instead, an RNIC presents an interface more like an
InfiniBand HCA: work requests (sends, receives, RDMA operations) are
passed to the RNIC via work queues, and completion notification is
returned asynchronously via completion queues.  An iWARP connection
can handle both send/receive ("two-sided") and get/put (RDMA or
"one-sided") operations.

For full details of the protocol used for this, you can look at the
drafs from the IETF rddp working group, but basically an RDMA protocol
is layered on top of a connected stream protocol -- usually TCP, but
SCTP could be used as well.

A lot of the perfomance of iWARP comes from the RDMA/direct placement
capabilities -- for example an NFS/RDMA server can process requests
out of order and put data directly into the buffer that's waiting for
it, without using any CPU on the destination -- but even send/receive
operations can be useful.

As a side note, an RNIC will also typically support the same sort of
kernel bypass as an IB HCA, where work queues can be safely mapped
into a userspace process's memory so that work requests can be posted
without a system call.  In fact, when people usually use RDMA as a
shorthand for the combination of the three features I described:
asynchronous work queues and completion queues, connections that
support both send/receive and RDMA, and kernel bypass.

In any case, RNIC support can be added to the existing IB stack with
fairly minor modifications -- you can search the netdev archives for
the patchsets posted by Steve Wise, but nearly all of the new code is
in the low-level hardware driver for the specific iWARP devices.

The real issues for netdev are things like Steve Wise's patch to add
route change notifiers, which could be used to tell RNICs when to
update the next hop for a connection they're handling.  More
generally, it would be interesting to see if it's possible to tie an
RNIC into the kernel's packet filtering, so that disallowed
connections don't get set up.  This seems very similar in spirit to
the problems around packet filtering that were raised for VJ netchannels.

 - Roland

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 20:34             ` Roland Dreier
@ 2006-07-24 22:06               ` David Miller
  2006-07-24 23:10                 ` Andi Kleen
  2006-07-25  5:51                 ` Evgeniy Polyakov
  0 siblings, 2 replies; 74+ messages in thread
From: David Miller @ 2006-07-24 22:06 UTC (permalink / raw)
  To: rdreier; +Cc: ak, tom, netdev, akpm

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 04 Jul 2006 13:34:27 -0700

> Well, here's a quick overview, leaving out some of the details.  The
> difference between TOE and iWARP/RDMA is really the interface that
> they present.

Thanks for the description Roland.  It helps me understand the
situation better.

> The real issues for netdev are things like Steve Wise's patch to add
> route change notifiers, which could be used to tell RNICs when to
> update the next hop for a connection they're handling.

I'll probably put Steve's patches in soon.

> More generally, it would be interesting to see if it's possible to
> tie an RNIC into the kernel's packet filtering, so that disallowed
> connections don't get set up.  This seems very similar in spirit to
> the problems around packet filtering that were raised for VJ
> netchannels.

Don't get too excited about VJ netchannels, more and more roadblocks
to their practicality are being found every day.

For example, my idea to allow ESTABLISHED TCP socket demux to be done
before netfilter is flawed.  Connection tracking and NAT can change
the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
socket, therefore we must always hit netfilter first.

All the original costs of route, netfilter, TCP socket lookup all
reappear as we make VJ netchannels fit all the rules of real practical
systems, eliminating their gains entirely.  I will also note in
passing that papers on related ideas, such as the Exokernel stuff, are
very careful to not address the issue of how practical 1) their demux
engine is and 2) the negative side effects of userspace TCP
implementations.  For an example of the latter, if you have some 1GB
JAVA process you do not want to wake that monster up just to do some
ACK processing or TCP window updates, yet if you don't you violate
TCP's rules and risk spurious unnecessary retransmits.

Furthermore, the VJ netchannel gains can be partially obtained from
generic stateless facilities that we are going to get anyways.
Networking chips supporting multiple MSI-X vectors, choosen by hashing
the flow ID, can move TCP processing to "end nodes" which are cpu
threads in this case, by having each such MSI-X vector target a
different cpu thread.

The good news is that we've survived a long time without revolutions
like VJ net channels, and the existing TCP stack can be improved
dramatically and in ways that people will see benefits from in a
shorter amount of time.  For example, Alexey Kuznetsov and I have some
ideas on how to make the most expensive TCP function for a sender,
tcp_ack(), more efficient by using different data structures for the
retransmit queue and the loss/recovery packet SACK state.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-24 22:06               ` David Miller
@ 2006-07-24 23:10                 ` Andi Kleen
  2006-07-24 23:22                   ` David Miller
  2006-07-25  5:51                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 74+ messages in thread
From: Andi Kleen @ 2006-07-24 23:10 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, tom, netdev, akpm

> For example, my idea to allow ESTABLISHED TCP socket demux to be done
> before netfilter is flawed.  Connection tracking and NAT can change
> the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
> socket, therefore we must always hit netfilter first.

Hmm, how does this happen?

I guess either when a connection is masqueraded and an application did a bind()
on a local port that is used by the masquerading engine.  That could be handled
by just disallowing it.

Or when you have a transparent proxy setup with the proxy on the local host.
Perhaps in that case netfilter could be taught to reinject packets
in a way that they hit another ESTABLISHED lookup.

Did I miss a case?

> All the original costs of route, netfilter, TCP socket lookup all
> reappear as we make VJ netchannels fit all the rules of real practical
> systems, eliminating their gains entirely.

At least most of the optimizations from the early demux scheme could
be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
that checks if all rules only check SYN etc. packets and doesn't walk the
full rules then (or more generalized a fast TCP flag mask check similar 
to what TCP does). With that ESTABLISHED would hit TCP only with relatively
small overhead.

> I will also note in 
> passing that papers on related ideas, such as the Exokernel stuff, are
> very careful to not address the issue of how practical 1) their demux
> engine is and 2) the negative side effects of userspace TCP
> implementations.  For an example of the latter, if you have some 1GB
> JAVA process you do not want to wake that monster up just to do some
> ACK processing or TCP window updates, yet if you don't you violate
> TCP's rules and risk spurious unnecessary retransmits.

I don't quite get why the size of the process matters here - if only
some user space TCP library is called directly then it shouldn't
really matter how big or small the rest of the process is.

Or did you mean migration costs as described below?

But on the other hand full user space TCP seems to me of little gain
compared to a process context implementation.

I somehow like it better to hide these implementation details in 
the kernel.

> Furthermore, the VJ netchannel gains can be partially obtained from
> generic stateless facilities that we are going to get anyways.
> Networking chips supporting multiple MSI-X vectors, choosen by hashing
> the flow ID, can move TCP processing to "end nodes" which are cpu
> threads in this case, by having each such MSI-X vector target a
> different cpu thread.

The problem with the scheme is that to do process context processing
effectively you would need to teach the scheduler to aggressively
migrate on wake up (so that the process ends up on the CPU that 
was selected by the hash function in the NIC).

But what do you do when you have lots of different connections
with different target CPU hash values or when this would require
you to move multiple compute intensive processes or a single core?

Without user context TCP, but using softirqs instead, it looks a bit better 
because you can at least use different CPUs to do the ACK processing etc.
and the hash function spreading out connections over your CPUs doesn't harm.

But you still have relatively high cache line transfer costs in handing
over these packet from the softirq CPUs to the final process consumer. I liked
VJ's idea of using arrays-of-something instead of lists for that to avoid
some cache line transfers.  Ok at least it sounds nice in theory - haven't seen any 
hard numbers on this scheme compared to a traditional double linked list.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-24 23:10                 ` Andi Kleen
@ 2006-07-24 23:22                   ` David Miller
  2006-07-25  0:02                     ` Andi Kleen
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-24 23:22 UTC (permalink / raw)
  To: ak; +Cc: rdreier, tom, netdev, akpm

From: Andi Kleen <ak@suse.de>
Date: Tue, 25 Jul 2006 01:10:25 +0200

> > All the original costs of route, netfilter, TCP socket lookup all
> > reappear as we make VJ netchannels fit all the rules of real practical
> > systems, eliminating their gains entirely.
> 
> At least most of the optimizations from the early demux scheme could
> be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
> that checks if all rules only check SYN etc. packets and doesn't walk the
> full rules then (or more generalized a fast TCP flag mask check similar 
> to what TCP does). With that ESTABLISHED would hit TCP only with relatively
> small overhead.

Actually, all is not lost.  Alexey has a more clever idea which
is basically to run the netfilter hooks in the socket receive
path.

So we'd do the socket demux, wake net channel task on remote cpu,
and that thread of control would run the netfilter hooks.

> > I will also note in 
> > passing that papers on related ideas, such as the Exokernel stuff, are
> > very careful to not address the issue of how practical 1) their demux
> > engine is and 2) the negative side effects of userspace TCP
> > implementations.  For an example of the latter, if you have some 1GB
> > JAVA process you do not want to wake that monster up just to do some
> > ACK processing or TCP window updates, yet if you don't you violate
> > TCP's rules and risk spurious unnecessary retransmits.
> 
> I don't quite get why the size of the process matters here - if only
> some user space TCP library is called directly then it shouldn't
> really matter how big or small the rest of the process is.

Where does state live in such a huge process?  Usually, it is
scattered all over it's address space.  Let us say that java
application just did a lot of churning on it's own data
structure, swapping out TCP library state objects, we will
prematurely swap that stuff back in just to spit out an ACK
or similar.

> But on the other hand full user space TCP seems to me of little gain
> compared to a process context implementation.

I totally agree.

> > Furthermore, the VJ netchannel gains can be partially obtained from
> > generic stateless facilities that we are going to get anyways.
> > Networking chips supporting multiple MSI-X vectors, choosen by hashing
> > the flow ID, can move TCP processing to "end nodes" which are cpu
> > threads in this case, by having each such MSI-X vector target a
> > different cpu thread.
> 
> The problem with the scheme is that to do process context processing
> effectively you would need to teach the scheduler to aggressively
> migrate on wake up (so that the process ends up on the CPU that 
> was selected by the hash function in the NIC).

I don't see this as a big problem.  It's all in software, we can
control the behavior.

> But what do you do when you have lots of different connections
> with different target CPU hash values or when this would require
> you to move multiple compute intensive processes or a single core?

That is why we have scheduler :)  Even in a best effort scenerio, things
will be generally better than they are currently, plus there is nothing
precluding the flow demux MSI-X selection from getting more intelligent.

For example, the demuxer could "notice" that TCPdata transmits for
flow X tend to happen on cpu X, and update a flow table to record that
fact.  It could use the same flow table as the one used for LRO.

> But you still have relatively high cache line transfer costs in
> handing over these packet from the softirq CPUs to the final process
> consumer.

It is true, in order to get the full benefit we have to target
the MSI-X vectors intelligently.

For stateless things like routing and IPSEC gateways and firewalls,
none of this really matters.  But for local transports, it matters
a lot.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-24 23:22                   ` David Miller
@ 2006-07-25  0:02                     ` Andi Kleen
  2006-07-25  0:29                       ` Rick Jones
  0 siblings, 1 reply; 74+ messages in thread
From: Andi Kleen @ 2006-07-25  0:02 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, tom, netdev, akpm

On Tuesday 25 July 2006 01:22, David Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Tue, 25 Jul 2006 01:10:25 +0200
> 
> > > All the original costs of route, netfilter, TCP socket lookup all
> > > reappear as we make VJ netchannels fit all the rules of real practical
> > > systems, eliminating their gains entirely.
> > 
> > At least most of the optimizations from the early demux scheme could
> > be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
> > that checks if all rules only check SYN etc. packets and doesn't walk the
> > full rules then (or more generalized a fast TCP flag mask check similar 
> > to what TCP does). With that ESTABLISHED would hit TCP only with relatively
> > small overhead.
> 
> Actually, all is not lost.  Alexey has a more clever idea which
> is basically to run the netfilter hooks in the socket receive
> path.

The gain being that the target CPU does the work instead of 
the softirq one?

Some combined lookup and better handler of ESTABLISHED still
seems like a good idea.

One idea I had at some point was to separate conntrack for local
connection vs routed connections and attach the local conntrack
to the socket (and use its lookup tables). Then at least for
local connections conntrack should be nearly free.

It should also solve the issue we currently have that enabled 
conntrack makes local network performance significantly worse.

> Where does state live in such a huge process?  Usually, it is
> scattered all over it's address space.  Let us say that java
> application just did a lot of churning on it's own data
> structure, swapping out TCP library state objects, we will
> prematurely swap that stuff back in just to spit out an ACK
> or similar.

TCP state is usually multiple cache lines, so you would have
cache misses anyways. Do you worry about the TLBs? 

> > But what do you do when you have lots of different connections
> > with different target CPU hash values or when this would require
> > you to move multiple compute intensive processes or a single core?
> 
> That is why we have scheduler :)

It can't do well if it gets conflicting input.

> Even in a best effort scenerio, things 
> will be generally better than they are currently, plus there is nothing
> precluding the flow demux MSI-X selection from getting more intelligent.

Intelligent = statefull in this case.

AFAIK the only way to do it stateless is hashes and the output
of hashes tends to be unpredictible by definition.

> For example, the demuxer could "notice" that TCPdata transmits for
> flow X tend to happen on cpu X, and update a flow table to record that
> fact.  It could use the same flow table as the one used for LRO.

Hmm, i somewhat doubt that lower end NICs will ever have such flow tables.
Also the flow tables could always thrash (because the on NIC RAM is necessarily
limited) or they or require the NIC to look up state in memory which is 
probably not much faster than the CPUs doing it.

Using hash functions in the hardware to select the MSI-X seems 
more elegant, cheaper and much more scalable to me.

The drawback of hashes is that for processes with multiple
connections you have to move some work back into the softirqs
that run on the MSI-X target CPUs.

So basically doing process context TCP fully will require
much more complex and statefull hardware. 

Or you can keep it only as a fast path for specific situations
(single busy connection per thread) and stay with mostly-softirq
processing for the many connection cases.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:02                     ` Andi Kleen
@ 2006-07-25  0:29                       ` Rick Jones
  2006-07-25  0:45                         ` David Miller
  2006-07-25  1:42                         ` Andi Kleen
  0 siblings, 2 replies; 74+ messages in thread
From: Rick Jones @ 2006-07-25  0:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, rdreier, tom, netdev, akpm

This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 
concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  IPS 
was done by the 10.20 stack at the handoff between the driver and netisr.  If 
the packet was not an IP datagram fragment, parts of the transport and IP 
headers would be hashed, and the result would be the netisr queue to which the 
packet would be queued for further processing.

It worked fine and dandy for stuff like aggregate netperf TCP_RR tests because 
there was a 1-1 correspondence between a connection and a process/thread.  It 
was "OK" for the networking to dictate where the process should run.  That feels 
rather like a NIC that would hash packets and pick the MSI-X based on that.

However, as Andi discusses, when there is a process/thread doing more than one 
connection, picking a CPU based on addressing hashing will be like TweedleDee 
and TweedleDum telling Alice to go in opposite directions.  Hence TOPS in 11.X. 
  This time, when there is a "normal" lookup location in the path, where the 
application last accessed the socket is determined, and things shift-over to 
that CPU.  This then is the process (well actually the scheduler) telling 
networking where it should do its work.

That addresses the multiple connections per thread/process and still works just 
as well for 1-1.  There are still issues if you have mutiple threads/processes 
concurrently accessing the same socket/connection, but that one is much more rare.

Nirvana I suppose would be the addition of a field in the header which could be 
used for the determination of where to process. A Transport Protocol option I 
suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for 
something along those lines.  It does though mean that the "state" is per-packet 
without it having to be based on addressing information.  Almost like RDMA 
arriving saying where the data goes, but this thing says where the processing 
should happen :)

rick jones

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:29                       ` Rick Jones
@ 2006-07-25  0:45                         ` David Miller
  2006-07-25  0:55                           ` Rick Jones
  2006-07-25  1:03                           ` Rick Jones
  2006-07-25  1:42                         ` Andi Kleen
  1 sibling, 2 replies; 74+ messages in thread
From: David Miller @ 2006-07-25  0:45 UTC (permalink / raw)
  To: rick.jones2; +Cc: ak, rdreier, tom, netdev, akpm

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 24 Jul 2006 17:29:05 -0700

> Nirvana I suppose would be the addition of a field in the header
> which could be used for the determination of where to process. A
> Transport Protocol option I suppose, maybe the IPv6 flow id, but
> knuth only knows if anyone would go for something along those lines.
> It does though mean that the "state" is per-packet without it having
> to be based on addressing information.  Almost like RDMA arriving
> saying where the data goes, but this thing says where the processing
> should happen :)

Since the full interpretation of the TCP timestamp option field value
is largely local to the peer setting it, there is nothing wrong with
stealing a few bits for destination cpu information.

It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:45                         ` David Miller
@ 2006-07-25  0:55                           ` Rick Jones
  2006-07-25  1:04                             ` Andi Kleen
  2006-07-25  1:21                             ` David Miller
  2006-07-25  1:03                           ` Rick Jones
  1 sibling, 2 replies; 74+ messages in thread
From: Rick Jones @ 2006-07-25  0:55 UTC (permalink / raw)
  To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Mon, 24 Jul 2006 17:29:05 -0700
> 
> 
>>Nirvana I suppose would be the addition of a field in the header
>>which could be used for the determination of where to process. A
>>Transport Protocol option I suppose, maybe the IPv6 flow id, but
>>knuth only knows if anyone would go for something along those lines.
>>It does though mean that the "state" is per-packet without it having
>>to be based on addressing information.  Almost like RDMA arriving
>>saying where the data goes, but this thing says where the processing
>>should happen :)
> 
> 
> Since the full interpretation of the TCP timestamp option field value
> is largely local to the peer setting it, there is nothing wrong with
> stealing a few bits for destination cpu information.

Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
sounding initially bizzare would be in the realm of theoretically possible 
before tooooo long.

> It would have to be done in such a way as to not make the PAWS
> tests fail by accident.  But I think it's doable.

That would cover TCP, are there similarly fungible fields in SCTP or other ULPs?

And if we were to want to get HW support for the thing, getting it adopted in a 
de jure standards body would probably be in order :)

rick jones

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:55                           ` Rick Jones
@ 2006-07-25  1:04                             ` Andi Kleen
  2006-07-25  1:21                             ` David Miller
  1 sibling, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-25  1:04 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm

> Even enough bits for 1024 or 2048 CPUs in the single system image? 

MSI-X supports 255 sub interrupts max, most hardware probably much less
(e.g. 8 seems to be a popular number). 

It can be always hashed to the existing CPUs.

It's a nice idea but I think standard hashing + processing in softirq 
would be worth a try first at least.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:55                           ` Rick Jones
  2006-07-25  1:04                             ` Andi Kleen
@ 2006-07-25  1:21                             ` David Miller
  2006-07-25 16:29                               ` Rick Jones
  1 sibling, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-25  1:21 UTC (permalink / raw)
  To: rick.jones2; +Cc: ak, rdreier, tom, netdev, akpm

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 24 Jul 2006 17:55:24 -0700

> Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
> 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
> sounding initially bizzare would be in the realm of theoretically possible 
> before tooooo long.

Read the RSS NDIS documents from Microsoft.  You aren't going to want
to demux to more than, say, 256 cpus for single network adapter even
on the largest machines.

Therefore a simple translation table and/or "base cpu number" is
sufficient to only need 8 bits of cpu identification.

You will be limited by the number of MSI-X vectors also,
for implementations demuxing directly to cpus using MSI-X
selection.

> That would cover TCP, are there similarly fungible fields in SCTP or
> other ULPs?  And if we were to want to get HW support for the thing,
> getting it adopted in a de jure standards body would probably be in
> order :)

Microsoft never does this, neither do we.  LRO came out of our own
design, the network folks found it reasonable and thus they have
started to implement it.  The same is true for Microsofts RSS stuff.

It's a hardware interpretation, therefore it belongs in a driver API
specification, nowhere else.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  1:21                             ` David Miller
@ 2006-07-25 16:29                               ` Rick Jones
  2006-07-25 16:32                                 ` Andi Kleen
  0 siblings, 1 reply; 74+ messages in thread
From: Rick Jones @ 2006-07-25 16:29 UTC (permalink / raw)
  To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Mon, 24 Jul 2006 17:55:24 -0700
> 
> 
>>Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
>>1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
>>sounding initially bizzare would be in the realm of theoretically possible 
>>before tooooo long.
> 
> 
> Read the RSS NDIS documents from Microsoft. 

I'll see about hunting them down.

> You aren't going to want
> to demux to more than, say, 256 cpus for single network adapter even
> on the largest machines.

I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because 
it still sounds like networking telling the scheduler where to run threads of 
execution, and even though I'm a networking guy I seem to have the notion that 
it should be the other way 'round.

>>That would cover TCP, are there similarly fungible fields in SCTP or
>>other ULPs?  And if we were to want to get HW support for the thing,
>>getting it adopted in a de jure standards body would probably be in
>>order :)
> 
> 
> Microsoft never does this, neither do we.  LRO came out of our own
> design, the network folks found it reasonable and thus they have
> started to implement it.  The same is true for Microsofts RSS stuff.
> 
> It's a hardware interpretation, therefore it belongs in a driver API
> specification, nowhere else.

It may be a hardware interpretation, but doesn't it have non-trivial system 
implications - where one runs threads/processes etc?

rick jones

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25 16:29                               ` Rick Jones
@ 2006-07-25 16:32                                 ` Andi Kleen
  0 siblings, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-25 16:32 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm


> It may be a hardware interpretation, but doesn't it have non-trivial system 
> implications - where one runs threads/processes etc?

Only if you do process context RX processing. If you chose not to it doesn't 
have much influence.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:45                         ` David Miller
  2006-07-25  0:55                           ` Rick Jones
@ 2006-07-25  1:03                           ` Rick Jones
  1 sibling, 0 replies; 74+ messages in thread
From: Rick Jones @ 2006-07-25  1:03 UTC (permalink / raw)
  To: David Miller; +Cc: ak, rdreier, tom, netdev, akpm

> It would have to be done in such a way as to not make the PAWS
> tests fail by accident.  But I think it's doable.

CPU ID and higher-order generation number such that whenever the process 
migrates to a lower-numbered CPU, the generation number is bumped to make the 
timestamp larger than before?

rick jones

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  0:29                       ` Rick Jones
  2006-07-25  0:45                         ` David Miller
@ 2006-07-25  1:42                         ` Andi Kleen
  1 sibling, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-25  1:42 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, rdreier, tom, netdev, akpm

On Tuesday 25 July 2006 02:29, Rick Jones wrote:
> This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 
> concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  

We've also talking about this for many years, just no code so far.
Or rather Linux so far left the job to manual tuning.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-24 22:06               ` David Miller
  2006-07-24 23:10                 ` Andi Kleen
@ 2006-07-25  5:51                 ` Evgeniy Polyakov
  2006-07-25  6:48                   ` David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: Evgeniy Polyakov @ 2006-07-25  5:51 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm

On Mon, Jul 24, 2006 at 03:06:13PM -0700, David Miller (davem@davemloft.net) wrote:
> Don't get too excited about VJ netchannels, more and more roadblocks
> to their practicality are being found every day.
> 
> For example, my idea to allow ESTABLISHED TCP socket demux to be done
> before netfilter is flawed.  Connection tracking and NAT can change
> the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
> socket, therefore we must always hit netfilter first.

There is no problem with netfilter and process context processing - when
skb is removed from hardware list/array and is being processed by
netfilter in netchannel (or in process context in general), 
there is no problems if changed skb will be rerouted into different 
queue and state.

> All the original costs of route, netfilter, TCP socket lookup all
> reappear as we make VJ netchannels fit all the rules of real practical
> systems, eliminating their gains entirely.  I will also note in
> passing that papers on related ideas, such as the Exokernel stuff, are
> very careful to not address the issue of how practical 1) their demux
> engine is and 2) the negative side effects of userspace TCP
> implementations.  For an example of the latter, if you have some 1GB
> JAVA process you do not want to wake that monster up just to do some
> ACK processing or TCP window updates, yet if you don't you violate
> TCP's rules and risk spurious unnecessary retransmits.

I still plan to continue userspace implementation.

If gigantic-java-monster (tm) is going to read some data - it has been
awakened already, thus it is in the memeory (with linked tcp lib), so
there is zero overhead.

> Furthermore, the VJ netchannel gains can be partially obtained from
> generic stateless facilities that we are going to get anyways.
> Networking chips supporting multiple MSI-X vectors, choosen by hashing
> the flow ID, can move TCP processing to "end nodes" which are cpu
> threads in this case, by having each such MSI-X vector target a
> different cpu thread.

And if that CPU is very busy?
Linux should somehow tell NIC that some CPUs are valid and some are not
right now, but not in a second, so scheduler must be tightly bound with
network internals.

Just my 2 coins.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  5:51                 ` Evgeniy Polyakov
@ 2006-07-25  6:48                   ` David Miller
  2006-07-25  6:59                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-25  6:48 UTC (permalink / raw)
  To: johnpol; +Cc: rdreier, ak, tom, netdev, akpm

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 25 Jul 2006 09:51:28 +0400

> On Mon, Jul 24, 2006 at 03:06:13PM -0700, David Miller (davem@davemloft.net) wrote:
> > Furthermore, the VJ netchannel gains can be partially obtained from
> > generic stateless facilities that we are going to get anyways.
> > Networking chips supporting multiple MSI-X vectors, choosen by hashing
> > the flow ID, can move TCP processing to "end nodes" which are cpu
> > threads in this case, by having each such MSI-X vector target a
> > different cpu thread.
> 
> And if that CPU is very busy?
> Linux should somehow tell NIC that some CPUs are valid and some are not
> right now, but not in a second, so scheduler must be tightly bound with
> network internals.

Yes, it is research problem.

Most of the time, even stateless version will improve things.
>From another viewpoint, even in worst case, it can be no
worse than current situation. :)

BTW, such dynamic remapping is provided for in the NDIS interfaces.
There is an indexing table that is gone through using computed hash to
get "cpu number".

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  6:48                   ` David Miller
@ 2006-07-25  6:59                     ` Evgeniy Polyakov
  2006-07-25  7:33                       ` David Miller
  0 siblings, 1 reply; 74+ messages in thread
From: Evgeniy Polyakov @ 2006-07-25  6:59 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm

On Mon, Jul 24, 2006 at 11:48:53PM -0700, David Miller (davem@davemloft.net) wrote:
> > And if that CPU is very busy?
> > Linux should somehow tell NIC that some CPUs are valid and some are not
> > right now, but not in a second, so scheduler must be tightly bound with
> > network internals.
> 
> Yes, it is research problem.
> 
> Most of the time, even stateless version will improve things.
> From another viewpoint, even in worst case, it can be no
> worse than current situation. :)
> 
> BTW, such dynamic remapping is provided for in the NDIS interfaces.
> There is an indexing table that is gone through using computed hash to
> get "cpu number".

I think we should force Linux scheduler to export some easily accessed CPU
statistic, so that info might be used by irq layer/protocol processing.

As a side completely unrelated to either my or others work note :) - 
I think it is a nanooptimisation - we get a bit of performance here, 
and lose those bit in other place.
When bag is filled, there is no much sence of rearranging some stuff
inside to be able to place another one - it is better to buy new bag.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  6:59                     ` Evgeniy Polyakov
@ 2006-07-25  7:33                       ` David Miller
  2006-07-25  7:42                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-25  7:33 UTC (permalink / raw)
  To: johnpol; +Cc: rdreier, ak, tom, netdev, akpm

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 25 Jul 2006 10:59:21 +0400

> As a side completely unrelated to either my or others work note :) - 
> I think it is a nanooptimisation - we get a bit of performance here, 
> and lose those bit in other place.
> When bag is filled, there is no much sence of rearranging some stuff
> inside to be able to place another one - it is better to buy new bag.

It is a matter of what the viewpoint is, I suppose.

I think in this specific case it might turn out to be
better for the scheduler to respond to what the device
throws at it, rather than the other way around.  And
in that case we need no feedback from scheduler to
cpu demux engine.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-25  7:33                       ` David Miller
@ 2006-07-25  7:42                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 74+ messages in thread
From: Evgeniy Polyakov @ 2006-07-25  7:42 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, ak, tom, netdev, akpm

On Tue, Jul 25, 2006 at 12:33:44AM -0700, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Tue, 25 Jul 2006 10:59:21 +0400
> 
> > As a side completely unrelated to either my or others work note :) - 
> > I think it is a nanooptimisation - we get a bit of performance here, 
> > and lose those bit in other place.
> > When bag is filled, there is no much sence of rearranging some stuff
> > inside to be able to place another one - it is better to buy new bag.
> 
> It is a matter of what the viewpoint is, I suppose.

Definitely.

> I think in this specific case it might turn out to be
> better for the scheduler to respond to what the device
> throws at it, rather than the other way around.  And
> in that case we need no feedback from scheduler to
> cpu demux engine.

That's exactly one bit lose/gain - if CPU is loafing - we get a gain,
and lose otherwise - so instead of generally predictible steady behaviour 
we can end up with bursty shapes.

Actually without real tests all it is just a handwaving, so let's see when
modern NICs get that capability, so network softirq scheduling would be 
changed accordingly.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-01 14:26           ` Andi Kleen
  2006-07-04 18:34             ` Andy Gay
  2006-07-04 20:34             ` Roland Dreier
@ 2006-07-05 17:09             ` Tom Tucker
  2006-07-05 17:50               ` Steve Wise
  2006-07-24 22:23               ` David Miller
  2 siblings, 2 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-05 17:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, rdreier, netdev, akpm

On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote:
> On Saturday 01 July 2006 01:01, Tom Tucker wrote:
> > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
> > 
> > > The TOE folks have tried to submit their hooks and drivers
> > > on several occaisions, and we've rejected it every time.
> > 
> > iWARP != TOE
> 
> Perhaps a good start of that discussion David asked for would 
> be if you could give us an overview of the differences
> and how you avoid the TOE problems.

I think Roland already gave the high-level overview. For those
interested in some of the details, the API for iWARP transports was
originally conceived independently from IB and is documented in the
RDMAC Verbs Specification found here:

http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf

The protocols, etc... are available here:
http://www.ietf.org/html.charters/rddp-charter.html

As Roland mentioned, the RDMAC verbs are *very* similar to the IB verbs
and so when we were thinking about how to design an API for iWARP we
concluded it would be best to leverage the tremendous amount of work
already done for IB by OpenFabrics and then work iteratively to extend
this API to include features unique to iWARP. This work has been ongoing
since September of 2005. 

There is an open source svn repository available for the iWARP source at
https://openib.org/svn/gen2/branches/iwarp.

There is also an open source NFS over RDMA implementation for Linux
available here that: http://sourceforge.net/projects/nfs-rdma.

So how do we avoid the TOE pitfalls with iWARP? I think it depends on
the pitfall. At the low level:

- Stale Network/Address Information: Path MTU Change, ICMP Redirect 
and ARP next hop changes need netlink notifier events so that hardware
can be updated when they change. I see this support as an extension (new
events) to an existing service and a relatively low-level of "parallel
stack integration". iSCSI and IB could also benefit from these events.

- Port Space Collision, i.e. socket app and rdma/iWARP apps collide on 
a port number: The RDMA CMA needs to be able to allocate and de-allocate
port numbers, however, the services that do this today are not exported
and would need some minor tweaking. iSCSI and IB benefit from these
services as well.

- netfilter rules, syn-flood, conn-rate, etc.... You pointed out that 
if connection establishment were done in the native stack (SYN,
SYN/ACK), that this would account for the bulk of the netfilter utility,
however, this probably results in falling into many of the TOE traps
people have issue with.

WRT to http://linux-net.osdl.org/index.php/TOE

Security Updates

"A TOE net stack is closed source firmware. Linux engineers have no way
to fix security issues that arise. As a result, only non-TOE users will
receive security updates, leaving random windows of vulnerability for
each TOE NIC's users."

- A Linux security update may or may not be relevant to a vendors
implementation. 

- If a vendor's implementation has a security issue then the customer
must rely on the vendor to fix it. This is no less true for iWARP than
for any adapter.

Point-in-time Solution

"Each TOE NIC has a limited lifetime of usefulness, because system
hardware rapidly catches up to TOE performance levels, and eventually
exceeds TOE performance levels. We saw this with 10mbit TOE, 100mbit
TOE, gigabit TOE, and soon with 10gig TOE."

- iWARP needs to do protocol processing in order to validate and
evaluate TCP payload in advance of direct data placement. This
requirement is independent of CPU speed. 

Different Network Behavior

"System administrators are quite familiar with how the Linux network
stack interoperates with the world at large. TOE is a black box, each
NIC requires re-examination of network behavior. Network scanners and
analysis tools must be updated, or they will provide faulty analysis."

- Native Linux Tools like tcpdump, netstat, etc... will not work as
expected. 

- Network Analyzers such as Finisar, etc... will work just fine.

Performance

"Experience has shown that TOE implementations require additional work
(programming the hardware, hardware-specific socket manipulation) to set
up and tear down connections. For connection intensive protocols such as
HTTP, TOE often underperforms."

- I suspect that connection rates for RDMA adapters fall well-below the
rates attainable with a dumb device. That said, all of the RDMA
applications that I know of are not connection intensive. Even for TOE,
the later HTTP versions makes connection rates less of an issue.

Hardware-specific limits 

"TOE NICs are more resource limited than your overall computer system.
This is most readily apparent under load, when trying to support
thousands of simultaneous connections. TOE NICs simply do not have the
memory resources to buffer thousands of connections, much less have the
CPU power to handle such loads. Further, each TOE NIC has different
resource limitations (often unpublished, only to be discovered at the
worst moments)."

- Any hardware device has this issue and so does iWARP

"Once resources are exhausted, TOE will either fall back to 100%
software net stack, defeating the purpose of TOE, or will deny service
to additional clients."

- A depleted iWARP adapter will simply fail the request. There is no
parallel iWARP stack to fall back on.

Resource-based denial-of-service attacks 

"If an attacker can discover the TOE NIC model in use, they can use this
information to enable resource-based algorithmic attacks. For example, a
SYN flood could potentially use up all TOE resources in a matter of
seconds. The TOE NIC will either stop accepting connections (complete
DoS), or will constantly bounce back to the software net stack."

- True of iWARP too.

RFC compliance 

"Linux is the most RFC-compliant network stack available. TOE can only
diminish this. Further, as a black box, each TOE NIC will have a
different level of RFC compliance, and different TCP/IP features they
do/don't support."

- True of iWARP too.

Linux features 

"TOE is by definition poorly integrated into Linux. TOE NICs will not
provide netfilter, packet scheduling, QoS, and many other features that
Linux users depend on. Or if they do provide this, they implement the
features in a vendor-specific manner. The featureset becomes
vendor-specific."

- This is the problem we're trying to solve...incrementally and
responsibly.

Requires vendor-specific tools 

"In order to configure a TOE NIC, hardware-specific tools are usually
required. This dramatically increases support costs."

- OpenFabrics is an attempt to solve this not only across vendors, but
also across transports (at this time IB and iWARP)

Poor user support 

"Linux engineers cannot provide an adequate level of support for TOE
users, and must instead refer users to the vendor -- who in all
likelihood cares more about non-Linux operating systems."

- This will certainly be true for iWARP early on.

Short term kernel maintenance 

"Supporting TOE requires massive, heavily invasive hooks into the
network stack. This increases the kernel maintenance burden on Linux
engineers, to support a solution Linux engineers have no control over."

- iWARP does not use sockets and does not share data structures with the
TCP stack. 
- It is not my opinion, however, that the patches in question consist of
"massive, heavily invasive hooks into the network stack". 

Long term user support 

"Linux has been in existence for over a decade, and some pieces of
decade-old hardware continue to be used and supported. In contrast, most
hardware vendors end-of-life (stop supporting) their hardware after just
a few years. For most hardware vendors, the sales of old hardware simply
do not justify dedicating engineers to Linux support for many years."

- If the hooks are not hideous and invasive then support should not be
any more onerous than for any other hardware device.

Long term kernel maintenance 

"Similarly, kernel engineers must support TOE for as long as users
continue to use the hardware. Hardware vendors disappear, get bought, or
simply disappear (go out of business) during our maintenance timeframe.
Once a hardware vendor loses interest in Linux, TOE NICs will cease to
receive security updates, and hardware issues become incredibly
difficult to debug. Each new generation of system hardware often
requires re-examination of hardware drivers, a task made far more
difficult without a hardware vendor to receive questions."

- This seems like a general rant against any hardware device and so it
applies to iWARP too. 

Eliminates global system view 

"With TOE, the system no longer has a complete picture of all resources
used by network connections. Some connections are software-based, and
thus limited by existing policy controls (such as per-socket memory
limits). Other connections are managed by TOE, and these details are
hidden. As such, the VM cannot adequately manage overall socket buffer
memory usage, TOE-enabled connections cannot be rate-limited by the same
controls as software-based connections, per-user socket security limits
may be ignored, etc."

- iWARP doesn't use socket buffers.

Linux has several TCP Congestion Control algorithms available. For TOE
connections, this would no longer be true, all the congestion control
would be done by proprietary vendor specific algorithms on the card.

- I don't know of any proprietary congestion control algorithms built
into iWARP and doubt they would work between vendors. There is an iWARP
Interoperability Lab at UNH that tests this kind of thing.

> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-05 17:09             ` Tom Tucker
@ 2006-07-05 17:50               ` Steve Wise
  2006-07-24 22:25                 ` David Miller
  2006-07-24 22:23               ` David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: Steve Wise @ 2006-07-05 17:50 UTC (permalink / raw)
  To: Tom Tucker; +Cc: Andi Kleen, David Miller, rdreier, netdev, akpm

On Wed, 2006-07-05 at 12:09 -0500, Tom Tucker wrote:
> On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote:
> > On Saturday 01 July 2006 01:01, Tom Tucker wrote:
> > > On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
> > > 
> > > > The TOE folks have tried to submit their hooks and drivers
> > > > on several occaisions, and we've rejected it every time.
> > > 
> > > iWARP != TOE
> > 
> > Perhaps a good start of that discussion David asked for would 
> > be if you could give us an overview of the differences
> > and how you avoid the TOE problems.
> 
> I think Roland already gave the high-level overview. For those
> interested in some of the details, the API for iWARP transports was
> originally conceived independently from IB and is documented in the
> RDMAC Verbs Specification found here:
> 
> http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf
> 
> The protocols, etc... are available here:
> http://www.ietf.org/html.charters/rddp-charter.html
> 
> As Roland mentioned, the RDMAC verbs are *very* similar to the IB verbs
> and so when we were thinking about how to design an API for iWARP we
> concluded it would be best to leverage the tremendous amount of work
> already done for IB by OpenFabrics and then work iteratively to extend
> this API to include features unique to iWARP. This work has been ongoing
> since September of 2005. 
> 
> There is an open source svn repository available for the iWARP source at
> https://openib.org/svn/gen2/branches/iwarp.
> 
> There is also an open source NFS over RDMA implementation for Linux
> available here that: http://sourceforge.net/projects/nfs-rdma.
> 
> 
> So how do we avoid the TOE pitfalls with iWARP? I think it depends on
> the pitfall. At the low level:
> 
> - Stale Network/Address Information: Path MTU Change, ICMP Redirect 
> and ARP next hop changes need netlink notifier events so that hardware
> can be updated when they change. I see this support as an extension (new
> events) to an existing service and a relatively low-level of "parallel
> stack integration". iSCSI and IB could also benefit from these events.
> 
> - Port Space Collision, i.e. socket app and rdma/iWARP apps collide on 
> a port number: The RDMA CMA needs to be able to allocate and de-allocate
> port numbers, however, the services that do this today are not exported
> and would need some minor tweaking. iSCSI and IB benefit from these
> services as well.
> 
> - netfilter rules, syn-flood, conn-rate, etc.... You pointed out that 
> if connection establishment were done in the native stack (SYN,
> SYN/ACK), that this would account for the bulk of the netfilter utility,
> however, this probably results in falling into many of the TOE traps
> people have issue with.

However, iWARP devices _could_ integrate with netfilter.  For most
devices the connection request event (SYN) gets passed up to the host
driver.  So the driver can enforce filter rules then.  Also, i think a
notification type mechanism could be used to trigger iWARP drivers to go
re-apply filter rules on existing connections and kill ones that should
be filtered.  I'm not that familiar yet with netfilter, but I think this
could all be done...

Steve.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-05 17:50               ` Steve Wise
@ 2006-07-24 22:25                 ` David Miller
  2006-07-24 22:47                   ` Caitlin Bestler
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-24 22:25 UTC (permalink / raw)
  To: swise; +Cc: tom, ak, rdreier, netdev, akpm

From: Steve Wise <swise@opengridcomputing.com>
Date: Wed, 05 Jul 2006 12:50:34 -0500

> However, iWARP devices _could_ integrate with netfilter.  For most
> devices the connection request event (SYN) gets passed up to the host
> driver.  So the driver can enforce filter rules then.

This doesn't work.  In order to handle things like NAT and connection
tracking properly you must even allow ESTABLISHED state packets to
pass through netfilter.

Netfilter can have rules such as "NAT port 200 to 300, leave the other
fields alone" and your suggested scheme cannot handle this.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: RDMA will be reverted
  2006-07-24 22:25                 ` David Miller
@ 2006-07-24 22:47                   ` Caitlin Bestler
  0 siblings, 0 replies; 74+ messages in thread
From: Caitlin Bestler @ 2006-07-24 22:47 UTC (permalink / raw)
  To: David Miller, swise; +Cc: tom, ak, rdreier, netdev, akpm

netdev-owner@vger.kernel.org wrote:
> From: Steve Wise <swise@opengridcomputing.com>
> Date: Wed, 05 Jul 2006 12:50:34 -0500
> 
>> However, iWARP devices _could_ integrate with netfilter.  For most
>> devices the connection request event (SYN) gets passed up to the host
>> driver.  So the driver can enforce filter rules then.
> 
> This doesn't work.  In order to handle things like NAT and
> connection tracking properly you must even allow ESTABLISHED
> state packets to pass through netfilter.
> 
> Netfilter can have rules such as "NAT port 200 to 300, leave
> the other fields alone" and your suggested scheme cannot handle this.

This is totally irrelevant. But it does work.

First, an RDMA connection once established associates a
TCP connection *as identified external to the box* with
an RDMA endpoint (conventionally a "QP").

Performing a NAT translation on a TCP packet would certainly
be within the capabilities of an RNIC, but it would be pointless.
The relabeled TCP segment would be associated with the same QP.

Once an RDMA connection is established, the individual TCP segments
are only of interest to the RDMA endpoint. Payload is delivered
through the RDMA interface (the same one already used for
InfiniBand). The purpose of integration with netfilter would
be to ensure that no RDMA Connection could exist, or persist,
if netfilter would not allow the TCP connection to be created.

That is not a matter of packet filtering, it is matter of
administrative consistency. If someone uses netfilter to block
connections from a given IP netmask then they reasonably expect
that there will be no connections with any host within that
IP netmask. They do not expect exceptions for RDMA, iSCSI
or InfiniBand.

The existing connection management interfaces in openfabrics,
designed to support both InfiniBand and iWARP, could naturally
be extended to validate all RDMA connections using an IP address
with netfilter. This would be of real value.

The only real value of a rule such as "NAT port 200 to 300" is
to allow a remote peer to establish a connection to port 200
with a local listener using port 300. That *can* be supported
without actually manipulating the header in each TCP packet.

It is also possible to discuss other netfilter functionality
that serves a valid end-user purpose, such as counting packets.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-05 17:09             ` Tom Tucker
  2006-07-05 17:50               ` Steve Wise
@ 2006-07-24 22:23               ` David Miller
  2006-07-24 22:57                 ` Caitlin Bestler
  1 sibling, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-24 22:23 UTC (permalink / raw)
  To: tom; +Cc: ak, rdreier, netdev, akpm

From: Tom Tucker <tom@opengridcomputing.com>
Date: Wed, 05 Jul 2006 12:09:42 -0500

> "A TOE net stack is closed source firmware. Linux engineers have no way
> to fix security issues that arise. As a result, only non-TOE users will
> receive security updates, leaving random windows of vulnerability for
> each TOE NIC's users."
> 
> - A Linux security update may or may not be relevant to a vendors
> implementation. 
> 
> - If a vendor's implementation has a security issue then the customer
> must rely on the vendor to fix it. This is no less true for iWARP than
> for any adapter.

This isn't how things actually work.

Users have a computer, and they can rightly expect the community
to help them solve problems that occur in the upstream kernel.

When a bug is found and the person is using NIC X, we don't
necessarily forward the bug report to the vendor of NIC X.
Instead we try to fix the bug.  Many chip drivers are maintained
by people who do not work for the company that makes the chip,
and this works just fine.

If only the chip vendor can fix a security problem, this makes Linux
less agile to fix.  Even aspect of a problem on a Linux system that
cannot be fixed entirely by the community is a net negative for Linux.

> - iWARP needs to do protocol processing in order to validate and
> evaluate TCP payload in advance of direct data placement. This
> requirement is independent of CPU speed. 

Yet, RDMA itself is just an optimization meant to deal with
limitations of cpu and memory speed.  You can rephrase the
situation in whatever way suits your argument, but it does not
make the core issue go away :)

> - I suspect that connection rates for RDMA adapters fall well-below the
> rates attainable with a dumb device. That said, all of the RDMA
> applications that I know of are not connection intensive. Even for TOE,
> the later HTTP versions makes connection rates less of an issue.

This is a very naive evaluation of the situation.  Yes, newer
versions of protocols such as HTTP make the per-client connection
burdon lower, but the number of clients will increase in time to
more than makeup for whatever gains are seen due to this.

And then you have protocols which by design are connection heavy,
and rightly so, such as bittorrent.

Being able to handle enormous numbers of connections, with extreme
scalability and low latency, is an absolute requirement of any modern
day serious TCP stack.  And this requirement is not going away.
Wishing this requirement away due to HTTP persistent connections is
very unrealistic, at best.

> - This is the problem we're trying to solve...incrementally and
> responsibly.

You can't.  See my email to Roland about why even VJ net channels
are found to be impractical.  To support netfilter properly, you
must traverse the whole netfilter stack, because NAT can rewrite
packets, yet still make them destined for the local system, and
thus they will have a different identity for connection demux
by the time the TCP stack sees the packet.

All of these tranformations occur between normal packet receive
and the TCP stack.  You would therefore need to put your card
between netfilter and TCP in the packet input path, and at that
point why bother with the stateful card at all?

The fact is that stateless approaches will always be better than
stateful things because you cannot replicate the functionality we
have in the Linux stack without replicating 10 years of work into
your chip's firmware.  At that point you should just run Linux
on your NIC since that is what you are effectively doing :)

In conversations such as these, it helps us a lot if you can be frank
and honest about the true absolute limitations of your technology.  I
can see that your viewpoint is tainted when I hear things such as HTTP
persistent connections being used as a reason why high TCP connection
rates won't matter in the future.  Such assertions are understood to
be patently false by anyone who understands TCP and how it is used in
the real world.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: RDMA will be reverted
  2006-07-24 22:23               ` David Miller
@ 2006-07-24 22:57                 ` Caitlin Bestler
  0 siblings, 0 replies; 74+ messages in thread
From: Caitlin Bestler @ 2006-07-24 22:57 UTC (permalink / raw)
  To: David Miller, tom; +Cc: ak, rdreier, netdev, akpm

netdev-owner@vger.kernel.org wrote:
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Wed, 05 Jul 2006 12:09:42 -0500
> 
>> "A TOE net stack is closed source firmware. Linux engineers have no
>> way to fix security issues that arise. As a result, only non-TOE
>> users will receive security updates, leaving random windows of
>> vulnerability for each TOE NIC's users." 
>> 
>> - A Linux security update may or may not be relevant to a vendors
>> implementation. 
>> 
>> - If a vendor's implementation has a security issue then the customer
>> must rely on the vendor to fix it. This is no less true for iWARP
>> than for any adapter.
> 
> This isn't how things actually work.
> 
> Users have a computer, and they can rightly expect the
> community to help them solve problems that occur in the
> upstream kernel.
> 
> When a bug is found and the person is using NIC X, we don't
> necessarily forward the bug report to the vendor of NIC X.
> Instead we try to fix the bug.  Many chip drivers are
> maintained by people who do not work for the company that
> makes the chip, and this works just fine.
> 
> If only the chip vendor can fix a security problem, this
> makes Linux less agile to fix.  Even aspect of a problem on a
> Linux system that cannot be fixed entirely by the community
> is a net negative for Linux.
> 
>> - iWARP needs to do protocol processing in order to validate and
>> evaluate TCP payload in advance of direct data placement. This
>> requirement is independent of CPU speed.
> 
> Yet, RDMA itself is just an optimization meant to deal with
> limitations of cpu and memory speed.  You can rephrase the
> situation in whatever way suits your argument, but it does
> not make the core issue go away :)
> 

RDMA is a protocol that allows the application to more
precisely state the actual ordering requirements. It
improves the end-to-end interactions and has value
over a protocol with only byte or message stream
semantics regardless of local interface efficiencies.
See http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt

In any event, isn't the value of an RDMA interface to applications
already settled? The question is how best to deal integrate the
usage of IP addresses with the kernel. The inability to validate
the low-level packet validation in open source code is a limitation
of *all* RDMA solutions, the transport layer of InfiniBand is just
as offloaded as it is for iWARP.

The patches proposed are intended to support integrated connection
management for RDMA connections using IP addresses, no matter what
the underlying transport is. The only difference is that *all* iWARP
connections use IP addresses.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-06-30 23:01         ` Tom Tucker
  2006-07-01 14:26           ` Andi Kleen
@ 2006-07-01 21:45           ` David Miller
  2006-07-04 20:34             ` Roland Dreier
  1 sibling, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-01 21:45 UTC (permalink / raw)
  To: tom; +Cc: rdreier, netdev, akpm

From: Tom Tucker <tom@opengridcomputing.com>
Date: Fri, 30 Jun 2006 18:01:43 -0500

> On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
> 
> > The TOE folks have tried to submit their hooks and drivers
> > on several occaisions, and we've rejected it every time.
> 
> iWARP != TOE

You are taking my comment out of context.  And the fact
that you removed the comment I am respond to, shows
that you really aren't following the conversation.

Roland stated that it has never been the case that we have
rejected adding support for a certain class of devices on the
kinds of merits being discussed in this thread.  And I'm saying
that TOE is such a case where we have emphatically done so.

So I am not saying iWARP or RDMA is equal to TOE, and if you had
actually read this thread you would have understood that.

You're just looking for cannon fodder in my emails.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-01 21:45           ` David Miller
@ 2006-07-04 20:34             ` Roland Dreier
  2006-07-05 18:27               ` David Miller
  0 siblings, 1 reply; 74+ messages in thread
From: Roland Dreier @ 2006-07-04 20:34 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev, akpm

 > Roland stated that it has never been the case that we have
 > rejected adding support for a certain class of devices on the
 > kinds of merits being discussed in this thread.  And I'm saying
 > that TOE is such a case where we have emphatically done so.

Well, in the past it's seemed more like patches have been rejected not
because of a blanket refusal to consider support for certain hardware,
but rather because of issues with the patches themselves.  eg last
year when Chelsio submitted some TOE hooks, you wrote the following
<http://marc.theaimsgroup.com/?l=linux-netdev&m=112382991506811&w=2>

  >> There is no way you're going to be allowed to call such deep TCP
  >> internals from your driver.

  >> This would mean that every time we wish to change the data structures
  >> and interfaces for TCP socket lookup, your drivers would need to
  >> change.

which looks like a very good reason to reject the changes.

 > So I am not saying iWARP or RDMA is equal to TOE, and if you had
 > actually read this thread you would have understood that.

There's definitely been quite a bit of conflation between the two in
this thread, even if you're not responsible...

 - R.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-04 20:34             ` Roland Dreier
@ 2006-07-05 18:27               ` David Miller
  2006-07-05 20:29                 ` Roland Dreier
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-05 18:27 UTC (permalink / raw)
  To: rdreier; +Cc: tom, netdev, akpm

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 04 Jul 2006 13:34:30 -0700

> Well, in the past it's seemed more like patches have been rejected not
> because of a blanket refusal to consider support for certain hardware,

Then why in the world would we put up explicit web pages that
say "TOE is bad, here's a list of reasons why" if we had any
intention of ever adding support for these kinds of devices?

	http://linux-net.osdl.org/index.php/TOE

It had nothing to do with a particular implementation of the patches,
it had everything to do with fundamentals of the technology.

It's going to be difficult to discuss RDMA and iWARP sanely unless you
accept the indisputable fact that we've rejected TOE as a technology
entirely, and it is an example of precedence for disallowing support
for entire classes of hardware.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-05 18:27               ` David Miller
@ 2006-07-05 20:29                 ` Roland Dreier
  2006-07-06  3:03                   ` David Miller
  0 siblings, 1 reply; 74+ messages in thread
From: Roland Dreier @ 2006-07-05 20:29 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev, akpm

 > Then why in the world would we put up explicit web pages that
 > say "TOE is bad, here's a list of reasons why" if we had any
 > intention of ever adding support for these kinds of devices?

I think there's a little bit of leap of logic there.  Everyone agrees
that winmodems are bad and yet there's still drivers/char/mwave.  This
TOE-phobia feels almost as if in the middle of one of those silly IDE
vs. SCSI flamewars, someone declared that Linux shouldn't have IDE drivers.

 > It's going to be difficult to discuss RDMA and iWARP sanely unless you
 > accept the indisputable fact that we've rejected TOE as a technology
 > entirely, and it is an example of precedence for disallowing support
 > for entire classes of hardware.

Fine.  I don't think I have much more to add to the discussion anyway.
The way forward seems to be to merge basic iWARP support that lives in
drivers/infiniband, and then you can accept or reject things for
better integration, like notifiers for routing changes.

 - R.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-05 20:29                 ` Roland Dreier
@ 2006-07-06  3:03                   ` David Miller
  2006-07-06  5:25                     ` Tom Tucker
  0 siblings, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-06  3:03 UTC (permalink / raw)
  To: rdreier; +Cc: tom, netdev, akpm

From: Roland Dreier <rdreier@cisco.com>
Date: Wed, 05 Jul 2006 13:29:35 -0700

> The way forward seems to be to merge basic iWARP support that lives in
> drivers/infiniband, and then you can accept or reject things for
> better integration, like notifiers for routing changes.

<sarcasm>
Let's merge in drivers before the necessary infrastructure.
</sarcasm>

No, I think that's not the way forward.  You build the foundation
before the house, if the foundation cannot be built then you are
wasting your time with the house idea.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-06  3:03                   ` David Miller
@ 2006-07-06  5:25                     ` Tom Tucker
  2006-07-06 14:08                       ` Herbert Xu
  2006-07-07  6:53                       ` David Miller
  0 siblings, 2 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-06  5:25 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Wed, 2006-07-05 at 20:03 -0700, David Miller wrote:
> From: Roland Dreier <rdreier@cisco.com>
> Date: Wed, 05 Jul 2006 13:29:35 -0700
> 
> > The way forward seems to be to merge basic iWARP support that lives in
> > drivers/infiniband, and then you can accept or reject things for
> > better integration, like notifiers for routing changes.
> 
> <sarcasm>
> Let's merge in drivers before the necessary infrastructure.
> </sarcasm>
> 
> No, I think that's not the way forward.  You build the foundation
> before the house, if the foundation cannot be built then you are
> wasting your time with the house idea.

We have been running NFS and other apps over iWARP 24x7 for the last 6
mos...without the proposed netdev patch. We've run 200+ node MPI
clusters for days and days over iWARP...without the proposed netdev
patch. We ran iWARP interoperability tests across the country between
Boston and San Jose...without ... yes I know ... you get it.

<sarcasm>
News flash...the foundation is built!
</sarcasm>

But! Our stable LAN and the WAN tests didn't often experience MTU
changes, and redirects...but of course we knew these were inevitable. So
our goal was to make iWARP more robust in the face of a more dynamic
network topology. Shutters on the house maybe...I dunno, it's your
analogy ;-)

All that said, the proposed patch helps not only iWARP, but other
transports (iSCSI, IB) as well. It is not large, invasive,
intrusive...hell it's not even new. It leverages an existing event
notifier mechanism. 

This patch is about dotting I's and crossing T's, it's not about
foundations.

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-06  5:25                     ` Tom Tucker
@ 2006-07-06 14:08                       ` Herbert Xu
  2006-07-06 17:36                         ` Tom Tucker
  2006-07-07  6:53                       ` David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: Herbert Xu @ 2006-07-06 14:08 UTC (permalink / raw)
  To: Tom Tucker; +Cc: davem, rdreier, netdev, akpm

Tom Tucker <tom@opengridcomputing.com> wrote:
> 
> All that said, the proposed patch helps not only iWARP, but other
> transports (iSCSI, IB) as well. It is not large, invasive,

Care to explain on how it helps those other technologies?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-06 14:08                       ` Herbert Xu
@ 2006-07-06 17:36                         ` Tom Tucker
  2006-07-07  0:03                           ` Herbert Xu
  0 siblings, 1 reply; 74+ messages in thread
From: Tom Tucker @ 2006-07-06 17:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, rdreier, netdev, akpm

On Fri, 2006-07-07 at 00:08 +1000, Herbert Xu wrote:
> Tom Tucker <tom@opengridcomputing.com> wrote:
> > 
> > All that said, the proposed patch helps not only iWARP, but other
> > transports (iSCSI, IB) as well. It is not large, invasive,
> 
> Care to explain on how it helps those other technologies?

The RDMA CMA uses IP addresses and port numbers to create a uniform
addressing scheme across all transport types. For IB, it is necessary to
resolve IP addresses to IB GIDs. The ARP protocol is used to do this and
a netfilter rule is installed to snoop the incoming ARP replies. This
would not be necessary if ARP events were provided as in the patch. 

Unified wire iSCSI adapters have the same issue as iWARP wrt to managing
IP addresses and ports.

> 
> Cheers,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-06 17:36                         ` Tom Tucker
@ 2006-07-07  0:03                           ` Herbert Xu
  2006-07-07  0:32                             ` Tom Tucker
  0 siblings, 1 reply; 74+ messages in thread
From: Herbert Xu @ 2006-07-07  0:03 UTC (permalink / raw)
  To: Tom Tucker; +Cc: davem, rdreier, netdev, akpm

On Thu, Jul 06, 2006 at 12:36:24PM -0500, Tom Tucker wrote:
> 
> The RDMA CMA uses IP addresses and port numbers to create a uniform
> addressing scheme across all transport types. For IB, it is necessary to
> resolve IP addresses to IB GIDs. The ARP protocol is used to do this and
> a netfilter rule is installed to snoop the incoming ARP replies. This
> would not be necessary if ARP events were provided as in the patch. 

Well the concerns we have do not apply to just iWARP, but RDMA/IP in
general so this isn't really another technology.

In fact, it seems that we now have IP-specific knowledge living in
drivers/infiniband/core/cma.c which is suboptimal.

> Unified wire iSCSI adapters have the same issue as iWARP wrt to managing
> IP addresses and ports.

If by Unified wire iSCSI you mean something that presents a SCSI interface
together with an Ethernet interface where the two share the same MAC and
IP address, then we have the same concerns with it as we do with iWARP or
TOE.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-07  0:03                           ` Herbert Xu
@ 2006-07-07  0:32                             ` Tom Tucker
  0 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-07  0:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, rdreier, netdev, akpm

On Fri, 2006-07-07 at 10:03 +1000, Herbert Xu wrote:
> On Thu, Jul 06, 2006 at 12:36:24PM -0500, Tom Tucker wrote:
> > 
> > The RDMA CMA uses IP addresses and port numbers to create a uniform
> > addressing scheme across all transport types. For IB, it is necessary to
> > resolve IP addresses to IB GIDs. The ARP protocol is used to do this and
> > a netfilter rule is installed to snoop the incoming ARP replies. This
> > would not be necessary if ARP events were provided as in the patch. 
> 
> Well the concerns we have do not apply to just iWARP, but RDMA/IP in
> general so this isn't really another technology.
> 
> In fact, it seems that we now have IP-specific knowledge living in
> drivers/infiniband/core/cma.c which is suboptimal.

To be clear the CMA doesn't look in the ARP packet, it just uses the
existence of the packet as an indication that it should check to see if
the ARP request it submitted for an IP address has been resolved yet. I
agree that this is suboptimal and why I think the notifier is a nice
alternative. 

> 
> > Unified wire iSCSI adapters have the same issue as iWARP wrt to managing
> > IP addresses and ports.
> 
> If by Unified wire iSCSI you mean something that presents a SCSI interface
> together with an Ethernet interface where the two share the same MAC and
> IP address, 

Yes, this is what I mean. But the notifier doesn't actually support
this, you would need to expose the IP/port space database to solve that
problem.

What I was referring to relative to  iSCSI is if the adapter is relying
on Linux to do ARP via the above suboptimal implementation, then it
would benefit from the notifier patch.

> then we have the same concerns with it as we do with iWARP or
> TOE.

Yes indeed.

> 
> Cheers,





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-06  5:25                     ` Tom Tucker
  2006-07-06 14:08                       ` Herbert Xu
@ 2006-07-07  6:53                       ` David Miller
  2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
  2006-07-07 13:29                         ` RDMA will be reverted Tom Tucker
  1 sibling, 2 replies; 74+ messages in thread
From: David Miller @ 2006-07-07  6:53 UTC (permalink / raw)
  To: tom; +Cc: rdreier, netdev, akpm

From: Tom Tucker <tom@opengridcomputing.com>
Date: Thu, 06 Jul 2006 00:25:03 -0500

> This patch is about dotting I's and crossing T's, it's not about
> foundations.

You assume that I've flat out rejected RDMA, in fact I haven't.  I
really don't have enough information to form a final opinion yet.
There's about a week of emails on this topic which I need to read
and digest first.

What I am saying, however, is that we need to understand the
technology and the hooks you guys want before we put any of it in.

I don't think that's unreasonable.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* What is RDMA (was: RDMA will be reverted)
  2006-07-07  6:53                       ` David Miller
@ 2006-07-07  8:11                         ` Herbert Xu
  2006-07-07 18:25                           ` Steve Wise
  2006-07-24 22:29                           ` What is RDMA David Miller
  2006-07-07 13:29                         ` RDMA will be reverted Tom Tucker
  1 sibling, 2 replies; 74+ messages in thread
From: Herbert Xu @ 2006-07-07  8:11 UTC (permalink / raw)
  To: David Miller; +Cc: tom, rdreier, netdev, akpm, Jeff Garzik

On Fri, Jul 07, 2006 at 06:53:20AM +0000, David Miller wrote:
> 
> What I am saying, however, is that we need to understand the
> technology and the hooks you guys want before we put any of it in.

Yes indeed.

Here is what I've understood so far so let's see if we can start building
a censensus.

1) RDMA over straight Infiniband is not contentious.  In this case no
   IP networking is involved.

2) RDMA over TCP/IP (or SCTP) can theoretically run on any network that
   supported IP, including Infiniband and Ethernet.

3) When RDMA over TCP is completely done in hardware, i.e., it has its
   own IP address, MAC address, and simply presents an RDMA interface
   (whatever that may be) to Linux, we're OK with it.

   This is similar to how some iSCSI adapters work.

4) When RDMA over TCP is done completely in the Linux networking stack,
   we don't have a problem because the existing TCP stack is still in
   charge.  However, this is pretty pointless.

5) RDMA over TCP on the receive side is offloaded into the NIC.  This
   allows the NIC to directly place data into the application's buffer.  

   We're starting to have a little bit of a problem because it means that
   part of the incoming IP traffic is now being directly processed by the
   NIC, with no input from the Linux TCP/IP stack.

   However, as long as the connection establishment/acks are still
   controlled/seen by Linux we can probably live with it.

6) RDMA over TCP on the transmit side is offloaded into the NIC.  This
   is starting to look very worrying.

   The reason is that we lose all control to crucial aspects of TCP like
   congestion control.  It is now completely up to the NIC to do that.
   For straight RDMA over Infiniband this isn't an issue because the
   traffic is not likely to travel across the Internet.

   However, for RDMA over TCP, one of their goals is to support sending
   traffic over the Internet so this is a concern.  Incidentally, this is
   why they need to know about things like MAC/route/MTU changing.

7) RDMA over TCP is completely offloaded into the NIC, however, they still
   use Linux's IP address, MAC address, and rely on us to tell it about
   events such as MTU updates or MAC changes.

   In addition to the problems we have in 5) and 6), we now have a portion
   of TCP port space which has suddenly become invisible to Linux.  What's
   more, we lose control (e.g., netfilter) over what connections may or
   may not be established.

So to my mind, RDMA over TCP is most problematic when it shares the same
IP/MAC address as the Linux host, and when the transmit side and/or the
connection establishment (case 6 and 7) is offloaded into the NIC.  This
also happens to be the only scenario where they need the notification
patch that started all this discussion.

BTW, this URL gives an interesting perspective on RDMA over TCP
(particularly Q14/Q15):

http://www.rdmaconsortium.org/home/FAQs_Apr25.htm

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA (was: RDMA will be reverted)
  2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
@ 2006-07-07 18:25                           ` Steve Wise
  2006-07-11  8:17                             ` Herbert Xu
  2006-07-24 22:29                           ` What is RDMA David Miller
  1 sibling, 1 reply; 74+ messages in thread
From: Steve Wise @ 2006-07-07 18:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik

Great summation.   Comments in-line...

On Fri, 2006-07-07 at 18:11 +1000, Herbert Xu wrote:
> On Fri, Jul 07, 2006 at 06:53:20AM +0000, David Miller wrote:
> > 
> > What I am saying, however, is that we need to understand the
> > technology and the hooks you guys want before we put any of it in.
> 
> Yes indeed.
> 
> Here is what I've understood so far so let's see if we can start building
> a censensus.
> 
> 1) RDMA over straight Infiniband is not contentious.  In this case no
>    IP networking is involved.
> 

Some IP networking is involved for this.  IP addresses and port numbers
are used by the RDMA Connection Manager.  The motivation for this was
two-fold, I think:

1) to simplify the connection setup model.  The IB CM model was very
complex.

2) to allow ULPs to be transport independent.  Thus a single code base
for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP
transports without code changes or knowing about transport-specific
addressing.

The routing table is also consulted to determine which rdma device
should be used for connection setup.  Each rdma device also installs a
netdev device for native stack traffic.  The RDMA CM maintains an
association between the netdev device and the rdma device.  

And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to
GID/QPN info.  This is done by calling arp_send() directly, and snooping
all ARP packets to "discover" when the arp entry is completed.

> 2) RDMA over TCP/IP (or SCTP) can theoretically run on any network that
>    supported IP, including Infiniband and Ethernet.
> 
> 3) When RDMA over TCP is completely done in hardware, i.e., it has its
>    own IP address, MAC address, and simply presents an RDMA interface
>    (whatever that may be) to Linux, we're OK with it.
> 
>    This is similar to how some iSCSI adapters work.
> 

The Ammasso driver implements this method.  It supports 2 mac addresses
on the single GigE port:  1 for native host networking traffic only, and
one for RDMA/TCP only.  The firmware implements a full TCP/IP/ARP/ICMP
stack and handles all function of the RDMA/TCP connection setup.    

However, even these types of devices need some integration with the
networking subsystem.  Namely the existing Infiniband rdma connection
manager assumes it will find a netdev device for each rdma device
registered.  So it uses the routing table to look up a netdev to
determine which rdma device should be used for connection setup.  The
Ammasso driver installs 2 netdevs, one of which is a virtual device used
soley for assigning IP addresses to the RDMA side of the nic, and for
the RDMA CM to find this device...

> 4) When RDMA over TCP is done completely in the Linux networking stack,
>    we don't have a problem because the existing TCP stack is still in
>    charge.  However, this is pretty pointless.
> 

Indeed.

I see one case where this model might be useful:  If the optimizations
that RDMA gives helps mainly the server side of an application, then the
client side might use a software-only rdma stack and a dumb nic.  The
server buys the deep rnic adapter and gets the perf benefits...

> 
> 5) RDMA over TCP on the receive side is offloaded into the NIC.  This
>    allows the NIC to directly place data into the application's buffer.  
> 
>    We're starting to have a little bit of a problem because it means that
>    part of the incoming IP traffic is now being directly processed by the
>    NIC, with no input from the Linux TCP/IP stack.
> 
>    However, as long as the connection establishment/acks are still
>    controlled/seen by Linux we can probably live with it.
> 
> 6) RDMA over TCP on the transmit side is offloaded into the NIC.  This
>    is starting to look very worrying.
> 
>    The reason is that we lose all control to crucial aspects of TCP like
>    congestion control.  It is now completely up to the NIC to do that.
>    For straight RDMA over Infiniband this isn't an issue because the
>    traffic is not likely to travel across the Internet.
> 
>    However, for RDMA over TCP, one of their goals is to support sending
>    traffic over the Internet so this is a concern.  Incidentally, this is
>    why they need to know about things like MAC/route/MTU changing.
> 
> 7) RDMA over TCP is completely offloaded into the NIC, however, they still
>    use Linux's IP address, MAC address, and rely on us to tell it about
>    events such as MTU updates or MAC changes.
> 

I only know of type 3 rnics (ammasso) and type 7 rnics (chelsio +
others).  I haven't seen any type 5 or 6 designs yet for RDMA/TCP...

>    In addition to the problems we have in 5) and 6), we now have a portion
>    of TCP port space which has suddenly become invisible to Linux.  What's
>    more, we lose control (e.g., netfilter) over what connections may or
>    may not be established.

port space issues and netfilter integration can be fixed, I think, if
there is a desire to do so.

> 
> So to my mind, RDMA over TCP is most problematic when it shares the same
> IP/MAC address as the Linux host, and when the transmit side and/or the
> connection establishment (case 6 and 7) is offloaded into the NIC.  This
> also happens to be the only scenario where they need the notification
> patch that started all this discussion.
> 

Note that the current Infiniband RDMA connection setup could also
benefit from the notification patch.  Then it would not need to filter
all incoming ARP packets...

Steve.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA (was: RDMA will be reverted)
  2006-07-07 18:25                           ` Steve Wise
@ 2006-07-11  8:17                             ` Herbert Xu
  2006-07-11 13:27                               ` Steve Wise
  0 siblings, 1 reply; 74+ messages in thread
From: Herbert Xu @ 2006-07-11  8:17 UTC (permalink / raw)
  To: Steve Wise; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik

On Fri, Jul 07, 2006 at 01:25:44PM -0500, Steve Wise wrote:
> 
> Some IP networking is involved for this.  IP addresses and port numbers
> are used by the RDMA Connection Manager.  The motivation for this was
> two-fold, I think:
> 
> 1) to simplify the connection setup model.  The IB CM model was very
> complex.
> 
> 2) to allow ULPs to be transport independent.  Thus a single code base
> for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP
> transports without code changes or knowing about transport-specific
> addressing.
> 
> The routing table is also consulted to determine which rdma device
> should be used for connection setup.  Each rdma device also installs a
> netdev device for native stack traffic.  The RDMA CM maintains an
> association between the netdev device and the rdma device.  
> 
> And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to
> GID/QPN info.  This is done by calling arp_send() directly, and snooping
> all ARP packets to "discover" when the arp entry is completed.

This sounds interesting.

Since this is going to be IB-neutral, what about moving high-level logic
like this is moved out of drivers/infiniband and into net?

That way the rest of the networking community can add input into how
things are done.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA (was: RDMA will be reverted)
  2006-07-11  8:17                             ` Herbert Xu
@ 2006-07-11 13:27                               ` Steve Wise
  0 siblings, 0 replies; 74+ messages in thread
From: Steve Wise @ 2006-07-11 13:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, tom, rdreier, netdev, akpm, Jeff Garzik

On Tue, 2006-07-11 at 18:17 +1000, Herbert Xu wrote:
> On Fri, Jul 07, 2006 at 01:25:44PM -0500, Steve Wise wrote:
> > 
> > Some IP networking is involved for this.  IP addresses and port numbers
> > are used by the RDMA Connection Manager.  The motivation for this was
> > two-fold, I think:
> > 
> > 1) to simplify the connection setup model.  The IB CM model was very
> > complex.
> > 
> > 2) to allow ULPs to be transport independent.  Thus a single code base
> > for NFSoRDMA, for example, can run over Infiniband and RDMA/TCP
> > transports without code changes or knowing about transport-specific
> > addressing.
> > 
> > The routing table is also consulted to determine which rdma device
> > should be used for connection setup.  Each rdma device also installs a
> > netdev device for native stack traffic.  The RDMA CM maintains an
> > association between the netdev device and the rdma device.  
> > 
> > And the Infiniband subsystem uses ARP over IPoIB to map IP addresses to
> > GID/QPN info.  This is done by calling arp_send() directly, and snooping
> > all ARP packets to "discover" when the arp entry is completed.
> 
> This sounds interesting.
> 
> Since this is going to be IB-neutral, what about moving high-level logic
> like this is moved out of drivers/infiniband and into net?
> 
> That way the rest of the networking community can add input into how
> things are done.
> 

The notifier patch I posted sort of does this already by eliminating the
need to snoop arp replies.  It will notify interested subsystems of
neighbour changes (EG when an ARP reply is processed and the neighbour
struct updated with the correct hw mac addr).  And I _think_
neigh_event_send() could be used instead of arp_send().


Steve.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA
  2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
  2006-07-07 18:25                           ` Steve Wise
@ 2006-07-24 22:29                           ` David Miller
  2006-07-24 22:34                             ` Rick Jones
  1 sibling, 1 reply; 74+ messages in thread
From: David Miller @ 2006-07-24 22:29 UTC (permalink / raw)
  To: herbert; +Cc: tom, rdreier, netdev, akpm, jgarzik

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 7 Jul 2006 18:11:31 +1000

> 5) RDMA over TCP on the receive side is offloaded into the NIC.  This
>    allows the NIC to directly place data into the application's buffer.  
> 
>    We're starting to have a little bit of a problem because it means that
>    part of the incoming IP traffic is now being directly processed by the
>    NIC, with no input from the Linux TCP/IP stack.
> 
>    However, as long as the connection establishment/acks are still
>    controlled/seen by Linux we can probably live with it.

As I have detailed in other emails, even if you get the connection
establishment packets processed by netfilter, you can end up with
a non-working connection because NAT can want to transform all of
the established state packets in the same way.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA
  2006-07-24 22:29                           ` What is RDMA David Miller
@ 2006-07-24 22:34                             ` Rick Jones
  2006-07-24 22:39                               ` David Miller
  2006-07-24 22:49                               ` Andi Kleen
  0 siblings, 2 replies; 74+ messages in thread
From: Rick Jones @ 2006-07-24 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, tom, rdreier, netdev, akpm, jgarzik

That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E 
standpoint.

rick jones

"Purity Of End To END"

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA
  2006-07-24 22:34                             ` Rick Jones
@ 2006-07-24 22:39                               ` David Miller
  2006-07-24 22:49                               ` Andi Kleen
  1 sibling, 0 replies; 74+ messages in thread
From: David Miller @ 2006-07-24 22:39 UTC (permalink / raw)
  To: rick.jones2; +Cc: herbert, tom, rdreier, netdev, akpm, jgarzik

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 24 Jul 2006 15:34:30 -0700

> That TOE/iWARP could end-up being precluded by NAT seems so ironic
> from a POE2E standpoint.

To be honest we do not have a pure end to end internet, and some of
our failed experiments in the past prove this :-)

For example, we have an optimization that allows much earlier
termination of TIME_WAIT connections.  It relies upon TCP timestamps
and attributes we can determine about end hosts using that information
(it is yet another Van Jacobson idea btw).  But NAT means that IP+Port
does not necessarily equate to the same host over time, not even over
short periods of time.  A NAT box could be using Port X for host A and
then host B some short time later.

Therefore we had to disable the early timewait recycling trick by
default.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: What is RDMA
  2006-07-24 22:34                             ` Rick Jones
  2006-07-24 22:39                               ` David Miller
@ 2006-07-24 22:49                               ` Andi Kleen
  1 sibling, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2006-07-24 22:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, herbert, tom, rdreier, netdev, akpm, jgarzik

On Tuesday 25 July 2006 00:34, Rick Jones wrote:
> That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E 
> standpoint.

Yes, it's sad, but reality unfortunately. 

There is even precedent: the VJ stateless TW recycling scheme also
turned out to not work because of NAT considerations.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
  2006-07-07  6:53                       ` David Miller
  2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
@ 2006-07-07 13:29                         ` Tom Tucker
  1 sibling, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-07 13:29 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, akpm

On Thu, 2006-07-06 at 23:53 -0700, David Miller wrote:
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Thu, 06 Jul 2006 00:25:03 -0500
> 
> > This patch is about dotting I's and crossing T's, it's not about
> > foundations.
> 
> You assume that I've flat out rejected RDMA, in fact I haven't.  I
> really don't have enough information to form a final opinion yet.
> There's about a week of emails on this topic which I need to read
> and digest first.

I realize that there is a tremendous amount of work out there and this
is just one thread.

> 
> What I am saying, however, is that we need to understand the
> technology and the hooks you guys want before we put any of it in.

Absolutely.

> 
> I don't think that's unreasonable.

Not at all. Let me know if I can help.

Tom




^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: RDMA will be reverted
@ 2006-07-06 13:26 Caitlin Bestler
  0 siblings, 0 replies; 74+ messages in thread
From: Caitlin Bestler @ 2006-07-06 13:26 UTC (permalink / raw)
  To: Andi Kleen, Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

Andi Kleen wrote:
> 
>> We're focusing on netfilter here. Is breaking netfilter really the
>> only issue with this stuff?
> 
> Another concern is that it will just not be able to keep
> up with a high rate of new connections or a high number of them
> (because the hardware has too limited state)
> 

Neither iWARP or an iSCSI initiator will require extremely high
rates of connection establishment. An RNIC only establishes connections
when its services have been explicitly requested (via use of a specific
service).

In any event, the key question here is whether integration with
the netdevice improves things or whether the offload device should
be "totally transparent" to the kernel. If the offload device somehow
insisted on handling connection requests that the kernel would have
been able to handle then this would be an issue. But the kernel is
not currently handling RDMA Connect requests on its own, and I know
of no-one who has suggested that a software-only implementation of
RDMA is feasible at 10Gbit is feasible.

netfiler integration is definitely something that needs to be addressed,
but the L2/L3 integrations need to be in place first.

> And then there are the other issues I listed like subtle TCP bugs
> (TSO is already a nightmare in this area and it's still not quite
> right) etc. 
> 

Making an RNIC "fully transparent" to the kernell would require it
to handle many L2 and L3 issues in parallel with the host stack.
That increases the chance of a bug, or at least a subtle difference
between the host and the RNIC which while being compliant would
be unexpected.

The purposes of the proposed patches is to enable the RNIC to be
in full compliance with the host stack on IP layer issues.

> 
> It would need someone who can describe how this new RDMA device avoids
> all the problems, but so far its advocates don't seem to be interested
> in doing that and I cannot contribute more.
> 

RDMA services are already defined for the kernel. The connection
management and network notifier patches are about enabling RDMA
devices to use IP addresses in a way that is consistent.

Obviously doing so is more important for an iWARP device than for
an InfiniBand device, but each InfiniBand users have expressed a
desire to use IP addressing.

Applications do not use RDMA by accident, it is a major design
decision. Once an application uses RDMA it is no longer a direct
consumer of the transport layer protocol. Indeed, one of the
main objectives of the OpenFabrics stack is to enable typical
applications to be written that will work over RDMA without
caring what the underlying transport is. The options for control
will still be there, but just as a sockets programmer does not
typically care whether their IP is carried over SLIP, PPP,
Ethernet or ATM; most RDMA developers should not have to worry
about iWARP or InfiniBand.

http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt
provides an overview on how RDMA benefits applications, and when
applications would benefit from its use as compared to plain TCP.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RDMA will be reverted
@ 2006-07-25 19:59 Tom Tucker
  0 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-25 19:59 UTC (permalink / raw)
  To: David Miller; +Cc: ak, rdreier, netdev, akpm

On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: 
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Wed, 05 Jul 2006 12:09:42 -0500
> 
> > "A TOE net stack is closed source firmware. Linux engineers have no way
> > to fix security issues that arise. As a result, only non-TOE users will
> > receive security updates, leaving random windows of vulnerability for
> > each TOE NIC's users."
> > 
> > - A Linux security update may or may not be relevant to a vendors
> > implementation. 
> > 
> > - If a vendor's implementation has a security issue then the customer
> > must rely on the vendor to fix it. This is no less true for iWARP than
> > for any adapter.
> 
> This isn't how things actually work.
> 
> Users have a computer, and they can rightly expect the community
> to help them solve problems that occur in the upstream kernel.
> 
> When a bug is found and the person is using NIC X, we don't
> necessarily forward the bug report to the vendor of NIC X.
> Instead we try to fix the bug.  Many chip drivers are maintained
> by people who do not work for the company that makes the chip,
> and this works just fine.
> 
> If only the chip vendor can fix a security problem, this makes Linux
> less agile to fix.  Even aspect of a problem on a Linux system that
> cannot be fixed entirely by the community is a net negative for Linux.
> 

All true. What I meant to say was that this is "no less true than for
any deep adapter". It is incontrovertible that a deep adapter is less
flexible, and more difficult to support than a shallow adapter.

> > - iWARP needs to do protocol processing in order to validate and
> > evaluate TCP payload in advance of direct data placement. This
> > requirement is independent of CPU speed. 
> 
> Yet, RDMA itself is just an optimization meant to deal with
> limitations of cpu and memory speed.  You can rephrase the
> situation in whatever way suits your argument, but it does not
> make the core issue go away :)

Yep.

> 
> > - I suspect that connection rates for RDMA adapters fall well-below the
> > rates attainable with a dumb device. That said, all of the RDMA
> > applications that I know of are not connection intensive. Even for TOE,
> > the later HTTP versions makes connection rates less of an issue.
> 
> This is a very naive evaluation of the situation.  Yes, newer
> versions of protocols such as HTTP make the per-client connection
> burdon lower, but the number of clients will increase in time to
> more than makeup for whatever gains are seen due to this.

Naive is being kind, my HTTP comment is irrelevant :).  

> And then you have protocols which by design are connection heavy,
> and rightly so, such as bittorrent.
> 
> Being able to handle enormous numbers of connections, with extreme
> scalability and low latency, is an absolute requirement of any modern
> day serious TCP stack.  And this requirement is not going away.
> Wishing this requirement away due to HTTP persistent connections is
> very unrealistic, at best.
> 
> > - This is the problem we're trying to solve...incrementally and
> > responsibly.
> 
> You can't.  See my email to Roland about why even VJ net channels
> are found to be impractical.  To support netfilter properly, you
> must traverse the whole netfilter stack, because NAT can rewrite
> packets, yet still make them destined for the local system, and
> thus they will have a different identity for connection demux
> by the time the TCP stack sees the packet.
> 

I'm not claiming that all the problems can be solved, I'm suggesting
that integration could be better and that partial integration is better
than none. 

> All of these tranformations occur between normal packet receive
> and the TCP stack.  You would therefore need to put your card
> between netfilter and TCP in the packet input path, and at that
> point why bother with the stateful card at all?
> 
> The fact is that stateless approaches will always be better than
> stateful things because you cannot replicate the functionality we
> have in the Linux stack without replicating 10 years of work into
> your chip's firmware.  At that point you should just run Linux
> on your NIC since that is what you are effectively doing :)
> 

I wish...I'd have a better stack. 

> In conversations such as these, it helps us a lot if you can be frank
> and honest about the true absolute limitations of your technology.  

I'm trying ... classifying these limitations as "core can't fix" and
"fixable with integration" is where we're getting crosswise. 

> I
> can see that your viewpoint is tainted when I hear things such as HTTP
> persistent connections being used as a reason why high TCP connection
> rates won't matter in the future.  Such assertions are understood to
> be patently false by anyone who understands TCP and how it is used in
> the real world.

Partial "Fixable with Integration" Summary

- ARP Resolution
- ICMP Redirect
- Path MTU Change
- Route Update
- Colliding TCP Port Spaces

Partial "Can't Fix" Issues Summary:

- Many devices cannot support more than tens of thousands of concurrent
connections (16-64k would be typical). The number of supported RDMA
connections does not scale with server resources. 

- Netfilter integration is busted. Some have suggested that devices that
do connection establishment in host software could honor netfilter rules
at startup. I'm concerned that this would be more confusing than
helpful (which rules work, which don't)

- NAT doesn't work when run on the same machine as the RDMA stack with
hardware assist. Post connection establishment adapter sees untranslated 
packet. 

- Connection rates will likely be lower for devices that do connection
establishment in the device vs. in the host.  

- The open source community cannot easily predict, diagnose or fix
problems in the hardware stack. It's a black box.

- Most hardware stacks lack the security features present in the native
stack and cannot be extended to handle new exploits.




^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2006-07-25 19:59 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-28  7:07 RDMA will be reverted David Miller
2006-06-28  7:41 ` Evgeniy Polyakov
2006-06-28 14:56 ` Tom Tucker
2006-06-28 15:01 ` Steve Wise
2006-06-29 16:54 ` Roland Dreier
2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
2006-06-29 17:35     ` Roland Dreier
2006-06-29 17:40       ` YOSHIFUJI Hideaki / 吉藤英明
2006-06-29 19:46   ` David Miller
2006-06-29 20:11     ` Tom Tucker
2006-06-29 20:16       ` Tom Tucker
2006-06-29 20:19       ` David Miller
2006-06-29 20:47         ` Tom Tucker
2006-06-29 20:53           ` David Miller
2006-06-29 21:28             ` Tom Tucker
2006-06-29 21:25         ` Andi Kleen
2006-06-29 20:42       ` James Morris
2006-06-30 20:51     ` Roland Dreier
2006-06-30 21:16       ` David Miller
2006-06-30 23:01         ` Tom Tucker
2006-07-01 14:26           ` Andi Kleen
2006-07-04 18:34             ` Andy Gay
2006-07-04 20:47               ` Andi Kleen
2006-07-04 22:22                 ` Andy Gay
2006-07-04 23:01                   ` Andi Kleen
2006-07-04 23:48                     ` Andy Gay
2006-07-05  0:04                       ` Andi Kleen
2006-07-04 20:34             ` Roland Dreier
2006-07-24 22:06               ` David Miller
2006-07-24 23:10                 ` Andi Kleen
2006-07-24 23:22                   ` David Miller
2006-07-25  0:02                     ` Andi Kleen
2006-07-25  0:29                       ` Rick Jones
2006-07-25  0:45                         ` David Miller
2006-07-25  0:55                           ` Rick Jones
2006-07-25  1:04                             ` Andi Kleen
2006-07-25  1:21                             ` David Miller
2006-07-25 16:29                               ` Rick Jones
2006-07-25 16:32                                 ` Andi Kleen
2006-07-25  1:03                           ` Rick Jones
2006-07-25  1:42                         ` Andi Kleen
2006-07-25  5:51                 ` Evgeniy Polyakov
2006-07-25  6:48                   ` David Miller
2006-07-25  6:59                     ` Evgeniy Polyakov
2006-07-25  7:33                       ` David Miller
2006-07-25  7:42                         ` Evgeniy Polyakov
2006-07-05 17:09             ` Tom Tucker
2006-07-05 17:50               ` Steve Wise
2006-07-24 22:25                 ` David Miller
2006-07-24 22:47                   ` Caitlin Bestler
2006-07-24 22:23               ` David Miller
2006-07-24 22:57                 ` Caitlin Bestler
2006-07-01 21:45           ` David Miller
2006-07-04 20:34             ` Roland Dreier
2006-07-05 18:27               ` David Miller
2006-07-05 20:29                 ` Roland Dreier
2006-07-06  3:03                   ` David Miller
2006-07-06  5:25                     ` Tom Tucker
2006-07-06 14:08                       ` Herbert Xu
2006-07-06 17:36                         ` Tom Tucker
2006-07-07  0:03                           ` Herbert Xu
2006-07-07  0:32                             ` Tom Tucker
2006-07-07  6:53                       ` David Miller
2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
2006-07-07 18:25                           ` Steve Wise
2006-07-11  8:17                             ` Herbert Xu
2006-07-11 13:27                               ` Steve Wise
2006-07-24 22:29                           ` What is RDMA David Miller
2006-07-24 22:34                             ` Rick Jones
2006-07-24 22:39                               ` David Miller
2006-07-24 22:49                               ` Andi Kleen
2006-07-07 13:29                         ` RDMA will be reverted Tom Tucker
  -- strict thread matches above, loose matches on Subject: below --
2006-07-06 13:26 Caitlin Bestler
2006-07-25 19:59 Tom Tucker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).