netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RDMA will be reverted
@ 2006-06-28  7:07 David Miller
  2006-06-28  7:41 ` Evgeniy Polyakov
                   ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: David Miller @ 2006-06-28  7:07 UTC (permalink / raw)
  To: rolandd; +Cc: netdev, akpm


Roland, there is no way in the world we would have let support for
RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
discussed this with Andrew Morton, and we'd like you to please revert
all of the RDMA code from Linus's tree immedialtely.

Folks are well aware how against RDMA and TOE type schemes the Linux
networking developers are.  So the fact that none of these RDMA
changes went up for review on netdev strikes me as just a little bit
more than suspicious.

Please do not do this again, thank you.

^ permalink raw reply	[flat|nested] 74+ messages in thread
* RE: RDMA will be reverted
@ 2006-07-06 13:26 Caitlin Bestler
  0 siblings, 0 replies; 74+ messages in thread
From: Caitlin Bestler @ 2006-07-06 13:26 UTC (permalink / raw)
  To: Andi Kleen, Andy Gay; +Cc: Tom Tucker, David Miller, rdreier, netdev, akpm

Andi Kleen wrote:
> 
>> We're focusing on netfilter here. Is breaking netfilter really the
>> only issue with this stuff?
> 
> Another concern is that it will just not be able to keep
> up with a high rate of new connections or a high number of them
> (because the hardware has too limited state)
> 

Neither iWARP or an iSCSI initiator will require extremely high
rates of connection establishment. An RNIC only establishes connections
when its services have been explicitly requested (via use of a specific
service).

In any event, the key question here is whether integration with
the netdevice improves things or whether the offload device should
be "totally transparent" to the kernel. If the offload device somehow
insisted on handling connection requests that the kernel would have
been able to handle then this would be an issue. But the kernel is
not currently handling RDMA Connect requests on its own, and I know
of no-one who has suggested that a software-only implementation of
RDMA is feasible at 10Gbit is feasible.

netfiler integration is definitely something that needs to be addressed,
but the L2/L3 integrations need to be in place first.

> And then there are the other issues I listed like subtle TCP bugs
> (TSO is already a nightmare in this area and it's still not quite
> right) etc. 
> 

Making an RNIC "fully transparent" to the kernell would require it
to handle many L2 and L3 issues in parallel with the host stack.
That increases the chance of a bug, or at least a subtle difference
between the host and the RNIC which while being compliant would
be unexpected.

The purposes of the proposed patches is to enable the RNIC to be
in full compliance with the host stack on IP layer issues.



> 
> It would need someone who can describe how this new RDMA device avoids
> all the problems, but so far its advocates don't seem to be interested
> in doing that and I cannot contribute more.
> 

RDMA services are already defined for the kernel. The connection
management and network notifier patches are about enabling RDMA
devices to use IP addresses in a way that is consistent.

Obviously doing so is more important for an iWARP device than for
an InfiniBand device, but each InfiniBand users have expressed a
desire to use IP addressing.

Applications do not use RDMA by accident, it is a major design
decision. Once an application uses RDMA it is no longer a direct
consumer of the transport layer protocol. Indeed, one of the
main objectives of the OpenFabrics stack is to enable typical
applications to be written that will work over RDMA without
caring what the underlying transport is. The options for control
will still be there, but just as a sockets programmer does not
typically care whether their IP is carried over SLIP, PPP,
Ethernet or ATM; most RDMA developers should not have to worry
about iWARP or InfiniBand.

http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt
provides an overview on how RDMA benefits applications, and when
applications would benefit from its use as compared to plain TCP.


^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: RDMA will be reverted
@ 2006-07-25 19:59 Tom Tucker
  0 siblings, 0 replies; 74+ messages in thread
From: Tom Tucker @ 2006-07-25 19:59 UTC (permalink / raw)
  To: David Miller; +Cc: ak, rdreier, netdev, akpm

On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: 
> From: Tom Tucker <tom@opengridcomputing.com>
> Date: Wed, 05 Jul 2006 12:09:42 -0500
> 
> > "A TOE net stack is closed source firmware. Linux engineers have no way
> > to fix security issues that arise. As a result, only non-TOE users will
> > receive security updates, leaving random windows of vulnerability for
> > each TOE NIC's users."
> > 
> > - A Linux security update may or may not be relevant to a vendors
> > implementation. 
> > 
> > - If a vendor's implementation has a security issue then the customer
> > must rely on the vendor to fix it. This is no less true for iWARP than
> > for any adapter.
> 
> This isn't how things actually work.
> 
> Users have a computer, and they can rightly expect the community
> to help them solve problems that occur in the upstream kernel.
> 
> When a bug is found and the person is using NIC X, we don't
> necessarily forward the bug report to the vendor of NIC X.
> Instead we try to fix the bug.  Many chip drivers are maintained
> by people who do not work for the company that makes the chip,
> and this works just fine.
> 
> If only the chip vendor can fix a security problem, this makes Linux
> less agile to fix.  Even aspect of a problem on a Linux system that
> cannot be fixed entirely by the community is a net negative for Linux.
> 

All true. What I meant to say was that this is "no less true than for
any deep adapter". It is incontrovertible that a deep adapter is less
flexible, and more difficult to support than a shallow adapter.

> > - iWARP needs to do protocol processing in order to validate and
> > evaluate TCP payload in advance of direct data placement. This
> > requirement is independent of CPU speed. 
> 
> Yet, RDMA itself is just an optimization meant to deal with
> limitations of cpu and memory speed.  You can rephrase the
> situation in whatever way suits your argument, but it does not
> make the core issue go away :)

Yep.

> 
> > - I suspect that connection rates for RDMA adapters fall well-below the
> > rates attainable with a dumb device. That said, all of the RDMA
> > applications that I know of are not connection intensive. Even for TOE,
> > the later HTTP versions makes connection rates less of an issue.
> 
> This is a very naive evaluation of the situation.  Yes, newer
> versions of protocols such as HTTP make the per-client connection
> burdon lower, but the number of clients will increase in time to
> more than makeup for whatever gains are seen due to this.

Naive is being kind, my HTTP comment is irrelevant :).  

> And then you have protocols which by design are connection heavy,
> and rightly so, such as bittorrent.
> 
> Being able to handle enormous numbers of connections, with extreme
> scalability and low latency, is an absolute requirement of any modern
> day serious TCP stack.  And this requirement is not going away.
> Wishing this requirement away due to HTTP persistent connections is
> very unrealistic, at best.
> 
> > - This is the problem we're trying to solve...incrementally and
> > responsibly.
> 
> You can't.  See my email to Roland about why even VJ net channels
> are found to be impractical.  To support netfilter properly, you
> must traverse the whole netfilter stack, because NAT can rewrite
> packets, yet still make them destined for the local system, and
> thus they will have a different identity for connection demux
> by the time the TCP stack sees the packet.
> 

I'm not claiming that all the problems can be solved, I'm suggesting
that integration could be better and that partial integration is better
than none. 

> All of these tranformations occur between normal packet receive
> and the TCP stack.  You would therefore need to put your card
> between netfilter and TCP in the packet input path, and at that
> point why bother with the stateful card at all?
> 
> The fact is that stateless approaches will always be better than
> stateful things because you cannot replicate the functionality we
> have in the Linux stack without replicating 10 years of work into
> your chip's firmware.  At that point you should just run Linux
> on your NIC since that is what you are effectively doing :)
> 

I wish...I'd have a better stack. 

> In conversations such as these, it helps us a lot if you can be frank
> and honest about the true absolute limitations of your technology.  

I'm trying ... classifying these limitations as "core can't fix" and
"fixable with integration" is where we're getting crosswise. 

> I
> can see that your viewpoint is tainted when I hear things such as HTTP
> persistent connections being used as a reason why high TCP connection
> rates won't matter in the future.  Such assertions are understood to
> be patently false by anyone who understands TCP and how it is used in
> the real world.

Partial "Fixable with Integration" Summary

- ARP Resolution
- ICMP Redirect
- Path MTU Change
- Route Update
- Colliding TCP Port Spaces

Partial "Can't Fix" Issues Summary:

- Many devices cannot support more than tens of thousands of concurrent
connections (16-64k would be typical). The number of supported RDMA
connections does not scale with server resources. 

- Netfilter integration is busted. Some have suggested that devices that
do connection establishment in host software could honor netfilter rules
at startup. I'm concerned that this would be more confusing than
helpful (which rules work, which don't)

- NAT doesn't work when run on the same machine as the RDMA stack with
hardware assist. Post connection establishment adapter sees untranslated 
packet. 

- Connection rates will likely be lower for devices that do connection
establishment in the device vs. in the host.  

- The open source community cannot easily predict, diagnose or fix
problems in the hardware stack. It's a black box.

- Most hardware stacks lack the security features present in the native
stack and cannot be extended to handle new exploits.




^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2006-07-25 19:59 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-28  7:07 RDMA will be reverted David Miller
2006-06-28  7:41 ` Evgeniy Polyakov
2006-06-28 14:56 ` Tom Tucker
2006-06-28 15:01 ` Steve Wise
2006-06-29 16:54 ` Roland Dreier
2006-06-29 17:32   ` YOSHIFUJI Hideaki / 吉藤英明
2006-06-29 17:35     ` Roland Dreier
2006-06-29 17:40       ` YOSHIFUJI Hideaki / 吉藤英明
2006-06-29 19:46   ` David Miller
2006-06-29 20:11     ` Tom Tucker
2006-06-29 20:16       ` Tom Tucker
2006-06-29 20:19       ` David Miller
2006-06-29 20:47         ` Tom Tucker
2006-06-29 20:53           ` David Miller
2006-06-29 21:28             ` Tom Tucker
2006-06-29 21:25         ` Andi Kleen
2006-06-29 20:42       ` James Morris
2006-06-30 20:51     ` Roland Dreier
2006-06-30 21:16       ` David Miller
2006-06-30 23:01         ` Tom Tucker
2006-07-01 14:26           ` Andi Kleen
2006-07-04 18:34             ` Andy Gay
2006-07-04 20:47               ` Andi Kleen
2006-07-04 22:22                 ` Andy Gay
2006-07-04 23:01                   ` Andi Kleen
2006-07-04 23:48                     ` Andy Gay
2006-07-05  0:04                       ` Andi Kleen
2006-07-04 20:34             ` Roland Dreier
2006-07-24 22:06               ` David Miller
2006-07-24 23:10                 ` Andi Kleen
2006-07-24 23:22                   ` David Miller
2006-07-25  0:02                     ` Andi Kleen
2006-07-25  0:29                       ` Rick Jones
2006-07-25  0:45                         ` David Miller
2006-07-25  0:55                           ` Rick Jones
2006-07-25  1:04                             ` Andi Kleen
2006-07-25  1:21                             ` David Miller
2006-07-25 16:29                               ` Rick Jones
2006-07-25 16:32                                 ` Andi Kleen
2006-07-25  1:03                           ` Rick Jones
2006-07-25  1:42                         ` Andi Kleen
2006-07-25  5:51                 ` Evgeniy Polyakov
2006-07-25  6:48                   ` David Miller
2006-07-25  6:59                     ` Evgeniy Polyakov
2006-07-25  7:33                       ` David Miller
2006-07-25  7:42                         ` Evgeniy Polyakov
2006-07-05 17:09             ` Tom Tucker
2006-07-05 17:50               ` Steve Wise
2006-07-24 22:25                 ` David Miller
2006-07-24 22:47                   ` Caitlin Bestler
2006-07-24 22:23               ` David Miller
2006-07-24 22:57                 ` Caitlin Bestler
2006-07-01 21:45           ` David Miller
2006-07-04 20:34             ` Roland Dreier
2006-07-05 18:27               ` David Miller
2006-07-05 20:29                 ` Roland Dreier
2006-07-06  3:03                   ` David Miller
2006-07-06  5:25                     ` Tom Tucker
2006-07-06 14:08                       ` Herbert Xu
2006-07-06 17:36                         ` Tom Tucker
2006-07-07  0:03                           ` Herbert Xu
2006-07-07  0:32                             ` Tom Tucker
2006-07-07  6:53                       ` David Miller
2006-07-07  8:11                         ` What is RDMA (was: RDMA will be reverted) Herbert Xu
2006-07-07 18:25                           ` Steve Wise
2006-07-11  8:17                             ` Herbert Xu
2006-07-11 13:27                               ` Steve Wise
2006-07-24 22:29                           ` What is RDMA David Miller
2006-07-24 22:34                             ` Rick Jones
2006-07-24 22:39                               ` David Miller
2006-07-24 22:49                               ` Andi Kleen
2006-07-07 13:29                         ` RDMA will be reverted Tom Tucker
  -- strict thread matches above, loose matches on Subject: below --
2006-07-06 13:26 Caitlin Bestler
2006-07-25 19:59 Tom Tucker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).