* [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. @ 2007-08-07 14:37 Steve Wise 2007-08-07 14:54 ` [ofa-general] " Evgeniy Polyakov 2007-08-09 18:49 ` Steve Wise 0 siblings, 2 replies; 53+ messages in thread From: Steve Wise @ 2007-08-07 14:37 UTC (permalink / raw) To: Roland Dreier, David S. Miller; +Cc: netdev, linux-kernel, OpenFabrics General Networking experts, I'd like input on the patch below, and help in solving this bug properly. iWARP devices that support both native stack TCP and iWARP (aka RDMA over TCP/IP/Ethernet) connections on the same interface need the fix below or some similar fix to the RDMA connection manager. This is a BUG in the Linux RDMA-CMA code as it stands today. Here is the issue: Consider an mpi cluster running mvapich2. And the cluster runs MPI/Sockets jobs concurrently with MPI/RDMA jobs. It is possible, without the patch below, for MPI/Sockets processes to mistakenly get incoming RDMA connections and vice versa. The way mvapich2 works is that the ranks all bind and listen to a random port (retrying new random ports if the bind fails with "in use"). Once they get a free port and bind/listen, they advertise that port number to the peers to do connection setup. Currently, without the patch below, the mpi/rdma processes can end up binding/listening to the _same_ port number as the mpi/sockets processes running over the native tcp stack. This is due to duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP port space. If this happens, then the connections can get screwed up. The correct solution in my mind is to use the host stack's TCP port space for _all_ RDMA_PS_TCP port allocations. The patch below is a minimal delta to unify the port spaces by using the kernel stack to bind ports. This is done by allocating a kernel socket and binding to the appropriate local addr/port. It also allows the kernel stack to pick ephemeral ports by virtue of just passing in port 0 on the kernel bind operation. There has been a discussion already on the RDMA list if anyone is interested: http://www.mail-archive.com/general@lists.openfabrics.org/msg05162.html Thanks, Steve. --- RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. This is needed for iwarp providers that support native and rdma connections over the same interface. Signed-off-by: Steve Wise <swise@opengridcomputing.com> --- drivers/infiniband/core/cma.c | 27 ++++++++++++++++++++++++++- 1 files changed, 26 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9e0ab04..e4d2d7f 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -111,6 +111,7 @@ struct rdma_id_private { struct rdma_cm_id id; struct rdma_bind_list *bind_list; + struct socket *sock; struct hlist_node node; struct list_head list; struct list_head listen_list; @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma kfree(bind_list); } mutex_unlock(&lock); + if (id_priv->sock) + sock_release(id_priv->sock); } void rdma_destroy_id(struct rdma_cm_id *id) @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps, return 0; } +static int cma_get_tcp_port(struct rdma_id_private *id_priv) +{ + int ret; + struct socket *sock; + + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); + if (ret) + return ret; + ret = sock->ops->bind(sock, + (struct socketaddr *)&id_priv->id.route.addr.src_addr, + ip_addr_size(&id_priv->id.route.addr.src_addr)); + if (ret) { + sock_release(sock); + return ret; + } + id_priv->sock = sock; + return 0; +} + static int cma_get_port(struct rdma_id_private *id_priv) { struct idr *ps; @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p break; case RDMA_PS_TCP: ps = &tcp_ps; + ret = cma_get_tcp_port(id_priv); /* Synch with native stack */ + if (ret) + goto out; break; case RDMA_PS_UDP: ps = &udp_ps; @@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p else ret = cma_use_port(ps, id_priv); mutex_unlock(&lock); - +out: return ret; } ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-07 14:37 [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise @ 2007-08-07 14:54 ` Evgeniy Polyakov 2007-08-07 15:06 ` Steve Wise 2007-08-09 18:49 ` Steve Wise 1 sibling, 1 reply; 53+ messages in thread From: Evgeniy Polyakov @ 2007-08-07 14:54 UTC (permalink / raw) To: Steve Wise Cc: netdev, Roland Dreier, linux-kernel, OpenFabrics General, David S. Miller Hi Steve. On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise (swise@opengridcomputing.com) wrote: > +static int cma_get_tcp_port(struct rdma_id_private *id_priv) > +{ > + int ret; > + struct socket *sock; > + > + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); > + if (ret) > + return ret; > + ret = sock->ops->bind(sock, > + (struct socketaddr > *)&id_priv->id.route.addr.src_addr, > + ip_addr_size(&id_priv->id.route.addr.src_addr)); If get away from talks about broken offloading, this one will result in the case, when usual network dataflow can enter private rdma land, i.e. after bind succeeded this socket is accessible via any other network device. Is it inteded? And this is quite noticeble overhead per rdma connection, btw. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-07 14:54 ` [ofa-general] " Evgeniy Polyakov @ 2007-08-07 15:06 ` Steve Wise 2007-08-07 15:39 ` [ofa-general] " Evgeniy Polyakov 0 siblings, 1 reply; 53+ messages in thread From: Steve Wise @ 2007-08-07 15:06 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Roland Dreier, David S. Miller, netdev, linux-kernel, Sean Hefty, OpenFabrics General Evgeniy Polyakov wrote: > Hi Steve. > > On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise (swise@opengridcomputing.com) wrote: >> +static int cma_get_tcp_port(struct rdma_id_private *id_priv) >> +{ >> + int ret; >> + struct socket *sock; >> + >> + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); >> + if (ret) >> + return ret; >> + ret = sock->ops->bind(sock, >> + (struct socketaddr >> *)&id_priv->id.route.addr.src_addr, >> + ip_addr_size(&id_priv->id.route.addr.src_addr)); > > If get away from talks about broken offloading, this one will result in > the case, when usual network dataflow can enter private rdma land, i.e. > after bind succeeded this socket is accessible via any other network > device. Is it inteded? > And this is quite noticeble overhead per rdma connection, btw. > I'm not sure I understand your question? What do you mean by "accessible"? The intention is to _just_ reserve the addr/port. The socket struct alloc and bind was a simple way to do this. I assume we'll have to come up with a better way though. Namely provide a low level interface to the port space allocator allowing both rdma and the host tcp stack to share the space without requiring a socket struct for rdma connections. Or maybe we'll come up a different and better solution to this issue... Steve. ^ permalink raw reply [flat|nested] 53+ messages in thread
* [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-07 15:06 ` Steve Wise @ 2007-08-07 15:39 ` Evgeniy Polyakov 0 siblings, 0 replies; 53+ messages in thread From: Evgeniy Polyakov @ 2007-08-07 15:39 UTC (permalink / raw) To: Steve Wise Cc: netdev, Roland Dreier, linux-kernel, OpenFabrics General, David S. Miller On Tue, Aug 07, 2007 at 10:06:29AM -0500, Steve Wise (swise@opengridcomputing.com) wrote: > >On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise > >(swise@opengridcomputing.com) wrote: > >>+static int cma_get_tcp_port(struct rdma_id_private *id_priv) > >>+{ > >>+ int ret; > >>+ struct socket *sock; > >>+ > >>+ ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); > >>+ if (ret) > >>+ return ret; > >>+ ret = sock->ops->bind(sock, > >>+ (struct socketaddr > >>*)&id_priv->id.route.addr.src_addr, > >>+ ip_addr_size(&id_priv->id.route.addr.src_addr)); > > > >If get away from talks about broken offloading, this one will result in > >the case, when usual network dataflow can enter private rdma land, i.e. > >after bind succeeded this socket is accessible via any other network > >device. Is it inteded? > >And this is quite noticeble overhead per rdma connection, btw. > > > > I'm not sure I understand your question? What do you mean by > "accessible"? The intention is to _just_ reserve the addr/port. Above RDMA ->bind() ends up with tcp_v4_get_port(), which will only add socket into bhash, but it is only accessible for new sockets created for listening connections or expilicit bind, network traffic checks only listening and establised hashes, which are not affected by above change, so it was false alarm from my side. It does allow to 'grab' a port and forbid its possible reuse. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 53+ messages in thread
* [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-07 14:37 [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise 2007-08-07 14:54 ` [ofa-general] " Evgeniy Polyakov @ 2007-08-09 18:49 ` Steve Wise 2007-08-09 21:40 ` Sean Hefty 1 sibling, 1 reply; 53+ messages in thread From: Steve Wise @ 2007-08-09 18:49 UTC (permalink / raw) To: Roland Dreier, David S. Miller; +Cc: netdev, linux-kernel, OpenFabrics General Any more comments? Steve Wise wrote: > Networking experts, > > I'd like input on the patch below, and help in solving this bug > properly. iWARP devices that support both native stack TCP and iWARP > (aka RDMA over TCP/IP/Ethernet) connections on the same interface need > the fix below or some similar fix to the RDMA connection manager. > > This is a BUG in the Linux RDMA-CMA code as it stands today. > > Here is the issue: > > Consider an mpi cluster running mvapich2. And the cluster runs > MPI/Sockets jobs concurrently with MPI/RDMA jobs. It is possible, > without the patch below, for MPI/Sockets processes to mistakenly get > incoming RDMA connections and vice versa. The way mvapich2 works is > that the ranks all bind and listen to a random port (retrying new random > ports if the bind fails with "in use"). Once they get a free port and > bind/listen, they advertise that port number to the peers to do > connection setup. Currently, without the patch below, the mpi/rdma > processes can end up binding/listening to the _same_ port number as the > mpi/sockets processes running over the native tcp stack. This is due to > duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP > port space. If this happens, then the connections can get screwed up. > > The correct solution in my mind is to use the host stack's TCP port > space for _all_ RDMA_PS_TCP port allocations. The patch below is a > minimal delta to unify the port spaces by using the kernel stack to bind > ports. This is done by allocating a kernel socket and binding to the > appropriate local addr/port. It also allows the kernel stack to pick > ephemeral ports by virtue of just passing in port 0 on the kernel bind > operation. > > There has been a discussion already on the RDMA list if anyone is > interested: > > http://www.mail-archive.com/general@lists.openfabrics.org/msg05162.html > > > Thanks, > > Steve. > > > --- > > RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. > > This is needed for iwarp providers that support native and rdma > connections over the same interface. > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > --- > > drivers/infiniband/core/cma.c | 27 ++++++++++++++++++++++++++- > 1 files changed, 26 insertions(+), 1 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 9e0ab04..e4d2d7f 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -111,6 +111,7 @@ struct rdma_id_private { > struct rdma_cm_id id; > > struct rdma_bind_list *bind_list; > + struct socket *sock; > struct hlist_node node; > struct list_head list; > struct list_head listen_list; > @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma > kfree(bind_list); > } > mutex_unlock(&lock); > + if (id_priv->sock) > + sock_release(id_priv->sock); > } > > void rdma_destroy_id(struct rdma_cm_id *id) > @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps, > return 0; > } > > +static int cma_get_tcp_port(struct rdma_id_private *id_priv) > +{ > + int ret; > + struct socket *sock; > + > + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); > + if (ret) > + return ret; > + ret = sock->ops->bind(sock, > + (struct sockaddr *)&id_priv->id.route.addr.src_addr, > + ip_addr_size(&id_priv->id.route.addr.src_addr)); > + if (ret) { > + sock_release(sock); > + return ret; > + } > + id_priv->sock = sock; > + return 0; > +} > + > static int cma_get_port(struct rdma_id_private *id_priv) > { > struct idr *ps; > @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p > break; > case RDMA_PS_TCP: > ps = &tcp_ps; > + ret = cma_get_tcp_port(id_priv); /* Synch with native stack */ > + if (ret) > + goto out; > break; > case RDMA_PS_UDP: > ps = &udp_ps; > @@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p > else > ret = cma_use_port(ps, id_priv); > mutex_unlock(&lock); > - > +out: > return ret; > } > > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-09 18:49 ` Steve Wise @ 2007-08-09 21:40 ` Sean Hefty 2007-08-09 21:55 ` David Miller 0 siblings, 1 reply; 53+ messages in thread From: Sean Hefty @ 2007-08-09 21:40 UTC (permalink / raw) To: Steve Wise Cc: netdev, Roland Dreier, David S. Miller, OpenFabrics General, linux-kernel Steve Wise wrote: > Any more comments? Does anyone have ideas on how to reserve the port space without using a struct socket? - Sean ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-09 21:40 ` Sean Hefty @ 2007-08-09 21:55 ` David Miller 2007-08-09 23:22 ` Sean Hefty ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: David Miller @ 2007-08-09 21:55 UTC (permalink / raw) To: mshefty; +Cc: netdev, rdreier, linux-kernel, general From: Sean Hefty <mshefty@ichips.intel.com> Date: Thu, 09 Aug 2007 14:40:16 -0700 > Steve Wise wrote: > > Any more comments? > > Does anyone have ideas on how to reserve the port space without using a > struct socket? How about we just remove the RDMA stack altogether? I am not at all kidding. If you guys can't stay in your sand box and need to cause problems for the normal network stack, it's unacceptable. We were told all along the if RDMA went into the tree none of this kind of stuff would be an issue. These are exactly the kinds of problems for which people like myself were dreading. These subsystems have no buisness using the TCP port space of the Linux software stack, absolutely none. After TCP port reservation, what's next? It seems an at least bi-monthly event that the RDMA folks need to put their fingers into something else in the normal networking stack. No more. I will NACK any patch that opens up sockets to eat up ports or anything stupid like that. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-09 21:55 ` David Miller @ 2007-08-09 23:22 ` Sean Hefty 2007-08-15 14:42 ` Steve Wise 2007-10-08 21:54 ` Steve Wise 2 siblings, 0 replies; 53+ messages in thread From: Sean Hefty @ 2007-08-09 23:22 UTC (permalink / raw) To: David Miller; +Cc: swise, rdreier, netdev, linux-kernel, general > How about we just remove the RDMA stack altogether? I am not at all > kidding. If you guys can't stay in your sand box and need to cause > problems for the normal network stack, it's unacceptable. We were > told all along the if RDMA went into the tree none of this kind of > stuff would be an issue. There are currently two RDMA solutions available. Each solution has different requirements and uses the normal network stack differently. Infiniband uses its own transport. iWarp runs over TCP. We have tried to leverage the existing infrastructure where it makes sense. > After TCP port reservation, what's next? It seems an at least > bi-monthly event that the RDMA folks need to put their fingers > into something else in the normal networking stack. No more. Currently, the RDMA stack uses its own port space. This causes a problem for iWarp, and is what Steve is looking for a solution for. I'm not an iWarp guru, so I don't know what options exist. Can iWarp use its own address family? Identify specific IP addresses for iWarp use? Restrict iWarp to specific port numbers? Let the app control the correct operation? I don't know. Steve merely defined a problem and suggested a possible solution. He's looking for constructive help trying to solve the problem. - Sean ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-09 21:55 ` David Miller 2007-08-09 23:22 ` Sean Hefty @ 2007-08-15 14:42 ` Steve Wise 2007-08-16 2:26 ` Jeff Garzik 2007-10-08 21:54 ` Steve Wise 2 siblings, 1 reply; 53+ messages in thread From: Steve Wise @ 2007-08-15 14:42 UTC (permalink / raw) To: David Miller; +Cc: rdreier, linux-kernel, general, netdev David Miller wrote: > From: Sean Hefty <mshefty@ichips.intel.com> > Date: Thu, 09 Aug 2007 14:40:16 -0700 > >> Steve Wise wrote: >>> Any more comments? >> Does anyone have ideas on how to reserve the port space without using a >> struct socket? > > How about we just remove the RDMA stack altogether? I am not at all > kidding. If you guys can't stay in your sand box and need to cause > problems for the normal network stack, it's unacceptable. We were > told all along the if RDMA went into the tree none of this kind of > stuff would be an issue. I think removing the RDMA stack is the wrong thing to do, and you shouldn't just threaten to yank entire subsystems because you don't like the technology. Lets keep this constructive, can we? RDMA should get the respect of any other technology in Linux. Maybe its a niche in your opinion, but come on, there's more RDMA users than say, the sparc64 port. Eh? > > These are exactly the kinds of problems for which people like myself > were dreading. These subsystems have no buisness using the TCP port > space of the Linux software stack, absolutely none. > Ok, although IMO its the correct solution. But I'll propose other solutions below. I ask for your feedback (and everyones!) on these alternate solutions. > After TCP port reservation, what's next? It seems an at least > bi-monthly event that the RDMA folks need to put their fingers > into something else in the normal networking stack. No more. > The only other change requested and commited, if I recall correctly, was for netevents, and that enabled both Infiniband and iWARP to integrate with the neighbour subsystem. I think that was a useful and needed change. Prior to that, these subsystems were snooping ARP replies to trigger events. That was back in 2.6.18 or 2.6.19 I think... > I will NACK any patch that opens up sockets to eat up ports or > anything stupid like that. Got it. Here are alternate solutions that avoid the need to share the port space: Solution 1) 1) admins must setup an alias interface on the iwarp device for use with rdma. This interface will have to be a separate subnet from the "TCP used" interface. And with a canonical name that indicates its "for rdma only". Like eth2:iw or eth2:rdma. There can be many of these per device. 2) admins make sure their sockets/tcp services don't use the interface configured in #1, and their rdma service do use said interface. 3) iwarp providers must translation binds to ipaddr 0.0.0.0 to the associated "for rdma only" ip addresses. They can do this by searching for all aliases of the canonical name that are aliases of the TCP interface for their nic device. Or: somehow not handle incoming connections to any address but the "for rdma use" addresses and instead pass them up and not offload them. This will avoid the collisions as long as the above steps are followed. Solution 2) Another possibility would be for the driver to create two net devices (and hence two interace names) like "eth2" and "iw2", and artificially separate the RDMA stuff that way. These two solutions are similar in that they create a "rdma only" interface. Pros: - is not intrusive into the core networking code - very minimal changes needed and in the iwarp provider's code, who are the ones with this problem - makes it clear which subnets are RDMA only Cons: - relies on system admin to set it up correctly. - native stack can still "use" this rdma-only interface and the same port space issue will exist. For the record, here are possible port-sharing solutions Dave sez he'll NAK: Solution NAK-1) The rdma-cma just allocates a socket and binds it to reserve TCP ports. Pros: - minimal changes needed to implement (always a plus in my mind :) - simple, clean, and it works (KISS) - if no RDMA is in use, there is no impact on the native stack - no need for a seperate RDMA interface Cons: - wastes memory - puts a TCP socket in the "CLOSED" state in the pcb tables. - Dave will NAK it :) Solution NAK-2) Create a low-level sockets-agnostic port allocation service that is shared by both TCP and RDMA. This way, the rdma-cm can reserve ports in an efficient manor instead of doing it via kernel_bind() using a sock struct. Pros: - probably the correct solution (my opinion :) if we went down the path of sharing port space - if no RDMA is in use, there is no impact on the native stack - no need for a separate RDMA interface Cons: - very intrusive change because the port allocations stuff is tightly bound to the host stack and sock struct, etc. - Dave will NAK it :) Steve. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-15 14:42 ` Steve Wise @ 2007-08-16 2:26 ` Jeff Garzik 2007-08-16 3:11 ` Roland Dreier ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Jeff Garzik @ 2007-08-16 2:26 UTC (permalink / raw) To: Steve Wise; +Cc: David Miller, mshefty, rdreier, netdev, linux-kernel, general Steve Wise wrote: > > > David Miller wrote: >> From: Sean Hefty <mshefty@ichips.intel.com> >> Date: Thu, 09 Aug 2007 14:40:16 -0700 >> >>> Steve Wise wrote: >>>> Any more comments? >>> Does anyone have ideas on how to reserve the port space without using >>> a struct socket? >> >> How about we just remove the RDMA stack altogether? I am not at all >> kidding. If you guys can't stay in your sand box and need to cause >> problems for the normal network stack, it's unacceptable. We were >> told all along the if RDMA went into the tree none of this kind of >> stuff would be an issue. > > I think removing the RDMA stack is the wrong thing to do, and you > shouldn't just threaten to yank entire subsystems because you don't like > the technology. Lets keep this constructive, can we? RDMA should get > the respect of any other technology in Linux. Maybe its a niche in your > opinion, but come on, there's more RDMA users than say, the sparc64 > port. Eh? It's not about being a niche. It's about creating a maintainable software net stack that has predictable behavior. Needing to reach out of the RDMA sandbox and reserve net stack resources away from itself travels a path we've consistently avoided. >> I will NACK any patch that opens up sockets to eat up ports or >> anything stupid like that. > > Got it. Ditto for me as well. Jeff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-16 2:26 ` Jeff Garzik @ 2007-08-16 3:11 ` Roland Dreier 2007-08-16 3:27 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty 2007-08-16 13:43 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker 2 siblings, 0 replies; 53+ messages in thread From: Roland Dreier @ 2007-08-16 3:11 UTC (permalink / raw) To: Jeff Garzik; +Cc: netdev, linux-kernel, general, David Miller > Needing to reach out of the RDMA sandbox and reserve net stack > resources away from itself travels a path we've consistently avoided. Where did the idea of an "RDMA sandbox" come from? Obviously no one disagrees with keeping things clean and maintainable, but the idea that RDMA is a second-class citizen that doesn't get any input into the evolution of the networking code seems kind of offensive to me. - R. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space. 2007-08-16 2:26 ` Jeff Garzik 2007-08-16 3:11 ` Roland Dreier @ 2007-08-16 3:27 ` Sean Hefty 2007-08-16 13:43 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker 2 siblings, 0 replies; 53+ messages in thread From: Sean Hefty @ 2007-08-16 3:27 UTC (permalink / raw) To: 'Jeff Garzik', Steve Wise Cc: netdev, rdreier, linux-kernel, general, David Miller >It's not about being a niche. It's about creating a maintainable >software net stack that has predictable behavior. > >Needing to reach out of the RDMA sandbox and reserve net stack resources >away from itself travels a path we've consistently avoided. We need to ensure that we're also creating a maintainable kernel. RDMA doesn't use sockets, but that doesn't mean it's not part of the networking support provided by the Linux kernel. Making blanket statements that RDMA should stay within a sandbox is equivalent to saying that RDMA should duplicate any network related functionality that it might need. >>> I will NACK any patch that opens up sockets to eat up ports or >>> anything stupid like that. > >Ditto for me as well. I agree that using a socket is the wrong approach, but my guess is that it was suggested as a possibility because of the attempt to keep RDMA in its 'sandbox'. The iWarp architecture implements RDMA over TCP; it just doesn't use sockets. The Linux network stack doesn't easily support this possibility. Are there any reasonable ways to enable this to the degree necessary for iWarp? - Sean ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-16 2:26 ` Jeff Garzik 2007-08-16 3:11 ` Roland Dreier 2007-08-16 3:27 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty @ 2007-08-16 13:43 ` Tom Tucker 2007-08-16 21:17 ` David Miller 2 siblings, 1 reply; 53+ messages in thread From: Tom Tucker @ 2007-08-16 13:43 UTC (permalink / raw) To: Jeff Garzik; +Cc: netdev, rdreier, linux-kernel, general, David Miller On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote: [...snip...] > > I think removing the RDMA stack is the wrong thing to do, and you > > shouldn't just threaten to yank entire subsystems because you don't like > > the technology. Lets keep this constructive, can we? RDMA should get > > the respect of any other technology in Linux. Maybe its a niche in your > > opinion, but come on, there's more RDMA users than say, the sparc64 > > port. Eh? > > It's not about being a niche. It's about creating a maintainable > software net stack that has predictable behavior. Isn't RDMA _part_ of the "software net stack" within Linux? Why isn't making RDMA stable, supportable and maintainable equally as important as any other subsystem? > > Needing to reach out of the RDMA sandbox and reserve net stack resources > away from itself travels a path we've consistently avoided. > > > >> I will NACK any patch that opens up sockets to eat up ports or > >> anything stupid like that. > > > > Got it. > > Ditto for me as well. > > Jeff > > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-16 13:43 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker @ 2007-08-16 21:17 ` David Miller 2007-08-17 19:52 ` Roland Dreier 0 siblings, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-16 21:17 UTC (permalink / raw) To: tom; +Cc: jeff, netdev, rdreier, linux-kernel, general From: Tom Tucker <tom@opengridcomputing.com> Date: Thu, 16 Aug 2007 08:43:11 -0500 > Isn't RDMA _part_ of the "software net stack" within Linux? It very much is not so. When using RDMA you lose the capability to do packet shaping, classification, and all the other wonderful networking facilities you've grown to love and use over the years. I'm glad this is a surprise to you, because it illustrates the point some of us keep trying to make about technologies like this. Imagine if you didn't know any of this, you purchase and begin to deploy a huge piece of RDMA infrastructure, you then get the mandate from IT that you need to add firewalling on the RDMA connections at the host level, and "oh shit" you can't? This is why none of us core networking developers like RDMA at all. It's totally not integrated with the rest of the Linux stack and on top of that it even gets in the way. It's an abberation, an eye sore, and a constant source of consternation. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-16 21:17 ` David Miller @ 2007-08-17 19:52 ` Roland Dreier 2007-08-17 21:27 ` David Miller 0 siblings, 1 reply; 53+ messages in thread From: Roland Dreier @ 2007-08-17 19:52 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, linux-kernel, general > > Isn't RDMA _part_ of the "software net stack" within Linux? > It very much is not so. This is just nit-picking. You can draw the boundary of the "software net stack" wherever you want, but I think Sean's point was just that RDMA drivers already are part of Linux, and we all want them to get better. > When using RDMA you lose the capability to do packet shaping, > classification, and all the other wonderful networking facilities > you've grown to love and use over the years. Same thing with TSO and LRO and who knows what else. I know you're going to make a distinction between "stateless" and "stateful" offloads, but really it's just an arbitrary distinction between things you like and things you don't. > Imagine if you didn't know any of this, you purchase and begin to > deploy a huge piece of RDMA infrastructure, you then get the mandate > from IT that you need to add firewalling on the RDMA connections at > the host level, and "oh shit" you can't? It's ironic that you bring up firewalling. I've had vendors of iWARP hardware tell me they would *love* to work with the community to make firewalling work better for RDMA connections. But instead we get the catch-22 of your changing arguments -- first, you won't even consider changes that might help RDMA work better in the name of maintainability; then you have to protect poor, ignorant users from accidentally using RDMA because of some problem or another; and then when someone tries to fix some of the problems you mention, it's back to step one. Obviously some decisions have been prejudged here, so I guess this moves to the realm of politics. I have plenty of interesting technical stuff, so I'll leave it to the people with a horse in the race to find ways to twist your arm. - R. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-17 19:52 ` Roland Dreier @ 2007-08-17 21:27 ` David Miller 2007-08-17 23:31 ` Roland Dreier 0 siblings, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-17 21:27 UTC (permalink / raw) To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general From: Roland Dreier <rdreier@cisco.com> Date: Fri, 17 Aug 2007 12:52:39 -0700 > > When using RDMA you lose the capability to do packet shaping, > > classification, and all the other wonderful networking facilities > > you've grown to love and use over the years. > > Same thing with TSO and LRO and who knows what else. Not true at all. Full classification and filtering still is usable with TSO and LRO. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-17 21:27 ` David Miller @ 2007-08-17 23:31 ` Roland Dreier 2007-08-18 0:00 ` David Miller 0 siblings, 1 reply; 53+ messages in thread From: Roland Dreier @ 2007-08-17 23:31 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, linux-kernel, general > > > When using RDMA you lose the capability to do packet shaping, > > > classification, and all the other wonderful networking facilities > > > you've grown to love and use over the years. > > > > Same thing with TSO and LRO and who knows what else. > > Not true at all. Full classification and filtering still is usable > with TSO and LRO. Well, obviously with TSO and LRO the packets that the stack sends or receives are not the same as what's on the wire. Whether that breaks your wonderful networking facilities or not depends on the specifics of the particular facility I guess -- for example shaping is clearly broken by TSO. (And people can wonder what the packet trains TSO creates do to congestion control on the internet, but the netdev crowd has already decided that TSO is "good" and RDMA is "bad") - R. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-17 23:31 ` Roland Dreier @ 2007-08-18 0:00 ` David Miller 2007-08-18 5:23 ` Roland Dreier 0 siblings, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-18 0:00 UTC (permalink / raw) To: rdreier; +Cc: jeff, netdev, linux-kernel, general From: Roland Dreier <rdreier@cisco.com> Date: Fri, 17 Aug 2007 16:31:07 -0700 > > > > When using RDMA you lose the capability to do packet shaping, > > > > classification, and all the other wonderful networking facilities > > > > you've grown to love and use over the years. > > > > > > Same thing with TSO and LRO and who knows what else. > > > > Not true at all. Full classification and filtering still is usable > > with TSO and LRO. > > Well, obviously with TSO and LRO the packets that the stack sends or > receives are not the same as what's on the wire. Whether that breaks > your wonderful networking facilities or not depends on the specifics > of the particular facility I guess -- for example shaping is clearly > broken by TSO. (And people can wonder what the packet trains TSO > creates do to congestion control on the internet, but the netdev crowd > has already decided that TSO is "good" and RDMA is "bad") This is also a series of falsehoods. All packet filtering, queue management, and packet scheduling facilities work perfectly fine and as designed with both LRO and TSO. When problems come up, they are bugs, and we fix them. Please stop spreading this FUD about TSO and LRO. The fact is that RDMA bypasses the whole stack so that supporting these facilities is not even _POSSIBLE_. With stateless offloads it is possible to support all of these facilities, and we do. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-18 0:00 ` David Miller @ 2007-08-18 5:23 ` Roland Dreier 2007-08-18 6:44 ` David Miller 0 siblings, 1 reply; 53+ messages in thread From: Roland Dreier @ 2007-08-18 5:23 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, linux-kernel, general > This is also a series of falsehoods. All packet filtering, > queue management, and packet scheduling facilities work perfectly > fine and as designed with both LRO and TSO. I'm not sure I follow. Perhaps "broken" was too strong a word to use, but if you pass a huge segment to a NIC with TSO, then you've given the NIC control of scheduling the packets that end up getting put on the wire. If your software packet scheduling is operating at a bigger scale, then things work fine, but I don't see how you can say that TSO doesn't lead to head-of-line blocking etc at short time scales. And yes of course I agree you can make sure things work by using short segments or not using TSO at all. Similarly with LRO the packets that get passed to the stack are not the packets that were actually on the wire. Sure, most filtering will work fine but eg are you sure your RTT estimates aren't going to get screwed up and cause some subtle bug? And I could trot out all the same bugaboos that are brought up about RDMA and warn darkly about security problems with bugs in NIC hardware that after all has to parse and rewrite TCP and IP packets. Also, looking at the complexity and bug-fixing effort that go into making TSO work vs the really pretty small gain it gives also makes part of me wonder whether the noble proclamations about maintainability are always taken to heart. Of course I know everything I just wrote is wrong because I forgot to refer to the crucial axiom that stateless == good && RDMA == bad. And sometimes it's unfortunate that in Linux when there's disagreement about something, the default action is *not* to do something. Sorry for prolonging this argument. Dave, I should say that I appreciate all the work you've done in helping build the most kick-ass networking stack in history. And as I said before, I have plenty of interesting work to do however this turns out, so I'll try to leave any further arguing to people who actually have a dog in this fight. - R. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-18 5:23 ` Roland Dreier @ 2007-08-18 6:44 ` David Miller 2007-08-19 7:01 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty 2007-08-21 1:16 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier 0 siblings, 2 replies; 53+ messages in thread From: David Miller @ 2007-08-18 6:44 UTC (permalink / raw) To: rdreier; +Cc: jeff, netdev, linux-kernel, general From: Roland Dreier <rdreier@cisco.com> Date: Fri, 17 Aug 2007 22:23:01 -0700 > Also, looking at the complexity and bug-fixing effort that go into > making TSO work vs the really pretty small gain it gives also makes > part of me wonder whether the noble proclamations about > maintainability are always taken to heart. The cpu and bus utilization improvements of TSO on the sender side are more than significant. Ask anyone who looks closely at this. For example, as part of his batching work Krisha Kumar has been posting lots of numbers lately on the netdev list, I'm sure he can post more specific numbers comparing the current stack in the case of TSO disabled vs. TSO enabled if that is what you need to see how beneficial TSO in fact is. If TSO is such a lose why does pretty much every ethernet chip vendor implement it in hardware? If you say it's just because Microsoft defines TSO in their NDI, that's a total cop-out. It really does help performance a lot. Why did the Xen folks bother making generic software TSO infrastructure for the kernel for the benefit of their virtualization network device? Why would someone as bright as Herbert Xu even bother to implement that stuff if TSO gives a "pretty small gain"? Similarly for LRO and this isn't defined in NDI at all. Vendors are going so far as to put full flow tables in their chips in order to do LRO better. Using the bugs and issues we've run into while implementing TSO as evidence there is something wrong with it is a total straw man. Look how many times the filesystem page cache has been rewritten over the years. Use the TSO problems as more of an example of how shitty a programmer I must be. :) Just be realistic and accept that RDMA is a point in time solution, and like any other such technology takes flexibility away from users. Horizontal scaling of cpus up to huge arity cores, network devices using large numbers of transmit and receive queues and classification based queue selection, are all going to work to make things like RDMA even more irrelevant than they already are. If you can't see that this is the future, you have my condolences. Because frankly, the signs are all around that this is where things are going. The work doesn't belong in these special purpose devices, they belong in the far-end-node compute resources, and our computers are getting more and more of these general purpose compute engines every day. We will be constantly moving away from specialized solutions and towards those which solve large classes of problems for large groups of people. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space. 2007-08-18 6:44 ` David Miller @ 2007-08-19 7:01 ` Sean Hefty 2007-08-19 7:23 ` David Miller 2007-08-21 1:16 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier 1 sibling, 1 reply; 53+ messages in thread From: Sean Hefty @ 2007-08-19 7:01 UTC (permalink / raw) To: 'David Miller', rdreier; +Cc: netdev, general, linux-kernel, jeff >Just be realistic and accept that RDMA is a point in time solution, >and like any other such technology takes flexibility away from users. All technologies are just point in time solutions. While management is important, shouldn't the customers decide how important it is relative to their problems? Whether some future technology will be better matters little if a problem needs to be solved today. >If you can't see that this is the future, you have my condolences. >Because frankly, the signs are all around that this is where things >are going. Adding a bazillion cores to a processor doesn't do a thing to help memory bandwidth. Millions of Infiniband ports are in operation today. Over 25% of the top 500 supercomputers use Infiniband. The formation of the OpenFabrics Alliance was pushed and has been continuously funded by an RDMA customer - the US National Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire, Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu, LSI, SGI, Sandia, and at least two dozen other companies. IDC expects Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue to increase six-fold (combined revenues of 1 billion). Customers see real benefits using channel based architectures. Do all customers need it? Of course not. Is it a niche? Yes, but I would say that about any 10+ gig network. That doesn't mean that it hasn't become essential for some customers. - Sean ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space. 2007-08-19 7:01 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty @ 2007-08-19 7:23 ` David Miller 2007-08-19 17:33 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti 2007-08-20 4:31 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " ssufficool 0 siblings, 2 replies; 53+ messages in thread From: David Miller @ 2007-08-19 7:23 UTC (permalink / raw) To: sean.hefty; +Cc: rdreier, jeff, netdev, linux-kernel, general From: "Sean Hefty" <sean.hefty@intel.com> Date: Sun, 19 Aug 2007 00:01:07 -0700 > Millions of Infiniband ports are in operation today. Over 25% of the top 500 > supercomputers use Infiniband. The formation of the OpenFabrics Alliance was > pushed and has been continuously funded by an RDMA customer - the US National > Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire, > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu, > LSI, SGI, Sandia, and at least two dozen other companies. IDC expects > Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue > to increase six-fold (combined revenues of 1 billion). Scale these numbers with reality and usage. These vendors pour in huge amounts of money into a relatively small number of extremely large cluster installations. Besides the folks doing nuke and whole-earth simulations at some government lab, nobody cares. And part of the investment is not being done wholly for smart economic reasons, but also largely publicity purposes. So present your great Infiniband numbers with that being admitted up front, ok? It's relevance to Linux as a general purpose operating system that should be "good enough" for %99 of the world is close to NIL. People have been pouring tons of money and research into doing stupid things to make clusters go fast, and in such a way that make zero sense for general purpose operating systems, for ages. RDMA is just one such example. BTW, I find it ironic that you mention memory bandwidth as a retort, as Roland's favorite stateless offload devil, TSO, deals explicity with lowering the per-packet BUS bandwidth usage of TCP. LRO offloading does likewise. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 7:23 ` David Miller @ 2007-08-19 17:33 ` Felix Marti 2007-08-19 19:32 ` David Miller 2007-08-20 0:18 ` Herbert Xu 2007-08-20 4:31 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " ssufficool 1 sibling, 2 replies; 53+ messages in thread From: Felix Marti @ 2007-08-19 17:33 UTC (permalink / raw) To: David Miller, sean.hefty; +Cc: netdev, rdreier, jeff, linux-kernel, general > -----Original Message----- > From: general-bounces@lists.openfabrics.org [mailto:general- > bounces@lists.openfabrics.org] On Behalf Of David Miller > Sent: Sunday, August 19, 2007 12:24 AM > To: sean.hefty@intel.com > Cc: netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Sean Hefty" <sean.hefty@intel.com> > Date: Sun, 19 Aug 2007 00:01:07 -0700 > > > Millions of Infiniband ports are in operation today. Over 25% of the > top 500 > > supercomputers use Infiniband. The formation of the OpenFabrics > Alliance was > > pushed and has been continuously funded by an RDMA customer - the US > National > > Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, > Sun, Voltaire, > > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, > NEC, Fujitsu, > > LSI, SGI, Sandia, and at least two dozen other companies. IDC > expects > > Infiniband adapter revenue to triple between 2006 and 2011, and > switch revenue > > to increase six-fold (combined revenues of 1 billion). > > Scale these numbers with reality and usage. > > These vendors pour in huge amounts of money into a relatively small > number of extremely large cluster installations. Besides the folks > doing nuke and whole-earth simulations at some government lab, nobody > cares. And part of the investment is not being done wholly for smart > economic reasons, but also largely publicity purposes. > > So present your great Infiniband numbers with that being admitted up > front, ok? > > It's relevance to Linux as a general purpose operating system that > should be "good enough" for %99 of the world is close to NIL. > > People have been pouring tons of money and research into doing stupid > things to make clusters go fast, and in such a way that make zero > sense for general purpose operating systems, for ages. RDMA is just > one such example. [Felix Marti] Ouch, and I believed linux to be a leading edge OS, scaling from small embedded systems to hundreds of CPUs and hence I assumed that the same 'scalability' applies to the network subsystem. > > BTW, I find it ironic that you mention memory bandwidth as a retort, > as Roland's favorite stateless offload devil, TSO, deals explicity > with lowering the per-packet BUS bandwidth usage of TCP. LRO > offloading does likewise. [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA enables DMA from/to application buffers removing the user-to-kernel/ kernel-to-user memory copy with is a significant overhead at the rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps out) requires 60Gbps of BW on most common platforms. So, receiving and transmitting at 10Gbps with LRO and TSO requires 80Gbps of system memory BW (which is beyond what most systems can do) whereas RDMA can do with 20Gbps! In addition, BUS improvements are really not significant (nor are buses the bottleneck anymore with wide availability of PCI-E >= x8); TSO avoids the DMA of a bunch of network headers... a typical example of stateless offload - improving performance by a few percent while offload technologies provide system improvements of hundreds of percent. I know that you don't agree that TSO has drawbacks, as outlined by Roland, but its history showing something else: the addition of TSO took a fair amount of time and network performance was erratic for multiple kernel revisions and the TSO code is sprinkled across the network stack. It is an example of an intrusive 'improvement' whereas Steve (who started this thread) is asking for a relatively small change (decoupling the 4-tuple allocation from the socket). As Steve has outlined, your refusal of the change requires RDMA users to work around the issue which pushes the issue to the end-users and thus slowing down the acceptance of the technology leading to a chicken-and-egg problem: you only care if there are lots of users but you make it hard to use the technology in the first place, clever ;) > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 17:33 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti @ 2007-08-19 19:32 ` David Miller 2007-08-19 19:49 ` Felix Marti 2007-08-20 0:18 ` Herbert Xu 1 sibling, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-19 19:32 UTC (permalink / raw) To: felix; +Cc: jeff, netdev, rdreier, linux-kernel, general From: "Felix Marti" <felix@chelsio.com> Date: Sun, 19 Aug 2007 10:33:31 -0700 > I know that you don't agree that TSO has drawbacks, as outlined by > Roland, but its history showing something else: the addition of TSO > took a fair amount of time and network performance was erratic for > multiple kernel revisions and the TSO code is sprinkled across the > network stack. This thing you call "sprinkled" is a necessity of any hardware offload when it is possible for a packet to later get "steered" to a device which cannot perform the offload. Therefore we need a software implementation of TSO so that those packets can still get output to the non-TSO-capable device. We do the same thing for checksum offloading. And for free we can use the software offloading mechanism to get batching to arbitrary network devices, even those which cannot do TSO. What benefits does RDMA infrastructure give to non-RDMA capable devices? None? I see, that's great. And again the TSO bugs and issues are being overstated and, also for the second time, these issues are more indicative of my bad programming skills then they are of intrinsic issues of TSO. The TSO implementation was looking for a good design, and it took me a while to find it because I personally suck. Face it, stateless offloads are always going to be better in the long term. And this is proven. You RDMA folks really do live in some kind of fantasy land. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 19:32 ` David Miller @ 2007-08-19 19:49 ` Felix Marti 2007-08-19 23:04 ` David Miller 2007-08-19 23:27 ` Andi Kleen 0 siblings, 2 replies; 53+ messages in thread From: Felix Marti @ 2007-08-19 19:49 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, rdreier, linux-kernel, general > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: Sunday, August 19, 2007 12:32 PM > To: Felix Marti > Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <felix@chelsio.com> > Date: Sun, 19 Aug 2007 10:33:31 -0700 > > > I know that you don't agree that TSO has drawbacks, as outlined by > > Roland, but its history showing something else: the addition of TSO > > took a fair amount of time and network performance was erratic for > > multiple kernel revisions and the TSO code is sprinkled across the > > network stack. > > This thing you call "sprinkled" is a necessity of any hardware > offload when it is possible for a packet to later get "steered" > to a device which cannot perform the offload. > > Therefore we need a software implementation of TSO so that those > packets can still get output to the non-TSO-capable device. > > We do the same thing for checksum offloading. > > And for free we can use the software offloading mechanism to > get batching to arbitrary network devices, even those which cannot > do TSO. > > What benefits does RDMA infrastructure give to non-RDMA capable > devices? None? I see, that's great. > > And again the TSO bugs and issues are being overstated and, also for > the second time, these issues are more indicative of my bad > programming skills then they are of intrinsic issues of TSO. The > TSO implementation was looking for a good design, and it took me > a while to find it because I personally suck. > > Face it, stateless offloads are always going to be better in the long > term. And this is proven. > > You RDMA folks really do live in some kind of fantasy land. [Felix Marti] You're not at all addressing the fact that RDMA does solve the memory BW problem and stateless offload doesn't. Apart from that, I don't quite understand your argument with respect to the benefits of the RDMA infrastructure; what benefits does the TSO infrastructure give the non-TSO capable devices? Isn't the answer none and yet you added TSO support?! I don't think that the argument is stateless _versus_ stateful offload both have their advantages and disadvantages. Stateless offload does help, i.e. TSO/LRO do improve performance in back-to-back benchmarks. It seems me that _you_ claim that there is no benefit to statefull offload and that is where we're disagreeing; there is benefit and i.e. the much lower memory BW requirements is just one example, yet an important one. We'll probably never agree but it seems to me that we're asking only for small changes to the software stack and then we can give the choice to the end users: they can opt for stateless offload if it fits the performance needs or for statefull offload if their apps require the extra boost in performance. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 19:49 ` Felix Marti @ 2007-08-19 23:04 ` David Miller 2007-08-20 0:32 ` Felix Marti 2007-08-19 23:27 ` Andi Kleen 1 sibling, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-19 23:04 UTC (permalink / raw) To: felix; +Cc: jeff, netdev, rdreier, linux-kernel, general From: "Felix Marti" <felix@chelsio.com> Date: Sun, 19 Aug 2007 12:49:05 -0700 > You're not at all addressing the fact that RDMA does solve the > memory BW problem and stateless offload doesn't. It does, I just didn't retort to your claims because they were so blatantly wrong. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 23:04 ` David Miller @ 2007-08-20 0:32 ` Felix Marti 2007-08-20 0:40 ` David Miller 0 siblings, 1 reply; 53+ messages in thread From: Felix Marti @ 2007-08-20 0:32 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, rdreier, linux-kernel, general > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: Sunday, August 19, 2007 4:04 PM > To: Felix Marti > Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <felix@chelsio.com> > Date: Sun, 19 Aug 2007 12:49:05 -0700 > > > You're not at all addressing the fact that RDMA does solve the > > memory BW problem and stateless offload doesn't. > > It does, I just didn't retort to your claims because they were > so blatantly wrong. [Felix Marti] Hmmm, interesting... I guess it is impossible to even have a discussion on the subject. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 0:32 ` Felix Marti @ 2007-08-20 0:40 ` David Miller 2007-08-20 0:47 ` Felix Marti 0 siblings, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-20 0:40 UTC (permalink / raw) To: felix; +Cc: jeff, netdev, rdreier, linux-kernel, general From: "Felix Marti" <felix@chelsio.com> Date: Sun, 19 Aug 2007 17:32:39 -0700 [ Why do you put that "[Felix Marti]" everywhere you say something? It's annoying and superfluous. The quoting done by your mail client makes clear who is saying what. ] > Hmmm, interesting... I guess it is impossible to even have > a discussion on the subject. Nice try, Herbert Xu gave a great explanation. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 0:40 ` David Miller @ 2007-08-20 0:47 ` Felix Marti 2007-08-20 1:05 ` David Miller 2007-08-20 9:43 ` Evgeniy Polyakov 0 siblings, 2 replies; 53+ messages in thread From: Felix Marti @ 2007-08-20 0:47 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, rdreier, linux-kernel, general > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: Sunday, August 19, 2007 5:40 PM > To: Felix Marti > Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <felix@chelsio.com> > Date: Sun, 19 Aug 2007 17:32:39 -0700 > > [ Why do you put that "[Felix Marti]" everywhere you say something? > It's annoying and superfluous. The quoting done by your mail client > makes clear who is saying what. ] > > > Hmmm, interesting... I guess it is impossible to even have > > a discussion on the subject. > > Nice try, Herbert Xu gave a great explanation. [Felix Marti] David and Herbert, so you agree that the user<>kernel space memory copy overhead is a significant overhead and we want to enable zero-copy in both the receive and transmit path? - Yes, copy avoidance is mainly an API issue and unfortunately the so widely used (synchronous) sockets API doesn't make copy avoidance easy, which is one area where protocol offload can help. Yes, some apps can resort to sendfile() but there are many apps which seem to have trouble switching to that API... and what about the receive path? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 0:47 ` Felix Marti @ 2007-08-20 1:05 ` David Miller 2007-08-20 1:41 ` Felix Marti 2007-08-20 9:43 ` Evgeniy Polyakov 1 sibling, 1 reply; 53+ messages in thread From: David Miller @ 2007-08-20 1:05 UTC (permalink / raw) To: felix; +Cc: jeff, netdev, rdreier, linux-kernel, general From: "Felix Marti" <felix@chelsio.com> Date: Sun, 19 Aug 2007 17:47:59 -0700 > [Felix Marti] Please stop using this to start your replies, thank you. > David and Herbert, so you agree that the user<>kernel > space memory copy overhead is a significant overhead and we want to > enable zero-copy in both the receive and transmit path? - Yes, copy > avoidance is mainly an API issue and unfortunately the so widely used > (synchronous) sockets API doesn't make copy avoidance easy, which is one > area where protocol offload can help. Yes, some apps can resort to > sendfile() but there are many apps which seem to have trouble switching > to that API... and what about the receive path? On the send side none of this is an issue. You either are sending static content, in which using sendfile() is trivial, or you're generating data dynamically in which case the data copy is in the noise or too small to do zerocopy on and if not you can use a shared mmap to generate your data into, and then sendfile out from that file, to avoid the copy that way. splice() helps a lot too. Splice has the capability to do away with the receive side too, and there are a few receivefile() implementations that could get cleaned up and merged in. Also, the I/O bus is still the more limiting factor and main memory bandwidth in all of this, it is the smallest data pipe for communications out to and from the network. So the protocol header avoidance gains of TSO and LRO are still a very worthwhile savings. But even if RDMA increases performance 100 fold, it still doesn't avoid the issue that it doesn't fit in with the rest of the networking stack and feature set. Any monkey can change the rules around ("ok I can make it go fast as long as you don't need firewalling, packet scheduling, classification, and you only need to talk to specific systems that speak this same special protocol") to make things go faster. On the other hand well designed solutions can give performance gains within the constraints of the full system design and without sactificing functionality. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 1:05 ` David Miller @ 2007-08-20 1:41 ` Felix Marti 2007-08-20 11:07 ` Andi Kleen 0 siblings, 1 reply; 53+ messages in thread From: Felix Marti @ 2007-08-20 1:41 UTC (permalink / raw) To: David Miller; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: Sunday, August 19, 2007 6:06 PM > To: Felix Marti > Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <felix@chelsio.com> > Date: Sun, 19 Aug 2007 17:47:59 -0700 > > > [Felix Marti] > > Please stop using this to start your replies, thank you. Better? > > > David and Herbert, so you agree that the user<>kernel > > space memory copy overhead is a significant overhead and we want to > > enable zero-copy in both the receive and transmit path? - Yes, copy > > avoidance is mainly an API issue and unfortunately the so widely used > > (synchronous) sockets API doesn't make copy avoidance easy, which is > one > > area where protocol offload can help. Yes, some apps can resort to > > sendfile() but there are many apps which seem to have trouble > switching > > to that API... and what about the receive path? > > On the send side none of this is an issue. You either are sending > static content, in which using sendfile() is trivial, or you're > generating data dynamically in which case the data copy is in the > noise or too small to do zerocopy on and if not you can use a shared > mmap to generate your data into, and then sendfile out from that file, > to avoid the copy that way. > > splice() helps a lot too. > > Splice has the capability to do away with the receive side too, and > there are a few receivefile() implementations that could get cleaned > up and merged in. I don't believe it is as simple as that. Many apps synthesize their payload in user space buffers (i.e. malloced memory) and expect to receive their data in user space buffers _and_ expect the received data to have a certain alignment and to be contiguous - something not addressed by these 'new' APIs. Look, people writing HPC apps tend to take advantage of whatever they can to squeeze some extra performance out of their apps and they are resorting to protocol offload technology for a reason, wouldn't you agree? > > Also, the I/O bus is still the more limiting factor and main memory > bandwidth in all of this, it is the smallest data pipe for > communications out to and from the network. So the protocol header > avoidance gains of TSO and LRO are still a very worthwhile savings. So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + 20), 864B, when moving ~64KB of payload - looks like very much in the noise to me. And again, PCI-E provides more bandwidth than the wire... > > But even if RDMA increases performance 100 fold, it still doesn't > avoid the issue that it doesn't fit in with the rest of the networking > stack and feature set. > > Any monkey can change the rules around ("ok I can make it go fast as > long as you don't need firewalling, packet scheduling, classification, > and you only need to talk to specific systems that speak this same > special protocol") to make things go faster. On the other hand well > designed solutions can give performance gains within the constraints > of the full system design and without sactificing functionality. While I believe that you should give people an option to get 'high performance' _instead_ of other features and let them chose whatever they care about, I really do agree with what you're saying and believe that offload devices _should_ be integrated with the facilities that you mention (in fact, offload can do a much better job at lots of things that you mention ;) ... but you're not letting offload devices integrate and you're slowing down innovation in this field. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 1:41 ` Felix Marti @ 2007-08-20 11:07 ` Andi Kleen 2007-08-20 16:26 ` Felix Marti 2007-08-20 19:16 ` Rick Jones 0 siblings, 2 replies; 53+ messages in thread From: Andi Kleen @ 2007-08-20 11:07 UTC (permalink / raw) To: Felix Marti Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff "Felix Marti" <felix@chelsio.com> writes: > > avoidance gains of TSO and LRO are still a very worthwhile savings. > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + > 20), 864B, when moving ~64KB of payload - looks like very much in the > noise to me. TSO is beneficial for the software again. The linux code currently takes several locks and does quite a few function calls for each packet and using larger packets lowers this overhead. At least with 10GbE saving CPU cycles is still quite important. > an option to get 'high performance' Shouldn't you qualify that? It is unlikely you really duplicated all the tuning for corner cases that went over many years into good software TCP stacks in your hardware. So e.g. for wide area networks with occasional packet loss the software might well perform better. -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 11:07 ` Andi Kleen @ 2007-08-20 16:26 ` Felix Marti 2007-08-20 19:16 ` Rick Jones 1 sibling, 0 replies; 53+ messages in thread From: Felix Marti @ 2007-08-20 16:26 UTC (permalink / raw) To: Andi Kleen Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff > -----Original Message----- > From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen > Sent: Monday, August 20, 2007 4:07 AM > To: Felix Marti > Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org; > rdreier@cisco.com; general@lists.openfabrics.org; linux- > kernel@vger.kernel.org; jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <felix@chelsio.com> writes: > > > avoidance gains of TSO and LRO are still a very worthwhile savings. > > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + > > 20), 864B, when moving ~64KB of payload - looks like very much in the > > noise to me. > > TSO is beneficial for the software again. The linux code currently > takes several locks and does quite a few function calls for each > packet and using larger packets lowers this overhead. At least with > 10GbE saving CPU cycles is still quite important. > > > an option to get 'high performance' > > Shouldn't you qualify that? > > It is unlikely you really duplicated all the tuning for corner cases > that went over many years into good software TCP stacks in your > hardware. So e.g. for wide area networks with occasional packet loss > the software might well perform better. Yes, it used to be sufficient to submit performance data to show that a technology make 'sense'. In fact, I believe it was Alan Cox who once said that linux will have a look at offload once an offload device holds the land speed record (probably assuming that the day never comes ;). For the last few years it has been Chelsio offload devices that have been improving their own LSRs (as IO bus speeds have been increasing). It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW and the fine-grained (per packet and not per TSO-burst) packet scheduler in the offload device played a crucial part in pushing performance to the limits of what OC-192 can do. Most other customers use our offload products in low-latency cluster environments. - The problem with offload devices is that they are not all born equal and there have been a lot of poor implementation giving the technology a bad name. I can only speak for Chelsio and do claim that we have a solid implementation that scales from low-latency clusters environments to LFNs. Andi, I could present performance numbers, i.e. throughput and CPU utilization in function of IO size, number of connections, ... in a back-to-back environment and/or in a cluster environment... but what will it get me? I'd still get hit by the 'not integrated' hammer :( > > -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 11:07 ` Andi Kleen 2007-08-20 16:26 ` Felix Marti @ 2007-08-20 19:16 ` Rick Jones 1 sibling, 0 replies; 53+ messages in thread From: Rick Jones @ 2007-08-20 19:16 UTC (permalink / raw) To: Andi Kleen; +Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller Andi Kleen wrote: > TSO is beneficial for the software again. The linux code currently > takes several locks and does quite a few function calls for each > packet and using larger packets lowers this overhead. At least with > 10GbE saving CPU cycles is still quite important. Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 2.6.23-rc3. the NICs are dual-core e1000's connected back-to-back with the interrupt throttle disabled. I like using TCP_RR to tickle path-length questions because it rarely runs into bandwidth limitations regardless of the link-type. First, with TSO enabled on both sides, then with it disabled, netperf/netserver bound to the same CPU as takes interrupts, which is the "best" place to be for a TCP_RR test (although not always for a TCP_STREAM test...): :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.3% !!! Local CPU util : 39.3% !!! Remote CPU util : 40.6% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 10.01 18611.32 20.96 22.35 22.522 24.017 16384 87380 :~# ethtool -K eth2 tso off e1000: eth2: e1000_set_tso: TSO is Disabled :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.4% !!! Local CPU util : 21.0% !!! Remote CPU util : 25.2% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 10.01 19812.51 17.81 17.19 17.983 17.358 16384 87380 While the confidence intervals for CPU util weren't hit, I suspect the differences in service demand were still real. On throughput we are talking about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not percentage points) in the first test and 12.5% in the second. So, in broad handwaving terms, TSO increased the per-transaction service demand by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the transaction rate decreased by ~6%. rick jones bitrate blindless is a constant concern ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 0:47 ` Felix Marti 2007-08-20 1:05 ` David Miller @ 2007-08-20 9:43 ` Evgeniy Polyakov 2007-08-20 16:53 ` Felix Marti 1 sibling, 1 reply; 53+ messages in thread From: Evgeniy Polyakov @ 2007-08-20 9:43 UTC (permalink / raw) To: Felix Marti Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti (felix@chelsio.com) wrote: > [Felix Marti] David and Herbert, so you agree that the user<>kernel > space memory copy overhead is a significant overhead and we want to > enable zero-copy in both the receive and transmit path? - Yes, copy It depends. If you need to access that data after received, you will get cache miss and performance will not be much better (if any) that with copy. > avoidance is mainly an API issue and unfortunately the so widely used > (synchronous) sockets API doesn't make copy avoidance easy, which is one > area where protocol offload can help. Yes, some apps can resort to > sendfile() but there are many apps which seem to have trouble switching > to that API... and what about the receive path? There is number of implementations, and all they are suitable for is to have recvfile(), since this is likely the only case, which can work without cache. And actually RDMA stack exist and no one said it should be thrown away _until_ it messes with main stack. It started to speal ports. What will happen when it gest all port space and no new legal network conection can be opened, although there is no way to show to user who got it? What will happen if hardware RDMA connection got terminated and software could not free the port? Will RDMA request to export connection reset functions out of stack to drop network connections which are on the ports which are supposed to be used by new RDMA connections? RDMA is not a problem, but how it influence to the network stack is. Let's better think about how to work correctly with network stack (since we already have that cr^Wdifferent hardware) instead of saying that others do bad work and do not allow shiny new feature to exist. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 9:43 ` Evgeniy Polyakov @ 2007-08-20 16:53 ` Felix Marti 2007-08-20 18:10 ` Andi Kleen 2007-08-20 20:33 ` Patrick Geoffray 0 siblings, 2 replies; 53+ messages in thread From: Felix Marti @ 2007-08-20 16:53 UTC (permalink / raw) To: Evgeniy Polyakov Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller > -----Original Message----- > From: Evgeniy Polyakov [mailto:johnpol@2ka.mipt.ru] > Sent: Monday, August 20, 2007 2:43 AM > To: Felix Marti > Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org; > rdreier@cisco.com; general@lists.openfabrics.org; linux- > kernel@vger.kernel.org; jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti > (felix@chelsio.com) wrote: > > [Felix Marti] David and Herbert, so you agree that the user<>kernel > > space memory copy overhead is a significant overhead and we want to > > enable zero-copy in both the receive and transmit path? - Yes, copy > > It depends. If you need to access that data after received, you will > get > cache miss and performance will not be much better (if any) that with > copy. Yes, the app will take the cache hits when accessing the data. However, the fact remains that if there is a copy in the receive path, you require and additional 3x memory BW (which is very significant at these high rates and most likely the bottleneck for most current systems)... and somebody always has to take the cache miss be it the copy_to_user or the app. > > > avoidance is mainly an API issue and unfortunately the so widely used > > (synchronous) sockets API doesn't make copy avoidance easy, which is > one > > area where protocol offload can help. Yes, some apps can resort to > > sendfile() but there are many apps which seem to have trouble > switching > > to that API... and what about the receive path? > > There is number of implementations, and all they are suitable for is > to have recvfile(), since this is likely the only case, which can work > without cache. > > And actually RDMA stack exist and no one said it should be thrown away > _until_ it messes with main stack. It started to speal ports. What will > happen when it gest all port space and no new legal network conection > can be opened, although there is no way to show to user who got it? > What will happen if hardware RDMA connection got terminated and > software > could not free the port? Will RDMA request to export connection reset > functions out of stack to drop network connections which are on the > ports > which are supposed to be used by new RDMA connections? Yes, RDMA support is there... but we could make it better and easier to use. We have a problem today with port sharing and there was a proposal to address the issue by tighter integration (see the beginning of the thread) but the proposal got shot down immediately... because it is RDMA and not for technical reasons. I believe this email threads shows in detail how RDMA (a network technology) is treated as bastard child by the network folks, well at least by one of them. > > RDMA is not a problem, but how it influence to the network stack is. > Let's better think about how to work correctly with network stack > (since > we already have that cr^Wdifferent hardware) instead of saying that > others do bad work and do not allow shiny new feature to exist. By no means did I want to imply that others do bad work; are you referring to me using TSO implementation issues as an example? - If so, let me clarify: I understand that the TSO implementation took some time to get right. What I was referring to is that TSO(/LRO) have their own issues, some eluded to by Roland and me. In fact, customers working on the LSR couldn't use TSO due to the burstiness it introduces and had to fall-back to our fine grained packet scheduling done in the offload device. I am for variety, let us support new technologies that solve real problems (lots of folks are buying this stuff for a reason) instead of the 'ah, its brain-dead and has no future' attitude... there is precedence for offloading the host CPUs: have a look at graphics. Graphics used to be done by the host CPU and now we have dedicated graphics adapters that do a much better job... so, why is it so farfetched that offload devices can do a better job at a data-flow problem? > > -- > Evgeniy Polyakov ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 16:53 ` Felix Marti @ 2007-08-20 18:10 ` Andi Kleen 2007-08-20 19:02 ` Felix Marti 2007-08-20 20:33 ` Patrick Geoffray 1 sibling, 1 reply; 53+ messages in thread From: Andi Kleen @ 2007-08-20 18:10 UTC (permalink / raw) To: Felix Marti Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general, David Miller "Felix Marti" <felix@chelsio.com> writes: > What I was referring to is that TSO(/LRO) have their own > issues, some eluded to by Roland and me. In fact, customers working on > the LSR couldn't use TSO due to the burstiness it introduces That was in old kernels where TSO didn't honor the initial cwnd correctly, right? I assume it's long fixed. If not please clarify what the problem was. > have a look at graphics. > Graphics used to be done by the host CPU and now we have dedicated > graphics adapters that do a much better job... Is your off load device as programable as a modern GPU? > farfetched that offload devices can do a better job at a data-flow > problem? One big difference is that there is no potentially adverse and always varying internet between the graphics card and your monitor. -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 18:10 ` Andi Kleen @ 2007-08-20 19:02 ` Felix Marti 2007-08-20 20:18 ` Thomas Graf 0 siblings, 1 reply; 53+ messages in thread From: Felix Marti @ 2007-08-20 19:02 UTC (permalink / raw) To: Andi Kleen Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general, David Miller > -----Original Message----- > From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen > Sent: Monday, August 20, 2007 11:11 AM > To: Felix Marti > Cc: Evgeniy Polyakov; jeff@garzik.org; netdev@vger.kernel.org; > rdreier@cisco.com; linux-kernel@vger.kernel.org; > general@lists.openfabrics.org; David Miller > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <felix@chelsio.com> writes: > > > What I was referring to is that TSO(/LRO) have their own > > issues, some eluded to by Roland and me. In fact, customers working > on > > the LSR couldn't use TSO due to the burstiness it introduces > > That was in old kernels where TSO didn't honor the initial cwnd > correctly, > right? I assume it's long fixed. > > If not please clarify what the problem was. The problem is that is that Ethernet is about the only technology that discloses 'useable' throughput while everybody else talks about signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that number) and hence 10Gbps Ethernet was overwhelming the OC-192 network. The customer needed to schedule packets at about 98% of OC-192 throughput in order to avoid packet drop. The scheduling needed to be done on a per packet basis and not per 'burst of packets' basis in order to avoid packet drop. > > > have a look at graphics. > > Graphics used to be done by the host CPU and now we have dedicated > > graphics adapters that do a much better job... > > Is your off load device as programable as a modern GPU? It has a lot of knobs to turn. > > > farfetched that offload devices can do a better job at a data-flow > > problem? > > One big difference is that there is no potentially adverse and > always varying internet between the graphics card and your monitor. These graphic adapters provide a wealth of features that you can take advantage of to bring these amazing graphics to life. General purpose CPUs cannot keep up. Chelsio offload devices do the same thing in the realm of networking. - Will there be things you can't do, probably yes, but as I said, there are lots of knobs to turn (and the latest and greatest feature that gets hyped up might not always be the best thing since sliced bread anyway; what happened to BIC love? ;) > > -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 19:02 ` Felix Marti @ 2007-08-20 20:18 ` Thomas Graf 2007-08-20 20:33 ` Andi Kleen 0 siblings, 1 reply; 53+ messages in thread From: Thomas Graf @ 2007-08-20 20:18 UTC (permalink / raw) To: Felix Marti Cc: Andi Kleen, Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general, David Miller * Felix Marti <felix@chelsio.com> 2007-08-20 12:02 > These graphic adapters provide a wealth of features that you can take > advantage of to bring these amazing graphics to life. General purpose > CPUs cannot keep up. Chelsio offload devices do the same thing in the > realm of networking. - Will there be things you can't do, probably yes, > but as I said, there are lots of knobs to turn (and the latest and > greatest feature that gets hyped up might not always be the best thing > since sliced bread anyway; what happened to BIC love? ;) GPUs have almost no influence on system security, the network stack OTOH is probably the most vulnerable part of an operating system. Even if all vendors would implement all the features collected over the last years properly which seems unlikely. Having such an essential and critical part depend on the vendor of my network card without being able to even verify it properly is truly frightening. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 20:18 ` Thomas Graf @ 2007-08-20 20:33 ` Andi Kleen 0 siblings, 0 replies; 53+ messages in thread From: Andi Kleen @ 2007-08-20 20:33 UTC (permalink / raw) To: Thomas Graf Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, Andi Kleen, general, David Miller > GPUs have almost no influence on system security, Unless you use direct rendering from user space. -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 16:53 ` Felix Marti 2007-08-20 18:10 ` Andi Kleen @ 2007-08-20 20:33 ` Patrick Geoffray 2007-08-21 4:21 ` Felix Marti 1 sibling, 1 reply; 53+ messages in thread From: Patrick Geoffray @ 2007-08-20 20:33 UTC (permalink / raw) To: Felix Marti Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general, David Miller Felix Marti wrote: > Yes, the app will take the cache hits when accessing the data. However, > the fact remains that if there is a copy in the receive path, you > require and additional 3x memory BW (which is very significant at these > high rates and most likely the bottleneck for most current systems)... > and somebody always has to take the cache miss be it the copy_to_user or > the app. The cache miss is going to cost you half the memory bandwidth of a full copy. If the data is already in cache, then the copy is cheaper. However, removing the copy removes the kernel from the picture on the receive side, so you lose demultiplexing, asynchronism, security, accounting, flow-control, swapping, etc. If it's ok with you to not use the kernel stack, then why expect to fit in the existing infrastructure anyway ? > Yes, RDMA support is there... but we could make it better and easier to What do you need from the kernel for RDMA support beyond HW drivers ? A fast way to pin and translate user memory (ie registration). That is pretty much the sandbox that David referred to. Eventually, it would be useful to be able to track the VM space to implement a registration cache instead of using ugly hacks in user-space to hijack malloc, but this is completely independent from the net stack. > use. We have a problem today with port sharing and there was a proposal The port spaces are either totally separate and there is no issue, or completely identical and you should then run your connection manager in user-space or fix your middlewares. > and not for technical reasons. I believe this email threads shows in > detail how RDMA (a network technology) is treated as bastard child by > the network folks, well at least by one of them. I don't think it's fair. This thread actually show how pushy some RDMA folks are about not acknowledging that the current infrastructure is here for a reason, and about mistaking zero-copy and RDMA. This is a similar argument than the TOE discussion, and it was definitively a good decision to not mess up the Linux stack with TOEs. Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-20 20:33 ` Patrick Geoffray @ 2007-08-21 4:21 ` Felix Marti 0 siblings, 0 replies; 53+ messages in thread From: Felix Marti @ 2007-08-21 4:21 UTC (permalink / raw) To: Patrick Geoffray Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general, David Miller > -----Original Message----- > From: Patrick Geoffray [mailto:patrick@myri.com] > Sent: Monday, August 20, 2007 1:34 PM > To: Felix Marti > Cc: Evgeniy Polyakov; David Miller; sean.hefty@intel.com; > netdev@vger.kernel.org; rdreier@cisco.com; > general@lists.openfabrics.org; linux-kernel@vger.kernel.org; > jeff@garzik.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > Felix Marti wrote: > > Yes, the app will take the cache hits when accessing the data. > However, > > the fact remains that if there is a copy in the receive path, you > > require and additional 3x memory BW (which is very significant at > these > > high rates and most likely the bottleneck for most current > systems)... > > and somebody always has to take the cache miss be it the copy_to_user > or > > the app. > > The cache miss is going to cost you half the memory bandwidth of a full > copy. If the data is already in cache, then the copy is cheaper. > > However, removing the copy removes the kernel from the picture on the > receive side, so you lose demultiplexing, asynchronism, security, > accounting, flow-control, swapping, etc. If it's ok with you to not use > the kernel stack, then why expect to fit in the existing infrastructure > anyway ? Many of the things you're referring to are moved to the offload adapter but from an ease of use point of view, it would be great if the user could still collect stats the same way, i.e. netstat reports the 4-tuple in use and other network stats. In addition, security features and packet scheduling could be integrated so that the user configures them the same way as the network stack. > > > Yes, RDMA support is there... but we could make it better and easier > to > > What do you need from the kernel for RDMA support beyond HW drivers ? A > fast way to pin and translate user memory (ie registration). That is > pretty much the sandbox that David referred to. > > Eventually, it would be useful to be able to track the VM space to > implement a registration cache instead of using ugly hacks in user- > space > to hijack malloc, but this is completely independent from the net > stack. > > > use. We have a problem today with port sharing and there was a > proposal > > The port spaces are either totally separate and there is no issue, or > completely identical and you should then run your connection manager in > user-space or fix your middlewares. When running on an iWarp device (and hence on top of TCP) I believe that the port space should shared and i.e. netstat reports the 4-tuple in use. > > > and not for technical reasons. I believe this email threads shows in > > detail how RDMA (a network technology) is treated as bastard child by > > the network folks, well at least by one of them. > > I don't think it's fair. This thread actually show how pushy some RDMA > folks are about not acknowledging that the current infrastructure is > here for a reason, and about mistaking zero-copy and RDMA. Zero-copy and RDMA are not the same but in the context of this discussion I referred to RDMA as a superset (zero-copy is implied). > > This is a similar argument than the TOE discussion, and it was > definitively a good decision to not mess up the Linux stack with TOEs. > > Patrick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 19:49 ` Felix Marti 2007-08-19 23:04 ` David Miller @ 2007-08-19 23:27 ` Andi Kleen 2007-08-19 23:12 ` David Miller 2007-08-20 1:45 ` Felix Marti 1 sibling, 2 replies; 53+ messages in thread From: Andi Kleen @ 2007-08-19 23:27 UTC (permalink / raw) To: Felix Marti; +Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller "Felix Marti" <felix@chelsio.com> writes: > what benefits does the TSO infrastructure give the > non-TSO capable devices? It improves performance on software queueing devices between guests and hypervisors. This is a more and more important application these days. Even when the system running the Hypervisor has a non TSO capable device in the end it'll still save CPU cycles this way. Right now virtualized IO tends to much more CPU intensive than direct IO so any help it can get is beneficial. It also makes loopback faster, although given that's probably not that useful. And a lot of the "TSO infrastructure" was needed for zero copy TX anyways, which benefits most reasonable modern NICs (anything with hardware checksumming) -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 23:27 ` Andi Kleen @ 2007-08-19 23:12 ` David Miller 2007-08-20 1:45 ` Felix Marti 1 sibling, 0 replies; 53+ messages in thread From: David Miller @ 2007-08-19 23:12 UTC (permalink / raw) To: andi; +Cc: jeff, netdev, rdreier, linux-kernel, general From: Andi Kleen <andi@firstfloor.org> Date: 20 Aug 2007 01:27:35 +0200 > "Felix Marti" <felix@chelsio.com> writes: > > > what benefits does the TSO infrastructure give the > > non-TSO capable devices? > > It improves performance on software queueing devices between guests > and hypervisors. This is a more and more important application these > days. Even when the system running the Hypervisor has a non TSO > capable device in the end it'll still save CPU cycles this way. Right now > virtualized IO tends to much more CPU intensive than direct IO so any > help it can get is beneficial. > > It also makes loopback faster, although given that's probably not that > useful. > > And a lot of the "TSO infrastructure" was needed for zero copy TX anyways, > which benefits most reasonable modern NICs (anything with hardware > checksumming) And also, you can enable TSO generation for a non-TSO-hw device and get all of the segmentation overhead reduction gains which works out as a pure win as long as the device can at a minimum do checksumming. ^ permalink raw reply [flat|nested] 53+ messages in thread
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 23:27 ` Andi Kleen 2007-08-19 23:12 ` David Miller @ 2007-08-20 1:45 ` Felix Marti 1 sibling, 0 replies; 53+ messages in thread From: Felix Marti @ 2007-08-20 1:45 UTC (permalink / raw) To: Andi Kleen; +Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller > -----Original Message----- > From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen > Sent: Sunday, August 19, 2007 4:28 PM > To: Felix Marti > Cc: David Miller; jeff@garzik.org; netdev@vger.kernel.org; > rdreier@cisco.com; linux-kernel@vger.kernel.org; > general@lists.openfabrics.org > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <felix@chelsio.com> writes: > > > what benefits does the TSO infrastructure give the > > non-TSO capable devices? > > It improves performance on software queueing devices between guests > and hypervisors. This is a more and more important application these > days. Even when the system running the Hypervisor has a non TSO > capable device in the end it'll still save CPU cycles this way. Right > now > virtualized IO tends to much more CPU intensive than direct IO so any > help it can get is beneficial. > > It also makes loopback faster, although given that's probably not that > useful. > > And a lot of the "TSO infrastructure" was needed for zero copy TX > anyways, > which benefits most reasonable modern NICs (anything with hardware > checksumming) Hi Andi, yes, you're right. I should have chosen my example more carefully. > > -Andi ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space. 2007-08-19 17:33 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti 2007-08-19 19:32 ` David Miller @ 2007-08-20 0:18 ` Herbert Xu 1 sibling, 0 replies; 53+ messages in thread From: Herbert Xu @ 2007-08-20 0:18 UTC (permalink / raw) To: Felix Marti Cc: davem, sean.hefty, netdev, rdreier, general, linux-kernel, jeff Felix Marti <felix@chelsio.com> wrote: > > [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA > enables DMA from/to application buffers removing the user-to-kernel/ > kernel-to-user memory copy with is a significant overhead at the > rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps > out) requires 60Gbps of BW on most common platforms. So, receiving and > transmitting at 10Gbps with LRO and TSO requires 80Gbps of system > memory BW (which is beyond what most systems can do) whereas RDMA can > do with 20Gbps! Actually this is false. TSO only requires a copy if the user chooses to use the sendmsg interface instead of sendpage. The same is true for RDMA really. Except that instead of having to switch your application to sendfile/splice, you're switching it to RDMA. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space. 2007-08-19 7:23 ` David Miller 2007-08-19 17:33 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti @ 2007-08-20 4:31 ` ssufficool 1 sibling, 0 replies; 53+ messages in thread From: ssufficool @ 2007-08-20 4:31 UTC (permalink / raw) To: David Miller; +Cc: jeff, netdev, rdreier, linux-kernel, general [-- Attachment #1.1: Type: text/plain, Size: 2478 bytes --] We implemented a small office solution using Infiniband purely on a cost per performance mark. We have a small cluster of 10 servers and for less that 120K, all from HP. Pure and simple, Infiniband offers the best price per performance when considering SAN and MPI consolidation vs F.C. + GbE. Not limited to top 500 HPC anymore, just those with common sense. On Sun, 2007-08-19 at 00:23 -0700, David Miller wrote: > From: "Sean Hefty" <sean.hefty@intel.com> > Date: Sun, 19 Aug 2007 00:01:07 -0700 > > > Millions of Infiniband ports are in operation today. Over 25% of the top 500 > > supercomputers use Infiniband. The formation of the OpenFabrics Alliance was > > pushed and has been continuously funded by an RDMA customer - the US National > > Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire, > > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu, > > LSI, SGI, Sandia, and at least two dozen other companies. IDC expects > > Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue > > to increase six-fold (combined revenues of 1 billion). > > Scale these numbers with reality and usage. > > These vendors pour in huge amounts of money into a relatively small > number of extremely large cluster installations. Besides the folks > doing nuke and whole-earth simulations at some government lab, nobody > cares. And part of the investment is not being done wholly for smart > economic reasons, but also largely publicity purposes. > > So present your great Infiniband numbers with that being admitted up > front, ok? > > It's relevance to Linux as a general purpose operating system that > should be "good enough" for %99 of the world is close to NIL. > > People have been pouring tons of money and research into doing stupid > things to make clusters go fast, and in such a way that make zero > sense for general purpose operating systems, for ages. RDMA is just > one such example. > > BTW, I find it ironic that you mention memory bandwidth as a retort, > as Roland's favorite stateless offload devil, TSO, deals explicity > with lowering the per-packet BUS bandwidth usage of TCP. LRO > offloading does likewise. > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general [-- Attachment #1.2: Type: text/html, Size: 3938 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-18 6:44 ` David Miller 2007-08-19 7:01 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty @ 2007-08-21 1:16 ` Roland Dreier 2007-08-21 6:58 ` David Miller 1 sibling, 1 reply; 53+ messages in thread From: Roland Dreier @ 2007-08-21 1:16 UTC (permalink / raw) To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general [TSO / LRO discussion snipped -- it's not the main point so no sense spending energy arguing about it] > Just be realistic and accept that RDMA is a point in time solution, > and like any other such technology takes flexibility away from users. > > Horizontal scaling of cpus up to huge arity cores, network devices > using large numbers of transmit and receive queues and classification > based queue selection, are all going to work to make things like RDMA > even more irrelevant than they already are. To me there is a real fundamental difference between RDMA and traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that messages can carry the address where they're supposed to be delivered (what the IETF calls "direct data placement"). And on top of that you can build one-sided operations aka put/get aka RDMA. And direct data placement really does give you a factor of two at least, because otherwise you're stuck receiving the data in one buffer, looking at some of the data at least, and then figuring out where to copy it. And memory bandwidth is if anything becoming more valuable; maybe LRO + header splitting + page remapping tricks can get you somewhere but as NCPUS grows then it seems the TLB shootdown cost of page flipping is only going to get worse. Don't get too hung up on the fact that current iWARP (RDMA over IP) implementations are using TCP offload -- to me that is just a side effect of doing enough processing on the NIC side of the PCI bus to be able to do direct data placement. InfiniBand with competely different transport, link and physical layers is one way to implement RDMA without TCP offload and I'm sure there will be others -- eg Intel's IOAT stuff could probably evolve to the point where you could implement iWARP with software TCP and the data placement offloaded to some DMA engine. - R. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-21 1:16 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier @ 2007-08-21 6:58 ` David Miller 0 siblings, 0 replies; 53+ messages in thread From: David Miller @ 2007-08-21 6:58 UTC (permalink / raw) To: rdreier; +Cc: jeff, netdev, linux-kernel, general From: Roland Dreier <rdreier@cisco.com> Date: Mon, 20 Aug 2007 18:16:54 -0700 > And direct data placement really does give you a factor of two at > least, because otherwise you're stuck receiving the data in one > buffer, looking at some of the data at least, and then figuring out > where to copy it. And memory bandwidth is if anything becoming more > valuable; maybe LRO + header splitting + page remapping tricks can get > you somewhere but as NCPUS grows then it seems the TLB shootdown cost > of page flipping is only going to get worse. As Herbert has said already, people can code for this just like they have to code for RDMA. There is no fundamental difference from converting an application to sendfile or similar. The only thing this needs is a "recvmsg_I_dont_care_where_the_data_is()" call. There are no alignment issues unless you are trying to push this data directly into the page cache. Couple this with a card that makes sure that on a per-page basis, only data for a particular flow (or group of flows) will accumulate. People already make cards that can do stuff like this, it can be done statelessly with an on-chip dynamically maintained flow table. And best yet it doesn't turn off every feature in the networking nor bypass it for the actual protocol processing. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-08-09 21:55 ` David Miller 2007-08-09 23:22 ` Sean Hefty 2007-08-15 14:42 ` Steve Wise @ 2007-10-08 21:54 ` Steve Wise 2007-10-09 13:44 ` James Lentini 2007-10-10 21:01 ` Sean Hefty 2 siblings, 2 replies; 53+ messages in thread From: Steve Wise @ 2007-10-08 21:54 UTC (permalink / raw) To: David Miller; +Cc: mshefty, rdreier, netdev, linux-kernel, general David Miller wrote: > From: Sean Hefty <mshefty@ichips.intel.com> > Date: Thu, 09 Aug 2007 14:40:16 -0700 > >> Steve Wise wrote: >>> Any more comments? >> Does anyone have ideas on how to reserve the port space without using a >> struct socket? > > How about we just remove the RDMA stack altogether? I am not at all > kidding. If you guys can't stay in your sand box and need to cause > problems for the normal network stack, it's unacceptable. We were > told all along the if RDMA went into the tree none of this kind of > stuff would be an issue. > > These are exactly the kinds of problems for which people like myself > were dreading. These subsystems have no buisness using the TCP port > space of the Linux software stack, absolutely none. > > After TCP port reservation, what's next? It seems an at least > bi-monthly event that the RDMA folks need to put their fingers > into something else in the normal networking stack. No more. > > I will NACK any patch that opens up sockets to eat up ports or > anything stupid like that. Hey Dave, The hack to use a socket and bind it to claim the port was just for demostrating the idea. The correct solution, IMO, is to enhance the core low level 4-tuple allocation services to be more generic (eg: not be tied to a struct sock). Then the host tcp stack and the host rdma stack can allocate TCP/iWARP ports/4tuples from this common exported service and share the port space. This allocation service could also be used by other deep adapters like iscsi adapters if needed. Will you NAK such a solution if I go implement it and submit for review? The dual ip subnet solution really sux, and I'm trying one more time to see if you will entertain the common port space solution, if done correctly. Thanks, Steve. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-10-08 21:54 ` Steve Wise @ 2007-10-09 13:44 ` James Lentini 2007-10-10 21:01 ` Sean Hefty 1 sibling, 0 replies; 53+ messages in thread From: James Lentini @ 2007-10-09 13:44 UTC (permalink / raw) To: Steve Wise; +Cc: David Miller, rdreier, linux-kernel, general, netdev On Mon, 8 Oct 2007, Steve Wise wrote: > The correct solution, IMO, is to enhance the core low level 4-tuple > allocation services to be more generic (eg: not be tied to a struct > sock). Then the host tcp stack and the host rdma stack can allocate > TCP/iWARP ports/4tuples from this common exported service and share > the port space. This allocation service could also be used by other > deep adapters like iscsi adapters if needed. As a developer of an RDMA ULP, NFS-RDMA, I like this approach because it will simplify the configuration of an RDMA device and the services that use it. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-10-08 21:54 ` Steve Wise 2007-10-09 13:44 ` James Lentini @ 2007-10-10 21:01 ` Sean Hefty 2007-10-10 23:04 ` David Miller 1 sibling, 1 reply; 53+ messages in thread From: Sean Hefty @ 2007-10-10 21:01 UTC (permalink / raw) To: Steve Wise; +Cc: netdev, rdreier, David Miller, general, linux-kernel > The hack to use a socket and bind it to claim the port was just for > demostrating the idea. The correct solution, IMO, is to enhance the > core low level 4-tuple allocation services to be more generic (eg: not > be tied to a struct sock). Then the host tcp stack and the host rdma > stack can allocate TCP/iWARP ports/4tuples from this common exported > service and share the port space. This allocation service could also be > used by other deep adapters like iscsi adapters if needed. Since iWarp runs on top of TCP, the port space is really the same. FWIW, I agree that this proposal is the correct solution to support iWarp. - Sean ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. 2007-10-10 21:01 ` Sean Hefty @ 2007-10-10 23:04 ` David Miller 0 siblings, 0 replies; 53+ messages in thread From: David Miller @ 2007-10-10 23:04 UTC (permalink / raw) To: mshefty; +Cc: netdev, rdreier, linux-kernel, general From: Sean Hefty <mshefty@ichips.intel.com> Date: Wed, 10 Oct 2007 14:01:07 -0700 > > The hack to use a socket and bind it to claim the port was just for > > demostrating the idea. The correct solution, IMO, is to enhance the > > core low level 4-tuple allocation services to be more generic (eg: not > > be tied to a struct sock). Then the host tcp stack and the host rdma > > stack can allocate TCP/iWARP ports/4tuples from this common exported > > service and share the port space. This allocation service could also be > > used by other deep adapters like iscsi adapters if needed. > > Since iWarp runs on top of TCP, the port space is really the same. > FWIW, I agree that this proposal is the correct solution to support iWarp. But you can be sure it's not going to happen, sorry. It would mean that we'd need to export the entire TCP socket table so then when iWARP connections are created you can search to make sure there is not an existing full 4-tuple that is the same. It is not just about local TCP ports. iWARP needs to live in it's seperate little container and not contaminate the rest of the networking, this is the deal. Any suggested such change which breaks that deal will be NACK'd by all of the core networking developers. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2007-10-10 23:04 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-08-07 14:37 [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise 2007-08-07 14:54 ` [ofa-general] " Evgeniy Polyakov 2007-08-07 15:06 ` Steve Wise 2007-08-07 15:39 ` [ofa-general] " Evgeniy Polyakov 2007-08-09 18:49 ` Steve Wise 2007-08-09 21:40 ` Sean Hefty 2007-08-09 21:55 ` David Miller 2007-08-09 23:22 ` Sean Hefty 2007-08-15 14:42 ` Steve Wise 2007-08-16 2:26 ` Jeff Garzik 2007-08-16 3:11 ` Roland Dreier 2007-08-16 3:27 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty 2007-08-16 13:43 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker 2007-08-16 21:17 ` David Miller 2007-08-17 19:52 ` Roland Dreier 2007-08-17 21:27 ` David Miller 2007-08-17 23:31 ` Roland Dreier 2007-08-18 0:00 ` David Miller 2007-08-18 5:23 ` Roland Dreier 2007-08-18 6:44 ` David Miller 2007-08-19 7:01 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty 2007-08-19 7:23 ` David Miller 2007-08-19 17:33 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti 2007-08-19 19:32 ` David Miller 2007-08-19 19:49 ` Felix Marti 2007-08-19 23:04 ` David Miller 2007-08-20 0:32 ` Felix Marti 2007-08-20 0:40 ` David Miller 2007-08-20 0:47 ` Felix Marti 2007-08-20 1:05 ` David Miller 2007-08-20 1:41 ` Felix Marti 2007-08-20 11:07 ` Andi Kleen 2007-08-20 16:26 ` Felix Marti 2007-08-20 19:16 ` Rick Jones 2007-08-20 9:43 ` Evgeniy Polyakov 2007-08-20 16:53 ` Felix Marti 2007-08-20 18:10 ` Andi Kleen 2007-08-20 19:02 ` Felix Marti 2007-08-20 20:18 ` Thomas Graf 2007-08-20 20:33 ` Andi Kleen 2007-08-20 20:33 ` Patrick Geoffray 2007-08-21 4:21 ` Felix Marti 2007-08-19 23:27 ` Andi Kleen 2007-08-19 23:12 ` David Miller 2007-08-20 1:45 ` Felix Marti 2007-08-20 0:18 ` Herbert Xu 2007-08-20 4:31 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " ssufficool 2007-08-21 1:16 ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier 2007-08-21 6:58 ` David Miller 2007-10-08 21:54 ` Steve Wise 2007-10-09 13:44 ` James Lentini 2007-10-10 21:01 ` Sean Hefty 2007-10-10 23:04 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).