* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator [not found] <200807300019.m6U0JkdY012558@localhost.localdomain> @ 2008-07-30 19:35 ` Jeff Garzik 2008-07-30 21:35 ` Roland Dreier 2008-07-31 1:24 ` Karen Xie 0 siblings, 2 replies; 63+ messages in thread From: Jeff Garzik @ 2008-07-30 19:35 UTC (permalink / raw) To: Karen Xie Cc: netdev, open-iscsi, davem, michaelc, swise, rdreier, daisyc, wenxiong, bhua, divy, dm, leedom, linux-scsi, LKML Karen Xie wrote: > Cxgb3i iSCSI driver > > Signed-off-by: Karen Xie <kxie@chelsio.com> > --- > > drivers/scsi/cxgb3i/Kconfig | 6 > drivers/scsi/cxgb3i/Makefile | 5 > drivers/scsi/cxgb3i/cxgb3i.h | 155 +++ > drivers/scsi/cxgb3i/cxgb3i_init.c | 109 ++ > drivers/scsi/cxgb3i/cxgb3i_iscsi.c | 800 ++++++++++++++ > drivers/scsi/cxgb3i/cxgb3i_offload.c | 2001 ++++++++++++++++++++++++++++++++++ > drivers/scsi/cxgb3i/cxgb3i_offload.h | 242 ++++ > drivers/scsi/cxgb3i/cxgb3i_ulp2.c | 692 ++++++++++++ > drivers/scsi/cxgb3i/cxgb3i_ulp2.h | 106 ++ > 9 files changed, 4116 insertions(+), 0 deletions(-) > create mode 100644 drivers/scsi/cxgb3i/Kconfig > create mode 100644 drivers/scsi/cxgb3i/Makefile > create mode 100644 drivers/scsi/cxgb3i/cxgb3i.h > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_init.c > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_iscsi.c > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_offload.c > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_offload.h > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_ulp2.c > create mode 100644 drivers/scsi/cxgb3i/cxgb3i_ulp2.h Comments: * SCSI drivers should be submitted via the linux-scsi@vger.kernel.org mailing list. * The driver is clean and readable, well done * From a networking standpoint, our main concern becomes how this interacts with the networking stack. In particular, I'm concerned based on reading the source that this driver uses "TCP port stealing" rather than using a totally separate MAC address (and IP). Stealing a TCP port on an IP/interface already assigned is a common solution in this space, but also a flawed one. Precisely because the kernel and applications are unaware of this "special, magic TCP port" you open the potential for application problems that are very difficult for an admin to diagnose based on observed behavior. So, additional information on your TCP port usage would be greatly appreciated. Also, how does this interact with IPv6? Clearly it interacts with IPv4... Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-07-30 19:35 ` [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator Jeff Garzik @ 2008-07-30 21:35 ` Roland Dreier 2008-08-01 0:51 ` Divy Le Ray 2008-07-31 1:24 ` Karen Xie 1 sibling, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-07-30 21:35 UTC (permalink / raw) To: Jeff Garzik Cc: Karen Xie, netdev, open-iscsi, davem, michaelc, swise, daisyc, wenxiong, bhua, divy, dm, leedom, linux-scsi, LKML > * From a networking standpoint, our main concern becomes how this > interacts with the networking stack. In particular, I'm concerned > based on reading the source that this driver uses "TCP port stealing" > rather than using a totally separate MAC address (and IP). > > Stealing a TCP port on an IP/interface already assigned is a common > solution in this space, but also a flawed one. Precisely because the > kernel and applications are unaware of this "special, magic TCP port" > you open the potential for application problems that are very > difficult for an admin to diagnose based on observed behavior. That's true, but using a separate MAC and IP opens up a bunch of other operational problems. I don't think the right answer for iSCSI offload is clear yet. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-07-30 21:35 ` Roland Dreier @ 2008-08-01 0:51 ` Divy Le Ray 2008-08-07 18:45 ` Divy Le Ray 0 siblings, 1 reply; 63+ messages in thread From: Divy Le Ray @ 2008-08-01 0:51 UTC (permalink / raw) To: Roland Dreier Cc: Jeff Garzik, Karen Xie, netdev, open-iscsi, davem, michaelc, Steve Wise, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML On Wednesday 30 July 2008 02:35:51 pm Roland Dreier wrote: > > * From a networking standpoint, our main concern becomes how this > > interacts with the networking stack. In particular, I'm concerned > > based on reading the source that this driver uses "TCP port stealing" > > rather than using a totally separate MAC address (and IP). > > > > Stealing a TCP port on an IP/interface already assigned is a common > > solution in this space, but also a flawed one. Precisely because the > > kernel and applications are unaware of this "special, magic TCP port" > > you open the potential for application problems that are very > > difficult for an admin to diagnose based on observed behavior. > > That's true, but using a separate MAC and IP opens up a bunch of other > operational problems. I don't think the right answer for iSCSI offload > is clear yet. > > - R. Hi Jeff, We've considered the approach of having a separate IP/MAC addresses to manage iSCSI connections. In such a context, the stack would have to be unaware of this iSCSI specific IP address. The iSCSI driver would then have to implement at least its own ARP reply mechanism. DHCP too would have to be managed separately. Most network setting/monitoring tools would also be unavailable. The open-iscsi initiator is not a huge consumer of TCP connections, allocating a TCP port from the stack would be reasonable in terms of resources in this context. It is however unclear if it is an acceptable approach. Our current implementation was designed to be the most tolerable one within the constraints - real or expected - aforementioned. Cheers, Divy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-01 0:51 ` Divy Le Ray @ 2008-08-07 18:45 ` Divy Le Ray 2008-08-07 20:07 ` Mike Christie 2008-08-08 18:09 ` Steve Wise 0 siblings, 2 replies; 63+ messages in thread From: Divy Le Ray @ 2008-08-07 18:45 UTC (permalink / raw) To: Roland Dreier Cc: Jeff Garzik, Karen Xie, netdev, open-iscsi, davem, michaelc, Steve Wise, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML On Thursday 31 July 2008 05:51:59 pm Divy Le Ray wrote: > On Wednesday 30 July 2008 02:35:51 pm Roland Dreier wrote: > > > * From a networking standpoint, our main concern becomes how this > > > interacts with the networking stack. In particular, I'm concerned > > > based on reading the source that this driver uses "TCP port stealing" > > > rather than using a totally separate MAC address (and IP). > > > > > > Stealing a TCP port on an IP/interface already assigned is a common > > > solution in this space, but also a flawed one. Precisely because the > > > kernel and applications are unaware of this "special, magic TCP port" > > > you open the potential for application problems that are very > > > difficult for an admin to diagnose based on observed behavior. > > > > That's true, but using a separate MAC and IP opens up a bunch of other > > operational problems. I don't think the right answer for iSCSI offload > > is clear yet. > > > > - R. > > Hi Jeff, > > We've considered the approach of having a separate IP/MAC addresses to > manage iSCSI connections. In such a context, the stack would have to be > unaware of this iSCSI specific IP address. The iSCSI driver would then have > to implement at least its own ARP reply mechanism. DHCP too would have to > be managed separately. Most network setting/monitoring tools would also be > unavailable. > > The open-iscsi initiator is not a huge consumer of TCP connections, > allocating a TCP port from the stack would be reasonable in terms of > resources in this context. It is however unclear if it is an acceptable > approach. > > Our current implementation was designed to be the most tolerable one > within the constraints - real or expected - aforementioned. > Hi Jeff, Mike Christie will not merge this code until he has an explicit acknowledgement from netdev. As you mentioned, the port stealing approach we've taken has its issues. We consequently analyzed your suggestion to use a different IP/MAC address for iSCSI and it raises other tough issues (separate ARP and DHCP management, unavailability of common networking tools). On these grounds, we believe our current approach is the most tolerable. Would the stack provide a TCP port allocation service, we'd be glad to use it to solve the current concerns. The cxgb3i driver is up and running here, its merge is pending our decision. Cheers, Divy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-07 18:45 ` Divy Le Ray @ 2008-08-07 20:07 ` Mike Christie 2008-08-08 18:09 ` Steve Wise 1 sibling, 0 replies; 63+ messages in thread From: Mike Christie @ 2008-08-07 20:07 UTC (permalink / raw) To: Divy Le Ray Cc: Roland Dreier, Jeff Garzik, Karen Xie, netdev, open-iscsi, davem, Steve Wise, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML Divy Le Ray wrote: > On Thursday 31 July 2008 05:51:59 pm Divy Le Ray wrote: >> On Wednesday 30 July 2008 02:35:51 pm Roland Dreier wrote: >>> > * From a networking standpoint, our main concern becomes how this >>> > interacts with the networking stack. In particular, I'm concerned >>> > based on reading the source that this driver uses "TCP port stealing" >>> > rather than using a totally separate MAC address (and IP). >>> > >>> > Stealing a TCP port on an IP/interface already assigned is a common >>> > solution in this space, but also a flawed one. Precisely because the >>> > kernel and applications are unaware of this "special, magic TCP port" >>> > you open the potential for application problems that are very >>> > difficult for an admin to diagnose based on observed behavior. >>> >>> That's true, but using a separate MAC and IP opens up a bunch of other >>> operational problems. I don't think the right answer for iSCSI offload >>> is clear yet. >>> >>> - R. >> Hi Jeff, >> >> We've considered the approach of having a separate IP/MAC addresses to >> manage iSCSI connections. In such a context, the stack would have to be >> unaware of this iSCSI specific IP address. The iSCSI driver would then have >> to implement at least its own ARP reply mechanism. DHCP too would have to >> be managed separately. Most network setting/monitoring tools would also be >> unavailable. >> >> The open-iscsi initiator is not a huge consumer of TCP connections, >> allocating a TCP port from the stack would be reasonable in terms of >> resources in this context. It is however unclear if it is an acceptable >> approach. >> >> Our current implementation was designed to be the most tolerable one >> within the constraints - real or expected - aforementioned. >> > > Hi Jeff, > > Mike Christie will not merge this code until he has an explicit > acknowledgement from netdev. > > As you mentioned, the port stealing approach we've taken has its issues. > We consequently analyzed your suggestion to use a different IP/MAC address for > iSCSI and it raises other tough issues (separate ARP and DHCP management, > unavailability of common networking tools). If the iscsi tools could not have to deal with networking issues that are already handled by other networking tools it would great for the iscsi users so they do not have to learn new tools. Maybe we could somehow hook into the existing network tools so they support these iscsi hbas as well as normal NICs. Would it be possible to have the iscsi hbas export the necessary network interfaces so that existing network tools can manage them? If it comes down to it and your port stealing implementation is not acceptable like how broadcom's was not, I will be ok with doing some special iscsi network tools. Or instead of special iscsi tools, is there something that the RDMA/iWarp guys are using that we can share? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-07 18:45 ` Divy Le Ray 2008-08-07 20:07 ` Mike Christie @ 2008-08-08 18:09 ` Steve Wise 2008-08-08 22:15 ` Jeff Garzik 1 sibling, 1 reply; 63+ messages in thread From: Steve Wise @ 2008-08-08 18:09 UTC (permalink / raw) To: Divy Le Ray, Jeff Garzik, davem Cc: Roland Dreier, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML > Hi Jeff, > > Mike Christie will not merge this code until he has an explicit > acknowledgement from netdev. > > As you mentioned, the port stealing approach we've taken has its issues. > We consequently analyzed your suggestion to use a different IP/MAC address for > iSCSI and it raises other tough issues (separate ARP and DHCP management, > unavailability of common networking tools). > On these grounds, we believe our current approach is the most tolerable. > Would the stack provide a TCP port allocation service, we'd be glad to use it > to solve the current concerns. > The cxgb3i driver is up and running here, its merge is pending our decision. > > Cheers, > Divy > Hey Dave/Jeff, I think we need some guidance here on how to proceed. Is the approach currently being reviewed ACKable? Or is it DOA? If its DOA, then what approach do you recommend? I believe Jeff's opinion is a separate ipaddr. But Dave, what do you think? Lets get some agreement on a high level design here. Possible solutions seen to date include: 1) reserving a socket to allocate the port. This has been NAK'd in the past and I assume is still a no go. 2) creating a 4-tuple allocation service so the host stack, the rdma stack, and the iscsi stack can share the same TCP 4-tuple space. This also has been NAK'd in the past and I assume is still a no go. 3) the iscsi device allocates its own local ephemeral posts (port stealing) and use the host's ip address for the iscsi offload device. This is the current proposal and you can review the thread for the pros and cons. IMO it is the least objectionable (and I think we really should be doing #2). 4) the iscsi device will manage its own ip address thus ensuring 4-tuple uniqueness. Unless you all want to re-open considering #1 or #2, then we're left with 3 or 4. Which one? Steve. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-08 18:09 ` Steve Wise @ 2008-08-08 22:15 ` Jeff Garzik 2008-08-08 22:20 ` Jeff Garzik ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Jeff Garzik @ 2008-08-08 22:15 UTC (permalink / raw) To: Steve Wise, davem Cc: Divy Le Ray, Roland Dreier, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML, Mike Christie Steve Wise wrote: > >> Hi Jeff, >> >> Mike Christie will not merge this code until he has an explicit >> acknowledgement from netdev. >> >> As you mentioned, the port stealing approach we've taken has its issues. >> We consequently analyzed your suggestion to use a different IP/MAC >> address for iSCSI and it raises other tough issues (separate ARP and >> DHCP management, unavailability of common networking tools). >> On these grounds, we believe our current approach is the most tolerable. >> Would the stack provide a TCP port allocation service, we'd be glad to >> use it to solve the current concerns. >> The cxgb3i driver is up and running here, its merge is pending our >> decision. >> >> Cheers, >> Divy >> > Hey Dave/Jeff, > > I think we need some guidance here on how to proceed. Is the approach > currently being reviewed ACKable? Or is it DOA? If its DOA, then what > approach do you recommend? I believe Jeff's opinion is a separate > ipaddr. But Dave, what do you think? Lets get some agreement on a high > level design here. > Possible solutions seen to date include: > > 1) reserving a socket to allocate the port. This has been NAK'd in the > past and I assume is still a no go. > > 2) creating a 4-tuple allocation service so the host stack, the rdma > stack, and the iscsi stack can share the same TCP 4-tuple space. This > also has been NAK'd in the past and I assume is still a no go. > > 3) the iscsi device allocates its own local ephemeral posts (port > stealing) and use the host's ip address for the iscsi offload device. > This is the current proposal and you can review the thread for the pros > and cons. IMO it is the least objectionable (and I think we really > should be doing #2). > > 4) the iscsi device will manage its own ip address thus ensuring 4-tuple > uniqueness. Conceptually, it is a nasty business for the OS kernel to be forced to co-manage an IP address in conjunction with a remote, independent entity. Hardware designers make the mistake of assuming that firmware management of a TCP port ("port stealing") successfully provides the illusion to the OS that that port is simply inactive, and the OS happily continues internetworking its merry way through life. This is certainly not true, because of current netfilter and userland application behavior, which often depends on being able to allocate (bind) to random TCP ports. Allocating a TCP port successfully within the OS, that then behaves different from all other TCP ports (because it is the magic iSCSI port) creates a cascading functional disconnect. On that magic iSCSI port, strange errors will be returned instead of proper behavior. Which, in turn, cascades through new (and inevitably under-utilized) error handling paths in the app. So, of course, one must work around problems like this, which leads to one of two broad choices: 1) implement co-management (sharing) of IP address/port space, between the OS kernel and a remote entity. 2) come up with a solution in hardware that does not require the OS to co-manage the data it has so far been managing exclusively in software. It should be obvious that we prefer path #2. For, trudging down path #1 means * one must give the user the ability to manage shared IP addresses IN A NON-HARDWARE-SPECIFIC manner. Currently most vendors of "TCP port stealing" solutions seem to expect each user to learn a vendor-specific method of identifying and managing the "magic port". Excuse my language, but, what a fucking security and management nightmare in a cross-vendor environment. It is already a pain, with some [unnamed system/chipset vendors] management stealing TCP ports -- and admins only discover this fact when applications behave strangely on new hardware. But... its tough to notice because stumbling upon the magic TCP port won't happen often unless the server is heavily loaded. Thus you have a security/application problem once in a blue moon, due to this magic TCP port mentioned in some obscure online documentation nobody has read. * however, giving the user the ability to co-manage IP addresses means hacking up the kernel TCP code and userland tools for this new concept, something that I think DaveM would rightly be a bit reluctant to do? You are essentially adding a bunch of special case code whenever TCP ports are used: if (port in list of "magic" TCP ports with special, hardware-specific behavior) ... else do what we've been doing for decades ISTR Roland(?) pointing out code that already does a bit of this in the IB space... but the point is Finally, this shared IP address/port co-management thing has several problems listed on the TOE page: http://www.linuxfoundation.org/en/Net:TOE such as, * security updates for TCP problems mean that a single IP address can be PARTIALLY SECURE, because security updates for kernel TCP stack and h/w's firmware are inevitably updated separately (even if distributed and compiled together). Yay, we are introducing a wonderful new security problem here. * from a security, network scanner and packet classifier point of view, a single IP address no longer behaves like Linux. It behaves like Linux... sometime. Depending on whether it is a magic TCP port or not. Talk about security audit hell. This should be plenty, so I'm stopping now. But looking down the TOE wiki page I could easily come up with more reasons why "IP address remote co-management" is more complicated and costly than you think. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-08 22:15 ` Jeff Garzik @ 2008-08-08 22:20 ` Jeff Garzik 2008-08-09 7:28 ` David Miller 2008-08-10 5:12 ` Roland Dreier 2 siblings, 0 replies; 63+ messages in thread From: Jeff Garzik @ 2008-08-08 22:20 UTC (permalink / raw) To: Steve Wise, davem Cc: Divy Le Ray, Roland Dreier, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML Jeff Garzik wrote: > * however, giving the user the ability to co-manage IP addresses means > hacking up the kernel TCP code and userland tools for this new concept, > something that I think DaveM would rightly be a bit reluctant to do? You > are essentially adding a bunch of special case code whenever TCP ports > are used: > > if (port in list of "magic" TCP ports with special, > hardware-specific behavior) > ... > else > do what we've been doing for decades > > ISTR Roland(?) pointing out code that already does a bit of this in the > IB space... but the point is grrr. but the point is that the solution is not at all complete, with feature disconnects and security audit differences still outsanding, and non-hw-specific management apps still unwritten. (I'm not calling for their existence, merely saying trying to strike the justification that current capability to limp along exists) Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-08 22:15 ` Jeff Garzik 2008-08-08 22:20 ` Jeff Garzik @ 2008-08-09 7:28 ` David Miller 2008-08-09 14:04 ` Steve Wise 2008-08-10 5:14 ` Roland Dreier 2008-08-10 5:12 ` Roland Dreier 2 siblings, 2 replies; 63+ messages in thread From: David Miller @ 2008-08-09 7:28 UTC (permalink / raw) To: jgarzik Cc: swise, divy, rdreier, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Jeff Garzik <jgarzik@pobox.com> Date: Fri, 08 Aug 2008 18:15:41 -0400 > * security updates for TCP problems mean that a single IP address can be > PARTIALLY SECURE, because security updates for kernel TCP stack and > h/w's firmware are inevitably updated separately (even if distributed > and compiled together). Yay, we are introducing a wonderful new > security problem here. > > * from a security, network scanner and packet classifier point of view, > a single IP address no longer behaves like Linux. It behaves like > Linux... sometime. Depending on whether it is a magic TCP port or not. I agree with everything Jeff has stated. Also, I find it ironic that the port abduction is being asked for in order to be "compatible with existing tools" yet in fact this stuff breaks everything. You can't netfilter this traffic, you can't apply qdiscs to it, you can't execut TC actions on them, you can't do segmentation offload on them, you can't look for the usual TCP MIB statistics on the connection, etc. etc. etc. It is broken from every possible angle. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-09 7:28 ` David Miller @ 2008-08-09 14:04 ` Steve Wise 2008-08-10 5:14 ` Roland Dreier 1 sibling, 0 replies; 63+ messages in thread From: Steve Wise @ 2008-08-09 14:04 UTC (permalink / raw) To: David Miller Cc: jgarzik, divy, rdreier, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Jeff Garzik <jgarzik@pobox.com> > Date: Fri, 08 Aug 2008 18:15:41 -0400 > > >> * security updates for TCP problems mean that a single IP address can be >> PARTIALLY SECURE, because security updates for kernel TCP stack and >> h/w's firmware are inevitably updated separately (even if distributed >> and compiled together). Yay, we are introducing a wonderful new >> security problem here. >> >> * from a security, network scanner and packet classifier point of view, >> a single IP address no longer behaves like Linux. It behaves like >> Linux... sometime. Depending on whether it is a magic TCP port or not. >> > > I agree with everything Jeff has stated. > > Also, I find it ironic that the port abduction is being asked for in > order to be "compatible with existing tools" yet in fact this stuff > breaks everything. You can't netfilter this traffic, you can't apply > qdiscs to it, you can't execut TC actions on them, you can't do > segmentation offload on them, you can't look for the usual TCP MIB > statistics on the connection, etc. etc. etc. > > It is broken from every possible angle. > I think a lot of these _could_ be implemented and integrated with the standard tools. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-09 7:28 ` David Miller 2008-08-09 14:04 ` Steve Wise @ 2008-08-10 5:14 ` Roland Dreier 2008-08-10 5:47 ` David Miller 1 sibling, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-10 5:14 UTC (permalink / raw) To: David Miller Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > Also, I find it ironic that the port abduction is being asked for in > order to be "compatible with existing tools" yet in fact this stuff > breaks everything. You can't netfilter this traffic, you can't apply > qdiscs to it, you can't execut TC actions on them, you can't do > segmentation offload on them, you can't look for the usual TCP MIB > statistics on the connection, etc. etc. etc. We already support offloads that break other features, eg large receive offload breaks forwarding. We deal with it. I'm sure if we thought about it we could come up with clean ways to fix some of the issues you raise, and just disable the offload if someone wanted to use a feature we can't support. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:14 ` Roland Dreier @ 2008-08-10 5:47 ` David Miller 2008-08-10 6:34 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: David Miller @ 2008-08-10 5:47 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Sat, 09 Aug 2008 22:14:11 -0700 > > Also, I find it ironic that the port abduction is being asked for in > > order to be "compatible with existing tools" yet in fact this stuff > > breaks everything. You can't netfilter this traffic, you can't apply > > qdiscs to it, you can't execut TC actions on them, you can't do > > segmentation offload on them, you can't look for the usual TCP MIB > > statistics on the connection, etc. etc. etc. > > We already support offloads that break other features, eg large receive > offload breaks forwarding. We deal with it. We turn it off. If I want to shape or filter one of these iSCSI connections can we turn it off? It's funny you mention LRO because it probably gives most of whatever gain these special iSCSI TCP connection offload things get. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:47 ` David Miller @ 2008-08-10 6:34 ` Herbert Xu 2008-08-10 17:57 ` Steve Wise 2008-08-11 16:09 ` Roland Dreier 2 siblings, 0 replies; 63+ messages in thread From: Herbert Xu @ 2008-08-10 6:34 UTC (permalink / raw) To: David Miller Cc: rdreier, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller <davem@davemloft.net> wrote: > >> We already support offloads that break other features, eg large receive >> offload breaks forwarding. We deal with it. > > We turn it off. If I want to shape or filter one of these iSCSI > connections can we turn it off? Actually one of my TODO items is to restructure software LRO so that we preserve the original packet headers while aggregating the packets. That would allow us to easily refragment them on output for forwarding. In other words LRO (at least the software variant) is not fundamentally incompatible with forwarding. I'd also like to encourage all hardware manufacturers considering LRO support to provide a way for us to access the original headers so that it doesn't have to be turned off for forwarding. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:47 ` David Miller 2008-08-10 6:34 ` Herbert Xu @ 2008-08-10 17:57 ` Steve Wise 2008-08-11 16:09 ` Roland Dreier 2 siblings, 0 replies; 63+ messages in thread From: Steve Wise @ 2008-08-10 17:57 UTC (permalink / raw) To: David Miller Cc: rdreier, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Sat, 09 Aug 2008 22:14:11 -0700 > > >> > Also, I find it ironic that the port abduction is being asked for in >> > order to be "compatible with existing tools" yet in fact this stuff >> > breaks everything. You can't netfilter this traffic, you can't apply >> > qdiscs to it, you can't execut TC actions on them, you can't do >> > segmentation offload on them, you can't look for the usual TCP MIB >> > statistics on the connection, etc. etc. etc. >> >> We already support offloads that break other features, eg large receive >> offload breaks forwarding. We deal with it. >> > > We turn it off. If I want to shape or filter one of these iSCSI > connections can we turn it off? > > Sure. Seems to me we _could_ architect this all so that these devices would have to support a method for the management/admin tools to tweak, and if nothing else kill, offload connections if policy rules change and the existing connections aren't implementing the policy. IE: if the offload connection doesn't support whatever security or other facilities that the admin requires, then the admin should have the ability to disable that device. And of course, some devices will allow doing things like netfilter, qos, tweaking vlan tags, etc even on active connection, if the OS infrastructure is there to hook it all up. BTW: I think all these offload devices provide MIBs and could be pulled in to the normal management tools. Steve. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:47 ` David Miller 2008-08-10 6:34 ` Herbert Xu 2008-08-10 17:57 ` Steve Wise @ 2008-08-11 16:09 ` Roland Dreier 2008-08-11 21:09 ` David Miller 2 siblings, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-11 16:09 UTC (permalink / raw) To: David Miller Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > We turn it off. If I want to shape or filter one of these iSCSI > connections can we turn it off? That seems like a reasonable idea to me -- the standard thing to do when a NIC offload conflicts with something else is to turn off the offload and fall back to software. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 16:09 ` Roland Dreier @ 2008-08-11 21:09 ` David Miller 2008-08-11 21:37 ` Roland Dreier 2008-08-11 23:20 ` Steve Wise 0 siblings, 2 replies; 63+ messages in thread From: David Miller @ 2008-08-11 21:09 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Mon, 11 Aug 2008 09:09:02 -0700 > > We turn it off. If I want to shape or filter one of these iSCSI > > connections can we turn it off? > > That seems like a reasonable idea to me -- the standard thing to do when > a NIC offload conflicts with something else is to turn off the offload > and fall back to software. But as Herbert says, we can make LRO such that turning it off isn't necessary. Can we shape the iSCSI offload traffic without turning it off? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:09 ` David Miller @ 2008-08-11 21:37 ` Roland Dreier 2008-08-11 21:51 ` David Miller 2008-08-11 23:20 ` Steve Wise 1 sibling, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-11 21:37 UTC (permalink / raw) To: David Miller Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > But as Herbert says, we can make LRO such that turning it off > isn't necessary. > > Can we shape the iSCSI offload traffic without turning it off? Sure... the same way we can ask the HW vendors to keep old headers around when aggregating for LRO, we can ask HW vendors for hooks for shaping iSCSI traffic. And the Chelsio TCP speed record seems to show that they already have pretty sophisticated queueing/shaping in their current HW. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:37 ` Roland Dreier @ 2008-08-11 21:51 ` David Miller 0 siblings, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-11 21:51 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Mon, 11 Aug 2008 14:37:59 -0700 > > But as Herbert says, we can make LRO such that turning it off > > isn't necessary. > > > > Can we shape the iSCSI offload traffic without turning it off? > > Sure... the same way we can ask the HW vendors to keep old headers > around when aggregating for LRO, we can ask HW vendors for hooks for > shaping iSCSI traffic. And the Chelsio TCP speed record seems to show > that they already have pretty sophisticated queueing/shaping in their > current HW. You don't get it, you can't add the entire netfilter and qdisc stack into the silly firmware. And we can't fix bugs there either. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:09 ` David Miller 2008-08-11 21:37 ` Roland Dreier @ 2008-08-11 23:20 ` Steve Wise 2008-08-11 23:45 ` Divy Le Ray 2008-08-12 0:22 ` David Miller 1 sibling, 2 replies; 63+ messages in thread From: Steve Wise @ 2008-08-11 23:20 UTC (permalink / raw) To: David Miller Cc: rdreier, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Mon, 11 Aug 2008 09:09:02 -0700 > > >> > We turn it off. If I want to shape or filter one of these iSCSI >> > connections can we turn it off? >> >> That seems like a reasonable idea to me -- the standard thing to do when >> a NIC offload conflicts with something else is to turn off the offload >> and fall back to software. >> > > But as Herbert says, we can make LRO such that turning it off > isn't necessary. > > Can we shape the iSCSI offload traffic without turning it off? > With Chelsio's product you can do this. Maybe Divy can provide details? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 23:20 ` Steve Wise @ 2008-08-11 23:45 ` Divy Le Ray 2008-08-12 0:22 ` David Miller 1 sibling, 0 replies; 63+ messages in thread From: Divy Le Ray @ 2008-08-11 23:45 UTC (permalink / raw) To: Steve Wise Cc: David Miller, rdreier, jgarzik, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel On Monday 11 August 2008 04:20:07 pm Steve Wise wrote: > David Miller wrote: > > From: Roland Dreier <rdreier@cisco.com> > > Date: Mon, 11 Aug 2008 09:09:02 -0700 > > > >> > We turn it off. If I want to shape or filter one of these iSCSI > >> > connections can we turn it off? > >> > >> That seems like a reasonable idea to me -- the standard thing to do when > >> a NIC offload conflicts with something else is to turn off the offload > >> and fall back to software. > > > > But as Herbert says, we can make LRO such that turning it off > > isn't necessary. > > > > Can we shape the iSCSI offload traffic without turning it off? > > With Chelsio's product you can do this. Maybe Divy can provide details? The T3 adapter is capable of performing rate control and pacing based on RTT on a per-connection basis. Cheers, Divy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 23:20 ` Steve Wise 2008-08-11 23:45 ` Divy Le Ray @ 2008-08-12 0:22 ` David Miller 1 sibling, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-12 0:22 UTC (permalink / raw) To: swise Cc: rdreier, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Steve Wise <swise@opengridcomputing.com> Date: Mon, 11 Aug 2008 18:20:07 -0500 > David Miller wrote: > > From: Roland Dreier <rdreier@cisco.com> > > Date: Mon, 11 Aug 2008 09:09:02 -0700 > > > > > >> > We turn it off. If I want to shape or filter one of these iSCSI > >> > connections can we turn it off? > >> > >> That seems like a reasonable idea to me -- the standard thing to do when > >> a NIC offload conflicts with something else is to turn off the offload > >> and fall back to software. > >> > > > > But as Herbert says, we can make LRO such that turning it off > > isn't necessary. > > > > Can we shape the iSCSI offload traffic without turning it off? > > > > With Chelsio's product you can do this. Maybe Divy can provide details? When I say shape I mean apply any packet scheduler, any netfilter module, and any other feature we support. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-08 22:15 ` Jeff Garzik 2008-08-08 22:20 ` Jeff Garzik 2008-08-09 7:28 ` David Miller @ 2008-08-10 5:12 ` Roland Dreier 2008-08-10 5:46 ` David Miller ` (2 more replies) 2 siblings, 3 replies; 63+ messages in thread From: Roland Dreier @ 2008-08-10 5:12 UTC (permalink / raw) To: Jeff Garzik Cc: Steve Wise, davem, Divy Le Ray, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML > * however, giving the user the ability to co-manage IP addresses means > hacking up the kernel TCP code and userland tools for this new > concept, something that I think DaveM would rightly be a bit reluctant > to do? You are essentially adding a bunch of special case code > whenever TCP ports are used: > > if (port in list of "magic" TCP ports with special, > hardware-specific behavior) > ... > else > do what we've been doing for decades I think you're arguing against something that no one is actually pushing. What I'm sure Chelsio and probably other iSCSI offload vendors would like is a way to make iSCSI (and other) offloads not steal magic ports but actually hook into the normal infrastructure so that the offloaded connections show up in netstat, etc. Having this solution would be nice not just for TCP offload but also for things like in-band system management, which currently lead to the same hard-to-diagnose issues when someone hits the stolen port. And it also would seem to help "classifier NICs" (Sun Neptune, Solarflare, etc) where some traffic might be steered to a userspace TCP stack. I don't think the proposal of just using a separate MAC and IP for the iSCSI HBA really works, for two reasons: - It doesn't work in theory, because the suggestion (I guess) is that the iSCSI HBA has its own MAC and IP and behaves like a separate system. But this means that to start with the HBA needs its own ARP, ICMP, routing, etc interface, which means we need some (probably new) interface to configure all of this. And then it doesn't work in lots of networks; for example the ethernet jack in my office doesn't work without 802.1x authentication, and putting all of that in an iSCSI HBA's firmware clearly is crazy (not to mention creating the interface to pass 802.1x credentials into the kernel to pass to the HBA). - It doesn't work in practice because most of the existing NICs that are capable of iSCSI offload, eg Chelsio and Broadcom as well as 3 or 4 other vendors, don't handle ARP, ICMP, etc in the device -- they need the host system to do it. Which means that either we have a separate ARP/ICMP stack for offload adapters (obviously untenable) or a separate implemention in each driver (even more untenable), or we use the normal stack for the adapter, which seems to force us into creating a normal netdev for the iSCSI offload interface, which in turn seems to force us to figure out a way for offload adapters to coexist with the host stack (assuming of course that we care about iSCSI HBAs and/or stuff like NFS/RDMA). A long time ago, DaveM pointed me at the paper "TCP offload is a dumb idea whose time has come" (<http://www.usenix.org/events/hotos03/tech/full_papers/mogul/mogul_html/index.html>) which is an interesting paper that argues that this time really is different, and OS developers need to figure out how transport offload fits in. As a side note, funnily enough back in the thread where DaveM mentioned that paper, Alan Cox said "Take a look at who holds the official internet land speed record. Its not a TOE using system" but at least as of now the current record for IPv4 (http://www.internet2.edu/lsr/) *is* held by a TOE. I think there are two ways to proceed: - Start trying to figure out the best way to support the iSCSI offload hardware that's out there. I don't know the perfect answer but I'm sure we can figure something out if we make an honest effort. - Ignore the issue and let users of iSCSI offload hardware (and iWARP and NFS/RDMA etc) stick to hacky out-of-tree solutions. This pays off if stuff like the Intel CRC32C instruction plus faster CPUs (or "multithreaded" NICs that use multicore better) makes offload irrelevant. However this ignores the fundamental 3X memory bandwidth cost of not doing direct placement in the NIC, and risks us being in a "well Solaris has support" situation down the road. To be honest I think the best thing to do is just to get support for these iSCSI offload adapters upstream in whatever form we can all agree on, so that we can see a) whether anyone cares and b) if someone does care, whether there's some better way to do things. > ISTR Roland(?) pointing out code that already does a bit of this in > the IB space... but the point is Not me... and I don't think that there would be anything like this for InfiniBand, since IB is a completely different animal that has nothing to do with TCP/IP. You may be thinking of iWARP (RDMA over TCP/IP), but actually the current Linux iWARP support completely punts on the issue of coexisting with the native stack (basically because of a lack of interest in solving the problems from the netdev side of things), which leads to nasty issues that show up when things happen to collide. So far people seem to be coping by using nasty out-of-tree hacks. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:12 ` Roland Dreier @ 2008-08-10 5:46 ` David Miller 2008-08-11 16:07 ` Roland Dreier 2008-08-11 18:13 ` Rick Jones 2008-08-10 6:24 ` Herbert Xu 2008-08-10 9:19 ` Alan Cox 2 siblings, 2 replies; 63+ messages in thread From: David Miller @ 2008-08-10 5:46 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Sat, 09 Aug 2008 22:12:07 -0700 > What I'm sure Chelsio and probably other iSCSI offload vendors > would like is a way to make iSCSI (and other) offloads not steal magic > ports but actually hook into the normal infrastructure so that the > offloaded connections show up in netstat, etc. Why show these special connections if the user cannot interact with or shape the stream at all like normal ones? This whole "make it look normal" argument is entirely bogus because none of the standard Linux networking facilities can be applied to these things. And I even wonder, these days, if you probably get %90 or more of the gain these "optimized" iSCSI connections obtain from things like LRO. And since LRO can be done entirely in software (although stateless HW assistence helps), it is even a NIC agnostic performance improvement. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:46 ` David Miller @ 2008-08-11 16:07 ` Roland Dreier 2008-08-11 21:08 ` David Miller 2008-08-11 18:13 ` Rick Jones 1 sibling, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-11 16:07 UTC (permalink / raw) To: David Miller Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > Why show these special connections if the user cannot interact with or > shape the stream at all like normal ones? So that an admin can see what connections are open, so that the stack doesn't try to reuse the same 4-tuple for another connection, etc, etc. > And I even wonder, these days, if you probably get %90 or more of the > gain these "optimized" iSCSI connections obtain from things like LRO. Yes, that's the question -- are stateless offloads (plus CRC32C in the CPU etc) going to give good enough performance that the whole TCP offload exercise is pointless? The only issue is that I don't see how to avoid the fundamental 3X increase in memory bandwidth that is chewed up if the NIC can't do direct placement. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 16:07 ` Roland Dreier @ 2008-08-11 21:08 ` David Miller 2008-08-11 21:39 ` Roland Dreier 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-11 21:08 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Mon, 11 Aug 2008 09:07:51 -0700 > Yes, that's the question -- are stateless offloads (plus CRC32C in the > CPU etc) going to give good enough performance that the whole TCP > offload exercise is pointless? This is by definition true, over time. And this has stedfastly proven itself, over and over again. That's why we call stateful offloads a point in time solution. They are constantly being obsoleted by time. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:08 ` David Miller @ 2008-08-11 21:39 ` Roland Dreier 2008-08-11 21:52 ` David Miller 0 siblings, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-11 21:39 UTC (permalink / raw) To: David Miller Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > > Yes, that's the question -- are stateless offloads (plus CRC32C in the > > CPU etc) going to give good enough performance that the whole TCP > > offload exercise is pointless? > > This is by definition true, over time. And this has stedfastly proven > itself, over and over again. By the definition of what? - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:39 ` Roland Dreier @ 2008-08-11 21:52 ` David Miller 0 siblings, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-11 21:52 UTC (permalink / raw) To: rdreier Cc: jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Mon, 11 Aug 2008 14:39:47 -0700 > > > Yes, that's the question -- are stateless offloads (plus CRC32C in the > > > CPU etc) going to give good enough performance that the whole TCP > > > offload exercise is pointless? > > > > This is by definition true, over time. And this has stedfastly proven > > itself, over and over again. > > By the definition of what? By definition of time always advancing forward, and cpus always getting faster, and memory (albeit more slowly) increasing in speed too, ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:46 ` David Miller 2008-08-11 16:07 ` Roland Dreier @ 2008-08-11 18:13 ` Rick Jones 2008-08-11 21:12 ` David Miller 1 sibling, 1 reply; 63+ messages in thread From: Rick Jones @ 2008-08-11 18:13 UTC (permalink / raw) To: David Miller Cc: rdreier, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > And I even wonder, these days, if you probably get %90 or more of the > gain these "optimized" iSCSI connections obtain from things like LRO. > And since LRO can be done entirely in software (although stateless > HW assistence helps), it is even a NIC agnostic performance improvement. Probably depends on whether or not the iSCSI offload solutions are doing zero-copy receive into the filecache? rick jones ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 18:13 ` Rick Jones @ 2008-08-11 21:12 ` David Miller 2008-08-11 21:41 ` Roland Dreier 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-11 21:12 UTC (permalink / raw) To: rick.jones2 Cc: rdreier, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Rick Jones <rick.jones2@hp.com> Date: Mon, 11 Aug 2008 11:13:25 -0700 > David Miller wrote: > > And I even wonder, these days, if you probably get %90 or more of the > > gain these "optimized" iSCSI connections obtain from things like LRO. > > And since LRO can be done entirely in software (although stateless > > HW assistence helps), it is even a NIC agnostic performance improvement. > > Probably depends on whether or not the iSCSI offload solutions are doing > zero-copy receive into the filecache? That's a data placement issue, which also can be solved with stateless offloading. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:12 ` David Miller @ 2008-08-11 21:41 ` Roland Dreier 2008-08-11 21:53 ` David Miller 0 siblings, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-11 21:41 UTC (permalink / raw) To: David Miller Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > > Probably depends on whether or not the iSCSI offload solutions are doing > > zero-copy receive into the filecache? > > That's a data placement issue, which also can be solved with > stateless offloading. How can you place iSCSI data properly with only stateless offloads? - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:41 ` Roland Dreier @ 2008-08-11 21:53 ` David Miller 2008-08-12 21:57 ` Divy Le Ray 2008-08-13 21:27 ` Roland Dreier 0 siblings, 2 replies; 63+ messages in thread From: David Miller @ 2008-08-11 21:53 UTC (permalink / raw) To: rdreier Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Mon, 11 Aug 2008 14:41:16 -0700 > > > Probably depends on whether or not the iSCSI offload solutions are doing > > > zero-copy receive into the filecache? > > > > That's a data placement issue, which also can be solved with > > stateless offloading. > > How can you place iSCSI data properly with only stateless offloads? By teaching the stateless offload how to parse the iSCSI headers on the flow and place the data into pages at the correct offsets such that you can place the pages hanging off of the SKB directly into the page cache. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:53 ` David Miller @ 2008-08-12 21:57 ` Divy Le Ray 2008-08-12 22:01 ` David Miller 2008-08-12 22:02 ` David Miller 2008-08-13 21:27 ` Roland Dreier 1 sibling, 2 replies; 63+ messages in thread From: Divy Le Ray @ 2008-08-12 21:57 UTC (permalink / raw) To: David Miller Cc: rdreier, rick.jones2, jgarzik, Steve Wise, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel On Monday 11 August 2008 02:53:13 pm David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Mon, 11 Aug 2008 14:41:16 -0700 > > > > > Probably depends on whether or not the iSCSI offload solutions are > > > > doing zero-copy receive into the filecache? > > > > > > That's a data placement issue, which also can be solved with > > > stateless offloading. > > > > How can you place iSCSI data properly with only stateless offloads? > > By teaching the stateless offload how to parse the iSCSI headers > on the flow and place the data into pages at the correct offsets > such that you can place the pages hanging off of the SKB directly > into the page cache. Hi Dave, iSCSI PDUs might spawn over multiple TCP segments, it is unclear to me how to do placement without keeping some state of the transactions. In any case, such a stateless solution is not yet designed, whereas accelerated iSCSI is available now, from us and other companies. The accelerated iSCSI streams benefit from the performance TOE provides, outlined in the following third party papers: http://www.chelsio.com/assetlibrary/pdf/redhat-chelsio-toe-final_v2.pdf http://www.chelsio.com/assetlibrary/pdf/RMDS6BNTChelsioRHEL5.pdf iSCSI is primarily targeted to the data center, where the SW stack's traffic shaping features might be redundant with specialized equipment. It should however be possible to integrate security features on a per offoaded connection basis, and TOEs - at least ours :) - are capable of rate control and traffic shaping. While CPU and - to a far lesser extent - memory performance improves, so does ethernet's. 40G, 100G are not too far ahead. It is not obvious at all that TOE is a point of time solution, especially for heavy load traffic as in a storage environment. It is quite the opposite actually. There is room for co-existence of the SW managed traffic and accelerated traffic. As our submission shows, enabling accelerated iSCSI is not intrusive code wise to the stack. The port stealing issue is solved if we can grab a port from the stack. Cheers, Divy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-12 21:57 ` Divy Le Ray @ 2008-08-12 22:01 ` David Miller 2008-08-12 22:02 ` David Miller 1 sibling, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-12 22:01 UTC (permalink / raw) To: divy Cc: rdreier, rick.jones2, jgarzik, swise, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Divy Le Ray <divy@chelsio.com> Date: Tue, 12 Aug 2008 14:57:09 -0700 > iSCSI PDUs might spawn over multiple TCP segments, it is unclear to me how to > do placement without keeping some state of the transactions. You keep a flow table with buffer IDs and offsets. The S2IO guys did something similar for one of their initial LRO impelementations. It's still strictly stateless, and best-effort. Entries can fall out of the flow cache which makes upcoming data use new buffers and offsets. But these are the kinds of tricks you hardware folks should be more than adequately able to design, rather than me. :-) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-12 21:57 ` Divy Le Ray 2008-08-12 22:01 ` David Miller @ 2008-08-12 22:02 ` David Miller 2008-08-12 22:21 ` Divy Le Ray 1 sibling, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-12 22:02 UTC (permalink / raw) To: divy Cc: rdreier, rick.jones2, jgarzik, swise, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Divy Le Ray <divy@chelsio.com> Date: Tue, 12 Aug 2008 14:57:09 -0700 > In any case, such a stateless solution is not yet designed, whereas > accelerated iSCSI is available now, from us and other companies. So, WHAT?! There are TOE pieces of crap out there too. It's strictly not our problem. Like Herbert said, this is the TOE discussion all over again. The results will be the same, and as per our decisions wrt. TOE, history speaks for itself. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-12 22:02 ` David Miller @ 2008-08-12 22:21 ` Divy Le Ray 2008-08-13 1:57 ` Herbert Xu 2008-08-13 18:35 ` Vladislav Bolkhovitin 0 siblings, 2 replies; 63+ messages in thread From: Divy Le Ray @ 2008-08-12 22:21 UTC (permalink / raw) To: David Miller Cc: rdreier, rick.jones2, jgarzik, Steve Wise, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel On Tuesday 12 August 2008 03:02:46 pm David Miller wrote: > From: Divy Le Ray <divy@chelsio.com> > Date: Tue, 12 Aug 2008 14:57:09 -0700 > > > In any case, such a stateless solution is not yet designed, whereas > > accelerated iSCSI is available now, from us and other companies. > > So, WHAT?! > > There are TOE pieces of crap out there too. Well, there is demand for accerated iscsi out there, which is the driving reason of our driver submission. > > It's strictly not our problem. > > Like Herbert said, this is the TOE discussion all over again. > The results will be the same, and as per our decisions wrt. > TOE, history speaks for itself. Herbert requested some benchmark numbers, I consequently obliged. Cheers, Divy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-12 22:21 ` Divy Le Ray @ 2008-08-13 1:57 ` Herbert Xu 2008-08-13 18:35 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 63+ messages in thread From: Herbert Xu @ 2008-08-13 1:57 UTC (permalink / raw) To: Divy Le Ray Cc: davem, rdreier, rick.jones2, jgarzik, swise, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel Divy Le Ray <divy@chelsio.com> wrote: > > Herbert requested some benchmark numbers, I consequently obliged. Have you posted a hardware-accelerated iSCSI vs. LRO comparison? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-12 22:21 ` Divy Le Ray 2008-08-13 1:57 ` Herbert Xu @ 2008-08-13 18:35 ` Vladislav Bolkhovitin 2008-08-13 19:29 ` Jeff Garzik 2008-08-13 20:23 ` David Miller 1 sibling, 2 replies; 63+ messages in thread From: Vladislav Bolkhovitin @ 2008-08-13 18:35 UTC (permalink / raw) To: David Miller Cc: open-iscsi, rdreier, rick.jones2, jgarzik, Steve Wise, Karen Xie, netdev, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel Divy Le Ray wrote: > On Tuesday 12 August 2008 03:02:46 pm David Miller wrote: >> From: Divy Le Ray <divy@chelsio.com> >> Date: Tue, 12 Aug 2008 14:57:09 -0700 >> >>> In any case, such a stateless solution is not yet designed, whereas >>> accelerated iSCSI is available now, from us and other companies. >> So, WHAT?! >> >> There are TOE pieces of crap out there too. > > Well, there is demand for accerated iscsi out there, which is the driving > reason of our driver submission. I'm, as an iSCSI target developer, strongly voting for hardware iSCSI offload. Having possibility of the direct data placement is a *HUGE* performance gain. For example, according to measurements done by one iSCSI-SCST user in system with iSCSI initiator and iSCSI target (with iSCSI-SCST (http://scst.sourceforge.net/target_iscsi.html) running), both with identical modern high speed hardware and 10GbE cards, the _INITIATOR_ is the bottleneck for READs (data transfers from target to initiator). This is because the target sends data in a zero-copy manner, so its CPU is capable to deal with the load, but on the initiator there are additional data copies from skb's to page cache and from page cache to application. As the result, in the measurements initiator got near 100% CPU load and only ~500MB/s throughput. Target had ~30% CPU load. For the opposite direction (WRITEs), where there is no the application data copy on the target, throughput was ~800MB/s with also near 100% CPU load, but in this case on the target. The initiator ran Linux with open-iscsi. The test was with real backstorage: target ran BLOCKIO (direct BIOs to/from backstorage) with 3ware card. Locally on the target the backstorage was able to provide 900+MB/s for READs and about 1GB/s for WRITEs. The commands queue in both cases was sufficiently big to eliminate the link and processing latencies (20-30 outstanding commands). Vlad ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 18:35 ` Vladislav Bolkhovitin @ 2008-08-13 19:29 ` Jeff Garzik 2008-08-13 20:13 ` David Miller 2008-08-14 18:24 ` Vladislav Bolkhovitin 2008-08-13 20:23 ` David Miller 1 sibling, 2 replies; 63+ messages in thread From: Jeff Garzik @ 2008-08-13 19:29 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: David Miller, open-iscsi, rdreier, rick.jones2, Steve Wise, Karen Xie, netdev, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel Vladislav Bolkhovitin wrote: > Divy Le Ray wrote: >> On Tuesday 12 August 2008 03:02:46 pm David Miller wrote: >>> From: Divy Le Ray <divy@chelsio.com> >>> Date: Tue, 12 Aug 2008 14:57:09 -0700 >>> >>>> In any case, such a stateless solution is not yet designed, whereas >>>> accelerated iSCSI is available now, from us and other companies. >>> So, WHAT?! >>> >>> There are TOE pieces of crap out there too. >> >> Well, there is demand for accerated iscsi out there, which is the >> driving reason of our driver submission. > > I'm, as an iSCSI target developer, strongly voting for hardware iSCSI > offload. Having possibility of the direct data placement is a *HUGE* > performance gain. Well, two responses here: * no one is arguing against hardware iSCSI offload. Rather, it is a problem with a specific implementation, one that falsely assumes two independent TCP stacks can co-exist peacefully on the same IP address and MAC. * direct data placement is possible without offloading the entire TCP stack onto a firmware/chip. There is plenty of room for hardware iSCSI offload... Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 19:29 ` Jeff Garzik @ 2008-08-13 20:13 ` David Miller 2008-08-14 18:24 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-13 20:13 UTC (permalink / raw) To: jgarzik Cc: vst, open-iscsi, rdreier, rick.jones2, swise, kxie, netdev, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Jeff Garzik <jgarzik@pobox.com> Date: Wed, 13 Aug 2008 15:29:55 -0400 > * direct data placement is possible without offloading the entire TCP > stack onto a firmware/chip. I've even described in this thread how that's possible. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 19:29 ` Jeff Garzik 2008-08-13 20:13 ` David Miller @ 2008-08-14 18:24 ` Vladislav Bolkhovitin 2008-08-14 21:59 ` Nicholas A. Bellinger 1 sibling, 1 reply; 63+ messages in thread From: Vladislav Bolkhovitin @ 2008-08-14 18:24 UTC (permalink / raw) To: Jeff Garzik Cc: David Miller, open-iscsi, rdreier, rick.jones2, Steve Wise, Karen Xie, netdev, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel Jeff Garzik wrote: > Vladislav Bolkhovitin wrote: >> Divy Le Ray wrote: >>> On Tuesday 12 August 2008 03:02:46 pm David Miller wrote: >>>> From: Divy Le Ray <divy@chelsio.com> >>>> Date: Tue, 12 Aug 2008 14:57:09 -0700 >>>> >>>>> In any case, such a stateless solution is not yet designed, whereas >>>>> accelerated iSCSI is available now, from us and other companies. >>>> So, WHAT?! >>>> >>>> There are TOE pieces of crap out there too. >>> Well, there is demand for accerated iscsi out there, which is the >>> driving reason of our driver submission. >> I'm, as an iSCSI target developer, strongly voting for hardware iSCSI >> offload. Having possibility of the direct data placement is a *HUGE* >> performance gain. > > Well, two responses here: > > * no one is arguing against hardware iSCSI offload. Rather, it is a > problem with a specific implementation, one that falsely assumes two > independent TCP stacks can co-exist peacefully on the same IP address > and MAC. > > * direct data placement is possible without offloading the entire TCP > stack onto a firmware/chip. > > There is plenty of room for hardware iSCSI offload... Sure, nobody is arguing against that. My points are: 1. All those are things not for near future. I don't think it can be implemented earlier than in a year time, but there is a huge demand for high speed and low CPU overhead iSCSI _now_. Nobody's satisfied by the fact that with the latest high end hardware he can saturate 10GbE link on only less than 50%(!). Additionally, for me, as an iSCSI target developer, it looks especially annoying that hardware requirements for _clients_ (initiators) are significantly higher than for _server_ (target). This situation for me looks as a nonsense. 2. I believe, that iSCSI/TCP pair is sufficiently heavy weighted protocol to be completely offloaded to hardware. All partial offloads will never make it comparably efficient. It still would consume a lot of CPU. For example, consider digests. Even if they computed by new CRC32C instruction, the computation still would need a chunk of CPU power. I think, at least as much as to copy the computed block to new location. Can we save it? Sure, with hardware offload. The additional CPU load can be acceptable if only data are transferred and there are no other activities, but in real life this is quite rare. Consider, for instance, a VPS server, like VMware. It always lacks CPU power and 30% CPU load during data transfers makes a huge difference. Another example is a target doing some processing of transferred data, like encryption or de-duplication. Note, I'm not advocating this particular cxgb3 driver. I have not examined it closely enough and don't have sufficient knowledge about the hardware to judge it. But I'm advocating the concept of full offload HBAs, because they provide a real gain, which IMHO can't be reached by any partial offloads. Actually, in the Fibre Channel world from the very beginning the entire FC protocol has been implemented on hardware and everybody have been happy with that. Now FCoE is coming, which means that Linux kernel is going to have implemented in software a big chunk of FC protocol. Then, hopefully, nobody would declare all existing FC cards as a crap and force FC vendors redesign their hardware to use Linux FC implementation and make partial offloads for it? ;) Instead, several implementations would live in a peace. The situation is the same with iSCSI. What we need is only to find an acceptable way for two TCP implementations to coexist. Then iSCSI on 10GbE hardware would have good chances to outperform 8Gbps FC in both performance and CPU efficiency. Vlad ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 18:24 ` Vladislav Bolkhovitin @ 2008-08-14 21:59 ` Nicholas A. Bellinger 0 siblings, 0 replies; 63+ messages in thread From: Nicholas A. Bellinger @ 2008-08-14 21:59 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jeff Garzik, David Miller, open-iscsi, rdreier, rick.jones2, Steve Wise, Karen Xie, netdev, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, linux-kernel On Thu, 2008-08-14 at 22:24 +0400, Vladislav Bolkhovitin wrote: > Jeff Garzik wrote: > > Vladislav Bolkhovitin wrote: > >> Divy Le Ray wrote: > >>> On Tuesday 12 August 2008 03:02:46 pm David Miller wrote: > >>>> From: Divy Le Ray <divy@chelsio.com> > >>>> Date: Tue, 12 Aug 2008 14:57:09 -0700 > >>>> > >>>>> In any case, such a stateless solution is not yet designed, whereas > >>>>> accelerated iSCSI is available now, from us and other companies. > >>>> So, WHAT?! > >>>> > >>>> There are TOE pieces of crap out there too. > >>> Well, there is demand for accerated iscsi out there, which is the > >>> driving reason of our driver submission. > >> I'm, as an iSCSI target developer, strongly voting for hardware iSCSI > >> offload. Having possibility of the direct data placement is a *HUGE* > >> performance gain. > > > > Well, two responses here: > > > > * no one is arguing against hardware iSCSI offload. Rather, it is a > > problem with a specific implementation, one that falsely assumes two > > independent TCP stacks can co-exist peacefully on the same IP address > > and MAC. > > > > * direct data placement is possible without offloading the entire TCP > > stack onto a firmware/chip. > > > > There is plenty of room for hardware iSCSI offload... > > Sure, nobody is arguing against that. My points are: > > 1. All those are things not for near future. I don't think it can be > implemented earlier than in a year time, but there is a huge demand for > high speed and low CPU overhead iSCSI _now_. Well, the first step wrt to this for us software folks is getting the Slicing by 8 algoritim CRC32C into the kernel.. This would be a great benefit for not just traditional iSCSI/TCP, but Linux/SCTP and Linux/iWARP software codebases. > Nobody's satisfied by the > fact that with the latest high end hardware he can saturate 10GbE link > on only less than 50%(!). Additionally, for me, as an iSCSI target > developer, it looks especially annoying that hardware requirements for > _clients_ (initiators) are significantly higher than for _server_ > (target). This situation for me looks as a nonsense. > I have always found this to be the historical case wrt iSCSI on x86 hardware. The rough estimate was that given identical hardware and network configuration, an iSCSI target talking to a SCSI subsystem layer would be able to handle 2x throughput compared to an iSCSI Initiator, obviously as long as the actual storage could handle it. > 2. I believe, that iSCSI/TCP pair is sufficiently heavy weighted > protocol to be completely offloaded to hardware. Heh, I think the period of designing news ASICs for traditional iSCSI offload is probably slowing. Aside from the actual difficulting of doing this and competing with software iSCSI on commodity x86 4x & 8x core (8x and 16x thread) micropressors with highly efficent software implementation, that can do BOTH traditional iSCSI offload (where available) and real deal OS independent connection recovery (ErrorRecoveryLevel=2) between multiple stateless iSER iWARP/TCP connections across both hardware *AND* software iWARP RNICs. > All partial offloads will never make it comparably efficient. With traditional iSCSI, I definately agree on this. With iWARP and iSER however, I believe the end balance of simplicity is greater for both hardware and software, and allows both hardware and software to scale more effectively because The simple gain of having a Framed PDU on top of legacy TCP with RFC 504[0-4] in order to determine the offload of the received packet that will be mapped to storage subsystem later memory for eventual hardware DMA on a vast array of Linux supported storage hardware and CPU architectures. > It still would consume a lot of > CPU. For example, consider digests. Even if they computed by new CRC32C > instruction, the computation still would need a chunk of CPU power. I > think, at least as much as to copy the computed block to new location. > Can we save it? Sure, with hardware offload. So yes, we are talking about quite a few possible cases: I) Traditional iSCSI: 1) Complete hardware offload for legacy HBAs 2) Hybrid of hardware/software As mentioned, reducing application layer checksum overhead for current software implementations is very important for our quickly increase user base. Using the Slicing by 8 CRC32C will help the current code, but I think the only other real optimization by network ASIC design folks would be to do something along the lines with traditional iSCSI with the application layer that the say the e1000 driver does with transport and network layer checksums today. I believe the complexity and time to market considerations of a complete traditional iSCSI offload solution compared to highly optimized software iSCSI on dedicated commodity cores still outweighs the benefit IMHO. Not that I am saying there is no room for improvement from the current set iSCSI Initiator TOEs. Again I could build a children's fortress from iSCSI TOE's and their retail boxes that I have in my office that I have gotten over the years. I would definately like to see them running on the LIO production fabric and VHACS bare-metal storage clouds at some point for validation purposes, et al. But as for new designs, this is still a very difficult proposition, I am glad to see it being discussed here.. II) iWARP/TCP and iSER 1) Hardware RNIC w/ iWARP/TCP with software iSER 2) Software RNIC w/ iWARP/TCP with software iSER 3) More possible iSER logic in hardware for latency/performance optimizations (We won't know this until #1 and #2 happen) Ahh, now this is the interesting case for scaling vendor independent IP storage fabric to multiple port full duplex 10 Gb/sec fabrics. As this hardware on PCIe gets out (yes, I have some AMSO1100 goodness too Steve :-), and iSER initiator/targets on iWARP/TCP come online, I believe the common code between the different flavours of implemenations will be much larger here. For example, I previously mentioned ERL=2 in the context of traditional iSCSI/iSER. This logic is independent of what RFC5045 knows a network fabric capable of of direct data placement. I will also make this code independent in lio-target-2.6.git for my upstream work. > The additional CPU load can > be acceptable if only data are transferred and there are no other > activities, but in real life this is quite rare. Consider, for instance, > a VPS server, like VMware. It always lacks CPU power and 30% CPU load > during data transfers makes a huge difference. Another example is a > target doing some processing of transferred data, like encryption or > de-duplication. Well, I think alot of this depends on hardware. For example, there is the X3100 adapter from Neterion today that can do 10 Gb/sec line rate with x86_64 virtualization. Obviously, the Linux kernel (and my project, Linux-iSCSI.org) wants to be able to support this as vendor neutral as possible, which is why we make extensive use of multiple technologies in our production fabrics, and in the VHACS stack. :-) Also, the Nested Page Tables would be a big win for this particular case, but I am not familar with the exact numbers.. > > Actually, in the Fibre Channel world from the very beginning the entire > FC protocol has been implemented on hardware and everybody have been > happy with that. Now FCoE is coming, which means that Linux kernel is > going to have implemented in software a big chunk of FC protocol. Then, > hopefully, nobody would declare all existing FC cards as a crap and > force FC vendors redesign their hardware to use Linux FC implementation > and make partial offloads for it? ;) Instead, several implementations > would live in a peace. The situation is the same with iSCSI. What we > need is only to find an acceptable way for two TCP implementations to > coexist. Then iSCSI on 10GbE hardware would have good chances to > outperform 8Gbps FC in both performance and CPU efficiency. > <nod> :-) --nab > Vlad > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 18:35 ` Vladislav Bolkhovitin 2008-08-13 19:29 ` Jeff Garzik @ 2008-08-13 20:23 ` David Miller 2008-08-14 18:27 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-13 20:23 UTC (permalink / raw) To: vst Cc: open-iscsi, rdreier, rick.jones2, jgarzik, swise, kxie, netdev, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Vladislav Bolkhovitin <vst@vlnb.net> Date: Wed, 13 Aug 2008 22:35:34 +0400 > This is because the target sends data in a zero-copy manner, so its > CPU is capable to deal with the load, but on the initiator there are > additional data copies from skb's to page cache and from page cache > to application. If you've actually been reading at all what I've been saying in this thread you'll see that I've described a method to do this copy avoidance in a completely stateless manner. You don't need to implement a TCP stack in the card in order to do data placement optimizations. They can be done completely stateless. Also, large portions of the cpu overhead are transactional costs, which are significantly reduced by existing technologies such as LRO. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 20:23 ` David Miller @ 2008-08-14 18:27 ` Vladislav Bolkhovitin 2008-08-14 18:30 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 63+ messages in thread From: Vladislav Bolkhovitin @ 2008-08-14 18:27 UTC (permalink / raw) To: David Miller Cc: open-iscsi, rdreier, rick.jones2, jgarzik, swise, kxie, netdev, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1392 bytes --] David Miller wrote: > From: Vladislav Bolkhovitin <vst@vlnb.net> > Date: Wed, 13 Aug 2008 22:35:34 +0400 > >> This is because the target sends data in a zero-copy manner, so its >> CPU is capable to deal with the load, but on the initiator there are >> additional data copies from skb's to page cache and from page cache >> to application. > > If you've actually been reading at all what I've been saying in this > thread you'll see that I've described a method to do this copy > avoidance in a completely stateless manner. > > You don't need to implement a TCP stack in the card in order to do > data placement optimizations. They can be done completely stateless. Sure, I read what you wrote before writing (although, frankly, didn't get the idea). But I don't think that overall it would be as efficient as full hardware offload. See my reply to Jeff Garzik about that. > Also, large portions of the cpu overhead are transactional costs, > which are significantly reduced by existing technologies such as > LRO. The test used Myricom Myri-10G cards (myri10ge driver), which support LRO. And from ethtool -S output I conclude it was enabled. Just in case, I attached it, so you can recheck me. Thus, apparently, LRO doesn't make a fundamental difference. Maybe this particular implementation isn't too efficient, I don't know. I don't have enough information for that. Vlad [-- Attachment #2: ethtool_initiator.txt --] [-- Type: text/plain, Size: 1498 bytes --] NIC statistics: rx_packets: 471090527 tx_packets: 175404246 rx_bytes: 683684492944 tx_bytes: 636200696592 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 rx_skbs: 0 alloc_order: 0 builtin_fw: 0 napi: 1 tx_boundary: 4096 WC: 2 irq: 1268 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1575 write_dma_bw_MBs: 1375 read_write_dma_bw_MBs: 2406 serial_number: 320283 watchdog_resets: 0 link_changes: 2 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 0 dropped_pause: 0 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 0 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 479 ----------- slice ---------: 0 tx_pkt_start: 176354843 tx_pkt_done: 176354843 tx_req: 474673372 tx_done: 474673372 rx_small_cnt: 19592127 rx_big_cnt: 462319631 wake_queue: 0 stop_queue: 0 tx_linearized: 0 LRO aggregated: 481899984 LRO flushed: 43071334 LRO avg aggr: 11 LRO no_desc: 0 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 18:27 ` Vladislav Bolkhovitin @ 2008-08-14 18:30 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 63+ messages in thread From: Vladislav Bolkhovitin @ 2008-08-14 18:30 UTC (permalink / raw) To: David Miller Cc: open-iscsi, rdreier, rick.jones2, jgarzik, swise, kxie, netdev, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel Vladislav Bolkhovitin wrote: > David Miller wrote: >> From: Vladislav Bolkhovitin <vst@vlnb.net> >> Date: Wed, 13 Aug 2008 22:35:34 +0400 >> >>> This is because the target sends data in a zero-copy manner, so its >>> CPU is capable to deal with the load, but on the initiator there are >>> additional data copies from skb's to page cache and from page cache >>> to application. >> If you've actually been reading at all what I've been saying in this >> thread you'll see that I've described a method to do this copy >> avoidance in a completely stateless manner. >> >> You don't need to implement a TCP stack in the card in order to do >> data placement optimizations. They can be done completely stateless. > > Sure, I read what you wrote before writing (although, frankly, didn't > get the idea). But I don't think that overall it would be as efficient > as full hardware offload. See my reply to Jeff Garzik about that. > >> Also, large portions of the cpu overhead are transactional costs, >> which are significantly reduced by existing technologies such as >> LRO. > > The test used Myricom Myri-10G cards (myri10ge driver), which support > LRO. And from ethtool -S output I conclude it was enabled. Just in case, > I attached it, so you can recheck me. Also, there wasn't big difference between MTU 1500 and 9000, which is another point to think that LRO was working. > Thus, apparently, LRO doesn't make a fundamental difference. Maybe this > particular implementation isn't too efficient, I don't know. I don't > have enough information for that. > > Vlad > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-11 21:53 ` David Miller 2008-08-12 21:57 ` Divy Le Ray @ 2008-08-13 21:27 ` Roland Dreier 2008-08-13 22:08 ` David Miller 1 sibling, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-13 21:27 UTC (permalink / raw) To: David Miller Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > > How can you place iSCSI data properly with only stateless offloads? > By teaching the stateless offload how to parse the iSCSI headers > on the flow and place the data into pages at the correct offsets > such that you can place the pages hanging off of the SKB directly > into the page cache. I don't see how this could work. First, it seems that you have to let the adapter know which connections are iSCSI connections so that it knows when to try and parse iSCSI headers. So you're already not totally stateless. Then, since (AFAIK -- I'm not an expert on iSCSI and especially I'm not an expert on what common practice is for current implementations) the iSCSI PDUs can start at any offset in the TCP stream, I don't see how a stateless adapter can even find the PDU headers to parse -- there's not any way that I know of to recognize where a PDU boundary is without keeping track of the lengths of all the PDUs that go by (ie you need per-connection state). Even if the adapter could find the PDUs, I don't see how it could come up with the correct offset to place the data -- PDUs with response data just carry an opaque tag assigned by the iSCSI initiator. Finally, if there are ways around all of those difficulties, we would still have to do major surgery to our block layer to cope with read requests that complete into random pages, rather than using a scatter list passed into the low-level driver. But I think all this argument is missing the point anyway. The real issue is not hand-waving about what someone might build someday, but how we want to support iSCSI offload with the existing Chelsio, Broadcom, etc adapters. The answer might be, "we don't," but I disagree with that choice because: a. "No upstream support" really ends up being "enterprise distros and customers end up using hacky out-of-tree drivers and blaming us." b. It sends a bad message to vendors who put a lot of effort into writing a clean, mergable driver and responding to review if the answer is, "Sorry, your hardware is wrong so no driver for you." Maybe the answer is that we just add the iSCSI HBA drivers with no help from the networking stack, and ignore the port collision problem. For iSCSI initiators, it's really not an issue: for a 4-tuple to collide, someone would have to use both offloaded and non-offloaded connections to the same target and be unlucky in the source port chosen. It would be nice to be able to discuss solutions to port collisions, but it may be that this is too emotional an issue for that to be possible. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 21:27 ` Roland Dreier @ 2008-08-13 22:08 ` David Miller 2008-08-13 23:03 ` Roland Dreier 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-13 22:08 UTC (permalink / raw) To: rdreier Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Wed, 13 Aug 2008 14:27:50 -0700 > I don't see how this could work. First, it seems that you have to let > the adapter know which connections are iSCSI connections so that it > knows when to try and parse iSCSI headers. It always starts from offset zero for never seen before connections. > So you're already not totally stateless. Yes, we are. > Then, since (AFAIK -- I'm not an expert on iSCSI and > especially I'm not an expert on what common practice is for current > implementations) the iSCSI PDUs can start at any offset in the TCP > stream, I don't see how a stateless adapter can even find the PDU > headers to parse -- there's not any way that I know of to recognize > where a PDU boundary is without keeping track of the lengths of all the > PDUs that go by (ie you need per-connection state). Like I said, you retain a "flow cache" (say it a million times, "flow cache") that remembers the current parameters and the buffers currently assigned to that flow and what offset within those buffers. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 22:08 ` David Miller @ 2008-08-13 23:03 ` Roland Dreier 2008-08-13 23:12 ` David Miller 0 siblings, 1 reply; 63+ messages in thread From: Roland Dreier @ 2008-08-13 23:03 UTC (permalink / raw) To: David Miller Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel > Like I said, you retain a "flow cache" (say it a million times, "flow > cache") that remembers the current parameters and the buffers > currently assigned to that flow and what offset within those buffers. OK, I admit you could make something work -- add hooks for the low-level driver to ask the iSCSI initiator where PDU boundaries are so it can resync when something is evicted from the flow cache, have the initiator format its tags in a special way to encode placement data, etc, etc. The scheme does bring to mind Alan's earlier comment about pigs and propulsion, though. In any case, as I said in the part of my email that you snipped, the real issue is not designing hypothetical hardware, but deciding how to support the Chelsio, Broadcom, etc hardware that exists today. - R. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 23:03 ` Roland Dreier @ 2008-08-13 23:12 ` David Miller 2008-08-14 1:26 ` Tom Tucker 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-13 23:12 UTC (permalink / raw) To: rdreier Cc: rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Roland Dreier <rdreier@cisco.com> Date: Wed, 13 Aug 2008 16:03:15 -0700 > OK, I admit you could make something work -- add hooks for the low-level > driver to ask the iSCSI initiator where PDU boundaries are so it can > resync when something is evicted from the flow cache, have the initiator > format its tags in a special way to encode placement data, etc, etc. > The scheme does bring to mind Alan's earlier comment about pigs and > propulsion, though. There would need to be _NO_ hooks into the iSCSI initiator at all. The card would land the block I/O data onto the necessary page boundaries and the iSCSI code would just be able to thus use the pages directly and as-is. It would look perfectly like normal TCP receive traffic. No hooks, no special cases, nothing like that. > In any case, as I said in the part of my email that you snipped, the > real issue is not designing hypothetical hardware, but deciding how to > support the Chelsio, Broadcom, etc hardware that exists today. The same like we support TOE hardware that exists today. That is, we don't. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-13 23:12 ` David Miller @ 2008-08-14 1:26 ` Tom Tucker 2008-08-14 1:37 ` David Miller 2008-08-14 2:09 ` David Miller 0 siblings, 2 replies; 63+ messages in thread From: Tom Tucker @ 2008-08-14 1:26 UTC (permalink / raw) To: David Miller Cc: rdreier, rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Roland Dreier <rdreier@cisco.com> > Date: Wed, 13 Aug 2008 16:03:15 -0700 > > >> OK, I admit you could make something work -- add hooks for the low-level >> driver to ask the iSCSI initiator where PDU boundaries are so it can >> resync when something is evicted from the flow cache, have the initiator >> format its tags in a special way to encode placement data, etc, etc. >> The scheme does bring to mind Alan's earlier comment about pigs and >> propulsion, though. >> > > There would need to be _NO_ hooks into the iSCSI initiator at all. > > The card would land the block I/O data onto the necessary page boundaries > and the iSCSI code would just be able to thus use the pages directly > and as-is. > > It would look perfectly like normal TCP receive traffic. No hooks, > no special cases, nothing like that. > > >> In any case, as I said in the part of my email that you snipped, the >> real issue is not designing hypothetical hardware, but deciding how to >> support the Chelsio, Broadcom, etc hardware that exists today. >> > > The same like we support TOE hardware that exists today. That is, we > don't. > > Is there any chance your could discuss exactly how a stateless adapter can determine if a network segment is in-order, next expected, minus productive ack, paws compliant, etc... without TCP state? I get how you can optimize "flows", but "flows" are a fancy name for a key (typically the four-tuple) that looks into a TCAM to get the "information" necessary to do header prediction. Can you explain how this "information" somehow doesn't qualify as "state". Doesn't the next expected sequence number at the very least need to be updated? una? etc...? Could you also include the "non-state-full" information necessary to do iSCSI header digest validation, data placement, and marker removal? Thanks, Tom > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:26 ` Tom Tucker @ 2008-08-14 1:37 ` David Miller 2008-08-14 1:52 ` Steve Wise 2008-08-14 1:57 ` Tom Tucker 2008-08-14 2:09 ` David Miller 1 sibling, 2 replies; 63+ messages in thread From: David Miller @ 2008-08-14 1:37 UTC (permalink / raw) To: tom Cc: rdreier, rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Tom Tucker <tom@opengridcomputing.com> Date: Wed, 13 Aug 2008 20:26:51 -0500 > Can you explain how this "information" somehow doesn't qualify as > "state". Doesn't the next expected sequence number at the very least > need to be updated? una? etc...? > > Could you also include the "non-state-full" information necessary to do > iSCSI header digest validation, data placement, and marker removal? It's stateless because the full packet traverses the real networking stack and thus can be treated like any other packet. The data placement is a side effect that the networking stack can completely ignore if it chooses to. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:37 ` David Miller @ 2008-08-14 1:52 ` Steve Wise 2008-08-14 2:05 ` David Miller 2008-08-14 1:57 ` Tom Tucker 1 sibling, 1 reply; 63+ messages in thread From: Steve Wise @ 2008-08-14 1:52 UTC (permalink / raw) To: David Miller Cc: tom, rdreier, rick.jones2, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Wed, 13 Aug 2008 20:26:51 -0500 > > >> Can you explain how this "information" somehow doesn't qualify as >> "state". Doesn't the next expected sequence number at the very least >> need to be updated? una? etc...? >> >> Could you also include the "non-state-full" information necessary to do >> iSCSI header digest validation, data placement, and marker removal? >> > > It's stateless because the full packet traverses the real networking > stack and thus can be treated like any other packet. > > The data placement is a side effect that the networking stack can > completely ignore if it chooses to. > How do you envision programming such a device? It will need TCP and iSCSI state to have any chance of doing useful and productive placement of data. The smarts about the iSCSI stateless offload hw will be in the device driver, probably the iscsi device driver. How will it gather the information from the TCP stack to insert the correct state for a flow into the hw cache? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:52 ` Steve Wise @ 2008-08-14 2:05 ` David Miller 2008-08-14 2:44 ` Steve Wise 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2008-08-14 2:05 UTC (permalink / raw) To: swise Cc: tom, rdreier, rick.jones2, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Steve Wise <swise@opengridcomputing.com> Date: Wed, 13 Aug 2008 20:52:47 -0500 > How do you envision programming such a device? There should be no special programming. > It will need TCP and iSCSI state to have any chance of doing useful > and productive placement of data. The card can see the entire TCP stream, it doesn't need anything more than that. It can parse every packet header, see what kind of data transfer is being requested or responded to, etc. Look, I'm not going to design this whole friggin' thing for you guys. I've stated clearly what the base requirement is, which is that the packet is fully processed by the networking stack and that the card merely does data placement optimizations that the stack can completely ignore if it wants to. You have an entire engine in there that can interpret an iSCSI transport stream, you have the logic to do these kinds of things, and it can be done without managing the connection on the card. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 2:05 ` David Miller @ 2008-08-14 2:44 ` Steve Wise 0 siblings, 0 replies; 63+ messages in thread From: Steve Wise @ 2008-08-14 2:44 UTC (permalink / raw) To: David Miller Cc: tom, rdreier, rick.jones2, jgarzik, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > I've stated clearly what the base requirement is, which is that the > packet is fully processed by the networking stack and that the card > merely does data placement optimizations that the stack can completely > ignore if it wants to. > > You have an entire engine in there that can interpret an iSCSI > transport stream, you have the logic to do these kinds of things, > and it can be done without managing the connection on the card. > Thanks for finally stating it clearly. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:37 ` David Miller 2008-08-14 1:52 ` Steve Wise @ 2008-08-14 1:57 ` Tom Tucker 2008-08-14 2:07 ` David Miller 1 sibling, 1 reply; 63+ messages in thread From: Tom Tucker @ 2008-08-14 1:57 UTC (permalink / raw) To: David Miller Cc: rdreier, rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel David Miller wrote: > From: Tom Tucker <tom@opengridcomputing.com> > Date: Wed, 13 Aug 2008 20:26:51 -0500 > > >> Can you explain how this "information" somehow doesn't qualify as >> "state". Doesn't the next expected sequence number at the very least >> need to be updated? una? etc...? >> >> Could you also include the "non-state-full" information necessary to do >> iSCSI header digest validation, data placement, and marker removal? >> > > It's stateless because the full packet traverses the real networking > stack and thus can be treated like any other packet. > > The data placement is a side effect that the networking stack can > completely ignore if it chooses to. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Ok. Maybe we're getting somewhere here ... or at least I am :-) I'm not trying to be pedantic here but let me try and restate what I think you said above: - The "header" traverses the real networking stack - The "payload" is placed either by by the hardware if possible or by the native stack if on the exception path - The "header" may aggregate multiple PDU (RSO) - Data ready indications are controlled entirely by the software/real networking stack Thanks, Tom ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:57 ` Tom Tucker @ 2008-08-14 2:07 ` David Miller 0 siblings, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-14 2:07 UTC (permalink / raw) To: tom Cc: rdreier, rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Tom Tucker <tom@opengridcomputing.com> Date: Wed, 13 Aug 2008 20:57:08 -0500 > I'm not trying to be pedantic here but let me try and restate what I > think you said above: > > - The "header" traverses the real networking stack > - The "payload" is placed either by by the hardware if possible or by > the native stack if on the exception path > - The "header" may aggregate multiple PDU (RSO) > - Data ready indications are controlled entirely by the software/real > networking stack SKB's can be paged, in fact many devices already work by chopping up lists of pages that the driver gives to the card. NIU is one of several examples. The only difference between what a device like NIU is doing now and what I propose is smart determination of at what offset and into which buffers to do the demarcation. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-14 1:26 ` Tom Tucker 2008-08-14 1:37 ` David Miller @ 2008-08-14 2:09 ` David Miller 1 sibling, 0 replies; 63+ messages in thread From: David Miller @ 2008-08-14 2:09 UTC (permalink / raw) To: tom Cc: rdreier, rick.jones2, jgarzik, swise, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel From: Tom Tucker <tom@opengridcomputing.com> Date: Wed, 13 Aug 2008 20:26:51 -0500 > Is there any chance your could discuss exactly how a stateless adapter > can determine if a network segment > is in-order, next expected, minus productive ack, paws compliant, etc... > without TCP state? If you're getting packets out of order, data placement optimizations are the least of your concerns. In fact this is exactly where we want all of the advanced loss handling algorithms of the Linux TCP stack to get engaged. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:12 ` Roland Dreier 2008-08-10 5:46 ` David Miller @ 2008-08-10 6:24 ` Herbert Xu 2008-08-10 9:19 ` Alan Cox 2 siblings, 0 replies; 63+ messages in thread From: Herbert Xu @ 2008-08-10 6:24 UTC (permalink / raw) To: Roland Dreier Cc: jgarzik, swise, davem, divy, kxie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, dm, leedom, linux-scsi, linux-kernel Roland Dreier <rdreier@cisco.com> wrote: > > I think there are two ways to proceed: > > - Start trying to figure out the best way to support the iSCSI offload > hardware that's out there. I don't know the perfect answer but I'm > sure we can figure something out if we make an honest effort. > > - Ignore the issue and let users of iSCSI offload hardware (and iWARP > and NFS/RDMA etc) stick to hacky out-of-tree solutions. This pays > off if stuff like the Intel CRC32C instruction plus faster CPUs (or > "multithreaded" NICs that use multicore better) makes offload > irrelevant. However this ignores the fundamental 3X memory bandwidth > cost of not doing direct placement in the NIC, and risks us being in > a "well Solaris has support" situation down the road. We've been here many times before. This is just the smae old TOE debate all over again. The fact with TOE is that history has shown that Dave's decision has been spot on. So you're going to have to come up with some really convincing evidence that shows we are all wrong and these TOE-like hardware offload solutions is the only way to go. You can start by collecting solid benchmark numbers that we can all reproduce and look into. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 5:12 ` Roland Dreier 2008-08-10 5:46 ` David Miller 2008-08-10 6:24 ` Herbert Xu @ 2008-08-10 9:19 ` Alan Cox 2008-08-10 12:49 ` Jeff Garzik 2 siblings, 1 reply; 63+ messages in thread From: Alan Cox @ 2008-08-10 9:19 UTC (permalink / raw) To: Roland Dreier Cc: Jeff Garzik, Steve Wise, davem, Divy Le Ray, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML > - It doesn't work in theory, because the suggestion (I guess) is that > the iSCSI HBA has its own MAC and IP and behaves like a separate The iSCSI HBA is its own system - that is the root of the problem. > system. But this means that to start with the HBA needs its own ARP, > ICMP, routing, etc interface, which means we need some (probably new) > interface to configure all of this. And then it doesn't work in lots Its another system so surely SNMP ;) More seriously I do think iSCSI is actually a subtly special case of TOE. Most TOE disintegrates under carefully chosen "malicious" workloads because of the way it is optimised, and the lack of security integration ranges can be very very dangeorus. A pure iSCSI connection is generally private, single purpose and really is the classic application of "pigs fly given enough thrust" - which is the only way to make the pig in question (iSCSI) work properly. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 9:19 ` Alan Cox @ 2008-08-10 12:49 ` Jeff Garzik 2008-08-10 14:54 ` James Bottomley 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2008-08-10 12:49 UTC (permalink / raw) To: Alan Cox Cc: Roland Dreier, Steve Wise, davem, Divy Le Ray, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML Alan Cox wrote: >> - It doesn't work in theory, because the suggestion (I guess) is that >> the iSCSI HBA has its own MAC and IP and behaves like a separate > > The iSCSI HBA is its own system - that is the root of the problem. Indeed. Just like with TOE, from the net stack's point of view, an iSCSI HBA is essentially a wholly asynchronous remote system [with a really fast communication bus like PCI Express]. As such, the task becomes updating the net stack such that formerly-private resources are now shared with an independent, external system... with all the complexity, additional failure modes, and additional security complications that come along with that. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 12:49 ` Jeff Garzik @ 2008-08-10 14:54 ` James Bottomley 2008-08-11 16:50 ` Mike Christie 0 siblings, 1 reply; 63+ messages in thread From: James Bottomley @ 2008-08-10 14:54 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Roland Dreier, Steve Wise, davem, Divy Le Ray, Karen Xie, netdev, open-iscsi, michaelc, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML On Sun, 2008-08-10 at 08:49 -0400, Jeff Garzik wrote: > Alan Cox wrote: > >> - It doesn't work in theory, because the suggestion (I guess) is that > >> the iSCSI HBA has its own MAC and IP and behaves like a separate > > > > The iSCSI HBA is its own system - that is the root of the problem. > > Indeed. > > Just like with TOE, from the net stack's point of view, an iSCSI HBA is > essentially a wholly asynchronous remote system [with a really fast > communication bus like PCI Express]. > > As such, the task becomes updating the net stack such that > formerly-private resources are now shared with an independent, external > system... with all the complexity, additional failure modes, and > additional security complications that come along with that. What's wrong with making it configurable identically to current software iSCSI? i.e. plumb the thing into the current iscsi transport class so that we use the standard daemon for creating and binding sessions? Then, only once the session is bound do you let your iSCSI TOE stack take over. That way the connection appears to the network as completely normal, because it has an open socket associated with it; and, since the transport class has done the connection login, it even looks like a normal iSCSI connection to the usual tools. iSCSI would manage connection and authentication, so your TOE stack can be simply around the block acceleration piece (i.e. you'd need to get the iscsi daemon to do relogin and things). I would assume net will require some indicator that the opened connection has been subsumed, so it knows not to try to manage it, but other than that I don't see it will need any alteration. The usual tools, like netfilter could even use this information to know the limits of their management. If this model works, we can use it for TOE acceleration of individual applications (rather than the entire TCP stack) on an as needed basis. This is like the port stealing proposal, but since the iSCSI daemon is responsible for maintaining the session, the port isn't completely stolen, just switched to accelerator mode when doing the iSCSI offload. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-08-10 14:54 ` James Bottomley @ 2008-08-11 16:50 ` Mike Christie 0 siblings, 0 replies; 63+ messages in thread From: Mike Christie @ 2008-08-11 16:50 UTC (permalink / raw) To: James Bottomley Cc: Jeff Garzik, Alan Cox, Roland Dreier, Steve Wise, davem, Divy Le Ray, Karen Xie, netdev, open-iscsi, daisyc, wenxiong, bhua, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML James Bottomley wrote: > On Sun, 2008-08-10 at 08:49 -0400, Jeff Garzik wrote: >> Alan Cox wrote: >>>> - It doesn't work in theory, because the suggestion (I guess) is that >>>> the iSCSI HBA has its own MAC and IP and behaves like a separate >>> The iSCSI HBA is its own system - that is the root of the problem. >> Indeed. >> >> Just like with TOE, from the net stack's point of view, an iSCSI HBA is >> essentially a wholly asynchronous remote system [with a really fast >> communication bus like PCI Express]. >> >> As such, the task becomes updating the net stack such that >> formerly-private resources are now shared with an independent, external >> system... with all the complexity, additional failure modes, and >> additional security complications that come along with that. > > What's wrong with making it configurable identically to current software > iSCSI? i.e. plumb the thing into the current iscsi transport class so > that we use the standard daemon for creating and binding sessions? > Then, only once the session is bound do you let your iSCSI TOE stack > take over. > > That way the connection appears to the network as completely normal, > because it has an open socket associated with it; and, since the > transport class has done the connection login, it even looks like a > normal iSCSI connection to the usual tools. iSCSI would manage > connection and authentication, so your TOE stack can be simply around > the block acceleration piece (i.e. you'd need to get the iscsi daemon to > do relogin and things). This is what Chelsio and broadcom do today more or less. Chelsio did the socket trick you are proposing. Broadcom went with a different hack. But in the end both hook into the iscsi transport class (the current iscsi transport class works for this today), userspace daemon and tools, so that the iscsi daemon handles iscsi login, iscsi authentication and all other iscsi operations, like it does for software iscsi. > > I would assume net will require some indicator that the opened > connection has been subsumed, so it knows not to try to manage it, but > other than that I don't see it will need any alteration. The usual > tools, like netfilter could even use this information to know the limits > of their management. > > If this model works, we can use it for TOE acceleration of individual > applications (rather than the entire TCP stack) on an as needed basis. > > This is like the port stealing proposal, but since the iSCSI daemon is > responsible for maintaining the session, the port isn't completely > stolen, just switched to accelerator mode when doing the iSCSI offload. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-07-30 19:35 ` [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator Jeff Garzik 2008-07-30 21:35 ` Roland Dreier @ 2008-07-31 1:24 ` Karen Xie 2008-07-31 12:45 ` Boaz Harrosh 1 sibling, 1 reply; 63+ messages in thread From: Karen Xie @ 2008-07-31 1:24 UTC (permalink / raw) To: Jeff Garzik Cc: netdev, open-iscsi, davem, michaelc, Steve Wise, rdreier, daisyc, wenxiong, bhua, Divy Le Ray, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML >Comments: > >* SCSI drivers should be submitted via the linux-scsi@vger.kernel.org >mailing list. Will do that. Thanks. > >* The driver is clean and readable, well done > >* From a networking standpoint, our main concern becomes how this >interacts with the networking stack. In particular, I'm concerned based >on reading the source that this driver uses "TCP port stealing" rather >than using a totally separate MAC address (and IP). > >Stealing a TCP port on an IP/interface already assigned is a common >solution in this space, but also a flawed one. Precisely because the >kernel and applications are unaware of this "special, magic TCP port" >you open the potential for application problems that are very difficult >for an admin to diagnose based on observed behavior. The collisions between the host stack and iSCSI offload are unlikely because the iSCSI target server's port is unique (nailed down as 3260). If an offload card is plugged in, all iSCSI connections to a given target (i.e., destination/port) are offloaded. There is precedence for this approach such as RDMA/iWarp. > >So, additional information on your TCP port usage would be greatly >appreciated. Also, how does this interact with IPv6? Clearly it >interacts with IPv4... Currently, IPv6 connection request will not be honored, I will make sure the checking is added in the resubmission. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator 2008-07-31 1:24 ` Karen Xie @ 2008-07-31 12:45 ` Boaz Harrosh 0 siblings, 0 replies; 63+ messages in thread From: Boaz Harrosh @ 2008-07-31 12:45 UTC (permalink / raw) To: open-iscsi Cc: Jeff Garzik, netdev, davem, michaelc, Steve Wise, rdreier, daisyc, wenxiong, bhua, Divy Le Ray, Dimitrios Michailidis, Casey Leedom, linux-scsi, LKML Karen Xie wrote: >> Comments: >> >> * SCSI drivers should be submitted via the linux-scsi@vger.kernel.org >> mailing list. > > Will do that. Thanks. > >> * The driver is clean and readable, well done >> >> * From a networking standpoint, our main concern becomes how this >> interacts with the networking stack. In particular, I'm concerned > based >> on reading the source that this driver uses "TCP port stealing" rather >> than using a totally separate MAC address (and IP). >> >> Stealing a TCP port on an IP/interface already assigned is a common >> solution in this space, but also a flawed one. Precisely because the >> kernel and applications are unaware of this "special, magic TCP port" >> you open the potential for application problems that are very difficult > >> for an admin to diagnose based on observed behavior. > > The collisions between the host stack and iSCSI offload are unlikely > because the iSCSI target server's port is unique (nailed down as 3260). > If an offload card is plugged in, all iSCSI connections to a given > target (i.e., destination/port) are offloaded. There is precedence for > this approach such as RDMA/iWarp. > Please note that all SW iscsi targets I know, let you change the default 3260 port to whatever you want. Is that supported? Jeff is there a way for the user-mode demon to reserve the port beforehand so it will appear to be taken. >> So, additional information on your TCP port usage would be greatly >> appreciated. Also, how does this interact with IPv6? Clearly it >> interacts with IPv4... > > Currently, IPv6 connection request will not be honored, I will make sure > the checking is added in the resubmission. > > Boaz ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2008-08-14 21:59 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200807300019.m6U0JkdY012558@localhost.localdomain>
2008-07-30 19:35 ` [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator Jeff Garzik
2008-07-30 21:35 ` Roland Dreier
2008-08-01 0:51 ` Divy Le Ray
2008-08-07 18:45 ` Divy Le Ray
2008-08-07 20:07 ` Mike Christie
2008-08-08 18:09 ` Steve Wise
2008-08-08 22:15 ` Jeff Garzik
2008-08-08 22:20 ` Jeff Garzik
2008-08-09 7:28 ` David Miller
2008-08-09 14:04 ` Steve Wise
2008-08-10 5:14 ` Roland Dreier
2008-08-10 5:47 ` David Miller
2008-08-10 6:34 ` Herbert Xu
2008-08-10 17:57 ` Steve Wise
2008-08-11 16:09 ` Roland Dreier
2008-08-11 21:09 ` David Miller
2008-08-11 21:37 ` Roland Dreier
2008-08-11 21:51 ` David Miller
2008-08-11 23:20 ` Steve Wise
2008-08-11 23:45 ` Divy Le Ray
2008-08-12 0:22 ` David Miller
2008-08-10 5:12 ` Roland Dreier
2008-08-10 5:46 ` David Miller
2008-08-11 16:07 ` Roland Dreier
2008-08-11 21:08 ` David Miller
2008-08-11 21:39 ` Roland Dreier
2008-08-11 21:52 ` David Miller
2008-08-11 18:13 ` Rick Jones
2008-08-11 21:12 ` David Miller
2008-08-11 21:41 ` Roland Dreier
2008-08-11 21:53 ` David Miller
2008-08-12 21:57 ` Divy Le Ray
2008-08-12 22:01 ` David Miller
2008-08-12 22:02 ` David Miller
2008-08-12 22:21 ` Divy Le Ray
2008-08-13 1:57 ` Herbert Xu
2008-08-13 18:35 ` Vladislav Bolkhovitin
2008-08-13 19:29 ` Jeff Garzik
2008-08-13 20:13 ` David Miller
2008-08-14 18:24 ` Vladislav Bolkhovitin
2008-08-14 21:59 ` Nicholas A. Bellinger
2008-08-13 20:23 ` David Miller
2008-08-14 18:27 ` Vladislav Bolkhovitin
2008-08-14 18:30 ` Vladislav Bolkhovitin
2008-08-13 21:27 ` Roland Dreier
2008-08-13 22:08 ` David Miller
2008-08-13 23:03 ` Roland Dreier
2008-08-13 23:12 ` David Miller
2008-08-14 1:26 ` Tom Tucker
2008-08-14 1:37 ` David Miller
2008-08-14 1:52 ` Steve Wise
2008-08-14 2:05 ` David Miller
2008-08-14 2:44 ` Steve Wise
2008-08-14 1:57 ` Tom Tucker
2008-08-14 2:07 ` David Miller
2008-08-14 2:09 ` David Miller
2008-08-10 6:24 ` Herbert Xu
2008-08-10 9:19 ` Alan Cox
2008-08-10 12:49 ` Jeff Garzik
2008-08-10 14:54 ` James Bottomley
2008-08-11 16:50 ` Mike Christie
2008-07-31 1:24 ` Karen Xie
2008-07-31 12:45 ` Boaz Harrosh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox