* TSO and IPoIB performance degradation @ 2006-03-06 22:34 Michael S. Tsirkin 2006-03-06 22:40 ` David S. Miller 2006-03-06 22:50 ` Stephen Hemminger 0 siblings, 2 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-06 22:34 UTC (permalink / raw) To: netdev, openib-general, Linux Kernel Mailing List, David S. Miller, Matt Leininger Hello, Dave! As you might know, the TSO patches merged into mainline kernel since 2.6.11 have hurt performance for the simple (non-TSO) high-speed netdevice that is IPoIB driver. This was discussed at length here http://openib.org/pipermail/openib-general/2005-October/012271.html I'm trying to figure out what can be done to improve the situation. In partucular, I'm looking at the Super TSO patch http://oss.sgi.com/archives/netdev/2005-05/msg00889.html merged into mainline here http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc There, you said: When we do ucopy receive (ie. copying directly to userspace during tcp input processing) we attempt to delay the ACK until cleanup_rbuf() is invoked. Most of the time this technique works very well, and we emit one ACK advertising the largest window. But this explodes if the ucopy prequeue is large enough. When the receiver is cpu limited and TSO frames are large, the receiver is inundated with ucopy processing, such that the ACK comes out very late. Often, this is so late that by the time the sender gets the ACK the window has emptied too much to be kept full by the sender. The existing TSO code mostly avoided this by keeping the TSO packets no larger than 1/8 of the available window. But with the new code we can get much larger TSO frames. So I'm trying to get a handle on it: could a solution be to simply look at the frame size, and call tcp_send_delayed_ack from if the frame size is no larger than 1/8? Does this make sense? Thanks, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin @ 2006-03-06 22:40 ` David S. Miller 2006-03-06 22:50 ` Stephen Hemminger 1 sibling, 0 replies; 35+ messages in thread From: David S. Miller @ 2006-03-06 22:40 UTC (permalink / raw) To: mst; +Cc: netdev, linux-kernel, openib-general From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Tue, 7 Mar 2006 00:34:38 +0200 > So I'm trying to get a handle on it: could a solution be to simply > look at the frame size, and call tcp_send_delayed_ack from > if the frame size is no larger than 1/8? > > Does this make sense? The comment you mention is very old, and no longer applies. Get full packet traces from the kernel TSO code in the 2.6.x kernel, analyze is, and post here what you think is occuring that is causing the performance problems. One thing to note is that the newer TSO code really needs to have large socket buffers, so you can experiment with that. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin 2006-03-06 22:40 ` David S. Miller @ 2006-03-06 22:50 ` Stephen Hemminger 2006-03-07 3:13 ` Shirley Ma 1 sibling, 1 reply; 35+ messages in thread From: Stephen Hemminger @ 2006-03-06 22:50 UTC (permalink / raw) To: Michael S. Tsirkin Cc: netdev, Linux Kernel Mailing List, openib-general, David S. Miller On Tue, 7 Mar 2006 00:34:38 +0200 "Michael S. Tsirkin" <mst@mellanox.co.il> wrote: > Hello, Dave! > As you might know, the TSO patches merged into mainline kernel > since 2.6.11 have hurt performance for the simple (non-TSO) > high-speed netdevice that is IPoIB driver. > > This was discussed at length here > http://openib.org/pipermail/openib-general/2005-October/012271.html > > I'm trying to figure out what can be done to improve the situation. > In partucular, I'm looking at the Super TSO patch > http://oss.sgi.com/archives/netdev/2005-05/msg00889.html > > merged into mainline here > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc > > There, you said: > > When we do ucopy receive (ie. copying directly to userspace > during tcp input processing) we attempt to delay the ACK > until cleanup_rbuf() is invoked. Most of the time this > technique works very well, and we emit one ACK advertising > the largest window. > > But this explodes if the ucopy prequeue is large enough. > When the receiver is cpu limited and TSO frames are large, > the receiver is inundated with ucopy processing, such that > the ACK comes out very late. Often, this is so late that > by the time the sender gets the ACK the window has emptied > too much to be kept full by the sender. > > The existing TSO code mostly avoided this by keeping the > TSO packets no larger than 1/8 of the available window. > But with the new code we can get much larger TSO frames. > > So I'm trying to get a handle on it: could a solution be to simply > look at the frame size, and call tcp_send_delayed_ack from > if the frame size is no larger than 1/8? > > Does this make sense? > > Thanks, > > More likely you are getting hit by the fact that TSO prevents the congestion window from increasing properly. This was fixed in 2.6.15 (around mid of Nov 2005). ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-06 22:50 ` Stephen Hemminger @ 2006-03-07 3:13 ` Shirley Ma 2006-03-07 21:44 ` Matt Leininger 0 siblings, 1 reply; 35+ messages in thread From: Shirley Ma @ 2006-03-07 3:13 UTC (permalink / raw) To: Stephen Hemminger Cc: netdev, Linux Kernel Mailing List, openib-general, openib-general-bounces, David S. Miller [-- Attachment #1.1: Type: text/plain, Size: 420 bytes --] > More likely you are getting hit by the fact that TSO prevents the congestion window from increasing properly. This was fixed in 2.6.15 (around mid of Nov 2005). Yep, I noticed the same problem. After updating to the new kernel, the performance are much better, but it's still lower than before. Thank Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 [-- Attachment #1.2: Type: text/html, Size: 570 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-07 3:13 ` Shirley Ma @ 2006-03-07 21:44 ` Matt Leininger 2006-03-07 21:49 ` Stephen Hemminger 0 siblings, 1 reply; 35+ messages in thread From: Matt Leininger @ 2006-03-07 21:44 UTC (permalink / raw) To: Shirley Ma Cc: netdev, Linux Kernel Mailing List, openib-general, David S. Miller, Stephen Hemminger On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote: > > > More likely you are getting hit by the fact that TSO prevents the > congestion > window from increasing properly. This was fixed in 2.6.15 (around mid > of Nov 2005). > > Yep, I noticed the same problem. After updating to the new kernel, the > performance are much better, but it's still lower than before. Here is an updated version of OpenIB IPoIB performance for various kernels with and without one of the TSO patches. The netperf performance for the latest kernels has not improved the TSO performance drop. Any comments or suggestions would be appreciated. - Matt > All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc Kernel OpenIB msi_x netperf (MB/s) 2.6.16-rc5 in-kernel 1 367 2.6.15 in-kernel 1 382 2.6.14-rc4 patch 1 in-kernel 1 434 2.6.14-rc4 in-kernel 1 385 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 patch 1 svn3627 1 446 2.6.13.2 in-kernel 1 394 2.6.13-rc3 patch 12 in-kernel 1 442 2.6.13-rc3 patch 1 in-kernel 1 450 2.6.13-rc3 in-kernel 1 395 2.6.12.5-lustre in-kernel 1 399 2.6.12.5 patch 1 in-kernel 1 464 2.6.12.5 in-kernel 1 402 2.6.12 in-kernel 1 406 2.6.12-rc6 patch 1 in-kernel 1 470 2.6.12-rc6 in-kernel 1 407 2.6.12-rc5 in-kernel 1 405 2.6.12-rc5 patch 1 in-kernel 1 474 2.6.12-rc4 in-kernel 1 470 2.6.12-rc3 in-kernel 1 466 2.6.12-rc2 in-kernel 1 469 2.6.12-rc1 in-kernel 1 466 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-07 21:44 ` Matt Leininger @ 2006-03-07 21:49 ` Stephen Hemminger 2006-03-07 21:53 ` Michael S. Tsirkin 2006-03-08 0:11 ` Matt Leininger 0 siblings, 2 replies; 35+ messages in thread From: Stephen Hemminger @ 2006-03-07 21:49 UTC (permalink / raw) To: Matt Leininger Cc: netdev, Linux Kernel Mailing List, openib-general, David S. Miller On Tue, 07 Mar 2006 13:44:51 -0800 Matt Leininger <mlleinin@hpcn.ca.sandia.gov> wrote: > On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote: > > > > > More likely you are getting hit by the fact that TSO prevents the > > congestion > > window from increasing properly. This was fixed in 2.6.15 (around mid > > of Nov 2005). > > > > Yep, I noticed the same problem. After updating to the new kernel, the > > performance are much better, but it's still lower than before. > > Here is an updated version of OpenIB IPoIB performance for various > kernels with and without one of the TSO patches. The netperf > performance for the latest kernels has not improved the TSO performance > drop. > > Any comments or suggestions would be appreciated. > > - Matt Configuration information? like did you increase the tcp_rmem, tcp_wmem? Tcpdump traces of what is being sent and available window? Is IB using NAPI or just doing netif_rx()? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-07 21:49 ` Stephen Hemminger @ 2006-03-07 21:53 ` Michael S. Tsirkin 2006-03-08 0:11 ` Matt Leininger 1 sibling, 0 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-07 21:53 UTC (permalink / raw) To: Stephen Hemminger Cc: netdev, Linux Kernel Mailing List, openib-general, David S. Miller Quoting r. Stephen Hemminger <shemminger@osdl.org>: > Is IB using NAPI or just doing netif_rx()? No, IPoIB doesn't use NAPI. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-07 21:49 ` Stephen Hemminger 2006-03-07 21:53 ` Michael S. Tsirkin @ 2006-03-08 0:11 ` Matt Leininger 2006-03-08 0:18 ` David S. Miller 1 sibling, 1 reply; 35+ messages in thread From: Matt Leininger @ 2006-03-08 0:11 UTC (permalink / raw) To: Stephen Hemminger Cc: netdev, Linux Kernel Mailing List, openib-general, David S. Miller On Tue, 2006-03-07 at 13:49 -0800, Stephen Hemminger wrote: > On Tue, 07 Mar 2006 13:44:51 -0800 > Matt Leininger <mlleinin@hpcn.ca.sandia.gov> wrote: > > > On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote: > > > > > > > More likely you are getting hit by the fact that TSO prevents the > > > congestion > > > window from increasing properly. This was fixed in 2.6.15 (around mid > > > of Nov 2005). > > > > > > Yep, I noticed the same problem. After updating to the new kernel, the > > > performance are much better, but it's still lower than before. > > > > Here is an updated version of OpenIB IPoIB performance for various > > kernels with and without one of the TSO patches. The netperf > > performance for the latest kernels has not improved the TSO performance > > drop. > > > > Any comments or suggestions would be appreciated. > > > > - Matt > > Configuration information? like did you increase the tcp_rmem, tcp_wmem? > Tcpdump traces of what is being sent and available window? > Is IB using NAPI or just doing netif_rx()? I used the standard setting for tcp_rmem and tcp_wmem. Here are a few other runs that change those variables. I was able to improve performance by ~30MB/s to 403 MB/s, but this is still a ways from the 474 MB/s before the TSO patches. Thanks, - Matt All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc msi_x=1 for all tests Kernel OpenIB netperf (MB/s) 2.6.16-rc5 in-kernel 403 tcp_wmem 4096 87380 16777216 tcp_rmem 4096 87380 16777216 2.6.16-rc5 in-kernel 395 tcp_wmem 4096 102400 16777216 tcp_rmem 4096 102400 16777216 2.6.16-rc5 in-kernel 392 tcp_wmem 4096 65536 16777216 tcp_rmem 4096 87380 16777216 2.6.16-rc5 in-kernel 394 tcp_wmem 4096 131072 16777216 tcp_rmem 4096 102400 16777216 2.6.16-rc5 in-kernel 377 tcp_wmem 4096 131072 16777216 tcp_rmem 4096 153600 16777216 2.6.16-rc5 in-kernel 377 tcp_wmem 4096 131072 16777216 tcp_rmem 4096 131072 16777216 2.6.16-rc5 in-kernel 353 tcp_wmem 4096 262144 16777216 tcp_rmem 4096 262144 16777216 2.6.16-rc5 in-kernel 305 tcp_wmem 4096 262144 16777216 tcp_rmem 4096 524288 16777216 2.6.16-rc5 in-kernel 303 tcp_wmem 4096 131072 16777216 tcp_rmem 4096 524288 16777216 2.6.16-rc5 in-kernel 290 tcp_wmem 4096 524288 16777216 tcp_rmem 4096 524288 16777216 2.6.16-rc5 in-kernel 367 default tcp values -------------------- All with standard tcp settings Kernel OpenIB netperf (MB/s) 2.6.16-rc5 in-kernel 367 2.6.15 in-kernel 382 2.6.14-rc4 patch 12 in-kernel 436 2.6.14-rc4 patch 1 in-kernel 434 2.6.14-rc4 in-kernel 385 2.6.14-rc3 in-kernel 374 2.6.13.2 svn3627 386 2.6.13.2 patch 1 svn3627 446 2.6.13.2 in-kernel 394 2.6.13-rc3 patch 12 in-kernel 442 2.6.13-rc3 patch 1 in-kernel 450 2.6.13-rc3 in-kernel 395 2.6.12.5-lustre in-kernel 399 2.6.12.5 patch 1 in-kernel 464 2.6.12.5 in-kernel 402 2.6.12 in-kernel 406 2.6.12-rc6 patch 1 in-kernel 470 2.6.12-rc6 in-kernel 407 2.6.12-rc5 in-kernel 405 2.6.12-rc5 patch 1 in-kernel 474 2.6.12-rc4 in-kernel 470 2.6.12-rc3 in-kernel 466 2.6.12-rc2 in-kernel 469 2.6.12-rc1 in-kernel 466 2.6.11 in-kernel 464 2.6.11 svn3687 464 2.6.9-11.ELsmp svn3513 425 (Woody's results, 3.6Ghz EM64T) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-08 0:11 ` Matt Leininger @ 2006-03-08 0:18 ` David S. Miller 2006-03-08 1:17 ` Roland Dreier 0 siblings, 1 reply; 35+ messages in thread From: David S. Miller @ 2006-03-08 0:18 UTC (permalink / raw) To: mlleinin; +Cc: netdev, linux-kernel, openib-general, shemminger From: Matt Leininger <mlleinin@hpcn.ca.sandia.gov> Date: Tue, 07 Mar 2006 16:11:37 -0800 > I used the standard setting for tcp_rmem and tcp_wmem. Here are a > few other runs that change those variables. I was able to improve > performance by ~30MB/s to 403 MB/s, but this is still a ways from the > 474 MB/s before the TSO patches. How limited are the IPoIB devices, TX descriptor wise? One side effect of the TSO changes is that one extra descriptor will be used for outgoing packets. This is because we have to put the headers as well as the user data, into page based buffers now. Perhaps you can experiment with increasing the transmit descriptor table size, if that's possible. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-08 0:18 ` David S. Miller @ 2006-03-08 1:17 ` Roland Dreier 2006-03-08 1:23 ` David S. Miller 0 siblings, 1 reply; 35+ messages in thread From: Roland Dreier @ 2006-03-08 1:17 UTC (permalink / raw) To: David S. Miller; +Cc: netdev, linux-kernel, openib-general, shemminger David> How limited are the IPoIB devices, TX descriptor wise? David> One side effect of the TSO changes is that one extra David> descriptor will be used for outgoing packets. This is David> because we have to put the headers as well as the user David> data, into page based buffers now. We have essentially no limit on TX descriptors. However I think there's some confusion about TSO: IPoIB does _not_ do TSO -- generic InfiniBand hardware does not have any TSO capability. In the future we might be able to implement TSO for certain hardware that does have support, but even that requires some firmware help from the from the HCA vendors, etc. So right now the IPoIB driver does not do TSO. The reason TSO comes up is that reverting the patch described below helps (or helped at some point at least) IPoIB throughput quite a bit. Clearly this was a bug fix so we can't revert it in general but I think what Michael Tsirkin was suggesting at the beginning of this thread is to do what the last paragraph of the changelog says -- find some way to re-enable the trick. diff-tree 3143241... (from e16fa6b...) Author: David S. Miller <davem@davemloft.net> Date: Mon May 23 12:03:06 2005 -0700 [TCP]: Fix stretch ACK performance killer when doing ucopy. When we are doing ucopy, we try to defer the ACK generation to cleanup_rbuf(). This works most of the time very well, but if the ucopy prequeue is large, this ACKing behavior kills performance. With TSO, it is possible to fill the prequeue so large that by the time the ACK is sent and gets back to the sender, most of the window has emptied of data and performance suffers significantly. This behavior does help in some cases, so we should think about re-enabling this trick in the future, using some kind of limit in order to avoid the bug case. - R. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-08 1:17 ` Roland Dreier @ 2006-03-08 1:23 ` David S. Miller 2006-03-08 1:34 ` Roland Dreier 2006-03-08 12:53 ` Michael S. Tsirkin 0 siblings, 2 replies; 35+ messages in thread From: David S. Miller @ 2006-03-08 1:23 UTC (permalink / raw) To: rdreier; +Cc: netdev, linux-kernel, openib-general, shemminger From: Roland Dreier <rdreier@cisco.com> Date: Tue, 07 Mar 2006 17:17:30 -0800 > The reason TSO comes up is that reverting the patch described below > helps (or helped at some point at least) IPoIB throughput quite a bit. I wish you had started the thread by mentioning this specific patch, we wasted an enormous amount of precious developer time speculating and asking for arbitrary tests to be run in order to narrow down the problem, yet you knew the specific change that introduced the performance regression already... This is a good example of how not to report a bug. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-08 1:23 ` David S. Miller @ 2006-03-08 1:34 ` Roland Dreier 2006-03-08 12:53 ` Michael S. Tsirkin 1 sibling, 0 replies; 35+ messages in thread From: Roland Dreier @ 2006-03-08 1:34 UTC (permalink / raw) To: David S. Miller; +Cc: netdev, linux-kernel, openib-general, shemminger David> I wish you had started the thread by mentioning this David> specific patch, we wasted an enormous amount of precious David> developer time speculating and asking for arbitrary tests David> to be run in order to narrow down the problem, yet you knew David> the specific change that introduced the performance David> regression already... Sorry, you're right. I was a little confused because I had a memory of Michael's original email (http://lkml.org/lkml/2006/3/6/150) quoting a changelog entry, but looking back at the message, it was quoting something completely different and misleading. I think the most interesting email in the old thread is http://openib.org/pipermail/openib-general/2005-October/012482.html which shows that reverting 314324121 (the "stretch ACK performance killer" fix) gives ~400 Mbit/sec in extra IPoIB performance. - R. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Re: TSO and IPoIB performance degradation 2006-03-08 1:23 ` David S. Miller 2006-03-08 1:34 ` Roland Dreier @ 2006-03-08 12:53 ` Michael S. Tsirkin 2006-03-08 20:53 ` David S. Miller 2006-03-09 23:48 ` David S. Miller 1 sibling, 2 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-08 12:53 UTC (permalink / raw) To: David S. Miller; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger Quoting r. David S. Miller <davem@davemloft.net>: > Subject: Re: Re: TSO and IPoIB performance degradation > > From: Roland Dreier <rdreier@cisco.com> > Date: Tue, 07 Mar 2006 17:17:30 -0800 > > > The reason TSO comes up is that reverting the patch described below > > helps (or helped at some point at least) IPoIB throughput quite a bit. > > I wish you had started the thread by mentioning this specific patch Er, since you mention it, the first message in thread did include this link: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc and I even pasted the patch description there, but oh well. Now that Roland helped us clear it all up, and now that it has been clarified that reverting this patch gives us back most of the performance, is the answer to my question the same? What I was trying to figure out was, how can we re-enable the trick without hurting TSO? Could a solution be to simply look at the frame size, and call tcp_send_delayed_ack if the frame size is small? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-08 12:53 ` Michael S. Tsirkin @ 2006-03-08 20:53 ` David S. Miller 2006-03-09 23:48 ` David S. Miller 1 sibling, 0 replies; 35+ messages in thread From: David S. Miller @ 2006-03-08 20:53 UTC (permalink / raw) To: mst; +Cc: rdreier, netdev, linux-kernel, openib-general, shemminger From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Wed, 8 Mar 2006 14:53:11 +0200 > What I was trying to figure out was, how can we re-enable the trick without > hurting TSO? Could a solution be to simply look at the frame size, and call > tcp_send_delayed_ack if the frame size is small? The problem is that this patch helps performance when the receiver is CPU limited. The old code would delay ACKs forever if the CPU of the receiver was slow, because we'd wait for all received packets to be copied into userspace before spitting out the ACK. This would allow the pipe to empty, since the sender is waiting for ACKs in order to send more into the pipe, and once the ACK did go out it would cause the sender to emit an enormous burst of data. Both of these behaviors are highly frowned upon for a TCP stack. I'll try to look at this some more later today. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-08 12:53 ` Michael S. Tsirkin 2006-03-08 20:53 ` David S. Miller @ 2006-03-09 23:48 ` David S. Miller 2006-03-10 0:10 ` Michael S. Tsirkin 2006-03-10 0:21 ` Rick Jones 1 sibling, 2 replies; 35+ messages in thread From: David S. Miller @ 2006-03-09 23:48 UTC (permalink / raw) To: mst; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Wed, 8 Mar 2006 14:53:11 +0200 > What I was trying to figure out was, how can we re-enable the trick > without hurting TSO? Could a solution be to simply look at the frame > size, and call tcp_send_delayed_ack if the frame size is small? The change is really not related to TSO. By reverting it, you are reducing the number of ACKs on the wire, and the number of context switches at the sender to push out new data. That's why it can make things go faster, but it also leads to bursty TCP sender behavior, which is bad for congestion on the internet. When the receiver has a strong cpu and can keep up with the incoming packet rate very well and we are in an environment with no congestion, the old code helps a lot. But if the receiver is cpu limited or we have congestion of any kind, it does exactly the wrong thing. It will delay ACKs a very long time to the point where the pipe is depleted and this kills performance in that case. For congested environments, due to the decreased ACK feedback, packet loss recovery will be extremely poor. This is the first reason behind my change. The behavior is also specifically frowned upon in the TCP implementor community. It is specifically mentioned in the Known TCP Implementation Problems RFC2525, in section 2.13 "Stretch ACK violation". The entry, quoted below for reference, is very clear on the reasons why stretch ACKs are bad. And although it may help performance for your case, in congested environments and also with cpu limited receivers it will have a negative impact on performance. So, this was the second reason why I made this change. So reverting the change isn't really an option. Name of Problem Stretch ACK violation Classification Congestion Control/Performance Description To improve efficiency (both computer and network) a data receiver may refrain from sending an ACK for each incoming segment, according to [RFC1122]. However, an ACK should not be delayed an inordinate amount of time. Specifically, ACKs SHOULD be sent for every second full-sized segment that arrives. If a second full- sized segment does not arrive within a given timeout (of no more than 0.5 seconds), an ACK should be transmitted, according to [RFC1122]. A TCP receiver which does not generate an ACK for every second full-sized segment exhibits a "Stretch ACK Violation". Significance TCP receivers exhibiting this behavior will cause TCP senders to generate burstier traffic, which can degrade performance in congested environments. In addition, generating fewer ACKs increases the amount of time needed by the slow start algorithm to open the congestion window to an appropriate point, which diminishes performance in environments with large bandwidth-delay products. Finally, generating fewer ACKs may cause needless retransmission timeouts in lossy environments, as it increases the possibility that an entire window of ACKs is lost, forcing a retransmission timeout. Implications When not in loss recovery, every ACK received by a TCP sender triggers the transmission of new data segments. The burst size is determined by the number of previously unacknowledged segments each ACK covers. Therefore, a TCP receiver ack'ing more than 2 segments at a time causes the sending TCP to generate a larger burst of traffic upon receipt of the ACK. This large burst of traffic can overwhelm an intervening gateway, leading to higher drop rates for both the connection and other connections passing through the congested gateway. In addition, the TCP slow start algorithm increases the congestion window by 1 segment for each ACK received. Therefore, increasing the ACK interval (thus decreasing the rate at which ACKs are transmitted) increases the amount of time it takes slow start to increase the congestion window to an appropriate operating point, and the connection consequently suffers from reduced performance. This is especially true for connections using large windows. Relevant RFCs RFC 1122 outlines delayed ACKs as a recommended mechanism. Trace file demonstrating it Trace file taken using tcpdump at host B, the data receiver (and ACK originator). The advertised window (which never changed) and timestamp options have been omitted for clarity, except for the first packet sent by A: 12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1 win 33580 <nop,nop,timestamp 2249877 2249914> [tos 0x8] 12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1 12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1 12:09:24.832222 B.3999 > A.1174: . ack 6393 12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1 12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1 12:09:24.950605 A.1174 > B.3999: . 9289:10737(1448) ack 1 12:09:24.950797 B.3999 > A.1174: . ack 10737 12:09:24.958488 A.1174 > B.3999: . 10737:12185(1448) ack 1 12:09:25.052330 A.1174 > B.3999: . 12185:13633(1448) ack 1 12:09:25.060216 A.1174 > B.3999: . 13633:15081(1448) ack 1 12:09:25.060405 B.3999 > A.1174: . ack 15081 This portion of the trace clearly shows that the receiver (host B) sends an ACK for every third full sized packet received. Further investigation of this implementation found that the cause of the increased ACK interval was the TCP options being used. The implementation sent an ACK after it was holding 2*MSS worth of unacknowledged data. In the above case, the MSS is 1460 bytes so the receiver transmits an ACK after it is holding at least 2920 bytes of unacknowledged data. However, the length of the TCP options being used [RFC1323] took 12 bytes away from the data portion of each packet. This produced packets containing 1448 bytes of data. But the additional bytes used by the options in the header were not taken into account when determining when to trigger an ACK. Therefore, it took 3 data segments before the data receiver was holding enough unacknowledged data (>= 2*MSS, or 2920 bytes in the above example) to transmit an ACK. Trace file demonstrating correct behavior Trace file taken using tcpdump at host B, the data receiver (and ACK originator), again with window and timestamp information omitted except for the first packet: 12:06:53.627320 A.1172 > B.3999: . 1449:2897(1448) ack 1 win 33580 <nop,nop,timestamp 2249575 2249612> [tos 0x8] 12:06:53.634773 A.1172 > B.3999: . 2897:4345(1448) ack 1 12:06:53.634961 B.3999 > A.1172: . ack 4345 12:06:53.737326 A.1172 > B.3999: . 4345:5793(1448) ack 1 12:06:53.744401 A.1172 > B.3999: . 5793:7241(1448) ack 1 12:06:53.744592 B.3999 > A.1172: . ack 7241 12:06:53.752287 A.1172 > B.3999: . 7241:8689(1448) ack 1 12:06:53.847332 A.1172 > B.3999: . 8689:10137(1448) ack 1 12:06:53.847525 B.3999 > A.1172: . ack 10137 This trace shows the TCP receiver (host B) ack'ing every second full-sized packet, according to [RFC1122]. This is the same implementation shown above, with slight modifications that allow the receiver to take the length of the options into account when deciding when to transmit an ACK. References This problem is documented in [Allman97] and [Paxson97]. How to detect Stretch ACK violations show up immediately in receiver-side packet traces of bulk transfers, as shown above. However, packet traces made on the sender side of the TCP connection may lead to ambiguities when diagnosing this problem due to the possibility of lost ACKs. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-09 23:48 ` David S. Miller @ 2006-03-10 0:10 ` Michael S. Tsirkin 2006-03-10 0:38 ` Michael S. Tsirkin 2006-03-10 7:18 ` David S. Miller 2006-03-10 0:21 ` Rick Jones 1 sibling, 2 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-10 0:10 UTC (permalink / raw) To: David S. Miller; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger Quoting David S. Miller <davem@davemloft.net>: > Description > To improve efficiency (both computer and network) a data receiver > may refrain from sending an ACK for each incoming segment, > according to [RFC1122]. However, an ACK should not be delayed an > inordinate amount of time. Specifically, ACKs SHOULD be sent for > every second full-sized segment that arrives. If a second full- > sized segment does not arrive within a given timeout (of no more > than 0.5 seconds), an ACK should be transmitted, according to > [RFC1122]. A TCP receiver which does not generate an ACK for > every second full-sized segment exhibits a "Stretch ACK > Violation". Thanks very much for the info! So the longest we can delay, according to this spec, is until we have two full sized segments. But with the change we are discussing, could an ack now be sent even sooner than we have at least two full sized segments? Or does __tcp_ack_snd_check delay until we have at least two full sized segments? David, could you explain please? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-10 0:10 ` Michael S. Tsirkin @ 2006-03-10 0:38 ` Michael S. Tsirkin 2006-03-10 7:18 ` David S. Miller 1 sibling, 0 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-10 0:38 UTC (permalink / raw) To: David S. Miller; +Cc: rdreier, netdev, linux-kernel, openib-general, shemminger Quoting r. Michael S. Tsirkin <mst@mellanox.co.il>: > Or does __tcp_ack_snd_check delay until we have at least two full sized > segments? What I'm trying to say, since RFC 2525, 2.13 talks about "every second full-sized segment", so following the code from __tcp_ack_snd_check, why does it do /* More than one full frame received... */ if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss rather than /* At least two full frames received... */ if (((tp->rcv_nxt - tp->rcv_wup) >= 2 * inet_csk(sk)->icsk_ack.rcv_mss -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-10 0:10 ` Michael S. Tsirkin 2006-03-10 0:38 ` Michael S. Tsirkin @ 2006-03-10 7:18 ` David S. Miller 1 sibling, 0 replies; 35+ messages in thread From: David S. Miller @ 2006-03-10 7:18 UTC (permalink / raw) To: mst; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Fri, 10 Mar 2006 02:10:31 +0200 > But with the change we are discussing, could an ack now be sent even > sooner than we have at least two full sized segments? Or does > __tcp_ack_snd_check delay until we have at least two full sized > segments? David, could you explain please? __tcp_ack_snd_check() delays until we have at least two full sized segments. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-09 23:48 ` David S. Miller 2006-03-10 0:10 ` Michael S. Tsirkin @ 2006-03-10 0:21 ` Rick Jones 2006-03-10 7:23 ` David S. Miller 1 sibling, 1 reply; 35+ messages in thread From: Rick Jones @ 2006-03-10 0:21 UTC (permalink / raw) To: netdev; +Cc: rdreier, linux-kernel, openib-general David S. Miller wrote: > From: "Michael S. Tsirkin" <mst@mellanox.co.il> > Date: Wed, 8 Mar 2006 14:53:11 +0200 > > >>What I was trying to figure out was, how can we re-enable the trick >>without hurting TSO? Could a solution be to simply look at the frame >>size, and call tcp_send_delayed_ack if the frame size is small? > > > The change is really not related to TSO. > > By reverting it, you are reducing the number of ACKs on the wire, and > the number of context switches at the sender to push out new data. > That's why it can make things go faster, but it also leads to bursty > TCP sender behavior, which is bad for congestion on the internet. naughty naughty Solaris and HP-UX TCP :) > > When the receiver has a strong cpu and can keep up with the incoming > packet rate very well and we are in an environment with no congestion, > the old code helps a lot. But if the receiver is cpu limited or we > have congestion of any kind, it does exactly the wrong thing. It will > delay ACKs a very long time to the point where the pipe is depleted > and this kills performance in that case. For congested environments, > due to the decreased ACK feedback, packet loss recovery will be > extremely poor. This is the first reason behind my change. well, there are stacks which do "stretch acks" (after a fashion) that make sure when they see packet loss to "do the right thing" wrt sending enough acks to allow cwnds to open again in a timely fashion. that brings-back all that stuff I posted ages ago about the performance delta when using an HP-UX receiver and altering the number of segmetns per ACK. should be in the netdev archive somewhere. might have been around the time of the discussions about MacOS and its ack avoidance - which wasn't done very well at the time. > > The behavior is also specifically frowned upon in the TCP implementor > community. It is specifically mentioned in the Known TCP > Implementation Problems RFC2525, in section 2.13 "Stretch ACK > violation". > > The entry, quoted below for reference, is very clear on the reasons > why stretch ACKs are bad. And although it may help performance for > your case, in congested environments and also with cpu limited > receivers it will have a negative impact on performance. So, this was > the second reason why I made this change. I would have thought that a receiver "stretching ACK's" would be helpful when it was CPU limited since it was spending fewer CPU cycles generating ACKs? > > So reverting the change isn't really an option. > > Name of Problem > Stretch ACK violation > > Classification > Congestion Control/Performance > > Description > To improve efficiency (both computer and network) a data receiver > may refrain from sending an ACK for each incoming segment, > according to [RFC1122]. However, an ACK should not be delayed an > inordinate amount of time. Specifically, ACKs SHOULD be sent for > every second full-sized segment that arrives. If a second full- > sized segment does not arrive within a given timeout (of no more > than 0.5 seconds), an ACK should be transmitted, according to > [RFC1122]. A TCP receiver which does not generate an ACK for > every second full-sized segment exhibits a "Stretch ACK > Violation". How can it be a "violation" of a SHOULD?-) > > Significance > TCP receivers exhibiting this behavior will cause TCP senders to > generate burstier traffic, which can degrade performance in > congested environments. In addition, generating fewer ACKs > increases the amount of time needed by the slow start algorithm to > open the congestion window to an appropriate point, which > diminishes performance in environments with large bandwidth-delay > products. Finally, generating fewer ACKs may cause needless > retransmission timeouts in lossy environments, as it increases the > possibility that an entire window of ACKs is lost, forcing a > retransmission timeout. Of those three, I think the most meaningful is the second, which can be dealt with by smarts in the ACK-stretching receiver. For the first, it will only degrade performance if it triggers packet loss. I'm not sure I've ever seen the third item happen. > > Implications > When not in loss recovery, every ACK received by a TCP sender > triggers the transmission of new data segments. The burst size is > determined by the number of previously unacknowledged segments > each ACK covers. Therefore, a TCP receiver ack'ing more than 2 > segments at a time causes the sending TCP to generate a larger > burst of traffic upon receipt of the ACK. This large burst of > traffic can overwhelm an intervening gateway, leading to higher > drop rates for both the connection and other connections passing > through the congested gateway. Doesn't RED mean that those other connections are rather less likely to be affected? > > In addition, the TCP slow start algorithm increases the congestion > window by 1 segment for each ACK received. Therefore, increasing > the ACK interval (thus decreasing the rate at which ACKs are > transmitted) increases the amount of time it takes slow start to > increase the congestion window to an appropriate operating point, > and the connection consequently suffers from reduced performance. > This is especially true for connections using large windows. This one is dealt with by ABC isn't it? At least in part since ABC appears to cap cwnd increase at 2*SMSS (I only know this because I just read the RFC mentioned in another thread - seems a bit much to have made that limit a MUST rather than a SHOULD :) rick jones ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-10 0:21 ` Rick Jones @ 2006-03-10 7:23 ` David S. Miller 2006-03-10 17:44 ` Rick Jones 2006-03-20 9:06 ` Michael S. Tsirkin 0 siblings, 2 replies; 35+ messages in thread From: David S. Miller @ 2006-03-10 7:23 UTC (permalink / raw) To: rick.jones2; +Cc: netdev, rdreier, linux-kernel, openib-general From: Rick Jones <rick.jones2@hp.com> Date: Thu, 09 Mar 2006 16:21:05 -0800 > well, there are stacks which do "stretch acks" (after a fashion) that > make sure when they see packet loss to "do the right thing" wrt sending > enough acks to allow cwnds to open again in a timely fashion. Once a loss happens, it's too late to stop doing the stretch ACKs, the damage is done already. It is going to take you at least one extra RTT to recover from the loss compared to if you were not doing stretch ACKs. You have to keep giving consistent well spaced ACKs back to the receiver in order to recover from loss optimally. The ACK every 2 full sized frames behavior of TCP is absolutely essential. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-10 7:23 ` David S. Miller @ 2006-03-10 17:44 ` Rick Jones 2006-03-20 9:06 ` Michael S. Tsirkin 1 sibling, 0 replies; 35+ messages in thread From: Rick Jones @ 2006-03-10 17:44 UTC (permalink / raw) To: netdev; +Cc: rdreier, linux-kernel, openib-general David S. Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Thu, 09 Mar 2006 16:21:05 -0800 > > >>well, there are stacks which do "stretch acks" (after a fashion) that >>make sure when they see packet loss to "do the right thing" wrt sending >>enough acks to allow cwnds to open again in a timely fashion. > > > Once a loss happens, it's too late to stop doing the stretch ACKs, the > damage is done already. It is going to take you at least one > extra RTT to recover from the loss compared to if you were not doing > stretch ACKs. I must be dense (entirely possible), but how is that absolute? If there is no more data in flight after the segment that was lost the "stretch ACK" stacks with which I'm familiar will generate the standalone ACK within the deferred ACK interval (50 milliseconds). I guess that can be the "one extra RTT" However, if there is data in flight after the point of loss, the immediate ACK upon receipt of out-of order data kicks in. > You have to keep giving consistent well spaced ACKs back to the > receiver in order to recover from loss optimally. The key there is defining consistent and well spaced. Certainly an ACK only after a window's-worth of data would not be well spaced, but I believe that an ACK after more than two full sized frames could indeed be well-spaced. > The ACK every 2 full sized frames behavior of TCP is absolutely > essential. I don't think it is _quite_ that cut and dried, otherwise, HP-UX and Solaris, since < 1997 would have had big time problems. rick jones ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-10 7:23 ` David S. Miller 2006-03-10 17:44 ` Rick Jones @ 2006-03-20 9:06 ` Michael S. Tsirkin 2006-03-20 9:55 ` David S. Miller 1 sibling, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-20 9:06 UTC (permalink / raw) To: David S. Miller Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general Quoting r. David S. Miller <davem@davemloft.net>: > > well, there are stacks which do "stretch acks" (after a fashion) that > > make sure when they see packet loss to "do the right thing" wrt sending > > enough acks to allow cwnds to open again in a timely fashion. > > Once a loss happens, it's too late to stop doing the stretch ACKs, the > damage is done already. It is going to take you at least one > extra RTT to recover from the loss compared to if you were not doing > stretch ACKs. > > You have to keep giving consistent well spaced ACKs back to the > receiver in order to recover from loss optimally. Is it the case then that this requirement is less essential on networks such as IP over InfiniBand, which are very low latency and essencially lossless (with explicit congestion contifications in hardware)? > The ACK every 2 full sized frames behavior of TCP is absolutely > essential. Interestingly, I was pointed towards the following RFC draft http://www.ietf.org/internet-drafts/draft-ietf-tcpm-rfc2581bis-00.txt The requirement that an ACK "SHOULD" be generated for at least every second full-sized segment is listed in [RFC1122] in one place as a SHOULD and another as a MUST. Here we unambiguously state it is a SHOULD. We also emphasize that this is a SHOULD, meaning that an implementor should indeed only deviate from this requirement after careful consideration of the implications. And as Matt Leininger's research appears to show, stretch ACKs are good for performance in case of IP over InfiniBand. Given all this, would it make sense to add a per-netdevice (or per-neighbour) flag to re-enable the trick for these net devices (as was done before 314324121f9b94b2ca657a494cf2b9cb0e4a28cc)? IP over InfiniBand driver would then simply set this flag. David, would you accept such a patch? It would be nice to get 2.6.17 back to within at least 10% of 2.6.11. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 9:06 ` Michael S. Tsirkin @ 2006-03-20 9:55 ` David S. Miller 2006-03-20 10:22 ` Michael S. Tsirkin 0 siblings, 1 reply; 35+ messages in thread From: David S. Miller @ 2006-03-20 9:55 UTC (permalink / raw) To: mst; +Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Mon, 20 Mar 2006 11:06:29 +0200 > Is it the case then that this requirement is less essential on > networks such as IP over InfiniBand, which are very low latency > and essencially lossless (with explicit congestion contifications > in hardware)? You can never assume any attribute of the network whatsoever. Even if initially the outgoing device is IPoIB, something in the middle, like a traffic classification or netfilter rule, could rewrite the packet and make it go somewhere else. This even applies to loopback packets, because packets can get rewritten and redirected even once they are passed in via netif_receive_skb(). > And as Matt Leininger's research appears to show, stretch ACKs > are good for performance in case of IP over InfiniBand. > > Given all this, would it make sense to add a per-netdevice (or per-neighbour) > flag to re-enable the trick for these net devices (as was done before > 314324121f9b94b2ca657a494cf2b9cb0e4a28cc)? > IP over InfiniBand driver would then simply set this flag. See above, this is not feasible. The path an SKB can take is opaque and unknown until the very last moment it is actually given to the device transmit function. People need to get the "special case this topology" ideas out of their heads. :-) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 9:55 ` David S. Miller @ 2006-03-20 10:22 ` Michael S. Tsirkin 2006-03-20 10:37 ` David S. Miller 0 siblings, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-20 10:22 UTC (permalink / raw) To: David S. Miller Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general Quoting r. David S. Miller <davem@davemloft.net>: > The path an SKB can take is opaque and unknown until the very last > moment it is actually given to the device transmit function. Why, I was proposing looking at dst cache. If that's NULL, well, we won't stretch ACKs. Worst case we apply the wrong optimization. Right? > People need to get the "special case this topology" ideas out of their > heads. :-) Okay, I get that. What I'd like to clarify, however: rfc2581 explicitly states that in some cases it might be OK to generate ACKs less frequently than every second full-sized segment. Given Matt's measurements, TCP on top of IP over InfiniBand on Linux seems to hit one of these cases. Do you agree to that? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 10:22 ` Michael S. Tsirkin @ 2006-03-20 10:37 ` David S. Miller 2006-03-20 11:27 ` Michael S. Tsirkin 2006-04-27 4:13 ` Troy Benjegerdes 0 siblings, 2 replies; 35+ messages in thread From: David S. Miller @ 2006-03-20 10:37 UTC (permalink / raw) To: mst; +Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general From: "Michael S. Tsirkin" <mst@mellanox.co.il> Date: Mon, 20 Mar 2006 12:22:34 +0200 > Quoting r. David S. Miller <davem@davemloft.net>: > > The path an SKB can take is opaque and unknown until the very last > > moment it is actually given to the device transmit function. > > Why, I was proposing looking at dst cache. If that's NULL, well, > we won't stretch ACKs. Worst case we apply the wrong optimization. > Right? Where you receive a packet from isn't very useful for determining even the full patch on which that packet itself flowed. More importantly, packets also do not necessarily go back out over the same path on which packets are received for a connection. This is actually quite common. Maybe packets for this connection come in via IPoIB but go out via gigabit ethernet and another route altogether. > What I'd like to clarify, however: rfc2581 explicitly states that in > some cases it might be OK to generate ACKs less frequently than > every second full-sized segment. Given Matt's measurements, TCP on > top of IP over InfiniBand on Linux seems to hit one of these cases. > Do you agree to that? I disagree with Linux changing it's behavior. It would be great to turn off congestion control completely over local gigabit networks, but that isn't determinable in any way, so we don't do that. The IPoIB situation is no different, you can set all the bits you want in incoming packets, the barrier to doing this remains the same. It hurts performance if any packet drop occurs because it will require an extra round trip for recovery to begin to be triggered at the sender. The network is a black box, routes to and from a destination are arbitrary, and so is packet rewriting and reflection, so being able to say "this all occurs on IPoIB" is simply infeasible. I don't know how else to say this, we simply cannot special case IPoIB or any other topology type. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 10:37 ` David S. Miller @ 2006-03-20 11:27 ` Michael S. Tsirkin 2006-03-20 11:47 ` Arjan van de Ven 2006-04-27 4:13 ` Troy Benjegerdes 1 sibling, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-20 11:27 UTC (permalink / raw) To: David S. Miller Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general Quoting David S. Miller <davem@davemloft.net>: > I disagree with Linux changing it's behavior. It would be great to > turn off congestion control completely over local gigabit networks, > but that isn't determinable in any way, so we don't do that. Interesting. Would it make sense to make it another tunable knob in /proc, sysfs or sysctl then? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 11:27 ` Michael S. Tsirkin @ 2006-03-20 11:47 ` Arjan van de Ven 2006-03-20 11:49 ` Lennert Buytenhek 0 siblings, 1 reply; 35+ messages in thread From: Arjan van de Ven @ 2006-03-20 11:47 UTC (permalink / raw) To: Michael S. Tsirkin Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller On Mon, 2006-03-20 at 13:27 +0200, Michael S. Tsirkin wrote: > Quoting David S. Miller <davem@davemloft.net>: > > I disagree with Linux changing it's behavior. It would be great to > > turn off congestion control completely over local gigabit networks, > > but that isn't determinable in any way, so we don't do that. > > Interesting. Would it make sense to make it another tunable knob in > /proc, sysfs or sysctl then? that's not the right level; since that is per interface. And you only know the actual interface waay too late (as per earlier posts). Per socket.. maybe But then again it's not impossible to have packets for one socket go out to multiple interfaces (think load balancing bonding over 2 interfaces, one IB another ethernet) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 11:47 ` Arjan van de Ven @ 2006-03-20 11:49 ` Lennert Buytenhek 2006-03-20 11:53 ` Arjan van de Ven 2006-03-20 12:04 ` Michael S. Tsirkin 0 siblings, 2 replies; 35+ messages in thread From: Lennert Buytenhek @ 2006-03-20 11:49 UTC (permalink / raw) To: Arjan van de Ven Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote: > > > I disagree with Linux changing it's behavior. It would be great to > > > turn off congestion control completely over local gigabit networks, > > > but that isn't determinable in any way, so we don't do that. > > > > Interesting. Would it make sense to make it another tunable knob in > > /proc, sysfs or sysctl then? > > that's not the right level; since that is per interface. And you only > know the actual interface waay too late (as per earlier posts). > Per socket.. maybe > But then again it's not impossible to have packets for one socket go out > to multiple interfaces > (think load balancing bonding over 2 interfaces, one IB another > ethernet) I read it as if he was proposing to have a sysctl knob to turn off TCP congestion control completely (which has so many issues it's not even funny.) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 11:49 ` Lennert Buytenhek @ 2006-03-20 11:53 ` Arjan van de Ven 2006-03-20 13:35 ` Michael S. Tsirkin 2006-03-20 12:04 ` Michael S. Tsirkin 1 sibling, 1 reply; 35+ messages in thread From: Arjan van de Ven @ 2006-03-20 11:53 UTC (permalink / raw) To: Lennert Buytenhek Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller On Mon, 2006-03-20 at 12:49 +0100, Lennert Buytenhek wrote: > On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote: > > > > > I disagree with Linux changing it's behavior. It would be great to > > > > turn off congestion control completely over local gigabit networks, > > > > but that isn't determinable in any way, so we don't do that. > > > > > > Interesting. Would it make sense to make it another tunable knob in > > > /proc, sysfs or sysctl then? > > > > that's not the right level; since that is per interface. And you only > > know the actual interface waay too late (as per earlier posts). > > Per socket.. maybe > > But then again it's not impossible to have packets for one socket go out > > to multiple interfaces > > (think load balancing bonding over 2 interfaces, one IB another > > ethernet) > > I read it as if he was proposing to have a sysctl knob to turn off > TCP congestion control completely (which has so many issues it's not > even funny.) owww that's so bad I didn't even consider that ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 11:53 ` Arjan van de Ven @ 2006-03-20 13:35 ` Michael S. Tsirkin 0 siblings, 0 replies; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-20 13:35 UTC (permalink / raw) To: Arjan van de Ven Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller, Lennert Buytenhek Quoting Arjan van de Ven <arjan@infradead.org>: > > I read it as if he was proposing to have a sysctl knob to turn off > > TCP congestion control completely (which has so many issues it's not > > even funny.) > > owww that's so bad I didn't even consider that No, I think that comment was taken out of thread context. We were talking about stretching ACKs - while avoiding stretch ACKs is important for TCP congestion control, it's not the only mechanism. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 11:49 ` Lennert Buytenhek 2006-03-20 11:53 ` Arjan van de Ven @ 2006-03-20 12:04 ` Michael S. Tsirkin 2006-03-20 15:09 ` Benjamin LaHaise 1 sibling, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2006-03-20 12:04 UTC (permalink / raw) To: Lennert Buytenhek Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller, Arjan van de Ven Quoting r. Lennert Buytenhek <buytenh@wantstofly.org>: > > > > I disagree with Linux changing it's behavior. It would be great to > > > > turn off congestion control completely over local gigabit networks, > > > > but that isn't determinable in any way, so we don't do that. > > > > > > Interesting. Would it make sense to make it another tunable knob in > > > /proc, sysfs or sysctl then? > > > > that's not the right level; since that is per interface. And you only > > know the actual interface waay too late (as per earlier posts). > > Per socket.. maybe > > But then again it's not impossible to have packets for one socket go out > > to multiple interfaces > > (think load balancing bonding over 2 interfaces, one IB another > > ethernet) > > I read it as if he was proposing to have a sysctl knob to turn off > TCP congestion control completely (which has so many issues it's not > even funny.) Not really, that was David :) What started this thread was the fact that since 2.6.11 Linux does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to stretch ACKs "after careful consideration", and we are seeing that it helps IP over InfiniBand, so recent Linux kernels perform worse in that respect. And since there does not seem to be a way to figure it out automagically when doing this is a good idea, I proposed adding some kind of knob that will let the user apply the consideration for us. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 12:04 ` Michael S. Tsirkin @ 2006-03-20 15:09 ` Benjamin LaHaise 2006-03-20 18:58 ` Rick Jones 2006-03-20 23:00 ` David S. Miller 0 siblings, 2 replies; 35+ messages in thread From: Benjamin LaHaise @ 2006-03-20 15:09 UTC (permalink / raw) To: Michael S. Tsirkin Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, David S. Miller, Lennert Buytenhek, Arjan van de Ven On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote: > does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to > stretch ACKs "after careful consideration", and we are seeing that it helps > IP over InfiniBand, so recent Linux kernels perform worse in that respect. > > And since there does not seem to be a way to figure it out automagically when > doing this is a good idea, I proposed adding some kind of knob that will let the > user apply the consideration for us. Wouldn't it make sense to strech the ACK when the previous ACK is still in the TX queue of the device? I know that sort of behaviour was always an issue on modem links where you don't want to send out redundant ACKs. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 15:09 ` Benjamin LaHaise @ 2006-03-20 18:58 ` Rick Jones 2006-03-20 23:00 ` David S. Miller 1 sibling, 0 replies; 35+ messages in thread From: Rick Jones @ 2006-03-20 18:58 UTC (permalink / raw) To: Benjamin LaHaise Cc: netdev, rdreier, linux-kernel, openib-general, David S. Miller, Lennert Buytenhek, Arjan van de Ven > Wouldn't it make sense to strech the ACK when the previous ACK is still in > the TX queue of the device? I know that sort of behaviour was always an > issue on modem links where you don't want to send out redundant ACKs. Perhaps, but it isn't clear that it would be worth the cycles to check. I doubt that a simple reference count on the ACK skb would do it since if it were a bare ACK I doubt that TCP keeps a reference to the skb in the first place? Also, what would be the "trigger" to send the next ACK after the previous one had left the building (Elvis-like)? Receipt of N in-order segments? A timeout? If you are going to go ahead and try to do stretch-ACKs, then I suspect the way to go about doing it is to have it behave very much like HP-UX or Solaris, both of which have arguably reasonable ACK-avoidance heuristics in them. But don't try to do it quick and dirty. rick "likes ACK avoidance, just check the archives" jones on netdev, no need to cc me directly ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 15:09 ` Benjamin LaHaise 2006-03-20 18:58 ` Rick Jones @ 2006-03-20 23:00 ` David S. Miller 1 sibling, 0 replies; 35+ messages in thread From: David S. Miller @ 2006-03-20 23:00 UTC (permalink / raw) To: bcrl Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general, buytenh, arjan From: Benjamin LaHaise <bcrl@kvack.org> Date: Mon, 20 Mar 2006 10:09:42 -0500 > Wouldn't it make sense to strech the ACK when the previous ACK is still in > the TX queue of the device? I know that sort of behaviour was always an > issue on modem links where you don't want to send out redundant ACKs. I thought about doing some similar trick with TSO, wherein we would not defer a TSO send if all the previous packets sent are out of the device transmit queue. The idea was the prevent the pipe from ever emptying which is the danger of deferring too much for TSO. This has several problems. It's hard to implement. You have to decide if you want precise state, thereby checking the TX descriptors. Or you go for imprecise but easier to implement, which is very imprecise and therefore not very useful state, by just checking the SKB refcount or similar which means that you find out it's left the TX queue after the TX purge interrupt which can be a long time after the event and by then the pipe has empties which is what you were trying to prevent. Lastly, you don't want to touch remote cpu state which is what such a hack is going to end up doing much of the time. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: TSO and IPoIB performance degradation 2006-03-20 10:37 ` David S. Miller 2006-03-20 11:27 ` Michael S. Tsirkin @ 2006-04-27 4:13 ` Troy Benjegerdes 1 sibling, 0 replies; 35+ messages in thread From: Troy Benjegerdes @ 2006-04-27 4:13 UTC (permalink / raw) To: David S. Miller Cc: mst, rick.jones2, netdev, rdreier, linux-kernel, openib-general On Mon, Mar 20, 2006 at 02:37:04AM -0800, David S. Miller wrote: > From: "Michael S. Tsirkin" <mst@mellanox.co.il> > Date: Mon, 20 Mar 2006 12:22:34 +0200 > > > Quoting r. David S. Miller <davem@davemloft.net>: > > > The path an SKB can take is opaque and unknown until the very last > > > moment it is actually given to the device transmit function. > > > > Why, I was proposing looking at dst cache. If that's NULL, well, > > we won't stretch ACKs. Worst case we apply the wrong optimization. > > Right? > > Where you receive a packet from isn't very useful for determining > even the full patch on which that packet itself flowed. > > More importantly, packets also do not necessarily go back out over the > same path on which packets are received for a connection. This is > actually quite common. > > Maybe packets for this connection come in via IPoIB but go out via > gigabit ethernet and another route altogether. > > > What I'd like to clarify, however: rfc2581 explicitly states that in > > some cases it might be OK to generate ACKs less frequently than > > every second full-sized segment. Given Matt's measurements, TCP on > > top of IP over InfiniBand on Linux seems to hit one of these cases. > > Do you agree to that? > > I disagree with Linux changing it's behavior. It would be great to > turn off congestion control completely over local gigabit networks, > but that isn't determinable in any way, so we don't do that. > > The IPoIB situation is no different, you can set all the bits you want > in incoming packets, the barrier to doing this remains the same. > > It hurts performance if any packet drop occurs because it will require > an extra round trip for recovery to begin to be triggered at the > sender. > > The network is a black box, routes to and from a destination are > arbitrary, and so is packet rewriting and reflection, so being able to > say "this all occurs on IPoIB" is simply infeasible. > > I don't know how else to say this, we simply cannot special case IPoIB > or any other topology type. David is right. If you care about performance, you are already using SDP or verbs layer for the transport anyway. If I am going to be doing IPoIB, it's because eventually I expect the packet might get off the IB network and onto some other network and go halfway across the country. ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2006-04-27 4:13 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin 2006-03-06 22:40 ` David S. Miller 2006-03-06 22:50 ` Stephen Hemminger 2006-03-07 3:13 ` Shirley Ma 2006-03-07 21:44 ` Matt Leininger 2006-03-07 21:49 ` Stephen Hemminger 2006-03-07 21:53 ` Michael S. Tsirkin 2006-03-08 0:11 ` Matt Leininger 2006-03-08 0:18 ` David S. Miller 2006-03-08 1:17 ` Roland Dreier 2006-03-08 1:23 ` David S. Miller 2006-03-08 1:34 ` Roland Dreier 2006-03-08 12:53 ` Michael S. Tsirkin 2006-03-08 20:53 ` David S. Miller 2006-03-09 23:48 ` David S. Miller 2006-03-10 0:10 ` Michael S. Tsirkin 2006-03-10 0:38 ` Michael S. Tsirkin 2006-03-10 7:18 ` David S. Miller 2006-03-10 0:21 ` Rick Jones 2006-03-10 7:23 ` David S. Miller 2006-03-10 17:44 ` Rick Jones 2006-03-20 9:06 ` Michael S. Tsirkin 2006-03-20 9:55 ` David S. Miller 2006-03-20 10:22 ` Michael S. Tsirkin 2006-03-20 10:37 ` David S. Miller 2006-03-20 11:27 ` Michael S. Tsirkin 2006-03-20 11:47 ` Arjan van de Ven 2006-03-20 11:49 ` Lennert Buytenhek 2006-03-20 11:53 ` Arjan van de Ven 2006-03-20 13:35 ` Michael S. Tsirkin 2006-03-20 12:04 ` Michael S. Tsirkin 2006-03-20 15:09 ` Benjamin LaHaise 2006-03-20 18:58 ` Rick Jones 2006-03-20 23:00 ` David S. Miller 2006-04-27 4:13 ` Troy Benjegerdes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).