TSO and IPoIB performance degradation

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TSO and IPoIB performance degradation
@ 2006-03-06 22:34 Michael S. Tsirkin
  2006-03-06 22:40 ` David S. Miller
  2006-03-06 22:50 ` Stephen Hemminger
  0 siblings, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-06 22:34 UTC (permalink / raw)
  To: netdev, openib-general, Linux Kernel Mailing List,
	David S. Miller, Matt Leininger

Hello, Dave!
As you might know, the TSO patches merged into mainline kernel
since 2.6.11 have hurt performance for the simple (non-TSO)
high-speed netdevice that is IPoIB driver.

This was discussed at length here
http://openib.org/pipermail/openib-general/2005-October/012271.html

I'm trying to figure out what can be done to improve the situation.
In partucular, I'm looking at the Super TSO patch
http://oss.sgi.com/archives/netdev/2005-05/msg00889.html

merged into mainline here

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc

There, you said:

	When we do ucopy receive (ie. copying directly to userspace
	during tcp input processing) we attempt to delay the ACK
	until cleanup_rbuf() is invoked.  Most of the time this
	technique works very well, and we emit one ACK advertising
	the largest window.

	But this explodes if the ucopy prequeue is large enough.
	When the receiver is cpu limited and TSO frames are large,
	the receiver is inundated with ucopy processing, such that
	the ACK comes out very late.  Often, this is so late that
	by the time the sender gets the ACK the window has emptied
	too much to be kept full by the sender.

	The existing TSO code mostly avoided this by keeping the
	TSO packets no larger than 1/8 of the available window.
	But with the new code we can get much larger TSO frames.

So I'm trying to get a handle on it: could a solution be to simply
look at the frame size, and call tcp_send_delayed_ack from
if the frame size is no larger than 1/8?

Does this make sense?

Thanks,


-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin
@ 2006-03-06 22:40 ` David S. Miller
  2006-03-06 22:50 ` Stephen Hemminger
  1 sibling, 0 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-06 22:40 UTC (permalink / raw)
  To: mst; +Cc: netdev, linux-kernel, openib-general

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Tue, 7 Mar 2006 00:34:38 +0200

> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
> 
> Does this make sense?

The comment you mention is very old, and no longer applies.

Get full packet traces from the kernel TSO code in the 2.6.x
kernel, analyze is, and post here what you think is occuring
that is causing the performance problems.

One thing to note is that the newer TSO code really needs to
have large socket buffers, so you can experiment with that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin
  2006-03-06 22:40 ` David S. Miller
@ 2006-03-06 22:50 ` Stephen Hemminger
  2006-03-07  3:13   ` Shirley Ma
  1 sibling, 1 reply; 35+ messages in thread
From: Stephen Hemminger @ 2006-03-06 22:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	David S. Miller

On Tue, 7 Mar 2006 00:34:38 +0200
"Michael S. Tsirkin" <mst@mellanox.co.il> wrote:

> Hello, Dave!
> As you might know, the TSO patches merged into mainline kernel
> since 2.6.11 have hurt performance for the simple (non-TSO)
> high-speed netdevice that is IPoIB driver.
> 
> This was discussed at length here
> http://openib.org/pipermail/openib-general/2005-October/012271.html
> 
> I'm trying to figure out what can be done to improve the situation.
> In partucular, I'm looking at the Super TSO patch
> http://oss.sgi.com/archives/netdev/2005-05/msg00889.html
> 
> merged into mainline here
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
> 
> There, you said:
> 
> 	When we do ucopy receive (ie. copying directly to userspace
> 	during tcp input processing) we attempt to delay the ACK
> 	until cleanup_rbuf() is invoked.  Most of the time this
> 	technique works very well, and we emit one ACK advertising
> 	the largest window.
> 
> 	But this explodes if the ucopy prequeue is large enough.
> 	When the receiver is cpu limited and TSO frames are large,
> 	the receiver is inundated with ucopy processing, such that
> 	the ACK comes out very late.  Often, this is so late that
> 	by the time the sender gets the ACK the window has emptied
> 	too much to be kept full by the sender.
> 
> 	The existing TSO code mostly avoided this by keeping the
> 	TSO packets no larger than 1/8 of the available window.
> 	But with the new code we can get much larger TSO frames.
> 
> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
> 
> Does this make sense?
> 
> Thanks,
> 
> 


More likely you are getting hit by the fact that TSO prevents the congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of Nov 2005).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-06 22:50 ` Stephen Hemminger
@ 2006-03-07  3:13   ` Shirley Ma
  2006-03-07 21:44     ` Matt Leininger
  0 siblings, 1 reply; 35+ messages in thread
From: Shirley Ma @ 2006-03-07  3:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	openib-general-bounces, David S. Miller


[-- Attachment #1.1: Type: text/plain, Size: 420 bytes --]

> More likely you are getting hit by the fact that TSO prevents the 
congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of 
Nov 2005).

Yep, I noticed the same problem. After updating to the new kernel, the 
performance are much better, but it's still lower than before.

Thank
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

[-- Attachment #1.2: Type: text/html, Size: 570 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-07  3:13   ` Shirley Ma
@ 2006-03-07 21:44     ` Matt Leininger
  2006-03-07 21:49       ` Stephen Hemminger
  0 siblings, 1 reply; 35+ messages in thread
From: Matt Leininger @ 2006-03-07 21:44 UTC (permalink / raw)
  To: Shirley Ma
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	David S. Miller, Stephen Hemminger

On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> 
> > More likely you are getting hit by the fact that TSO prevents the
> congestion
> window from increasing properly. This was fixed in 2.6.15 (around mid
> of Nov 2005). 
> 
> Yep, I noticed the same problem. After updating to the new kernel, the
> performance are much better, but it's still lower than before.

 Here is an updated version of OpenIB IPoIB performance for various
kernels with and without one of the TSO patches.  The netperf
performance for the latest kernels has not improved the TSO performance
drop.

  Any comments or suggestions would be appreciated.

  - Matt

> 
All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc

Kernel                OpenIB    msi_x  netperf (MB/s)  
2.6.16-rc5           in-kernel    1     367
2.6.15               in-kernel    1     382
2.6.14-rc4 patch 1   in-kernel    1     434 
2.6.14-rc4           in-kernel    1     385 
2.6.14-rc3           in-kernel    1     374 
2.6.13.2             svn3627      1     386 
2.6.13.2 patch 1     svn3627      1     446 
2.6.13.2             in-kernel    1     394 
2.6.13-rc3 patch 12  in-kernel    1     442 
2.6.13-rc3 patch 1   in-kernel    1     450 
2.6.13-rc3           in-kernel    1     395
2.6.12.5-lustre      in-kernel    1     399  
2.6.12.5 patch 1     in-kernel    1     464
2.6.12.5             in-kernel    1     402 
2.6.12               in-kernel    1     406 
2.6.12-rc6 patch 1   in-kernel    1     470 
2.6.12-rc6           in-kernel    1     407
2.6.12-rc5           in-kernel    1     405 
2.6.12-rc5 patch 1   in-kernel    1     474
2.6.12-rc4           in-kernel    1     470 
2.6.12-rc3           in-kernel    1     466 
2.6.12-rc2           in-kernel    1     469 
2.6.12-rc1           in-kernel    1     466
2.6.11               in-kernel    1     464 
2.6.11               svn3687      1     464 
2.6.9-11.ELsmp       svn3513      1     425  (Woody's results, 3.6Ghz
EM64T) 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-07 21:44     ` Matt Leininger
@ 2006-03-07 21:49       ` Stephen Hemminger
  2006-03-07 21:53         ` Michael S. Tsirkin
  2006-03-08  0:11         ` Matt Leininger
  0 siblings, 2 replies; 35+ messages in thread
From: Stephen Hemminger @ 2006-03-07 21:49 UTC (permalink / raw)
  To: Matt Leininger
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	David S. Miller

On Tue, 07 Mar 2006 13:44:51 -0800
Matt Leininger <mlleinin@hpcn.ca.sandia.gov> wrote:

> On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> > 
> > > More likely you are getting hit by the fact that TSO prevents the
> > congestion
> > window from increasing properly. This was fixed in 2.6.15 (around mid
> > of Nov 2005). 
> > 
> > Yep, I noticed the same problem. After updating to the new kernel, the
> > performance are much better, but it's still lower than before.
> 
>  Here is an updated version of OpenIB IPoIB performance for various
> kernels with and without one of the TSO patches.  The netperf
> performance for the latest kernels has not improved the TSO performance
> drop.
> 
>   Any comments or suggestions would be appreciated.
> 
>   - Matt

Configuration information? like did you increase the tcp_rmem, tcp_wmem?
Tcpdump traces of what is being sent and available window?
Is IB using NAPI or just doing netif_rx()?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-07 21:49       ` Stephen Hemminger
@ 2006-03-07 21:53         ` Michael S. Tsirkin
  2006-03-08  0:11         ` Matt Leininger
  1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-07 21:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	David S. Miller

Quoting r. Stephen Hemminger <shemminger@osdl.org>:
> Is IB using NAPI or just doing netif_rx()?

No, IPoIB doesn't use NAPI.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-07 21:49       ` Stephen Hemminger
  2006-03-07 21:53         ` Michael S. Tsirkin
@ 2006-03-08  0:11         ` Matt Leininger
  2006-03-08  0:18           ` David S. Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Matt Leininger @ 2006-03-08  0:11 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, Linux Kernel Mailing List, openib-general,
	David S. Miller

On Tue, 2006-03-07 at 13:49 -0800, Stephen Hemminger wrote:
> On Tue, 07 Mar 2006 13:44:51 -0800
> Matt Leininger <mlleinin@hpcn.ca.sandia.gov> wrote:
> 
> > On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> > > 
> > > > More likely you are getting hit by the fact that TSO prevents the
> > > congestion
> > > window from increasing properly. This was fixed in 2.6.15 (around mid
> > > of Nov 2005). 
> > > 
> > > Yep, I noticed the same problem. After updating to the new kernel, the
> > > performance are much better, but it's still lower than before.
> > 
> >  Here is an updated version of OpenIB IPoIB performance for various
> > kernels with and without one of the TSO patches.  The netperf
> > performance for the latest kernels has not improved the TSO performance
> > drop.
> > 
> >   Any comments or suggestions would be appreciated.
> > 
> >   - Matt
> 
> Configuration information? like did you increase the tcp_rmem, tcp_wmem?
> Tcpdump traces of what is being sent and available window?
> Is IB using NAPI or just doing netif_rx()?

  I used the standard setting for tcp_rmem and tcp_wmem.   Here are a
few other runs that change those variables.  I was able to improve
performance by ~30MB/s to 403 MB/s, but this is still a ways from the
474 MB/s before the TSO patches.

 Thanks,

	- Matt

All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc
msi_x=1 for all tests

Kernel                OpenIB     netperf (MB/s)  
2.6.16-rc5           in-kernel    403      
tcp_wmem 4096 87380 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5           in-kernel    395      
tcp_wmem 4096 102400 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5           in-kernel    392      
tcp_wmem 4096 65536 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5           in-kernel    394      
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5           in-kernel    377      
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 153600 16777216

2.6.16-rc5           in-kernel    377      
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 131072 16777216

2.6.16-rc5           in-kernel    353      
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 262144 16777216

2.6.16-rc5           in-kernel    305      
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5           in-kernel    303      
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5           in-kernel    290      
tcp_wmem 4096 524288 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5           in-kernel    367      default tcp values

--------------------
All with standard tcp settings
Kernel                OpenIB     netperf (MB/s)  
2.6.16-rc5           in-kernel    367      
2.6.15               in-kernel    382
2.6.14-rc4 patch 12  in-kernel    436 
2.6.14-rc4 patch 1   in-kernel    434 
2.6.14-rc4           in-kernel    385 
2.6.14-rc3           in-kernel    374 
2.6.13.2             svn3627      386 
2.6.13.2 patch 1     svn3627      446 
2.6.13.2             in-kernel    394 
2.6.13-rc3 patch 12  in-kernel    442 
2.6.13-rc3 patch 1   in-kernel    450 
2.6.13-rc3           in-kernel    395
2.6.12.5-lustre      in-kernel    399  
2.6.12.5 patch 1     in-kernel    464
2.6.12.5             in-kernel    402 
2.6.12               in-kernel    406 
2.6.12-rc6 patch 1   in-kernel    470 
2.6.12-rc6           in-kernel    407
2.6.12-rc5           in-kernel    405 
2.6.12-rc5 patch 1   in-kernel    474
2.6.12-rc4           in-kernel    470 
2.6.12-rc3           in-kernel    466 
2.6.12-rc2           in-kernel    469 
2.6.12-rc1           in-kernel    466
2.6.11               in-kernel    464 
2.6.11               svn3687      464 
2.6.9-11.ELsmp       svn3513      425  (Woody's results, 3.6Ghz EM64T) 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-08  0:11         ` Matt Leininger
@ 2006-03-08  0:18           ` David S. Miller
  2006-03-08  1:17             ` Roland Dreier
  0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-03-08  0:18 UTC (permalink / raw)
  To: mlleinin; +Cc: netdev, linux-kernel, openib-general, shemminger

From: Matt Leininger <mlleinin@hpcn.ca.sandia.gov>
Date: Tue, 07 Mar 2006 16:11:37 -0800

>   I used the standard setting for tcp_rmem and tcp_wmem.   Here are a
> few other runs that change those variables.  I was able to improve
> performance by ~30MB/s to 403 MB/s, but this is still a ways from the
> 474 MB/s before the TSO patches.

How limited are the IPoIB devices, TX descriptor wise?

One side effect of the TSO changes is that one extra descriptor
will be used for outgoing packets.  This is because we have to
put the headers as well as the user data, into page based
buffers now.

Perhaps you can experiment with increasing the transmit descriptor
table size, if that's possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-08  0:18           ` David S. Miller
@ 2006-03-08  1:17             ` Roland Dreier
  2006-03-08  1:23               ` David S. Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Roland Dreier @ 2006-03-08  1:17 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-kernel, openib-general, shemminger

    David> How limited are the IPoIB devices, TX descriptor wise?

    David> One side effect of the TSO changes is that one extra
    David> descriptor will be used for outgoing packets.  This is
    David> because we have to put the headers as well as the user
    David> data, into page based buffers now.

We have essentially no limit on TX descriptors.  However I think
there's some confusion about TSO: IPoIB does _not_ do TSO -- generic
InfiniBand hardware does not have any TSO capability.  In the future
we might be able to implement TSO for certain hardware that does have
support, but even that requires some firmware help from the from the
HCA vendors, etc.  So right now the IPoIB driver does not do TSO.

The reason TSO comes up is that reverting the patch described below
helps (or helped at some point at least) IPoIB throughput quite a bit.
Clearly this was a bug fix so we can't revert it in general but I
think what Michael Tsirkin was suggesting at the beginning of this
thread is to do what the last paragraph of the changelog says -- find
some way to re-enable the trick.

diff-tree 3143241... (from e16fa6b...)
Author: David S. Miller <davem@davemloft.net>
Date:   Mon May 23 12:03:06 2005 -0700

    [TCP]: Fix stretch ACK performance killer when doing ucopy.

    When we are doing ucopy, we try to defer the ACK generation to
    cleanup_rbuf().  This works most of the time very well, but if the
    ucopy prequeue is large, this ACKing behavior kills performance.

    With TSO, it is possible to fill the prequeue so large that by the
    time the ACK is sent and gets back to the sender, most of the window
    has emptied of data and performance suffers significantly.

    This behavior does help in some cases, so we should think about
    re-enabling this trick in the future, using some kind of limit in
    order to avoid the bug case.

 - R.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-08  1:17             ` Roland Dreier
@ 2006-03-08  1:23               ` David S. Miller
  2006-03-08  1:34                 ` Roland Dreier
  2006-03-08 12:53                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-08  1:23 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, linux-kernel, openib-general, shemminger

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 07 Mar 2006 17:17:30 -0800

> The reason TSO comes up is that reverting the patch described below
> helps (or helped at some point at least) IPoIB throughput quite a bit.

I wish you had started the thread by mentioning this specific
patch, we wasted an enormous amount of precious developer time
speculating and asking for arbitrary tests to be run in order
to narrow down the problem, yet you knew the specific change
that introduced the performance regression already...

This is a good example of how not to report a bug.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-08  1:23               ` David S. Miller
@ 2006-03-08  1:34                 ` Roland Dreier
  2006-03-08 12:53                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 35+ messages in thread
From: Roland Dreier @ 2006-03-08  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-kernel, openib-general, shemminger

    David> I wish you had started the thread by mentioning this
    David> specific patch, we wasted an enormous amount of precious
    David> developer time speculating and asking for arbitrary tests
    David> to be run in order to narrow down the problem, yet you knew
    David> the specific change that introduced the performance
    David> regression already...

Sorry, you're right.  I was a little confused because I had a memory of
Michael's original email (http://lkml.org/lkml/2006/3/6/150) quoting a
changelog entry, but looking back at the message, it was quoting
something completely different and misleading.

I think the most interesting email in the old thread is
http://openib.org/pipermail/openib-general/2005-October/012482.html
which shows that reverting 314324121 (the "stretch ACK performance
killer" fix) gives ~400 Mbit/sec in extra IPoIB performance.

 - R.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: TSO and IPoIB performance degradation
  2006-03-08  1:23               ` David S. Miller
  2006-03-08  1:34                 ` Roland Dreier
@ 2006-03-08 12:53                 ` Michael S. Tsirkin
  2006-03-08 20:53                   ` David S. Miller
  2006-03-09 23:48                   ` David S. Miller
  1 sibling, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-08 12:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger

Quoting r. David S. Miller <davem@davemloft.net>:
> Subject: Re: Re: TSO and IPoIB performance degradation
> 
> From: Roland Dreier <rdreier@cisco.com>
> Date: Tue, 07 Mar 2006 17:17:30 -0800
> 
> > The reason TSO comes up is that reverting the patch described below
> > helps (or helped at some point at least) IPoIB throughput quite a bit.
> 
> I wish you had started the thread by mentioning this specific patch

Er, since you mention it, the first message in thread did include this link:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
and I even pasted the patch description there, but oh well.

Now that Roland helped us clear it all up, and now that it has been clarified
that reverting this patch gives us back most of the performance, is the answer
to my question the same?

What I was trying to figure out was, how can we re-enable the trick without
hurting TSO? Could a solution be to simply look at the frame size, and call
tcp_send_delayed_ack if the frame size is small?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-08 12:53                 ` Michael S. Tsirkin
@ 2006-03-08 20:53                   ` David S. Miller
  2006-03-09 23:48                   ` David S. Miller
  1 sibling, 0 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-08 20:53 UTC (permalink / raw)
  To: mst; +Cc: rdreier, netdev, linux-kernel, openib-general, shemminger

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick without
> hurting TSO? Could a solution be to simply look at the frame size, and call
> tcp_send_delayed_ack if the frame size is small?

The problem is that this patch helps performance when the
receiver is CPU limited.

The old code would delay ACKs forever if the CPU of the
receiver was slow, because we'd wait for all received
packets to be copied into userspace before spitting out
the ACK.  This would allow the pipe to empty, since the
sender is waiting for ACKs in order to send more into
the pipe, and once the ACK did go out it would cause the
sender to emit an enormous burst of data.  Both of these
behaviors are highly frowned upon for a TCP stack.

I'll try to look at this some more later today.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-08 12:53                 ` Michael S. Tsirkin
  2006-03-08 20:53                   ` David S. Miller
@ 2006-03-09 23:48                   ` David S. Miller
  2006-03-10  0:10                     ` Michael S. Tsirkin
  2006-03-10  0:21                     ` Rick Jones
  1 sibling, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-09 23:48 UTC (permalink / raw)
  To: mst; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick
> without hurting TSO? Could a solution be to simply look at the frame
> size, and call tcp_send_delayed_ack if the frame size is small?

The change is really not related to TSO.

By reverting it, you are reducing the number of ACKs on the wire, and
the number of context switches at the sender to push out new data.
That's why it can make things go faster, but it also leads to bursty
TCP sender behavior, which is bad for congestion on the internet.

When the receiver has a strong cpu and can keep up with the incoming
packet rate very well and we are in an environment with no congestion,
the old code helps a lot.  But if the receiver is cpu limited or we
have congestion of any kind, it does exactly the wrong thing.  It will
delay ACKs a very long time to the point where the pipe is depleted
and this kills performance in that case.  For congested environments,
due to the decreased ACK feedback, packet loss recovery will be
extremely poor.  This is the first reason behind my change.

The behavior is also specifically frowned upon in the TCP implementor
community.  It is specifically mentioned in the Known TCP
Implementation Problems RFC2525, in section 2.13 "Stretch ACK
violation".

The entry, quoted below for reference, is very clear on the reasons
why stretch ACKs are bad.  And although it may help performance for
your case, in congested environments and also with cpu limited
receivers it will have a negative impact on performance.  So, this was
the second reason why I made this change.

So reverting the change isn't really an option.

   Name of Problem
      Stretch ACK violation

   Classification
      Congestion Control/Performance

   Description
      To improve efficiency (both computer and network) a data receiver
      may refrain from sending an ACK for each incoming segment,
      according to [RFC1122].  However, an ACK should not be delayed an
      inordinate amount of time.  Specifically, ACKs SHOULD be sent for
      every second full-sized segment that arrives.  If a second full-
      sized segment does not arrive within a given timeout (of no more
      than 0.5 seconds), an ACK should be transmitted, according to
      [RFC1122].  A TCP receiver which does not generate an ACK for
      every second full-sized segment exhibits a "Stretch ACK
      Violation".

   Significance
      TCP receivers exhibiting this behavior will cause TCP senders to
      generate burstier traffic, which can degrade performance in
      congested environments.  In addition, generating fewer ACKs
      increases the amount of time needed by the slow start algorithm to
      open the congestion window to an appropriate point, which
      diminishes performance in environments with large bandwidth-delay
      products.  Finally, generating fewer ACKs may cause needless
      retransmission timeouts in lossy environments, as it increases the
      possibility that an entire window of ACKs is lost, forcing a
      retransmission timeout.

   Implications
      When not in loss recovery, every ACK received by a TCP sender
      triggers the transmission of new data segments.  The burst size is
      determined by the number of previously unacknowledged segments
      each ACK covers.  Therefore, a TCP receiver ack'ing more than 2
      segments at a time causes the sending TCP to generate a larger
      burst of traffic upon receipt of the ACK.  This large burst of
      traffic can overwhelm an intervening gateway, leading to higher
      drop rates for both the connection and other connections passing
      through the congested gateway.

      In addition, the TCP slow start algorithm increases the congestion
      window by 1 segment for each ACK received.  Therefore, increasing
      the ACK interval (thus decreasing the rate at which ACKs are
      transmitted) increases the amount of time it takes slow start to
      increase the congestion window to an appropriate operating point,
      and the connection consequently suffers from reduced performance.
      This is especially true for connections using large windows.

   Relevant RFCs
      RFC 1122 outlines delayed ACKs as a recommended mechanism.

   Trace file demonstrating it
      Trace file taken using tcpdump at host B, the data receiver (and
      ACK originator).  The advertised window (which never changed) and
      timestamp options have been omitted for clarity, except for the
      first packet sent by A:

   12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1
       win 33580 <nop,nop,timestamp 2249877 2249914> [tos 0x8]
   12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1
   12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1
   12:09:24.832222 B.3999 > A.1174: . ack 6393
   12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1
   12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1
   12:09:24.950605 A.1174 > B.3999: . 9289:10737(1448) ack 1
   12:09:24.950797 B.3999 > A.1174: . ack 10737
   12:09:24.958488 A.1174 > B.3999: . 10737:12185(1448) ack 1
   12:09:25.052330 A.1174 > B.3999: . 12185:13633(1448) ack 1
   12:09:25.060216 A.1174 > B.3999: . 13633:15081(1448) ack 1
   12:09:25.060405 B.3999 > A.1174: . ack 15081

      This portion of the trace clearly shows that the receiver (host B)
      sends an ACK for every third full sized packet received.  Further
      investigation of this implementation found that the cause of the
      increased ACK interval was the TCP options being used.  The
      implementation sent an ACK after it was holding 2*MSS worth of
      unacknowledged data.  In the above case, the MSS is 1460 bytes so
      the receiver transmits an ACK after it is holding at least 2920
      bytes of unacknowledged data.  However, the length of the TCP
      options being used [RFC1323] took 12 bytes away from the data
      portion of each packet.  This produced packets containing 1448
      bytes of data.  But the additional bytes used by the options in
      the header were not taken into account when determining when to
      trigger an ACK.  Therefore, it took 3 data segments before the
      data receiver was holding enough unacknowledged data (>= 2*MSS, or
      2920 bytes in the above example) to transmit an ACK.

   Trace file demonstrating correct behavior
      Trace file taken using tcpdump at host B, the data receiver (and
      ACK originator), again with window and timestamp information
      omitted except for the first packet:

   12:06:53.627320 A.1172 > B.3999: . 1449:2897(1448) ack 1
       win 33580 <nop,nop,timestamp 2249575 2249612> [tos 0x8]
   12:06:53.634773 A.1172 > B.3999: . 2897:4345(1448) ack 1
   12:06:53.634961 B.3999 > A.1172: . ack 4345
   12:06:53.737326 A.1172 > B.3999: . 4345:5793(1448) ack 1
   12:06:53.744401 A.1172 > B.3999: . 5793:7241(1448) ack 1
   12:06:53.744592 B.3999 > A.1172: . ack 7241
   12:06:53.752287 A.1172 > B.3999: . 7241:8689(1448) ack 1
   12:06:53.847332 A.1172 > B.3999: . 8689:10137(1448) ack 1
   12:06:53.847525 B.3999 > A.1172: . ack 10137

      This trace shows the TCP receiver (host B) ack'ing every second
      full-sized packet, according to [RFC1122].  This is the same
      implementation shown above, with slight modifications that allow
      the receiver to take the length of the options into account when
      deciding when to transmit an ACK.

   References
      This problem is documented in [Allman97] and [Paxson97].

   How to detect
      Stretch ACK violations show up immediately in receiver-side packet
      traces of bulk transfers, as shown above.  However, packet traces
      made on the sender side of the TCP connection may lead to
      ambiguities when diagnosing this problem due to the possibility of
      lost ACKs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-09 23:48                   ` David S. Miller
@ 2006-03-10  0:10                     ` Michael S. Tsirkin
  2006-03-10  0:38                       ` Michael S. Tsirkin
  2006-03-10  7:18                       ` David S. Miller
  2006-03-10  0:21                     ` Rick Jones
  1 sibling, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-10  0:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger

Quoting David S. Miller <davem@davemloft.net>:
>    Description
>       To improve efficiency (both computer and network) a data receiver
>       may refrain from sending an ACK for each incoming segment,
>       according to [RFC1122].  However, an ACK should not be delayed an
>       inordinate amount of time.  Specifically, ACKs SHOULD be sent for
>       every second full-sized segment that arrives.  If a second full-
>       sized segment does not arrive within a given timeout (of no more
>       than 0.5 seconds), an ACK should be transmitted, according to
>       [RFC1122].  A TCP receiver which does not generate an ACK for
>       every second full-sized segment exhibits a "Stretch ACK
>       Violation".

Thanks very much for the info!

So the longest we can delay, according to this spec, is until we have two full
sized segments.

But with the change we are discussing, could an ack now be sent even sooner than
we have at least two full sized segments?  Or does __tcp_ack_snd_check delay
until we have at least two full sized segments? David, could you explain please?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-10  0:10                     ` Michael S. Tsirkin
@ 2006-03-10  0:38                       ` Michael S. Tsirkin
  2006-03-10  7:18                       ` David S. Miller
  1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-10  0:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: rdreier, netdev, linux-kernel, openib-general, shemminger

Quoting r. Michael S. Tsirkin <mst@mellanox.co.il>:
> Or does __tcp_ack_snd_check delay until we have at least two full sized
> segments?

What I'm trying to say, since RFC 2525, 2.13 talks about
"every second full-sized segment", so following the code from
__tcp_ack_snd_check, why does it do

	    /* More than one full frame received... */
	if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss

rather than

	    /* At least two full frames received... */
	if (((tp->rcv_nxt - tp->rcv_wup) >= 2 * inet_csk(sk)->icsk_ack.rcv_mss

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-10  0:10                     ` Michael S. Tsirkin
  2006-03-10  0:38                       ` Michael S. Tsirkin
@ 2006-03-10  7:18                       ` David S. Miller
  1 sibling, 0 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-10  7:18 UTC (permalink / raw)
  To: mst; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Fri, 10 Mar 2006 02:10:31 +0200

> But with the change we are discussing, could an ack now be sent even
> sooner than we have at least two full sized segments?  Or does
> __tcp_ack_snd_check delay until we have at least two full sized
> segments? David, could you explain please?

__tcp_ack_snd_check() delays until we have at least two full
sized segments.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-09 23:48                   ` David S. Miller
  2006-03-10  0:10                     ` Michael S. Tsirkin
@ 2006-03-10  0:21                     ` Rick Jones
  2006-03-10  7:23                       ` David S. Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Rick Jones @ 2006-03-10  0:21 UTC (permalink / raw)
  To: netdev; +Cc: rdreier, linux-kernel, openib-general

David S. Miller wrote:
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 8 Mar 2006 14:53:11 +0200
> 
> 
>>What I was trying to figure out was, how can we re-enable the trick
>>without hurting TSO? Could a solution be to simply look at the frame
>>size, and call tcp_send_delayed_ack if the frame size is small?
> 
> 
> The change is really not related to TSO.
> 
> By reverting it, you are reducing the number of ACKs on the wire, and
> the number of context switches at the sender to push out new data.
> That's why it can make things go faster, but it also leads to bursty
> TCP sender behavior, which is bad for congestion on the internet.

naughty naughty Solaris and HP-UX TCP :)

> 
> When the receiver has a strong cpu and can keep up with the incoming
> packet rate very well and we are in an environment with no congestion,
> the old code helps a lot.  But if the receiver is cpu limited or we
> have congestion of any kind, it does exactly the wrong thing.  It will
> delay ACKs a very long time to the point where the pipe is depleted
> and this kills performance in that case.  For congested environments,
> due to the decreased ACK feedback, packet loss recovery will be
> extremely poor.  This is the first reason behind my change.

well, there are stacks which do "stretch acks" (after a fashion) that 
make sure when they see packet loss to "do the right thing" wrt sending 
enough acks to allow cwnds to open again in a timely fashion.

that brings-back all that stuff I posted ages ago about the performance 
delta when using an HP-UX receiver and altering the number of segmetns 
per ACK.  should be in the netdev archive somewhere.

might have been around the time of the discussions about MacOS and its 
ack avoidance - which wasn't done very well at the time.


> 
> The behavior is also specifically frowned upon in the TCP implementor
> community.  It is specifically mentioned in the Known TCP
> Implementation Problems RFC2525, in section 2.13 "Stretch ACK
> violation".
> 
> The entry, quoted below for reference, is very clear on the reasons
> why stretch ACKs are bad.  And although it may help performance for
> your case, in congested environments and also with cpu limited
> receivers it will have a negative impact on performance.  So, this was
> the second reason why I made this change.

I would have thought that a receiver "stretching ACK's" would be helpful 
when it was CPU limited since it was spending fewer CPU cycles 
generating ACKs?

> 
> So reverting the change isn't really an option.
> 
>    Name of Problem
>       Stretch ACK violation
> 
>    Classification
>       Congestion Control/Performance
> 
>    Description
>       To improve efficiency (both computer and network) a data receiver
>       may refrain from sending an ACK for each incoming segment,
>       according to [RFC1122].  However, an ACK should not be delayed an
>       inordinate amount of time.  Specifically, ACKs SHOULD be sent for
>       every second full-sized segment that arrives.  If a second full-
>       sized segment does not arrive within a given timeout (of no more
>       than 0.5 seconds), an ACK should be transmitted, according to
>       [RFC1122].  A TCP receiver which does not generate an ACK for
>       every second full-sized segment exhibits a "Stretch ACK
>       Violation".

How can it be a "violation" of a SHOULD?-)

> 
>    Significance
>       TCP receivers exhibiting this behavior will cause TCP senders to
>       generate burstier traffic, which can degrade performance in
>       congested environments.  In addition, generating fewer ACKs
>       increases the amount of time needed by the slow start algorithm to
>       open the congestion window to an appropriate point, which
>       diminishes performance in environments with large bandwidth-delay
>       products.  Finally, generating fewer ACKs may cause needless
>       retransmission timeouts in lossy environments, as it increases the
>       possibility that an entire window of ACKs is lost, forcing a
>       retransmission timeout.

Of those three, I think the most meaningful is the second, which can be 
dealt with by smarts in the ACK-stretching receiver.

For the first, it will only degrade performance if it triggers packet loss.

I'm not sure I've ever seen the third item happen.

> 
>    Implications
>       When not in loss recovery, every ACK received by a TCP sender
>       triggers the transmission of new data segments.  The burst size is
>       determined by the number of previously unacknowledged segments
>       each ACK covers.  Therefore, a TCP receiver ack'ing more than 2
>       segments at a time causes the sending TCP to generate a larger
>       burst of traffic upon receipt of the ACK.  This large burst of
>       traffic can overwhelm an intervening gateway, leading to higher
>       drop rates for both the connection and other connections passing
>       through the congested gateway.

Doesn't RED mean that those other connections are rather less likely to 
be affected?

> 
>       In addition, the TCP slow start algorithm increases the congestion
>       window by 1 segment for each ACK received.  Therefore, increasing
>       the ACK interval (thus decreasing the rate at which ACKs are
>       transmitted) increases the amount of time it takes slow start to
>       increase the congestion window to an appropriate operating point,
>       and the connection consequently suffers from reduced performance.
>       This is especially true for connections using large windows.

This one is dealt with by ABC isn't it?  At least in part since ABC 
appears to cap cwnd increase at 2*SMSS (I only know this because I just 
read the RFC mentioned in another thread - seems a bit much to have made 
that limit a MUST rather than a SHOULD :)

rick jones

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-10  0:21                     ` Rick Jones
@ 2006-03-10  7:23                       ` David S. Miller
  2006-03-10 17:44                         ` Rick Jones
  2006-03-20  9:06                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-10  7:23 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev, rdreier, linux-kernel, openib-general

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 09 Mar 2006 16:21:05 -0800

> well, there are stacks which do "stretch acks" (after a fashion) that 
> make sure when they see packet loss to "do the right thing" wrt sending 
> enough acks to allow cwnds to open again in a timely fashion.

Once a loss happens, it's too late to stop doing the stretch ACKs, the
damage is done already.  It is going to take you at least one
extra RTT to recover from the loss compared to if you were not doing
stretch ACKs.

You have to keep giving consistent well spaced ACKs back to the
receiver in order to recover from loss optimally.

The ACK every 2 full sized frames behavior of TCP is absolutely
essential.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-10  7:23                       ` David S. Miller
@ 2006-03-10 17:44                         ` Rick Jones
  2006-03-20  9:06                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 35+ messages in thread
From: Rick Jones @ 2006-03-10 17:44 UTC (permalink / raw)
  To: netdev; +Cc: rdreier, linux-kernel, openib-general

David S. Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 09 Mar 2006 16:21:05 -0800
> 
> 
>>well, there are stacks which do "stretch acks" (after a fashion) that 
>>make sure when they see packet loss to "do the right thing" wrt sending 
>>enough acks to allow cwnds to open again in a timely fashion.
> 
> 
> Once a loss happens, it's too late to stop doing the stretch ACKs, the
> damage is done already.  It is going to take you at least one
> extra RTT to recover from the loss compared to if you were not doing
> stretch ACKs.

I must be dense (entirely possible), but how is that absolute?

If there is no more data in flight after the segment that was lost the 
"stretch ACK" stacks with which I'm familiar will generate the 
standalone ACK within the deferred ACK interval (50 milliseconds). I 
guess that can be the "one extra RTT"  However,  if there is data in 
flight after the point of loss, the immediate ACK upon receipt of out-of 
order data kicks in.

> You have to keep giving consistent well spaced ACKs back to the
> receiver in order to recover from loss optimally.

The key there is defining consistent and well spaced.  Certainly an ACK 
only after a window's-worth of data would not be well spaced, but I 
believe that an ACK after more than two full sized frames could indeed 
be well-spaced.

> The ACK every 2 full sized frames behavior of TCP is absolutely
> essential.

I don't think it is _quite_ that cut and dried, otherwise, HP-UX and 
Solaris, since < 1997 would have had big time problems.

rick jones

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-10  7:23                       ` David S. Miller
  2006-03-10 17:44                         ` Rick Jones
@ 2006-03-20  9:06                         ` Michael S. Tsirkin
  2006-03-20  9:55                           ` David S. Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-20  9:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general

Quoting r. David S. Miller <davem@davemloft.net>:
> > well, there are stacks which do "stretch acks" (after a fashion) that 
> > make sure when they see packet loss to "do the right thing" wrt sending 
> > enough acks to allow cwnds to open again in a timely fashion.
> 
> Once a loss happens, it's too late to stop doing the stretch ACKs, the
> damage is done already.  It is going to take you at least one
> extra RTT to recover from the loss compared to if you were not doing
> stretch ACKs.
> 
> You have to keep giving consistent well spaced ACKs back to the
> receiver in order to recover from loss optimally.

Is it the case then that this requirement is less essential on
networks such as IP over InfiniBand, which are very low latency
and essencially lossless (with explicit congestion contifications
in hardware)?

> The ACK every 2 full sized frames behavior of TCP is absolutely
> essential.

Interestingly, I was pointed towards the following RFC draft
http://www.ietf.org/internet-drafts/draft-ietf-tcpm-rfc2581bis-00.txt

    The requirement that an ACK "SHOULD" be generated for at least every
    second full-sized segment is listed in [RFC1122] in one place as a
    SHOULD and another as a MUST.  Here we unambiguously state it is a
    SHOULD.  We also emphasize that this is a SHOULD, meaning that an
    implementor should indeed only deviate from this requirement after
    careful consideration of the implications.

And as Matt Leininger's research appears to show, stretch ACKs
are good for performance in case of IP over InfiniBand.

Given all this, would it make sense to add a per-netdevice (or per-neighbour)
flag to re-enable the trick for these net devices (as was done before
314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
IP over InfiniBand driver would then simply set this flag.

David, would you accept such a patch? It would be nice to get 2.6.17
back to within at least 10% of 2.6.11.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20  9:06                         ` Michael S. Tsirkin
@ 2006-03-20  9:55                           ` David S. Miller
  2006-03-20 10:22                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-03-20  9:55 UTC (permalink / raw)
  To: mst; +Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Mon, 20 Mar 2006 11:06:29 +0200

> Is it the case then that this requirement is less essential on
> networks such as IP over InfiniBand, which are very low latency
> and essencially lossless (with explicit congestion contifications
> in hardware)?

You can never assume any attribute of the network whatsoever.
Even if initially the outgoing device is IPoIB, something in
the middle, like a traffic classification or netfilter rule,
could rewrite the packet and make it go somewhere else.

This even applies to loopback packets, because packets can
get rewritten and redirected even once they are passed in
via netif_receive_skb().

> And as Matt Leininger's research appears to show, stretch ACKs
> are good for performance in case of IP over InfiniBand.
>
> Given all this, would it make sense to add a per-netdevice (or per-neighbour)
> flag to re-enable the trick for these net devices (as was done before
> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
> IP over InfiniBand driver would then simply set this flag.

See above, this is not feasible.

The path an SKB can take is opaque and unknown until the very last
moment it is actually given to the device transmit function.

People need to get the "special case this topology" ideas out of their
heads. :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20  9:55                           ` David S. Miller
@ 2006-03-20 10:22                             ` Michael S. Tsirkin
  2006-03-20 10:37                               ` David S. Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-20 10:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general

Quoting r. David S. Miller <davem@davemloft.net>:
> The path an SKB can take is opaque and unknown until the very last
> moment it is actually given to the device transmit function.

Why, I was proposing looking at dst cache. If that's NULL, well,
we won't stretch ACKs. Worst case we apply the wrong optimization.
Right?

> People need to get the "special case this topology" ideas out of their
> heads. :-)

Okay, I get that.

What I'd like to clarify, however: rfc2581 explicitly states that in some cases
it might be OK to generate ACKs less frequently than every second full-sized
segment. Given Matt's measurements, TCP on top of IP over InfiniBand on Linux
seems to hit one of these cases.  Do you agree to that?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 10:22                             ` Michael S. Tsirkin
@ 2006-03-20 10:37                               ` David S. Miller
  2006-03-20 11:27                                 ` Michael S. Tsirkin
  2006-04-27  4:13                                 ` Troy Benjegerdes
  0 siblings, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-20 10:37 UTC (permalink / raw)
  To: mst; +Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Mon, 20 Mar 2006 12:22:34 +0200

> Quoting r. David S. Miller <davem@davemloft.net>:
> > The path an SKB can take is opaque and unknown until the very last
> > moment it is actually given to the device transmit function.
> 
> Why, I was proposing looking at dst cache. If that's NULL, well,
> we won't stretch ACKs. Worst case we apply the wrong optimization.
> Right?

Where you receive a packet from isn't very useful for determining
even the full patch on which that packet itself flowed.

More importantly, packets also do not necessarily go back out over the
same path on which packets are received for a connection.  This is
actually quite common.

Maybe packets for this connection come in via IPoIB but go out via
gigabit ethernet and another route altogether.

> What I'd like to clarify, however: rfc2581 explicitly states that in
> some cases it might be OK to generate ACKs less frequently than
> every second full-sized segment. Given Matt's measurements, TCP on
> top of IP over InfiniBand on Linux seems to hit one of these cases.
> Do you agree to that?

I disagree with Linux changing it's behavior.  It would be great to
turn off congestion control completely over local gigabit networks,
but that isn't determinable in any way, so we don't do that.

The IPoIB situation is no different, you can set all the bits you want
in incoming packets, the barrier to doing this remains the same.

It hurts performance if any packet drop occurs because it will require
an extra round trip for recovery to begin to be triggered at the
sender.

The network is a black box, routes to and from a destination are
arbitrary, and so is packet rewriting and reflection, so being able to
say "this all occurs on IPoIB" is simply infeasible.

I don't know how else to say this, we simply cannot special case IPoIB
or any other topology type.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 10:37                               ` David S. Miller
@ 2006-03-20 11:27                                 ` Michael S. Tsirkin
  2006-03-20 11:47                                   ` Arjan van de Ven
  2006-04-27  4:13                                 ` Troy Benjegerdes
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-20 11:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general

Quoting David S. Miller <davem@davemloft.net>:
> I disagree with Linux changing it's behavior.  It would be great to
> turn off congestion control completely over local gigabit networks,
> but that isn't determinable in any way, so we don't do that.

Interesting. Would it make sense to make it another tunable knob in
/proc, sysfs or sysctl then?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 11:27                                 ` Michael S. Tsirkin
@ 2006-03-20 11:47                                   ` Arjan van de Ven
  2006-03-20 11:49                                     ` Lennert Buytenhek
  0 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2006-03-20 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller

On Mon, 2006-03-20 at 13:27 +0200, Michael S. Tsirkin wrote:
> Quoting David S. Miller <davem@davemloft.net>:
> > I disagree with Linux changing it's behavior.  It would be great to
> > turn off congestion control completely over local gigabit networks,
> > but that isn't determinable in any way, so we don't do that.
> 
> Interesting. Would it make sense to make it another tunable knob in
> /proc, sysfs or sysctl then?

that's not the right level; since that is per interface. And you only
know the actual interface waay too late (as per earlier posts).
Per socket.. maybe
But then again it's not impossible to have packets for one socket go out
to multiple interfaces
(think load balancing bonding over 2 interfaces, one IB another
ethernet)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 11:47                                   ` Arjan van de Ven
@ 2006-03-20 11:49                                     ` Lennert Buytenhek
  2006-03-20 11:53                                       ` Arjan van de Ven
  2006-03-20 12:04                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 35+ messages in thread
From: Lennert Buytenhek @ 2006-03-20 11:49 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller

On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:

> > > I disagree with Linux changing it's behavior.  It would be great to
> > > turn off congestion control completely over local gigabit networks,
> > > but that isn't determinable in any way, so we don't do that.
> > 
> > Interesting. Would it make sense to make it another tunable knob in
> > /proc, sysfs or sysctl then?
> 
> that's not the right level; since that is per interface. And you only
> know the actual interface waay too late (as per earlier posts).
> Per socket.. maybe
> But then again it's not impossible to have packets for one socket go out
> to multiple interfaces
> (think load balancing bonding over 2 interfaces, one IB another
> ethernet)

I read it as if he was proposing to have a sysctl knob to turn off
TCP congestion control completely (which has so many issues it's not
even funny.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 11:49                                     ` Lennert Buytenhek
@ 2006-03-20 11:53                                       ` Arjan van de Ven
  2006-03-20 13:35                                         ` Michael S. Tsirkin
  2006-03-20 12:04                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2006-03-20 11:53 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller

On Mon, 2006-03-20 at 12:49 +0100, Lennert Buytenhek wrote:
> On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:
> 
> > > > I disagree with Linux changing it's behavior.  It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > > 
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> > 
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
> 
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

owww that's so bad I didn't even consider that

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 11:53                                       ` Arjan van de Ven
@ 2006-03-20 13:35                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-20 13:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller, Lennert Buytenhek

Quoting Arjan van de Ven <arjan@infradead.org>:
> > I read it as if he was proposing to have a sysctl knob to turn off
> > TCP congestion control completely (which has so many issues it's not
> > even funny.)
> 
> owww that's so bad I didn't even consider that

No, I think that comment was taken out of thread context. We were talking about
stretching ACKs - while avoiding stretch ACKs is important for TCP congestion
control, it's not the only mechanism.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 11:49                                     ` Lennert Buytenhek
  2006-03-20 11:53                                       ` Arjan van de Ven
@ 2006-03-20 12:04                                       ` Michael S. Tsirkin
  2006-03-20 15:09                                         ` Benjamin LaHaise
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2006-03-20 12:04 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller, Arjan van de Ven

Quoting r. Lennert Buytenhek <buytenh@wantstofly.org>:
> > > > I disagree with Linux changing it's behavior.  It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > > 
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> > 
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
> 
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

Not really, that was David :)

What started this thread was the fact that since 2.6.11 Linux
does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
stretch ACKs "after careful consideration", and we are seeing that it helps
IP over InfiniBand, so recent Linux kernels perform worse in that respect.

And since there does not seem to be a way to figure it out automagically when
doing this is a good idea, I proposed adding some kind of knob that will let the
user apply the consideration for us.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 12:04                                       ` Michael S. Tsirkin
@ 2006-03-20 15:09                                         ` Benjamin LaHaise
  2006-03-20 18:58                                           ` Rick Jones
  2006-03-20 23:00                                           ` David S. Miller
  0 siblings, 2 replies; 35+ messages in thread
From: Benjamin LaHaise @ 2006-03-20 15:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	David S. Miller, Lennert Buytenhek, Arjan van de Ven

On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote:
> does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
> stretch ACKs "after careful consideration", and we are seeing that it helps
> IP over InfiniBand, so recent Linux kernels perform worse in that respect.
> 
> And since there does not seem to be a way to figure it out automagically when
> doing this is a good idea, I proposed adding some kind of knob that will let the
> user apply the consideration for us.

Wouldn't it make sense to strech the ACK when the previous ACK is still in 
the TX queue of the device?  I know that sort of behaviour was always an 
issue on modem links where you don't want to send out redundant ACKs.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 15:09                                         ` Benjamin LaHaise
@ 2006-03-20 18:58                                           ` Rick Jones
  2006-03-20 23:00                                           ` David S. Miller
  1 sibling, 0 replies; 35+ messages in thread
From: Rick Jones @ 2006-03-20 18:58 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: netdev, rdreier, linux-kernel, openib-general, David S. Miller,
	Lennert Buytenhek, Arjan van de Ven

> Wouldn't it make sense to strech the ACK when the previous ACK is still in 
> the TX queue of the device?  I know that sort of behaviour was always an 
> issue on modem links where you don't want to send out redundant ACKs.

Perhaps, but it isn't clear that it would be worth the cycles to check. 
    I doubt that a simple reference count on the ACK skb would do it 
since if it were a bare ACK I doubt that TCP keeps a reference to the 
skb in the first place?

Also, what would be the "trigger" to send the next ACK after the 
previous one had left the building (Elvis-like)?  Receipt of N in-order 
segments?  A timeout?

If you are going to go ahead and try to do stretch-ACKs, then I suspect 
the way to go about doing it is to have it behave very much like HP-UX 
or Solaris, both of which have arguably reasonable ACK-avoidance 
heuristics in them.

But don't try to do it quick and dirty.

rick "likes ACK avoidance, just check the archives" jones
on netdev, no need to cc me directly

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 15:09                                         ` Benjamin LaHaise
  2006-03-20 18:58                                           ` Rick Jones
@ 2006-03-20 23:00                                           ` David S. Miller
  1 sibling, 0 replies; 35+ messages in thread
From: David S. Miller @ 2006-03-20 23:00 UTC (permalink / raw)
  To: bcrl
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	buytenh, arjan

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Mon, 20 Mar 2006 10:09:42 -0500

> Wouldn't it make sense to strech the ACK when the previous ACK is still in 
> the TX queue of the device?  I know that sort of behaviour was always an 
> issue on modem links where you don't want to send out redundant ACKs.

I thought about doing some similar trick with TSO, wherein we would
not defer a TSO send if all the previous packets sent are out of the
device transmit queue.  The idea was the prevent the pipe from ever
emptying which is the danger of deferring too much for TSO.

This has several problems.  It's hard to implement.  You have to
decide if you want precise state, thereby checking the TX descriptors.
Or you go for imprecise but easier to implement, which is very
imprecise and therefore not very useful state, by just checking the
SKB refcount or similar which means that you find out it's left the TX
queue after the TX purge interrupt which can be a long time after the
event and by then the pipe has empties which is what you were trying
to prevent.

Lastly, you don't want to touch remote cpu state which is what such
a hack is going to end up doing much of the time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: TSO and IPoIB performance degradation
  2006-03-20 10:37                               ` David S. Miller
  2006-03-20 11:27                                 ` Michael S. Tsirkin
@ 2006-04-27  4:13                                 ` Troy Benjegerdes
  1 sibling, 0 replies; 35+ messages in thread
From: Troy Benjegerdes @ 2006-04-27  4:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: mst, rick.jones2, netdev, rdreier, linux-kernel, openib-general

On Mon, Mar 20, 2006 at 02:37:04AM -0800, David S. Miller wrote:
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Mon, 20 Mar 2006 12:22:34 +0200
> 
> > Quoting r. David S. Miller <davem@davemloft.net>:
> > > The path an SKB can take is opaque and unknown until the very last
> > > moment it is actually given to the device transmit function.
> > 
> > Why, I was proposing looking at dst cache. If that's NULL, well,
> > we won't stretch ACKs. Worst case we apply the wrong optimization.
> > Right?
> 
> Where you receive a packet from isn't very useful for determining
> even the full patch on which that packet itself flowed.
> 
> More importantly, packets also do not necessarily go back out over the
> same path on which packets are received for a connection.  This is
> actually quite common.
> 
> Maybe packets for this connection come in via IPoIB but go out via
> gigabit ethernet and another route altogether.
> 
> > What I'd like to clarify, however: rfc2581 explicitly states that in
> > some cases it might be OK to generate ACKs less frequently than
> > every second full-sized segment. Given Matt's measurements, TCP on
> > top of IP over InfiniBand on Linux seems to hit one of these cases.
> > Do you agree to that?
> 
> I disagree with Linux changing it's behavior.  It would be great to
> turn off congestion control completely over local gigabit networks,
> but that isn't determinable in any way, so we don't do that.
> 
> The IPoIB situation is no different, you can set all the bits you want
> in incoming packets, the barrier to doing this remains the same.
> 
> It hurts performance if any packet drop occurs because it will require
> an extra round trip for recovery to begin to be triggered at the
> sender.
> 
> The network is a black box, routes to and from a destination are
> arbitrary, and so is packet rewriting and reflection, so being able to
> say "this all occurs on IPoIB" is simply infeasible.
> 
> I don't know how else to say this, we simply cannot special case IPoIB
> or any other topology type.

David is right. If you care about performance, you are already using SDP
or verbs layer for the transport anyway. If I am going to be doing IPoIB,
it's because eventually I expect the packet might get off the IB network
and onto some other network and go halfway across the country.


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2006-04-27  4:13 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-06 22:34 TSO and IPoIB performance degradation Michael S. Tsirkin
2006-03-06 22:40 ` David S. Miller
2006-03-06 22:50 ` Stephen Hemminger
2006-03-07  3:13   ` Shirley Ma
2006-03-07 21:44     ` Matt Leininger
2006-03-07 21:49       ` Stephen Hemminger
2006-03-07 21:53         ` Michael S. Tsirkin
2006-03-08  0:11         ` Matt Leininger
2006-03-08  0:18           ` David S. Miller
2006-03-08  1:17             ` Roland Dreier
2006-03-08  1:23               ` David S. Miller
2006-03-08  1:34                 ` Roland Dreier
2006-03-08 12:53                 ` Michael S. Tsirkin
2006-03-08 20:53                   ` David S. Miller
2006-03-09 23:48                   ` David S. Miller
2006-03-10  0:10                     ` Michael S. Tsirkin
2006-03-10  0:38                       ` Michael S. Tsirkin
2006-03-10  7:18                       ` David S. Miller
2006-03-10  0:21                     ` Rick Jones
2006-03-10  7:23                       ` David S. Miller
2006-03-10 17:44                         ` Rick Jones
2006-03-20  9:06                         ` Michael S. Tsirkin
2006-03-20  9:55                           ` David S. Miller
2006-03-20 10:22                             ` Michael S. Tsirkin
2006-03-20 10:37                               ` David S. Miller
2006-03-20 11:27                                 ` Michael S. Tsirkin
2006-03-20 11:47                                   ` Arjan van de Ven
2006-03-20 11:49                                     ` Lennert Buytenhek
2006-03-20 11:53                                       ` Arjan van de Ven
2006-03-20 13:35                                         ` Michael S. Tsirkin
2006-03-20 12:04                                       ` Michael S. Tsirkin
2006-03-20 15:09                                         ` Benjamin LaHaise
2006-03-20 18:58                                           ` Rick Jones
2006-03-20 23:00                                           ` David S. Miller
2006-04-27  4:13                                 ` Troy Benjegerdes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).