TCP rx window autotuning harmful at LAN context

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TCP rx window autotuning harmful at LAN context
@ 2009-03-09 11:25 Marian Ďurkovič
  2009-03-09 18:01 ` John Heffner
  2009-03-11  9:02 ` Rémi Denis-Courmont
  0 siblings, 2 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-09 11:25 UTC (permalink / raw)
  To: netdev

Hi all,

  based on multiple user complaints about poor LAN performance with 
TCP window autotuning on receiver side we conducted several tests at
our university to verify whether these complaints are valid. Unfortunately,
our results confirmed, that the present implementation indeed behaves 
erratically in LAN context and causes serious harm to LAN operation.

  The behaviour could be descibed as "spiraling death" syndrome. While
TCP with constant and decently sized rx window natively reduces transmission
rate when RTT increases, autotuning performs exactly the opposite - as a
response to increased RTT it increases the rx window size (which in turn
again increases RTT...) As this happens again and again, the result is
complete waste of all available buffers at sending host or at the bottleneck
point, resulting in upto 267 msec (!) latency in LAN context (with 100 Mbps
ethernet connection, default txqueuelen=1000, MTU=1500 and sky2 driver).  
Needles to say that this means the LAN is almost unusable.

   With autotuning disabled, the same situation results in just 5 msec
latency and still full 100 Mpbs link utilization, since with 64 kB rx window
the TCP transmission is solely controlled by RTT without ever going into
congestion avoidance mode since there are no packet drops.

   As rx window autotuning is enabled in all recent kernels and with 1 GB
of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
and we believe it needs urgent attention. As demontrated above, such huge
rx window (which is at least 100*BDP of the example above) does not deliver
any performance gain but instead it seriously harms other hosts and/or
applications. It should also be noted, that host with autotuning enabled
steals an unfair share of the total available bandwidth, which might look
like a "better" performing TCP stack at first sight - however such behaviour
is not appropriate (RFC2914, section 3.2).

   The possible solution to the above problem could be e.g. to limit RTT
measuring only to the initial phase of the TCP connection and computing
the BDP from this value for the whole lifetime of the connection. With
such modification, increases in RTT due to buffering at the bottleneck 
point will again reduce transmission rate and the "spiraling death" 
syndrome will no longer exist.

    Thanks kind regards,

--------------------------------------------------------------------------
----                                                                  ----
----   Marian Ďurkovič                       network  manager         ----
----                                                                  ----
----   Slovak Technical University           Tel: +421 2 571 041 81   ----
----   Computer Centre, Nám. Slobody 17      Fax: +421 2 524 94 351   ----
----   812 43 Bratislava, Slovak Republic    E-mail/sip: md@bts.sk    ----
----                                                                  ----
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič
@ 2009-03-09 18:01 ` John Heffner
  2009-03-09 20:05   ` Marian Ďurkovič
       [not found]   ` <20090309195906.M50328@bts.sk>
  2009-03-11  9:02 ` Rémi Denis-Courmont
  1 sibling, 2 replies; 30+ messages in thread
From: John Heffner @ 2009-03-09 18:01 UTC (permalink / raw)
  To: Marian Ďurkovič; +Cc: netdev

On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote:
>   As rx window autotuning is enabled in all recent kernels and with 1 GB
> of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
> and we believe it needs urgent attention. As demontrated above, such huge
> rx window (which is at least 100*BDP of the example above) does not deliver
> any performance gain but instead it seriously harms other hosts and/or
> applications. It should also be noted, that host with autotuning enabled
> steals an unfair share of the total available bandwidth, which might look
> like a "better" performing TCP stack at first sight - however such behaviour
> is not appropriate (RFC2914, section 3.2).

It's well known that "standard" TCP fills all available drop-tail
buffers, and that this behavior is not desirable.

The situation you describe is exactly what congestion control (the
topic of RFC2914) should fix.  It is not the role of receive window
(flow control).  It is really the sender's job to detect and react to
this, not the receiver's.  (We have had this discussion before on
netdev.)  There are a number of delay-based congestion control
algorithms that have been implemented and are available in Linux, but
all have proved problematic in many cases, and has not been suitable
to enable widely.  This is still an active research topic.

Another option in LANs is to enable AQM.  In Linux, you can configure
the bottleneck interface qdisc to be any of a number of RED-like early
droppers.  Most commercial routers also offer the ability to configure
AQM on interfaces, though most do not enable by default.

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 18:01 ` John Heffner
@ 2009-03-09 20:05   ` Marian Ďurkovič
  2009-03-09 20:24     ` Stephen Hemminger
  2009-03-10  0:09     ` David Miller
       [not found]   ` <20090309195906.M50328@bts.sk>
  1 sibling, 2 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-09 20:05 UTC (permalink / raw)
  To: netdev

On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote
> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote:
> >   As rx window autotuning is enabled in all recent kernels and with 1 GB
> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading
> >   rapidly
> > and we believe it needs urgent attention. As demontrated above, such
> >   huge
> > rx window (which is at least 100*BDP of the example above) does not
> >   deliver
> > any performance gain but instead it seriously harms other hosts and/or
> > applications. It should also be noted, that host with autotuning enabled
> > steals an unfair share of the total available bandwidth, which might
> > look
> > like a "better" performing TCP stack at first sight - however such
> > behaviour
> > is not appropriate (RFC2914, section 3.2).
>
> It's well known that "standard" TCP fills all available drop-tail
> buffers, and that this behavior is not desirable.

Well, in practice that was always limited by receive window size, which
was by default 64 kB on most operating systems. So this undesirable behavior
was limited to hosts where receive window was manually increased to huge
values.

Today, the real effect of autotuning is the same as changing the receive window
size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
growing the window to maximum even for low RTT paths.

> The situation you describe is exactly what congestion control (the
> topic of RFC2914) should fix.  It is not the role of receive window
> (flow control).  It is really the sender's job to detect and react to
> this, not the receiver's.  (We have had this discussion before on
> netdev.)

It's not of high importance whose job it is according to pure theory.
What matters is, that autotuning introduced serious problem at LAN context
by disabling any possibility to properly react to increasing RTT. Again,
it's not important whether this functionality was there by design or by
coincidence, but it was holding the system well-balanced for many years.

Now, as autotuning is enabled by default in stock kernel, this problem is
spreading into LANs without users even knowing what's going on. Therefore
I'd like to suggest to look for a decent fix which could be implemented
in relatively short time frame. My proposal is this:

- measure RTT during the initial phase of TCP connection (first X segments)
- compute maximal receive window size depending on measured RTT using
  configurable constant representing the bandwidth part of BDP
- let autotuning do its work upto that limit.

  With kind regards,

        M. 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 20:05   ` Marian Ďurkovič
@ 2009-03-09 20:24     ` Stephen Hemminger
  2009-03-10  0:09     ` David Miller
  1 sibling, 0 replies; 30+ messages in thread
From: Stephen Hemminger @ 2009-03-09 20:24 UTC (permalink / raw)
  To: Marian Ďurkovič; +Cc: netdev

On Mon, 9 Mar 2009 21:05:05 +0100
Marian Ďurkovič <md@bts.sk> wrote:

> On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote
> > On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote:
> > >   As rx window autotuning is enabled in all recent kernels and with 1 GB
> > > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading
> > >   rapidly
> > > and we believe it needs urgent attention. As demontrated above, such
> > >   huge
> > > rx window (which is at least 100*BDP of the example above) does not
> > >   deliver
> > > any performance gain but instead it seriously harms other hosts and/or
> > > applications. It should also be noted, that host with autotuning enabled
> > > steals an unfair share of the total available bandwidth, which might
> > > look
> > > like a "better" performing TCP stack at first sight - however such
> > > behaviour
> > > is not appropriate (RFC2914, section 3.2).
> >
> > It's well known that "standard" TCP fills all available drop-tail
> > buffers, and that this behavior is not desirable.
> 
> Well, in practice that was always limited by receive window size, which
> was by default 64 kB on most operating systems. So this undesirable behavior
> was limited to hosts where receive window was manually increased to huge
> values.
> 
> Today, the real effect of autotuning is the same as changing the receive window
> size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
> growing the window to maximum even for low RTT paths.
> 
> > The situation you describe is exactly what congestion control (the
> > topic of RFC2914) should fix.  It is not the role of receive window
> > (flow control).  It is really the sender's job to detect and react to
> > this, not the receiver's.  (We have had this discussion before on
> > netdev.)
> 
> It's not of high importance whose job it is according to pure theory.
> What matters is, that autotuning introduced serious problem at LAN context
> by disabling any possibility to properly react to increasing RTT. Again,
> it's not important whether this functionality was there by design or by
> coincidence, but it was holding the system well-balanced for many years.
> 
> Now, as autotuning is enabled by default in stock kernel, this problem is
> spreading into LANs without users even knowing what's going on. Therefore
> I'd like to suggest to look for a decent fix which could be implemented
> in relatively short time frame. My proposal is this:
> 
> - measure RTT during the initial phase of TCP connection (first X segments)
> - compute maximal receive window size depending on measured RTT using
>   configurable constant representing the bandwidth part of BDP
> - let autotuning do its work upto that limit.
> 
>   With kind regards,
> 
>         M. 

So you have broken infrastructure or senders and you want to blame
the receiver? The receiver is not responsible for flow control in TCP.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 20:05   ` Marian Ďurkovič
  2009-03-09 20:24     ` Stephen Hemminger
@ 2009-03-10  0:09     ` David Miller
  2009-03-10  0:34       ` Rick Jones
  2009-03-11 10:03       ` Andi Kleen
  1 sibling, 2 replies; 30+ messages in thread
From: David Miller @ 2009-03-10  0:09 UTC (permalink / raw)
  To: md; +Cc: netdev

From: Marian Ďurkovič <md@bts.sk>
Date: Mon, 9 Mar 2009 21:05:05 +0100

> Well, in practice that was always limited by receive window size, which
> was by default 64 kB on most operating systems. So this undesirable behavior
> was limited to hosts where receive window was manually increased to huge
> values.

You say "was" as if this was a recent change.  Linux has been doing
receive buffer autotuning for at least 5 years if not longer.

> Today, the real effect of autotuning is the same as changing the
> receive window size to 4 MB on *all* hosts, since there's no
> mechanism to prevent it from growing the window to maximum even for
> low RTT paths.

There is, on the sender side (congestion control) and at the
intermediate bottleneck routers (active queue management).

You are pointing the blame at the wrong area, as both John and Stephen
are trying to tell you.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10  0:09     ` David Miller
@ 2009-03-10  0:34       ` Rick Jones
  2009-03-10  3:55         ` John Heffner
  2009-03-11 10:03       ` Andi Kleen
  1 sibling, 1 reply; 30+ messages in thread
From: Rick Jones @ 2009-03-10  0:34 UTC (permalink / raw)
  To: David Miller; +Cc: md, netdev

If I recall correctly, when I have asked about this behaviour in the past, I was 
told that the autotuning receiver would always try to offer the sender 2X what 
the receiver thought the sender's cwnd happened to be.  Is my recollection 
incorrect, or is this then:

[root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 
0 AF_INET
THROUGHPUT=941.30
LSS_SIZE_REQ=131072
LSS_SIZE=262142
LSS_SIZE_END=262142
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=3900000

not intended behaviour?  LSS == Local Socket Send; RSR == Remote Socket Receive. 
  dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a nf-next-2.6 about 
two or three weeks old with some of the 32-core scaling patches applied 
(2.6.29-rc5-nfnextconntrack)

I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to 
128K/256K that will be the limit on what it will ever put out onto the connection 
at one time, but by the end of the 10 second test over the local GbE LAN the 
receiver's autotuned SO_RCVBUF has grown to 3900000.

rick jones

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10  0:34       ` Rick Jones
@ 2009-03-10  3:55         ` John Heffner
  2009-03-10 17:20           ` Rick Jones
  0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2009-03-10  3:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, md, netdev

On Mon, Mar 9, 2009 at 5:34 PM, Rick Jones <rick.jones2@hp.com> wrote:
> If I recall correctly, when I have asked about this behaviour in the past, I
> was told that the autotuning receiver would always try to offer the sender
> 2X what the receiver thought the sender's cwnd happened to be.  Is my
> recollection incorrect, or is this then:
>
> [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K
> OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45)
> port 0 AF_INET
> THROUGHPUT=941.30
> LSS_SIZE_REQ=131072
> LSS_SIZE=262142
> LSS_SIZE_END=262142
> RSR_SIZE_REQ=-1
> RSR_SIZE=87380
> RSR_SIZE_END=3900000
>
> not intended behaviour?  LSS == Local Socket Send; RSR == Remote Socket
> Receive.  dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a
> nf-next-2.6 about two or three weeks old with some of the 32-core scaling
> patches applied (2.6.29-rc5-nfnextconntrack)
>
> I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to
> 128K/256K that will be the limit on what it will ever put out onto the
> connection at one time, but by the end of the 10 second test over the local
> GbE LAN the receiver's autotuned SO_RCVBUF has grown to 3900000.


Hi Rick,

(Pretty sure we went over this already, but once more..)  The receiver
does not size to twice cwnd.  It sizes to twice the amount of data
that the application read in one RTT.  In the common case of a path
bottleneck and a receiving application that always keeps up, this
equals 2*cwnd, but the distinction is very important to understanding
its behavior in other cases.

In your test where you limit sndbuf to 256k, you will find that you
did not fill up the bottleneck queues, and you did not get a
significantly increased RTT, which are the negative effects we want to
avoid.  The large receive window caused no trouble at all.

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10  3:55         ` John Heffner
@ 2009-03-10 17:20           ` Rick Jones
  0 siblings, 0 replies; 30+ messages in thread
From: Rick Jones @ 2009-03-10 17:20 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, md, netdev

> (Pretty sure we went over this already, but once more..) 

Sometimes I am but dense north by northwest, but I am also occasionally simply 
dense regardless of the direction :)

> The receiver does not size to twice cwnd.  It sizes to twice the amount of
> data that the application read in one RTT.  In the common case of a path 
> bottleneck and a receiving application that always keeps up, this equals
> 2*cwnd, but the distinction is very important to understanding its behavior in
> other cases.
> 
> In your test where you limit sndbuf to 256k, you will find that you
> did not fill up the bottleneck queues, and you did not get a
> significantly increased RTT, which are the negative effects we want to
> avoid.  The large receive window caused no trouble at all.

What is the definition of "significantly" here?

With my 256K capped SO_SNDBUF ping seems to report like this:

[root@dl5855 ~]# ping sut42
PING sut42.west (10.208.0.45) 56(84) bytes of data.
64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=1.58 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.126 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.104 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.100 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.140 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=11.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=10.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=7.42 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=4.51 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=1.56 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=4.47 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=4.63 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=1.66 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=7.65 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=4.73 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.135 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.116 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=22 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=23 ttl=64 time=0.098 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=24 ttl=64 time=0.104 ms

FWIW, when I uncap the SO_SNDBUF, the RTTs start to look like this instead:

[root@dl5855 ~]# ping sut42
PING sut42.west (10.208.0.45) 56(84) bytes of data.
64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=0.183 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.100 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.117 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.099 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.123 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=26.2 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=24.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=26.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=26.4 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=26.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=26.2 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=26.6 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=26.2 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=26.5 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=26.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=0.126 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.119 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.120 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.097 ms

And then when I cap both sides to 64K requested/128K and still get link-rate the 
pings look like:

[root@dl5855 ~]# ping sut42
PING sut42.west (10.208.0.45) 56(84) bytes of data.
64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=0.161 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.101 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.106 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.753 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.594 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=0.789 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=0.566 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=0.587 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=0.635 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=0.729 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=0.613 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=0.609 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=0.655 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=0.152 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=0.106 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.100 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.106 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.122 ms

None of the above "absolves" the sender of course, but I still get wrapped around 
the axle of handing so much rope to senders when we know 99 times out of ten they 
are going to hang themselves with it.

rick jones

Netperf cannot tell me bytes received per RTT, but it can tell me the average 
bytes per recv() call.  I'm not sure if that is a sufficient approximation but 
here are those three netperf runs re-run with remote_bytes_per_recv added to the 
output:

[root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 64K -S 64K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 
0 AF_INET
THROUGHPUT=941.07
LSS_SIZE_REQ=65536
LSS_SIZE=131072
LSS_SIZE_END=131072
RSR_SIZE_REQ=65536
RSR_SIZE=131072
RSR_SIZE_END=131072
REMOTE_BYTES_PER_RECV=8178.43
[root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 
0 AF_INET
THROUGHPUT=941.31
LSS_SIZE_REQ=131072
LSS_SIZE=262142
LSS_SIZE_END=262142
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=4194304
REMOTE_BYTES_PER_RECV=8005.97
[root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 
0 AF_INET
THROUGHPUT=941.33
LSS_SIZE_REQ=-1
LSS_SIZE=16384
LSS_SIZE_END=4194304
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=4194304
REMOTE_BYTES_PER_RECV=8055.89

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10  0:09     ` David Miller
  2009-03-10  0:34       ` Rick Jones
@ 2009-03-11 10:03       ` Andi Kleen
  2009-03-11 11:03         ` Marian Ďurkovič
  2009-03-11 13:30         ` David Miller
  1 sibling, 2 replies; 30+ messages in thread
From: Andi Kleen @ 2009-03-11 10:03 UTC (permalink / raw)
  To: David Miller; +Cc: md, netdev

David Miller <davem@davemloft.net> writes:

> From: Marian ÄŽurkoviÄ <md@bts.sk>
> Date: Mon, 9 Mar 2009 21:05:05 +0100
>
>> Well, in practice that was always limited by receive window size, which
>> was by default 64 kB on most operating systems. So this undesirable behavior
>> was limited to hosts where receive window was manually increased to huge
>> values.
>
> You say "was" as if this was a recent change.  Linux has been doing
> receive buffer autotuning for at least 5 years if not longer.

I think his point was the only now does it become a visible problem
as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes.

Perhaps this points to the default buffer sizing heuristics to 
be too aggressive for >= 1GB?

Perhaps something like this patch? Marian, does that help?

-Andi

TCP: Lower per socket RX buffer sizing threshold 

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 net/ipv4/tcp.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.28-test/net/ipv4/tcp.c
===================================================================
--- linux-2.6.28-test.orig/net/ipv4/tcp.c	2009-02-09 11:06:52.000000000 +0100
+++ linux-2.6.28-test/net/ipv4/tcp.c	2009-03-11 11:01:53.000000000 +0100
@@ -2757,9 +2757,9 @@
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
-	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
-	max_share = min(4UL*1024*1024, limit);
+	/* Set per-socket limits to no more than 1/256 the pressure threshold */
+	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8);
+	max_share = min(2UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
 	sysctl_tcp_wmem[1] = 16*1024;


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11 10:03       ` Andi Kleen
@ 2009-03-11 11:03         ` Marian Ďurkovič
  2009-03-11 13:30         ` David Miller
  1 sibling, 0 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-11 11:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev

On Wed, Mar 11, 2009 at 11:03:35AM +0100, Andi Kleen wrote:
> > You say "was" as if this was a recent change.  Linux has been doing
> > receive buffer autotuning for at least 5 years if not longer.
> 
> I think his point was the only now does it become a visible problem
> as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes.

Yes, exactly! We run into this after number of workstations were upgraded
at once to a new hardware with 2GB of RAM.

> Perhaps this points to the default buffer sizing heuristics to 
> be too aggressive for >= 1GB?
> 
> Perhaps something like this patch? Marian, does that help?

Sure - as it lowers the maximum from 4MB to 2MB, the net result is that
RTTs at 100 Mbps immediately went down from 267 msec into:

--- x.x.x.x ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8992ms
rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms

Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static
rx buffer look like this (with no performance penalty):

--- x.x.x.x ping statistics --
10 packets transmitted, 10 received, 0% packet loss, time 9000ms
rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms

I.e. the patch significantly helps as expected, however having one static
limit for all NIC speeds as well as for the whole range of RTTs is suboptimal 
by principle.

   Thanks & kind regards,

       M.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11 10:03       ` Andi Kleen
  2009-03-11 11:03         ` Marian Ďurkovič
@ 2009-03-11 13:30         ` David Miller
  2009-03-11 15:01           ` Andi Kleen
  1 sibling, 1 reply; 30+ messages in thread
From: David Miller @ 2009-03-11 13:30 UTC (permalink / raw)
  To: andi; +Cc: md, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 11 Mar 2009 11:03:35 +0100

> Perhaps this points to the default buffer sizing heuristics to 
> be too aggressive for >= 1GB?

It's necessary Andi, you can't fill a connection on a trans-
continental connection without at least a 4MB receive buffer.

Did you read the commit message of the change that increased
the limit?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11 13:30         ` David Miller
@ 2009-03-11 15:01           ` Andi Kleen
  2009-03-11 14:56             ` Marian Ďurkovič
  2009-03-11 15:34             ` John Heffner
  0 siblings, 2 replies; 30+ messages in thread
From: Andi Kleen @ 2009-03-11 15:01 UTC (permalink / raw)
  To: David Miller; +Cc: andi, md, netdev

On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Wed, 11 Mar 2009 11:03:35 +0100
> 
> > Perhaps this points to the default buffer sizing heuristics to 
> > be too aggressive for >= 1GB?
> 
> It's necessary Andi, you can't fill a connection on a trans-
> continental connection without at least a 4MB receive buffer.

Seems pretty arbitary to me. It's the value for a given bandwidth*latency
product, but why not half or twice the bandwidth? I don't think
that number is written in stone like you claim.

Anyways it was just a test patch and it indeeds seems to address
the problem at least partly.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11 15:01           ` Andi Kleen
@ 2009-03-11 14:56             ` Marian Ďurkovič
  2009-03-11 15:34             ` John Heffner
  1 sibling, 0 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-11 14:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev

On Wed, Mar 11, 2009 at 04:01:49PM +0100, Andi Kleen wrote:
> On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
> > From: Andi Kleen <andi@firstfloor.org>
> > Date: Wed, 11 Mar 2009 11:03:35 +0100
> > 
> > > Perhaps this points to the default buffer sizing heuristics to 
> > > be too aggressive for >= 1GB?
> > 
> > It's necessary Andi, you can't fill a connection on a trans-
> > continental connection without at least a 4MB receive buffer.
> 
> Seems pretty arbitary to me. It's the value for a given bandwidth*latency
> product, but why not half or twice the bandwidth? I don't think
> that number is written in stone like you claim.

Besides being arbitrary, it's also incorrect. The defaults at
tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring
the fact, that it results in 4MB send buffer but only 3 MB 
receive buffer due to other defaults (tcp_adv_win_scale=2).
 
Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec
- i.e. exactly the latency we're seeing.

   With kind regards,

         M.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11 15:01           ` Andi Kleen
  2009-03-11 14:56             ` Marian Ďurkovič
@ 2009-03-11 15:34             ` John Heffner
  1 sibling, 0 replies; 30+ messages in thread
From: John Heffner @ 2009-03-11 15:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, md, netdev

On Wed, Mar 11, 2009 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
>> From: Andi Kleen <andi@firstfloor.org>
>> Date: Wed, 11 Mar 2009 11:03:35 +0100
>>
>> > Perhaps this points to the default buffer sizing heuristics to
>> > be too aggressive for >= 1GB?
>>
>> It's necessary Andi, you can't fill a connection on a trans-
>> continental connection without at least a 4MB receive buffer.
>
> Seems pretty arbitary to me. It's the value for a given bandwidth*latency
> product, but why not half or twice the bandwidth? I don't think
> that number is written in stone like you claim.

It is of course just a number, though not exactly arbitrary -- it's
approximately the required value for transcontinental 100 Mbps paths.
Choosing the value is a matter of engineering trade-offs, and seemed
like a reasonable cap at this time.

Any cap so much lower that it would give a small bound for LAN
latencies would bring us back to the bad old days where you couldn't
get anything more than 10 Mbps on the wide area.

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20090309195906.M50328@bts.sk>]

* Re: TCP rx window autotuning harmful at LAN context
       [not found]   ` <20090309195906.M50328@bts.sk>
@ 2009-03-09 20:23     ` John Heffner
  2009-03-09 20:33       ` Stephen Hemminger
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: John Heffner @ 2009-03-09 20:23 UTC (permalink / raw)
  To: Marian Ďurkovič; +Cc: netdev

On Mon, Mar 9, 2009 at 1:02 PM, Marian Ďurkovič <md@bts.sk> wrote:
> On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote
>> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote:
>> >   As rx window autotuning is enabled in all recent kernels and with 1 GB
>> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
>> > and we believe it needs urgent attention. As demontrated above, such huge
>> > rx window (which is at least 100*BDP of the example above) does not deliver
>> > any performance gain but instead it seriously harms other hosts and/or
>> > applications. It should also be noted, that host with autotuning enabled
>> > steals an unfair share of the total available bandwidth, which might look
>> > like a "better" performing TCP stack at first sight - however such behaviour
>> > is not appropriate (RFC2914, section 3.2).
>>
>> It's well known that "standard" TCP fills all available drop-tail
>> buffers, and that this behavior is not desirable.
>
> Well, in practice that was always limited by receive window size, which
> was by default 64 kB on most operating systems. So this undesirable behavior
> was limited to hosts where receive window was manually increased to huge values.
>
> Today, the real effect of autotuning is the same as changing the receive window
> size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
> growing the window to maximum even for low RTT paths.
>
>> The situation you describe is exactly what congestion control (the
>> topic of RFC2914) should fix.  It is not the role of receive window
>> (flow control).  It is really the sender's job to detect and react to
>> this, not the receiver's.  (We have had this discussion before on
>> netdev.)
>
> It's not of high importance whose job it is according to pure theory.
> What matters is, that autotuning introduced serious problem at LAN context
> by disabling any possibility to properly react to increasing RTT. Again,
> it's not important whether this functionality was there by design or by
> coincidence, but it was holding the system well-balanced for many years.

This is not a theoretical exercise, but one in good system design.
This "well-balanced" system was really broken all along, and
autotuning has exposed this.

A drop-tail queue size of 1000 packets on a local interface is
questionable, and I think this is the real source of your problem.
This change was introduced a few years ago on most drivers --
generally used to be 100 by default.  This was partly because TCP
slow-start has problems when a drop-tail queue is smaller than the
BDP.  (Limited slow-start is meant to address this problem, but
requires tuning to the right value.)  Again, using AQM is likely the
best solution.


> Now, as autotuning is enabled by default in stock kernel, this problem is
> spreading into LANs without users even knowing what's going on. Therefore
> I'd like to suggest to look for a decent fix which could be implemented
> in relatively short time frame. My proposal is this:
>
> - measure RTT during the initial phase of TCP connection (first X segments)
> - compute maximal receive window size depending on measured RTT using
>  configurable constant representing the bandwidth part of BDP
> - let autotuning do its work upto that limit.

Let's take this proposal, and try it instead at the sender side, as
part of congestion control.  Would this proposal make sense in that
position?  Would you seriously consider it there?

(As a side note, this is in fact what happens if you disable
timestamps, since TCP cannot get an updated measurement of RTT without
timestamps, only a lower bound.  However, I consider this a limitation
not a feature.)

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 20:23     ` John Heffner
@ 2009-03-09 20:33       ` Stephen Hemminger
  2009-03-09 23:52       ` David Miller
       [not found]       ` <20090310104956.GA81181@bts.sk>
  2 siblings, 0 replies; 30+ messages in thread
From: Stephen Hemminger @ 2009-03-09 20:33 UTC (permalink / raw)
  To: John Heffner; +Cc: Marian Ďurkovič, netdev

On Mon, 9 Mar 2009 13:23:15 -0700
John Heffner <johnwheffner@gmail.com> wrote:

> On Mon, Mar 9, 2009 at 1:02 PM, Marian Ďurkovič <md@bts.sk> wrote:
> > On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote
> >> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote:
> >> >   As rx window autotuning is enabled in all recent kernels and with 1 GB
> >> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
> >> > and we believe it needs urgent attention. As demontrated above, such huge
> >> > rx window (which is at least 100*BDP of the example above) does not deliver
> >> > any performance gain but instead it seriously harms other hosts and/or
> >> > applications. It should also be noted, that host with autotuning enabled
> >> > steals an unfair share of the total available bandwidth, which might look
> >> > like a "better" performing TCP stack at first sight - however such behaviour
> >> > is not appropriate (RFC2914, section 3.2).
> >>
> >> It's well known that "standard" TCP fills all available drop-tail
> >> buffers, and that this behavior is not desirable.
> >
> > Well, in practice that was always limited by receive window size, which
> > was by default 64 kB on most operating systems. So this undesirable behavior
> > was limited to hosts where receive window was manually increased to huge values.
> >
> > Today, the real effect of autotuning is the same as changing the receive window
> > size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
> > growing the window to maximum even for low RTT paths.
> >
> >> The situation you describe is exactly what congestion control (the
> >> topic of RFC2914) should fix.  It is not the role of receive window
> >> (flow control).  It is really the sender's job to detect and react to
> >> this, not the receiver's.  (We have had this discussion before on
> >> netdev.)
> >
> > It's not of high importance whose job it is according to pure theory.
> > What matters is, that autotuning introduced serious problem at LAN context
> > by disabling any possibility to properly react to increasing RTT. Again,
> > it's not important whether this functionality was there by design or by
> > coincidence, but it was holding the system well-balanced for many years.
> 
> This is not a theoretical exercise, but one in good system design.
> This "well-balanced" system was really broken all along, and
> autotuning has exposed this.
> 
> A drop-tail queue size of 1000 packets on a local interface is
> questionable, and I think this is the real source of your problem.
> This change was introduced a few years ago on most drivers --
> generally used to be 100 by default.  This was partly because TCP
> slow-start has problems when a drop-tail queue is smaller than the
> BDP.  (Limited slow-start is meant to address this problem, but
> requires tuning to the right value.)  Again, using AQM is likely the
> best solution.

By default, sky2 queue is 511 pkts which is 6.2ms on @ 1G.
Probably, should be half that by default. Also there is
software transmit queue as well, which could be 0 unless some
form of AQM is being done.

> 
> > Now, as autotuning is enabled by default in stock kernel, this problem is
> > spreading into LANs without users even knowing what's going on. Therefore
> > I'd like to suggest to look for a decent fix which could be implemented
> > in relatively short time frame. My proposal is this:
> >
> > - measure RTT during the initial phase of TCP connection (first X segments)
> > - compute maximal receive window size depending on measured RTT using
> >  configurable constant representing the bandwidth part of BDP
> > - let autotuning do its work upto that limit.
> 
> Let's take this proposal, and try it instead at the sender side, as
> part of congestion control.  Would this proposal make sense in that
> position?  Would you seriously consider it there?
> 
> (As a side note, this is in fact what happens if you disable
> timestamps, since TCP cannot get an updated measurement of RTT without
> timestamps, only a lower bound.  However, I consider this a limitation
> not a feature.)
> 
>   -John
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 20:23     ` John Heffner
  2009-03-09 20:33       ` Stephen Hemminger
@ 2009-03-09 23:52       ` David Miller
  2009-03-10  0:09         ` John Heffner
       [not found]       ` <20090310104956.GA81181@bts.sk>
  2 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2009-03-09 23:52 UTC (permalink / raw)
  To: johnwheffner; +Cc: md, netdev

From: John Heffner <johnwheffner@gmail.com>
Date: Mon, 9 Mar 2009 13:23:15 -0700

> A drop-tail queue size of 1000 packets on a local interface is
> questionable, and I think this is the real source of your problem.

Are you suggested we decrease it? :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 23:52       ` David Miller
@ 2009-03-10  0:09         ` John Heffner
  2009-03-10  5:19           ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2009-03-10  0:09 UTC (permalink / raw)
  To: David Miller; +Cc: md, netdev

On Mon, Mar 9, 2009 at 4:52 PM, David Miller <davem@davemloft.net> wrote:
> From: John Heffner <johnwheffner@gmail.com>
> Date: Mon, 9 Mar 2009 13:23:15 -0700
>
>> A drop-tail queue size of 1000 packets on a local interface is
>> questionable, and I think this is the real source of your problem.
>
> Are you suggested we decrease it? :-)

I am not so bold. :-D  (And note the drop-tail prefix.)

A long queue with AQM would probably be best, but would require
careful testing before enabling by default.  It would almost certainly
cause pain for some.

And, for the vast majority of people for whom the local interface is
not the bottleneck, it makes no difference.  It hurts worst for
someone doing bulk transfer with a GigE device in 100 Mbps (or worse,
10-Mbps) mode, where 1000 pkts is a long time, while simultaneously
doing something latency-sensitive.  I suspect this is the case Marian
is experiencing.

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10  0:09         ` John Heffner
@ 2009-03-10  5:19           ` Eric Dumazet
  0 siblings, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2009-03-10  5:19 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, md, netdev

John Heffner a écrit :
> On Mon, Mar 9, 2009 at 4:52 PM, David Miller <davem@davemloft.net> wrote:
>> From: John Heffner <johnwheffner@gmail.com>
>> Date: Mon, 9 Mar 2009 13:23:15 -0700
>>
>>> A drop-tail queue size of 1000 packets on a local interface is
>>> questionable, and I think this is the real source of your problem.
>> Are you suggested we decrease it? :-)
> 
> I am not so bold. :-D  (And note the drop-tail prefix.)
> 
> A long queue with AQM would probably be best, but would require
> careful testing before enabling by default.  It would almost certainly
> cause pain for some.
> 
> And, for the vast majority of people for whom the local interface is
> not the bottleneck, it makes no difference.  It hurts worst for
> someone doing bulk transfer with a GigE device in 100 Mbps (or worse,
> 10-Mbps) mode, where 1000 pkts is a long time, while simultaneously
> doing something latency-sensitive.  I suspect this is the case Marian
> is experiencing.
> 

Interesting stuff indeed.

Could you tell us more about AQM ?




^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20090310104956.GA81181@bts.sk>]

* Re: TCP rx window autotuning harmful at LAN context
       [not found]       ` <20090310104956.GA81181@bts.sk>
@ 2009-03-10 11:30         ` David Miller
  2009-03-10 11:46           ` Marian Ďurkovič
  0 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2009-03-10 11:30 UTC (permalink / raw)
  To: md; +Cc: johnwheffner, netdev

From: Marian Ďurkovič <md@bts.sk>
Date: Tue, 10 Mar 2009 11:49:56 +0100

> Sender does not have the relevant info to implement this - it might be
> connected by 10 GE to the highspeed backbone.

Yes, the sender does indeed have this information, and using it is
exactly what congestion control algorithms such as VEGAS try to do.

They look at both round trip times and bandwith as they increase the
send congestion window.  And if round trips increase without a
corresponding increase in bandwidth, they stop increasing.  This is
because in such a situation we can infer that we're just consuming
more queue space at some intermediate router/switch rather than using
more of the available bandwidth.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10 11:30         ` David Miller
@ 2009-03-10 11:46           ` Marian Ďurkovič
  2009-03-10 15:23             ` John Heffner
  0 siblings, 1 reply; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-10 11:46 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev

On Tue, Mar 10, 2009 at 04:30:19AM -0700, David Miller wrote:
> From: Marian Ďurkovič <md@bts.sk>
> Date: Tue, 10 Mar 2009 11:49:56 +0100
> 
> > Sender does not have the relevant info to implement this - it might be
> > connected by 10 GE to the highspeed backbone.
> 
> Yes, the sender does indeed have this information, and using it is
> exactly what congestion control algorithms such as VEGAS try to do.
> 
> They look at both round trip times and bandwith as they increase the
> send congestion window.  And if round trips increase without a
> corresponding increase in bandwidth, they stop increasing.

Yes, but that's actual bandwidth between sender and receiver, not
the hard BW limit of the receiver's NIC. My intention is just to introduce
some safety belt preventing autotuning to increase the rx window
into MB ranges when RTT is very low.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10 11:46           ` Marian Ďurkovič
@ 2009-03-10 15:23             ` John Heffner
  2009-03-10 16:00               ` Marian Ďurkovič
  0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2009-03-10 15:23 UTC (permalink / raw)
  To: Marian Ďurkovič; +Cc: David Miller, netdev

On Tue, Mar 10, 2009 at 4:46 AM, Marian Ďurkovič <md@bts.sk> wrote:
> On Tue, Mar 10, 2009 at 04:30:19AM -0700, David Miller wrote:
>> From: Marian Ďurkovič <md@bts.sk>
>> Date: Tue, 10 Mar 2009 11:49:56 +0100
>>
>> > Sender does not have the relevant info to implement this - it might be
>> > connected by 10 GE to the highspeed backbone.
>>
>> Yes, the sender does indeed have this information, and using it is
>> exactly what congestion control algorithms such as VEGAS try to do.
>>
>> They look at both round trip times and bandwith as they increase the
>> send congestion window.  And if round trips increase without a
>> corresponding increase in bandwidth, they stop increasing.
>
> Yes, but that's actual bandwidth between sender and receiver, not
> the hard BW limit of the receiver's NIC. My intention is just to introduce
> some safety belt preventing autotuning to increase the rx window
> into MB ranges when RTT is very low.

Nowhere in our proposal do you use NIC bandwidth.  What you proposed
can be done easily at the sender.

  -John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10 15:23             ` John Heffner
@ 2009-03-10 16:00               ` Marian Ďurkovič
  2009-03-10 16:18                 ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-10 16:00 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, netdev

> > Yes, but that's actual bandwidth between sender and receiver, not
> > the hard BW limit of the receiver's NIC. My intention is just to introduce
> > some safety belt preventing autotuning to increase the rx window
> > into MB ranges when RTT is very low.
> 
> Nowhere in our proposal do you use NIC bandwidth.  What you proposed
> can be done easily at the sender.

Only if you *absolutely* trust the sender to do everything correctly.
That's never the case on global scale - some senders are buggy, some teribly
outdated, some incorrectly configured, some using different congestion
control scheme...

Again, autotuning in its present form removes all safety at the receiver
side and allows senders to easily bring LANs down. IMHO we need to fix
this before the problem spreads even more.

   Thanks & kind regards,

         M.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10 16:00               ` Marian Ďurkovič
@ 2009-03-10 16:18                 ` David Miller
  2009-03-11  8:29                   ` Marian Ďurkovič
  0 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2009-03-10 16:18 UTC (permalink / raw)
  To: md; +Cc: johnwheffner, netdev

From: Marian Ďurkovič <md@bts.sk>
Date: Tue, 10 Mar 2009 17:00:40 +0100

> Again, autotuning in its present form removes all safety at the
> receiver side and allows senders to easily bring LANs down. IMHO we
> need to fix this before the problem spreads even more.

There are both global system-wide and socket local limits to how much
memory can be consumed by TCP receive data.  If things get beyond the
configured limits, we back off.  You could modify those if you
personally wish.

It's really good that you brought up this issue.

And it's really good that you've explained your own personal
workaround for this issue.

But it's not good that you want to impose your choosen workaround on
everyone else.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-10 16:18                 ` David Miller
@ 2009-03-11  8:29                   ` Marian Ďurkovič
  2009-03-11  8:41                     ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-11  8:29 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev

On Tue, Mar 10, 2009 at 09:18:16AM -0700, David Miller wrote:
> There are both global system-wide and socket local limits to how much
> memory can be consumed by TCP receive data.  If things get beyond the
> configured limits, we back off.  You could modify those if you
> personally wish.
> 
> It's really good that you brought up this issue.
> 
> And it's really good that you've explained your own personal
> workaround for this issue.

Beg your pardon - "personal" ?! Is our university the only place where
people use Linux on workstations with 100 Mbps ethernet connection?
Isn't the stock kernel supposed to work decently for them - or should
they all become TCP experts and fiddle with various parameters in order
not to cause harm to other applications or the whole LAN just by starting
a single bulk transfer?

For the last time: setting TCP window to BDP is well-known and generally
accepted practice. Autotuning does NOT respect it, and for 100 Mpbs 
connections at LAN context it might set the rx window somewhere between
100*BDP and 300*BDP. Since the BDP formula obviously applies also in 
reverse direction, i.e.

delay=window/bandwith

setting insanely huge window results in insanely increased LAN latencies
(upto buffer limits). Is this really something noone cares about ?! 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11  8:29                   ` Marian Ďurkovič
@ 2009-03-11  8:41                     ` David Miller
  2009-03-11  9:05                       ` Marian Ďurkovič
  2009-03-11  9:11                       ` Eric Dumazet
  0 siblings, 2 replies; 30+ messages in thread
From: David Miller @ 2009-03-11  8:41 UTC (permalink / raw)
  To: md; +Cc: johnwheffner, netdev

From: Marian Ďurkovič <md@bts.sk>
Date: Wed, 11 Mar 2009 09:29:20 +0100

> For the last time:

Thankfully...

> setting TCP window to BDP is well-known and generally accepted
> practice. Autotuning does NOT respect it, and for 100 Mpbs
> connections at LAN context it might set the rx window somewhere
> between 100*BDP and 300*BDP. Since the BDP formula obviously applies
> also in reverse direction, i.e.

It's the congestion control algorithm on the sender making this
happen, not window autosizing.  The window autosizing is only
providing for flow control.  It's the congestion control algorithm
that is deciding to send more and more into a path where only
latency (and not bandwidth) is increasing with larger congestion
window values.

John has tried to explain this to you, and now I have also made an
effort.  So please stop ignoring what the real issue is here.

You also could use Active Queue Management.  But I doubt you would
bother even testing such a thing to let us know how well that works in
your situation.  You've already decided how you are willing to handle
this issue, so it's a fait accompli.

It's seems to be not even a matter for discussion for you, so that's
why this thread will likely go nowhere if it's entirely up to you.

> delay=window/bandwith
> 
> setting insanely huge window results in insanely increased LAN latencies
> (upto buffer limits). Is this really something noone cares about ?! 

Let me clue you in about something you may not be aware of.

If you don't auto-tune and let the RX socket buffer increase up
to a few megabytes, you cannot fully utilize the link on real
trans-continental connections people are using over the internet
today.

So your suggestion would be a huge step backwards.

This is why you keep being told that what you're asking us to do
is not appropriate.

You can't even talk 100Mbit between New York and San Francisco
without appropriately sized large RX buffers, and RX autotuning
is the only way to achieve that now.

Similarly for west coast US to anywhere in the Asia Pacific region.

So the world is much bigger than your little university where you've
decided to oversubscribe your network, and there are many other issues
to consider besides your specific localized problem.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11  8:41                     ` David Miller
@ 2009-03-11  9:05                       ` Marian Ďurkovič
  2009-03-11  9:11                       ` Eric Dumazet
  1 sibling, 0 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-11  9:05 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev

> Let me clue you in about something you may not be aware of.
> 
> If you don't auto-tune and let the RX socket buffer increase up
> to a few megabytes, you cannot fully utilize the link on real
> trans-continental connections people are using over the internet
> today.
> 
> So your suggestion would be a huge step backwards.

Are you kidding or treating anyone else but you a complete idiot?
I never said autotuning should be disabled !

What I proposed is to limit the maximum autotuned buffer size to:

NIC full bandwidth * RTT measured during initial phase of TCP connection

This would for 100 Mbps connection become:

at RTT 5 msec 64 kB
at RTT 50 msec 640 kB
at RTT 200 msec 2,56 MB

With 1 Gbps connection this will become:

at RTT 5 msec 640 kB
at RTT 50 msec 6,4 MB
at RTT 200 msec 25,6 MB (if your hardlimit is that big).

In fact this will IMHO work much better than today, since you'll be able
to use even larger hardlimits (not 4 MB but e.g. 16 MB if you wish) and
still be protected from overflowing all buffers at your LAN or any other
low RTT paths.

> So the world is much bigger than your little university where you've
> decided to oversubscribe your network, and there are many other issues
> to consider besides your specific localized problem.

Please spare such junk for yourself and please start talking about technical
matters.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11  8:41                     ` David Miller
  2009-03-11  9:05                       ` Marian Ďurkovič
@ 2009-03-11  9:11                       ` Eric Dumazet
  2009-03-11 13:25                         ` David Miller
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2009-03-11  9:11 UTC (permalink / raw)
  To: David Miller; +Cc: md, johnwheffner, netdev

David Miller a écrit :
> From: Marian Ďurkovič <md@bts.sk>
> Date: Wed, 11 Mar 2009 09:29:20 +0100
> 
>> For the last time:
> 
> Thankfully...
> 
>> setting TCP window to BDP is well-known and generally accepted
>> practice. Autotuning does NOT respect it, and for 100 Mpbs
>> connections at LAN context it might set the rx window somewhere
>> between 100*BDP and 300*BDP. Since the BDP formula obviously applies
>> also in reverse direction, i.e.
> 
> It's the congestion control algorithm on the sender making this
> happen, not window autosizing.  The window autosizing is only
> providing for flow control.  It's the congestion control algorithm
> that is deciding to send more and more into a path where only
> latency (and not bandwidth) is increasing with larger congestion
> window values.
> 
> John has tried to explain this to you, and now I have also made an
> effort.  So please stop ignoring what the real issue is here.
> 
> You also could use Active Queue Management.  But I doubt you would
> bother even testing such a thing to let us know how well that works in
> your situation.  You've already decided how you are willing to handle
> this issue, so it's a fait accompli.
> 

I am interested to know how use AQM in practice.

Isnt it a matter of :

Using RED on linux hosts, with 'ecn' flag to mark packets instead
 of droping them if possible.

Using ECN enabled clients and servers. (Assuming most trafic is TCP)
Last time I checked, windows XP doesnt have ECN support. Am I wrong ?

Then in the Marian case, it has many senders that might send data to
one target. Active Queue Management wont triger at sender level,
so we need ECN capable routers that are able to use ECN to mark packets,
because only these routers will notice a queue congestion ?

Or maybe my focus on ECN is not relevant, since it may be marginal and
only save some percent of bandwidth ?

> It's seems to be not even a matter for discussion for you, so that's
> why this thread will likely go nowhere if it's entirely up to you.
> 
>> delay=window/bandwith
>>
>> setting insanely huge window results in insanely increased LAN latencies
>> (upto buffer limits). Is this really something noone cares about ?! 
> 
> Let me clue you in about something you may not be aware of.
> 
> If you don't auto-tune and let the RX socket buffer increase up
> to a few megabytes, you cannot fully utilize the link on real
> trans-continental connections people are using over the internet
> today.
> 
> So your suggestion would be a huge step backwards.
> 
> This is why you keep being told that what you're asking us to do
> is not appropriate.
> 
> You can't even talk 100Mbit between New York and San Francisco
> without appropriately sized large RX buffers, and RX autotuning
> is the only way to achieve that now.
> 
> Similarly for west coast US to anywhere in the Asia Pacific region.
> 
> So the world is much bigger than your little university where you've
> decided to oversubscribe your network, and there are many other issues
> to consider besides your specific localized problem.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-11  9:11                       ` Eric Dumazet
@ 2009-03-11 13:25                         ` David Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2009-03-11 13:25 UTC (permalink / raw)
  To: dada1; +Cc: md, johnwheffner, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 11 Mar 2009 10:11:10 +0100

> I am interested to know how use AQM in practice.

You just need RED, no need for ECN or anything like that.

RED will drop randomly when a certain percentage of the backlog queue
is consumed, and then behave like tail-drop after the next
configured threshold is reached.

It prevents TCPs from synchronizing, which is what happens with
pure tail-drop routers.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: TCP rx window autotuning harmful at LAN context
  2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič
  2009-03-09 18:01 ` John Heffner
@ 2009-03-11  9:02 ` Rémi Denis-Courmont
  1 sibling, 0 replies; 30+ messages in thread
From: Rémi Denis-Courmont @ 2009-03-11  9:02 UTC (permalink / raw)
  To: ext Marian Ďurkovič; +Cc: netdev@vger.kernel.org

On Monday 09 March 2009 13:25:21 ext Marian Ďurkovič wrote:
>   The behaviour could be descibed as "spiraling death" syndrome. While
> TCP with constant and decently sized rx window natively reduces
> transmission rate when RTT increases, autotuning performs exactly the
> opposite - as a response to increased RTT it increases the rx window size
> (which in turn again increases RTT...) As this happens again and again, the
> result is complete waste of all available buffers at sending host or at the
> bottleneck point, resulting in upto 267 msec (!) latency in LAN context
> (with 100 Mbps ethernet connection, default txqueuelen=1000, MTU=1500 and
> sky2 driver). Needles to say that this means the LAN is almost unusable.

This is very likely a stupid question, but anyway...

Is this with all applications, or only some pathological ones (one of which we 
both wrote code for, alright) with abnormally large send buffers?

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-03-11 15:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič
2009-03-09 18:01 ` John Heffner
2009-03-09 20:05   ` Marian Ďurkovič
2009-03-09 20:24     ` Stephen Hemminger
2009-03-10  0:09     ` David Miller
2009-03-10  0:34       ` Rick Jones
2009-03-10  3:55         ` John Heffner
2009-03-10 17:20           ` Rick Jones
2009-03-11 10:03       ` Andi Kleen
2009-03-11 11:03         ` Marian Ďurkovič
2009-03-11 13:30         ` David Miller
2009-03-11 15:01           ` Andi Kleen
2009-03-11 14:56             ` Marian Ďurkovič
2009-03-11 15:34             ` John Heffner
     [not found]   ` <20090309195906.M50328@bts.sk>
2009-03-09 20:23     ` John Heffner
2009-03-09 20:33       ` Stephen Hemminger
2009-03-09 23:52       ` David Miller
2009-03-10  0:09         ` John Heffner
2009-03-10  5:19           ` Eric Dumazet
     [not found]       ` <20090310104956.GA81181@bts.sk>
2009-03-10 11:30         ` David Miller
2009-03-10 11:46           ` Marian Ďurkovič
2009-03-10 15:23             ` John Heffner
2009-03-10 16:00               ` Marian Ďurkovič
2009-03-10 16:18                 ` David Miller
2009-03-11  8:29                   ` Marian Ďurkovič
2009-03-11  8:41                     ` David Miller
2009-03-11  9:05                       ` Marian Ďurkovič
2009-03-11  9:11                       ` Eric Dumazet
2009-03-11 13:25                         ` David Miller
2009-03-11  9:02 ` Rémi Denis-Courmont

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).