* TCP rx window autotuning harmful at LAN context
@ 2009-03-09 11:25 Marian Ďurkovič
2009-03-09 18:01 ` John Heffner
2009-03-11 9:02 ` Rémi Denis-Courmont
0 siblings, 2 replies; 30+ messages in thread
From: Marian Ďurkovič @ 2009-03-09 11:25 UTC (permalink / raw)
To: netdev
Hi all,
based on multiple user complaints about poor LAN performance with
TCP window autotuning on receiver side we conducted several tests at
our university to verify whether these complaints are valid. Unfortunately,
our results confirmed, that the present implementation indeed behaves
erratically in LAN context and causes serious harm to LAN operation.
The behaviour could be descibed as "spiraling death" syndrome. While
TCP with constant and decently sized rx window natively reduces transmission
rate when RTT increases, autotuning performs exactly the opposite - as a
response to increased RTT it increases the rx window size (which in turn
again increases RTT...) As this happens again and again, the result is
complete waste of all available buffers at sending host or at the bottleneck
point, resulting in upto 267 msec (!) latency in LAN context (with 100 Mbps
ethernet connection, default txqueuelen=1000, MTU=1500 and sky2 driver).
Needles to say that this means the LAN is almost unusable.
With autotuning disabled, the same situation results in just 5 msec
latency and still full 100 Mpbs link utilization, since with 64 kB rx window
the TCP transmission is solely controlled by RTT without ever going into
congestion avoidance mode since there are no packet drops.
As rx window autotuning is enabled in all recent kernels and with 1 GB
of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
and we believe it needs urgent attention. As demontrated above, such huge
rx window (which is at least 100*BDP of the example above) does not deliver
any performance gain but instead it seriously harms other hosts and/or
applications. It should also be noted, that host with autotuning enabled
steals an unfair share of the total available bandwidth, which might look
like a "better" performing TCP stack at first sight - however such behaviour
is not appropriate (RFC2914, section 3.2).
The possible solution to the above problem could be e.g. to limit RTT
measuring only to the initial phase of the TCP connection and computing
the BDP from this value for the whole lifetime of the connection. With
such modification, increases in RTT due to buffering at the bottleneck
point will again reduce transmission rate and the "spiraling death"
syndrome will no longer exist.
Thanks kind regards,
--------------------------------------------------------------------------
---- ----
---- Marian Ďurkovič network manager ----
---- ----
---- Slovak Technical University Tel: +421 2 571 041 81 ----
---- Computer Centre, Nám. Slobody 17 Fax: +421 2 524 94 351 ----
---- 812 43 Bratislava, Slovak Republic E-mail/sip: md@bts.sk ----
---- ----
--------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič @ 2009-03-09 18:01 ` John Heffner 2009-03-09 20:05 ` Marian Ďurkovič [not found] ` <20090309195906.M50328@bts.sk> 2009-03-11 9:02 ` Rémi Denis-Courmont 1 sibling, 2 replies; 30+ messages in thread From: John Heffner @ 2009-03-09 18:01 UTC (permalink / raw) To: Marian Ďurkovič; +Cc: netdev On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote: > As rx window autotuning is enabled in all recent kernels and with 1 GB > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly > and we believe it needs urgent attention. As demontrated above, such huge > rx window (which is at least 100*BDP of the example above) does not deliver > any performance gain but instead it seriously harms other hosts and/or > applications. It should also be noted, that host with autotuning enabled > steals an unfair share of the total available bandwidth, which might look > like a "better" performing TCP stack at first sight - however such behaviour > is not appropriate (RFC2914, section 3.2). It's well known that "standard" TCP fills all available drop-tail buffers, and that this behavior is not desirable. The situation you describe is exactly what congestion control (the topic of RFC2914) should fix. It is not the role of receive window (flow control). It is really the sender's job to detect and react to this, not the receiver's. (We have had this discussion before on netdev.) There are a number of delay-based congestion control algorithms that have been implemented and are available in Linux, but all have proved problematic in many cases, and has not been suitable to enable widely. This is still an active research topic. Another option in LANs is to enable AQM. In Linux, you can configure the bottleneck interface qdisc to be any of a number of RED-like early droppers. Most commercial routers also offer the ability to configure AQM on interfaces, though most do not enable by default. -John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 18:01 ` John Heffner @ 2009-03-09 20:05 ` Marian Ďurkovič 2009-03-09 20:24 ` Stephen Hemminger 2009-03-10 0:09 ` David Miller [not found] ` <20090309195906.M50328@bts.sk> 1 sibling, 2 replies; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-09 20:05 UTC (permalink / raw) To: netdev On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote > On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote: > > As rx window autotuning is enabled in all recent kernels and with 1 GB > > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading > > rapidly > > and we believe it needs urgent attention. As demontrated above, such > > huge > > rx window (which is at least 100*BDP of the example above) does not > > deliver > > any performance gain but instead it seriously harms other hosts and/or > > applications. It should also be noted, that host with autotuning enabled > > steals an unfair share of the total available bandwidth, which might > > look > > like a "better" performing TCP stack at first sight - however such > > behaviour > > is not appropriate (RFC2914, section 3.2). > > It's well known that "standard" TCP fills all available drop-tail > buffers, and that this behavior is not desirable. Well, in practice that was always limited by receive window size, which was by default 64 kB on most operating systems. So this undesirable behavior was limited to hosts where receive window was manually increased to huge values. Today, the real effect of autotuning is the same as changing the receive window size to 4 MB on *all* hosts, since there's no mechanism to prevent it from growing the window to maximum even for low RTT paths. > The situation you describe is exactly what congestion control (the > topic of RFC2914) should fix. It is not the role of receive window > (flow control). It is really the sender's job to detect and react to > this, not the receiver's. (We have had this discussion before on > netdev.) It's not of high importance whose job it is according to pure theory. What matters is, that autotuning introduced serious problem at LAN context by disabling any possibility to properly react to increasing RTT. Again, it's not important whether this functionality was there by design or by coincidence, but it was holding the system well-balanced for many years. Now, as autotuning is enabled by default in stock kernel, this problem is spreading into LANs without users even knowing what's going on. Therefore I'd like to suggest to look for a decent fix which could be implemented in relatively short time frame. My proposal is this: - measure RTT during the initial phase of TCP connection (first X segments) - compute maximal receive window size depending on measured RTT using configurable constant representing the bandwidth part of BDP - let autotuning do its work upto that limit. With kind regards, M. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 20:05 ` Marian Ďurkovič @ 2009-03-09 20:24 ` Stephen Hemminger 2009-03-10 0:09 ` David Miller 1 sibling, 0 replies; 30+ messages in thread From: Stephen Hemminger @ 2009-03-09 20:24 UTC (permalink / raw) To: Marian Ďurkovič; +Cc: netdev On Mon, 9 Mar 2009 21:05:05 +0100 Marian Ďurkovič <md@bts.sk> wrote: > On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote > > On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote: > > > As rx window autotuning is enabled in all recent kernels and with 1 GB > > > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading > > > rapidly > > > and we believe it needs urgent attention. As demontrated above, such > > > huge > > > rx window (which is at least 100*BDP of the example above) does not > > > deliver > > > any performance gain but instead it seriously harms other hosts and/or > > > applications. It should also be noted, that host with autotuning enabled > > > steals an unfair share of the total available bandwidth, which might > > > look > > > like a "better" performing TCP stack at first sight - however such > > > behaviour > > > is not appropriate (RFC2914, section 3.2). > > > > It's well known that "standard" TCP fills all available drop-tail > > buffers, and that this behavior is not desirable. > > Well, in practice that was always limited by receive window size, which > was by default 64 kB on most operating systems. So this undesirable behavior > was limited to hosts where receive window was manually increased to huge > values. > > Today, the real effect of autotuning is the same as changing the receive window > size to 4 MB on *all* hosts, since there's no mechanism to prevent it from > growing the window to maximum even for low RTT paths. > > > The situation you describe is exactly what congestion control (the > > topic of RFC2914) should fix. It is not the role of receive window > > (flow control). It is really the sender's job to detect and react to > > this, not the receiver's. (We have had this discussion before on > > netdev.) > > It's not of high importance whose job it is according to pure theory. > What matters is, that autotuning introduced serious problem at LAN context > by disabling any possibility to properly react to increasing RTT. Again, > it's not important whether this functionality was there by design or by > coincidence, but it was holding the system well-balanced for many years. > > Now, as autotuning is enabled by default in stock kernel, this problem is > spreading into LANs without users even knowing what's going on. Therefore > I'd like to suggest to look for a decent fix which could be implemented > in relatively short time frame. My proposal is this: > > - measure RTT during the initial phase of TCP connection (first X segments) > - compute maximal receive window size depending on measured RTT using > configurable constant representing the bandwidth part of BDP > - let autotuning do its work upto that limit. > > With kind regards, > > M. So you have broken infrastructure or senders and you want to blame the receiver? The receiver is not responsible for flow control in TCP. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 20:05 ` Marian Ďurkovič 2009-03-09 20:24 ` Stephen Hemminger @ 2009-03-10 0:09 ` David Miller 2009-03-10 0:34 ` Rick Jones 2009-03-11 10:03 ` Andi Kleen 1 sibling, 2 replies; 30+ messages in thread From: David Miller @ 2009-03-10 0:09 UTC (permalink / raw) To: md; +Cc: netdev From: Marian Ďurkovič <md@bts.sk> Date: Mon, 9 Mar 2009 21:05:05 +0100 > Well, in practice that was always limited by receive window size, which > was by default 64 kB on most operating systems. So this undesirable behavior > was limited to hosts where receive window was manually increased to huge > values. You say "was" as if this was a recent change. Linux has been doing receive buffer autotuning for at least 5 years if not longer. > Today, the real effect of autotuning is the same as changing the > receive window size to 4 MB on *all* hosts, since there's no > mechanism to prevent it from growing the window to maximum even for > low RTT paths. There is, on the sender side (congestion control) and at the intermediate bottleneck routers (active queue management). You are pointing the blame at the wrong area, as both John and Stephen are trying to tell you. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 0:09 ` David Miller @ 2009-03-10 0:34 ` Rick Jones 2009-03-10 3:55 ` John Heffner 2009-03-11 10:03 ` Andi Kleen 1 sibling, 1 reply; 30+ messages in thread From: Rick Jones @ 2009-03-10 0:34 UTC (permalink / raw) To: David Miller; +Cc: md, netdev If I recall correctly, when I have asked about this behaviour in the past, I was told that the autotuning receiver would always try to offer the sender 2X what the receiver thought the sender's cwnd happened to be. Is my recollection incorrect, or is this then: [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 0 AF_INET THROUGHPUT=941.30 LSS_SIZE_REQ=131072 LSS_SIZE=262142 LSS_SIZE_END=262142 RSR_SIZE_REQ=-1 RSR_SIZE=87380 RSR_SIZE_END=3900000 not intended behaviour? LSS == Local Socket Send; RSR == Remote Socket Receive. dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a nf-next-2.6 about two or three weeks old with some of the 32-core scaling patches applied (2.6.29-rc5-nfnextconntrack) I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to 128K/256K that will be the limit on what it will ever put out onto the connection at one time, but by the end of the 10 second test over the local GbE LAN the receiver's autotuned SO_RCVBUF has grown to 3900000. rick jones ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 0:34 ` Rick Jones @ 2009-03-10 3:55 ` John Heffner 2009-03-10 17:20 ` Rick Jones 0 siblings, 1 reply; 30+ messages in thread From: John Heffner @ 2009-03-10 3:55 UTC (permalink / raw) To: Rick Jones; +Cc: David Miller, md, netdev On Mon, Mar 9, 2009 at 5:34 PM, Rick Jones <rick.jones2@hp.com> wrote: > If I recall correctly, when I have asked about this behaviour in the past, I > was told that the autotuning receiver would always try to offer the sender > 2X what the receiver thought the sender's cwnd happened to be. Is my > recollection incorrect, or is this then: > > [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K > OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) > port 0 AF_INET > THROUGHPUT=941.30 > LSS_SIZE_REQ=131072 > LSS_SIZE=262142 > LSS_SIZE_END=262142 > RSR_SIZE_REQ=-1 > RSR_SIZE=87380 > RSR_SIZE_END=3900000 > > not intended behaviour? LSS == Local Socket Send; RSR == Remote Socket > Receive. dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a > nf-next-2.6 about two or three weeks old with some of the 32-core scaling > patches applied (2.6.29-rc5-nfnextconntrack) > > I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to > 128K/256K that will be the limit on what it will ever put out onto the > connection at one time, but by the end of the 10 second test over the local > GbE LAN the receiver's autotuned SO_RCVBUF has grown to 3900000. Hi Rick, (Pretty sure we went over this already, but once more..) The receiver does not size to twice cwnd. It sizes to twice the amount of data that the application read in one RTT. In the common case of a path bottleneck and a receiving application that always keeps up, this equals 2*cwnd, but the distinction is very important to understanding its behavior in other cases. In your test where you limit sndbuf to 256k, you will find that you did not fill up the bottleneck queues, and you did not get a significantly increased RTT, which are the negative effects we want to avoid. The large receive window caused no trouble at all. -John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 3:55 ` John Heffner @ 2009-03-10 17:20 ` Rick Jones 0 siblings, 0 replies; 30+ messages in thread From: Rick Jones @ 2009-03-10 17:20 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, md, netdev > (Pretty sure we went over this already, but once more..) Sometimes I am but dense north by northwest, but I am also occasionally simply dense regardless of the direction :) > The receiver does not size to twice cwnd. It sizes to twice the amount of > data that the application read in one RTT. In the common case of a path > bottleneck and a receiving application that always keeps up, this equals > 2*cwnd, but the distinction is very important to understanding its behavior in > other cases. > > In your test where you limit sndbuf to 256k, you will find that you > did not fill up the bottleneck queues, and you did not get a > significantly increased RTT, which are the negative effects we want to > avoid. The large receive window caused no trouble at all. What is the definition of "significantly" here? With my 256K capped SO_SNDBUF ping seems to report like this: [root@dl5855 ~]# ping sut42 PING sut42.west (10.208.0.45) 56(84) bytes of data. 64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=1.58 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.126 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.104 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.100 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.140 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=11.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=10.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=7.42 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=4.51 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=1.56 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=4.47 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=4.63 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=1.66 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=7.65 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=4.73 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.135 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.116 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=22 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=23 ttl=64 time=0.098 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=24 ttl=64 time=0.104 ms FWIW, when I uncap the SO_SNDBUF, the RTTs start to look like this instead: [root@dl5855 ~]# ping sut42 PING sut42.west (10.208.0.45) 56(84) bytes of data. 64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=0.183 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.100 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.117 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.099 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.123 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=26.2 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=24.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=26.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=26.4 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=26.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=26.2 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=26.6 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=26.2 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=26.5 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=26.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=0.126 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.119 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.120 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.097 ms And then when I cap both sides to 64K requested/128K and still get link-rate the pings look like: [root@dl5855 ~]# ping sut42 PING sut42.west (10.208.0.45) 56(84) bytes of data. 64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=0.161 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.104 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.101 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.106 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.753 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.594 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=0.789 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=0.566 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=0.587 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=0.635 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=0.729 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=0.613 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=0.609 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=0.655 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=0.152 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=0.106 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.100 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.106 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.122 ms None of the above "absolves" the sender of course, but I still get wrapped around the axle of handing so much rope to senders when we know 99 times out of ten they are going to hang themselves with it. rick jones Netperf cannot tell me bytes received per RTT, but it can tell me the average bytes per recv() call. I'm not sure if that is a sufficient approximation but here are those three netperf runs re-run with remote_bytes_per_recv added to the output: [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 64K -S 64K OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 0 AF_INET THROUGHPUT=941.07 LSS_SIZE_REQ=65536 LSS_SIZE=131072 LSS_SIZE_END=131072 RSR_SIZE_REQ=65536 RSR_SIZE=131072 RSR_SIZE_END=131072 REMOTE_BYTES_PER_RECV=8178.43 [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 0 AF_INET THROUGHPUT=941.31 LSS_SIZE_REQ=131072 LSS_SIZE=262142 LSS_SIZE_END=262142 RSR_SIZE_REQ=-1 RSR_SIZE=87380 RSR_SIZE_END=4194304 REMOTE_BYTES_PER_RECV=8005.97 [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 0 AF_INET THROUGHPUT=941.33 LSS_SIZE_REQ=-1 LSS_SIZE=16384 LSS_SIZE_END=4194304 RSR_SIZE_REQ=-1 RSR_SIZE=87380 RSR_SIZE_END=4194304 REMOTE_BYTES_PER_RECV=8055.89 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 0:09 ` David Miller 2009-03-10 0:34 ` Rick Jones @ 2009-03-11 10:03 ` Andi Kleen 2009-03-11 11:03 ` Marian Ďurkovič 2009-03-11 13:30 ` David Miller 1 sibling, 2 replies; 30+ messages in thread From: Andi Kleen @ 2009-03-11 10:03 UTC (permalink / raw) To: David Miller; +Cc: md, netdev David Miller <davem@davemloft.net> writes: > From: Marian ÄurkoviÄ <md@bts.sk> > Date: Mon, 9 Mar 2009 21:05:05 +0100 > >> Well, in practice that was always limited by receive window size, which >> was by default 64 kB on most operating systems. So this undesirable behavior >> was limited to hosts where receive window was manually increased to huge >> values. > > You say "was" as if this was a recent change. Linux has been doing > receive buffer autotuning for at least 5 years if not longer. I think his point was the only now does it become a visible problem as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes. Perhaps this points to the default buffer sizing heuristics to be too aggressive for >= 1GB? Perhaps something like this patch? Marian, does that help? -Andi TCP: Lower per socket RX buffer sizing threshold Signed-off-by: Andi Kleen <ak@linux.intel.com> --- net/ipv4/tcp.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.28-test/net/ipv4/tcp.c =================================================================== --- linux-2.6.28-test.orig/net/ipv4/tcp.c 2009-02-09 11:06:52.000000000 +0100 +++ linux-2.6.28-test/net/ipv4/tcp.c 2009-03-11 11:01:53.000000000 +0100 @@ -2757,9 +2757,9 @@ sysctl_tcp_mem[1] = limit; sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; - /* Set per-socket limits to no more than 1/128 the pressure threshold */ - limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); - max_share = min(4UL*1024*1024, limit); + /* Set per-socket limits to no more than 1/256 the pressure threshold */ + limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8); + max_share = min(2UL*1024*1024, limit); sysctl_tcp_wmem[0] = SK_MEM_QUANTUM; sysctl_tcp_wmem[1] = 16*1024; -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 10:03 ` Andi Kleen @ 2009-03-11 11:03 ` Marian Ďurkovič 2009-03-11 13:30 ` David Miller 1 sibling, 0 replies; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-11 11:03 UTC (permalink / raw) To: Andi Kleen; +Cc: netdev On Wed, Mar 11, 2009 at 11:03:35AM +0100, Andi Kleen wrote: > > You say "was" as if this was a recent change. Linux has been doing > > receive buffer autotuning for at least 5 years if not longer. > > I think his point was the only now does it become a visible problem > as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes. Yes, exactly! We run into this after number of workstations were upgraded at once to a new hardware with 2GB of RAM. > Perhaps this points to the default buffer sizing heuristics to > be too aggressive for >= 1GB? > > Perhaps something like this patch? Marian, does that help? Sure - as it lowers the maximum from 4MB to 2MB, the net result is that RTTs at 100 Mbps immediately went down from 267 msec into: --- x.x.x.x ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8992ms rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static rx buffer look like this (with no performance penalty): --- x.x.x.x ping statistics -- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms I.e. the patch significantly helps as expected, however having one static limit for all NIC speeds as well as for the whole range of RTTs is suboptimal by principle. Thanks & kind regards, M. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 10:03 ` Andi Kleen 2009-03-11 11:03 ` Marian Ďurkovič @ 2009-03-11 13:30 ` David Miller 2009-03-11 15:01 ` Andi Kleen 1 sibling, 1 reply; 30+ messages in thread From: David Miller @ 2009-03-11 13:30 UTC (permalink / raw) To: andi; +Cc: md, netdev From: Andi Kleen <andi@firstfloor.org> Date: Wed, 11 Mar 2009 11:03:35 +0100 > Perhaps this points to the default buffer sizing heuristics to > be too aggressive for >= 1GB? It's necessary Andi, you can't fill a connection on a trans- continental connection without at least a 4MB receive buffer. Did you read the commit message of the change that increased the limit? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 13:30 ` David Miller @ 2009-03-11 15:01 ` Andi Kleen 2009-03-11 14:56 ` Marian Ďurkovič 2009-03-11 15:34 ` John Heffner 0 siblings, 2 replies; 30+ messages in thread From: Andi Kleen @ 2009-03-11 15:01 UTC (permalink / raw) To: David Miller; +Cc: andi, md, netdev On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: > From: Andi Kleen <andi@firstfloor.org> > Date: Wed, 11 Mar 2009 11:03:35 +0100 > > > Perhaps this points to the default buffer sizing heuristics to > > be too aggressive for >= 1GB? > > It's necessary Andi, you can't fill a connection on a trans- > continental connection without at least a 4MB receive buffer. Seems pretty arbitary to me. It's the value for a given bandwidth*latency product, but why not half or twice the bandwidth? I don't think that number is written in stone like you claim. Anyways it was just a test patch and it indeeds seems to address the problem at least partly. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 15:01 ` Andi Kleen @ 2009-03-11 14:56 ` Marian Ďurkovič 2009-03-11 15:34 ` John Heffner 1 sibling, 0 replies; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-11 14:56 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev On Wed, Mar 11, 2009 at 04:01:49PM +0100, Andi Kleen wrote: > On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: > > From: Andi Kleen <andi@firstfloor.org> > > Date: Wed, 11 Mar 2009 11:03:35 +0100 > > > > > Perhaps this points to the default buffer sizing heuristics to > > > be too aggressive for >= 1GB? > > > > It's necessary Andi, you can't fill a connection on a trans- > > continental connection without at least a 4MB receive buffer. > > Seems pretty arbitary to me. It's the value for a given bandwidth*latency > product, but why not half or twice the bandwidth? I don't think > that number is written in stone like you claim. Besides being arbitrary, it's also incorrect. The defaults at tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring the fact, that it results in 4MB send buffer but only 3 MB receive buffer due to other defaults (tcp_adv_win_scale=2). Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec - i.e. exactly the latency we're seeing. With kind regards, M. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 15:01 ` Andi Kleen 2009-03-11 14:56 ` Marian Ďurkovič @ 2009-03-11 15:34 ` John Heffner 1 sibling, 0 replies; 30+ messages in thread From: John Heffner @ 2009-03-11 15:34 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, md, netdev On Wed, Mar 11, 2009 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: > On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: >> From: Andi Kleen <andi@firstfloor.org> >> Date: Wed, 11 Mar 2009 11:03:35 +0100 >> >> > Perhaps this points to the default buffer sizing heuristics to >> > be too aggressive for >= 1GB? >> >> It's necessary Andi, you can't fill a connection on a trans- >> continental connection without at least a 4MB receive buffer. > > Seems pretty arbitary to me. It's the value for a given bandwidth*latency > product, but why not half or twice the bandwidth? I don't think > that number is written in stone like you claim. It is of course just a number, though not exactly arbitrary -- it's approximately the required value for transcontinental 100 Mbps paths. Choosing the value is a matter of engineering trade-offs, and seemed like a reasonable cap at this time. Any cap so much lower that it would give a small bound for LAN latencies would bring us back to the bad old days where you couldn't get anything more than 10 Mbps on the wide area. -John ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20090309195906.M50328@bts.sk>]
* Re: TCP rx window autotuning harmful at LAN context [not found] ` <20090309195906.M50328@bts.sk> @ 2009-03-09 20:23 ` John Heffner 2009-03-09 20:33 ` Stephen Hemminger ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: John Heffner @ 2009-03-09 20:23 UTC (permalink / raw) To: Marian Ďurkovič; +Cc: netdev On Mon, Mar 9, 2009 at 1:02 PM, Marian Ďurkovič <md@bts.sk> wrote: > On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote >> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote: >> > As rx window autotuning is enabled in all recent kernels and with 1 GB >> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly >> > and we believe it needs urgent attention. As demontrated above, such huge >> > rx window (which is at least 100*BDP of the example above) does not deliver >> > any performance gain but instead it seriously harms other hosts and/or >> > applications. It should also be noted, that host with autotuning enabled >> > steals an unfair share of the total available bandwidth, which might look >> > like a "better" performing TCP stack at first sight - however such behaviour >> > is not appropriate (RFC2914, section 3.2). >> >> It's well known that "standard" TCP fills all available drop-tail >> buffers, and that this behavior is not desirable. > > Well, in practice that was always limited by receive window size, which > was by default 64 kB on most operating systems. So this undesirable behavior > was limited to hosts where receive window was manually increased to huge values. > > Today, the real effect of autotuning is the same as changing the receive window > size to 4 MB on *all* hosts, since there's no mechanism to prevent it from > growing the window to maximum even for low RTT paths. > >> The situation you describe is exactly what congestion control (the >> topic of RFC2914) should fix. It is not the role of receive window >> (flow control). It is really the sender's job to detect and react to >> this, not the receiver's. (We have had this discussion before on >> netdev.) > > It's not of high importance whose job it is according to pure theory. > What matters is, that autotuning introduced serious problem at LAN context > by disabling any possibility to properly react to increasing RTT. Again, > it's not important whether this functionality was there by design or by > coincidence, but it was holding the system well-balanced for many years. This is not a theoretical exercise, but one in good system design. This "well-balanced" system was really broken all along, and autotuning has exposed this. A drop-tail queue size of 1000 packets on a local interface is questionable, and I think this is the real source of your problem. This change was introduced a few years ago on most drivers -- generally used to be 100 by default. This was partly because TCP slow-start has problems when a drop-tail queue is smaller than the BDP. (Limited slow-start is meant to address this problem, but requires tuning to the right value.) Again, using AQM is likely the best solution. > Now, as autotuning is enabled by default in stock kernel, this problem is > spreading into LANs without users even knowing what's going on. Therefore > I'd like to suggest to look for a decent fix which could be implemented > in relatively short time frame. My proposal is this: > > - measure RTT during the initial phase of TCP connection (first X segments) > - compute maximal receive window size depending on measured RTT using > configurable constant representing the bandwidth part of BDP > - let autotuning do its work upto that limit. Let's take this proposal, and try it instead at the sender side, as part of congestion control. Would this proposal make sense in that position? Would you seriously consider it there? (As a side note, this is in fact what happens if you disable timestamps, since TCP cannot get an updated measurement of RTT without timestamps, only a lower bound. However, I consider this a limitation not a feature.) -John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 20:23 ` John Heffner @ 2009-03-09 20:33 ` Stephen Hemminger 2009-03-09 23:52 ` David Miller [not found] ` <20090310104956.GA81181@bts.sk> 2 siblings, 0 replies; 30+ messages in thread From: Stephen Hemminger @ 2009-03-09 20:33 UTC (permalink / raw) To: John Heffner; +Cc: Marian Ďurkovič, netdev On Mon, 9 Mar 2009 13:23:15 -0700 John Heffner <johnwheffner@gmail.com> wrote: > On Mon, Mar 9, 2009 at 1:02 PM, Marian Ďurkovič <md@bts.sk> wrote: > > On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote > >> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@bts.sk> wrote: > >> > As rx window autotuning is enabled in all recent kernels and with 1 GB > >> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly > >> > and we believe it needs urgent attention. As demontrated above, such huge > >> > rx window (which is at least 100*BDP of the example above) does not deliver > >> > any performance gain but instead it seriously harms other hosts and/or > >> > applications. It should also be noted, that host with autotuning enabled > >> > steals an unfair share of the total available bandwidth, which might look > >> > like a "better" performing TCP stack at first sight - however such behaviour > >> > is not appropriate (RFC2914, section 3.2). > >> > >> It's well known that "standard" TCP fills all available drop-tail > >> buffers, and that this behavior is not desirable. > > > > Well, in practice that was always limited by receive window size, which > > was by default 64 kB on most operating systems. So this undesirable behavior > > was limited to hosts where receive window was manually increased to huge values. > > > > Today, the real effect of autotuning is the same as changing the receive window > > size to 4 MB on *all* hosts, since there's no mechanism to prevent it from > > growing the window to maximum even for low RTT paths. > > > >> The situation you describe is exactly what congestion control (the > >> topic of RFC2914) should fix. It is not the role of receive window > >> (flow control). It is really the sender's job to detect and react to > >> this, not the receiver's. (We have had this discussion before on > >> netdev.) > > > > It's not of high importance whose job it is according to pure theory. > > What matters is, that autotuning introduced serious problem at LAN context > > by disabling any possibility to properly react to increasing RTT. Again, > > it's not important whether this functionality was there by design or by > > coincidence, but it was holding the system well-balanced for many years. > > This is not a theoretical exercise, but one in good system design. > This "well-balanced" system was really broken all along, and > autotuning has exposed this. > > A drop-tail queue size of 1000 packets on a local interface is > questionable, and I think this is the real source of your problem. > This change was introduced a few years ago on most drivers -- > generally used to be 100 by default. This was partly because TCP > slow-start has problems when a drop-tail queue is smaller than the > BDP. (Limited slow-start is meant to address this problem, but > requires tuning to the right value.) Again, using AQM is likely the > best solution. By default, sky2 queue is 511 pkts which is 6.2ms on @ 1G. Probably, should be half that by default. Also there is software transmit queue as well, which could be 0 unless some form of AQM is being done. > > > Now, as autotuning is enabled by default in stock kernel, this problem is > > spreading into LANs without users even knowing what's going on. Therefore > > I'd like to suggest to look for a decent fix which could be implemented > > in relatively short time frame. My proposal is this: > > > > - measure RTT during the initial phase of TCP connection (first X segments) > > - compute maximal receive window size depending on measured RTT using > > configurable constant representing the bandwidth part of BDP > > - let autotuning do its work upto that limit. > > Let's take this proposal, and try it instead at the sender side, as > part of congestion control. Would this proposal make sense in that > position? Would you seriously consider it there? > > (As a side note, this is in fact what happens if you disable > timestamps, since TCP cannot get an updated measurement of RTT without > timestamps, only a lower bound. However, I consider this a limitation > not a feature.) > > -John > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 20:23 ` John Heffner 2009-03-09 20:33 ` Stephen Hemminger @ 2009-03-09 23:52 ` David Miller 2009-03-10 0:09 ` John Heffner [not found] ` <20090310104956.GA81181@bts.sk> 2 siblings, 1 reply; 30+ messages in thread From: David Miller @ 2009-03-09 23:52 UTC (permalink / raw) To: johnwheffner; +Cc: md, netdev From: John Heffner <johnwheffner@gmail.com> Date: Mon, 9 Mar 2009 13:23:15 -0700 > A drop-tail queue size of 1000 packets on a local interface is > questionable, and I think this is the real source of your problem. Are you suggested we decrease it? :-) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 23:52 ` David Miller @ 2009-03-10 0:09 ` John Heffner 2009-03-10 5:19 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: John Heffner @ 2009-03-10 0:09 UTC (permalink / raw) To: David Miller; +Cc: md, netdev On Mon, Mar 9, 2009 at 4:52 PM, David Miller <davem@davemloft.net> wrote: > From: John Heffner <johnwheffner@gmail.com> > Date: Mon, 9 Mar 2009 13:23:15 -0700 > >> A drop-tail queue size of 1000 packets on a local interface is >> questionable, and I think this is the real source of your problem. > > Are you suggested we decrease it? :-) I am not so bold. :-D (And note the drop-tail prefix.) A long queue with AQM would probably be best, but would require careful testing before enabling by default. It would almost certainly cause pain for some. And, for the vast majority of people for whom the local interface is not the bottleneck, it makes no difference. It hurts worst for someone doing bulk transfer with a GigE device in 100 Mbps (or worse, 10-Mbps) mode, where 1000 pkts is a long time, while simultaneously doing something latency-sensitive. I suspect this is the case Marian is experiencing. -John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 0:09 ` John Heffner @ 2009-03-10 5:19 ` Eric Dumazet 0 siblings, 0 replies; 30+ messages in thread From: Eric Dumazet @ 2009-03-10 5:19 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, md, netdev John Heffner a écrit : > On Mon, Mar 9, 2009 at 4:52 PM, David Miller <davem@davemloft.net> wrote: >> From: John Heffner <johnwheffner@gmail.com> >> Date: Mon, 9 Mar 2009 13:23:15 -0700 >> >>> A drop-tail queue size of 1000 packets on a local interface is >>> questionable, and I think this is the real source of your problem. >> Are you suggested we decrease it? :-) > > I am not so bold. :-D (And note the drop-tail prefix.) > > A long queue with AQM would probably be best, but would require > careful testing before enabling by default. It would almost certainly > cause pain for some. > > And, for the vast majority of people for whom the local interface is > not the bottleneck, it makes no difference. It hurts worst for > someone doing bulk transfer with a GigE device in 100 Mbps (or worse, > 10-Mbps) mode, where 1000 pkts is a long time, while simultaneously > doing something latency-sensitive. I suspect this is the case Marian > is experiencing. > Interesting stuff indeed. Could you tell us more about AQM ? ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20090310104956.GA81181@bts.sk>]
* Re: TCP rx window autotuning harmful at LAN context [not found] ` <20090310104956.GA81181@bts.sk> @ 2009-03-10 11:30 ` David Miller 2009-03-10 11:46 ` Marian Ďurkovič 0 siblings, 1 reply; 30+ messages in thread From: David Miller @ 2009-03-10 11:30 UTC (permalink / raw) To: md; +Cc: johnwheffner, netdev From: Marian Ďurkovič <md@bts.sk> Date: Tue, 10 Mar 2009 11:49:56 +0100 > Sender does not have the relevant info to implement this - it might be > connected by 10 GE to the highspeed backbone. Yes, the sender does indeed have this information, and using it is exactly what congestion control algorithms such as VEGAS try to do. They look at both round trip times and bandwith as they increase the send congestion window. And if round trips increase without a corresponding increase in bandwidth, they stop increasing. This is because in such a situation we can infer that we're just consuming more queue space at some intermediate router/switch rather than using more of the available bandwidth. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 11:30 ` David Miller @ 2009-03-10 11:46 ` Marian Ďurkovič 2009-03-10 15:23 ` John Heffner 0 siblings, 1 reply; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-10 11:46 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, netdev On Tue, Mar 10, 2009 at 04:30:19AM -0700, David Miller wrote: > From: Marian Ďurkovič <md@bts.sk> > Date: Tue, 10 Mar 2009 11:49:56 +0100 > > > Sender does not have the relevant info to implement this - it might be > > connected by 10 GE to the highspeed backbone. > > Yes, the sender does indeed have this information, and using it is > exactly what congestion control algorithms such as VEGAS try to do. > > They look at both round trip times and bandwith as they increase the > send congestion window. And if round trips increase without a > corresponding increase in bandwidth, they stop increasing. Yes, but that's actual bandwidth between sender and receiver, not the hard BW limit of the receiver's NIC. My intention is just to introduce some safety belt preventing autotuning to increase the rx window into MB ranges when RTT is very low. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 11:46 ` Marian Ďurkovič @ 2009-03-10 15:23 ` John Heffner 2009-03-10 16:00 ` Marian Ďurkovič 0 siblings, 1 reply; 30+ messages in thread From: John Heffner @ 2009-03-10 15:23 UTC (permalink / raw) To: Marian Ďurkovič; +Cc: David Miller, netdev On Tue, Mar 10, 2009 at 4:46 AM, Marian Ďurkovič <md@bts.sk> wrote: > On Tue, Mar 10, 2009 at 04:30:19AM -0700, David Miller wrote: >> From: Marian Ďurkovič <md@bts.sk> >> Date: Tue, 10 Mar 2009 11:49:56 +0100 >> >> > Sender does not have the relevant info to implement this - it might be >> > connected by 10 GE to the highspeed backbone. >> >> Yes, the sender does indeed have this information, and using it is >> exactly what congestion control algorithms such as VEGAS try to do. >> >> They look at both round trip times and bandwith as they increase the >> send congestion window. And if round trips increase without a >> corresponding increase in bandwidth, they stop increasing. > > Yes, but that's actual bandwidth between sender and receiver, not > the hard BW limit of the receiver's NIC. My intention is just to introduce > some safety belt preventing autotuning to increase the rx window > into MB ranges when RTT is very low. Nowhere in our proposal do you use NIC bandwidth. What you proposed can be done easily at the sender. -John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 15:23 ` John Heffner @ 2009-03-10 16:00 ` Marian Ďurkovič 2009-03-10 16:18 ` David Miller 0 siblings, 1 reply; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-10 16:00 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, netdev > > Yes, but that's actual bandwidth between sender and receiver, not > > the hard BW limit of the receiver's NIC. My intention is just to introduce > > some safety belt preventing autotuning to increase the rx window > > into MB ranges when RTT is very low. > > Nowhere in our proposal do you use NIC bandwidth. What you proposed > can be done easily at the sender. Only if you *absolutely* trust the sender to do everything correctly. That's never the case on global scale - some senders are buggy, some teribly outdated, some incorrectly configured, some using different congestion control scheme... Again, autotuning in its present form removes all safety at the receiver side and allows senders to easily bring LANs down. IMHO we need to fix this before the problem spreads even more. Thanks & kind regards, M. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 16:00 ` Marian Ďurkovič @ 2009-03-10 16:18 ` David Miller 2009-03-11 8:29 ` Marian Ďurkovič 0 siblings, 1 reply; 30+ messages in thread From: David Miller @ 2009-03-10 16:18 UTC (permalink / raw) To: md; +Cc: johnwheffner, netdev From: Marian Ďurkovič <md@bts.sk> Date: Tue, 10 Mar 2009 17:00:40 +0100 > Again, autotuning in its present form removes all safety at the > receiver side and allows senders to easily bring LANs down. IMHO we > need to fix this before the problem spreads even more. There are both global system-wide and socket local limits to how much memory can be consumed by TCP receive data. If things get beyond the configured limits, we back off. You could modify those if you personally wish. It's really good that you brought up this issue. And it's really good that you've explained your own personal workaround for this issue. But it's not good that you want to impose your choosen workaround on everyone else. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-10 16:18 ` David Miller @ 2009-03-11 8:29 ` Marian Ďurkovič 2009-03-11 8:41 ` David Miller 0 siblings, 1 reply; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-11 8:29 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, netdev On Tue, Mar 10, 2009 at 09:18:16AM -0700, David Miller wrote: > There are both global system-wide and socket local limits to how much > memory can be consumed by TCP receive data. If things get beyond the > configured limits, we back off. You could modify those if you > personally wish. > > It's really good that you brought up this issue. > > And it's really good that you've explained your own personal > workaround for this issue. Beg your pardon - "personal" ?! Is our university the only place where people use Linux on workstations with 100 Mbps ethernet connection? Isn't the stock kernel supposed to work decently for them - or should they all become TCP experts and fiddle with various parameters in order not to cause harm to other applications or the whole LAN just by starting a single bulk transfer? For the last time: setting TCP window to BDP is well-known and generally accepted practice. Autotuning does NOT respect it, and for 100 Mpbs connections at LAN context it might set the rx window somewhere between 100*BDP and 300*BDP. Since the BDP formula obviously applies also in reverse direction, i.e. delay=window/bandwith setting insanely huge window results in insanely increased LAN latencies (upto buffer limits). Is this really something noone cares about ?! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 8:29 ` Marian Ďurkovič @ 2009-03-11 8:41 ` David Miller 2009-03-11 9:05 ` Marian Ďurkovič 2009-03-11 9:11 ` Eric Dumazet 0 siblings, 2 replies; 30+ messages in thread From: David Miller @ 2009-03-11 8:41 UTC (permalink / raw) To: md; +Cc: johnwheffner, netdev From: Marian Ďurkovič <md@bts.sk> Date: Wed, 11 Mar 2009 09:29:20 +0100 > For the last time: Thankfully... > setting TCP window to BDP is well-known and generally accepted > practice. Autotuning does NOT respect it, and for 100 Mpbs > connections at LAN context it might set the rx window somewhere > between 100*BDP and 300*BDP. Since the BDP formula obviously applies > also in reverse direction, i.e. It's the congestion control algorithm on the sender making this happen, not window autosizing. The window autosizing is only providing for flow control. It's the congestion control algorithm that is deciding to send more and more into a path where only latency (and not bandwidth) is increasing with larger congestion window values. John has tried to explain this to you, and now I have also made an effort. So please stop ignoring what the real issue is here. You also could use Active Queue Management. But I doubt you would bother even testing such a thing to let us know how well that works in your situation. You've already decided how you are willing to handle this issue, so it's a fait accompli. It's seems to be not even a matter for discussion for you, so that's why this thread will likely go nowhere if it's entirely up to you. > delay=window/bandwith > > setting insanely huge window results in insanely increased LAN latencies > (upto buffer limits). Is this really something noone cares about ?! Let me clue you in about something you may not be aware of. If you don't auto-tune and let the RX socket buffer increase up to a few megabytes, you cannot fully utilize the link on real trans-continental connections people are using over the internet today. So your suggestion would be a huge step backwards. This is why you keep being told that what you're asking us to do is not appropriate. You can't even talk 100Mbit between New York and San Francisco without appropriately sized large RX buffers, and RX autotuning is the only way to achieve that now. Similarly for west coast US to anywhere in the Asia Pacific region. So the world is much bigger than your little university where you've decided to oversubscribe your network, and there are many other issues to consider besides your specific localized problem. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 8:41 ` David Miller @ 2009-03-11 9:05 ` Marian Ďurkovič 2009-03-11 9:11 ` Eric Dumazet 1 sibling, 0 replies; 30+ messages in thread From: Marian Ďurkovič @ 2009-03-11 9:05 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, netdev > Let me clue you in about something you may not be aware of. > > If you don't auto-tune and let the RX socket buffer increase up > to a few megabytes, you cannot fully utilize the link on real > trans-continental connections people are using over the internet > today. > > So your suggestion would be a huge step backwards. Are you kidding or treating anyone else but you a complete idiot? I never said autotuning should be disabled ! What I proposed is to limit the maximum autotuned buffer size to: NIC full bandwidth * RTT measured during initial phase of TCP connection This would for 100 Mbps connection become: at RTT 5 msec 64 kB at RTT 50 msec 640 kB at RTT 200 msec 2,56 MB With 1 Gbps connection this will become: at RTT 5 msec 640 kB at RTT 50 msec 6,4 MB at RTT 200 msec 25,6 MB (if your hardlimit is that big). In fact this will IMHO work much better than today, since you'll be able to use even larger hardlimits (not 4 MB but e.g. 16 MB if you wish) and still be protected from overflowing all buffers at your LAN or any other low RTT paths. > So the world is much bigger than your little university where you've > decided to oversubscribe your network, and there are many other issues > to consider besides your specific localized problem. Please spare such junk for yourself and please start talking about technical matters. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 8:41 ` David Miller 2009-03-11 9:05 ` Marian Ďurkovič @ 2009-03-11 9:11 ` Eric Dumazet 2009-03-11 13:25 ` David Miller 1 sibling, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2009-03-11 9:11 UTC (permalink / raw) To: David Miller; +Cc: md, johnwheffner, netdev David Miller a écrit : > From: Marian Ďurkovič <md@bts.sk> > Date: Wed, 11 Mar 2009 09:29:20 +0100 > >> For the last time: > > Thankfully... > >> setting TCP window to BDP is well-known and generally accepted >> practice. Autotuning does NOT respect it, and for 100 Mpbs >> connections at LAN context it might set the rx window somewhere >> between 100*BDP and 300*BDP. Since the BDP formula obviously applies >> also in reverse direction, i.e. > > It's the congestion control algorithm on the sender making this > happen, not window autosizing. The window autosizing is only > providing for flow control. It's the congestion control algorithm > that is deciding to send more and more into a path where only > latency (and not bandwidth) is increasing with larger congestion > window values. > > John has tried to explain this to you, and now I have also made an > effort. So please stop ignoring what the real issue is here. > > You also could use Active Queue Management. But I doubt you would > bother even testing such a thing to let us know how well that works in > your situation. You've already decided how you are willing to handle > this issue, so it's a fait accompli. > I am interested to know how use AQM in practice. Isnt it a matter of : Using RED on linux hosts, with 'ecn' flag to mark packets instead of droping them if possible. Using ECN enabled clients and servers. (Assuming most trafic is TCP) Last time I checked, windows XP doesnt have ECN support. Am I wrong ? Then in the Marian case, it has many senders that might send data to one target. Active Queue Management wont triger at sender level, so we need ECN capable routers that are able to use ECN to mark packets, because only these routers will notice a queue congestion ? Or maybe my focus on ECN is not relevant, since it may be marginal and only save some percent of bandwidth ? > It's seems to be not even a matter for discussion for you, so that's > why this thread will likely go nowhere if it's entirely up to you. > >> delay=window/bandwith >> >> setting insanely huge window results in insanely increased LAN latencies >> (upto buffer limits). Is this really something noone cares about ?! > > Let me clue you in about something you may not be aware of. > > If you don't auto-tune and let the RX socket buffer increase up > to a few megabytes, you cannot fully utilize the link on real > trans-continental connections people are using over the internet > today. > > So your suggestion would be a huge step backwards. > > This is why you keep being told that what you're asking us to do > is not appropriate. > > You can't even talk 100Mbit between New York and San Francisco > without appropriately sized large RX buffers, and RX autotuning > is the only way to achieve that now. > > Similarly for west coast US to anywhere in the Asia Pacific region. > > So the world is much bigger than your little university where you've > decided to oversubscribe your network, and there are many other issues > to consider besides your specific localized problem. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-11 9:11 ` Eric Dumazet @ 2009-03-11 13:25 ` David Miller 0 siblings, 0 replies; 30+ messages in thread From: David Miller @ 2009-03-11 13:25 UTC (permalink / raw) To: dada1; +Cc: md, johnwheffner, netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Wed, 11 Mar 2009 10:11:10 +0100 > I am interested to know how use AQM in practice. You just need RED, no need for ECN or anything like that. RED will drop randomly when a certain percentage of the backlog queue is consumed, and then behave like tail-drop after the next configured threshold is reached. It prevents TCPs from synchronizing, which is what happens with pure tail-drop routers. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: TCP rx window autotuning harmful at LAN context 2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič 2009-03-09 18:01 ` John Heffner @ 2009-03-11 9:02 ` Rémi Denis-Courmont 1 sibling, 0 replies; 30+ messages in thread From: Rémi Denis-Courmont @ 2009-03-11 9:02 UTC (permalink / raw) To: ext Marian Ďurkovič; +Cc: netdev@vger.kernel.org On Monday 09 March 2009 13:25:21 ext Marian Ďurkovič wrote: > The behaviour could be descibed as "spiraling death" syndrome. While > TCP with constant and decently sized rx window natively reduces > transmission rate when RTT increases, autotuning performs exactly the > opposite - as a response to increased RTT it increases the rx window size > (which in turn again increases RTT...) As this happens again and again, the > result is complete waste of all available buffers at sending host or at the > bottleneck point, resulting in upto 267 msec (!) latency in LAN context > (with 100 Mbps ethernet connection, default txqueuelen=1000, MTU=1500 and > sky2 driver). Needles to say that this means the LAN is almost unusable. This is very likely a stupid question, but anyway... Is this with all applications, or only some pathological ones (one of which we both wrote code for, alright) with abnormally large send buffers? -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2009-03-11 15:34 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-09 11:25 TCP rx window autotuning harmful at LAN context Marian Ďurkovič
2009-03-09 18:01 ` John Heffner
2009-03-09 20:05 ` Marian Ďurkovič
2009-03-09 20:24 ` Stephen Hemminger
2009-03-10 0:09 ` David Miller
2009-03-10 0:34 ` Rick Jones
2009-03-10 3:55 ` John Heffner
2009-03-10 17:20 ` Rick Jones
2009-03-11 10:03 ` Andi Kleen
2009-03-11 11:03 ` Marian Ďurkovič
2009-03-11 13:30 ` David Miller
2009-03-11 15:01 ` Andi Kleen
2009-03-11 14:56 ` Marian Ďurkovič
2009-03-11 15:34 ` John Heffner
[not found] ` <20090309195906.M50328@bts.sk>
2009-03-09 20:23 ` John Heffner
2009-03-09 20:33 ` Stephen Hemminger
2009-03-09 23:52 ` David Miller
2009-03-10 0:09 ` John Heffner
2009-03-10 5:19 ` Eric Dumazet
[not found] ` <20090310104956.GA81181@bts.sk>
2009-03-10 11:30 ` David Miller
2009-03-10 11:46 ` Marian Ďurkovič
2009-03-10 15:23 ` John Heffner
2009-03-10 16:00 ` Marian Ďurkovič
2009-03-10 16:18 ` David Miller
2009-03-11 8:29 ` Marian Ďurkovič
2009-03-11 8:41 ` David Miller
2009-03-11 9:05 ` Marian Ďurkovič
2009-03-11 9:11 ` Eric Dumazet
2009-03-11 13:25 ` David Miller
2009-03-11 9:02 ` Rémi Denis-Courmont
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).