netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Socket buffer sizes with autotuning
@ 2008-04-23 23:29 Jerry Chu
  2008-04-24 16:32 ` John Heffner
  2008-04-25  7:05 ` David Miller
  0 siblings, 2 replies; 56+ messages in thread
From: Jerry Chu @ 2008-04-23 23:29 UTC (permalink / raw)
  To: netdev

I've been seeing the same problem here and am trying to fix it.
My fix is to not count those pkts still in the host queue as "prior_in_flight"
when feeding the latter to tcp_cong_avoid(). This should cause
tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
is all due to the large host queue, and stop the cwnd to grow beyond
what's really necessary.

"sysctl_tcp_tso_win_divisor causes cwnd" also unnecessarily inflates
cwnd quite a bit when TSO is enabled.

Jerry

From: David Miller <davem@davemloft.net>

>
>
> Date: Tue, Apr 22, 2008 at 8:59 PM
>
> Subject: Re: Socket buffer sizes with autotuning
> To: johnwheffner@gmail.com
> Cc: rick.jones2@hp.com, netdev@vger.kernel.org
>
>
> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 22 Apr 2008 19:17:39 -0700
>
>
>
> > On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
> > >  oslowest:~# netstat -an | grep ESTAB
> > >  ...
> > >  tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
> > >  ...
> > >
> > >  Is this expected behaviour?
> >
>
> > What is your interface txqueuelen and mtu?  If you have a very large
> > interface queue, TCP will happily fill it up unless you are using a
> > delay-based congestion controller.
>
>
> Yes, that's the fundamental problem with loss based congestion
> control.  If there are any queues in the path, TCP will fill them up.
>
> Vegas and other similar techniques are able to avoid this, but come
> with the fundamental flaw that it's easy to get them into situations
> where they do not respond to increases in pipe space adequately, and
> thus underperform compared to loss based algorithms.
>
>
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread
[parent not found: <d1c2719f0804241829s1bc3f41ejf7ebbff73ed96578@mail.gmail.com>]
* Socket buffer sizes with autotuning
@ 2008-04-23  0:38 Rick Jones
  2008-04-23  2:17 ` John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: Rick Jones @ 2008-04-23  0:38 UTC (permalink / raw)
  To: Linux Network Development list

One of the issues with netperf and linux is that netperf only snaps the 
socket buffer size at the beginning of the connection.  This of course 
does not catch what the socket buffer size might become over the 
lifetime of the connection.  So, in the in-development "omni" tests I've 
added code that when running on Linux will snap the socket buffer sizes 
at both the beginning and end of the data connection.  I was a triffle 
surprised at some of what I saw with a 1G connection between systems - 
when autoscaling/ranging/tuning/whatever was active (netperf taking 
defaults and not calling setsockopt()) I was seeing the socket buffer 
size at the end of the connection up at 4MB:

sut34:~/netperf2_trunk# netperf -l 1 -t omni -H oslowest -- -d 4 -o bar 
-s -1 -S -1 -m ,16K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
940.52,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304

Which was the limit of the autotuning:

net.ipv4.tcp_wmem = 16384       16384   4194304
net.ipv4.tcp_rmem = 16384       87380   4194304

The test above is basically the omni version of a TCP_MAERTS test from a 
2.6.18 system to a 2.6.25 system (kernel bits grabbed about 40 minutes 
ago from http://www.kernel.org/hg/linux-2.6.  The receiving system on 
which the 2.6.25 bits were compiled and run started life as a Debian 
Lenny/Testing system.  The sender is iirc Debian Etch.

It seemed odd to me that one would need a 4MB socket buffer to get 
link-rate on gigabit, so I ran a quick set of tests to confirm in my 
mind that indeed, a much smaller socket buffer was sufficient:

sut34:~/netperf2_trunk# HDR="-P 1"; for i in -1 32K 64K 128K 256K 512K; 
do netperf -l 20 -t omni -H oslowest $HDR -- -d 4 -o bar -s $i -S $i -m 
,16K; HDR="-P 0"; done
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
941.38,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304
939.29,Receive,2.6.25-raj,32768,65536,65536,2.6.18-5-mckinley,32768,65536,65536
940.28,Receive,2.6.25-raj,65536,131072,131072,2.6.18-5-mckinley,65536,131072,131072
940.96,Receive,2.6.25-raj,131072,262142,262142,2.6.18-5-mckinley,131072,253952,253952
940.99,Receive,2.6.25-raj,262144,262142,262142,2.6.18-5-mckinley,262144,253952,253952
940.98,Receive,2.6.25-raj,524288,262142,262142,2.6.18-5-mckinley,524288,253952,253952

And then I decided to let the receiver autotune while the sender was 
either autotune or fixed (simulating something other than Linux sending 
I suppose):
sut34:~/netperf2_trunk# HDR="-P 1"; for i in -1 32K 64K 128K 256K 512K; 
do netperf -l 20 -t omni -H oslowest $HDR -- -d 4 -o bar -s -1 -S $i -m 
,16K; HDR="-P 0"; done
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
941.38,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304
941.34,Receive,2.6.25-raj,-1,87380,1337056,2.6.18-5-mckinley,32768,65536,65536
941.35,Receive,2.6.25-raj,-1,87380,1814576,2.6.18-5-mckinley,65536,131072,131072
941.38,Receive,2.6.25-raj,-1,87380,2645664,2.6.18-5-mckinley,131072,253952,253952
941.39,Receive,2.6.25-raj,-1,87380,2649728,2.6.18-5-mckinley,262144,253952,253952
941.38,Receive,2.6.25-raj,-1,87380,2653792,2.6.18-5-mckinley,524288,253952,253952

Finally to see what was going on the wire (in case it was simply the 
socket buffer getting larger and not also the window) I took a packet 
trace on the sender to look at the window updates coming back, and sure 
enough, by the end of the connection (wscale = 7) the advertised window 
was huge:

17:10:00.522200 IP sut34.raj.53459 > oslowest.raj.37322: S 
3334965237:3334965237(0) win 5840 <mss 1460,sackOK,timestamp 4294921737 
0,nop,wscale 7>
17:10:00.522214 IP oslowest.raj.37322 > sut34.raj.53459: S 
962695631:962695631(0) ack 3334965238 win 5792 <mss 
1460,sackOK,timestamp 3303630187 4294921737,nop,wscale 7>

...

17:10:01.554698 IP sut34.raj.53459 > oslowest.raj.37322: . ack 121392225 
win 24576 <nop,nop,timestamp 4294921995 3303630438>
17:10:01.554706 IP sut34.raj.53459 > oslowest.raj.37322: . ack 121395121 
win 24576 <nop,nop,timestamp 4294921995 3303630438>

I also checked (during a different connection, autotuning at both ends) 
how much was actually queued at the sender, and it was indeed rather large:

oslowest:~# netstat -an | grep ESTAB
...
tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
...

Is this expected behaviour?

rick jones

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-05-17  1:47 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu
2008-04-24 16:32 ` John Heffner
2008-04-25  0:49   ` Jerry Chu
2008-04-25  6:46     ` David Miller
2008-04-25 21:29       ` Jerry Chu
2008-04-25 21:35         ` David Miller
2008-04-28 18:30       ` Jerry Chu
2008-04-28 19:21         ` John Heffner
2008-04-28 20:44           ` Jerry Chu
2008-04-28 23:22             ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner
2008-04-28 23:22               ` [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO John Heffner
     [not found]           ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
2008-04-28 23:28             ` Socket buffer sizes with autotuning John Heffner
2008-04-28 23:35               ` David Miller
2008-04-29  2:20               ` Jerry Chu
2008-04-25  7:05 ` David Miller
2008-05-07  3:57   ` Jerry Chu
2008-05-07  4:27     ` David Miller
2008-05-07 18:36       ` Jerry Chu
2008-05-07 21:18         ` David Miller
2008-05-08  1:37           ` Jerry Chu
2008-05-08  1:43             ` David Miller
2008-05-08  3:33               ` Jerry Chu
2008-05-12 22:22                 ` Jerry Chu
2008-05-12 22:29                   ` David Miller
2008-05-12 22:31                     ` David Miller
2008-05-13  3:56                       ` Jerry Chu
2008-05-13  3:58                         ` David Miller
2008-05-13  4:00                           ` Jerry Chu
2008-05-13  4:02                             ` David Miller
2008-05-17  1:13                               ` Jerry Chu
2008-05-17  1:29                                 ` David Miller
2008-05-17  1:47                                   ` Jerry Chu
2008-05-12 22:58                     ` Jerry Chu
2008-05-12 23:01                       ` David Miller
2008-05-07  4:28     ` David Miller
2008-05-07 18:54       ` Jerry Chu
2008-05-07 21:20         ` David Miller
2008-05-08  0:16           ` Jerry Chu
     [not found] <d1c2719f0804241829s1bc3f41ejf7ebbff73ed96578@mail.gmail.com>
2008-04-25  7:06 ` Andi Kleen
2008-04-25  7:28   ` David Miller
2008-04-25  7:48     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2008-04-23  0:38 Rick Jones
2008-04-23  2:17 ` John Heffner
2008-04-23  3:59   ` David Miller
2008-04-23 16:32     ` Rick Jones
2008-04-23 16:58       ` John Heffner
2008-04-23 17:24         ` Rick Jones
2008-04-23 17:41           ` John Heffner
2008-04-23 17:46             ` Rick Jones
2008-04-24 22:21     ` Andi Kleen
2008-04-24 22:39       ` John Heffner
2008-04-25  1:28       ` David Miller
     [not found]       ` <65634d660804242234w66455bedve44801a98e3de9d9@mail.gmail.com>
2008-04-25  6:36         ` David Miller
2008-04-25  7:42           ` Tom Herbert
2008-04-25  7:46             ` David Miller
2008-04-28 17:51               ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).