From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: data vs overhead bytes, netperf aggregate RR and retransmissions Date: Tue, 02 Aug 2011 14:39:36 -0700 Message-ID: <4E386E98.1090606@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit To: netdev@vger.kernel.org Return-path: Received: from g1t0027.austin.hp.com ([15.216.28.34]:48993 "EHLO g1t0027.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755367Ab1HBVjh (ORCPT ); Tue, 2 Aug 2011 17:39:37 -0400 Received: from g1t0038.austin.hp.com (g1t0038.austin.hp.com [16.236.32.44]) by g1t0027.austin.hp.com (Postfix) with ESMTP id 7FBF438497 for ; Tue, 2 Aug 2011 21:39:37 +0000 (UTC) Received: from [16.89.244.213] (tardy.cup.hp.com [16.89.244.213]) by g1t0038.austin.hp.com (Postfix) with ESMTP id 37E51300BB for ; Tue, 2 Aug 2011 21:39:36 +0000 (UTC) Sender: netdev-owner@vger.kernel.org List-ID: Folks - Those who have looked at the "runemomniagg2.sh" script I have up on netperf.org will know that one of the tests I often run is an aggregate, burst-mode, single-byte TCP_RR test. I ramp-up how many transactions any one instance of netperf will have in-flight at any one time (eg 1, 4, 16, 64, 256), and also the number of concurrent netperf processes going (eg 1, 2, 4, 8, 12, 24). I do this with TCP_NODELAY set to try to guesstimate the maximum PPS Rather than simply dump burst-size transactions into the connection at once, netperf will walk it up - first two transactions in flight, then after they complete, three, then four, all in a somewhat slow-start-ish way. I usually run this sort of test with TCP_NODELAY set to try to guesstimate the maximum PPS. (With the occasional sanity check against ethtool stats) I did some of that testing just recently, from one system to two others via a 1 GbE link, all three systems running a 2.6.38 derived kernel (Ubuntu 11.04), and Intel 82576 chips running: $ ethtool -i eth0 driver: igb version: 2.1.0-k2 firmware-version: 1.8-2 bus-info: 0000:05:00.0 One of the things fixed recently in netperf (top-of-trunk, beyond 2.5.0) is I actually have reporting of per-connection TCP retransmissions working. I was looking at that, and noticed a bunch of retransmissions at the 256 burst level with 24 concurrent netperfs. I figured it was simple overload of say the switch or the one port active on the SUT (I do have one system talking to two, so perhaps some incast). Burst 64 had retrans as well. Burst 16 and below did not. That pattern repeated at 12 concurrent netperfs, and 8, and 4 and 2 and even 1 - yes, a single netperf aggregate TCP_RR test with a burst of 64 was reporting TCP retransmissions. No incasting issues there. The network was otherwise clean. I went to try to narrow it down further: # for b in 32 40 48 56 64 256; do ./netperf -t TCP_RR -l 30 -H mumble.181 -P 0 -- -r 1 -b $b -D -o throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,lsr_size_end,rss_size_end,rsr_size_end; done 206950.58,32,0,0,129280,87380,137360,87380 247000.30,40,0,0,121200,87380,137360,87380 254820.14,48,1,14,129280,88320,137360,87380 248496.06,56,33,35,125240,101200,121200,101200 278683.05,64,42,10,161600,114080,145440,117760 259422.46,256,2157,2027,133320,469200,137360,471040 and noticed the seeming correlation between the appearance of the retransmissions (columns 3 and 4) and the growth of the receive buffers (columns 6 and 8). Certainly, there was never anywhere near 86K of *actual* data outstanding, but if the inbound DMA buffers were 2048 bytes in size, 48 (49 actually, the "burst" is added to the one done by default) of them would fill 86KB - so would 40, but there is a race between netperf/netserver emptying the socket and packets arriving. on a lark I set an explicit and larger socket buffer size: # for b in 32 40 48 56 64 256; do ./netperf -t TCP_RR -l 30 -H mumble.181 -P 0 -- -s 128K -S 128K -r 1 -b $b -D -o throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,lsr_size_end,rss_size_end,rsr_size_end; done 201903.06,32,0,0,262144,262144,262144,262144 266204.05,40,0,0,262144,262144,262144,262144 253596.15,48,0,0,262144,262144,262144,262144 264811.65,56,0,0,262144,262144,262144,262144 254421.20,64,0,0,262144,262144,262144,262144 252563.16,256,4172,9677,262144,262144,262144,262144 poof, the retransmissions up through burst 64 are gone - though at 256 they are quite high indeed. Giving more space takes care of that: # for b in 256; do ./netperf -t TCP_RR -l 30 -H 15.184.83.181 -P 0 -- -s 1M -S 1M -r 1 -b $b -D -o throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,lsr_size_end,rss_size_end,rsr_size_end; done 248218.69,256,0,0,2097152,2097152,2097152,2097152 Is this simply a case of "Doctor! Doctor! It hurts when I do *this*!" "Well, don't do that!" or does this suggest that perhaps the receive socket buffers aren't growing quite fast enough on inbound, and/or collapsing buffers isn't sufficiently effective? It does seem rather strange that one could overfill the socket buffer with just that few data bytes. happy benchmarking, rick jones BTW, if I make the MTU 9000 bytes on both sides, and go back to auto-tuning, only the burst 256 retransmissions remain, and the receive socket buffers don't grow until then either: # for b in 32 40 48 56 64 256; do ./netperf -t TCP_RR -l 30 -H 15.184.83.181 -P 0 -- -r 1 -b $b -D -o throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,lsr_size_end,rss_size_end,rsr_size_end; done 198724.66,32,0,0,28560,87380,28560,87380 242936.45,40,0,0,28560,87380,28560,87380 272157.95,48,0,0,28560,87380,28560,87380 283002.29,56,0,0,1009120,87380,1047200,87380 272489.02,64,0,0,971040,87380,971040,87380 277626.55,256,72,1285,971040,106704,971040,87696 And it would seem a great deal of the send socket buffer size growth goes away too.