From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Just one more byte, it is wafer thin... Date: Wed, 20 Jul 2011 16:28:32 -0700 Message-ID: <4E2764A0.90003@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit To: netdev@vger.kernel.org Return-path: Received: from g4t0016.houston.hp.com ([15.201.24.19]:15980 "EHLO g4t0016.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751701Ab1GTXgV (ORCPT ); Wed, 20 Jul 2011 19:36:21 -0400 Received: from g4t0009.houston.hp.com (g4t0009.houston.hp.com [16.234.32.26]) by g4t0016.houston.hp.com (Postfix) with ESMTP id 1D30414BF9 for ; Wed, 20 Jul 2011 23:36:21 +0000 (UTC) Received: from [16.89.244.213] (tardy.cup.hp.com [16.89.244.213]) by g4t0009.houston.hp.com (Postfix) with ESMTP id 3722EC280 for ; Wed, 20 Jul 2011 23:28:33 +0000 (UTC) Sender: netdev-owner@vger.kernel.org List-ID: One of the netperf scripts I run from time to time is the packet_byte_script (doc/examples/packet_byte_script in the netperf source tree, though I tweaked it locally to use omni output selectors). The goal of that script is to measure the incremental cost of sending another byte and/or another TCP segment. Among other things, it runs RR tests where the request or response size is incremented. It starts at 1 byte, doubles until it would exceed the MSS, then does 1MSS, 1MSS+1, 2MSS, 2MSS+1 and 3MSS, 3MSS+1. I recently ran it between a pair of dual-processor X5650 based systems with 10GbE NICs based on Mellanox MT26438 running as a 10GbE interface. The kernel is 2.6.38-8-server (maverick) and the driver info is: # ethtool -i eth2 driver: mlx4_en (HP_0200000003) version: 1.5.1.6 (August 2010) firmware-version: 2.7.9294 bus-info: 0000:05:00.0 (yes, that HP_mumble does broach the possibility of a local fubar. i'd try a pure upstream myself but the systems at my disposal are somewhat locked-down, i'm hoping someone with a "pure" environment can reproduce the result, or not) The full output can be seen at: ftp://ftp.netperf.org/netperf/misc/sl390_NC543i_mlx4_en_1.5.1.6_Ubuntu_11.04_A5800_56C_to_same_pab_1500mtu_20110719.csv I wasn't entirely sure what TSO and LRO/GRO would mean for the script, at first I thought I wouldn't get the +1 trip down the stack, but the transaction rates all looked reasonably "sane" until the 3MSS to 3MSS+1 transition, when the transaction rate dropped by something like 70%. And stayed there as the request size was increased further in other testing. I looked at a tcpdump trace on the sending and receiving side - LRO/GRO had coalesced segments into the full request size. On the sending side though, I was seeing one segment of 3MSS and one of one byte. At first I thought that perhaps something was fubar with cwnd, but looking at traces for 2MSS(+1) and 1MSS(+1) I saw that is just what TSO does - only send integer multiples of the MSS as TSO. So, while that does interesting things to the service demand for a given transaction size, it probably wasn't the culprit. It would seem that the adaptive-rx was. Previously, the coalescing settings on the receiver (netserver side) were: # ethtool -c eth2 Coalesce parameters for eth2: Adaptive RX: on TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 400000 pkt-rate-high: 450000 rx-usecs: 16 rx-frames: 44 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 128 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 and netperf would look like: # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 4344 1 10.00 10030.37 16384 87380 16384 87380 4345 1 10.00 3406.62 16384 87380 when I switched adaptive rx off via ethtool, the drop largely went away: # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 4344 1 10.00 11167.48 16384 87380 16384 87380 4345 1 10.00 10460.02 16384 87380 Now, at 11000 transactions per second, even with the request being 4 packets, that is still < 55000 packets per second, so presumably everything should have stayed at "_low" right? Just for grins, I put adaptive coalescing on again and set rx-usecs-high to 64 and ran those two points again: # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 4344 1 10.00 11143.07 16384 87380 16384 87380 4345 1 10.00 5790.48 16384 87380 and just to be completely pedantic about it, set rx-usecs-high to 0: # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 4344 1 10.00 14274.03 16384 87380 16384 87380 4345 1 10.00 13697.11 16384 87380 and got a somewhat unexpected result - I've no idea why then they both went up - perhaps it was sensing "high" occasionally even in the 4344 byte request case. Still, is this suggesting that perhaps the adaptive bits are being a bit to aggressive about sensing high? Over what interval is that measurement supposed to be happening? rick jones