* Just one more byte, it is wafer thin...
@ 2011-07-20 23:28 Rick Jones
2011-07-21 0:52 ` Rick Jones
2011-07-21 22:28 ` Rick Jones
0 siblings, 2 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-20 23:28 UTC (permalink / raw)
To: netdev
One of the netperf scripts I run from time to time is the
packet_byte_script (doc/examples/packet_byte_script in the netperf
source tree, though I tweaked it locally to use omni output selectors).
The goal of that script is to measure the incremental cost of sending
another byte and/or another TCP segment. Among other things, it runs RR
tests where the request or response size is incremented. It starts at 1
byte, doubles until it would exceed the MSS, then does 1MSS, 1MSS+1,
2MSS, 2MSS+1 and 3MSS, 3MSS+1.
I recently ran it between a pair of dual-processor X5650 based systems
with 10GbE NICs based on Mellanox MT26438 running as a 10GbE interface.
The kernel is 2.6.38-8-server (maverick) and the driver info is:
# ethtool -i eth2
driver: mlx4_en (HP_0200000003)
version: 1.5.1.6 (August 2010)
firmware-version: 2.7.9294
bus-info: 0000:05:00.0
(yes, that HP_mumble does broach the possibility of a local fubar. i'd
try a pure upstream myself but the systems at my disposal are somewhat
locked-down, i'm hoping someone with a "pure" environment can reproduce
the result, or not)
The full output can be seen at:
ftp://ftp.netperf.org/netperf/misc/sl390_NC543i_mlx4_en_1.5.1.6_Ubuntu_11.04_A5800_56C_to_same_pab_1500mtu_20110719.csv
I wasn't entirely sure what TSO and LRO/GRO would mean for the script,
at first I thought I wouldn't get the +1 trip down the stack, but the
transaction rates all looked reasonably "sane" until the 3MSS to 3MSS+1
transition, when the transaction rate dropped by something like 70%. And
stayed there as the request size was increased further in other testing.
I looked at a tcpdump trace on the sending and receiving side - LRO/GRO
had coalesced segments into the full request size. On the sending side
though, I was seeing one segment of 3MSS and one of one byte. At first
I thought that perhaps something was fubar with cwnd, but looking at
traces for 2MSS(+1) and 1MSS(+1) I saw that is just what TSO does - only
send integer multiples of the MSS as TSO. So, while that does
interesting things to the service demand for a given transaction size,
it probably wasn't the culprit.
It would seem that the adaptive-rx was. Previously, the coalescing
settings on the receiver (netserver side) were:
# ethtool -c eth2
Coalesce parameters for eth2:
Adaptive RX: on TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000
rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
and netperf would look like:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 10030.37
16384 87380
16384 87380 4345 1 10.00 3406.62
16384 87380
when I switched adaptive rx off via ethtool, the drop largely went away:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 11167.48
16384 87380
16384 87380 4345 1 10.00 10460.02
16384 87380
Now, at 11000 transactions per second, even with the request being 4
packets, that is still < 55000 packets per second, so presumably
everything should have stayed at "_low" right? Just for grins, I put
adaptive coalescing on again and set rx-usecs-high to 64 and ran those
two points again:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 11143.07
16384 87380
16384 87380 4345 1 10.00 5790.48
16384 87380
and just to be completely pedantic about it, set rx-usecs-high to 0:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 14274.03
16384 87380
16384 87380 4345 1 10.00 13697.11
16384 87380
and got a somewhat unexpected result - I've no idea why then they both
went up - perhaps it was sensing "high" occasionally even in the 4344
byte request case. Still, is this suggesting that perhaps the adaptive
bits are being a bit to aggressive about sensing high? Over what
interval is that measurement supposed to be happening?
rick jones
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Just one more byte, it is wafer thin...
2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
@ 2011-07-21 0:52 ` Rick Jones
2011-07-21 22:28 ` Rick Jones
1 sibling, 0 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-21 0:52 UTC (permalink / raw)
To: netdev
On 07/20/2011 04:28 PM, Rick Jones wrote:
> and got a somewhat unexpected result - I've no idea why then they both
> went up - perhaps it was sensing "high" occasionally even in the 4344
> byte request case.
That would seem to be the case? Back to defaults, ./configure'd netperf
with --enable-demo and have it print-out interim results every 250
milliseconds (or so)
root@use111814x:~/netperf-2.5.0# HDR="-P 1";for r in 4344 4345; do
netperf -D 0.25 -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : demo : first
burst 0
Interim result: 5332.90 Trans/s over 0.28 seconds ending at 1311209347.312
Interim result: 6867.20 Trans/s over 0.25 seconds ending at 1311209347.562
Interim result: 14475.52 Trans/s over 0.25 seconds ending at 1311209347.813
Interim result: 14513.50 Trans/s over 0.25 seconds ending at 1311209348.063
Interim result: 14528.00 Trans/s over 0.25 seconds ending at 1311209348.313
Interim result: 8245.53 Trans/s over 0.44 seconds ending at 1311209348.753
Interim result: 13523.73 Trans/s over 0.25 seconds ending at 1311209349.003
Interim result: 13310.17 Trans/s over 0.26 seconds ending at 1311209349.259
Interim result: 8303.74 Trans/s over 0.40 seconds ending at 1311209349.660
Interim result: 14202.24 Trans/s over 0.25 seconds ending at 1311209349.910
Interim result: 8124.76 Trans/s over 0.44 seconds ending at 1311209350.347
Interim result: 14495.59 Trans/s over 0.25 seconds ending at 1311209350.597
Interim result: 14505.91 Trans/s over 0.25 seconds ending at 1311209350.847
Interim result: 13338.19 Trans/s over 0.27 seconds ending at 1311209351.119
Interim result: 7280.44 Trans/s over 0.46 seconds ending at 1311209351.577
Interim result: 14002.71 Trans/s over 0.25 seconds ending at 1311209351.827
Interim result: 6661.47 Trans/s over 0.53 seconds ending at 1311209352.353
Interim result: 4069.30 Trans/s over 0.41 seconds ending at 1311209352.762
Interim result: 10444.77 Trans/s over 0.35 seconds ending at 1311209353.110
Interim result: 9013.21 Trans/s over 0.29 seconds ending at 1311209353.399
Interim result: 6480.59 Trans/s over 0.35 seconds ending at 1311209353.747
Interim result: 13245.09 Trans/s over 0.25 seconds ending at 1311209353.997
Interim result: 12205.48 Trans/s over 0.30 seconds ending at 1311209354.294
Interim result: 5592.64 Trans/s over 0.55 seconds ending at 1311209354.840
Interim result: 6142.67 Trans/s over 0.59 seconds ending at 1311209355.430
Interim result: 11084.00 Trans/s over 0.25 seconds ending at 1311209355.680
Interim result: 14511.18 Trans/s over 0.25 seconds ending at 1311209355.930
Interim result: 14475.35 Trans/s over 0.25 seconds ending at 1311209356.181
Interim result: 7893.58 Trans/s over 0.46 seconds ending at 1311209356.639
Interim result: 14176.00 Trans/s over 0.25 seconds ending at 1311209356.889
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 9907.27
16384 87380
now the 4345 byte request:
Interim result: 8712.99 Trans/s over 0.37 seconds ending at 1311209357.406
Interim result: 3344.24 Trans/s over 0.65 seconds ending at 1311209358.057
Interim result: 3495.28 Trans/s over 0.25 seconds ending at 1311209358.308
Interim result: 3457.05 Trans/s over 0.25 seconds ending at 1311209358.561
Interim result: 3315.55 Trans/s over 0.26 seconds ending at 1311209358.821
Interim result: 3340.47 Trans/s over 0.25 seconds ending at 1311209359.072
Interim result: 3343.81 Trans/s over 0.25 seconds ending at 1311209359.322
Interim result: 3373.45 Trans/s over 0.25 seconds ending at 1311209359.572
Interim result: 3292.31 Trans/s over 0.26 seconds ending at 1311209359.828
Interim result: 3328.17 Trans/s over 0.25 seconds ending at 1311209360.079
Interim result: 3373.07 Trans/s over 0.25 seconds ending at 1311209360.329
Interim result: 3431.75 Trans/s over 0.25 seconds ending at 1311209360.579
Interim result: 3324.45 Trans/s over 0.26 seconds ending at 1311209360.837
Interim result: 3347.82 Trans/s over 0.25 seconds ending at 1311209361.087
Interim result: 3327.10 Trans/s over 0.25 seconds ending at 1311209361.338
Interim result: 3337.22 Trans/s over 0.25 seconds ending at 1311209361.589
Interim result: 3444.56 Trans/s over 0.25 seconds ending at 1311209361.839
Interim result: 3336.91 Trans/s over 0.26 seconds ending at 1311209362.097
Interim result: 3323.07 Trans/s over 0.25 seconds ending at 1311209362.348
Interim result: 3422.15 Trans/s over 0.25 seconds ending at 1311209362.598
Interim result: 3327.81 Trans/s over 0.26 seconds ending at 1311209362.855
Interim result: 3312.43 Trans/s over 0.25 seconds ending at 1311209363.106
Interim result: 3346.22 Trans/s over 0.25 seconds ending at 1311209363.356
Interim result: 3426.75 Trans/s over 0.25 seconds ending at 1311209363.606
Interim result: 3304.44 Trans/s over 0.26 seconds ending at 1311209363.866
Interim result: 3466.26 Trans/s over 0.25 seconds ending at 1311209364.116
Interim result: 3299.97 Trans/s over 0.26 seconds ending at 1311209364.379
Interim result: 3360.99 Trans/s over 0.25 seconds ending at 1311209364.629
Interim result: 3402.76 Trans/s over 0.25 seconds ending at 1311209364.879
Interim result: 3389.28 Trans/s over 0.25 seconds ending at 1311209365.130
Interim result: 3360.94 Trans/s over 0.25 seconds ending at 1311209365.382
Interim result: 3319.58 Trans/s over 0.25 seconds ending at 1311209365.635
Interim result: 3440.41 Trans/s over 0.25 seconds ending at 1311209365.886
Interim result: 3386.75 Trans/s over 0.25 seconds ending at 1311209366.140
Interim result: 3337.23 Trans/s over 0.25 seconds ending at 1311209366.393
Interim result: 3329.40 Trans/s over 0.25 seconds ending at 1311209366.644
Interim result: 3328.29 Trans/s over 0.25 seconds ending at 1311209366.894
16384 87380 4345 1 10.00 3560.55
16384 87380
Still, is this suggesting that perhaps the adaptive
> bits are being a bit to aggressive about sensing high? Over what
> interval is that measurement supposed to be happening?
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Just one more byte, it is wafer thin...
2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
2011-07-21 0:52 ` Rick Jones
@ 2011-07-21 22:28 ` Rick Jones
1 sibling, 0 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-21 22:28 UTC (permalink / raw)
To: netdev; +Cc: Eli Cohen, Yevgeny Petrilin
On 07/20/2011 04:28 PM, Rick Jones wrote:
> and just to be completely pedantic about it, set rx-usecs-high to 0:
>
> # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
> $HDR -- -r ${r},1; HDR="-P 0"; done
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
> Local /Remote
> Socket Size Request Resp. Elapsed Trans.
> Send Recv Size Size Time Rate
> bytes Bytes bytes bytes secs. per sec
>
> 16384 87380 4344 1 10.00 14274.03
> 16384 87380
> 16384 87380 4345 1 10.00 13697.11
> 16384 87380
>
> and got a somewhat unexpected result - I've no idea why then they both
> went up - perhaps it was sensing "high" occasionally even in the 4344
> byte request case. Still, is this suggesting that perhaps the adaptive
> bits are being a bit to aggressive about sensing high? Over what
> interval is that measurement supposed to be happening?
So, from a 2.6.38 tree in drivers/net/mlx4/en_netdev:
/* Apply auto-moderation only when packet rate exceeds a rate that
* it matters */
if (rate > MLX4_EN_RX_RATE_THRESH) {
/* If tx and rx packet rates are not balanced, assume that
* traffic is mainly BW bound and apply maximum moderation.
* Otherwise, moderate according to packet rate */
if (2 * tx_pkt_diff > 3 * rx_pkt_diff &&
rx_pkt_diff / rx_byte_diff <
MLX4_EN_SMALL_PKT_SIZE)
moder_time = priv->rx_usecs_low;
else if (2 * rx_pkt_diff > 3 * tx_pkt_diff)
moder_time = priv->rx_usecs_high;
else {
if (rate < priv->pkt_rate_low)
moder_time = priv->rx_usecs_low;
else if (rate > priv->pkt_rate_high)
moder_time = priv->rx_usecs_high;
else
moder_time = (rate - priv->pkt_rate_low) *
(priv->rx_usecs_high -
priv->rx_usecs_low) /
(priv->pkt_rate_high -
priv->pkt_rate_low) +
priv->rx_usecs_low;
}
} else {
/* When packet rate is low, use default moderation
rather than
* 0 to prevent interrupt storms if traffic suddenly
increases */
moder_time = priv->rx_usecs;
}
It would seem that the assume is involved here. The TCP_RR test I was
running (or NFS Read, or NFS Write, or I suspect SMB/CIFS reads and
writes etc) will have either request much larger than response or vice
versa. That leaves the tx and rx packet rates decidedly not balanced
even when the traffic is not BW bound, particularly when there will not
be all that many requests outstanding at one time. And it becomes even
more unbalanced when LRO/GRO stretches the ACKs.
rick jones
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-07-21 22:28 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
2011-07-21 0:52 ` Rick Jones
2011-07-21 22:28 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).