Just one more byte, it is wafer thin...

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Just one more byte, it is wafer thin...
@ 2011-07-20 23:28 Rick Jones
  2011-07-21  0:52 ` Rick Jones
  2011-07-21 22:28 ` Rick Jones
  0 siblings, 2 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-20 23:28 UTC (permalink / raw)
  To: netdev

One of the netperf scripts I run from time to time is the 
packet_byte_script (doc/examples/packet_byte_script in the netperf 
source tree, though I tweaked it locally to use omni output selectors). 
  The goal of that script is to measure the incremental cost of sending 
another byte and/or another TCP segment.  Among other things, it runs RR 
tests where the request or response size is incremented.  It starts at 1 
byte, doubles until it would exceed the MSS, then does 1MSS, 1MSS+1, 
2MSS, 2MSS+1 and 3MSS, 3MSS+1.

I recently ran it between a pair of dual-processor X5650 based systems 
with 10GbE NICs based on Mellanox MT26438 running as a 10GbE interface. 
The kernel is 2.6.38-8-server (maverick) and the driver info is:

# ethtool -i eth2
driver: mlx4_en (HP_0200000003)
version: 1.5.1.6 (August 2010)
firmware-version: 2.7.9294
bus-info: 0000:05:00.0

(yes, that HP_mumble does broach the possibility of a local fubar. i'd 
try a pure upstream myself but the systems at my disposal are somewhat 
locked-down, i'm hoping someone with a "pure" environment can reproduce 
the result, or not)

The full output can be seen at:

ftp://ftp.netperf.org/netperf/misc/sl390_NC543i_mlx4_en_1.5.1.6_Ubuntu_11.04_A5800_56C_to_same_pab_1500mtu_20110719.csv

I wasn't entirely sure what TSO and LRO/GRO would mean for the script, 
at first I thought I wouldn't get the +1 trip down the stack, but the 
transaction rates all looked reasonably "sane" until the 3MSS to 3MSS+1 
transition, when the transaction rate dropped by something like 70%. And 
stayed there as the request size was increased further in other testing. 
I looked at a tcpdump trace on the sending and receiving side - LRO/GRO 
had coalesced segments into the full request size.  On the sending side 
though, I was seeing one segment of 3MSS and one of one byte.  At first 
I thought that perhaps something was fubar with cwnd, but looking at 
traces for 2MSS(+1) and 1MSS(+1) I saw that is just what TSO does - only 
send integer multiples of the MSS as TSO.  So, while that does 
interesting things to the service demand for a given transaction size, 
it probably wasn't the culprit.

It would seem that the adaptive-rx was.  Previously, the coalescing 
settings on the receiver (netserver side) were:

# ethtool -c eth2
Coalesce parameters for eth2:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

and netperf would look like:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    10030.37
16384  87380
16384  87380  4345     1       10.00    3406.62
16384  87380

when I switched adaptive rx off via ethtool, the drop largely went away:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    11167.48
16384  87380
16384  87380  4345     1       10.00    10460.02
16384  87380

Now, at 11000 transactions per second, even with the request being 4 
packets, that is still < 55000 packets per second, so presumably 
everything should have stayed at "_low" right?  Just for grins, I put 
adaptive coalescing on again and set rx-usecs-high to 64 and ran those 
two points again:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    11143.07
16384  87380
16384  87380  4345     1       10.00    5790.48
16384  87380

and just to be completely pedantic about it, set rx-usecs-high to 0:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    14274.03
16384  87380
16384  87380  4345     1       10.00    13697.11
16384  87380

and got a somewhat unexpected result - I've no idea why then they both 
went up - perhaps it was sensing "high" occasionally even in the 4344 
byte request case.  Still, is this suggesting that perhaps the adaptive 
bits are being a bit to aggressive about sensing high?  Over what 
interval is that measurement supposed to be happening?

rick jones

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Just one more byte, it is wafer thin...
  2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
@ 2011-07-21  0:52 ` Rick Jones
  2011-07-21 22:28 ` Rick Jones
  1 sibling, 0 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-21  0:52 UTC (permalink / raw)
  To: netdev

On 07/20/2011 04:28 PM, Rick Jones wrote:
> and got a somewhat unexpected result - I've no idea why then they both
> went up - perhaps it was sensing "high" occasionally even in the 4344
> byte request case.

That would seem to be the case?  Back to defaults, ./configure'd netperf 
with --enable-demo and have it print-out interim results every 250 
milliseconds (or so)

root@use111814x:~/netperf-2.5.0# HDR="-P 1";for r in 4344 4345; do 
netperf -D 0.25 -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : demo : first 
burst 0
Interim result: 5332.90 Trans/s over 0.28 seconds ending at 1311209347.312
Interim result: 6867.20 Trans/s over 0.25 seconds ending at 1311209347.562
Interim result: 14475.52 Trans/s over 0.25 seconds ending at 1311209347.813
Interim result: 14513.50 Trans/s over 0.25 seconds ending at 1311209348.063
Interim result: 14528.00 Trans/s over 0.25 seconds ending at 1311209348.313
Interim result: 8245.53 Trans/s over 0.44 seconds ending at 1311209348.753
Interim result: 13523.73 Trans/s over 0.25 seconds ending at 1311209349.003
Interim result: 13310.17 Trans/s over 0.26 seconds ending at 1311209349.259
Interim result: 8303.74 Trans/s over 0.40 seconds ending at 1311209349.660
Interim result: 14202.24 Trans/s over 0.25 seconds ending at 1311209349.910
Interim result: 8124.76 Trans/s over 0.44 seconds ending at 1311209350.347
Interim result: 14495.59 Trans/s over 0.25 seconds ending at 1311209350.597
Interim result: 14505.91 Trans/s over 0.25 seconds ending at 1311209350.847
Interim result: 13338.19 Trans/s over 0.27 seconds ending at 1311209351.119
Interim result: 7280.44 Trans/s over 0.46 seconds ending at 1311209351.577
Interim result: 14002.71 Trans/s over 0.25 seconds ending at 1311209351.827
Interim result: 6661.47 Trans/s over 0.53 seconds ending at 1311209352.353
Interim result: 4069.30 Trans/s over 0.41 seconds ending at 1311209352.762
Interim result: 10444.77 Trans/s over 0.35 seconds ending at 1311209353.110
Interim result: 9013.21 Trans/s over 0.29 seconds ending at 1311209353.399
Interim result: 6480.59 Trans/s over 0.35 seconds ending at 1311209353.747
Interim result: 13245.09 Trans/s over 0.25 seconds ending at 1311209353.997
Interim result: 12205.48 Trans/s over 0.30 seconds ending at 1311209354.294
Interim result: 5592.64 Trans/s over 0.55 seconds ending at 1311209354.840
Interim result: 6142.67 Trans/s over 0.59 seconds ending at 1311209355.430
Interim result: 11084.00 Trans/s over 0.25 seconds ending at 1311209355.680
Interim result: 14511.18 Trans/s over 0.25 seconds ending at 1311209355.930
Interim result: 14475.35 Trans/s over 0.25 seconds ending at 1311209356.181
Interim result: 7893.58 Trans/s over 0.46 seconds ending at 1311209356.639
Interim result: 14176.00 Trans/s over 0.25 seconds ending at 1311209356.889
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    9907.27
16384  87380

now the 4345 byte request:

Interim result: 8712.99 Trans/s over 0.37 seconds ending at 1311209357.406
Interim result: 3344.24 Trans/s over 0.65 seconds ending at 1311209358.057
Interim result: 3495.28 Trans/s over 0.25 seconds ending at 1311209358.308
Interim result: 3457.05 Trans/s over 0.25 seconds ending at 1311209358.561
Interim result: 3315.55 Trans/s over 0.26 seconds ending at 1311209358.821
Interim result: 3340.47 Trans/s over 0.25 seconds ending at 1311209359.072
Interim result: 3343.81 Trans/s over 0.25 seconds ending at 1311209359.322
Interim result: 3373.45 Trans/s over 0.25 seconds ending at 1311209359.572
Interim result: 3292.31 Trans/s over 0.26 seconds ending at 1311209359.828
Interim result: 3328.17 Trans/s over 0.25 seconds ending at 1311209360.079
Interim result: 3373.07 Trans/s over 0.25 seconds ending at 1311209360.329
Interim result: 3431.75 Trans/s over 0.25 seconds ending at 1311209360.579
Interim result: 3324.45 Trans/s over 0.26 seconds ending at 1311209360.837
Interim result: 3347.82 Trans/s over 0.25 seconds ending at 1311209361.087
Interim result: 3327.10 Trans/s over 0.25 seconds ending at 1311209361.338
Interim result: 3337.22 Trans/s over 0.25 seconds ending at 1311209361.589
Interim result: 3444.56 Trans/s over 0.25 seconds ending at 1311209361.839
Interim result: 3336.91 Trans/s over 0.26 seconds ending at 1311209362.097
Interim result: 3323.07 Trans/s over 0.25 seconds ending at 1311209362.348
Interim result: 3422.15 Trans/s over 0.25 seconds ending at 1311209362.598
Interim result: 3327.81 Trans/s over 0.26 seconds ending at 1311209362.855
Interim result: 3312.43 Trans/s over 0.25 seconds ending at 1311209363.106
Interim result: 3346.22 Trans/s over 0.25 seconds ending at 1311209363.356
Interim result: 3426.75 Trans/s over 0.25 seconds ending at 1311209363.606
Interim result: 3304.44 Trans/s over 0.26 seconds ending at 1311209363.866
Interim result: 3466.26 Trans/s over 0.25 seconds ending at 1311209364.116
Interim result: 3299.97 Trans/s over 0.26 seconds ending at 1311209364.379
Interim result: 3360.99 Trans/s over 0.25 seconds ending at 1311209364.629
Interim result: 3402.76 Trans/s over 0.25 seconds ending at 1311209364.879
Interim result: 3389.28 Trans/s over 0.25 seconds ending at 1311209365.130
Interim result: 3360.94 Trans/s over 0.25 seconds ending at 1311209365.382
Interim result: 3319.58 Trans/s over 0.25 seconds ending at 1311209365.635
Interim result: 3440.41 Trans/s over 0.25 seconds ending at 1311209365.886
Interim result: 3386.75 Trans/s over 0.25 seconds ending at 1311209366.140
Interim result: 3337.23 Trans/s over 0.25 seconds ending at 1311209366.393
Interim result: 3329.40 Trans/s over 0.25 seconds ending at 1311209366.644
Interim result: 3328.29 Trans/s over 0.25 seconds ending at 1311209366.894
16384  87380  4345     1       10.00    3560.55
16384  87380

  Still, is this suggesting that perhaps the adaptive
> bits are being a bit to aggressive about sensing high? Over what
> interval is that measurement supposed to be happening?
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Just one more byte, it is wafer thin...
  2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
  2011-07-21  0:52 ` Rick Jones
@ 2011-07-21 22:28 ` Rick Jones
  1 sibling, 0 replies; 3+ messages in thread
From: Rick Jones @ 2011-07-21 22:28 UTC (permalink / raw)
  To: netdev; +Cc: Eli Cohen, Yevgeny Petrilin

On 07/20/2011 04:28 PM, Rick Jones wrote:
> and just to be completely pedantic about it, set rx-usecs-high to 0:
>
> # HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR
> $HDR -- -r ${r},1; HDR="-P 0"; done
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
> Local /Remote
> Socket Size Request Resp. Elapsed Trans.
> Send Recv Size Size Time Rate
> bytes Bytes bytes bytes secs. per sec
>
> 16384 87380 4344 1 10.00 14274.03
> 16384 87380
> 16384 87380 4345 1 10.00 13697.11
> 16384 87380
>
> and got a somewhat unexpected result - I've no idea why then they both
> went up - perhaps it was sensing "high" occasionally even in the 4344
> byte request case. Still, is this suggesting that perhaps the adaptive
> bits are being a bit to aggressive about sensing high? Over what
> interval is that measurement supposed to be happening?

So, from a 2.6.38 tree in drivers/net/mlx4/en_netdev:

        /* Apply auto-moderation only when packet rate exceeds a rate that
          * it matters */
         if (rate > MLX4_EN_RX_RATE_THRESH) {
                 /* If tx and rx packet rates are not balanced, assume that
                  * traffic is mainly BW bound and apply maximum moderation.
                  * Otherwise, moderate according to packet rate */
                 if (2 * tx_pkt_diff > 3 * rx_pkt_diff &&
                     rx_pkt_diff / rx_byte_diff <
                     MLX4_EN_SMALL_PKT_SIZE)
                         moder_time = priv->rx_usecs_low;
                 else if (2 * rx_pkt_diff > 3 * tx_pkt_diff)
                         moder_time = priv->rx_usecs_high;
                 else {
                         if (rate < priv->pkt_rate_low)
                                 moder_time = priv->rx_usecs_low;
                         else if (rate > priv->pkt_rate_high)
                                 moder_time = priv->rx_usecs_high;
                         else
                                 moder_time = (rate - priv->pkt_rate_low) *
                                         (priv->rx_usecs_high - 
priv->rx_usecs_low) /
                                         (priv->pkt_rate_high - 
priv->pkt_rate_low) +
                                         priv->rx_usecs_low;
                 }
         } else {
                 /* When packet rate is low, use default moderation 
rather than
                  * 0 to prevent interrupt storms if traffic suddenly 
increases */
                 moder_time = priv->rx_usecs;
         }

It would seem that the assume is involved here.  The TCP_RR test I was 
running (or NFS Read, or NFS Write, or I suspect SMB/CIFS reads and 
writes etc) will have either request much larger than response or vice 
versa.  That leaves the tx and rx packet rates decidedly not balanced 
even when the traffic is not BW bound, particularly when there will not 
be all that many requests outstanding at one time.  And it becomes even 
more unbalanced when LRO/GRO stretches the ACKs.

rick jones

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-07-21 22:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-20 23:28 Just one more byte, it is wafer thin Rick Jones
2011-07-21  0:52 ` Rick Jones
2011-07-21 22:28 ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).