* e1000 (?) jumbo frames performance issue
@ 2005-05-05 16:28 Michael Iatrou
2005-05-05 20:17 ` Rick Jones
0 siblings, 1 reply; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 16:28 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 500 bytes --]
Hi,
I did several benchmarks using Intel e1000 NIC and it seems there is a
network throughput problem for MTU > 12000 (e1000 supports up to 16110
MTU).
Configuration:
Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT
disabled), 512MB RAM, e1000 (82546EB)
Linux 2.6.11.7
netperf 2.3pl1
http://members.hellug.gr/iatrou/plain_ip_mtu.png
http://members.hellug.gr/iatrou/plain_ip_mtu.dat
--
Michael Iatrou
Electrical and Computer Engineering Dept.
University of Patras, Greece
[-- Attachment #2: plain_ip_mtu.dat --]
[-- Type: text/plain, Size: 3639 bytes --]
1500 936.14
1550 938.45
1600 927.26
1650 943.06
1700 942.91
1750 943.94
1800 947.95
1850 945.14
1900 952.91
1950 948.15
2000 948.55
2050 949.63
2100 953.98
2150 958.59
2200 959.39
2250 959.48
2300 960.57
2350 960.78
2400 953.26
2450 956.50
2500 964.36
2550 961.98
2600 964.61
2650 963.79
2700 966.36
2750 967.56
2800 968.12
2850 963.65
2900 969.22
2950 968.97
3000 967.69
3050 968.75
3100 970.95
3150 967.95
3200 972.19
3250 972.61
3300 973.01
3350 973.41
3400 973.78
3450 974.15
3500 974.52
3550 974.88
3600 975.23
3650 975.56
3700 975.89
3750 976.20
3800 976.52
3850 976.82
3900 977.11
3950 977.39
4000 977.68
4050 977.95
4100 978.23
4150 978.48
4200 978.74
4250 978.97
4300 979.21
4350 979.46
4400 979.68
4450 979.91
4500 980.13
4550 980.36
4600 980.58
4650 980.77
4700 980.97
4750 981.17
4800 981.37
4850 981.56
4900 981.75
4950 981.92
5000 982.10
5050 982.28
5100 982.47
5150 982.61
5200 982.78
5250 982.96
5300 983.11
5350 983.27
5400 983.43
5450 983.58
5500 983.71
5550 983.87
5600 984.01
5650 984.15
5700 984.27
5750 984.42
5800 984.55
5850 984.67
5900 984.83
5950 984.94
6000 985.06
6050 985.21
6100 985.29
6150 985.43
6200 985.54
6250 985.66
6300 985.79
6350 985.91
6400 986.03
6450 986.13
6500 986.23
6550 986.34
6600 986.44
6650 986.55
6700 986.65
6750 986.75
6800 986.83
6850 986.95
6900 987.04
6950 987.13
7000 987.22
7050 987.31
7100 987.40
7150 987.48
7200 987.57
7250 987.66
7300 987.74
7350 987.81
7400 987.89
7450 987.98
7500 987.74
7550 987.81
7600 987.20
7650 987.58
7700 986.59
7750 985.17
7800 985.46
7850 983.50
7900 983.88
7950 982.86
8000 953.35
8050 952.76
8100 947.34
8150 947.85
8200 942.18
8250 989.13
8300 989.21
8350 989.27
8400 989.33
8450 989.40
8500 989.47
8550 989.52
8600 989.58
8650 989.64
8700 989.71
8750 989.76
8800 989.82
8850 989.88
8900 989.93
8950 989.99
9000 990.04
9050 990.10
9100 990.16
9150 990.21
9200 990.26
9250 990.31
9300 990.37
9350 990.41
9400 990.47
9450 990.51
9500 990.57
9550 990.61
9600 990.66
9650 990.70
9700 990.76
9750 990.81
9800 990.86
9850 990.89
9900 990.94
9950 990.98
10000 991.04
10050 991.08
10100 991.12
10150 991.17
10200 991.21
10250 991.23
10300 991.30
10350 991.34
10400 991.38
10450 991.43
10500 991.46
10550 991.50
10600 991.55
10650 991.58
10700 991.63
10750 991.67
10800 991.70
10850 991.74
10900 991.78
10950 991.81
11000 991.86
11050 991.89
11100 991.92
11150 991.97
11200 992.00
11250 992.03
11300 992.07
11350 992.11
11400 992.13
11450 992.17
11500 992.21
11550 992.23
11600 992.16
11650 992.12
11700 991.89
11750 991.62
11800 991.79
11850 991.50
11900 991.77
11950 990.46
12000 990.52
12050 989.66
12100 968.77
12150 967.79
12200 970.55
12250 955.71
12300 952.36
12350 947.52
12400 945.08
12450 946.48
12500 944.09
12550 945.16
12600 942.07
12650 940.04
12700 938.17
12750 936.29
12800 933.29
12850 930.13
12900 927.75
12950 925.10
13000 924.25
13050 922.31
13100 917.67
13150 915.92
13200 913.99
13250 912.00
13300 908.68
13350 905.72
13400 904.99
13450 903.24
13500 900.91
13550 898.57
13600 897.07
13650 895.17
13700 892.79
13750 890.03
13800 888.41
13850 887.29
13900 886.30
13950 884.43
14000 882.25
14050 880.46
14100 878.22
14150 876.24
14200 875.41
14250 872.58
14300 873.01
14350 870.16
14400 868.45
14450 866.98
14500 864.76
14550 863.05
14600 862.22
14650 860.05
14700 858.71
14750 857.23
14800 856.31
14850 854.37
14900 851.15
14950 849.92
15000 849.54
15050 848.40
15100 847.32
15150 845.17
15200 844.45
15250 842.98
15300 841.14
15350 839.62
15400 838.34
15450 837.14
15500 836.43
15550 834.64
15600 833.64
15650 832.08
15700 830.72
15750 829.29
15800 828.07
15850 826.93
15900 825.75
15950 825.10
16000 823.32
16050 822.36
16100 819.77
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou
@ 2005-05-05 20:17 ` Rick Jones
2005-05-05 21:33 ` David S. Miller
2005-05-05 21:55 ` Michael Iatrou
0 siblings, 2 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-05 20:17 UTC (permalink / raw)
To: Michael Iatrou; +Cc: netdev
Michael Iatrou wrote:
> Hi,
> I did several benchmarks using Intel e1000 NIC and it seems there is a
> network throughput problem for MTU > 12000 (e1000 supports up to 16110
> MTU).
>
> Configuration:
> Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT
> disabled), 512MB RAM, e1000 (82546EB)
>
> Linux 2.6.11.7
> netperf 2.3pl1
What settings, if any, did you use for -s, -S and in particular -m in netperf?
I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would
result in netperf sending 16KB at a time into the connection - once you sent the
MTU above 16K you may have started running into issues with Nagle and delayed
ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C
to set TCP_CORK, or use -m to set the send size to say, 32KB.
It might be good to add CPU utilization figures - for 2.3pl1 that means editing
the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can grab netperf
2.4.0-rc3 from:
ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/
if you cannot find it elsewhere, and that will (try to) compile-in the right CPU
utilization mechanism automagically.
rick jones
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 20:17 ` Rick Jones
@ 2005-05-05 21:33 ` David S. Miller
2005-05-05 21:54 ` Rick Jones
2005-05-05 21:55 ` Michael Iatrou
1 sibling, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-05 21:33 UTC (permalink / raw)
To: Rick Jones; +Cc: m.iatrou, netdev
On Thu, 05 May 2005 13:17:31 -0700
Rick Jones <rick.jones2@hp.com> wrote:
> I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would
> result in netperf sending 16KB at a time into the connection - once you sent the
> MTU above 16K you may have started running into issues with Nagle and delayed
> ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C
> to set TCP_CORK, or use -m to set the send size to say, 32KB.
Yes, for one don't expect reasonable behavior if the MTU is near to or less
than the send buffer size in use.
Also, many of Nagle's notions start to fall apart at such high MTU settings.
For example, all of Nagle (even with Minshall's modifications) basically define
"small packet" as anything smaller than 1 MSS.
So something to look into (besides increasing your send buffer size with jacking
up the MTU so large) is changing Nagle to use some constant. Perhaps something like
512 bytes or smaller, or even 128 bytes or smaller.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 21:33 ` David S. Miller
@ 2005-05-05 21:54 ` Rick Jones
2005-05-05 22:17 ` David S. Miller
0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2005-05-05 21:54 UTC (permalink / raw)
To: netdev; +Cc: m.iatrou
David S. Miller wrote:
> On Thu, 05 May 2005 13:17:31 -0700
> Rick Jones <rick.jones2@hp.com> wrote:
>
>
>>I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would
>>result in netperf sending 16KB at a time into the connection - once you sent the
>>MTU above 16K you may have started running into issues with Nagle and delayed
>>ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C
>>to set TCP_CORK, or use -m to set the send size to say, 32KB.
>
>
> Yes, for one don't expect reasonable behavior if the MTU is near to or less
> than the send buffer size in use.
>
> Also, many of Nagle's notions start to fall apart at such high MTU settings.
> For example, all of Nagle (even with Minshall's modifications) basically define
> "small packet" as anything smaller than 1 MSS.
>
> So something to look into (besides increasing your send buffer size with jacking
> up the MTU so large) is changing Nagle to use some constant. Perhaps something like
> 512 bytes or smaller, or even 128 bytes or smaller.
IMO 128 is too small - 54 bytes of header to only 128 bytes of data seems
"worthy" of encountering Nagle by default. If not 1460, then 536 feels nice - I
would guess it likely was a common MSS "back in the day" when Nagle first
proposed the algorithm/heuristic - assuming of course that the intent of the
algorithm was to try to get the average header/header+data ratio to something
around 0.9 (although IIRC, none of a 537 byte send would be delayed by Nagle
since it was the size of the user's send being >= the MSS, so make that ~0.45 ?)
rick jones
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 20:17 ` Rick Jones
2005-05-05 21:33 ` David S. Miller
@ 2005-05-05 21:55 ` Michael Iatrou
2005-05-05 22:26 ` Michael Iatrou
2005-05-06 16:18 ` Rick Jones
1 sibling, 2 replies; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 21:55 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
When the date was Thursday 05 May 2005 23:17, Rick Jones wrote:
> What settings, if any, did you use for -s, -S and in particular -m in
> netperf?
-s 0 -S 0 -m 16384
For both ends:
/proc/sys/net/core/wmem_max: 16777216
/proc/sys/net/core/rmem_max: 16777216
/proc/sys/net/ipv4/tcp_rmem: 16384 349520 16777216
/proc/sys/net/ipv4/tcp_wmem: 16384 262144 16777216
> I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would
> result in netperf sending 16KB at a time into the connection - once you
> sent the MTU above 16K you may have started running into issues with Nagle
> and delayed ACK?
The problem firstly appears at 12KB...
> You could try some tests adding a test-specific -D to
> disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to
> say, 32KB.
I 've already tested -m 32KB and its the same as 16KB.
I will try -D and -C too.
> It might be good to add CPU utilization figures - for 2.3pl1 that means
> editing the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can
> grab netperf 2.4.0-rc3 from:
>
> ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/
>
> if you cannot find it elsewhere, and that will (try to) compile-in the
> right CPU utilization mechanism automagically.
I already did a custom CPU usage instrumentation (based on infos
from /proc/stat -- the latest netperf does the same thing, doesn't it?) and
it seems that system has plenty of idle time (up to 50% if I recall correct)
--
Michael Iatrou
Electrical and Computer Engineering Dept.
University of Patras, Greece
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 21:54 ` Rick Jones
@ 2005-05-05 22:17 ` David S. Miller
2005-05-05 23:24 ` Rick Jones
0 siblings, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-05 22:17 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev, m.iatrou
On Thu, 05 May 2005 14:54:43 -0700
Rick Jones <rick.jones2@hp.com> wrote:
> assuming of course that the intent of the algorithm was to try to get the average header/header+data ratio to something
> around 0.9 (although IIRC, none of a 537 byte send would be delayed by Nagle
> since it was the size of the user's send being >= the MSS, so make that ~0.45 ?)
It tries to hold smaller packets back in hopes to get some more sendmsg()
calls which will bunch up some more data before all outstanding data is
ACK'd.
It's meant for terminal protocols and other chatty sequences.
It was not designed with 16K MSS frame sizes in mind.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 21:55 ` Michael Iatrou
@ 2005-05-05 22:26 ` Michael Iatrou
2005-05-06 16:18 ` Rick Jones
1 sibling, 0 replies; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 22:26 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
When the date was Friday 06 May 2005 00:55, Michael Iatrou wrote:
> > You could try some tests adding a test-specific -D to
> > disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to
> > say, 32KB.
>
> I 've already tested -m 32KB and its the same as 16KB.
> I will try -D and -C too.
Done, (almost) nothing changed.
--
Michael Iatrou
Electrical and Computer Engineering Dept.
University of Patras, Greece
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 22:17 ` David S. Miller
@ 2005-05-05 23:24 ` Rick Jones
0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-05 23:24 UTC (permalink / raw)
To: netdev; +Cc: m.iatrou
David S. Miller wrote:
> On Thu, 05 May 2005 14:54:43 -0700 Rick Jones <rick.jones2@hp.com> wrote:
>
>
>> assuming of course that the intent of the algorithm was to try to get the
>> average header/header+data ratio to something around 0.9 (although IIRC,
>> none of a 537 byte send would be delayed by Nagle since it was the size of
>> the user's send being >= the MSS, so make that ~0.45 ?)
>
>
> It tries to hold smaller packets back in hopes to get some more sendmsg()
> calls which will bunch up some more data before all outstanding data is
> ACK'd.
I think we may be saying _nearly_ the same thing, although I would call that
smaller user sends. Nothing I've read (and remembered) suggested that a user
send of MSS+1 bytes should have that last byte delayed. That's were I then got
that handwaving math of 0.45 instead of 0.9.
My bringing up the ratio of header to header+data comes from stuff like this in
rfc896:
<begin>
The small-packet problem
There is a special problem associated with small packets. When
TCP is used for the transmission of single-character messages
originating at a keyboard, the typical result is that 41 byte
packets (one byte of data, 40 bytes of header) are transmitted
for each byte of useful data. This 4000% overhead is annoying
but tolerable on lightly loaded networks. On heavily loaded net-
works, however, the congestion resulting from this overhead can
result in lost datagrams and retransmissions, as well as exces-
sive propagation time caused by congestion in switching nodes and
gateways. In practice, throughput may drop so low that TCP con-
nections are aborted.
<end>
The reason I make the "user send" versus packet distinction comes from stuff
like this:
<begin>
The solution is to inhibit the sending of new TCP segments when
new outgoing data arrives from the user if any previously
transmitted data on the connection remains unacknowledged.
<end>
I do acknowledge though that there have been stacks that interpreted Nagle on a
segment by segment basis rather than a user send by user send basis. I just
don't think that they were correct :)
>
> It's meant for terminal protocols and other chatty sequences.
>
He included an FTP example with 512 byte sends which leads me to believe it was
meant for more than just terminal protocols:
<begin>
We use our scheme for all TCP connections, not just Telnet con-
nections. Let us see what happens for a file transfer data con-
nection using our technique. The two extreme cases will again be
considered.
As before, we first consider the Ethernet case. The user is now
writing data to TCP in 512 byte blocks as fast as TCP will accept
them. The user's first write to TCP will start things going; our
first datagram will be 512+40 bytes or 552 bytes long. The
user's second write to TCP will not cause a send but will cause
the block to be buffered.
<end>
What I'd forgotten is that the original RFC had no explicit discussion of checks
against the MSS. It _seems_ that the first reference to that is in rfc898,
which was a writeup of meeting notes:
<begin>
Congestion Control -- FACC - Nagle
Postel: This was a discussion of the situation leading to the
ideas presented in RFC 896, and how the policies described there
improved overall performance.
Hinden, Postel, Muuss, & Reynolds [Page 20]
\f
RFC 898 April 1984
Gateway SIG Meeting Notes
Muuss:
First principle of congestion control:
DON'T DROP PACKETS (unless absolutely necessary)
Second principle:
Hosts must behave themselves (or else)
Enemies list -
1. TOPS-20 TCP from DEC
2. VAX/UNIX 4.2 from Berkeley
Third principle:
Memory won't help (beyond a certain point).
The small packet problem: Big packets are good, small are bad
(big = 576).
Suggested fix: Rule: When the user writes to TCP, initiate a send
only if there are NO outstanding packets on the connection. [good
for TELNET, at least] (or if you fill a segment). No change when
Acks come back. Assumption is that there is a pipe-like buffer
between the user and the TCP.
<end>
with that parenthetical "(or if you fill a segment)" comment. It is interesting
how they define "big = 576" :)
It seems the full-sized segment bit gets formalized in 1122:
<begin>
A TCP SHOULD implement the Nagle Algorithm [TCP:9] to
coalesce short segments. However, there MUST be a way for
an application to disable the Nagle algorithm on an
individual connection. In all cases, sending data is also
subject to the limitation imposed by the Slow Start
algorithm (Section 4.2.2.15).
DISCUSSION:
The Nagle algorithm is generally as follows:
If there is unacknowledged data (i.e., SND.NXT >
SND.UNA), then the sending TCP buffers all user
Internet Engineering Task Force [Page 98]
\f
RFC1122 TRANSPORT LAYER -- TCP October 1989
data (regardless of the PSH bit), until the
outstanding data has been acknowledged or until
the TCP can send a full-sized segment (Eff.snd.MSS
bytes; see Section 4.2.2.6).
<end>
> It was not designed with 16K MSS frame sizes in mind.
I certainly agree that those frame sizes were probably far from their minds at
the time and that basing the decision on the ratio of header overhead is well
within the spirit.
rick jones
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue
2005-05-05 21:55 ` Michael Iatrou
2005-05-05 22:26 ` Michael Iatrou
@ 2005-05-06 16:18 ` Rick Jones
1 sibling, 0 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-06 16:18 UTC (permalink / raw)
To: Michael Iatrou; +Cc: netdev
>
>>It might be good to add CPU utilization figures - for 2.3pl1 that means
>>editing the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can
>>grab netperf 2.4.0-rc3 from:
>>
>>ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/
>>
>>if you cannot find it elsewhere, and that will (try to) compile-in the
>>right CPU utilization mechanism automagically.
>
>
> I already did a custom CPU usage instrumentation (based on infos
> from /proc/stat -- the latest netperf does the same thing, doesn't it?) and
> it seems that system has plenty of idle time (up to 50% if I recall correct)
IIRC you stated that the boxes were UP?
If changing netperf settings didn't affect much, then kernel profiles and/or
packet traces may be in order.
rick
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-05-06 16:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou
2005-05-05 20:17 ` Rick Jones
2005-05-05 21:33 ` David S. Miller
2005-05-05 21:54 ` Rick Jones
2005-05-05 22:17 ` David S. Miller
2005-05-05 23:24 ` Rick Jones
2005-05-05 21:55 ` Michael Iatrou
2005-05-05 22:26 ` Michael Iatrou
2005-05-06 16:18 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).