* e1000 (?) jumbo frames performance issue @ 2005-05-05 16:28 Michael Iatrou 2005-05-05 20:17 ` Rick Jones 0 siblings, 1 reply; 9+ messages in thread From: Michael Iatrou @ 2005-05-05 16:28 UTC (permalink / raw) To: netdev [-- Attachment #1: Type: text/plain, Size: 500 bytes --] Hi, I did several benchmarks using Intel e1000 NIC and it seems there is a network throughput problem for MTU > 12000 (e1000 supports up to 16110 MTU). Configuration: Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT disabled), 512MB RAM, e1000 (82546EB) Linux 2.6.11.7 netperf 2.3pl1 http://members.hellug.gr/iatrou/plain_ip_mtu.png http://members.hellug.gr/iatrou/plain_ip_mtu.dat -- Michael Iatrou Electrical and Computer Engineering Dept. University of Patras, Greece [-- Attachment #2: plain_ip_mtu.dat --] [-- Type: text/plain, Size: 3639 bytes --] 1500 936.14 1550 938.45 1600 927.26 1650 943.06 1700 942.91 1750 943.94 1800 947.95 1850 945.14 1900 952.91 1950 948.15 2000 948.55 2050 949.63 2100 953.98 2150 958.59 2200 959.39 2250 959.48 2300 960.57 2350 960.78 2400 953.26 2450 956.50 2500 964.36 2550 961.98 2600 964.61 2650 963.79 2700 966.36 2750 967.56 2800 968.12 2850 963.65 2900 969.22 2950 968.97 3000 967.69 3050 968.75 3100 970.95 3150 967.95 3200 972.19 3250 972.61 3300 973.01 3350 973.41 3400 973.78 3450 974.15 3500 974.52 3550 974.88 3600 975.23 3650 975.56 3700 975.89 3750 976.20 3800 976.52 3850 976.82 3900 977.11 3950 977.39 4000 977.68 4050 977.95 4100 978.23 4150 978.48 4200 978.74 4250 978.97 4300 979.21 4350 979.46 4400 979.68 4450 979.91 4500 980.13 4550 980.36 4600 980.58 4650 980.77 4700 980.97 4750 981.17 4800 981.37 4850 981.56 4900 981.75 4950 981.92 5000 982.10 5050 982.28 5100 982.47 5150 982.61 5200 982.78 5250 982.96 5300 983.11 5350 983.27 5400 983.43 5450 983.58 5500 983.71 5550 983.87 5600 984.01 5650 984.15 5700 984.27 5750 984.42 5800 984.55 5850 984.67 5900 984.83 5950 984.94 6000 985.06 6050 985.21 6100 985.29 6150 985.43 6200 985.54 6250 985.66 6300 985.79 6350 985.91 6400 986.03 6450 986.13 6500 986.23 6550 986.34 6600 986.44 6650 986.55 6700 986.65 6750 986.75 6800 986.83 6850 986.95 6900 987.04 6950 987.13 7000 987.22 7050 987.31 7100 987.40 7150 987.48 7200 987.57 7250 987.66 7300 987.74 7350 987.81 7400 987.89 7450 987.98 7500 987.74 7550 987.81 7600 987.20 7650 987.58 7700 986.59 7750 985.17 7800 985.46 7850 983.50 7900 983.88 7950 982.86 8000 953.35 8050 952.76 8100 947.34 8150 947.85 8200 942.18 8250 989.13 8300 989.21 8350 989.27 8400 989.33 8450 989.40 8500 989.47 8550 989.52 8600 989.58 8650 989.64 8700 989.71 8750 989.76 8800 989.82 8850 989.88 8900 989.93 8950 989.99 9000 990.04 9050 990.10 9100 990.16 9150 990.21 9200 990.26 9250 990.31 9300 990.37 9350 990.41 9400 990.47 9450 990.51 9500 990.57 9550 990.61 9600 990.66 9650 990.70 9700 990.76 9750 990.81 9800 990.86 9850 990.89 9900 990.94 9950 990.98 10000 991.04 10050 991.08 10100 991.12 10150 991.17 10200 991.21 10250 991.23 10300 991.30 10350 991.34 10400 991.38 10450 991.43 10500 991.46 10550 991.50 10600 991.55 10650 991.58 10700 991.63 10750 991.67 10800 991.70 10850 991.74 10900 991.78 10950 991.81 11000 991.86 11050 991.89 11100 991.92 11150 991.97 11200 992.00 11250 992.03 11300 992.07 11350 992.11 11400 992.13 11450 992.17 11500 992.21 11550 992.23 11600 992.16 11650 992.12 11700 991.89 11750 991.62 11800 991.79 11850 991.50 11900 991.77 11950 990.46 12000 990.52 12050 989.66 12100 968.77 12150 967.79 12200 970.55 12250 955.71 12300 952.36 12350 947.52 12400 945.08 12450 946.48 12500 944.09 12550 945.16 12600 942.07 12650 940.04 12700 938.17 12750 936.29 12800 933.29 12850 930.13 12900 927.75 12950 925.10 13000 924.25 13050 922.31 13100 917.67 13150 915.92 13200 913.99 13250 912.00 13300 908.68 13350 905.72 13400 904.99 13450 903.24 13500 900.91 13550 898.57 13600 897.07 13650 895.17 13700 892.79 13750 890.03 13800 888.41 13850 887.29 13900 886.30 13950 884.43 14000 882.25 14050 880.46 14100 878.22 14150 876.24 14200 875.41 14250 872.58 14300 873.01 14350 870.16 14400 868.45 14450 866.98 14500 864.76 14550 863.05 14600 862.22 14650 860.05 14700 858.71 14750 857.23 14800 856.31 14850 854.37 14900 851.15 14950 849.92 15000 849.54 15050 848.40 15100 847.32 15150 845.17 15200 844.45 15250 842.98 15300 841.14 15350 839.62 15400 838.34 15450 837.14 15500 836.43 15550 834.64 15600 833.64 15650 832.08 15700 830.72 15750 829.29 15800 828.07 15850 826.93 15900 825.75 15950 825.10 16000 823.32 16050 822.36 16100 819.77 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou @ 2005-05-05 20:17 ` Rick Jones 2005-05-05 21:33 ` David S. Miller 2005-05-05 21:55 ` Michael Iatrou 0 siblings, 2 replies; 9+ messages in thread From: Rick Jones @ 2005-05-05 20:17 UTC (permalink / raw) To: Michael Iatrou; +Cc: netdev Michael Iatrou wrote: > Hi, > I did several benchmarks using Intel e1000 NIC and it seems there is a > network throughput problem for MTU > 12000 (e1000 supports up to 16110 > MTU). > > Configuration: > Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT > disabled), 512MB RAM, e1000 (82546EB) > > Linux 2.6.11.7 > netperf 2.3pl1 What settings, if any, did you use for -s, -S and in particular -m in netperf? I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would result in netperf sending 16KB at a time into the connection - once you sent the MTU above 16K you may have started running into issues with Nagle and delayed ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to say, 32KB. It might be good to add CPU utilization figures - for 2.3pl1 that means editing the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can grab netperf 2.4.0-rc3 from: ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/ if you cannot find it elsewhere, and that will (try to) compile-in the right CPU utilization mechanism automagically. rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 20:17 ` Rick Jones @ 2005-05-05 21:33 ` David S. Miller 2005-05-05 21:54 ` Rick Jones 2005-05-05 21:55 ` Michael Iatrou 1 sibling, 1 reply; 9+ messages in thread From: David S. Miller @ 2005-05-05 21:33 UTC (permalink / raw) To: Rick Jones; +Cc: m.iatrou, netdev On Thu, 05 May 2005 13:17:31 -0700 Rick Jones <rick.jones2@hp.com> wrote: > I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would > result in netperf sending 16KB at a time into the connection - once you sent the > MTU above 16K you may have started running into issues with Nagle and delayed > ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C > to set TCP_CORK, or use -m to set the send size to say, 32KB. Yes, for one don't expect reasonable behavior if the MTU is near to or less than the send buffer size in use. Also, many of Nagle's notions start to fall apart at such high MTU settings. For example, all of Nagle (even with Minshall's modifications) basically define "small packet" as anything smaller than 1 MSS. So something to look into (besides increasing your send buffer size with jacking up the MTU so large) is changing Nagle to use some constant. Perhaps something like 512 bytes or smaller, or even 128 bytes or smaller. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 21:33 ` David S. Miller @ 2005-05-05 21:54 ` Rick Jones 2005-05-05 22:17 ` David S. Miller 0 siblings, 1 reply; 9+ messages in thread From: Rick Jones @ 2005-05-05 21:54 UTC (permalink / raw) To: netdev; +Cc: m.iatrou David S. Miller wrote: > On Thu, 05 May 2005 13:17:31 -0700 > Rick Jones <rick.jones2@hp.com> wrote: > > >>I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would >>result in netperf sending 16KB at a time into the connection - once you sent the >>MTU above 16K you may have started running into issues with Nagle and delayed >>ACK? You could try some tests adding a test-specific -D to disable Nagle, or -C >>to set TCP_CORK, or use -m to set the send size to say, 32KB. > > > Yes, for one don't expect reasonable behavior if the MTU is near to or less > than the send buffer size in use. > > Also, many of Nagle's notions start to fall apart at such high MTU settings. > For example, all of Nagle (even with Minshall's modifications) basically define > "small packet" as anything smaller than 1 MSS. > > So something to look into (besides increasing your send buffer size with jacking > up the MTU so large) is changing Nagle to use some constant. Perhaps something like > 512 bytes or smaller, or even 128 bytes or smaller. IMO 128 is too small - 54 bytes of header to only 128 bytes of data seems "worthy" of encountering Nagle by default. If not 1460, then 536 feels nice - I would guess it likely was a common MSS "back in the day" when Nagle first proposed the algorithm/heuristic - assuming of course that the intent of the algorithm was to try to get the average header/header+data ratio to something around 0.9 (although IIRC, none of a 537 byte send would be delayed by Nagle since it was the size of the user's send being >= the MSS, so make that ~0.45 ?) rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 21:54 ` Rick Jones @ 2005-05-05 22:17 ` David S. Miller 2005-05-05 23:24 ` Rick Jones 0 siblings, 1 reply; 9+ messages in thread From: David S. Miller @ 2005-05-05 22:17 UTC (permalink / raw) To: Rick Jones; +Cc: netdev, m.iatrou On Thu, 05 May 2005 14:54:43 -0700 Rick Jones <rick.jones2@hp.com> wrote: > assuming of course that the intent of the algorithm was to try to get the average header/header+data ratio to something > around 0.9 (although IIRC, none of a 537 byte send would be delayed by Nagle > since it was the size of the user's send being >= the MSS, so make that ~0.45 ?) It tries to hold smaller packets back in hopes to get some more sendmsg() calls which will bunch up some more data before all outstanding data is ACK'd. It's meant for terminal protocols and other chatty sequences. It was not designed with 16K MSS frame sizes in mind. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 22:17 ` David S. Miller @ 2005-05-05 23:24 ` Rick Jones 0 siblings, 0 replies; 9+ messages in thread From: Rick Jones @ 2005-05-05 23:24 UTC (permalink / raw) To: netdev; +Cc: m.iatrou David S. Miller wrote: > On Thu, 05 May 2005 14:54:43 -0700 Rick Jones <rick.jones2@hp.com> wrote: > > >> assuming of course that the intent of the algorithm was to try to get the >> average header/header+data ratio to something around 0.9 (although IIRC, >> none of a 537 byte send would be delayed by Nagle since it was the size of >> the user's send being >= the MSS, so make that ~0.45 ?) > > > It tries to hold smaller packets back in hopes to get some more sendmsg() > calls which will bunch up some more data before all outstanding data is > ACK'd. I think we may be saying _nearly_ the same thing, although I would call that smaller user sends. Nothing I've read (and remembered) suggested that a user send of MSS+1 bytes should have that last byte delayed. That's were I then got that handwaving math of 0.45 instead of 0.9. My bringing up the ratio of header to header+data comes from stuff like this in rfc896: <begin> The small-packet problem There is a special problem associated with small packets. When TCP is used for the transmission of single-character messages originating at a keyboard, the typical result is that 41 byte packets (one byte of data, 40 bytes of header) are transmitted for each byte of useful data. This 4000% overhead is annoying but tolerable on lightly loaded networks. On heavily loaded net- works, however, the congestion resulting from this overhead can result in lost datagrams and retransmissions, as well as exces- sive propagation time caused by congestion in switching nodes and gateways. In practice, throughput may drop so low that TCP con- nections are aborted. <end> The reason I make the "user send" versus packet distinction comes from stuff like this: <begin> The solution is to inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged. <end> I do acknowledge though that there have been stacks that interpreted Nagle on a segment by segment basis rather than a user send by user send basis. I just don't think that they were correct :) > > It's meant for terminal protocols and other chatty sequences. > He included an FTP example with 512 byte sends which leads me to believe it was meant for more than just terminal protocols: <begin> We use our scheme for all TCP connections, not just Telnet con- nections. Let us see what happens for a file transfer data con- nection using our technique. The two extreme cases will again be considered. As before, we first consider the Ethernet case. The user is now writing data to TCP in 512 byte blocks as fast as TCP will accept them. The user's first write to TCP will start things going; our first datagram will be 512+40 bytes or 552 bytes long. The user's second write to TCP will not cause a send but will cause the block to be buffered. <end> What I'd forgotten is that the original RFC had no explicit discussion of checks against the MSS. It _seems_ that the first reference to that is in rfc898, which was a writeup of meeting notes: <begin> Congestion Control -- FACC - Nagle Postel: This was a discussion of the situation leading to the ideas presented in RFC 896, and how the policies described there improved overall performance. Hinden, Postel, Muuss, & Reynolds [Page 20] \f RFC 898 April 1984 Gateway SIG Meeting Notes Muuss: First principle of congestion control: DON'T DROP PACKETS (unless absolutely necessary) Second principle: Hosts must behave themselves (or else) Enemies list - 1. TOPS-20 TCP from DEC 2. VAX/UNIX 4.2 from Berkeley Third principle: Memory won't help (beyond a certain point). The small packet problem: Big packets are good, small are bad (big = 576). Suggested fix: Rule: When the user writes to TCP, initiate a send only if there are NO outstanding packets on the connection. [good for TELNET, at least] (or if you fill a segment). No change when Acks come back. Assumption is that there is a pipe-like buffer between the user and the TCP. <end> with that parenthetical "(or if you fill a segment)" comment. It is interesting how they define "big = 576" :) It seems the full-sized segment bit gets formalized in 1122: <begin> A TCP SHOULD implement the Nagle Algorithm [TCP:9] to coalesce short segments. However, there MUST be a way for an application to disable the Nagle algorithm on an individual connection. In all cases, sending data is also subject to the limitation imposed by the Slow Start algorithm (Section 4.2.2.15). DISCUSSION: The Nagle algorithm is generally as follows: If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the sending TCP buffers all user Internet Engineering Task Force [Page 98] \f RFC1122 TRANSPORT LAYER -- TCP October 1989 data (regardless of the PSH bit), until the outstanding data has been acknowledged or until the TCP can send a full-sized segment (Eff.snd.MSS bytes; see Section 4.2.2.6). <end> > It was not designed with 16K MSS frame sizes in mind. I certainly agree that those frame sizes were probably far from their minds at the time and that basing the decision on the ratio of header overhead is well within the spirit. rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 20:17 ` Rick Jones 2005-05-05 21:33 ` David S. Miller @ 2005-05-05 21:55 ` Michael Iatrou 2005-05-05 22:26 ` Michael Iatrou 2005-05-06 16:18 ` Rick Jones 1 sibling, 2 replies; 9+ messages in thread From: Michael Iatrou @ 2005-05-05 21:55 UTC (permalink / raw) To: Rick Jones; +Cc: netdev When the date was Thursday 05 May 2005 23:17, Rick Jones wrote: > What settings, if any, did you use for -s, -S and in particular -m in > netperf? -s 0 -S 0 -m 16384 For both ends: /proc/sys/net/core/wmem_max: 16777216 /proc/sys/net/core/rmem_max: 16777216 /proc/sys/net/ipv4/tcp_rmem: 16384 349520 16777216 /proc/sys/net/ipv4/tcp_wmem: 16384 262144 16777216 > I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would > result in netperf sending 16KB at a time into the connection - once you > sent the MTU above 16K you may have started running into issues with Nagle > and delayed ACK? The problem firstly appears at 12KB... > You could try some tests adding a test-specific -D to > disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to > say, 32KB. I 've already tested -m 32KB and its the same as 16KB. I will try -D and -C too. > It might be good to add CPU utilization figures - for 2.3pl1 that means > editing the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can > grab netperf 2.4.0-rc3 from: > > ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/ > > if you cannot find it elsewhere, and that will (try to) compile-in the > right CPU utilization mechanism automagically. I already did a custom CPU usage instrumentation (based on infos from /proc/stat -- the latest netperf does the same thing, doesn't it?) and it seems that system has plenty of idle time (up to 50% if I recall correct) -- Michael Iatrou Electrical and Computer Engineering Dept. University of Patras, Greece ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 21:55 ` Michael Iatrou @ 2005-05-05 22:26 ` Michael Iatrou 2005-05-06 16:18 ` Rick Jones 1 sibling, 0 replies; 9+ messages in thread From: Michael Iatrou @ 2005-05-05 22:26 UTC (permalink / raw) To: Rick Jones; +Cc: netdev When the date was Friday 06 May 2005 00:55, Michael Iatrou wrote: > > You could try some tests adding a test-specific -D to > > disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to > > say, 32KB. > > I 've already tested -m 32KB and its the same as 16KB. > I will try -D and -C too. Done, (almost) nothing changed. -- Michael Iatrou Electrical and Computer Engineering Dept. University of Patras, Greece ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: e1000 (?) jumbo frames performance issue 2005-05-05 21:55 ` Michael Iatrou 2005-05-05 22:26 ` Michael Iatrou @ 2005-05-06 16:18 ` Rick Jones 1 sibling, 0 replies; 9+ messages in thread From: Rick Jones @ 2005-05-06 16:18 UTC (permalink / raw) To: Michael Iatrou; +Cc: netdev > >>It might be good to add CPU utilization figures - for 2.3pl1 that means >>editing the makefile to add a -DUSE_PROC_STAT and recompiling. Or you can >>grab netperf 2.4.0-rc3 from: >> >>ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/ >> >>if you cannot find it elsewhere, and that will (try to) compile-in the >>right CPU utilization mechanism automagically. > > > I already did a custom CPU usage instrumentation (based on infos > from /proc/stat -- the latest netperf does the same thing, doesn't it?) and > it seems that system has plenty of idle time (up to 50% if I recall correct) IIRC you stated that the boxes were UP? If changing netperf settings didn't affect much, then kernel profiles and/or packet traces may be in order. rick ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-05-06 16:18 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou 2005-05-05 20:17 ` Rick Jones 2005-05-05 21:33 ` David S. Miller 2005-05-05 21:54 ` Rick Jones 2005-05-05 22:17 ` David S. Miller 2005-05-05 23:24 ` Rick Jones 2005-05-05 21:55 ` Michael Iatrou 2005-05-05 22:26 ` Michael Iatrou 2005-05-06 16:18 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).