e1000 (?) jumbo frames performance issue

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* e1000 (?) jumbo frames performance issue
@ 2005-05-05 16:28 Michael Iatrou
  2005-05-05 20:17 ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 16:28 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

Hi,
I did several benchmarks using Intel e1000 NIC and it seems there is a
network throughput problem for MTU > 12000 (e1000 supports up to 16110
MTU).

Configuration:
Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT
disabled), 512MB RAM, e1000 (82546EB)

Linux 2.6.11.7
netperf 2.3pl1

http://members.hellug.gr/iatrou/plain_ip_mtu.png
http://members.hellug.gr/iatrou/plain_ip_mtu.dat

-- 
 Michael Iatrou
 Electrical and Computer Engineering Dept.
 University of Patras, Greece

[-- Attachment #2: plain_ip_mtu.dat --]
[-- Type: text/plain, Size: 3639 bytes --]

1500	936.14
1550	938.45
1600	927.26
1650	943.06
1700	942.91
1750	943.94
1800	947.95
1850	945.14
1900	952.91
1950	948.15
2000	948.55
2050	949.63
2100	953.98
2150	958.59
2200	959.39
2250	959.48
2300	960.57
2350	960.78
2400	953.26
2450	956.50
2500	964.36
2550	961.98
2600	964.61
2650	963.79
2700	966.36
2750	967.56
2800	968.12
2850	963.65
2900	969.22
2950	968.97
3000	967.69
3050	968.75
3100	970.95
3150	967.95
3200	972.19
3250	972.61
3300	973.01
3350	973.41
3400	973.78
3450	974.15
3500	974.52
3550	974.88
3600	975.23
3650	975.56
3700	975.89
3750	976.20
3800	976.52
3850	976.82
3900	977.11
3950	977.39
4000	977.68
4050	977.95
4100	978.23
4150	978.48
4200	978.74
4250	978.97
4300	979.21
4350	979.46
4400	979.68
4450	979.91
4500	980.13
4550	980.36
4600	980.58
4650	980.77
4700	980.97
4750	981.17
4800	981.37
4850	981.56
4900	981.75
4950	981.92
5000	982.10
5050	982.28
5100	982.47
5150	982.61
5200	982.78
5250	982.96
5300	983.11
5350	983.27
5400	983.43
5450	983.58
5500	983.71
5550	983.87
5600	984.01
5650	984.15
5700	984.27
5750	984.42
5800	984.55
5850	984.67
5900	984.83
5950	984.94
6000	985.06
6050	985.21
6100	985.29
6150	985.43
6200	985.54
6250	985.66
6300	985.79
6350	985.91
6400	986.03
6450	986.13
6500	986.23
6550	986.34
6600	986.44
6650	986.55
6700	986.65
6750	986.75
6800	986.83
6850	986.95
6900	987.04
6950	987.13
7000	987.22
7050	987.31
7100	987.40
7150	987.48
7200	987.57
7250	987.66
7300	987.74
7350	987.81
7400	987.89
7450	987.98
7500	987.74
7550	987.81
7600	987.20
7650	987.58
7700	986.59
7750	985.17
7800	985.46
7850	983.50
7900	983.88
7950	982.86
8000	953.35
8050	952.76
8100	947.34
8150	947.85
8200	942.18
8250	989.13
8300	989.21
8350	989.27
8400	989.33
8450	989.40
8500	989.47
8550	989.52
8600	989.58
8650	989.64
8700	989.71
8750	989.76
8800	989.82
8850	989.88
8900	989.93
8950	989.99
9000	990.04
9050	990.10
9100	990.16
9150	990.21
9200	990.26
9250	990.31
9300	990.37
9350	990.41
9400	990.47
9450	990.51
9500	990.57
9550	990.61
9600	990.66
9650	990.70
9700	990.76
9750	990.81
9800	990.86
9850	990.89
9900	990.94
9950	990.98
10000	991.04
10050	991.08
10100	991.12
10150	991.17
10200	991.21
10250	991.23
10300	991.30
10350	991.34
10400	991.38
10450	991.43
10500	991.46
10550	991.50
10600	991.55
10650	991.58
10700	991.63
10750	991.67
10800	991.70
10850	991.74
10900	991.78
10950	991.81
11000	991.86
11050	991.89
11100	991.92
11150	991.97
11200	992.00
11250	992.03
11300	992.07
11350	992.11
11400	992.13
11450	992.17
11500	992.21
11550	992.23
11600	992.16
11650	992.12
11700	991.89
11750	991.62
11800	991.79
11850	991.50
11900	991.77
11950	990.46
12000	990.52
12050	989.66
12100	968.77
12150	967.79
12200	970.55
12250	955.71
12300	952.36
12350	947.52
12400	945.08
12450	946.48
12500	944.09
12550	945.16
12600	942.07
12650	940.04
12700	938.17
12750	936.29
12800	933.29
12850	930.13
12900	927.75
12950	925.10
13000	924.25
13050	922.31
13100	917.67
13150	915.92
13200	913.99
13250	912.00
13300	908.68
13350	905.72
13400	904.99
13450	903.24
13500	900.91
13550	898.57
13600	897.07
13650	895.17
13700	892.79
13750	890.03
13800	888.41
13850	887.29
13900	886.30
13950	884.43
14000	882.25
14050	880.46
14100	878.22
14150	876.24
14200	875.41
14250	872.58
14300	873.01
14350	870.16
14400	868.45
14450	866.98
14500	864.76
14550	863.05
14600	862.22
14650	860.05
14700	858.71
14750	857.23
14800	856.31
14850	854.37
14900	851.15
14950	849.92
15000	849.54
15050	848.40
15100	847.32
15150	845.17
15200	844.45
15250	842.98
15300	841.14
15350	839.62
15400	838.34
15450	837.14
15500	836.43
15550	834.64
15600	833.64
15650	832.08
15700	830.72
15750	829.29
15800	828.07
15850	826.93
15900	825.75
15950	825.10
16000	823.32
16050	822.36
16100	819.77

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou
@ 2005-05-05 20:17 ` Rick Jones
  2005-05-05 21:33   ` David S. Miller
  2005-05-05 21:55   ` Michael Iatrou
  0 siblings, 2 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-05 20:17 UTC (permalink / raw)
  To: Michael Iatrou; +Cc: netdev

Michael Iatrou wrote:
> Hi,
> I did several benchmarks using Intel e1000 NIC and it seems there is a
> network throughput problem for MTU > 12000 (e1000 supports up to 16110
> MTU).
> 
> Configuration:
> Two identical PCs, connected back to back, Intel Xeon 2.8GHz (SMP/SMT
> disabled), 512MB RAM, e1000 (82546EB)
> 
> Linux 2.6.11.7
> netperf 2.3pl1

What settings, if any, did you use for -s, -S and in particular -m in netperf?

I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would 
result in netperf sending 16KB at a time into the connection - once you sent the 
MTU above 16K you may have started running into issues with Nagle and delayed 
ACK?  You could try some tests adding a test-specific -D to disable Nagle, or -C 
to set TCP_CORK, or use -m to set the send size to say, 32KB.

It might be good to add CPU utilization figures - for 2.3pl1 that means editing 
the makefile to add a -DUSE_PROC_STAT and recompiling.  Or you can grab netperf 
2.4.0-rc3 from:

ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/

if you cannot find it elsewhere, and that will (try to) compile-in the right CPU 
utilization mechanism automagically.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 20:17 ` Rick Jones
@ 2005-05-05 21:33   ` David S. Miller
  2005-05-05 21:54     ` Rick Jones
  2005-05-05 21:55   ` Michael Iatrou
  1 sibling, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-05 21:33 UTC (permalink / raw)
  To: Rick Jones; +Cc: m.iatrou, netdev

On Thu, 05 May 2005 13:17:31 -0700
Rick Jones <rick.jones2@hp.com> wrote:

> I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would 
> result in netperf sending 16KB at a time into the connection - once you sent the 
> MTU above 16K you may have started running into issues with Nagle and delayed 
> ACK?  You could try some tests adding a test-specific -D to disable Nagle, or -C 
> to set TCP_CORK, or use -m to set the send size to say, 32KB.

Yes, for one don't expect reasonable behavior if the MTU is near to or less
than the send buffer size in use.

Also, many of Nagle's notions start to fall apart at such high MTU settings.
For example, all of Nagle (even with Minshall's modifications) basically define
"small packet" as anything smaller than 1 MSS.

So something to look into (besides increasing your send buffer size with jacking
up the MTU so large) is changing Nagle to use some constant.  Perhaps something like
512 bytes or smaller, or even 128 bytes or smaller.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 21:33   ` David S. Miller
@ 2005-05-05 21:54     ` Rick Jones
  2005-05-05 22:17       ` David S. Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2005-05-05 21:54 UTC (permalink / raw)
  To: netdev; +Cc: m.iatrou

David S. Miller wrote:
> On Thu, 05 May 2005 13:17:31 -0700
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>>I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would 
>>result in netperf sending 16KB at a time into the connection - once you sent the 
>>MTU above 16K you may have started running into issues with Nagle and delayed 
>>ACK?  You could try some tests adding a test-specific -D to disable Nagle, or -C 
>>to set TCP_CORK, or use -m to set the send size to say, 32KB.
> 
> 
> Yes, for one don't expect reasonable behavior if the MTU is near to or less
> than the send buffer size in use.
> 
> Also, many of Nagle's notions start to fall apart at such high MTU settings.
> For example, all of Nagle (even with Minshall's modifications) basically define
> "small packet" as anything smaller than 1 MSS.
> 
> So something to look into (besides increasing your send buffer size with jacking
> up the MTU so large) is changing Nagle to use some constant.  Perhaps something like
> 512 bytes or smaller, or even 128 bytes or smaller.

IMO 128 is too small - 54 bytes of header to only 128 bytes of data seems 
"worthy" of encountering Nagle by default.  If not 1460, then 536 feels nice - I 
would guess it likely was a common MSS "back in the day" when Nagle first 
proposed the algorithm/heuristic - assuming of course that the intent of the 
algorithm was to try to get the average header/header+data ratio to something 
around 0.9 (although IIRC, none of a 537 byte send would  be delayed by Nagle 
since it was the size of the user's send being >= the MSS, so make that ~0.45 ?)

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 20:17 ` Rick Jones
  2005-05-05 21:33   ` David S. Miller
@ 2005-05-05 21:55   ` Michael Iatrou
  2005-05-05 22:26     ` Michael Iatrou
  2005-05-06 16:18     ` Rick Jones
  1 sibling, 2 replies; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 21:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

When the date was Thursday 05 May 2005 23:17, Rick Jones wrote:


> What settings, if any, did you use for -s, -S and in particular -m in
> netperf?

-s 0 -S 0 -m 16384

For both ends:

/proc/sys/net/core/wmem_max:   16777216
/proc/sys/net/core/rmem_max:   16777216
/proc/sys/net/ipv4/tcp_rmem:   16384    349520  16777216
/proc/sys/net/ipv4/tcp_wmem:   16384    262144  16777216

> I seem to recall that some of the stack defaults for SO_SNDBUF (IIRC) would
> result in netperf sending 16KB at a time into the connection - once you
> sent the MTU above 16K you may have started running into issues with Nagle
> and delayed ACK?  

The problem firstly appears at 12KB...

> You could try some tests adding a test-specific -D to 
> disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to
> say, 32KB.

I 've already tested -m 32KB and its the same as 16KB. 
I will try -D and -C too.

> It might be good to add CPU utilization figures - for 2.3pl1 that means
> editing the makefile to add a -DUSE_PROC_STAT and recompiling.  Or you can
> grab netperf 2.4.0-rc3 from:
>
> ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/
>
> if you cannot find it elsewhere, and that will (try to) compile-in the
> right CPU utilization mechanism automagically.

I already did a custom CPU usage instrumentation (based on infos 
from /proc/stat -- the latest netperf does the same thing, doesn't it?) and 
it seems that system has plenty of idle time (up to 50% if I recall correct)

-- 
 Michael Iatrou
 Electrical and Computer Engineering Dept.
 University of Patras, Greece

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 21:54     ` Rick Jones
@ 2005-05-05 22:17       ` David S. Miller
  2005-05-05 23:24         ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-05 22:17 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, m.iatrou

On Thu, 05 May 2005 14:54:43 -0700
Rick Jones <rick.jones2@hp.com> wrote:

> assuming of course that the intent of the algorithm was to try to get the average header/header+data ratio to something 
> around 0.9 (although IIRC, none of a 537 byte send would  be delayed by Nagle 
> since it was the size of the user's send being >= the MSS, so make that ~0.45 ?)

It tries to hold smaller packets back in hopes to get some more sendmsg()
calls which will bunch up some more data before all outstanding data is
ACK'd.

It's meant for terminal protocols and other chatty sequences.

It was not designed with 16K MSS frame sizes in mind.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 21:55   ` Michael Iatrou
@ 2005-05-05 22:26     ` Michael Iatrou
  2005-05-06 16:18     ` Rick Jones
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Iatrou @ 2005-05-05 22:26 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

When the date was Friday 06 May 2005 00:55, Michael Iatrou wrote:

> > You could try some tests adding a test-specific -D to
> > disable Nagle, or -C to set TCP_CORK, or use -m to set the send size to
> > say, 32KB.
>
> I 've already tested -m 32KB and its the same as 16KB.
> I will try -D and -C too.

Done, (almost) nothing changed.

-- 
 Michael Iatrou
 Electrical and Computer Engineering Dept.
 University of Patras, Greece

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 22:17       ` David S. Miller
@ 2005-05-05 23:24         ` Rick Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-05 23:24 UTC (permalink / raw)
  To: netdev; +Cc: m.iatrou

David S. Miller wrote:
> On Thu, 05 May 2005 14:54:43 -0700 Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>> assuming of course that the intent of the algorithm was to try to get the
>> average header/header+data ratio to something around 0.9 (although IIRC,
>> none of a 537 byte send would  be delayed by Nagle since it was the size of
>> the user's send being >= the MSS, so make that ~0.45 ?)
> 
> 
> It tries to hold smaller packets back in hopes to get some more sendmsg() 
> calls which will bunch up some more data before all outstanding data is 
> ACK'd.

I think we may be saying _nearly_ the same thing, although I would call that
smaller user sends.  Nothing I've read (and remembered) suggested that a user 
send of MSS+1 bytes should have that last byte delayed.  That's were I then got 
that handwaving math of 0.45 instead of 0.9.

My bringing up the ratio of header to header+data comes from stuff like this in 
rfc896:

<begin>
                    The small-packet problem

There is a special problem associated with small  packets.   When
TCP  is  used  for  the transmission of single-character messages
originating at a keyboard, the typical result  is  that  41  byte
packets  (one  byte  of data, 40 bytes of header) are transmitted
for each byte of useful data.  This 4000%  overhead  is  annoying
but tolerable on lightly loaded networks.  On heavily loaded net-
works, however, the congestion resulting from this  overhead  can
result  in  lost datagrams and retransmissions, as well as exces-
sive propagation time caused by congestion in switching nodes and
gateways.   In practice, throughput may drop so low that TCP con-
nections are aborted.
<end>

The reason I make the "user send" versus packet distinction comes from stuff 
like this:

<begin>
The solution is to inhibit the sending of new TCP  segments  when
new  outgoing  data  arrives  from  the  user  if  any previously
transmitted data on the connection remains unacknowledged.
<end>

I do acknowledge though that there have been stacks that interpreted Nagle on a 
segment by segment basis rather than a user send by user send basis.  I just 
don't think that they were correct :)

> 
> It's meant for terminal protocols and other chatty sequences.
> 

He included an FTP example with 512 byte sends which leads me to believe it was 
meant for more than just terminal protocols:

<begin>
We use our scheme for all TCP connections, not just  Telnet  con-
nections.   Let us see what happens for a file transfer data con-
nection using our technique. The two extreme cases will again  be
considered.

As before, we first consider the Ethernet case.  The user is  now
writing data to TCP in 512 byte blocks as fast as TCP will accept
them.  The user's first write to TCP will start things going; our
first  datagram  will  be  512+40  bytes  or 552 bytes long.  The
user's second write to TCP will not cause a send but  will  cause
the  block  to  be buffered.
<end>

What I'd forgotten is that the original RFC had no explicit discussion of checks 
against the MSS.  It _seems_ that the first reference to that is in rfc898, 
which was a writeup of meeting notes:

<begin>
Congestion Control -- FACC - Nagle

       Postel:  This was a discussion of the situation leading to the
       ideas presented in RFC 896, and how the policies described there
       improved overall performance.

Hinden, Postel, Muuss, & Reynolds                              [Page 20]
\f

RFC 898                                                       April 1984
Gateway SIG Meeting Notes

       Muuss:

       First principle of congestion control:

          DON'T DROP PACKETS (unless absolutely necessary)

       Second principle:

          Hosts must behave themselves (or else)

          Enemies list -

             1.  TOPS-20 TCP from DEC
             2.  VAX/UNIX 4.2 from Berkeley

       Third principle:

          Memory won't help (beyond a certain point).

          The small packet problem: Big packets are good, small are bad
          (big = 576).

       Suggested fix: Rule: When the user writes to TCP, initiate a send
       only if there are NO outstanding packets on the connection. [good
       for TELNET, at least] (or if you fill a segment). No change when
       Acks come back. Assumption is that there is a pipe-like buffer
       between the user and the TCP.
<end>

with that parenthetical "(or if you fill a segment)" comment.  It is interesting 
how they define "big = 576" :)

It seems the full-sized segment bit gets formalized in 1122:

<begin>
             A TCP SHOULD implement the Nagle Algorithm [TCP:9] to
             coalesce short segments.  However, there MUST be a way for
             an application to disable the Nagle algorithm on an
             individual connection.  In all cases, sending data is also
             subject to the limitation imposed by the Slow Start
             algorithm (Section 4.2.2.15).

             DISCUSSION:
                  The Nagle algorithm is generally as follows:

                       If there is unacknowledged data (i.e., SND.NXT >
                       SND.UNA), then the sending TCP buffers all user

Internet Engineering Task Force                                [Page 98]
\f

RFC1122                  TRANSPORT LAYER -- TCP             October 1989

                       data (regardless of the PSH bit), until the
                       outstanding data has been acknowledged or until
                       the TCP can send a full-sized segment (Eff.snd.MSS
                       bytes; see Section 4.2.2.6).
<end>

> It was not designed with 16K MSS frame sizes in mind.

I certainly agree that those frame sizes were probably far from their minds at 
the time and that basing the decision on the ratio of header overhead is well 
within the spirit.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: e1000 (?) jumbo frames performance issue
  2005-05-05 21:55   ` Michael Iatrou
  2005-05-05 22:26     ` Michael Iatrou
@ 2005-05-06 16:18     ` Rick Jones
  1 sibling, 0 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-06 16:18 UTC (permalink / raw)
  To: Michael Iatrou; +Cc: netdev


> 
>>It might be good to add CPU utilization figures - for 2.3pl1 that means
>>editing the makefile to add a -DUSE_PROC_STAT and recompiling.  Or you can
>>grab netperf 2.4.0-rc3 from:
>>
>>ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/experimental/
>>
>>if you cannot find it elsewhere, and that will (try to) compile-in the
>>right CPU utilization mechanism automagically.
> 
> 
> I already did a custom CPU usage instrumentation (based on infos 
> from /proc/stat -- the latest netperf does the same thing, doesn't it?) and 
> it seems that system has plenty of idle time (up to 50% if I recall correct)

IIRC you stated that the boxes were UP?

If changing netperf settings didn't affect much, then kernel profiles and/or 
packet traces may be in order.

rick

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-05-06 16:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-05 16:28 e1000 (?) jumbo frames performance issue Michael Iatrou
2005-05-05 20:17 ` Rick Jones
2005-05-05 21:33   ` David S. Miller
2005-05-05 21:54     ` Rick Jones
2005-05-05 22:17       ` David S. Miller
2005-05-05 23:24         ` Rick Jones
2005-05-05 21:55   ` Michael Iatrou
2005-05-05 22:26     ` Michael Iatrou
2005-05-06 16:18     ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).