* e1000 full-duplex TCP performance well below wire speed
@ 2008-01-30 12:23 Bruce Allen
2008-01-30 17:36 ` Brandeburg, Jesse
2008-01-30 19:17 ` Ben Greear
0 siblings, 2 replies; 32+ messages in thread
From: Bruce Allen @ 2008-01-30 12:23 UTC (permalink / raw)
To: netdev; +Cc: Carsten Aulbert, Henning Fehrmann, Bruce Allen
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1769 bytes --]
(Pádraig Brady has suggested that I post this to Netdev. It was
originally posted to LKML here: http://lkml.org/lkml/2008/1/30/141 )
Dear NetDev,
We've connected a pair of modern high-performance boxes with integrated copper
Gb/s Intel NICS, with an ethernet crossover cable, and have run some netperf
full duplex TCP tests. The transfer rates are well below wire speed. We're
reporting this as a kernel bug, because we expect a vanilla kernel with default
settings to give wire speed (or close to wire speed) performance in this case.
We DO see wire speed in simplex transfers. The behavior has been verified on
multiple machines with identical hardware.
Details:
Kernel version: 2.6.23.12
ethernet NIC: Intel 82573L
ethernet driver: e1000 version 7.3.20-k2
motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220, Intel 3000
chipset, 8GB memory)
The test was done with various mtu sizes ranging from 1500 to 9000, with
ethernet flow control switched on and off, and using reno and cubic as a TCP
congestion control.
The behavior depends on the setup. In one test we used cubic congestion
control, flow control off. The transfer rate in one direction was above 0.9Gb/s
while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s the rates
flipped. Perhaps the two steams are fighting for resources. (The performance of
a full duplex stream should be close to 1Gb/s in both directions.) A graph of
the transfer speed as a function of time is here:
https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg
Red shows transmit and green shows receive (please ignore other plots):
We're happy to do additional testing, if that would help, and very grateful for
any advice!
Bruce Allen
Carsten Aulbert
Henning Fehrmann
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-30 12:23 e1000 full-duplex TCP performance well below wire speed Bruce Allen
@ 2008-01-30 17:36 ` Brandeburg, Jesse
2008-01-30 18:45 ` Rick Jones
2008-01-30 23:07 ` Bruce Allen
2008-01-30 19:17 ` Ben Greear
1 sibling, 2 replies; 32+ messages in thread
From: Brandeburg, Jesse @ 2008-01-30 17:36 UTC (permalink / raw)
To: Bruce Allen, netdev; +Cc: Carsten Aulbert, Henning Fehrmann, Bruce Allen
Bruce Allen wrote:
> Details:
> Kernel version: 2.6.23.12
> ethernet NIC: Intel 82573L
> ethernet driver: e1000 version 7.3.20-k2
> motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220,
> Intel 3000 chipset, 8GB memory)
Hi Bruce,
The 82573L (a client NIC, regardless of the class of machine it is in)
only has a x1 connection which does introduce some latency since the
slot is only capable of about 2Gb/s data total, which includes overhead
of descriptors and other transactions. As you approach the maximum of
the slot it gets more and more difficult to get wire speed in a
bidirectional test.
> The test was done with various mtu sizes ranging from 1500 to 9000,
> with ethernet flow control switched on and off, and using reno and
> cubic as a TCP congestion control.
As asked in LKML thread, please post the exact netperf command used to
start the client/server, whether or not you're using irqbalanced (aka
irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
right?)
I've recently discovered that particularly with the most recent kernels
if you specify any socket options (-- -SX -sY) to netperf it does worse
than if it just lets the kernel auto-tune.
> The behavior depends on the setup. In one test we used cubic
> congestion control, flow control off. The transfer rate in one
> direction was above 0.9Gb/s while in the other direction it was 0.6
> to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams
> are fighting for resources. (The performance of a full duplex stream
> should be close to 1Gb/s in both directions.) A graph of the
> transfer speed as a function of time is here:
> https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg
> Red shows transmit and green shows receive (please ignore other
> plots):
One other thing you can try with e1000 is disabling the dynamic
interrupt moderation by loading the driver with
InterruptThrottleRate=8000,8000,... (the number of commas depends on
your number of ports) which might help in your particular benchmark.
just for completeness can you post the dump of ethtool -e eth0 and lspci
-vvv?
Thanks,
Jesse
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 17:36 ` Brandeburg, Jesse
@ 2008-01-30 18:45 ` Rick Jones
2008-01-30 23:15 ` Bruce Allen
2008-01-31 11:35 ` Carsten Aulbert
2008-01-30 23:07 ` Bruce Allen
1 sibling, 2 replies; 32+ messages in thread
From: Rick Jones @ 2008-01-30 18:45 UTC (permalink / raw)
To: Brandeburg, Jesse
Cc: Bruce Allen, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
> As asked in LKML thread, please post the exact netperf command used to
> start the client/server, whether or not you're using irqbalanced (aka
> irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
> right?)
In particular, it would be good to know if you are doing two concurrent
streams, or if you are using the "burst mode" TCP_RR with large
request/response sizes method which then is only using one connection.
> I've recently discovered that particularly with the most recent kernels
> if you specify any socket options (-- -SX -sY) to netperf it does worse
> than if it just lets the kernel auto-tune.
That is the bit where explicit setsockopts are capped by core [rw]mem
sysctls but the autotuning is not correct?
rick jones
BTW, a bit of netperf news - the "omni" (two routines to measure it all)
tests seem to be more or less working now in top of trunk netperf. It
of course still needs work/polish, but if folks would like to play with
them, I'd love the feedback. Output is a bit different from classic
netperf, and includes an option to emit the results as csv
(test-specific -o presently) rather than "human readable" (test-specific
-O). You get the omni stuff via ./configure --enable-omni and use
"omni" as the test name. No docs yet, for options and their effects,
you need to look at scan_omni_args in src/nettest_omni.c
One other addition in the omni tests is retreiving not just the initial
SO_*BUF sizes, but also the final SO_*BUF sizes so one can see where
autotuning took things just based on netperf output.
If the general concensus is that the overhead of the omni stuff isn't
too dear, (there are more conditionals in the mainline than with classic
netperf) I will convert the classic netperf tests to use the omni code.
BTW, don't have a heart attack when you see the quantity of current csv
output - I do plan on being able to let the user specify what values
should be included :)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 12:23 e1000 full-duplex TCP performance well below wire speed Bruce Allen
2008-01-30 17:36 ` Brandeburg, Jesse
@ 2008-01-30 19:17 ` Ben Greear
2008-01-30 22:33 ` Bruce Allen
1 sibling, 1 reply; 32+ messages in thread
From: Ben Greear @ 2008-01-30 19:17 UTC (permalink / raw)
To: Bruce Allen; +Cc: netdev, Carsten Aulbert, Henning Fehrmann, Bruce Allen
Bruce Allen wrote:
> (Pádraig Brady has suggested that I post this to Netdev. It was
> originally posted to LKML here: http://lkml.org/lkml/2008/1/30/141 )
>
>
> Dear NetDev,
>
> We've connected a pair of modern high-performance boxes with integrated
> copper Gb/s Intel NICS, with an ethernet crossover cable, and have run
> some netperf full duplex TCP tests. The transfer rates are well below
> wire speed. We're reporting this as a kernel bug, because we expect a
> vanilla kernel with default settings to give wire speed (or close to
> wire speed) performance in this case. We DO see wire speed in simplex
> transfers. The behavior has been verified on multiple machines with
> identical hardware.
Try using NICs in the pci-e slots. We have better
luck there, as you usually have more lanes and/or higher
quality NIC chipsets available in this case.
Try a UDP test to make sure the NIC can actually handle the throughput.
Look at the actual link usage as reported by the ethernet driver so that you
take all of the ACKS and other overhead into account.
Try the same test using 10G hardware (CX4 NICs are quite affordable
these days, and we drove a 2-port 10G NIC based on the Intel ixgbe
chipset at around 4Gbps on two ports, full duplex, using pktgen).
As in around 16Gbps throughput across the busses. That may also give you an idea
if the bottleneck is hardware or software related.
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 19:17 ` Ben Greear
@ 2008-01-30 22:33 ` Bruce Allen
0 siblings, 0 replies; 32+ messages in thread
From: Bruce Allen @ 2008-01-30 22:33 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev, Carsten Aulbert, Henning Fehrmann, Bruce Allen
Hi Ben,
Thank you for the suggestions and questions.
>> We've connected a pair of modern high-performance boxes with integrated
>> copper Gb/s Intel NICS, with an ethernet crossover cable, and have run some
>> netperf full duplex TCP tests. The transfer rates are well below wire
>> speed. We're reporting this as a kernel bug, because we expect a vanilla
>> kernel with default settings to give wire speed (or close to wire speed)
>> performance in this case. We DO see wire speed in simplex transfers. The
>> behavior has been verified on multiple machines with identical hardware.
>
> Try using NICs in the pci-e slots. We have better luck there, as you
> usually have more lanes and/or higher quality NIC chipsets available in
> this case.
It's a good idea. We can try this, though it will take a little time to
organize.
> Try a UDP test to make sure the NIC can actually handle the throughput.
I should have mentioned this in my original post -- we already did this.
We can run UDP wire speed full duplex (over 900 Mb/s in each direction, at
the same time). So the problem stems from TCP or is aggravated by TCP.
It's not a hardware limitation.
> Look at the actual link usage as reported by the ethernet driver so that
> you take all of the ACKS and other overhead into account.
OK. We'll report on this as soon as possible.
> Try the same test using 10G hardware (CX4 NICs are quite affordable
> these days, and we drove a 2-port 10G NIC based on the Intel ixgbe
> chipset at around 4Gbps on two ports, full duplex, using pktgen). As in
> around 16Gbps throughput across the busses. That may also give you an
> idea if the bottleneck is hardware or software related.
OK. That will take more time to organize.
Cheers,
Bruce
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-30 17:36 ` Brandeburg, Jesse
2008-01-30 18:45 ` Rick Jones
@ 2008-01-30 23:07 ` Bruce Allen
2008-01-31 5:43 ` Brandeburg, Jesse
2008-01-31 9:17 ` Andi Kleen
1 sibling, 2 replies; 32+ messages in thread
From: Bruce Allen @ 2008-01-30 23:07 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: netdev, Carsten Aulbert, Henning Fehrmann, Bruce Allen
Hi Jesse,
It's good to be talking directly to one of the e1000 developers and
maintainers. Although at this point I am starting to think that the
issue may be TCP stack related and nothing to do with the NIC. Am I
correct that these are quite distinct parts of the kernel?
> The 82573L (a client NIC, regardless of the class of machine it is in)
> only has a x1 connection which does introduce some latency since the
> slot is only capable of about 2Gb/s data total, which includes overhead
> of descriptors and other transactions. As you approach the maximum of
> the slot it gets more and more difficult to get wire speed in a
> bidirectional test.
According to the Intel datasheet, the PCI-e x1 connection is 2Gb/s in each
direction. So we only need to get up to 50% of peak to saturate a
full-duplex wire-speed link. I hope that the overhead is not a factor of
two.
Important note: we ARE able to get full duplex wire speed (over 900 Mb/s
simulaneously in both directions) using UDP. The problems occur only with
TCP connections.
>> The test was done with various mtu sizes ranging from 1500 to 9000,
>> with ethernet flow control switched on and off, and using reno and
>> cubic as a TCP congestion control.
>
> As asked in LKML thread, please post the exact netperf command used to
> start the client/server, whether or not you're using irqbalanced (aka
> irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
> right?)
I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in
Germany). So we'll provide this info in ~10 hours.
I assume that the interrupt load is distributed among all four cores --
the default affinity is 0xff, and I also assume that there is some type of
interrupt aggregation taking place in the driver. If the CPUs were not
able to service the interrupts fast enough, I assume that we would also
see loss of performance with UDP testing.
> I've recently discovered that particularly with the most recent kernels
> if you specify any socket options (-- -SX -sY) to netperf it does worse
> than if it just lets the kernel auto-tune.
I am pretty sure that no socket options were specified, but again need to
wait until Carsten or Henning come back on-line.
>> The behavior depends on the setup. In one test we used cubic
>> congestion control, flow control off. The transfer rate in one
>> direction was above 0.9Gb/s while in the other direction it was 0.6
>> to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams
>> are fighting for resources. (The performance of a full duplex stream
>> should be close to 1Gb/s in both directions.) A graph of the
>> transfer speed as a function of time is here:
>> https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg
>> Red shows transmit and green shows receive (please ignore other
>> plots):
> One other thing you can try with e1000 is disabling the dynamic
> interrupt moderation by loading the driver with
> InterruptThrottleRate=8000,8000,... (the number of commas depends on
> your number of ports) which might help in your particular benchmark.
OK. Is 'dynamic interrupt moderation' another name for 'interrupt
aggregation'? Meaning that if more than one interrupt is generated in a
given time interval, then they are replaced by a single interrupt?
> just for completeness can you post the dump of ethtool -e eth0 and lspci
> -vvv?
Yup, we'll give that info also.
Thanks again!
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 18:45 ` Rick Jones
@ 2008-01-30 23:15 ` Bruce Allen
2008-01-31 11:35 ` Carsten Aulbert
1 sibling, 0 replies; 32+ messages in thread
From: Bruce Allen @ 2008-01-30 23:15 UTC (permalink / raw)
To: Rick Jones
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Hi Rick,
First off, thanks for netperf. I've used it a lot and find it an extremely
useful tool.
>> As asked in LKML thread, please post the exact netperf command used to
>> start the client/server, whether or not you're using irqbalanced (aka
>> irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
>> right?)
>
> In particular, it would be good to know if you are doing two concurrent
> streams, or if you are using the "burst mode" TCP_RR with large
> request/response sizes method which then is only using one connection.
I'm not sure -- must wait for Henning and Carsten to respond tomorrow.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-30 23:07 ` Bruce Allen
@ 2008-01-31 5:43 ` Brandeburg, Jesse
2008-01-31 8:31 ` Bruce Allen
` (2 more replies)
2008-01-31 9:17 ` Andi Kleen
1 sibling, 3 replies; 32+ messages in thread
From: Brandeburg, Jesse @ 2008-01-31 5:43 UTC (permalink / raw)
To: Bruce Allen; +Cc: netdev, Carsten Aulbert, Henning Fehrmann, Bruce Allen
Bruce Allen wrote:
> Hi Jesse,
>
> It's good to be talking directly to one of the e1000 developers and
> maintainers. Although at this point I am starting to think that the
> issue may be TCP stack related and nothing to do with the NIC. Am I
> correct that these are quite distinct parts of the kernel?
Yes, quite.
> Important note: we ARE able to get full duplex wire speed (over 900
> Mb/s simulaneously in both directions) using UDP. The problems occur
> only with TCP connections.
That eliminates bus bandwidth issues, probably, but small packets take
up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>>> The test was done with various mtu sizes ranging from 1500 to 9000,
>>> with ethernet flow control switched on and off, and using reno and
>>> cubic as a TCP congestion control.
>>
>> As asked in LKML thread, please post the exact netperf command used
>> to start the client/server, whether or not you're using irqbalanced
>> (aka irqbalance) and what cat /proc/interrupts looks like (you ARE
>> using MSI, right?)
>
> I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in
> Germany). So we'll provide this info in ~10 hours.
I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
I think you'll have to compile netperf with burst mode support enabled.
> I assume that the interrupt load is distributed among all four cores
> -- the default affinity is 0xff, and I also assume that there is some
> type of interrupt aggregation taking place in the driver. If the
> CPUs were not able to service the interrupts fast enough, I assume
> that we would also see loss of performance with UDP testing.
>
>> One other thing you can try with e1000 is disabling the dynamic
>> interrupt moderation by loading the driver with
>> InterruptThrottleRate=8000,8000,... (the number of commas depends on
>> your number of ports) which might help in your particular benchmark.
>
> OK. Is 'dynamic interrupt moderation' another name for 'interrupt
> aggregation'? Meaning that if more than one interrupt is generated
> in a given time interval, then they are replaced by a single
> interrupt?
Yes, InterruptThrottleRate=8000 means there will be no more than 8000
ints/second from that adapter, and if interrupts are generated faster
than that they are "aggregated."
Interestingly since you are interested in ultra low latency, and may be
willing to give up some cpu for it during bulk transfers you should try
InterruptThrottleRate=1 (can generate up to 70000 ints/s)
>> just for completeness can you post the dump of ethtool -e eth0 and
>> lspci -vvv?
>
> Yup, we'll give that info also.
>
> Thanks again!
Welcome, its an interesting discussion. Hope we can come to a good
conclusion.
Jesse
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-31 5:43 ` Brandeburg, Jesse
@ 2008-01-31 8:31 ` Bruce Allen
2008-01-31 18:08 ` Kok, Auke
2008-01-31 15:12 ` Carsten Aulbert
2008-01-31 15:18 ` Carsten Aulbert
2 siblings, 1 reply; 32+ messages in thread
From: Bruce Allen @ 2008-01-31 8:31 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: netdev, Carsten Aulbert, Henning Fehrmann, Bruce Allen
Hi Jesse,
>> It's good to be talking directly to one of the e1000 developers and
>> maintainers. Although at this point I am starting to think that the
>> issue may be TCP stack related and nothing to do with the NIC. Am I
>> correct that these are quite distinct parts of the kernel?
>
> Yes, quite.
OK. I hope that there is also someone knowledgable about the TCP stack
who is following this thread. (Perhaps you also know this part of the
kernel, but I am assuming that your expertise is on the e1000/NIC bits.)
>> Important note: we ARE able to get full duplex wire speed (over 900
>> Mb/s simulaneously in both directions) using UDP. The problems occur
>> only with TCP connections.
>
> That eliminates bus bandwidth issues, probably, but small packets take
> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
I see. Your concern is the extra ACK packets associated with TCP. Even
those these represent a small volume of data (around 5% with MTU=1500, and
less at larger MTU) they double the number of packets that must be handled
by the system compared to UDP transmission at the same data rate. Is that
correct?
>> I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in
>> Germany). So we'll provide this info in ~10 hours.
>
> I would suggest you try TCP_RR with a command line something like this:
> netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
>
> I think you'll have to compile netperf with burst mode support enabled.
I just saw Carsten a few minutes ago. He has to take part in a
'Baubesprechung' meeting this morning, after which he will start answering
the technical questions and doing additional testing as suggested by you
and others. If you are on the US west coast, he should have some answers
and results posted by Thursday morning Pacific time.
>> I assume that the interrupt load is distributed among all four cores
>> -- the default affinity is 0xff, and I also assume that there is some
>> type of interrupt aggregation taking place in the driver. If the
>> CPUs were not able to service the interrupts fast enough, I assume
>> that we would also see loss of performance with UDP testing.
>>
>>> One other thing you can try with e1000 is disabling the dynamic
>>> interrupt moderation by loading the driver with
>>> InterruptThrottleRate=8000,8000,... (the number of commas depends on
>>> your number of ports) which might help in your particular benchmark.
>>
>> OK. Is 'dynamic interrupt moderation' another name for 'interrupt
>> aggregation'? Meaning that if more than one interrupt is generated
>> in a given time interval, then they are replaced by a single
>> interrupt?
>
> Yes, InterruptThrottleRate=8000 means there will be no more than 8000
> ints/second from that adapter, and if interrupts are generated faster
> than that they are "aggregated."
>
> Interestingly since you are interested in ultra low latency, and may be
> willing to give up some cpu for it during bulk transfers you should try
> InterruptThrottleRate=1 (can generate up to 70000 ints/s)
I'm not sure it's quite right to say that we are interested in ultra low
latency. Most of our network transfers involve bulk data movement (a few
MB or more). We don't care so much about low latency (meaning how long it
takes the FIRST byte of data to travel from sender to receiver). We care
about aggregate bandwidth: once the pipe is full, how fast can data be
moved through it. Sow we don't care so much if getting the pipe full takes
20 us or 50 us. We just want the data to flow fast once the pipe IS full.
> Welcome, its an interesting discussion. Hope we can come to a good
> conclusion.
Thank you. Carsten will post more info and answers later today.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 23:07 ` Bruce Allen
2008-01-31 5:43 ` Brandeburg, Jesse
@ 2008-01-31 9:17 ` Andi Kleen
2008-01-31 9:59 ` Bruce Allen
2008-01-31 16:09 ` Carsten Aulbert
1 sibling, 2 replies; 32+ messages in thread
From: Andi Kleen @ 2008-01-31 9:17 UTC (permalink / raw)
To: Bruce Allen
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Bruce Allen <ballen@gravity.phys.uwm.edu> writes:
>
> Important note: we ARE able to get full duplex wire speed (over 900
> Mb/s simulaneously in both directions) using UDP. The problems occur
> only with TCP connections.
Another issue with full duplex TCP not mentioned yet is that if TSO is used
the output will be somewhat bursty and might cause problems with the
TCP ACK clock of the other direction because the ACKs would need
to squeeze in between full TSO bursts.
You could try disabling TSO with ethtool.
-Andi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 9:17 ` Andi Kleen
@ 2008-01-31 9:59 ` Bruce Allen
2008-01-31 16:09 ` Carsten Aulbert
1 sibling, 0 replies; 32+ messages in thread
From: Bruce Allen @ 2008-01-31 9:59 UTC (permalink / raw)
To: Andi Kleen
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Hi Andi!
>> Important note: we ARE able to get full duplex wire speed (over 900
>> Mb/s simulaneously in both directions) using UDP. The problems occur
>> only with TCP connections.
>
> Another issue with full duplex TCP not mentioned yet is that if TSO is used
> the output will be somewhat bursty and might cause problems with the
> TCP ACK clock of the other direction because the ACKs would need
> to squeeze in between full TSO bursts.
>
> You could try disabling TSO with ethtool.
Noted. We'll try this also.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-30 18:45 ` Rick Jones
2008-01-30 23:15 ` Bruce Allen
@ 2008-01-31 11:35 ` Carsten Aulbert
2008-01-31 17:55 ` Rick Jones
1 sibling, 1 reply; 32+ messages in thread
From: Carsten Aulbert @ 2008-01-31 11:35 UTC (permalink / raw)
To: Rick Jones
Cc: Brandeburg, Jesse, Bruce Allen, netdev, Henning Fehrmann,
Bruce Allen
Good morning (my TZ),
I'll try to answer all questions, hoewver if I miss something big,
please point my nose to it again.
Rick Jones wrote:
>> As asked in LKML thread, please post the exact netperf command used to
>> start the client/server, whether or not you're using irqbalanced (aka
>> irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
>> right?)
>
netperf was used without any special tuning parameters. Usually we start
two processes on two hosts which start (almost) simultaneously, last for
20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e.
on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20
on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20
192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we
use the .2. part for eth1 for a range of mtus.
The server is started on both nodes with the start-stop-daemon and no
special parameters I'm aware of.
/proc/interrupts shows me PCI_MSI-edge thus, I think YES.
> In particular, it would be good to know if you are doing two concurrent
> streams, or if you are using the "burst mode" TCP_RR with large
> request/response sizes method which then is only using one connection.
>
As outlined above: Two concurrent streams right now. If you think TCP_RR
should be better I'm happy to rerun some tests.
More in other emails.
I'll wade through them slowly.
Carsten
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 5:43 ` Brandeburg, Jesse
2008-01-31 8:31 ` Bruce Allen
@ 2008-01-31 15:12 ` Carsten Aulbert
2008-01-31 17:20 ` Brandeburg, Jesse
2008-01-31 18:03 ` Rick Jones
2008-01-31 15:18 ` Carsten Aulbert
2 siblings, 2 replies; 32+ messages in thread
From: Carsten Aulbert @ 2008-01-31 15:12 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: Bruce Allen, netdev, Henning Fehrmann, Bruce Allen
[-- Attachment #1: Type: text/plain, Size: 5687 bytes --]
Hi all, slowly crawling through the mails.
Brandeburg, Jesse wrote:
>>>> The test was done with various mtu sizes ranging from 1500 to 9000,
>>>> with ethernet flow control switched on and off, and using reno and
>>>> cubic as a TCP congestion control.
>>> As asked in LKML thread, please post the exact netperf command used
>>> to start the client/server, whether or not you're using irqbalanced
>>> (aka irqbalance) and what cat /proc/interrupts looks like (you ARE
>>> using MSI, right?)
We are using MSI, /proc/interrupts look like:
n0003:~# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 6536963 0 0 0 IO-APIC-edge timer
1: 2 0 0 0 IO-APIC-edge i8042
3: 1 0 0 0 IO-APIC-edge serial
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-fasteoi acpi
14: 32321 0 0 0 IO-APIC-edge libata
15: 0 0 0 0 IO-APIC-edge libata
16: 0 0 0 0 IO-APIC-fasteoi
uhci_hcd:usb5
18: 0 0 0 0 IO-APIC-fasteoi
uhci_hcd:usb4
19: 0 0 0 0 IO-APIC-fasteoi
uhci_hcd:usb3
23: 0 0 0 0 IO-APIC-fasteoi
ehci_hcd:usb1, uhci_hcd:usb2
378: 17234866 0 0 0 PCI-MSI-edge eth1
379: 129826 0 0 0 PCI-MSI-edge eth0
NMI: 0 0 0 0
LOC: 6537181 6537326 6537149 6537052
ERR: 0
(sorry for the line break).
What we don't understand is why only core0 gets the interrupts, since
the affinity is set to f:
# cat /proc/irq/378/smp_affinity
f
Right now, irqbalance is not running, though I can give it shot if
people think this will make a difference.
> I would suggest you try TCP_RR with a command line something like this:
> netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest
The results with netperf running like
netperf -t TCP_STREAM -H <host> -l 20
can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1
I reran the tests with
netperf -t <test> -H <host> -l 20 -c -C
or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k
> Yes, InterruptThrottleRate=8000 means there will be no more than 8000
> ints/second from that adapter, and if interrupts are generated faster
> than that they are "aggregated."
>
> Interestingly since you are interested in ultra low latency, and may be
> willing to give up some cpu for it during bulk transfers you should try
> InterruptThrottleRate=1 (can generate up to 70000 ints/s)
>
On the web page you'll see that there are about 4000 interrupts/s for
most tests and up to 20,000/s for the TCP_RR test. Shall I change the
throttle rate?
>>> just for completeness can you post the dump of ethtool -e eth0 and
>>> lspci -vvv?
>> Yup, we'll give that info also.
n0002:~# ethtool -e eth1
Offset Values
------ ------
0x0000 00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff
0x0010 ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80
0x0020 00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27
0x0030 c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07
0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00
0x0060 00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f
lspci -vvv for this card:
0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
Controller
Subsystem: Super Micro Computer Inc Unknown device 109a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 378
Region 0: Memory at ee200000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 5000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable+
Address: 00000000fee0f00c Data: 41b9
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0,
ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown,
Port 0
Link: Latency L0s <128ns, L1 <64us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 2d-94-93-ff-ff-48-30-00
(all lspci-vvv output as attachment)
Thanks a lot, open for suggestions
Carsten
[-- Attachment #2: lspci --]
[-- Type: text/plain, Size: 16537 bytes --]
00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub (rev c0)
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Latency: 0
Capabilities: [e0] Vendor Specific Information
00:01.0 PCI bridge: Intel Corporation E7230 PCI Express Root Port (rev c0) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: fff00000-000fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity+ SERR- NoISA+ VGA- MAbort- >Reset+ FastB2B-
Capabilities: [88] Subsystem: Super Micro Computer Inc Unknown device 8580
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0f00c Data: 4159
Capabilities: [a0] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <64ns, L1 <1us
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 2
Link: Latency L0s <1us, L1 <4us
Link: ASPM Disabled RCB 64 bytes Disabled CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x0
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug- Surpise-
Slot: Number 0, PowerLimit 0.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Off, PwrInd On, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [100] Virtual Channel
Capabilities: [140] Unknown (5)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) (prog-if 00 [Normal decode])
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=09, subordinate=09, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: fff00000-000fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x4, ASPM L0s L1, Port 1
Link: Latency L0s <256ns, L1 <4us
Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x0
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
Slot: Number 0, PowerLimit 0.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Unknown, PwrInd Unknown, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0f00c Data: 4161
Capabilities: [90] Subsystem: Super Micro Computer Inc Unknown device 8580
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel
Capabilities: [180] Unknown (5)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=0d, subordinate=0d, sec-latency=0
I/O behind bridge: 00004000-00004fff
Memory behind bridge: ee100000-ee1fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 5
Link: Latency L0s <256ns, L1 <4us
Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
Slot: Number 5, PowerLimit 10.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Unknown, PwrInd Unknown, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0f00c Data: 4169
Capabilities: [90] Subsystem: Super Micro Computer Inc Unknown device 8580
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel
Capabilities: [180] Unknown (5)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=0e, subordinate=0e, sec-latency=0
I/O behind bridge: 00005000-00005fff
Memory behind bridge: ee200000-ee2fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 6
Link: Latency L0s <256ns, L1 <4us
Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
Slot: Number 6, PowerLimit 10.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Unknown, PwrInd Unknown, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0f00c Data: 4171
Capabilities: [90] Subsystem: Super Micro Computer Inc Unknown device 8580
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel
Capabilities: [180] Unknown (5)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) (prog-if 00 [UHCI])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 23
Region 4: I/O ports at 3000 [size=32]
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) (prog-if 00 [UHCI])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 19
Region 4: I/O ports at 3020 [size=32]
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) (prog-if 00 [UHCI])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin C routed to IRQ 18
Region 4: I/O ports at 3040 [size=32]
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) (prog-if 00 [UHCI])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin D routed to IRQ 16
Region 4: I/O ports at 3060 [size=32]
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) (prog-if 20 [EHCI])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 23
Region 0: Memory at ee000000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Debug port
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) (prog-if 01 [Subtractive decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=0f, subordinate=0f, sec-latency=32
I/O behind bridge: 00006000-00006fff
Memory behind bridge: ee300000-ee3fffff
Prefetchable memory behind bridge: 00000000ef000000-00000000efffffff
Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-
Capabilities: [50] Subsystem: Super Micro Computer Inc Unknown device 8580
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Capabilities: [e0] Vendor Specific Information
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01) (prog-if 8a [Master SecP PriP])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 19
Region 0: I/O ports at 01f0 [size=8]
Region 1: I/O ports at 03f4 [size=1]
Region 2: I/O ports at 0170 [size=8]
Region 3: I/O ports at 0374 [size=1]
Region 4: I/O ports at 30a0 [size=16]
Region 5: Memory at f5000000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [70] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Interrupt: pin B routed to IRQ 19
Region 4: I/O ports at 1100 [size=32]
0d:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03)
Subsystem: Super Micro Computer Inc Unknown device 108c
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 379
Region 0: Memory at ee100000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 4000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0f00c Data: 41b1
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown, Port 0
Link: Latency L0s <128ns, L1 <64us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 2c-94-93-ff-ff-48-30-00
0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Super Micro Computer Inc Unknown device 109a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 378
Region 0: Memory at ee200000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 5000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0f00c Data: 41b9
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown, Port 0
Link: Latency L0s <128ns, L1 <64us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 2d-94-93-ff-ff-48-30-00
0f:00.0 VGA compatible controller: XGI - Xabre Graphics Inc Volari Z7 (prog-if 00 [VGA])
Subsystem: Super Micro Computer Inc Unknown device 8580
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
BIST result: 00
Region 0: Memory at ef000000 (32-bit, prefetchable) [size=16M]
Region 1: Memory at ee300000 (32-bit, non-prefetchable) [size=256K]
Region 2: I/O ports at 6000 [size=128]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 5:43 ` Brandeburg, Jesse
2008-01-31 8:31 ` Bruce Allen
2008-01-31 15:12 ` Carsten Aulbert
@ 2008-01-31 15:18 ` Carsten Aulbert
2 siblings, 0 replies; 32+ messages in thread
From: Carsten Aulbert @ 2008-01-31 15:18 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: Bruce Allen, netdev, Henning Fehrmann, Bruce Allen
Brief question I forgot to ask:
Right now we are using the "old" version 7.3.20-k2. To save some effort
on your end, shall we upgrade this to 7.6.15 or should our version be
good enough?
Thanks
Carsten
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 9:17 ` Andi Kleen
2008-01-31 9:59 ` Bruce Allen
@ 2008-01-31 16:09 ` Carsten Aulbert
2008-01-31 18:15 ` Kok, Auke
1 sibling, 1 reply; 32+ messages in thread
From: Carsten Aulbert @ 2008-01-31 16:09 UTC (permalink / raw)
To: Andi Kleen
Cc: Bruce Allen, Brandeburg, Jesse, netdev, Henning Fehrmann,
Bruce Allen
Hi Andi,
Andi Kleen wrote:
> Another issue with full duplex TCP not mentioned yet is that if TSO is used
> the output will be somewhat bursty and might cause problems with the
> TCP ACK clock of the other direction because the ACKs would need
> to squeeze in between full TSO bursts.
>
> You could try disabling TSO with ethtool.
I just tried that:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3
It seems that the numbers do get better (sweet-spot seems to be MTU6000
with 914 MBit/s and 927 MBit/s), however for other settings the results
vary a lot so I'm not sure how large the statistical fluctuations are.
Next test I'll try if it makes sense to enlarge the ring buffers.
Thanks
Carsten
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-31 15:12 ` Carsten Aulbert
@ 2008-01-31 17:20 ` Brandeburg, Jesse
2008-01-31 17:27 ` Carsten Aulbert
2008-01-31 18:03 ` Rick Jones
1 sibling, 1 reply; 32+ messages in thread
From: Brandeburg, Jesse @ 2008-01-31 17:20 UTC (permalink / raw)
To: Carsten Aulbert; +Cc: Bruce Allen, netdev, Henning Fehrmann, Bruce Allen
Carsten Aulbert wrote:
> We are using MSI, /proc/interrupts look like:
> n0003:~# cat /proc/interrupts
> 378: 17234866 0 0 0 PCI-MSI-edge
> eth1
> 379: 129826 0 0 0 PCI-MSI-edge
> eth0
> (sorry for the line break).
>
> What we don't understand is why only core0 gets the interrupts, since
> the affinity is set to f:
> # cat /proc/irq/378/smp_affinity
> f
without CONFIG_IRQBALANCE set, and no irqbalance daemon running, this is
expected. Seems it is also dependent upon your system hardware.
> Right now, irqbalance is not running, though I can give it shot if
> people think this will make a difference.
probably won't make much of a difference if you only have a single
interrupt source generating interrupts. If you are using both adapters
simultaneously, please use smp_affinity or turn on irqbalance.
>> I would suggest you try TCP_RR with a command line something like
>> this: netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
>
> I did that and the results can be found here:
> https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest
seems something went wrong and all you ran was the 1 byte tests, where
it should have been 64K both directions (request/response).
> The results with netperf running like
> netperf -t TCP_STREAM -H <host> -l 20
> can be found here:
> https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1
> I reran the tests with
> netperf -t <test> -H <host> -l 20 -c -C
> or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k
I get:
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to foo
(134.134.3.121) port 0 AF_INET : first burst 4
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem
S.dem
Send Recv Size Size Time Rate local remote local
remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr
us/Tr
16384 87380 65536 65536 10.00 1565.34 14.17 27.18 362.220
347.243
16384 87380
>> Yes, InterruptThrottleRate=8000 means there will be no more than 8000
>> ints/second from that adapter, and if interrupts are generated faster
>> than that they are "aggregated."
>>
>> Interestingly since you are interested in ultra low latency, and may
>> be willing to give up some cpu for it during bulk transfers you
>> should try InterruptThrottleRate=1 (can generate up to 70000 ints/s)
>>
>
> On the web page you'll see that there are about 4000 interrupts/s for
> most tests and up to 20,000/s for the TCP_RR test. Shall I change the
> throttle rate?
that's the auto-tuning, I suggest just InterruptThrottleRate=4000 or
8000 if all you're concerned about is bulk traffic performance.
>>>> just for completeness can you post the dump of ethtool -e eth0 and
>>>> lspci -vvv?
>>> Yup, we'll give that info also.
>
> n0002:~# ethtool -e eth1
> Offset Values
> ------ ------
> 0x0000 00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff
> 0x0010 ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80
> 0x0020 00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27
> 0x0030 c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07
> 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
> 0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00
> 0x0060 00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff
> 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f
this looks fine.
> lspci -vvv for this card:
> 0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
> Controller
> Subsystem: Super Micro Computer Inc Unknown device 109a
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR+ FastB2B-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
> >TAbort- <TAbort- <MAbort- >SERR- <PERR-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 378
> Region 0: Memory at ee200000 (32-bit, non-prefetchable)
> [size=128K] Region 2: I/O ports at 5000 [size=32]
> Capabilities: [c8] Power Management version 2
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/0 Enable+
> Address: 00000000fee0f00c Data: 41b9
> Capabilities: [e0] Express Endpoint IRQ 0
> Device: Supported: MaxPayload 256 bytes, PhantFunc 0,
> ExtTag-
> Device: Latency L0s <512ns, L1 <64us
> Device: AtnBtn- AtnInd- PwrInd-
> Device: Errors: Correctable- Non-Fatal- Fatal-
> Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc-
> AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes,
> MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s,
> Width x1, ASPM unknown,
> Port 0
> Link: Latency L0s <128ns, L1 <64us
> Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
> Link: Speed 2.5Gb/s, Width x1
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Device Serial Number
> 2d-94-93-ff-ff-48-30-00
this also looks good, no APSM, MSI enabled,
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 17:20 ` Brandeburg, Jesse
@ 2008-01-31 17:27 ` Carsten Aulbert
2008-01-31 17:33 ` Brandeburg, Jesse
2008-01-31 18:11 ` running aggregate netperf TCP_RR " Rick Jones
0 siblings, 2 replies; 32+ messages in thread
From: Carsten Aulbert @ 2008-01-31 17:27 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: Bruce Allen, netdev, Henning Fehrmann, Bruce Allen
Hi all,
Brandeburg, Jesse wrote:
>>> I would suggest you try TCP_RR with a command line something like
>>> this: netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
>> I did that and the results can be found here:
>> https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest
>
> seems something went wrong and all you ran was the 1 byte tests, where
> it should have been 64K both directions (request/response).
>
Yes, shell-quoting got me there. I'll re-run the tests, so please don't
look at the TCP_RR results too closely. I think I'll be able to run
maybe one or two more tests today, rest will follow tomorrow.
Thanks for bearing with me
Carsten
PS: Am I right that the TCP_RR tests should only be run on a single node
at a time, not on both ends simultaneously?
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000 full-duplex TCP performance well below wire speed
2008-01-31 17:27 ` Carsten Aulbert
@ 2008-01-31 17:33 ` Brandeburg, Jesse
2008-01-31 18:11 ` running aggregate netperf TCP_RR " Rick Jones
1 sibling, 0 replies; 32+ messages in thread
From: Brandeburg, Jesse @ 2008-01-31 17:33 UTC (permalink / raw)
To: Carsten Aulbert; +Cc: Bruce Allen, netdev, Henning Fehrmann, Bruce Allen
Carsten Aulbert wrote:
> PS: Am I right that the TCP_RR tests should only be run on a single
> node at a time, not on both ends simultaneously?
yes, they are a request/response test, and so perform the bidirectional
test with a single node starting the test.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 11:35 ` Carsten Aulbert
@ 2008-01-31 17:55 ` Rick Jones
2008-02-01 19:57 ` Carsten Aulbert
0 siblings, 1 reply; 32+ messages in thread
From: Rick Jones @ 2008-01-31 17:55 UTC (permalink / raw)
To: Carsten Aulbert
Cc: Brandeburg, Jesse, Bruce Allen, netdev, Henning Fehrmann,
Bruce Allen
> netperf was used without any special tuning parameters. Usually we start
> two processes on two hosts which start (almost) simultaneously, last for
> 20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e.
>
> on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20
> on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20
>
> 192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we
> use the .2. part for eth1 for a range of mtus.
>
> The server is started on both nodes with the start-stop-daemon and no
> special parameters I'm aware of.
So long as you are relying on external (netperf relative) means to
report the throughput, those command lines would be fine. I wouldn't be
comfortably relying on the sum of the netperf-reported throughtputs with
those comand lines though. Netperf2 has no test synchronization, so two
separate commands, particularly those initiated on different systems,
are subject to skew errors. 99 times out of ten they might be epsilon,
but I get a _little_ paranoid there.
There are three alternatives:
1) use netperf4. not as convenient for "quick" testing at present, but
it has explicit test synchronization, so you "know" that the numbers
presented are from when all connections were actively transferring data
2) use the aforementioned "burst" TCP_RR test. This is then a single
netperf with data flowing both ways on a single connection so no issue
of skew, but perhaps an issue of being one connection and so one process
on each end.
3) start both tests from the same system and follow the suggestions
contained in :
<http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html>
particluarly:
<http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance>
and use a combination of TCP_STREAM and TCP_MAERTS (STREAM backwards) tests.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 15:12 ` Carsten Aulbert
2008-01-31 17:20 ` Brandeburg, Jesse
@ 2008-01-31 18:03 ` Rick Jones
1 sibling, 0 replies; 32+ messages in thread
From: Rick Jones @ 2008-01-31 18:03 UTC (permalink / raw)
To: Carsten Aulbert
Cc: Brandeburg, Jesse, Bruce Allen, netdev, Henning Fehrmann,
Bruce Allen
Carsten Aulbert wrote:
> Hi all, slowly crawling through the mails.
>
> Brandeburg, Jesse wrote:
>
>>>>> The test was done with various mtu sizes ranging from 1500 to 9000,
>>>>> with ethernet flow control switched on and off, and using reno and
>>>>> cubic as a TCP congestion control.
>>>>
>>>> As asked in LKML thread, please post the exact netperf command used
>>>> to start the client/server, whether or not you're using irqbalanced
>>>> (aka irqbalance) and what cat /proc/interrupts looks like (you ARE
>>>> using MSI, right?)
>
>
> We are using MSI, /proc/interrupts look like:
> n0003:~# cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 6536963 0 0 0 IO-APIC-edge timer
> 1: 2 0 0 0 IO-APIC-edge i8042
> 3: 1 0 0 0 IO-APIC-edge serial
> 8: 0 0 0 0 IO-APIC-edge rtc
> 9: 0 0 0 0 IO-APIC-fasteoi acpi
> 14: 32321 0 0 0 IO-APIC-edge libata
> 15: 0 0 0 0 IO-APIC-edge libata
> 16: 0 0 0 0 IO-APIC-fasteoi
> uhci_hcd:usb5
> 18: 0 0 0 0 IO-APIC-fasteoi
> uhci_hcd:usb4
> 19: 0 0 0 0 IO-APIC-fasteoi
> uhci_hcd:usb3
> 23: 0 0 0 0 IO-APIC-fasteoi
> ehci_hcd:usb1, uhci_hcd:usb2
> 378: 17234866 0 0 0 PCI-MSI-edge eth1
> 379: 129826 0 0 0 PCI-MSI-edge eth0
> NMI: 0 0 0 0
> LOC: 6537181 6537326 6537149 6537052
> ERR: 0
>
> (sorry for the line break).
>
> What we don't understand is why only core0 gets the interrupts, since
> the affinity is set to f:
> # cat /proc/irq/378/smp_affinity
> f
>
> Right now, irqbalance is not running, though I can give it shot if
> people think this will make a difference.
>
>> I would suggest you try TCP_RR with a command line something like this:
>> netperf -t TCP_RR -H <hostname> -C -c -- -b 4 -r 64K
>
>
> I did that and the results can be found here:
> https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest
For convenience, 2.4.4 (perhaps earlier I can never remember when I've
added things :) allows the output format for a TCP_RR test to be set to
the same as a _STREAM or _MAERTS test. And if you add a -v 2 to it you
will get the "each way" values and the average round-trip latency:
raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_RR -H oslowest.cup -f m
-v 2 -- -r 64K -b 4
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
oslowest.cup.hp.com (16.89.84.17) port 0 AF_INET : first burst 4
Local /Remote
Socket Size Request Resp. Elapsed
Send Recv Size Size Time Throughput
bytes Bytes bytes bytes secs. 10^6bits/sec
16384 87380 65536 65536 10.01 105.63
16384 87380
Alignment Offset RoundTrip Trans Throughput
Local Remote Local Remote Latency Rate 10^6bits/s
Send Recv Send Recv usec/Tran per sec Outbound Inbound
8 0 0 0 49635.583 100.734 52.814 52.814
raj@tardy:~/netperf2_trunk$
(this was a WAN test :)
rick jones
one of these days I may tweak netperf further so if the CPU utilization
method for either end doesn't require calibration, CPU utilization will
always be done on that end. people's thoughts on that tweak would be
most welcome...
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 8:31 ` Bruce Allen
@ 2008-01-31 18:08 ` Kok, Auke
2008-01-31 18:38 ` Rick Jones
2008-01-31 19:13 ` Bruce Allen
0 siblings, 2 replies; 32+ messages in thread
From: Kok, Auke @ 2008-01-31 18:08 UTC (permalink / raw)
To: Bruce Allen
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Bruce Allen wrote:
> Hi Jesse,
>
>>> It's good to be talking directly to one of the e1000 developers and
>>> maintainers. Although at this point I am starting to think that the
>>> issue may be TCP stack related and nothing to do with the NIC. Am I
>>> correct that these are quite distinct parts of the kernel?
>>
>> Yes, quite.
>
> OK. I hope that there is also someone knowledgable about the TCP stack
> who is following this thread. (Perhaps you also know this part of the
> kernel, but I am assuming that your expertise is on the e1000/NIC bits.)
>
>>> Important note: we ARE able to get full duplex wire speed (over 900
>>> Mb/s simulaneously in both directions) using UDP. The problems occur
>>> only with TCP connections.
>>
>> That eliminates bus bandwidth issues, probably, but small packets take
>> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>
> I see. Your concern is the extra ACK packets associated with TCP. Even
> those these represent a small volume of data (around 5% with MTU=1500,
> and less at larger MTU) they double the number of packets that must be
> handled by the system compared to UDP transmission at the same data
> rate. Is that correct?
A lot of people tend to forget that the pci-express bus has enough bandwidth on
first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it
there is significant overhead going on: each packet requires transmit, cleanup and
buffer transactions, and there are many irq register clears per second (slow
ioread/writes). The transactions double for TCP ack processing, and this all
accumulates and starts to introduce latency, higher cpu utilization etc...
Auke
^ permalink raw reply [flat|nested] 32+ messages in thread
* running aggregate netperf TCP_RR Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 17:27 ` Carsten Aulbert
2008-01-31 17:33 ` Brandeburg, Jesse
@ 2008-01-31 18:11 ` Rick Jones
1 sibling, 0 replies; 32+ messages in thread
From: Rick Jones @ 2008-01-31 18:11 UTC (permalink / raw)
To: Carsten Aulbert
Cc: Brandeburg, Jesse, Bruce Allen, netdev, Henning Fehrmann,
Bruce Allen
> PS: Am I right that the TCP_RR tests should only be run on a single node
> at a time, not on both ends simultaneously?
It depends on what you want to measure. In this specific case since the
goal is to saturate the link in both directions it is unlikely you
should need a second instance running, and if you do, going to a
TCP_STREAM+TCP_MAERTS pair might be indicated.
If one is measuring aggregate small transaction (perhaps packet)
performance, then there can be times when running multiple, concurrent,
aggregate TCP_RR tests is indicated.
Also, from time to time you may want to experiment with the value you
use with -b - the value necessary to get to saturation may not always be
the same - particularly as you switch from link to link and from LAN to
WAN and all those familiar bandwidthXdelay considerations.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 16:09 ` Carsten Aulbert
@ 2008-01-31 18:15 ` Kok, Auke
0 siblings, 0 replies; 32+ messages in thread
From: Kok, Auke @ 2008-01-31 18:15 UTC (permalink / raw)
To: Carsten Aulbert
Cc: Andi Kleen, Bruce Allen, Brandeburg, Jesse, netdev,
Henning Fehrmann, Bruce Allen
Carsten Aulbert wrote:
> Hi Andi,
>
> Andi Kleen wrote:
>> Another issue with full duplex TCP not mentioned yet is that if TSO is
>> used the output will be somewhat bursty and might cause problems with
>> the TCP ACK clock of the other direction because the ACKs would need
>> to squeeze in between full TSO bursts.
>>
>> You could try disabling TSO with ethtool.
>
> I just tried that:
>
> https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3
>
> It seems that the numbers do get better (sweet-spot seems to be MTU6000
> with 914 MBit/s and 927 MBit/s), however for other settings the results
> vary a lot so I'm not sure how large the statistical fluctuations are.
>
> Next test I'll try if it makes sense to enlarge the ring buffers.
sometimes it may help if the system (cpu) is laggy or busy a lot so that the card
has more buffers available (and thus can go longer without servicing)
Usually (if your system responds quickly) it's better to use *smaller* ring sizes
as this reduces cache. Hence the small default value.
so, unless the ethtool -S ethX output indicates that your system is too busy
(rx_no_buffer_count increases) I would not recommend increasing the ring size.
Auke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 18:08 ` Kok, Auke
@ 2008-01-31 18:38 ` Rick Jones
2008-01-31 18:47 ` Kok, Auke
2008-01-31 19:13 ` Bruce Allen
1 sibling, 1 reply; 32+ messages in thread
From: Rick Jones @ 2008-01-31 18:38 UTC (permalink / raw)
To: Kok, Auke
Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
Henning Fehrmann, Bruce Allen
> A lot of people tend to forget that the pci-express bus has enough bandwidth on
> first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it
> there is significant overhead going on: each packet requires transmit, cleanup and
> buffer transactions, and there are many irq register clears per second (slow
> ioread/writes). The transactions double for TCP ack processing, and this all
> accumulates and starts to introduce latency, higher cpu utilization etc...
Sounds like tools to show PCI* bus utilization would be helpful...
rick jones
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 18:38 ` Rick Jones
@ 2008-01-31 18:47 ` Kok, Auke
2008-01-31 19:07 ` Rick Jones
0 siblings, 1 reply; 32+ messages in thread
From: Kok, Auke @ 2008-01-31 18:47 UTC (permalink / raw)
To: Rick Jones
Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
Henning Fehrmann, Bruce Allen
Rick Jones wrote:
>> A lot of people tend to forget that the pci-express bus has enough
>> bandwidth on
>> first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data
>> going over it
>> there is significant overhead going on: each packet requires transmit,
>> cleanup and
>> buffer transactions, and there are many irq register clears per second
>> (slow
>> ioread/writes). The transactions double for TCP ack processing, and
>> this all
>> accumulates and starts to introduce latency, higher cpu utilization
>> etc...
>
> Sounds like tools to show PCI* bus utilization would be helpful...
that would be a hardware profiling thing and highly dependent on the part sticking
out of the slot, vendor bus implementation etc... Perhaps Intel has some tools for
this already but I personally do not know of any :/
Auke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 18:47 ` Kok, Auke
@ 2008-01-31 19:07 ` Rick Jones
0 siblings, 0 replies; 32+ messages in thread
From: Rick Jones @ 2008-01-31 19:07 UTC (permalink / raw)
To: Kok, Auke
Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
Henning Fehrmann, Bruce Allen
>>Sounds like tools to show PCI* bus utilization would be helpful...
>
>
> that would be a hardware profiling thing and highly dependent on the part sticking
> out of the slot, vendor bus implementation etc... Perhaps Intel has some tools for
> this already but I personally do not know of any :/
Small matter of getting specs for the various LBA's (is that the correct
term? - lower bus adaptors) and then abstracting them a la the CPU perf
counters as done by say perfmon and then used by papi :)
rick jones
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 18:08 ` Kok, Auke
2008-01-31 18:38 ` Rick Jones
@ 2008-01-31 19:13 ` Bruce Allen
2008-01-31 19:32 ` Kok, Auke
1 sibling, 1 reply; 32+ messages in thread
From: Bruce Allen @ 2008-01-31 19:13 UTC (permalink / raw)
To: Kok, Auke
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Hi Auke,
>>>> Important note: we ARE able to get full duplex wire speed (over 900
>>>> Mb/s simulaneously in both directions) using UDP. The problems occur
>>>> only with TCP connections.
>>>
>>> That eliminates bus bandwidth issues, probably, but small packets take
>>> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>>
>> I see. Your concern is the extra ACK packets associated with TCP. Even
>> those these represent a small volume of data (around 5% with MTU=1500,
>> and less at larger MTU) they double the number of packets that must be
>> handled by the system compared to UDP transmission at the same data
>> rate. Is that correct?
>
> A lot of people tend to forget that the pci-express bus has enough
> bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart
> from data going over it there is significant overhead going on: each
> packet requires transmit, cleanup and buffer transactions, and there are
> many irq register clears per second (slow ioread/writes). The
> transactions double for TCP ack processing, and this all accumulates and
> starts to introduce latency, higher cpu utilization etc...
Based on the discussion in this thread, I am inclined to believe that lack
of PCI-e bus bandwidth is NOT the issue. The theory is that the extra
packet handling associated with TCP acknowledgements are pushing the PCI-e
x1 bus past its limits. However the evidence seems to show otherwise:
(1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit
PCI connection. That connection can transfer data at 8Gb/s.
(2) If the theory is right, then doubling the MTU from 1500 to 3000 should
have significantly reduce the problem, since it drops the number of ACK's
by two. Similarly, going from MTU 1500 to MTU 9000 should reduce the
number of ACK's by a factor of six, practically eliminating the problem.
But changing the MTU size does not help.
(3) The interrupt counts are quite reasonable. Broadcom NICs without
interrupt aggregation generate an order of magnitude more irq/s and this
doesn't prevent wire speed performance there.
(4) The CPUs on the system are largely idle. There are plenty of
computing resources available.
(5) I don't think that the overhead will increase the bandwidth needed by
more than a factor of two. Of course you and the other e1000 developers
are the experts, but the dominant bus cost should be copying data buffers
across the bus. Everything else in minimal in comparison.
Intel insiders: isn't there some simple instrumentation available (which
read registers or statistics counters on the PCI-e interface chip) to tell
us statistics such as how many bits have moved over the link in each
direction? This plus some accurate timing would make it easy to see if the
TCP case is saturating the PCI-e bus. Then the theory addressed with data
rather than with opinions.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 19:13 ` Bruce Allen
@ 2008-01-31 19:32 ` Kok, Auke
2008-01-31 19:48 ` Bruce Allen
0 siblings, 1 reply; 32+ messages in thread
From: Kok, Auke @ 2008-01-31 19:32 UTC (permalink / raw)
To: Bruce Allen
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Bruce Allen wrote:
> Hi Auke,
>
>>>>> Important note: we ARE able to get full duplex wire speed (over 900
>>>>> Mb/s simulaneously in both directions) using UDP. The problems occur
>>>>> only with TCP connections.
>>>>
>>>> That eliminates bus bandwidth issues, probably, but small packets take
>>>> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>>>
>>> I see. Your concern is the extra ACK packets associated with TCP. Even
>>> those these represent a small volume of data (around 5% with MTU=1500,
>>> and less at larger MTU) they double the number of packets that must be
>>> handled by the system compared to UDP transmission at the same data
>>> rate. Is that correct?
>>
>> A lot of people tend to forget that the pci-express bus has enough
>> bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but
>> apart from data going over it there is significant overhead going on:
>> each packet requires transmit, cleanup and buffer transactions, and
>> there are many irq register clears per second (slow ioread/writes).
>> The transactions double for TCP ack processing, and this all
>> accumulates and starts to introduce latency, higher cpu utilization
>> etc...
>
> Based on the discussion in this thread, I am inclined to believe that
> lack of PCI-e bus bandwidth is NOT the issue. The theory is that the
> extra packet handling associated with TCP acknowledgements are pushing
> the PCI-e x1 bus past its limits. However the evidence seems to show
> otherwise:
>
> (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
> 64-bit PCI connection. That connection can transfer data at 8Gb/s.
That was even a PCI-X connection, which is known to have extremely good latency
numbers, IIRC better than PCI-e? (?) which could account for a lot of the
latency-induced lower performance...
also, 82573's are _not_ a serverpart and were not designed for this usage. 82546's
are and that really does make a difference. 82573's are full of power savings
features and all that does make a difference even with some of them turned off.
It's not for nothing that these 82573's are used in a ton of laptops like from
toshiba, lenovo etc.... A lot of this has to do with the cards internal clock
timings as usual.
So, you'd really have to compare the 82546 to a 82571 card to be fair. You get
what you pay for so to speak.
> (2) If the theory is right, then doubling the MTU from 1500 to 3000
> should have significantly reduce the problem, since it drops the number
> of ACK's by two. Similarly, going from MTU 1500 to MTU 9000 should
> reduce the number of ACK's by a factor of six, practically eliminating
> the problem. But changing the MTU size does not help.
>
> (3) The interrupt counts are quite reasonable. Broadcom NICs without
> interrupt aggregation generate an order of magnitude more irq/s and this
> doesn't prevent wire speed performance there.
>
> (4) The CPUs on the system are largely idle. There are plenty of
> computing resources available.
>
> (5) I don't think that the overhead will increase the bandwidth needed
> by more than a factor of two. Of course you and the other e1000
> developers are the experts, but the dominant bus cost should be copying
> data buffers across the bus. Everything else in minimal in comparison.
>
> Intel insiders: isn't there some simple instrumentation available (which
> read registers or statistics counters on the PCI-e interface chip) to
> tell us statistics such as how many bits have moved over the link in
> each direction? This plus some accurate timing would make it easy to see
> if the TCP case is saturating the PCI-e bus. Then the theory addressed
> with data rather than with opinions.
the only tools we have are expensive bus analyzers. As said in the thread with
Rick Jones, I think there might be some tools avaialable from Intel for this but I
have never seen these.
Auke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 19:32 ` Kok, Auke
@ 2008-01-31 19:48 ` Bruce Allen
2008-02-01 6:27 ` Bill Fink
0 siblings, 1 reply; 32+ messages in thread
From: Bruce Allen @ 2008-01-31 19:48 UTC (permalink / raw)
To: Kok, Auke
Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
Bruce Allen
Hi Auke,
>> Based on the discussion in this thread, I am inclined to believe that
>> lack of PCI-e bus bandwidth is NOT the issue. The theory is that the
>> extra packet handling associated with TCP acknowledgements are pushing
>> the PCI-e x1 bus past its limits. However the evidence seems to show
>> otherwise:
>>
>> (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
>> 64-bit PCI connection. That connection can transfer data at 8Gb/s.
>
> That was even a PCI-X connection, which is known to have extremely good latency
> numbers, IIRC better than PCI-e? (?) which could account for a lot of the
> latency-induced lower performance...
>
> also, 82573's are _not_ a serverpart and were not designed for this
> usage. 82546's are and that really does make a difference.
I'm confused. It DOESN'T make a difference! Using 'server grade' 82546's
on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP
full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1
bus.
Just like us, when Bill goes from TCP to UDP, he gets wire speed back.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 19:48 ` Bruce Allen
@ 2008-02-01 6:27 ` Bill Fink
2008-02-01 7:54 ` Bruce Allen
0 siblings, 1 reply; 32+ messages in thread
From: Bill Fink @ 2008-02-01 6:27 UTC (permalink / raw)
To: Bruce Allen
Cc: Kok, Auke, Brandeburg, Jesse, netdev, Carsten Aulbert,
Henning Fehrmann, Bruce Allen
On Thu, 31 Jan 2008, Bruce Allen wrote:
> >> Based on the discussion in this thread, I am inclined to believe that
> >> lack of PCI-e bus bandwidth is NOT the issue. The theory is that the
> >> extra packet handling associated with TCP acknowledgements are pushing
> >> the PCI-e x1 bus past its limits. However the evidence seems to show
> >> otherwise:
> >>
> >> (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
> >> 64-bit PCI connection. That connection can transfer data at 8Gb/s.
> >
> > That was even a PCI-X connection, which is known to have extremely good latency
> > numbers, IIRC better than PCI-e? (?) which could account for a lot of the
> > latency-induced lower performance...
> >
> > also, 82573's are _not_ a serverpart and were not designed for this
> > usage. 82546's are and that really does make a difference.
>
> I'm confused. It DOESN'T make a difference! Using 'server grade' 82546's
> on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP
> full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1
> bus.
>
> Just like us, when Bill goes from TCP to UDP, he gets wire speed back.
Good. I thought it was just me who was confused by Auke's reply. :-)
Yes, I get the same type of reduced TCP performance behavior on a
bidirectional test that Bruce has seen, even though I'm using the
better 82546 GigE NIC on a faster 64-bit/133-MHz PCI-X bus. I also
don't think bus bandwidth is an issue, but I am curious if there
are any known papers on typical PCI-X/PCI-E bus overhead on network
transfers, either bulk data transfers with large packets or more
transaction or video based applications using smaller packets.
I started musing if once one side's transmitter got the upper hand,
it might somehow defer the processing of received packets, causing
the resultant ACKs to be delayed and thus further slowing down the
other end's transmitter. I began to wonder if the txqueuelen could
have an affect on the TCP performance behavior. I normally have
the txqueuelen set to 10000 for 10-GigE testing, so decided to run
a test with txqueuelen set to 200 (actually settled on this value
through some experimentation). Here is a typical result:
[bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.6.79
tx: 1120.6345 MB / 10.07 sec = 933.4042 Mbps 12 %TX 9 %RX 0 retrans
rx: 1104.3081 MB / 10.09 sec = 917.7365 Mbps 12 %TX 11 %RX 0 retrans
This is significantly better, but there was more variability in the
results. The above was with TSO enabled. I also then ran a test
with TSO disabled, with the following typical result:
[bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.6.79
tx: 1119.4749 MB / 10.05 sec = 934.2922 Mbps 13 %TX 9 %RX 0 retrans
rx: 1131.7334 MB / 10.05 sec = 944.8437 Mbps 15 %TX 12 %RX 0 retrans
This was a little better yet and getting closer to expected results.
Jesse Brandeburg mentioned in another post that there were known
performance issues with the version of the e1000 driver I'm using.
I recognized that the kernel/driver versions I was using were rather
old, but it was what I had available to do a quick test with. Those
particular systems are in a remote location so I have to be careful
with messing with their network drivers. I do have some other test
systems at work that I might be able to try with newer kernels
and/or drivers or maybe even with other vendor's GigE NICs, but
I won't be back to work until early next week sometime.
-Bill
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-02-01 6:27 ` Bill Fink
@ 2008-02-01 7:54 ` Bruce Allen
0 siblings, 0 replies; 32+ messages in thread
From: Bruce Allen @ 2008-02-01 7:54 UTC (permalink / raw)
To: Bill Fink
Cc: Kok, Auke, Brandeburg, Jesse, netdev, Carsten Aulbert,
Henning Fehrmann, Bruce Allen
Hi Bill,
> I started musing if once one side's transmitter got the upper hand, it
> might somehow defer the processing of received packets, causing the
> resultant ACKs to be delayed and thus further slowing down the other
> end's transmitter. I began to wonder if the txqueuelen could have an
> affect on the TCP performance behavior. I normally have the txqueuelen
> set to 10000 for 10-GigE testing, so decided to run a test with
> txqueuelen set to 200 (actually settled on this value through some
> experimentation). Here is a typical result:
>
> [bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.6.79
> tx: 1120.6345 MB / 10.07 sec = 933.4042 Mbps 12 %TX 9 %RX 0 retrans
> rx: 1104.3081 MB / 10.09 sec = 917.7365 Mbps 12 %TX 11 %RX 0 retrans
>
> This is significantly better, but there was more variability in the
> results. The above was with TSO enabled. I also then ran a test
> with TSO disabled, with the following typical result:
>
> [bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.6.79
> tx: 1119.4749 MB / 10.05 sec = 934.2922 Mbps 13 %TX 9 %RX 0 retrans
> rx: 1131.7334 MB / 10.05 sec = 944.8437 Mbps 15 %TX 12 %RX 0 retrans
>
> This was a little better yet and getting closer to expected results.
We'll also try changing txqueuelen. I have not looked, but I suppose that
this is set to the default value of 1000. We'd be delighted to see
full-duplex performance that was consistent and greater than 900 Mb/s x 2.
> I do have some other test systems at work that I might be able to try
> with newer kernels and/or drivers or maybe even with other vendor's GigE
> NICs, but I won't be back to work until early next week sometime.
Bill, we'd be happy to give you root access to a couple of our systems
here if you want to do additional testing. We can put the latest drivers
on them (and reboot if/as needed). If you want to do this, please just
send an ssh public key to Carsten.
Cheers,
Bruce
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000 full-duplex TCP performance well below wire speed
2008-01-31 17:55 ` Rick Jones
@ 2008-02-01 19:57 ` Carsten Aulbert
0 siblings, 0 replies; 32+ messages in thread
From: Carsten Aulbert @ 2008-02-01 19:57 UTC (permalink / raw)
To: Rick Jones
Cc: Brandeburg, Jesse, Bruce Allen, netdev, Henning Fehrmann,
Bruce Allen
Hi all
Rick Jones wrote:
> 2) use the aforementioned "burst" TCP_RR test. This is then a single
> netperf with data flowing both ways on a single connection so no issue
> of skew, but perhaps an issue of being one connection and so one process
> on each end.
Since our major gaol is to establish a reliable way to test duplex
connections this looks like a very good choice. Right now we just run
this on a back to back test (cable connecting two hosts), but want to
move to a high performance network with up to three switches between
hosts. For this we want to have a stable test.
I doubt that I will be able to finish the tests tonight, but I'll post a
follow-up latest on Monday.
Have a nice week-end and thanks a lot for all the suggestions so far!
Cheers
Carsten
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2008-02-01 19:58 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-30 12:23 e1000 full-duplex TCP performance well below wire speed Bruce Allen
2008-01-30 17:36 ` Brandeburg, Jesse
2008-01-30 18:45 ` Rick Jones
2008-01-30 23:15 ` Bruce Allen
2008-01-31 11:35 ` Carsten Aulbert
2008-01-31 17:55 ` Rick Jones
2008-02-01 19:57 ` Carsten Aulbert
2008-01-30 23:07 ` Bruce Allen
2008-01-31 5:43 ` Brandeburg, Jesse
2008-01-31 8:31 ` Bruce Allen
2008-01-31 18:08 ` Kok, Auke
2008-01-31 18:38 ` Rick Jones
2008-01-31 18:47 ` Kok, Auke
2008-01-31 19:07 ` Rick Jones
2008-01-31 19:13 ` Bruce Allen
2008-01-31 19:32 ` Kok, Auke
2008-01-31 19:48 ` Bruce Allen
2008-02-01 6:27 ` Bill Fink
2008-02-01 7:54 ` Bruce Allen
2008-01-31 15:12 ` Carsten Aulbert
2008-01-31 17:20 ` Brandeburg, Jesse
2008-01-31 17:27 ` Carsten Aulbert
2008-01-31 17:33 ` Brandeburg, Jesse
2008-01-31 18:11 ` running aggregate netperf TCP_RR " Rick Jones
2008-01-31 18:03 ` Rick Jones
2008-01-31 15:18 ` Carsten Aulbert
2008-01-31 9:17 ` Andi Kleen
2008-01-31 9:59 ` Bruce Allen
2008-01-31 16:09 ` Carsten Aulbert
2008-01-31 18:15 ` Kok, Auke
2008-01-30 19:17 ` Ben Greear
2008-01-30 22:33 ` Bruce Allen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).