* Achieved 10Gbit/s bidirectional routing
@ 2009-07-15 16:50 Jesper Dangaard Brouer
2009-07-16 3:22 ` Bill Fink
0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2009-07-15 16:50 UTC (permalink / raw)
To: netdev@vger.kernel.org
Cc: David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P,
Ronciak, John, jesse.brandeburg, Stephen Hemminger,
Linux Kernel Mailing List
I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard
hardware running Linux.
http://linuxcon.linuxfoundation.org/meetings/1585
https://events.linuxfoundation.org/lc09o17
I'm getting some really good 10Gbit/s bidirectional routing results
with Intels latest 82599 chip. (I got two pre-release engineering
samples directly from Intel, thanks Peter)
Using a Core i7-920, and tuning the memory according to the RAMs
X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to
6.4GT/s. (Motherboard P6T6 WS revolution)
With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed
bidirectional routing.
Notice bidirectional routing means that we actually has to move approx
40Gbit/s through memory and in-and-out of the interfaces.
Formatted quick view using 'ifstat -b'
eth31-in eth31-out eth32-in eth32-out
9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s
9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s
9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s
9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s
[Adding an extra NIC]
Another observation is that I'm hitting some kind of bottleneck on the
PCI-express switch. Adding an extra NIC in a PCIe slot connected to
the same PCIe switch, does not scale beyond 40Gbit/s collective
throughput.
But, I happened to have a special motherboard ASUS P6T6 WS revolution,
which has an additional PCIe switch chip NVIDIA's NF200.
Connecting two dual port 10GbE NICs via two different PCI-express
switch chips, makes things scale again! I have achieved a collective
throughput of 66.25 Gbit/s. This results is also influenced by my
pktgen machines cannot keep up, and I'm getting closer to the memory
bandwidth limits.
FYI: I found a really good reference explaining the PCI-express
architecture, written by Intel:
http://download.intel.com/design/intarch/papers/321071.pdf
I'm not sure how to explain the PCI-express chip bottleneck I'm
seeing, but my guess is that I'm limited by the number of outstanding
packets/DMA-transfers and the latency for the DMA operations.
Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express
chips, that can tell me the number of outstanding transfers they
support?
--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Achieved 10Gbit/s bidirectional routing 2009-07-15 16:50 Achieved 10Gbit/s bidirectional routing Jesper Dangaard Brouer @ 2009-07-16 3:22 ` Bill Fink 2009-07-16 9:39 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 7+ messages in thread From: Bill Fink @ 2009-07-16 3:22 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote: > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard > hardware running Linux. > > http://linuxcon.linuxfoundation.org/meetings/1585 > https://events.linuxfoundation.org/lc09o17 > > I'm getting some really good 10Gbit/s bidirectional routing results > with Intels latest 82599 chip. (I got two pre-release engineering > samples directly from Intel, thanks Peter) > > Using a Core i7-920, and tuning the memory according to the RAMs > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to > 6.4GT/s. (Motherboard P6T6 WS revolution) > > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed > bidirectional routing. > > Notice bidirectional routing means that we actually has to move approx > 40Gbit/s through memory and in-and-out of the interfaces. > > Formatted quick view using 'ifstat -b' > > eth31-in eth31-out eth32-in eth32-out > 9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s > 9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s > 9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s > 9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s > > [Adding an extra NIC] > > Another observation is that I'm hitting some kind of bottleneck on the > PCI-express switch. Adding an extra NIC in a PCIe slot connected to > the same PCIe switch, does not scale beyond 40Gbit/s collective > throughput. > > But, I happened to have a special motherboard ASUS P6T6 WS revolution, > which has an additional PCIe switch chip NVIDIA's NF200. > > Connecting two dual port 10GbE NICs via two different PCI-express > switch chips, makes things scale again! I have achieved a collective > throughput of 66.25 Gbit/s. This results is also influenced by my > pktgen machines cannot keep up, and I'm getting closer to the memory > bandwidth limits. > > FYI: I found a really good reference explaining the PCI-express > architecture, written by Intel: > > http://download.intel.com/design/intarch/papers/321071.pdf > > I'm not sure how to explain the PCI-express chip bottleneck I'm > seeing, but my guess is that I'm limited by the number of outstanding > packets/DMA-transfers and the latency for the DMA operations. > > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express > chips, that can tell me the number of outstanding transfers they > support? We've achieved 70 Gbps aggregate unidirectional TCP performance from one P6T6 based system to another. We figured out in our case that we were being limited by the interconnect between the Intel X58 and Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the Intel X58 and get the full 40 Gbps throughput from the dual-port Myricom 10-GigE NICs we have installed in them. But the other 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered through googling that the link between the X58 and N200 chips only operates at PCIe x16 _1.0_ speed, which limits the possible aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps. This was clearly seen in our nuttcp testing: [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11 n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT This used 4 dual-port Myricom 10-GigE NICs. We also tested with a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed at about 70 Gbps, due to the performance bottleneck between the X58 and N200 chips. -Bill ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Achieved 10Gbit/s bidirectional routing 2009-07-16 3:22 ` Bill Fink @ 2009-07-16 9:39 ` Jesper Dangaard Brouer 2009-07-16 15:38 ` Bill Fink 0 siblings, 1 reply; 7+ messages in thread From: Jesper Dangaard Brouer @ 2009-07-16 9:39 UTC (permalink / raw) To: Bill Fink Cc: netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Wed, 2009-07-15 at 23:22 -0400, Bill Fink wrote: > On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote: > > > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard > > hardware running Linux. > > > > http://linuxcon.linuxfoundation.org/meetings/1585 > > https://events.linuxfoundation.org/lc09o17 > > > > I'm getting some really good 10Gbit/s bidirectional routing results > > with Intels latest 82599 chip. (I got two pre-release engineering > > samples directly from Intel, thanks Peter) > > > > Using a Core i7-920, and tuning the memory according to the RAMs > > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to > > 6.4GT/s. (Motherboard P6T6 WS revolution) > > > > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed > > bidirectional routing. > > > > Notice bidirectional routing means that we actually has to move approx > > 40Gbit/s through memory and in-and-out of the interfaces. > > > > Formatted quick view using 'ifstat -b' > > > > eth31-in eth31-out eth32-in eth32-out > > 9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s > > 9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s > > 9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s > > 9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s > > > > [Adding an extra NIC] > > > > Another observation is that I'm hitting some kind of bottleneck on the > > PCI-express switch. Adding an extra NIC in a PCIe slot connected to > > the same PCIe switch, does not scale beyond 40Gbit/s collective > > throughput. Correcting my self, according to Bill's info below. It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe switch chip (reason explained below by Bill) > > But, I happened to have a special motherboard ASUS P6T6 WS revolution, > > which has an additional PCIe switch chip NVIDIA's NF200. > > > > Connecting two dual port 10GbE NICs via two different PCI-express > > switch chips, makes things scale again! I have achieved a collective > > throughput of 66.25 Gbit/s. This results is also influenced by my > > pktgen machines cannot keep up, and I'm getting closer to the memory > > bandwidth limits. > > > > FYI: I found a really good reference explaining the PCI-express > > architecture, written by Intel: > > > > http://download.intel.com/design/intarch/papers/321071.pdf > > > > I'm not sure how to explain the PCI-express chip bottleneck I'm > > seeing, but my guess is that I'm limited by the number of outstanding > > packets/DMA-transfers and the latency for the DMA operations. > > > > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express > > chips, that can tell me the number of outstanding transfers they > > support? > > We've achieved 70 Gbps aggregate unidirectional TCP performance from > one P6T6 based system to another. We figured out in our case that > we were being limited by the interconnect between the Intel X58 and > Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the > Intel X58 and get the full 40 Gbps throughput from the dual-port > Myricom 10-GigE NICs we have installed in them. But the other > 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered > through googling that the link between the X58 and N200 chips > only operates at PCIe x16 _1.0_ speed, which limits the possible > aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps. This definitly explains the bottlenecks I have seen! Thanks! Yes, it seems to scale when installing the two NICs in the first two slots, both connected to the X58. If overclocking the RAM and CPU a bit, I can match my pktgen machines speed which gives a collective throughput of 67.95 Gbit/s. eth33 eth34 eth31 eth32 in out in out in out in out 7.54 + 9.58 + 9.56 + 7.56 + 7.33 + 9.53 + 9.50 + 7.35 = 67.95 Gbit/s Now I just need a faster generator machine, to find the next bottleneck ;-) > This was clearly seen in our nuttcp testing: > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11 > n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT > n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT > n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT > n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT > n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT > n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT > n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT > n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT > > This used 4 dual-port Myricom 10-GigE NICs. We also tested with > a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed > at about 70 Gbps, due to the performance bottleneck between the > X58 and N200 chips. This is also very excellent results! Thanks a lot Bill !!! -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Achieved 10Gbit/s bidirectional routing 2009-07-16 9:39 ` Jesper Dangaard Brouer @ 2009-07-16 15:38 ` Bill Fink 2009-07-17 20:35 ` Willy Tarreau 0 siblings, 1 reply; 7+ messages in thread From: Bill Fink @ 2009-07-16 15:38 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Thu, 16 Jul 2009, Jesper Dangaard Brouer wrote: > On Wed, 2009-07-15 at 23:22 -0400, Bill Fink wrote: > > On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote: > > > > > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard > > > hardware running Linux. > > > > > > http://linuxcon.linuxfoundation.org/meetings/1585 > > > https://events.linuxfoundation.org/lc09o17 > > > > > > I'm getting some really good 10Gbit/s bidirectional routing results > > > with Intels latest 82599 chip. (I got two pre-release engineering > > > samples directly from Intel, thanks Peter) > > > > > > Using a Core i7-920, and tuning the memory according to the RAMs > > > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to > > > 6.4GT/s. (Motherboard P6T6 WS revolution) > > > > > > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed > > > bidirectional routing. > > > > > > Notice bidirectional routing means that we actually has to move approx > > > 40Gbit/s through memory and in-and-out of the interfaces. > > > > > > Formatted quick view using 'ifstat -b' > > > > > > eth31-in eth31-out eth32-in eth32-out > > > 9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s > > > 9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s > > > 9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s > > > 9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s > > > > > > [Adding an extra NIC] > > > > > > Another observation is that I'm hitting some kind of bottleneck on the > > > PCI-express switch. Adding an extra NIC in a PCIe slot connected to > > > the same PCIe switch, does not scale beyond 40Gbit/s collective > > > throughput. > > Correcting my self, according to Bill's info below. > > It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe > switch chip (reason explained below by Bill) > > > > > But, I happened to have a special motherboard ASUS P6T6 WS revolution, > > > which has an additional PCIe switch chip NVIDIA's NF200. > > > > > > Connecting two dual port 10GbE NICs via two different PCI-express > > > switch chips, makes things scale again! I have achieved a collective > > > throughput of 66.25 Gbit/s. This results is also influenced by my > > > pktgen machines cannot keep up, and I'm getting closer to the memory > > > bandwidth limits. > > > > > > FYI: I found a really good reference explaining the PCI-express > > > architecture, written by Intel: > > > > > > http://download.intel.com/design/intarch/papers/321071.pdf > > > > > > I'm not sure how to explain the PCI-express chip bottleneck I'm > > > seeing, but my guess is that I'm limited by the number of outstanding > > > packets/DMA-transfers and the latency for the DMA operations. > > > > > > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express > > > chips, that can tell me the number of outstanding transfers they > > > support? > > > > We've achieved 70 Gbps aggregate unidirectional TCP performance from > > one P6T6 based system to another. We figured out in our case that > > we were being limited by the interconnect between the Intel X58 and > > Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the > > Intel X58 and get the full 40 Gbps throughput from the dual-port > > Myricom 10-GigE NICs we have installed in them. But the other > > 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered > > through googling that the link between the X58 and N200 chips > > only operates at PCIe x16 _1.0_ speed, which limits the possible > > aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps. > > This definitly explains the bottlenecks I have seen! Thanks! > > Yes, it seems to scale when installing the two NICs in the first two > slots, both connected to the X58. If overclocking the RAM and CPU a > bit, I can match my pktgen machines speed which gives a collective > throughput of 67.95 Gbit/s. > > eth33 eth34 eth31 eth32 > in out in out in out in out > 7.54 + 9.58 + 9.56 + 7.56 + 7.33 + 9.53 + 9.50 + 7.35 = 67.95 Gbit/s > > Now I just need a faster generator machine, to find the next bottleneck ;-) > > > > This was clearly seen in our nuttcp testing: > > > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11 > > n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT > > n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT > > n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT > > n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT > > n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT > > n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT > > n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT > > n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT > > > > This used 4 dual-port Myricom 10-GigE NICs. We also tested with > > a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed > > at about 70 Gbps, due to the performance bottleneck between the > > X58 and N200 chips. > > This is also very excellent results! > > Thanks a lot Bill !!! We also achieved nearly 80 Gbps in bidirectional TCP tests (40 Gbps simultaneously in each direction): [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -r -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -r -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -r -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -r -xc3/3 -p5008 192.168.8.11 n2: 11542.6250 MB / 10.07 sec = 9619.9920 Mbps 44 %TX 51 %RX 0 retrans 0.12 msRTT n3: 11543.7143 MB / 10.06 sec = 9622.2153 Mbps 41 %TX 49 %RX 0 retrans 0.15 msRTT n4: 11622.8125 MB / 10.05 sec = 9701.0296 Mbps 43 %TX 51 %RX 0 retrans 0.10 msRTT n5: 11523.6875 MB / 10.03 sec = 9638.8883 Mbps 43 %TX 50 %RX 0 retrans 0.15 msRTT n6: 11608.0141 MB / 10.04 sec = 9695.7388 Mbps 43 %TX 50 %RX 0 retrans 0.10 msRTT n7: 11580.1250 MB / 10.04 sec = 9679.3910 Mbps 43 %TX 50 %RX 0 retrans 0.13 msRTT n8: 11608.0000 MB / 10.06 sec = 9678.7596 Mbps 42 %TX 50 %RX 0 retrans 0.10 msRTT n9: 11553.3750 MB / 10.05 sec = 9643.7296 Mbps 45 %TX 50 %RX 0 retrans 0.11 msRTT This was using 2 dual-port 10-GigE NICs in the first two PCIe 2.0 slots. We are using an Intel i7 965 quad-core 3.2 GHz Nehalem processor (overclocked to 3.4 GHz) and 2000 MHz DDR3 memory. Adding an additional dual-port 10-GigE NIC on the Nvidia N200 chip does only marginally better, as it appears we are basically CPU limited at this point for this test (the sum of the TX and RX CPU utilization for each pair of 10-GigE interfaces is about 93%). -Bill ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Achieved 10Gbit/s bidirectional routing 2009-07-16 15:38 ` Bill Fink @ 2009-07-17 20:35 ` Willy Tarreau 2009-07-17 23:38 ` Bill Fink 2009-07-18 7:14 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 7+ messages in thread From: Willy Tarreau @ 2009-07-17 20:35 UTC (permalink / raw) To: Bill Fink Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Thu, Jul 16, 2009 at 11:38:27AM -0400, Bill Fink wrote: > On Thu, 16 Jul 2009, Jesper Dangaard Brouer wrote: > > > On Wed, 2009-07-15 at 23:22 -0400, Bill Fink wrote: > > > On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote: > > > > > > > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard > > > > hardware running Linux. > > > > > > > > http://linuxcon.linuxfoundation.org/meetings/1585 > > > > https://events.linuxfoundation.org/lc09o17 > > > > > > > > I'm getting some really good 10Gbit/s bidirectional routing results > > > > with Intels latest 82599 chip. (I got two pre-release engineering > > > > samples directly from Intel, thanks Peter) > > > > > > > > Using a Core i7-920, and tuning the memory according to the RAMs > > > > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to > > > > 6.4GT/s. (Motherboard P6T6 WS revolution) > > > > > > > > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed > > > > bidirectional routing. > > > > > > > > Notice bidirectional routing means that we actually has to move approx > > > > 40Gbit/s through memory and in-and-out of the interfaces. > > > > > > > > Formatted quick view using 'ifstat -b' > > > > > > > > eth31-in eth31-out eth32-in eth32-out > > > > 9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s > > > > 9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s > > > > 9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s > > > > 9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s > > > > > > > > [Adding an extra NIC] > > > > > > > > Another observation is that I'm hitting some kind of bottleneck on the > > > > PCI-express switch. Adding an extra NIC in a PCIe slot connected to > > > > the same PCIe switch, does not scale beyond 40Gbit/s collective > > > > throughput. > > > > Correcting my self, according to Bill's info below. > > > > It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe > > switch chip (reason explained below by Bill) > > > > > > > > But, I happened to have a special motherboard ASUS P6T6 WS revolution, > > > > which has an additional PCIe switch chip NVIDIA's NF200. > > > > > > > > Connecting two dual port 10GbE NICs via two different PCI-express > > > > switch chips, makes things scale again! I have achieved a collective > > > > throughput of 66.25 Gbit/s. This results is also influenced by my > > > > pktgen machines cannot keep up, and I'm getting closer to the memory > > > > bandwidth limits. > > > > > > > > FYI: I found a really good reference explaining the PCI-express > > > > architecture, written by Intel: > > > > > > > > http://download.intel.com/design/intarch/papers/321071.pdf > > > > > > > > I'm not sure how to explain the PCI-express chip bottleneck I'm > > > > seeing, but my guess is that I'm limited by the number of outstanding > > > > packets/DMA-transfers and the latency for the DMA operations. > > > > > > > > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express > > > > chips, that can tell me the number of outstanding transfers they > > > > support? > > > > > > We've achieved 70 Gbps aggregate unidirectional TCP performance from > > > one P6T6 based system to another. We figured out in our case that > > > we were being limited by the interconnect between the Intel X58 and > > > Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the > > > Intel X58 and get the full 40 Gbps throughput from the dual-port > > > Myricom 10-GigE NICs we have installed in them. But the other > > > 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered > > > through googling that the link between the X58 and N200 chips > > > only operates at PCIe x16 _1.0_ speed, which limits the possible > > > aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps. > > > > This definitly explains the bottlenecks I have seen! Thanks! > > > > Yes, it seems to scale when installing the two NICs in the first two > > slots, both connected to the X58. If overclocking the RAM and CPU a > > bit, I can match my pktgen machines speed which gives a collective > > throughput of 67.95 Gbit/s. > > > > eth33 eth34 eth31 eth32 > > in out in out in out in out > > 7.54 + 9.58 + 9.56 + 7.56 + 7.33 + 9.53 + 9.50 + 7.35 = 67.95 Gbit/s > > > > Now I just need a faster generator machine, to find the next bottleneck ;-) > > > > > > > This was clearly seen in our nuttcp testing: > > > > > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11 > > > n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT > > > n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT > > > n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT > > > n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT > > > n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT > > > n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT > > > n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT > > > n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT > > > > > > This used 4 dual-port Myricom 10-GigE NICs. We also tested with > > > a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed > > > at about 70 Gbps, due to the performance bottleneck between the > > > X58 and N200 chips. > > > > This is also very excellent results! > > > > Thanks a lot Bill !!! > > We also achieved nearly 80 Gbps in bidirectional TCP tests (40 Gbps > simultaneously in each direction): > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -r -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -r -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -r -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -r -xc3/3 -p5008 192.168.8.11 > n2: 11542.6250 MB / 10.07 sec = 9619.9920 Mbps 44 %TX 51 %RX 0 retrans 0.12 msRTT > n3: 11543.7143 MB / 10.06 sec = 9622.2153 Mbps 41 %TX 49 %RX 0 retrans 0.15 msRTT > n4: 11622.8125 MB / 10.05 sec = 9701.0296 Mbps 43 %TX 51 %RX 0 retrans 0.10 msRTT > n5: 11523.6875 MB / 10.03 sec = 9638.8883 Mbps 43 %TX 50 %RX 0 retrans 0.15 msRTT > n6: 11608.0141 MB / 10.04 sec = 9695.7388 Mbps 43 %TX 50 %RX 0 retrans 0.10 msRTT > n7: 11580.1250 MB / 10.04 sec = 9679.3910 Mbps 43 %TX 50 %RX 0 retrans 0.13 msRTT > n8: 11608.0000 MB / 10.06 sec = 9678.7596 Mbps 42 %TX 50 %RX 0 retrans 0.10 msRTT > n9: 11553.3750 MB / 10.05 sec = 9643.7296 Mbps 45 %TX 50 %RX 0 retrans 0.11 msRTT > > This was using 2 dual-port 10-GigE NICs in the first two PCIe 2.0 slots. > We are using an Intel i7 965 quad-core 3.2 GHz Nehalem processor > (overclocked to 3.4 GHz) and 2000 MHz DDR3 memory. Adding an additional > dual-port 10-GigE NIC on the Nvidia N200 chip does only marginally > better, as it appears we are basically CPU limited at this point for > this test (the sum of the TX and RX CPU utilization for each pair of > 10-GigE interfaces is about 93%). Hey guys, those are really nice numbers. Since TCP splicing appeared in the kernel (once we got it fixed), I achieved 10 Gbps of HTTP proxying using haproxy with very low CPU usage (about 20% of a Core2Duo 2.66 GHz). Before buying the machines, I had been wandering around with the NICs donated by Myricom in order to try to find a machine capable of supporting this. My conclusion was that a lot of machines had difficulties getting above 3.5, 4.7 and 6.5 Gbps of output traffic (those 3 numbers were always the same, depending on the chipsets). There clearly was a bandwidth limitation imposed by the chipset. So I waited for the X38 and AM780FX chipsets to become available and bought 3 machines (1 C2D, 1 AMD X2, 1 AMD X4). Those ones have no problem with 10 Gbps of forwarded traffic (20 Gbps of total bus bandwidth), even with 1500 bytes frames, but I don't know how high they can go, maybe they will saturate slightly above. Unfortunately, I only have 5 NICs in 3 machines and no switch (and CX4 is hard to find these days), so I'm probably stuck at 10 Gbps max. Interestingly, I had the impression that forwarding data with TCP splicing costs less CPU than IP forwarding, because the NICs can do LRO. Also, I know a french service provider who uses haproxy on Core i7 machines and who has already reached 5 Gbps of sustained traffic with recent intel dual-port NICs (though I'm not sure exactly which ones). This is with very little CPU usage too, less than 2-3% user and 15% system+softirq. On previous machines (quad core xeons), it was impossible to go beyond 3 Gbps, it looked like the chipset was the limitating factor too (though I don't precisely remember which one it was). I really blamed the NICs because this guys machine was about 4 times more powerful than mine, but apparently it was just a chipset issue. I also happen to have a customer who recently received a few Sun NXGE, mounted in Sun x2100-m2 using an nvidia chipset which I tested OK at 10 Gbps with my myri10GE NICs. I'll try to see if I can run some tests there, as Davem once said those NICs are really good too. All in all, I find it really cool that our beloved OS scales that well with the hardware :-) Regards, Willy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Achieved 10Gbit/s bidirectional routing 2009-07-17 20:35 ` Willy Tarreau @ 2009-07-17 23:38 ` Bill Fink 2009-07-18 7:14 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 7+ messages in thread From: Bill Fink @ 2009-07-17 23:38 UTC (permalink / raw) To: Willy Tarreau Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Fri, 17 Jul 2009, Willy Tarreau wrote: > On Thu, Jul 16, 2009 at 11:38:27AM -0400, Bill Fink wrote: > > > We also achieved nearly 80 Gbps in bidirectional TCP tests (40 Gbps > > simultaneously in each direction): > > > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -r -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -r -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -r -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -r -xc3/3 -p5008 192.168.8.11 > > n2: 11542.6250 MB / 10.07 sec = 9619.9920 Mbps 44 %TX 51 %RX 0 retrans 0.12 msRTT > > n3: 11543.7143 MB / 10.06 sec = 9622.2153 Mbps 41 %TX 49 %RX 0 retrans 0.15 msRTT > > n4: 11622.8125 MB / 10.05 sec = 9701.0296 Mbps 43 %TX 51 %RX 0 retrans 0.10 msRTT > > n5: 11523.6875 MB / 10.03 sec = 9638.8883 Mbps 43 %TX 50 %RX 0 retrans 0.15 msRTT > > n6: 11608.0141 MB / 10.04 sec = 9695.7388 Mbps 43 %TX 50 %RX 0 retrans 0.10 msRTT > > n7: 11580.1250 MB / 10.04 sec = 9679.3910 Mbps 43 %TX 50 %RX 0 retrans 0.13 msRTT > > n8: 11608.0000 MB / 10.06 sec = 9678.7596 Mbps 42 %TX 50 %RX 0 retrans 0.10 msRTT > > n9: 11553.3750 MB / 10.05 sec = 9643.7296 Mbps 45 %TX 50 %RX 0 retrans 0.11 msRTT > > > > This was using 2 dual-port 10-GigE NICs in the first two PCIe 2.0 slots. > > We are using an Intel i7 965 quad-core 3.2 GHz Nehalem processor > > (overclocked to 3.4 GHz) and 2000 MHz DDR3 memory. Adding an additional > > dual-port 10-GigE NIC on the Nvidia N200 chip does only marginally > > better, as it appears we are basically CPU limited at this point for > > this test (the sum of the TX and RX CPU utilization for each pair of > > 10-GigE interfaces is about 93%). > > Hey guys, those are really nice numbers. Since TCP splicing appeared in the > kernel (once we got it fixed), I achieved 10 Gbps of HTTP proxying using > haproxy with very low CPU usage (about 20% of a Core2Duo 2.66 GHz). > > Before buying the machines, I had been wandering around with the NICs > donated by Myricom in order to try to find a machine capable of supporting > this. My conclusion was that a lot of machines had difficulties getting > above 3.5, 4.7 and 6.5 Gbps of output traffic (those 3 numbers were always > the same, depending on the chipsets). There clearly was a bandwidth > limitation imposed by the chipset. > > So I waited for the X38 and AM780FX chipsets to become available and > bought 3 machines (1 C2D, 1 AMD X2, 1 AMD X4). Those ones have no problem > with 10 Gbps of forwarded traffic (20 Gbps of total bus bandwidth), even > with 1500 bytes frames, but I don't know how high they can go, maybe > they will saturate slightly above. > > Unfortunately, I only have 5 NICs in 3 machines and no switch (and CX4 > is hard to find these days), so I'm probably stuck at 10 Gbps max. > > Interestingly, I had the impression that forwarding data with TCP > splicing costs less CPU than IP forwarding, because the NICs can do > LRO. > > Also, I know a french service provider who uses haproxy on Core i7 > machines and who has already reached 5 Gbps of sustained traffic > with recent intel dual-port NICs (though I'm not sure exactly which > ones). This is with very little CPU usage too, less than 2-3% user > and 15% system+softirq. On previous machines (quad core xeons), it > was impossible to go beyond 3 Gbps, it looked like the chipset was > the limitating factor too (though I don't precisely remember which > one it was). > > I really blamed the NICs because this guys machine was about 4 times > more powerful than mine, but apparently it was just a chipset issue. > > I also happen to have a customer who recently received a few Sun NXGE, > mounted in Sun x2100-m2 using an nvidia chipset which I tested OK at > 10 Gbps with my myri10GE NICs. I'll try to see if I can run some tests > there, as Davem once said those NICs are really good too. > > All in all, I find it really cool that our beloved OS scales that > well with the hardware :-) Yes, I am quite impressed that the Linux kernel and TCP/IP network stack performs amazingly well at these multi-10-GigE speeds. I was especially interested in Jesper's IP forwarding results, as we haven't tested that yet ourselves, and one of the intended applications of these systems is as a multi-10-GigE firewall, so that's looking very encouraging at this point. -Bill ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Achieved 10Gbit/s bidirectional routing 2009-07-17 20:35 ` Willy Tarreau 2009-07-17 23:38 ` Bill Fink @ 2009-07-18 7:14 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 7+ messages in thread From: Jesper Dangaard Brouer @ 2009-07-18 7:14 UTC (permalink / raw) To: Willy Tarreau Cc: Bill Fink, netdev@vger.kernel.org, David S. Miller, Robert Olsson, Waskiewicz Jr, Peter P, Ronciak, John, jesse.brandeburg, Stephen Hemminger, Linux Kernel Mailing List On Fri, 2009-07-17 at 22:35 +0200, Willy Tarreau wrote: > On Thu, Jul 16, 2009 at 11:38:27AM -0400, Bill Fink wrote: > > On Thu, 16 Jul 2009, Jesper Dangaard Brouer wrote: > > > > > On Wed, 2009-07-15 at 23:22 -0400, Bill Fink wrote: > > > > On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote: > > > > > > > > > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard > > > > > hardware running Linux. > > > > > > > > > > http://linuxcon.linuxfoundation.org/meetings/1585 > > > > > https://events.linuxfoundation.org/lc09o17 > > > > > > > > > > I'm getting some really good 10Gbit/s bidirectional routing results > > > > > with Intels latest 82599 chip. (I got two pre-release engineering > > > > > samples directly from Intel, thanks Peter) > > > > > > > > > > Using a Core i7-920, and tuning the memory according to the RAMs > > > > > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to > > > > > 6.4GT/s. (Motherboard P6T6 WS revolution) > > > > > > > > > > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed > > > > > bidirectional routing. > > > > > > > > > > Notice bidirectional routing means that we actually has to move approx > > > > > 40Gbit/s through memory and in-and-out of the interfaces. > > > > > > > > > > Formatted quick view using 'ifstat -b' > > > > > > > > > > eth31-in eth31-out eth32-in eth32-out > > > > > 9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s > > > > > 9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s > > > > > 9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s > > > > > 9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s > > > > > > > > > > [Adding an extra NIC] > > > > > > > > > > Another observation is that I'm hitting some kind of bottleneck on the > > > > > PCI-express switch. Adding an extra NIC in a PCIe slot connected to > > > > > the same PCIe switch, does not scale beyond 40Gbit/s collective > > > > > throughput. > > > > > > Correcting my self, according to Bill's info below. > > > > > > It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe > > > switch chip (reason explained below by Bill) > > > > > > > > > > > But, I happened to have a special motherboard ASUS P6T6 WS revolution, > > > > > which has an additional PCIe switch chip NVIDIA's NF200. > > > > > > > > > > Connecting two dual port 10GbE NICs via two different PCI-express > > > > > switch chips, makes things scale again! I have achieved a collective > > > > > throughput of 66.25 Gbit/s. This results is also influenced by my > > > > > pktgen machines cannot keep up, and I'm getting closer to the memory > > > > > bandwidth limits. > > > > > > > > > > FYI: I found a really good reference explaining the PCI-express > > > > > architecture, written by Intel: > > > > > > > > > > http://download.intel.com/design/intarch/papers/321071.pdf > > > > > > > > > > I'm not sure how to explain the PCI-express chip bottleneck I'm > > > > > seeing, but my guess is that I'm limited by the number of outstanding > > > > > packets/DMA-transfers and the latency for the DMA operations. > > > > > > > > > > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express > > > > > chips, that can tell me the number of outstanding transfers they > > > > > support? > > > > > > > > We've achieved 70 Gbps aggregate unidirectional TCP performance from > > > > one P6T6 based system to another. We figured out in our case that > > > > we were being limited by the interconnect between the Intel X58 and > > > > Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the > > > > Intel X58 and get the full 40 Gbps throughput from the dual-port > > > > Myricom 10-GigE NICs we have installed in them. But the other > > > > 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered > > > > through googling that the link between the X58 and N200 chips > > > > only operates at PCIe x16 _1.0_ speed, which limits the possible > > > > aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps. > > > > > > This definitly explains the bottlenecks I have seen! Thanks! > > > > > > Yes, it seems to scale when installing the two NICs in the first two > > > slots, both connected to the X58. If overclocking the RAM and CPU a > > > bit, I can match my pktgen machines speed which gives a collective > > > throughput of 67.95 Gbit/s. > > > > > > eth33 eth34 eth31 eth32 > > > in out in out in out in out > > > 7.54 + 9.58 + 9.56 + 7.56 + 7.33 + 9.53 + 9.50 + 7.35 = 67.95 Gbit/s > > > > > > Now I just need a faster generator machine, to find the next bottleneck ;-) > > > > > > > > > > This was clearly seen in our nuttcp testing: > > > > > > > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11 > > > > n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT > > > > n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT > > > > n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT > > > > n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT > > > > n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT > > > > n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT > > > > n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT > > > > n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT > > > > > > > > This used 4 dual-port Myricom 10-GigE NICs. We also tested with > > > > a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed > > > > at about 70 Gbps, due to the performance bottleneck between the > > > > X58 and N200 chips. > > > > > > This is also very excellent results! > > > > > > Thanks a lot Bill !!! > > > > We also achieved nearly 80 Gbps in bidirectional TCP tests (40 Gbps > > simultaneously in each direction): > > > > [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -r -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -r -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -r -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -r -xc3/3 -p5008 192.168.8.11 > > n2: 11542.6250 MB / 10.07 sec = 9619.9920 Mbps 44 %TX 51 %RX 0 retrans 0.12 msRTT > > n3: 11543.7143 MB / 10.06 sec = 9622.2153 Mbps 41 %TX 49 %RX 0 retrans 0.15 msRTT > > n4: 11622.8125 MB / 10.05 sec = 9701.0296 Mbps 43 %TX 51 %RX 0 retrans 0.10 msRTT > > n5: 11523.6875 MB / 10.03 sec = 9638.8883 Mbps 43 %TX 50 %RX 0 retrans 0.15 msRTT > > n6: 11608.0141 MB / 10.04 sec = 9695.7388 Mbps 43 %TX 50 %RX 0 retrans 0.10 msRTT > > n7: 11580.1250 MB / 10.04 sec = 9679.3910 Mbps 43 %TX 50 %RX 0 retrans 0.13 msRTT > > n8: 11608.0000 MB / 10.06 sec = 9678.7596 Mbps 42 %TX 50 %RX 0 retrans 0.10 msRTT > > n9: 11553.3750 MB / 10.05 sec = 9643.7296 Mbps 45 %TX 50 %RX 0 retrans 0.11 msRTT > > > > This was using 2 dual-port 10-GigE NICs in the first two PCIe 2.0 slots. > > We are using an Intel i7 965 quad-core 3.2 GHz Nehalem processor > > (overclocked to 3.4 GHz) and 2000 MHz DDR3 memory. Adding an additional > > dual-port 10-GigE NIC on the Nvidia N200 chip does only marginally > > better, as it appears we are basically CPU limited at this point for > > this test (the sum of the TX and RX CPU utilization for each pair of > > 10-GigE interfaces is about 93%). > > Hey guys, those are really nice numbers. Since TCP splicing appeared in the > kernel (once we got it fixed), I achieved 10 Gbps of HTTP proxying using > haproxy with very low CPU usage (about 20% of a Core2Duo 2.66 GHz). Nice, but I think we have a bug with the measured CPU usage. Eric Dumazet did a fix, but also pointed out that in a later mail, at I seem like it not fixed completely yet... > Before buying the machines, I had been wandering around with the NICs > donated by Myricom in order to try to find a machine capable of supporting > this. My conclusion was that a lot of machines had difficulties getting > above 3.5, 4.7 and 6.5 Gbps of output traffic (those 3 numbers were always > the same, depending on the chipsets). There clearly was a bandwidth > limitation imposed by the chipset. > > So I waited for the X38 and AM780FX chipsets to become available and > bought 3 machines (1 C2D, 1 AMD X2, 1 AMD X4). Those ones have no problem > with 10 Gbps of forwarded traffic (20 Gbps of total bus bandwidth), even > with 1500 bytes frames, but I don't know how high they can go, maybe > they will saturate slightly above. My experience is also that the AMDs can easily do 10Gbit/s forwarding, but doing bidirectional they suffer... > Unfortunately, I only have 5 NICs in 3 machines and no switch (and CX4 > is hard to find these days), so I'm probably stuck at 10 Gbps max. We are a fiber company, so I'm using our spare 10G optics, but I'm limited by our supply of SFP+ currently. I'll be getting two 6 port 10GbE NIC using PCIe2 x16 82599, in august, so it will be interesting how high we can go! :-) > Interestingly, I had the impression that forwarding data with TCP > splicing costs less CPU than IP forwarding, because the NICs can do > LRO. > > Also, I know a french service provider who uses haproxy on Core i7 > machines and who has already reached 5 Gbps of sustained traffic > with recent intel dual-port NICs (though I'm not sure exactly which > ones). This is with very little CPU usage too, less than 2-3% user > and 15% system+softirq. On previous machines (quad core xeons), it > was impossible to go beyond 3 Gbps, it looked like the chipset was > the limitating factor too (though I don't precisely remember which > one it was). > > I really blamed the NICs because this guys machine was about 4 times > more powerful than mine, but apparently it was just a chipset issue. > > I also happen to have a customer who recently received a few Sun NXGE, > mounted in Sun x2100-m2 using an nvidia chipset which I tested OK at > 10 Gbps with my myri10GE NICs. I'll try to see if I can run some tests > there, as Davem once said those NICs are really good too. The Sun NIU NIC has to use several hardware queues to achieve 10GbE. Currently using these as generators, and thats one of my limiting factors. > All in all, I find it really cool that our beloved OS scales that > well with the hardware :-) Yes, its really amazing how well the Linux net stack scales. I think the primary thanks for this efford goes to DaveMs multiqueue changes and Eric Dumazet's tuning. ps. I'll offline untill tuesday. -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-07-18 7:14 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-15 16:50 Achieved 10Gbit/s bidirectional routing Jesper Dangaard Brouer 2009-07-16 3:22 ` Bill Fink 2009-07-16 9:39 ` Jesper Dangaard Brouer 2009-07-16 15:38 ` Bill Fink 2009-07-17 20:35 ` Willy Tarreau 2009-07-17 23:38 ` Bill Fink 2009-07-18 7:14 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).