* NAPI, rx_no_buffer_count, e1000, r8169 and other actors
@ 2008-06-15 20:24 Denys Fedoryshchenko
2008-06-15 20:57 ` Francois Romieu
2008-06-15 23:46 ` Ben Hutchings
0 siblings, 2 replies; 7+ messages in thread
From: Denys Fedoryshchenko @ 2008-06-15 20:24 UTC (permalink / raw)
To: netdev
Hi
Since i am using PC routers for my network, and i reach significant numbers
(for me significant) i start noticing minor problems. So all this talk about
networking performance in my case.
For example.
Sun server, AMD based (two CPU - AMD Opteron(tm) Processor 248).
e1000 connected over PCI-X ([ 4.919249] e1000: 0000:01:01.0: e1000_probe:
(PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
of traffic. Host running also conntrack (max 1000000 entries, when packetloss
happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
worrying me, that ok, i win time by increasing rx descriptors from 256 to
4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
interpolating descriptors increase from 256 to 4096 (4 times), i cannot
process more than 400Mbps RX?
The CPU is not so busy after all... maybe there is a way to change some
parameter to force NAPI poll interface more often?
I tried nice, changing realtime priority to FIFO, changing kernel to
preemptible... no luck, except increasing descriptors.
Router-Dora ~ # mpstat -P ALL 1
Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/15/08
22:51:02 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
22:51:03 all 1.00 0.00 0.00 0.00 2.50 29.00 0.00
67.50 12927.00
22:51:03 0 2.00 0.00 0.00 0.00 4.00 59.00 0.00
35.00 11935.00
22:51:03 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 993.00
22:51:03 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
PID PPID USER STAT VSZ %MEM %CPU COMMAND
1544 1 root S 5824 0.2 0.0 /usr/sbin/snmpd -c /config/snmpd.conf
1530 1 squid S 2880 0.1 0.0 /usr/sbin/ripd -d
1524 1 squid S 2740 0.1 0.0 /usr/sbin/zebra -d
1 0 root S 2384 0.1 0.0 /bin/sh /init
1576 1115 root S 2384 0.1 0.0 /sbin/getty 38400 tty1
1577 1115 root S 2384 0.1 0.0 /sbin/getty 38400 tty2
1581 1115 root S 2384 0.1 0.0 /sbin/getty 38400 tty3
I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
kernel configuration and similar amount of traffic, higher load (ifb + plenty
of shapers running) - almost no errors on default settings.
Linux 2.6.26-rc6-git2-build-0029 (Kup) 06/16/08
07:00:27 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
07:00:28 all 0.00 0.00 0.50 0.00 4.00 31.50 0.00
64.00 32835.00
07:00:29 all 0.00 0.00 0.50 0.00 2.50 29.00 0.00
68.00 33164.36
Third host r8169 (PCI! This is important, seems i am running out of PCI
capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around
200Mbps load. What is worrying me - interrupts rate, it seems generated by
realtek card... is there any way to drop it down?
Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No
nat, no shapers, same kernel configuration.
17:36:51 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
17:36:52 all 0.50 0.00 0.50 0.00 1.49 32.34 0.00
65.17 88993.07
17:36:53 all 0.00 0.00 0.50 0.00 0.50 32.00 0.00
67.00 88655.00
17:36:54 all 0.00 0.00 0.50 0.00 1.49 31.84 0.00
66.17 89484.00
MegaRouter-KARAM ~ # cat /proc/interrupts ;sleep 10;cat /proc/interrupts
CPU0 CPU1
0: 806263699 0 IO-APIC-edge timer
1: 2 0 IO-APIC-edge i8042
9: 0 0 IO-APIC-fasteoi acpi
12: 5 0 IO-APIC-edge i8042
16: 0 0 IO-APIC-fasteoi uhci_hcd:usb3
18: 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb7
19: 0 0 IO-APIC-fasteoi uhci_hcd:usb6
21: 1191830952 0 IO-APIC-fasteoi uhci_hcd:usb4, eth0
23: 1245 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5
217: 3 1584682152 PCI-MSI-edge eth1
NMI: 806263639 806263443 Non-maskable interrupts
LOC: 0 806263442 Local timer interrupts
RES: 99130 71199 Rescheduling interrupts
CAL: 62651 3871 function call interrupts
TLB: 239 187 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0
CPU0 CPU1
0: 806273702 0 IO-APIC-edge timer
1: 2 0 IO-APIC-edge i8042
9: 0 0 IO-APIC-fasteoi acpi
12: 5 0 IO-APIC-edge i8042
16: 0 0 IO-APIC-fasteoi uhci_hcd:usb3
18: 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb7
19: 0 0 IO-APIC-fasteoi uhci_hcd:usb6
21: 1192549139 0 IO-APIC-fasteoi uhci_hcd:usb4, eth0
23: 1245 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5
217: 3 1584840861 PCI-MSI-edge eth1
NMI: 806273642 806273446 Non-maskable interrupts
LOC: 0 806273445 Local timer interrupts
RES: 99130 71199 Rescheduling interrupts
CAL: 62653 3871 function call interrupts
TLB: 239 187 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0
--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko @ 2008-06-15 20:57 ` Francois Romieu 2008-06-15 21:32 ` Denys Fedoryshchenko 2008-06-15 21:32 ` Denys Fedoryshchenko 2008-06-15 23:46 ` Ben Hutchings 1 sibling, 2 replies; 7+ messages in thread From: Francois Romieu @ 2008-06-15 20:57 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev Denys Fedoryshchenko <denys@visp.net.lb> : [...] > Third host r8169 (PCI! This is important, seems i am running out of PCI > capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around 400 rx + 400 tx or 200 rx + 200 tx ? Can you specify the packet rate and the cpu ? > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > realtek card... is there any way to drop it down? > > Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No > nat, no shapers, same kernel configuration. Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses packets as well as the lines of dmesg which relate to the r8169 driver ? -- Ueimor ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-15 20:57 ` Francois Romieu @ 2008-06-15 21:32 ` Denys Fedoryshchenko 2008-06-15 21:32 ` Denys Fedoryshchenko 1 sibling, 0 replies; 7+ messages in thread From: Denys Fedoryshchenko @ 2008-06-15 21:32 UTC (permalink / raw) To: Francois Romieu; +Cc: netdev On Sun, 15 Jun 2008 22:57:10 +0200, Francois Romieu wrote > Denys Fedoryshchenko <denys@visp.net.lb> : > [...] > > Third host r8169 (PCI! This is important, seems i am running out of PCI > > capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around > > 400 rx + 400 tx or 200 rx + 200 tx ? > Can you specify the packet rate and the cpu ? On this host 275 Mbps TX right now, 152 Mbps RX. After 3 minute uptime eth0 Link encap:Ethernet HWaddr 00:18:F8:0B:46:A6 inet addr:192.168.20.10 Bcast:0.0.0.0 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9510755 errors:0 dropped:400 overruns:0 frame:0 TX packets:9601889 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:10000 RX bytes:3768549053 (3.5 GiB) TX bytes:2251698126 (2.0 GiB) Interrupt:21 Base address:0x4000 MegaRouter-KARAM ~ # ethtool -S eth0 NIC statistics: tx_packets: 10336831 rx_packets: 10191781 tx_errors: 0 rx_errors: 0 rx_missed: 436 align_errors: 0 tx_single_collisions: 0 tx_multi_collisions: 0 unicast: 10183249 broadcast: 971 multicast: 7561 tx_aborted: 0 tx_underrun: 0 MegaRouter-KARAM ~ # mpstat -P ALL 1 Linux 2.6.26-rc6-git2-build-0029 (MegaRouter-KARAM) 06/16/08 00:32:08 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 00:32:09 all 0.50 0.00 1.49 0.00 1.49 27.23 0.00 69.31 76659.41 00:32:09 0 1.01 0.00 0.00 0.00 0.00 43.43 0.00 55.56 61549.50 00:32:09 1 0.00 0.00 1.98 0.00 2.97 10.89 0.00 84.16 15102.97 00:32:09 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > > realtek card... is there any way to drop it down? > > > > Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No > > nat, no shapers, same kernel configuration. > > Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses > packets as well as the lines of dmesg which relate to the r8169 > driver ? > > -- > Ueimor > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-15 20:57 ` Francois Romieu 2008-06-15 21:32 ` Denys Fedoryshchenko @ 2008-06-15 21:32 ` Denys Fedoryshchenko 1 sibling, 0 replies; 7+ messages in thread From: Denys Fedoryshchenko @ 2008-06-15 21:32 UTC (permalink / raw) To: Francois Romieu; +Cc: netdev Very sorry, forgot dmesg [ 3.070955] r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded [ 3.070972] ACPI: PCI Interrupt 0000:07:00.0[A] -> GSI 21 (level, low) -> IRQ 21 [ 3.071582] eth0: RTL8110s at 0xf8894000, 00:18:f8:0b:46:a6, XID 04000000 IRQ 21 On Sun, 15 Jun 2008 22:57:10 +0200, Francois Romieu wrote > Denys Fedoryshchenko <denys@visp.net.lb> : > [...] > > Third host r8169 (PCI! This is important, seems i am running out of PCI > > capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around > > 400 rx + 400 tx or 200 rx + 200 tx ? > Can you specify the packet rate and the cpu ? > > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > > realtek card... is there any way to drop it down? > > > > Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No > > nat, no shapers, same kernel configuration. > > Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses > packets as well as the lines of dmesg which relate to the r8169 > driver ? > > -- > Ueimor > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko 2008-06-15 20:57 ` Francois Romieu @ 2008-06-15 23:46 ` Ben Hutchings 2008-06-16 2:59 ` Stephen Hemminger 1 sibling, 1 reply; 7+ messages in thread From: Ben Hutchings @ 2008-06-15 23:46 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev Denys Fedoryshchenko wrote: > Hi > > Since i am using PC routers for my network, and i reach significant numbers > (for me significant) i start noticing minor problems. So all this talk about > networking performance in my case. > > For example. > Sun server, AMD based (two CPU - AMD Opteron(tm) Processor 248). > e1000 connected over PCI-X ([ 4.919249] e1000: 0000:01:01.0: e1000_probe: > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4) > > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps Currently TX checksum offload does not work for VLAN devices, which may be a serious performance hit if there is a lot of traffic routed between VLANs. This should change in 2.6.27 for some drivers, which I think will include e1000. > of traffic. Host running also conntrack (max 1000000 entries, when packetloss > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is > worrying me, that ok, i win time by increasing rx descriptors from 256 to > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by > interpolating descriptors increase from 256 to 4096 (4 times), i cannot > process more than 400Mbps RX? Increasing the RX descriptor ring size should give the driver and stack more time to catch up after handling some packets that take unusually long. It may also allow you to increase interrupt moderation, which will reduce the per-packet cost. > The CPU is not so busy after all... maybe there is a way to change some > parameter to force NAPI poll interface more often? NAPI polling is not time-based, except indirectly though interrupt moderation. > I tried nice, changing realtime priority to FIFO, changing kernel to > preemptible... no luck, except increasing descriptors. > > Router-Dora ~ # mpstat -P ALL 1 > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/15/08 > > 22:51:02 CPU %user %nice %sys %iowait %irq %soft %steal > %idle intr/s > 22:51:03 all 1.00 0.00 0.00 0.00 2.50 29.00 0.00 > 67.50 12927.00 > 22:51:03 0 2.00 0.00 0.00 0.00 4.00 59.00 0.00 > 35.00 11935.00 > 22:51:03 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 100.00 993.00 > 22:51:03 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 You might do better with a NIC that supports MSI-X. This allows the use of two RX queues with their own IRQs, each handled by a different processor. As it is, one CPU is completely idle. However, I don't know how well the other work of routing scales to multiple processors. [...] > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same > kernel configuration and similar amount of traffic, higher load (ifb + plenty > of shapers running) - almost no errors on default settings. > Linux 2.6.26-rc6-git2-build-0029 (Kup) 06/16/08 > > 07:00:27 CPU %user %nice %sys %iowait %irq %soft %steal > %idle intr/s > 07:00:28 all 0.00 0.00 0.50 0.00 4.00 31.50 0.00 > 64.00 32835.00 > 07:00:29 all 0.00 0.00 0.50 0.00 2.50 29.00 0.00 > 68.00 33164.36 > > Third host r8169 (PCI! This is important, seems i am running out of PCI > capacity), Gigabit Ethernet on plain old PCI is not ideal. If each card has a separate route to the south bridge then you might be able to get a fair fraction of a gigabit between them though. > 400Mbit/s rx+tx summary load, e1000e interface also - around > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > realtek card... is there any way to drop it down? [...] ethtool -C lets you change interrupt moderation. I don't know anything about this driver or NIC's capabilities but it does seem to be in the cheapest GbE cards so I wouldn't expect outstanding performance. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-15 23:46 ` Ben Hutchings @ 2008-06-16 2:59 ` Stephen Hemminger 2008-06-16 4:05 ` Denys Fedoryshchenko 0 siblings, 1 reply; 7+ messages in thread From: Stephen Hemminger @ 2008-06-16 2:59 UTC (permalink / raw) To: Ben Hutchings; +Cc: Denys Fedoryshchenko, netdev On Mon, 16 Jun 2008 00:46:22 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote: > Denys Fedoryshchenko wrote: > > Hi > > > > Since i am using PC routers for my network, and i reach significant numbers > > (for me significant) i start noticing minor problems. So all this talk about > > networking performance in my case. > > > > For example. > > Sun server, AMD based (two CPU - AMD Opteron(tm) Processor 248). > > e1000 connected over PCI-X ([ 4.919249] e1000: 0000:01:01.0: e1000_probe: > > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4) > > > > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps > > Currently TX checksum offload does not work for VLAN devices, which may > be a serious performance hit if there is a lot of traffic routed between > VLANs. This should change in 2.6.27 for some drivers, which I think will > include e1000. > > > of traffic. Host running also conntrack (max 1000000 entries, when packetloss > > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is > > worrying me, that ok, i win time by increasing rx descriptors from 256 to > > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by > > interpolating descriptors increase from 256 to 4096 (4 times), i cannot > > process more than 400Mbps RX? You are CPU limited because of the overhead of firewalling. When this happens packets get backlogged. > Increasing the RX descriptor ring size should give the driver and stack > more time to catch up after handling some packets that take unusually > long. It may also allow you to increase interrupt moderation, which > will reduce the per-packet cost. No if the receive side is CPU limited, you just end up eating more memory. A bigger queue may actually make performance worse (less cache hits). > > The CPU is not so busy after all... maybe there is a way to change some > > parameter to force NAPI poll interface more often? > > NAPI polling is not time-based, except indirectly though interrupt > moderation. How are you measuring CPU? You need to do something like measure the available cycles left for applications. Don't believe top or other measures that may not reflect I/O overhead and bus usage. > > I tried nice, changing realtime priority to FIFO, changing kernel to > > preemptible... no luck, except increasing descriptors. > > > > Router-Dora ~ # mpstat -P ALL 1 > > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/15/08 > > > > 22:51:02 CPU %user %nice %sys %iowait %irq %soft %steal > > %idle intr/s > > 22:51:03 all 1.00 0.00 0.00 0.00 2.50 29.00 0.00 > > 67.50 12927.00 > > 22:51:03 0 2.00 0.00 0.00 0.00 4.00 59.00 0.00 > > 35.00 11935.00 > > 22:51:03 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 100.00 993.00 > > 22:51:03 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 > > You might do better with a NIC that supports MSI-X. This allows the use of > two RX queues with their own IRQs, each handled by a different processor. > As it is, one CPU is completely idle. However, I don't know how well the > other work of routing scales to multiple processors. Routing and firewalling should scale well. The deadlock is probably going to be some hot lock like the transmit lock. > [...] > > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same > > kernel configuration and similar amount of traffic, higher load (ifb + plenty > > of shapers running) - almost no errors on default settings. > > Linux 2.6.26-rc6-git2-build-0029 (Kup) 06/16/08 > > > > 07:00:27 CPU %user %nice %sys %iowait %irq %soft %steal > > %idle intr/s > > 07:00:28 all 0.00 0.00 0.50 0.00 4.00 31.50 0.00 > > 64.00 32835.00 > > 07:00:29 all 0.00 0.00 0.50 0.00 2.50 29.00 0.00 > > 68.00 33164.36 > > > > Third host r8169 (PCI! This is important, seems i am running out of PCI > > capacity), > > Gigabit Ethernet on plain old PCI is not ideal. If each card has a > separate route to the south bridge then you might be able to get a fair > fraction of a gigabit between them though. > > > 400Mbit/s rx+tx summary load, e1000e interface also - around > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > > realtek card... is there any way to drop it down? > [...] > > ethtool -C lets you change interrupt moderation. I don't know anything > about this driver or NIC's capabilities but it does seem to be in the > cheapest GbE cards so I wouldn't expect outstanding performance. > > Ben. > The bigger issues is available memory bandwidth. Different processors and busses have different overheads. PCI is much worse than PCI-express, and CPU's with integrated memory controllers do much better than CPU's with separate memory controller (like Core 2). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors 2008-06-16 2:59 ` Stephen Hemminger @ 2008-06-16 4:05 ` Denys Fedoryshchenko 0 siblings, 0 replies; 7+ messages in thread From: Denys Fedoryshchenko @ 2008-06-16 4:05 UTC (permalink / raw) To: Stephen Hemminger, Ben Hutchings; +Cc: netdev On Sun, 15 Jun 2008 19:59:18 -0700, Stephen Hemminger wrote > On Mon, 16 Jun 2008 00:46:22 +0100 > Ben Hutchings <bhutchings@solarflare.com> wrote: > > > Denys Fedoryshchenko wrote: > > > Hi > > > > > > Since i am using PC routers for my network, and i reach significant numbers > > > (for me significant) i start noticing minor problems. So all this talk about > > > networking performance in my case. > > > > > > For example. > > > Sun server, AMD based (two CPU - AMD Opteron(tm) Processor 248). > > > e1000 connected over PCI-X ([ 4.919249] e1000: 0000:01:01.0: e1000_probe: > > > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4) > > > > > > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps > > > > Currently TX checksum offload does not work for VLAN devices, which may > > be a serious performance hit if there is a lot of traffic routed between > > VLANs. This should change in 2.6.27 for some drivers, which I think will > > include e1000. Probably it is valid for weak CPU's, or in my case really a lot of traffic. > > > > > of traffic. Host running also conntrack (max 1000000 entries, when packetloss > > > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is > > > worrying me, that ok, i win time by increasing rx descriptors from 256 to > > > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by > > > interpolating descriptors increase from 256 to 4096 (4 times), i cannot > > > process more than 400Mbps RX? > > You are CPU limited because of the overhead of firewalling. When > this happens packets get backlogged. I tried to increase net.core.netdev_max_backlog, it doesn't help, and it doesn't change anything at all. But it looks like: if i have 200Mbps RX, with average packet 500 bytes, i have 50Kpps rate. RX descriptor is 256 packets, each 1ms passed 50 packets. If poll just more than late then 5ms, i miss packets. Or if it doesn't complete all packets in one softirq cycle. Probably i understand something (or everything) wrong. But firewalling must be not a big deal, since i am not using anything "heavy" like L7 filtering. But i will try to optimize rules, like i did once with u32 hash... so most of packets will not pass "long chain". And, there is around 29 rules on filter, 63 in NAT, 20 in mangle, it's not much i guess. > > > Increasing the RX descriptor ring size should give the driver and stack > > more time to catch up after handling some packets that take unusually > > long. It may also allow you to increase interrupt moderation, which > > will reduce the per-packet cost. > > No if the receive side is CPU limited, you just end up eating more memory. > A bigger queue may actually make performance worse (less cache hits). Thats very good idea. e1000 / AMD - cache size : 1024 KB and both Core 2 Duo routers - 4096 KB (shared?) > > > > The CPU is not so busy after all... maybe there is a way to change some > > > parameter to force NAPI poll interface more often? > > > > NAPI polling is not time-based, except indirectly though interrupt > > moderation. > > How are you measuring CPU? You need to do something like measure the > available cycles left for applications. Don't believe top or other > measures that may not reflect I/O overhead and bus usage. Probably mpstat gives correct results? I never use top, other that to find clear CPU hog userspace app. Router-Dora ~ # mpstat 1 Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/16/08 06:31:19 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 06:31:20 all 0.00 0.00 0.00 0.00 1.51 8.04 0.00 90.45 13570.30 06:31:21 all 0.00 0.00 0.00 0.00 2.49 9.95 0.00 87.56 13986.00 06:31:22 all 0.00 0.00 0.50 0.00 2.49 9.45 0.00 87.56 14364.00 > > > > I tried nice, changing realtime priority to FIFO, changing kernel to > > > preemptible... no luck, except increasing descriptors. > > > > > > Router-Dora ~ # mpstat -P ALL 1 > > > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/15/08 > > > > > > 22:51:02 CPU %user %nice %sys %iowait %irq %soft %steal > > > %idle intr/s > > > 22:51:03 all 1.00 0.00 0.00 0.00 2.50 29.00 0.00 > > > 67.50 12927.00 > > > 22:51:03 0 2.00 0.00 0.00 0.00 4.00 59.00 0.00 > > > 35.00 11935.00 > > > 22:51:03 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > > 100.00 993.00 > > > 22:51:03 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 0.00 > > > > You might do better with a NIC that supports MSI-X. This allows the use of > > two RX queues with their own IRQs, each handled by a different processor. > > As it is, one CPU is completely idle. However, I don't know how well the > > other work of routing scales to multiple processors. > > Routing and firewalling should scale well. The deadlock is probably going > to be some hot lock like the transmit lock. I tried to change tx queue length. If i make it too much small, it will just drop packets _silently_. Will not be shown on netstat -s, nor ifconfig stats. That what i reported before. > > > [...] > > > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same > > > kernel configuration and similar amount of traffic, higher load (ifb + plenty > > > of shapers running) - almost no errors on default settings. > > > Linux 2.6.26-rc6-git2-build-0029 (Kup) 06/16/08 > > > > > > 07:00:27 CPU %user %nice %sys %iowait %irq %soft %steal > > > %idle intr/s > > > 07:00:28 all 0.00 0.00 0.50 0.00 4.00 31.50 0.00 > > > 64.00 32835.00 > > > 07:00:29 all 0.00 0.00 0.50 0.00 2.50 29.00 0.00 > > > 68.00 33164.36 > > > > > > Third host r8169 (PCI! This is important, seems i am running out of PCI > > > capacity), > > > > Gigabit Ethernet on plain old PCI is not ideal. If each card has a > > separate route to the south bridge then you might be able to get a fair > > fraction of a gigabit between them though. I think in this case r8169 is routed over PCI-PCIExpress bridge, other card is PCIExpress, nothing else on PCI, other than IDE controller which not used at all). Yes, it is bad, but still must be 133 Mbyte/s (1064 Mbit/s). Yes i know there is overhead, but probably i can expect 500-800 Mbps total bandwidth limit? > > > > > 400Mbit/s rx+tx summary load, e1000e interface also - around > > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > > > realtek card... is there any way to drop it down? > > [...] > > > > ethtool -C lets you change interrupt moderation. I don't know anything > > about this driver or NIC's capabilities but it does seem to be in the > > cheapest GbE cards so I wouldn't expect outstanding performance. > > > > Ben. Well, realtek 8169 doesn't support changing ring, and doesn't support changing coalesce parameters. By the way e1000 also doesn't support -C, but e1000e does. Is it new way of forcing people to buy newer adapters? :-) > > > > The bigger issues is available memory bandwidth. Different processors > and busses have different overheads. PCI is much worse than PCI- > express, and CPU's with integrated memory controllers do much better > than CPU's with separate memory controller (like Core 2). Yes, but in my case Core 2 do heavier job much better, probably because of larger cache or some voodoo magic. The biggest issue, in this country it is not possible to find PCI-Express network adapter. Even Realtek 8169. It is just impossible, that WHOLE country have very limited stock of PCI-Express adapters, just few PCI-Express R8169 month ago was laying on the shelf of local Apple dealer, and i remember too late about them. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-06-16 4:05 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko 2008-06-15 20:57 ` Francois Romieu 2008-06-15 21:32 ` Denys Fedoryshchenko 2008-06-15 21:32 ` Denys Fedoryshchenko 2008-06-15 23:46 ` Ben Hutchings 2008-06-16 2:59 ` Stephen Hemminger 2008-06-16 4:05 ` Denys Fedoryshchenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).