NAPI, rx_no_buffer_count, e1000, r8169 and other actors

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NAPI, rx_no_buffer_count, e1000, r8169 and other actors
@ 2008-06-15 20:24 Denys Fedoryshchenko
  2008-06-15 20:57 ` Francois Romieu
  2008-06-15 23:46 ` Ben Hutchings
  0 siblings, 2 replies; 7+ messages in thread
From: Denys Fedoryshchenko @ 2008-06-15 20:24 UTC (permalink / raw)
  To: netdev

Hi

Since i am using PC routers for my network, and i reach significant numbers
(for me significant) i start noticing minor problems. So all this talk about
networking performance in my case.

For example.
Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
(PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)

All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
of traffic. Host running also conntrack (max 1000000 entries, when packetloss
happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
worrying me, that ok, i win time by increasing rx descriptors from 256 to
4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
interpolating descriptors increase from 256 to 4096 (4 times), i cannot
process more than 400Mbps RX?
The CPU is not so busy after all... maybe there is a way to change some
parameter to force NAPI poll interface more often?
I tried nice, changing realtime priority to FIFO, changing kernel to
preemptible... no luck, except increasing descriptors.

Router-Dora ~ # mpstat -P ALL 1
Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08

22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
%idle    intr/s
22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
67.50  12927.00
22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
35.00  11935.00
22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
100.00    993.00
22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
0.00      0.00

  PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
 1544     1 root     S     5824  0.2  0.0 /usr/sbin/snmpd -c /config/snmpd.conf
 1530     1 squid    S     2880  0.1  0.0 /usr/sbin/ripd -d
 1524     1 squid    S     2740  0.1  0.0 /usr/sbin/zebra -d
    1     0 root     S     2384  0.1  0.0 /bin/sh /init
 1576  1115 root     S     2384  0.1  0.0 /sbin/getty 38400 tty1
 1577  1115 root     S     2384  0.1  0.0 /sbin/getty 38400 tty2
 1581  1115 root     S     2384  0.1  0.0 /sbin/getty 38400 tty3



I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
kernel configuration and similar amount of traffic, higher load (ifb + plenty
of shapers running) - almost no errors on default settings.
Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08

07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
%idle    intr/s
07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
64.00  32835.00
07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
68.00  33164.36

Third host r8169 (PCI! This is important, seems i am running out of PCI
capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around
200Mbps load. What is worrying me - interrupts rate, it seems generated by
realtek card... is there any way to drop it down? 
Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No
nat, no shapers, same kernel configuration.

17:36:51     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
%idle    intr/s
17:36:52     all    0.50    0.00    0.50    0.00    1.49   32.34    0.00  
65.17  88993.07
17:36:53     all    0.00    0.00    0.50    0.00    0.50   32.00    0.00  
67.00  88655.00
17:36:54     all    0.00    0.00    0.50    0.00    1.49   31.84    0.00  
66.17  89484.00




MegaRouter-KARAM ~ # cat /proc/interrupts ;sleep 10;cat /proc/interrupts
           CPU0       CPU1
  0:  806263699          0   IO-APIC-edge      timer
  1:          2          0   IO-APIC-edge      i8042
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          5          0   IO-APIC-edge      i8042
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 18:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb7
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb6
 21: 1191830952          0   IO-APIC-fasteoi   uhci_hcd:usb4, eth0
 23:       1245          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb5
217:          3 1584682152   PCI-MSI-edge      eth1
NMI:  806263639  806263443   Non-maskable interrupts
LOC:          0  806263442   Local timer interrupts
RES:      99130      71199   Rescheduling interrupts
CAL:      62651       3871   function call interrupts
TLB:        239        187   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0
           CPU0       CPU1
  0:  806273702          0   IO-APIC-edge      timer
  1:          2          0   IO-APIC-edge      i8042
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          5          0   IO-APIC-edge      i8042
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 18:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb7
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb6
 21: 1192549139          0   IO-APIC-fasteoi   uhci_hcd:usb4, eth0
 23:       1245          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb5
217:          3 1584840861   PCI-MSI-edge      eth1
NMI:  806273642  806273446   Non-maskable interrupts
LOC:          0  806273445   Local timer interrupts
RES:      99130      71199   Rescheduling interrupts
CAL:      62653       3871   function call interrupts
TLB:        239        187   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko
@ 2008-06-15 20:57 ` Francois Romieu
  2008-06-15 21:32   ` Denys Fedoryshchenko
  2008-06-15 21:32   ` Denys Fedoryshchenko
  2008-06-15 23:46 ` Ben Hutchings
  1 sibling, 2 replies; 7+ messages in thread
From: Francois Romieu @ 2008-06-15 20:57 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

Denys Fedoryshchenko <denys@visp.net.lb> :
[...]
> Third host r8169 (PCI! This is important, seems i am running out of PCI
> capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around

400 rx + 400 tx or 200 rx + 200 tx ?
Can you specify the packet rate and the cpu ?

> 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> realtek card... is there any way to drop it down? 
>
> Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No
> nat, no shapers, same kernel configuration.

Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses packets 
as well as the lines of dmesg which relate to the r8169 driver ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-15 20:57 ` Francois Romieu
@ 2008-06-15 21:32   ` Denys Fedoryshchenko
  2008-06-15 21:32   ` Denys Fedoryshchenko
  1 sibling, 0 replies; 7+ messages in thread
From: Denys Fedoryshchenko @ 2008-06-15 21:32 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Sun, 15 Jun 2008 22:57:10 +0200, Francois Romieu wrote
> Denys Fedoryshchenko <denys@visp.net.lb> :
> [...]
> > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around
> 
> 400 rx + 400 tx or 200 rx + 200 tx ?
> Can you specify the packet rate and the cpu ?

On this host 275 Mbps TX right now, 152 Mbps RX.
After 3 minute uptime
eth0      Link encap:Ethernet  HWaddr 00:18:F8:0B:46:A6
          inet addr:192.168.20.10  Bcast:0.0.0.0  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9510755 errors:0 dropped:400 overruns:0 frame:0
          TX packets:9601889 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10000
          RX bytes:3768549053 (3.5 GiB)  TX bytes:2251698126 (2.0 GiB)
          Interrupt:21 Base address:0x4000

MegaRouter-KARAM ~ # ethtool -S eth0
NIC statistics:
     tx_packets: 10336831
     rx_packets: 10191781
     tx_errors: 0
     rx_errors: 0
     rx_missed: 436
     align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     unicast: 10183249
     broadcast: 971
     multicast: 7561
     tx_aborted: 0
     tx_underrun: 0

MegaRouter-KARAM ~ # mpstat -P ALL 1
Linux 2.6.26-rc6-git2-build-0029 (MegaRouter-KARAM)     06/16/08

00:32:08     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
%idle    intr/s
00:32:09     all    0.50    0.00    1.49    0.00    1.49   27.23    0.00  
69.31  76659.41
00:32:09       0    1.01    0.00    0.00    0.00    0.00   43.43    0.00  
55.56  61549.50
00:32:09       1    0.00    0.00    1.98    0.00    2.97   10.89    0.00  
84.16  15102.97
00:32:09       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
0.00      0.00



> 
> > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > realtek card... is there any way to drop it down? 
> >
> > Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No
> > nat, no shapers, same kernel configuration.
> 
> Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses 
> packets as well as the lines of dmesg which relate to the r8169 
> driver ?
> 
> -- 
> Ueimor
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-15 20:57 ` Francois Romieu
  2008-06-15 21:32   ` Denys Fedoryshchenko
@ 2008-06-15 21:32   ` Denys Fedoryshchenko
  1 sibling, 0 replies; 7+ messages in thread
From: Denys Fedoryshchenko @ 2008-06-15 21:32 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

Very sorry, forgot dmesg
[    3.070955] r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded
[    3.070972] ACPI: PCI Interrupt 0000:07:00.0[A] -> GSI 21 (level, low) ->
IRQ 21
[    3.071582] eth0: RTL8110s at 0xf8894000, 00:18:f8:0b:46:a6, XID 04000000
IRQ 21


On Sun, 15 Jun 2008 22:57:10 +0200, Francois Romieu wrote
> Denys Fedoryshchenko <denys@visp.net.lb> :
> [...]
> > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > capacity), 400Mbit/s rx+tx summary load, e1000e interface also - around
> 
> 400 rx + 400 tx or 200 rx + 200 tx ?
> Can you specify the packet rate and the cpu ?
> 
> > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > realtek card... is there any way to drop it down? 
> >
> > Also some packetloss, around 0.0005% (i prefer to have clear zero :-)) ). No
> > nat, no shapers, same kernel configuration.
> 
> Can you send an ethtool -S (+ ifconfig) of the 8169 if it misses 
> packets as well as the lines of dmesg which relate to the r8169 
> driver ?
> 
> -- 
> Ueimor
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko
  2008-06-15 20:57 ` Francois Romieu
@ 2008-06-15 23:46 ` Ben Hutchings
  2008-06-16  2:59   ` Stephen Hemminger
  1 sibling, 1 reply; 7+ messages in thread
From: Ben Hutchings @ 2008-06-15 23:46 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

Denys Fedoryshchenko wrote:
> Hi
> 
> Since i am using PC routers for my network, and i reach significant numbers
> (for me significant) i start noticing minor problems. So all this talk about
> networking performance in my case.
> 
> For example.
> Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
> e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
> (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
> 
> All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps

Currently TX checksum offload does not work for VLAN devices, which may
be a serious performance hit if there is a lot of traffic routed between
VLANs.  This should change in 2.6.27 for some drivers, which I think will
include e1000.

> of traffic. Host running also conntrack (max 1000000 entries, when packetloss
> happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
> worrying me, that ok, i win time by increasing rx descriptors from 256 to
> 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
> interpolating descriptors increase from 256 to 4096 (4 times), i cannot
> process more than 400Mbps RX?

Increasing the RX descriptor ring size should give the driver and stack
more time to catch up after handling some packets that take unusually
long.  It may also allow you to increase interrupt moderation, which
will reduce the per-packet cost.

> The CPU is not so busy after all... maybe there is a way to change some
> parameter to force NAPI poll interface more often?

NAPI polling is not time-based, except indirectly though interrupt
moderation.

> I tried nice, changing realtime priority to FIFO, changing kernel to
> preemptible... no luck, except increasing descriptors.
> 
> Router-Dora ~ # mpstat -P ALL 1
> Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08
> 
> 22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> %idle    intr/s
> 22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
> 67.50  12927.00
> 22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
> 35.00  11935.00
> 22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
> 100.00    993.00
> 22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
> 0.00      0.00
 
You might do better with a NIC that supports MSI-X.  This allows the use of
two RX queues with their own IRQs, each handled by a different processor.
As it is, one CPU is completely idle.  However, I don't know how well the
other work of routing scales to multiple processors.

[...]
> I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
> kernel configuration and similar amount of traffic, higher load (ifb + plenty
> of shapers running) - almost no errors on default settings.
> Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08
> 
> 07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> %idle    intr/s
> 07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
> 64.00  32835.00
> 07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
> 68.00  33164.36
> 
> Third host r8169 (PCI! This is important, seems i am running out of PCI
> capacity),

Gigabit Ethernet on plain old PCI is not ideal.  If each card has a
separate route to the south bridge then you might be able to get a fair
fraction of a gigabit between them though.

> 400Mbit/s rx+tx summary load, e1000e interface also - around
> 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> realtek card... is there any way to drop it down? 
[...]

ethtool -C lets you change interrupt moderation.  I don't know anything
about this driver or NIC's capabilities but it does seem to be in the
cheapest GbE cards so I wouldn't expect outstanding performance.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-15 23:46 ` Ben Hutchings
@ 2008-06-16  2:59   ` Stephen Hemminger
  2008-06-16  4:05     ` Denys Fedoryshchenko
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Hemminger @ 2008-06-16  2:59 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Denys Fedoryshchenko, netdev

On Mon, 16 Jun 2008 00:46:22 +0100
Ben Hutchings <bhutchings@solarflare.com> wrote:

> Denys Fedoryshchenko wrote:
> > Hi
> > 
> > Since i am using PC routers for my network, and i reach significant numbers
> > (for me significant) i start noticing minor problems. So all this talk about
> > networking performance in my case.
> > 
> > For example.
> > Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
> > e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
> > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
> > 
> > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
> 
> Currently TX checksum offload does not work for VLAN devices, which may
> be a serious performance hit if there is a lot of traffic routed between
> VLANs.  This should change in 2.6.27 for some drivers, which I think will
> include e1000.
> 
> > of traffic. Host running also conntrack (max 1000000 entries, when packetloss
> > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
> > worrying me, that ok, i win time by increasing rx descriptors from 256 to
> > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
> > interpolating descriptors increase from 256 to 4096 (4 times), i cannot
> > process more than 400Mbps RX?

You are CPU limited because of the overhead of firewalling. When this happens
packets get backlogged.

> Increasing the RX descriptor ring size should give the driver and stack
> more time to catch up after handling some packets that take unusually
> long.  It may also allow you to increase interrupt moderation, which
> will reduce the per-packet cost.

No if the receive side is CPU limited, you just end up eating more memory.
A bigger queue may actually make performance worse (less cache hits).

> > The CPU is not so busy after all... maybe there is a way to change some
> > parameter to force NAPI poll interface more often?
> 
> NAPI polling is not time-based, except indirectly though interrupt
> moderation.

How are you measuring CPU? You need to do something like measure the available
cycles left for applications. Don't believe top or other measures that may
not reflect I/O overhead and bus usage. 

> > I tried nice, changing realtime priority to FIFO, changing kernel to
> > preemptible... no luck, except increasing descriptors.
> > 
> > Router-Dora ~ # mpstat -P ALL 1
> > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08
> > 
> > 22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
> > 67.50  12927.00
> > 22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
> > 35.00  11935.00
> > 22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
> > 100.00    993.00
> > 22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
> > 0.00      0.00
>  
> You might do better with a NIC that supports MSI-X.  This allows the use of
> two RX queues with their own IRQs, each handled by a different processor.
> As it is, one CPU is completely idle.  However, I don't know how well the
> other work of routing scales to multiple processors.


Routing and firewalling should scale well. The deadlock is probably going
to be some hot lock like the transmit lock.

> [...]
> > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
> > kernel configuration and similar amount of traffic, higher load (ifb + plenty
> > of shapers running) - almost no errors on default settings.
> > Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08
> > 
> > 07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
> > 64.00  32835.00
> > 07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
> > 68.00  33164.36
> > 
> > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > capacity),
> 
> Gigabit Ethernet on plain old PCI is not ideal.  If each card has a
> separate route to the south bridge then you might be able to get a fair
> fraction of a gigabit between them though.
> 
> > 400Mbit/s rx+tx summary load, e1000e interface also - around
> > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > realtek card... is there any way to drop it down? 
> [...]
> 
> ethtool -C lets you change interrupt moderation.  I don't know anything
> about this driver or NIC's capabilities but it does seem to be in the
> cheapest GbE cards so I wouldn't expect outstanding performance.
> 
> Ben.
> 

The bigger issues is available memory bandwidth. Different processors
and busses have different overheads. PCI is much worse than PCI-express,
and CPU's with integrated memory controllers do much better than CPU's
with separate memory controller (like Core 2).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
  2008-06-16  2:59   ` Stephen Hemminger
@ 2008-06-16  4:05     ` Denys Fedoryshchenko
  0 siblings, 0 replies; 7+ messages in thread
From: Denys Fedoryshchenko @ 2008-06-16  4:05 UTC (permalink / raw)
  To: Stephen Hemminger, Ben Hutchings; +Cc: netdev

On Sun, 15 Jun 2008 19:59:18 -0700, Stephen Hemminger wrote
> On Mon, 16 Jun 2008 00:46:22 +0100
> Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > Denys Fedoryshchenko wrote:
> > > Hi
> > > 
> > > Since i am using PC routers for my network, and i reach significant numbers
> > > (for me significant) i start noticing minor problems. So all this talk about
> > > networking performance in my case.
> > > 
> > > For example.
> > > Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
> > > e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
> > > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
> > > 
> > > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
> > 
> > Currently TX checksum offload does not work for VLAN devices, which may
> > be a serious performance hit if there is a lot of traffic routed between
> > VLANs.  This should change in 2.6.27 for some drivers, which I think will
> > include e1000.

Probably it is valid for weak CPU's, or in my case really a lot of traffic.

> > 
> > > of traffic. Host running also conntrack (max 1000000 entries, when
packetloss
> > > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running.
What is
> > > worrying me, that ok, i win time by increasing rx descriptors from 256 to
> > > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
> > > interpolating descriptors increase from 256 to 4096 (4 times), i cannot
> > > process more than 400Mbps RX?
> 
> You are CPU limited because of the overhead of firewalling. When 
> this happens packets get backlogged.

I tried to increase net.core.netdev_max_backlog, it doesn't help, and it
doesn't change anything at all.
But it looks like: if i have 200Mbps RX, with average packet 500 bytes, i have
50Kpps rate. RX descriptor is 256 packets, each 1ms passed 50 packets. If poll
just more than late then 5ms, i miss packets. Or if it doesn't complete all
packets in one softirq cycle.
Probably i understand something (or everything) wrong.

But firewalling must be not a big deal, since i am not using anything "heavy"
like L7 filtering. But i will try to optimize rules, like i did once with u32
hash... so most of packets will not pass "long chain". And, there is around 29
rules on filter, 63 in NAT, 20 in mangle, it's not much i guess.

> 
> > Increasing the RX descriptor ring size should give the driver and stack
> > more time to catch up after handling some packets that take unusually
> > long.  It may also allow you to increase interrupt moderation, which
> > will reduce the per-packet cost.
> 
> No if the receive side is CPU limited, you just end up eating more memory.
> A bigger queue may actually make performance worse (less cache hits).
Thats very good idea. 
e1000 / AMD - cache size      : 1024 KB
and both Core 2 Duo routers - 4096 KB (shared?)

> 
> > > The CPU is not so busy after all... maybe there is a way to change some
> > > parameter to force NAPI poll interface more often?
> > 
> > NAPI polling is not time-based, except indirectly though interrupt
> > moderation.
> 
> How are you measuring CPU? You need to do something like measure the 
> available cycles left for applications. Don't believe top or other 
> measures that may not reflect I/O overhead and bus usage.

Probably mpstat gives correct results? I never use top, other that to find
clear CPU hog userspace app.

Router-Dora ~ # mpstat 1
Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/16/08

06:31:19     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
%idle    intr/s
06:31:20     all    0.00    0.00    0.00    0.00    1.51    8.04    0.00  
90.45  13570.30
06:31:21     all    0.00    0.00    0.00    0.00    2.49    9.95    0.00  
87.56  13986.00
06:31:22     all    0.00    0.00    0.50    0.00    2.49    9.45    0.00  
87.56  14364.00


> 
> > > I tried nice, changing realtime priority to FIFO, changing kernel to
> > > preemptible... no luck, except increasing descriptors.
> > > 
> > > Router-Dora ~ # mpstat -P ALL 1
> > > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08
> > > 
> > > 22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > > %idle    intr/s
> > > 22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
> > > 67.50  12927.00
> > > 22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
> > > 35.00  11935.00
> > > 22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
> > > 100.00    993.00
> > > 22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
> > > 0.00      0.00
> >  
> > You might do better with a NIC that supports MSI-X.  This allows the use of
> > two RX queues with their own IRQs, each handled by a different processor.
> > As it is, one CPU is completely idle.  However, I don't know how well the
> > other work of routing scales to multiple processors.
> 
> Routing and firewalling should scale well. The deadlock is probably going
> to be some hot lock like the transmit lock.

I tried to change tx queue length. If i make it too much small, it will just
drop packets _silently_. Will not be shown on netstat -s, nor ifconfig stats.
That what i reported before.

> 
> > [...]
> > > I have another host running, Core 2 Duo, e1000e+3 x e100, also
conntrack, same
> > > kernel configuration and similar amount of traffic, higher load (ifb +
plenty
> > > of shapers running) - almost no errors on default settings.
> > > Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08
> > > 
> > > 07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > > %idle    intr/s
> > > 07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
> > > 64.00  32835.00
> > > 07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
> > > 68.00  33164.36
> > > 
> > > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > > capacity),
> > 
> > Gigabit Ethernet on plain old PCI is not ideal.  If each card has a
> > separate route to the south bridge then you might be able to get a fair
> > fraction of a gigabit between them though.

I think in this case r8169 is routed over PCI-PCIExpress bridge, other card is
PCIExpress, nothing else on PCI, other than IDE controller which not used at
all). Yes, it is bad, but still must be 133 Mbyte/s (1064 Mbit/s). Yes i know
there is overhead, but probably i can expect 500-800 Mbps total bandwidth limit?

> > 
> > > 400Mbit/s rx+tx summary load, e1000e interface also - around
> > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > > realtek card... is there any way to drop it down? 
> > [...]
> > 
> > ethtool -C lets you change interrupt moderation.  I don't know anything
> > about this driver or NIC's capabilities but it does seem to be in the
> > cheapest GbE cards so I wouldn't expect outstanding performance.
> > 
> > Ben.
Well, realtek 8169 doesn't support changing ring, and doesn't support changing
coalesce parameters. By the way e1000 also doesn't support -C, but e1000e
does. Is it new way of forcing people to buy newer adapters? :-)

> >
> 
> The bigger issues is available memory bandwidth. Different processors
> and busses have different overheads. PCI is much worse than PCI-
> express, and CPU's with integrated memory controllers do much better 
> than CPU's with separate memory controller (like Core 2).
Yes, but in my case Core 2 do heavier job much better, probably because of
larger cache or some voodoo magic.

The biggest issue, in this country it is not possible to find PCI-Express
network adapter. Even Realtek 8169. It is just impossible, that WHOLE country
have very limited stock of PCI-Express adapters, just few PCI-Express R8169
month ago was laying on the shelf of local Apple dealer, and i remember too
late about them.

--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-06-16  4:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko
2008-06-15 20:57 ` Francois Romieu
2008-06-15 21:32   ` Denys Fedoryshchenko
2008-06-15 21:32   ` Denys Fedoryshchenko
2008-06-15 23:46 ` Ben Hutchings
2008-06-16  2:59   ` Stephen Hemminger
2008-06-16  4:05     ` Denys Fedoryshchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).