netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stephen Hemminger <shemminger@vyatta.com>
To: Ben Hutchings <bhutchings@solarflare.com>
Cc: Denys Fedoryshchenko <denys@visp.net.lb>, netdev@vger.kernel.org
Subject: Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors
Date: Sun, 15 Jun 2008 19:59:18 -0700	[thread overview]
Message-ID: <20080615195918.210fe19f@extreme> (raw)
In-Reply-To: <20080615234620.GC2835@solarflare.com>

On Mon, 16 Jun 2008 00:46:22 +0100
Ben Hutchings <bhutchings@solarflare.com> wrote:

> Denys Fedoryshchenko wrote:
> > Hi
> > 
> > Since i am using PC routers for my network, and i reach significant numbers
> > (for me significant) i start noticing minor problems. So all this talk about
> > networking performance in my case.
> > 
> > For example.
> > Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
> > e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
> > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
> > 
> > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
> 
> Currently TX checksum offload does not work for VLAN devices, which may
> be a serious performance hit if there is a lot of traffic routed between
> VLANs.  This should change in 2.6.27 for some drivers, which I think will
> include e1000.
> 
> > of traffic. Host running also conntrack (max 1000000 entries, when packetloss
> > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
> > worrying me, that ok, i win time by increasing rx descriptors from 256 to
> > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
> > interpolating descriptors increase from 256 to 4096 (4 times), i cannot
> > process more than 400Mbps RX?

You are CPU limited because of the overhead of firewalling. When this happens
packets get backlogged.

> Increasing the RX descriptor ring size should give the driver and stack
> more time to catch up after handling some packets that take unusually
> long.  It may also allow you to increase interrupt moderation, which
> will reduce the per-packet cost.

No if the receive side is CPU limited, you just end up eating more memory.
A bigger queue may actually make performance worse (less cache hits).

> > The CPU is not so busy after all... maybe there is a way to change some
> > parameter to force NAPI poll interface more often?
> 
> NAPI polling is not time-based, except indirectly though interrupt
> moderation.

How are you measuring CPU? You need to do something like measure the available
cycles left for applications. Don't believe top or other measures that may
not reflect I/O overhead and bus usage. 

> > I tried nice, changing realtime priority to FIFO, changing kernel to
> > preemptible... no luck, except increasing descriptors.
> > 
> > Router-Dora ~ # mpstat -P ALL 1
> > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08
> > 
> > 22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
> > 67.50  12927.00
> > 22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
> > 35.00  11935.00
> > 22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
> > 100.00    993.00
> > 22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
> > 0.00      0.00
>  
> You might do better with a NIC that supports MSI-X.  This allows the use of
> two RX queues with their own IRQs, each handled by a different processor.
> As it is, one CPU is completely idle.  However, I don't know how well the
> other work of routing scales to multiple processors.


Routing and firewalling should scale well. The deadlock is probably going
to be some hot lock like the transmit lock.

> [...]
> > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
> > kernel configuration and similar amount of traffic, higher load (ifb + plenty
> > of shapers running) - almost no errors on default settings.
> > Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08
> > 
> > 07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
> > 64.00  32835.00
> > 07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
> > 68.00  33164.36
> > 
> > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > capacity),
> 
> Gigabit Ethernet on plain old PCI is not ideal.  If each card has a
> separate route to the south bridge then you might be able to get a fair
> fraction of a gigabit between them though.
> 
> > 400Mbit/s rx+tx summary load, e1000e interface also - around
> > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > realtek card... is there any way to drop it down? 
> [...]
> 
> ethtool -C lets you change interrupt moderation.  I don't know anything
> about this driver or NIC's capabilities but it does seem to be in the
> cheapest GbE cards so I wouldn't expect outstanding performance.
> 
> Ben.
> 

The bigger issues is available memory bandwidth. Different processors
and busses have different overheads. PCI is much worse than PCI-express,
and CPU's with integrated memory controllers do much better than CPU's
with separate memory controller (like Core 2).


  reply	other threads:[~2008-06-16  2:59 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-15 20:24 NAPI, rx_no_buffer_count, e1000, r8169 and other actors Denys Fedoryshchenko
2008-06-15 20:57 ` Francois Romieu
2008-06-15 21:32   ` Denys Fedoryshchenko
2008-06-15 21:32   ` Denys Fedoryshchenko
2008-06-15 23:46 ` Ben Hutchings
2008-06-16  2:59   ` Stephen Hemminger [this message]
2008-06-16  4:05     ` Denys Fedoryshchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080615195918.210fe19f@extreme \
    --to=shemminger@vyatta.com \
    --cc=bhutchings@solarflare.com \
    --cc=denys@visp.net.lb \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).