From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors Date: Sun, 15 Jun 2008 19:59:18 -0700 Message-ID: <20080615195918.210fe19f@extreme> References: <20080615200013.M67401@visp.net.lb> <20080615234620.GC2835@solarflare.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Denys Fedoryshchenko , netdev@vger.kernel.org To: Ben Hutchings Return-path: Received: from mail.vyatta.com ([216.93.170.194]:35279 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750851AbYFPC7U (ORCPT ); Sun, 15 Jun 2008 22:59:20 -0400 In-Reply-To: <20080615234620.GC2835@solarflare.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 16 Jun 2008 00:46:22 +0100 Ben Hutchings wrote: > Denys Fedoryshchenko wrote: > > Hi > > > > Since i am using PC routers for my network, and i reach significant numbers > > (for me significant) i start noticing minor problems. So all this talk about > > networking performance in my case. > > > > For example. > > Sun server, AMD based (two CPU - AMD Opteron(tm) Processor 248). > > e1000 connected over PCI-X ([ 4.919249] e1000: 0000:01:01.0: e1000_probe: > > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4) > > > > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps > > Currently TX checksum offload does not work for VLAN devices, which may > be a serious performance hit if there is a lot of traffic routed between > VLANs. This should change in 2.6.27 for some drivers, which I think will > include e1000. > > > of traffic. Host running also conntrack (max 1000000 entries, when packetloss > > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is > > worrying me, that ok, i win time by increasing rx descriptors from 256 to > > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by > > interpolating descriptors increase from 256 to 4096 (4 times), i cannot > > process more than 400Mbps RX? You are CPU limited because of the overhead of firewalling. When this happens packets get backlogged. > Increasing the RX descriptor ring size should give the driver and stack > more time to catch up after handling some packets that take unusually > long. It may also allow you to increase interrupt moderation, which > will reduce the per-packet cost. No if the receive side is CPU limited, you just end up eating more memory. A bigger queue may actually make performance worse (less cache hits). > > The CPU is not so busy after all... maybe there is a way to change some > > parameter to force NAPI poll interface more often? > > NAPI polling is not time-based, except indirectly though interrupt > moderation. How are you measuring CPU? You need to do something like measure the available cycles left for applications. Don't believe top or other measures that may not reflect I/O overhead and bus usage. > > I tried nice, changing realtime priority to FIFO, changing kernel to > > preemptible... no luck, except increasing descriptors. > > > > Router-Dora ~ # mpstat -P ALL 1 > > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora) 06/15/08 > > > > 22:51:02 CPU %user %nice %sys %iowait %irq %soft %steal > > %idle intr/s > > 22:51:03 all 1.00 0.00 0.00 0.00 2.50 29.00 0.00 > > 67.50 12927.00 > > 22:51:03 0 2.00 0.00 0.00 0.00 4.00 59.00 0.00 > > 35.00 11935.00 > > 22:51:03 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 100.00 993.00 > > 22:51:03 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 > > You might do better with a NIC that supports MSI-X. This allows the use of > two RX queues with their own IRQs, each handled by a different processor. > As it is, one CPU is completely idle. However, I don't know how well the > other work of routing scales to multiple processors. Routing and firewalling should scale well. The deadlock is probably going to be some hot lock like the transmit lock. > [...] > > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same > > kernel configuration and similar amount of traffic, higher load (ifb + plenty > > of shapers running) - almost no errors on default settings. > > Linux 2.6.26-rc6-git2-build-0029 (Kup) 06/16/08 > > > > 07:00:27 CPU %user %nice %sys %iowait %irq %soft %steal > > %idle intr/s > > 07:00:28 all 0.00 0.00 0.50 0.00 4.00 31.50 0.00 > > 64.00 32835.00 > > 07:00:29 all 0.00 0.00 0.50 0.00 2.50 29.00 0.00 > > 68.00 33164.36 > > > > Third host r8169 (PCI! This is important, seems i am running out of PCI > > capacity), > > Gigabit Ethernet on plain old PCI is not ideal. If each card has a > separate route to the south bridge then you might be able to get a fair > fraction of a gigabit between them though. > > > 400Mbit/s rx+tx summary load, e1000e interface also - around > > 200Mbps load. What is worrying me - interrupts rate, it seems generated by > > realtek card... is there any way to drop it down? > [...] > > ethtool -C lets you change interrupt moderation. I don't know anything > about this driver or NIC's capabilities but it does seem to be in the > cheapest GbE cards so I wouldn't expect outstanding performance. > > Ben. > The bigger issues is available memory bandwidth. Different processors and busses have different overheads. PCI is much worse than PCI-express, and CPU's with integrated memory controllers do much better than CPU's with separate memory controller (like Core 2).