Netdev List
 help / color / mirror / Atom feed
* Please pull 'fixes-jgarzik' branch of wireless-2.6
From: John W. Linville @ 2008-01-10 19:49 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-wireless

Jeff,

A couple more fixes for 2.6.24.  The one from Mattias Nissler is already
in your upstream tree...FYI.

Let me know if there are problems!

Thanks,

John

---

Individual patches available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-jgarzik/

---

The following changes since commit 3ce54450461bad18bbe1f9f5aa3ecd2f8e8d1235:
  Linus Torvalds (1):
        Linux 2.6.24-rc7

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-jgarzik

Ivo van Doorn (1):
      rt2x00: Corectly initialize rt2500usb MAC

Mattias Nissler (1):
      rt2x00: Allow rt61 to catch up after a missing tx report

 drivers/net/wireless/rt2x00/rt2500usb.c |    2 +-
 drivers/net/wireless/rt2x00/rt61pci.c   |   13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wireless/rt2x00/rt2500usb.c b/drivers/net/wireless/rt2x00/rt2500usb.c
index 50775f9..18b1f91 100644
--- a/drivers/net/wireless/rt2x00/rt2500usb.c
+++ b/drivers/net/wireless/rt2x00/rt2500usb.c
@@ -257,7 +257,7 @@ static const struct rt2x00debug rt2500usb_rt2x00debug = {
 static void rt2500usb_config_mac_addr(struct rt2x00_dev *rt2x00dev,
 				      __le32 *mac)
 {
-	rt2500usb_register_multiwrite(rt2x00dev, MAC_CSR2, &mac,
+	rt2500usb_register_multiwrite(rt2x00dev, MAC_CSR2, mac,
 				      (3 * sizeof(__le16)));
 }
 
diff --git a/drivers/net/wireless/rt2x00/rt61pci.c b/drivers/net/wireless/rt2x00/rt61pci.c
index 01dbef1..0d9436d 100644
--- a/drivers/net/wireless/rt2x00/rt61pci.c
+++ b/drivers/net/wireless/rt2x00/rt61pci.c
@@ -1738,6 +1738,7 @@ static void rt61pci_txdone(struct rt2x00_dev *rt2x00dev)
 {
 	struct data_ring *ring;
 	struct data_entry *entry;
+	struct data_entry *entry_done;
 	struct data_desc *txd;
 	u32 word;
 	u32 reg;
@@ -1791,6 +1792,18 @@ static void rt61pci_txdone(struct rt2x00_dev *rt2x00dev)
 		    !rt2x00_get_field32(word, TXD_W0_VALID))
 			return;
 
+		entry_done = rt2x00_get_data_entry_done(ring);
+		while (entry != entry_done) {
+			/* Catch up. Just report any entries we missed as
+			 * failed. */
+			WARNING(rt2x00dev,
+				"TX status report missed for entry %p\n",
+				entry_done);
+			rt2x00pci_txdone(rt2x00dev, entry_done, TX_FAIL_OTHER,
+					 0);
+			entry_done = rt2x00_get_data_entry_done(ring);
+		}
+
 		/*
 		 * Obtain the status about this packet.
 		 */
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply related

* debian iproute2 patches branch rebased.
From: Andreas Henriksson @ 2008-01-10 19:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Hello Stephen!

I've rebased the patches branch we carry in debian on top of the new
080108 release of iproute2.

See patches branch of git://git.debian.org/git/collab-maint/pkg-iproute

I've dropped one of the patches you picked up[1], so there's now one of the
old ones left and a new manpage for routel/routef.
(Any reason you didn't pull the actual commit we served you with git?)

The old remaining patch fixes the infinite loop in ip route flush exactly the
same way you fixed the same problem in ip neigh flush[2].
An additional patch will be provided in a followup mail (not available in
Debian) that was created by request from Patrick McHardy. This one makes max
rounds configurable (and 0 means try to infinity, so you can restore old
behaviour).
Patrick and me disagrees on what the default should be[3]. He thinks the 'ip
route flush' aka 'loop forever' behaviour should stay, while I vote for the
'ip neigh flush' behaviour of bailing out after N attempts.
IMNSHO looping infinitely is an *insane* default. Specially since this is a
tool used in bootup scripts....

[1]: See commit ea5dd59c03b36fe2acec8f03a8d7a2f7b7036b04
[2]: See commit 660818498d0f5a3f52c05355a3e82c23f670fcc1
     Where the comment seems to be wrong about "Limit ip route flush...",
     since it's actually "ip neigh flush" that's being modified.
[3]: Read thread from here on:
     http://www.spinics.net/lists/netdev/msg44920.html


commit 1eef590948f81b5c84e8450d5c95dd73744b4278
Author: Andreas Henriksson <andreas@fatal.se>
Date:   Thu Jan 3 16:48:56 2008 +0100

    Add routel and routef man page.

diff --git a/Makefile b/Makefile
index de04176..723eb5d 100644
--- a/Makefile
+++ b/Makefile
@@ -56,6 +56,7 @@ install: all
 	ln -sf lnstat.8  $(DESTDIR)$(MANDIR)/man8/rtstat.8
 	ln -sf lnstat.8  $(DESTDIR)$(MANDIR)/man8/ctstat.8
 	ln -sf rtacct.8  $(DESTDIR)$(MANDIR)/man8/nstat.8
+	ln -sf routel.8  $(DESTDIR)$(MANDIR)/man8/routef.8
 	install -m 0755 -d $(DESTDIR)$(MANDIR)/man3
 	install -m 0644 $(shell find man/man3 -maxdepth 1 -type f) $(DESTDIR)$(MANDIR)/man3
 
diff --git a/man/man8/routel.8 b/man/man8/routel.8
new file mode 100644
index 0000000..cdf8f55
--- /dev/null
+++ b/man/man8/routel.8
@@ -0,0 +1,32 @@
+.TH "ROUTEL" "8" "3 Jan, 2008" "iproute2" "Linux"
+.SH "NAME"
+.LP 
+routel \- list routes with pretty output format
+.br
+routef \- flush routes
+.SH "SYNTAX"
+.LP 
+routel [\fItablenr\fP [\fIraw ip args...\fP]]
+.br 
+routef
+.SH "DESCRIPTION"
+.LP 
+These programs are a set of helper scripts you can use instead of raw iproute2 commands.
+.br
+The routel script will list routes in a format that some might consider easier to interpret then the ip route list equivalent.
+.br
+The routef script does not take any arguments and will simply flush the routing table down the drain. Beware! This means deleting all routes which will make your network unusable!
+
+.SH "FILES"
+.LP 
+\fI/usr/bin/routef\fP 
+.br 
+\fI/usr/bin/routel\fP 
+.SH "AUTHORS"
+.LP 
+The routel script was written by Stephen R. van den Berg <srb@cuci.nl>, 1999/04/18 and donated to the public domain.
+.br
+This manual page was written by Andreas Henriksson  <andreas@fatal.se>, for the Debian GNU/Linux system.
+.SH "SEE ALSO"
+.LP 
+ip(8)

commit 1d1dab5826d1a9091e0bb2cf832f0785dc2add63
Author: Daniel Silverstone <daniel.silverstone@ubuntu.com>
Date:   Fri Oct 19 13:32:24 2007 +0200

    Avoid infinite loop in ip addr flush.
    
    Fix "ip addr flush" the same way "ip neigh flush" was previously fixed,
    by bailing out if the flush hasn't completed after MAX_ROUNDS (10) tries.

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index d1c6620..34379d0 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -34,6 +34,8 @@
 #include "ll_map.h"
 #include "ip_common.h"
 
+#define MAX_ROUNDS 10
+
 static struct
 {
 	int ifindex;
@@ -667,7 +669,7 @@ int ipaddr_list_or_flush(int argc, char **argv, int flush)
 		filter.flushp = 0;
 		filter.flushe = sizeof(flushb);
 
-		for (;;) {
+		while (round < MAX_ROUNDS) {
 			if (rtnl_wilddump_request(&rth, filter.family, RTM_GETADDR) < 0) {
 				perror("Cannot send dump request");
 				exit(1);
@@ -694,6 +696,8 @@ int ipaddr_list_or_flush(int argc, char **argv, int flush)
 				fflush(stdout);
 			}
 		}
+		fprintf(stderr, "*** Flush remains incomplete after %d rounds. ***\n", MAX_ROUNDS); fflush(stderr);
+		return 1;
 	}
 
 	if (filter.family != AF_PACKET) {



-- 
Regards,
Andreas Henriksson

^ permalink raw reply related

* Re: RFC: igb: Intel 82575 gigabit ethernet driver (take #2)
From: Kok, Auke @ 2008-01-10 19:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Kok, Auke, NetDev, Arjan van de Ven, Jesse Brandeburg,
	Ronciak, John
In-Reply-To: <47684AA9.9050409@pobox.com>

Jeff Garzik wrote:
> Looks pretty decent.  Main comments (style mostly, driver operation path
> seems sound):

thanks again for the comments. I am about to send an updated patch just before my
vacation and before I do let me just quickly touch on your comments below:

> * kill the bitfields and unions [in descriptor structs].  they are not
> endian-safe as presented, generate poor code, and are otherwise
> undesirable.

that bitfield was unused and so I removed the code. I don't see any more bitfields
at all now in this driver.

> * the basic operations are too verbose:  E1000_READ_REG(hw, REGISTER) is
> far more readable as ER32(REGISTER), following the style of other
> drivers.  Furthermore, the "E1000_" prefix, in addition to being overly
> redundant (used in each register read/write), it is also incorrect,
> because this is not E1000...

partially I agree, and I refined the register writes to remove the need for the
"hw" part.

However the hardware *is* e1000, we ended up making a new driver since it just
does not make sense to add all of this infrastructure for older chipsets anymore.

renaming everything (from e1000_ to igb_) would just make life for us really hard
looking up historical diffs, history etc. and most importantly compare with
e1000/e1000e when we encounter an issue that might affect the other drivers. For
now it is easier to just leave these alone.

I however do not rule out that we change this at a later stage ...

> * in general, rename everything with "e1000_" prefix.  this will
> eliminate plenty of human confusion in the long run.

I'm doing this for all functions, which solves the namespace collisions. The
"e1000" specific static structs (which are the same in igb as they are in e1000,
e1000e) as well as the registers (ditto) I'll keep unchanged for now.

> * API:   unless you have chips in the lab that will require an API hook,
> don't create one.  For example, a direct call to
> e1000_acquire_nvm_82575() should replace all ->acquire_nvm() hooks....
> if there are no chips in pipeline GUARANTEED to have a different
> ->acquire_nvm() feature.

Noted

Note also that there are already many less hooks as there are in e1000e. We did
already make an effort to scrub as many as we can.


Cheers,

Auke


^ permalink raw reply

* Re: [PROCFS] [NETNS] issue with /proc/net entries
From: Eric W. Biederman @ 2008-01-10 19:55 UTC (permalink / raw)
  To: Benjamin Thery; +Cc: netdev, linux-kernel
In-Reply-To: <478654E1.5050501@bull.net>

Benjamin Thery <benjamin.thery@bull.net> writes:

> Hi Eric,
>
> While testing the current network namespace stuff merged in net-2.6.25,
> I bumped into the following problem with the /proc/net/ entries.
> It doesn't always display the actual data of the current namespace,
> but sometime displays data from other namespaces.
>
> I bisected the problem to the commit:
> "proc: remove/Fix proc generic d_revalidate"
> 3790ee4bd86396558eedd86faac1052cb782e4e1
>
> The problem: If a process in a particular network namespace changes
> current directory to /proc/net, then processes in other network
> namespaces trying to look at /proc/net entries will see data from the
> first namespace (the one with CWD /proc/net). (See test case below).
>
> As you comments in the commit suggest, you seem to be aware of some
> issues when CONFIG_NET_NS=y. Is it one of these corner cases you
> identified? Any idea on how we can fix it?

Yes.  It isn't especially hard.   I have most of it in my queue
I just need to get the silly patches out of there.

Essentially we need to fix the caching of proc_generic entries,
So that we can have a proper d_revalidate implementation.

To get d_revalidate and the caching correct for /proc/net will take
just a bit more work.  We need to make /proc/net a symlink
to something like /proc/self/net so that we don't get excess
revalidates when switching between different processes.

Or else we can't properly implement the case you have described.
Where being in the directory causes the wrong version of /proc/net
to show up. Changing the contents of the dentry for /proc/net
should only happen during unshare.  Not when we switch between
processes or else we get into the d_revalidate leaks mount points
problem again.

We also need the check to see if something is mounted on top of
us before we call drop the dentry.  But if we don't even try until
we know the dentry is invalid it should not be too bad.

Eric

^ permalink raw reply

* Re: [Bugme-new] [Bug 9721] New: wake on lan fails with sky2 module
From: supersud501 @ 2008-01-10 19:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andrew Morton, netdev, linux-acpi
In-Reply-To: <20080109205210.1f8a83bb@deepthought>



Stephen Hemminger schrieb:
> On Wed, 9 Jan 2008 16:03:00 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Wed,  9 Jan 2008 13:05:34 -0800 (PST)
>> bugme-daemon@bugzilla.kernel.org wrote:
>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=9721
>>>
>>>            Summary: wake on lan fails with sky2 module
>>>            Product: ACPI
>>>            Version: 2.5
>>>      KernelVersion: 2.6.24-rc7
>>>           Platform: All
>>>         OS/Version: Linux
>>>               Tree: Mainline
>>>             Status: NEW
>>>           Severity: normal
>>>           Priority: P1
>>>          Component: Power-Sleep-Wake
>>>         AssignedTo: acpi_power-sleep-wake@kernel-bugs.osdl.org
>>>         ReportedBy: supersud501@yahoo.de
>> This post-2.6.23 regression was assigned to ACPI but is quite possibly a
>> net driver problem?
>>
>>> Latest working kernel version: 2.6.23.12
>>> Earliest failing kernel version: 2.6.24-rc6 (not tested earlier kernel,
>>> 2.6.24-rc7 still failing)
>>> Distribution: Ubuntu 8.04 (but Kernel build from Kernel.org and system modifiet
>>> to make wake on lan work, i.e. network cards are not shutted down on poweroff)
>>> Hardware Environment: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
>>> Ethernet Controller (rev 20) onboard Asus P5W DH motherboard, uses module SKY2
>>> Software Environment:
>>> Problem Description:
>>>
>>> When enabling wake on lan with: 'ethtool -s eth0 wol' i get the following
>>> status:
>>>
>>> 21:56:29 ~ # sudo ethtool eth0
>>> Settings for eth0:
>>>         Supported ports: [ TP ]
>>>         Supported link modes:   10baseT/Half 10baseT/Full 
>>>                                 100baseT/Half 100baseT/Full 
>>>                                 1000baseT/Half 1000baseT/Full 
>>>         Supports auto-negotiation: Yes
>>>         Advertised link modes:  10baseT/Half 10baseT/Full 
>>>                                 100baseT/Half 100baseT/Full 
>>>                                 1000baseT/Half 1000baseT/Full 
>>>         Advertised auto-negotiation: Yes
>>>         Speed: 100Mb/s
>>>         Duplex: Full
>>>         Port: Twisted Pair
>>>         PHYAD: 0
>>>         Transceiver: internal
>>>         Auto-negotiation: on
>>>         Supports Wake-on: pg
>>>         Wake-on: g    <---- wol enabled
>>>         Current message level: 0x000000ff (255)
>>>         Link detected: yes
>>>
>>> but after shutting down the pc doesn't wake up when magic packet is sent.
>>>
>>> the status lights of the network card are still on (so the card seems to be
>>> online).
>>>
>>> same system with only changed kernel to 2.6.23.12 and same procedure like
>>> above: wake on lan works.
>>>
>>> Steps to reproduce: enable wol on your network card using SKY2 module and it
>>> doesn't work too?
>>>
>>> if you need more information, just tell me, it's my first bug report.
>>> regards
>>>
> 
> 
> Wake from power off works on 2.6.24-rc7 for me.
> Wake from suspend doesn't because Network Manager, HAL, or some other
> user space tool gets confused.
> 
> I just rechecked it with Fujitsu Lifebook, which has sky2 (88E8055).
> There many variations of this chip, and it maybe chip specific problem
> or ACPI/BIOS issues.  If you don't enable Wake on Lan in BIOS, the
> driver can't do it for you. Also, check how you are shutting down.
> 
> Also since the device has to restart the PHY, it could be a switch
> issue if you have some fancy pants switch doing intrusion detection
> or something, but I doubt that.
> 
> Is it a clean or fast shutdown, most distributions mark network
> devices as down on shutdown, but if the distribution does something 
> stupid like remove the driver module, then the driver is unable to setup Wake On Lan.
> The wake on lan setup is done in one place in the driver, add
> a printk to see if it is ever called.
> 
> 

I only tried wake from shutdown (poweroff), and like i wrote, on the 
same system with kernel 2.6.23.12 (nothing changed but vmlinuz and 
initrd, with the same kernel config on 2.6.24-rc6/7 (make oldconfig, 
default answer to all questions)), it works. so it seems to me like a 
problem in the kernel.

every wake-up setting (wake up by pci-device, rtc-alarm, modem ...) in 
bios is also enabled, otherwise it couldn't work in 2.6.23.12 (and windows).

if you say your sky2-card works, it might be a acpi-problem not related 
to sky2 like i thought - when i am at home i'll try to start my pc with 
a timer (--> /proc/acpi/alarm) from kernel 2.6.24-rc7 to check if 
acpi-wakeup works and report back (if it is any help in finding the 
source of my problem).

and regarding "printk" i'll try to find out what you mean (my first 
steps into kernel debugging :) - i think you mean adding a line in the 
source to print out something when the function is called)

regards

^ permalink raw reply

* RE: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi
From: Chatre, Reinette @ 2008-01-10 19:10 UTC (permalink / raw)
  To: Joonwoo Park, Zhu, Yi, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA, lkml,
	ipw3945-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <11998765481610-git-send-email-joonwpark81-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On , Joonwoo Park  wrote:

> --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> @@ -6262,6 +6262,10 @@ static void __iwl_down(struct iwl_priv *priv)
> 	/* tell the device to stop sending interrupts */
> 	iwl_disable_interrupts(priv);
> 
> +	/* synchronize irq and tasklet */
> +	synchronize_irq(priv->pci_dev->irq);
> +	tasklet_kill(&priv->irq_tasklet);
> +

Could synchronize_irq() be moved into iwl_disable_interrupts() ? I am
also wondering if we cannot call tasklet_kill() before
iwl_disable_interrupts() ... thus preventing it from being scheduled
when we are going down. 

Reinette

^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Kok, Auke @ 2008-01-10 19:01 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chris Friesen, netdev, linux-kernel
In-Reply-To: <478666F4.50405@hp.com>

Rick Jones wrote:
>> 1) Interrupts are being processed on both cpus:
>>
>> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>>            CPU0       CPU1
>>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> IIRC none of the e1000 driven cards are multi-queue

the pci-express variants are, but the functionality is almost always disabled (and
relatively new anyway).

even with multiqueue, you can still have only a single irq line (which defeats the
purpose of course mostly).

>, so while the above 
> shows that interrupts from eth0 have been processed on both CPUs at
> various points in the past, it doesn't necessarily mean that they are
> being processed on both CPUs at the same time right?

never will, an irq can only be processed on one cpu at a time anyway, obviously
the irq here has been migrated ONCE from one of the cpu's to the other.
unfortunately you can't see from /proc/interrupts whether this happens frequently
or not, or how many times it happened before.

Auke

^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Kok, Auke @ 2008-01-10 18:26 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <47865FFD.2010407@nortel.com>

Chris Friesen wrote:
> Kok, Auke wrote:
> 
>> You're using 2.6.10... you can always replace the e1000 module with the
>> out-of-tree version from e1000.sf.net, this might help a bit - the
>> version in the
>> 2.6.10 kernel is very very old.
> 
> Do you have any reason to believe this would improve things?  It seems
> like the problem lies in the NAPI/softirq code rather than in the e1000
> driver itself, no?

your real issue is that your userspace app is hogging the CPU. While network is
not really cpu intensive, it does require that ample time at many intervals is
given to the CPU to run cleanups and prevent FIFO issues.

alternatively, you can increase your rx/tx ring descriptor count (with ethtool),
which basically makes it easier for the hardware not to be serviced for a longer
period, since there are more buffers available and the card can go longer on when
userspace is hogging the CPU.

>> it also appears that your app is eating up CPU time. perhaps setting
>> the app to a
>> nicer nice level might mitigate things a bit.
> 
> If we're not handling the softirq work from ksoftirqd how would changing
> scheduler settings affect anything?

correct, it might not.

>> Also turn off the in-kernel irq
>> mitigation, it just causes cache misses and you really need the
>> network irq to sit
>> on a single cpu at most (if not all) the time to get the best
>> performance. Use the
>> userspace irqbalance daemon instead to achieve this.
> 
> Using userspace irqbalance would be some effort to test and deploy
> properly.  However, as a quick test I tried setting the irq affinity for
> this device and it didn't help.

irqbalance is a simple userspace app that drops into any system seemlessly and
does the best job all around - often it beats manual tuning of smp_affinity even ;)

Auke

^ permalink raw reply

* RE: ipip tunnel code (IPV4)
From: Templin, Fred L @ 2008-01-10 17:58 UTC (permalink / raw)
  To: Andy Johnson, netdev
In-Reply-To: <147a89290801100634t7854a101w203de150982b0284@mail.gmail.com>

Andy,

> -----Original Message-----
> From: Andy Johnson [mailto:johnsonzjo@gmail.com] 
> Sent: Thursday, January 10, 2008 6:35 AM
> To: netdev@vger.kernel.org
> Subject: ipip tunnel code (IPV4)
> 
> Hello,
> 
> I am trying to learn the IPV4 ipip tunnel code  (net/ipv4/ipip.c)
> and I have two little questions about
> semantics of variables:
> 
> ipip_fb_tunnel_init - what does "fb" stand for ?
> 
> In tunnels_wc   : what does "wc" stand for ?

Similar names occur in net/ipv6/sit.c, which is the
IPv6-in-IPv4 analog of ipip.c. I am 90% certain that
"wc" stands for "wildcard" - it is used for selecting
the default tunnel interface when no other tunnel
interfaces match a specific (src, dst) pair.

In that light, I assume "fb" stands for something like
"fallback" although I am not certain. It would seem to
fit though, because the "fallback" tunnel interface is
the one that is selected by a "wildcard" match.

Would be interested if anyone could confirm or correct
my assumptions.

Thanks - Fred
fred.l.templin@boeing.com 
 
> Regards,
> Andy
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Rick Jones @ 2008-01-10 18:41 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>

> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0

IIRC none of the e1000 driven cards are multi-queue, so while the above 
shows that interrupts from eth0 have been processed on both CPUs at 
various points in the past, it doesn't necessarily mean that they are 
being processed on both CPUs at the same time right?

rick jones

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Rick Jones @ 2008-01-10 18:37 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bhutchings, Linux Network Development list
In-Reply-To: <1199986291.8931.62.camel@cafe>

> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.

It may seem like picking a tiny nit, but netperf never transmits 
packets.  It only provides buffers of specified size to the stack. It is 
then the stack which transmits and determines the size of the packets on 
the network.

Drifting a bit more...

While there are settings, conditions and known stack behaviours where 
one can be confident of the packet size on the network based on the 
options passed to netperf, generally speaking one should not ass-u-me a 
direct relationship between the options one passes to netperf and the 
size of the packets on the network.

And for JumboFrames to be effective it must be set on both ends, 
otherwise the TCP MSS exchange will result in the smaller of the two 
MTU's "winning" as it were.

>>single CPU this can become a bottleneck.  Does the test system have
>>multiple CPUs?  Are IRQs for the multiple NICs balanced across
>>multiple CPUs?
> 
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts: 

That suggests to me anyway that the dreaded irqbalanced is running, 
shuffling the interrupts as you go.  Not often a happy place for running 
netperf when one want's consistent results.

> 
> # cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
>  16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
>  18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
>  19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
> 273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
> 275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
> 278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
> 279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
> 305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
> 321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
> 323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
> 324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
> 325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
> BAD:       4216

IMO, what you want (in the absence of multi-queue NICs) is one CPU 
taking the interrupts of one port/interface, and each port/interface's 
interrupts going to a separate CPU.  So, something that looks roughly 
like concocted example:

            CPU0     CPU1      CPU2     CPU3
   1:       1234        0         0        0   eth0
   2:          0     1234         0        0   eth1
   3:          0        0      1234        0   eth2
   4:          0        0         0     1234   eth3

which you should be able to acheive via the method I think someone else 
has already mentioned about echoing values into 
/proc/irq/<irq>/smp_affinity  - after you have slain the dreaded 
irqbalance daemon.

rick jones

^ permalink raw reply

* [PATCH]  New driver "sfc" for Solarstorm SFC4000 controller - 4th attempt
From: Robert Stonehouse @ 2008-01-10 18:29 UTC (permalink / raw)
  To: jgarzik, netdev; +Cc: spope, linux-net-drivers

This is a resubmission of a new driver for Solarflare network controllers.

The driver supports several types of PHY (10Gbase-T, XFP, CX4) on six
different 10G and 1G boards.

Hardware based on this network controller is now available from SMC as
part numbers SMC10GPCIe-XFP and SMC10GPCIe-10BT.

The previous thread was:
  http://marc.info/?l=linux-netdev&m=119825632209357&w=2


Thanks to the people who looked at the previous patches. We have addressed
the following from comments received after the 3rd submission:
 - Kerneldoc style comment
 - Kconfig changes
 - Reduced size slightly

I am also sending a request to linux-mtd@lists.infradead.org for review of
the MTD part of the driver.


Previous reviewers have noted that the driver is quite large (but it
would not be the largest network driver by source or compiled module
size). I think it is a reasonable size for a driver that supports a
fully featured NIC, across a range of MACs, PHYs and silicon
revisions.

One aspect that is worth mentioning is that the NIC has no firmware.
A benefit is no dreaded binary blob!  A downside is that more support
code is needed but this tends to be around initialisation and is
readable commented C.

To give a small break down of the sizes of the different driver parts
                                 (wc output)
 Core control/datapath         | 5001  16405 139467  = efx.c rx.c tx.c
 Controller HW support         | 3653  11823 107554  = falcon.c
 HW defs                       | 1588   4838  47050  = falcon_hwdefs.h
 board support                 | 1848   7105  52455
 MAC support                   | 1623   4977  51007
 PHY support                   | 2196   7904  67711
 Headers                       | 4565  20645 162402
 Self test code                |  863   3088  24981
 Ethtool support               |  751   2144  22845
 MTD code (separate module)    | 1021   3200  26944
 Debugfs Code (KConfig option) |  863   2543  24896


Are there further review comments that we need to address before it can be
merged?


The patch (against net-2.6.25) is at:
     https://support.solarflare.com/netdev/4/net-2.6.25-sfc-2.2.0038.patch

The new files may also be downloaded as a tarball:
     https://support.solarflare.com/netdev/4/net-2.6.25-sfc-2.2.0038.tgz

And for verification there is:
     https://support.solarflare.com/netdev/4/MD5SUMS

Regards

-- 
Rob Stonehouse

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Kok, Auke @ 2008-01-10 18:18 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bhutchings, NetDev
In-Reply-To: <1199986291.8931.62.camel@cafe>

Breno Leitao wrote:
> On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>> <snip>
>>
>> I take it that's the average for individual interfaces, not the
>> aggregate?
> Right, each of these results are for individual interfaces. Otherwise,
> we'd have a huge problem. :-)
> 
>> This can be mitigated by interrupt moderation and NAPI
>> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
>> I don't think e1000 hardware does LRO, but the driver could presumably
>> be changed use Linux's software LRO.
> Without using these "features" and keeping the MTU as 1500, do you think
> we could get a better performance than this one?
> 
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
> 
>> single CPU this can become a bottleneck.  Does the test system have
>> multiple CPUs?  Are IRQs for the multiple NICs balanced across
>> multiple CPUs?
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts: 


which is wrong and hurts performance. you want your ethernet irq's to stick to a
CPU for long times to prevent cache thrash.

please disable the in-kernel irq balancing code and use the userspace `irqbalance`
daemon.

Gee I should put that in my signature, I already wrote that twice today :)

Auke

> 
> # cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
>  16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
>  18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
>  19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
> 273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
> 275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
> 278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
> 279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
> 305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
> 321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
> 323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
> 324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
> 325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
> BAD:       4216
> 
> Thanks,
> 


^ permalink raw reply

* Re: SMP code / network stack
From: Kok, Auke @ 2008-01-10 18:31 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jeba Anandhan, Eric Dumazet, netdev,
	matthew.hattersley
In-Reply-To: <20080110174657.GL22437@ghostprotocols.net>

Arnaldo Carvalho de Melo wrote:
> Em Thu, Jan 10, 2008 at 03:26:59PM +0000, Jeba Anandhan escreveu:
>> Hi Eric,
>> Thanks for the reply. I have one more doubt. For example, if we have 2
>> processor and 4 ethernet cards. Only CPU0 does all work through 8 cards.
>> If we set the affinity to each ethernet card as CPU number, will it be
>> efficient?.
>>
>> Will this be default behavior?
>>
>> # cat /proc/interrupts 
>>            CPU0       CPU1       
>>   0:   11472559   74291833    IO-APIC-edge  timer
>>   2:          0          0          XT-PIC  cascade
>>   8:          0          1    IO-APIC-edge  rtc
>>  81:          0          0   IO-APIC-level  ohci_hcd
>>  97: 1830022231        847   IO-APIC-level  ehci_hcd, eth0
>>  97: 3830012232        847   IO-APIC-level  ehci_hcd, eth1
>>  97: 5830052231        847   IO-APIC-level  ehci_hcd, eth2
>>  97: 6830032213        847   IO-APIC-level  ehci_hcd, eth3

another thing to try: if you don't need usb2 support, remove the ehci_hcd module -
this will give a slight less overhead servicing irq's in your system.

I take it that you have no MSI support in these ethernet cards?

Auke

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Rick Jones @ 2008-01-10 18:26 UTC (permalink / raw)
  To: Breno Leitao; +Cc: netdev
In-Reply-To: <1199981839.8931.35.camel@cafe>

Many many things to check when running netperf :)

*) Are the cards on the same or separate PCImumble bus, and what sort of bus

*) is the two interface performance two interfaces on the same four-port 
card, or an interface from each of the two four-port cards?

*) is there a dreaded (IMO) irqbalance daemon running?  one of the very 
first things I do when running netperf is terminate the irqbalance 
daemon with as extreme a predjudice as I can.

*) what is the distribution of interrupts from the interfaces to the 
CPUs?  if you've tried to set that manually, the dreaded irqbalance 
daemon will come along shortly thereafter and ruin everything.

*) what does netperf say about the overall CPU utilization of the 
system(s) when the tests are running?

*) what does top say about the utilization of any single CPU in the 
system(s) when the tests are running?

*) are you using the global -T option to spread the netperf/netserver 
processes across the CPUs, or leaving that all up to the 
stack/scheduler/etc?

I suspect there could be more but that is what comes to mind thusfar as 
far as things I often check when running netperf.

rick jones


^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: James Chapman @ 2008-01-10 18:25 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>

Chris Friesen wrote:
> Hi all,
> 
> I've got an issue that's popped up with a deployed system running
> 2.6.10.  I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
> 
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).

What's changed in your application? Any real-time threads in there?

>From the top output below, looks like SigtranServices is consuming all
your CPU...

> The
> error/dropped/fifo counts are going up in ethtool:
> 
>      rx_packets: 32180834
>      rx_bytes: 5480756958
>      rx_errors: 862506
>      rx_dropped: 771345
>      rx_length_errors: 0
>      rx_over_errors: 0
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_fifo_errors: 91161
>      rx_missed_errors: 91161
> 
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
> 
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
> 
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
> 
> Some additional information:
> 
> 
> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> 
> 
> 
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
> 
> 
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
> Swap:   0k total,       0k used,      0k free,   219844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68
> SigtranServices
>  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75
> schedmon.ppc210
>  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
>  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
>     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
>     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
>     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
>     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
>     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1
> 
> 
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
> 
> 
> So...anyone have any ideas/suggestions?
> 
> Thanks,
> 
> Chris

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Chris Friesen @ 2008-01-10 18:12 UTC (permalink / raw)
  To: Kok, Auke; +Cc: netdev, linux-kernel
In-Reply-To: <478657C1.8040107@intel.com>

Kok, Auke wrote:

> You're using 2.6.10... you can always replace the e1000 module with the
> out-of-tree version from e1000.sf.net, this might help a bit - the version in the
> 2.6.10 kernel is very very old.

Do you have any reason to believe this would improve things?  It seems 
like the problem lies in the NAPI/softirq code rather than in the e1000 
driver itself, no?

> it also appears that your app is eating up CPU time. perhaps setting the app to a
> nicer nice level might mitigate things a bit.

If we're not handling the softirq work from ksoftirqd how would changing 
scheduler settings affect anything?

 > Also turn off the in-kernel irq
> mitigation, it just causes cache misses and you really need the network irq to sit
> on a single cpu at most (if not all) the time to get the best performance. Use the
> userspace irqbalance daemon instead to achieve this.

Using userspace irqbalance would be some effort to test and deploy 
properly.  However, as a quick test I tried setting the irq affinity for 
this device and it didn't help.

One thing that might be of interest is that it seems to be bursty rather 
than gradual.  Here are some timestamps (in seconds) along with the 
number of overruns on eth0:

6552.15  overruns:260097
6552.69  overruns:260097
6553.32  overruns:260097
6553.83  overruns:260097
6554.35  overruns:260097
6554.87  overruns:260097
6555.41  overruns:260097
6555.94  overruns:260097
6556.51  overruns:260097
6557.07  overruns:260282
6557.58  overruns:260282
6558.23  overruns:260282


Chris

^ permalink raw reply

* Re: SMP code / network stack
From: Arnaldo Carvalho de Melo @ 2008-01-10 17:46 UTC (permalink / raw)
  To: Jeba Anandhan; +Cc: Eric Dumazet, netdev, matthew.hattersley
In-Reply-To: <1199978819.29856.43.camel@vglwks010.vgl2.office.vaioni.com>

Em Thu, Jan 10, 2008 at 03:26:59PM +0000, Jeba Anandhan escreveu:
> Hi Eric,
> Thanks for the reply. I have one more doubt. For example, if we have 2
> processor and 4 ethernet cards. Only CPU0 does all work through 8 cards.
> If we set the affinity to each ethernet card as CPU number, will it be
> efficient?.
> 
> Will this be default behavior?
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>   0:   11472559   74291833    IO-APIC-edge  timer
>   2:          0          0          XT-PIC  cascade
>   8:          0          1    IO-APIC-edge  rtc
>  81:          0          0   IO-APIC-level  ohci_hcd
>  97: 1830022231        847   IO-APIC-level  ehci_hcd, eth0
>  97: 3830012232        847   IO-APIC-level  ehci_hcd, eth1
>  97: 5830052231        847   IO-APIC-level  ehci_hcd, eth2
>  97: 6830032213        847   IO-APIC-level  ehci_hcd, eth3
> #sleep 10
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>   0:   11472559   74291833    IO-APIC-edge  timer
>   2:          0          0          XT-PIC  cascade
>   8:          0          1    IO-APIC-edge  rtc
>  81:          0          0   IO-APIC-level  ohci_hcd
>  97: 2031409801        847   IO-APIC-level  ehci_hcd, eth0
>  97: 4813981390        847   IO-APIC-level  ehci_hcd, eth1
>  97: 7123982139        847   IO-APIC-level  ehci_hcd, eth2
>  97: 8030193010        847   IO-APIC-level  ehci_hcd, eth3
> 
> 
> Instead of the above mentioned ,if we set the affinity for eth2 and
> eth3.
> the output will be
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>   0:   11472559   74291833    IO-APIC-edge  timer
>   2:          0          0          XT-PIC  cascade
>   8:          0          1    IO-APIC-edge  rtc
>  81:          0          0   IO-APIC-level  ohci_hcd
>  97: 1830022231        847   IO-APIC-level  ehci_hcd, eth0
>  97: 3830012232        847   IO-APIC-level  ehci_hcd, eth1
>  97: 5830052231        923   IO-APIC-level  ehci_hcd, eth2
>  97: 6830032213        1230   IO-APIC-level  ehci_hcd, eth3
> #sleep 10
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>   0:   11472559   74291833    IO-APIC-edge  timer
>   2:          0          0          XT-PIC  cascade
>   8:          0          1    IO-APIC-edge  rtc
>  81:          0          0   IO-APIC-level  ohci_hcd
>  97: 2300022231        847   IO-APIC-level  ehci_hcd, eth0
>  97: 4010212232        847   IO-APIC-level  ehci_hcd, eth1
>  97: 5830052231        1847   IO-APIC-level  ehci_hcd, eth2
>  97: 6830032213        2337   IO-APIC-level  ehci_hcd, eth3
> 
> In this case, will the performance improves?.

ps ax | grep irqbalance

tells what?

If it is enabled please try:

service irqbalance stop
chkconfig irqbalance off

Then reset the smp_affinity entries to ff so and try again.

http://www.irqbalance.org/

- Arnaldo

^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Kok, Auke @ 2008-01-10 17:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>

Chris Friesen wrote:
> Hi all,
> 
> I've got an issue that's popped up with a deployed system running
> 2.6.10.  I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
> 
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).  The
> error/dropped/fifo counts are going up in ethtool:
> 
>      rx_packets: 32180834
>      rx_bytes: 5480756958
>      rx_errors: 862506
>      rx_dropped: 771345
>      rx_length_errors: 0
>      rx_over_errors: 0
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_fifo_errors: 91161
>      rx_missed_errors: 91161
> 
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
> 
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
> 
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
> 
> Some additional information:
> 
> 
> 1) Interrupts are being processed on both cpus:
> 
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
>            CPU0       CPU1
>  30:    1703756    4530785  U3-MPIC Level     eth0
> 
> 
> 
> 
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
> 
> 
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
> Swap:   0k total,       0k used,      0k free,   219844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68
> SigtranServices
>  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75
> schedmon.ppc210
>  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
>  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
>     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
>     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
>     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
>     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
>     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1
> 
> 
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
> 
> 
> So...anyone have any ideas/suggestions?

You're using 2.6.10... you can always replace the e1000 module with the
out-of-tree version from e1000.sf.net, this might help a bit - the version in the
2.6.10 kernel is very very old.

it also appears that your app is eating up CPU time. perhaps setting the app to a
nicer nice level might mitigate things a bit. Also turn off the in-kernel irq
mitigation, it just causes cache misses and you really need the network irq to sit
on a single cpu at most (if not all) the time to get the best performance. Use the
userspace irqbalance daemon instead to achieve this.

Auke


^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: Patrick McHardy @ 2008-01-10 17:29 UTC (permalink / raw)
  To: mahatma; +Cc: netdev
In-Reply-To: <47866C69.3080904@bspu.unibel.by>

Dzianis Kahanovich wrote:
> --- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
> @@ -161,2 +161,5 @@
>              skb->tc_index = TC_H_MIN(res.classid);
> +#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
> +            skb->mark = 
> (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
> +#endif
>          default:


Behaviour like this shouldn't depend on compile-time options.


^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Breno Leitao @ 2008-01-10 17:31 UTC (permalink / raw)
  To: bhutchings
In-Reply-To: <20080110163626.GJ3544@solarflare.com>

On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
> 
> I take it that's the average for individual interfaces, not the
> aggregate?
Right, each of these results are for individual interfaces. Otherwise,
we'd have a huge problem. :-)

> This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
Without using these "features" and keeping the MTU as 1500, do you think
we could get a better performance than this one?

I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.

> single CPU this can become a bottleneck.  Does the test system have
> multiple CPUs?  Are IRQs for the multiple NICs balanced across
> multiple CPUs?
Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts: 

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
 16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
 18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
 19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
BAD:       4216

Thanks,

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply

* [PATCH 2.6.23+] ingress classify to [nf]mark
From: Dzianis Kahanovich @ 2008-01-10 19:05 UTC (permalink / raw)
  To: netdev

To "classid x:y" = "mark=mark&x|y" ("classid :y" = "-j MARK --set-mark y", etc).

--- linux-2.6.23-gentoo-r2/net/sched/Kconfig
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/Kconfig
@@ -222,6 +222,16 @@
  	  To compile this code as a module, choose M here: the
  	  module will be called sch_ingress.

+config NET_SCH_INGRESS_TC2MARK
+	bool "ingress classify -> mark"
+	depends on NET_SCH_INGRESS && NET_CLS_ACT
+	---help---
+	  This enables access to "mark" value via "classid"
+	  Example: set "tc filter ... flowid|classid 1:2"
+	  eq "netfilter mark" mark=mark&1|2
+	
+	  But classid may be undefined (?) - use "flowid :0".
+
  comment "Classification"

  config NET_CLS
--- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
@@ -161,2 +161,5 @@
  			skb->tc_index = TC_H_MIN(res.classid);
+#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
+			skb->mark = (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
+#endif
  		default:


-- 
WBR,
Denis Kaganovich,  mahatma@eu.by  http://mahatma.bspu.unibel.by

^ permalink raw reply

* [PROCFS] [NETNS] issue with /proc/net entries
From: Benjamin Thery @ 2008-01-10 17:24 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, linux-kernel

Hi Eric,

While testing the current network namespace stuff merged in net-2.6.25,
I bumped into the following problem with the /proc/net/ entries.
It doesn't always display the actual data of the current namespace,
but sometime displays data from other namespaces.

I bisected the problem to the commit:
"proc: remove/Fix proc generic d_revalidate"
3790ee4bd86396558eedd86faac1052cb782e4e1

The problem: If a process in a particular network namespace changes
current directory to /proc/net, then processes in other network
namespaces trying to look at /proc/net entries will see data from the
first namespace (the one with CWD /proc/net). (See test case below).

As you comments in the commit suggest, you seem to be aware of some
issues when CONFIG_NET_NS=y. Is it one of these corner cases you
identified? Any idea on how we can fix it?

Thanks.

Benjamin


Test case:
----------
(1) Shell 1, in init namespace:
$ cat /proc/net/dev
lo ...
eth0 ...

(2) Shell 2, in another network namespace
$ cat /proc/net/dev
lo ...

(3) Shell 1
$ cd /proc/net
$ cat dev
lo ...
eth0 ...

(4) Shell 2
$ cat /proc/net/dev
lo ...
eth0 ...

Argh, lo + eth0 in child namespace.... the device list of init netns
is displayed in /proc/net/dev of child namespace :-(

(5) Shell 1
$ cd /

(6) Shell 2
$ cat /proc/net/dev
lo ...

Back to normality.


-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software R&D

    http://www.bull.com

^ permalink raw reply

* questions on NAPI processing latency and dropped network packets
From: Chris Friesen @ 2008-01-10 17:24 UTC (permalink / raw)
  To: netdev, linux-kernel

Hi all,

I've got an issue that's popped up with a deployed system running 
2.6.10.  I'm looking for some help figuring out why incoming network 
packets aren't being processed fast enough.

After a recent userspace app change, we've started seeing packets being 
dropped by the ethernet hardware (e1000, NAPI is enabled).  The 
error/dropped/fifo counts are going up in ethtool:

      rx_packets: 32180834
      rx_bytes: 5480756958
      rx_errors: 862506
      rx_dropped: 771345
      rx_length_errors: 0
      rx_over_errors: 0
      rx_crc_errors: 0
      rx_frame_errors: 0
      rx_fifo_errors: 91161
      rx_missed_errors: 91161

This link is receiving roughly 13K packets/sec, and we're dropping 
roughly 51 packets/sec due to fifo errors.

Increasing the rx descriptor ring size from 256 up to around 3000 or so 
seems to make the problem stop, but it seems to me that this is just a 
workaround for the latency in processing the incoming packets.

So, I'm looking for some suggestions on how to fix this or to figure out 
where the latency is coming from.

Some additional information:


1) Interrupts are being processed on both cpus:

root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
            CPU0       CPU1
  30:    1703756    4530785  U3-MPIC Level     eth0




2) "top" shows a fair amount of time processing softirqs, but very 
little time in ksoftirqd (or is that a sampling artifact?).


Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
Mem:  4007812k total, 2199148k used,  1808664k free,     0k buffers
Swap:   0k total,       0k used,      0k free,   219844k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5375 root      15   0 2682m 1.8g 6640 S 99.9 46.7  31:17.68 
SigtranServices
  7696 root      17   0  6952 3212 1192 S  7.3  0.1   0:15.75 
schedmon.ppc210
  7859 root      16   0  2688 1228  964 R  0.7  0.0   0:00.04 top
  2956 root       8  -8 18940 7436 5776 S  0.3  0.2   0:01.35 blademtc
     1 root      16   0  1660  620  532 S  0.0  0.0   0:30.62 init
     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/0
     3 root      15   0     0    0    0 S  0.0  0.0   0:00.55 ksoftirqd/0
     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
     5 root      15   0     0    0    0 S  0.0  0.0   0:00.43 ksoftirqd/1


3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300


So...anyone have any ideas/suggestions?

Thanks,

Chris

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Jeba Anandhan @ 2008-01-10 16:51 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Breno Leitao, netdev
In-Reply-To: <20080110163626.GJ3544@solarflare.com>

Ben,
I am facing the performance issue when we try to bond the multiple
interfaces with virtual interface. It could be related to this thread. 
My questions are,
*) When we use mulitple NICs, will the performance of overall system  be
summation of all individual lines  XX bits/sec. ?
*) What are the factors improves the performance if we have multiple
interfaces?. [ kind of tuning the parameters in proc ]

Breno, 
I hope this thread will be helpful for performance issue which i have
with bonding driver.

Jeba
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> Breno Leitao wrote:
> > Hello, 
> > 
> > I've perceived that there is a performance issue when running netperf
> > against 4 e1000 links connected end-to-end to another machine with 4
> > e1000 interfaces. 
> > 
> > I have 2 4-port interfaces on my machine, but the test is just
> > considering 2 port for each interfaces card.
> > 
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
> 
> I take it that's the average for individual interfaces, not the
> aggregate?  RX processing for multi-gigabits per second can be quite
> expensive.  This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
> 
> Even with these optimisations, if all RX processing is done on a
> single CPU this can become a bottleneck.  Does the test system have
> multiple CPUs?  Are IRQs for the multiple NICs balanced across
> multiple CPUs?
> 
> Ben.
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox