* RE: ipip tunnel code (IPV4)
From: Templin, Fred L @ 2008-01-10 17:58 UTC (permalink / raw)
To: Andy Johnson, netdev
In-Reply-To: <147a89290801100634t7854a101w203de150982b0284@mail.gmail.com>
Andy,
> -----Original Message-----
> From: Andy Johnson [mailto:johnsonzjo@gmail.com]
> Sent: Thursday, January 10, 2008 6:35 AM
> To: netdev@vger.kernel.org
> Subject: ipip tunnel code (IPV4)
>
> Hello,
>
> I am trying to learn the IPV4 ipip tunnel code (net/ipv4/ipip.c)
> and I have two little questions about
> semantics of variables:
>
> ipip_fb_tunnel_init - what does "fb" stand for ?
>
> In tunnels_wc : what does "wc" stand for ?
Similar names occur in net/ipv6/sit.c, which is the
IPv6-in-IPv4 analog of ipip.c. I am 90% certain that
"wc" stands for "wildcard" - it is used for selecting
the default tunnel interface when no other tunnel
interfaces match a specific (src, dst) pair.
In that light, I assume "fb" stands for something like
"fallback" although I am not certain. It would seem to
fit though, because the "fallback" tunnel interface is
the one that is selected by a "wildcard" match.
Would be interested if anyone could confirm or correct
my assumptions.
Thanks - Fred
fred.l.templin@boeing.com
> Regards,
> Andy
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: questions on NAPI processing latency and dropped network packets
From: Rick Jones @ 2008-01-10 18:41 UTC (permalink / raw)
To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>
> 1) Interrupts are being processed on both cpus:
>
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
> CPU0 CPU1
> 30: 1703756 4530785 U3-MPIC Level eth0
IIRC none of the e1000 driven cards are multi-queue, so while the above
shows that interrupts from eth0 have been processed on both CPUs at
various points in the past, it doesn't necessarily mean that they are
being processed on both CPUs at the same time right?
rick jones
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Rick Jones @ 2008-01-10 18:37 UTC (permalink / raw)
To: Breno Leitao; +Cc: bhutchings, Linux Network Development list
In-Reply-To: <1199986291.8931.62.camel@cafe>
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
It may seem like picking a tiny nit, but netperf never transmits
packets. It only provides buffers of specified size to the stack. It is
then the stack which transmits and determines the size of the packets on
the network.
Drifting a bit more...
While there are settings, conditions and known stack behaviours where
one can be confident of the packet size on the network based on the
options passed to netperf, generally speaking one should not ass-u-me a
direct relationship between the options one passes to netperf and the
size of the packets on the network.
And for JumboFrames to be effective it must be set on both ends,
otherwise the TCP MSS exchange will result in the smaller of the two
MTU's "winning" as it were.
>>single CPU this can become a bottleneck. Does the test system have
>>multiple CPUs? Are IRQs for the multiple NICs balanced across
>>multiple CPUs?
>
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts:
That suggests to me anyway that the dreaded irqbalanced is running,
shuffling the interrupts as you go. Not often a happy place for running
netperf when one want's consistent results.
>
> # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 16: 940 760 1047 904 993 777 975 813 XICS Level IPI
> 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
> 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
> 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
> 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
> 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
> 279: 893 919 857 909 867 917 894 881 XICS Level eth0
> 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
> 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
> 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
> 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
> 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
> BAD: 4216
IMO, what you want (in the absence of multi-queue NICs) is one CPU
taking the interrupts of one port/interface, and each port/interface's
interrupts going to a separate CPU. So, something that looks roughly
like concocted example:
CPU0 CPU1 CPU2 CPU3
1: 1234 0 0 0 eth0
2: 0 1234 0 0 eth1
3: 0 0 1234 0 eth2
4: 0 0 0 1234 eth3
which you should be able to acheive via the method I think someone else
has already mentioned about echoing values into
/proc/irq/<irq>/smp_affinity - after you have slain the dreaded
irqbalance daemon.
rick jones
^ permalink raw reply
* [PATCH] New driver "sfc" for Solarstorm SFC4000 controller - 4th attempt
From: Robert Stonehouse @ 2008-01-10 18:29 UTC (permalink / raw)
To: jgarzik, netdev; +Cc: spope, linux-net-drivers
This is a resubmission of a new driver for Solarflare network controllers.
The driver supports several types of PHY (10Gbase-T, XFP, CX4) on six
different 10G and 1G boards.
Hardware based on this network controller is now available from SMC as
part numbers SMC10GPCIe-XFP and SMC10GPCIe-10BT.
The previous thread was:
http://marc.info/?l=linux-netdev&m=119825632209357&w=2
Thanks to the people who looked at the previous patches. We have addressed
the following from comments received after the 3rd submission:
- Kerneldoc style comment
- Kconfig changes
- Reduced size slightly
I am also sending a request to linux-mtd@lists.infradead.org for review of
the MTD part of the driver.
Previous reviewers have noted that the driver is quite large (but it
would not be the largest network driver by source or compiled module
size). I think it is a reasonable size for a driver that supports a
fully featured NIC, across a range of MACs, PHYs and silicon
revisions.
One aspect that is worth mentioning is that the NIC has no firmware.
A benefit is no dreaded binary blob! A downside is that more support
code is needed but this tends to be around initialisation and is
readable commented C.
To give a small break down of the sizes of the different driver parts
(wc output)
Core control/datapath | 5001 16405 139467 = efx.c rx.c tx.c
Controller HW support | 3653 11823 107554 = falcon.c
HW defs | 1588 4838 47050 = falcon_hwdefs.h
board support | 1848 7105 52455
MAC support | 1623 4977 51007
PHY support | 2196 7904 67711
Headers | 4565 20645 162402
Self test code | 863 3088 24981
Ethtool support | 751 2144 22845
MTD code (separate module) | 1021 3200 26944
Debugfs Code (KConfig option) | 863 2543 24896
Are there further review comments that we need to address before it can be
merged?
The patch (against net-2.6.25) is at:
https://support.solarflare.com/netdev/4/net-2.6.25-sfc-2.2.0038.patch
The new files may also be downloaded as a tarball:
https://support.solarflare.com/netdev/4/net-2.6.25-sfc-2.2.0038.tgz
And for verification there is:
https://support.solarflare.com/netdev/4/MD5SUMS
Regards
--
Rob Stonehouse
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Kok, Auke @ 2008-01-10 18:18 UTC (permalink / raw)
To: Breno Leitao; +Cc: bhutchings, NetDev
In-Reply-To: <1199986291.8931.62.camel@cafe>
Breno Leitao wrote:
> On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>> <snip>
>>
>> I take it that's the average for individual interfaces, not the
>> aggregate?
> Right, each of these results are for individual interfaces. Otherwise,
> we'd have a huge problem. :-)
>
>> This can be mitigated by interrupt moderation and NAPI
>> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
>> I don't think e1000 hardware does LRO, but the driver could presumably
>> be changed use Linux's software LRO.
> Without using these "features" and keeping the MTU as 1500, do you think
> we could get a better performance than this one?
>
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
>
>> single CPU this can become a bottleneck. Does the test system have
>> multiple CPUs? Are IRQs for the multiple NICs balanced across
>> multiple CPUs?
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts:
which is wrong and hurts performance. you want your ethernet irq's to stick to a
CPU for long times to prevent cache thrash.
please disable the in-kernel irq balancing code and use the userspace `irqbalance`
daemon.
Gee I should put that in my signature, I already wrote that twice today :)
Auke
>
> # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 16: 940 760 1047 904 993 777 975 813 XICS Level IPI
> 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
> 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
> 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
> 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
> 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
> 279: 893 919 857 909 867 917 894 881 XICS Level eth0
> 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
> 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
> 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
> 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
> 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
> BAD: 4216
>
> Thanks,
>
^ permalink raw reply
* Re: SMP code / network stack
From: Kok, Auke @ 2008-01-10 18:31 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Jeba Anandhan, Eric Dumazet, netdev,
matthew.hattersley
In-Reply-To: <20080110174657.GL22437@ghostprotocols.net>
Arnaldo Carvalho de Melo wrote:
> Em Thu, Jan 10, 2008 at 03:26:59PM +0000, Jeba Anandhan escreveu:
>> Hi Eric,
>> Thanks for the reply. I have one more doubt. For example, if we have 2
>> processor and 4 ethernet cards. Only CPU0 does all work through 8 cards.
>> If we set the affinity to each ethernet card as CPU number, will it be
>> efficient?.
>>
>> Will this be default behavior?
>>
>> # cat /proc/interrupts
>> CPU0 CPU1
>> 0: 11472559 74291833 IO-APIC-edge timer
>> 2: 0 0 XT-PIC cascade
>> 8: 0 1 IO-APIC-edge rtc
>> 81: 0 0 IO-APIC-level ohci_hcd
>> 97: 1830022231 847 IO-APIC-level ehci_hcd, eth0
>> 97: 3830012232 847 IO-APIC-level ehci_hcd, eth1
>> 97: 5830052231 847 IO-APIC-level ehci_hcd, eth2
>> 97: 6830032213 847 IO-APIC-level ehci_hcd, eth3
another thing to try: if you don't need usb2 support, remove the ehci_hcd module -
this will give a slight less overhead servicing irq's in your system.
I take it that you have no MSI support in these ethernet cards?
Auke
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Rick Jones @ 2008-01-10 18:26 UTC (permalink / raw)
To: Breno Leitao; +Cc: netdev
In-Reply-To: <1199981839.8931.35.camel@cafe>
Many many things to check when running netperf :)
*) Are the cards on the same or separate PCImumble bus, and what sort of bus
*) is the two interface performance two interfaces on the same four-port
card, or an interface from each of the two four-port cards?
*) is there a dreaded (IMO) irqbalance daemon running? one of the very
first things I do when running netperf is terminate the irqbalance
daemon with as extreme a predjudice as I can.
*) what is the distribution of interrupts from the interfaces to the
CPUs? if you've tried to set that manually, the dreaded irqbalance
daemon will come along shortly thereafter and ruin everything.
*) what does netperf say about the overall CPU utilization of the
system(s) when the tests are running?
*) what does top say about the utilization of any single CPU in the
system(s) when the tests are running?
*) are you using the global -T option to spread the netperf/netserver
processes across the CPUs, or leaving that all up to the
stack/scheduler/etc?
I suspect there could be more but that is what comes to mind thusfar as
far as things I often check when running netperf.
rick jones
^ permalink raw reply
* Re: questions on NAPI processing latency and dropped network packets
From: James Chapman @ 2008-01-10 18:25 UTC (permalink / raw)
To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>
Chris Friesen wrote:
> Hi all,
>
> I've got an issue that's popped up with a deployed system running
> 2.6.10. I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
>
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled).
What's changed in your application? Any real-time threads in there?
>From the top output below, looks like SigtranServices is consuming all
your CPU...
> The
> error/dropped/fifo counts are going up in ethtool:
>
> rx_packets: 32180834
> rx_bytes: 5480756958
> rx_errors: 862506
> rx_dropped: 771345
> rx_length_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 91161
> rx_missed_errors: 91161
>
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
>
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
>
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
>
> Some additional information:
>
>
> 1) Interrupts are being processed on both cpus:
>
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
> CPU0 CPU1
> 30: 1703756 4530785 U3-MPIC Level eth0
>
>
>
>
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
>
>
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem: 4007812k total, 2199148k used, 1808664k free, 0k buffers
> Swap: 0k total, 0k used, 0k free, 219844k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5375 root 15 0 2682m 1.8g 6640 S 99.9 46.7 31:17.68
> SigtranServices
> 7696 root 17 0 6952 3212 1192 S 7.3 0.1 0:15.75
> schedmon.ppc210
> 7859 root 16 0 2688 1228 964 R 0.7 0.0 0:00.04 top
> 2956 root 8 -8 18940 7436 5776 S 0.3 0.2 0:01.35 blademtc
> 1 root 16 0 1660 620 532 S 0.0 0.0 0:30.62 init
> 2 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/0
> 3 root 15 0 0 0 0 S 0.0 0.0 0:00.55 ksoftirqd/0
> 4 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1
> 5 root 15 0 0 0 0 S 0.0 0.0 0:00.43 ksoftirqd/1
>
>
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
>
>
> So...anyone have any ideas/suggestions?
>
> Thanks,
>
> Chris
--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development
^ permalink raw reply
* Re: questions on NAPI processing latency and dropped network packets
From: Chris Friesen @ 2008-01-10 18:12 UTC (permalink / raw)
To: Kok, Auke; +Cc: netdev, linux-kernel
In-Reply-To: <478657C1.8040107@intel.com>
Kok, Auke wrote:
> You're using 2.6.10... you can always replace the e1000 module with the
> out-of-tree version from e1000.sf.net, this might help a bit - the version in the
> 2.6.10 kernel is very very old.
Do you have any reason to believe this would improve things? It seems
like the problem lies in the NAPI/softirq code rather than in the e1000
driver itself, no?
> it also appears that your app is eating up CPU time. perhaps setting the app to a
> nicer nice level might mitigate things a bit.
If we're not handling the softirq work from ksoftirqd how would changing
scheduler settings affect anything?
> Also turn off the in-kernel irq
> mitigation, it just causes cache misses and you really need the network irq to sit
> on a single cpu at most (if not all) the time to get the best performance. Use the
> userspace irqbalance daemon instead to achieve this.
Using userspace irqbalance would be some effort to test and deploy
properly. However, as a quick test I tried setting the irq affinity for
this device and it didn't help.
One thing that might be of interest is that it seems to be bursty rather
than gradual. Here are some timestamps (in seconds) along with the
number of overruns on eth0:
6552.15 overruns:260097
6552.69 overruns:260097
6553.32 overruns:260097
6553.83 overruns:260097
6554.35 overruns:260097
6554.87 overruns:260097
6555.41 overruns:260097
6555.94 overruns:260097
6556.51 overruns:260097
6557.07 overruns:260282
6557.58 overruns:260282
6558.23 overruns:260282
Chris
^ permalink raw reply
* Re: SMP code / network stack
From: Arnaldo Carvalho de Melo @ 2008-01-10 17:46 UTC (permalink / raw)
To: Jeba Anandhan; +Cc: Eric Dumazet, netdev, matthew.hattersley
In-Reply-To: <1199978819.29856.43.camel@vglwks010.vgl2.office.vaioni.com>
Em Thu, Jan 10, 2008 at 03:26:59PM +0000, Jeba Anandhan escreveu:
> Hi Eric,
> Thanks for the reply. I have one more doubt. For example, if we have 2
> processor and 4 ethernet cards. Only CPU0 does all work through 8 cards.
> If we set the affinity to each ethernet card as CPU number, will it be
> efficient?.
>
> Will this be default behavior?
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 11472559 74291833 IO-APIC-edge timer
> 2: 0 0 XT-PIC cascade
> 8: 0 1 IO-APIC-edge rtc
> 81: 0 0 IO-APIC-level ohci_hcd
> 97: 1830022231 847 IO-APIC-level ehci_hcd, eth0
> 97: 3830012232 847 IO-APIC-level ehci_hcd, eth1
> 97: 5830052231 847 IO-APIC-level ehci_hcd, eth2
> 97: 6830032213 847 IO-APIC-level ehci_hcd, eth3
> #sleep 10
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 11472559 74291833 IO-APIC-edge timer
> 2: 0 0 XT-PIC cascade
> 8: 0 1 IO-APIC-edge rtc
> 81: 0 0 IO-APIC-level ohci_hcd
> 97: 2031409801 847 IO-APIC-level ehci_hcd, eth0
> 97: 4813981390 847 IO-APIC-level ehci_hcd, eth1
> 97: 7123982139 847 IO-APIC-level ehci_hcd, eth2
> 97: 8030193010 847 IO-APIC-level ehci_hcd, eth3
>
>
> Instead of the above mentioned ,if we set the affinity for eth2 and
> eth3.
> the output will be
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 11472559 74291833 IO-APIC-edge timer
> 2: 0 0 XT-PIC cascade
> 8: 0 1 IO-APIC-edge rtc
> 81: 0 0 IO-APIC-level ohci_hcd
> 97: 1830022231 847 IO-APIC-level ehci_hcd, eth0
> 97: 3830012232 847 IO-APIC-level ehci_hcd, eth1
> 97: 5830052231 923 IO-APIC-level ehci_hcd, eth2
> 97: 6830032213 1230 IO-APIC-level ehci_hcd, eth3
> #sleep 10
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 11472559 74291833 IO-APIC-edge timer
> 2: 0 0 XT-PIC cascade
> 8: 0 1 IO-APIC-edge rtc
> 81: 0 0 IO-APIC-level ohci_hcd
> 97: 2300022231 847 IO-APIC-level ehci_hcd, eth0
> 97: 4010212232 847 IO-APIC-level ehci_hcd, eth1
> 97: 5830052231 1847 IO-APIC-level ehci_hcd, eth2
> 97: 6830032213 2337 IO-APIC-level ehci_hcd, eth3
>
> In this case, will the performance improves?.
ps ax | grep irqbalance
tells what?
If it is enabled please try:
service irqbalance stop
chkconfig irqbalance off
Then reset the smp_affinity entries to ff so and try again.
http://www.irqbalance.org/
- Arnaldo
^ permalink raw reply
* Re: questions on NAPI processing latency and dropped network packets
From: Kok, Auke @ 2008-01-10 17:37 UTC (permalink / raw)
To: Chris Friesen; +Cc: netdev, linux-kernel
In-Reply-To: <478654C3.60806@nortel.com>
Chris Friesen wrote:
> Hi all,
>
> I've got an issue that's popped up with a deployed system running
> 2.6.10. I'm looking for some help figuring out why incoming network
> packets aren't being processed fast enough.
>
> After a recent userspace app change, we've started seeing packets being
> dropped by the ethernet hardware (e1000, NAPI is enabled). The
> error/dropped/fifo counts are going up in ethtool:
>
> rx_packets: 32180834
> rx_bytes: 5480756958
> rx_errors: 862506
> rx_dropped: 771345
> rx_length_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 91161
> rx_missed_errors: 91161
>
> This link is receiving roughly 13K packets/sec, and we're dropping
> roughly 51 packets/sec due to fifo errors.
>
> Increasing the rx descriptor ring size from 256 up to around 3000 or so
> seems to make the problem stop, but it seems to me that this is just a
> workaround for the latency in processing the incoming packets.
>
> So, I'm looking for some suggestions on how to fix this or to figure out
> where the latency is coming from.
>
> Some additional information:
>
>
> 1) Interrupts are being processed on both cpus:
>
> root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
> CPU0 CPU1
> 30: 1703756 4530785 U3-MPIC Level eth0
>
>
>
>
> 2) "top" shows a fair amount of time processing softirqs, but very
> little time in ksoftirqd (or is that a sampling artifact?).
>
>
> Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
> Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
> Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
> Mem: 4007812k total, 2199148k used, 1808664k free, 0k buffers
> Swap: 0k total, 0k used, 0k free, 219844k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5375 root 15 0 2682m 1.8g 6640 S 99.9 46.7 31:17.68
> SigtranServices
> 7696 root 17 0 6952 3212 1192 S 7.3 0.1 0:15.75
> schedmon.ppc210
> 7859 root 16 0 2688 1228 964 R 0.7 0.0 0:00.04 top
> 2956 root 8 -8 18940 7436 5776 S 0.3 0.2 0:01.35 blademtc
> 1 root 16 0 1660 620 532 S 0.0 0.0 0:30.62 init
> 2 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/0
> 3 root 15 0 0 0 0 S 0.0 0.0 0:00.55 ksoftirqd/0
> 4 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1
> 5 root 15 0 0 0 0 S 0.0 0.0 0:00.43 ksoftirqd/1
>
>
> 3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
>
>
> So...anyone have any ideas/suggestions?
You're using 2.6.10... you can always replace the e1000 module with the
out-of-tree version from e1000.sf.net, this might help a bit - the version in the
2.6.10 kernel is very very old.
it also appears that your app is eating up CPU time. perhaps setting the app to a
nicer nice level might mitigate things a bit. Also turn off the in-kernel irq
mitigation, it just causes cache misses and you really need the network irq to sit
on a single cpu at most (if not all) the time to get the best performance. Use the
userspace irqbalance daemon instead to achieve this.
Auke
^ permalink raw reply
* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: Patrick McHardy @ 2008-01-10 17:29 UTC (permalink / raw)
To: mahatma; +Cc: netdev
In-Reply-To: <47866C69.3080904@bspu.unibel.by>
Dzianis Kahanovich wrote:
> --- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
> @@ -161,2 +161,5 @@
> skb->tc_index = TC_H_MIN(res.classid);
> +#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
> + skb->mark =
> (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
> +#endif
> default:
Behaviour like this shouldn't depend on compile-time options.
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Breno Leitao @ 2008-01-10 17:31 UTC (permalink / raw)
To: bhutchings
In-Reply-To: <20080110163626.GJ3544@solarflare.com>
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
>
> I take it that's the average for individual interfaces, not the
> aggregate?
Right, each of these results are for individual interfaces. Otherwise,
we'd have a huge problem. :-)
> This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
Without using these "features" and keeping the MTU as 1500, do you think
we could get a better performance than this one?
I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.
> single CPU this can become a bottleneck. Does the test system have
> multiple CPUs? Are IRQs for the multiple NICs balanced across
> multiple CPUs?
Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts:
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
16: 940 760 1047 904 993 777 975 813 XICS Level IPI
18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
279: 893 919 857 909 867 917 894 881 XICS Level eth0
305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
BAD: 4216
Thanks,
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply
* [PATCH 2.6.23+] ingress classify to [nf]mark
From: Dzianis Kahanovich @ 2008-01-10 19:05 UTC (permalink / raw)
To: netdev
To "classid x:y" = "mark=mark&x|y" ("classid :y" = "-j MARK --set-mark y", etc).
--- linux-2.6.23-gentoo-r2/net/sched/Kconfig
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/Kconfig
@@ -222,6 +222,16 @@
To compile this code as a module, choose M here: the
module will be called sch_ingress.
+config NET_SCH_INGRESS_TC2MARK
+ bool "ingress classify -> mark"
+ depends on NET_SCH_INGRESS && NET_CLS_ACT
+ ---help---
+ This enables access to "mark" value via "classid"
+ Example: set "tc filter ... flowid|classid 1:2"
+ eq "netfilter mark" mark=mark&1|2
+
+ But classid may be undefined (?) - use "flowid :0".
+
comment "Classification"
config NET_CLS
--- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
@@ -161,2 +161,5 @@
skb->tc_index = TC_H_MIN(res.classid);
+#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
+ skb->mark = (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
+#endif
default:
--
WBR,
Denis Kaganovich, mahatma@eu.by http://mahatma.bspu.unibel.by
^ permalink raw reply
* [PROCFS] [NETNS] issue with /proc/net entries
From: Benjamin Thery @ 2008-01-10 17:24 UTC (permalink / raw)
To: ebiederm; +Cc: netdev, linux-kernel
Hi Eric,
While testing the current network namespace stuff merged in net-2.6.25,
I bumped into the following problem with the /proc/net/ entries.
It doesn't always display the actual data of the current namespace,
but sometime displays data from other namespaces.
I bisected the problem to the commit:
"proc: remove/Fix proc generic d_revalidate"
3790ee4bd86396558eedd86faac1052cb782e4e1
The problem: If a process in a particular network namespace changes
current directory to /proc/net, then processes in other network
namespaces trying to look at /proc/net entries will see data from the
first namespace (the one with CWD /proc/net). (See test case below).
As you comments in the commit suggest, you seem to be aware of some
issues when CONFIG_NET_NS=y. Is it one of these corner cases you
identified? Any idea on how we can fix it?
Thanks.
Benjamin
Test case:
----------
(1) Shell 1, in init namespace:
$ cat /proc/net/dev
lo ...
eth0 ...
(2) Shell 2, in another network namespace
$ cat /proc/net/dev
lo ...
(3) Shell 1
$ cd /proc/net
$ cat dev
lo ...
eth0 ...
(4) Shell 2
$ cat /proc/net/dev
lo ...
eth0 ...
Argh, lo + eth0 in child namespace.... the device list of init netns
is displayed in /proc/net/dev of child namespace :-(
(5) Shell 1
$ cd /
(6) Shell 2
$ cat /proc/net/dev
lo ...
Back to normality.
--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
http://www.bull.com
^ permalink raw reply
* questions on NAPI processing latency and dropped network packets
From: Chris Friesen @ 2008-01-10 17:24 UTC (permalink / raw)
To: netdev, linux-kernel
Hi all,
I've got an issue that's popped up with a deployed system running
2.6.10. I'm looking for some help figuring out why incoming network
packets aren't being processed fast enough.
After a recent userspace app change, we've started seeing packets being
dropped by the ethernet hardware (e1000, NAPI is enabled). The
error/dropped/fifo counts are going up in ethtool:
rx_packets: 32180834
rx_bytes: 5480756958
rx_errors: 862506
rx_dropped: 771345
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 91161
rx_missed_errors: 91161
This link is receiving roughly 13K packets/sec, and we're dropping
roughly 51 packets/sec due to fifo errors.
Increasing the rx descriptor ring size from 256 up to around 3000 or so
seems to make the problem stop, but it seems to me that this is just a
workaround for the latency in processing the incoming packets.
So, I'm looking for some suggestions on how to fix this or to figure out
where the latency is coming from.
Some additional information:
1) Interrupts are being processed on both cpus:
root@base0-0-0-13-0-11-1:/root> cat /proc/interrupts
CPU0 CPU1
30: 1703756 4530785 U3-MPIC Level eth0
2) "top" shows a fair amount of time processing softirqs, but very
little time in ksoftirqd (or is that a sampling artifact?).
Tasks: 79 total, 1 running, 78 sleeping, 0 stopped, 0 zombie
Cpu0: 23.6% us, 30.9% sy, 0.0% ni, 36.9% id, 0.0% wa, 0.3% hi, 8.3% si
Cpu1: 30.4% us, 24.1% sy, 0.0% ni, 5.9% id, 0.0% wa, 0.7% hi, 38.9% si
Mem: 4007812k total, 2199148k used, 1808664k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 219844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5375 root 15 0 2682m 1.8g 6640 S 99.9 46.7 31:17.68
SigtranServices
7696 root 17 0 6952 3212 1192 S 7.3 0.1 0:15.75
schedmon.ppc210
7859 root 16 0 2688 1228 964 R 0.7 0.0 0:00.04 top
2956 root 8 -8 18940 7436 5776 S 0.3 0.2 0:01.35 blademtc
1 root 16 0 1660 620 532 S 0.0 0.0 0:30.62 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/0
3 root 15 0 0 0 0 S 0.0 0.0 0:00.55 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1
5 root 15 0 0 0 0 S 0.0 0.0 0:00.43 ksoftirqd/1
3) /proc/sys/net/core/netdev_max_backlog is set to the default of 300
So...anyone have any ideas/suggestions?
Thanks,
Chris
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Jeba Anandhan @ 2008-01-10 16:51 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Breno Leitao, netdev
In-Reply-To: <20080110163626.GJ3544@solarflare.com>
Ben,
I am facing the performance issue when we try to bond the multiple
interfaces with virtual interface. It could be related to this thread.
My questions are,
*) When we use mulitple NICs, will the performance of overall system be
summation of all individual lines XX bits/sec. ?
*) What are the factors improves the performance if we have multiple
interfaces?. [ kind of tuning the parameters in proc ]
Breno,
I hope this thread will be helpful for performance issue which i have
with bonding driver.
Jeba
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> Breno Leitao wrote:
> > Hello,
> >
> > I've perceived that there is a performance issue when running netperf
> > against 4 e1000 links connected end-to-end to another machine with 4
> > e1000 interfaces.
> >
> > I have 2 4-port interfaces on my machine, but the test is just
> > considering 2 port for each interfaces card.
> >
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
>
> I take it that's the average for individual interfaces, not the
> aggregate? RX processing for multi-gigabits per second can be quite
> expensive. This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
>
> Even with these optimisations, if all RX processing is done on a
> single CPU this can become a bottleneck. Does the test system have
> multiple CPUs? Are IRQs for the multiple NICs balanced across
> multiple CPUs?
>
> Ben.
>
^ permalink raw reply
* Re: e1000 performance issue in 4 simultaneous links
From: Ben Hutchings @ 2008-01-10 16:36 UTC (permalink / raw)
To: Breno Leitao; +Cc: netdev
In-Reply-To: <1199981839.8931.35.camel@cafe>
Breno Leitao wrote:
> Hello,
>
> I've perceived that there is a performance issue when running netperf
> against 4 e1000 links connected end-to-end to another machine with 4
> e1000 interfaces.
>
> I have 2 4-port interfaces on my machine, but the test is just
> considering 2 port for each interfaces card.
>
> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> get around 720 * 10^6 bits/sec.
<snip>
I take it that's the average for individual interfaces, not the
aggregate? RX processing for multi-gigabits per second can be quite
expensive. This can be mitigated by interrupt moderation and NAPI
polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
I don't think e1000 hardware does LRO, but the driver could presumably
be changed use Linux's software LRO.
Even with these optimisations, if all RX processing is done on a
single CPU this can become a bottleneck. Does the test system have
multiple CPUs? Are IRQs for the multiple NICs balanced across
multiple CPUs?
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
^ permalink raw reply
* e1000 performance issue in 4 simultaneous links
From: Breno Leitao @ 2008-01-10 16:17 UTC (permalink / raw)
To: netdev
Hello,
I've perceived that there is a performance issue when running netperf
against 4 e1000 links connected end-to-end to another machine with 4
e1000 interfaces.
I have 2 4-port interfaces on my machine, but the test is just
considering 2 port for each interfaces card.
When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
of transfer rate. If I run 4 netperf against 4 different interfaces, I
get around 720 * 10^6 bits/sec.
If I run the same test against 2 interfaces I get a 940 * 10^6 bits/sec
transfer rate also, and if I run it against 3 interfaces I get around
850 * 10^6 bits/sec performance.
I got this results using the upstream netdev-2.6 branch kernel plus
David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the result
is a bit worse, and the the transfer rate was around 600 * 10^6
bits/sec.
[1] http://marc.info/?l=linux-netdev&m=119977075917488&w=2
PS: I am not using a switch in the middle of interfaces (they are
end-to-end) and the connections are independents.
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply
* Re: virtio_net and SMP guests
From: Christian Borntraeger @ 2008-01-10 15:51 UTC (permalink / raw)
To: virtualization; +Cc: kvm-devel, Anthony Liguori, Dor Laor, netdev
In-Reply-To: <200801101639.15995.borntraeger@de.ibm.com>
Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger:
> Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger:
> > Am Dienstag, 18. Dezember 2007 schrieb Rusty Russell:
> > > To me this points to doing interrupt suppression a different way. If
we
> > > have a ->disable_cb() virtio function, and call it before we call
> > > netif_rx_schedule, does that fix it?
> >
> > The fix looks good and I agree with it.
> >
> > There is one problem that I try to find for some days, but the following
> > BUG_ON triggers:
> >
> > static void vring_disable_cb(struct virtqueue *_vq)
> > {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> >
> > START_USE(vq);
> > ----> BUG_ON(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
> > vq->vring.avail->flags |= VRING_AVAIL_F_NO_INTERRUPT;
> > END_USE(vq);
> > }
>
> Ok, I found it:
>
> static int virtnet_open(struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> try_fill_recv(vi);
> /* If we didn't even get one input buffer, we're useless. */
> if (vi->num == 0)
> return -ENOMEM;
> ---> int for new packet
> static void skb_recv_done(struct virtqueue *rvq)
> {
> struct virtnet_info *vi = rvq->vdev->priv;
> /* Suppress further interrupts. */
> rvq->vq_ops->disable_cb(rvq);
> netif_rx_schedule(vi->dev, &vi->napi);
> }
> - poll is not yet possible, no softirq
> - return from interrupt
> napi_enable(&vi->napi);
> vi->rvq->vq_ops->disable_cb(vi->rvq);
> ---> BUG: its already disabled
Btw. this problem also happens on single processor guests.
What about the following patch:
---
drivers/net/virtio_net.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
Index: kvm/drivers/net/virtio_net.c
===================================================================
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -179,9 +179,12 @@ static void try_fill_recv(struct virtnet
static void skb_recv_done(struct virtqueue *rvq)
{
struct virtnet_info *vi = rvq->vdev->priv;
- /* Suppress further interrupts. */
- rvq->vq_ops->disable_cb(rvq);
- netif_rx_schedule(vi->dev, &vi->napi);
+ /* Schedule NAPI, Suppress further interrupts if successful. */
+
+ if (netif_rx_schedule_prep(vi->dev, &vi->napi)) {
+ rvq->vq_ops->disable_cb(rvq);
+ __netif_rx_schedule(vi->dev, &vi->napi);
+ }
}
static int virtnet_poll(struct napi_struct *napi, int budget)
^ permalink raw reply
* Re: [Bugme-new] [Bug 9719] New: when a system is configured as a bridge, and at the same time configured to have multipath weighted route, with one leg goes thru NAT and another without NAT, the nat path will intermittently get packets leaking out using internal IP without being SNAT-ted
From: Patrick McHardy @ 2008-01-10 15:41 UTC (permalink / raw)
To: mingching.tiew; +Cc: Andrew Morton, bugme-daemon, netdev
In-Reply-To: <20080109152813.83fb8168.akpm@linux-foundation.org>
Andrew Morton wrote:
>> Distribution: iptables 1.4.0 was used with kernel 2.6.23 and iptables 1.3.8
>> with 2.6.22.15
>> Hardware Environment: 3 interfaces, 2 interfaces bridged to form br0, and
>> another connects to internet using pppoe.
>> Software Environment: bridge, multipath routing
>> Problem Description: when a system is configured as a bridge with IP assigned
>> to br0 interface, and at the same time it is configured to have multipath
>> weighted default route, and one of the default route is NAT-ed and another of
>> the default route is not NAT-ed, then it is NAT-ed interface will occasionally
>> get packets leaking out to it with packets with private IPs.
That is most likely because the route changes over time (when the cache
is flushed) and the NAT mappings for the connection have been set up on
a different interface. The way to properly do this is to add routing
rules based on fwmark and use CONNMARK to bind a connection to one of
the interfaces after the initial multipath routing decision.
^ permalink raw reply
* Re: No idea about shaping trough many pc
From: Lennart Sorensen @ 2008-01-10 15:38 UTC (permalink / raw)
To: Badalian Vyacheslav; +Cc: netdev
In-Reply-To: <4785E01B.3080900@bigtelecom.ru>
On Thu, Jan 10, 2008 at 12:06:35PM +0300, Badalian Vyacheslav wrote:
> Hello all.
> I try more then 2 month resolve problem witch my shaping. Maybe you can
> help for me?
>
> Sheme:
> +-------------------+
> + ----- | Shaping PC 1 | ---------+
> / +-------------------+ \
> +--------+ / +--------------------+ \
> + --------+
> | Cisco | +-------- | Shaping PC N | -----------+ -----| CISCO |
> +--------+ \ +--------------------+ /
> +---------+
> \ +---------------------+ /
> + ----- | Shaping PC 20 | --------+
> +---------------------+
>
> Network - Over 10k users. Common bandwidth to INTERNET more then 1 GBs
> All computers have BGP and turn on multipath.
> Cisco can't do load sharing by Packet (its can resolve all my problems
> =((( ). Only by DST IP, SRC IP, or +Level4.
> Ok. User must have speed 1mbs.
> Lets look variants:
> 1. Create rules to user = (1mbs/N computers). If user use N connection
> all great, but if it use 1 connection his speed = 1mbs/N - its not look
> good. All be great if cisco can PER PACKET load sharing =(
> 2. Create rules to user = 1mbs. If user use 1 connection all great, but
> if it use N connection his speed much more then needed limit =(
>
> Why i use 20 PC? Becouse 1 pc normal forward 100-150mbs... when it have
> 100% cpu usage on Sofware Interrupts...
I have managed forwarding of 600Mbps using about 15% CPU load on a
500MHz Geode LX, using 4 100Mbit pcnet32 interfaces and a small tweak to
how the NAPI is implemented on it. Adding traffic shapping and such to
the processing would certainly increase the CPU load, but hopefully not
by much. The reason I didn't get more than 600Mbps was that the PCI bus
is now full.
> Any idea how to resolve this problem?
>
> In my dreams (feature request to netdev ;) ):
> Get PC - title: MASTER TC. All 20 PC syncronize statistic with MASTER
> and have common rules and statistic. Then i use variant 2 and will be
> happy... but its not real? =(
> Maybe have other variants?
Well now sure about synchornizing and all that. I still think if I can
manage 600Mbps forwarding rate using a slow poke Geode then a modern CPU
like a Q6600 with a number of PCIe gig ports should be able to do quite
a lot.
The tweak I did was to add a timer to the driver that I can activate
whenever I finish emptying the receive queue. When the timer expires it
adds the port back to the NAPI queue, and when it is called again the
poll will either process whatever packets arrived during the delay, or
it will actually unmask the IRQ and go back to IRQ mode. The delay I
use is 1 jiffy, and I run with 1000HZ and set the queues to 256 packets,
since 1ms at 100MBps can provide at most about 200 packets (64byte worst
case). I simply check whenever I empty the queue how many packets I
just processed. If greater than 0, I enable the timer to expire on the
next jiffy and leave the port masked after removing port from napi
polling, and if it was 0 then I must have been called again after the
timer expired and still had no packets to process in which case I unmask
the IRQ and don't enable the timer. I had to change the HZ to 1000
since at 250 or 100 I wouldn't be able to handle the worst case number
of packets (the pcnet32 has a maximum of 512 packets in a queue).
With NAPI the normal behaviour is that whenever you empty the receive
queue, you reenable IRQs, but it doesn't take that fast a CPU to
actually empty the queue all the time and then you end up with the
overhead for masking IRQs everytime you receive packets, process them,
and then the overhead of unmasking the IRQ just to within a fraction of
a milisecond getting an IRQ for the next packet. With the delay until
the next jiffy for unmasking the IRQ you end up causing a potential lag
on processing packets of up to 1ms, although on average less than that,
but the IRQ load drops dramatically and the overhead of managing the IRQ
masking and the IRQ handler goes away. In the case of this system the
CPU load dropped from 90% at 500Mbps to 15% at 600Mbps, and the
interrupt rate dropped from one IRQ every couple of packets, to one IRQ
at the start of each burst of packets.
I believe some GB ethernet ports and most 10Gig ports have the ability
to do delayed IRQ where they wait for a certain number of packets before
generating an IRQ, which is pretty much what I tried to emulate with my
tweak and it sure works amazingly well.
--
Len Sorensen
^ permalink raw reply
* [MACVLAN]: Prevent nesting macvlan devices
From: Patrick McHardy @ 2008-01-10 15:32 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List
[-- Attachment #1: Type: text/plain, Size: 0 bytes --]
[-- Attachment #2: 02.diff --]
[-- Type: text/x-patch, Size: 1217 bytes --]
[MACVLAN]: Prevent nesting macvlan devices
Don't allow to nest macvlan devices since it will cause lockdep warnings and
isn't really useful for anything.
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
commit 80a76fbde679793a17482a3dd842386801fca66b
tree 07f67e78ac0ae505a5de81e7e770a1b7d597f120
parent 4d14fded63dcaf9d5dcf78e2a8ea3f5de2c29eb9
author Patrick McHardy <kaber@trash.net> Thu, 10 Jan 2008 16:25:01 +0100
committer Patrick McHardy <kaber@trash.net> Thu, 10 Jan 2008 16:25:01 +0100
drivers/net/macvlan.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 2e4bcd5..e8dc2f4 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -384,6 +384,13 @@ static int macvlan_newlink(struct net_device *dev,
if (lowerdev == NULL)
return -ENODEV;
+ /* Don't allow macvlans on top of other macvlans - its not really
+ * wrong, but lockdep can't handle it and its not useful for anything
+ * you couldn't do directly on top of the real device.
+ */
+ if (lowerdev->rtnl_link_ops == dev->rtnl_link_ops)
+ return -ENODEV;
+
if (!tb[IFLA_MTU])
dev->mtu = lowerdev->mtu;
else if (dev->mtu > lowerdev->mtu)
^ permalink raw reply related
* [VLAN]: nested VLAN: fix lockdep's recursive locking warning
From: Patrick McHardy @ 2008-01-10 15:32 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List
[-- Attachment #1: Type: text/plain, Size: 0 bytes --]
[-- Attachment #2: 01.diff --]
[-- Type: text/x-patch, Size: 1540 bytes --]
[VLAN]: nested VLAN: fix lockdep's recursive locking warning
Allow vlans nesting other vlans without lockdep's warnings (max. 2 levels
i.e. parent + child). Thanks to Patrick McHardy for pointing a bug in the
first version of this patch.
Reported-by: Benny Amorsen
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
commit 4d14fded63dcaf9d5dcf78e2a8ea3f5de2c29eb9
tree 2f0792e8240151b1e5437b05130d1f569175f572
parent e2474f60798c97f5c05d29a906045dd1f416ba7f
author Jarek Poplawski <jarkao2@gmail.com> Thu, 10 Jan 2008 16:25:00 +0100
committer Patrick McHardy <kaber@trash.net> Thu, 10 Jan 2008 16:25:00 +0100
net/8021q/vlan.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4add9bd..032bf44 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -323,6 +323,7 @@ static const struct header_ops vlan_header_ops = {
static int vlan_dev_init(struct net_device *dev)
{
struct net_device *real_dev = VLAN_DEV_INFO(dev)->real_dev;
+ int subclass = 0;
/* IFF_BROADCAST|IFF_MULTICAST; ??? */
dev->flags = real_dev->flags & ~IFF_UP;
@@ -349,7 +350,11 @@ static int vlan_dev_init(struct net_device *dev)
dev->hard_start_xmit = vlan_dev_hard_start_xmit;
}
- lockdep_set_class(&dev->_xmit_lock, &vlan_netdev_xmit_lock_key);
+ if (real_dev->priv_flags & IFF_802_1Q_VLAN)
+ subclass = 1;
+
+ lockdep_set_class_and_subclass(&dev->_xmit_lock,
+ &vlan_netdev_xmit_lock_key, subclass);
return 0;
}
^ permalink raw reply related
* Re: [PATCH take2] Re: Nested VLAN causes recursive locking error
From: Patrick McHardy @ 2008-01-10 15:31 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Benny Amorsen, Chuck Ebbert, netdev
In-Reply-To: <20080102234107.GA6902@ami.dom.local>
Jarek Poplawski wrote:
> As a matter of fact I started to doubt it's a real problem: 2 vlan
> headers in the row - is it working?
Yes, apparently some people are using this.
> Anyway, as Patrick pointed, the previous patch was a bit buggy, and
> deeper nesting needs a little more (if it's can work too...). So,
> here is something minimal.
>
> Patrick, if you think about something else, then of course don't care
> about this patch.
No, this seems fine, thanks. Even better would be a way to get
the last lockdep subclass through lockdep somehow, but I couldn't
find a clean way for this. So I've applied your patch and also
fixed macvlan.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox