Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: "src" attribute ignored for IPv6 (preferred source address selection)
From: Daniel Roesen @ 2010-10-17  1:12 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4C8215C8.8000908@httrack.com>

On Sat, Sep 04, 2010 at 11:47:52AM +0200, Xavier Roche wrote:
> In an attempt to change the source address in various conditions (network 
> prefix, outgoing protocol, user owner id, ..), it appeared that the src 
> attribute of an IPv6 route command (ip -6 route add .. src ..) was simply 
> ignored.
>
> The outgoing interface can be selected, but not the preferred source 
> address (see (1)), which remains always the same within an interface.
>
> If this a bug ? Or a known limitation of the kernel ?

Well, at least it was a known limitation six years ago, when I
discovered the same:

http://lkml.indiana.edu/hypermail/linux/kernel/0409.0/1768.html

Unfortunately I don't have the time (and most probably not the necessary
kernel knowhow) to hack this up myself, but I'm still interested. :-)

Best regards,
Daniel

-- 
CLUE-RIPE -- Jabber: dr@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0

^ permalink raw reply

* r8169: high CPU load (softirq) bottlenecking throughput
From: Lasse Makholm @ 2010-10-16 21:54 UTC (permalink / raw)
  To: netdev

Hello netdev people,

I'm seeing what seems to be an unreasonably high CPU utilization from
the r8169 driver on an Intel Atom board.

Serving a 1GB blob repeatedly through Apache basically maxes out one
CPU core completely...

Here is a typical snapshot of the CPU load:

11:17:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
11:17:24 PM  all    0.00    0.00    6.54    0.00    0.00   17.68
0.00    0.00   75.79
11:17:24 PM    0    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
11:17:24 PM    1    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
11:17:24 PM    2    0.00    0.00   27.00    0.00    0.00   73.00
0.00    0.00    0.00
11:17:24 PM    3    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00

There is no disk IO while this is going on (the 1GB blob is cached),
and the machine is otherwise idle.

Throughput tops out at around 50MB/s but the receiving end and the
switch is easily able to get 100MB/s from another box.

The board is a D945GCLF2D with an Atom 330 (2 x 1.6 GHz), so
presumably it should be able to pump out 100MB/s easily...

For SSH-tunneled traffic, this is a disaster because both encryption
and I/O land on the same CPU... I'm currently maxing out at 10 - 20
MB/s with SSH-tunneled traffic

The kernel is currently 2.6.36-020636rc8-generic (Ubuntu daily
snapshot) but I've tried a handful of others between here and Ubuntu's
2.6.32-24....

lspci knows the following about the NIC:

makholm@korovyov:~$ lspci -vv -s 01:00.0
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
	Subsystem: Intel Corporation Device 0001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 40
	Region 0: I/O ports at 1000 [size=256]
	Region 2: Memory at 88100000 (64-bit, non-prefetchable) [size=4K]
	Region 4: Memory at 88000000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at 88020000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169

makholm@korovyov:~$

I have a D-Link DGE-528T (also uses the r8169 driver) that shows
similar behavior in the Atom box. Trying that cards in an Athlon II
box also show a noticable CPU load by the r8169 driver, though the
Athlon obviously copes much better...

Any tips on how to improve the situation? I'd be happy to provide more
info and/or test some patches etc. if needed...

Thanks
/Lasse

^ permalink raw reply

* How to change delayed ACKs?
From: Fred . @ 2010-10-16 21:57 UTC (permalink / raw)
  To: linux-net

I play online games.
Someone told me about Leatrix Latency Fix.
http://www.wowinterface.com/downloads/info13581-LeatrixLatencyFix.html

ACKs are delayed. The latency fix works by changing setting to
decrease that delay.

On Windows it modifies 'TCPAckFrequency' in the registry.

On FreeBSD and Mac OS X;
sudo sysctl -w net.inet.tcp.delayed_ack=0

This key doesn't seem to exist on Linux.
error: "net.inet.tcp.delayed_ack" is an unknown key

Is there any way to reduce the delay of delayed ACKs in order to
increase performance in online gaming?

P.S im not subscribed to mailinglist

^ permalink raw reply

* Re: tbf/htb qdisc limitations
From: Jarek Poplawski @ 2010-10-16 20:58 UTC (permalink / raw)
  To: Bill Fink; +Cc: Eric Dumazet, Rick Jones, Steven Brudenell, netdev
In-Reply-To: <20101016005106.35e4cc8d.billfink@mindspring.com>

On Sat, Oct 16, 2010 at 12:51:06AM -0400, Bill Fink wrote:
> On Sat, 16 Oct 2010, Jarek Poplawski wrote:
> 
> > On Fri, Oct 15, 2010 at 05:37:46PM -0400, Bill Fink wrote:
> > ...
> > > i7test7% tc -s -d qdisc show dev eth2
> > > qdisc prio 1: root refcnt 33 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > >  Sent 11028687119 bytes 1223828 pkt (dropped 293, overlimits 0 requeues 0) 
> > >  backlog 0b 0p requeues 0 
> > > qdisc tbf 10: parent 1:1 rate 8900Mbit burst 1112500b/64 mpu 0b lat 4295.0s 
> > >  Sent 11028687077 bytes 1223827 pkt (dropped 293, overlimits 593 requeues 0) 
> > >  backlog 0b 0p requeues 0 
> > > 
> > > I'm not sure how you can have so many dropped but not have
> > > any TCP retransmissions (or not show up as requeues).  But
> > > there's probably something basic I just don't understand
> > > about how all this stuff works.
> > 
> > Me either, but it seems higher "limit" might help with these drops.
> 
> You were of course correct about the higher limit helping.
> I finally upgraded the field system to 2.6.35, and did some
> testing on the real data path of interest, which has an RTT
> of about 29 ms.  I set up a rate limit of 8 Gbps using the
> following commands:
> 
> tc qdisc add dev eth2 root handle 1: prio
> tc qdisc add dev eth2 parent 1:1 handle 10: tbf rate 8000mbit limit 35000000 burst 20000 mtu 9000
> tc filter add dev eth2 protocol ip parent 1: prio 1 u32 match ip protocol 6 0xff match ip dst 192.168.1.23 flowid 10:1
> 
> hecn-i7sl1% nuttcp -T10 -i1 -w50m 192.168.1.23
>   676.3750 MB /   1.00 sec = 5673.4646 Mbps     0 retrans
>   948.5625 MB /   1.00 sec = 7957.1508 Mbps     0 retrans
>   948.8125 MB /   1.00 sec = 7959.5902 Mbps     0 retrans
>   948.3750 MB /   1.00 sec = 7955.5382 Mbps     0 retrans
>   949.0000 MB /   1.00 sec = 7960.6696 Mbps     0 retrans
>   948.7500 MB /   1.00 sec = 7958.7873 Mbps     0 retrans
>   948.6875 MB /   1.00 sec = 7958.0959 Mbps     0 retrans
>   948.6250 MB /   1.00 sec = 7957.4205 Mbps     0 retrans
>   948.7500 MB /   1.00 sec = 7958.7237 Mbps     0 retrans
>   948.4375 MB /   1.00 sec = 7956.3648 Mbps     0 retrans
> 
>  9270.5625 MB /  10.09 sec = 7707.7457 Mbps 24 %TX 36 %RX 0 retrans 29.38 msRTT
> 
> hecn-i7sl1% tc -s -d qdisc show dev eth2
> qdisc prio 1: root refcnt 33 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 9779476756 bytes 1084943 pkt (dropped 0, overlimits 0 requeues 0) 
>  backlog 0b 0p requeues 0 
> qdisc tbf 10: parent 1:1 rate 8000Mbit burst 19000b/64 mpu 0b lat 35.0ms 
>  Sent 9779476756 bytes 1084943 pkt (dropped 0, overlimits 1831360 requeues 0) 
>  backlog 0b 0p requeues 0 
> 
> No drops!
> 
> BTW the effective rate limit seems to be a very coarse adjustment
> at these speeds.  I was seeing some data path issues at 8.9 Gbps
> so I tried setting slightly lower rates such as 8.8 Gbps, 8.7 Gbps,
> etc, but they still gave me an effective rate limit of about 8.9 Gbps.
> It wasn't until I got down to a setting of 8 Gbps that I actually
> got an effective rate limit of 8 Gbps.
> 
> Also the man page for tbf seems to be wrong/misleading about
> the burst parameter.  It states:
> 
> 	"If your buffer is too small, packets may be dropped because more
> 	tokens arrive per timer tick than fit in your bucket.  The minimum
> 	buffer size can be calculated by dividing the rate by HZ.
> 
> According to that, with a rate of 8 Gbps and HZ=1000, the minimum
> burst should be 1000000 bytes.  But my testing shows that a burst
> of just 20000 works just fine.  That's only 2 9000-byte packets
> or about 20 usec of traffic at the 8 Gbps rate.  Using too large
> a value for burst can actually be harmful as it allows the traffic
> to temporarily exceed the desired rate limit.

As I mentioned before, it could work, but your config is really on
the edge. Anyway, if lower than minimum buffer size is needed
something else is definitely wrong. (Btw, this size can matter less
with high resolution timers.) You could try if my iproute patch:
"tc_core: Use double in tc_core_time2tick()" (not merged) can help
here. While googling for this patch I found this page, which might be
interesting to you (besides the link to the thread with the patch at
the end, take 1 or 2, shouldn't matter):

http://code.google.com/p/pspacer/wiki/HTBon10GbE
 
If it doesn't help reconsider hfsc.

Thanks,
Jarek P.

^ permalink raw reply

* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
From: Jarek Poplawski @ 2010-10-16 19:48 UTC (permalink / raw)
  To: emin ak
  Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
	Anton Vorontsov
In-Reply-To: <AANLkTinAj=9-RqNA6xKc_5ONz2hPMRGM8uJ2Xy7XJVdr@mail.gmail.com>

On Sat, Oct 16, 2010 at 02:14:01AM +0300, emin ak wrote:
> Hi Jarek.
> Sorry for delayed answer. As I promised, I have started the tests on
> Monday (in spite of dealing with The Monday Syndrome:)  I had applied
> your patch and after two billions packet (and approximatly four hours)
> passed kernel crashed with skb_over_panic error similar with first
> type. To ensure the patch is failed or not, I rerun the same test
> again. That time, surprisingly,  it did'nt crashed again for two days
> with same kernel. But this situation had occured before, I think
> sometimes because of randomness of the applied ethernet traffic and
> mostly because I cant apply all full line rate random traffic to my
> target device because of wrong test setup (switch / hardware packet
> generator settings etc..), it takes three or more day to crash the
> kernel and sometimes it never crashes. My device only and only crashes
> when I can apply full line rate random traffic. Before informing you
> and the list with (official) test results, I want to be sure with the
> truth of them. So that please let me apply more test to the target
> device for  a fews day more. After that I wish I'll came with good
> news!

Hi Emin,
Sorry for being impatient. Actually, waiting is no problem for me. I
simply didn't know how much this bug is reproducible, and was a bit
mislead by your forecast of one day result (and terrified btw seeing
words Monday and immediately side by side ;-) So, a few days or weeks,
no problem, it's all up to you.

On the other hand, it looks like there might be something else/more,
so I'd suggest to add this last skb_over_panic to bugzilla too.

Thanks,
Jarek P.

^ permalink raw reply

* Re: openvswitch/flow WAS ( Re: [rfc] Merging the Open vSwitch datapath
From: Jesse Gross @ 2010-10-16 19:33 UTC (permalink / raw)
  To: hadi; +Cc: Ben Pfaff, netdev, ovs-team
In-Reply-To: <1287228959.3664.72.camel@bigi>

On Sat, Oct 16, 2010 at 4:35 AM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-10-15 at 14:35 -0700, Jesse Gross wrote:
>
>>
>> You're right, at a high level, it appears that there is a bit of an
>> overlap between bridging, tc, and Open vSwitch.
>
> It looks like openvswitch rides on top of openflow, correct?
> earlier i was looking at  openflow/datapath but gleaning
> openvswitch/datapath it still looks conceptually the same
> at the lower level.

Yes, Open vSwitch supports the OpenFlow protocol.  However, the Open
vSwitch kernel portion is completely different from the OpenFlow
reference implementation datapath and in fact does not speak OpenFlow
at the kernel level.  You brought up the point of keeping the kernel
simple and making policy decisions in userspace.  I completely agree
and, in fact, that is the reason why Open vSwitch is designed the way
it is.

I think it might be helpful if I gave a high level overview of packet
processing:

When a packet is received it, the relevant fields from the packet are
extracted and matched against a hash table.  The most interesting part
is actually what happens when the packets don't match a hash entry:
they get sent up to userspace.  It is userspace that makes a policy
decision about the traffic and then pushes down a flow entry for
future packets to match.  Some of the things that those decisions can
be based on include: OpenFlow rules, wildcarded entries, normal L2
learning, etc.  From then on, packets in that flow can be processed on
the fast path in the kernel with minimal overhead, while still getting
the benefit of the knowledge of userspace.

So I think that we are actually in agreement on quite a number of
points: the kernel should be kept as simple as possible, the control
plane should be abstracted out and handled in userspace, and it should
be possible to map the control rules (from OpenFlow or anywhere
really) onto a simpler set of primitives for handling packets.

So with those goals in mind, here's what is needed:
1. Packet field extraction and classification.  Realistically speaking
a new, specialized classifier would probably be needed, as you
mention.
2. A mechanism to send/receive packets to/from userspace.  This is an
important component that Open vSwitch adds to the pipeline.  This will
probably expand in the future to suit different applications, like the
security processing that I talked about.
3. Output actions.  A few exist today, at least some new ones will
need to be added.

So in reality, all of major components of Open vSwitch are actually
not present in the kernel today.  I know the argument could be made
that certains parts can be replicated in different ways but that's
back to the simplicity point that I was making earlier.  The u32
classifier isn't well suited for these types of rules and neither is
pedit.  If we're going to add the needed components either way, let's
not make everyone's lives more complicated by mixing everything
together.

^ permalink raw reply

* Re: [PATCH -next] net: move MII outside of NET_ETHERNET, fix kconfig warning
From: David Miller @ 2010-10-16 18:58 UTC (permalink / raw)
  To: randy.dunlap; +Cc: linux-kernel, netdev, akpm, jgarzik
In-Reply-To: <20101013181859.fc3a96fd.randy.dunlap@oracle.com>

From: Randy Dunlap <randy.dunlap@oracle.com>
Date: Wed, 13 Oct 2010 18:18:59 -0700

> From: Randy Dunlap <randy.dunlap@oracle.com>
> 
> We have USB, PCMCIA, and gigabit ethernet drivers that select
> MII even though NET_ETHERNET is not enabled, so make MII not
> be dependent on NET_ETHERNET.  It is still dependent on NET
> and NETDEVICES.
> 
> Fixes kconfig unmet dependency warning (shortened, was very long string):
 ...
> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
> Acked-by: Jeff Garzik <jgarzik@pobox.com> [2006-NOV-30]

Applied, and I added your infiniband patch too.

Thanks Randy.

^ permalink raw reply

* Re: [PATCH net-next 2/2] stmmac: make function tables const
From: David Miller @ 2010-10-16 18:57 UTC (permalink / raw)
  To: shemminger; +Cc: peppe.cavallaro, deepak.sikri, netdev
In-Reply-To: <20101013175125.6f6a64db@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 13 Oct 2010 17:51:25 -0700

> These tables only contain function pointers.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/2]: stmmac: make ethtool functions local
From: David Miller @ 2010-10-16 18:57 UTC (permalink / raw)
  To: shemminger; +Cc: peppe.cavallaro, deepak.sikri, netdev
In-Reply-To: <20101013175031.41c29349@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 13 Oct 2010 17:50:31 -0700

> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] tipc: cleanup function namespace
From: David Miller @ 2010-10-16 18:56 UTC (permalink / raw)
  To: shemminger; +Cc: paul.gortmaker, nhorman, netdev, allan.stephens
In-Reply-To: <20101013162035.0c2e8123@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 13 Oct 2010 16:20:35 -0700

> Do some cleanups of TIPC based on make namespacecheck
>   1. Don't export unused symbols
>   2. Eliminate dead code
>   3. Make functions and variables local
>   4. Rename buf_acquire to tipc_buf_acquire since it is used in several files
> 
> Compile tested only.
> This make break out of tree kernel modules that depend on TIPC routines.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

I really think we can and should do this now, so I've
applied this to net-next-2.6

Thanks everyone.

^ permalink raw reply

* Re: [PATCH] via-velocity: forced 1000 Mbps mode support.
From: David Miller @ 2010-10-16 18:56 UTC (permalink / raw)
  To: romieu; +Cc: davidlv.linux, netdev, DavidLv, ShirleyHu, AndersMa, rseguier
In-Reply-To: <20101013192605.GA5157@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 13 Oct 2010 21:26:05 +0200

> Full duplex only. Half duplex 1000 Mbps is not supported.
> 
> Signed-off-by: David Lv <DavidLv@viatech.com.cn>
> Acked-by: Francois Romieu <romieu@fr.zoreil.com>
> Tested-by: Seguier Regis <rseguier@e-teleport.net>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] fib: avoid false sharing on fib_table_hash
From: David Miller @ 2010-10-16 18:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286994123.3876.1137.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Oct 2010 20:22:03 +0200

> While doing profile analysis, I found fib_hash_table was sometime in a
> cache line shared by a possibly often written kernel structure.
> 
> (CONFIG_IP_ROUTE_MULTIPATH || !CONFIG_IPV6_MULTIPLE_TABLES)
> 
> It's hard to detect because not easily reproductible.
> 
> Make sure we allocate a full cache line to keep this shared in all cpus
> caches.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] fib_trie: use fls() instead of open coded loop
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, Robert.Olsson
In-Reply-To: <1286988971.3876.974.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Oct 2010 18:56:11 +0200

> fib_table_lookup() might use fls() to speedup an open coded loop.
> 
> Noticed while doing a profile analysis.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] fib: remove a useless synchronize_rcu() call
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286980984.3876.679.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Oct 2010 16:43:04 +0200

> fib_nl_delrule() calls synchronize_rcu() for no apparent reason,
> while rtnl is held.
> 
> I suspect it was done to avoid an atomic_inc_not_zero() in
> fib_rules_lookup(), which commit 7fa7cb7109d07 added anyway.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] fib6: use FIB_LOOKUP_NOREF in fib6_rule_lookup()
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286973940.3876.423.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Oct 2010 14:45:40 +0200

> Avoid two atomic ops on found rule in fib6_rule_lookup()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [patch -next] pch_gbe: fix if condition in set_settings()
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: error27; +Cc: netdev, masa-korg, kernel-janitors
In-Reply-To: <20101013093619.GC6060@bicker>

From: Dan Carpenter <error27@gmail.com>
Date: Wed, 13 Oct 2010 11:36:19 +0200

> There were no curly braces in this if condition so it always enabled full
> duplex.
> 
> And ecmd->speed is an unsigned short so it is never equal to -1.  The
> effect is that mii_ethtool_sset() fails with -EINVAL and an error is
> printed to dmesg.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] dnet: mark methods static and annotate for correct endianness
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: harvey.harrison; +Cc: netdev
In-Reply-To: <1286958034-19372-1-git-send-email-harvey.harrison@gmail.com>

From: Harvey Harrison <harvey.harrison@gmail.com>
Date: Wed, 13 Oct 2010 01:20:34 -0700

> Their doesn't appear to be bugs with the endianness handling here, just get the
> annotations right to keep sparse happy.
> 
> Suppresses the following sparse warnings:
 ...
> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] cxgb4vf: make single bit signed bitfields unsigned
From: David Miller @ 2010-10-16 18:55 UTC (permalink / raw)
  To: harvey.harrison; +Cc: netdev
In-Reply-To: <1286956346-17147-1-git-send-email-harvey.harrison@gmail.com>

From: Harvey Harrison <harvey.harrison@gmail.com>
Date: Wed, 13 Oct 2010 00:52:26 -0700

> Single bit signed bitfields don't make a lot of sense, noticed by sparse:
 ...
> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: allocate skbs on local node
From: David Miller @ 2010-10-16 18:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, mchan, eilong, akpm, hch, cl
In-Reply-To: <1286859925.30423.184.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Oct 2010 07:05:25 +0200

> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: David Miller @ 2010-10-16 18:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: sgruszka, romieu, netdev
In-Reply-To: <1286831867.30423.80.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 23:17:47 +0200

> [PATCH net-next] r8169: use 50% less ram for RX ring
> 
> Using standard skb allocations in r8169 leads to order-3 allocations (if
> PAGE_SIZE=4096), because NIC needs 16383 bytes, and skb overhead makes
> this bigger than 16384 -> 32768 bytes per "skb"
> 
> Using kmalloc() permits to reduce memory requirements of one r8169 nic
> by 4Mbytes. (256 frames * 16Kbytes). This is fine since a hardware bug
> requires us to copy incoming frames, so we build real skb when doing
> this copy.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH linux-2.6 v2] IPv6: Temp addresses are immediately deleted.
From: Glenn Wurster @ 2010-10-16 18:42 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, shemminger,
	eric.dumazet, herbert, ebiederm, netdev, linux-kernel
In-Reply-To: <20100928.233028.112599117.davem@davemloft.net>

On September 29, 2010 02:30:28 am David Miller wrote:
> From: Glenn Wurster <gwurster@scs.carleton.ca>
> Date: Mon, 27 Sep 2010 13:10:10 -0400
> 
> > There is a bug in the interaction between ipv6_create_tempaddr and
> > addrconf_verify.  Because ipv6_create_tempaddr uses the cstamp and tstamp
> > from the public address in creating a private address, if we have not
> > received a router advertisement in a while, tstamp + temp_valid_lft might
> > be < now.  If this happens, the new address is created inside
> > ipv6_create_tempaddr, then the loop within addrconf_verify starts again
> > and the address is immediately deleted.  We are left with no temporary
> > addresses on the interface, and no more will be created until the public
> > IP address is updated.  To avoid this, set the expiry time to be the
> > minimum of the time left on the public address or the config option PLUS
> > the current age of the public interface.
> > 
> > Version 2, now with 100% fewer line wraps.  Thanks to David Miller for
> > pointing out the line wrapping issue.
> > 
> > Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca>
> 
> This can only happen if we apply your other patch, which I showed
> was incorrect as per RFCs.
> 
> We only create temporary address when public addresses are created,
> and this is the point where we are handling a router advertisement
> with non-zero Valid Lifetime.
> 
> Therefore I'm not applying this patch either.

No, the first patch was to create a temporary address if none exists.  Like 
Brian Haley pointed out, that patch accommodates the case where we set 
use_tempaddr to a non-zero value after the interface had been brought up.

This patch accommodates the case where the router is only broadcasting 
advertisements every x seconds, and yet the user has set the valid_lft to be 
something less than x.  In this setup, the condition I mentioned in the patch 
description happens, where the new temporary address is created, but the last 
modification time on that temporary address is set to the time of the last 
router advertisement, which was more than valid_lft seconds ago.  In this 
case, the temporary address is immediately deleted, and we are left with no 
temporary address on the interface.  Furthermore, because all temporary 
addresses get deleted by the time the next router advertisement arrives, we 
are left with not being able to use temporary addresses until we move 
networks.

I tested this patch alone, and it works as intended, allowing temporary 
addresses to continue to be created and deleted between received router 
advertisements.

You can easily test the bug by setting tmp_valid_lft to 60 and then running 
radvd.  The defaults for radvd seem to be a minimum retransmit on unsolicited 
router advertisements of 200 seconds (http://linux.die.net/man/5/radvd.conf), 
much higher than the 60 seconds it is going to take for the temporary address 
to expire. 

Glenn.

^ permalink raw reply

* 2.6.36-rc7: net/bridge causes temporary network I/O lockups [2]
From: Patrick Ringl @ 2010-10-16 18:15 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, herbert, bridge

Hi,

okay I narrowed down the issue. I watched all function calls of the 
'bridge' module with the help of a small systemtap probe of mine. I 
first traced a timespan where the issue did not occur, then one where it 
did and composed an intersection of these two:

br_fdb_cleanup
br_flood
br_flood_forward
br_ip4_multicast_add_group
br_ip4_multicast_alloc_query
br_ip4_multicast_leave_group
br_ip6_multicast_alloc_query
br_mdb_get
br_multicast_alloc_query
br_multicast_flood
br_multicast_forward
br_multicast_ipv4_rcv
br_multicast_port_query_expired
br_multicast_query_expired
br_multicast_rcv
__br_multicast_send_query
br_multicast_send_query

igmp_hdr
ip_hdrlen
ipv6_addr_copy
ipv6_addr_set
ipv6_eth_mc_map
ipv6_hdr

maybe_deliver
netdev_alloc_skb
netdev_alloc_skb_ip_align

skb_checksum_complete
__skb_pull
__skb_push
skb_reserve
skb_reset_transport_header
skb_set_network_header
skb_set_transport_header

These are the function calls that are exclusively called during the 
'nonfunctional'-timespan.

This again gave me the idea to use tcpdump and watch out for igmp and 
v6. Well, and that is also where the issue is coming from.

Once a multicast membership query (igmp) arrives, A multicast listener 
query (icmpv6) is sent.
  From my understanding of the bridge code br_flood will propgate the 
packet to all nodes (simple multicast) and this is also where things 
stop working. Systemtap itself and thus in my case function calls of the 
bridge module are not delayed, but something needs to be wrong in the 
multicast handling of the bridge interface, since as pointed out in my 
previous email with 2.6.32 everything is working fine.

Can anyone reconfirm this issue, or give a helping hand in how to 
proceed further?

PS: Herbert, I've seen your changes for 2.6.34 which I think are 
responsible for this behavior (even 2.6.33 here works fine. Anything 
containing your multicast-related fixed breaks here).
Could you specifically take a look into it and/or tell me how I can help 
you?

PPS: Again please CC back to me, since I am not subscribed

regards,
Patrick

^ permalink raw reply

* Send your details to claim 750,000.00 GBP in the GNLD promo
From: richard-mullen @ 2010-10-16 16:27 UTC (permalink / raw)



Name/Address

^ permalink raw reply

* RE: [PATCH] vmxnet3: remove set_flag_le{16,64} functions
From: Shreyas Bhatewara @ 2010-10-16 15:57 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: netdev@vger.kernel.org, shemminger@vyatta.com
In-Reply-To: <1287214392-8963-1-git-send-email-harvey.harrison@gmail.com>

________________________________________
From: Harvey Harrison [harvey.harrison@gmail.com]
Sent: Saturday, October 16, 2010 12:33 AM
To: Shreyas Bhatewara
Cc: netdev@vger.kernel.org; shemminger@vyatta.com
Subject: [PATCH] vmxnet3: remove set_flag_le{16,64} functions

Opencode the flag setting in the few places it was being done.

Signed-off-by: Harvey Harrison harvey.harrison@gmail.com
---
@@ -1617,8 +1599,7 @@ vmxnet3_vlan_rx_register(struct net_device *netdev, struct vlan_group *grp)
                                               VMXNET3_CMD_UPDATE_VLAN_FILTERS);

                        /* update FEATURES to device */
-                       reset_flag_le64(&devRead->misc.uptFeatures,
-                                       UPT1_F_RXVLAN);
+                       devRead->misc.uptFeatures |= cpu_to_le64(UPT1_F_RXVLAN);
                        VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
                                               VMXNET3_CMD_UPDATE_FEATURE);
                }


This hunk is wrong, UPT1_F_RXVLAN should be reset here.

I fail to understand why set/reset functions were not good enough and why this opencoding is required. I agree with your earlier suggestion about replacing two conversions with *data |= cpu_to_le16(flag); though.


Regards.
Shreyas

^ permalink raw reply

* Re: Linux Plumbers Conference: User-visible Network Issues Mini-Conf
From: Matt Domsch @ 2010-10-16 14:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20101016.011834.59671697.davem@davemloft.net>

On Sat, Oct 16, 2010 at 01:18:34AM -0700, David Miller wrote:
> Half a month before the event is not the time to be making
> this kind of formal plea with core networking people.
> 
> If someone isn't already attending LPC, at this point there is next to
> zero chance they are going to be able to make plans to do so in time.

Indeed, and I apologize for the tardiness.  The list of topics is,
with the exception of one which was dropped, the same as in my note of
Sept 17, when I made the same plea, but I fully admit I should have
finalized this list earlier.

-- 
Matt Domsch
Technology Strategist
Dell | Office of the CTO

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox