* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: David Miller @ 2010-04-27 19:42 UTC (permalink / raw)
To: paulmck; +Cc: eric.dumazet, netdev
In-Reply-To: <20100427161716.GB2424@linux.vnet.ibm.com>
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Tue, 27 Apr 2010 09:17:16 -0700
> On Tue, Apr 27, 2010 at 08:40:43AM +0200, Eric Dumazet wrote:
>> __sk_dst_set() might be called while no state can be integrated in a
>> rcu_dereference_check() condition.
>>
>> So use rcu_dereference_raw() to shutup lockdep warnings (if
>> CONFIG_PROVE_RCU is set)
>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
I've applied this to net-next-2.6, thanks!
^ permalink raw reply
* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Arnd Bergmann @ 2010-04-27 19:38 UTC (permalink / raw)
To: Anirban Chakraborty
Cc: Scott Feldman, Rose, Gregory V, David Miller,
netdev@vger.kernel.org, chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <8966E338-1C9C-43D9-B6A3-A44349E7EE18@qlogic.com>
On Tuesday 27 April 2010 19:33:04 Anirban Chakraborty wrote:
> On Apr 27, 2010, at 5:35 AM, Arnd Bergmann wrote:
> > Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> > at least when we want to extend this to adapters that don't do it in firmware.
>
> Correct me if I am wrong. Shouldn't the port profile be tied to the physical NICs which are essentially
> PCI functions (be it PF or VF)? I'd think that a port profile would have configuration settings for all the
> physical NICs (PF/VF) of a specific physical port of the adapter. I liked the idea of querying the device
> for number of VFs as it will cover both SR-IOV and non SR-IOV PCI functions.
Yes, the port profile association is tied to whoever owns the link to the switch.
That can be a regular NIC, an SR-IOV PF, an ethernet bonding device or an S-component
implementing provider S-VLANs on top of any of these.
Usually it will be the same as a physical link, but in case of bonding it is two
physical links while in case of S-VLAN, you have multiple instances that each
have their own set of port profile association. If S-VLAN is implemented by
the NIC, that may be a VF.
Querying a PF for the number of VFs attached to it is a useful thing, but this
is independent of port profiles. Consider this (artificially complex) setup:
- eth0 is the PF of an SR-IOV NIC
- eth1 is a regular single-channel NIC
- vf0 is a VF of eth0, used by a guest using PCI passthrough mode on S-VLAN 2
- vf1 is a VF of eth0 owned by the host on S-VLAN 3
- vf1.23 is a VLAN port for VLAN 23 in S-VLAN 3
- br0 is a bridge connected to vf1
- br23 is a bridge VLAN device for br0
- vf2 is a VF of eth0 owned by the host on S-VLAN 4
- eth1.5 is a software vlan device for S-VLAN 4
- bond0 combines eth1.5 and vf2
- bond0.24 is a VLAN port for VLAN24 on bond0
- tap0 is a guest connected to br0 in trunk mode
- tap1 is a guest connected to br23 in access mode
- macvtap0 is a VEPA mode guest on bond0
- macvtap1 is a private mode guest on bond0.24
This means you have a total of five guests running, on vf0, tap0, tap1,
macvtap0 and macvtap1. Querying the number of VFs on eth0 will return '2',
for vf0 and vf1. What you are interested in however is which guests are
associated. Querying every single interface in the system will tell you
eth0: one guest (vf0)
vf1: two guests (tap0 and tap1)
bond0: two guests (macvtap0 and macvtap1)
Arnd
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 19:30 UTC (permalink / raw)
To: eilong
Cc: Rick Jones, David Miller, therbert@google.com,
netdev@vger.kernel.org
In-Reply-To: <1272393060.30392.2.camel@lb-tlvb-eilong.il.broadcom.com>
Le mardi 27 avril 2010 à 21:31 +0300, Eilon Greenstein a écrit :
> Though the thread is going in a different direction now, I just wanted
> to clarify two things:
> - yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
> hash. We all agree that it is much better to address the UDP ports as
> well, but I think Rick Jones explained the process very well - thank you
> Rick. Just to add one more (lame) excuse: the HW was designed before new
> NAPI was introduced and it complies with the requirements from Redmond
> - the next generation (57712) which we already sample does (finally)
> support it. We are working on a patch series to enhance the bnx2x to
> support this device now.
>
Thanks Eilon !
^ permalink raw reply
* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: Patrick McHardy @ 2010-04-27 18:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20100427.100345.241441437.davem@davemloft.net>
David Miller wrote:
> Whoa, there are three of you now?!?!?!
>
> :-)
>
That would be nice, I'd have my two clones do all the work :)
Not sure what happened, some mishandling of git send-email
apparently :)
^ permalink raw reply
* Re: vlan performance issue on outgoing traffic
From: Brandeburg, Jesse @ 2010-04-27 18:32 UTC (permalink / raw)
To: R. Weinedel; +Cc: netdev@vger.kernel.org
In-Reply-To: <4BD4C037.2070003@yahoo.de>
On Sun, 25 Apr 2010, R. Weinedel wrote:
> hallo,
>
> I have an performance issue with vlan interfaces on an Debian Lenny
> server. The problem occurs only on outgoing traffic from the vlan
> interfaces. They use only half of the available bandwidth - (490 Mbit/s
> measured with iperf ). Incoming traffic is handled @ 950 Mbit/s and is
> fine. The issue remains even with no switch and an direct connection
> between pc and server on the same nic. Removing (on server) the vlans
> from eth0 and configure one net on eth0 results in full speed (950
> Mbit/s) in both directions. Even another nic (onboard nvidia3 - mod
> forcedeth) couldn't solve it. I tested only in the same networking
> segment (vlan) without the need for ip forwarding or NAT, but the issue
> occurs on all my vlan's.
>
> All values were taken with iperf between the server and an ubuntu 9.04
> workstation (and vice versa). I have controlled (w. ethtool / stats from
> switch) that all connection was 1000-BaseT/full duplex. It looks like
> some kind of trafficshaping to me, but i don't use tc, qos,tos nor other
> priority handling.
> The network ist quite simple: One Server, one switch and then the
> workstations. No need for cascading or using (r)stp.
>
> Here some information about my network:
>
> Switch: Netgear GSM7224 Layer 2 managed switch, FW 6.2.0.14
> (independent, issue remains on direct connection).
>
> Server: Debian Lenny, kernel 2.6.26-2,
This version of the kernel doesn't support offloads for vlan adapters,
which is probably causing most of your decrease in throughput due to
either exhausting socket buffer size, or because of the round trip time
being so much more relevant when not sending large bursts using TSO.
Sometimes the flood of ACK packets causes higher cpu which could reduce
your throughput also.
The newer kernels will have a major impact on your setup due to a patch
that enabled pass through of hardware offloads to the vlan device's
offload advertisement.
The commit id of the patch is 5fb13570543f4ae022996c9d7c0c099c8abf22dd,
you can view it at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb13570543f4ae022996c9d7c0c099c8abf22dd
> NIC: Intel Corporation 82541PI Gigabit Ethernet Con. (e1000 module).
This PCI adapter is bandwidth limited on the PCI bus, and so will be even
more sensitive to offload on (TSO) vs offload off.
> # ethtool eth0
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: umbg
> Wake-on: g
> Current message level: 0x00000007 (7)
> Link detected: yes
>
> 8021q:
> filename: /lib/modules/2.6.26-2-686/kernel/net/8021q/8021q.ko
> version: 1.8
> license: GPL
> alias: rtnl-link-vlan
> srcversion: A61E1168F65EE335A91D4E1
> depends:
> vermagic: 2.6.26-2-686 SMP mod_unload modversions 686
>
> VLAN: #/proc/net/vlan/config
> VLAN Dev name | VLAN ID
> Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
> eth0.5 | 5 | eth0
> eth0.101 | 101 | eth0
> eth0.90 | 90 | eth0
>
> IFCONFIG:
> eth0 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:28140829 errors:0 dropped:218 overruns:0 frame:0
> TX packets:44994420 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:1000
> RX bytes:3472864138 (3.2 GiB) TX bytes:3908682627 (3.6 GiB)
>
> eth0.5 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.5.1 Bcast:XXX.YYY.5.255 Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:77807 errors:0 dropped:0 overruns:0 frame:0
> TX packets:69699 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:57578233 (54.9 MiB) TX bytes:7782844 (7.4 MiB)
>
> eth0.90 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.90.1 Bcast:XXX.YYY.90.255
> Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:457850 errors:0 dropped:0 overruns:0 frame:0
> TX packets:913988 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:23824841 (22.7 MiB) TX bytes:1311485281 (1.2 GiB)
>
> eth0.101 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.101.1 Bcast:XXX.YYY.101.255
> Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:24856818 errors:0 dropped:0 overruns:0 frame:0
> TX packets:41608593 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:423116676 (403.5 MiB) TX bytes:3855703636 (3.5 GiB)
>
> ROUTE: #route -n
> Ziel Router Genmask Flags Metric Ref Use
> Iface
> XXX.YYY.101.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.101
> XXX.YYY.5.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.5
> XXX.YYY.90.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.90
> 0.0.0.0 192.168.5.4 0.0.0.0 UG 0 0 0
> eth0.5
>
> Can someone give me a hint, where my search for an solution should be
> going on ?
>
> Many thanks !
> Regards
> Ralf Weinedel
> Falkensee/Germany
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eilon Greenstein @ 2010-04-27 18:31 UTC (permalink / raw)
To: Rick Jones, David Miller, therbert@google.com,
eric.dumazet@gmail.com
Cc: netdev@vger.kernel.org
In-Reply-To: <4BD601C3.5030108@hp.com>
On Mon, 2010-04-26 at 14:12 -0700, Rick Jones wrote:
> David Miller wrote:
> > From: Rick Jones <rick.jones2@hp.com>
> > Date: Mon, 26 Apr 2010 13:48:22 -0700
> >
> >>Do not confuse explanation with endorsement.
> >
> > Ok, fair enough.
> >
> > But I don't see even the "other perspective" argument being even
> > valid. Big shops still use UDP and it has to scale.
>
> Preface - I too think it is massively stupid to ignore anything but TCP/IPv4,
> and unwise to ignore IPv6 and so on, but there is a very real reason why one of
> my email signatures reads:
>
> "The road to hell is paved with business decisions"
>
> > Or have they made multicast magically start working with TCP so
> > they can us it to do trades on the NASDAQ?
>
> No. How many NIC chips can NASDAQ be expected to move? 0.1%? or even 1% of the
> NIC chip market?
>
> How many more NIC chips are in places where someone says "You sold me on
> iSCSI/FCoE/whatnot, why can't I get 'link-rate' to/from iSCSI storage/whatnot?!"
>
> The NIC designer is there with his finance guys breathing down his neck shouting
> "ROI Uber Alles!" and "Your budget is only this many monetary units!" The
> system designers at the system vendors are hearing the same things from their
> own finance guys, have certain schedules, which then has them going to the NIC
> firms, who want to sell chips to the system guys "You have to be ready to ship
> by this date and your chip has to sell for no more than this."
>
> Lather, rinse, repeat a few times and you get compromises on top of compromises.
>
> Sometimes I think it is a wonder any of it actually works at all...
>
> rick jones
Though the thread is going in a different direction now, I just wanted
to clarify two things:
- yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
hash. We all agree that it is much better to address the UDP ports as
well, but I think Rick Jones explained the process very well - thank you
Rick. Just to add one more (lame) excuse: the HW was designed before new
NAPI was introduced and it complies with the requirements from Redmond
- the next generation (57712) which we already sample does (finally)
support it. We are working on a patch series to enhance the bnx2x to
support this device now.
Eilon
^ permalink raw reply
* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Miles Lane @ 2010-04-27 17:58 UTC (permalink / raw)
To: paulmck
Cc: Eric W. Biederman, Vivek Goyal, Eric Paris, Lai Jiangshan,
Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <20100427162201.GA5826@linux.vnet.ibm.com>
On Tue, Apr 27, 2010 at 12:22 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Apr 26, 2010 at 09:27:44PM -0700, Paul E. McKenney wrote:
>> On Mon, Apr 26, 2010 at 11:35:10AM -0700, Eric W. Biederman wrote:
>> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> >
>> > > Eric Dumazet traced these down to a commit from Eric Biederman.
>> > >
>> > > If I don't hear from Eric Biederman in a few days, I will attempt a
>> > > patch, but it would be more likely to be correct coming from someone
>> > > with a better understanding of the code. ;-)
>> >
>> > I already replied.
>> >
>> > http://lkml.org/lkml/2010/4/21/420
>>
>> You did indeed!!! This experience is giving me an even better appreciation
>> of the maintainers' ability to keep all their patches straight!
>>
>> I will put together something based on your suggestion.
>
> How about the following?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 85fa42bd568ab99c375f018761ae6345249942cd
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date: Mon Apr 26 21:40:05 2010 -0700
>
> net: suppress RCU lockdep false positive in twsk_net()
>
> Calls to twsk_net() are in some cases protected by reference counting
> as an alternative to RCU protection. Cases covered by reference counts
> include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
> inet_twdr_twcal_tick(), and tcp_timewait_state_process(). RCU is used
> by inet_twsk_purge(). Locking is used by established_get_first()
> and established_get_next(). Finally, __inet_twsk_hashdance() is an
> initialization case.
>
> It appears to be non-trivial to locate the appropriate locks and
> reference counts from within twsk_net(), so used rcu_dereference_raw().
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 79f67ea..a066fdd 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -224,7 +224,9 @@ static inline
> struct net *twsk_net(const struct inet_timewait_sock *twsk)
> {
> #ifdef CONFIG_NET_NS
> - return rcu_dereference(twsk->tw_net);
> + return rcu_dereference_raw(twsk->tw_net); /* protected by locking, */
> + /* reference counting, */
> + /* initialization, or RCU. */
> #else
> return &init_net;
> #endif
>
Worked for me. Thanks!
Miles
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:37 UTC (permalink / raw)
To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.102038.57469310.davem@davemloft.net>
Le mardi 27 avril 2010 à 10:20 -0700, David Miller a écrit :
>
> Indeed, a huge issue, in that we haven't converted the UDP hash over
> to RCU yet :-)
>
I am not sure what you mean, UDP hash _is_ RCU converted ;)
> But because of the transient bind nature of UDP there are still a bunch
> of cases that won't even cure.
> --
We might use the ticket spinlock paradigm to let writers go in parallel
and let the user the socket lock
Instead of having the bh_lock_sock() to protect receive_queue *and*
backlog, writers get a unique slot in a table, that 'user' can handle
later.
Or serialize writers (before they try to bh_lock_sock()) with a
dedicated lock, so that user has 50% chances to get the sock lock,
contending with at most one writer.
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:36 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, bmb, netdev, rick.jones2
In-Reply-To: <g2k65634d661004271031r2eb2000bxc30013009509c410@mail.gmail.com>
Le mardi 27 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> This is the problem that we are addressing with so_reuseport!
How standard applications are protected against a DDOS ?
^ permalink raw reply
* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Anirban Chakraborty @ 2010-04-27 17:33 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Scott Feldman, Rose, Gregory V, David Miller,
netdev@vger.kernel.org, chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <201004271435.25480.arnd@arndb.de>
On Apr 27, 2010, at 5:35 AM, Arnd Bergmann wrote:
> On Tuesday 27 April 2010, Scott Feldman wrote:
>>> Yes, I believe that's there today:
>>>
>>> NLA_PUT_U32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent));
>>>
>>> The number of VFs is returned in RTM_GETLINK. But, it's only returned if:
>>>
>>> if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent)
>>>
>>> For my proposal, I'll need to return IFLA_NUM_VF unconditionally so callers
>>> can get num VFs.
>>
>> Hmmm...seems IFLA_NUM_VF assumes a PCI device supporting SR-IOV when it uses
>> dev_num_vf(). I think a better option would have been to query the device
>> for the number of VFs, without assuming SR-IOV or even PCI.
>>
>> I see a ndo_get_num_vf() coming...
>
> Shouldn't the number of registered port profiles be totally independent of
> the number of virtual functions?
>
> Any of the VFs could multiplex multiple guests using macvlan, which means you
> need to register each guest separately, not each VF.
>
> Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> at least when we want to extend this to adapters that don't do it in firmware.
Correct me if I am wrong. Shouldn't the port profile be tied to the physical NICs which are essentially
PCI functions (be it PF or VF)? I'd think that a port profile would have configuration settings for all the
physical NICs (PF/VF) of a specific physical port of the adapter. I liked the idea of querying the device
for number of VFs as it will cover both SR-IOV and non SR-IOV PCI functions.
thanks,
-Anirban
^ permalink raw reply
* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: Vlad Yasevich @ 2010-04-27 17:32 UTC (permalink / raw)
To: David Miller
Cc: error27, sri, yjwei, cdischino, linux-sctp, netdev,
kernel-janitors
In-Reply-To: <20100427.095823.98890165.davem@davemloft.net>
David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Tue, 27 Apr 2010 10:32:34 -0400
>
>>
>> Dan Carpenter wrote:
>>> This assignment isn't needed because we did it earlier already.
>>>
>>> Also another reason to delete the assignment is because it triggers a
>>> Smatch warning about checking for NULL pointers after a dereference.
>>>
>>> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>>> Signed-off-by: Dan Carpenter <error27@gmail.com>
>> Thanks. I'll take this one.
>
> And when will I get this from you? :-)
By the end of the week. I am trying to get all the testing finished. :)
-vlad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Tom Herbert @ 2010-04-27 17:31 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, bmb, netdev, rick.jones2
In-Reply-To: <1272388439.2295.369.camel@edumazet-laptop>
> So we have a BIG problem :
>
> All cpus are fighting to get the socket lock,
> and very litle progress is done.
>
> Note this problem has nothing to do with RPS, we could have
> it with multiqueue as well.
>
This is the problem that we are addressing with so_reuseport!
> Oh well...
>
>
>
>
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 17:20 UTC (permalink / raw)
To: eric.dumazet; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <1272388439.2295.369.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Apr 2010 19:13:59 +0200
> So we have a BIG problem :
>
> All cpus are fighting to get the socket lock,
> and very litle progress is done.
>
> Note this problem has nothing to do with RPS, we could have
> it with multiqueue as well.
>
> Oh well...
Indeed, a huge issue, in that we haven't converted the UDP hash over
to RCU yet :-)
But because of the transient bind nature of UDP there are still a bunch
of cases that won't even cure.
^ permalink raw reply
* Re: [PATCH] cxgb3: Wait longer for control packets on initialization
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
To: divy; +Cc: adetsch, netdev
In-Reply-To: <4BD648FC.80602@chelsio.com>
From: Divy Le Ray <divy@chelsio.com>
Date: Mon, 26 Apr 2010 19:16:28 -0700
> Andre Detsch wrote:
>> In some Power7 platforms, when using VIOS (Virtual I/O Server), we
>> need to wait longer for control packets to finish transfer during
>> initialization.
>> Without this change, initialization may fail prematurely.
>>
>> Signed-off-by: Wen Xiong <wenxiong@us.ibm.com>
>> Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
>>
>
> Acked-by: Divy Le Ray <divy@chelsio.com>
Applied.
^ permalink raw reply
* Re: [net-2.6 PATCH] ixgbe: Power down PHY during driver resets
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, peter.p.waskiewicz.jr
In-Reply-To: <20100427103814.23338.47637.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 03:38:15 -0700
> From: Peter Waskiewicz <peter.p.waskiewicz.jr@intel.com>
>
> The PHY laser is still on during driver init. It's allowing
> garbage to hit our FIFO, which eventually can cause the entire
> device to die. Power down the laser while setting up the device,
> and re-enable the laser before getting link.
>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, mjg, bruce.w.allan
In-Reply-To: <20100427133232.25490.92973.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 06:33:04 -0700
> From: Bruce Allan <bruce.w.allan@intel.com>
>
> Prompted by a previous patch submitted by Matthew Garret <mjg@redhat.com>,
> further digging into errata documentation reveals the current enabling or
> disabling of ASPM L0s and L1 states for certain parts supported by this
> driver are incorrect. 82571 and 82572 should always disable L1. For
> standard frames, 82573/82574/82583 can enable L1 but L0s must be disabled,
> and for jumbo frames 82573/82574 must disable L1. This allows for some
> parts to enable L1 in certain configurations leading to better power
> savings.
>
> Also according to the same errata, Early Receive (ERT) should be disabled
> on 82573 when using jumbo frames.
>
> Cc: Matthew Garret <mjg@redhat.com>
> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
To: sfr; +Cc: netdev, linux-next, linux-kernel, yoshfuji
In-Reply-To: <20100427.093430.258110898.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Tue, 27 Apr 2010 09:34:30 -0700 (PDT)
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Tue, 27 Apr 2010 15:25:16 +1000
>
>> After merging the bkl-ioctl tree, today's linux-next build (powerpc
>> ppc44x_defconfig) failed like this:
>>
>> net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
>> net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
>>
>> Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
>> br_multicast: IPv6 MLD support") from the net tree.
>>
>> csum_ipv6_magic is declared in net/ip6_checksum.h ...
>
> Bummer, powerpc is one of the few platforms that doesn't get the header
> file implicitly so you always trip over this whereas we never see it in
> x86 and sparc64 builds :-)
>
> I'll fix this, thanks!
I just committed the following for this:
bridge: Fix build of ipv6 multicast code.
Based upon a report from Stephen Rothwell:
--------------------
net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
br_multicast: IPv6 MLD support") from the net tree.
csum_ipv6_magic is declared in net/ip6_checksum.h ...
--------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/bridge/br_multicast.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index e481dbd..2048ef0 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -28,6 +28,7 @@
#include <net/ipv6.h>
#include <net/mld.h>
#include <net/addrconf.h>
+#include <net/ip6_checksum.h>
#endif
#include "br_private.h"
--
1.7.0.4
^ permalink raw reply related
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:13 UTC (permalink / raw)
To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>
Le mardi 27 avril 2010 à 09:51 -0700, David Miller a écrit :
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
>
> > David Miller wrote:
> >> How damn hard is it to add two 16-bit ports to the hash regardless of
> >> protocol?
> >>
> > Come to think of it, for UDP the hash must ignore
> > the srcport and srcaddr, because a single bound
> > socket is going to wildcard both those fields.
>
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
>
> It's going to be fully specified in the packet, and that's enough.
>
> Sure, for full RFS some amends might be necessary in this area, but
> for RPS and adapter based hw steering, using all of the ports is
> entirely sufficient.
Well well well...
I was doing some pktgen tests, with :
pgset "src_min 192.168.0.10"
pgset "src_max 192.168.0.110"
pgset "dst_min 192.168.0.2"
pgset "dst_max 192.168.0.2"
pgset "udp_dst_min 4000"
pgset "udp_dst_max 4000"
So I simulate 100 remote IPS bombarding a single port on target machine.
pktgen injects about 930.000 pps
sofirq of my target received on cpu0, and RPS spread packets to 7 other
cpus.
And my receiver is stuck (he can read about 50 pps !!!)
As soon as I disable rps, my receiver can catch 850.000 pps
RPS OFF: perf top of cpu 0
------------------------------------------------------------------------------------------------------------------------------
PerfTop: 1001 irqs/sec kernel:100.0% [1000Hz cycles], (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________ _______
385.00 10.2% __udp4_lib_lookup vmlinux
322.00 8.5% ip_route_input vmlinux
312.00 8.3% sock_queue_rcv_skb vmlinux
262.00 6.9% do_raw_spin_lock vmlinux
251.00 6.6% __alloc_skb vmlinux
239.00 6.3% sock_put vmlinux
207.00 5.5% eth_type_trans vmlinux
202.00 5.4% __slab_alloc vmlinux
159.00 4.2% __kmalloc_track_caller vmlinux
149.00 3.9% __sk_mem_schedule vmlinux
125.00 3.3% kmem_cache_alloc vmlinux
116.00 3.1% ipt_do_table vmlinux
115.00 3.0% do_raw_read_lock vmlinux
71.00 1.9% tg3_poll_work vmlinux
65.00 1.7% __netdev_alloc_skb vmlinux
64.00 1.7% skb_pull vmlinux
58.00 1.5% ip_rcv vmlinux
58.00 1.5% __slab_free vmlinux
53.00 1.4% udp_queue_rcv_skb vmlinux
47.00 1.2% nf_iterate vmlinux
44.00 1.2% __netif_receive_skb vmlinux
29.00 0.8% sock_def_readable vmlinux
28.00 0.7% do_raw_spin_unlock vmlinux
26.00 0.7% kfree vmlinux
25.00 0.7% __udp4_lib_rcv vmlinux
24.00 0.6% ip_rcv_finish vmlinux
24.00 0.6% __list_add vmlinux
RPS, on, a perf top of a slave CPU :
------------------------------------------------------------------------------------------------------------------------------
PerfTop: 1000 irqs/sec kernel:100.0% [1000Hz cycles], (all, cpu: 1)
------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________ _______
2411.00 62.0% do_raw_spin_lock vmlinux
690.00 17.7% delay_tsc vmlinux
234.00 6.0% __udp4_lib_lookup vmlinux
174.00 4.5% sock_put vmlinux
72.00 1.9% ip_rcv vmlinux
51.00 1.3% __netif_receive_skb vmlinux
43.00 1.1% do_raw_spin_unlock vmlinux
39.00 1.0% __delay vmlinux
38.00 1.0% sock_queue_rcv_skb vmlinux
36.00 0.9% udp_queue_rcv_skb vmlinux
31.00 0.8% ip_route_input vmlinux
15.00 0.4% __slab_free vmlinux
12.00 0.3% ipt_do_table vmlinux
11.00 0.3% skb_release_data vmlinux
7.00 0.2% kfree vmlinux
5.00 0.1% nf_iterate vmlinux
So we have a BIG problem :
All cpus are fighting to get the socket lock,
and very litle progress is done.
Note this problem has nothing to do with RPS, we could have
it with multiqueue as well.
Oh well...
^ permalink raw reply
* [PATCH net-next] bridge: multicast router list manipulation
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
To: Herbert Xu; +Cc: David S. Miller, netdev
In-Reply-To: <E1NlbuX-00021m-Bq@gondolin.me.apana.org.au>
I prefer that the hlist be only accessed through the hlist macro
objects. Explicit twiddling of links (especially with RCU) exposes
the code to future bugs.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/bridge/br_multicast.c 2010-04-27 09:54:02.180531924 -0700
+++ b/net/bridge/br_multicast.c 2010-04-27 10:07:19.188688664 -0700
@@ -1041,21 +1041,21 @@ static int br_ip6_multicast_mld2_report(
static void br_multicast_add_router(struct net_bridge *br,
struct net_bridge_port *port)
{
- struct hlist_node *p;
- struct hlist_node **h;
+ struct net_bridge_port *p;
+ struct hlist_node *n, *last = NULL;
- for (h = &br->router_list.first;
- (p = *h) &&
- (unsigned long)container_of(p, struct net_bridge_port, rlist) >
- (unsigned long)port;
- h = &p->next)
- ;
-
- port->rlist.pprev = h;
- port->rlist.next = p;
- rcu_assign_pointer(*h, &port->rlist);
- if (p)
- p->pprev = &port->rlist.next;
+ hlist_for_each_entry(p, n, &br->router_list, rlist) {
+ if ((unsigned long) port >= (unsigned long) p) {
+ hlist_add_before_rcu(n, &port->rlist);
+ return;
+ }
+ last = n;
+ }
+
+ if (last)
+ hlist_add_after_rcu(last, &port->rlist);
+ else
+ hlist_add_head_rcu(&port->rlist, &br->router_list);
}
static void br_multicast_mark_router(struct net_bridge *br,
^ permalink raw reply
* [PATCH net-next] bridge: use is_multicast_ether_addr
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
To: Herbert Xu, David S. Miller; +Cc: netdev
In-Reply-To: <E1NlbuW-00021d-8E@gondolin.me.apana.org.au>
Use existing inline function.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/bridge/br_device.c 2010-04-27 09:49:30.059258391 -0700
+++ b/net/bridge/br_device.c 2010-04-27 09:50:21.439878721 -0700
@@ -36,7 +36,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *
skb_reset_mac_header(skb);
skb_pull(skb, ETH_HLEN);
- if (dest[0] & 1) {
+ if (is_multicast_ether_addr(dest)) {
if (br_multicast_rcv(br, NULL, skb))
goto out;
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 17:06 UTC (permalink / raw)
To: bmb; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <4BD71890.2050606@athenacr.com>
From: Brian Bloniarz <bmb@athenacr.com>
Date: Tue, 27 Apr 2010 13:02:08 -0400
> Maybe I'm misunderstanding... won't it distribute the
> packet handling load to multiple cores, but then all
> those cores will contend trying to deliver those packets
> to the single socket?
>
> I was assuming that this'd be a net loss over just doing
> all the protocol handling on a single core. I haven't
> done any benchmarks yet.
Whether it's a new loss depends upon the application.
Also, on the non-application side f.e. a router or firewall, this is
exactly the behavior you want.
^ permalink raw reply
* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: David Miller @ 2010-04-27 17:03 UTC (permalink / raw)
To: Patrick, McHardy, kaber; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>
Whoa, there are three of you now?!?!?!
:-)
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 17:02 UTC (permalink / raw)
To: David Miller; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>
David Miller wrote:
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
>
>> David Miller wrote:
>>> How damn hard is it to add two 16-bit ports to the hash regardless of
>>> protocol?
>>>
>> Come to think of it, for UDP the hash must ignore
>> the srcport and srcaddr, because a single bound
>> socket is going to wildcard both those fields.
>
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
>
> It's going to be fully specified in the packet, and that's enough.
Maybe I'm misunderstanding... won't it distribute the
packet handling load to multiple cores, but then all
those cores will contend trying to deliver those packets
to the single socket?
I was assuming that this'd be a net loss over just doing
all the protocol handling on a single core. I haven't
done any benchmarks yet.
^ permalink raw reply
* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: David Miller @ 2010-04-27 16:58 UTC (permalink / raw)
To: vladislav.yasevich
Cc: error27, sri, yjwei, cdischino, linux-sctp, netdev,
kernel-janitors
In-Reply-To: <4BD6F582.9030804@hp.com>
From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Tue, 27 Apr 2010 10:32:34 -0400
>
>
> Dan Carpenter wrote:
>> This assignment isn't needed because we did it earlier already.
>>
>> Also another reason to delete the assignment is because it triggers a
>> Smatch warning about checking for NULL pointers after a dereference.
>>
>> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>> Signed-off-by: Dan Carpenter <error27@gmail.com>
>
> Thanks. I'll take this one.
And when will I get this from you? :-)
^ permalink raw reply
* Re: [net-2.6 PATCH] ixgbevf: Fix link speed display
From: David Miller @ 2010-04-27 16:57 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, gregory.v.rose
In-Reply-To: <20100427103834.23338.22213.stgit@localhost.localdomain>
Not appropriate this late in the -RC series.
I don't see this in the regression list, and reported link speed being
incorrect is not a catastropic crash and/or failure.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox