Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: David Miller @ 2010-04-27 19:42 UTC (permalink / raw)
  To: paulmck; +Cc: eric.dumazet, netdev
In-Reply-To: <20100427161716.GB2424@linux.vnet.ibm.com>

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Tue, 27 Apr 2010 09:17:16 -0700

> On Tue, Apr 27, 2010 at 08:40:43AM +0200, Eric Dumazet wrote:
>> __sk_dst_set() might be called while no state can be integrated in a
>> rcu_dereference_check() condition.
>> 
>> So use rcu_dereference_raw() to shutup lockdep warnings (if
>> CONFIG_PROVE_RCU is set)
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I've applied this to net-next-2.6, thanks!

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Arnd Bergmann @ 2010-04-27 19:38 UTC (permalink / raw)
  To: Anirban Chakraborty
  Cc: Scott Feldman, Rose, Gregory V, David Miller,
	netdev@vger.kernel.org, chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <8966E338-1C9C-43D9-B6A3-A44349E7EE18@qlogic.com>

On Tuesday 27 April 2010 19:33:04 Anirban Chakraborty wrote:
> On Apr 27, 2010, at 5:35 AM, Arnd Bergmann wrote:
> > Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> > at least when we want to extend this to adapters that don't do it in firmware.
> 
> Correct me if I am wrong. Shouldn't the port profile be tied to the physical NICs which are essentially
> PCI functions (be it PF or VF)? I'd think that a port profile would have configuration settings for all the
> physical NICs (PF/VF) of a specific physical port of the adapter. I liked the idea of querying the device
> for number of VFs as it will cover both SR-IOV and non SR-IOV PCI functions.

Yes, the port profile association is tied to whoever owns the link to the switch.
That can be a regular NIC, an SR-IOV PF, an ethernet bonding device or an S-component
implementing provider S-VLANs on top of any of these.

Usually it will be the same as a physical link, but in case of bonding it is two
physical links while in case of S-VLAN, you have multiple instances that each
have their own set of port profile association. If S-VLAN is implemented by
the NIC, that may be a VF.

Querying a PF for the number of VFs attached to it is a useful thing, but this
is independent of port profiles. Consider this (artificially complex) setup:

- eth0 is the PF of an SR-IOV NIC
- eth1 is a regular single-channel NIC
- vf0 is a VF of eth0, used by a guest using PCI passthrough mode on S-VLAN 2
- vf1 is a VF of eth0 owned by the host on S-VLAN 3
- vf1.23 is a VLAN port for VLAN 23 in S-VLAN 3
- br0 is a bridge connected to vf1
- br23 is a bridge VLAN device for br0
- vf2 is a VF of eth0 owned by the host on S-VLAN 4
- eth1.5 is a software vlan device for S-VLAN 4
- bond0 combines eth1.5 and vf2
- bond0.24 is a VLAN port for VLAN24 on bond0
- tap0 is a guest connected to br0 in trunk mode
- tap1 is a guest connected to br23 in access mode
- macvtap0 is a VEPA mode guest on bond0
- macvtap1 is a private mode guest on bond0.24

This means you have a total of five guests running, on vf0, tap0, tap1,
macvtap0 and macvtap1. Querying the number of VFs on eth0 will return '2',
for vf0 and vf1. What you are interested in however is which guests are
associated. Querying every single interface in the system will tell you

eth0: one guest (vf0)
vf1: two guests (tap0 and tap1)
bond0: two guests (macvtap0 and macvtap1)

	Arnd

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 19:30 UTC (permalink / raw)
  To: eilong
  Cc: Rick Jones, David Miller, therbert@google.com,
	netdev@vger.kernel.org
In-Reply-To: <1272393060.30392.2.camel@lb-tlvb-eilong.il.broadcom.com>

Le mardi 27 avril 2010 à 21:31 +0300, Eilon Greenstein a écrit :

> Though the thread is going in a different direction now, I just wanted
> to clarify two things:
> - yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
> hash. We all agree that it is much better to address the UDP ports as
> well, but I think Rick Jones explained the process very well - thank you
> Rick. Just to add one more (lame) excuse: the HW was designed before new
> NAPI was introduced and it complies with the requirements from Redmond
> - the next generation (57712) which we already sample does (finally)
> support it. We are working on a patch series to enhance the bnx2x to
> support this device now.
> 

Thanks Eilon !



^ permalink raw reply

* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: Patrick McHardy @ 2010-04-27 18:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100427.100345.241441437.davem@davemloft.net>

David Miller wrote:
> Whoa, there are three of you now?!?!?!
> 
> :-)
> 

That would be nice, I'd have my two clones do all the work :)

Not sure what happened, some mishandling of git send-email
apparently :)

^ permalink raw reply

* Re: vlan performance issue on outgoing traffic
From: Brandeburg, Jesse @ 2010-04-27 18:32 UTC (permalink / raw)
  To: R. Weinedel; +Cc: netdev@vger.kernel.org
In-Reply-To: <4BD4C037.2070003@yahoo.de>

On Sun, 25 Apr 2010, R. Weinedel wrote:

> hallo,
> 
> I have an performance issue with vlan interfaces on an Debian Lenny
> server. The problem occurs only on outgoing traffic from the vlan
> interfaces. They use only half of the available bandwidth - (490 Mbit/s
> measured with iperf ). Incoming traffic is handled @ 950 Mbit/s and is
> fine. The issue remains even with no switch and an direct connection
> between pc and server on the same nic. Removing (on server) the vlans
> from eth0 and configure one net on eth0 results in full speed (950
> Mbit/s) in both directions. Even another nic (onboard nvidia3 - mod
> forcedeth) couldn't solve it. I tested only in the same networking
> segment (vlan) without the need for ip forwarding or NAT, but the issue
> occurs on all my vlan's.
> 
> All values were taken with iperf between the server and an ubuntu 9.04
> workstation (and vice versa). I have controlled (w. ethtool / stats from
> switch) that all connection was 1000-BaseT/full duplex. It looks like
> some kind of trafficshaping to me, but i don't use tc, qos,tos nor other
> priority handling.
 
> The network ist quite simple: One Server, one switch and then the
> workstations. No need for cascading or using (r)stp.
> 
> Here some information about my network:
> 
> Switch: Netgear GSM7224 Layer 2 managed switch, FW 6.2.0.14
> (independent, issue remains on direct connection).
> 
> Server: Debian Lenny, kernel 2.6.26-2,

This version of the kernel doesn't support offloads for vlan adapters, 
which is probably causing most of your decrease in throughput due to 
either exhausting socket buffer size, or because of the round trip time 
being so much more relevant when not sending large bursts using TSO.  
Sometimes the flood of ACK packets causes higher cpu which could reduce 
your throughput also.

The newer kernels will have a major impact on your setup due to a patch 
that enabled pass through of hardware offloads to the vlan device's 
offload advertisement.

The commit id of the patch is 5fb13570543f4ae022996c9d7c0c099c8abf22dd, 
you can view it at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb13570543f4ae022996c9d7c0c099c8abf22dd

 
> NIC: Intel Corporation 82541PI Gigabit Ethernet Con. (e1000 module).

This PCI adapter is bandwidth limited on the PCI bus, and so will be even 
more sensitive to offload on (TSO) vs offload off.

> # ethtool eth0
> Settings for eth0:
>         Supported ports: [ TP ]
>         Supported link modes:   10baseT/Half 10baseT/Full
>                                 100baseT/Half 100baseT/Full
>                                 1000baseT/Full
>         Supports auto-negotiation: Yes
>         Advertised link modes:  1000baseT/Full
>         Advertised auto-negotiation: Yes
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: Twisted Pair
>         PHYAD: 0
>         Transceiver: internal
>         Auto-negotiation: on
>         Supports Wake-on: umbg
>         Wake-on: g
>         Current message level: 0x00000007 (7)
>         Link detected: yes
> 
> 8021q:
> filename:       /lib/modules/2.6.26-2-686/kernel/net/8021q/8021q.ko
> version:        1.8
> license:        GPL
> alias:          rtnl-link-vlan
> srcversion:     A61E1168F65EE335A91D4E1
> depends:
> vermagic:       2.6.26-2-686 SMP mod_unload modversions 686
> 
> VLAN: #/proc/net/vlan/config
> VLAN Dev name    | VLAN ID
> Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
> eth0.5         | 5  | eth0
> eth0.101       | 101  | eth0
> eth0.90        | 90  | eth0
> 
> IFCONFIG:
> eth0      Link encap:Ethernet  Hardware Adresse 00:0e:0c:bc:43:43
>           inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
>           RX packets:28140829 errors:0 dropped:218 overruns:0 frame:0
>           TX packets:44994420 errors:0 dropped:0 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:1000
>           RX bytes:3472864138 (3.2 GiB)  TX bytes:3908682627 (3.6 GiB)
> 
> eth0.5    Link encap:Ethernet  Hardware Adresse 00:0e:0c:bc:43:43
>           inet Adresse:XXX.YYY.5.1  Bcast:XXX.YYY.5.255  Maske:255.255.255.0
>           inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
>           RX packets:77807 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:69699 errors:0 dropped:0 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:0
>           RX bytes:57578233 (54.9 MiB)  TX bytes:7782844 (7.4 MiB)
> 
> eth0.90   Link encap:Ethernet  Hardware Adresse 00:0e:0c:bc:43:43
>           inet Adresse:XXX.YYY.90.1  Bcast:XXX.YYY.90.255 
> Maske:255.255.255.0
>           inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
>           RX packets:457850 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:913988 errors:0 dropped:0 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:0
>           RX bytes:23824841 (22.7 MiB)  TX bytes:1311485281 (1.2 GiB)
> 
> eth0.101  Link encap:Ethernet  Hardware Adresse 00:0e:0c:bc:43:43
>           inet Adresse:XXX.YYY.101.1  Bcast:XXX.YYY.101.255 
> Maske:255.255.255.0
>           inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
>           RX packets:24856818 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:41608593 errors:0 dropped:0 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:0
>           RX bytes:423116676 (403.5 MiB)  TX bytes:3855703636 (3.5 GiB)
> 
> ROUTE: #route -n
> Ziel            Router          Genmask         Flags Metric Ref    Use
> Iface
> XXX.YYY.101.0   0.0.0.0         255.255.255.0   U     0      0        0
> eth0.101
> XXX.YYY.5.0     0.0.0.0         255.255.255.0   U     0      0        0
> eth0.5
> XXX.YYY.90.0    0.0.0.0         255.255.255.0   U     0      0        0
> eth0.90
> 0.0.0.0         192.168.5.4     0.0.0.0         UG    0      0        0
> eth0.5
> 
> Can someone give me a hint, where my search for an solution should be
> going on ?
> 
> Many thanks !
> Regards
> Ralf Weinedel
> Falkensee/Germany
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eilon Greenstein @ 2010-04-27 18:31 UTC (permalink / raw)
  To: Rick Jones, David Miller, therbert@google.com,
	eric.dumazet@gmail.com
  Cc: netdev@vger.kernel.org
In-Reply-To: <4BD601C3.5030108@hp.com>

On Mon, 2010-04-26 at 14:12 -0700, Rick Jones wrote:
> David Miller wrote:
> > From: Rick Jones <rick.jones2@hp.com>
> > Date: Mon, 26 Apr 2010 13:48:22 -0700
> > 
> >>Do not confuse explanation with endorsement.
> > 
> > Ok, fair enough.
> > 
> > But I don't see even the "other perspective" argument being even
> > valid.  Big shops still use UDP and it has to scale.
> 
> Preface - I too think it is massively stupid to ignore anything but TCP/IPv4, 
> and unwise to ignore IPv6 and so on, but there is a very real reason why one of 
> my email signatures reads:
> 
> "The road to hell is paved with business decisions"
> 
> > Or have they made multicast magically start working with TCP so
> > they can us it to do trades on the NASDAQ?
> 
> No. How many NIC chips can NASDAQ be expected to move? 0.1%? or even 1% of the 
> NIC chip market?
> 
> How many more NIC chips are in places where someone says "You sold me on 
> iSCSI/FCoE/whatnot, why can't I get 'link-rate'  to/from iSCSI storage/whatnot?!"
> 
> The NIC designer is there with his finance guys breathing down his neck shouting 
> "ROI Uber Alles!" and "Your budget is only this many monetary units!"  The 
> system designers at the system vendors are hearing the same things from their 
> own finance guys, have certain schedules, which then has them going to the NIC 
> firms, who want to sell chips to the system guys "You have to be ready to ship 
> by this date and your chip has to sell for no more than this."
> 
> Lather, rinse, repeat a few times and you get compromises on top of compromises.
> 
> Sometimes I think it is a wonder any of it actually works at all...
> 
> rick jones

Though the thread is going in a different direction now, I just wanted
to clarify two things:
- yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
hash. We all agree that it is much better to address the UDP ports as
well, but I think Rick Jones explained the process very well - thank you
Rick. Just to add one more (lame) excuse: the HW was designed before new
NAPI was introduced and it complies with the requirements from Redmond
- the next generation (57712) which we already sample does (finally)
support it. We are working on a patch series to enhance the bnx2x to
support this device now.

Eilon




^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Miles Lane @ 2010-04-27 17:58 UTC (permalink / raw)
  To: paulmck
  Cc: Eric W. Biederman, Vivek Goyal, Eric Paris, Lai Jiangshan,
	Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
	Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <20100427162201.GA5826@linux.vnet.ibm.com>

On Tue, Apr 27, 2010 at 12:22 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Apr 26, 2010 at 09:27:44PM -0700, Paul E. McKenney wrote:
>> On Mon, Apr 26, 2010 at 11:35:10AM -0700, Eric W. Biederman wrote:
>> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> >
>> > > Eric Dumazet traced these down to a commit from Eric Biederman.
>> > >
>> > > If I don't hear from Eric Biederman in a few days, I will attempt a
>> > > patch, but it would be more likely to be correct coming from someone
>> > > with a better understanding of the code.  ;-)
>> >
>> > I already replied.
>> >
>> > http://lkml.org/lkml/2010/4/21/420
>>
>> You did indeed!!!  This experience is giving me an even better appreciation
>> of the maintainers' ability to keep all their patches straight!
>>
>> I will put together something based on your suggestion.
>
> How about the following?
>
>                                                        Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 85fa42bd568ab99c375f018761ae6345249942cd
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Mon Apr 26 21:40:05 2010 -0700
>
>    net: suppress RCU lockdep false positive in twsk_net()
>
>    Calls to twsk_net() are in some cases protected by reference counting
>    as an alternative to RCU protection.  Cases covered by reference counts
>    include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
>    inet_twdr_twcal_tick(), and tcp_timewait_state_process().  RCU is used
>    by inet_twsk_purge().  Locking is used by established_get_first()
>    and established_get_next().  Finally, __inet_twsk_hashdance() is an
>    initialization case.
>
>    It appears to be non-trivial to locate the appropriate locks and
>    reference counts from within twsk_net(), so used rcu_dereference_raw().
>
>    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 79f67ea..a066fdd 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -224,7 +224,9 @@ static inline
>  struct net *twsk_net(const struct inet_timewait_sock *twsk)
>  {
>  #ifdef CONFIG_NET_NS
> -       return rcu_dereference(twsk->tw_net);
> +       return rcu_dereference_raw(twsk->tw_net); /* protected by locking, */
> +                                                 /* reference counting, */
> +                                                 /* initialization, or RCU. */
>  #else
>        return &init_net;
>  #endif
>

Worked for me.  Thanks!

           Miles

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:37 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.102038.57469310.davem@davemloft.net>

Le mardi 27 avril 2010 à 10:20 -0700, David Miller a écrit :

> 
> Indeed, a huge issue, in that we haven't converted the UDP hash over
> to RCU yet :-)
> 

I am not sure what you mean, UDP hash _is_ RCU converted ;)

> But because of the transient bind nature of UDP there are still a bunch
> of cases that won't even cure.
> --

We might use the ticket spinlock paradigm to let writers go in parallel
and let the user the socket lock

Instead of having the bh_lock_sock() to protect receive_queue *and*
backlog, writers get a unique slot in a table, that 'user' can handle
later.

Or serialize writers (before they try to bh_lock_sock()) with a
dedicated lock, so that user has 50% chances to get the sock lock,
contending with at most one writer.

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:36 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, bmb, netdev, rick.jones2
In-Reply-To: <g2k65634d661004271031r2eb2000bxc30013009509c410@mail.gmail.com>

Le mardi 27 avril 2010 à 10:31 -0700, Tom Herbert a écrit :

> This is the problem that we are addressing with so_reuseport!

How standard applications are protected against a DDOS ?




^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Anirban Chakraborty @ 2010-04-27 17:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Scott Feldman, Rose, Gregory V, David Miller,
	netdev@vger.kernel.org, chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <201004271435.25480.arnd@arndb.de>


On Apr 27, 2010, at 5:35 AM, Arnd Bergmann wrote:

> On Tuesday 27 April 2010, Scott Feldman wrote:
>>> Yes, I believe that's there today:
>>> 
>>>    NLA_PUT_U32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent));
>>> 
>>> The number of VFs is returned in RTM_GETLINK.  But, it's only returned if:
>>> 
>>>    if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent)
>>> 
>>> For my proposal, I'll need to return IFLA_NUM_VF unconditionally so callers
>>> can get num VFs.
>> 
>> Hmmm...seems IFLA_NUM_VF assumes a PCI device supporting SR-IOV when it uses
>> dev_num_vf().  I think a better option would have been to query the device
>> for the number of VFs, without assuming SR-IOV or even PCI.
>> 
>> I see a ndo_get_num_vf() coming...
> 
> Shouldn't the number of registered port profiles be totally independent of
> the number of virtual functions?
> 
> Any of the VFs could multiplex multiple guests using macvlan, which means you
> need to register each guest separately, not each VF.
> 
> Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> at least when we want to extend this to adapters that don't do it in firmware.

Correct me if I am wrong. Shouldn't the port profile be tied to the physical NICs which are essentially
PCI functions (be it PF or VF)? I'd think that a port profile would have configuration settings for all the
physical NICs (PF/VF) of a specific physical port of the adapter. I liked the idea of querying the device
for number of VFs as it will cover both SR-IOV and non SR-IOV PCI functions.

thanks,
-Anirban

^ permalink raw reply

* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: Vlad Yasevich @ 2010-04-27 17:32 UTC (permalink / raw)
  To: David Miller
  Cc: error27, sri, yjwei, cdischino, linux-sctp, netdev,
	kernel-janitors
In-Reply-To: <20100427.095823.98890165.davem@davemloft.net>



David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Tue, 27 Apr 2010 10:32:34 -0400
> 
>>
>> Dan Carpenter wrote:
>>> This assignment isn't needed because we did it earlier already.
>>>
>>> Also another reason to delete the assignment is because it triggers a
>>> Smatch warning about checking for NULL pointers after a dereference.
>>>
>>> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>>> Signed-off-by: Dan Carpenter <error27@gmail.com>
>> Thanks.  I'll take this one.
> 
> And when will I get this from you? :-)

By the end of the week.  I am trying to get all the testing finished. :)

-vlad

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Tom Herbert @ 2010-04-27 17:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, netdev, rick.jones2
In-Reply-To: <1272388439.2295.369.camel@edumazet-laptop>

> So we have a BIG problem :
>
> All cpus are fighting to get the socket lock,
> and very litle progress is done.
>
> Note this problem has nothing to do with RPS, we could have
> it with multiqueue as well.
>

This is the problem that we are addressing with so_reuseport!

> Oh well...
>
>
>
>

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 17:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <1272388439.2295.369.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Apr 2010 19:13:59 +0200

> So we have a BIG problem :
> 
> All cpus are fighting to get the socket lock,
> and very litle progress is done.
> 
> Note this problem has nothing to do with RPS, we could have 
> it with multiqueue as well.
> 
> Oh well...

Indeed, a huge issue, in that we haven't converted the UDP hash over
to RCU yet :-)

But because of the transient bind nature of UDP there are still a bunch
of cases that won't even cure.

^ permalink raw reply

* Re: [PATCH] cxgb3: Wait longer for control packets on initialization
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
  To: divy; +Cc: adetsch, netdev
In-Reply-To: <4BD648FC.80602@chelsio.com>

From: Divy Le Ray <divy@chelsio.com>
Date: Mon, 26 Apr 2010 19:16:28 -0700

> Andre Detsch wrote:
>> In some Power7 platforms, when using VIOS (Virtual I/O Server), we
>> need to wait longer for control packets to finish transfer during
>> initialization.
>> Without this change, initialization may fail prematurely.
>>
>> Signed-off-by: Wen Xiong <wenxiong@us.ibm.com>
>> Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
>>   
> 
> Acked-by: Divy Le Ray <divy@chelsio.com>

Applied.

^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbe: Power down PHY during driver resets
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, peter.p.waskiewicz.jr
In-Reply-To: <20100427103814.23338.47637.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 03:38:15 -0700

> From: Peter Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> 
> The PHY laser is still on during driver init.  It's allowing
> garbage to hit our FIFO, which eventually can cause the entire
> device to die.  Power down the laser while setting up the device,
> and re-enable the laser before getting link.
> 
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, mjg, bruce.w.allan
In-Reply-To: <20100427133232.25490.92973.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 06:33:04 -0700

> From: Bruce Allan <bruce.w.allan@intel.com>
> 
> Prompted by a previous patch submitted by Matthew Garret <mjg@redhat.com>,
> further digging into errata documentation reveals the current enabling or
> disabling of ASPM L0s and L1 states for certain parts supported by this
> driver are incorrect.  82571 and 82572 should always disable L1.  For
> standard frames, 82573/82574/82583 can enable L1 but L0s must be disabled,
> and for jumbo frames 82573/82574 must disable L1.  This allows for some
> parts to enable L1 in certain configurations leading to better power
> savings.
> 
> Also according to the same errata, Early Receive (ERT) should be disabled
> on 82573 when using jumbo frames.
> 
> Cc: Matthew Garret <mjg@redhat.com>
> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, yoshfuji
In-Reply-To: <20100427.093430.258110898.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 27 Apr 2010 09:34:30 -0700 (PDT)

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Tue, 27 Apr 2010 15:25:16 +1000
> 
>> After merging the bkl-ioctl tree, today's linux-next build (powerpc
>> ppc44x_defconfig) failed like this:
>> 
>> net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
>> net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
>> 
>> Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
>> br_multicast: IPv6 MLD support") from the net tree.
>> 
>> csum_ipv6_magic is declared in net/ip6_checksum.h ...
> 
> Bummer, powerpc is one of the few platforms that doesn't get the header
> file implicitly so you always trip over this whereas we never see it in
> x86 and sparc64 builds :-)
> 
> I'll fix this, thanks!

I just committed the following for this:

bridge: Fix build of ipv6 multicast code.

Based upon a report from Stephen Rothwell:

--------------------
net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'

Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
br_multicast: IPv6 MLD support") from the net tree.

csum_ipv6_magic is declared in net/ip6_checksum.h ...
--------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/bridge/br_multicast.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index e481dbd..2048ef0 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -28,6 +28,7 @@
 #include <net/ipv6.h>
 #include <net/mld.h>
 #include <net/addrconf.h>
+#include <net/ip6_checksum.h>
 #endif
 
 #include "br_private.h"
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:13 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>

Le mardi 27 avril 2010 à 09:51 -0700, David Miller a écrit :
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
> 
> > David Miller wrote:
> >> How damn hard is it to add two 16-bit ports to the hash regardless of
> >> protocol?
> >>   
> > Come to think of it, for UDP the hash must ignore
> > the srcport and srcaddr, because a single bound
> > socket is going to wildcard both those fields.
> 
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
> 
> It's going to be fully specified in the packet, and that's enough.
> 
> Sure, for full RFS some amends might be necessary in this area, but
> for RPS and adapter based hw steering, using all of the ports is
> entirely sufficient.

Well well well...

I was doing some pktgen tests, with :

 pgset "src_min 192.168.0.10"
 pgset "src_max 192.168.0.110"
 pgset "dst_min 192.168.0.2"
 pgset "dst_max 192.168.0.2"
 pgset "udp_dst_min 4000"
 pgset "udp_dst_max 4000"

So I simulate 100 remote IPS bombarding a single port on target machine.
pktgen injects about 930.000 pps

sofirq of my target received on cpu0, and RPS spread packets to 7 other
cpus.

And my receiver is stuck (he can read about 50 pps !!!)


As soon as I disable rps, my receiver can catch 850.000 pps


RPS OFF: perf top of cpu 0

------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ _______

              385.00 10.2% __udp4_lib_lookup      vmlinux
              322.00  8.5% ip_route_input         vmlinux
              312.00  8.3% sock_queue_rcv_skb     vmlinux
              262.00  6.9% do_raw_spin_lock       vmlinux
              251.00  6.6% __alloc_skb            vmlinux
              239.00  6.3% sock_put               vmlinux
              207.00  5.5% eth_type_trans         vmlinux
              202.00  5.4% __slab_alloc           vmlinux
              159.00  4.2% __kmalloc_track_caller vmlinux
              149.00  3.9% __sk_mem_schedule      vmlinux
              125.00  3.3% kmem_cache_alloc       vmlinux
              116.00  3.1% ipt_do_table           vmlinux
              115.00  3.0% do_raw_read_lock       vmlinux
               71.00  1.9% tg3_poll_work          vmlinux
               65.00  1.7% __netdev_alloc_skb     vmlinux
               64.00  1.7% skb_pull               vmlinux
               58.00  1.5% ip_rcv                 vmlinux
               58.00  1.5% __slab_free            vmlinux
               53.00  1.4% udp_queue_rcv_skb      vmlinux
               47.00  1.2% nf_iterate             vmlinux
               44.00  1.2% __netif_receive_skb    vmlinux
               29.00  0.8% sock_def_readable      vmlinux
               28.00  0.7% do_raw_spin_unlock     vmlinux
               26.00  0.7% kfree                  vmlinux
               25.00  0.7% __udp4_lib_rcv         vmlinux
               24.00  0.6% ip_rcv_finish          vmlinux
               24.00  0.6% __list_add             vmlinux


RPS, on, a perf top of a slave CPU :

------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 1)
------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function            DSO
             _______ _____ ___________________ _______

             2411.00 62.0% do_raw_spin_lock    vmlinux
              690.00 17.7% delay_tsc           vmlinux
              234.00  6.0% __udp4_lib_lookup   vmlinux
              174.00  4.5% sock_put            vmlinux
               72.00  1.9% ip_rcv              vmlinux
               51.00  1.3% __netif_receive_skb vmlinux
               43.00  1.1% do_raw_spin_unlock  vmlinux
               39.00  1.0% __delay             vmlinux
               38.00  1.0% sock_queue_rcv_skb  vmlinux
               36.00  0.9% udp_queue_rcv_skb   vmlinux
               31.00  0.8% ip_route_input      vmlinux
               15.00  0.4% __slab_free         vmlinux
               12.00  0.3% ipt_do_table        vmlinux
               11.00  0.3% skb_release_data    vmlinux
                7.00  0.2% kfree               vmlinux
                5.00  0.1% nf_iterate          vmlinux

So we have a BIG problem :

All cpus are fighting to get the socket lock,
and very litle progress is done.

Note this problem has nothing to do with RPS, we could have 
it with multiqueue as well.

Oh well...




^ permalink raw reply

* [PATCH net-next] bridge: multicast router list manipulation
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev
In-Reply-To: <E1NlbuX-00021m-Bq@gondolin.me.apana.org.au>

I prefer that the hlist be only accessed through the hlist macro
objects. Explicit twiddling of links (especially with RCU) exposes
the code to future bugs.

Compile tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/net/bridge/br_multicast.c	2010-04-27 09:54:02.180531924 -0700
+++ b/net/bridge/br_multicast.c	2010-04-27 10:07:19.188688664 -0700
@@ -1041,21 +1041,21 @@ static int br_ip6_multicast_mld2_report(
 static void br_multicast_add_router(struct net_bridge *br,
 				    struct net_bridge_port *port)
 {
-	struct hlist_node *p;
-	struct hlist_node **h;
+	struct net_bridge_port *p;
+	struct hlist_node *n, *last = NULL;
 
-	for (h = &br->router_list.first;
-	     (p = *h) &&
-	     (unsigned long)container_of(p, struct net_bridge_port, rlist) >
-	     (unsigned long)port;
-	     h = &p->next)
-		;
-
-	port->rlist.pprev = h;
-	port->rlist.next = p;
-	rcu_assign_pointer(*h, &port->rlist);
-	if (p)
-		p->pprev = &port->rlist.next;
+	hlist_for_each_entry(p, n, &br->router_list, rlist) {
+		if ((unsigned long) port >= (unsigned long) p) {
+			hlist_add_before_rcu(n, &port->rlist);
+			return;
+		}
+		last = n;
+	}
+
+	if (last)
+		hlist_add_after_rcu(last, &port->rlist);
+	else
+		hlist_add_head_rcu(&port->rlist, &br->router_list);
 }
 
 static void br_multicast_mark_router(struct net_bridge *br,


^ permalink raw reply

* [PATCH net-next] bridge: use is_multicast_ether_addr
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller; +Cc: netdev
In-Reply-To: <E1NlbuW-00021d-8E@gondolin.me.apana.org.au>

Use existing inline function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/bridge/br_device.c	2010-04-27 09:49:30.059258391 -0700
+++ b/net/bridge/br_device.c	2010-04-27 09:50:21.439878721 -0700
@@ -36,7 +36,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *
 	skb_reset_mac_header(skb);
 	skb_pull(skb, ETH_HLEN);

-	if (dest[0] & 1) {
+	if (is_multicast_ether_addr(dest)) {
 		if (br_multicast_rcv(br, NULL, skb))
 			goto out;

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 17:06 UTC (permalink / raw)
  To: bmb; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <4BD71890.2050606@athenacr.com>

From: Brian Bloniarz <bmb@athenacr.com>
Date: Tue, 27 Apr 2010 13:02:08 -0400

> Maybe I'm misunderstanding... won't it distribute the
> packet handling load to multiple cores, but then all
> those cores will contend trying to deliver those packets
> to the single socket?
> 
> I was assuming that this'd be a net loss over just doing
> all the protocol handling on a single core. I haven't
> done any benchmarks yet.

Whether it's a new loss depends upon the application.

Also, on the non-application side f.e. a router or firewall, this is
exactly the behavior you want.

^ permalink raw reply

* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: David Miller @ 2010-04-27 17:03 UTC (permalink / raw)
  To: Patrick, McHardy, kaber; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>


Whoa, there are three of you now?!?!?!

:-)

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 17:02 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>

David Miller wrote:
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
> 
>> David Miller wrote:
>>> How damn hard is it to add two 16-bit ports to the hash regardless of
>>> protocol?
>>>   
>> Come to think of it, for UDP the hash must ignore
>> the srcport and srcaddr, because a single bound
>> socket is going to wildcard both those fields.
> 
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
> 
> It's going to be fully specified in the packet, and that's enough.

Maybe I'm misunderstanding... won't it distribute the
packet handling load to multiple cores, but then all
those cores will contend trying to deliver those packets
to the single socket?

I was assuming that this'd be a net loss over just doing
all the protocol handling on a single core. I haven't
done any benchmarks yet.

^ permalink raw reply

* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: David Miller @ 2010-04-27 16:58 UTC (permalink / raw)
  To: vladislav.yasevich
  Cc: error27, sri, yjwei, cdischino, linux-sctp, netdev,
	kernel-janitors
In-Reply-To: <4BD6F582.9030804@hp.com>

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Tue, 27 Apr 2010 10:32:34 -0400

> 
> 
> Dan Carpenter wrote:
>> This assignment isn't needed because we did it earlier already.
>> 
>> Also another reason to delete the assignment is because it triggers a
>> Smatch warning about checking for NULL pointers after a dereference.
>> 
>> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> Thanks.  I'll take this one.

And when will I get this from you? :-)

^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbevf: Fix link speed display
From: David Miller @ 2010-04-27 16:57 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, gregory.v.rose
In-Reply-To: <20100427103834.23338.22213.stgit@localhost.localdomain>

Not appropriate this late in the -RC series.

I don't see this in the regression list, and reported link speed being
incorrect is not a catastropic crash and/or failure.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox