Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Receive issues with bonding and vlans
From: John Fastabend @ 2010-05-03 21:17 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
	Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <11804.1272911110@death.nxdomain.ibm.com>

Jay Vosburgh wrote:
> John Fastabend <john.r.fastabend@intel.com> wrote:
> 
>> Jay Vosburgh wrote:
>>> Chris Leech <christopher.leech@intel.com> wrote:
>>>
>>>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
>>>>>    Is the FCoE supposed to run over the inactive bonding slave?  Or
>>>>> am I misunderstanding what you're saying?  I had thought the LLDP, et
>>>>> al, special case in bonding was to permit, essentially, path discovery,
>>>>> not necessarily active use of the inactive slave.
>>>> That's what I'm trying to do, yes.  Mostly because it's a setup that
>>>> would work if you removed the FCoE traffic from the network data path,
>>>> and only converged at the driver level and below.  It's possible that
>>>> the answer is "don't do that".
>>>      So, basically, you want the bond to act like usual for "regular"
>>> ethernet traffic, but act like the slaves are independent from the bond
>>> for the magic FCoE traffic, right?
>>>
>>>      I'm not really sure if that's a "don't do that" or not.
>>>
>>>>>    Not that this is necessarily bad; the "drop stuff on inactive
>>>>> slaves" is really there for duplicate suppression, but it also can
>>>>> uncover network topology issues, e.g., network layouts that won't work
>>>>> if the devices fail, but appear to work during testing because the
>>>>> "inactive" slave still receives traffic (it hasn't really failed).
>>>>>
>>>>>> The problem is that it doesn't work for hardware accelerated VLAN
>>>>>> devices, because the VLAN receive paths have their own
>>>>>> skb_bond_should_drop calls that were not updated.
>>>>>>
>>>>> >From what I can tell, VLAN receives always end up going through
>>>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>>>>>> the frame isn't dropped the first time.  I think the bonding checks in
>>>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed.
>>>>>    I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
>>>>> the original receiving device in skb->dev; by the time the packet gets
>>>>> to netif_receive_skb, the original slave the packet was received on has
>>>>> been lost (and replaced with the VLAN device).  Various things are
>>>>> interested in that, in particular the "arp_validate" and the "inactive
>>>>> slave drop" logic for bonding depend on knowing the real device the
>>>>> packet arrived on.
>>>>>
>>>>>    I note that the vlan accel logic doesn't change skb_iif to the
>>>>> VLAN device; it remains as the original device.  I suppose one
>>>>> alternative would be to convert the bonding drop, et al, logic to use
>>>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core
>>>>> would not need to call skb_bond_should_drop, which in turn would be a
>>>>> bit more complicated as it would have to look up the dev from the
>>>>> skb_iif.  There's already some code in bonding that takes advantage of
>>>>> this property of the VLANs, so maybe this is the way to go.
>>>> Thanks, I'll take another look and see if I can come up with something
>>>> better.
>>>      I looked at the skb_bond_should_drop stuff a bit more after I
>>> wrote that; it's not as easy as I had suspected.  The big sticking point
>>> is that currently the test in netif_receive_skb (now __netif_receive_skb
>>> in net-next-2.6) is on skb->dev->master to identify packets arriving on
>>> slaves of bonding.  The VLAN skb->dev has ->master set to NULL.  Doing
>>> that test against skb->skb_iif would be much more expensive, as it would
>>> require a device lookup for every packet.
>>>
>>>      So, I suspect that something has to happen in the VLAN
>>> acceleration path, although I don't know exactly what.  I don't know if
>>> it would be possible to flag the packets in some special way to indicate
>>> that they're "bonding slave" packets, or if it's better to keep the
>>> current structure and just fix the calls somehow.
>>>
>>>      -J
>>>
>>>
>> Jay,
>>
>> It should be OK to allow packets to be received on the VLAN if it is not
>> explicitly in the bond?
> 
>         Lemme see if I have this straight, because all of these special
> cases are making my brain hurt.  This one is for a configuration like this:
> 
>         bond0 ----- eth0
>                    /
>         vlan.xxx -/
> 
>         I.e., a VLAN configured directly atop an ethernet device, said
> ethernet also being a slave to bonding.  Is that correct?
> 

Yes, this is the correct scenario that we are considering.

>         Extrapolating from the ASCII art in a prior message in this
> discussion, would this configuration really be something like this:
> 
>         vlan.xxx -\
>                    \
>         bond0 ----- eth1
>         bond0 ----- eth0
>                    /
>         vlan.xxx -/
> 
>         I.e., two slaves to bonding, with VLAN xxx configured atop both
> of the slaves?  Or would the eth0 and eth1 use discrete VLAN ids?  The
> reason I ask is in regards to duplicate suppression.  The whole reason
> the "inactive" slave drops (most) incoming packets is to eliminate
> duplicates when the switch floods traffic to both slave ports.
> 

These vlan ids could be the same or discrete I think both configurations 
should be valid.

>         This is a bit tricky, because it's not really about broadcasts /
> multicasts so much, but about traffic that the switch sends to all ports
> because the switch's MAC address table isn't up to date with the
> destination MAC of the traffic (which is a transient condition, so this
> would only happen for perhaps one second or so).  That would result in
> duplicate unicast packets being received by the bond (back in the day
> before bonding had the "drop inactive traffic" logic).
> 
>         So if the same VLAN is configured atop the two slaves, I wonder
> if that will open a window for the duplicate unicast packet problem.

OK, this does appear to open a window for duplicated unicast packets. 
By only allowing handlers with exact matches at least this issue is less 
obvious and we are assuming the packet handler can deal with this 
duplication.  This seems to be the current assumption made. The same 
issue exists today for real device in the following setup,

vlan --> bond0 --> eth

Specifically for FCoE we use the san mac address so it wouldn't be an 
issue here.  The expectation being that the switch will only ever use 
the correct san mac on the port.

> 
>         If the VLAN ids are different, then I'll assume this is some
> kind of device mapper magic, doing the load balancing elsewhere.

Correct device mapper handles load balancing and failover for both 
cases, when the vlan ids are different and when they are the same.

> 
>> Or if we want to be more paranoid deliver packets only to handlers with
>> exact matches for the device. For non vlan devices we deliver skb's to
>> packet handlers that match exactly even on inactive slaves so doing this
>> on vlan devices as well makes sense and shouldn't cause any unexpected
>> problems.
> 
>         Yah, the whole concept of "inactive" slave is pretty mutated
> now; it's kind of become the "active-backup with semi-manual load
> balance for clever protocols, oh, and duplicate suppression" mode.
> 
>> Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond
>> are not working as expected in __netif_receive_skb().  At least the
>> comment 'deliver only exact match' could be inaccurate.
> 
>         I don't think this is unrelated at all.  This code (the packet
> to device lookup stuff in __netif_receive_skb) has been modified pretty
> extensively lately for various bonding-related special cases, and I
> think it is getting hard to follow.  Whatever comments are there need to
> be accurate, and, honestly, I think this code needs comments to explain
> what exactly is supposed to happen for these special cases.
>

Agreed.  This should be cleaned up and some explanations added.  The 
current behavior in active-backup mode is receiving packets on the 
bonded real device in active mode fails but putting that same real 
device in an inactive state will cause it to receive packets.  This is 
an inconsistency, which should probably be fixed by initializing 
null_or_bond to orig_dev.  And also renaming it orig_or_bond at that point.

>> Here's a patch to illustrate what I'm thinking compile tested only.  If
>> this sounds reasonable I'll work up an official patch.
>>
>>
>> [PATCH] net: allow vlans on bonded real net_devices
>>
>> For converged I/O it is reasonable to use dm_multipathing to provice
>> failover and load balancing for storage traffic and then use bonding
>> for the LAN failover and load balancing.
>>
>> Currently this works if the multipathed devices are using the real
>> net_device and those devices are enslaved to a bonded device what
>> does not work is creating a vlan on the real device and trying to
>> use it outside the bond for multipathing.
>>
>> This patch adds logic so that if the skb is destined for a vlan
>> that is not in the bond the skb will not be dropped.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>> net/8021q/vlan_core.c |   31 +++++++++++++++++++++----------
>> net/core/dev.c        |   11 ++++++++---
>> 2 files changed, 29 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>> index c584a0a..3bce0c3 100644
>> --- a/net/8021q/vlan_core.c
>> +++ b/net/8021q/vlan_core.c
>> @@ -8,18 +8,24 @@
>> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
>>                     u16 vlan_tci, int polling)
>> {
>> +      struct net_device *vlan_dev;
>> +
>>       if (netpoll_rx(skb))
>>               return NET_RX_DROP;
>>
>> -      if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>> +      vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>> +
>> +      if (!vlan_dev)
>> +              goto drop;
>> +
>> +      if ((vlan_dev->priv_flags & IFF_BONDING ||
>> +          vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>> +          skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
> 
>         I'm not sure this will do the right thing if the VLAN device
> itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0.  In
> that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev)
> doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I
> believe).
> 

correct, vlan_dev does have IFF_BONDING and IFF_SLAVE here and doesn't 
have IFF_MASTER.


>         I think this will result in all incoming traffic being accepted
> on such a configuration (leading to duplicates, as described above).
> 
>         I suspect, but have not tested, that something like this might
> do what you're looking for:
> 
>         if ((vlan_dev->priv_flags & IFF_BONDING ||
>             vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) &&
>             skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
> 
>         I.e., if the VLAN device is either a MASTER (configured above
> the bond) or a slave (configured below the bond) do the duplicate
> suppresion.

Here are the three basic cases I see,

#1. vlanx --> bond0 --> ethx

In this case vlanx does not have IFF_BONDING set and real_dev is ethx 
with IFF_SLAVE set.  ethx has master dev->bond0 so this should work. 
And shows why we need the IFF_SLAVE bit as you pointed out and I dropped.

#2. bond  --> vlanx --> ethx

This case is broke, skb->dev->master is NULL so we would never drop this 
pkt.  As it exists today I suspect this is broken as well.

#3 bond0 --> ethx
    vlanx --> -|

Here is the case where adding the IFF_SLAVE bit doesn't work as I hoped. 
  We don't want to run skb_bond_should_drop here.

So I think there needs to be a bit of logic here to determine if we need 
to check skb_bond_should_drop with the vlan device or with the 
skb->dev->master. Something like might do:

should_drop_dev = vlan_dev->master ? vlan_dev->master : skb->dev->master

This should fix case #2 without breaking case #1.  And the case I want 
to allow is still not resolved.  I'll think about this some more maybe 
this logic can be fixed for all cases.


> 
>>               goto drop;
>>
>>       skb->skb_iif = skb->dev->ifindex;
>>       __vlan_hwaccel_put_tag(skb, vlan_tci);
>> -      skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>> -
>> -      if (!skb->dev)
>> -              goto drop;
>> +      skb->dev = vlan_dev;
>>
>>       return (polling ? netif_receive_skb(skb) : netif_rx(skb));
>>
>> @@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct
>> vlan_group *grp,
>>               unsigned int vlan_tci, struct sk_buff *skb)
>> {
>>       struct sk_buff *p;
>> +      struct net_device *vlan_dev;
>>
>> -      if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>> +      vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>> +
>> +      if (!vlan_dev)
>> +              goto drop;
>> +
>> +      if ((vlan_dev->priv_flags & IFF_BONDING ||
>> +          vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>> +          skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>>               goto drop;
>>
>>       skb->skb_iif = skb->dev->ifindex;
>>       __vlan_hwaccel_put_tag(skb, vlan_tci);
>> -      skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>> -
>> -      if (!skb->dev)
>> -              goto drop;
>> +      skb->dev = vlan_dev;
>>
>>       for (p = napi->gro_list; p; p = p->next) {
>>               NAPI_GRO_CB(p)->same_flow =
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 100dcbd..9ea4550 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>       struct net_device *master;
>>       struct net_device *null_or_orig;
>>       struct net_device *null_or_bond;
>> +      struct net_device *real_dev;
>>       int ret = NET_RX_DROP;
>>       __be16 type;
>>
>> @@ -2853,9 +2854,13 @@ ncls:
>>        * handler may have to adjust skb->dev and orig_dev.
>>        */
>>       null_or_bond = NULL;
>> -      if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>> -          (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>> -              null_or_bond = vlan_dev_real_dev(skb->dev);
>> +      if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) {
>> +              real_dev = vlan_dev_real_dev(skb->dev);
>> +              if (real_dev->priv_flags & IFF_BONDING)
>> +                      null_or_bond = vlan_dev_real_dev(skb->dev);
>> +              if (!(skb->dev->priv_flags & IFF_BONDING) &&
>> +                  real_dev->priv_flags & IFF_SLAVE_INACTIVE)
>> +                      null_or_orig = skb->dev;
>>       }
>>
>>       type = skb->protocol;
> 
>         Is this another way of accomplishing what I had suggested at the
> end of this message:
> 
> http://marc.info/?l=linux-netdev&m=127111386702765&w=2
> 
>         The patch part of my suggestion was as follows:
> 

I think we need the code you suggested either way, or initialize 
null_or_bond to orig_dev as I suggested above.

This logic was to deliver the skb only to exact matches for this case,

bond0 ---> eth0
vlanx ---> -|

Here vlanx is not in a bond and the real_dev is an inactive slave. I'll 
rethink this, but I believe only delivering this packet to handlers with 
exact matches is a good idea.  At least it is consistent with the non 
vlan case.

>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index b98ddc6..cc665bb 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2735,7 +2735,7 @@ ncls:
>>                       &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>               if (ptype->type == type && (ptype->dev == null_or_orig ||
>>                    ptype->dev == skb->dev || ptype->dev == orig_dev ||
>> -                   ptype->dev == null_or_bond)) {
>> +                   (null_or_bond && (ptype->dev == null_or_bond))) {
>>                       if (pt_prev)
>>                               ret = deliver_skb(skb, pt_prev, orig_dev);
>>                       pt_prev = ptype;
>>
>>
>>       I haven't tested this, but the theory is to only test against
>> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
>> traffic over bonding.
> 
>         Chris Leech said "that should do it" but I don't recall seeing
> if it actually did so in practice.
> 
>         Or is your change meant to fix something else?
> 

The missing piece with just this bit of code is if its dropped in the 
vlan_gro_common or __vlan_hwaccel_rx it never gets to the 
netif_receive_skb path.

Also null_or_bond wouldn't be set to the vlan dev that we want so I 
don't think this gets us there.

At this point I think there are two bug fixes that need to be made, one 
to address null_or_bond and another to check the correct net_device in 
case #1 and #2 above.

I'll try to put together another RFC patch series with all your feedback 
this evening.  With good comments to hopefully explain what is going and 
at least make it clear where things will work and not work.  Thanks for 
all the good feedback!

John.

>         -J
> 
> ---
>         -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply

* Re: [stable] ixgbe: Fix return of invalid txq
From: David Miller @ 2010-05-03 21:14 UTC (permalink / raw)
  To: jesse.brandeburg; +Cc: stable, netdev, linux-kernel, brandon
In-Reply-To: <alpine.WNT.2.00.1005031401100.4376@jbrandeb-desk1.amr.corp.intel.com>

From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Mon, 3 May 2010 14:02:25 -0700 (Pacific Daylight Time)

> putting davem on to: line...  sorry for not including on first email.

I'm fine with this.

^ permalink raw reply

* Re: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: David Miller @ 2010-05-03 21:13 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: ben, netdev, support
In-Reply-To: <14385191E87B904DBD836449AA30269D6DE671@MORGANITE.micrel.com>

From: "Ha, Tristram" <Tristram.Ha@Micrel.Com>
Date: Mon, 3 May 2010 14:11:41 -0700

> I thought the transmit check workqueue reschedules itself when the
> buffer available is still not enough, and this is the part you objected.

If it reschedules itself, it runs immediately.  That will just hog a cpu
endlessly until the TX packets start to be transmitted by the chip.  That's
just as bad a polling in a loop.

You need to use an hrtimer or similar.

^ permalink raw reply

* RE: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: Ha, Tristram @ 2010-05-03 21:11 UTC (permalink / raw)
  To: David Miller; +Cc: ben, netdev, support
In-Reply-To: <20100503.130316.56362236.davem@davemloft.net>

David Miller wrote:
> From: "Ha, Tristram" <Tristram.Ha@Micrel.Com>
> Date: Mon, 3 May 2010 12:06:21 -0700
> 
>> The transmit done interrupt in the KSZ8851 chips is not required for
>> normal operation.  Turning it off actually will improve transmit
>> performance because the system will not be interrupted every time a
>> packet is sent.
> 
> But you only trigger this workqueue when you notice in
->ndo_start_xmit() that you're out of
> space. 
> 
> This makes the chip sit idle with no packets to send until the
workqueue executes asynchronously
> to the initial transmit path which noticed the queue was full. 
> 
> That doesn't make any sense to me.  If anything you should at least
try to purge the TX queue
> and make space directly in the ->ndo_start_xmit() handler.  And if
that fails trigger an hrtimer
> to poll the TX state.  
> 
> Without some kind of timer based polling mechanism, if the workqueue
finds the TX queue is still
> full, what's going to do more checks later?  You will no longer get
->ndo_start_xmit() calls
> because the queue has been marked full, so nothing will trigger the
workqueue to run any more.  

I thought the transmit check workqueue reschedules itself when the
buffer available is still not enough, and this is the part you objected.

The buffer has about 8K.  Suppose 5 1514-byte packets are sent and the
buffer is not enough to hold the next packet and so the transmit queue
is stopped.  A workqueue is scheduled to read the buffer available
register.  As previous packets should have been sent the buffer size
read is big enough and so the transmit queue is restarted.  This happens
very often as the network bandwidth is only 10 Mbps but the CPU speed is
very fast.

I should also read the buffer available register during receive
interrupt processing so that this situation does not trigger in slow
traffic.

This roundabout way of using workqueue is because the register cannot be
read directly inside ndo_start_xmit().

Actually in other driver with similar transmit buffer available
operation I just return NETDEV_TX_BUSY and let the kernel stack calls
ndo_start_xmit again and again.  Supposedly it is bad for the whole
system but the overall transmit throughput is much better than when the
transmit done interrupt is enabled.

There should be a way to trigger the transmit interrupt after certain
number of packets are sent.  That will improve the performance and allay
your concern.  As I am not the engineer who worked on the Micrel KSZ8851
chip during its development, I am not quite aware of its functions and
limitations.  I will try the new code and submit a better patch if
possible.

^ permalink raw reply

* Re: Fun with if_bridge.h and br_private.h
From: Arnd Bergmann @ 2010-05-03 21:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: paulmck, netdev
In-Reply-To: <20100503133613.0f2a61f7@nehalam>

On Monday 03 May 2010 22:36:13 Stephen Hemminger wrote:
> > In file included from net/core/dev.c:104:
> > include/linux/if_bridge.h:106: warning: "struct net_bridge_port" declared inside parameter list
> > include/linux/if_bridge.h:106: warning: its scope is only this definition or declaration, which is probably not what you want
> > net/core/dev.c:2331: error: conflicting types for "br_handle_frame_hook"
> > include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here
> > net/core/dev.c:2333: error: conflicting types for "br_handle_frame_hook"
> > include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here
> > 
> > This happens because net/bridge/br_private.h includes if_bridge.h before
> > it defines net_bridge_port.
> > 
> > Any thoughts on how best to allow handle_bridge() see the definition
> > of struct net_bridge_port?
> > 
> 
> Why not make it a void *, there is no reason to make core code depend
> on br_private.h.

Ah, right. That's actually how I changed the definition of br_port to
start with. Sorry Paul, I had totally forgotten about this.
Not sure if we also need to change the br_handle_frame_hook prototype,
I think the forward declaration for struct net_bridge_port that I had
in my long patch was actually sufficient.

	Arnd

^ permalink raw reply

* Re: [stable] ixgbe: Fix return of invalid txq
From: Brandeburg, Jesse @ 2010-05-03 21:02 UTC (permalink / raw)
  To: stable@kernel.org, davem
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	brandon@ifup.org
In-Reply-To: <alpine.WNT.2.00.1005031333200.4376@jbrandeb-desk1.amr.corp.intel.com>

putting davem on to: line...  sorry for not including on first email.

On Mon, 3 May 2010, Brandeburg, Jesse wrote:

> Please consider commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f for 
> inclusion in 2.6.32.y (it is already in 2.6.33.y)
> 
> Here is the commit message, it fixes a panic on machines with a larger 
> number of cpus than ixgbe has tx queues (64).
> 
> commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
> Author: Krishna Kumar <krkumar2@in.ibm.com>
> Date:   Wed Feb 3 13:13:10 2010 +0000
> 
>     ixgbe: Fix return of invalid txq
> 
>     a developer had complained of getting lots of warnings:
> 
>     "eth16 selects TX queue 98, but real number of TX queues is 64"
>     
> 	http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg02200.html
> 
>     As there was no follow up on that bug, I am submitting this
>     patch assuming that the other return points will not return
>     invalid txq's, and also that this fixes the bug (not tested).
> 
>     Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
>     Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>     Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>     Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> 

^ permalink raw reply

* Re: OOP in ip_cmsg_recv (net-next)
From: Stephen Hemminger @ 2010-05-03 21:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1272906266.2226.77.camel@edumazet-laptop>

On Mon, 03 May 2010 19:04:26 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 03 mai 2010 à 09:47 -0700, Stephen Hemminger a écrit :
> > I am getting occasional NULL pointer references with net-next kernel.
> > No test, just usual stuff (like DNS).
> > 
> > This is a new regression in net-next only.
> > 
> > 
> > [  674.929685] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> > [  674.929691] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.929699] PGD 1bce2b067 PUD 1b80af067 PMD 0 
> > [  674.929704] Oops: 0000 [#1] SMP 
> > [  674.929708] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> > [  674.929712] CPU 2 
> > [  674.929713] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> > [  674.929764] 
> > [  674.929767] Pid: 4358, comm: dnsmasq Not tainted 2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> > [  674.929770] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.929776] RSP: 0018:ffff8801bce27ac8  EFLAGS: 00010246
> > [  674.929778] RAX: 0000000000000000 RBX: ffff8801bde62500 RCX: 0000000000000000
> > [  674.929781] RDX: ffff8801bce27e48 RSI: ffff8801bde62500 RDI: ffff8801bce27f18
> > [  674.929784] RBP: ffff8801bce27b48 R08: 0000000000000640 R09: 0000000000000000
> > [  674.929787] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bce27f18
> > [  674.929789] R13: ffff8801bce27f18 R14: 0000000000000000 R15: ffff8801bdbe8850
> > [  674.929793] FS:  00007fe37fbfd700(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000
> > [  674.929796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  674.929798] CR2: 0000000000000322 CR3: 00000001bce5c000 CR4: 00000000000006e0
> > [  674.929801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  674.929804] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [  674.929807] Process dnsmasq (pid: 4358, threadinfo ffff8801bce26000, task ffff8801bda54560)
> > [  674.929810] Stack:
> > [  674.929811]  0000000000000134 000000000000012c ffff8801bce27b48 ffffffff813b065b
> > [  674.929816] <0> ffff8801bce27b08 ffffffff8123ce8e ffff8801bdbe8800 ffff8801bce27dc8
> > [  674.929821] <0> ffff8801bce27b18 ffffffff81464612 ffff8801bce27b48 000000005eba1e95
> > [  674.929827] Call Trace:
> > [  674.929834]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> > [  674.929840]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [  674.929845]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> > [  674.929850]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> > [  674.929856]  [<ffffffff81045190>] ? default_wake_function+0x0/0x10
> > [  674.929860]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> > [  674.929866]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> > [  674.929872]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> > [  674.929878]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
> > [  674.929882]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
> > [  674.929888]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
> > [  674.929892]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> > [  674.929897]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> > [  674.929902]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> > [  674.929908]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> > [  674.929910] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> > [  674.929955] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.929959]  RSP <ffff8801bce27ac8>
> > [  674.929961] CR2: 0000000000000322
> > [  674.929964] ---[ end trace 443be32e81365554 ]---
> > [  674.929966] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> > [  674.929972] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.929979] PGD 1bb9c7067 PUD 1bd5d3067 PMD 0 
> > [  674.929985] Oops: 0000 [#2] SMP 
> > [  674.929989] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> > [  674.929994] CPU 7 
> > [  674.929997] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> > [  674.930067] 
> > [  674.930072] Pid: 4525, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> > [  674.930077] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930084] RSP: 0018:ffff8801bcf03ac8  EFLAGS: 00010246
> > [  674.930088] RAX: 0000000000000000 RBX: ffff8801b746c500 RCX: 0000000000000000
> > [  674.930092] RDX: ffff8801bcf03e48 RSI: ffff8801b746c500 RDI: ffff8801bcf03f18
> > [  674.930097] RBP: ffff8801bcf03b48 R08: 0000000000000640 R09: 0000000000000000
> > [  674.930101] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcf03f18
> > [  674.930105] R13: ffff8801bcf03f18 R14: 0000000000000000 R15: ffff8801bd430850
> > [  674.930110] FS:  00007f42211eb700(0000) GS:ffff880001ee0000(0000) knlGS:0000000000000000
> > [  674.930114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  674.930118] CR2: 0000000000000322 CR3: 00000001bb96b000 CR4: 00000000000006e0
> > [  674.930122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  674.930127] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [  674.930132] Process dnsmasq (pid: 4525, threadinfo ffff8801bcf02000, task ffff8801bd52ae40)
> > [  674.930135] Stack:
> > [  674.930137]  0000000000000134 000000000000012c ffff8801bcf03b48 ffffffff813b065b
> > [  674.930144] <0> ffff8801bcf03b08 ffffffff8123ce8e ffff8801bd430800 ffff8801bcf03dc8
> > [  674.930152] <0> ffff8801bcf03b18 ffffffff81464612 ffff8801bcf03b48 0000000003fe9d95
> > [  674.930160] Call Trace:
> > [  674.930167]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> > [  674.930174]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [  674.930180]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> > [  674.930187]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> > [  674.930193]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> > [  674.930199]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> > [  674.930206]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> > [  674.930212]  [<ffffffff8123cf34>] ? do_raw_spin_lock+0x54/0x150
> > [  674.930218]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> > [  674.930224]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> > [  674.930231]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> > [  674.930238]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> > [  674.930241] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> > [  674.930307] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930313]  RSP <ffff8801bcf03ac8>
> > [  674.930315] CR2: 0000000000000322
> > [  674.930319] ---[ end trace 443be32e81365555 ]---
> > [  674.930322] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> > [  674.930327] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930332] PGD 1b97f1067 PUD 1bb827067 PMD 0 
> > [  674.930338] Oops: 0000 [#3] SMP 
> > [  674.930341] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> > [  674.930345] CPU 3 
> > [  674.930347] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> > [  674.930396] 
> > [  674.930401] Pid: 4561, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> > [  674.930405] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930413] RSP: 0018:ffff8801bcd95ac8  EFLAGS: 00010246
> > [  674.930417] RAX: 0000000000000000 RBX: ffff8801b746cb00 RCX: 0000000000000000
> > [  674.930421] RDX: ffff8801bcd95e48 RSI: ffff8801b746cb00 RDI: ffff8801bcd95f18
> > [  674.930425] RBP: ffff8801bcd95b48 R08: 0000000000000640 R09: 0000000000000000
> > [  674.930429] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd95f18
> > [  674.930433] R13: ffff8801bcd95f18 R14: 0000000000000000 R15: ffff8801b6bf8c50
> > [  674.930439] FS:  00007fc947627700(0000) GS:ffff880001e60000(0000) knlGS:0000000000000000
> > [  674.930443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  674.930447] CR2: 0000000000000322 CR3: 00000001b9654000 CR4: 00000000000006e0
> > [  674.930451] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  674.930455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [  674.930460] Process dnsmasq (pid: 4561, threadinfo ffff8801bcd94000, task ffff8801bd5b1720)
> > [  674.930464] Stack:
> > [  674.930466]  0000000000000134 000000000000012c ffff8801bcd95b48 ffffffff813b065b
> > [  674.930473] <0> ffff8801bcd95b08 ffffffff8123ce8e ffff8801b6bf8c00 ffff8801bcd95dc8
> > [  674.930481] <0> ffff8801bcd95b18 ffffffff81464612 ffff8801bcd95b48 000000008ae6d276
> > [  674.930490] Call Trace:
> > [  674.930496]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> > [  674.930503]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [  674.930509]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> > [  674.930516]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> > [  674.930522]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> > [  674.930529]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> > [  674.930537]  [<ffffffff810704e2>] ? finish_wait+0x62/0x80
> > [  674.930543]  [<ffffffff814623f3>] ? __wait_on_bit_lock+0x73/0xb0
> > [  674.930550]  [<ffffffff81070390>] ? wake_bit_function+0x0/0x40
> > [  674.930556]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> > [  674.930562]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> > [  674.930569]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> > [  674.930576]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> > [  674.930579] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> > [  674.930636] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930641]  RSP <ffff8801bcd95ac8>
> > [  674.930642] CR2: 0000000000000322
> > [  674.930645] ---[ end trace 443be32e81365556 ]---
> > [  674.930647] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> > [  674.930653] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930660] PGD 1bcdbc067 PUD 1bbc3c067 PMD 0 
> > [  674.930666] Oops: 0000 [#4] SMP 
> > [  674.930669] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> > [  674.930672] CPU 4 
> > [  674.930673] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> > [  674.930712] 
> > [  674.930715] Pid: 4488, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> > [  674.930718] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930723] RSP: 0018:ffff8801bcd93ac8  EFLAGS: 00010246
> > [  674.930725] RAX: 0000000000000000 RBX: ffff8801b746cf00 RCX: 0000000000000000
> > [  674.930727] RDX: ffff8801bcd93e48 RSI: ffff8801b746cf00 RDI: ffff8801bcd93f18
> > [  674.930730] RBP: ffff8801bcd93b48 R08: 0000000000000640 R09: 0000000000000000
> > [  674.930732] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd93f18
> > [  674.930735] R13: ffff8801bcd93f18 R14: 0000000000000000 R15: ffff8801b6bf8450
> > [  674.930738] FS:  00007f4ccbd68700(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
> > [  674.930741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  674.930743] CR2: 0000000000000322 CR3: 00000001bb81d000 CR4: 00000000000006e0
> > [  674.930745] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  674.930748] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [  674.930751] Process dnsmasq (pid: 4488, threadinfo ffff8801bcd92000, task ffff8801bde2dc80)
> > [  674.930753] Stack:
> > [  674.930754]  0000000000000134 000000000000012c ffff8801bcd93b48 ffffffff813b065b
> > [  674.930758] <0> ffff8801bcd93b08 ffffffff8123ce8e ffff8801b6bf8400 ffff8801bcd93dc8
> > [  674.930763] <0> ffff8801bcd93b18 ffffffff81464612 ffff8801bcd93b48 00000000d5628d65
> > [  674.930768] Call Trace:
> > [  674.930773]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> > [  674.930778]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [  674.930783]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> > [  674.930787]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> > [  674.930792]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> > [  674.930796]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> > [  674.930801]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> > [  674.930806]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
> > [  674.930810]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
> > [  674.930815]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
> > [  674.930819]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> > [  674.930823]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> > [  674.930828]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> > [  674.930833]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> > [  674.930835] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> > [  674.930880] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> > [  674.930884]  RSP <ffff8801bcd93ac8>
> > [  674.930886] CR2: 0000000000000322
> > [  674.930889] ---[ end trace 443be32e81365557 ]---
> 
> Hmm, skb->sk is NULL
> 
> void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
> {
> 	struct inet_sock *inet = inet_sk(skb->sk);
> 	unsigned flags = inet->cmsg_flags; // CRASH
> 
> 
> So a skb_free_datagram_locked() is at fault here...
> 
> commit 4b0b72f7dd617b13abd1b04c947e15873e011a24 probably
> 
> OK, the skb_orphan() should not be done at this point, if we are not the
> only user (and last user)
> 
> Oh well, sorry for the regression ;)
> 
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 95b851f..88949b0 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -230,12 +230,8 @@ EXPORT_SYMBOL(skb_free_datagram);
>  void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
>  	lock_sock_bh(sk);
> -	skb_orphan(skb);
> -	sk_mem_reclaim_partial(sk);
> +	skb_free_datagram(sk, skb);
>  	unlock_sock_bh(sk);
> -
> -	/* skb is now orphaned, might be freed outside of locked section */
> -	consume_skb(skb);
>  }
>  EXPORT_SYMBOL(skb_free_datagram_locked);

This works great for me. No messages for several hours.

-- 

^ permalink raw reply

* [stable] ixgbe: Fix return of invalid txq
From: Brandeburg, Jesse @ 2010-05-03 20:56 UTC (permalink / raw)
  To: stable; +Cc: netdev, linux-kernel, brandon, jesse.brandeburg

Please consider commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f for 
inclusion in 2.6.32.y (it is already in 2.6.33.y)

Here is the commit message, it fixes a panic on machines with a larger 
number of cpus than ixgbe has tx queues (64).

commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date:   Wed Feb 3 13:13:10 2010 +0000

    ixgbe: Fix return of invalid txq

    a developer had complained of getting lots of warnings:

    "eth16 selects TX queue 98, but real number of TX queues is 64"
    
	http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg02200.html

    As there was no follow up on that bug, I am submitting this
    patch assuming that the other return points will not return
    invalid txq's, and also that this fixes the bug (not tested).

    Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
    Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: Eric Dumazet @ 2010-05-03 20:50 UTC (permalink / raw)
  To: David Miller
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <20100503.132442.200112018.davem@davemloft.net>

Paul, David, here the patch I was thinking about :

Feel free to split it in two parts if you like, I am too tired and must
sleep now ;)

Thanks

[PATCH net-next-2.6] net: rcu fixes

Add hlist_for_each_entry_rcu_bh() and
hlist_for_each_entry_continue_rcu_bh() macros, and use them in
ipv6_get_ifaddr(), if6_get_first() and if6_get_next() to fix lockdeps
warnings.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/rculist.h |   29 +++++++++++++++++++++++++++++
 net/ipv6/addrconf.c     |   16 ++++++++--------
 2 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 004908b..4ec3b38 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -429,6 +429,23 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
 		pos = rcu_dereference_raw(pos->next))
 
 /**
+ * hlist_for_each_entry_rcu_bh - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as hlist_add_head_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ */
+#define hlist_for_each_entry_rcu_bh(tpos, pos, head, member)		 \
+	for (pos = rcu_dereference_bh((head)->first);			 \
+		pos && ({ prefetch(pos->next); 1; }) &&			 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference_bh(pos->next))
+
+/**
  * hlist_for_each_entry_continue_rcu - iterate over a hlist continuing after current point
  * @tpos:	the type * to use as a loop cursor.
  * @pos:	the &struct hlist_node to use as a loop cursor.
@@ -440,6 +457,18 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
 	     ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });  \
 	     pos = rcu_dereference(pos->next))
 
+/**
+ * hlist_for_each_entry_continue_rcu_bh - iterate over a hlist continuing after current point
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_continue_rcu_bh(tpos, pos, member)		\
+	for (pos = rcu_dereference_bh((pos)->next);			\
+	     pos && ({ prefetch(pos->next); 1; }) &&			\
+	     ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });  \
+	     pos = rcu_dereference_bh(pos->next))
+
 
 #endif	/* __KERNEL__ */
 #endif
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 34d2d64..3984f52 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1346,7 +1346,7 @@ struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr *add
 	struct hlist_node *node;
 
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(ifp, node, &inet6_addr_lst[hash], addr_lst) {
+	hlist_for_each_entry_rcu_bh(ifp, node, &inet6_addr_lst[hash], addr_lst) {
 		if (!net_eq(dev_net(ifp->idev->dev), net))
 			continue;
 		if (ipv6_addr_equal(&ifp->addr, addr)) {
@@ -2959,7 +2959,7 @@ static struct inet6_ifaddr *if6_get_first(struct seq_file *seq)
 
 	for (state->bucket = 0; state->bucket < IN6_ADDR_HSIZE; ++state->bucket) {
 		struct hlist_node *n;
-		hlist_for_each_entry_rcu(ifa, n, &inet6_addr_lst[state->bucket],
+		hlist_for_each_entry_rcu_bh(ifa, n, &inet6_addr_lst[state->bucket],
 					 addr_lst)
 			if (net_eq(dev_net(ifa->idev->dev), net))
 				return ifa;
@@ -2974,12 +2974,12 @@ static struct inet6_ifaddr *if6_get_next(struct seq_file *seq,
 	struct net *net = seq_file_net(seq);
 	struct hlist_node *n = &ifa->addr_lst;
 
-	hlist_for_each_entry_continue_rcu(ifa, n, addr_lst)
+	hlist_for_each_entry_continue_rcu_bh(ifa, n, addr_lst)
 		if (net_eq(dev_net(ifa->idev->dev), net))
 			return ifa;
 
 	while (++state->bucket < IN6_ADDR_HSIZE) {
-		hlist_for_each_entry(ifa, n,
+		hlist_for_each_entry_rcu_bh(ifa, n,
 				     &inet6_addr_lst[state->bucket], addr_lst) {
 			if (net_eq(dev_net(ifa->idev->dev), net))
 				return ifa;
@@ -3000,7 +3000,7 @@ static struct inet6_ifaddr *if6_get_idx(struct seq_file *seq, loff_t pos)
 }
 
 static void *if6_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(rcu)
+	__acquires(rcu_bh)
 {
 	rcu_read_lock_bh();
 	return if6_get_idx(seq, *pos);
@@ -3016,7 +3016,7 @@ static void *if6_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 }
 
 static void if6_seq_stop(struct seq_file *seq, void *v)
-	__releases(rcu)
+	__releases(rcu_bh)
 {
 	rcu_read_unlock_bh();
 }
@@ -3093,7 +3093,7 @@ int ipv6_chk_home_addr(struct net *net, struct in6_addr *addr)
 	unsigned int hash = ipv6_addr_hash(addr);
 
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(ifp, n, &inet6_addr_lst[hash], addr_lst) {
+	hlist_for_each_entry_rcu_bh(ifp, n, &inet6_addr_lst[hash], addr_lst) {
 		if (!net_eq(dev_net(ifp->idev->dev), net))
 			continue;
 		if (ipv6_addr_equal(&ifp->addr, addr) &&
@@ -3127,7 +3127,7 @@ static void addrconf_verify(unsigned long foo)
 
 	for (i = 0; i < IN6_ADDR_HSIZE; i++) {
 restart:
-		hlist_for_each_entry_rcu(ifp, node,
+		hlist_for_each_entry_rcu_bh(ifp, node,
 					 &inet6_addr_lst[i], addr_lst) {
 			unsigned long age;
 



^ permalink raw reply related

* Re: [PATCH] ixgb and e1000: Use new function for copybreak tests
From: Jeff Kirsher @ 2010-05-03 20:43 UTC (permalink / raw)
  To: Joe Perches
  Cc: e1000-devel, netdev, Bruce Allan, Jesse Brandeburg, LKML,
	John Ronciak
In-Reply-To: <1272847591.30040.15.camel@Joe-Laptop.home>

On Sun, May 2, 2010 at 17:46, Joe Perches <joe@perches.com> wrote:
> There appears to be an off-by-1 defect in the maximum packet size
> copied when copybreak is speified in these modules.
>
> The copybreak module params are specified as:
> "Maximum size of packet that is copied to a new buffer on receive"
>
> The tests are changed from "< copybreak" to "<= copybreak"
> and moved into new static functions for readability.
>
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  drivers/net/e1000/e1000_main.c |   47 ++++++++++++++++++++---------------
>  drivers/net/ixgb/ixgb_main.c   |   52 +++++++++++++++++++++++----------------
>  2 files changed, 58 insertions(+), 41 deletions(-)
>

Thanks Joe, I have added the patch to my queue of patches.

-- 
Cheers,
Jeff

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: Fun with if_bridge.h and br_private.h
From: Stephen Hemminger @ 2010-05-03 20:36 UTC (permalink / raw)
  To: paulmck; +Cc: netdev, arnd
In-Reply-To: <20100503191256.GA18110@linux.vnet.ibm.com>

On Mon, 3 May 2010 12:12:56 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Hello, Stephen,
> 
> I need some help with #include issues surrounding bridge.
> 
> Arnd has been putting together a sparse-based RCU-checking tool
> that can detect pointer access that should have been protected by an
> rcu_dereference() or rcu_assign_pointer().  On of the side-effects of
> his approach is that rcu_dereference() can no longer handle pointers
> to incomplete types, as Arnd's approach uses the address-space feature
> of sparse, which in turn must tag the type pointed to rather than the
> pointer itself.  And this in turn requires rcu_dereference() and friends
> to see the pointed-to type as well as the pointer type.
> 
> One place affected by this is handle_bridge() in net/core/dev.c, which
> invokes rcu_dereference() on skb->dev->br_port, which is of type struct
> net_bridge_port, which is defined in net/bridge/br_private.h.  This is,
> well, private, but is included into a couple places in netfilter and atm,
> so I tried including it into net/core/dev.c.
> 
> This still left me with forward references:
> 
> In file included from net/core/dev.c:104:
> include/linux/if_bridge.h:106: warning: "struct net_bridge_port" declared inside parameter list
> include/linux/if_bridge.h:106: warning: its scope is only this definition or declaration, which is probably not what you want
> net/core/dev.c:2331: error: conflicting types for "br_handle_frame_hook"
> include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here
> net/core/dev.c:2333: error: conflicting types for "br_handle_frame_hook"
> include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here
> 
> This happens because net/bridge/br_private.h includes if_bridge.h before
> it defines net_bridge_port.
> 
> Any thoughts on how best to allow handle_bridge() see the definition
> of struct net_bridge_port?
> 

Why not make it a void *, there is no reason to make core code depend
on br_private.h.

^ permalink raw reply

* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: jamal @ 2010-05-03 20:34 UTC (permalink / raw)
  To: Dan Smith; +Cc: Daniel Lezcano, containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <87y6g1xe0h.fsf@caffeine.danplanet.com>

On Mon, 2010-05-03 at 07:21 -0700, Dan Smith wrote:

> The benefits of doing what we can in userspace are well-understood and
> arguing for doing so where it makes sense is, of course, a good idea.
> 
> However, it seems to me that the rtnl interface provides us a
> reasonable layer of isolation between us and such changes.  Am I
> wrong?  

I may not have made my point earlier:
Let me give you an example by looking at your migration attributes..

-----
+            __be32 inet4_len;          /* mask length (bits)*/
+            __u32  inet4_met;          /* metric            */
+            __be32 inet4_dst;          /* route address     */
+            __be32 inet4_gwy;          /* gateway address   */
-----

At some point i had a discussion with some folks on netdev where it
seemed valueable to add a fwmark to the route. If such is made, I dont
see what the motivation for whoever is codifying to add it to your
attributes so you can migrate the fwmark. One good motivation is to make
sure the main route code fails to compile if your attributes dont get
modified - this could happen if you re-use the same data structures as
the kernel etc.

> The rtnl messages appear to be rather generic and timeless,
> and in most cases have a significant amount of flexibility with
> respect to allowing advanced attributes to be ignored (which implies
> taking the default).

True - but you still need to worry about compat issues etc i.e when you
migrate to a remote kernel they better have the same features and kernel
config.. I am assuming this is not hard to impose on an admin.
Doing things in user space allows for doing more interesting things like
negotiating on capabilities etc

> In many other areas of C/R we're not so lucky and don't have a
> well-defined interface for dumping that information out of the
> kernel...

Maybe the answer is to start by formalizing that, not sure.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v2 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: David Miller @ 2010-05-03 20:27 UTC (permalink / raw)
  To: simon; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF2FD5.2030509@simon.arlott.org.uk>

From: Simon Arlott <simon@fire.lp0.eu>
Date: Mon, 03 May 2010 21:19:33 +0100

> In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.
> 
> If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
> the packet is passed directly to pppd.
> 
> This occurs frequently when using PPPoE with an interface MTU
> greater than 1500 because the skb is more likely to be non-linear.
> 
> The next 2 bytes need to be pulled in ppp_input(). The pull of 2
> bytes in ppp_receive_frame() has been removed as it is no longer
> required.
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>

Applied.

^ permalink raw reply

* Re: [PATCH v2 2/2] ppp_generic: handle non-linear skbs when passing them to pppd
From: David Miller @ 2010-05-03 20:27 UTC (permalink / raw)
  To: simon; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF300B.1040103@simon.arlott.org.uk>

From: Simon Arlott <simon@fire.lp0.eu>
Date: Mon, 03 May 2010 21:20:27 +0100

> Frequently when using PPPoE with an interface MTU greater than 1500,
> the skb is likely to be non-linear. If the skb needs to be passed to
> pppd then the skb data must be read correctly.
> 
> The previous commit fixes an issue with accidentally sending skbs
> to pppd based on an invalid read of the protocol type. When that
> error occurred pppd was reading invalid skb data too.
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: David Miller @ 2010-05-03 20:24 UTC (permalink / raw)
  To: eric.dumazet
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272917635.2407.106.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 03 May 2010 22:13:55 +0200

> Le lundi 03 mai 2010 à 13:09 -0700, David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 03 May 2010 21:46:47 +0200
>> 
>> > Then, net-next-2.6 doesnt yet have your commit Paul to relax
>> > hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.
>> 
>> Is that in Linus's tree yet?  If it propagates there I can make it
>> propagate to net-next-2.6, you just have to tell me you need it.
> 
> 
> Hmm, it seems it's already in net-next-2.6, sorry for the confusion.

No problem, just let me know in the future if something upstream needs
to propagate.

^ permalink raw reply

* [PATCH v2 2/2] ppp_generic: handle non-linear skbs when passing them to pppd
From: Simon Arlott @ 2010-05-03 20:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF2FD5.2030509@simon.arlott.org.uk>

Frequently when using PPPoE with an interface MTU greater than 1500,
the skb is likely to be non-linear. If the skb needs to be passed to
pppd then the skb data must be read correctly.

The previous commit fixes an issue with accidentally sending skbs
to pppd based on an invalid read of the protocol type. When that
error occurred pppd was reading invalid skb data too.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 drivers/net/ppp_generic.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 75e8903..8518a2e 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -405,6 +405,7 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	DECLARE_WAITQUEUE(wait, current);
 	ssize_t ret;
 	struct sk_buff *skb = NULL;
+	struct iovec iov;
 
 	ret = count;
 
@@ -448,7 +449,9 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	if (skb->len > count)
 		goto outf;
 	ret = -EFAULT;
-	if (copy_to_user(buf, skb->data, skb->len))
+	iov.iov_base = buf;
+	iov.iov_len = count;
+	if (skb_copy_datagram_iovec(skb, 0, &iov, skb->len))
 		goto outf;
 	ret = skb->len;
 
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* [PATCH v2 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: Simon Arlott @ 2010-05-03 20:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulus, linux-ppp

In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.

If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
the packet is passed directly to pppd.

This occurs frequently when using PPPoE with an interface MTU
greater than 1500 because the skb is more likely to be non-linear.

The next 2 bytes need to be pulled in ppp_input(). The pull of 2
bytes in ppp_receive_frame() has been removed as it is no longer
required.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 drivers/net/ppp_generic.c |   29 ++++++++++++++++++-----------
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 6e281bc..75e8903 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1567,13 +1567,22 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	struct channel *pch = chan->ppp;
 	int proto;
 
-	if (!pch || skb->len == 0) {
+	if (!pch) {
 		kfree_skb(skb);
 		return;
 	}
 
-	proto = PPP_PROTO(skb);
 	read_lock_bh(&pch->upl);
+	if (!pskb_may_pull(skb, 2)) {
+		kfree_skb(skb);
+		if (pch->ppp) {
+			++pch->ppp->dev->stats.rx_length_errors;
+			ppp_receive_error(pch->ppp);
+		}
+		goto done;
+	}
+
+	proto = PPP_PROTO(skb);
 	if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
 		/* put it on the channel queue */
 		skb_queue_tail(&pch->file.rq, skb);
@@ -1585,6 +1594,8 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	} else {
 		ppp_do_recv(pch->ppp, skb, pch);
 	}
+
+done:
 	read_unlock_bh(&pch->upl);
 }
 
@@ -1617,7 +1628,8 @@ ppp_input_error(struct ppp_channel *chan, int code)
 static void
 ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 {
-	if (pskb_may_pull(skb, 2)) {
+	/* note: a 0-length skb is used as an error indication */
+	if (skb->len > 0) {
 #ifdef CONFIG_PPP_MULTILINK
 		/* XXX do channel-level decompression here */
 		if (PPP_PROTO(skb) == PPP_MP)
@@ -1625,15 +1637,10 @@ ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 		else
 #endif /* CONFIG_PPP_MULTILINK */
 			ppp_receive_nonmp_frame(ppp, skb);
-		return;
+	} else {
+		kfree_skb(skb);
+		ppp_receive_error(ppp);
 	}
-
-	if (skb->len > 0)
-		/* note: a 0-length skb is used as an error indication */
-		++ppp->dev->stats.rx_length_errors;
-
-	kfree_skb(skb);
-	ppp_receive_error(ppp);
 }
 
 static void
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-05-03 20:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David Miller, xiaosuo, therbert, shemminger, netdev,
	lenb, arjan
In-Reply-To: <1272838104.2173.166.camel@edumazet-laptop>

On Mon, 2010-05-03 at 00:08 +0200, Eric Dumazet wrote:

> 
> Test I did this week with Jamal.
> 
> We first set a "ee" rps mask, because all NIC interrupts were handled by
> CPU0, and Jamal thought like you, that not using cpu4 would give better
> performance.
> 
> But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps
> 

I am seeing the opposite with my machine (Nehalem):
with ee i get 99.4% and fe i get 94.2% whereas non-rps
is about 98.1%.


cheers,
jamal

PS:- sorry dont have time to collect a lot more data - tommorow i could
do more.



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: Eric Dumazet @ 2010-05-03 20:13 UTC (permalink / raw)
  To: David Miller
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <20100503.130902.172620141.davem@davemloft.net>

Le lundi 03 mai 2010 à 13:09 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 03 May 2010 21:46:47 +0200
> 
> > Then, net-next-2.6 doesnt yet have your commit Paul to relax
> > hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.
> 
> Is that in Linus's tree yet?  If it propagates there I can make it
> propagate to net-next-2.6, you just have to tell me you need it.


Hmm, it seems it's already in net-next-2.6, sorry for the confusion.

commit 3120438ad68601f341e61e7cb1323b0e1a6ca367
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Feb 22 17:04:48 2010 -0800

    rcu: Disable lockdep checking in RCU list-traversal primitives
    
    The theory is that use of bare rcu_dereference() is more prone
    to error than use of the RCU list-traversal primitives.
    Therefore, disable lockdep RCU read-side critical-section
    checking in these primitives for the time being.  Once all of
    the rcu_dereference() uses have been dealt with, it may be time
    to re-enable lockdep checking for the RCU list-traversal
    primitives.
    


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-03 20:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714966.14499.37.camel@bigi>

On Sat, 2010-05-01 at 07:56 -0400, jamal wrote:
> On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:
> 
> > But, whole point of epoll is to not change interest each time you get an
> > event.
> > 
> > Without EV_PERSIST, you need two more syscalls per recvfrom()
> > 
> > epoll_wait()
> >  epoll_ctl(REMOVE)
> >  epoll_ctl(ADD)
> >  recvfrom()
> > 
> > Even poll() would be faster in your case
> > 
> > poll(one fd)
> > recvfrom()
> > 
> 
> This is true - but my goal was/is to replicate the regression i was
> seeing[1]. 
> I will try with PERSIST next opportunity. If it gets better
> then it is something that needs documentation in the doc Tom
> promised ;->

I tried it with PERSIST and today's net-next and you are right:
rps was better compared with (99.4% vs 98.1% of 750Kpps).
If however i removed the PERSIST i.e both rps and non-rps
have two extra syscalls, again rps performed worse (93.2% vs 97.8%
of 750Kpps). Eric, I know the answer is not to do the non-PERSIST mode
for rps ;-> But lets just ignore that for a sec:
what the heck is going on? I would expect the degradation to be the same
for both non-rps. 
I also wanna do the broken record reminder that kernels before net-next
of Apr14 were doing about 97% (as opposed to 93% currently for same
test).

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: David Miller @ 2010-05-03 20:09 UTC (permalink / raw)
  To: eric.dumazet
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272916007.2407.75.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 03 May 2010 21:46:47 +0200

> Then, net-next-2.6 doesnt yet have your commit Paul to relax
> hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.

Is that in Linus's tree yet?  If it propagates there I can make it
propagate to net-next-2.6, you just have to tell me you need it.

^ permalink raw reply

* Re: [RFC] network driver skb allocations
From: David Miller @ 2010-05-03 20:06 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, netdev
In-Reply-To: <1272916166.27948.62.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 03 May 2010 20:49:26 +0100

> On Mon, 2010-05-03 at 19:06 +0200, Eric Dumazet wrote:
> [...]
>> Current logic for drivers is to :
>> 
>> allocate skbs (sk_buff + data) and put them in a ring buffer.
> 
> Not all of them.

In particular NIU always allocates SKBs at the time that it passes the
packet up to the stack.

^ permalink raw reply

* Re: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: David Miller @ 2010-05-03 20:04 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: ben, netdev, support
In-Reply-To: <14385191E87B904DBD836449AA30269D6DE63A@MORGANITE.micrel.com>

And btw, your commit message should have explained all of the things
you're telling me here.  Rather than just mention some vague "problem".

Your commit messages should explain everything about why you're
making the change, not just say what changes are being made.

^ permalink raw reply

* Re: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: David Miller @ 2010-05-03 20:03 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: ben, netdev, support
In-Reply-To: <14385191E87B904DBD836449AA30269D6DE63A@MORGANITE.micrel.com>

From: "Ha, Tristram" <Tristram.Ha@Micrel.Com>
Date: Mon, 3 May 2010 12:06:21 -0700

> The transmit done interrupt in the KSZ8851 chips is not required for
> normal operation.  Turning it off actually will improve transmit
> performance because the system will not be interrupted every time a
> packet is sent.

But you only trigger this workqueue when you notice in ->ndo_start_xmit()
that you're out of space.

This makes the chip sit idle with no packets to send until the workqueue
executes asynchronously to the initial transmit path which noticed the
queue was full.

That doesn't make any sense to me.  If anything you should at least try
to purge the TX queue and make space directly in the ->ndo_start_xmit()
handler.  And if that fails trigger an hrtimer to poll the TX state.

Without some kind of timer based polling mechanism, if the workqueue
finds the TX queue is still full, what's going to do more checks
later?  You will no longer get ->ndo_start_xmit() calls because the
queue has been marked full, so nothing will trigger the workqueue to
run any more.

^ permalink raw reply

* [PATCH] dm9601: fix phy/eeprom write routine
From: Peter Korsgaard @ 2010-05-03 20:01 UTC (permalink / raw)
  To: davem, netdev, michael.planes; +Cc: Peter Korsgaard, stable

Use correct bit positions in DM_SHARED_CTRL register for writes.

Michael Planes recently encountered a 'KY-RS9600 USB-LAN converter', which
came with a driver CD containing a Linux driver. This driver turns out to
be a copy of dm9601.c with symbols renamed and my copyright stripped.
That aside, it did contain 1 functional change in dm_write_shared_word(),
and after checking the datasheet the original value was indeed wrong
(read versus write bits).

On Michaels HW, this change bumps receive speed from ~30KB/s to ~900KB/s.
On other devices the difference is less spectacular, but still significant
(~30%).

Reported-by: Michael Planes <michael.planes@free.fr>
CC: stable@kernel.org
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
---
 drivers/net/usb/dm9601.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c
index 04b2810..5dfed92 100644
--- a/drivers/net/usb/dm9601.c
+++ b/drivers/net/usb/dm9601.c
@@ -240,7 +240,7 @@ static int dm_write_shared_word(struct usbnet *dev, int phy, u8 reg, __le16 valu
 		goto out;

 	dm_write_reg(dev, DM_SHARED_ADDR, phy ? (reg | 0x40) : reg);
-	dm_write_reg(dev, DM_SHARED_CTRL, phy ? 0x1c : 0x14);
+	dm_write_reg(dev, DM_SHARED_CTRL, phy ? 0x1a : 0x12);

 	for (i = 0; i < DM_TIMEOUT; i++) {
 		u8 tmp;
-- 
1.7.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox