Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] bluetooth: static lock key fix
From: David Miller @ 2009-10-20  2:37 UTC (permalink / raw)
  To: hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w
  Cc: marcel-kz+m5ild9QBg9hUCZPvPmw, oliver-fJ+pQTUTwRTk1uMJSBkQmQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20091019062830.GB4102-4/PLUo9XfK+SVgrV+fD4Uw@public.gmane.org>

From: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 19 Oct 2009 14:28:30 +0800

> When shutdown ppp connection, lockdep waring about non-static key
> will happen, it is caused by the lock is not initialized properly
> at that time.
> 
> Fix with tuning the lock/skb_queue_head init order
 ...
> Reported-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
> Signed-off-by: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Tested-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

Applied, thanks Dave.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20  3:04 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910200308.57381.denys@visp.net.lb>

On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
> On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
>> I'm assuming that there was a race in us sending patches at nearly the
>> same
>> time I'm convinced now that the flush_lock can die, and the patch I sent
>> out kills it.
> o_O
>
> I am drowning in patches. Just let me know which one to test :-)
>
Oh ;) Try out latest Michal's patch (and then mine). I'll continue
digg this issue at next spare time slot. Thanks!

^ permalink raw reply

* pktgen and spin_lock_bh in xmit path
From: Ben Greear @ 2009-10-20  3:38 UTC (permalink / raw)
  To: NetDev

I'm having strange issues when running pktgen on 10G interfaces while 
also running
pktgen on mac-vlans on that interface, when the mac-vlan pktgen threads 
are on a different
CPU.

First, lockdep gives up and says that things are not properly 
annotated.  I believe this is because
the macvlan tx path will lock it's txq and will also lock the 
lower-dev's txq.  To fix this, perhaps
we need some new lockdep aware primitives for netdev txq locking?

Second, is using _bh() locking really sufficient if we have pktgen 
writing to a physical device
and also have other pktgen threads writing to that same device though 
mac-vlans?   I'm seeing
deadlocks spinning on the _bh() lock in pktgen as well as strange 
corruptions, so I think there
must be *some* problem somewhere, I just don't know quite what it is yet.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-20  3:42 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191354o16d023d2lbe2761a97e88acea@mail.gmail.com>

Michal Ostrowski a écrit :
> Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
> and all use cases now rely on the socket lock.  Because of this, the
> ref-count on the namespace held by the socket object suffices to hold
> the namespace in existence and so we don't need to ref-count the
> namespace in PPPoE. The flush_lock is gone.
> 

Seems good !

But can we use lock_sock() in __pppoe_xmit() context ?




^ permalink raw reply

* Re: pktgen and spin_lock_bh in xmit path
From: Eric Dumazet @ 2009-10-20  3:48 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4ADD309B.1040505@candelatech.com>

Ben Greear a écrit :
> I'm having strange issues when running pktgen on 10G interfaces while
> also running
> pktgen on mac-vlans on that interface, when the mac-vlan pktgen threads
> are on a different
> CPU.
> 
> First, lockdep gives up and says that things are not properly
> annotated.  I believe this is because
> the macvlan tx path will lock it's txq and will also lock the
> lower-dev's txq.  To fix this, perhaps
> we need some new lockdep aware primitives for netdev txq locking?
> 
> Second, is using _bh() locking really sufficient if we have pktgen
> writing to a physical device
> and also have other pktgen threads writing to that same device though
> mac-vlans?   I'm seeing
> deadlocks spinning on the _bh() lock in pktgen as well as strange
> corruptions, so I think there
> must be *some* problem somewhere, I just don't know quite what it is yet.
> 

Could you please give us a copy if your pktgen scripts ?

^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: David Miller @ 2009-10-20  3:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <4ADC96D6.4000909@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Oct 2009 18:41:58 +0200

> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

Dubious, how so?

Yes, I know RTNL/dev_base_lock, but it's not using what it gets
back at all.

It's testing existence, a boolean, it doesn't dereference the
'dev' it gets back at all.

This code is intentional and perfectly fine.

^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: Eric Dumazet @ 2009-10-20  4:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091019.205948.193706797.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 19 Oct 2009 18:41:58 +0200
> 
>> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
> 
> Dubious, how so?
> 
> Yes, I know RTNL/dev_base_lock, but it's not using what it gets
> back at all.
> 
> It's testing existence, a boolean, it doesn't dereference the
> 'dev' it gets back at all.
> 
> This code is intentional and perfectly fine.

If this was intentional, something changed and made the prereq false.

Final target might be fine, but an element in the chain, before target
could be deleted while reader scans hash chain.

/* Device list removal */
static void unlist_netdevice(struct net_device *dev)
{
        ASSERT_RTNL();

        /* Unlink dev from the device chain */
        write_lock_bh(&dev_base_lock);
        list_del(&dev->dev_list);
        hlist_del(&dev->name_hlist);
        hlist_del(&dev->index_hlist);   <<< HERE >>>
        write_unlock_bh(&dev_base_lock);
}


static inline void hlist_del(struct hlist_node *n)
{
        __hlist_del(n);
        n->next = LIST_POISON1;   <<< HERE >>>
        n->pprev = LIST_POISON2;
}
include/linux/poison.h:#define LIST_POISON1  ((void *) 0x00100100)

reader tries to pass over this delete net_device, doing a dev->index_hlist->next

#define hlist_for_each(pos, head) \
        for (pos = (head)->first; pos && ({ prefetch(pos->next); 1; }); \
             pos = pos->next)

So it should visit a nice memory location ?


^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: Eric Dumazet @ 2009-10-20  4:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <4ADD3794.8030906@gmail.com>

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 19 Oct 2009 18:41:58 +0200
>>
>>> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
>> Dubious, how so?
>>
>> Yes, I know RTNL/dev_base_lock, but it's not using what it gets
>> back at all.
>>
>> It's testing existence, a boolean, it doesn't dereference the
>> 'dev' it gets back at all.
>>
>> This code is intentional and perfectly fine.
> 
> If this was intentional, something changed and made the prereq false.
> 
> Final target might be fine, but an element in the chain, before target
> could be deleted while reader scans hash chain.
> 

BTW, even an insertion can crash a lockless reader, since reader could see a corrupt
 n->next (hlist_add_head() has no barrier between n->next = first and h->first = n;)

static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
{
        struct hlist_node *first = h->first;
        n->next = first;
        if (first)
                first->pprev = &n->next;
        h->first = n;
        n->pprev = &h->first;
}


^ permalink raw reply

* Re: [PATCH 4/4 v3] net: Fix for dst_negative_advice
From: David Miller @ 2009-10-20  4:17 UTC (permalink / raw)
  To: krkumar2; +Cc: shemminger, dada1, herbert, netdev
In-Reply-To: <OF47BDCCCB.33D1775A-ON65257654.00178901-65257654.00192AA2@in.ibm.com>

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Mon, 19 Oct 2009 10:04:53 +0530

> Should I resubmit with the changed order?

I took care of this.

I put the patch that actually makes dev.c use sk_tx_queue_mapping
last in the set.

All applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: David Miller @ 2009-10-20  4:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <4ADD3794.8030906@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:07:48 +0200

> Final target might be fine, but an element in the chain, before target
> could be deleted while reader scans hash chain.
 ...
> So it should visit a nice memory location ?

It should hit a NULL eventually and deterministically even if an
unlink happens at the same time..... unless the object gets free'd
meanwhile, hmmm...

This code is definitely intentional, I remember when I added it to
the tree, Alexey wrote it :-)

^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: David Miller @ 2009-10-20  4:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <4ADD3982.2040100@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:16:02 +0200

> BTW, even an insertion can crash a lockless reader, since reader
>  could see a corrupt n->next (hlist_add_head() has no barrier
>  between n->next = first and h->first = n;)

Ok, now that convinces it for me, I'll apply your patch, thanks!

^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: Eric Dumazet @ 2009-10-20  4:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091019.212018.79580287.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Oct 2009 06:07:48 +0200
> 
>> Final target might be fine, but an element in the chain, before target
>> could be deleted while reader scans hash chain.
>  ...
>> So it should visit a nice memory location ?
> 
> It should hit a NULL eventually and deterministically even if an
> unlink happens at the same time..... unless the object gets free'd
> meanwhile, hmmm...
> 
> This code is definitely intentional, I remember when I added it to
> the tree, Alexey wrote it :-)

I wonder if the whole thing could use RCU somehow, since some workloads hit
this dev_base_lock rwlock pretty hard...


^ permalink raw reply

* Re: [PATCH] net: Fix IP_MULTICAST_IF
From: David Miller @ 2009-10-20  4:28 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <4ADD3B5A.1080905@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:23:54 +0200

> I wonder if the whole thing could use RCU somehow, since some
> workloads hit this dev_base_lock rwlock pretty hard...

True, but for now we'll put your fix in :-)

^ permalink raw reply

* Re: pktgen and spin_lock_bh in xmit path
From: Ben Greear @ 2009-10-20  4:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: NetDev
In-Reply-To: <4ADD32FA.6030409@gmail.com>

Eric Dumazet wrote:
> Ben Greear a écrit :
>   
>> I'm having strange issues when running pktgen on 10G interfaces while
>> also running
>> pktgen on mac-vlans on that interface, when the mac-vlan pktgen threads
>> are on a different
>> CPU.
>>
>> First, lockdep gives up and says that things are not properly
>> annotated.  I believe this is because
>> the macvlan tx path will lock it's txq and will also lock the
>> lower-dev's txq.  To fix this, perhaps
>> we need some new lockdep aware primitives for netdev txq locking?
>>
>> Second, is using _bh() locking really sufficient if we have pktgen
>> writing to a physical device
>> and also have other pktgen threads writing to that same device though
>> mac-vlans?   I'm seeing
>> deadlocks spinning on the _bh() lock in pktgen as well as strange
>> corruptions, so I think there
>> must be *some* problem somewhere, I just don't know quite what it is yet.
>>
>>     
>
> Could you please give us a copy if your pktgen scripts ?
>   
I'm driving it with another program, and my pktgen is a bit hacked, but 
the basic idea is:

1 pktgen connection on cpu 0 running as fast as it can (trying for 
10Gbps, but getting maybe 3-4),
  running between two 10G ports (intel 82599).
  Multi-pkt is set to 10,000 on each side.
3 pairs of mac-vlans on each of the two physical 10G ports.
 3 pktgen 'connections' between these..each are running at about 1Gbps.
 These 3 pktgen connections are on CPU 4.
 Multi-pkt is set to 1 since multi-pkt is a very bad idea on virtual 
devices.

1514 byte pkts.  No IPs on the interfaces, using ToS in pktgen, but 
nothing else is configured to
care.

The two physical ports are cabled together directly with a fibre cable.

All pktgen connections are full duplex (both sides transmitting to each 
other..and I have
rx logic to gather stats on received pkts as well).  With no kernel 
debugging, this can run right at 10Gbps bi-directional,
with lockdep it gets around 5-6Gbps in each direction.

The lockup often occurs near starting/stopping pktgen, but also happens 
while just normally
running under load, usually within 10 minutes.

I tried and failed to reproduce this on a 1G network, but maybe I'm just 
not getting (un)lucky,
didn't try for too long.

Among other things, it appears as if the mac-vlan interfaces sometimes 
become locked to transmit
by pktgen, but a raw socket in user-space can send on them fine.  I'm 
going to add some debugging
for this particular issue tomorrow to try to figure out why that happens.

Please note I have the rest of my network patches applied (but not using 
any proprietary modules),
so it could easily be something I've caused.  I think fixing lockdep to 
work with the txq _bh locks
would be a good first step to fixing this..

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* Re: [PATCH v2] can: provide library functions for skb allocation
From: David Miller @ 2009-10-20  4:54 UTC (permalink / raw)
  To: wg; +Cc: netdev, socketcan-core, haas, anantgole, mkl
In-Reply-To: <4ADB6243.2040109@grandegger.com>

From: Wolfgang Grandegger <wg@grandegger.com>
Date: Sun, 18 Oct 2009 20:45:23 +0200

> This patch makes the private functions alloc_can_skb() and
> alloc_can_err_skb() of the at91_can driver public and adapts all
> drivers to use these. While making the patch I realized, that
> the skb's are *not* setup consistently. It's now done as shown
> below:
> 
>   skb->protocol = htons(ETH_P_CAN);
>   skb->pkt_type = PACKET_BROADCAST;
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   *cf = (struct can_frame *)skb_put(skb, sizeof(struct can_frame));
>   memset(*cf, 0, sizeof(struct can_frame));
> 
> The frame is zeroed out to avoid uninitialized data to be passed to
> user space. Some drivers or library code did not set "pkt_type" or 
> "ip_summed". Also,  "__constant_htons()" should not be used for
> runtime invocations, as pointed out by David Miller.
> 
> Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next V2 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: David Miller @ 2009-10-20  4:54 UTC (permalink / raw)
  To: tomas.winkler
  Cc: linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1255806576-26869-1-git-send-email-tomas.winkler@intel.com>

From: Tomas Winkler <tomas.winkler@intel.com>
Date: Sat, 17 Oct 2009 21:09:34 +0200

> This patch adds Intel Wireless MultiCom 3200 top driver.
> IWMC3200 is 4Wireless Com CHIP (GPS/BT/WiFi/WiMAX).
> Top driver is responsible for device initialization and firmware download.
> Firmware handled by top is responsible for top itself and
> as well as bluetooth and GPS coms. (Wifi and WiMax provide their own firmware)
> In addition top driver is used to retrieve firmware logs
> and supports other debugging features
> 
> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH net-next V2 2/3] iwmc3200wifi: select IWMC3200TOP in Kconfig
From: David Miller @ 2009-10-20  4:55 UTC (permalink / raw)
  To: tomas.winkler
  Cc: linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1255806576-26869-2-git-send-email-tomas.winkler@intel.com>

From: Tomas Winkler <tomas.winkler@intel.com>
Date: Sat, 17 Oct 2009 21:09:35 +0200

> iwmc3200wifi requires iwmc3200top  for its operation
> 
> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
> Acked-by: Zhu Yi <yi.zhu@intel.com>

Applied to net-next-2.6

^ permalink raw reply

* Re: PATCH [net-next-2.6] IP: Cleanups
From: David Miller @ 2009-10-20  4:55 UTC (permalink / raw)
  To: john.dykstra1; +Cc: netdev
In-Reply-To: <1255980690.20673.5.camel@Maple>

From: John Dykstra <john.dykstra1@gmail.com>
Date: Mon, 19 Oct 2009 14:31:30 -0500

> Use symbols instead of magic constants while checking PMTU discovery
> setsockopt.
> 
> Remove redundant test in ip_rt_frag_needed() (done by caller).
> 
> Signed-off-by: John Dykstra <john.dykstra1@gmail.com>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH net-next V2 3/3] i2400m-sdio: select IWMC3200TOP in Kconfig
From: David Miller @ 2009-10-20  4:55 UTC (permalink / raw)
  To: tomas.winkler
  Cc: linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1255806576-26869-3-git-send-email-tomas.winkler@intel.com>

From: Tomas Winkler <tomas.winkler@intel.com>
Date: Sat, 17 Oct 2009 21:09:36 +0200

> i2400m-sdio requires iwmc3200top for its operation
> 
> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
> Acked-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>

Applied to net-next-2.6, thanks!

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20  5:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADD31A2.4030702@gmail.com>

On 10/20/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
>> and all use cases now rely on the socket lock.  Because of this, the
>> ref-count on the namespace held by the socket object suffices to hold
>> the namespace in existence and so we don't need to ref-count the
>> namespace in PPPoE. The flush_lock is gone.
>>
>
> Seems good !
>
> But can we use lock_sock() in __pppoe_xmit() context ?
>

Eric, most probably i miss something, but how lock sock protect us
from mtu changed via sysfs. This action calls change mtu notifier
which doesn't care about sockets at all...

^ permalink raw reply

* [PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()
From: Eric Dumazet @ 2009-10-20  5:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091019.212855.179405364.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Oct 2009 06:23:54 +0200
> 
>> I wonder if the whole thing could use RCU somehow, since some
>> workloads hit this dev_base_lock rwlock pretty hard...
> 
> True, but for now we'll put your fix in :-)

[PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()

Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock.

netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    1
 net/core/dev.c            |   40 ++++++++++++++++++++++++++++++------
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8380009..4eda680 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1127,6 +1127,7 @@ extern void		netdev_resync_ops(struct net_device *dev);
 extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
 extern struct net_device	*dev_get_by_index(struct net *net, int ifindex);
 extern struct net_device	*__dev_get_by_index(struct net *net, int ifindex);
+extern struct net_device	*dev_get_by_index_rcu(struct net *net, int ifindex);
 extern int		dev_restart(struct net_device *dev);
 #ifdef CONFIG_NETPOLL_TRAP
 extern int		netpoll_trap(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..cb011b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -217,12 +217,15 @@ static int list_netdevice(struct net_device *dev)
 	write_lock_bh(&dev_base_lock);
 	list_add_tail(&dev->dev_list, &net->dev_base_head);
 	hlist_add_head(&dev->name_hlist, dev_name_hash(net, dev->name));
-	hlist_add_head(&dev->index_hlist, dev_index_hash(net, dev->ifindex));
+	hlist_add_head_rcu(&dev->index_hlist,
+			   dev_index_hash(net, dev->ifindex));
 	write_unlock_bh(&dev_base_lock);
 	return 0;
 }
 
-/* Device list removal */
+/* Device list removal
+ * caller must respect a RCU grace period before freeing/reusing dev
+ */
 static void unlist_netdevice(struct net_device *dev)
 {
 	ASSERT_RTNL();
@@ -231,7 +234,7 @@ static void unlist_netdevice(struct net_device *dev)
 	write_lock_bh(&dev_base_lock);
 	list_del(&dev->dev_list);
 	hlist_del(&dev->name_hlist);
-	hlist_del(&dev->index_hlist);
+	hlist_del_rcu(&dev->index_hlist);
 	write_unlock_bh(&dev_base_lock);
 }
 
@@ -649,6 +652,31 @@ struct net_device *__dev_get_by_index(struct net *net, int ifindex)
 }
 EXPORT_SYMBOL(__dev_get_by_index);
 
+/**
+ *	dev_get_by_index_rcu - find a device by its ifindex
+ *	@net: the applicable net namespace
+ *	@ifindex: index of device
+ *
+ *	Search for an interface by index. Returns %NULL if the device
+ *	is not found or a pointer to the device. The device has not
+ *	had its reference counter increased so the caller must be careful
+ *	about locking. The caller must hold RCU lock.
+ */
+
+struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex)
+{
+	struct hlist_node *p;
+	struct net_device *dev;
+	struct hlist_head *head = dev_index_hash(net, ifindex);
+
+	hlist_for_each_entry_rcu(dev, p, head, index_hlist)
+		if (dev->ifindex == ifindex)
+			return dev;
+
+	return NULL;
+}
+EXPORT_SYMBOL(dev_get_by_index_rcu);
+
 
 /**
  *	dev_get_by_index - find a device by its ifindex
@@ -665,11 +693,11 @@ struct net_device *dev_get_by_index(struct net *net, int ifindex)
 {
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
-	dev = __dev_get_by_index(net, ifindex);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
 	if (dev)
 		dev_hold(dev);
-	read_unlock(&dev_base_lock);
+	rcu_read_unlock();
 	return dev;
 }
 EXPORT_SYMBOL(dev_get_by_index);

^ permalink raw reply related

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-20  5:05 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <aa79d98a0910192202j4ea9f189g2ff719d57aa5a5eb@mail.gmail.com>

Cyrill Gorcunov a écrit :
> On 10/20/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Michal Ostrowski a écrit :
>>> Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
>>> and all use cases now rely on the socket lock.  Because of this, the
>>> ref-count on the namespace held by the socket object suffices to hold
>>> the namespace in existence and so we don't need to ref-count the
>>> namespace in PPPoE. The flush_lock is gone.
>>>
>> Seems good !
>>
>> But can we use lock_sock() in __pppoe_xmit() context ?
>>
> 
> Eric, most probably i miss something, but how lock sock protect us
> from mtu changed via sysfs. This action calls change mtu notifier
> which doesn't care about sockets at all...

This ultimately calls pppoe_flush_dev() and this function
takes care of taking appropriate sock_locks() on each sockets ?

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()
From: Stephen Hemminger @ 2009-10-20  5:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <4ADD44B0.8030204@gmail.com>

On Tue, 20 Oct 2009 07:03:44 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 20 Oct 2009 06:23:54 +0200
> > 
> >> I wonder if the whole thing could use RCU somehow, since some
> >> workloads hit this dev_base_lock rwlock pretty hard...
> > 
> > True, but for now we'll put your fix in :-)
> 
> [PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()
> 
> Some workloads hit dev_base_lock rwlock pretty hard.
> We can use RCU lookups to avoid touching this rwlock.
> 
> netdevices are already freed after a RCU grace period, so this patch
> adds no penalty at device dismantle time.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

All usage dev_base_lock should be replaceable by using combination of rtnl_mutex
and RCU?

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20  5:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADD4518.8020909@gmail.com>

On 10/20/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Cyrill Gorcunov a écrit :
>> On 10/20/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Michal Ostrowski a écrit :
>>>> Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
>>>> and all use cases now rely on the socket lock.  Because of this, the
>>>> ref-count on the namespace held by the socket object suffices to hold
>>>> the namespace in existence and so we don't need to ref-count the
>>>> namespace in PPPoE. The flush_lock is gone.
>>>>
>>> Seems good !
>>>
>>> But can we use lock_sock() in __pppoe_xmit() context ?
>>>
>>
>> Eric, most probably i miss something, but how lock sock protect us
>> from mtu changed via sysfs. This action calls change mtu notifier
>> which doesn't care about sockets at all...
>
> This ultimately calls pppoe_flush_dev() and this function
> takes care of taking appropriate sock_locks() on each sockets ?
>
This hold and lock socks but set pppoe_dev to null as well. I'll back
later. And i need to reread the code.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()
From: Eric Dumazet @ 2009-10-20  5:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20091020140632.79efb738@s6510>

Stephen Hemminger a écrit :
> On Tue, 20 Oct 2009 07:03:44 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> David Miller a écrit :
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>> Date: Tue, 20 Oct 2009 06:23:54 +0200
>>>
>>>> I wonder if the whole thing could use RCU somehow, since some
>>>> workloads hit this dev_base_lock rwlock pretty hard...
>>> True, but for now we'll put your fix in :-)
>> [PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()
>>
>> Some workloads hit dev_base_lock rwlock pretty hard.
>> We can use RCU lookups to avoid touching this rwlock.
>>
>> netdevices are already freed after a RCU grace period, so this patch
>> adds no penalty at device dismantle time.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> All usage dev_base_lock should be replaceable by using combination of rtnl_mutex
> and RCU?

Yes probably, but I believe we should make step-by-step patches ?

1) __dev_get_by_index() is faster than dev_get_by_index_rcu()

2) I am not sure holding RTNL means we also have rcu_lock() implied.

However dev_ifname() could use rcu_lock() in the same patch, 
here is an updated version.

[PATCH net-next-2.6] net: Introduce dev_get_by_index_rcu()

Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock.

netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.

dev_ifname() converted to dev_get_by_index_rcu()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    1
 net/core/dev.c            |   48 ++++++++++++++++++++++++++++--------
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8380009..4eda680 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1127,6 +1127,7 @@ extern void		netdev_resync_ops(struct net_device *dev);
 extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
 extern struct net_device	*dev_get_by_index(struct net *net, int ifindex);
 extern struct net_device	*__dev_get_by_index(struct net *net, int ifindex);
+extern struct net_device	*dev_get_by_index_rcu(struct net *net, int ifindex);
 extern int		dev_restart(struct net_device *dev);
 #ifdef CONFIG_NETPOLL_TRAP
 extern int		netpoll_trap(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..4564596 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -217,12 +217,15 @@ static int list_netdevice(struct net_device *dev)
 	write_lock_bh(&dev_base_lock);
 	list_add_tail(&dev->dev_list, &net->dev_base_head);
 	hlist_add_head(&dev->name_hlist, dev_name_hash(net, dev->name));
-	hlist_add_head(&dev->index_hlist, dev_index_hash(net, dev->ifindex));
+	hlist_add_head_rcu(&dev->index_hlist,
+			   dev_index_hash(net, dev->ifindex));
 	write_unlock_bh(&dev_base_lock);
 	return 0;
 }
 
-/* Device list removal */
+/* Device list removal
+ * caller must respect a RCU grace period before freeing/reusing dev
+ */
 static void unlist_netdevice(struct net_device *dev)
 {
 	ASSERT_RTNL();
@@ -231,7 +234,7 @@ static void unlist_netdevice(struct net_device *dev)
 	write_lock_bh(&dev_base_lock);
 	list_del(&dev->dev_list);
 	hlist_del(&dev->name_hlist);
-	hlist_del(&dev->index_hlist);
+	hlist_del_rcu(&dev->index_hlist);
 	write_unlock_bh(&dev_base_lock);
 }
 
@@ -649,6 +652,31 @@ struct net_device *__dev_get_by_index(struct net *net, int ifindex)
 }
 EXPORT_SYMBOL(__dev_get_by_index);
 
+/**
+ *	dev_get_by_index_rcu - find a device by its ifindex
+ *	@net: the applicable net namespace
+ *	@ifindex: index of device
+ *
+ *	Search for an interface by index. Returns %NULL if the device
+ *	is not found or a pointer to the device. The device has not
+ *	had its reference counter increased so the caller must be careful
+ *	about locking. The caller must hold RCU lock.
+ */
+
+struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex)
+{
+	struct hlist_node *p;
+	struct net_device *dev;
+	struct hlist_head *head = dev_index_hash(net, ifindex);
+
+	hlist_for_each_entry_rcu(dev, p, head, index_hlist)
+		if (dev->ifindex == ifindex)
+			return dev;
+
+	return NULL;
+}
+EXPORT_SYMBOL(dev_get_by_index_rcu);
+
 
 /**
  *	dev_get_by_index - find a device by its ifindex
@@ -665,11 +693,11 @@ struct net_device *dev_get_by_index(struct net *net, int ifindex)
 {
 	struct net_device *dev;
 
-	read_lock(&dev_base_lock);
-	dev = __dev_get_by_index(net, ifindex);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
 	if (dev)
 		dev_hold(dev);
-	read_unlock(&dev_base_lock);
+	rcu_read_unlock();
 	return dev;
 }
 EXPORT_SYMBOL(dev_get_by_index);
@@ -2930,15 +2958,15 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 	if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
 		return -EFAULT;
 
-	read_lock(&dev_base_lock);
-	dev = __dev_get_by_index(net, ifr.ifr_ifindex);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifr.ifr_ifindex);
 	if (!dev) {
-		read_unlock(&dev_base_lock);
+		rcu_read_unlock();
 		return -ENODEV;
 	}
 
 	strcpy(ifr.ifr_name, dev->name);
-	read_unlock(&dev_base_lock);
+	rcu_read_unlock();
 
 	if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
 		return -EFAULT;


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox