Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] phonet: use call_rcu for phonet device free
From: Eric Dumazet @ 2010-06-07 13:49 UTC (permalink / raw)
  To: Rémi Denis-Courmont; +Cc: Jiri Pirko, netdev, davem
In-Reply-To: <474f08fed4a406e929af3d4142d3e185@chewa.net>

Le lundi 07 juin 2010 à 15:43 +0200, Rémi Denis-Courmont a écrit :
> On Mon, 7 Jun 2010 15:27:39 +0200, Jiri Pirko <jpirko@redhat.com> wrote:
> > Use call_rcu rather than synchronize_rcu.
> > 
> > Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> This looks fine to me, but what is the goal here? The RCU documentation
> seems to imply that synchronize_rcu() is preferable over call_rcu() when at
> all possible.
> 

Thats not exactly that.

synchronize_rcu() is easier, in respect of memory use.
But its drawback is current thread is blocked for several milli seconds.

In the end, call_rcu() is more scalable.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: [PATCH net-next-2.6] phonet: use call_rcu for phonet device free
From: Rémi Denis-Courmont @ 2010-06-07 13:43 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem
In-Reply-To: <20100607132738.GB2730@psychotron.lab.eng.brq.redhat.com>


On Mon, 7 Jun 2010 15:27:39 +0200, Jiri Pirko <jpirko@redhat.com> wrote:
> Use call_rcu rather than synchronize_rcu.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>

This looks fine to me, but what is the goal here? The RCU documentation
seems to imply that synchronize_rcu() is preferable over call_rcu() when at
all possible.

-- 
Rémi Denis-Courmont
http://www.remlab.net
http://fi.linkedin.com/in/remidenis


^ permalink raw reply

* Re: Virtual device and ARP table
From: Eric Dumazet @ 2010-06-07 13:30 UTC (permalink / raw)
  To: Christophe Jelger; +Cc: netdev
In-Reply-To: <4C0CEE3E.50100@unibas.ch>

Le lundi 07 juin 2010 à 15:03 +0200, Christophe Jelger a écrit :

> 
> Eric: thanks for the forward to the netdev list. Regarding the code, I 
> of course welcome any help but didn't want to pollute the list with 
> unsollicited code: I can of course of course send it directly to anyone 
> who is willing to help (I can easily reproduce the problem on different 
> machines).
> 

Christophe,

Unless patch is really huge, its ok to send it on netdev, with a RFC
label, so that only people with free time take a look, eventually.

[RFC] lunar: ....



^ permalink raw reply

* [PATCH net-next-2.6] phonet: use call_rcu for phonet device free
From: Jiri Pirko @ 2010-06-07 13:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, remi.denis-courmont

Use call_rcu rather than synchronize_rcu.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 include/net/phonet/pn_dev.h |    1 +
 net/phonet/pn_dev.c         |   15 +++++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/net/phonet/pn_dev.h b/include/net/phonet/pn_dev.h
index d7b989c..2d16783 100644
--- a/include/net/phonet/pn_dev.h
+++ b/include/net/phonet/pn_dev.h
@@ -34,6 +34,7 @@ struct phonet_device {
 	struct list_head list;
 	struct net_device *netdev;
 	DECLARE_BITMAP(addrs, 64);
+	struct rcu_head	rcu;
 };
 
 int phonet_device_init(void);
diff --git a/net/phonet/pn_dev.c b/net/phonet/pn_dev.c
index c33da65..b18e48f 100644
--- a/net/phonet/pn_dev.c
+++ b/net/phonet/pn_dev.c
@@ -162,6 +162,14 @@ int phonet_address_add(struct net_device *dev, u8 addr)
 	return err;
 }
 
+static void phonet_device_rcu_free(struct rcu_head *head)
+{
+	struct phonet_device *pnd;
+
+	pnd = container_of(head, struct phonet_device, rcu);
+	kfree(pnd);
+}
+
 int phonet_address_del(struct net_device *dev, u8 addr)
 {
 	struct phonet_device_list *pndevs = phonet_device_list(dev_net(dev));
@@ -179,10 +187,9 @@ int phonet_address_del(struct net_device *dev, u8 addr)
 		pnd = NULL;
 	mutex_unlock(&pndevs->lock);
 
-	if (pnd) {
-		synchronize_rcu();
-		kfree(pnd);
-	}
+	if (pnd)
+		call_rcu(&pnd->rcu, phonet_device_rcu_free);
+
 	return err;
 }
 
-- 
1.7.0.1


^ permalink raw reply related

* [PATCH net-next-2.6] ipv4: avoid two atomic ops in ip_rt_redirect()
From: Eric Dumazet @ 2010-06-07 13:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

in_dev_get() -> __in_dev_get_rcu() in a rcu protected function.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/route.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7b8eacd..0c30534 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1343,11 +1343,12 @@ static void rt_del(unsigned hash, struct rtable *rt)
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 }
 
+/* called in rcu_read_lock() section */
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
 {
 	int i, k;
-	struct in_device *in_dev = in_dev_get(dev);
+	struct in_device *in_dev = __in_dev_get_rcu(dev);
 	struct rtable *rth, **rthp;
 	__be32  skeys[2] = { saddr, 0 };
 	int  ikeys[2] = { dev->ifindex, 0 };
@@ -1383,7 +1384,6 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 
 			rthp=&rt_hash_table[hash].chain;
 
-			rcu_read_lock();
 			while ((rth = rcu_dereference(*rthp)) != NULL) {
 				struct rtable *rt;
 
@@ -1405,12 +1405,10 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 					break;
 
 				dst_hold(&rth->u.dst);
-				rcu_read_unlock();
 
 				rt = dst_alloc(&ipv4_dst_ops);
 				if (rt == NULL) {
 					ip_rt_put(rth);
-					in_dev_put(in_dev);
 					return;
 				}
 
@@ -1463,12 +1461,10 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 					ip_rt_put(rt);
 				goto do_next;
 			}
-			rcu_read_unlock();
 		do_next:
 			;
 		}
 	}
-	in_dev_put(in_dev);
 	return;
 
 reject_redirect:
@@ -1479,7 +1475,6 @@ reject_redirect:
 		       &old_gw, dev->name, &new_gw,
 		       &saddr, &daddr);
 #endif
-	in_dev_put(in_dev);
 }
 
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)



^ permalink raw reply related

* [PATCH net-next-2.6] igmp: avoid two atomic ops in igmp_rcv()
From: Eric Dumazet @ 2010-06-07 13:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

in_dev_get() -> __in_dev_get_rcu() in a rcu protected function.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/igmp.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 250cb5e..3294f54 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -916,18 +916,19 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
 	read_unlock(&in_dev->mc_list_lock);
 }
 
+/* called in rcu_read_lock() section */
 int igmp_rcv(struct sk_buff *skb)
 {
 	/* This basically follows the spec line by line -- see RFC1112 */
 	struct igmphdr *ih;
-	struct in_device *in_dev = in_dev_get(skb->dev);
+	struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
 	int len = skb->len;
 
 	if (in_dev == NULL)
 		goto drop;
 
 	if (!pskb_may_pull(skb, sizeof(struct igmphdr)))
-		goto drop_ref;
+		goto drop;
 
 	switch (skb->ip_summed) {
 	case CHECKSUM_COMPLETE:
@@ -937,7 +938,7 @@ int igmp_rcv(struct sk_buff *skb)
 	case CHECKSUM_NONE:
 		skb->csum = 0;
 		if (__skb_checksum_complete(skb))
-			goto drop_ref;
+			goto drop;
 	}
 
 	ih = igmp_hdr(skb);
@@ -957,7 +958,6 @@ int igmp_rcv(struct sk_buff *skb)
 		break;
 	case IGMP_PIM:
 #ifdef CONFIG_IP_PIMSM_V1
-		in_dev_put(in_dev);
 		return pim_rcv_v1(skb);
 #endif
 	case IGMPV3_HOST_MEMBERSHIP_REPORT:
@@ -971,8 +971,6 @@ int igmp_rcv(struct sk_buff *skb)
 		break;
 	}
 
-drop_ref:
-	in_dev_put(in_dev);
 drop:
 	kfree_skb(skb);
 	return 0;



^ permalink raw reply related

* Re: Virtual device and ARP table
From: Christophe Jelger @ 2010-06-07 13:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1275913320.2545.53.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le lundi 07 juin 2010 à 12:21 +0200, Christophe Jelger a écrit :
>> Hello,
>>
>> I am currently "resurrecting" a Linux module (called LUNAR) which I 
>> co-developed in 2007 and I'm having a weird kernel crash. This code 
>> basically used to work fine up to 2.6.18 which was the latest version 
>> before we stopped our development. I quickly ported it to 2.6.{31,32}: 
>> it compiles fine and loads fine, but it crashes/hangs the kernel when 
>> it's really being used.
>>
>> The module is a virtual device used for MANET routing: with the current 
>> version, it basically "captures" DNS requests sent to the virtual 
>> interface --> this triggers the sending of a fake DNS reply (see below) 
>> and the creation of an ARP table entry for the destination (the MANET 
>> route is built at the same time). Packets can then be sent to the 
>> destination.
>>
>> The problem I'm having is that the kernel quickly hangs after I create a 
>> new ARP entry (actually only if it's being used). If the entry I create 
>> is set to NUD_PERMANENT, then everything works fine! I use 
>> __neigh_lookup_errno to lookup/create the entry and neigh_lookup to 
>> set/update the MAC address. Note that the ARP entry is created without 
>> problem, but typically even just doing a userspace "arp -a" command can 
>> crash the kernel (it also hangs the userspace command!). Doing "arp -na" 
>> usually does NOT crash the kernel!
>>
>> I guess the problem comes from a combination of ARP + DNS 
>> lookups/replies. Note that my kernel module has its own internal fake 
>> DNS server which captures lookups and sends replies directly back to the 
>> stack. What is amazing: if the ARP entry I create is set to 
>> NUD_PERMANENT, then I don't get any crash (however I cannot develop my 
>> module with permanent ARP entries).
>>
>> I'm wondering if there were any major changes to the neighbor and arp 
>> code (between 2.6.18 and 2.6.31) that are somehow causing this problem ?...
>>
>> Any hint is very welcome.
>>
>> thanks in advance,
>> Christophe
>>
>> PS: I can easily reproduce the problem, and was trying to debug with 
>> qemu and gdb server but so fra no success to clearly identify the 
>> problem. Last point: it seems the kernel does not really "crash" but 
>> rather ends up in some unstable state and maybe in a loop.
>> --
> 
> Hi Christophe
> 
> You should ask these kind of questions on netdev instead of lkml.
> 
> And of course, post your patch, or send us a crystal ball ;)
> 
> Yes, many things changed between 2.6.18 and 2.6.34
> 

Eric: thanks for the forward to the netdev list. Regarding the code, I 
of course welcome any help but didn't want to pollute the list with 
unsollicited code: I can of course of course send it directly to anyone 
who is willing to help (I can easily reproduce the problem on different 
machines).

Christophe




^ permalink raw reply

* [PATCH net-next-2.6] ip: Router Alert RCU conversion
From: Eric Dumazet @ 2010-06-07 13:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Straightforward conversion to RCU.

One rwlock becomes a spinlock, and is static.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h       |    2 +-
 net/ipv4/ip_input.c    |   11 +++--------
 net/ipv4/ip_sockglue.c |   23 ++++++++++++++---------
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 452f229..9982c97 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -62,10 +62,10 @@ struct ip_ra_chain {
 	struct ip_ra_chain	*next;
 	struct sock		*sk;
 	void			(*destructor)(struct sock *);
+	struct rcu_head		rcu;
 };
 
 extern struct ip_ra_chain *ip_ra_chain;
-extern rwlock_t ip_ra_lock;
 
 /* IP flags. */
 #define IP_CE		0x8000		/* Flag: "Congestion"		*/
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index d52c9da..08a3b12 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -146,7 +146,7 @@
 #include <linux/netlink.h>
 
 /*
- *	Process Router Attention IP option
+ *	Process Router Attention IP option (RFC 2113)
  */
 int ip_call_ra_chain(struct sk_buff *skb)
 {
@@ -155,8 +155,7 @@ int ip_call_ra_chain(struct sk_buff *skb)
 	struct sock *last = NULL;
 	struct net_device *dev = skb->dev;
 
-	read_lock(&ip_ra_lock);
-	for (ra = ip_ra_chain; ra; ra = ra->next) {
+	for (ra = rcu_dereference(ip_ra_chain); ra; ra = rcu_dereference(ra->next)) {
 		struct sock *sk = ra->sk;
 
 		/* If socket is bound to an interface, only report
@@ -167,10 +166,8 @@ int ip_call_ra_chain(struct sk_buff *skb)
 		     sk->sk_bound_dev_if == dev->ifindex) &&
 		    net_eq(sock_net(sk), dev_net(dev))) {
 			if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) {
-				if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN)) {
-					read_unlock(&ip_ra_lock);
+				if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN))
 					return 1;
-				}
 			}
 			if (last) {
 				struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
@@ -183,10 +180,8 @@ int ip_call_ra_chain(struct sk_buff *skb)
 
 	if (last) {
 		raw_rcv(last, skb);
-		read_unlock(&ip_ra_lock);
 		return 1;
 	}
-	read_unlock(&ip_ra_lock);
 	return 0;
 }
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index ce23178..08b9519 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -239,7 +239,12 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc)
    sent to multicast group to reach destination designated router.
  */
 struct ip_ra_chain *ip_ra_chain;
-DEFINE_RWLOCK(ip_ra_lock);
+static DEFINE_SPINLOCK(ip_ra_lock);
+
+static void ip_ra_free_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_ra_chain, rcu));
+}
 
 int ip_ra_control(struct sock *sk, unsigned char on,
 		  void (*destructor)(struct sock *))
@@ -251,35 +256,35 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-	write_lock_bh(&ip_ra_lock);
+	spin_lock_bh(&ip_ra_lock);
 	for (rap = &ip_ra_chain; (ra = *rap) != NULL; rap = &ra->next) {
 		if (ra->sk == sk) {
 			if (on) {
-				write_unlock_bh(&ip_ra_lock);
+				spin_unlock_bh(&ip_ra_lock);
 				kfree(new_ra);
 				return -EADDRINUSE;
 			}
-			*rap = ra->next;
-			write_unlock_bh(&ip_ra_lock);
+			rcu_assign_pointer(*rap, ra->next);
+			spin_unlock_bh(&ip_ra_lock);
 
 			if (ra->destructor)
 				ra->destructor(sk);
 			sock_put(sk);
-			kfree(ra);
+			call_rcu(&ra->rcu, ip_ra_free_rcu);
 			return 0;
 		}
 	}
 	if (new_ra == NULL) {
-		write_unlock_bh(&ip_ra_lock);
+		spin_unlock_bh(&ip_ra_lock);
 		return -ENOBUFS;
 	}
 	new_ra->sk = sk;
 	new_ra->destructor = destructor;
 
 	new_ra->next = ra;
-	*rap = new_ra;
+	rcu_assign_pointer(*rap, new_ra);
 	sock_hold(sk);
-	write_unlock_bh(&ip_ra_lock);
+	spin_unlock_bh(&ip_ra_lock);
 
 	return 0;
 }



^ permalink raw reply related

* Re: [Patch 2/2] mlx4: add dynamic LRO disable support
From: Cong Wang @ 2010-06-07 13:15 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: Ben Hutchings, netdev, herbert.xu, nhorman, davem
In-Reply-To: <20100607130038.11581e28@dhcp-lab-109.englab.brq.redhat.com>

On 06/07/10 19:00, Stanislaw Gruszka wrote:
> On Mon, 07 Jun 2010 16:51:49 +0800
> Cong Wang<amwang@redhat.com>  wrote:
>
>>> Now that I look at the patch again, I see you're using a static (i.e.
>>> global) variable to 'back up' the non-zero (enabled) value of num_lro.
>>> This is introducing a bug!  The correct value is apparently set in
>>> mlx4_en_get_profile(); you would need to replicate that.
>>>
>>
>> Oh, probably, but unfortunately 'num_lro' is static so only visible
>> in en_main.c.
>
> So just remove "static" and make it global :-)
>

Not that easy, it is defined with a macro which is also used
by other parameters. :-/


^ permalink raw reply

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Andy Gospodarek @ 2010-06-07 13:03 UTC (permalink / raw)
  To: Cong Wang
  Cc: Andy Gospodarek, Jay Vosburgh, Flavio Leitner, linux-kernel,
	Matt Mackall, netdev, bridge, Andy Gospodarek, Neil Horman,
	Jeff Moyer, Stephen Hemminger, bonding-devel, David Miller
In-Reply-To: <4C0CC29D.9070507@redhat.com>

On Mon, Jun 07, 2010 at 05:57:49PM +0800, Cong Wang wrote:
> On 06/05/10 03:18, Andy Gospodarek wrote:
>> On Wed, Jun 02, 2010 at 06:04:45PM +0800, Cong Wang wrote:
>>> On 06/02/10 02:42, Jay Vosburgh wrote:
>>>> Cong Wang<amwang@redhat.com>   wrote:
>>>>
>>>>> On 06/01/10 03:08, Flavio Leitner wrote:
>>>>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>>>>> Hi, Flavio,
>>>>>>>
>>>>>>> Please use the attached patch instead, try to see if it solves
>>>>>>> all your problems.
>>>>>>
>>>>>> I tried and it hangs. No backtraces this time.
>>>>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>>>>> notification, so I think it won't work.
>>>>>
>>>>> Ah, I thought the same.
>>>>>
>>>>>>
>>>>>> Please, correct if I'm wrong, but when a failover happens with your
>>>>>> patch applied, the netconsole would be disabled forever even with
>>>>>> another healthy slave, right?
>>>>>>
>>>>>
>>>>> Yes, this is an easy solution, because bonding has several modes,
>>>>> it is complex to make netpoll work in different modes.
>>>>
>>>> 	If I understand correctly, the root cause of the problem with
>>>> netconsole and bonding is that bonding is, ultimately, performing
>>>> printks with a write lock held, and when netconsole recursively calls
>>>> into bonding to send the printk over the netconsole, there is a deadlock
>>>> (when the bonding xmit function attempts to acquire the same lock for
>>>> read).
>>>
>>>
>>> Yes.
>>>
>>>>
>>>> 	You're trying to avoid the deadlock by shutting off netconsole
>>>> (permanently, it looks like) for one problem case: a failover, which
>>>> does some printks with a write lock held.
>>>>
>>>> 	This doesn't look to me like a complete solution, there are
>>>> other cases in bonding that will do printk with write locks held.  I
>>>> suspect those will also hang netconsole as things exist today, and won't
>>>> be affected by your patch below.
>>>
>>>
>>> I can expect that, bonding modes are complex.
>>>
>>>>
>>>> 	For example:
>>>>
>>>> 	The sysfs functions to set the primary (bonding_store_primary)
>>>> or active (bonding_store_active_slave) options: a pr_info is called to
>>>> provide a log message of the results.  These could be tested by setting
>>>> the primary or active options via sysfs, e.g.,
>>>>
>>>> echo eth0>   /sys/class/net/bond0/bonding/primary
>>>> echo eth0>   /sys/class/net/bond0/bonding/active
>>>>
>>>> 	If the kernel is defined with DEBUG, there are a few pr_debug
>>>> calls within write_locks (bond_del_vlan, for example).
>>>>
>>>> 	If the slave's underlying device driver's ndo_vlan_rx_register
>>>> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
>>>> for error cases, e.g., igbvf, ehea, enic), those would also presumably
>>>> deadlock (because bonding holds its write_lock when calling the ndo_
>>>> vlan functions).
>>>>
>>>> 	It also appears that (with the patch below) some nominally
>>>> normal usage patterns will immediately disable netconsole.  The one that
>>>> comes to mind is if the primary= option is set (to "eth1" for this
>>>> example), but that slave not enslaved first (the slaves are added, say,
>>>> eth0 then eth1).  In that situation, when the primary slave (eth1 here)
>>>> is added, the first thing that will happen is a failover, and that will
>>>> disable netconsole.
>>>>
>>>
>>> Thanks for your detailed explanation!
>>>
>>> This is why I said bonding is complex. I guess we would have to adjust
>>> netpoll code for different bonding cases, one solution seems not fix all.
>>> I am not sure how much work to do, since I am not familiar with bonding
>>> code. Maybe Andy can help?
>>>
>>
>> Sorry I've been silent until now.  This does seem quite similar to a
>> problem I've previously encountered when dealing with bonding+netpoll on
>> some old 2.6.9-based kernels.  There is no guarantee the methods used
>> there will apply here, but I'll talk about them anyway.
>>
>> As Flavio noticed, recursive calls into the bond transmit routines were
>> not a good idea.  I discovered the same and worked around this issue by
>> checking to see if we could take the bond->lock for writing before
>> continuing.  If we could not get, I wanted to signal that this should be
>> queued for transmission later.  Based on the flow of netpoll_send_skb
>> (or possibly for another reason that is escaping me right now) I added
>> one of these checks in bond_poll_controller too.  These aren't the
>> prettiest fixes, but seemed to work well for me when I did this work in
>> the past.  I realize the differences are not that great compared to some
>> of the patches posted by Flavio, but I think they are worth trying.
>
>
> Hmm, I still feel like this way is ugly, although it may work.
> I guess David doesn't like it either.
>

Notice how I referred to it as a work-around? :)

It certainly isn't a great way to resolve the issue, but I wanted to
offer my opinon on the issue since you asked.

> Anyway, Flavio, could you try the following patch as well?
>
> Thanks a lot!
>
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index ef60244..d7b9b99 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1290,6 +1290,12 @@ static bool slaves_support_netpoll(struct net_device *bond_dev)
>>   static void bond_poll_controller(struct net_device *bond_dev)
>>   {
>>   	struct net_device *dev = bond_dev->npinfo->netpoll->real_dev;
>> +	struct bonding *bond = netdev_priv(bond_dev);
>> +
>> +	if (!write_trylock(&bond->lock))
>> +		return;
>> +	write_unlock(&bond->lock);
>> +
>>   	if (dev != bond_dev)
>>   		netpoll_poll_dev(dev);
>>   }
>> @@ -4418,7 +4424,11 @@ static void bond_set_xmit_hash_policy(struct bonding *bond)
>>
>>   static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>> -	const struct bonding *bond = netdev_priv(dev);
>> +	struct bonding *bond = netdev_priv(dev);
>> +
>> +	if (!write_trylock(&bond->lock))
>> +		return NETDEV_TX_BUSY;
>> +	write_unlock(&bond->lock);
>>
>>   	switch (bond->params.mode) {
>>   	case BOND_MODE_ROUNDROBIN:
>>
>> The other key to all of this is to make sure that queuing is done
>> correctly now that we expect to queue these frames and have them sent at
>> some point when there is a member of the bond that is actually capable
>> of sending them out.
>>
>> The new style of sending queued skbs in a workqueue is much better than
>> what was done in the 2.6.9 timeframe, but careful attention should still
>> be paid to txq lock and which processor is the owner.  Returning
>> something other than NETDEV_TX_OK from bond_start_xmit and checking for
>> locks being held there should also help with any deadlocks that show up
>> while running in queue_process (though they would not be recursive).
>>
>> I'm not in a good spot to test this right now, but I can take a look at
>> next week and we can try and track down any of the other deadlocks that
>> currently exist as I suspect this will not resolve all of the issues.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-07 12:54 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275914205.29978.10.camel@jlt3.sipsolutions.net>

On Mon, Jun 7, 2010 at 14:36, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Mon, 2010-06-07 at 14:26 +0200, Kay Sievers wrote:
>> On Mon, Jun 7, 2010 at 13:41, Johannes Berg <johannes@sipsolutions.net> wrote:
>> > (mind you, I think we probably need to have the bus/driver assignment,
>> > but I wanted to try out your suggestion first)
>>
>> Class device can never have a driver. And unregistered drivers can
>> not be used, what you try to do here. That all should just be removed or
>> properly registered with the core if needed.
>
> Yeah but it shouldn't influence the operation?

It does. You can not use "dead static" drivers. They need to be
properly registered, and can only work for bus devices. Class devices
are not allowed to have drivers.

This worked only because the core ignored the value you assigned
behind its back. :)

>> Here is something that seems to work for me.
>
> I see you remove the driver. Does this mean that in sysfs these devices
> wouldn't have a driver symlink? ISTR that we needed that for userspace,
> but I'm not entirely sure, and I don't have all the relevant userspace
> in my test setup.

Yeah, if you need the driver, you need to use bus device, as you do
now. The driver needs to be registered with the core, and the probe
function needs to tell to bind to the devices while registered. You
can not manually assign this.

Kay

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-07 12:36 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275913563.1823.1.camel@yio.site>

On Mon, 2010-06-07 at 14:26 +0200, Kay Sievers wrote:
> On Mon, Jun 7, 2010 at 13:41, Johannes Berg <johannes@sipsolutions.net> wrote:
> > (mind you, I think we probably need to have the bus/driver assignment,
> > but I wanted to try out your suggestion first)
> 
> Class device can never have a driver. And unregistered drivers can
> not be used, what you try to do here. That all should just be removed or
> properly registered with the core if needed.

Yeah but it shouldn't influence the operation? Anyway I see from your
patch that I should have assigned the release function and that would've
helped I guess.

> > So I removed bus/driver assignment from the above code just to try it
> > out, and got
> >
> > Device 'hwsim0' does not have a release() function, it is broken and
> > must be fixed.
> 
> The driver core expects the resources of the device to be freed on
> release. You can provide an empty release function because you do this
> from a global list of devices.

Ok, makes sense.

> > This has evolved far too much for me right now. Can we just apply the
> > initial patch from Eric and be happier for a while? I can't justify
> > spending this much time on it right now. Alternatively, you could look
> > at hwsim too, since it's all virtual, nothing special is required, I do
> > testing in a virtual machine ...
> >
> > current patch is at
> > http://johannes.sipsolutions.net/patches/kernel/all/2010-06-07-11:41/hwsim-bus.patch
> 
> Here is something that seems to work for me.

I see you remove the driver. Does this mean that in sysfs these devices
wouldn't have a driver symlink? ISTR that we needed that for userspace,
but I'm not entirely sure, and I don't have all the relevant userspace
in my test setup.

johannes

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Kay Sievers @ 2010-06-07 12:26 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275910905.29978.7.camel@jlt3.sipsolutions.net>

On Mon, Jun 7, 2010 at 13:41, Johannes Berg <johannes@sipsolutions.net> wrote:
> (mind you, I think we probably need to have the bus/driver assignment,
> but I wanted to try out your suggestion first)

Class device can never have a driver. And unregistered drivers can
not be used, what you try to do here. That all should just be removed or
properly registered with the core if needed.

> So I removed bus/driver assignment from the above code just to try it
> out, and got
>
> Device 'hwsim0' does not have a release() function, it is broken and
> must be fixed.

The driver core expects the resources of the device to be freed on
release. You can provide an empty release function because you do this
from a global list of devices.

> This has evolved far too much for me right now. Can we just apply the
> initial patch from Eric and be happier for a while? I can't justify
> spending this much time on it right now. Alternatively, you could look
> at hwsim too, since it's all virtual, nothing special is required, I do
> testing in a virtual machine ...
>
> current patch is at
> http://johannes.sipsolutions.net/patches/kernel/all/2010-06-07-11:41/hwsim-bus.patch

Here is something that seems to work for me.

Cheers,
Kay



diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index 6f8cb3e..d210ce6 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -194,7 +194,13 @@ static inline void hwsim_clear_sta_magic(struct ieee80211_sta *sta)
 	sp->magic = 0;
 }
 
-static struct class *hwsim_class;
+static struct bus_type hwsim_bus = {
+	.name = "mac80211_hwsim",
+};
+
+static struct device hwsim_parent = {
+	.init_name = "mac80211_hwsim",
+};
 
 static struct net_device *hwsim_mon; /* global monitor netdev */
 
@@ -1071,14 +1077,10 @@ static void mac80211_hwsim_free(void)
 		device_unregister(data->dev);
 		ieee80211_free_hw(data->hw);
 	}
-	class_destroy(hwsim_class);
+	device_unregister(&hwsim_parent);
+	bus_unregister(&hwsim_bus);
 }
 
-
-static struct device_driver mac80211_hwsim_driver = {
-	.name = "mac80211_hwsim"
-};
-
 static const struct net_device_ops hwsim_netdev_ops = {
 	.ndo_start_xmit 	= hwsim_mon_xmit,
 	.ndo_change_mtu		= eth_change_mtu,
@@ -1231,6 +1233,9 @@ DEFINE_SIMPLE_ATTRIBUTE(hwsim_fops_group,
 			hwsim_fops_group_read, hwsim_fops_group_write,
 			"%llx\n");
 
+/* all devices are kept in out own list and the ressources are freed on exit */
+static void hwsim_dev_release(struct device* dev) {}
+
 static int __init init_mac80211_hwsim(void)
 {
 	int i, err = 0;
@@ -1251,9 +1256,15 @@ static int __init init_mac80211_hwsim(void)
 	spin_lock_init(&hwsim_radio_lock);
 	INIT_LIST_HEAD(&hwsim_radios);
 
-	hwsim_class = class_create(THIS_MODULE, "mac80211_hwsim");
-	if (IS_ERR(hwsim_class))
-		return PTR_ERR(hwsim_class);
+	err = bus_register(&hwsim_bus);
+	if (err)
+		return err;
+
+	err = device_register(&hwsim_parent);
+	if (err) {
+		bus_unregister(&hwsim_bus);
+		return err;
+	}
 
 	memset(addr, 0, ETH_ALEN);
 	addr[0] = 0x02;
@@ -1271,16 +1282,24 @@ static int __init init_mac80211_hwsim(void)
 		data = hw->priv;
 		data->hw = hw;
 
-		data->dev = device_create(hwsim_class, NULL, 0, hw,
-					  "hwsim%d", i);
-		if (IS_ERR(data->dev)) {
-			printk(KERN_DEBUG
-			       "mac80211_hwsim: device_create "
-			       "failed (%ld)\n", PTR_ERR(data->dev));
+		data->dev = kzalloc(sizeof(struct device), GFP_KERNEL);
+		if (!data->dev) {
 			err = -ENOMEM;
 			goto failed_drvdata;
 		}
-		data->dev->driver = &mac80211_hwsim_driver;
+
+		dev_set_name(data->dev, "hwsim%d", i);
+		data->dev->parent = &hwsim_parent;
+		data->dev->bus = &hwsim_bus;
+		data->dev->release = hwsim_dev_release;
+
+		err = device_register(data->dev);
+		if (err) {
+			printk(KERN_DEBUG
+			       "mac80211_hwsim: device_register failed (%d)\n",
+			       err);
+			goto failed_drvdata;
+		}
 
 		SET_IEEE80211_DEV(hw, data->dev);
 		addr[3] = i >> 8;




^ permalink raw reply related

* Re: Virtual device and ARP table
From: Eric Dumazet @ 2010-06-07 12:22 UTC (permalink / raw)
  To: Christophe Jelger; +Cc: linux-kernel, netdev
In-Reply-To: <4C0CC810.7030501@unibas.ch>

Le lundi 07 juin 2010 à 12:21 +0200, Christophe Jelger a écrit :
> Hello,
> 
> I am currently "resurrecting" a Linux module (called LUNAR) which I 
> co-developed in 2007 and I'm having a weird kernel crash. This code 
> basically used to work fine up to 2.6.18 which was the latest version 
> before we stopped our development. I quickly ported it to 2.6.{31,32}: 
> it compiles fine and loads fine, but it crashes/hangs the kernel when 
> it's really being used.
> 
> The module is a virtual device used for MANET routing: with the current 
> version, it basically "captures" DNS requests sent to the virtual 
> interface --> this triggers the sending of a fake DNS reply (see below) 
> and the creation of an ARP table entry for the destination (the MANET 
> route is built at the same time). Packets can then be sent to the 
> destination.
> 
> The problem I'm having is that the kernel quickly hangs after I create a 
> new ARP entry (actually only if it's being used). If the entry I create 
> is set to NUD_PERMANENT, then everything works fine! I use 
> __neigh_lookup_errno to lookup/create the entry and neigh_lookup to 
> set/update the MAC address. Note that the ARP entry is created without 
> problem, but typically even just doing a userspace "arp -a" command can 
> crash the kernel (it also hangs the userspace command!). Doing "arp -na" 
> usually does NOT crash the kernel!
> 
> I guess the problem comes from a combination of ARP + DNS 
> lookups/replies. Note that my kernel module has its own internal fake 
> DNS server which captures lookups and sends replies directly back to the 
> stack. What is amazing: if the ARP entry I create is set to 
> NUD_PERMANENT, then I don't get any crash (however I cannot develop my 
> module with permanent ARP entries).
> 
> I'm wondering if there were any major changes to the neighbor and arp 
> code (between 2.6.18 and 2.6.31) that are somehow causing this problem ?...
> 
> Any hint is very welcome.
> 
> thanks in advance,
> Christophe
> 
> PS: I can easily reproduce the problem, and was trying to debug with 
> qemu and gdb server but so fra no success to clearly identify the 
> problem. Last point: it seems the kernel does not really "crash" but 
> rather ends up in some unstable state and maybe in a loop.
> --

Hi Christophe

You should ask these kind of questions on netdev instead of lkml.

And of course, post your patch, or send us a crystal ball ;)

Yes, many things changed between 2.6.18 and 2.6.34

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: use call_rcu for port free
From: Eric Dumazet @ 2010-06-07 12:15 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kaber
In-Reply-To: <20100607113629.GA2730@psychotron.lab.eng.brq.redhat.com>

Le lundi 07 juin 2010 à 13:36 +0200, Jiri Pirko a écrit :
> Use call_rcu rather than synchronize_rcu.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> ---
>  drivers/net/macvlan.c |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-07 11:41 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTikMDvpEzrtFVue8gStudKONxuQmVx9B2w1JM1vO@mail.gmail.com>

On Mon, 2010-06-07 at 13:05 +0200, Kay Sievers wrote:

> > +               data->dev = kzalloc(sizeof(struct device), GFP_KERNEL);
> > +               if (!data->dev) {
> >                        err = -ENOMEM;
> >                        goto failed_drvdata;
> >                }
> > +
> > +               dev_set_name(data->dev, "hwsim%d", i);
> > +               data->dev->bus = &hwsim_bus;
> >                data->dev->driver = &mac80211_hwsim_driver;
> >
> > +               err = device_register(data->dev);
> > +               if (err) {
> > +                       printk(KERN_DEBUG
> > +                              "mac80211_hwsim: device_register failed (%d)\n",
> > +                              err);
> > +                       goto failed_drvdata;
> > +               }
> >
> > (ignore the pluses, snipped from a patch) but it ran into a null ptr
> > deref?
> 
> Oh, I see. It's probably something nobody ever did before. You try to
> register a bus device which has no parent. Seems that's something
> nobody ever expected to happen. :)
> 
> Your driver/subsystem is completely virtual, does not depend on any
> hardware, right? If we create a virtual parent, like:
>   parent = kzalloc(sizeof(struct device), GFP_KERNEL);
>   dev_set_name(parent, "mac80211_hwsim");
>   device_register(parent);
> 
> An in your code:
>   data->dev->parent = parent;
> 
> That should give you a /sys/devices/mac80211_hwsim/ directory where
> all the devices you create should show up.

So that seemed equivalent to my code except for the .bus/.driver
assignments and the two-level hierarchy of course. 

(mind you, I think we probably need to have the bus/driver assignment,
but I wanted to try out your suggestion first)

So I removed bus/driver assignment from the above code just to try it
out, and got

Device 'hwsim0' does not have a release() function, it is broken and
must be fixed.

This has evolved far too much for me right now. Can we just apply the
initial patch from Eric and be happier for a while? I can't justify
spending this much time on it right now. Alternatively, you could look
at hwsim too, since it's all virtual, nothing special is required, I do
testing in a virtual machine ...

current patch is at
http://johannes.sipsolutions.net/patches/kernel/all/2010-06-07-11:41/hwsim-bus.patch

johannes

^ permalink raw reply

* BUG: double spinlock in "drivers/net/3c505.c"
From: Alexander Strakh @ 2010-06-07 11:17 UTC (permalink / raw)
  To: Philip Blundell
  Cc: Craig Southeren, Andrew Tridgell, Alan Cox, netdev, linux-kernel

	KERNEL_VERSION: 2.6.35-rc1
        SUBJECT: duble spinlock  in function elp_start_xmit
        SUBSCRIBE:
        In driver drivers/net/3c505.c in function elp_start_xmit:

1. In line 1075 we have first spinlock. In the next line we called 
check_3c505_dma:

1070 static netdev_tx_t elp_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
1071 {
1072         unsigned long flags;
1073         elp_device *adapter = netdev_priv(dev);
1074
1075         spin_lock_irqsave(&adapter->lock, flags);
1076         check_3c505_dma(dev);

2. In function check_3c505_dma we have second spinloock at line 301.

 293 static inline void check_3c505_dma(struct net_device *dev)
 294 {
 295         elp_device *adapter = netdev_priv(dev);
 296         if (adapter->dmaing && time_after(jiffies, adapter-
>current_dma.start_time + 10)) {
 297                 unsigned long flags, f;
 298                 pr_err("%s: DMA %s timed out, %d bytes left\n", dev-
>name,
 299                         adapter->current_dma.direction ? "download" : 
"upload",
 300                         get_dma_residue(dev->dma));
 301                 spin_lock_irqsave(&adapter->lock, flags);


Found by Linux Device Drivers Verification Project (Svace Detector)

^ permalink raw reply

* [PATCH net-next-2.6] macvlan: use call_rcu for port free
From: Jiri Pirko @ 2010-06-07 11:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaber

Use call_rcu rather than synchronize_rcu.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/macvlan.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 53422ce..59c3155 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -37,6 +37,7 @@ struct macvlan_port {
 	struct net_device	*dev;
 	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
 	struct list_head	vlans;
+	struct rcu_head		rcu;
 };
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
@@ -540,14 +541,21 @@ static int macvlan_port_create(struct net_device *dev)
 	return err;
 }
 
+static void macvlan_port_rcu_free(struct rcu_head *head)
+{
+	struct macvlan_port *port;
+
+	port = container_of(head, struct macvlan_port, rcu);
+	kfree(port);
+}
+
 static void macvlan_port_destroy(struct net_device *dev)
 {
 	struct macvlan_port *port = dev->macvlan_port;
 
 	netdev_rx_handler_unregister(dev);
 	rcu_assign_pointer(dev->macvlan_port, NULL);
-	synchronize_rcu();
-	kfree(port);
+	call_rcu(&port->rcu, macvlan_port_rcu_free);
 }
 
 static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
-- 
1.7.0.1


^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-07 11:05 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275905686.29978.3.camel@jlt3.sipsolutions.net>

On Mon, Jun 7, 2010 at 12:14, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Mon, 2010-06-07 at 11:53 +0200, Kay Sievers wrote:
>
>> > Can you please tell me then how to device_create() without a class? I
>> > cannot seem to create devices without a class at all, even using manual
>> > allocation (yuck) and device_register crashes the kernel.
>>
>> Right, this "convenience API" does not exist for buses. It's not doing
>> much, just allocates a "struct device" and fills in the few values and
>> calls device_register().
>>
>> Does your device create a device node? If not, device_create() should
>> not be used anyway, because the corresponding device_destroy() will
>> not do anything.
>
> No, it doesn't need a dev node. I tried this:
>
> +               data->dev = kzalloc(sizeof(struct device), GFP_KERNEL);
> +               if (!data->dev) {
>                        err = -ENOMEM;
>                        goto failed_drvdata;
>                }
> +
> +               dev_set_name(data->dev, "hwsim%d", i);
> +               data->dev->bus = &hwsim_bus;
>                data->dev->driver = &mac80211_hwsim_driver;
>
> +               err = device_register(data->dev);
> +               if (err) {
> +                       printk(KERN_DEBUG
> +                              "mac80211_hwsim: device_register failed (%d)\n",
> +                              err);
> +                       goto failed_drvdata;
> +               }
>
> (ignore the pluses, snipped from a patch) but it ran into a null ptr
> deref?

Oh, I see. It's probably something nobody ever did before. You try to
register a bus device which has no parent. Seems that's something
nobody ever expected to happen. :)

Your driver/subsystem is completely virtual, does not depend on any
hardware, right? If we create a virtual parent, like:
  parent = kzalloc(sizeof(struct device), GFP_KERNEL);
  dev_set_name(parent, "mac80211_hwsim");
  device_register(parent);

An in your code:
  data->dev->parent = parent;

That should give you a /sys/devices/mac80211_hwsim/ directory where
all the devices you create should show up.

If that works as expected, we should probably add something like:
   struct device *device_virtual_parent(const char *name);
which will allow you to create such a parent device in the
/sys/device/virtual/ directory.

Let me know if the above hack with the virtual parent works, then we
can check what do add to the core.

Thanks,
Kay

^ permalink raw reply

* Re: [Patch 2/2] mlx4: add dynamic LRO disable support
From: Stanislaw Gruszka @ 2010-06-07 11:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ben Hutchings, netdev, herbert.xu, nhorman, davem
In-Reply-To: <4C0CB325.2040704@redhat.com>

On Mon, 07 Jun 2010 16:51:49 +0800
Cong Wang <amwang@redhat.com> wrote:

> > Now that I look at the patch again, I see you're using a static (i.e.
> > global) variable to 'back up' the non-zero (enabled) value of num_lro.
> > This is introducing a bug!  The correct value is apparently set in
> > mlx4_en_get_profile(); you would need to replicate that.
> >
> 
> Oh, probably, but unfortunately 'num_lro' is static so only visible
> in en_main.c.

So just remove "static" and make it global :-)

Stanislaw

^ permalink raw reply

* Re: [PATCH] tcp: Fix slowness in read /proc/net/tcp
From: Eric Dumazet @ 2010-06-07 10:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Herbert, davem, netdev, Yakov Lerner
In-Reply-To: <87eigjgrcw.fsf@basil.nowhere.org>

Le lundi 07 juin 2010 à 10:49 +0200, Andi Kleen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> > utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> > more palpable.
> 
> That would be easily fixable in netstat.
> 
> But I'm not sure net-tools is still maintained as a separate package,
> afaik the standard procedure is to submit a patch to a big distribution
> and the others take it from there.

Well, it seems this is fixed in latest netstat, sorry for the noise (I
still have machines with RHEL 4)




^ permalink raw reply

* [PATCH v3] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
From: sonic zhang @ 2010-06-07 10:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, uclinux-dist-devel

>From 40560ae9e8db42e2d2259b791ace160534c9a0f2 Mon Sep 17 00:00:00 2001
From: Sonic Zhang <sonic.zhang@analog.com>
Date: Thu, 3 Jun 2010 11:44:33 +0800
Subject: [PATCH v3] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer

SKBs hold onto resources that can't be held indefinitely, such as TCP
socket references and netfilter conntrack state.  So if a packet is left
in TX ring for a long time, there might be a TCP socket that cannot be
closed and freed up.

Current blackfin EMAC driver always reclaim and free used tx skbs in future
transfers. The problem is that future transfer may not come as soon as
possible. This patch start a timer after transfer to reclaim and free skb.
There is nearly no performance drop with this patch.

TX interrupt is not enabled because of a strange behavior of the Blackfin EMAC.
If EMAC TX transfer control is turned on, endless TX interrupts are triggered
no matter if TX DMA is enabled or not. Since DMA walks down the ring automatically,
TX transfer control can't be turned off in the middle. The only way is to disable
TX interrupt completely.


Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
---
 drivers/net/bfin_mac.c |  114 ++++++++++++++++++++++++++++++------------------
 drivers/net/bfin_mac.h |    5 ++
 2 files changed, 77 insertions(+), 42 deletions(-)

diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c
index 368f333..6f0755f 100644
--- a/drivers/net/bfin_mac.c
+++ b/drivers/net/bfin_mac.c
@@ -922,61 +922,73 @@ static void bfin_mac_hwtstamp_init(struct net_device *netdev)
 # define bfin_tx_hwtstamp(dev, skb)
 #endif
 
-static void adjust_tx_list(void)
+static inline void _tx_reclaim_skb(void)
+{
+	do {
+		tx_list_head->desc_a.config &= ~DMAEN;
+		tx_list_head->status.status_word = 0;
+		if (tx_list_head->skb) {
+			dev_kfree_skb(tx_list_head->skb);
+			tx_list_head->skb = NULL;
+		}
+		tx_list_head = tx_list_head->next;
+
+	} while (tx_list_head->status.status_word != 0);
+}
+
+static void tx_reclaim_skb(struct bfin_mac_local *lp)
 {
 	int timeout_cnt = MAX_TIMEOUT_CNT;
 
-	if (tx_list_head->status.status_word != 0 &&
-	    current_tx_ptr != tx_list_head) {
-		goto adjust_head;	/* released something, just return; */
-	}
+	if (tx_list_head->status.status_word != 0)
+		_tx_reclaim_skb();
 
-	/*
-	 * if nothing released, check wait condition
-	 * current's next can not be the head,
-	 * otherwise the dma will not stop as we want
-	 */
-	if (current_tx_ptr->next->next == tx_list_head) {
+	if (current_tx_ptr->next == tx_list_head) {
 		while (tx_list_head->status.status_word == 0) {
+			/* slow down polling to avoid too many queue stop. */
 			udelay(10);
-			if (tx_list_head->status.status_word != 0 ||
-			    !(bfin_read_DMA2_IRQ_STATUS() & DMA_RUN)) {
-				goto adjust_head;
-			}
-			if (timeout_cnt-- < 0) {
-				printk(KERN_ERR DRV_NAME
-				": wait for adjust tx list head timeout\n");
+			/* reclaim skb if DMA is not running. */
+			if (!(bfin_read_DMA2_IRQ_STATUS() & DMA_RUN))
+				break;
+			if (timeout_cnt-- < 0)
 				break;
-			}
-		}
-		if (tx_list_head->status.status_word != 0) {
-			goto adjust_head;
 		}
+
+		if (timeout_cnt >= 0)
+			_tx_reclaim_skb();
+		else
+			netif_stop_queue(lp->ndev);
 	}
 
-	return;
+	if (current_tx_ptr->next != tx_list_head &&
+		netif_queue_stopped(lp->ndev))
+		netif_wake_queue(lp->ndev);
+
+	if (tx_list_head != current_tx_ptr) {
+		/* shorten the timer interval if tx queue is stopped */
+		if (netif_queue_stopped(lp->ndev))
+			lp->tx_reclaim_timer.expires =
+				jiffies + (TX_RECLAIM_JIFFIES >> 4);
+		else
+			lp->tx_reclaim_timer.expires =
+				jiffies + TX_RECLAIM_JIFFIES;
+
+		mod_timer(&lp->tx_reclaim_timer,
+			lp->tx_reclaim_timer.expires);
+	}
 
-adjust_head:
-	do {
-		tx_list_head->desc_a.config &= ~DMAEN;
-		tx_list_head->status.status_word = 0;
-		if (tx_list_head->skb) {
-			dev_kfree_skb(tx_list_head->skb);
-			tx_list_head->skb = NULL;
-		} else {
-			printk(KERN_ERR DRV_NAME
-			       ": no sk_buff in a transmitted frame!\n");
-		}
-		tx_list_head = tx_list_head->next;
-	} while (tx_list_head->status.status_word != 0 &&
-		 current_tx_ptr != tx_list_head);
 	return;
+}
 
+static void tx_reclaim_skb_timeout(unsigned long lp)
+{
+	tx_reclaim_skb((struct bfin_mac_local *)lp);
 }
 
 static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 				struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
 	u16 *data;
 	u32 data_align = (unsigned long)(skb->data) & 0x3;
 	union skb_shared_tx *shtx = skb_tx(skb);
@@ -1009,8 +1021,6 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 			skb->len);
 		current_tx_ptr->desc_a.start_addr =
 			(u32)current_tx_ptr->packet;
-		if (current_tx_ptr->status.status_word != 0)
-			current_tx_ptr->status.status_word = 0;
 		blackfin_dcache_flush_range(
 			(u32)current_tx_ptr->packet,
 			(u32)(current_tx_ptr->packet + skb->len + 2));
@@ -1022,6 +1032,9 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 	 */
 	SSYNC();
 
+	/* always clear status buffer before start tx dma */
+	current_tx_ptr->status.status_word = 0;
+
 	/* enable this packet's dma */
 	current_tx_ptr->desc_a.config |= DMAEN;
 
@@ -1037,13 +1050,14 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 	bfin_write_EMAC_OPMODE(bfin_read_EMAC_OPMODE() | TE);
 
 out:
-	adjust_tx_list();
-
 	bfin_tx_hwtstamp(dev, skb);
 
 	current_tx_ptr = current_tx_ptr->next;
 	dev->stats.tx_packets++;
 	dev->stats.tx_bytes += (skb->len);
+
+	tx_reclaim_skb(lp);
+
 	return NETDEV_TX_OK;
 }
 
@@ -1167,8 +1181,11 @@ real_rx:
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void bfin_mac_poll(struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
+
 	disable_irq(IRQ_MAC_RX);
 	bfin_mac_interrupt(IRQ_MAC_RX, dev);
+	tx_reclaim_skb(lp);
 	enable_irq(IRQ_MAC_RX);
 }
 #endif				/* CONFIG_NET_POLL_CONTROLLER */
@@ -1232,12 +1249,20 @@ static int bfin_mac_enable(void)
 /* Our watchdog timed out. Called by the networking layer */
 static void bfin_mac_timeout(struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
+
 	pr_debug("%s: %s\n", dev->name, __func__);
 
 	bfin_mac_disable();
 
+	if (timer_pending(&lp->tx_reclaim_timer))
+		del_timer(&(lp->tx_reclaim_timer));
+
 	/* reset tx queue */
-	tx_list_tail = tx_list_head->next;
+	current_tx_ptr = tx_list_head;
+
+	if (netif_queue_stopped(lp->ndev))
+		netif_wake_queue(lp->ndev);
 
 	bfin_mac_enable();
 
@@ -1430,6 +1455,7 @@ static int __devinit bfin_mac_probe(struct platform_device *pdev)
 	SET_NETDEV_DEV(ndev, &pdev->dev);
 	platform_set_drvdata(pdev, ndev);
 	lp = netdev_priv(ndev);
+	lp->ndev = ndev;
 
 	/* Grab the MAC address in the MAC */
 	*(__le32 *) (&(ndev->dev_addr[0])) = cpu_to_le32(bfin_read_EMAC_ADDRLO());
@@ -1485,6 +1511,10 @@ static int __devinit bfin_mac_probe(struct platform_device *pdev)
 	ndev->netdev_ops = &bfin_mac_netdev_ops;
 	ndev->ethtool_ops = &bfin_mac_ethtool_ops;
 
+	init_timer(&lp->tx_reclaim_timer);
+	lp->tx_reclaim_timer.data = (unsigned long)lp;
+	lp->tx_reclaim_timer.function = tx_reclaim_skb_timeout;
+
 	spin_lock_init(&lp->lock);
 
 	/* now, enable interrupts */
diff --git a/drivers/net/bfin_mac.h b/drivers/net/bfin_mac.h
index 1ae7b82..04e4050 100644
--- a/drivers/net/bfin_mac.h
+++ b/drivers/net/bfin_mac.h
@@ -13,9 +13,12 @@
 #include <linux/net_tstamp.h>
 #include <linux/clocksource.h>
 #include <linux/timecompare.h>
+#include <linux/timer.h>
 
 #define BFIN_MAC_CSUM_OFFLOAD
 
+#define TX_RECLAIM_JIFFIES (HZ / 5)
+
 struct dma_descriptor {
 	struct dma_descriptor *next_dma_desc;
 	unsigned long start_addr;
@@ -68,6 +71,8 @@ struct bfin_mac_local {
 
 	int wol;		/* Wake On Lan */
 	int irq_wake_requested;
+	struct timer_list tx_reclaim_timer;
+	struct net_device *ndev;
 
 	/* MII and PHY stuffs */
 	int old_link;          /* used by bf537_adjust_link */
-- 
1.6.0




^ permalink raw reply related

* Re: [PATCH v2] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
From: Sonic Zhang @ 2010-06-07 10:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, uclinux-dist-devel
In-Reply-To: <1275904680.2545.44.camel@edumazet-laptop>

On Mon, Jun 7, 2010 at 5:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 04 juin 2010 à 12:44 +0800, Sonic Zhang a écrit :
>
>>
>> Yes, you are right. dev_kfree_skb_irq() queues used skb to the
>> complete queue. But, it is actually freed in the other soft irq
>> NET_TX_SOFTIRQ.
>
> I guess you didnt understood my mail, so I'll re-explain :
>
> dev_kfree_skb_irq() queues skb only if packet is not already orphaned.
>
> As most packets are now orphaned (in recent kernels where your patch
> applies), dev_kfree_skb_irq() can free packet immediately, with no
> NET_TX_SOFTIRQ overhead.
>
>

OK. I didn't notice the recent change in kernel to orphan most
packets. You are right, my second reason is out of date.

Thanks

Sonic

>
>

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-07 10:14 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTin_qjU0-xXfMDYeYWRd0BmvT4OIdK4S_GQ3ZYUi@mail.gmail.com>

On Mon, 2010-06-07 at 11:53 +0200, Kay Sievers wrote:

> > Can you please tell me then how to device_create() without a class? I
> > cannot seem to create devices without a class at all, even using manual
> > allocation (yuck) and device_register crashes the kernel.
> 
> Right, this "convenience API" does not exist for buses. It's not doing
> much, just allocates a "struct device" and fills in the few values and
> calls device_register().
> 
> Does your device create a device node? If not, device_create() should
> not be used anyway, because the corresponding device_destroy() will
> not do anything.

No, it doesn't need a dev node. I tried this:

+               data->dev = kzalloc(sizeof(struct device), GFP_KERNEL);
+               if (!data->dev) {
                        err = -ENOMEM;
                        goto failed_drvdata;
                }
+
+               dev_set_name(data->dev, "hwsim%d", i);
+               data->dev->bus = &hwsim_bus;
                data->dev->driver = &mac80211_hwsim_driver;
 
+               err = device_register(data->dev);
+               if (err) {
+                       printk(KERN_DEBUG
+                              "mac80211_hwsim: device_register failed (%d)\n",
+                              err);
+                       goto failed_drvdata;
+               }

(ignore the pluses, snipped from a patch) but it ran into a null ptr
deref?

johannes


^ permalink raw reply

* Re: RX/close vcc race with solos/atmtcp/usbatm/he
From: David Woodhouse @ 2010-06-07 10:02 UTC (permalink / raw)
  To: linux-atm-general; +Cc: netdev, Nathan Williams
In-Reply-To: <1274872584.20576.13579.camel@macbook.infradead.org>

On Wed, 2010-05-26 at 12:16 +0100, David Woodhouse wrote:
> I've had this crash reported to me...
> 
> [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684]
> SS:ESP 0068:dfb89d14 

Nathan, did you manage to get your customer to confirm that this fixes
the problem? It'd be useful to get this into 2.6.35 and -stable.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox