Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: David Miller @ 2010-10-05 20:25 UTC (permalink / raw)
  To: fbl; +Cc: netdev
In-Reply-To: <20101005143430.GA13811@redhat.com>

From: Flavio Leitner <fbl@redhat.com>
Date: Tue, 5 Oct 2010 11:34:30 -0300

> On Tue, Oct 05, 2010 at 12:13:38AM -0700, David Miller wrote:
>> From: Flavio Leitner <fleitner@redhat.com>
>> Date: Wed, 29 Sep 2010 04:12:07 -0300
>> 
>> > It should rejoin multicast groups immediately when
>> > the failover happens to restore the multicast traffic.
>> > 
>> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
>> 
>> I suspect the IGMPv3 handling via a delayed action, as is currently
>> implemented, is on purpose and is done so to follow the specification
>> of the IGMPv3 RFCs.
>> 
>> Therefore you have to explain why your new behavior is so desirable
>> and in particular why something as undesirable as violating the RFCs
>> is therefore warranted.
> 
> That patch only changes the behavior for bonding during a link
> failure, so if we have a bonding in active-backup or any other mode
> with current-active-slave, the initialization will happen just fine
> following IGMP specs.
> 
> However, neither the backup slave interface nor the backup switch
> connected to backup slave knows about mcast. Thus when a link failure
> happens, we shouldn't rely on timers to not stay out of the mcast
> group losing traffic.
> 
> E.g. The V1 specs says that we shouldn't send any membership report
> if it has been one in the last minute because that means the switch
> is notified and the system will receive mcast traffic for that group. 
> Therefore, if it sees one and a link failure happens right after that,
> the backup slave will send another membership report only one minute
> later. During this time the system loses traffic.

All of this valuable information belongs in the commit log message.

Please resubmit your patch with all of the necessary contextual
information and reasoning explaining in the commit message.

Thanks.

^ permalink raw reply

* Re: [net-2.6 PATCH] MAINTAINERS: update Intel LAN Ethernet info
From: David Miller @ 2010-10-05 20:25 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, linux-kernel, gospo, bphilips
In-Reply-To: <20101005111438.22968.15130.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:15:17 -0700

> - Add ixgbevf and docs files to the maintainers file
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/3] ixgbevf.txt: Update ixgbevf documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111643.23000.38976.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:16:44 -0700

> Update the documentation for the ixgbevf (ixgbe virtual
> function driver).
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] e1000.txt: Update e1000 documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111704.23000.60972.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:17:05 -0700

> Updated the e1000 networking driver documentation.
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] e1000e.txt: Add e1000e documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111726.23000.73997.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:17:27 -0700

> Adds documentation for the e1000e networking driver.
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* [PATCH net-next] fib: RCU conversion of fib_lookup()
From: Eric Dumazet @ 2010-10-05 20:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

fib_lookup() converted to be called in RCU protected context, no
reference taken and released on a contended cache line (fib_clntref)

fib_table_lookup() and fib_semantic_match() get an additional parameter.

struct fib_info gets an rcu_head field, and is freed after an rcu grace
period.

Stress test :
(Sending 160.000.000 UDP frames on same neighbour,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_HASH) (about same results for FIB_TRIE)

Before patch :

real	1m31.199s
user	0m13.761s
sys	23m24.780s

After patch:

real	1m5.375s
user	0m14.997s
sys	15m50.115s

Before patch Profile :

13044.00 15.4% __ip_route_output_key vmlinux
 8438.00 10.0% dst_destroy           vmlinux
 5983.00  7.1% fib_semantic_match    vmlinux
 5410.00  6.4% fib_rules_lookup      vmlinux
 4803.00  5.7% neigh_lookup          vmlinux
 4420.00  5.2% _raw_spin_lock        vmlinux
 3883.00  4.6% rt_set_nexthop        vmlinux
 3261.00  3.9% _raw_read_lock        vmlinux
 2794.00  3.3% fib_table_lookup      vmlinux
 2374.00  2.8% neigh_resolve_output  vmlinux
 2153.00  2.5% dst_alloc             vmlinux
 1502.00  1.8% _raw_read_lock_bh     vmlinux
 1484.00  1.8% kmem_cache_alloc      vmlinux
 1407.00  1.7% eth_header            vmlinux
 1406.00  1.7% ipv4_dst_destroy      vmlinux
 1298.00  1.5% __copy_from_user_ll   vmlinux
 1174.00  1.4% dev_queue_xmit        vmlinux
 1000.00  1.2% ip_output             vmlinux

After patch Profile :

13712.00 15.8% dst_destroy             vmlinux
 8548.00  9.9% __ip_route_output_key   vmlinux
 7017.00  8.1% neigh_lookup            vmlinux
 4554.00  5.3% fib_semantic_match      vmlinux
 4067.00  4.7% _raw_read_lock          vmlinux
 3491.00  4.0% dst_alloc               vmlinux
 3186.00  3.7% neigh_resolve_output    vmlinux
 3103.00  3.6% fib_table_lookup        vmlinux
 2098.00  2.4% _raw_read_lock_bh       vmlinux
 2081.00  2.4% kmem_cache_alloc        vmlinux
 2013.00  2.3% _raw_spin_lock          vmlinux
 1763.00  2.0% __copy_from_user_ll     vmlinux
 1763.00  2.0% ip_output               vmlinux
 1761.00  2.0% ipv4_dst_destroy        vmlinux
 1631.00  1.9% eth_header              vmlinux
 1440.00  1.7% _raw_read_unlock_bh     vmlinux

Reference results, if IP route cache is enabled :

real	0m29.718s
user	0m10.845s
sys	7m37.341s


25213.00 29.5% __ip_route_output_key   vmlinux
 9011.00 10.5% dst_release             vmlinux
 4817.00  5.6% ip_push_pending_frames  vmlinux
 4232.00  5.0% ip_finish_output        vmlinux
 3940.00  4.6% udp_sendmsg             vmlinux
 3730.00  4.4% __copy_from_user_ll     vmlinux
 3716.00  4.4% ip_route_output_flow    vmlinux
 2451.00  2.9% __xfrm_lookup           vmlinux
 2221.00  2.6% ip_append_data          vmlinux
 1718.00  2.0% _raw_spin_lock_bh       vmlinux
 1655.00  1.9% __alloc_skb             vmlinux
 1572.00  1.8% sock_wfree              vmlinux
 1345.00  1.6% kfree                   vmlinux


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/fib_rules.h  |    2 +
 include/net/ip_fib.h     |   17 ++--------
 net/core/fib_rules.c     |    3 +
 net/ipv4/fib_frontend.c  |   27 ++++++++--------
 net/ipv4/fib_hash.c      |    5 +--
 net/ipv4/fib_lookup.h    |    2 -
 net/ipv4/fib_rules.c     |    3 +
 net/ipv4/fib_semantics.c |   21 ++++++++++---
 net/ipv4/fib_trie.c      |   10 +++---
 net/ipv4/route.c         |   59 +++++++++++++++----------------------
 10 files changed, 72 insertions(+), 77 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index ac2fd00..106f309 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -31,6 +31,8 @@ struct fib_lookup_arg {
 	void			*lookup_ptr;
 	void			*result;
 	struct fib_rule		*rule;
+	int			flags;
+#define FIB_LOOKUP_NOREF	1
 };
 
 struct fib_rules_ops {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index c93f94e..ba3666d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -86,6 +86,7 @@ struct fib_info {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			fib_power;
 #endif
+	struct rcu_head		rcu;
 	struct fib_nh		fib_nh[0];
 #define fib_dev		fib_nh[0].nh_dev
 };
@@ -148,7 +149,7 @@ struct fib_table {
 };
 
 extern int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
-			    struct fib_result *res);
+			    struct fib_result *res, int fib_flags);
 extern int fib_table_insert(struct fib_table *, struct fib_config *);
 extern int fib_table_delete(struct fib_table *, struct fib_config *);
 extern int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
@@ -185,11 +186,11 @@ static inline int fib_lookup(struct net *net, const struct flowi *flp,
 	struct fib_table *table;
 
 	table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (!fib_table_lookup(table, flp, res))
+	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
 
 	table = fib_get_table(net, RT_TABLE_MAIN);
-	if (!fib_table_lookup(table, flp, res))
+	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
 	return -ENETUNREACH;
 }
@@ -254,16 +255,6 @@ static inline void fib_info_put(struct fib_info *fi)
 		free_fib_info(fi);
 }
 
-static inline void fib_res_put(struct fib_result *res)
-{
-	if (res->fi)
-		fib_info_put(res->fi);
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	if (res->r)
-		fib_rule_put(res->r);
-#endif
-}
-
 #ifdef CONFIG_PROC_FS
 extern int __net_init  fib_proc_init(struct net *net);
 extern void __net_exit fib_proc_exit(struct net *net);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cfb7d25..21698f8 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -225,7 +225,8 @@ jumped:
 			err = ops->action(rule, fl, flags, arg);
 
 		if (err != -EAGAIN) {
-			if (likely(atomic_inc_not_zero(&rule->refcnt))) {
+			if ((arg->flags & FIB_LOOKUP_NOREF) ||
+			    likely(atomic_inc_not_zero(&rule->refcnt))) {
 				arg->rule = rule;
 				goto out;
 			}
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b05c23b..919f2ad 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -168,8 +168,11 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 	struct fib_result res = { 0 };
 	struct net_device *dev = NULL;
 
-	if (fib_lookup(net, &fl, &res))
+	rcu_read_lock();
+	if (fib_lookup(net, &fl, &res)) {
+		rcu_read_unlock();
 		return NULL;
+	}
 	if (res.type != RTN_LOCAL)
 		goto out;
 	dev = FIB_RES_DEV(res);
@@ -177,7 +180,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 	if (dev && devref)
 		dev_hold(dev);
 out:
-	fib_res_put(&res);
+	rcu_read_unlock();
 	return dev;
 }
 EXPORT_SYMBOL(__ip_dev_find);
@@ -207,11 +210,12 @@ static inline unsigned __inet_dev_addr_type(struct net *net,
 	local_table = fib_get_table(net, RT_TABLE_LOCAL);
 	if (local_table) {
 		ret = RTN_UNICAST;
-		if (!fib_table_lookup(local_table, &fl, &res)) {
+		rcu_read_lock();
+		if (!fib_table_lookup(local_table, &fl, &res, FIB_LOOKUP_NOREF)) {
 			if (!dev || dev == res.fi->fib_dev)
 				ret = res.type;
-			fib_res_put(&res);
 		}
+		rcu_read_unlock();
 	}
 	return ret;
 }
@@ -235,6 +239,7 @@ EXPORT_SYMBOL(inet_dev_addr_type);
  * - figure out what "logical" interface this packet arrived
  *   and calculate "specific destination" address.
  * - check, that packet arrived from expected physical interface.
+ * called with rcu_read_lock()
  */
 int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 			struct net_device *dev, __be32 *spec_dst,
@@ -259,7 +264,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 	struct net *net;
 
 	no_addr = rpf = accept_local = 0;
-	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev);
 	if (in_dev) {
 		no_addr = in_dev->ifa_list == NULL;
@@ -268,7 +272,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 		if (mark && !IN_DEV_SRC_VMARK(in_dev))
 			fl.mark = 0;
 	}
-	rcu_read_unlock();
 
 	if (in_dev == NULL)
 		goto e_inval;
@@ -278,7 +281,7 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 		goto last_resort;
 	if (res.type != RTN_UNICAST) {
 		if (res.type != RTN_LOCAL || !accept_local)
-			goto e_inval_res;
+			goto e_inval;
 	}
 	*spec_dst = FIB_RES_PREFSRC(res);
 	fib_combine_itag(itag, &res);
@@ -299,10 +302,8 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 #endif
 	if (dev_match) {
 		ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
-		fib_res_put(&res);
 		return ret;
 	}
-	fib_res_put(&res);
 	if (no_addr)
 		goto last_resort;
 	if (rpf == 1)
@@ -315,7 +316,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 			*spec_dst = FIB_RES_PREFSRC(res);
 			ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
 		}
-		fib_res_put(&res);
 	}
 	return ret;
 
@@ -326,8 +326,6 @@ last_resort:
 	*itag = 0;
 	return 0;
 
-e_inval_res:
-	fib_res_put(&res);
 e_inval:
 	return -EINVAL;
 e_rpf:
@@ -873,15 +871,16 @@ static void nl_fib_lookup(struct fib_result_nl *frn, struct fib_table *tb)
 		local_bh_disable();
 
 		frn->tb_id = tb->tb_id;
-		frn->err = fib_table_lookup(tb, &fl, &res);
+		rcu_read_lock();
+		frn->err = fib_table_lookup(tb, &fl, &res, FIB_LOOKUP_NOREF);
 
 		if (!frn->err) {
 			frn->prefixlen = res.prefixlen;
 			frn->nh_sel = res.nh_sel;
 			frn->type = res.type;
 			frn->scope = res.scope;
-			fib_res_put(&res);
 		}
+		rcu_read_unlock();
 		local_bh_enable();
 	}
 }
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 4ed7e0d..83cca68 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -244,7 +244,8 @@ fn_new_zone(struct fn_hash *table, int z)
 }
 
 int fib_table_lookup(struct fib_table *tb,
-		     const struct flowi *flp, struct fib_result *res)
+		     const struct flowi *flp, struct fib_result *res,
+		     int fib_flags)
 {
 	int err;
 	struct fn_zone *fz;
@@ -264,7 +265,7 @@ int fib_table_lookup(struct fib_table *tb,
 
 			err = fib_semantic_match(&f->fn_alias,
 						 flp, res,
-						 fz->fz_order);
+						 fz->fz_order, fib_flags);
 			if (err <= 0)
 				goto out;
 		}
diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index 637b133..b9c9a9f 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -22,7 +22,7 @@ struct fib_alias {
 /* Exported by fib_semantics.c */
 extern int fib_semantic_match(struct list_head *head,
 			      const struct flowi *flp,
-			      struct fib_result *res, int prefixlen);
+			      struct fib_result *res, int prefixlen, int fib_flags);
 extern void fib_release_info(struct fib_info *);
 extern struct fib_info *fib_create_info(struct fib_config *cfg);
 extern int fib_nh_match(struct fib_config *cfg, struct fib_info *fi);
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 3230052..7981a24 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -57,6 +57,7 @@ int fib_lookup(struct net *net, struct flowi *flp, struct fib_result *res)
 {
 	struct fib_lookup_arg arg = {
 		.result = res,
+		.flags = FIB_LOOKUP_NOREF,
 	};
 	int err;
 
@@ -94,7 +95,7 @@ static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 	if (!tbl)
 		goto errout;
 
-	err = fib_table_lookup(tbl, flp, (struct fib_result *) arg->result);
+	err = fib_table_lookup(tbl, flp, (struct fib_result *) arg->result, arg->flags);
 	if (err > 0)
 		err = -EAGAIN;
 errout:
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index ba52f39..0f80dfc 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -148,6 +148,13 @@ static const struct
 
 /* Release a nexthop info record */
 
+static void free_fib_info_rcu(struct rcu_head *head)
+{
+	struct fib_info *fi = container_of(head, struct fib_info, rcu);
+
+	kfree(fi);
+}
+
 void free_fib_info(struct fib_info *fi)
 {
 	if (fi->fib_dead == 0) {
@@ -161,7 +168,7 @@ void free_fib_info(struct fib_info *fi)
 	} endfor_nexthops(fi);
 	fib_info_cnt--;
 	release_net(fi->fib_net);
-	kfree(fi);
+	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
 void fib_release_info(struct fib_info *fi)
@@ -553,6 +560,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			nh->nh_scope = RT_SCOPE_LINK;
 			return 0;
 		}
+		rcu_read_lock();
 		{
 			struct flowi fl = {
 				.nl_u = {
@@ -568,8 +576,10 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			if (fl.fl4_scope < RT_SCOPE_LINK)
 				fl.fl4_scope = RT_SCOPE_LINK;
 			err = fib_lookup(net, &fl, &res);
-			if (err)
+			if (err) {
+				rcu_read_unlock();
 				return err;
+			}
 		}
 		err = -EINVAL;
 		if (res.type != RTN_UNICAST && res.type != RTN_LOCAL)
@@ -585,7 +595,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			goto out;
 		err = 0;
 out:
-		fib_res_put(&res);
+		rcu_read_unlock();
 		return err;
 	} else {
 		struct in_device *in_dev;
@@ -879,7 +889,7 @@ failure:
 
 /* Note! fib_semantic_match intentionally uses  RCU list functions. */
 int fib_semantic_match(struct list_head *head, const struct flowi *flp,
-		       struct fib_result *res, int prefixlen)
+		       struct fib_result *res, int prefixlen, int fib_flags)
 {
 	struct fib_alias *fa;
 	int nh_sel = 0;
@@ -943,7 +953,8 @@ out_fill_res:
 	res->type = fa->fa_type;
 	res->scope = fa->fa_scope;
 	res->fi = fa->fa_info;
-	atomic_inc(&res->fi->fib_clntref);
+	if (!(fib_flags & FIB_LOOKUP_NOREF))
+		atomic_inc(&res->fi->fib_clntref);
 	return 0;
 }
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index a96e5ec..271c89b 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1342,7 +1342,7 @@ err:
 /* should be called with rcu_read_lock */
 static int check_leaf(struct trie *t, struct leaf *l,
 		      t_key key,  const struct flowi *flp,
-		      struct fib_result *res)
+		      struct fib_result *res, int fib_flags)
 {
 	struct leaf_info *li;
 	struct hlist_head *hhead = &l->list;
@@ -1356,7 +1356,7 @@ static int check_leaf(struct trie *t, struct leaf *l,
 		if (l->key != (key & ntohl(mask)))
 			continue;
 
-		err = fib_semantic_match(&li->falh, flp, res, plen);
+		err = fib_semantic_match(&li->falh, flp, res, plen, fib_flags);
 
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 		if (err <= 0)
@@ -1372,7 +1372,7 @@ static int check_leaf(struct trie *t, struct leaf *l,
 }
 
 int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
-		     struct fib_result *res)
+		     struct fib_result *res, int fib_flags)
 {
 	struct trie *t = (struct trie *) tb->tb_data;
 	int ret;
@@ -1399,7 +1399,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 
 	/* Just a leaf? */
 	if (IS_LEAF(n)) {
-		ret = check_leaf(t, (struct leaf *)n, key, flp, res);
+		ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
 		goto found;
 	}
 
@@ -1424,7 +1424,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 		}
 
 		if (IS_LEAF(n)) {
-			ret = check_leaf(t, (struct leaf *)n, key, flp, res);
+			ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
 			if (ret > 0)
 				goto backtrace;
 			goto found;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 04e0df8..7864d0c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1773,12 +1773,15 @@ void ip_rt_get_source(u8 *addr, struct rtable *rt)
 
 	if (rt->fl.iif == 0)
 		src = rt->rt_src;
-	else if (fib_lookup(dev_net(rt->dst.dev), &rt->fl, &res) == 0) {
-		src = FIB_RES_PREFSRC(res);
-		fib_res_put(&res);
-	} else
-		src = inet_select_addr(rt->dst.dev, rt->rt_gateway,
+	else {
+		rcu_read_lock();
+		if (fib_lookup(dev_net(rt->dst.dev), &rt->fl, &res) == 0)
+			src = FIB_RES_PREFSRC(res);
+		else
+			src = inet_select_addr(rt->dst.dev, rt->rt_gateway,
 					RT_SCOPE_UNIVERSE);
+		rcu_read_unlock();
+	}
 	memcpy(addr, &src, 4);
 }
 
@@ -2081,6 +2084,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
  *	Such approach solves two big problems:
  *	1. Not simplex devices are handled properly.
  *	2. IP spoofing attempts are filtered with 100% of guarantee.
+ *	called with rcu_read_lock()
  */
 
 static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
@@ -2102,7 +2106,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	unsigned	hash;
 	__be32		spec_dst;
 	int		err = -EINVAL;
-	int		free_res = 0;
 	struct net    * net = dev_net(dev);
 
 	/* IP on this device is disabled. */
@@ -2134,12 +2137,12 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	/*
 	 *	Now we are ready to route packet.
 	 */
-	if ((err = fib_lookup(net, &fl, &res)) != 0) {
+	err = fib_lookup(net, &fl, &res);
+	if (err != 0) {
 		if (!IN_DEV_FORWARD(in_dev))
 			goto e_hostunreach;
 		goto no_route;
 	}
-	free_res = 1;
 
 	RT_CACHE_STAT_INC(in_slow_tot);
 
@@ -2148,8 +2151,8 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 
 	if (res.type == RTN_LOCAL) {
 		err = fib_validate_source(saddr, daddr, tos,
-					     net->loopback_dev->ifindex,
-					     dev, &spec_dst, &itag, skb->mark);
+					  net->loopback_dev->ifindex,
+					  dev, &spec_dst, &itag, skb->mark);
 		if (err < 0)
 			goto martian_source_keep_err;
 		if (err)
@@ -2164,9 +2167,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		goto martian_destination;
 
 	err = ip_mkroute_input(skb, &res, &fl, in_dev, daddr, saddr, tos);
-done:
-	if (free_res)
-		fib_res_put(&res);
 out:	return err;
 
 brd_input:
@@ -2226,7 +2226,7 @@ local_input:
 	rth->rt_type	= res.type;
 	hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
 	err = rt_intern_hash(hash, rth, NULL, skb, fl.iif);
-	goto done;
+	goto out;
 
 no_route:
 	RT_CACHE_STAT_INC(in_no_route);
@@ -2249,21 +2249,21 @@ martian_destination:
 
 e_hostunreach:
 	err = -EHOSTUNREACH;
-	goto done;
+	goto out;
 
 e_inval:
 	err = -EINVAL;
-	goto done;
+	goto out;
 
 e_nobufs:
 	err = -ENOBUFS;
-	goto done;
+	goto out;
 
 martian_source:
 	err = -EINVAL;
 martian_source_keep_err:
 	ip_handle_martian_source(dev, in_dev, skb, daddr, saddr);
-	goto done;
+	goto out;
 }
 
 int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
@@ -2349,6 +2349,7 @@ skip_cache:
 }
 EXPORT_SYMBOL(ip_route_input_common);
 
+/* called with rcu_read_lock() */
 static int __mkroute_output(struct rtable **result,
 			    struct fib_result *res,
 			    const struct flowi *fl,
@@ -2373,18 +2374,13 @@ static int __mkroute_output(struct rtable **result,
 	if (dev_out->flags & IFF_LOOPBACK)
 		flags |= RTCF_LOCAL;
 
-	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev_out);
-	if (!in_dev) {
-		rcu_read_unlock();
+	if (!in_dev)
 		return -EINVAL;
-	}
+
 	if (res->type == RTN_BROADCAST) {
 		flags |= RTCF_BROADCAST | RTCF_LOCAL;
-		if (res->fi) {
-			fib_info_put(res->fi);
-			res->fi = NULL;
-		}
+		res->fi = NULL;
 	} else if (res->type == RTN_MULTICAST) {
 		flags |= RTCF_MULTICAST | RTCF_LOCAL;
 		if (!ip_check_mc(in_dev, oldflp->fl4_dst, oldflp->fl4_src,
@@ -2394,10 +2390,8 @@ static int __mkroute_output(struct rtable **result,
 		 * default one, but do not gateway in this case.
 		 * Yes, it is hack.
 		 */
-		if (res->fi && res->prefixlen < 4) {
-			fib_info_put(res->fi);
+		if (res->fi && res->prefixlen < 4)
 			res->fi = NULL;
-		}
 	}
 
 
@@ -2467,6 +2461,7 @@ static int __mkroute_output(struct rtable **result,
 	return 0;
 }
 
+/* called with rcu_read_lock() */
 static int ip_mkroute_output(struct rtable **rp,
 			     struct fib_result *res,
 			     const struct flowi *fl,
@@ -2509,7 +2504,6 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 	struct fib_result res;
 	unsigned int flags = 0;
 	struct net_device *dev_out = NULL;
-	int free_res = 0;
 	int err;
 
 
@@ -2636,15 +2630,12 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 		err = -ENETUNREACH;
 		goto out;
 	}
-	free_res = 1;
 
 	if (res.type == RTN_LOCAL) {
 		if (!fl.fl4_src)
 			fl.fl4_src = fl.fl4_dst;
 		dev_out = net->loopback_dev;
 		fl.oif = dev_out->ifindex;
-		if (res.fi)
-			fib_info_put(res.fi);
 		res.fi = NULL;
 		flags |= RTCF_LOCAL;
 		goto make_route;
@@ -2668,8 +2659,6 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 make_route:
 	err = ip_mkroute_output(rp, &res, &fl, oldflp, dev_out, flags);
 
-	if (free_res)
-		fib_res_put(&res);
 out:	return err;
 }
 



^ permalink raw reply related

* Re: [net-next PATCH] igb: update adapter stats when reading /proc/net/dev.
From: Eric Dumazet @ 2010-10-05 21:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, netdev, Jeff Kirsher
In-Reply-To: <1286291947.2796.387.camel@edumazet-laptop>

Le mardi 05 octobre 2010 à 17:19 +0200, Eric Dumazet a écrit :
> Le mardi 05 octobre 2010 à 16:53 +0200, Jesper Dangaard Brouer a écrit :
> 
> > Its already racy, because "ethtool -S" reads out the stats immediately,
> > and thus races with the timer.
> > 
> > See: igb_ethtool.c
> >  igb_get_ethtool_stats() invoke igb_update_stats(adapter);
> > 
> 
> You would be surprised how many bugs are waiting to be found and
> fixed ;)
> 
> 

I took a look at igb stats, and it appears they also are racy on 32bit
arches. igb uses u64 counters, with no synchronization between
producers(writers) and consumers(readers).

Some fixes are needed ;)




^ permalink raw reply

* Re: [net-next PATCH] igb: update adapter stats when reading /proc/net/dev.
From: Jesper Dangaard Brouer @ 2010-10-05 21:16 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: Jesper Dangaard Brouer, David S. Miller, netdev, Jeff Kirsher
In-Reply-To: <1286312479.2593.35.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1133 bytes --]

On Tue, 5 Oct 2010, Eric Dumazet wrote:

> Le mardi 05 octobre 2010 à 17:19 +0200, Eric Dumazet a écrit :
>> Le mardi 05 octobre 2010 à 16:53 +0200, Jesper Dangaard Brouer a écrit :
>>
>>> Its already racy, because "ethtool -S" reads out the stats immediately,
>>> and thus races with the timer.
>>>
>>> See: igb_ethtool.c
>>>  igb_get_ethtool_stats() invoke igb_update_stats(adapter);
>>>
>>
>> You would be surprised how many bugs are waiting to be found and
>> fixed ;)
>
> I took a look at igb stats, and it appears they also are racy on 32bit
> arches. igb uses u64 counters, with no synchronization between
> producers(writers) and consumers(readers).

Are you saying, that we need more than a simple mutex in 
igb_update_stats()?


> Some fixes are needed ;)

The question is then if Intel wants to fix it, or let it be up to you or 
me?

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: BUG - qdev - partial loss of network connectivity
From: Leszek Urbanski @ 2010-10-05 21:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony Liguori, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <20100927213203.GA28089-4cL5GMJPxME@public.gmane.org>

<20100927213203.GA28089-4cL5GMJPxME@public.gmane.org>; from Leszek Urbanski on Mon, Sep 27, 2010 at 23:32:03 +0200

> > > > >It's vanilla 2.6.32.22, but I also reproduced this on Debian's 2.6.32-23
> > > > >(based on 2.6.32.21).
> > > > >
> > > > >If offload is the only difference, I'll play with different offload
> > > > >options and check which one causes it.
> > > > >   
> > > > 
> > > > It's not technically the only difference but it's the most likely 
> > > > culprit IMHO.
> > > 
> > > udp fragmentation offload is definitely the culprit.
> > 
> > I see. Most likely guest bug - won't be the first bug around UFO.
> > If so pls copy netdev linux-nfs and virtualization.
> > Do you see anything in dmesg? Can try 2.6.36-rc5?
> 
> (for reference: first post is at:
> http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg01685.html )
> 
> I can't reproduce it on 2.6.36-rc5. Do you have an idea which patch may have

This one definitely fixes it: http://patchwork.ozlabs.org/patch/55643/

(Herbert Xu: udp: Fix bogus UFO packet generation)

The patch works with 2.6.32.24 too - it should probably be backported to
2.6.32.x.


-- 
Leszek "Tygrys" Urbanski, SCSA, SCNA
 "Unix-to-Unix Copy Program;" said PDP-1. "You will never find a more
  wretched hive of bugs and flamers. We must be cautious." -- DECWARS
     http://cygnus.moo.pl/ -- Cygnus High Altitude Balloon
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] net: add a core netdev->rx_dropped counter
From: David Miller @ 2010-10-05 21:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jesse, rl, netdev, kaber
In-Reply-To: <1285916815.2705.152.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 01 Oct 2010 09:06:55 +0200

> [PATCH net-next] net: add a core netdev->rx_dropped counter
> 
> In various situations, a device provides a packet to our stack and we
> drop it before it enters protocol stack :
> - softnet backlog full (accounted in /proc/net/softnet_stat)
> - bad vlan tag (not accounted)
> - unknown/unregistered protocol (not accounted)
> 
> We can handle a per-device counter of such dropped frames at core level,
> and automatically adds it to the device provided stats (rx_dropped), so
> that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)
> 
> This is a generalization of commit 8990f468a (net: rx_dropped
> accounting), thus reverting it.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good to me, applied, thanks.

^ permalink raw reply

* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: David Miller @ 2010-10-05 21:50 UTC (permalink / raw)
  To: holt; +Cc: w, linux-kernel, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20101002112419.248437367@gulag1.americas.sgi.com>

From: Robin Holt <holt@sgi.com>
Date: Sat, 02 Oct 2010 06:24:06 -0500

> Subject: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
> 
> On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
> sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
> maximized without overflowing.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>

Robin please resubmit this with the SCTP bits included.

^ permalink raw reply

* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: David Miller @ 2010-10-05 21:50 UTC (permalink / raw)
  To: eric.dumazet
  Cc: holt, akpm, w, linux-kernel, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber
In-Reply-To: <1286025736.2582.1827.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 02 Oct 2010 15:22:16 +0200

> [PATCH] net: avoid limits overflow
> 
> Robin Holt tried to boot a 16TB machine and found some limits were
> reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
> 
> We can switch infrastructure to use long "instead" of "int", now
> atomic_long_t primitives are available for free.
> 
> Reported-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Eric please resubmit this when the sysctl fix is resolved.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net neigh: neigh_delete() and neigh_add() changes
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286202456.18293.293.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 16:27:36 +0200

> neigh_delete() and neigh_add() dont need to touch device refcount,
> we hold RTNL when calling them, so device cannot disappear under us.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net neigh: RCU conversion of neigh hash table
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286208944.18293.349.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 18:15:44 +0200

> [PATCH net-next] net neigh: RCU conversion of neigh hash table
> 
> Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
> neigh_table", a new structure is defined :
> 
> struct neigh_hash_table {
>        struct neighbour        **hash_buckets;
>        unsigned int            hash_mask;
>        __u32                   hash_rnd;
>        struct rcu_head         rcu;
> };
> 
> And "struct neigh_table" has an RCU protected pointer to such a
> neigh_hash_table.
> 
> This means the signature of (*hash)() function changed: We need to add a
> third parameter with the actual hash_rnd value, since this is not
> anymore a neigh_table field.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Also applied.

Looking good so far! :)

^ permalink raw reply

* Re: [PATCH] AF_UNIX: Implement SO_TIMESTAMP and SO_TIMETAMPNS on Unix sockets
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alban.crequy, lennart, shemminger, gorcunov, adobriyan,
	linux-kernel, netdev
In-Reply-To: <1286220052.2381.70.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 21:20:52 +0200

> Le lundi 04 octobre 2010 à 19:48 +0100, Alban Crequy a écrit :
>> Userspace applications can already request to receive timestamps with:
>> setsockopt(sockfd, SOL_SOCKET, SO_TIMESTAMP, ...)
>> 
>> Although setsockopt() returns zero (success), timestamps are not added to the
>> ancillary data. This patch fixes that on SOCK_DGRAM and SOCK_SEQPACKET Unix
>> sockets.
>> 
>> Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] skge: add quirk to limit DMA
From: Stephen Hemminger @ 2010-10-05 21:58 UTC (permalink / raw)
  To: David Miller; +Cc: sgruszka, netdev, luya
In-Reply-To: <20101005.001800.13732127.davem@davemloft.net>

On Tue, 05 Oct 2010 00:18:00 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Stanislaw Gruszka <sgruszka@redhat.com>
> Date: Wed, 29 Sep 2010 11:33:23 +0200
> 
> > Skge devices installed on some Gigabyte motherboards are not able to
> > perform 64 dma correctly due to board PCI implementation, so limit
> > DMA to 32bit if such boards are detected.
> > 
> > Bug was reported here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=447489
> > 
> > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> > Tested-by: Luya Tshimbalanga <luya@fedoraproject.org>
> 
> Stephen?  Can I get an ACK or some kind of other status on this?

I was hoping to find the hardware somewhere to dig deeper into
this. But until I know more please apply the patch. There are
two possibilities that still exist, 1) it is true for all devices
on this motherboard (in which cases it should be a PCI quirk),
2) it is a driver bug. The test was going to be putting a skge
pci card in a slot on the MB.


^ permalink raw reply

* Re: [PATCH] SIW: Module initialization
From: Stephen Hemminger @ 2010-10-05 22:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Bernard Metzler, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <AANLkTik5=i+A5_OpU0rVyfYi=ibgS9UWMu82vMCmM=PN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 5 Oct 2010 12:57:21 +0200
Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:

> > + * TODO: Dynamic device management (network device registration/removal).  
> 
> The current implementation is such that one siw device is created for
> each network device found at kernel module load time. That means that
> you force the user to load the siw kernel module after all other
> kernel modules that register a network device. I'm not sure that's a
> good idea.

Then device should be controlled by a netlink (rtnl_link_ops) style
interface see vlan_netlink.c. Using netlink is extensible and provides
a cleaner interface than all these other parameterization methods.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: Flavio Leitner @ 2010-10-05 22:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20101005.132525.186293128.davem@davemloft.net>

On Tue, Oct 05, 2010 at 01:25:25PM -0700, David Miller wrote:
> From: Flavio Leitner <fbl@redhat.com>
> Date: Tue, 5 Oct 2010 11:34:30 -0300
> 
> > On Tue, Oct 05, 2010 at 12:13:38AM -0700, David Miller wrote:
> >> From: Flavio Leitner <fleitner@redhat.com>
> >> Date: Wed, 29 Sep 2010 04:12:07 -0300
> >> 
> >> > It should rejoin multicast groups immediately when
> >> > the failover happens to restore the multicast traffic.
> >> > 
> >> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> >> 
> >> I suspect the IGMPv3 handling via a delayed action, as is currently
> >> implemented, is on purpose and is done so to follow the specification
> >> of the IGMPv3 RFCs.
> >> 
> >> Therefore you have to explain why your new behavior is so desirable
> >> and in particular why something as undesirable as violating the RFCs
> >> is therefore warranted.
> > 
> > That patch only changes the behavior for bonding during a link
> > failure, so if we have a bonding in active-backup or any other mode
> > with current-active-slave, the initialization will happen just fine
> > following IGMP specs.
> > 
> > However, neither the backup slave interface nor the backup switch
> > connected to backup slave knows about mcast. Thus when a link failure
> > happens, we shouldn't rely on timers to not stay out of the mcast
> > group losing traffic.
> > 
> > E.g. The V1 specs says that we shouldn't send any membership report
> > if it has been one in the last minute because that means the switch
> > is notified and the system will receive mcast traffic for that group. 
> > Therefore, if it sees one and a link failure happens right after that,
> > the backup slave will send another membership report only one minute
> > later. During this time the system loses traffic.
> 
> All of this valuable information belongs in the commit log message.
> 
> Please resubmit your patch with all of the necessary contextual
> information and reasoning explaining in the commit message.

Sure. I'll post replying to another thread so that the patch
can be applied in correct order.

thanks for reviewing it
-- 
Flavio

^ permalink raw reply

* Re: [PATCH v2] bonding: rejoin multicast groups on VLANs
From: Flavio Leitner @ 2010-10-05 22:07 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: netdev, bonding-devel, Jay Vosburgh
In-Reply-To: <20101004132409.GA7497@gospo.rdu.redhat.com>

On Mon, Oct 04, 2010 at 09:24:10AM -0400, Andy Gospodarek wrote:
> On Thu, Sep 30, 2010 at 05:45:24PM -0300, Flavio Leitner wrote:
> > It fixes bonding to rejoin multicast groups added
> > to VLAN devices on top of bonding when a failover
> > happens.
> > 
> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> > ---
> >  drivers/net/bonding/bond_main.c |   61 +++++++++++++++++++++++++++++++++-----
> >  drivers/net/bonding/bonding.h   |    1 +
> >  2 files changed, 54 insertions(+), 8 deletions(-)
> > 
> [...]
> > @@ -944,7 +979,9 @@ static void bond_mc_swap(struct bonding *bond, struct slave *new_active,
> >  
> >  		netdev_for_each_mc_addr(ha, bond->dev)
> >  			dev_mc_add(new_active->dev, ha->addr);
> > -		bond_resend_igmp_join_requests(bond);
> > +
> > +		/* rejoin multicast groups */
> > +		queue_delayed_work(bond->wq, &bond->mcast_work, 0);
> >  	}
> >  }
> >  
> 
> I was hoping that one patch would just make the changes so
> retransmission was supported on VLANs and the second patch would queue
> the work and add the tunable for multiple retransmissions, but I guess
> it wasn't clear enough.
> 
> I felt like that would divide the patches up into the bug-fix (VLANs +
> multicast not working) and the new feature (multiple retransmissions
> from the workqueue).

I'm not seeing how to that because the first patch changes to send
the membership directly and not using timers anymore, so the call
trace usually is:

 write_lock_bh(&bond->curr_slave_lock); <-- hold lock
 bond_select_active_slave() 
  ->bond_change_active_slave()
     ->bond_mc_swap()
        ->tx membership reports
 

then trying to send the IGMP packet directly, it will end up at:
  bond_start_xmit()
   ->bond_xmit_activebackup() (mode=1, for example)
      ->read_lock(&bond->lock);
      ->read_lock(&bond->curr_slave_lock); <-- deadlock


> I will test these out this morning and make sure things look good.

I have a better patch sequence which I'm planning post replying to
this thread as soon as I finish up testing them. It is still using
workqueue though.

--
Flavio

^ permalink raw reply

* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: David Miller @ 2010-10-05 22:09 UTC (permalink / raw)
  To: fbl; +Cc: netdev
In-Reply-To: <20101005220436.GA19931@redhat.com>

From: Flavio Leitner <fbl@redhat.com>
Date: Tue, 5 Oct 2010 19:04:36 -0300

> Sure. I'll post replying to another thread so that the patch
> can be applied in correct order.
> 
> thanks for reviewing it

No problem, thank you.

^ permalink raw reply

* Re: [PATCH] skge: add quirk to limit DMA
From: David Miller @ 2010-10-05 22:09 UTC (permalink / raw)
  To: shemminger; +Cc: sgruszka, netdev, luya
In-Reply-To: <20101006065845.5759dbeb@s6510>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 6 Oct 2010 06:58:45 +0900

> On Tue, 05 Oct 2010 00:18:00 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: Stanislaw Gruszka <sgruszka@redhat.com>
>> Date: Wed, 29 Sep 2010 11:33:23 +0200
>> 
>> > Skge devices installed on some Gigabyte motherboards are not able to
>> > perform 64 dma correctly due to board PCI implementation, so limit
>> > DMA to 32bit if such boards are detected.
>> > 
>> > Bug was reported here:
>> > https://bugzilla.redhat.com/show_bug.cgi?id=447489
>> > 
>> > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
>> > Tested-by: Luya Tshimbalanga <luya@fedoraproject.org>
>> 
>> Stephen?  Can I get an ACK or some kind of other status on this?
> 
> I was hoping to find the hardware somewhere to dig deeper into
> this. But until I know more please apply the patch. There are
> two possibilities that still exist, 1) it is true for all devices
> on this motherboard (in which cases it should be a PCI quirk),
> 2) it is a driver bug. The test was going to be putting a skge
> pci card in a slot on the MB.

Ok, thanks Stephen, I'll apply his patch for now.

^ permalink raw reply

* [PATCH] iwlwifi: fix iwlwifi scanning corner cases
From: Florian Mickler @ 2010-10-05 22:21 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: stable-DgEjT+Ai2ygdnm+yROfE0A,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	wey-yi.w.guy-ral2JQCrhuEAvxtiuMwx3w,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	ilw-VuQAYsv1563Yd54FQh9/CA, johannes.berg-ral2JQCrhuEAvxtiuMwx3w,
	ben.m.cahill-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	sgruszka-H+wXaHxf7aLQT0dZR+AlfA, Florian Mickler
In-Reply-To: <20101005085717.GA18012-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Stanislaw Gruszka pointed out, that commit
"iwl3945: queue the right work if the scan needs to be aborted"
has an awkward definition of "right". Specifically the abort_scan work
doesn't notify the generic wireless stack that the scan was aborted.

In order to get rid of the warning in bug
https://bugzilla.kernel.org/show_bug.cgi?id=17722
we inform ieee80211_scan_completed that we are aborting
the scan by setting the apropriate status bit in request_scan and 
pass it into ieee80211_scan_completed. 

Signed-off-by: Florian Mickler <florian-sVu6HhrpSfRAfugRpC6u6w@public.gmane.org>
---

Hi John!

Here is the fix that Stanislaw described. 
(Yes, it is in a brown paper bag.)  Please wait for him to review this. 

Another option would be to just revert my previous patch and live with the 
warning until the scanning rework hit's mainline.

Regards,
Flo




 drivers/net/wireless/iwlwifi/iwl-agn-lib.c  |    4 +++-
 drivers/net/wireless/iwlwifi/iwl-scan.c     |    6 ++++--
 drivers/net/wireless/iwlwifi/iwl3945-base.c |    3 ++-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-lib.c b/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
index 8fd00a6..2d26767 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
@@ -1410,8 +1410,10 @@ void iwlagn_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	*/
 	clear_bit(STATUS_SCAN_HW, &priv->status);
 	clear_bit(STATUS_SCANNING, &priv->status);
+
 	/* inform mac80211 scan aborted */
-	queue_work(priv->workqueue, &priv->abort_scan);
+	set_bit(STATUS_SCAN_ABORTING, &priv->status);
+	queue_work(priv->workqueue, &priv->scan_completed);
 }
 
 int iwlagn_manage_ibss_station(struct iwl_priv *priv,
diff --git a/drivers/net/wireless/iwlwifi/iwl-scan.c b/drivers/net/wireless/iwlwifi/iwl-scan.c
index a4b3663..fedf384 100644
--- a/drivers/net/wireless/iwlwifi/iwl-scan.c
+++ b/drivers/net/wireless/iwlwifi/iwl-scan.c
@@ -552,8 +552,10 @@ void iwl_bg_scan_completed(struct work_struct *work)
 	 * into driver again into functions that will attempt to take
 	 * mutex.
 	 */
-	if (!internal)
-		ieee80211_scan_completed(priv->hw, false);
+	if (!internal) {
+		bool aborted = test_bit(STATUS_SCAN_ABORTING, &priv->status);
+		ieee80211_scan_completed(priv->hw, aborted);
+	}
 }
 EXPORT_SYMBOL(iwl_bg_scan_completed);
 
diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index d31661c..da10588 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -3018,7 +3018,8 @@ void iwl3945_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	clear_bit(STATUS_SCANNING, &priv->status);
 
 	/* inform mac80211 scan aborted */
-	queue_work(priv->workqueue, &priv->abort_scan);
+	set_bit(STATUS_SCAN_ABORTING, &priv->status);
+	queue_work(priv->workqueue, &priv->scan_completed);
 }
 
 static void iwl3945_bg_restart(struct work_struct *data)
-- 
1.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] iwl3945: queue the right work if the scan needs to be aborted
From: Florian Mickler @ 2010-10-05 22:29 UTC (permalink / raw)
  To: stable-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Stanislaw Gruszka,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Guy, Wey-Yi, Chatre, Reinette, Intel Linux Wireless,
	John W. Linville, Berg, Johannes, Cahill, Ben M,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20101005104357.GB18833-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, 5 Oct 2010 12:43:58 +0200
Stanislaw Gruszka <sgruszka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> > > 
> > > However, I do not think we should go with that to -stable (below
> > > 2.6.36). IIRC warnings showed up in current 2.6.36-rc, because of
> > > some other changes in the code.
> > > 
> > > Stanislaw
Hello stable,

scrap this. I handed out a brown paper bag bug fix and if that doesn't
get into 2.6.36, it probably should go into 2.6.36.y. But let's see...

Regards,
Flo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next PATCH] igb: update adapter stats when reading /proc/net/dev.
From: Jeff Kirsher @ 2010-10-05 22:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Alexander Duyck, Jesper Dangaard Brouer,
	David S. Miller, netdev, Carolyn Wyborny
In-Reply-To: <Pine.LNX.4.64.1010052313370.30765@ask.diku.dk>

On Tue, Oct 5, 2010 at 14:16, Jesper Dangaard Brouer <hawk@diku.dk> wrote:
> On Tue, 5 Oct 2010, Eric Dumazet wrote:
>
>> Le mardi 05 octobre 2010 à 17:19 +0200, Eric Dumazet a écrit :
>>>
>>> Le mardi 05 octobre 2010 à 16:53 +0200, Jesper Dangaard Brouer a écrit :
>>>
>>>> Its already racy, because "ethtool -S" reads out the stats immediately,
>>>> and thus races with the timer.
>>>>
>>>> See: igb_ethtool.c
>>>>  igb_get_ethtool_stats() invoke igb_update_stats(adapter);
>>>>
>>>
>>> You would be surprised how many bugs are waiting to be found and
>>> fixed ;)
>>
>> I took a look at igb stats, and it appears they also are racy on 32bit
>> arches. igb uses u64 counters, with no synchronization between
>> producers(writers) and consumers(readers).
>
> Are you saying, that we need more than a simple mutex in igb_update_stats()?
>
>
>> Some fixes are needed ;)
>
> The question is then if Intel wants to fix it, or let it be up to you or me?
>

I will make sure that Carolyn and Alex know about the issue.  But,
feel free to submit a patch if you guys have the time.

-- 
Cheers,
Jeff

^ permalink raw reply

* RE: ixgbe: normalize frag_list usage
From: Duyck, Alexander H @ 2010-10-05 22:45 UTC (permalink / raw)
  To: David Miller
  Cc: Kirsher, Jeffrey T, Brandeburg, Jesse, Allan, Bruce W,
	netdev@vger.kernel.org
In-Reply-To: <20101004.113258.212671454.davem@davemloft.net>

>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Monday, October 04, 2010 11:33 AM
>To: Duyck, Alexander H
>Cc: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W;
>netdev@vger.kernel.org
>Subject: Re: ixgbe: normalize frag_list usage
>
>From: Alexander Duyck <alexander.h.duyck@intel.com>
>Date: Mon, 04 Oct 2010 11:04:18 -0700
>
>> This will not work for RSC due to the fact that it assumes only one
>> RSC context is active per ring and that is not the case.  There can
>be
>> multiple RSC combined flows interlaced on the ring.
>
>Thanks for looking at this Alexander.
>
>I must have mis-understood what the current code is doing.
>
>It looked like RSC packets always show up in-order in the RX ring.
>
>And that they are scanned for by simply combining any sequence of
>non-EOP packets up to and including the final EOP one into a frag
>list.
>
>Are the RSC packets are identified via something else entirely?

Dave,

The patch below is kind of what I had in mind for a way to do RSC and maintain
the pointer scheme you are looking for.  Consider this patch an RFC for now
since I based this off of Jeff's internal testing tree and so it would need
some modification to apply cleanly to net-next.

The only deviation from a standard frag list is that we are appending the tail
before it actually has any data in it so we have to break things into three
steps.  One to add the tail to the end of the frag list, one to merge the
length from the tail into the head, and finally one to clean-up the head->prev
pointer.

If this works for you I will send it to Jeff to add to his tree for testing.

Thanks,

Alex

---

ixgbe: change RSC to use a normalize frag_list usage

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change drops the RSC queue approach and instead creates a normalized
frag_list skb but the tail is kept active and regularly merged into the
host SKB every time it is completed.  In order to identify the tail skb
as a tail we set the next pointer to the head skb, and the head skb prev
pointer to the tail.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 -
 drivers/net/ixgbe/ixgbe_main.c |   96 +++++++++++++++++++++++-----------------
 2 files changed, 56 insertions(+), 42 deletions(-)


diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index b9a1e70..2c7a296 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -470,7 +470,7 @@ enum ixbge_state_t {
 
 struct ixgbe_rsc_cb {
 	dma_addr_t dma;
-	u16 skb_cnt;
+	u16 append_cnt;
 	bool delay_unmap;
 };
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 221d5ef..c3774b9 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1251,34 +1251,46 @@ static inline u16 ixgbe_get_hlen(union ixgbe_adv_rx_desc *rx_desc)
 }
 
 /**
- * ixgbe_transform_rsc_queue - change rsc queue into a full packet
- * @skb: pointer to the last skb in the rsc queue
+ * ixgbe_merge_active_tail - merge active tail into frag_list skb
+ * @tail: pointer to active tail in frag_list
  *
- * This function changes a queue full of hw rsc buffers into a completed
- * packet.  It uses the ->prev pointers to find the first packet and then
- * turns it into the frag list owner.
+ * This function merges the length and data of an active tail into the
+ * skb containing the frag_list.  It resets the tail's pointer to the head,
+ * but it leaves the heads pointer to tail intact.
  **/
-static inline struct sk_buff *ixgbe_transform_rsc_queue(struct sk_buff *skb)
+static inline struct sk_buff *ixgbe_merge_active_tail(struct sk_buff *tail)
 {
-	unsigned int frag_list_size = 0;
-	unsigned int skb_cnt = 1;
-
-	while (skb->prev) {
-		struct sk_buff *prev = skb->prev;
-		frag_list_size += skb->len;
-		skb->prev = NULL;
-		skb = prev;
-		skb_cnt++;
+	if (tail->next) {
+		struct sk_buff *head = tail->next;
+		head->len += tail->len;
+		head->data_len += tail->len;
+		head->truesize += tail->len;
+		tail->next = NULL;
+		return head;
 	}
+	return tail;
+}
 
-	skb_shinfo(skb)->frag_list = skb->next;
-	skb->next = NULL;
-	skb->len += frag_list_size;
-	skb->data_len += frag_list_size;
-	skb->truesize += frag_list_size;
-	IXGBE_RSC_CB(skb)->skb_cnt = skb_cnt;
-
-	return skb;
+/**
+ * ixgbe_append_active_tail - add empty skb to the end of a frag_list
+ * @head: skb that will host the frag_list
+ * @tail: skb to add to the end of the frag_list
+ *
+ * This function adds a yet to be completed skb to the end of a frag_list.
+ * It is an in-between step while we are still waiting for the data to be
+ * placed into the yet to be completed skb.  A back pointer from tail to
+ * head is placed in tail->next so we can later merge it into the host skb.
+ * The head->prev pointer is used to track the tail of the frag_list.
+ **/
+static inline void ixgbe_append_active_tail(struct sk_buff *head,
+					    struct sk_buff *tail)
+{
+	if (skb_shinfo(head)->frag_list)
+		head->prev->next = tail;
+	else
+		skb_shinfo(head)->frag_list = tail;
+	head->prev = tail;
+	tail->next = head;
 }
 
 static inline bool ixgbe_get_rsc_state(union ixgbe_adv_rx_desc *rx_desc)
@@ -1328,7 +1340,7 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			u16 hlen;
 			if (pkt_is_rsc &&
 			    !(staterr & IXGBE_RXD_STAT_EOP) &&
-			    !skb->prev) {
+			    !skb->next) {
 				/*
 				 * When HWRSC is enabled, delay unmapping
 				 * of the first packet. It carries the
@@ -1397,6 +1409,8 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			next_buffer = &rx_ring->rx_buffer_info[i];
 		}
 
+		skb = ixgbe_merge_active_tail(skb);
+
 		if (!(staterr & IXGBE_RXD_STAT_EOP)) {
 			if (ring_is_ps_enabled(rx_ring)) {
 				rx_buffer_info->skb = next_buffer->skb;
@@ -1404,15 +1418,16 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				next_buffer->skb = skb;
 				next_buffer->dma = 0;
 			} else {
-				skb->next = next_buffer->skb;
-				skb->next->prev = skb;
+				ixgbe_append_active_tail(skb,
+							 next_buffer->skb);
+				IXGBE_RSC_CB(skb)->append_cnt++;
 			}
 			rx_ring->rx_stats.non_eop_descs++;
 			goto next_desc;
 		}
 
 		if (skb->prev) {
-			skb = ixgbe_transform_rsc_queue(skb);
+			skb->prev = NULL;
 			/* if we got here without RSC the packet is invalid */
 			if (!pkt_is_rsc) {
 				__pskb_trim(skb, 0);
@@ -1437,7 +1452,7 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 					skb_shinfo(skb)->nr_frags;
 			else
 				rx_ring->rx_stats.rsc_count +=
-					IXGBE_RSC_CB(skb)->skb_cnt;
+					IXGBE_RSC_CB(skb)->append_cnt + 1;
 			rx_ring->rx_stats.rsc_flush++;
 		}
 
@@ -3907,19 +3922,18 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 		if (rx_buffer_info->skb) {
 			struct sk_buff *skb = rx_buffer_info->skb;
 			rx_buffer_info->skb = NULL;
-			do {
-				struct sk_buff *this = skb;
-				if (IXGBE_RSC_CB(this)->delay_unmap) {
-					dma_unmap_single(dev,
-							 IXGBE_RSC_CB(this)->dma,
-							 rx_ring->rx_buf_len,
-							 DMA_FROM_DEVICE);
-					IXGBE_RSC_CB(this)->dma = 0;
-					IXGBE_RSC_CB(skb)->delay_unmap = false;
-				}
-				skb = skb->prev;
-				dev_kfree_skb(this);
-			} while (skb);
+			/* We need to clean up RSC frag lists */
+			skb = ixgbe_merge_active_tail(skb);
+			skb->prev = NULL;
+			if (IXGBE_RSC_CB(skb)->delay_unmap) {
+				dma_unmap_single(dev,
+						 IXGBE_RSC_CB(skb)->dma,
+						 rx_ring->rx_buf_len,
+						 DMA_FROM_DEVICE);
+				IXGBE_RSC_CB(skb)->dma = 0;
+				IXGBE_RSC_CB(skb)->delay_unmap = false;
+			}
+			dev_kfree_skb(skb);
 		}
 		if (!rx_buffer_info->page)
 			continue;

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox