Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 19:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273170813.2222.10.camel@edumazet-laptop>

On Thu, May 06, 2010 at 08:33:33PM +0200, Eric Dumazet wrote:
> Le jeudi 06 mai 2010 à 14:02 -0400, Neil Horman a écrit :
> > On Thu, May 06, 2010 at 07:32:35PM +0200, Eric Dumazet wrote:
> > > Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> > > > A while back there was a discussion regarding the rt_secret_interval timer.
> > > > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > > > now, based on a statistical analysis of the various hash chain lengths in the
> > > > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > > > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > > > analysis mechanism to determine the need for route cache flushes.
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > 
> > > > 
> > > 
> > > Nice cleanup try Neil, but this gives to attackers more time to hit the
> > > cache (infinite time should be enough as a matter of fact ;) )
> > > 
> > Not sure I follow what your complaint is.  I get that this gives attackers
> > plenty of time to try to attack the cache, but thats rather the point of the
> > statistics gathering for the cache, and why I don't think we need the secret
> > timer any more.   With the statistical analysis we do on the route cache every
> > gc cycle, we can tell if an attacker has guessed our rt_genid value, and is
> > making any chains in the cache abnormally long.  Thats when we do the rebuild,
> > modifying the rt_genid, forcing the attacker to re-discover it (which should be
> > difficult).  Theres no need to change this periodically if you're not being
> > attacked.
> >  
> > > Hints : 
> > > 
> > > - What is the initial value of rt_genid ?
> > > 
> > > - How/When is it changed (full 32 bits are changed or small
> > > perturbations ? check rt_cache_invalidate())
> > > 
> > /*
> >  * Pertubation of rt_genid by a small quantity [1..256]
> >  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
> >  * many times (2^24) without giving recent rt_genid.
> >  * Jenkins hash is strong enough that litle changes of rt_genid are OK.
> >  */
> > static void rt_cache_invalidate(struct net *net)
> > {
> >         unsigned char shuffle;
> > 
> >         get_random_bytes(&shuffle, sizeof(shuffle));
> >         atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
> > }
> > 
> > Clearly, its small changes.  To paraphrase the comment, Changes to rt_genid are
> > small enough to be confident that we don't repetatively use a gen_id often, but
> > sufficiently random that attackers cannot easily guess the next gen_id based on
> > the current value.  Both the timer and the statistics code use this invalidation
> > technique previously, and the latter continues to do so.
> > 
> > I've not changed anything regarding how we
> > invalidate, only when we choose to invalidate.  Invalidation can lead to
> > slowdowns during routing, since it send frames through the slow path after an
> > invalidation.  It behooves us to avoid preforming this invalidation without
> > need, and since we have a mechanism in place to do that invalidation specfically
> > when we need to, lets get rid of the code that handles that, and make it a bit
> > cleaner.  If there are users that feel strongly that they need to defend against
> > potential attacks by periodically changing their rt_genid, its still possible.
> > Its as simple as putting:
> > echo -1 > /proc/sys/net/ipv4/route/flush
> > in a cron job.
> > 
> 
> I have some customers that will simply kill me if their routing cache is
> disabled by a smart attack, slowing down their server by a 4x factor.
> 
> I know its possible, it has been done.
> 
Ok, I can understand that, but I don't think a single user profile needs to
dictate policy here.  the statistics code has a configurable threshold for where
to disable caching, so I would think you can just set that to be high.  And if
you need even more than that, you can do as I suggested, adding a:
echo -1 > /proc/sys/net/ipv4/route/flush
to a cron job on a short interval.  That will preform the _exact_ same operation
that the in-kernel timer did previously.  And the other 99% of our users don't
have to suffer a periodic cache invalidation when they don't need it.

> For a quiet machine possible rt_genid values range are known from
> attacker, and hash size is also known. Thats really too easy for the bad
> guys...
> 
Well, ok, I can buy your argument, but I think the problem you are describing is
orthogonoal to what my change here does.  If its too easy for attackers to guess
our genid secret, then we need to make it more complex to guess, not force
everyone to change it more frequently.

> Neil, I think your cleanup should stay a cleanup for the moment, or you
> must make sure rt_genid initial value is not 0 (read your patch
> again...)
> 
Yeah, I get that we start the gen_id at 0, that doesn't come from this change,
its always that way in the net_alloc initalization.  I certainly don't have a
problem during inialization starting with a randomization of that value.  I'll
post an updated patch shortly.

> I agree we dont need anymore the complex timer logic. We could keep the
> secret_interval (default to 0 if you really want) and force a
> rt_cache_invalidate() call once in a while from the periodic
> rt_check_expire() for example.
> 
Doing that doesn't solve my aim however, which is to avoid performing rt_genid
updates when no one is attacking you at all.  I completely agree that we can
start the gen_id at some random value (by forcing an initial invalidation),
however.  Beyond that however, if someone is managing to guess our secret value,
then we need to make our secret value more complex to determine.  Perhaps given
the reduction in the number of times we need to iterate our gen_id with the
timer gone, we can use something more heavyweight to determine the the hash
secret (the cprng perhaps?).

Regards
Neil
> 
> 
> 

^ permalink raw reply

* [PATCH (v4+)] bridge: update sysfs link names if port device names have changed
From: Stephen Hemminger @ 2010-05-06 19:50 UTC (permalink / raw)
  To: David Miller; +Cc: Simon Arlott, netdev
In-Reply-To: <4BE31A67.4090007@simon.arlott.org.uk>


From: Simon Arlott <simon@fire.lp0.eu>

Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.

As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.

This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.

https://bugzilla.kernel.org/show_bug.cgi?id=12743

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>

---
Modified to apply to net-next and fix checkpatch warnings -- Stephen

 fs/sysfs/symlink.c       |    1 +
 net/bridge/br_if.c       |    2 +-
 net/bridge/br_notify.c   |    7 +++++++
 net/bridge/br_private.h  |    6 ++++++
 net/bridge/br_sysfs_if.c |   32 +++++++++++++++++++++++++++-----
 5 files changed, 42 insertions(+), 6 deletions(-)

--- a/fs/sysfs/symlink.c	2010-04-20 08:22:12.000000000 -0700
+++ b/fs/sysfs/symlink.c	2010-05-06 12:41:36.488153157 -0700
@@ -261,3 +261,4 @@ const struct inode_operations sysfs_syml
 
 EXPORT_SYMBOL_GPL(sysfs_create_link);
 EXPORT_SYMBOL_GPL(sysfs_remove_link);
+EXPORT_SYMBOL_GPL(sysfs_rename_link);
--- a/net/bridge/br_if.c	2010-05-06 12:24:59.000000000 -0700
+++ b/net/bridge/br_if.c	2010-05-06 12:41:36.488153157 -0700
@@ -133,7 +133,7 @@ static void del_nbp(struct net_bridge_po
 	struct net_bridge *br = p->br;
 	struct net_device *dev = p->dev;
 
-	sysfs_remove_link(br->ifobj, dev->name);
+	sysfs_remove_link(br->ifobj, p->sysfs_name);
 
 	dev_set_promiscuity(dev, -1);
 
--- a/net/bridge/br_notify.c	2010-05-06 12:24:30.000000000 -0700
+++ b/net/bridge/br_notify.c	2010-05-06 12:43:27.776801235 -0700
@@ -34,6 +34,7 @@ static int br_device_event(struct notifi
 	struct net_device *dev = ptr;
 	struct net_bridge_port *p = dev->br_port;
 	struct net_bridge *br;
+	int err;
 
 	/* not a port of a bridge */
 	if (p == NULL)
@@ -83,6 +84,12 @@ static int br_device_event(struct notifi
 		br_del_if(br, dev);
 		break;
 
+	case NETDEV_CHANGENAME:
+		err = br_sysfs_renameif(p);
+		if (err)
+			return notifier_from_errno(err);
+		break;
+
 	case NETDEV_PRE_TYPE_CHANGE:
 		/* Forbid underlaying device to change its type. */
 		return NOTIFY_BAD;
--- a/net/bridge/br_private.h	2010-05-06 08:43:18.000000000 -0700
+++ b/net/bridge/br_private.h	2010-05-06 12:41:36.488153157 -0700
@@ -139,6 +139,10 @@ struct net_bridge_port
 	struct hlist_head		mglist;
 	struct hlist_node		rlist;
 #endif
+
+#ifdef CONFIG_SYSFS
+	char				sysfs_name[IFNAMSIZ];
+#endif
 };
 
 struct br_cpu_netstats {
@@ -455,6 +459,7 @@ extern void br_ifinfo_notify(int event, 
 /* br_sysfs_if.c */
 extern const struct sysfs_ops brport_sysfs_ops;
 extern int br_sysfs_addif(struct net_bridge_port *p);
+extern int br_sysfs_renameif(struct net_bridge_port *p);
 
 /* br_sysfs_br.c */
 extern int br_sysfs_addbr(struct net_device *dev);
@@ -463,6 +468,7 @@ extern void br_sysfs_delbr(struct net_de
 #else
 
 #define br_sysfs_addif(p)	(0)
+#define br_sysfs_renameif(p)	(0)
 #define br_sysfs_addbr(dev)	(0)
 #define br_sysfs_delbr(dev)	do { } while(0)
 #endif /* CONFIG_SYSFS */
--- a/net/bridge/br_sysfs_if.c	2010-05-06 12:24:30.000000000 -0700
+++ b/net/bridge/br_sysfs_if.c	2010-05-06 12:48:07.747112472 -0700
@@ -246,7 +246,7 @@ const struct sysfs_ops brport_sysfs_ops 
 /*
  * Add sysfs entries to ethernet device added to a bridge.
  * Creates a brport subdirectory with bridge attributes.
- * Puts symlink in bridge's brport subdirectory
+ * Puts symlink in bridge's brif subdirectory
  */
 int br_sysfs_addif(struct net_bridge_port *p)
 {
@@ -257,15 +257,37 @@ int br_sysfs_addif(struct net_bridge_por
 	err = sysfs_create_link(&p->kobj, &br->dev->dev.kobj,
 				SYSFS_BRIDGE_PORT_LINK);
 	if (err)
-		goto out2;
+		return err;
 
 	for (a = brport_attrs; *a; ++a) {
 		err = sysfs_create_file(&p->kobj, &((*a)->attr));
 		if (err)
-			goto out2;
+			return err;
 	}
 
-	err = sysfs_create_link(br->ifobj, &p->kobj, p->dev->name);
-out2:
+	strlcpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	return sysfs_create_link(br->ifobj, &p->kobj, p->sysfs_name);
+}
+
+/* Rename bridge's brif symlink */
+int br_sysfs_renameif(struct net_bridge_port *p)
+{
+	struct net_bridge *br = p->br;
+	int err;
+
+	/* If a rename fails, the rollback will cause another
+	 * rename call with the existing name.
+	 */
+	if (!strncmp(p->sysfs_name, p->dev->name, IFNAMSIZ))
+		return 0;
+
+	err = sysfs_rename_link(br->ifobj, &p->kobj,
+				p->sysfs_name, p->dev->name);
+	if (err)
+		netdev_notice(br->dev, "unable to rename link %s to %s",
+			      p->sysfs_name, p->dev->name);
+	else
+		strlcpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+
 	return err;
 }

^ permalink raw reply

* Re: [PATCH (v4)] bridge: update sysfs link names if port device names have changed
From: Stephen Hemminger @ 2010-05-06 19:40 UTC (permalink / raw)
  To: Simon Arlott; +Cc: netdev
In-Reply-To: <4BE31A67.4090007@simon.arlott.org.uk>

On Thu, 06 May 2010 20:37:11 +0100
Simon Arlott <simon@fire.lp0.eu> wrote:

> Links for each port are created in sysfs using the device
> name, but this could be changed after being added to the
> bridge.
> 
> As well as being unable to remove interfaces after this
> occurs (because userspace tools don't recognise the new
> name, and the kernel won't recognise the old name), adding
> another interface with the old name to the bridge will
> cause an error trying to create the sysfs link.
> 
> This fixes the problem by listening for NETDEV_CHANGENAME
> notifications and renaming the link.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=12743
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>

Looks good
Acked-by: Stephen Hemminger <shemminger@vyatta.com>


-- 

^ permalink raw reply

* [PATCH 2/2] bridge netfilter: use net_ratelimit
From: Stephen Hemminger @ 2010-05-06 19:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Remove routine br_dnat_complain which just implemented own version
of net_ratelimit() and use netdev_warn.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/bridge/br_netfilter.c	2010-05-06 12:32:23.427786161 -0700
+++ b/net/bridge/br_netfilter.c	2010-05-06 12:33:37.826565965 -0700
@@ -253,17 +253,6 @@ static int br_nf_pre_routing_finish_ipv6
 	return 0;
 }
 
-static void __br_dnat_complain(void)
-{
-	static unsigned long last_complaint;
-
-	if (jiffies - last_complaint >= 5 * HZ) {
-		printk(KERN_WARNING "Performing cross-bridge DNAT requires IP "
-		       "forwarding to be enabled\n");
-		last_complaint = jiffies;
-	}
-}
-
 /* This requires some explaining. If DNAT has taken place,
  * we will need to fix up the destination Ethernet address,
  * and this is a tricky process.
@@ -382,8 +371,12 @@ static int br_nf_pre_routing_finish(stru
 				/* we are sure that forwarding is disabled, so printing
 				 * this message is no problem. Note that the packet could
 				 * still have a martian destination address, in which case
-				 * the packet could be dropped even if forwarding were enabled */
-				__br_dnat_complain();
+				 * the packet could be dropped even if forwarding were enabled
+				 */
+				if (net_ratelimit())
+					netdev_warn(dev, "Performing cross-bridge DNAT "
+						    "requires IP forwarding to be enabled\n");
+
 				dst_release((struct dst_entry *)rt);
 			}
 free_skb:

^ permalink raw reply

* [PATCH (v4)] bridge: update sysfs link names if port device names have changed
From: Simon Arlott @ 2010-05-06 19:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20100506121107.1868ad07@nehalam>

Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.

As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.

This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.

https://bugzilla.kernel.org/show_bug.cgi?id=12743

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
Updated notifier return value and error message logging.
Changed to use strlcpy instead of strncpy.

 fs/sysfs/symlink.c       |    1 +
 net/bridge/br_if.c       |    2 +-
 net/bridge/br_notify.c   |    7 +++++++
 net/bridge/br_private.h  |    6 ++++++
 net/bridge/br_sysfs_if.c |   29 +++++++++++++++++++++++++++--
 5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index b93ec51..942f239 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -261,3 +261,4 @@ const struct inode_operations sysfs_symlink_inode_operations = {
 
 EXPORT_SYMBOL_GPL(sysfs_create_link);
 EXPORT_SYMBOL_GPL(sysfs_remove_link);
+EXPORT_SYMBOL_GPL(sysfs_rename_link);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 0b6b1f2..17175a5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -132,7 +132,7 @@ static void del_nbp(struct net_bridge_port *p)
 	struct net_bridge *br = p->br;
 	struct net_device *dev = p->dev;
 
-	sysfs_remove_link(br->ifobj, dev->name);
+	sysfs_remove_link(br->ifobj, p->sysfs_name);
 
 	dev_set_promiscuity(dev, -1);
 
diff --git a/net/bridge/br_notify.c b/net/bridge/br_notify.c
index 763a3ec..c3b0c80 100644
--- a/net/bridge/br_notify.c
+++ b/net/bridge/br_notify.c
@@ -34,6 +34,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	struct net_device *dev = ptr;
 	struct net_bridge_port *p = dev->br_port;
 	struct net_bridge *br;
+	int err;
 
 	/* not a port of a bridge */
 	if (p == NULL)
@@ -82,6 +83,12 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	case NETDEV_UNREGISTER:
 		br_del_if(br, dev);
 		break;
+
+	case NETDEV_CHANGENAME:
+		err = br_sysfs_renameif(p);
+		if (err)
+			return notifier_from_errno(err);
+		break;
 	}
 
 	/* Events that may cause spanning tree to refresh */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 846d7d1..58f1dd7 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -128,6 +128,10 @@ struct net_bridge_port
 	struct hlist_head		mglist;
 	struct hlist_node		rlist;
 #endif
+
+#ifdef CONFIG_SYSFS
+	char				sysfs_name[IFNAMSIZ];
+#endif
 };
 
 struct net_bridge
@@ -433,6 +437,7 @@ extern void br_ifinfo_notify(int event, struct net_bridge_port *port);
 /* br_sysfs_if.c */
 extern const struct sysfs_ops brport_sysfs_ops;
 extern int br_sysfs_addif(struct net_bridge_port *p);
+extern int br_sysfs_renameif(struct net_bridge_port *p);
 
 /* br_sysfs_br.c */
 extern int br_sysfs_addbr(struct net_device *dev);
@@ -441,6 +446,7 @@ extern void br_sysfs_delbr(struct net_device *dev);
 #else
 
 #define br_sysfs_addif(p)	(0)
+#define br_sysfs_renameif(p)	(0)
 #define br_sysfs_addbr(dev)	(0)
 #define br_sysfs_delbr(dev)	do { } while(0)
 #endif /* CONFIG_SYSFS */
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 0b99164..51b6b52 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -246,7 +246,7 @@ const struct sysfs_ops brport_sysfs_ops = {
 /*
  * Add sysfs entries to ethernet device added to a bridge.
  * Creates a brport subdirectory with bridge attributes.
- * Puts symlink in bridge's brport subdirectory
+ * Puts symlink in bridge's brif subdirectory
  */
 int br_sysfs_addif(struct net_bridge_port *p)
 {
@@ -265,7 +265,32 @@ int br_sysfs_addif(struct net_bridge_port *p)
 			goto out2;
 	}
 
-	err = sysfs_create_link(br->ifobj, &p->kobj, p->dev->name);
+	strlcpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	err = sysfs_create_link(br->ifobj, &p->kobj, p->sysfs_name);
 out2:
 	return err;
 }
+
+/* Rename bridge's brif symlink */
+int br_sysfs_renameif(struct net_bridge_port *p)
+{
+	struct net_bridge *br = p->br;
+	int err;
+
+	/* If a rename fails, the rollback will cause another
+	 * rename call with the existing name.
+	 */
+	if (!strncmp(p->sysfs_name, p->dev->name, IFNAMSIZ))
+		return 0;
+
+	err = sysfs_rename_link(br->ifobj, &p->kobj,
+		p->sysfs_name, p->dev->name);
+	if (err) {
+		netdev_notice(br->dev, "unable to rename sysfs link %s to %s (%d)",
+			p->sysfs_name, p->dev->name, err);
+	} else {
+		strlcpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	}
+
+	return err;
+}
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: kernel panic when using netns+bridges+tc(netem)
From: Eric Dumazet @ 2010-05-06 19:15 UTC (permalink / raw)
  To: Martín Ferrari; +Cc: Arnd Bergmann, netdev, Mathieu Lacage, David Miller
In-Reply-To: <h2vb9800b71005060423rc0f4bd3g4109e90f798d5157@mail.gmail.com>

Le jeudi 06 mai 2010 à 13:23 +0200, Martín Ferrari a écrit :
> Eric,
> 
> On Thu, May 6, 2010 at 08:59, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > Could you please try following patch ?
> > David, this is a stable candidate, once tested and acked, thanks !
> >
> > [PATCH] veth: Dont kfree_skb() after dev_forward_skb()
> 
> I have been running a bunch of benchmarks and it didn't cause any
> trouble. Seems the fix is good
> 
> Thanks a lot!!
> 

Thanks for testing and report !



^ permalink raw reply

* Re: [PATCH (v2)] bridge: update sysfs link names if port device names have changed
From: Stephen Hemminger @ 2010-05-06 19:11 UTC (permalink / raw)
  To: Simon Arlott; +Cc: netdev
In-Reply-To: <4BE31218.6010709@simon.arlott.org.uk>

On Thu, 06 May 2010 20:01:44 +0100
Simon Arlott <simon@fire.lp0.eu> wrote:

> +
> +	case NETDEV_CHANGENAME:
> +		err = br_sysfs_renameif(p);
> +		if (err)
> +			return NOTIFY_BAD;
> +		break;

I think you want:
 if (err)
	return notifier_from_errno(err);


+	err = sysfs_rename_link(br->ifobj, &p->kobj,
+		p->sysfs_name, p->dev->name);
+	if (err) {
+		printk(KERN_ERR "%s: unable to rename sysfs link %s to %s (%d)",
+			br->dev->name, p->sysfs_name, p->dev->name, err);

This should not be KERN_ERR but KERN_NOTICE, and use new wrapper macros.

	if (err)
		netdev_notice(br->dev, "unable to rename sysfs link %s to %s".
			 p->sysfs_name, p->dev->name, err)

^ permalink raw reply

* [PATCH (v2)] bridge: update sysfs link names if port device names have changed
From: Simon Arlott @ 2010-05-06 19:01 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <4BE30B3D.9000600@simon.arlott.org.uk>

Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.

As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.

This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.

https://bugzilla.kernel.org/show_bug.cgi?id=12743

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
Updated to force rename rollback if sysfs_rename_link fails.

 fs/sysfs/symlink.c       |    1 +
 net/bridge/br_if.c       |    2 +-
 net/bridge/br_notify.c   |    7 +++++++
 net/bridge/br_private.h  |    5 +++++
 net/bridge/br_sysfs_if.c |   29 +++++++++++++++++++++++++++--
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index b93ec51..942f239 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -261,3 +261,4 @@ const struct inode_operations sysfs_symlink_inode_operations = {
 
 EXPORT_SYMBOL_GPL(sysfs_create_link);
 EXPORT_SYMBOL_GPL(sysfs_remove_link);
+EXPORT_SYMBOL_GPL(sysfs_rename_link);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 0b6b1f2..17175a5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -132,7 +132,7 @@ static void del_nbp(struct net_bridge_port *p)
 	struct net_bridge *br = p->br;
 	struct net_device *dev = p->dev;
 
-	sysfs_remove_link(br->ifobj, dev->name);
+	sysfs_remove_link(br->ifobj, p->sysfs_name);
 
 	dev_set_promiscuity(dev, -1);
 
diff --git a/net/bridge/br_notify.c b/net/bridge/br_notify.c
index 763a3ec..576a1b0 100644
--- a/net/bridge/br_notify.c
+++ b/net/bridge/br_notify.c
@@ -34,6 +34,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	struct net_device *dev = ptr;
 	struct net_bridge_port *p = dev->br_port;
 	struct net_bridge *br;
+	int err;
 
 	/* not a port of a bridge */
 	if (p == NULL)
@@ -82,6 +83,12 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	case NETDEV_UNREGISTER:
 		br_del_if(br, dev);
 		break;
+
+	case NETDEV_CHANGENAME:
+		err = br_sysfs_renameif(p);
+		if (err)
+			return NOTIFY_BAD;
+		break;
 	}
 
 	/* Events that may cause spanning tree to refresh */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 846d7d1..4309bd8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -96,6 +96,9 @@ struct net_bridge_port
 {
 	struct net_bridge		*br;
 	struct net_device		*dev;
+#ifdef CONFIG_SYSFS
+	char				sysfs_name[IFNAMSIZ];
+#endif
 	struct list_head		list;
 
 	/* STP */
@@ -433,6 +436,7 @@ extern void br_ifinfo_notify(int event, struct net_bridge_port *port);
 /* br_sysfs_if.c */
 extern const struct sysfs_ops brport_sysfs_ops;
 extern int br_sysfs_addif(struct net_bridge_port *p);
+extern int br_sysfs_renameif(struct net_bridge_port *p);
 
 /* br_sysfs_br.c */
 extern int br_sysfs_addbr(struct net_device *dev);
@@ -441,6 +445,7 @@ extern void br_sysfs_delbr(struct net_device *dev);
 #else
 
 #define br_sysfs_addif(p)	(0)
+#define br_sysfs_renameif(p)	(0)
 #define br_sysfs_addbr(dev)	(0)
 #define br_sysfs_delbr(dev)	do { } while(0)
 #endif /* CONFIG_SYSFS */
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 0b99164..a7997a7 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -246,7 +246,7 @@ const struct sysfs_ops brport_sysfs_ops = {
 /*
  * Add sysfs entries to ethernet device added to a bridge.
  * Creates a brport subdirectory with bridge attributes.
- * Puts symlink in bridge's brport subdirectory
+ * Puts symlink in bridge's brif subdirectory
  */
 int br_sysfs_addif(struct net_bridge_port *p)
 {
@@ -265,7 +265,32 @@ int br_sysfs_addif(struct net_bridge_port *p)
 			goto out2;
 	}
 
-	err = sysfs_create_link(br->ifobj, &p->kobj, p->dev->name);
+	strncpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	err = sysfs_create_link(br->ifobj, &p->kobj, p->sysfs_name);
 out2:
 	return err;
 }
+
+/* Rename bridge's brif symlink */
+int br_sysfs_renameif(struct net_bridge_port *p)
+{
+	struct net_bridge *br = p->br;
+	int err;
+
+	/* If a rename fails, the rollback will cause another
+	 * rename call with the existing name.
+	 */
+	if (!strncmp(p->sysfs_name, p->dev->name, IFNAMSIZ))
+		return 0;
+
+	err = sysfs_rename_link(br->ifobj, &p->kobj,
+		p->sysfs_name, p->dev->name);
+	if (err) {
+		printk(KERN_ERR "%s: unable to rename sysfs link %s to %s (%d)",
+			br->dev->name, p->sysfs_name, p->dev->name, err);
+	} else {
+		strncpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	}
+
+	return err;
+}
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: [PATCH] bridge: update sysfs link names if port device names have changed
From: Stephen Hemminger @ 2010-05-06 18:53 UTC (permalink / raw)
  To: Simon Arlott; +Cc: netdev
In-Reply-To: <4BE30B3D.9000600@simon.arlott.org.uk>

On Thu, 06 May 2010 19:32:29 +0100
Simon Arlott <simon@fire.lp0.eu> wrote:

> Links for each port are created in sysfs using the device
> name, but this could be changed after being added to the
> bridge.
> 
> As well as being unable to remove interfaces after this
> occurs (because userspace tools don't recognise the new
> name, and the kernel won't recognise the old name), adding
> another interface with the old name to the bridge will
> cause an error trying to create the sysfs link.
> 
> This fixes the problem by listening for NETDEV_CHANGENAME
> notifications and renaming the link.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=12743
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
> ---
>  fs/sysfs/symlink.c       |    1 +
>  net/bridge/br_if.c       |    2 +-
>  net/bridge/br_notify.c   |    4 ++++
>  net/bridge/br_private.h  |    5 +++++
>  net/bridge/br_sysfs_if.c |   19 +++++++++++++++++--
>  5 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
> index b93ec51..942f239 100644
> --- a/fs/sysfs/symlink.c
> +++ b/fs/sysfs/symlink.c
> @@ -261,3 +261,4 @@ const struct inode_operations sysfs_symlink_inode_operations = {
>  
>  EXPORT_SYMBOL_GPL(sysfs_create_link);
>  EXPORT_SYMBOL_GPL(sysfs_remove_link);
> +EXPORT_SYMBOL_GPL(sysfs_rename_link);
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 0b6b1f2..17175a5 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -132,7 +132,7 @@ static void del_nbp(struct net_bridge_port *p)
>  	struct net_bridge *br = p->br;
>  	struct net_device *dev = p->dev;
>  
> -	sysfs_remove_link(br->ifobj, dev->name);
> +	sysfs_remove_link(br->ifobj, p->sysfs_name);
>  
>  	dev_set_promiscuity(dev, -1);
>  
> diff --git a/net/bridge/br_notify.c b/net/bridge/br_notify.c
> index 763a3ec..9004406 100644
> --- a/net/bridge/br_notify.c
> +++ b/net/bridge/br_notify.c
> @@ -82,6 +82,10 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
>  	case NETDEV_UNREGISTER:
>  		br_del_if(br, dev);
>  		break;
> +
> +	case NETDEV_CHANGENAME:
> +		br_sysfs_renameif(p);
> +		break;
>  	}
>  
>  	/* Events that may cause spanning tree to refresh */
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 846d7d1..dcbe744 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -96,6 +96,9 @@ struct net_bridge_port
>  {
>  	struct net_bridge		*br;
>  	struct net_device		*dev;
> +#ifdef CONFIG_SYSFS
> +	char				sysfs_name[IFNAMSIZ];
> +#endif
>  	struct list_head		list;

Ok, but the sysfs_name should go at end of net_bridge_port structure
since it is only used in special case code. And putting in middle
of structure kills cache locality.

^ permalink raw reply

* [PATCH -next 2/3 v2] bnx2: Add prefetches to rx path.
From: Michael Chan @ 2010-05-06 18:58 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1273172293-10241-1-git-send-email-mchan@broadcom.com>

Add prefetches of the skb and the next rx descriptor to speed up rx path.

Use prefetchw() for the skb [suggested by Eric Dumazet].

The rx descriptor is in skb->data which is mapped for streaming mode DMA.
Eric Dumazet pointed out that we should not prefetch the data before
dma_sync.  So we prefetch only if dma_sync is no_op on the system.

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |   16 +++++++++++++---
 drivers/net/bnx2.h |    1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 320526b..667f419 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2719,6 +2719,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	}
 
 	rx_buf->skb = skb;
+	rx_buf->desc = (struct l2_fhdr *) skb->data;
 	dma_unmap_addr_set(rx_buf, mapping, mapping);
 
 	rxbd->rx_bd_haddr_hi = (u64) mapping >> 32;
@@ -2941,6 +2942,7 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr,
 	rxr->rx_prod_bseq += bp->rx_buf_use_size;
 
 	prod_rx_buf->skb = skb;
+	prod_rx_buf->desc = (struct l2_fhdr *) skb->data;
 
 	if (cons == prod)
 		return;
@@ -3074,6 +3076,7 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 	u16 hw_cons, sw_cons, sw_ring_cons, sw_prod, sw_ring_prod;
 	struct l2_fhdr *rx_hdr;
 	int rx_pkt = 0, pg_ring_used = 0;
+	struct pci_dev *pdev = bp->pdev;
 
 	hw_cons = bnx2_get_hw_rx_cons(bnapi);
 	sw_cons = rxr->rx_cons;
@@ -3086,7 +3089,7 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 	while (sw_cons != hw_cons) {
 		unsigned int len, hdr_len;
 		u32 status;
-		struct sw_bd *rx_buf;
+		struct sw_bd *rx_buf, *next_rx_buf;
 		struct sk_buff *skb;
 		dma_addr_t dma_addr;
 		u16 vtag = 0;
@@ -3097,7 +3100,14 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 
 		rx_buf = &rxr->rx_buf_ring[sw_ring_cons];
 		skb = rx_buf->skb;
+		prefetchw(skb);
 
+		if (!get_dma_ops(&pdev->dev)->sync_single_for_cpu) {
+			next_rx_buf =
+				&rxr->rx_buf_ring[
+					RX_RING_IDX(NEXT_RX_BD(sw_cons))];
+			prefetch(next_rx_buf->desc);
+		}
 		rx_buf->skb = NULL;
 
 		dma_addr = dma_unmap_addr(rx_buf, mapping);
@@ -3106,7 +3116,7 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 			BNX2_RX_OFFSET + BNX2_RX_COPY_THRESH,
 			PCI_DMA_FROMDEVICE);
 
-		rx_hdr = (struct l2_fhdr *) skb->data;
+		rx_hdr = rx_buf->desc;
 		len = rx_hdr->l2_fhdr_pkt_len;
 		status = rx_hdr->l2_fhdr_status;
 
@@ -5764,7 +5774,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	rx_buf = &rxr->rx_buf_ring[rx_start_idx];
 	rx_skb = rx_buf->skb;
 
-	rx_hdr = (struct l2_fhdr *) rx_skb->data;
+	rx_hdr = rx_buf->desc;
 	skb_reserve(rx_skb, BNX2_RX_OFFSET);
 
 	pci_dma_sync_single_for_cpu(bp->pdev,
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index ab34a5d..dd35bd0 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6551,6 +6551,7 @@ struct l2_fhdr {
 
 struct sw_bd {
 	struct sk_buff		*skb;
+	struct l2_fhdr		*desc;
 	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
-- 
1.6.4.GIT



^ permalink raw reply related

* [PATCH -next 1/3 v2] bnx2: Add GRO support.
From: Michael Chan @ 2010-05-06 18:58 UTC (permalink / raw)
  To: davem; +Cc: netdev

And turn on NETIF_F_GRO by default [requested by DaveM].

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index ab26bbc..320526b 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -3207,10 +3207,10 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 
 #ifdef BCM_VLAN
 		if (hw_vlan)
-			vlan_hwaccel_receive_skb(skb, bp->vlgrp, vtag);
+			vlan_gro_receive(&bnapi->napi, bp->vlgrp, vtag, skb);
 		else
 #endif
-			netif_receive_skb(skb);
+			napi_gro_receive(&bnapi->napi, skb);
 
 		rx_pkt++;
 
@@ -8296,7 +8296,7 @@ bnx2_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	memcpy(dev->dev_addr, bp->mac_addr, 6);
 	memcpy(dev->perm_addr, bp->mac_addr, 6);
 
-	dev->features |= NETIF_F_IP_CSUM | NETIF_F_SG;
+	dev->features |= NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_GRO;
 	vlan_features_add(dev, NETIF_F_IP_CSUM | NETIF_F_SG);
 	if (CHIP_NUM(bp) == CHIP_NUM_5709) {
 		dev->features |= NETIF_F_IPV6_CSUM;
-- 
1.6.4.GIT



^ permalink raw reply related

* Re: [PATCH] net: deliver skbs on inactive slaves to exact matches
From: John Fastabend @ 2010-05-06 18:37 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: bonding-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	Leech, Christopher, andy@greyhouse.net, kaber@trash.net
In-Reply-To: <25360.1273168864@death.nxdomain.ibm.com>

Jay Vosburgh wrote:
> John Fastabend <john.r.fastabend@intel.com> wrote:
> 
>> Currently, the accelerated receive path for VLAN's will
>> drop packets if the real device is an inactive slave and
>> is not one of the special pkts tested for in
>> skb_bond_should_drop().  This behavior is different then
>> the non-accelerated path and for pkts over a bonded vlan.
>>
>> For example,
>>
>> vlanx -> bond0 -> ethx
>>
>> will be dropped in the vlan path and not delivered to any
>> packet handlers.  However,
>>
>> bond0 -> vlanx -> ethx
>>
>> will be delivered to handlers that match the exact dev,
>> because the VLAN path checks the real_dev which is not a
>> slave and netif_recv_skb() doesn't drop frames but only
>> delivers them to exact matches.
>>
>> This patch adds a pkt_type PACKET_DROP which is now used
>> to identify skbs that would previously been dropped and
>> allows the skb to continue to skb_netif_recv().  Here we
>> add logic to check for PACKET_DROP and if so only deliver
>> to handlers that match exactly.  IMHO this is more
>> consistent and gives pkt handlers a way to identify skbs
>> that come from inactive slaves.
> 
> 	I was looking at this some yesterday and this morning, trying to
> figure out if a packet going up the IP protocol stack with pkt_type ==
> PACKET_DROP would break anything.  For example:
> 
> net/ipv4/ip_options.c:
> int ip_options_rcv_srr(struct sk_buff *skb)
> {
> [...]
>         if (!opt->srr)
>                 return 0;
> 
>         if (skb->pkt_type != PACKET_HOST)
>                 return -EINVAL;
> 
> 	or:
> 
> net/ipv4/tcp_ipv4.c:
> int tcp_v4_rcv(struct sk_buff *skb)
> {
>         const struct iphdr *iph;
>         struct tcphdr *th;
>         struct sock *sk;
>         int ret;
>         struct net *net = dev_net(skb->dev);
> 
>         if (skb->pkt_type != PACKET_HOST)
>                 goto discard_it;
> 
> 	Is pkt_type == PACKET_DROP instead going to break things for
> these, or the other similar cases?
> 
> 	I think what you're looking for is really the usual PACKET_HOST,
> PACKET_OTHERHOST, et al, upper protocol behavior, with the exception
> that at the __netif_receive_skb level, wildcard matches in the ptypes
> are not done.  Setting the pkt_type to PACKET_DROP loses the _HOST,
> _OTHERHOST, etc, information, but sends the packet up the stack anyway,
> and I'm not sure that's really the right thing to do.

Because we wouldn't be doing wildcard matches the skb shouldn't be 
passed to the IP protocol stack.  Really what we want is the ip_rcv() to 
drop the skb in this case,

net/ipv4/ip_input.c
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct 
packet_type *pt, struct net_device *orig_dev)
{
[...]

if (skb->pkt_type == PACKET_OTHERHOST ||
     skb->pkt_type == PACKET_DROP)
	goto drop;


We do lose some information about the packet _HOST, _OTHERHOST, etc, but 
we also gain something namely that the packet was received on an 
inactive slave.  Currently, we pass these frames up the stack (for non 
wildcard matches) without any indication that they were received on an 
inactive slave.

What we need is a way for the VLAN path to flag skbs so that wildcard 
matches in the ptyes are not done.  Also I think it may be valuable for 
the packet handler to be able to determine if the skb is coming from an 
inactive slave.  I think using the pkt_type field might be OK any ideas 
for a better field?

Thanks,
John

> 
> 	-J
> 
>> This allows a third case to function which is important for
>> doing multipath with FCoE traffic while LAN traffic bonded,
>>
>> bond0 -> ethx
>>          |
>> vlanx -> --
>>
>> Here the vlan is not in bond0 but the FCoE handler can still
>> receive the skb.  Previously these skbs were dropped.
>>
>> I have tested the following 4 configurations in failover modes
>> and load balancing modes and have not seen any duplicate packets
>> or unexpected bahavior.
>>
>> # bond0 -> ethx
>>
>> # vlanx -> bond0 -> ethx
>>
>> # bond0 -> vlanx -> ethx
>>
>> # bond0 -> ethx
>>            |
>>  vlanx -> --
>>
>> Also this removes the PACKET_FASTROUTE define which was not being
>> used.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>> include/linux/if_packet.h |    2 +-
>> net/8021q/vlan_core.c     |    4 ++--
>> net/core/dev.c            |   25 ++++++++++++++++++-------
>> 3 files changed, 21 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
>> index 6ac23ef..9d079fa 100644
>> --- a/include/linux/if_packet.h
>> +++ b/include/linux/if_packet.h
>> @@ -28,7 +28,7 @@ struct sockaddr_ll {
>> #define PACKET_OUTGOING		4		/* Outgoing of any type */
>> /* These ones are invisible by user level */
>> #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
>> -#define PACKET_FASTROUTE	6		/* Fastrouted frame	*/
>> +#define PACKET_DROP		6		/* Drop packet 		*/
>>
>> /* Packet socket options */
>>
>> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>> index c584a0a..4510e08 100644
>> --- a/net/8021q/vlan_core.c
>> +++ b/net/8021q/vlan_core.c
>> @@ -12,7 +12,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
>> 		return NET_RX_DROP;
>>
>> 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>> -		goto drop;
>> +		skb->pkt_type = PACKET_DROP;
>>
>> 	skb->skb_iif = skb->dev->ifindex;
>> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>> @@ -84,7 +84,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
>> 	struct sk_buff *p;
>>
>> 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>> -		goto drop;
>> +		skb->pkt_type = PACKET_DROP;
>>
>> 	skb->skb_iif = skb->dev->ifindex;
>> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 36d53be..cefac4f 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2776,7 +2776,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	struct net_device *orig_dev;
>> 	struct net_device *master;
>> 	struct net_device *null_or_orig;
>> -	struct net_device *null_or_bond;
>> +	struct net_device *dev_or_bond;
>> 	int ret = NET_RX_DROP;
>> 	__be16 type;
>>
>> @@ -2793,13 +2793,24 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	if (!skb->skb_iif)
>> 		skb->skb_iif = skb->dev->ifindex;
>>
>> +	/*
>> +	 * bonding note: skbs received on inactive slaves should only
>> +	 * be delivered to pkt handlers that are exact matches.  Also
>> +	 * the pkt_type field will be marked PACKET_DROP.  If packet
>> +	 * handlers are sensitive to duplicate packets these skbs will
>> +	 * need to be dropped at the handler.  The vlan accel path may
>> +	 * have already set PACKET_DROP.
>> +	 */
>> 	null_or_orig = NULL;
>> 	orig_dev = skb->dev;
>> 	master = ACCESS_ONCE(orig_dev->master);
>> -	if (master) {
>> -		if (skb_bond_should_drop(skb, master))
>> +	if (skb->pkt_type == PACKET_DROP)
>> +		null_or_orig = orig_dev;
>> +	else if (master) {
>> +		if (skb_bond_should_drop(skb, master)) {
>> +			skb->pkt_type = PACKET_DROP;
>> 			null_or_orig = orig_dev; /* deliver only exact match */
>> -		else
>> +		} else
>> 			skb->dev = master;
>> 	}
>>
>> @@ -2849,10 +2860,10 @@ ncls:
>> 	 * device that may have registered for a specific ptype.  The
>> 	 * handler may have to adjust skb->dev and orig_dev.
>> 	 */
>> -	null_or_bond = NULL;
>> +	dev_or_bond = skb->dev;
>> 	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>> 	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>> -		null_or_bond = vlan_dev_real_dev(skb->dev);
>> +		dev_or_bond = vlan_dev_real_dev(skb->dev);
>> 	}
>>
>> 	type = skb->protocol;
>> @@ -2860,7 +2871,7 @@ ncls:
>> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>> 		if (ptype->type == type && (ptype->dev == null_or_orig ||
>> 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>> -		     ptype->dev == null_or_bond)) {
>> +		     ptype->dev == dev_or_bond)) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>>
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Eric Dumazet @ 2010-05-06 18:33 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506180233.GB5063@hmsreliant.think-freely.org>

Le jeudi 06 mai 2010 à 14:02 -0400, Neil Horman a écrit :
> On Thu, May 06, 2010 at 07:32:35PM +0200, Eric Dumazet wrote:
> > Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> > > A while back there was a discussion regarding the rt_secret_interval timer.
> > > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > > now, based on a statistical analysis of the various hash chain lengths in the
> > > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > > analysis mechanism to determine the need for route cache flushes.
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > 
> > > 
> > 
> > Nice cleanup try Neil, but this gives to attackers more time to hit the
> > cache (infinite time should be enough as a matter of fact ;) )
> > 
> Not sure I follow what your complaint is.  I get that this gives attackers
> plenty of time to try to attack the cache, but thats rather the point of the
> statistics gathering for the cache, and why I don't think we need the secret
> timer any more.   With the statistical analysis we do on the route cache every
> gc cycle, we can tell if an attacker has guessed our rt_genid value, and is
> making any chains in the cache abnormally long.  Thats when we do the rebuild,
> modifying the rt_genid, forcing the attacker to re-discover it (which should be
> difficult).  Theres no need to change this periodically if you're not being
> attacked.
>  
> > Hints : 
> > 
> > - What is the initial value of rt_genid ?
> > 
> > - How/When is it changed (full 32 bits are changed or small
> > perturbations ? check rt_cache_invalidate())
> > 
> /*
>  * Pertubation of rt_genid by a small quantity [1..256]
>  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
>  * many times (2^24) without giving recent rt_genid.
>  * Jenkins hash is strong enough that litle changes of rt_genid are OK.
>  */
> static void rt_cache_invalidate(struct net *net)
> {
>         unsigned char shuffle;
> 
>         get_random_bytes(&shuffle, sizeof(shuffle));
>         atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
> }
> 
> Clearly, its small changes.  To paraphrase the comment, Changes to rt_genid are
> small enough to be confident that we don't repetatively use a gen_id often, but
> sufficiently random that attackers cannot easily guess the next gen_id based on
> the current value.  Both the timer and the statistics code use this invalidation
> technique previously, and the latter continues to do so.
> 
> I've not changed anything regarding how we
> invalidate, only when we choose to invalidate.  Invalidation can lead to
> slowdowns during routing, since it send frames through the slow path after an
> invalidation.  It behooves us to avoid preforming this invalidation without
> need, and since we have a mechanism in place to do that invalidation specfically
> when we need to, lets get rid of the code that handles that, and make it a bit
> cleaner.  If there are users that feel strongly that they need to defend against
> potential attacks by periodically changing their rt_genid, its still possible.
> Its as simple as putting:
> echo -1 > /proc/sys/net/ipv4/route/flush
> in a cron job.
> 

I have some customers that will simply kill me if their routing cache is
disabled by a smart attack, slowing down their server by a 4x factor.

I know its possible, it has been done.

For a quiet machine possible rt_genid values range are known from
attacker, and hash size is also known. Thats really too easy for the bad
guys...

Neil, I think your cleanup should stay a cleanup for the moment, or you
must make sure rt_genid initial value is not 0 (read your patch
again...)

I agree we dont need anymore the complex timer logic. We could keep the
secret_interval (default to 0 if you really want) and force a
rt_cache_invalidate() call once in a while from the periodic
rt_check_expire() for example.




^ permalink raw reply

* [PATCH] bridge: update sysfs link names if port device names have changed
From: Simon Arlott @ 2010-05-06 18:32 UTC (permalink / raw)
  To: shemminger; +Cc: netdev

Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.

As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.

This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.

https://bugzilla.kernel.org/show_bug.cgi?id=12743

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 fs/sysfs/symlink.c       |    1 +
 net/bridge/br_if.c       |    2 +-
 net/bridge/br_notify.c   |    4 ++++
 net/bridge/br_private.h  |    5 +++++
 net/bridge/br_sysfs_if.c |   19 +++++++++++++++++--
 5 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index b93ec51..942f239 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -261,3 +261,4 @@ const struct inode_operations sysfs_symlink_inode_operations = {
 
 EXPORT_SYMBOL_GPL(sysfs_create_link);
 EXPORT_SYMBOL_GPL(sysfs_remove_link);
+EXPORT_SYMBOL_GPL(sysfs_rename_link);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 0b6b1f2..17175a5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -132,7 +132,7 @@ static void del_nbp(struct net_bridge_port *p)
 	struct net_bridge *br = p->br;
 	struct net_device *dev = p->dev;
 
-	sysfs_remove_link(br->ifobj, dev->name);
+	sysfs_remove_link(br->ifobj, p->sysfs_name);
 
 	dev_set_promiscuity(dev, -1);
 
diff --git a/net/bridge/br_notify.c b/net/bridge/br_notify.c
index 763a3ec..9004406 100644
--- a/net/bridge/br_notify.c
+++ b/net/bridge/br_notify.c
@@ -82,6 +82,10 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	case NETDEV_UNREGISTER:
 		br_del_if(br, dev);
 		break;
+
+	case NETDEV_CHANGENAME:
+		br_sysfs_renameif(p);
+		break;
 	}
 
 	/* Events that may cause spanning tree to refresh */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 846d7d1..dcbe744 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -96,6 +96,9 @@ struct net_bridge_port
 {
 	struct net_bridge		*br;
 	struct net_device		*dev;
+#ifdef CONFIG_SYSFS
+	char				sysfs_name[IFNAMSIZ];
+#endif
 	struct list_head		list;
 
 	/* STP */
@@ -433,6 +436,7 @@ extern void br_ifinfo_notify(int event, struct net_bridge_port *port);
 /* br_sysfs_if.c */
 extern const struct sysfs_ops brport_sysfs_ops;
 extern int br_sysfs_addif(struct net_bridge_port *p);
+extern void br_sysfs_renameif(struct net_bridge_port *p);
 
 /* br_sysfs_br.c */
 extern int br_sysfs_addbr(struct net_device *dev);
@@ -441,6 +445,7 @@ extern void br_sysfs_delbr(struct net_device *dev);
 #else
 
 #define br_sysfs_addif(p)	(0)
+#define br_sysfs_renameif(p)	do { } while(0)
 #define br_sysfs_addbr(dev)	(0)
 #define br_sysfs_delbr(dev)	do { } while(0)
 #endif /* CONFIG_SYSFS */
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 0b99164..6702d7d 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -246,7 +246,7 @@ const struct sysfs_ops brport_sysfs_ops = {
 /*
  * Add sysfs entries to ethernet device added to a bridge.
  * Creates a brport subdirectory with bridge attributes.
- * Puts symlink in bridge's brport subdirectory
+ * Puts symlink in bridge's brif subdirectory
  */
 int br_sysfs_addif(struct net_bridge_port *p)
 {
@@ -265,7 +265,22 @@ int br_sysfs_addif(struct net_bridge_port *p)
 			goto out2;
 	}
 
-	err = sysfs_create_link(br->ifobj, &p->kobj, p->dev->name);
+	strncpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+	err = sysfs_create_link(br->ifobj, &p->kobj, p->sysfs_name);
 out2:
 	return err;
 }
+
+/* Rename bridge's brif symlink */
+void br_sysfs_renameif(struct net_bridge_port *p)
+{
+	struct net_bridge *br = p->br;
+	int err;
+
+	err = sysfs_rename_link(br->ifobj, &p->kobj,
+		p->sysfs_name, p->dev->name);
+	if (err)
+		printk(KERN_ERR "%s: unable to rename sysfs link %s to %s (%d)",
+			br->dev->name, p->sysfs_name, p->dev->name, err);
+	strncpy(p->sysfs_name, p->dev->name, IFNAMSIZ);
+}
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: 2.6.34-rc5-mmotm0428 - 2 RCU whinges in mac80211
From: Valdis.Kletnieks @ 2010-05-06 18:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, David S. Miller, John W. Linville, linux-kernel,
	netdev, linux-wireless
In-Reply-To: <1273168696.23199.0.camel@jlt3.sipsolutions.net>

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

On Thu, 06 May 2010 19:58:16 +0200, Johannes Berg said:
> On Thu, 2010-05-06 at 13:35 -0400, Valdis.Kletnieks@vt.edu wrote:
> > Spotted these in my dmesg, hopefully somebody is interested...
> > 
> > [   54.076863] wlan0: deauthenticating from 00:11:20:a4:4c:11 by local choice (reason=3)
> > [   54.080580] 
> > [   54.080581] ===================================================
> > [   54.080584] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   54.080586] ---------------------------------------------------
> > [   54.080589] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!
> 
> I'm pretty sure I fixed those so there should be a patch on its way to
> or in mainline somewhere.

Oh,  OK.  I didn't remember seeing them mentioned before, but I must have
missed somebody else's post about it due to the lkml firehose effect. I've
moved these to my "whinge if it isn't fixed in the next -mmotm" queue.


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Pankaj Thakkar @ 2010-05-06 18:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Dmitry Torokhov, pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Christoph Hellwig
In-Reply-To: <20100506081933.GF10044@redhat.com>

On Thu, May 06, 2010 at 01:19:33AM -0700, Gleb Natapov wrote:
> Overhead of interpreting bytecode plugin is written in. Or are you
> saying plugin is x86 assembly (32bit or 64bit btw?) and other arches
> will have to have in kernel x86 emulator to use the plugin (like some
> of them had for vgabios)? 
> 

Plugin is x86 or x64 machine code. You write the plugin in C and compile it using gcc/ld to get the object file, we map the relevant sections only to the OS space. 

NPA is a way of enabling passthrough of SR-IOV NICs with live migration support on ESX Hypervisor which runs only on x86/x64 hardware. It only supports x86/x64 guest OS. So we don't have to worry about other architectures. If NPA approach needs to be extended and adopted by other hypervisors then we have to take care of that. Today we have two plugins images per VF (one for 32-bit, one for 64-bit).

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 18:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273167155.2853.49.camel@edumazet-laptop>

On Thu, May 06, 2010 at 07:32:35PM +0200, Eric Dumazet wrote:
> Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> > A while back there was a discussion regarding the rt_secret_interval timer.
> > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > now, based on a statistical analysis of the various hash chain lengths in the
> > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > analysis mechanism to determine the need for route cache flushes.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> 
> Nice cleanup try Neil, but this gives to attackers more time to hit the
> cache (infinite time should be enough as a matter of fact ;) )
> 
Not sure I follow what your complaint is.  I get that this gives attackers
plenty of time to try to attack the cache, but thats rather the point of the
statistics gathering for the cache, and why I don't think we need the secret
timer any more.   With the statistical analysis we do on the route cache every
gc cycle, we can tell if an attacker has guessed our rt_genid value, and is
making any chains in the cache abnormally long.  Thats when we do the rebuild,
modifying the rt_genid, forcing the attacker to re-discover it (which should be
difficult).  Theres no need to change this periodically if you're not being
attacked.
 
> Hints : 
> 
> - What is the initial value of rt_genid ?
> 
> - How/When is it changed (full 32 bits are changed or small
> perturbations ? check rt_cache_invalidate())
> 
/*
 * Pertubation of rt_genid by a small quantity [1..256]
 * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
 * many times (2^24) without giving recent rt_genid.
 * Jenkins hash is strong enough that litle changes of rt_genid are OK.
 */
static void rt_cache_invalidate(struct net *net)
{
        unsigned char shuffle;

        get_random_bytes(&shuffle, sizeof(shuffle));
        atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
}

Clearly, its small changes.  To paraphrase the comment, Changes to rt_genid are
small enough to be confident that we don't repetatively use a gen_id often, but
sufficiently random that attackers cannot easily guess the next gen_id based on
the current value.  Both the timer and the statistics code use this invalidation
technique previously, and the latter continues to do so.

I've not changed anything regarding how we
invalidate, only when we choose to invalidate.  Invalidation can lead to
slowdowns during routing, since it send frames through the slow path after an
invalidation.  It behooves us to avoid preforming this invalidation without
need, and since we have a mechanism in place to do that invalidation specfically
when we need to, lets get rid of the code that handles that, and make it a bit
cleaner.  If there are users that feel strongly that they need to defend against
potential attacks by periodically changing their rt_genid, its still possible.
Its as simple as putting:
echo -1 > /proc/sys/net/ipv4/route/flush
in a cron job.

Regards
Neil


^ permalink raw reply

* Re: [PATCH] net: deliver skbs on inactive slaves to exact matches
From: Jay Vosburgh @ 2010-05-06 18:01 UTC (permalink / raw)
  To: John Fastabend; +Cc: bonding-devel, netdev, christopher.leech, andy, kaber
In-Reply-To: <20100506172438.26339.93520.stgit@jf-dev2-dcblab>

John Fastabend <john.r.fastabend@intel.com> wrote:

>Currently, the accelerated receive path for VLAN's will
>drop packets if the real device is an inactive slave and
>is not one of the special pkts tested for in
>skb_bond_should_drop().  This behavior is different then
>the non-accelerated path and for pkts over a bonded vlan.
>
>For example,
>
>vlanx -> bond0 -> ethx
>
>will be dropped in the vlan path and not delivered to any
>packet handlers.  However,
>
>bond0 -> vlanx -> ethx
>
>will be delivered to handlers that match the exact dev,
>because the VLAN path checks the real_dev which is not a
>slave and netif_recv_skb() doesn't drop frames but only
>delivers them to exact matches.
>
>This patch adds a pkt_type PACKET_DROP which is now used
>to identify skbs that would previously been dropped and
>allows the skb to continue to skb_netif_recv().  Here we
>add logic to check for PACKET_DROP and if so only deliver
>to handlers that match exactly.  IMHO this is more
>consistent and gives pkt handlers a way to identify skbs
>that come from inactive slaves.

	I was looking at this some yesterday and this morning, trying to
figure out if a packet going up the IP protocol stack with pkt_type ==
PACKET_DROP would break anything.  For example:

net/ipv4/ip_options.c:
int ip_options_rcv_srr(struct sk_buff *skb)
{
[...]
        if (!opt->srr)
                return 0;

        if (skb->pkt_type != PACKET_HOST)
                return -EINVAL;

	or:

net/ipv4/tcp_ipv4.c:
int tcp_v4_rcv(struct sk_buff *skb)
{
        const struct iphdr *iph;
        struct tcphdr *th;
        struct sock *sk;
        int ret;
        struct net *net = dev_net(skb->dev);

        if (skb->pkt_type != PACKET_HOST)
                goto discard_it;

	Is pkt_type == PACKET_DROP instead going to break things for
these, or the other similar cases?

	I think what you're looking for is really the usual PACKET_HOST,
PACKET_OTHERHOST, et al, upper protocol behavior, with the exception
that at the __netif_receive_skb level, wildcard matches in the ptypes
are not done.  Setting the pkt_type to PACKET_DROP loses the _HOST,
_OTHERHOST, etc, information, but sends the packet up the stack anyway,
and I'm not sure that's really the right thing to do.

	-J

>This allows a third case to function which is important for
>doing multipath with FCoE traffic while LAN traffic bonded,
>
>bond0 -> ethx
>          |
>vlanx -> --
>
>Here the vlan is not in bond0 but the FCoE handler can still
>receive the skb.  Previously these skbs were dropped.
>
>I have tested the following 4 configurations in failover modes
>and load balancing modes and have not seen any duplicate packets
>or unexpected bahavior.
>
># bond0 -> ethx
>
># vlanx -> bond0 -> ethx
>
># bond0 -> vlanx -> ethx
>
># bond0 -> ethx
>            |
>  vlanx -> --
>
>Also this removes the PACKET_FASTROUTE define which was not being
>used.
>
>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>---
>
> include/linux/if_packet.h |    2 +-
> net/8021q/vlan_core.c     |    4 ++--
> net/core/dev.c            |   25 ++++++++++++++++++-------
> 3 files changed, 21 insertions(+), 10 deletions(-)
>
>diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
>index 6ac23ef..9d079fa 100644
>--- a/include/linux/if_packet.h
>+++ b/include/linux/if_packet.h
>@@ -28,7 +28,7 @@ struct sockaddr_ll {
> #define PACKET_OUTGOING		4		/* Outgoing of any type */
> /* These ones are invisible by user level */
> #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
>-#define PACKET_FASTROUTE	6		/* Fastrouted frame	*/
>+#define PACKET_DROP		6		/* Drop packet 		*/
>
> /* Packet socket options */
>
>diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>index c584a0a..4510e08 100644
>--- a/net/8021q/vlan_core.c
>+++ b/net/8021q/vlan_core.c
>@@ -12,7 +12,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
> 		return NET_RX_DROP;
>
> 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>+		skb->pkt_type = PACKET_DROP;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>@@ -84,7 +84,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
> 	struct sk_buff *p;
>
> 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>+		skb->pkt_type = PACKET_DROP;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 36d53be..cefac4f 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2776,7 +2776,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	struct net_device *orig_dev;
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
>-	struct net_device *null_or_bond;
>+	struct net_device *dev_or_bond;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
>
>@@ -2793,13 +2793,24 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	if (!skb->skb_iif)
> 		skb->skb_iif = skb->dev->ifindex;
>
>+	/*
>+	 * bonding note: skbs received on inactive slaves should only
>+	 * be delivered to pkt handlers that are exact matches.  Also
>+	 * the pkt_type field will be marked PACKET_DROP.  If packet
>+	 * handlers are sensitive to duplicate packets these skbs will
>+	 * need to be dropped at the handler.  The vlan accel path may
>+	 * have already set PACKET_DROP.
>+	 */
> 	null_or_orig = NULL;
> 	orig_dev = skb->dev;
> 	master = ACCESS_ONCE(orig_dev->master);
>-	if (master) {
>-		if (skb_bond_should_drop(skb, master))
>+	if (skb->pkt_type == PACKET_DROP)
>+		null_or_orig = orig_dev;
>+	else if (master) {
>+		if (skb_bond_should_drop(skb, master)) {
>+			skb->pkt_type = PACKET_DROP;
> 			null_or_orig = orig_dev; /* deliver only exact match */
>-		else
>+		} else
> 			skb->dev = master;
> 	}
>
>@@ -2849,10 +2860,10 @@ ncls:
> 	 * device that may have registered for a specific ptype.  The
> 	 * handler may have to adjust skb->dev and orig_dev.
> 	 */
>-	null_or_bond = NULL;
>+	dev_or_bond = skb->dev;
> 	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
> 	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>-		null_or_bond = vlan_dev_real_dev(skb->dev);
>+		dev_or_bond = vlan_dev_real_dev(skb->dev);
> 	}
>
> 	type = skb->protocol;
>@@ -2860,7 +2871,7 @@ ncls:
> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> 		if (ptype->type == type && (ptype->dev == null_or_orig ||
> 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>-		     ptype->dev == null_or_bond)) {
>+		     ptype->dev == dev_or_bond)) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: 2.6.34-rc5-mmotm0428 - 2 RCU whinges in mac80211
From: Johannes Berg @ 2010-05-06 17:58 UTC (permalink / raw)
  To: Valdis.Kletnieks-PjAqaU27lzQ
  Cc: Andrew Morton, David S. Miller, John W. Linville,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5820.1273167313@localhost>

On Thu, 2010-05-06 at 13:35 -0400, Valdis.Kletnieks-PjAqaU27lzQ@public.gmane.org wrote:
> Spotted these in my dmesg, hopefully somebody is interested...
> 
> [   54.076863] wlan0: deauthenticating from 00:11:20:a4:4c:11 by local choice (reason=3)
> [   54.080580] 
> [   54.080581] ===================================================
> [   54.080584] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   54.080586] ---------------------------------------------------
> [   54.080589] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!

I'm pretty sure I fixed those so there should be a patch on its way to
or in mainline somewhere.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: ixgbe and mac-vlans problem
From: Tantilov, Emil S @ 2010-05-06 17:51 UTC (permalink / raw)
  To: Ben Greear; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <4BE2ECEE.7070200@candelatech.com>

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

Ben Greear wrote:
> On 04/30/2010 03:26 PM, Tantilov, Emil S wrote:
>> Ben Greear wrote:
>>> On 04/30/2010 02:13 PM, Tantilov, Emil S wrote:
> 
>>>> I ran a quick test in my setup with 82599 and was able to pass
>>>> traffic on all 50 mac-vlans without issues. This is on net-next.
>>> 
>>> For an 82599 system, I can get 127 mac-vlans working out of 500
>>> created. 
>>> 
>>> That NIC also does not go PROMISC with lots (500) of mac-vlans.
>>> 
>>> Once I put it in promisc mode manually, it works fine.
>>> 
>>> So, I think whatever logic is supposed to put the NIC into promisc
>>> mode when it overflows it's lookup tables isn't working for ixgbe
>>> in 2.6.31.12.
>> 
>> Yeah, you're right. I was able to repro it.
>> 
>> We'll look into it.
> 
> I'd be happy to test out a patch if you have one available.
> 
> If you don't expect to have one soon, please let me know and
> I'll add work-arounds to my code to throw ixgbe NICs into PROMISC
> mode manually.
> 
> Thanks,
> Ben

Hi Ben,

We do have a patch in testing (see attached). It may not apply cleanly as it is on top of some other patches currently in validation. Let me know if it works for you.

Thanks,
Emil

[-- Attachment #2: ixgbe_macvlan_v5.patch --]
[-- Type: application/octet-stream, Size: 2680 bytes --]

Introduce uc_set_promisc flag to better handle the enabling of promisc when
the number of RARs is exceeded.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>

 drivers/net/ixgbe/ixgbe_common.c |    5 ++++-
 drivers/net/ixgbe/ixgbe_main.c   |    8 ++++----
 drivers/net/ixgbe/ixgbe_type.h   |    1 +
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_common.c b/drivers/net/ixgbe/ixgbe_common.c
index ee42fd6..49775b6 100644
--- a/drivers/net/ixgbe/ixgbe_common.c
+++ b/drivers/net/ixgbe/ixgbe_common.c
@@ -1397,14 +1397,17 @@ s32 ixgbe_update_uc_addr_list_generic(struct ixgbe_hw *hw,
 			fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
 			fctrl |= IXGBE_FCTRL_UPE;
 			IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
+			hw->addr_ctrl.uc_set_promisc = true;
 		}
 	} else {
 		/* only disable if set by overflow, not by user */
-		if (old_promisc_setting && !hw->addr_ctrl.user_set_promisc) {
+		if ((old_promisc_setting && hw->addr_ctrl.uc_set_promisc) &&
+		   !(hw->addr_ctrl.user_set_promisc)){
 			hw_dbg(hw, " Leaving address overflow promisc mode\n");
 			fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
 			fctrl &= ~IXGBE_FCTRL_UPE;
 			IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
+			hw->addr_ctrl.uc_set_promisc = false;
 		}
 	}
 
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 577ac72..7bf3b40 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2970,8 +2970,8 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 
 	fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
 
-	if (netdev->flags & IFF_PROMISC) {
-		hw->addr_ctrl.user_set_promisc = 1;
+	if (netdev->flags & IFF_PROMISC){
+		hw->addr_ctrl.user_set_promisc = true;	
 		fctrl |= (IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
 		/* don't hardware filter vlans in promisc mode */
 		ixgbe_vlan_filter_disable(adapter);
@@ -2979,11 +2979,11 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 		if (netdev->flags & IFF_ALLMULTI) {
 			fctrl |= IXGBE_FCTRL_MPE;
 			fctrl &= ~IXGBE_FCTRL_UPE;
-		} else {
+		} else if (!hw->addr_ctrl.uc_set_promisc) {
 			fctrl &= ~(IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
 		}
 		ixgbe_vlan_filter_enable(adapter);
-		hw->addr_ctrl.user_set_promisc = 0;
+		hw->addr_ctrl.user_set_promisc = false;	
 	}
 
 	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
diff --git a/drivers/net/ixgbe/ixgbe_type.h b/drivers/net/ixgbe/ixgbe_type.h
index 1c89cbb..38f26bb 100644
--- a/drivers/net/ixgbe/ixgbe_type.h
+++ b/drivers/net/ixgbe/ixgbe_type.h
@@ -2285,6 +2285,7 @@ struct ixgbe_addr_filter_info {
 	u32 mc_addr_in_rar_count;
 	u32 mta_in_use;
 	u32 overflow_promisc;
+	bool uc_set_promisc;
 	bool user_set_promisc;
 };
 

^ permalink raw reply related

* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-05-06 17:42 UTC (permalink / raw)
  To: John Fastabend
  Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
	Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <4BE103BE.3040805@intel.com>

John Fastabend <john.r.fastabend@intel.com> wrote:

>Jay Vosburgh wrote:
>> John Fastabend <john.r.fastabend@intel.com> wrote:
[...]
>>> #3 bond0 --> ethx
>>>   vlanx --> -|
>>>
>>> Here is the case where adding the IFF_SLAVE bit doesn't work as I
>>> hoped. We don't want to run skb_bond_should_drop here.
>>
>>         Yes, this is tricky because the VLAN device will copy the
>> dev->flags from the device it's placed atop, so the VLAN will inherit
>> the ethx's IFF_SLAVE flag.  This happens regardless of the setup order
>> (enslave ethX, then add VLAN, or vice versa).
>>
>
>This doesn't appear to be true, adding a VLAN on ethx then enslave ethx
>doesn't set the IFF_SLAVE flag on the VLAN.  Unless I am missing
>something.

	I tried this again, and yes, the vlan device inherits the flags
of the device at the time the vlan is added.

	I think I was confused because the vlan device doesn't lose
IFF_SLAVE if the underlying ethX is taken out of the bond.  I suspect
both of these behaviors are because netdev_set_master doesn't do a
notifier call (just an rtmsg_ifinfo) when it changes dev->flags outside
of dev_set_flags.

	I don't think the vlan device should pick up IFF_SLAVE, though,
when the vlan device itself is not a slave, so that part seems correct.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* 2.6.34-rc5-mmotm0428 - 2 RCU whinges in mac80211
From: Valdis.Kletnieks-PjAqaU27lzQ @ 2010-05-06 17:35 UTC (permalink / raw)
  To: Andrew Morton, Johannes Berg, David S. Miller, John W. Linville
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 5105 bytes --]

Spotted these in my dmesg, hopefully somebody is interested...

[   54.076863] wlan0: deauthenticating from 00:11:20:a4:4c:11 by local choice (reason=3)
[   54.080580] 
[   54.080581] ===================================================
[   54.080584] [ INFO: suspicious rcu_dereference_check() usage. ]
[   54.080586] ---------------------------------------------------
[   54.080589] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!
[   54.080591] 
[   54.080591] other info that might help us debug this:
[   54.080592] 
[   54.080594] 
[   54.080595] rcu_scheduler_active = 1, debug_locks = 1
[   54.080597] no locks held by hald/3362.
[   54.080599] 
[   54.080599] stack backtrace:
[   54.080602] Pid: 3362, comm: hald Not tainted 2.6.34-rc5-mmotm0428 #1
[   54.080604] Call Trace:
[   54.080607]  <IRQ>  [<ffffffff81064eb9>] lockdep_rcu_dereference+0x9d/0xa5
[   54.080619]  [<ffffffff815649f9>] ieee80211_find_sta_by_hw+0x46/0x10f
[   54.080623]  [<ffffffff81564ad9>] ieee80211_find_sta+0x17/0x19
[   54.080628]  [<ffffffff8136e1ee>] iwlagn_tx_queue_reclaim+0xe7/0x1bd
[   54.080632]  [<ffffffff81371da3>] iwlagn_rx_reply_tx+0x4e9/0x5a6
[   54.080638]  [<ffffffff8136501f>] iwl_rx_handle+0x268/0x3fe
[   54.080642]  [<ffffffff813664ea>] iwl_irq_tasklet+0x2d3/0x3e4
[   54.080647]  [<ffffffff8103e337>] tasklet_action+0x79/0xd7
[   54.080651]  [<ffffffff8103fd54>] __do_softirq+0x15a/0x2a2
[   54.080656]  [<ffffffff8100358c>] call_softirq+0x1c/0x34
[   54.080659]  [<ffffffff81004ad8>] do_softirq+0x44/0xf0
[   54.080663]  [<ffffffff8103f3ba>] irq_exit+0x4a/0xb3
[   54.080667]  [<ffffffff81004211>] do_IRQ+0xa7/0xbe
[   54.080671]  [<ffffffff8159cd93>] ret_from_intr+0x0/0xf
[   54.080673]  <EOI> 
[   54.114898] wlan0: authenticate with 00:11:20:a4:4c:11 (try 1)
[   54.122103] wlan0: authenticated
[   54.122472] wlan0: associate with 00:11:20:a4:4c:11 (try 1)
[   54.128267] wlan0: RX AssocResp from 00:11:20:a4:4c:11 (capab=0x31 status=0 aid=10)
[   54.128271] wlan0: associated
[   54.132990] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   54.133333] cfg80211: Calling CRDA for country: US
[   54.135752] 
[   54.135753] ===================================================
[   54.135756] [ INFO: suspicious rcu_dereference_check() usage. ]
[   54.135758] ---------------------------------------------------
[   54.135760] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!
[   54.135763] 
[   54.135763] other info that might help us debug this:
[   54.135764] 
[   54.135766] 
[   54.135767] rcu_scheduler_active = 1, debug_locks = 1
[   54.135769] 1 lock held by udevd/2933:
[   54.135771]  #0:  (policy_rwlock){.+.+..}, at: [<ffffffff811d4f60>] security_compute_av+0x29/0x259
[   54.135781] 
[   54.135782] stack backtrace:
[   54.135785] Pid: 2933, comm: udevd Not tainted 2.6.34-rc5-mmotm0428 #1
[   54.135787] Call Trace:
[   54.135790]  <IRQ>  [<ffffffff81064eb9>] lockdep_rcu_dereference+0x9d/0xa5
[   54.135800]  [<ffffffff81564a49>] ieee80211_find_sta_by_hw+0x96/0x10f
[   54.135804]  [<ffffffff81564ad9>] ieee80211_find_sta+0x17/0x19
[   54.135809]  [<ffffffff8136e1ee>] iwlagn_tx_queue_reclaim+0xe7/0x1bd
[   54.135813]  [<ffffffff81371da3>] iwlagn_rx_reply_tx+0x4e9/0x5a6
[   54.135819]  [<ffffffff8136501f>] iwl_rx_handle+0x268/0x3fe
[   54.135823]  [<ffffffff813664ea>] iwl_irq_tasklet+0x2d3/0x3e4
[   54.135828]  [<ffffffff8103e337>] tasklet_action+0x79/0xd7
[   54.135832]  [<ffffffff8103fd54>] __do_softirq+0x15a/0x2a2
[   54.135837]  [<ffffffff8100358c>] call_softirq+0x1c/0x34
[   54.135841]  [<ffffffff81004ad8>] do_softirq+0x44/0xf0
[   54.135844]  [<ffffffff8103f3ba>] irq_exit+0x4a/0xb3
[   54.135847]  [<ffffffff81004211>] do_IRQ+0xa7/0xbe
[   54.135852]  [<ffffffff8159cd93>] ret_from_intr+0x0/0xf
[   54.135854]  <EOI>  [<ffffffff811ce48b>] ? avtab_search_node+0x6c/0x7b
[   54.135862]  [<ffffffff811d4cdb>] context_struct_compute_av+0x136/0x271
[   54.135867]  [<ffffffff811d506b>] security_compute_av+0x134/0x259
[   54.135872]  [<ffffffff811c2560>] avc_has_perm_noaudit+0x22e/0x537
[   54.135876]  [<ffffffff811c289f>] avc_has_perm+0x36/0x69
[   54.135880]  [<ffffffff810c52cd>] ? __do_fault+0x254/0x3f1
[   54.135885]  [<ffffffff811c4bca>] current_has_perm+0x3a/0x3f
[   54.135888]  [<ffffffff811c4c7d>] selinux_task_create+0x17/0x19
[   54.135893]  [<ffffffff811beced>] security_task_create+0x11/0x13
[   54.135898]  [<ffffffff81036dae>] copy_process+0x8d/0x11ed
[   54.135901]  [<ffffffff810c5434>] ? __do_fault+0x3bb/0x3f1
[   54.135905]  [<ffffffff8107c26b>] ? rcu_read_lock+0x0/0x35
[   54.135909]  [<ffffffff810380b2>] do_fork+0x1a4/0x3b5
[   54.135913]  [<ffffffff8107c2c1>] ? rcu_read_unlock+0x21/0x23
[   54.135917]  [<ffffffff8107dc56>] ? audit_filter_syscall+0xb4/0xc8
[   54.135921]  [<ffffffff81059044>] ? up_read+0x1e/0x36
[   54.135924]  [<ffffffff8105d760>] ? current_kernel_time+0x28/0x50
[   54.135929]  [<ffffffff81009705>] sys_clone+0x23/0x25
[   54.135933]  [<ffffffff810029d3>] stub_clone+0x13/0x20
[   54.135936]  [<ffffffff8100266b>] ? system_call_fastpath+0x16/0x1b



[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Eric Dumazet @ 2010-05-06 17:32 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>

Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> A while back there was a discussion regarding the rt_secret_interval timer.
> Given that we've had the ability to do emergency route cache rebuilds for awhile
> now, based on a statistical analysis of the various hash chain lengths in the
> cache, the use of the flush timer is somewhat redundant.  This patch removes the
> rt_secret_interval sysctl, allowing us to rely solely on the statistical
> analysis mechanism to determine the need for route cache flushes.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 

Nice cleanup try Neil, but this gives to attackers more time to hit the
cache (infinite time should be enough as a matter of fact ;) )

Hints : 

- What is the initial value of rt_genid ?

- How/When is it changed (full 32 bits are changed or small
perturbations ? check rt_cache_invalidate())

Thanks

>  include/net/netns/ipv4.h |    1 
>  net/ipv4/route.c         |  108 -----------------------------------------------
>  2 files changed, 2 insertions(+), 107 deletions(-)
> 
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index ae07fee..d68c3f1 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,7 +55,6 @@ struct netns_ipv4 {
>  	int sysctl_rt_cache_rebuild_count;
>  	int current_rt_cache_rebuild_count;
>  
> -	struct timer_list rt_secret_timer;
>  	atomic_t rt_genid;
>  
>  #ifdef CONFIG_IP_MROUTE
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index a947428..ffd3da1 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
>  static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
>  static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
>  static int ip_rt_min_advmss __read_mostly	= 256;
> -static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
>  static int rt_chain_length_max __read_mostly	= 20;
>  
>  static struct delayed_work expires_work;
> @@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
>  	rt_do_flush(!in_softirq());
>  }
>  
> -/*
> - * We change rt_genid and let gc do the cleanup
> - */
> -static void rt_secret_rebuild(unsigned long __net)
> -{
> -	struct net *net = (struct net *)__net;
> -	rt_cache_invalidate(net);
> -	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
> -}
> -
> -static void rt_secret_rebuild_oneshot(struct net *net)
> -{
> -	del_timer_sync(&net->ipv4.rt_secret_timer);
> -	rt_cache_invalidate(net);
> -	if (ip_rt_secret_interval)
> -		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
> -}
> -
>  static void rt_emergency_hash_rebuild(struct net *net)
>  {
> -	if (net_ratelimit()) {
> +	if (net_ratelimit())
>  		printk(KERN_WARNING "Route hash chain too long!\n");
> -		printk(KERN_WARNING "Adjust your secret_interval!\n");
> -	}
> -
> -	rt_secret_rebuild_oneshot(net);
> +	rt_cache_invalidate(net);
>  }
>  
>  /*
> @@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
>  	return -EINVAL;
>  }
>  
> -static void rt_secret_reschedule(int old)
> -{
> -	struct net *net;
> -	int new = ip_rt_secret_interval;
> -	int diff = new - old;
> -
> -	if (!diff)
> -		return;
> -
> -	rtnl_lock();
> -	for_each_net(net) {
> -		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
> -		long time;
> -
> -		if (!new)
> -			continue;
> -
> -		if (deleted) {
> -			time = net->ipv4.rt_secret_timer.expires - jiffies;
> -
> -			if (time <= 0 || (time += diff) <= 0)
> -				time = 0;
> -		} else
> -			time = new;
> -
> -		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
> -	}
> -	rtnl_unlock();
> -}
> -
> -static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
> -					  void __user *buffer, size_t *lenp,
> -					  loff_t *ppos)
> -{
> -	int old = ip_rt_secret_interval;
> -	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
> -
> -	rt_secret_reschedule(old);
> -
> -	return ret;
> -}
> -
>  static ctl_table ipv4_route_table[] = {
>  	{
>  		.procname	= "gc_thresh",
> @@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> -	{
> -		.procname	= "secret_interval",
> -		.data		= &ip_rt_secret_interval,
> -		.maxlen		= sizeof(int),
> -		.mode		= 0644,
> -		.proc_handler	= ipv4_sysctl_rt_secret_interval,
> -	},
>  	{ }
>  };
>  
> @@ -3337,36 +3266,6 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
>  #endif
>  
> 
> -static __net_init int rt_secret_timer_init(struct net *net)
> -{
> -	atomic_set(&net->ipv4.rt_genid,
> -			(int) ((num_physpages ^ (num_physpages>>8)) ^
> -			(jiffies ^ (jiffies >> 7))));
> -
> -	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
> -	net->ipv4.rt_secret_timer.data = (unsigned long)net;
> -	init_timer_deferrable(&net->ipv4.rt_secret_timer);
> -
> -	if (ip_rt_secret_interval) {
> -		net->ipv4.rt_secret_timer.expires =
> -			jiffies + net_random() % ip_rt_secret_interval +
> -			ip_rt_secret_interval;
> -		add_timer(&net->ipv4.rt_secret_timer);
> -	}
> -	return 0;
> -}
> -
> -static __net_exit void rt_secret_timer_exit(struct net *net)
> -{
> -	del_timer_sync(&net->ipv4.rt_secret_timer);
> -}
> -
> -static __net_initdata struct pernet_operations rt_secret_timer_ops = {
> -	.init = rt_secret_timer_init,
> -	.exit = rt_secret_timer_exit,
> -};
> -
> -
>  #ifdef CONFIG_NET_CLS_ROUTE
>  struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
>  #endif /* CONFIG_NET_CLS_ROUTE */
> @@ -3424,9 +3323,6 @@ int __init ip_rt_init(void)
>  	schedule_delayed_work(&expires_work,
>  		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
>  
> -	if (register_pernet_subsys(&rt_secret_timer_ops))
> -		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
> -
>  	if (ip_rt_proc_init())
>  		printk(KERN_ERR "Unable to create route proc files\n");
>  #ifdef CONFIG_XFRM
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply

* [PATCH] net: deliver skbs on inactive slaves to exact matches
From: John Fastabend @ 2010-05-06 17:24 UTC (permalink / raw)
  To: bonding-devel, netdev
  Cc: john.r.fastabend, christopher.leech, andy, kaber, fubar

Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop().  This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.

For example,

vlanx -> bond0 -> ethx

will be dropped in the vlan path and not delivered to any
packet handlers.  However,

bond0 -> vlanx -> ethx

will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.

This patch adds a pkt_type PACKET_DROP which is now used
to identify skbs that would previously been dropped and
allows the skb to continue to skb_netif_recv().  Here we
add logic to check for PACKET_DROP and if so only deliver
to handlers that match exactly.  IMHO this is more
consistent and gives pkt handlers a way to identify skbs
that come from inactive slaves.

This allows a third case to function which is important for
doing multipath with FCoE traffic while LAN traffic bonded,

bond0 -> ethx
          |
vlanx -> --

Here the vlan is not in bond0 but the FCoE handler can still
receive the skb.  Previously these skbs were dropped.

I have tested the following 4 configurations in failover modes
and load balancing modes and have not seen any duplicate packets
or unexpected bahavior.

# bond0 -> ethx

# vlanx -> bond0 -> ethx

# bond0 -> vlanx -> ethx

# bond0 -> ethx
            |
  vlanx -> --

Also this removes the PACKET_FASTROUTE define which was not being
used.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/if_packet.h |    2 +-
 net/8021q/vlan_core.c     |    4 ++--
 net/core/dev.c            |   25 ++++++++++++++++++-------
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 6ac23ef..9d079fa 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -28,7 +28,7 @@ struct sockaddr_ll {
 #define PACKET_OUTGOING		4		/* Outgoing of any type */
 /* These ones are invisible by user level */
 #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
-#define PACKET_FASTROUTE	6		/* Fastrouted frame	*/
+#define PACKET_DROP		6		/* Drop packet 		*/
 
 /* Packet socket options */
 
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index c584a0a..4510e08 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -12,7 +12,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
 		return NET_RX_DROP;
 
 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
-		goto drop;
+		skb->pkt_type = PACKET_DROP;
 
 	skb->skb_iif = skb->dev->ifindex;
 	__vlan_hwaccel_put_tag(skb, vlan_tci);
@@ -84,7 +84,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 	struct sk_buff *p;
 
 	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
-		goto drop;
+		skb->pkt_type = PACKET_DROP;
 
 	skb->skb_iif = skb->dev->ifindex;
 	__vlan_hwaccel_put_tag(skb, vlan_tci);
diff --git a/net/core/dev.c b/net/core/dev.c
index 36d53be..cefac4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2776,7 +2776,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	struct net_device *orig_dev;
 	struct net_device *master;
 	struct net_device *null_or_orig;
-	struct net_device *null_or_bond;
+	struct net_device *dev_or_bond;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -2793,13 +2793,24 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
 
+	/*
+	 * bonding note: skbs received on inactive slaves should only
+	 * be delivered to pkt handlers that are exact matches.  Also
+	 * the pkt_type field will be marked PACKET_DROP.  If packet
+	 * handlers are sensitive to duplicate packets these skbs will
+	 * need to be dropped at the handler.  The vlan accel path may
+	 * have already set PACKET_DROP.
+	 */
 	null_or_orig = NULL;
 	orig_dev = skb->dev;
 	master = ACCESS_ONCE(orig_dev->master);
-	if (master) {
-		if (skb_bond_should_drop(skb, master))
+	if (skb->pkt_type == PACKET_DROP)
+		null_or_orig = orig_dev;
+	else if (master) {
+		if (skb_bond_should_drop(skb, master)) {
+			skb->pkt_type = PACKET_DROP;
 			null_or_orig = orig_dev; /* deliver only exact match */
-		else
+		} else
 			skb->dev = master;
 	}
 
@@ -2849,10 +2860,10 @@ ncls:
 	 * device that may have registered for a specific ptype.  The
 	 * handler may have to adjust skb->dev and orig_dev.
 	 */
-	null_or_bond = NULL;
+	dev_or_bond = skb->dev;
 	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
 	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		null_or_bond = vlan_dev_real_dev(skb->dev);
+		dev_or_bond = vlan_dev_real_dev(skb->dev);
 	}
 
 	type = skb->protocol;
@@ -2860,7 +2871,7 @@ ncls:
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
 		if (ptype->type == type && (ptype->dev == null_or_orig ||
 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == null_or_bond)) {
+		     ptype->dev == dev_or_bond)) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;


^ permalink raw reply related

* [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 17:16 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, nhorman

A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h |    1 
 net/ipv4/route.c         |  108 -----------------------------------------------
 2 files changed, 2 insertions(+), 107 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..d68c3f1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,7 +55,6 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
-	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
 
 #ifdef CONFIG_IP_MROUTE
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..ffd3da1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
 static int rt_chain_length_max __read_mostly	= 20;
 
 static struct delayed_work expires_work;
@@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
 	rt_do_flush(!in_softirq());
 }
 
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
-	struct net *net = (struct net *)__net;
-	rt_cache_invalidate(net);
-	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-	rt_cache_invalidate(net);
-	if (ip_rt_secret_interval)
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
 static void rt_emergency_hash_rebuild(struct net *net)
 {
-	if (net_ratelimit()) {
+	if (net_ratelimit())
 		printk(KERN_WARNING "Route hash chain too long!\n");
-		printk(KERN_WARNING "Adjust your secret_interval!\n");
-	}
-
-	rt_secret_rebuild_oneshot(net);
+	rt_cache_invalidate(net);
 }
 
 /*
@@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 	return -EINVAL;
 }
 
-static void rt_secret_reschedule(int old)
-{
-	struct net *net;
-	int new = ip_rt_secret_interval;
-	int diff = new - old;
-
-	if (!diff)
-		return;
-
-	rtnl_lock();
-	for_each_net(net) {
-		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-		long time;
-
-		if (!new)
-			continue;
-
-		if (deleted) {
-			time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-			if (time <= 0 || (time += diff) <= 0)
-				time = 0;
-		} else
-			time = new;
-
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
-	}
-	rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
-					  void __user *buffer, size_t *lenp,
-					  loff_t *ppos)
-{
-	int old = ip_rt_secret_interval;
-	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
-	rt_secret_reschedule(old);
-
-	return ret;
-}
-
 static ctl_table ipv4_route_table[] = {
 	{
 		.procname	= "gc_thresh",
@@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "secret_interval",
-		.data		= &ip_rt_secret_interval,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= ipv4_sysctl_rt_secret_interval,
-	},
 	{ }
 };
 
@@ -3337,36 +3266,6 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 #endif
 
 
-static __net_init int rt_secret_timer_init(struct net *net)
-{
-	atomic_set(&net->ipv4.rt_genid,
-			(int) ((num_physpages ^ (num_physpages>>8)) ^
-			(jiffies ^ (jiffies >> 7))));
-
-	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
-	net->ipv4.rt_secret_timer.data = (unsigned long)net;
-	init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
-	if (ip_rt_secret_interval) {
-		net->ipv4.rt_secret_timer.expires =
-			jiffies + net_random() % ip_rt_secret_interval +
-			ip_rt_secret_interval;
-		add_timer(&net->ipv4.rt_secret_timer);
-	}
-	return 0;
-}
-
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
-	.init = rt_secret_timer_init,
-	.exit = rt_secret_timer_exit,
-};
-
-
 #ifdef CONFIG_NET_CLS_ROUTE
 struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 #endif /* CONFIG_NET_CLS_ROUTE */
@@ -3424,9 +3323,6 @@ int __init ip_rt_init(void)
 	schedule_delayed_work(&expires_work,
 		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
 
-	if (register_pernet_subsys(&rt_secret_timer_ops))
-		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox