Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 13:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219112842.GE2782@psychotron.redhat.com>

Le 19/02/2011 12:28, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>> bond-specific work is moved into bond code.
>>>>
>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>
>>>> v1->v2:
>>>>          using skb_iif instead of new input_dev to remember original
>>>> 	device
>>>> v2->v3:
>>>> 	set orig_dev = skb->dev if skb_iif is set
>>>>
>>>
>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>
>>> Bonding used to be handled with very few overhead, simply replacing
>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>> added many special processing for bonding into __netif_receive_skb(),
>>> but the overhead remained very light.
>>>
>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>
>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>
>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>> while (rx_handler) {
>>> 	/* ...  */
>>> 	orig_dev = skb->dev;
>>> 	skb = rx_handler(skb);
>>> 	/* ... */
>>> 	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>> }
>>>
>>> This would reduce the overhead, while still allowing nesting: vlan on
>>> top on bonding, bridge on top on bonding, ...
>>
>> I see your point. Makes sense to me. But the loop would have to include
>> at least processing of ptype_all too. I'm going to cook a follow-up
>> patch.
>>
>
> DRAFT (doesn't modify rx_handlers):
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4ebf7fe..e5dba47 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   {
>   	struct packet_type *ptype, *pt_prev;
>   	rx_handler_func_t *rx_handler;
> +	struct net_device *dev;
>   	struct net_device *orig_dev;
>   	struct net_device *null_or_dev;
>   	int ret = NET_RX_DROP;
> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   	if (netpoll_receive_skb(skb))
>   		return NET_RX_DROP;
>
> -	__this_cpu_inc(softnet_data.processed);
> +	skb->skb_iif = skb->dev->ifindex;
> +	orig_dev = skb->dev;

orig_dev should be set inside the loop, to reflect "previously crossed device", while following the 
path:

eth0 -> bond0 -> br0.

First step inside loop:

orig_dev = eth0
skb->dev = bond0 (at the end of the loop).

Second step inside loop:

orig_dev = bond0
skb->dev = br0 (et the end of the loop).

This would allow for exact match delivery to bond0 if someone bind there.

> +
>   	skb_reset_network_header(skb);
>   	skb_reset_transport_header(skb);
>   	skb->mac_len = skb->network_header - skb->mac_header;
> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
>   	rcu_read_lock();
>
> -	if (!skb->skb_iif) {
> -		skb->skb_iif = skb->dev->ifindex;
> -		orig_dev = skb->dev;
> -	} else {
> -		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
> -	}

I like the fact that it removes the above part.

> +another_round:
> +	__this_cpu_inc(softnet_data.processed);
> +	dev = skb->dev;
>
>   #ifdef CONFIG_NET_CLS_ACT
>   	if (skb->tc_verd&  TC_NCLS) {
> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   #endif
>
>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
> -		if (!ptype->dev || ptype->dev == skb->dev) {
> +		if (!ptype->dev || ptype->dev == dev) {
>   			if (pt_prev)
>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>   			pt_prev = ptype;

Inside the loop, we should only do exact match delivery, for &ptype_all and for 
&ptype_base[ntohs(type) & PTYPE_HASH_MASK]:

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
-               if (!ptype->dev || ptype->dev == dev) {
+               if (ptype->dev == dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
                 if (ptype->type == type &&
-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+                   (ptype->dev == skb->dev)) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

After leaving the loop, we can do wilcard delivery, if skb is not NULL.

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
-               if (!ptype->dev || ptype->dev == dev) {
+               if (!ptype->dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                }
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-               if (ptype->type == type &&
-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+		if (ptype->type == type && !ptype->dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

This would reduce the number of tests inside the list_for_each_entry_rcu() loops. And because we 
match only ptype->dev == dev inside the loop and !ptype->dev outside the loop, this should avoid 
duplicate delivery.

Also, for performance reason, exact match protocol handler lists might be moved from ptype_base or 
ptype_all to a per net_device list. That way, the list_for_each_entry_rcu() inside the loop could be 
empty if no protocol handler bind on the current dev.

inside loop:

         list_for_each_entry_rcu(ptype, dev->ptype_all, list) {
                 if (pt_prev)
			ret = deliver_skb(skb, pt_prev, orig_dev);
                 pt_prev = ptype;
         }

         list_for_each_entry_rcu(ptype,
                         dev->ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
                 if (ptype->type == type) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

Outside loop :

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
                 if (pt_prev)
                         ret = deliver_skb(skb, pt_prev, orig_dev);
                 pt_prev = ptype;
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
		if (ptype->type == type) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

This would require several changes into ptype_all and ptype_base handling, but should be faster.

> @@ -3167,7 +3167,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   ncls:
>   #endif
>
> -	rx_handler = rcu_dereference(skb->dev->rx_handler);
> +	rx_handler = rcu_dereference(dev->rx_handler);
>   	if (rx_handler) {
>   		if (pt_prev) {
>   			ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -3176,6 +3176,8 @@ ncls:
>   		skb = rx_handler(skb);
>   		if (!skb)
>   			goto out;
> +		if (dev != skb->dev)

I would use "if (skb->dev != dev)" for clarity, because skb->dev is expected to have changed, not dev.

> +			goto another_round;
>   	}
>
>   	if (vlan_tx_tag_present(skb)) {
>

	Nicolas.

^ permalink raw reply

* [PATCH 1/1] Fix "(unregistered net_device): Features changed" message
From: Michał Mirosław @ 2011-02-19 12:46 UTC (permalink / raw)
  To: netdev; +Cc: Ben Hutchings, David Miller
In-Reply-To: <20110219122804.GA28326@rere.qmqm.pl>

Fix netdev_update_features() messages on register time by moving
the call further in register_netdevice(). When
netdev->reg_state != NETREG_REGISTERED, netdev_name() returns
"(unregistered netdevice)" even if the dev's name is already filled.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 net/core/dev.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4f69439..5d8e13e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5478,8 +5478,6 @@ int register_netdevice(struct net_device *dev)
 	if (!(dev->wanted_features & NETIF_F_SG))
 		dev->wanted_features &= ~NETIF_F_GSO;
 
-	netdev_update_features(dev);
-
 	/* Enable GRO and NETIF_F_HIGHDMA for vlans by default,
 	 * vlan_dev_init() will do the dev->features check, so these features
 	 * are enabled only if supported by underlying device.
@@ -5496,6 +5494,8 @@ int register_netdevice(struct net_device *dev)
 		goto err_uninit;
 	dev->reg_state = NETREG_REGISTERED;
 
+	netdev_update_features(dev);
+
 	/*
 	 *	Default initial state at registry is that the
 	 *	device is present.
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH v6 0/9] net: Unified offload configuration
From: Michał Mirosław @ 2011-02-19 12:28 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, netdev
In-Reply-To: <20110218.120617.71104636.davem@davemloft.net>

On Fri, Feb 18, 2011 at 12:06:17PM -0800, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 18 Feb 2011 14:29:31 +0000
> 
> > On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
> >> On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> > [...]
> >> > Please get rid of that annoying message spit out by netif_features_change(),
> >> > it's just spam.  If we want notifications for stuff like this, use a
> >> > non-unicast netlink message so those who want to hear it can do so.
> >> You mean netdev_update_features() "Features changed" message? Is it ok
> >> to just demote it to DEBUG level or you want to remove it altogether?
> >> What about netdev_fix_features() messages?
> > I think you need to emit these messages at 'error' severity when fixing
> > up features for a newly-added device, but at 'debug' later on.
> I get one several minutes after every boot for a completely
> unregistered device for some reason:
> 
> [119704.730965] (unregistered net_device): Features changed: 0x00011065 -> 0x00015065

Hmm. That's because netdev_update_features() get's called before changing
netdev->reg_state. I wonder if moving the feature update just after
"dev->reg_state = NETREG_REGISTERED;" will be correct fix.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 11:28 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219110830.GD2782@psychotron.redhat.com>

Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>__netif_receive_skb() with much less exceptions needed. Also
>>>bond-specific work is moved into bond code.
>>>
>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>
>>>v1->v2:
>>>         using skb_iif instead of new input_dev to remember original
>>>	device
>>>v2->v3:
>>>	set orig_dev = skb->dev if skb_iif is set
>>>
>>
>>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>
>>Bonding used to be handled with very few overhead, simply replacing
>>skb->dev with skb->dev->master. Time has passed and we eventually
>>added many special processing for bonding into __netif_receive_skb(),
>>but the overhead remained very light.
>>
>>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>
>>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>whatever need to be delivered, to whoever need, inside the loop ?
>>
>>rx_handler = rcu_dereference(skb->dev->rx_handler);
>>while (rx_handler) {
>>	/* ...  */
>>	orig_dev = skb->dev;
>>	skb = rx_handler(skb);
>>	/* ... */
>>	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>}
>>
>>This would reduce the overhead, while still allowing nesting: vlan on
>>top on bonding, bridge on top on bonding, ...
>
>I see your point. Makes sense to me. But the loop would have to include
>at least processing of ptype_all too. I'm going to cook a follow-up
>patch.
>

DRAFT (doesn't modify rx_handlers):

diff --git a/net/core/dev.c b/net/core/dev.c
index 4ebf7fe..e5dba47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
+	struct net_device *dev;
 	struct net_device *orig_dev;
 	struct net_device *null_or_dev;
 	int ret = NET_RX_DROP;
@@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (netpoll_receive_skb(skb))
 		return NET_RX_DROP;
 
-	__this_cpu_inc(softnet_data.processed);
+	skb->skb_iif = skb->dev->ifindex;
+	orig_dev = skb->dev;
+
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
@@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
 
 	rcu_read_lock();
 
-	if (!skb->skb_iif) {
-		skb->skb_iif = skb->dev->ifindex;
-		orig_dev = skb->dev;
-	} else {
-		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
-	}
+another_round:
+	__this_cpu_inc(softnet_data.processed);
+	dev = skb->dev;
 
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
@@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 #endif
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (!ptype->dev || ptype->dev == skb->dev) {
+		if (!ptype->dev || ptype->dev == dev) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
@@ -3167,7 +3167,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	rx_handler = rcu_dereference(skb->dev->rx_handler);
+	rx_handler = rcu_dereference(dev->rx_handler);
 	if (rx_handler) {
 		if (pt_prev) {
 			ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -3176,6 +3176,8 @@ ncls:
 		skb = rx_handler(skb);
 		if (!skb)
 			goto out;
+		if (dev != skb->dev)
+			goto another_round;
 	}
 
 	if (vlan_tx_tag_present(skb)) {

^ permalink raw reply related

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 11:08 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <4D5FA1D7.4050801@gmail.com>

Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also
>>bond-specific work is moved into bond code.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>
>>v1->v2:
>>         using skb_iif instead of new input_dev to remember original
>>	device
>>v2->v3:
>>	set orig_dev = skb->dev if skb_iif is set
>>
>
>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>
>Bonding used to be handled with very few overhead, simply replacing
>skb->dev with skb->dev->master. Time has passed and we eventually
>added many special processing for bonding into __netif_receive_skb(),
>but the overhead remained very light.
>
>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>
>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>whatever need to be delivered, to whoever need, inside the loop ?
>
>rx_handler = rcu_dereference(skb->dev->rx_handler);
>while (rx_handler) {
>	/* ...  */
>	orig_dev = skb->dev;
>	skb = rx_handler(skb);
>	/* ... */
>	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>}
>
>This would reduce the overhead, while still allowing nesting: vlan on
>top on bonding, bridge on top on bonding, ...

I see your point. Makes sense to me. But the loop would have to include
at least processing of ptype_all too. I'm going to cook a follow-up
patch.

>
>That way, we can probably keep the list of crossed devices inside a
>local array, and call deliver_skb() with the current "orig_dev" when
>appropriate. No need to overload sk_buff nor to use a global
>variable.
>
>Of course, this might be a very simplistic view.
>
>Any comments?
>
>	Nicolas.

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 10:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219080523.GB2782@psychotron.redhat.com>

Le 19/02/2011 09:05, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>
> v1->v2:
>          using skb_iif instead of new input_dev to remember original
> 	device
> v2->v3:
> 	set orig_dev = skb->dev if skb_iif is set
>

Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?

Bonding used to be handled with very few overhead, simply replacing skb->dev with skb->dev->master. 
Time has passed and we eventually added many special processing for bonding into 
__netif_receive_skb(), but the overhead remained very light.

Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.

Can't we, instead, loop inside __netif_receive_skb(), and deliver whatever need to be delivered, to 
whoever need, inside the loop ?

rx_handler = rcu_dereference(skb->dev->rx_handler);
while (rx_handler) {
	/* ...  */
	orig_dev = skb->dev;
	skb = rx_handler(skb);
	/* ... */
	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
}

This would reduce the overhead, while still allowing nesting: vlan on top on bonding, bridge on top 
on bonding, ...

That way, we can probably keep the list of crossed devices inside a local array, and call 
deliver_skb() with the current "orig_dev" when appropriate. No need to overload sk_buff nor to use a 
global variable.

Of course, this might be a very simplistic view.

Any comments?

	Nicolas.

^ permalink raw reply

* [PATCH] ipvs: use hlist instead of list
From: Changli Gao @ 2011-02-19 10:05 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S. Miller, Patrick McHardy, Wensong Zhang, Julian Anastasov,
	netdev, lvs-devel, netfilter-devel, Changli Gao

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 include/net/ip_vs.h             |    2 -
 net/netfilter/ipvs/ip_vs_conn.c |   52 ++++++++++++++++++++++------------------
 2 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index e80ffb7..2078a47 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -494,7 +494,7 @@ struct ip_vs_conn_param {
  *	IP_VS structure allocated for each dynamically scheduled connection
  */
 struct ip_vs_conn {
-	struct list_head        c_list;         /* hashed list heads */
+	struct hlist_node	c_list;         /* hashed list heads */
 #ifdef CONFIG_NET_NS
 	struct net              *net;           /* Name space */
 #endif
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 83233fe..9c2a517 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -59,7 +59,7 @@ static int ip_vs_conn_tab_mask __read_mostly;
 /*
  *  Connection hash table: for input and output packets lookups of IPVS
  */
-static struct list_head *ip_vs_conn_tab __read_mostly;
+static struct hlist_head *ip_vs_conn_tab __read_mostly;
 
 /*  SLAB cache for IPVS connections */
 static struct kmem_cache *ip_vs_conn_cachep __read_mostly;
@@ -201,7 +201,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
 	spin_lock(&cp->lock);
 
 	if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
-		list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
+		hlist_add_head(&cp->c_list, &ip_vs_conn_tab[hash]);
 		cp->flags |= IP_VS_CONN_F_HASHED;
 		atomic_inc(&cp->refcnt);
 		ret = 1;
@@ -234,7 +234,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
 	spin_lock(&cp->lock);
 
 	if (cp->flags & IP_VS_CONN_F_HASHED) {
-		list_del(&cp->c_list);
+		hlist_del(&cp->c_list);
 		cp->flags &= ~IP_VS_CONN_F_HASHED;
 		atomic_dec(&cp->refcnt);
 		ret = 1;
@@ -259,12 +259,13 @@ __ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
+	struct hlist_node *n;
 
 	hash = ip_vs_conn_hashkey_param(p, false);
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
 		if (cp->af == p->af &&
 		    p->cport == cp->cport && p->vport == cp->vport &&
 		    ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
@@ -345,12 +346,13 @@ struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
+	struct hlist_node *n;
 
 	hash = ip_vs_conn_hashkey_param(p, false);
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
 		if (!ip_vs_conn_net_eq(cp, p->net))
 			continue;
 		if (p->pe_data && p->pe->ct_match) {
@@ -394,6 +396,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp, *ret=NULL;
+	struct hlist_node *n;
 
 	/*
 	 *	Check for "full" addressed entries
@@ -402,7 +405,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
 		if (cp->af == p->af &&
 		    p->vport == cp->cport && p->cport == cp->dport &&
 		    ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
@@ -818,7 +821,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
 		return NULL;
 	}
 
-	INIT_LIST_HEAD(&cp->c_list);
+	INIT_HLIST_NODE(&cp->c_list);
 	setup_timer(&cp->timer, ip_vs_conn_expire, (unsigned long)cp);
 	ip_vs_conn_net_set(cp, p->net);
 	cp->af		   = p->af;
@@ -894,8 +897,8 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
  */
 #ifdef CONFIG_PROC_FS
 struct ip_vs_iter_state {
-	struct seq_net_private p;
-	struct list_head *l;
+	struct seq_net_private	p;
+	struct hlist_head	*l;
 };
 
 static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
@@ -903,13 +906,14 @@ static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
 	int idx;
 	struct ip_vs_conn *cp;
 	struct ip_vs_iter_state *iter = seq->private;
+	struct hlist_node *n;
 
 	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
 		ct_read_lock_bh(idx);
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+		hlist_for_each_entry(cp, n, &ip_vs_conn_tab[idx], c_list) {
 			if (pos-- == 0) {
 				iter->l = &ip_vs_conn_tab[idx];
-			return cp;
+				return cp;
 			}
 		}
 		ct_read_unlock_bh(idx);
@@ -930,7 +934,8 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct ip_vs_conn *cp = v;
 	struct ip_vs_iter_state *iter = seq->private;
-	struct list_head *e, *l = iter->l;
+	struct hlist_node *e;
+	struct hlist_head *l = iter->l;
 	int idx;
 
 	++*pos;
@@ -938,15 +943,15 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		return ip_vs_conn_array(seq, 0);
 
 	/* more on same hash chain? */
-	if ((e = cp->c_list.next) != l)
-		return list_entry(e, struct ip_vs_conn, c_list);
+	if ((e = cp->c_list.next))
+		return hlist_entry(e, struct ip_vs_conn, c_list);
 
 	idx = l - ip_vs_conn_tab;
 	ct_read_unlock_bh(idx);
 
 	while (++idx < ip_vs_conn_tab_size) {
 		ct_read_lock_bh(idx);
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+		hlist_for_each_entry(cp, e, &ip_vs_conn_tab[idx], c_list) {
 			iter->l = &ip_vs_conn_tab[idx];
 			return cp;
 		}
@@ -959,7 +964,7 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v)
 {
 	struct ip_vs_iter_state *iter = seq->private;
-	struct list_head *l = iter->l;
+	struct hlist_head *l = iter->l;
 
 	if (l)
 		ct_read_unlock_bh(l - ip_vs_conn_tab);
@@ -1148,13 +1153,14 @@ void ip_vs_random_dropentry(struct net *net)
 	 */
 	for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) {
 		unsigned hash = net_random() & ip_vs_conn_tab_mask;
+		struct hlist_node *n;
 
 		/*
 		 *  Lock is actually needed in this loop.
 		 */
 		ct_write_lock_bh(hash);
 
-		list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+		hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
 			if (cp->flags & IP_VS_CONN_F_TEMPLATE)
 				/* connection template */
 				continue;
@@ -1202,12 +1208,14 @@ static void ip_vs_conn_flush(struct net *net)
 
 flush_again:
 	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		struct hlist_node *n;
+
 		/*
 		 *  Lock is actually needed in this loop.
 		 */
 		ct_write_lock_bh(idx);
 
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+		hlist_for_each_entry(cp, n, &ip_vs_conn_tab[idx], c_list) {
 			if (!ip_vs_conn_net_eq(cp, net))
 				continue;
 			IP_VS_DBG(4, "del connection\n");
@@ -1265,8 +1273,7 @@ int __init ip_vs_conn_init(void)
 	/*
 	 * Allocate the connection hash table and initialize its list heads
 	 */
-	ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size *
-				 sizeof(struct list_head));
+	ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size * sizeof(*ip_vs_conn_tab));
 	if (!ip_vs_conn_tab)
 		return -ENOMEM;
 
@@ -1286,9 +1293,8 @@ int __init ip_vs_conn_init(void)
 	IP_VS_DBG(0, "Each connection entry needs %Zd bytes at least\n",
 		  sizeof(struct ip_vs_conn));
 
-	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
-		INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
-	}
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++)
+		INIT_HLIST_HEAD(&ip_vs_conn_tab[idx]);
 
 	for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++)  {
 		rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);

^ permalink raw reply related

* [PATCH v2] ipvs: unify the formula to estimate the overhead of processing connections
From: Changli Gao @ 2011-02-19  9:32 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S. Miller, Patrick McHardy, Wensong Zhang, Julian Anastasov,
	netdev, lvs-devel, netfilter-devel, Changli Gao

lc and wlc use the same formula, but lblc and lblcr use another one. There
is no reason for using two different formulas for the lc variants.

The formula used by lc is used by all the lc variants in this patch.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
v2: use ip_vs_dest_conn_overhead() instead.
 include/net/ip_vs.h              |   14 ++++++++++++++
 net/netfilter/ipvs/ip_vs_lblc.c  |   13 +++----------
 net/netfilter/ipvs/ip_vs_lblcr.c |   25 +++++++------------------
 net/netfilter/ipvs/ip_vs_lc.c    |   18 +-----------------
 net/netfilter/ipvs/ip_vs_wlc.c   |   20 ++------------------
 5 files changed, 27 insertions(+), 63 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 5d75fea..e80ffb7 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1241,6 +1241,20 @@ static inline void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp)
 /* CONFIG_IP_VS_NFCT */
 #endif
 
+static inline unsigned int
+ip_vs_dest_conn_overhead(struct ip_vs_dest *dest)
+{
+	/*
+	 * We think the overhead of processing active connections is 256
+	 * times higher than that of inactive connections in average. (This
+	 * 256 times might not be accurate, we will change it later) We
+	 * use the following formula to estimate the overhead now:
+	 *		  dest->activeconns*256 + dest->inactconns
+	 */
+	return (atomic_read(&dest->activeconns) << 8) +
+		atomic_read(&dest->inactconns);
+}
+
 #endif /* __KERNEL__ */
 
 #endif	/* _NET_IP_VS_H */
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 00b5ffa..58ae403 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -389,12 +389,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 	int loh, doh;
 
 	/*
-	 * We think the overhead of processing active connections is fifty
-	 * times higher than that of inactive connections in average. (This
-	 * fifty times might not be accurate, we will change it later.) We
-	 * use the following formula to estimate the overhead:
-	 *                dest->activeconns*50 + dest->inactconns
-	 * and the load:
+	 * We use the following formula to estimate the load:
 	 *                (dest overhead) / dest->weight
 	 *
 	 * Remember -- no floats in kernel mode!!!
@@ -410,8 +405,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 			continue;
 		if (atomic_read(&dest->weight) > 0) {
 			least = dest;
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_dest_conn_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -425,8 +419,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_dest_conn_overhead(dest);
 		if (loh * atomic_read(&dest->weight) >
 		    doh * atomic_read(&least->weight)) {
 			least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index bfa25f1..2ddefe8 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -178,8 +178,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
 
 		if ((atomic_read(&least->weight) > 0)
 		    && (least->flags & IP_VS_DEST_F_AVAILABLE)) {
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_dest_conn_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -192,8 +191,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_dest_conn_overhead(dest);
 		if ((loh * atomic_read(&dest->weight) >
 		     doh * atomic_read(&least->weight))
 		    && (dest->flags & IP_VS_DEST_F_AVAILABLE)) {
@@ -228,8 +226,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
 	list_for_each_entry(e, &set->list, list) {
 		most = e->dest;
 		if (atomic_read(&most->weight) > 0) {
-			moh = atomic_read(&most->activeconns) * 50
-				+ atomic_read(&most->inactconns);
+			moh = ip_vs_dest_conn_overhead(most);
 			goto nextstage;
 		}
 	}
@@ -239,8 +236,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
   nextstage:
 	list_for_each_entry(e, &set->list, list) {
 		dest = e->dest;
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_dest_conn_overhead(dest);
 		/* moh/mw < doh/dw ==> moh*dw < doh*mw, where mw,dw>0 */
 		if ((moh * atomic_read(&dest->weight) <
 		     doh * atomic_read(&most->weight))
@@ -563,12 +559,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 	int loh, doh;
 
 	/*
-	 * We think the overhead of processing active connections is fifty
-	 * times higher than that of inactive connections in average. (This
-	 * fifty times might not be accurate, we will change it later.) We
-	 * use the following formula to estimate the overhead:
-	 *                dest->activeconns*50 + dest->inactconns
-	 * and the load:
+	 * We use the following formula to estimate the load:
 	 *                (dest overhead) / dest->weight
 	 *
 	 * Remember -- no floats in kernel mode!!!
@@ -585,8 +576,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 
 		if (atomic_read(&dest->weight) > 0) {
 			least = dest;
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_dest_conn_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -600,8 +590,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_dest_conn_overhead(dest);
 		if (loh * atomic_read(&dest->weight) >
 		    doh * atomic_read(&least->weight)) {
 			least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lc.c b/net/netfilter/ipvs/ip_vs_lc.c
index 4f69db1..160cb80 100644
--- a/net/netfilter/ipvs/ip_vs_lc.c
+++ b/net/netfilter/ipvs/ip_vs_lc.c
@@ -22,22 +22,6 @@
 
 #include <net/ip_vs.h>
 
-
-static inline unsigned int
-ip_vs_lc_dest_overhead(struct ip_vs_dest *dest)
-{
-	/*
-	 * We think the overhead of processing active connections is 256
-	 * times higher than that of inactive connections in average. (This
-	 * 256 times might not be accurate, we will change it later) We
-	 * use the following formula to estimate the overhead now:
-	 *		  dest->activeconns*256 + dest->inactconns
-	 */
-	return (atomic_read(&dest->activeconns) << 8) +
-		atomic_read(&dest->inactconns);
-}
-
-
 /*
  *	Least Connection scheduling
  */
@@ -62,7 +46,7 @@ ip_vs_lc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
 		if ((dest->flags & IP_VS_DEST_F_OVERLOAD) ||
 		    atomic_read(&dest->weight) == 0)
 			continue;
-		doh = ip_vs_lc_dest_overhead(dest);
+		doh = ip_vs_dest_conn_overhead(dest);
 		if (!least || doh < loh) {
 			least = dest;
 			loh = doh;
diff --git a/net/netfilter/ipvs/ip_vs_wlc.c b/net/netfilter/ipvs/ip_vs_wlc.c
index bbddfdb..db751f5 100644
--- a/net/netfilter/ipvs/ip_vs_wlc.c
+++ b/net/netfilter/ipvs/ip_vs_wlc.c
@@ -27,22 +27,6 @@
 
 #include <net/ip_vs.h>
 
-
-static inline unsigned int
-ip_vs_wlc_dest_overhead(struct ip_vs_dest *dest)
-{
-	/*
-	 * We think the overhead of processing active connections is 256
-	 * times higher than that of inactive connections in average. (This
-	 * 256 times might not be accurate, we will change it later) We
-	 * use the following formula to estimate the overhead now:
-	 *		  dest->activeconns*256 + dest->inactconns
-	 */
-	return (atomic_read(&dest->activeconns) << 8) +
-		atomic_read(&dest->inactconns);
-}
-

^ permalink raw reply related

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-19  9:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <20110219085816.GC2782@psychotron.redhat.com>

Le samedi 19 février 2011 à 09:58 +0100, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 09:37:55AM CET, eric.dumazet@gmail.com wrote:
> >Le samedi 19 février 2011 à 09:05 +0100, Jiri Pirko a écrit :
> >> This patch converts bonding to use rx_handler. Results in cleaner
> >> __netif_receive_skb() with much less exceptions needed. Also
> >> bond-specific work is moved into bond code.
> >> 
> >> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> >> 
> >> v1->v2:
> >>         using skb_iif instead of new input_dev to remember original
> >> 	device
> >> v2->v3:
> >> 	set orig_dev = skb->dev if skb_iif is set
> >> 
> >
> >Seems much better ;)
> >
> >Do you have some performance numbers ?
> 
> I don't. I can surely obtain some. What's the best way to measure this?
> 

Hmm, since its receive path :

Two machines, one sending (pktgen) a flood, one receiving it and
check/count how many frames hit destination, before/after patch.




^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19  8:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jay Vosburgh, David Miller, kaber, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <1298104675.8559.22.camel@edumazet-laptop>

Sat, Feb 19, 2011 at 09:37:55AM CET, eric.dumazet@gmail.com wrote:
>Le samedi 19 février 2011 à 09:05 +0100, Jiri Pirko a écrit :
>> This patch converts bonding to use rx_handler. Results in cleaner
>> __netif_receive_skb() with much less exceptions needed. Also
>> bond-specific work is moved into bond code.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> 
>> v1->v2:
>>         using skb_iif instead of new input_dev to remember original
>> 	device
>> v2->v3:
>> 	set orig_dev = skb->dev if skb_iif is set
>> 
>
>Seems much better ;)
>
>Do you have some performance numbers ?

I don't. I can surely obtain some. What's the best way to measure this?

>
>
>

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-19  8:37 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <20110219080523.GB2782@psychotron.redhat.com>

Le samedi 19 février 2011 à 09:05 +0100, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> v1->v2:
>         using skb_iif instead of new input_dev to remember original
> 	device
> v2->v3:
> 	set orig_dev = skb->dev if skb_iif is set
> 

Seems much better ;)

Do you have some performance numbers ?




^ permalink raw reply

* [PATCH] tcp: fix inet_twsk_deschedule()
From: Eric Dumazet @ 2011-02-19  8:35 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller
  Cc: Arnaldo Carvalho de Melo, Linus Torvalds, Michal Hocko,
	Ingo Molnar, linux-mm, LKML, netdev, Pavel Emelyanov,
	Daniel Lezcano
In-Reply-To: <m1sjvl2i3q.fsf@fess.ebiederm.org>

Le vendredi 18 février 2011 à 12:38 -0800, Eric W. Biederman a écrit :
> Arnaldo Carvalho de Melo <acme@redhat.com> writes:
> 
> > Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
> >> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> >> > This seems to be a fairly straightforward bug.
> >> > 
> >> > In net/ipv4/inet_timewait_sock.c we have this:
> >> > 
> >> >   /* These are always called from BH context.  See callers in
> >> >    * tcp_input.c to verify this.
> >> >    */
> >> > 
> >> >   /* This is for handling early-kills of TIME_WAIT sockets. */
> >> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
> >> >                             struct inet_timewait_death_row *twdr)
> >> >   {
> >> >           spin_lock(&twdr->death_lock);
> >> >           ..
> >> > 
> >> > and the intention is clearly that that spin_lock is BH-safe because
> >> > it's called from BH context.
> >> > 
> >> > Except that clearly isn't true. It's called from a worker thread:
> >> > 
> >> > > stack backtrace:
> >> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> >> > > Call Trace:
> >> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> >> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> >> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> >> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> >> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> >> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> >> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> >> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> >> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> >> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> >> > 
> >> > so it can deadlock with a BH happening at the same time, afaik.
> >> > 
> >> > The code (and comment) is all from 2005, it looks like the BH->worker
> >> > thread has broken the code. But somebody who knows that code better
> >> > should take a deeper look at it.
> >> > 
> >> > Added acme to the cc, since the code is attributed to him back in 2005
> >> > ;). Although I don't know how active he's been in networking lately
> >> > (seems to be all perf-related). Whatever, it can't hurt.
> >> 
> >> Original code is ANK's, I just made it possible to use with DCCP, and
> >> yeah, the smiley is appropriate, something 6 years old and the world
> >> around it changing continually... well, thanks for the git blame ;-)
> >
> > But yeah, your analisys seems correct, with the bug being introduced by
> > one of these world around it changing continually issues, networking
> > namespaces broke the rules of the game on its cleanup_net() routine,
> > adding Pavel to the CC list since it doesn't hurt ;-)
> 
> Which probably gets the bug back around to me.
> 
> I guess this must be one of those ipv4 cases that where the cleanup
> simply did not exist in the rmmod sense that we had to invent.
> 
> I think that was Daniel who did the time wait sockets.  I do remember
> they were a real pain.
> 
> Would a bh_disable be sufficient?  I guess I should stop remembering and
> look at the code now.
> 

Here is the patch to fix the problem

Daniel commit (d315492b1a6ba29d (netns : fix kernel panic in timewait
socket destruction) was OK (it did use local_bh_disable())

Problem comes from commit 575f4cd5a5b6394577
(net: Use rcu lookups in inet_twsk_purge.) added in 2.6.33

Thanks !

[PATCH] tcp: fix inet_twsk_deschedule()

Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()

This is caused by inet_twsk_purge(), run from process context,
and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.)
removed the BH disabling that was necessary.

Add the BH disabling but fine grained, right before calling
inet_twsk_deschedule(), instead of whole function.

With help from Linus Torvalds and Eric W. Biederman

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Pavel Emelyanov <xemul@openvz.org>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: stable <stable@kernel.org> (# 2.6.33+)
---
 net/ipv4/inet_timewait_sock.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index c5af909..3c8dfa1 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -505,7 +505,9 @@ restart:
 			}
 
 			rcu_read_unlock();
+			local_bh_disable();
 			inet_twsk_deschedule(tw, twdr);
+			local_bh_enable();
 			inet_twsk_put(tw);
 			goto restart_rcu;
 		}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19  8:05 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: David Miller, kaber, eric.dumazet, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <21593.1298070371@death>

This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed. Also
bond-specific work is moved into bond code.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
        using skb_iif instead of new input_dev to remember original
	device
v2->v3:
	set orig_dev = skb->dev if skb_iif is set

---
 drivers/net/bonding/bond_main.c |   75 ++++++++++++++++++++++++-
 net/core/dev.c                  |  120 +++++++++-----------------------------
 2 files changed, 103 insertions(+), 92 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..a856a11 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1423,6 +1423,68 @@ static void bond_setup_by_slave(struct net_device *bond_dev,
 	bond->setup_by_slave = 1;
 }
 
+/* On bonding slaves other than the currently active slave, suppress
+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
+ * ARP on active-backup slaves with arp_validate enabled.
+ */
+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
+					    struct net_device *slave_dev,
+					    struct net_device *bond_dev)
+{
+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+			return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+		    skb->pkt_type != PACKET_BROADCAST &&
+		    skb->pkt_type != PACKET_MULTICAST)
+				return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
+			return false;
+
+		return true;
+	}
+	return false;
+}
+
+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
+{
+	struct net_device *slave_dev;
+	struct net_device *bond_dev;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return NULL;
+	slave_dev = skb->dev;
+	bond_dev = ACCESS_ONCE(slave_dev->master);
+	if (unlikely(!bond_dev))
+		return skb;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
+		slave_dev->last_rx = jiffies;
+
+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
+		skb->deliver_no_wcard = 1;
+		return skb;
+	}
+
+	skb->dev = bond_dev;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
+	    skb->pkt_type == PACKET_HOST) {
+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+
+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
+	}
+
+	netif_rx(skb);
+	return NULL;
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1599,11 +1661,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
 		goto err_restore_mac;
 	}
+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
+	if (res) {
+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
+		goto err_unset_master;
+	}
+
 	/* open the slave since the application closed it */
 	res = dev_open(slave_dev);
 	if (res) {
 		pr_debug("Opening slave %s failed\n", slave_dev->name);
-		goto err_unset_master;
+		goto err_unreg_rxhandler;
 	}
 
 	new_slave->dev = slave_dev;
@@ -1811,6 +1879,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 err_close:
 	dev_close(slave_dev);
 
+err_unreg_rxhandler:
+	netdev_rx_handler_unregister(slave_dev);
+
 err_unset_master:
 	netdev_set_bond_master(slave_dev, NULL);
 
@@ -1992,6 +2063,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		netif_addr_unlock_bh(bond_dev);
 	}
 
+	netdev_rx_handler_unregister(slave_dev);
 	netdev_set_bond_master(slave_dev, NULL);
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -2114,6 +2186,7 @@ static int bond_release_all(struct net_device *bond_dev)
 			netif_addr_unlock_bh(bond_dev);
 		}
 
+		netdev_rx_handler_unregister(slave_dev);
 		netdev_set_bond_master(slave_dev, NULL);
 
 		/* close slave before restoring its mac address */
diff --git a/net/core/dev.c b/net/core/dev.c
index 4f69439..4ebf7fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3092,54 +3092,23 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
-					      struct net_device *master)
+static void vlan_on_bond_hook(struct sk_buff *skb)
 {
-	if (skb->pkt_type == PACKET_HOST) {
-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
-
-		memcpy(dest, master->dev_addr, ETH_ALEN);
-	}
-}
-
-/* On bonding slaves other than the currently active slave, suppress
- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
- * ARP on active-backup slaves with arp_validate enabled.
- */
-static int __skb_bond_should_drop(struct sk_buff *skb,
-				  struct net_device *master)
-{
-	struct net_device *dev = skb->dev;
-
-	if (master->priv_flags & IFF_MASTER_ARPMON)
-		dev->last_rx = jiffies;
-
-	if ((master->priv_flags & IFF_MASTER_ALB) &&
-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
-		/* Do address unmangle. The local destination address
-		 * will be always the one master has. Provides the right
-		 * functionality in a bridge.
-		 */
-		skb_bond_set_mac_by_master(skb, master);
-	}
-
-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-			return 0;
-
-		if (master->priv_flags & IFF_MASTER_ALB) {
-			if (skb->pkt_type != PACKET_BROADCAST &&
-			    skb->pkt_type != PACKET_MULTICAST)
-				return 0;
-		}
-		if (master->priv_flags & IFF_MASTER_8023AD &&
-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
-			return 0;
+	/*
+	 * Make sure ARP frames received on VLAN interfaces stacked on
+	 * bonding interfaces still make their way to any base bonding
+	 * device that may have registered for a specific ptype.
+	 */
+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
+	    skb->protocol == htons(ETH_P_ARP)) {
+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
-		return 1;
+		if (!skb2)
+			return;
+		skb2->dev = vlan_dev_real_dev(skb->dev);
+		netif_rx(skb2);
 	}
-	return 0;
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)
@@ -3147,8 +3116,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
 	struct net_device *orig_dev;
-	struct net_device *null_or_orig;
-	struct net_device *orig_or_bond;
+	struct net_device *null_or_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -3161,33 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (netpoll_receive_skb(skb))
 		return NET_RX_DROP;
 
-	if (!skb->skb_iif)
-		skb->skb_iif = skb->dev->ifindex;
-
-	/*
-	 * bonding note: skbs received on inactive slaves should only
-	 * be delivered to pkt handlers that are exact matches.  Also
-	 * the deliver_no_wcard flag will be set.  If packet handlers
-	 * are sensitive to duplicate packets these skbs will need to
-	 * be dropped at the handler.
-	 */
-	null_or_orig = NULL;
-	orig_dev = skb->dev;
-	if (skb->deliver_no_wcard)
-		null_or_orig = orig_dev;
-	else if (netif_is_bond_slave(orig_dev)) {
-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
-
-		if (likely(bond_master)) {
-			if (__skb_bond_should_drop(skb, bond_master)) {
-				skb->deliver_no_wcard = 1;
-				/* deliver only exact match */
-				null_or_orig = orig_dev;
-			} else
-				skb->dev = bond_master;
-		}
-	}
-
 	__this_cpu_inc(softnet_data.processed);
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
@@ -3197,6 +3138,13 @@ static int __netif_receive_skb(struct sk_buff *skb)
 
 	rcu_read_lock();
 
+	if (!skb->skb_iif) {
+		skb->skb_iif = skb->dev->ifindex;
+		orig_dev = skb->dev;
+	} else {
+		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
+	}
+
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
@@ -3205,8 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 #endif
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
-		    ptype->dev == orig_dev) {
+		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
@@ -3220,7 +3167,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
 		if (pt_prev) {
@@ -3244,24 +3190,16 @@ ncls:
 			goto out;
 	}
 
-	/*
-	 * Make sure frames received on VLAN interfaces stacked on
-	 * bonding interfaces still make their way to any base bonding
-	 * device that may have registered for a specific ptype.  The
-	 * handler may have to adjust skb->dev and orig_dev.
-	 */
-	orig_or_bond = orig_dev;
-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		orig_or_bond = vlan_dev_real_dev(skb->dev);
-	}
+	vlan_on_bond_hook(skb);
+
+	/* deliver only exact match when indicated */
+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-		if (ptype->type == type && (ptype->dev == null_or_orig ||
-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == orig_or_bond)) {
+		if (ptype->type == type &&
+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
-- 
1.7.3.4


^ permalink raw reply related

* Re: [patch net-next-2.6 V2] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19  7:44 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: David Miller, kaber, eric.dumazet, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <21593.1298070371@death>

Sat, Feb 19, 2011 at 12:06:11AM CET, fubar@us.ibm.com wrote:
>Jiri Pirko <jpirko@redhat.com> wrote:
>
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also bond-specific
>>work is moved into bond code.
>>
>>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>
>>v1->v2:
>>	using skb_iif instead of new input_dev to remember original device
>>
>>---
>> drivers/net/bonding/bond_main.c |   75 ++++++++++++++++++++++++++-
>> net/core/dev.c                  |  111 ++++++++-------------------------------
>> 2 files changed, 97 insertions(+), 89 deletions(-)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index 77e3c6a..a856a11 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -1423,6 +1423,68 @@ static void bond_setup_by_slave(struct net_device *bond_dev,
>> 	bond->setup_by_slave = 1;
>> }
>>
>>+/* On bonding slaves other than the currently active slave, suppress
>>+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>>+ * ARP on active-backup slaves with arp_validate enabled.
>>+ */
>>+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
>>+					    struct net_device *slave_dev,
>>+					    struct net_device *bond_dev)
>>+{
>>+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
>>+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
>>+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>>+			return false;
>>+
>>+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
>>+		    skb->pkt_type != PACKET_BROADCAST &&
>>+		    skb->pkt_type != PACKET_MULTICAST)
>>+				return false;
>>+
>>+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
>>+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>>+			return false;
>
>	Since this is all in the bonding code now, it should be possible
>to do away with using priv_flags for all (or at least most) of this.
>Perhaps in a follow-on patch.

follow-on patch was exatly my intension to do this in.

>
>>+
>>+		return true;
>>+	}
>>+	return false;
>>+}
>>+
>>+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
>>+{
>>+	struct net_device *slave_dev;
>>+	struct net_device *bond_dev;
>>+
>>+	skb = skb_share_check(skb, GFP_ATOMIC);
>>+	if (unlikely(!skb))
>>+		return NULL;
>>+	slave_dev = skb->dev;
>>+	bond_dev = ACCESS_ONCE(slave_dev->master);
>>+	if (unlikely(!bond_dev))
>>+		return skb;
>>+
>>+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
>>+		slave_dev->last_rx = jiffies;
>
>	The last_rx field could probably move into bonding as well,
>although it looks like there are a couple of drivers using last_rx for
>something (more than just setting it).

I'll leave this to follow-on patch also.

>
>>+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
>>+		skb->deliver_no_wcard = 1;
>>+		return skb;
>>+	}
>>+
>>+	skb->dev = bond_dev;
>>+
>>+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
>>+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
>>+	    skb->pkt_type == PACKET_HOST) {
>>+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>>+
>>+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
>>+	}
>>+
>>+	netif_rx(skb);
>>+	return NULL;
>>+}
>>+
>> /* enslave device <slave> to bond device <master> */
>> int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>> {
>>@@ -1599,11 +1661,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>> 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
>> 		goto err_restore_mac;
>> 	}
>>+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
>>+	if (res) {
>>+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
>>+		goto err_unset_master;
>>+	}
>>+
>> 	/* open the slave since the application closed it */
>> 	res = dev_open(slave_dev);
>> 	if (res) {
>> 		pr_debug("Opening slave %s failed\n", slave_dev->name);
>>-		goto err_unset_master;
>>+		goto err_unreg_rxhandler;
>> 	}
>>
>> 	new_slave->dev = slave_dev;
>>@@ -1811,6 +1879,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>> err_close:
>> 	dev_close(slave_dev);
>>
>>+err_unreg_rxhandler:
>>+	netdev_rx_handler_unregister(slave_dev);
>>+
>> err_unset_master:
>> 	netdev_set_bond_master(slave_dev, NULL);
>>
>>@@ -1992,6 +2063,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
>> 		netif_addr_unlock_bh(bond_dev);
>> 	}
>>
>>+	netdev_rx_handler_unregister(slave_dev);
>> 	netdev_set_bond_master(slave_dev, NULL);
>>
>> #ifdef CONFIG_NET_POLL_CONTROLLER
>>@@ -2114,6 +2186,7 @@ static int bond_release_all(struct net_device *bond_dev)
>> 			netif_addr_unlock_bh(bond_dev);
>> 		}
>>
>>+		netdev_rx_handler_unregister(slave_dev);
>> 		netdev_set_bond_master(slave_dev, NULL);
>>
>> 		/* close slave before restoring its mac address */
>>diff --git a/net/core/dev.c b/net/core/dev.c
>>index 4f69439..580cff1 100644
>>--- a/net/core/dev.c
>>+++ b/net/core/dev.c
>>@@ -3092,63 +3092,31 @@ void netdev_rx_handler_unregister(struct net_device *dev)
>> }
>> EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>>
>>-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
>>-					      struct net_device *master)
>>+static void vlan_on_bond_hook(struct sk_buff *skb)
>> {
>>-	if (skb->pkt_type == PACKET_HOST) {
>>-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>>-
>>-		memcpy(dest, master->dev_addr, ETH_ALEN);
>>-	}
>>-}
>>-
>>-/* On bonding slaves other than the currently active slave, suppress
>>- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>>- * ARP on active-backup slaves with arp_validate enabled.
>>- */
>>-static int __skb_bond_should_drop(struct sk_buff *skb,
>>-				  struct net_device *master)
>>-{
>>-	struct net_device *dev = skb->dev;
>>-
>>-	if (master->priv_flags & IFF_MASTER_ARPMON)
>>-		dev->last_rx = jiffies;
>>-
>>-	if ((master->priv_flags & IFF_MASTER_ALB) &&
>>-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
>>-		/* Do address unmangle. The local destination address
>>-		 * will be always the one master has. Provides the right
>>-		 * functionality in a bridge.
>>-		 */
>>-		skb_bond_set_mac_by_master(skb, master);
>>-	}
>>-
>>-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
>>-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
>>-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>>-			return 0;
>>-
>>-		if (master->priv_flags & IFF_MASTER_ALB) {
>>-			if (skb->pkt_type != PACKET_BROADCAST &&
>>-			    skb->pkt_type != PACKET_MULTICAST)
>>-				return 0;
>>-		}
>>-		if (master->priv_flags & IFF_MASTER_8023AD &&
>>-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>>-			return 0;
>>+	/*
>>+	 * Make sure ARP frames received on VLAN interfaces stacked on
>>+	 * bonding interfaces still make their way to any base bonding
>>+	 * device that may have registered for a specific ptype.
>>+	 */
>>+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
>>+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
>>+	    skb->protocol == htons(ETH_P_ARP)) {
>>+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
>>
>>-		return 1;
>>+		if (!skb2)
>>+			return;
>>+		skb2->dev = vlan_dev_real_dev(skb->dev);
>>+		netif_rx(skb2);
>> 	}
>>-	return 0;
>> }
>>
>> static int __netif_receive_skb(struct sk_buff *skb)
>> {
>> 	struct packet_type *ptype, *pt_prev;
>> 	rx_handler_func_t *rx_handler;
>>+	struct net_device *null_or_dev;
>> 	struct net_device *orig_dev;
>>-	struct net_device *null_or_orig;
>>-	struct net_device *orig_or_bond;
>> 	int ret = NET_RX_DROP;
>> 	__be16 type;
>>
>>@@ -3164,30 +3132,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	if (!skb->skb_iif)
>> 		skb->skb_iif = skb->dev->ifindex;
>>
>>-	/*
>>-	 * bonding note: skbs received on inactive slaves should only
>>-	 * be delivered to pkt handlers that are exact matches.  Also
>>-	 * the deliver_no_wcard flag will be set.  If packet handlers
>>-	 * are sensitive to duplicate packets these skbs will need to
>>-	 * be dropped at the handler.
>>-	 */
>>-	null_or_orig = NULL;
>>-	orig_dev = skb->dev;
>>-	if (skb->deliver_no_wcard)
>>-		null_or_orig = orig_dev;
>>-	else if (netif_is_bond_slave(orig_dev)) {
>>-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
>>-
>>-		if (likely(bond_master)) {
>>-			if (__skb_bond_should_drop(skb, bond_master)) {
>>-				skb->deliver_no_wcard = 1;
>>-				/* deliver only exact match */
>>-				null_or_orig = orig_dev;
>>-			} else
>>-				skb->dev = bond_master;
>>-		}
>>-	}
>>-
>> 	__this_cpu_inc(softnet_data.processed);
>> 	skb_reset_network_header(skb);
>> 	skb_reset_transport_header(skb);
>>@@ -3196,6 +3140,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	pt_prev = NULL;
>>
>> 	rcu_read_lock();
>>+	orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>
>	Aren't most packets going to have orig_dev == skb->dev at this
>point?  Can this be combined with the skb_iif test a few lines above
>this in __netif_receive_skb, looking something like:
>
>	if (!skb->skb_iif) {
>		skb->skb_iif = skb->dev->ifindex;
>		orig_dev = skb->dev;
>	else {
>		orig_dev = dev_get_by_index_rcu(...);
>	}
>
>	Presumably moving the whole thing down inside the rcu_read_lock.

Yep, that's reasonable. Thanks.

>
>	VLAN packets should come through here twice, but the first time
>through is before the call to vlan_hwaccel_do_receive, so skb->dev
>hasn't been set to the VLAN's dev yet.
>
>	Unless, of course, you find a place to store the orig_dev.
>
>	-J
>
>> #ifdef CONFIG_NET_CLS_ACT
>> 	if (skb->tc_verd & TC_NCLS) {
>>@@ -3205,8 +3150,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> #endif
>>
>> 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>>-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>-		    ptype->dev == orig_dev) {
>>+		if (!ptype->dev || ptype->dev == skb->dev) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>>@@ -3220,7 +3164,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> ncls:
>> #endif
>>
>>-	/* Handle special case of bridge or macvlan */
>> 	rx_handler = rcu_dereference(skb->dev->rx_handler);
>> 	if (rx_handler) {
>> 		if (pt_prev) {
>>@@ -3244,24 +3187,16 @@ ncls:
>> 			goto out;
>> 	}
>>
>>-	/*
>>-	 * Make sure frames received on VLAN interfaces stacked on
>>-	 * bonding interfaces still make their way to any base bonding
>>-	 * device that may have registered for a specific ptype.  The
>>-	 * handler may have to adjust skb->dev and orig_dev.
>>-	 */
>>-	orig_or_bond = orig_dev;
>>-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>>-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>>-		orig_or_bond = vlan_dev_real_dev(skb->dev);
>>-	}
>>+	vlan_on_bond_hook(skb);
>>+
>>+	/* deliver only exact match when indicated */
>>+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
>>
>> 	type = skb->protocol;
>> 	list_for_each_entry_rcu(ptype,
>> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>-		if (ptype->type == type && (ptype->dev == null_or_orig ||
>>-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>>-		     ptype->dev == orig_or_bond)) {
>>+		if (ptype->type == type &&
>>+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>>-- 
>>1.7.3.4
>>
>
>---
>	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [ANNOUNCE] NET Test Tools
From: David Miller @ 2011-02-19  5:36 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, pablo, robert.olsson

I've made a GIT repository at:

	git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

that contains little tools like the udpflood program I posted the other
day.

It requires libmnl be installed.  I'm happy to take patches with the
strict rule that adding autoconf is not allowed.

Also in there is a new tool, "route_bench" that allows benchmarking
the ipv4 route lookup path.  Amusingly it's slower than udpflood
because we have to recvmsg() sink the rtnetlink replies, maybe there
is some way to optimize that?

The tool allows all kinds of iteration through the various keys of a
routing lookup.

davem@maramba:~/src/GIT/route_bench$ route_bench -h
usage: route_bench [ -o ] [ -l count ]
                [ -s src_ip ] [ -a src_ip_stride ] [ -b src_ip_limit ]
                [ -d dst_ip ] [ -e dst_ip_stride ] [ -f dst_ip_limit ]
                [ -i iif ] [ -x iif_stride ] [ -y iif_limit ]
                [ -m mark ] [ -n mark_stride ] [ -p mark_limit ]
                [ -t tos ] [ -q tos_stride ] [ -r tos_limit ]

For example, the following does 100000 lookups, iterating over destination
addresses 10.8.0.2 to 10.8.0.254 in increments of 7.

davem@maramba:~/src/GIT/route_bench$ route_bench -l 100000 -d 10.8.0.2 -a 7 -b 10.8.0.254
Bench: count(100000) saddr[0x00000000] daddr[0x0a080002] mark[0x0] iif[0x0]
Result: 0m2.342s
davem@maramba:~/src/GIT/route_bench$ 

That is with the routing cache removed on a Niagara 2+ machine.

You can use the "-o" option to control how the program handles the case where
it has been asked to iterate over several keys.  By default, all keys iterate
at the same time.  With "-o" specified we iterate only the highest priority
key until it wraps, then we iterate over the next key, and so on and so
forth.  The priority is src --> dst --> iif --> mark --> tos.

Note that if "iif" is zero (it's default value) we do an output lookup,
else we do an input route lookup.

I want to add randomization to the iteration, as well as threading support.

Enjoy.

^ permalink raw reply

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
From: Cypher Wu @ 2011-02-19  4:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev
In-Reply-To: <4D5EE9E9.2000407@tilera.com>

On Sat, Feb 19, 2011 at 5:51 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/17/2011 10:16 PM, Cypher Wu wrote:
>> On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>>> The interrupt architecture on Tile allows a write to a special-purpose
>>> register to put you into a "critical section" where no interrupts or faults
>>> are delivered.  So we just need to bracket the read_lock operations with
>>> two SPR writes; each takes six machine cycles, so we're only adding 12
>>> cycles to the total cost of taking or releasing a read lock on an rwlock
>> I agree that just lock interrupt for read operations should be enough,
>> but read_unlock() is also the place we should lock interrupt, right?
>> If interrupt occurred when it hold lock-val after TNS deadlock still
>> can occur.
>
> Correct; that's what I meant by "read_lock operations".  This include lock,
> trylock, and unlock.
>
>> When will you release out that patch? Since time is tight, so maybe
>> I've to fix-up it myself.
>
> I heard from one of our support folks that you were asking through that
> channel, so I asked him to go ahead and give you the spinlock sources
> directly.  I will be spending time next week syncing up our internal tree
> with the public git repository so you'll see it on LKML at that time.
>
>> 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the
>> interrupt which claimed 'CM', is that right? Should we have to same
>> its original value and restore it later?
>
> We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is
> almost always zero except in very specific situations.
>
>> 2. Should we lock interrupt for the whole operation of
>> read_lock()/read_unlock(), or we should leave interrupt critical
>> section if it run into  __raw_read_lock_slow() and before have to
>> delay_backoff() some time, and re-enter interrupt critical section
>> again before TNS?
>
> Correct, the fix only holds the critical section around the tns and the
> write-back, not during the delay_backoff().
>
>> Bye the way, other RISC platforms, say ARM and MIPS, use store
>> conditional rather that TNS a temp value for lock-val, does Fx have
>> similar instructions?
>
> TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit
> processor) has a full array of atomic instructions.
>
>> Adding that to SPR writes should be fine, but it may cause interrupt
>> delay a little more that other platform's read_lock()?
>
> A little, but I think it's in the noise relative to the basic cost of
> read_lock in the absence of full-fledged atomic instructions.
>
>> Another question: What NMI in the former mail means?
>
> Non-maskable interrupt, such as performance counter interrupts.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

I've got your source code, thank you very much.

There is still two more question:
1. Why we merge the inlined code and the *_slow into none inlined functions?
2. I've seen the use of 'mb()' in unlock operation, but we don't use
that in the lock operation.

I've released a temporary version with that modification under our
customer' demand, since they want to do a long time test though this
weekend. I'll appreciate that if you gave some comment on my
modifications:

*** /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/include/asm-tile/spinlock_32.h
    2010-04-02 11:07:47.000000000 +0800
--- include/asm-tile/spinlock_32.h      2011-02-18 17:09:40.000000000 +0800
***************
*** 12,17 ****
--- 12,27 ----
   *   more details.
   *
   * 32-bit SMP spinlocks.
+  *
+  *
+  * The use of TNS instruction cause race condition for system call and
+  * interrupt, so we have to lock interrupt when we trying lock-value.
+  * However, since write_lock() is exclusive so if we really need to
+  * operate it in interrupt then system call have to use write_lock_irqsave(),
+  * So it don't need to lock interrupt here.
+  * Spinlock is also exclusive so we don't take care about it.
+  *
+  * Modified by Cyberman Wu on Feb 18th, 2011.
   */

  #ifndef _ASM_TILE_SPINLOCK_32_H
*************** void __raw_read_unlock_slow(raw_rwlock_t
*** 86,91 ****
--- 96,114 ----
  void __raw_write_lock_slow(raw_rwlock_t *, u32);
  void __raw_write_unlock_slow(raw_rwlock_t *, u32);

+
+ static inline void __raw_read_lock_enter_critical(void)
+ {
+       BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION));
+       __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
+ }
+
+ static inline void __raw_read_lock_leave_critical(void)
+ {
+       __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+ }
+
+
  /**
   * __raw_read_can_lock() - would read_trylock() succeed?
   */
*************** static inline int __raw_write_can_lock(r
*** 107,121 ****
   */
  static inline void __raw_read_lock(raw_rwlock_t *rwlock)
  {
!       u32 val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val << _RD_COUNT_WIDTH)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_lock_slow(rwlock, val);
                return;
        }
        rwlock->lock = val + (1 << _RD_COUNT_SHIFT);
  }

  /**
--- 130,148 ----
   */
  static inline void __raw_read_lock(raw_rwlock_t *rwlock)
  {
!     u32 val;
!       __raw_read_lock_enter_critical();
!       /*u32 */val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val << _RD_COUNT_WIDTH)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_lock_slow(rwlock, val);
+               __raw_read_lock_leave_critical();
                return;
        }
        rwlock->lock = val + (1 << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
  }

  /**
*************** static inline void __raw_write_lock(raw_
*** 140,154 ****
  static inline int __raw_read_trylock(raw_rwlock_t *rwlock)
  {
        int locked;
!       u32 val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
!               return __raw_read_trylock_slow(rwlock);
        }
        locked = (val << _RD_COUNT_WIDTH) == 0;
        rwlock->lock = val + (locked << _RD_COUNT_SHIFT);
        return locked;
  }

--- 167,187 ----
  static inline int __raw_read_trylock(raw_rwlock_t *rwlock)
  {
        int locked;
!     u32 val;
!       __raw_read_lock_enter_critical();
!       /*u32 */val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
!               // return __raw_read_trylock_slow(rwlock);
!               locked =__raw_read_trylock_slow(rwlock);
!               __raw_read_lock_leave_critical();
!               return locked;
        }
        locked = (val << _RD_COUNT_WIDTH) == 0;
        rwlock->lock = val + (locked << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
        return locked;
  }

*************** static inline void __raw_read_unlock(raw
*** 184,198 ****
--- 217,234 ----
  {
        u32 val;
        mb();
+       __raw_read_lock_enter_critical();
        val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_unlock_slow(rwlock);
+               __raw_read_lock_leave_critical();
                return;
        }
        rwlock->lock = val - (1 << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
  }



--- /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/arch/tile/lib/spinlock_32.c
       2010-04-02 11:08:02.000000000 +0800
+++ arch/tile/lib/spinlock_32.c 2011-02-18 16:05:31.000000000 +0800
@@ -98,7 +98,18 @@ static inline u32 get_rwlock(raw_rwlock_
 #ifdef __TILECC__
 #pragma frequency_hint NEVER
 #endif
+            /*
+             * get_rwlock() now have to be called in Interrupt
+             * Critical Section, so it can't be called in the
+             * these __raw_write_xxx() anymore!!!!!
+             *
+             * We leave Interrupt Critical Section for making
+             * interrupt delay minimal.
+             * Is that really needed???
+             */
+            __raw_read_lock_leave_critical();
                        delay_backoff(iterations++);
+            __raw_read_lock_enter_critical();
                        continue;
                }
                return val;
@@ -152,7 +163,14 @@ void __raw_read_lock_slow(raw_rwlock_t *
        do {
                if (!(val & 1))
                        rwlock->lock = val;
+        /*
+         * We leave Interrupt Critical Section for making
+         * interrupt delay minimal.
+         * Is that really needed???
+         */
+        __raw_read_lock_leave_critical();
                delay_backoff(iterations++);
+        __raw_read_lock_enter_critical();
                val = __insn_tns((int *)&rwlock->lock);
        } while ((val << RD_COUNT_WIDTH) != 0);
        rwlock->lock = val + (1 << RD_COUNT_SHIFT);
@@ -166,23 +184,30 @@ void __raw_write_lock_slow(raw_rwlock_t
         * when we compare them.
         */
        u32 my_ticket_;
+       u32 iterations = 0;

-       /* Take out the next ticket; this will also stop would-be readers. */
-       if (val & 1)
-               val = get_rwlock(rwlock);
-       rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
+       /*
+        * Wait until there are no readers, then bump up the next
+        * field and capture the ticket value.
+        */
+       for (;;) {
+               if (!(val & 1)) {
+                       if ((val >> RD_COUNT_SHIFT) == 0)
+                               break;
+                       rwlock->lock = val;
+               }
+               delay_backoff(iterations++);
+               val = __insn_tns((int *)&rwlock->lock);
+       }

-       /* Extract my ticket value from the original word. */
+       /* Take out the next ticket and extract my ticket value. */
+       rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
        my_ticket_ = val >> WR_NEXT_SHIFT;

-       /*
-        * Wait until the "current" field matches our ticket, and
-        * there are no remaining readers.
-        */
+       /* Wait until the "current" field matches our ticket. */
        for (;;) {
                u32 curr_ = val >> WR_CURR_SHIFT;
-               u32 readers = val >> RD_COUNT_SHIFT;
-               u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
+               u32 delta = ((my_ticket_ - curr_) & WR_MASK);
                if (likely(delta == 0))
                        break;


-- 
Cyberman Wu

^ permalink raw reply

* [PATCH] ipvs: unify the formula to estimate the overhead of processing connections
From: Changli Gao @ 2011-02-19  1:18 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S. Miller, Patrick McHardy, Wensong Zhang, Julian Anastasov,
	netdev, lvs-devel, netfilter-devel, Changli Gao

lc and wlc use the same formula, but lblc and lblcr use another one. There
is no reason for using two different formulas for the lc variants.

The formula used by lc is used by all the lc variants in this patch.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 include/net/ip_vs.h              |   14 ++++++++++++++
 net/netfilter/ipvs/ip_vs_lblc.c  |   13 +++----------
 net/netfilter/ipvs/ip_vs_lblcr.c |   25 +++++++------------------
 net/netfilter/ipvs/ip_vs_lc.c    |   16 ----------------
 net/netfilter/ipvs/ip_vs_wlc.c   |   20 ++------------------
 5 files changed, 26 insertions(+), 62 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 5d75fea..d9dac3b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1241,6 +1241,20 @@ static inline void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp)
 /* CONFIG_IP_VS_NFCT */
 #endif
 
+static inline unsigned int
+ip_vs_lc_dest_overhead(struct ip_vs_dest *dest)
+{
+	/*
+	 * We think the overhead of processing active connections is 256
+	 * times higher than that of inactive connections in average. (This
+	 * 256 times might not be accurate, we will change it later) We
+	 * use the following formula to estimate the overhead now:
+	 *		  dest->activeconns*256 + dest->inactconns
+	 */
+	return (atomic_read(&dest->activeconns) << 8) +
+		atomic_read(&dest->inactconns);
+}
+
 #endif /* __KERNEL__ */
 
 #endif	/* _NET_IP_VS_H */
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 00b5ffa..bb6f7a7 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -389,12 +389,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 	int loh, doh;
 
 	/*
-	 * We think the overhead of processing active connections is fifty
-	 * times higher than that of inactive connections in average. (This
-	 * fifty times might not be accurate, we will change it later.) We
-	 * use the following formula to estimate the overhead:
-	 *                dest->activeconns*50 + dest->inactconns
-	 * and the load:
+	 * We use the following formula to estimate the load:
 	 *                (dest overhead) / dest->weight
 	 *
 	 * Remember -- no floats in kernel mode!!!
@@ -410,8 +405,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 			continue;
 		if (atomic_read(&dest->weight) > 0) {
 			least = dest;
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_lc_dest_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -425,8 +419,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_lc_dest_overhead(dest);
 		if (loh * atomic_read(&dest->weight) >
 		    doh * atomic_read(&least->weight)) {
 			least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index bfa25f1..d5b57e2 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -178,8 +178,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
 
 		if ((atomic_read(&least->weight) > 0)
 		    && (least->flags & IP_VS_DEST_F_AVAILABLE)) {
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_lc_dest_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -192,8 +191,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_lc_dest_overhead(dest);
 		if ((loh * atomic_read(&dest->weight) >
 		     doh * atomic_read(&least->weight))
 		    && (dest->flags & IP_VS_DEST_F_AVAILABLE)) {
@@ -228,8 +226,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
 	list_for_each_entry(e, &set->list, list) {
 		most = e->dest;
 		if (atomic_read(&most->weight) > 0) {
-			moh = atomic_read(&most->activeconns) * 50
-				+ atomic_read(&most->inactconns);
+			moh = ip_vs_lc_dest_overhead(most);
 			goto nextstage;
 		}
 	}
@@ -239,8 +236,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
   nextstage:
 	list_for_each_entry(e, &set->list, list) {
 		dest = e->dest;
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_lc_dest_overhead(dest);
 		/* moh/mw < doh/dw ==> moh*dw < doh*mw, where mw,dw>0 */
 		if ((moh * atomic_read(&dest->weight) <
 		     doh * atomic_read(&most->weight))
@@ -563,12 +559,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 	int loh, doh;
 
 	/*
-	 * We think the overhead of processing active connections is fifty
-	 * times higher than that of inactive connections in average. (This
-	 * fifty times might not be accurate, we will change it later.) We
-	 * use the following formula to estimate the overhead:
-	 *                dest->activeconns*50 + dest->inactconns
-	 * and the load:
+	 * We use the following formula to estimate the load:
 	 *                (dest overhead) / dest->weight
 	 *
 	 * Remember -- no floats in kernel mode!!!
@@ -585,8 +576,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 
 		if (atomic_read(&dest->weight) > 0) {
 			least = dest;
-			loh = atomic_read(&least->activeconns) * 50
-				+ atomic_read(&least->inactconns);
+			loh = ip_vs_lc_dest_overhead(least);
 			goto nextstage;
 		}
 	}
@@ -600,8 +590,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
 		if (dest->flags & IP_VS_DEST_F_OVERLOAD)
 			continue;
 
-		doh = atomic_read(&dest->activeconns) * 50
-			+ atomic_read(&dest->inactconns);
+		doh = ip_vs_lc_dest_overhead(dest);
 		if (loh * atomic_read(&dest->weight) >
 		    doh * atomic_read(&least->weight)) {
 			least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lc.c b/net/netfilter/ipvs/ip_vs_lc.c
index 4f69db1..d8e2975 100644
--- a/net/netfilter/ipvs/ip_vs_lc.c
+++ b/net/netfilter/ipvs/ip_vs_lc.c
@@ -22,22 +22,6 @@
 
 #include <net/ip_vs.h>
 
-
-static inline unsigned int
-ip_vs_lc_dest_overhead(struct ip_vs_dest *dest)
-{
-	/*
-	 * We think the overhead of processing active connections is 256
-	 * times higher than that of inactive connections in average. (This
-	 * 256 times might not be accurate, we will change it later) We
-	 * use the following formula to estimate the overhead now:
-	 *		  dest->activeconns*256 + dest->inactconns
-	 */
-	return (atomic_read(&dest->activeconns) << 8) +
-		atomic_read(&dest->inactconns);
-}
-
-
 /*
  *	Least Connection scheduling
  */
diff --git a/net/netfilter/ipvs/ip_vs_wlc.c b/net/netfilter/ipvs/ip_vs_wlc.c
index bbddfdb..45cbaaf 100644
--- a/net/netfilter/ipvs/ip_vs_wlc.c
+++ b/net/netfilter/ipvs/ip_vs_wlc.c
@@ -27,22 +27,6 @@
 
 #include <net/ip_vs.h>
 
-
-static inline unsigned int
-ip_vs_wlc_dest_overhead(struct ip_vs_dest *dest)
-{
-	/*
-	 * We think the overhead of processing active connections is 256
-	 * times higher than that of inactive connections in average. (This
-	 * 256 times might not be accurate, we will change it later) We
-	 * use the following formula to estimate the overhead now:
-	 *		  dest->activeconns*256 + dest->inactconns
-	 */
-	return (atomic_read(&dest->activeconns) << 8) +
-		atomic_read(&dest->inactconns);
-}
-

^ permalink raw reply related

* [net-2.6 PATCH v2] net: dcb: match dcb_app protocol field with 802.1Qaz spec
From: John Fastabend @ 2011-02-18 23:30 UTC (permalink / raw)
  To: davem; +Cc: john.r.fastabend, shmulikr, netdev

The dcb_app protocol field is a __u32 however the 802.1Qaz
specification defines it as a 16 bit field. This patch brings
the structure inline with the spec making it a __u16.

CC: Shmulik Ravid <shmulikr@broadcom.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/dcbnl.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 68cd248..66900e3 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -101,8 +101,8 @@ struct ieee_pfc {
  */
 struct dcb_app {
 	__u8	selector;
-	__u32	protocol;
 	__u8	priority;
+	__u16	protocol;
 };
 
 struct dcbmsg {


^ permalink raw reply related

* Re: ixgbe: 82599 and Westmere with HT
From: Brandeburg, Jesse @ 2011-02-18 23:18 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: netdev@vger.kernel.org
In-Reply-To: <AANLkTi=1d8aJoFU7gsNy-sYjJomGp-Xz4K+ddKSOk-ir@mail.gmail.com>

On Wed, 16 Feb 2011, Andrew Dickinson wrote:
> I've got a dual Westmere board (X5675) with an X520 card with dual
> 10G.  I see 24-cores exposed to me and the ixgbe driver exposes 24 tx
> and 24 rx interrupts per NIC.  I then pin the interrupts to cores for
> each NIC (each interrupt gets its own core, standard stuff).

we have a script for that in our standalone driver package at sourceforge 
called set_irq_affinity.sh, fyi.
 
> Anyway.... I'm only seeing RX interrupts on 16 of the 24 cores (random
> src/dest pairs across a /16 each, so I should be getting good flow
> hashing).  Did I miss some magic somewhere?

the way our hardware works is by first looking for a flow director table 
match for the flow (only support TCP flows, no UDP or IP only) and then it 
falls back to using RSS (only supports 16 queues)
 
> I'm running 2.6.32.4, perhaps this has been fixed upstream.  If not,
> any thoughts on how to make this work?

if the adapter port in question isn't transmittinng any packets then the 
flow director will not have any transmits to use to set up the symmetric 
"receive flows".  The latest drivers from sourceforge have the fix to 
update the flow director tables with every transmit, and net-next has it 
too now[1].

hope this helps, 
Jesse

[1] 
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=69830529b26e6dc9582a4b65ab88f40f050cf94e




^ permalink raw reply

* Re: [patch net-next-2.6 V2] net: convert bonding to use rx_handler
From: Jay Vosburgh @ 2011-02-18 23:06 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, kaber, eric.dumazet, netdev, shemminger,
	nicolas.2p.debian, andy
In-Reply-To: <20110218205832.GE2602@psychotron.redhat.com>

Jiri Pirko <jpirko@redhat.com> wrote:

>This patch converts bonding to use rx_handler. Results in cleaner
>__netif_receive_skb() with much less exceptions needed. Also bond-specific
>work is moved into bond code.
>
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>v1->v2:
>	using skb_iif instead of new input_dev to remember original device
>
>---
> drivers/net/bonding/bond_main.c |   75 ++++++++++++++++++++++++++-
> net/core/dev.c                  |  111 ++++++++-------------------------------
> 2 files changed, 97 insertions(+), 89 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 77e3c6a..a856a11 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1423,6 +1423,68 @@ static void bond_setup_by_slave(struct net_device *bond_dev,
> 	bond->setup_by_slave = 1;
> }
>
>+/* On bonding slaves other than the currently active slave, suppress
>+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>+ * ARP on active-backup slaves with arp_validate enabled.
>+ */
>+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
>+					    struct net_device *slave_dev,
>+					    struct net_device *bond_dev)
>+{
>+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
>+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
>+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>+			return false;
>+
>+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
>+		    skb->pkt_type != PACKET_BROADCAST &&
>+		    skb->pkt_type != PACKET_MULTICAST)
>+				return false;
>+
>+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
>+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>+			return false;

	Since this is all in the bonding code now, it should be possible
to do away with using priv_flags for all (or at least most) of this.
Perhaps in a follow-on patch.

>+
>+		return true;
>+	}
>+	return false;
>+}
>+
>+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
>+{
>+	struct net_device *slave_dev;
>+	struct net_device *bond_dev;
>+
>+	skb = skb_share_check(skb, GFP_ATOMIC);
>+	if (unlikely(!skb))
>+		return NULL;
>+	slave_dev = skb->dev;
>+	bond_dev = ACCESS_ONCE(slave_dev->master);
>+	if (unlikely(!bond_dev))
>+		return skb;
>+
>+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
>+		slave_dev->last_rx = jiffies;

	The last_rx field could probably move into bonding as well,
although it looks like there are a couple of drivers using last_rx for
something (more than just setting it).

>+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
>+		skb->deliver_no_wcard = 1;
>+		return skb;
>+	}
>+
>+	skb->dev = bond_dev;
>+
>+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
>+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
>+	    skb->pkt_type == PACKET_HOST) {
>+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>+
>+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
>+	}
>+
>+	netif_rx(skb);
>+	return NULL;
>+}
>+
> /* enslave device <slave> to bond device <master> */
> int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> {
>@@ -1599,11 +1661,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
> 		goto err_restore_mac;
> 	}
>+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
>+	if (res) {
>+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
>+		goto err_unset_master;
>+	}
>+
> 	/* open the slave since the application closed it */
> 	res = dev_open(slave_dev);
> 	if (res) {
> 		pr_debug("Opening slave %s failed\n", slave_dev->name);
>-		goto err_unset_master;
>+		goto err_unreg_rxhandler;
> 	}
>
> 	new_slave->dev = slave_dev;
>@@ -1811,6 +1879,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> err_close:
> 	dev_close(slave_dev);
>
>+err_unreg_rxhandler:
>+	netdev_rx_handler_unregister(slave_dev);
>+
> err_unset_master:
> 	netdev_set_bond_master(slave_dev, NULL);
>
>@@ -1992,6 +2063,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
> 		netif_addr_unlock_bh(bond_dev);
> 	}
>
>+	netdev_rx_handler_unregister(slave_dev);
> 	netdev_set_bond_master(slave_dev, NULL);
>
> #ifdef CONFIG_NET_POLL_CONTROLLER
>@@ -2114,6 +2186,7 @@ static int bond_release_all(struct net_device *bond_dev)
> 			netif_addr_unlock_bh(bond_dev);
> 		}
>
>+		netdev_rx_handler_unregister(slave_dev);
> 		netdev_set_bond_master(slave_dev, NULL);
>
> 		/* close slave before restoring its mac address */
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 4f69439..580cff1 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -3092,63 +3092,31 @@ void netdev_rx_handler_unregister(struct net_device *dev)
> }
> EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>
>-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
>-					      struct net_device *master)
>+static void vlan_on_bond_hook(struct sk_buff *skb)
> {
>-	if (skb->pkt_type == PACKET_HOST) {
>-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>-
>-		memcpy(dest, master->dev_addr, ETH_ALEN);
>-	}
>-}
>-
>-/* On bonding slaves other than the currently active slave, suppress
>- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>- * ARP on active-backup slaves with arp_validate enabled.
>- */
>-static int __skb_bond_should_drop(struct sk_buff *skb,
>-				  struct net_device *master)
>-{
>-	struct net_device *dev = skb->dev;
>-
>-	if (master->priv_flags & IFF_MASTER_ARPMON)
>-		dev->last_rx = jiffies;
>-
>-	if ((master->priv_flags & IFF_MASTER_ALB) &&
>-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
>-		/* Do address unmangle. The local destination address
>-		 * will be always the one master has. Provides the right
>-		 * functionality in a bridge.
>-		 */
>-		skb_bond_set_mac_by_master(skb, master);
>-	}
>-
>-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
>-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
>-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>-			return 0;
>-
>-		if (master->priv_flags & IFF_MASTER_ALB) {
>-			if (skb->pkt_type != PACKET_BROADCAST &&
>-			    skb->pkt_type != PACKET_MULTICAST)
>-				return 0;
>-		}
>-		if (master->priv_flags & IFF_MASTER_8023AD &&
>-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>-			return 0;
>+	/*
>+	 * Make sure ARP frames received on VLAN interfaces stacked on
>+	 * bonding interfaces still make their way to any base bonding
>+	 * device that may have registered for a specific ptype.
>+	 */
>+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
>+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
>+	    skb->protocol == htons(ETH_P_ARP)) {
>+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
>
>-		return 1;
>+		if (!skb2)
>+			return;
>+		skb2->dev = vlan_dev_real_dev(skb->dev);
>+		netif_rx(skb2);
> 	}
>-	return 0;
> }
>
> static int __netif_receive_skb(struct sk_buff *skb)
> {
> 	struct packet_type *ptype, *pt_prev;
> 	rx_handler_func_t *rx_handler;
>+	struct net_device *null_or_dev;
> 	struct net_device *orig_dev;
>-	struct net_device *null_or_orig;
>-	struct net_device *orig_or_bond;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
>
>@@ -3164,30 +3132,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	if (!skb->skb_iif)
> 		skb->skb_iif = skb->dev->ifindex;
>
>-	/*
>-	 * bonding note: skbs received on inactive slaves should only
>-	 * be delivered to pkt handlers that are exact matches.  Also
>-	 * the deliver_no_wcard flag will be set.  If packet handlers
>-	 * are sensitive to duplicate packets these skbs will need to
>-	 * be dropped at the handler.
>-	 */
>-	null_or_orig = NULL;
>-	orig_dev = skb->dev;
>-	if (skb->deliver_no_wcard)
>-		null_or_orig = orig_dev;
>-	else if (netif_is_bond_slave(orig_dev)) {
>-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
>-
>-		if (likely(bond_master)) {
>-			if (__skb_bond_should_drop(skb, bond_master)) {
>-				skb->deliver_no_wcard = 1;
>-				/* deliver only exact match */
>-				null_or_orig = orig_dev;
>-			} else
>-				skb->dev = bond_master;
>-		}
>-	}
>-
> 	__this_cpu_inc(softnet_data.processed);
> 	skb_reset_network_header(skb);
> 	skb_reset_transport_header(skb);
>@@ -3196,6 +3140,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	pt_prev = NULL;
>
> 	rcu_read_lock();
>+	orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);

	Aren't most packets going to have orig_dev == skb->dev at this
point?  Can this be combined with the skb_iif test a few lines above
this in __netif_receive_skb, looking something like:

	if (!skb->skb_iif) {
		skb->skb_iif = skb->dev->ifindex;
		orig_dev = skb->dev;
	else {
		orig_dev = dev_get_by_index_rcu(...);
	}

	Presumably moving the whole thing down inside the rcu_read_lock.

	VLAN packets should come through here twice, but the first time
through is before the call to vlan_hwaccel_do_receive, so skb->dev
hasn't been set to the VLAN's dev yet.

	Unless, of course, you find a place to store the orig_dev.

	-J

> #ifdef CONFIG_NET_CLS_ACT
> 	if (skb->tc_verd & TC_NCLS) {
>@@ -3205,8 +3150,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> #endif
>
> 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>-		    ptype->dev == orig_dev) {
>+		if (!ptype->dev || ptype->dev == skb->dev) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>@@ -3220,7 +3164,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> ncls:
> #endif
>
>-	/* Handle special case of bridge or macvlan */
> 	rx_handler = rcu_dereference(skb->dev->rx_handler);
> 	if (rx_handler) {
> 		if (pt_prev) {
>@@ -3244,24 +3187,16 @@ ncls:
> 			goto out;
> 	}
>
>-	/*
>-	 * Make sure frames received on VLAN interfaces stacked on
>-	 * bonding interfaces still make their way to any base bonding
>-	 * device that may have registered for a specific ptype.  The
>-	 * handler may have to adjust skb->dev and orig_dev.
>-	 */
>-	orig_or_bond = orig_dev;
>-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>-		orig_or_bond = vlan_dev_real_dev(skb->dev);
>-	}
>+	vlan_on_bond_hook(skb);
>+
>+	/* deliver only exact match when indicated */
>+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
>
> 	type = skb->protocol;
> 	list_for_each_entry_rcu(ptype,
> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>-		if (ptype->type == type && (ptype->dev == null_or_orig ||
>-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>-		     ptype->dev == orig_or_bond)) {
>+		if (ptype->type == type &&
>+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>-- 
>1.7.3.4
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH] bonding: bond_select_queue off by one
From: Ben Hutchings @ 2011-02-18 23:06 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Jay Vosburgh, Phil Oester, netdev
In-Reply-To: <20110218224958.GC11864@gospo.rdu.redhat.com>

On Fri, 2011-02-18 at 17:49 -0500, Andy Gospodarek wrote:
> On Thu, Feb 17, 2011 at 08:41:48PM -0800, Jay Vosburgh wrote:
> > Phil Oester <kernel@linuxace.com> wrote:
> > 
> > >The bonding driver's bond_select_queue function simply returns
> > >skb->queue_mapping.  However queue_mapping could be == 16
> > >for queue #16.  This causes the following message to be flooded
> > >to syslog:
> > >
> > >kernel: bondx selects TX queue 16, but real number of TX queues is 16
> > >
> > >ndo_select_queue wants a zero-based number, so bonding driver needs
> > >to subtract one to return the proper queue number.  Also fix grammar in
> > >a comment while in the vicinity.
> > 
> > 	Andy, can you comment on this?
> > 
> > 	If memory serves, the omission of queue ID zero was on purpose;
> > is this patch going to break any of the functionality added by:
> > 
> > commit bb1d912323d5dd50e1079e389f4e964be14f0ae3
> > Author: Andy Gospodarek <andy@greyhouse.net>
> > Date:   Wed Jun 2 08:40:18 2010 +0000
> > 
> >     bonding: allow user-controlled output slave selection
> > 
> 
> My original intent was that a queue_mapping == 0 would indicate that the
> mode's default transmit routine would be used.  We could still operate
> under this assumption, however.  I think the patch below will work.
> 
> > Ben Hutchings <bhutchings@solarflare.com> wrote:
> > 
> > >This looks basically correct, but it should use the proper functions:
> > >
> > >	skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> > 
> > 	As Ben points out, skb_rx_queue_recorded, skb_record_rx_queue,
> > et al, do the offset by one internally, but the bond_slave_override
> > function is comparing the slave's queue_id to the skb->queue_mapping.
> > 
> > 	That makes me wonder if this patch is going to mess things up,
> > and if bond_slave_override should also use the skb_rx_queue_recorded, et
> > al, functions.
> > 
> 
> They could be use them, but I really dislike using functions with 'rx'
> in the name for options that are clearly for transmit.

This isn't an option for transmit, it is a record of the result of RX
hashing (or steering).  It may or may not then be used to select a TX
queue.

[...]
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2194,6 +2194,21 @@ static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
>         return skb->queue_mapping != 0;
>  }
>  
> +static inline void skb_record_tx_queue(struct sk_buff *skb, u16 tx_queue)
> +{
> +       skb->queue_mapping = tx_queue + 1;
> +}
> +
> +static inline u16 skb_get_tx_queue(const struct sk_buff *skb)
> +{
> +       return skb->queue_mapping - 1;
> +}
> +
> +static inline bool skb_tx_queue_recorded(const struct sk_buff *skb)
> +{
> +       return skb->queue_mapping != 0;
> +}
> +
[...]

This is nonsense.  After the TX queue has been selected, it's recorded
in queue_mapping *without* the offset (skb_set_queue_mapping()).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] bonding: bond_select_queue off by one
From: Andy Gospodarek @ 2011-02-18 22:49 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Andy Gospodarek, Phil Oester, netdev
In-Reply-To: <22094.1298004108@death>

On Thu, Feb 17, 2011 at 08:41:48PM -0800, Jay Vosburgh wrote:
> Phil Oester <kernel@linuxace.com> wrote:
> 
> >The bonding driver's bond_select_queue function simply returns
> >skb->queue_mapping.  However queue_mapping could be == 16
> >for queue #16.  This causes the following message to be flooded
> >to syslog:
> >
> >kernel: bondx selects TX queue 16, but real number of TX queues is 16
> >
> >ndo_select_queue wants a zero-based number, so bonding driver needs
> >to subtract one to return the proper queue number.  Also fix grammar in
> >a comment while in the vicinity.
> 
> 	Andy, can you comment on this?
> 
> 	If memory serves, the omission of queue ID zero was on purpose;
> is this patch going to break any of the functionality added by:
> 
> commit bb1d912323d5dd50e1079e389f4e964be14f0ae3
> Author: Andy Gospodarek <andy@greyhouse.net>
> Date:   Wed Jun 2 08:40:18 2010 +0000
> 
>     bonding: allow user-controlled output slave selection
> 

My original intent was that a queue_mapping == 0 would indicate that the
mode's default transmit routine would be used.  We could still operate
under this assumption, however.  I think the patch below will work.

> Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> >This looks basically correct, but it should use the proper functions:
> >
> >	skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> 
> 	As Ben points out, skb_rx_queue_recorded, skb_record_rx_queue,
> et al, do the offset by one internally, but the bond_slave_override
> function is comparing the slave's queue_id to the skb->queue_mapping.
> 
> 	That makes me wonder if this patch is going to mess things up,
> and if bond_slave_override should also use the skb_rx_queue_recorded, et
> al, functions.
> 

They could be use them, but I really dislike using functions with 'rx'
in the name for options that are clearly for transmit.  I would rather
get rid of access to queue_id or queue_mapping in the code in question.

How about something like this?  I have not fully tested this, but will
and would appreciate feedback from Phil or anyone else.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>

---
 drivers/net/bonding/bond_main.c  |   11 ++++++-----
 drivers/net/bonding/bond_sysfs.c |    2 +-
 drivers/net/bonding/bonding.h    |   16 ++++++++++++++++
 include/linux/skbuff.h           |   15 +++++++++++++++
 net/core/dev.c                   |    4 ++--
 5 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..02d8161 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4511,20 +4511,21 @@ static inline int bond_slave_override(struct bonding *bond,
 
 	read_lock(&bond->lock);
 
-	if (!BOND_IS_OK(bond) || !skb->queue_mapping)
+	if (!BOND_IS_OK(bond) || !skb_tx_queue_recorded(skb))
 		goto out;
 
 	/* Find out if any slaves have the same mapping as this skb. */
 	bond_for_each_slave(bond, check_slave, i) {
-		if (check_slave->queue_id == skb->queue_mapping) {
+		if (slave_tx_queue_recorded(check_slave) &&
+		    slave_get_tx_queue(check_slave) == skb_get_tx_queue(skb)) {
 			slave = check_slave;
 			break;
 		}
 	}
 
 	/* If the slave isn't UP, use default transmit policy. */
-	if (slave && slave->queue_id && IS_UP(slave->dev) &&
-	    (slave->link == BOND_LINK_UP)) {
+	if (slave && slave_tx_queue_recorded(slave) &&
+	    IS_UP(slave->dev) && (slave->link == BOND_LINK_UP)) {
 		res = bond_dev_queue_xmit(bond, skb, slave->dev);
 	}
 
@@ -4541,7 +4542,7 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
 	 * skb_tx_hash and will put the skbs in the queue we expect on their
 	 * way down to the bonding driver.
 	 */
-	return skb->queue_mapping;
+	return skb_tx_queue_recorded(skb) ? skb_get_tx_queue(skb) : 0;
 }
 
 static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 72bb0f6..b3bc092 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1499,7 +1499,7 @@ static ssize_t bonding_store_queue_id(struct device *d,
 	/* Check buffer length, valid ifname and queue id */
 	if (strlen(buffer) > IFNAMSIZ ||
 	    !dev_valid_name(buffer) ||
-	    qid > bond->params.tx_queues)
+	    qid >= bond->params.tx_queues)
 		goto err_no_cmd;
 
 	/* Get the pointer to that interface if it exists */
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 31fe980..75b5798 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -422,4 +422,20 @@ static inline void bond_unregister_ipv6_notifier(void)
 }
 #endif
 
+static inline void slave_record_tx_queue(struct slave *slave, u16 tx_queue)
+{
+	slave->queue_id = tx_queue + 1;
+}
+
+static inline u16 slave_get_tx_queue(const struct slave *slave)
+{
+	return slave->queue_id - 1;
+}
+
+static inline bool slave_tx_queue_recorded(const struct slave *slave)
+{
+	return slave->queue_id != 0;
+}
+
+
 #endif /* _LINUX_BONDING_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 31f02d0..49d101c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2194,6 +2194,21 @@ static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
 	return skb->queue_mapping != 0;
 }
 
+static inline void skb_record_tx_queue(struct sk_buff *skb, u16 tx_queue)
+{
+	skb->queue_mapping = tx_queue + 1;
+}
+
+static inline u16 skb_get_tx_queue(const struct sk_buff *skb)
+{
+	return skb->queue_mapping - 1;
+}
+
+static inline bool skb_tx_queue_recorded(const struct sk_buff *skb)
+{
+	return skb->queue_mapping != 0;
+}
+
 extern u16 __skb_tx_hash(const struct net_device *dev,
 			 const struct sk_buff *skb,
 			 unsigned int num_tx_queues);
diff --git a/net/core/dev.c b/net/core/dev.c
index a413276..50aa490 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2211,8 +2211,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 	u16 qoffset = 0;
 	u16 qcount = num_tx_queues;
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
+	if (skb_tx_queue_recorded(skb)) {
+		hash = skb_get_tx_queue(skb);
 		while (unlikely(hash >= num_tx_queues))
 			hash -= num_tx_queues;
 		return hash;

^ permalink raw reply related

* [v2 PATCH 2/2] ipv4: Implement __ip_dev_find using new interface address hash.
From: David Miller @ 2011-02-18 22:49 UTC (permalink / raw)
  To: netdev


Much quicker than going through the FIB tables.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/devinet.c      |   33 +++++++++++++++++++++++++++++++++
 net/ipv4/fib_frontend.c |   40 ----------------------------------------
 2 files changed, 33 insertions(+), 40 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2fe5076..ee144a4 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -125,6 +125,39 @@ static void inet_hash_remove(struct in_ifaddr *ifa)
 	spin_unlock(&inet_addr_hash_lock);
 }
 
+/**
+ * __ip_dev_find - find the first device with a given source address.
+ * @net: the net namespace
+ * @addr: the source address
+ * @devref: if true, take a reference on the found device
+ *
+ * If a caller uses devref=false, it should be protected by RCU, or RTNL
+ */
+struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
+{
+	unsigned int hash = inet_addr_hash(net, addr);
+	struct net_device *result = NULL;
+	struct in_ifaddr *ifa;
+	struct hlist_node *node;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(ifa, node, &inet_addr_lst[hash], hash) {
+		struct net_device *dev = ifa->ifa_dev->dev;
+
+		if (!net_eq(dev_net(dev), net))
+			continue;
+		if (ifa->ifa_address == addr) {
+			result = dev;
+			break;
+		}
+	}
+	if (result && devref)
+		dev_hold(result);
+	rcu_read_unlock();
+	return result;
+}
+EXPORT_SYMBOL(__ip_dev_find);
+
 static void rtmsg_ifa(int event, struct in_ifaddr *, struct nlmsghdr *, u32);
 
 static BLOCKING_NOTIFIER_HEAD(inetaddr_chain);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2a49c06..ad0778a 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -132,46 +132,6 @@ static void fib_flush(struct net *net)
 		rt_cache_flush(net, -1);
 }
 
-/**
- * __ip_dev_find - find the first device with a given source address.
- * @net: the net namespace
- * @addr: the source address
- * @devref: if true, take a reference on the found device
- *
- * If a caller uses devref=false, it should be protected by RCU, or RTNL
- */
-struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
-{
-	struct flowi fl = {
-		.fl4_dst = addr,
-	};
-	struct fib_result res = { 0 };
-	struct net_device *dev = NULL;
-	struct fib_table *local_table;
-
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	res.r = NULL;
-#endif
-
-	rcu_read_lock();
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (!local_table ||
-	    fib_table_lookup(local_table, &fl, &res, FIB_LOOKUP_NOREF)) {
-		rcu_read_unlock();
-		return NULL;
-	}
-	if (res.type != RTN_LOCAL)
-		goto out;
-	dev = FIB_RES_DEV(res);
-
-	if (dev && devref)
-		dev_hold(dev);
-out:
-	rcu_read_unlock();
-	return dev;
-}
-EXPORT_SYMBOL(__ip_dev_find);
-
 /*
  * Find address type as if only "dev" was present in the system. If
  * on_dev is NULL then all interfaces are taken into consideration.
-- 
1.7.4.1


^ permalink raw reply related

* [v2 PATCH 1/2] ipv4: Add hash table of interface addresses.
From: David Miller @ 2011-02-18 22:49 UTC (permalink / raw)
  To: netdev


This will be used to optimize __ip_dev_find() and friends.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/inetdevice.h |    1 +
 net/ipv4/devinet.c         |   45 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ae8fdc5..5f81466 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -144,6 +144,7 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
 #define IN_DEV_ARP_NOTIFY(in_dev)	IN_DEV_MAXCONF((in_dev), ARP_NOTIFY)
 
 struct in_ifaddr {
+	struct hlist_node	hash;
 	struct in_ifaddr	*ifa_next;
 	struct in_device	*ifa_dev;
 	struct rcu_head		rcu_head;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 748cb5b..2fe5076 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -51,6 +51,7 @@
 #include <linux/inetdevice.h>
 #include <linux/igmp.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -92,6 +93,38 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
 	[IFA_LABEL]     	= { .type = NLA_STRING, .len = IFNAMSIZ - 1 },
 };
 
+/* inet_addr_hash's shifting is dependent upon this IN4_ADDR_HSIZE
+ * value.  So if you change this define, make appropriate changes to
+ * inet_addr_hash as well.
+ */
+#define IN4_ADDR_HSIZE	256
+static struct hlist_head inet_addr_lst[IN4_ADDR_HSIZE];
+static DEFINE_SPINLOCK(inet_addr_hash_lock);
+
+static inline unsigned int inet_addr_hash(struct net *net, __be32 addr)
+{
+	u32 val = (__force u32) addr ^ hash_ptr(net, 8);
+
+	return ((val ^ (val >> 8) ^ (val >> 16) ^ (val >> 24)) &
+		(IN4_ADDR_HSIZE - 1));
+}
+
+static void inet_hash_insert(struct net *net, struct in_ifaddr *ifa)
+{
+	unsigned int hash = inet_addr_hash(net, ifa->ifa_address);
+
+	spin_lock(&inet_addr_hash_lock);
+	hlist_add_head_rcu(&ifa->hash, &inet_addr_lst[hash]);
+	spin_unlock(&inet_addr_hash_lock);
+}
+
+static void inet_hash_remove(struct in_ifaddr *ifa)
+{
+	spin_lock(&inet_addr_hash_lock);
+	hlist_del_init_rcu(&ifa->hash);
+	spin_unlock(&inet_addr_hash_lock);
+}
+
 static void rtmsg_ifa(int event, struct in_ifaddr *, struct nlmsghdr *, u32);
 
 static BLOCKING_NOTIFIER_HEAD(inetaddr_chain);
@@ -265,6 +298,7 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
 			}
 
 			if (!do_promote) {
+				inet_hash_remove(ifa);
 				*ifap1 = ifa->ifa_next;
 
 				rtmsg_ifa(RTM_DELADDR, ifa, nlh, pid);
@@ -281,6 +315,7 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
 	/* 2. Unlink it */
 
 	*ifap = ifa1->ifa_next;
+	inet_hash_remove(ifa1);
 
 	/* 3. Announce address deletion */
 
@@ -368,6 +403,8 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	ifa->ifa_next = *ifap;
 	*ifap = ifa;
 
+	inet_hash_insert(dev_net(in_dev->dev), ifa);
+
 	/* Send message first, then call notifier.
 	   Notifier will trigger FIB update, so that
 	   listeners of netlink will know about new ifaddr */
@@ -521,6 +558,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh)
 	if (tb[IFA_ADDRESS] == NULL)
 		tb[IFA_ADDRESS] = tb[IFA_LOCAL];
 
+	INIT_HLIST_NODE(&ifa->hash);
 	ifa->ifa_prefixlen = ifm->ifa_prefixlen;
 	ifa->ifa_mask = inet_make_mask(ifm->ifa_prefixlen);
 	ifa->ifa_flags = ifm->ifa_flags;
@@ -728,6 +766,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 		if (!ifa) {
 			ret = -ENOBUFS;
 			ifa = inet_alloc_ifa();
+			INIT_HLIST_NODE(&ifa->hash);
 			if (!ifa)
 				break;
 			if (colon)
@@ -1069,6 +1108,7 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 			struct in_ifaddr *ifa = inet_alloc_ifa();
 
 			if (ifa) {
+				INIT_HLIST_NODE(&ifa->hash);
 				ifa->ifa_local =
 				  ifa->ifa_address = htonl(INADDR_LOOPBACK);
 				ifa->ifa_prefixlen = 8;
@@ -1710,6 +1750,11 @@ static struct rtnl_af_ops inet_af_ops = {
 
 void __init devinet_init(void)
 {
+	int i;
+
+	for (i = 0; i < IN4_ADDR_HSIZE; i++)
+		INIT_HLIST_HEAD(&inet_addr_lst[i]);
+
 	register_pernet_subsys(&devinet_ops);
 
 	register_gifconf(PF_INET, inet_gifconf);
-- 
1.7.4.1


^ permalink raw reply related

* Confirmation
From: Western Union Money Transfer @ 2011-02-18 22:10 UTC (permalink / raw)


You have a money transfer of $85,000. Confirm receipt via e-mail: wudept4@w.cn

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox