Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-20 12:12 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy, Fischer, Anna
In-Reply-To: <20110220103609.GA2750@psychotron.redhat.com>

Le 20/02/2011 11:36, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 09:27:37PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 19/02/2011 14:46, Jiri Pirko a écrit :
>>> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>>>> Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>>>> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>>>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>>>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>>>>>> bond-specific work is moved into bond code.
>>>>>>>>
>>>>>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>>>>
>>>>>>>> v1->v2:
>>>>>>>>          using skb_iif instead of new input_dev to remember original
>>>>>>>> 	device
>>>>>>>> v2->v3:
>>>>>>>> 	set orig_dev = skb->dev if skb_iif is set
>>>>>>>>
>>>>>>>
>>>>>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>>>>
>>>>>>> Bonding used to be handled with very few overhead, simply replacing
>>>>>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>>>>>> added many special processing for bonding into __netif_receive_skb(),
>>>>>>> but the overhead remained very light.
>>>>>>>
>>>>>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>>>>
>>>>>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>>>>>
>>>>>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>>>> while (rx_handler) {
>>>>>>> 	/* ...  */
>>>>>>> 	orig_dev = skb->dev;
>>>>>>> 	skb = rx_handler(skb);
>>>>>>> 	/* ... */
>>>>>>> 	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>>>> }
>>>>>>>
>>>>>>> This would reduce the overhead, while still allowing nesting: vlan on
>>>>>>> top on bonding, bridge on top on bonding, ...
>>>>>>
>>>>>> I see your point. Makes sense to me. But the loop would have to include
>>>>>> at least processing of ptype_all too. I'm going to cook a follow-up
>>>>>> patch.
>>>>>>
>>>>>
>>>>> DRAFT (doesn't modify rx_handlers):
>>>>>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index 4ebf7fe..e5dba47 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   {
>>>>>   	struct packet_type *ptype, *pt_prev;
>>>>>   	rx_handler_func_t *rx_handler;
>>>>> +	struct net_device *dev;
>>>>>   	struct net_device *orig_dev;
>>>>>   	struct net_device *null_or_dev;
>>>>>   	int ret = NET_RX_DROP;
>>>>> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   	if (netpoll_receive_skb(skb))
>>>>>   		return NET_RX_DROP;
>>>>>
>>>>> -	__this_cpu_inc(softnet_data.processed);
>>>>> +	skb->skb_iif = skb->dev->ifindex;
>>>>> +	orig_dev = skb->dev;
>>>>
>>>> orig_dev should be set inside the loop, to reflect "previously
>>>> crossed device", while following the path:
>>>>
>>>> eth0 ->   bond0 ->   br0.
>>>>
>>>> First step inside loop:
>>>>
>>>> orig_dev = eth0
>>>> skb->dev = bond0 (at the end of the loop).
>>>>
>>>> Second step inside loop:
>>>>
>>>> orig_dev = bond0
>>>> skb->dev = br0 (et the end of the loop).
>>>>
>>>> This would allow for exact match delivery to bond0 if someone bind there.
>>>>
>>>>> +
>>>>>   	skb_reset_network_header(skb);
>>>>>   	skb_reset_transport_header(skb);
>>>>>   	skb->mac_len = skb->network_header - skb->mac_header;
>>>>> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>
>>>>>   	rcu_read_lock();
>>>>>
>>>>> -	if (!skb->skb_iif) {
>>>>> -		skb->skb_iif = skb->dev->ifindex;
>>>>> -		orig_dev = skb->dev;
>>>>> -	} else {
>>>>> -		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>>>> -	}
>>>>
>>>> I like the fact that it removes the above part.
>>>>
>>>>> +another_round:
>>>>> +	__this_cpu_inc(softnet_data.processed);
>>>>> +	dev = skb->dev;
>>>>>
>>>>>   #ifdef CONFIG_NET_CLS_ACT
>>>>>   	if (skb->tc_verd&    TC_NCLS) {
>>>>> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   #endif
>>>>>
>>>>>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>> +		if (!ptype->dev || ptype->dev == dev) {
>>>>>   			if (pt_prev)
>>>>>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>>   			pt_prev = ptype;
>>>>
>>>> Inside the loop, we should only do exact match delivery, for
>>>> &ptype_all and for&ptype_base[ntohs(type)&   PTYPE_HASH_MASK]:
>>>>
>>>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>> -               if (!ptype->dev || ptype->dev == dev) {
>>>> +               if (ptype->dev == dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>>
>>>>         list_for_each_entry_rcu(ptype,
>>>>                         &ptype_base[ntohs(type)&   PTYPE_HASH_MASK], list) {
>>>>                 if (ptype->type == type&&
>>>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>> +                   (ptype->dev == skb->dev)) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>>>
>>>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>> -               if (!ptype->dev || ptype->dev == dev) {
>>>> +               if (!ptype->dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                }
>>>>         }
>>>>
>>>>
>>>>         list_for_each_entry_rcu(ptype,
>>>>                         &ptype_base[ntohs(type)&   PTYPE_HASH_MASK], list) {
>>>> -               if (ptype->type == type&&
>>>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>> +		if (ptype->type == type&&   !ptype->dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>> This would reduce the number of tests inside the
>>>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>>>> == dev inside the loop and !ptype->dev outside the loop, this should
>>>> avoid duplicate delivery.
>>>
>>> Would you care to put this into patch so I can see the whole picture?
>>> Thanks.
>>
>> Here is what I have in mind. It is based on your previous DRAFT patch, and don't modify rx_handlers yet.
>>
>> Only compile tested !!
>>
>> I don't know if every pieces are at the right place. I wonder what to
>> do with CONFIG_NET_CLS_ACT part, that currently is between ptype_all
>> and ptype_base processing.
>>
>> Anyway, the general idea is there.
>>
>> 	Nicolas.
>>
>> net/core/dev.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>> 1 files changed, 60 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index e5dba47..7e007a9 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3117,7 +3117,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	rx_handler_func_t *rx_handler;
>> 	struct net_device *dev;
>> 	struct net_device *orig_dev;
>> -	struct net_device *null_or_dev;
>> 	int ret = NET_RX_DROP;
>> 	__be16 type;
>>
>> @@ -3130,9 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	if (netpoll_receive_skb(skb))
>> 		return NET_RX_DROP;
>>
>> -	skb->skb_iif = skb->dev->ifindex;
>> -	orig_dev = skb->dev;
>> -
>> 	skb_reset_network_header(skb);
>> 	skb_reset_transport_header(skb);
>> 	skb->mac_len = skb->network_header - skb->mac_header;
>> @@ -3143,6 +3139,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>> another_round:
>> 	__this_cpu_inc(softnet_data.processed);
>> +	skb->skb_iif = skb->dev->ifindex;
>> +	orig_dev = skb->dev;
> orig_dev should be set at the end of the loop. Now you are going to have
> it always the same as dev and skb->dev.
>

Yes, you are right.

I thinking about all this, I wonder what the protocol handlers expect as the orig_dev value ?

Lest imagine the following configuration: eth0 -> bond0 -> br0.

What does a protocol handler listening on br0 expect for orig_dev ? bond0 or eth0 ? Current 
implementation give eth0, but I think bond0 should be the right value, for proper nesting.

>> 	dev = skb->dev;
>>
>> #ifdef CONFIG_NET_CLS_ACT
>> @@ -3152,8 +3150,13 @@ another_round:
>> 	}
>> #endif
>>
>> +	/*
>> +	 * Deliver to ptype_all protocol handlers that match current dev.
>> +	 * This happens before rx_handler is given a chance to change skb->dev.
>> +	 */
>> +
>> 	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -		if (!ptype->dev || ptype->dev == dev) {
>> +		if (ptype->dev == dev) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>> @@ -3167,6 +3170,31 @@ another_round:
>> ncls:
>> #endif
>>
>> +	/*
>> +	 * Deliver to ptype_base protocol handlers that match current dev.
>> +	 * This happens before rx_handler is given a chance to change skb->dev.
>> +	 */
>> +
>> +	type = skb->protocol;
>> +	list_for_each_entry_rcu(ptype,
>> +			&ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> +		if (ptype->type == type&&  ptype->dev == skb->dev) {
>> +			if (pt_prev)
>> +				ret = deliver_skb(skb, pt_prev, orig_dev);
>> +			pt_prev = ptype;
>> +		}
>> +	}
>
> I'm not sure it is ok to deliver ptype_base here. See comment above
> ptype_head() (I'm not sure I understand that correctly)

Anyway, all this is probably plain wrong: Delivering the skb to protocol handlers while still 
changing the skb is guaranteed to cause strange behaviors.

If we want to be able to deliver the skb to different protocol handlers and give all of them the 
right values for dev->skb and orig_dev (or previous_dev), we might end up with copying the skb. I 
hate the idea, but currently can't find a cleaner way to do so.

We first need to clarify what orig_dev should be, as stated above.

>> +
>> +	/*
>> +	 * Call rx_handler for current device.
>> +	 * If rx_handler return NULL, skip wilcard protocol handler delivery.
>> +	 * Else, if skb->dev changed, restart the whole delivery process, to
>> +	 * allow for device nesting.
>> +	 *
>> +	 * Warning:
>> +	 * rx_handlers must kfree_skb(skb) if they return NULL.
> Well this is not true. They can return NULL and call netif_rx as they
> have before. No changes necessary I believe.

I don't really know. This needs to be double checked, anyway.

>> +	 */
>> +
>> 	rx_handler = rcu_dereference(dev->rx_handler);
>> 	if (rx_handler) {
>> 		if (pt_prev) {
>> @@ -3176,10 +3204,15 @@ ncls:
>> 		skb = rx_handler(skb);
>> 		if (!skb)
>> 			goto out;
>> -		if (dev != skb->dev)
>> +		if (skb->dev != dev)
>> 			goto another_round;
>> 	}
>>
>> +	/*
>> +	 * FIXME: The part below should use rx_handler instead of being hard
>> +	 * coded here.
> I'm not sure it is doable atm. For bridge and bond it should not be a
> problem, but for macvlan, there is possible to have macvlans and vlans
> on the same dev. This possibility should persist.
> /me scratches head on the idea to have multiple rx_handlers although it
> was his original idea....

I think your original proposal of having several rx_handlers per device was right.

At the time you introduced the rx_handler system, only bridge and macvlan used it. Even if using 
bridge and macvlan on the same base device might be useless, this is not true for every possible 
rx_handler configuration.

Now that we want to move bonding and vlan to the rx_handler system, it becomes obvious that we need 
several rx_handlers per device. At least, vlan should properly mix with bridge. And who know what 
would be the fifth rx_handler...

>> +	 */
>> +
>> 	if (vlan_tx_tag_present(skb)) {
>> 		if (pt_prev) {
>> 			ret = deliver_skb(skb, pt_prev, orig_dev);
>> @@ -3192,16 +3225,33 @@ ncls:
>> 			goto out;
>> 	}
>>
>> +	/*
>> +	 * FIXME: Can't this be moved into the rx_handler for bonding,
>> +	 * or into a futur rx_handler for vlan?
> This hook is something I do not like at all :/ But anyway if should be in vlan
> part I think.

Yes, and in order for the future rx_handler for vlan to properly handle it, it needs to know the 
device just below it, not the pure original device. Hence, my question about the exact meaning of 
orig_dev...

	Nicolas.

>> +	 */
>> +
>> 	vlan_on_bond_hook(skb);
>>
>> -	/* deliver only exact match when indicated */
>> -	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
>> +	/*
>> +	 * Deliver to wildcard ptype_all protocol handlers.
>> +	 */
>> +
>> +	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> +		if (!ptype->dev) {
>> +			if (pt_prev)
>> +				ret = deliver_skb(skb, pt_prev, orig_dev);
>> +			pt_prev = ptype;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Deliver to wildcard ptype_all protocol handlers.
>> +	 */
>>
>> 	type = skb->protocol;
>> 	list_for_each_entry_rcu(ptype,
>> 			&ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> -		if (ptype->type == type&&
>> -		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +		if (ptype->type == type&&  !ptype->dev) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>> --
>> 1.7.2.3
>>
>>
>>

^ permalink raw reply

* Re: [PATCH] net: fix unreg list corruption in dev_deactivate()
From: Eric Dumazet @ 2011-02-20 12:11 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: netdev, Octavian Purdila, David S. Miller
In-Reply-To: <20110220113429.GA27047@localhost.localdomain>

Le dimanche 20 février 2011 à 12:34 +0100, Stanislaw Gruszka a écrit :
> Patch fix issue introduced by 443457242beb6716b43db4d62fe148eab5515505 
> "net: factorize sync-rcu call in unregister_netdevice_many". It manifest
> on my system by following warning when removing usb wireless device.
> 
> [ 3539.368139] WARNING: at lib/list_debug.c:53 __list_del_entry+0x62/0x71()
> [ 3539.368149] list_del corruption. prev->next should be f035e05c, but was f1ce670c
> [ 3539.368242] Call Trace:
> [ 3539.368254]  [<c04393d7>] ? warn_slowpath_common+0x6a/0x7f
> [ 3539.368262]  [<c05bd062>] ? __list_del_entry+0x62/0x71
> [ 3539.368269]  [<c043945f>] ? warn_slowpath_fmt+0x2b/0x2f
> [ 3539.368276]  [<c05bd062>] ? __list_del_entry+0x62/0x71
> [ 3539.368286]  [<c06f6d06>] ? unregister_netdevice_queue+0x41/0x6e
> [ 3539.368322]  [<fa1ee998>] ? ieee80211_remove_interfaces+0x7b/0x9a [mac80211]
> [ 3539.368348]  [<fa1e208a>] ? ieee80211_unregister_hw+0x48/0xf9 [mac80211]
> [ 3539.368363]  [<fa223903>] ? rt2x00lib_remove_dev+0x76/0xd1 [rt2x00lib]
> [ 3539.368372]  [<fa2770b1>] ? rt2x00usb_disconnect+0x29/0x8c [rt2x00usb]
> [ 3539.368382]  [<c069ef8c>] ? usb_unbind_interface+0x48/0xfd
> 
> I'm no longer seeing warning with patch applied.
> 
> Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
> ---
> I did not try review related code. I think someone who understand it,
> should audit it carefully to exclude similar issues. Adding
> dev->unreg_list to various local list, when device will not gonna be
> destroyed looks really fishy.
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 34dc598..1bc6980 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -839,6 +839,7 @@ void dev_deactivate(struct net_device *dev)
>  
>  	list_add(&dev->unreg_list, &single);
>  	dev_deactivate_many(&single);
> +	list_del(&single);
>  }
>  
>  static void dev_init_scheduler_queue(struct net_device *dev,


Hmm, you should read Eric B patch, he already addressed this problem a
few hours ago.

A full audit _is_ needed.

https://lkml.org/lkml/2011/2/20/4




^ permalink raw reply

* [PATCH] net: fix unreg list corruption in dev_deactivate()
From: Stanislaw Gruszka @ 2011-02-20 11:34 UTC (permalink / raw)
  To: netdev; +Cc: Octavian Purdila, Eric Dumazet, David S. Miller

Patch fix issue introduced by 443457242beb6716b43db4d62fe148eab5515505 
"net: factorize sync-rcu call in unregister_netdevice_many". It manifest
on my system by following warning when removing usb wireless device.

[ 3539.368139] WARNING: at lib/list_debug.c:53 __list_del_entry+0x62/0x71()
[ 3539.368149] list_del corruption. prev->next should be f035e05c, but was f1ce670c
[ 3539.368242] Call Trace:
[ 3539.368254]  [<c04393d7>] ? warn_slowpath_common+0x6a/0x7f
[ 3539.368262]  [<c05bd062>] ? __list_del_entry+0x62/0x71
[ 3539.368269]  [<c043945f>] ? warn_slowpath_fmt+0x2b/0x2f
[ 3539.368276]  [<c05bd062>] ? __list_del_entry+0x62/0x71
[ 3539.368286]  [<c06f6d06>] ? unregister_netdevice_queue+0x41/0x6e
[ 3539.368322]  [<fa1ee998>] ? ieee80211_remove_interfaces+0x7b/0x9a [mac80211]
[ 3539.368348]  [<fa1e208a>] ? ieee80211_unregister_hw+0x48/0xf9 [mac80211]
[ 3539.368363]  [<fa223903>] ? rt2x00lib_remove_dev+0x76/0xd1 [rt2x00lib]
[ 3539.368372]  [<fa2770b1>] ? rt2x00usb_disconnect+0x29/0x8c [rt2x00usb]
[ 3539.368382]  [<c069ef8c>] ? usb_unbind_interface+0x48/0xfd

I'm no longer seeing warning with patch applied.

Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
---
I did not try review related code. I think someone who understand it,
should audit it carefully to exclude similar issues. Adding
dev->unreg_list to various local list, when device will not gonna be
destroyed looks really fishy.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..1bc6980 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -839,6 +839,7 @@ void dev_deactivate(struct net_device *dev)
 
 	list_add(&dev->unreg_list, &single);
 	dev_deactivate_many(&single);
+	list_del(&single);
 }
 
 static void dev_init_scheduler_queue(struct net_device *dev,

^ permalink raw reply related

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-20 10:36 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <4D6027B9.6050108@gmail.com>

Sat, Feb 19, 2011 at 09:27:37PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 19/02/2011 14:46, Jiri Pirko a écrit :
>>Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>>>Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>>>Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>>>Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>>>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>>>>>__netif_receive_skb() with much less exceptions needed. Also
>>>>>>>bond-specific work is moved into bond code.
>>>>>>>
>>>>>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>>>
>>>>>>>v1->v2:
>>>>>>>         using skb_iif instead of new input_dev to remember original
>>>>>>>	device
>>>>>>>v2->v3:
>>>>>>>	set orig_dev = skb->dev if skb_iif is set
>>>>>>>
>>>>>>
>>>>>>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>>>
>>>>>>Bonding used to be handled with very few overhead, simply replacing
>>>>>>skb->dev with skb->dev->master. Time has passed and we eventually
>>>>>>added many special processing for bonding into __netif_receive_skb(),
>>>>>>but the overhead remained very light.
>>>>>>
>>>>>>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>>>
>>>>>>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>>>whatever need to be delivered, to whoever need, inside the loop ?
>>>>>>
>>>>>>rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>>>while (rx_handler) {
>>>>>>	/* ...  */
>>>>>>	orig_dev = skb->dev;
>>>>>>	skb = rx_handler(skb);
>>>>>>	/* ... */
>>>>>>	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>>>}
>>>>>>
>>>>>>This would reduce the overhead, while still allowing nesting: vlan on
>>>>>>top on bonding, bridge on top on bonding, ...
>>>>>
>>>>>I see your point. Makes sense to me. But the loop would have to include
>>>>>at least processing of ptype_all too. I'm going to cook a follow-up
>>>>>patch.
>>>>>
>>>>
>>>>DRAFT (doesn't modify rx_handlers):
>>>>
>>>>diff --git a/net/core/dev.c b/net/core/dev.c
>>>>index 4ebf7fe..e5dba47 100644
>>>>--- a/net/core/dev.c
>>>>+++ b/net/core/dev.c
>>>>@@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  {
>>>>  	struct packet_type *ptype, *pt_prev;
>>>>  	rx_handler_func_t *rx_handler;
>>>>+	struct net_device *dev;
>>>>  	struct net_device *orig_dev;
>>>>  	struct net_device *null_or_dev;
>>>>  	int ret = NET_RX_DROP;
>>>>@@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  	if (netpoll_receive_skb(skb))
>>>>  		return NET_RX_DROP;
>>>>
>>>>-	__this_cpu_inc(softnet_data.processed);
>>>>+	skb->skb_iif = skb->dev->ifindex;
>>>>+	orig_dev = skb->dev;
>>>
>>>orig_dev should be set inside the loop, to reflect "previously
>>>crossed device", while following the path:
>>>
>>>eth0 ->  bond0 ->  br0.
>>>
>>>First step inside loop:
>>>
>>>orig_dev = eth0
>>>skb->dev = bond0 (at the end of the loop).
>>>
>>>Second step inside loop:
>>>
>>>orig_dev = bond0
>>>skb->dev = br0 (et the end of the loop).
>>>
>>>This would allow for exact match delivery to bond0 if someone bind there.
>>>
>>>>+
>>>>  	skb_reset_network_header(skb);
>>>>  	skb_reset_transport_header(skb);
>>>>  	skb->mac_len = skb->network_header - skb->mac_header;
>>>>@@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>
>>>>  	rcu_read_lock();
>>>>
>>>>-	if (!skb->skb_iif) {
>>>>-		skb->skb_iif = skb->dev->ifindex;
>>>>-		orig_dev = skb->dev;
>>>>-	} else {
>>>>-		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>>>-	}
>>>
>>>I like the fact that it removes the above part.
>>>
>>>>+another_round:
>>>>+	__this_cpu_inc(softnet_data.processed);
>>>>+	dev = skb->dev;
>>>>
>>>>  #ifdef CONFIG_NET_CLS_ACT
>>>>  	if (skb->tc_verd&   TC_NCLS) {
>>>>@@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  #endif
>>>>
>>>>  	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>>-		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>+		if (!ptype->dev || ptype->dev == dev) {
>>>>  			if (pt_prev)
>>>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>  			pt_prev = ptype;
>>>
>>>Inside the loop, we should only do exact match delivery, for
>>>&ptype_all and for&ptype_base[ntohs(type)&  PTYPE_HASH_MASK]:
>>>
>>>        list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>-               if (!ptype->dev || ptype->dev == dev) {
>>>+               if (ptype->dev == dev) {
>>>                        if (pt_prev)
>>>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                        pt_prev = ptype;
>>>                }
>>>        }
>>>
>>>
>>>        list_for_each_entry_rcu(ptype,
>>>                        &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>>>                if (ptype->type == type&&
>>>-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>+                   (ptype->dev == skb->dev)) {
>>>                        if (pt_prev)
>>>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                        pt_prev = ptype;
>>>                }
>>>        }
>>>
>>>After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>>
>>>        list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>-               if (!ptype->dev || ptype->dev == dev) {
>>>+               if (!ptype->dev) {
>>>                        if (pt_prev)
>>>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                        pt_prev = ptype;
>>>               }
>>>        }
>>>
>>>
>>>        list_for_each_entry_rcu(ptype,
>>>                        &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>>>-               if (ptype->type == type&&
>>>-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>+		if (ptype->type == type&&  !ptype->dev) {
>>>                        if (pt_prev)
>>>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                        pt_prev = ptype;
>>>                }
>>>        }
>>>
>>>This would reduce the number of tests inside the
>>>list_for_each_entry_rcu() loops. And because we match only ptype->dev
>>>== dev inside the loop and !ptype->dev outside the loop, this should
>>>avoid duplicate delivery.
>>
>>Would you care to put this into patch so I can see the whole picture?
>>Thanks.
>
>Here is what I have in mind. It is based on your previous DRAFT patch, and don't modify rx_handlers yet.
>
>Only compile tested !!
>
>I don't know if every pieces are at the right place. I wonder what to
>do with CONFIG_NET_CLS_ACT part, that currently is between ptype_all
>and ptype_base processing.
>
>Anyway, the general idea is there.
>
>	Nicolas.
>
> net/core/dev.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 60 insertions(+), 10 deletions(-)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index e5dba47..7e007a9 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -3117,7 +3117,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	rx_handler_func_t *rx_handler;
> 	struct net_device *dev;
> 	struct net_device *orig_dev;
>-	struct net_device *null_or_dev;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
>
>@@ -3130,9 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	if (netpoll_receive_skb(skb))
> 		return NET_RX_DROP;
>
>-	skb->skb_iif = skb->dev->ifindex;
>-	orig_dev = skb->dev;
>-
> 	skb_reset_network_header(skb);
> 	skb_reset_transport_header(skb);
> 	skb->mac_len = skb->network_header - skb->mac_header;
>@@ -3143,6 +3139,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
> another_round:
> 	__this_cpu_inc(softnet_data.processed);
>+	skb->skb_iif = skb->dev->ifindex;
>+	orig_dev = skb->dev;
orig_dev should be set at the end of the loop. Now you are going to have
it always the same as dev and skb->dev.

> 	dev = skb->dev;
>
> #ifdef CONFIG_NET_CLS_ACT
>@@ -3152,8 +3150,13 @@ another_round:
> 	}
> #endif
>
>+	/*
>+	 * Deliver to ptype_all protocol handlers that match current dev.
>+	 * This happens before rx_handler is given a chance to change skb->dev.
>+	 */
>+
> 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>-		if (!ptype->dev || ptype->dev == dev) {
>+		if (ptype->dev == dev) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>@@ -3167,6 +3170,31 @@ another_round:
> ncls:
> #endif
>
>+	/*
>+	 * Deliver to ptype_base protocol handlers that match current dev.
>+	 * This happens before rx_handler is given a chance to change skb->dev.
>+	 */
>+
>+	type = skb->protocol;
>+	list_for_each_entry_rcu(ptype,
>+			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>+		if (ptype->type == type && ptype->dev == skb->dev) {
>+			if (pt_prev)
>+				ret = deliver_skb(skb, pt_prev, orig_dev);
>+			pt_prev = ptype;
>+		}
>+	}

I'm not sure it is ok to deliver ptype_base here. See comment above
ptype_head() (I'm not sure I understand that correctly)

>+
>+	/*
>+	 * Call rx_handler for current device.
>+	 * If rx_handler return NULL, skip wilcard protocol handler delivery.
>+	 * Else, if skb->dev changed, restart the whole delivery process, to
>+	 * allow for device nesting.
>+	 *
>+	 * Warning:
>+	 * rx_handlers must kfree_skb(skb) if they return NULL.
Well this is not true. They can return NULL and call netif_rx as they
have before. No changes necessary I believe.

>+	 */
>+
> 	rx_handler = rcu_dereference(dev->rx_handler);
> 	if (rx_handler) {
> 		if (pt_prev) {
>@@ -3176,10 +3204,15 @@ ncls:
> 		skb = rx_handler(skb);
> 		if (!skb)
> 			goto out;
>-		if (dev != skb->dev)
>+		if (skb->dev != dev)
> 			goto another_round;
> 	}
>
>+	/*
>+	 * FIXME: The part below should use rx_handler instead of being hard
>+	 * coded here.
I'm not sure it is doable atm. For bridge and bond it should not be a
problem, but for macvlan, there is possible to have macvlans and vlans
on the same dev. This possibility should persist.
/me scratches head on the idea to have multiple rx_handlers although it
was his original idea....


>+	 */
>+
> 	if (vlan_tx_tag_present(skb)) {
> 		if (pt_prev) {
> 			ret = deliver_skb(skb, pt_prev, orig_dev);
>@@ -3192,16 +3225,33 @@ ncls:
> 			goto out;
> 	}
>
>+	/*
>+	 * FIXME: Can't this be moved into the rx_handler for bonding,
>+	 * or into a futur rx_handler for vlan?
This hook is something I do not like at all :/ But anyway if should be in vlan
part I think.

>+	 */
>+
> 	vlan_on_bond_hook(skb);
>
>-	/* deliver only exact match when indicated */
>-	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
>+	/*
>+	 * Deliver to wildcard ptype_all protocol handlers.
>+	 */
>+
>+	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>+		if (!ptype->dev) {
>+			if (pt_prev)
>+				ret = deliver_skb(skb, pt_prev, orig_dev);
>+			pt_prev = ptype;
>+		}
>+	}
>+
>+	/*
>+	 * Deliver to wildcard ptype_all protocol handlers.
>+	 */
>
> 	type = skb->protocol;
> 	list_for_each_entry_rcu(ptype,
> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>-		if (ptype->type == type &&
>-		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>+		if (ptype->type == type && !ptype->dev) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>-- 
>1.7.2.3
>
>
>

^ permalink raw reply

* [PATCH net-next] sctp: fix compile warnings in sctp_tsnmap_num_gabs
From: Shan Wei @ 2011-02-20  7:57 UTC (permalink / raw)
  To: Vlad Yasevich, David Miller, Network-Maillist, SCTP-Maillist

net/sctp/tsnmap.c: In function ‘sctp_tsnmap_num_gabs’:
net/sctp/tsnmap.c:347: warning: ‘start’ may be used uninitialized in this function
net/sctp/tsnmap.c:347: warning: ‘end’ may be used uninitialized in this function

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/sctp/tsnmap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sctp/tsnmap.c b/net/sctp/tsnmap.c
index 747d541..f1e40ce 100644
--- a/net/sctp/tsnmap.c
+++ b/net/sctp/tsnmap.c
@@ -344,7 +344,7 @@ __u16 sctp_tsnmap_num_gabs(struct sctp_tsnmap *map,
 
 	/* Refresh the gap ack information. */
 	if (sctp_tsnmap_has_gap(map)) {
-		__u16 start, end;
+		__u16 start = 0, end = 0;
 		sctp_tsnmap_iter_init(map, &iter);
 		while (sctp_tsnmap_next_gap_ack(map, &iter,
 						&start,
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH] tcp: Remove debug macro of TCP_CHECK_TIMER
From: Shan Wei @ 2011-02-20  7:55 UTC (permalink / raw)
  To: David Miller, Network-Maillist, kuznet, pekkas, jmorris,
	Patrick McHardy


Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 include/net/tcp.h    |    2 --
 net/ipv4/tcp.c       |    9 ---------
 net/ipv4/tcp_ipv4.c  |    5 -----
 net/ipv4/tcp_timer.c |    3 ---
 net/ipv6/tcp_ipv6.c  |    4 ----
 5 files changed, 0 insertions(+), 23 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index adfe6db..cda30ea 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1068,8 +1068,6 @@ static inline int tcp_paws_reject(const struct tcp_options_received *rx_opt,
 	return 1;
 }
 
-#define TCP_CHECK_TIMER(sk) do { } while (0)
-
 static inline void tcp_mib_init(struct net *net)
 {
 	/* See RFC 2012 */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9867d2..a17a5a7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -873,9 +873,7 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 					flags);
 
 	lock_sock(sk);
-	TCP_CHECK_TIMER(sk);
 	res = do_tcp_sendpages(sk, &page, offset, size, flags);
-	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
 	return res;
 }
@@ -916,7 +914,6 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	long timeo;
 
 	lock_sock(sk);
-	TCP_CHECK_TIMER(sk);
 
 	flags = msg->msg_flags;
 	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
@@ -1104,7 +1101,6 @@ wait_for_memory:
 out:
 	if (copied)
 		tcp_push(sk, flags, mss_now, tp->nonagle);
-	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
 	return copied;
 
@@ -1123,7 +1119,6 @@ do_error:
 		goto out;
 out_err:
 	err = sk_stream_error(sk, flags, err);
-	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
 	return err;
 }
@@ -1415,8 +1410,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	TCP_CHECK_TIMER(sk);
-
 	err = -ENOTCONN;
 	if (sk->sk_state == TCP_LISTEN)
 		goto out;
@@ -1767,12 +1760,10 @@ skip_copy:
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
-	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
 	return copied;
 
 out:
-	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
 	return err;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e2b9be2..ef5a90b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1556,12 +1556,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		sock_rps_save_rxhash(sk, skb->rxhash);
-		TCP_CHECK_TIMER(sk);
 		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
 			rsk = sk;
 			goto reset;
 		}
-		TCP_CHECK_TIMER(sk);
 		return 0;
 	}
 
@@ -1583,13 +1581,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	} else
 		sock_rps_save_rxhash(sk, skb->rxhash);
 
-
-	TCP_CHECK_TIMER(sk);
 	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
 		rsk = sk;
 		goto reset;
 	}
-	TCP_CHECK_TIMER(sk);
 	return 0;
 
 reset:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 74a6aa0..ecd44b0 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -259,7 +259,6 @@ static void tcp_delack_timer(unsigned long data)
 		tcp_send_ack(sk);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKS);
 	}
-	TCP_CHECK_TIMER(sk);
 
 out:
 	if (tcp_memory_pressure)
@@ -481,7 +480,6 @@ static void tcp_write_timer(unsigned long data)
 		tcp_probe_timer(sk);
 		break;
 	}
-	TCP_CHECK_TIMER(sk);
 
 out:
 	sk_mem_reclaim(sk);
@@ -589,7 +587,6 @@ static void tcp_keepalive_timer (unsigned long data)
 		elapsed = keepalive_time_when(tp) - elapsed;
 	}
 
-	TCP_CHECK_TIMER(sk);
 	sk_mem_reclaim(sk);
 
 resched:
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d6954e3..1d0ab55 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1636,10 +1636,8 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 		opt_skb = skb_clone(skb, GFP_ATOMIC);
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
-		TCP_CHECK_TIMER(sk);
 		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
 			goto reset;
-		TCP_CHECK_TIMER(sk);
 		if (opt_skb)
 			goto ipv6_pktoptions;
 		return 0;
@@ -1667,10 +1665,8 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 		}
 	}
 
-	TCP_CHECK_TIMER(sk);
 	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
 		goto reset;
-	TCP_CHECK_TIMER(sk);
 	if (opt_skb)
 		goto ipv6_pktoptions;
 	return 0;
-- 
1.6.3.3

^ permalink raw reply related

* [PATCH]tcp: document tcp_max_ssthresh (Limited Slow-Start)
From: Shan Wei @ 2011-02-20  7:52 UTC (permalink / raw)
  To: Ilpo Järvinen, David Miller, Network-Maillist, jheffner

From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Base on Ilpo's patch about documenting tcp_max_ssthresh.
(see http://marc.info/?l=linux-netdev&m=117950581307310&w=2)

According to errata of RFC3742, fix the number of segments increased 
during RTT time. 

Just to state the occasion to use this parameter, But
about how to set parameter value, maybe some others can do it. 


Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 Documentation/networking/ip-sysctl.txt |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ac3b4a7..ea78aac 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -280,6 +280,17 @@ tcp_max_orphans - INTEGER
 	more aggressively. Let me to remind again: each orphan eats
 	up to ~64K of unswappable memory.
 
+tcp_max_ssthresh - INTEGER
+	Limited Slow-Start for TCP with large congestion windows (cwnd) defined in
+	RFC3742. Limited slow-start is a mechanism to limit growth of the cwnd
+	on the region where cwnd is larger than tcp_max_ssthresh. TCP increases cwnd
+	by at most tcp_max_ssthresh segments, and by at least tcp_max_ssthresh/2
+	segments per RTT when the cwnd is above tcp_max_ssthresh.
+	If TCP connection increased cwnd to thousands (or tens of thousands) segments,
+	and thousands of packets were being dropped during slow-start, you can set
+	tcp_max_ssthresh to improve performance for new TCP connection.
+	Default: 0 (off)
+
 tcp_max_syn_backlog - INTEGER
 	Maximal number of remembered connection requests, which are
 	still did not receive an acknowledgment from connecting client.
-- 
1.6.3.3

^ permalink raw reply related

* THANKS AND TREAT AS URGENT PLEASE
From: Mr Dubeam Solion @ 2011-02-20  4:47 UTC (permalink / raw)



Dear Friend,

Good day to you, I know that my message will come to you as a surprise, but never mind, I am Mr.Solion DUBEAM, the credit officer in BOA Bank of Africa here in my country Burkina-Faso west Africa, In my department here in the bank I discover an abandon sum of $12.5Million United State Dollars, that belong to one of our biggest customer here in this bank, who died years ago in a plane crash with his family, I contacted you so that you will help me see that the total sum of $12.5Million will be transfer into your account in your country.

And after the successful transfer i will come over to your country to meet you, and we shall share 50% for you while 50% will be for me, if you agree to help me, I will like you to get back to me immediately

You should call me as soon as you reply my mail so that I can check my mailing box and give you the details about this deal +226 78708420

Thanks and have a nice day.
Mr.Solion DUBEAM
tele: +226 78708420


      

^ permalink raw reply

* Re: [PATCH 0/2] netfilter: netfilter fixes for 2.6.38
From: David Miller @ 2011-02-20  3:01 UTC (permalink / raw)
  To: kaber; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>

From: kaber@trash.net
Date: Sat, 19 Feb 2011 16:41:03 +0100

> the following patches for two netfilter bugs:
> 
> - an oops in nfnetlink_log in combination with TPROXY when a socket
>   in TIME-WAIT state is assigned to skb->sk, patch from Florian Westphal
> 
> - incorrect printing of the MAC header in the ip6t_LOG target,
>   from Joerg Marx
> 
> Please apply or pull from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master

Pulled, thanks Patrick.

^ permalink raw reply

* Re: [net-2.6 PATCH v2] net: dcb: match dcb_app protocol field with 802.1Qaz spec
From: David Miller @ 2011-02-20  3:00 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: shmulikr, netdev
In-Reply-To: <20110218233017.7419.32447.stgit@jf-dev1-dcblab>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Fri, 18 Feb 2011 15:30:17 -0800

> The dcb_app protocol field is a __u32 however the 802.1Qaz
> specification defines it as a 16 bit field. This patch brings
> the structure inline with the spec making it a __u16.
> 
> CC: Shmulik Ravid <shmulikr@broadcom.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH] tcp: fix inet_twsk_deschedule()
From: David Miller @ 2011-02-20  2:59 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ebiederm, acme, torvalds, mhocko, mingo, linux-mm, linux-kernel,
	netdev, xemul, daniel.lezcano
In-Reply-To: <1298104556.8559.21.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 19 Feb 2011 09:35:56 +0100

> [PATCH] tcp: fix inet_twsk_deschedule()
> 
> Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()
> 
> This is caused by inet_twsk_purge(), run from process context,
> and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.)
> removed the BH disabling that was necessary.
> 
> Add the BH disabling but fine grained, right before calling
> inet_twsk_deschedule(), instead of whole function.
> 
> With help from Linus Torvalds and Eric W. Biederman
> 
> Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Daniel Lezcano <daniel.lezcano@free.fr>
> CC: Pavel Emelyanov <xemul@openvz.org>
> CC: Arnaldo Carvalho de Melo <acme@redhat.com>
> CC: stable <stable@kernel.org> (# 2.6.33+)

Applied, thanks Eric.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] connector: Convert char *name to const char *name
From: Joe Perches @ 2011-02-19 23:45 UTC (permalink / raw)
  To: Javier Martinez Canillas, Evgeniy Polyakov
  Cc: Greg Kroah-Hartman, devel, K. Y. Srinivasan, netdev
In-Reply-To: <1298151834-13141-1-git-send-email-martinez.javier@gmail.com>

Allow more const declarations.

Signed-off-by: Joe Perches <joe@perches.com>

---

Better to change the declarations and uses as this argument
is not modified.

On Sat, 2011-02-19 at 22:43 +0100, Javier Martinez Canillas wrote:
> cn_add_callback() defines is second argument as a char * but a
> const char * was supplied instead. So the compiler generates a  
> compile warning.

 drivers/connector/cn_queue.c  |    7 ++++---
 drivers/connector/connector.c |    2 +-
 include/linux/connector.h     |    9 ++++++---
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c
index 81270d2..55653ab 100644
--- a/drivers/connector/cn_queue.c
+++ b/drivers/connector/cn_queue.c
@@ -48,7 +48,7 @@ void cn_queue_wrapper(struct work_struct *work)
 }
 
 static struct cn_callback_entry *
-cn_queue_alloc_callback_entry(char *name, struct cb_id *id,
+cn_queue_alloc_callback_entry(const char *name, struct cb_id *id,
 			      void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
 {
 	struct cn_callback_entry *cbq;
@@ -78,7 +78,8 @@ int cn_cb_equal(struct cb_id *i1, struct cb_id *i2)
 	return ((i1->idx == i2->idx) && (i1->val == i2->val));
 }
 
-int cn_queue_add_callback(struct cn_queue_dev *dev, char *name, struct cb_id *id,
+int cn_queue_add_callback(struct cn_queue_dev *dev, const char *name,
+			  struct cb_id *id,
 			  void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
 {
 	struct cn_callback_entry *cbq, *__cbq;
@@ -135,7 +136,7 @@ void cn_queue_del_callback(struct cn_queue_dev *dev, struct cb_id *id)
 	}
 }
 
-struct cn_queue_dev *cn_queue_alloc_dev(char *name, struct sock *nls)
+struct cn_queue_dev *cn_queue_alloc_dev(const char *name, struct sock *nls)
 {
 	struct cn_queue_dev *dev;
 
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 05117f1..f7554de 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -205,7 +205,7 @@ static void cn_rx_skb(struct sk_buff *__skb)
  *
  * May sleep.
  */
-int cn_add_callback(struct cb_id *id, char *name,
+int cn_add_callback(struct cb_id *id, const char *name,
 		    void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
 {
 	int err;
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 2e9759c..a9edf24 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -129,14 +129,17 @@ struct cn_dev {
 	struct cn_queue_dev *cbdev;
 };
 
-int cn_add_callback(struct cb_id *, char *, void (*callback) (struct cn_msg *, struct netlink_skb_parms *));
+int cn_add_callback(struct cb_id *id, const char *name,
+		    void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
 void cn_del_callback(struct cb_id *);
 int cn_netlink_send(struct cn_msg *, u32, gfp_t);
 
-int cn_queue_add_callback(struct cn_queue_dev *dev, char *name, struct cb_id *id, void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
+int cn_queue_add_callback(struct cn_queue_dev *dev, const char *name,
+			  struct cb_id *id,
+			  void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
 void cn_queue_del_callback(struct cn_queue_dev *dev, struct cb_id *id);
 
-struct cn_queue_dev *cn_queue_alloc_dev(char *name, struct sock *);
+struct cn_queue_dev *cn_queue_alloc_dev(const char *name, struct sock *);
 void cn_queue_free_dev(struct cn_queue_dev *dev);
 
 int cn_cb_equal(struct cb_id *, struct cb_id *);



^ permalink raw reply related

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 20:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219134645.GF2782@psychotron.redhat.com>

Le 19/02/2011 14:46, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>>>> bond-specific work is moved into bond code.
>>>>>>
>>>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>>
>>>>>> v1->v2:
>>>>>>          using skb_iif instead of new input_dev to remember original
>>>>>> 	device
>>>>>> v2->v3:
>>>>>> 	set orig_dev = skb->dev if skb_iif is set
>>>>>>
>>>>>
>>>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>>
>>>>> Bonding used to be handled with very few overhead, simply replacing
>>>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>>>> added many special processing for bonding into __netif_receive_skb(),
>>>>> but the overhead remained very light.
>>>>>
>>>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>>
>>>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>>>
>>>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>> while (rx_handler) {
>>>>> 	/* ...  */
>>>>> 	orig_dev = skb->dev;
>>>>> 	skb = rx_handler(skb);
>>>>> 	/* ... */
>>>>> 	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>> }
>>>>>
>>>>> This would reduce the overhead, while still allowing nesting: vlan on
>>>>> top on bonding, bridge on top on bonding, ...
>>>>
>>>> I see your point. Makes sense to me. But the loop would have to include
>>>> at least processing of ptype_all too. I'm going to cook a follow-up
>>>> patch.
>>>>
>>>
>>> DRAFT (doesn't modify rx_handlers):
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 4ebf7fe..e5dba47 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>   {
>>>   	struct packet_type *ptype, *pt_prev;
>>>   	rx_handler_func_t *rx_handler;
>>> +	struct net_device *dev;
>>>   	struct net_device *orig_dev;
>>>   	struct net_device *null_or_dev;
>>>   	int ret = NET_RX_DROP;
>>> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>   	if (netpoll_receive_skb(skb))
>>>   		return NET_RX_DROP;
>>>
>>> -	__this_cpu_inc(softnet_data.processed);
>>> +	skb->skb_iif = skb->dev->ifindex;
>>> +	orig_dev = skb->dev;
>>
>> orig_dev should be set inside the loop, to reflect "previously
>> crossed device", while following the path:
>>
>> eth0 ->  bond0 ->  br0.
>>
>> First step inside loop:
>>
>> orig_dev = eth0
>> skb->dev = bond0 (at the end of the loop).
>>
>> Second step inside loop:
>>
>> orig_dev = bond0
>> skb->dev = br0 (et the end of the loop).
>>
>> This would allow for exact match delivery to bond0 if someone bind there.
>>
>>> +
>>>   	skb_reset_network_header(skb);
>>>   	skb_reset_transport_header(skb);
>>>   	skb->mac_len = skb->network_header - skb->mac_header;
>>> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>
>>>   	rcu_read_lock();
>>>
>>> -	if (!skb->skb_iif) {
>>> -		skb->skb_iif = skb->dev->ifindex;
>>> -		orig_dev = skb->dev;
>>> -	} else {
>>> -		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>> -	}
>>
>> I like the fact that it removes the above part.
>>
>>> +another_round:
>>> +	__this_cpu_inc(softnet_data.processed);
>>> +	dev = skb->dev;
>>>
>>>   #ifdef CONFIG_NET_CLS_ACT
>>>   	if (skb->tc_verd&   TC_NCLS) {
>>> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>   #endif
>>>
>>>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
>>> +		if (!ptype->dev || ptype->dev == dev) {
>>>   			if (pt_prev)
>>>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>   			pt_prev = ptype;
>>
>> Inside the loop, we should only do exact match delivery, for
>> &ptype_all and for&ptype_base[ntohs(type)&  PTYPE_HASH_MASK]:
>>
>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -               if (!ptype->dev || ptype->dev == dev) {
>> +               if (ptype->dev == dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>>
>>         list_for_each_entry_rcu(ptype,
>>                         &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>>                 if (ptype->type == type&&
>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +                   (ptype->dev == skb->dev)) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>
>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -               if (!ptype->dev || ptype->dev == dev) {
>> +               if (!ptype->dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                }
>>         }
>>
>>
>>         list_for_each_entry_rcu(ptype,
>>                         &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> -               if (ptype->type == type&&
>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +		if (ptype->type == type&&  !ptype->dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>> This would reduce the number of tests inside the
>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>> == dev inside the loop and !ptype->dev outside the loop, this should
>> avoid duplicate delivery.
>
> Would you care to put this into patch so I can see the whole picture?
> Thanks.

Here is what I have in mind. It is based on your previous DRAFT patch, and don't modify rx_handlers yet.

Only compile tested !!

I don't know if every pieces are at the right place. I wonder what to do with CONFIG_NET_CLS_ACT 
part, that currently is between ptype_all and ptype_base processing.

Anyway, the general idea is there.

	Nicolas.

  net/core/dev.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++--------
  1 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e5dba47..7e007a9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3117,7 +3117,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
  	rx_handler_func_t *rx_handler;
  	struct net_device *dev;
  	struct net_device *orig_dev;
-	struct net_device *null_or_dev;
  	int ret = NET_RX_DROP;
  	__be16 type;

@@ -3130,9 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
  	if (netpoll_receive_skb(skb))
  		return NET_RX_DROP;

-	skb->skb_iif = skb->dev->ifindex;
-	orig_dev = skb->dev;
-
  	skb_reset_network_header(skb);
  	skb_reset_transport_header(skb);
  	skb->mac_len = skb->network_header - skb->mac_header;
@@ -3143,6 +3139,8 @@ static int __netif_receive_skb(struct sk_buff *skb)

  another_round:
  	__this_cpu_inc(softnet_data.processed);
+	skb->skb_iif = skb->dev->ifindex;
+	orig_dev = skb->dev;
  	dev = skb->dev;

  #ifdef CONFIG_NET_CLS_ACT
@@ -3152,8 +3150,13 @@ another_round:
  	}
  #endif

+	/*
+	 * Deliver to ptype_all protocol handlers that match current dev.
+	 * This happens before rx_handler is given a chance to change skb->dev.
+	 */
+
  	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (!ptype->dev || ptype->dev == dev) {
+		if (ptype->dev == dev) {
  			if (pt_prev)
  				ret = deliver_skb(skb, pt_prev, orig_dev);
  			pt_prev = ptype;
@@ -3167,6 +3170,31 @@ another_round:
  ncls:
  #endif

+	/*
+	 * Deliver to ptype_base protocol handlers that match current dev.
+	 * This happens before rx_handler is given a chance to change skb->dev.
+	 */
+
+	type = skb->protocol;
+	list_for_each_entry_rcu(ptype,
+			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
+		if (ptype->type == type && ptype->dev == skb->dev) {
+			if (pt_prev)
+				ret = deliver_skb(skb, pt_prev, orig_dev);
+			pt_prev = ptype;
+		}
+	}
+
+	/*
+	 * Call rx_handler for current device.
+	 * If rx_handler return NULL, skip wilcard protocol handler delivery.
+	 * Else, if skb->dev changed, restart the whole delivery process, to
+	 * allow for device nesting.
+	 *
+	 * Warning:
+	 * rx_handlers must kfree_skb(skb) if they return NULL.
+	 */
+
  	rx_handler = rcu_dereference(dev->rx_handler);
  	if (rx_handler) {
  		if (pt_prev) {
@@ -3176,10 +3204,15 @@ ncls:
  		skb = rx_handler(skb);
  		if (!skb)
  			goto out;
-		if (dev != skb->dev)
+		if (skb->dev != dev)
  			goto another_round;
  	}

+	/*
+	 * FIXME: The part below should use rx_handler instead of being hard
+	 * coded here.
+	 */
+
  	if (vlan_tx_tag_present(skb)) {
  		if (pt_prev) {
  			ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -3192,16 +3225,33 @@ ncls:
  			goto out;
  	}

+	/*
+	 * FIXME: Can't this be moved into the rx_handler for bonding,
+	 * or into a futur rx_handler for vlan?
+	 */
+
  	vlan_on_bond_hook(skb);

-	/* deliver only exact match when indicated */
-	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
+	/*
+	 * Deliver to wildcard ptype_all protocol handlers.
+	 */
+
+	list_for_each_entry_rcu(ptype, &ptype_all, list) {
+		if (!ptype->dev) {
+			if (pt_prev)
+				ret = deliver_skb(skb, pt_prev, orig_dev);
+			pt_prev = ptype;
+		}
+	}
+
+	/*
+	 * Deliver to wildcard ptype_all protocol handlers.
+	 */

  	type = skb->protocol;
  	list_for_each_entry_rcu(ptype,
  			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-		if (ptype->type == type &&
-		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+		if (ptype->type == type && !ptype->dev) {
  			if (pt_prev)
  				ret = deliver_skb(skb, pt_prev, orig_dev);
  			pt_prev = ptype;
-- 
1.7.2.3




^ permalink raw reply related

* Business Proposal for you!!!
From: Mr Peter Wong @ 2011-02-19 16:15 UTC (permalink / raw)


Dear Friend,

I am Mr.Peter Wong  Non Executive Director of Hang Seng Bank Ltd, Hong
Kong.I have a deceased client funds in my bank of $44.5MUSD and i need you
to front  as beneficiary,your benefit is 50% of the total funds.If you are
interested  contact me for more information via peter.wong133@hotmail.com

Sincerely
Mr.Peter



^ permalink raw reply

* [PATCH] ping: fix building on older systems
From: Mike Frysinger @ 2011-02-19 17:56 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev

The SO_MARK define is somewhat recent (linux-2.6.25), so check for its
existence before we try to use it.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
 ping_common.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ping_common.c b/ping_common.c
index 82320b1..b4b26b2 100644
--- a/ping_common.c
+++ b/ping_common.c
@@ -475,6 +475,7 @@ void setup(int icmp_sock)
 			fprintf(stderr, "Warning: no SO_TIMESTAMP support, falling back to SIOCGSTAMP\n");
 	}
 #endif
+#ifdef SO_MARK
 	if (options & F_MARK) {
 		if (setsockopt(icmp_sock, SOL_SOCKET, SO_MARK,
 				&mark, sizeof(mark)) == -1) {
@@ -484,6 +485,7 @@ void setup(int icmp_sock)
 			fprintf(stderr, "Warning: Failed to set mark %d\n", mark);
 		}
 	}
+#endif
 
 	/* Set some SNDTIMEO to prevent blocking forever
 	 * on sends, when device is too slow or stalls. Just put limit
-- 
1.7.4.1


^ permalink raw reply related

* [PATCH 3/3] tracepath: fix typo in -l error message
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev, Jeroen Roovers
In-Reply-To: <1298137889-23969-1-git-send-email-vapier@gentoo.org>

From: Jeroen Roovers <jer@gentoo.org>

Signed-off-by: Jeroen Roovers <jer@gentoo.org>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
 tracepath.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index 0a14b1b..90e1b28 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -306,7 +306,7 @@ main(int argc, char **argv)
 			break;
 		case 'l':
 			if ((mtu = atoi(optarg)) <= overhead) {
-				fprintf(stderr, "Error: length must be >= %d\n", overhead);
+				fprintf(stderr, "Error: length must be > %d\n", overhead);
 				exit(1);
 			}
 			break;
-- 
1.7.4.1


^ permalink raw reply related

* [PATCH 2/3] fix up strict-aliasing warnings
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev
In-Reply-To: <1298137889-23969-1-git-send-email-vapier@gentoo.org>

Current build of some tools results in gcc warning about strict-aliasing
violations.  So change those freaky casts to memcpy's.  When the pointer
types work out, gcc will optimize this away anyways.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
 ping6.c      |   13 +++++++++----
 tracepath.c  |    2 +-
 tracepath6.c |    2 +-
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/ping6.c b/ping6.c
index c5ff881..86f9216 100644
--- a/ping6.c
+++ b/ping6.c
@@ -1104,18 +1104,21 @@ int build_niquery(__u8 *_nih)
 {
 	struct ni_hdr *nih;
 	int cc;
+	__u16 this_nonce;
 
 	nih = (struct ni_hdr *)_nih;
 	nih->ni_cksum = 0;
 
-	CLR(ntohs((*(__u16*)(nih->ni_nonce))) % mx_dup_ck);
+	memcpy(&this_nonce, &nih->ni_nonce, sizeof(this_nonce));
+	CLR(ntohs(this_nonce) % mx_dup_ck);
 
 	nih->ni_type = ICMPV6_NI_QUERY;
 	cc = sizeof(*nih);
 	datalen = 0;
 
 	memcpy(nih->ni_nonce, ni_nonce, sizeof(nih->ni_nonce));
-	*(__u16*)(nih->ni_nonce) = htons(ntransmitted + 1);
+	this_nonce = htons(ntransmitted + 1);
+	memcpy(&nih->ni_nonce, &this_nonce, sizeof(this_nonce));
 
 	nih->ni_code = ni_subject_type;
 	nih->ni_qtype = htons(ni_query);
@@ -1331,7 +1334,7 @@ parse_reply(struct msghdr *msg, int cc, void *addr, struct timeval *tv)
 #endif
 			if (c->cmsg_len < CMSG_LEN(sizeof(int)))
 				continue;
-			hops = *(int*)CMSG_DATA(c);
+			memcpy(&hops, CMSG_DATA(c), sizeof(int));
 		}
 	}
 
@@ -1355,7 +1358,9 @@ parse_reply(struct msghdr *msg, int cc, void *addr, struct timeval *tv)
 			return 0;
 	} else if (icmph->icmp6_type == ICMPV6_NI_REPLY) {
 		struct ni_hdr *nih = (struct ni_hdr *)icmph;
-		__u16 seq = ntohs(*(__u16 *)nih->ni_nonce);
+		__u16 seq;
+		memcpy(&seq, &nih->ni_nonce, sizeof(seq));
+		seq = ntohs(seq);
 		if (memcmp(&nih->ni_nonce[2], &ni_nonce[2], sizeof(ni_nonce) - sizeof(__u16)))
 			return 1;
 		if (gather_statistics((__u8*)icmph, sizeof(*icmph), cc,
diff --git a/tracepath.c b/tracepath.c
index ca84a69..0a14b1b 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -142,7 +142,7 @@ restart:
 			if (cmsg->cmsg_type == IP_RECVERR) {
 				e = (struct sock_extended_err *) CMSG_DATA(cmsg);
 			} else if (cmsg->cmsg_type == IP_TTL) {
-				rethops = *(int*)CMSG_DATA(cmsg);
+				memcpy(&rethops, CMSG_DATA(cmsg), sizeof(int));
 			} else {
 				printf("cmsg:%d\n ", cmsg->cmsg_type);
 			}
diff --git a/tracepath6.c b/tracepath6.c
index 5c2db8f..77a3563 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -170,7 +170,7 @@ restart:
 #ifdef IPV6_2292HOPLIMIT
 			case IPV6_2292HOPLIMIT:
 #endif
-				rethops = *(int*)CMSG_DATA(cmsg);
+				memcpy(&rethops, CMSG_DATA(cmsg), sizeof(int));
 				break;
 			default:
 				printf("cmsg6:%d\n ", cmsg->cmsg_type);
-- 
1.7.4.1


^ permalink raw reply related

* [PATCH 1/3] tracepath: re-use printf return in print_host
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev

The printf funcs take an int for field widths, not a size_t.  Also, since
the printf funcs already return the length of chars displayed, use that
value instead of re-calculating the length with strlen.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
 tracepath.c  |   11 ++++-------
 tracepath6.c |   11 ++++-------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index 81c22e9..ca84a69 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -68,13 +68,10 @@ void data_wait(int fd)
 
 void print_host(const char *a, const char *b, int both)
 {
-	size_t plen = 0;
-	printf("%s", a);
-	plen = strlen(a);
-	if (both) {
-		printf(" (%s)", b);
-		plen += strlen(b) + 3;
-	}
+	int plen;
+	plen = printf("%s", a);
+	if (both)
+		plen += printf(" (%s)", b);
 	if (plen >= HOST_COLUMN_SIZE)
 		plen = HOST_COLUMN_SIZE - 1;
 	printf("%*s", HOST_COLUMN_SIZE - plen, "");
diff --git a/tracepath6.c b/tracepath6.c
index 5cc7424..5c2db8f 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -80,13 +80,10 @@ void data_wait(int fd)
 
 void print_host(const char *a, const char *b, int both)
 {
-	size_t plen = 0;
-	printf("%s", a);
-	plen = strlen(a);
-	if (both) {
-		printf(" (%s)", b);
-		plen += strlen(b) + 3;
-	}
+	int plen;
+	plen = printf("%s", a);
+	if (both)
+		plen += printf(" (%s)", b);
 	if (plen >= HOST_COLUMN_SIZE)
 		plen = HOST_COLUMN_SIZE - 1;
 	printf("%*s", HOST_COLUMN_SIZE - plen, "");
-- 
1.7.4.1


^ permalink raw reply related

* [PATCH 0/2] netfilter: netfilter fixes for 2.6.38
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

Hi Dave,

the following patches for two netfilter bugs:

- an oops in nfnetlink_log in combination with TPROXY when a socket
  in TIME-WAIT state is assigned to skb->sk, patch from Florian Westphal

- incorrect printing of the MAC header in the ip6t_LOG target,
  from Joerg Marx

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master

Thanks!


^ permalink raw reply

* [PATCH 2/2] netfilter: ip6t_LOG: fix a flaw in printing the MAC
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>

From: Joerg Marx <joerg.marx@secunet.com>

The flaw was in skipping the second byte in MAC header due to increasing
the pointer AND indexed access starting at '1'.

Signed-off-by: Joerg Marx <joerg.marx@secunet.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv6/netfilter/ip6t_LOG.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/netfilter/ip6t_LOG.c b/net/ipv6/netfilter/ip6t_LOG.c
index 09c8889..de33803 100644
--- a/net/ipv6/netfilter/ip6t_LOG.c
+++ b/net/ipv6/netfilter/ip6t_LOG.c
@@ -410,7 +410,7 @@ fallback:
 		if (p != NULL) {
 			sb_add(m, "%02x", *p++);
 			for (i = 1; i < len; i++)
-				sb_add(m, ":%02x", p[i]);
+				sb_add(m, ":%02x", *p++);
 		}
 		sb_add(m, " ");
 
-- 
1.7.2.3


^ permalink raw reply related

* [PATCH 1/2] netfilter: tproxy: do not assign timewait sockets to skb->sk
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>

From: Florian Westphal <fwestphal@astaro.com>

Assigning a socket in timewait state to skb->sk can trigger
kernel oops, e.g. in nfnetlink_log, which does:

if (skb->sk) {
        read_lock_bh(&skb->sk->sk_callback_lock);
        if (skb->sk->sk_socket && skb->sk->sk_socket->file) ...

in the timewait case, accessing sk->sk_callback_lock and sk->sk_socket
is invalid.

Either all of these spots will need to add a test for sk->sk_state != TCP_TIME_WAIT,
or xt_TPROXY must not assign a timewait socket to skb->sk.

This does the latter.

If a TW socket is found, assign the tproxy nfmark, but skip the skb->sk assignment,
thus mimicking behaviour of a '-m socket .. -j MARK/ACCEPT' re-routing rule.

The 'SYN to TW socket' case is left unchanged -- we try to redirect to the
listener socket.

Cc: Balazs Scheidler <bazsi@balabit.hu>
Cc: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Florian Westphal <fwestphal@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/net/netfilter/nf_tproxy_core.h |   12 +-----------
 net/netfilter/nf_tproxy_core.c         |   27 ++++++++++++---------------
 net/netfilter/xt_TPROXY.c              |   22 ++++++++++++++++++++--
 net/netfilter/xt_socket.c              |   13 +++++++++++--
 4 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h
index cd85b3b..e505358 100644
--- a/include/net/netfilter/nf_tproxy_core.h
+++ b/include/net/netfilter/nf_tproxy_core.h
@@ -201,18 +201,8 @@ nf_tproxy_get_sock_v6(struct net *net, const u8 protocol,
 }
 #endif
 
-static inline void
-nf_tproxy_put_sock(struct sock *sk)
-{
-	/* TIME_WAIT inet sockets have to be handled differently */
-	if ((sk->sk_protocol == IPPROTO_TCP) && (sk->sk_state == TCP_TIME_WAIT))
-		inet_twsk_put(inet_twsk(sk));
-	else
-		sock_put(sk);
-}
-
 /* assign a socket to the skb -- consumes sk */
-int
+void
 nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk);
 
 #endif
diff --git a/net/netfilter/nf_tproxy_core.c b/net/netfilter/nf_tproxy_core.c
index 4d87bef..474d621 100644
--- a/net/netfilter/nf_tproxy_core.c
+++ b/net/netfilter/nf_tproxy_core.c
@@ -28,26 +28,23 @@ nf_tproxy_destructor(struct sk_buff *skb)
 	skb->destructor = NULL;
 
 	if (sk)
-		nf_tproxy_put_sock(sk);
+		sock_put(sk);
 }
 
 /* consumes sk */
-int
+void
 nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk)
 {
-	bool transparent = (sk->sk_state == TCP_TIME_WAIT) ?
-				inet_twsk(sk)->tw_transparent :
-				inet_sk(sk)->transparent;
-
-	if (transparent) {
-		skb_orphan(skb);
-		skb->sk = sk;
-		skb->destructor = nf_tproxy_destructor;
-		return 1;
-	} else
-		nf_tproxy_put_sock(sk);
-
-	return 0;
+	/* assigning tw sockets complicates things; most
+	 * skb->sk->X checks would have to test sk->sk_state first */
+	if (sk->sk_state == TCP_TIME_WAIT) {
+		inet_twsk_put(inet_twsk(sk));
+		return;
+	}
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = nf_tproxy_destructor;
 }
 EXPORT_SYMBOL_GPL(nf_tproxy_assign_sock);
 
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 640678f..dcfd57e 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -33,6 +33,20 @@
 #include <net/netfilter/nf_tproxy_core.h>
 #include <linux/netfilter/xt_TPROXY.h>
 
+static bool tproxy_sk_is_transparent(struct sock *sk)
+{
+	if (sk->sk_state != TCP_TIME_WAIT) {
+		if (inet_sk(sk)->transparent)
+			return true;
+		sock_put(sk);
+	} else {
+		if (inet_twsk(sk)->tw_transparent)
+			return true;
+		inet_twsk_put(inet_twsk(sk));
+	}
+	return false;
+}
+
 static inline __be32
 tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr)
 {
@@ -141,7 +155,7 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 					   skb->dev, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
-	if (sk && nf_tproxy_assign_sock(skb, sk)) {
+	if (sk && tproxy_sk_is_transparent(sk)) {
 		/* This should be in a separate target, but we don't do multiple
 		   targets on the same rule yet */
 		skb->mark = (skb->mark & ~mark_mask) ^ mark_value;
@@ -149,6 +163,8 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 		pr_debug("redirecting: proto %hhu %pI4:%hu -> %pI4:%hu, mark: %x\n",
 			 iph->protocol, &iph->daddr, ntohs(hp->dest),
 			 &laddr, ntohs(lport), skb->mark);
+
+		nf_tproxy_assign_sock(skb, sk);
 		return NF_ACCEPT;
 	}
 
@@ -306,7 +322,7 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 					   par->in, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
-	if (sk && nf_tproxy_assign_sock(skb, sk)) {
+	if (sk && tproxy_sk_is_transparent(sk)) {
 		/* This should be in a separate target, but we don't do multiple
 		   targets on the same rule yet */
 		skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
@@ -314,6 +330,8 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 		pr_debug("redirecting: proto %hhu %pI6:%hu -> %pI6:%hu, mark: %x\n",
 			 tproto, &iph->saddr, ntohs(hp->source),
 			 laddr, ntohs(lport), skb->mark);
+
+		nf_tproxy_assign_sock(skb, sk);
 		return NF_ACCEPT;
 	}
 
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 00d6ae8..9cc4635 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -35,6 +35,15 @@
 #include <net/netfilter/nf_conntrack.h>
 #endif
 
+static void
+xt_socket_put_sk(struct sock *sk)
+{
+	if (sk->sk_state == TCP_TIME_WAIT)
+		inet_twsk_put(inet_twsk(sk));
+	else
+		sock_put(sk);
+}
+
 static int
 extract_icmp4_fields(const struct sk_buff *skb,
 		    u8 *protocol,
@@ -164,7 +173,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 				       (sk->sk_state == TCP_TIME_WAIT &&
 					inet_twsk(sk)->tw_transparent));
 
-		nf_tproxy_put_sock(sk);
+		xt_socket_put_sk(sk);
 
 		if (wildcard || !transparent)
 			sk = NULL;
@@ -298,7 +307,7 @@ socket_mt6_v1(const struct sk_buff *skb, struct xt_action_param *par)
 				       (sk->sk_state == TCP_TIME_WAIT &&
 					inet_twsk(sk)->tw_transparent));
 
-		nf_tproxy_put_sock(sk);
+		xt_socket_put_sk(sk);
 
 		if (wildcard || !transparent)
 			sk = NULL;
-- 
1.7.2.3


^ permalink raw reply related

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 14:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219134645.GF2782@psychotron.redhat.com>

Le 19/02/2011 14:46, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:

[snip]
>> Inside the loop, we should only do exact match delivery, for
>> &ptype_all and for&ptype_base[ntohs(type)&  PTYPE_HASH_MASK]:
>>
>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -               if (!ptype->dev || ptype->dev == dev) {
>> +               if (ptype->dev == dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>>
>>         list_for_each_entry_rcu(ptype,
>>                         &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>>                 if (ptype->type == type&&
>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +                   (ptype->dev == skb->dev)) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>
>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -               if (!ptype->dev || ptype->dev == dev) {
>> +               if (!ptype->dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                }
>>         }
>>
>>
>>         list_for_each_entry_rcu(ptype,
>>                         &ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> -               if (ptype->type == type&&
>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +		if (ptype->type == type&&  !ptype->dev) {
>>                         if (pt_prev)
>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>                         pt_prev = ptype;
>>                 }
>>         }
>>
>> This would reduce the number of tests inside the
>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>> == dev inside the loop and !ptype->dev outside the loop, this should
>> avoid duplicate delivery.
>
> Would you care to put this into patch so I can see the whole picture?
> Thanks.

I will try.

	Nicolas.

^ permalink raw reply

* Advice on network driver design
From: Felix Radensky @ 2011-02-19 13:37 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

I'm in the process of designing a network driver for a custom
hardware and would like to get some advice from linux network
gurus.

The host platform is Freescale P2020. The custom hardware is
FPGA with several TX FIFOs, single RX FIFO and a set of registers.
FPGA is connected to CPU via PCI-E. Host CPU DMA controller is used
to get packets to/from FIFOs. Each FIFO has its set of events, 
generating interrupts, which can be enabled and disabled. Status 
register reflects the current state of events, the bit in status 
register is cleared by FPGA when event is handled. Reads or writes to 
status register have no impact on its contents.

The device driver should support 80Mbit/sec of traffic in each direction.

So far I have TX side working. I'm using Linux dmaengine APIs to
transfer packets to FIFOs. The DMA completion interrupt is handled
by per-fifo work queue.

My question is about RX. Would such design benefit from NAPI ?
If my understanding of NAPI is correct, it runs in softirq context,
so I cannot do any DMA work in dev->poll(). If I were to use NAPI,
I should probably disable RX interrupts, do all DMA work in some
work queue, keep RX packets in a list and only then call dev->poll().
Is that correct ?

Any other advice and how to write an efficient driver for this 
hardware is most welcome. I can influence FPGA design to some degree, 
so if you think FPGA should be changed to improve things, please let 
me know.

Thanks a lot in advance.

Felix.


^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 13:46 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <4D5FC308.9020507@gmail.com>

Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>>>__netif_receive_skb() with much less exceptions needed. Also
>>>>>bond-specific work is moved into bond code.
>>>>>
>>>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>
>>>>>v1->v2:
>>>>>         using skb_iif instead of new input_dev to remember original
>>>>>	device
>>>>>v2->v3:
>>>>>	set orig_dev = skb->dev if skb_iif is set
>>>>>
>>>>
>>>>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>
>>>>Bonding used to be handled with very few overhead, simply replacing
>>>>skb->dev with skb->dev->master. Time has passed and we eventually
>>>>added many special processing for bonding into __netif_receive_skb(),
>>>>but the overhead remained very light.
>>>>
>>>>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>
>>>>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>whatever need to be delivered, to whoever need, inside the loop ?
>>>>
>>>>rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>while (rx_handler) {
>>>>	/* ...  */
>>>>	orig_dev = skb->dev;
>>>>	skb = rx_handler(skb);
>>>>	/* ... */
>>>>	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>}
>>>>
>>>>This would reduce the overhead, while still allowing nesting: vlan on
>>>>top on bonding, bridge on top on bonding, ...
>>>
>>>I see your point. Makes sense to me. But the loop would have to include
>>>at least processing of ptype_all too. I'm going to cook a follow-up
>>>patch.
>>>
>>
>>DRAFT (doesn't modify rx_handlers):
>>
>>diff --git a/net/core/dev.c b/net/core/dev.c
>>index 4ebf7fe..e5dba47 100644
>>--- a/net/core/dev.c
>>+++ b/net/core/dev.c
>>@@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>  {
>>  	struct packet_type *ptype, *pt_prev;
>>  	rx_handler_func_t *rx_handler;
>>+	struct net_device *dev;
>>  	struct net_device *orig_dev;
>>  	struct net_device *null_or_dev;
>>  	int ret = NET_RX_DROP;
>>@@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>  	if (netpoll_receive_skb(skb))
>>  		return NET_RX_DROP;
>>
>>-	__this_cpu_inc(softnet_data.processed);
>>+	skb->skb_iif = skb->dev->ifindex;
>>+	orig_dev = skb->dev;
>
>orig_dev should be set inside the loop, to reflect "previously
>crossed device", while following the path:
>
>eth0 -> bond0 -> br0.
>
>First step inside loop:
>
>orig_dev = eth0
>skb->dev = bond0 (at the end of the loop).
>
>Second step inside loop:
>
>orig_dev = bond0
>skb->dev = br0 (et the end of the loop).
>
>This would allow for exact match delivery to bond0 if someone bind there.
>
>>+
>>  	skb_reset_network_header(skb);
>>  	skb_reset_transport_header(skb);
>>  	skb->mac_len = skb->network_header - skb->mac_header;
>>@@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>>  	rcu_read_lock();
>>
>>-	if (!skb->skb_iif) {
>>-		skb->skb_iif = skb->dev->ifindex;
>>-		orig_dev = skb->dev;
>>-	} else {
>>-		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>-	}
>
>I like the fact that it removes the above part.
>
>>+another_round:
>>+	__this_cpu_inc(softnet_data.processed);
>>+	dev = skb->dev;
>>
>>  #ifdef CONFIG_NET_CLS_ACT
>>  	if (skb->tc_verd&  TC_NCLS) {
>>@@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>  #endif
>>
>>  	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>-		if (!ptype->dev || ptype->dev == skb->dev) {
>>+		if (!ptype->dev || ptype->dev == dev) {
>>  			if (pt_prev)
>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>  			pt_prev = ptype;
>
>Inside the loop, we should only do exact match delivery, for
>&ptype_all and for &ptype_base[ntohs(type) & PTYPE_HASH_MASK]:
>
>        list_for_each_entry_rcu(ptype, &ptype_all, list) {
>-               if (!ptype->dev || ptype->dev == dev) {
>+               if (ptype->dev == dev) {
>                        if (pt_prev)
>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>                        pt_prev = ptype;
>                }
>        }
>
>
>        list_for_each_entry_rcu(ptype,
>                        &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>                if (ptype->type == type &&
>-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>+                   (ptype->dev == skb->dev)) {
>                        if (pt_prev)
>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>                        pt_prev = ptype;
>                }
>        }
>
>After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>
>        list_for_each_entry_rcu(ptype, &ptype_all, list) {
>-               if (!ptype->dev || ptype->dev == dev) {
>+               if (!ptype->dev) {
>                        if (pt_prev)
>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>                        pt_prev = ptype;
>               }
>        }
>
>
>        list_for_each_entry_rcu(ptype,
>                        &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>-               if (ptype->type == type &&
>-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>+		if (ptype->type == type && !ptype->dev) {
>                        if (pt_prev)
>                                ret = deliver_skb(skb, pt_prev, orig_dev);
>                        pt_prev = ptype;
>                }
>        }
>
>This would reduce the number of tests inside the
>list_for_each_entry_rcu() loops. And because we match only ptype->dev
>== dev inside the loop and !ptype->dev outside the loop, this should
>avoid duplicate delivery.

Would you care to put this into patch so I can see the whole picture?
Thanks.


^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 13:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
	shemminger, andy
In-Reply-To: <20110219112842.GE2782@psychotron.redhat.com>

Le 19/02/2011 12:28, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>> bond-specific work is moved into bond code.
>>>>
>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>
>>>> v1->v2:
>>>>          using skb_iif instead of new input_dev to remember original
>>>> 	device
>>>> v2->v3:
>>>> 	set orig_dev = skb->dev if skb_iif is set
>>>>
>>>
>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>
>>> Bonding used to be handled with very few overhead, simply replacing
>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>> added many special processing for bonding into __netif_receive_skb(),
>>> but the overhead remained very light.
>>>
>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>
>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>
>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>> while (rx_handler) {
>>> 	/* ...  */
>>> 	orig_dev = skb->dev;
>>> 	skb = rx_handler(skb);
>>> 	/* ... */
>>> 	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>> }
>>>
>>> This would reduce the overhead, while still allowing nesting: vlan on
>>> top on bonding, bridge on top on bonding, ...
>>
>> I see your point. Makes sense to me. But the loop would have to include
>> at least processing of ptype_all too. I'm going to cook a follow-up
>> patch.
>>
>
> DRAFT (doesn't modify rx_handlers):
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4ebf7fe..e5dba47 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   {
>   	struct packet_type *ptype, *pt_prev;
>   	rx_handler_func_t *rx_handler;
> +	struct net_device *dev;
>   	struct net_device *orig_dev;
>   	struct net_device *null_or_dev;
>   	int ret = NET_RX_DROP;
> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   	if (netpoll_receive_skb(skb))
>   		return NET_RX_DROP;
>
> -	__this_cpu_inc(softnet_data.processed);
> +	skb->skb_iif = skb->dev->ifindex;
> +	orig_dev = skb->dev;

orig_dev should be set inside the loop, to reflect "previously crossed device", while following the 
path:

eth0 -> bond0 -> br0.

First step inside loop:

orig_dev = eth0
skb->dev = bond0 (at the end of the loop).

Second step inside loop:

orig_dev = bond0
skb->dev = br0 (et the end of the loop).

This would allow for exact match delivery to bond0 if someone bind there.

> +
>   	skb_reset_network_header(skb);
>   	skb_reset_transport_header(skb);
>   	skb->mac_len = skb->network_header - skb->mac_header;
> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
>   	rcu_read_lock();
>
> -	if (!skb->skb_iif) {
> -		skb->skb_iif = skb->dev->ifindex;
> -		orig_dev = skb->dev;
> -	} else {
> -		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
> -	}

I like the fact that it removes the above part.

> +another_round:
> +	__this_cpu_inc(softnet_data.processed);
> +	dev = skb->dev;
>
>   #ifdef CONFIG_NET_CLS_ACT
>   	if (skb->tc_verd&  TC_NCLS) {
> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   #endif
>
>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
> -		if (!ptype->dev || ptype->dev == skb->dev) {
> +		if (!ptype->dev || ptype->dev == dev) {
>   			if (pt_prev)
>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>   			pt_prev = ptype;

Inside the loop, we should only do exact match delivery, for &ptype_all and for 
&ptype_base[ntohs(type) & PTYPE_HASH_MASK]:

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
-               if (!ptype->dev || ptype->dev == dev) {
+               if (ptype->dev == dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
                 if (ptype->type == type &&
-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+                   (ptype->dev == skb->dev)) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

After leaving the loop, we can do wilcard delivery, if skb is not NULL.

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
-               if (!ptype->dev || ptype->dev == dev) {
+               if (!ptype->dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                }
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-               if (ptype->type == type &&
-                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+		if (ptype->type == type && !ptype->dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

This would reduce the number of tests inside the list_for_each_entry_rcu() loops. And because we 
match only ptype->dev == dev inside the loop and !ptype->dev outside the loop, this should avoid 
duplicate delivery.

Also, for performance reason, exact match protocol handler lists might be moved from ptype_base or 
ptype_all to a per net_device list. That way, the list_for_each_entry_rcu() inside the loop could be 
empty if no protocol handler bind on the current dev.

inside loop:

         list_for_each_entry_rcu(ptype, dev->ptype_all, list) {
                 if (pt_prev)
			ret = deliver_skb(skb, pt_prev, orig_dev);
                 pt_prev = ptype;
         }

         list_for_each_entry_rcu(ptype,
                         dev->ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
                 if (ptype->type == type) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

Outside loop :

         list_for_each_entry_rcu(ptype, &ptype_all, list) {
                 if (pt_prev)
                         ret = deliver_skb(skb, pt_prev, orig_dev);
                 pt_prev = ptype;
         }


         list_for_each_entry_rcu(ptype,
                         &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
		if (ptype->type == type) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }

This would require several changes into ptype_all and ptype_base handling, but should be faster.

> @@ -3167,7 +3167,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   ncls:
>   #endif
>
> -	rx_handler = rcu_dereference(skb->dev->rx_handler);
> +	rx_handler = rcu_dereference(dev->rx_handler);
>   	if (rx_handler) {
>   		if (pt_prev) {
>   			ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -3176,6 +3176,8 @@ ncls:
>   		skb = rx_handler(skb);
>   		if (!skb)
>   			goto out;
> +		if (dev != skb->dev)

I would use "if (skb->dev != dev)" for clarity, because skb->dev is expected to have changed, not dev.

> +			goto another_round;
>   	}
>
>   	if (vlan_tx_tag_present(skb)) {
>

	Nicolas.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox