Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 1/4] bonding: update the primary when slave name changed
From: Ding Tianhong @ 2014-01-10 11:55 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: Jay Vosburgh, David S. Miller, Netdev
In-Reply-To: <20140110111143.GB4132@redhat.com>

On 2014/1/10 19:11, Veaceslav Falico wrote:
> On Fri, Jan 10, 2014 at 07:05:34PM +0800, Ding Tianhong wrote:
>> On 2014/1/10 15:44, Veaceslav Falico wrote:
>>> If it's not the primray
>>> slave, and we don't have one - select it as a new primary and, again, see
>>> if we need to select a new active slave.
>>
>> I don't think so , I think if it is not the primary slave and we don't have one,
>> no need to do anything, just a normal slave change its name.
> 
> If primary == "my_eth0", you have 2 slaves - "eth0" and "eth1", thus null
> primary_slave, and rename eth0 to my_eth0 - then you need to set
> primary_slave to my_eth0.
> 
> If primary == "my_eth0", you have 2 slaves - "my_eth0" and "eth1", thus
> primary_slave == dev with name "my_eth0", and rename "my_eth0" to "eth0" -
> then you must set primary_slave to NULL.
> 
> And after either of these you must see if you need to re-select the active
> slave, as it might have been forced by the primary_slave, which has been
> modified.
> 
> You might also want to add some pr_info() about adding/removing
> primary_slave, as the user to be aware.
> 

Ok thanks.

Regards
Ding

> 

^ permalink raw reply

* Re: [PATCH net-next v3 2/9] xen-netback: Change TX path from grant copy to mapping
From: Wei Liu @ 2014-01-10 11:45 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Wei Liu, ian.campbell, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <52CFDAEC.5080708@citrix.com>

On Fri, Jan 10, 2014 at 11:35:08AM +0000, Zoltan Kiss wrote:
[...]
> 
> >>@@ -920,6 +852,18 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
> >>  	err = gop->status;
> >>  	if (unlikely(err))
> >>  		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
> >>+	else {
> >>+		if (vif->grant_tx_handle[pending_idx] !=
> >>+			NETBACK_INVALID_HANDLE) {
> >>+			netdev_err(vif->dev,
> >>+				"Stale mapped handle! pending_idx %x handle %x\n",
> >>+				pending_idx, vif->grant_tx_handle[pending_idx]);
> >>+			BUG();
> >>+		}
> >>+		set_phys_to_machine(idx_to_pfn(vif, pending_idx),
> >>+			FOREIGN_FRAME(gop->dev_bus_addr >> PAGE_SHIFT));
> >
> >What happens when you don't have this?
> Your frags will be filled with garbage. I don't understand exactly
> what this function does, someone might want to enlighten us? I've
> took it's usage from classic kernel.
> Also, it might be worthwhile to check the return value and BUG if
> it's false, but I don't know what exactly that return value means.
> 

This is actually part of gnttab_map_refs. As you're using hypercall
directly this becomes very fragile.

So the right thing to do is to fix gnttab_map_refs.

> >
> >>  		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
> >>  			int target = min_t(int, skb->len, PKT_PROT_LEN);
> >>@@ -1581,6 +1541,8 @@ static int xenvif_tx_submit(struct xenvif *vif)
> >>  		if (checksum_setup(vif, skb)) {
> >>  			netdev_dbg(vif->dev,
> >>  				   "Can't setup checksum in net_tx_action\n");
> >>+			if (skb_shinfo(skb)->destructor_arg)
> >>+				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> >
> >Do you still care setting the flag even if this skb is not going to be
> >delivered? If so can you state clearly the reason just like the
> >following hunk?
> Of course, otherwise the pages wouldn't be sent back to the guest.
> I've added a comment.
> 

OK, Thanks! That means whenever SKB leaves netback we need to add this
flag.

> >>@@ -1715,7 +1685,7 @@ static inline void xenvif_tx_dealloc_action(struct xenvif *vif)
> >>  int xenvif_tx_action(struct xenvif *vif, int budget)
> >>  {
> >>  	unsigned nr_gops;
> >>-	int work_done;
> >>+	int work_done, ret;
> >>
> >>  	if (unlikely(!tx_work_todo(vif)))
> >>  		return 0;
> >>@@ -1725,7 +1695,10 @@ int xenvif_tx_action(struct xenvif *vif, int budget)
> >>  	if (nr_gops == 0)
> >>  		return 0;
> >>
> >>-	gnttab_batch_copy(vif->tx_copy_ops, nr_gops);
> >>+	ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
> >>+			vif->tx_map_ops,
> >>+			nr_gops);
> >
> >Why do you need to replace gnttab_batch_copy with hypercall? In the
> >ideal situation gnttab_batch_copy should behave the same as directly
> >hypercall but it also handles GNTST_eagain for you.
> 
> I don't need gnttab_batch_copy at all, I'm using the grant mapping
> hypercall here.
> 

Oops, my bad! Ignore that one.

Wei.

> Regards,
> 
> Zoli

^ permalink raw reply

* [PATCH net] bonding: reset the slave's mtu when its be changed
From: Ding Tianhong @ 2014-01-10 11:32 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Netdev, David S. Miller

All slave should have the same mtu with mastet's, and the bond do it when
enslave the slave, but the user could change the slave's mtu, it will cause
the master and slave have different mtu, althrough in AB mode, it does not
matter if the slave is not the current slave, but in other mode, it is incorrect,
so reset the slave's mtu like the master set.

Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 drivers/net/bonding/bond_main.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 398e299..e7b5bcf 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2882,18 +2882,17 @@ static int bond_slave_netdev_event(unsigned long event,
 		 */
 		break;
 	case NETDEV_CHANGEMTU:
-		/*
-		 * TODO: Should slaves be allowed to
-		 * independently alter their MTU?  For
-		 * an active-backup bond, slaves need
-		 * not be the same type of device, so
-		 * MTUs may vary.  For other modes,
-		 * slaves arguably should have the
-		 * same MTUs. To do this, we'd need to
-		 * take over the slave's change_mtu
-		 * function for the duration of their
-		 * servitude.
+		/* All slave should have the same mtu
+		 * as master.
 		 */
+		if (slave->dev->mtu != bond->dev->mtu) {
+			int res;
+			slave->original_mtu = slave->dev->mtu;
+			res = dev_set_mtu(slave->dev, bond->dev->mtu);
+			if (res)
+				pr_debug("Error %d calling dev_set_mtu for slave %s\n",
+					 res, slave->dev->name);
+		}
 		break;
 	case NETDEV_CHANGENAME:
 		/*
--
1.8.0

^ permalink raw reply related

* Re: [PATCH net-next v3 2/9] xen-netback: Change TX path from grant copy to mapping
From: Zoltan Kiss @ 2014-01-10 11:35 UTC (permalink / raw)
  To: Wei Liu; +Cc: ian.campbell, xen-devel, netdev, linux-kernel, jonathan.davies
In-Reply-To: <20140109153015.GF12164@zion.uk.xensource.com>

On 09/01/14 15:30, Wei Liu wrote:
> On Wed, Jan 08, 2014 at 12:10:11AM +0000, Zoltan Kiss wrote:
>> v3:
>> - delete a surplus checking from tx_action
>> - remove stray line
>> - squash xenvif_idx_unmap changes into the first patch
>> - init spinlocks
>> - call map hypercall directly instead of gnttab_map_refs()
>
> I suppose this is to avoid touching m2p override as well, just as
> previous patch uses unmap hypercall directly.

Yes.

>> --- a/drivers/net/xen-netback/interface.c
>> +++ b/drivers/net/xen-netback/interface.c
>> @@ -122,7 +122,9 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	BUG_ON(skb->dev != dev);
>>
>>   	/* Drop the packet if vif is not ready */
>> -	if (vif->task == NULL || !xenvif_schedulable(vif))
>> +	if (vif->task == NULL ||
>> +		vif->dealloc_task == NULL ||
>> +		!xenvif_schedulable(vif))
>
> Indentation.
Fixed, and the later ones as well

>> @@ -920,6 +852,18 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
>>   	err = gop->status;
>>   	if (unlikely(err))
>>   		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
>> +	else {
>> +		if (vif->grant_tx_handle[pending_idx] !=
>> +			NETBACK_INVALID_HANDLE) {
>> +			netdev_err(vif->dev,
>> +				"Stale mapped handle! pending_idx %x handle %x\n",
>> +				pending_idx, vif->grant_tx_handle[pending_idx]);
>> +			BUG();
>> +		}
>> +		set_phys_to_machine(idx_to_pfn(vif, pending_idx),
>> +			FOREIGN_FRAME(gop->dev_bus_addr >> PAGE_SHIFT));
>
> What happens when you don't have this?
Your frags will be filled with garbage. I don't understand exactly what 
this function does, someone might want to enlighten us? I've took it's 
usage from classic kernel.
Also, it might be worthwhile to check the return value and BUG if it's 
false, but I don't know what exactly that return value means.

>
>>   		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
>>   			int target = min_t(int, skb->len, PKT_PROT_LEN);
>> @@ -1581,6 +1541,8 @@ static int xenvif_tx_submit(struct xenvif *vif)
>>   		if (checksum_setup(vif, skb)) {
>>   			netdev_dbg(vif->dev,
>>   				   "Can't setup checksum in net_tx_action\n");
>> +			if (skb_shinfo(skb)->destructor_arg)
>> +				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>
> Do you still care setting the flag even if this skb is not going to be
> delivered? If so can you state clearly the reason just like the
> following hunk?
Of course, otherwise the pages wouldn't be sent back to the guest. I've 
added a comment.

>> @@ -1715,7 +1685,7 @@ static inline void xenvif_tx_dealloc_action(struct xenvif *vif)
>>   int xenvif_tx_action(struct xenvif *vif, int budget)
>>   {
>>   	unsigned nr_gops;
>> -	int work_done;
>> +	int work_done, ret;
>>
>>   	if (unlikely(!tx_work_todo(vif)))
>>   		return 0;
>> @@ -1725,7 +1695,10 @@ int xenvif_tx_action(struct xenvif *vif, int budget)
>>   	if (nr_gops == 0)
>>   		return 0;
>>
>> -	gnttab_batch_copy(vif->tx_copy_ops, nr_gops);
>> +	ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
>> +			vif->tx_map_ops,
>> +			nr_gops);
>
> Why do you need to replace gnttab_batch_copy with hypercall? In the
> ideal situation gnttab_batch_copy should behave the same as directly
> hypercall but it also handles GNTST_eagain for you.

I don't need gnttab_batch_copy at all, I'm using the grant mapping 
hypercall here.

Regards,

Zoli

^ permalink raw reply

* Re: [net-next 14/16] i40e: enable PTP
From: Richard Cochran @ 2014-01-10 11:26 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Jacob Keller, netdev, gospo, sassmann, Jesse Brandeburg
In-Reply-To: <1389344336-1558-15-git-send-email-jeffrey.t.kirsher@intel.com>

On Fri, Jan 10, 2014 at 12:58:54AM -0800, Jeff Kirsher wrote:

> +/**
> + * i40e_ptp_enable - Enable/disable ancillary features of the PHC subsystem
> + * @ptp: The PTP clock structure
> + * @rq: The requested feature to change
> + * @on: Enable/disable flag
> + *
> + * The XL710 does not support any of the ancillary features of the PHY
> + * subsystem, so this function may just return.
> + **/

Typo, PHC subsystem? Otherwise, the driver looks good to me.

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 1/4] bonding: update the primary when slave name changed
From: Veaceslav Falico @ 2014-01-10 11:11 UTC (permalink / raw)
  To: Ding Tianhong; +Cc: Jay Vosburgh, David S. Miller, Netdev
In-Reply-To: <52CFD3FE.9000805@huawei.com>

On Fri, Jan 10, 2014 at 07:05:34PM +0800, Ding Tianhong wrote:
>On 2014/1/10 15:44, Veaceslav Falico wrote:
>> If it's not the primray
>> slave, and we don't have one - select it as a new primary and, again, see
>> if we need to select a new active slave.
>
>I don't think so , I think if it is not the primary slave and we don't have one,
>no need to do anything, just a normal slave change its name.

If primary == "my_eth0", you have 2 slaves - "eth0" and "eth1", thus null
primary_slave, and rename eth0 to my_eth0 - then you need to set
primary_slave to my_eth0.

If primary == "my_eth0", you have 2 slaves - "my_eth0" and "eth1", thus
primary_slave == dev with name "my_eth0", and rename "my_eth0" to "eth0" -
then you must set primary_slave to NULL.

And after either of these you must see if you need to re-select the active
slave, as it might have been forced by the primary_slave, which has been
modified.

You might also want to add some pr_info() about adding/removing
primary_slave, as the user to be aware.

^ permalink raw reply

* Re: [PATCH net-next 1/4] bonding: update the primary when slave name changed
From: Ding Tianhong @ 2014-01-10 11:05 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: Jay Vosburgh, David S. Miller, Netdev
In-Reply-To: <20140110074433.GA26273@redhat.com>

On 2014/1/10 15:44, Veaceslav Falico wrote:
> On Fri, Jan 10, 2014 at 12:20:45PM +0800, Ding Tianhong wrote:
>> On 2014/1/9 20:30, Veaceslav Falico wrote:
>>> On Thu, Jan 09, 2014 at 08:23:58PM +0800, Ding Tianhong wrote:
>>>> On 2014/1/9 19:46, Veaceslav Falico wrote:
>>>>> On Thu, Jan 09, 2014 at 07:20:36PM +0800, Ding Tianhong wrote:
>>>>>> If the primary_slave's name changed, but the bond->prams.primay was
>>>>>> still using the old name, it is conflict with the meaning of the
>>>>>> primary, so update the primary when the slave change its name.
>>>>>
>>>>> Nope, the bonding parameter, which is set by the user, shouldn't change
>>>>> because of an interface name change.
>>>>>
>>>> Yes, I know what you mean, but it is not bug fix, just make it more better,
>>>> do not you feel it strange that the primary was different with primary_slave's name?
>>>
>>> Yep, that's an issue - that's why there is the TODO. We shouldn't, though,
>>> change the primary param, but rather check if the slave (that changed name)
>>> is (already not) eligible for primary_slave.
>>>
>>
>> Ok,So,summarize your and my opinion, I think there are two ways to fix this:
>>
>> 1. just like my patch said.
> 
> No, the primary string is user-set, and it should *not* be changed by
> kernel.
> 
>> 2. check if the primary is not the primary_slave, make the primary_slave = NULL, this means
>>   the primary_slave is no valid.
> 
> Check the slave that changed name - if it's the primary slave, remove it,
> and see if we need to select the new active slave. 

Ok, agree.

> If it's not the primray
> slave, and we don't have one - select it as a new primary and, again, see
> if we need to select a new active slave.

I don't think so , I think if it is not the primary slave and we don't have one,
no need to do anything, just a normal slave change its name.

Regards
Ding

> 
>> 3. ?? did you have any better ideas?
>>
>> Regards
>> Ding
>>

>>>>
>>>
>>> .
>>>
>>
>>
> 
> .
> 

^ permalink raw reply

* [PATCH v4 net-next 3/3] bonding: fix __get_active_agg() RCU logic
From: Veaceslav Falico @ 2014-01-10 10:59 UTC (permalink / raw)
  To: netdev; +Cc: Veaceslav Falico, dingtianhong, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389351585-19615-1-git-send-email-vfalico@redhat.com>

Currently, the implementation is meaningless - once again, we take the
slave structure and use it after we've exited RCU critical section.

Fix this by removing the rcu_read_lock() from __get_active_agg(), and
ensuring that all its callers are holding RCU.

Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---

Notes:
    v2 -> v3:
    Use the RCU primitives.
    
    v1 -> v2:
    Don't use RCU primitives as we can hold RTNL.

 drivers/net/bonding/bond_3ad.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index b49f421..cce1f1b 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -678,6 +678,8 @@ static u32 __get_agg_bandwidth(struct aggregator *aggregator)
 /**
  * __get_active_agg - get the current active aggregator
  * @aggregator: the aggregator we're looking at
+ *
+ * Caller must hold RCU lock.
  */
 static struct aggregator *__get_active_agg(struct aggregator *aggregator)
 {
@@ -685,13 +687,9 @@ static struct aggregator *__get_active_agg(struct aggregator *aggregator)
 	struct list_head *iter;
 	struct slave *slave;
 
-	rcu_read_lock();
 	bond_for_each_slave_rcu(bond, slave, iter)
-		if (SLAVE_AD_INFO(slave).aggregator.is_active) {
-			rcu_read_unlock();
+		if (SLAVE_AD_INFO(slave).aggregator.is_active)
 			return &(SLAVE_AD_INFO(slave).aggregator);
-		}
-	rcu_read_unlock();
 
 	return NULL;
 }
@@ -1499,11 +1497,11 @@ static void ad_agg_selection_logic(struct aggregator *agg)
 	struct slave *slave;
 	struct port *port;
 
+	rcu_read_lock();
 	origin = agg;
 	active = __get_active_agg(agg);
 	best = (active && agg_device_up(active)) ? active : NULL;
 
-	rcu_read_lock();
 	bond_for_each_slave_rcu(bond, slave, iter) {
 		agg = &(SLAVE_AD_INFO(slave).aggregator);
 
-- 
1.8.4

^ permalink raw reply related

* [PATCH v4 net-next 2/3] bonding: fix __get_first_agg RCU usage
From: Veaceslav Falico @ 2014-01-10 10:59 UTC (permalink / raw)
  To: netdev; +Cc: Veaceslav Falico, dingtianhong, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389351585-19615-1-git-send-email-vfalico@redhat.com>

Currently, the RCU read lock usage is just wrong - it gets the slave struct
under RCU and continues to use it when RCU lock is released.

However, it's still safe to do this cause we didn't need the
rcu_read_lock() initially - all of the __get_first_agg() callers are either
holding RCU read lock or the RTNL lock, so that we can't sync while in it.

Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---

Notes:
    v3 -> v4: add rcu_read_lock() to silence lockdep.
    
    v2 -> v3:
    Use the rcu primitives.
    
    v1 -> v2:
    Don't use RCU primitives as we can hold RTNL.

 drivers/net/bonding/bond_3ad.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index da0d7c5..b49f421 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -143,11 +143,13 @@ static inline struct bonding *__get_bond_by_port(struct port *port)
  *
  * Return the aggregator of the first slave in @bond, or %NULL if it can't be
  * found.
+ * The caller must hold RCU or RTNL lock.
  */
 static inline struct aggregator *__get_first_agg(struct port *port)
 {
 	struct bonding *bond = __get_bond_by_port(port);
 	struct slave *first_slave;
+	struct aggregator *agg;
 
 	/* If there's no bond for this port, or bond has no slaves */
 	if (bond == NULL)
@@ -155,9 +157,10 @@ static inline struct aggregator *__get_first_agg(struct port *port)
 
 	rcu_read_lock();
 	first_slave = bond_first_slave_rcu(bond);
+	agg = first_slave ? &(SLAVE_AD_INFO(first_slave).aggregator) : NULL;
 	rcu_read_unlock();
 
-	return first_slave ? &(SLAVE_AD_INFO(first_slave).aggregator) : NULL;
+	return agg;
 }
 
 /**
-- 
1.8.4

^ permalink raw reply related

* [PATCH v4 net-next 1/3] bonding: fix bond_3ad_set_carrier() RCU usage
From: Veaceslav Falico @ 2014-01-10 10:59 UTC (permalink / raw)
  To: netdev; +Cc: Veaceslav Falico, dingtianhong, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389351585-19615-1-git-send-email-vfalico@redhat.com>

Currently, its usage is just plainly wrong. It first gets a slave under
RCU, and, after releasing the RCU lock, continues to use it - whilst it can
be freed.

Fix this by ensuring that bond_3ad_set_carrier() holds RCU till it uses its
slave (or its agg).

Fixes: be79bd048ab ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---

Notes:
    v3 -> v4:
    Remove the useless goto out.
    
    v2 -> v3:
    Just wrap RCU for the whole usage of our slave.
    
    v1 -> v2:
    Don't use _rcu primitives as we can be called under RTNL too.
    
    v2 -> v3:
    Just wrap RCU for the whole usage of our slave.
    
    v1 -> v2:
    Don't use _rcu primitives as we can be called under RTNL too.
    
    v1 -> v2:
    Don't use _rcu primitives as we can be called under RTNL too.

 drivers/net/bonding/bond_3ad.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 29db1ca..da0d7c5 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2327,32 +2327,32 @@ int bond_3ad_set_carrier(struct bonding *bond)
 {
 	struct aggregator *active;
 	struct slave *first_slave;
+	int ret = 1;
 
 	rcu_read_lock();
 	first_slave = bond_first_slave_rcu(bond);
-	rcu_read_unlock();
-	if (!first_slave)
-		return 0;
+	if (!first_slave) {
+		ret = 0;
+		goto out;
+	}
 	active = __get_active_agg(&(SLAVE_AD_INFO(first_slave).aggregator));
 	if (active) {
 		/* are enough slaves available to consider link up? */
 		if (active->num_of_ports < bond->params.min_links) {
 			if (netif_carrier_ok(bond->dev)) {
 				netif_carrier_off(bond->dev);
-				return 1;
+				goto out;
 			}
 		} else if (!netif_carrier_ok(bond->dev)) {
 			netif_carrier_on(bond->dev);
-			return 1;
+			goto out;
 		}
-		return 0;
-	}
-
-	if (netif_carrier_ok(bond->dev)) {
+	} else if (netif_carrier_ok(bond->dev)) {
 		netif_carrier_off(bond->dev);
-		return 1;
 	}
-	return 0;
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 /**
-- 
1.8.4

^ permalink raw reply related

* [PATCH v4 net-next 0/3] bonding: fix bond_3ad RCU usage
From: Veaceslav Falico @ 2014-01-10 10:59 UTC (permalink / raw)
  To: netdev; +Cc: dingtianhong, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico

While digging through bond_3ad.c I've found that the RCU usage there is
just wrong - it's used as a kind of mutex/spinlock instead of RCU.

v3->v4: remove useless goto and wrap __get_first_agg() in proper RCU.

v2->v3: make bond_3ad_set_carrier() use RCU read lock for the whole
function, so that all other functions will be protected by RCU as well.
This way we can use _rcu variants everywhere.

v1->v2: use generic primitives instead of _rcu ones cause we can hold RTNL
lock without RCU one, which is still safe.

This patchset is on top of bond_3ad.c cleanup:
http://www.spinics.net/lists/netdev/msg265447.html

CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] bonding: fix __get_first_agg RCU usage
From: Veaceslav Falico @ 2014-01-10 10:53 UTC (permalink / raw)
  To: Ding Tianhong; +Cc: netdev, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <52CFCEDD.6030202@huawei.com>

On Fri, Jan 10, 2014 at 06:43:41PM +0800, Ding Tianhong wrote:
>On 2014/1/10 17:18, Veaceslav Falico wrote:
>> -	rcu_read_lock();
>>  	first_slave = bond_first_slave_rcu(bond);
>> -	rcu_read_unlock();
>>
>I am afraid the lockdep check will calling some warming:
>bond_3ad_unbind_slave -> __get_first_agg -> bond_first_slave_rcu -> netdev_lower_get_first_private_rcu -> list_first_or_null_rcu
>
>but the bond_3ad_unbind_slave is not in RCU.

Yep, right, I'm always colliding with my next patchset which removes it
completely, so it doesn't whine.

Will resend.

>
>Regards
>Ding
>>  	return first_slave ? &(SLAVE_AD_INFO(first_slave).aggregator) : NULL;
>>  }
>>
>
>

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] bonding: fix __get_active_agg() RCU logic
From: Ding Tianhong @ 2014-01-10 10:48 UTC (permalink / raw)
  To: Veaceslav Falico, netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389345523-5497-4-git-send-email-vfalico@redhat.com>

On 2014/1/10 17:18, Veaceslav Falico wrote:
> Currently, the implementation is meaningless - once again, we take the
> slave structure and use it after we've exited RCU critical section.
> 
> Fix this by removing the rcu_read_lock() from __get_active_agg(), and
> ensuring that all its callers are holding RCU.
> 
> Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
> CC: dingtianhong@huawei.com
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
> ---
> 
> Notes:
>     v2 -> v3:
>     Use the RCU primitives.
>     
>     v1 -> v2:
>     Don't use RCU primitives as we can hold RTNL.
> 
>  drivers/net/bonding/bond_3ad.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 27dac0e..112afa8 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -674,6 +674,8 @@ static u32 __get_agg_bandwidth(struct aggregator *aggregator)
>  /**
>   * __get_active_agg - get the current active aggregator
>   * @aggregator: the aggregator we're looking at
> + *
> + * Caller must hold RCU lock.
>   */
>  static struct aggregator *__get_active_agg(struct aggregator *aggregator)
>  {
> @@ -681,13 +683,9 @@ static struct aggregator *__get_active_agg(struct aggregator *aggregator)
>  	struct list_head *iter;
>  	struct slave *slave;
>  
> -	rcu_read_lock();
>  	bond_for_each_slave_rcu(bond, slave, iter)
> -		if (SLAVE_AD_INFO(slave).aggregator.is_active) {
> -			rcu_read_unlock();
> +		if (SLAVE_AD_INFO(slave).aggregator.is_active)
>  			return &(SLAVE_AD_INFO(slave).aggregator);
> -		}
> -	rcu_read_unlock();
>  
>  	return NULL;
>  }
> @@ -1495,11 +1493,11 @@ static void ad_agg_selection_logic(struct aggregator *agg)
>  	struct slave *slave;
>  	struct port *port;
>  
> +	rcu_read_lock();
>  	origin = agg;
>  	active = __get_active_agg(agg);
>  	best = (active && agg_device_up(active)) ? active : NULL;
>  
> -	rcu_read_lock();
>  	bond_for_each_slave_rcu(bond, slave, iter) {
>  		agg = &(SLAVE_AD_INFO(slave).aggregator);
>  
> 
Great.

Acked-by: Ding Tianhong <dingtianhong@huawei.com>

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] bonding: fix __get_first_agg RCU usage
From: Ding Tianhong @ 2014-01-10 10:43 UTC (permalink / raw)
  To: Veaceslav Falico, netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389345523-5497-3-git-send-email-vfalico@redhat.com>

On 2014/1/10 17:18, Veaceslav Falico wrote:
> Currently, the RCU read lock usage is just wrong - it gets the slave struct
> under RCU and continues to use it when RCU lock is released.
> 
> However, it's still safe to do this cause we didn't need the
> rcu_read_lock() initially - all of the __get_first_agg() callers are either
> holding RCU read lock or the RTNL lock, so that we can't sync while in it.
> 
> So, remove the useless rcu locking and add a comment.
> 
> Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
> CC: dingtianhong@huawei.com
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
> ---
> 
> Notes:
>     v2 -> v3:
>     Use the rcu primitives.
>     
>     v1 -> v2:
>     Don't use RCU primitives as we can hold RTNL.
> 
>  drivers/net/bonding/bond_3ad.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 9ff55eb..27dac0e 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -143,6 +143,7 @@ static inline struct bonding *__get_bond_by_port(struct port *port)
>   *
>   * Return the aggregator of the first slave in @bond, or %NULL if it can't be
>   * found.
> + * The caller must hold RCU lock.
>   */
>  static inline struct aggregator *__get_first_agg(struct port *port)
>  {
> @@ -153,9 +154,7 @@ static inline struct aggregator *__get_first_agg(struct port *port)
>  	if (bond == NULL)
>  		return NULL;
>  
> -	rcu_read_lock();
>  	first_slave = bond_first_slave_rcu(bond);
> -	rcu_read_unlock();
>  
I am afraid the lockdep check will calling some warming:
bond_3ad_unbind_slave -> __get_first_agg -> bond_first_slave_rcu -> netdev_lower_get_first_private_rcu -> list_first_or_null_rcu

but the bond_3ad_unbind_slave is not in RCU.

Regards
Ding
>  	return first_slave ? &(SLAVE_AD_INFO(first_slave).aggregator) : NULL;
>  }
> 

^ permalink raw reply

* RE: [PATCH v2 3/4] net: make tcp_cleanup_rbuf private
From: David Laight @ 2014-01-10 10:38 UTC (permalink / raw)
  To: 'David Miller', dan.j.williams@intel.com
  Cc: ncardwell@google.com, dmaengine@vger.kernel.org,
	yoshfuji@linux-ipv6.org, netdev@vger.kernel.org,
	jmorris@namei.org, kuznet@ms2.inr.ac.ru, kaber@trash.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <20140109.154232.65818281679170677.davem@davemloft.net>

From: David Miller
...
> > On Thu, Jan 9, 2014 at 12:26 PM, Neal Cardwell <ncardwell@google.com> wrote:
> >> On Thu, Jan 9, 2014 at 3:16 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>> net_dma was the only external user so this can become local to tcp.c
> >>> again.
> >> ...
> >>> -void tcp_cleanup_rbuf(struct sock *sk, int copied)
> >>> +static void cleanup_rbuf(struct sock *sk, int copied)
> >>
> >> I would vote to keep the tcp_ prefix. In the TCP code base that is the
> >> more common idiom, even for internal/static TCP functions, and
> >> personally I find it easier to read and work with in stack traces,
> >> etc. My 2 cents.
> >>
> >
> > Ok.  It was cleanup_rbuf() in a former life, but one vote for leaving
> > the name as is is enough for me.
> 
> You can make that two votes :)

I suspect DM adds 2000 votes :-)

Keeping the prefix makes it easier to grep for, and makes it more
obvious that it isn't a generic function (when being called).

	David

^ permalink raw reply

* Re: [net-next 05/16] i40e: fix long lines
From: Joe Perches @ 2014-01-10 10:35 UTC (permalink / raw)
  To: David Laight
  Cc: Jeff Kirsher, davem@davemloft.net, Mitch Williams,
	netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com,
	Jesse Brandeburg
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D457950@AcuExch.aculab.com>

On Fri, 2014-01-10 at 10:02 +0000, David Laight wrote:
> > Another way this could be changed is to put the
> > return value on a separate line like:
> > 
> > i40e_status
> > i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw, u16 seid, bool set,
> > 				    struct i40e_asq_cmd_details *cmd_details)
> 
> Personally I prefer that so I can find the definition by grepping
> for '^function_name' - but it doesn't seem to be done in any linux
> source files.

That form is used in a lot of files.
$ git grep -E "^[a-z_]+\("

^ permalink raw reply

* Re: [PATCH v3 net-next 1/3] bonding: fix bond_3ad_set_carrier() RCU usage
From: Ding Tianhong @ 2014-01-10 10:34 UTC (permalink / raw)
  To: Veaceslav Falico, netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389345523-5497-2-git-send-email-vfalico@redhat.com>

On 2014/1/10 17:18, Veaceslav Falico wrote:
> Currently, its usage is just plainly wrong. It first gets a slave under
> RCU, and, after releasing the RCU lock, continues to use it - whilst it can
> be freed.
> 
> Fix this by ensuring that bond_3ad_set_carrier() holds RCU till it uses its
> slave (or its agg).
> 
> Fixes: be79bd048ab ("bonding: add RCU for bond_3ad_state_machine_handler()")
> CC: dingtianhong@huawei.com
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
> ---
> 
> Notes:
>     v2 -> v3:
>     Just wrap RCU for the whole usage of our slave.
>     
>     v1 -> v2:
>     Don't use _rcu primitives as we can be called under RTNL too.
>     
>     v1 -> v2:
>     Don't use _rcu primitives as we can be called under RTNL too.
> 
>  drivers/net/bonding/bond_3ad.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 29db1ca..9ff55eb 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -2327,32 +2327,33 @@ int bond_3ad_set_carrier(struct bonding *bond)
>  {
>  	struct aggregator *active;
>  	struct slave *first_slave;
> +	int ret = 1;
>  
>  	rcu_read_lock();
>  	first_slave = bond_first_slave_rcu(bond);
> -	rcu_read_unlock();
> -	if (!first_slave)
> -		return 0;
> +	if (!first_slave) {
> +		ret = 0;
> +		goto out;
> +	}
>  	active = __get_active_agg(&(SLAVE_AD_INFO(first_slave).aggregator));
>  	if (active) {
>  		/* are enough slaves available to consider link up? */
>  		if (active->num_of_ports < bond->params.min_links) {
>  			if (netif_carrier_ok(bond->dev)) {
>  				netif_carrier_off(bond->dev);
> -				return 1;
> +				goto out;
>  			}
>  		} else if (!netif_carrier_ok(bond->dev)) {
>  			netif_carrier_on(bond->dev);
> -			return 1;
> +			goto out;
>  		}
> -		return 0;
> -	}
> -
> -	if (netif_carrier_ok(bond->dev)) {
> +	} else if (netif_carrier_ok(bond->dev)) {
>  		netif_carrier_off(bond->dev);
> -		return 1;
> +		goto out;
no need for this line, but it is not a big issue.

Regards
Ding

>  	}
> -	return 0;
> +out:
> +	rcu_read_unlock();
> +	return ret;
>  }
>  
>  /**
> 

^ permalink raw reply

* RE: [PATCH -next] qlcnic: fix compiler warning
From: David Laight @ 2014-01-10 10:15 UTC (permalink / raw)
  To: 'Shahed Shaikh', Martin Kaiser, Himanshu Madhani,
	Rajesh Borundia
  Cc: linux-kernel, trivial@kernel.org, netdev
In-Reply-To: <262CB373A6D1F14F9B81E82F74F77D5A46F52B8E@avmb2.qlogic.org>

From: Shahed Shaikh
> Adding netdev.
> 
> > -----Original Message-----
> > From: Martin Kaiser,,, [mailto:martin@reykholt.kaiser.cx] On Behalf Of
> > Martin Kaiser
> > Sent: Thursday, January 09, 2014 9:29 PM
> > To: Himanshu Madhani; Rajesh Borundia
> > Cc: linux-kernel; trivial@kernel.org
> > Subject: [PATCH -next] qlcnic: fix compiler warning
> >
> > Add an explicit cast to fix the following warning (seen on Debian Wheezy, gcc
> > 4.7.2)
> >
> > CC [M]  drivers/net/wireless/rtlwifi/rtl8192ce/trx.o
> >     drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c: In function
> > qlcnic_send_filter:
> >     drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:349:3: warning:
> >     passing argument 2 of ether_addr_equal from incompatible pointer type
> > [enabled by default]
> >     In file included from include/linux/if_vlan.h:16:0,
> >     from drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:9:
> >     include/linux/etherdevice.h:244:20: note: expected const u8 * but
> > argument is of type u64 *
> >
> 
> If I am not wrong, this patch should go to David's net-next tree.
> 
> > Signed-off-by: Martin Kaiser <martin@kaiser.cx>
> 
> Acked-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
> 
> Thanks,
> Shahed
> > ---
> >  drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
> > b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
> > index 6373f60..3557154 100644
> > --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
> > +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
> > @@ -346,7 +346,7 @@ static void qlcnic_send_filter(struct qlcnic_adapter
> > *adapter,
> >  	head = &(adapter->fhash.fhead[hindex]);
> >
> >  	hlist_for_each_entry_safe(tmp_fil, n, head, fnode) {
> > -		if (ether_addr_equal(tmp_fil->faddr, &src_addr) &&
> > +		if (ether_addr_equal(tmp_fil->faddr, (const u8 *)&src_addr)
> > &&
> >  		    tmp_fil->vlan_id == vlan_id) {
> >  			if (jiffies > (QLCNIC_READD_AGE * HZ + tmp_fil-
> > >ftime))
> >  				qlcnic_change_filter(adapter, &src_addr,
> > --
> > 1.7.10.4

A quick look at the code seems to imply that this code is somewhat buggy.
'src_addr' is defined a u64 even though it is a 6 byte mac address.
On big-endian systems the wrong bytes get accessed all over the place.

Also when ether_addr_equal() in inlined the code ends up accessing src_addr
as 32bit and 16bit data - which don't have to be synchronised against
and writes to the 64bit item.

If might be that ether_addr_equal() (etc) should have a gcc asm
statement with a "memory" constraint against the mac address buffers.

	David


^ permalink raw reply

* RE: [net-next 05/16] i40e: fix long lines
From: David Laight @ 2014-01-10 10:02 UTC (permalink / raw)
  To: 'Joe Perches', Jeff Kirsher
  Cc: davem@davemloft.net, Mitch Williams, netdev@vger.kernel.org,
	gospo@redhat.com, sassmann@redhat.com, Jesse Brandeburg
In-Reply-To: <1389344869.24222.31.camel@joe-AO722>

> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c
> b/drivers/net/ethernet/intel/i40e/i40e_common.c
> []
> > @@ -681,7 +681,8 @@ aq_add_vsi_exit:
> >   * @cmd_details: pointer to command details structure or NULL
> >   **/
> >  i40e_status i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw,
> > -				u16 seid, bool set, struct i40e_asq_cmd_details *cmd_details)
> > +				u16 seid, bool set,
> > +				struct i40e_asq_cmd_details *cmd_details)
> 
> Another way this could be changed is to put the
> return value on a separate line like:
> 
> i40e_status
> i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw, u16 seid, bool set,
> 				    struct i40e_asq_cmd_details *cmd_details)

Personally I prefer that so I can find the definition by grepping
for '^function_name' - but it doesn't seem to be done in any linux
source files.

> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
> b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
> []
> > @@ -93,9 +93,9 @@ i40e_status i40e_aq_set_vsi_broadcast(struct i40e_hw *hw,
> >  				u16 vsi_id, bool set_filter,
> >  				struct i40e_asq_cmd_details *cmd_details);
> >  i40e_status i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw,
> > -				u16 vsi_id, bool set, struct i40e_asq_cmd_details *cmd_details);
> > +		u16 vsi_id, bool set, struct i40e_asq_cmd_details *cmd_details);
> 
> i40e_status
> i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw, u16 vsi_id, bool set,
> 				    struct i40e_asq_cmd_details *cmd_details);
> 
> etc...
> 
> but once you use extremely long 35+ characters
> function names, 80 columns gets a bit silly too.

Again this only a real problem when the coding stand requires that
continuation lines have the parameters lined up under the '('.
If they are indented by 4 spaces (as many of the BSDs do) then you
do stand a chance of laying out a call in a reasonable number of
vertical lines.

My Actual preferred layout is to exclude tabs, indent by 4 spaces
and double-indent continuation lines.
I'll also start continuation lines (of expressions) with an operator
to make it even clearer they are continuation lines.
When I'm scan-reading code I tend to concentrate on the LHS of each line.
- so that is where the information need to be.

Not that I expect Linux to change!
(Other code bases have much worse layout rules.)

	David

^ permalink raw reply

* [PATCH net-next v2] ipv6: move IPV6_TCLASS_SHIFT into ipv6.h
From: roy.qing.li @ 2014-01-10 10:03 UTC (permalink / raw)
  To: netdev

From: Li RongQing <roy.qing.li@gmail.com>

Two places defined IPV6_TCLASS_SHIFT, so we should move it into ipv6.h,
and use this macro as possible.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 Sorry, version 1 has bug in fib6_rule_match it should be frist
mask flowlabel with IPV6_TCLASS_MASK, then call ntohl 

 include/net/ipv6.h       |    1 +
 net/ipv6/fib6_rules.c    |    4 +++-
 net/ipv6/ip6_gre.c       |    2 --
 net/ipv6/ip6_tunnel.c    |    2 --
 net/ipv6/ipv6_sockglue.c |    4 +++-
 5 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 12079c6..d819f7a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -239,6 +239,7 @@ struct ip6_flowlabel {
 #define IPV6_FLOWINFO_MASK	cpu_to_be32(0x0FFFFFFF)
 #define IPV6_FLOWLABEL_MASK	cpu_to_be32(0x000FFFFF)
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
+#define IPV6_TCLASS_SHIFT	20
 
 struct ipv6_fl_socklist {
 	struct ipv6_fl_socklist	__rcu	*next;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 3fd0a57..d50c5ca 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -150,6 +150,8 @@ static int fib6_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
 {
 	struct fib6_rule *r = (struct fib6_rule *) rule;
 	struct flowi6 *fl6 = &fl->u.ip6;
+	u8 tclass = ntohl(fl6->flowlabel & IPV6_TCLASS_MASK) >>
+		    IPV6_TCLASS_SHIFT;
 
 	if (r->dst.plen &&
 	    !ipv6_prefix_equal(&fl6->daddr, &r->dst.addr, r->dst.plen))
@@ -169,7 +171,7 @@ static int fib6_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
 			return 0;
 	}
 
-	if (r->tclass && r->tclass != ((ntohl(fl6->flowlabel) >> 20) & 0xff))
+	if (r->tclass && r->tclass != tclass)
 		return 0;
 
 	return 1;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index e7a440d..f3ffb43 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -61,8 +61,6 @@ static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
-#define IPV6_TCLASS_SHIFT 20
-
 #define HASH_SIZE_SHIFT  5
 #define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 1e5e240..5db8d31 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -69,8 +69,6 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 #define IP6_TNL_TRACE(x...) do {;} while(0)
 #endif
 
-#define IPV6_TCLASS_SHIFT 20
-
 #define HASH_SIZE_SHIFT  5
 #define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index af0ecb9..1c9aa35 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -1019,7 +1019,9 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 				put_cmsg(&msg, SOL_IPV6, IPV6_HOPLIMIT, sizeof(hlim), &hlim);
 			}
 			if (np->rxopt.bits.rxtclass) {
-				int tclass = ntohl(np->rcv_flowinfo & IPV6_TCLASS_MASK) >> 20;
+				int tclass = ntohl(np->rcv_flowinfo &
+						   IPV6_TCLASS_MASK) >>
+					     IPV6_TCLASS_SHIFT;
 				put_cmsg(&msg, SOL_IPV6, IPV6_TCLASS, sizeof(tclass), &tclass);
 			}
 			if (np->rxopt.bits.rxoinfo) {
-- 
1.7.10.4

^ permalink raw reply related

* [RFC PATCH 12/12] net: sched: make tc_action safe to walk under RCU
From: John Fastabend @ 2014-01-10  9:44 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, eric.dumazet; +Cc: netdev, davem
In-Reply-To: <20140110092041.7193.5952.stgit@nitbit.x32>

This patch makes tc_actions safe to walk from classifiers under RCU.
Notice that each action act() callback defined in each act_*.c
already does its own locking. This is needed even in the current
infrastructure for the case where users add/remove actions via
'tc actions' and reference them via index from the classifiers.

There are a couple interesting pieces here (i.e. need careful review)
In tcf_exts_exec() the call to tcf_action_exec follows a list_empty
check. However although this is occurring under RCU there is no
guarantee that the list is not empty when tcf_action_exec is called.
This patch fixes up return values from tcf_action_exec() so that
packets wont be dropped on the floor when this occurs. Hopefully
its rare and by using RCU we sort of make this assumption.

Second there is a suspect usage of list_splice_init_rcu() in the
tcf_exts_change() routine. Notice how it is used twice in succession
and the second init works on the src tcf_exts. There is probably a
better way to accomplish that.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/act_api.h |    1 +
 include/net/pkt_cls.h |   12 ++++++++++--
 net/sched/act_api.c   |   18 +++++++++---------
 net/sched/cls_api.c   |   15 ++++++++-------
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 77d5d81..93f248e 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -81,6 +81,7 @@ struct tc_action {
 	__u32			type; /* for backward compat(TCA_OLD_COMPAT) */
 	__u32			order;
 	struct list_head	list;
+	struct rcu_head		rcu;
 };
 
 #define TCA_CAP_NONE 0
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 50ea079..53b6e50 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -63,7 +63,7 @@ tcf_unbind_filter(struct tcf_proto *tp, struct tcf_result *r)
 struct tcf_exts {
 #ifdef CONFIG_NET_CLS_ACT
 	__u32	type; /* for backward compat(TCA_OLD_COMPAT) */
-	struct list_head actions;
+	struct list_head __rcu actions;
 #endif
 	/* Map to export classifier specific extension TLV types to the
 	 * generic extensions API. Unsupported extensions must be set to 0.
@@ -76,7 +76,7 @@ static inline void tcf_exts_init(struct tcf_exts *exts, int action, int police)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	exts->type = 0;
-	INIT_LIST_HEAD(&exts->actions);
+	INIT_LIST_HEAD_RCU(&exts->actions);
 #endif
 	exts->action = action;
 	exts->police = police;
@@ -88,6 +88,8 @@ static inline void tcf_exts_init(struct tcf_exts *exts, int action, int police)
  *
  * Returns 1 if a predicative extension is present, i.e. an extension which
  * might cause further actions and thus overrule the regular tcf_result.
+ *
+ * This check is only valid if done under RTNL.
  */
 static inline int
 tcf_exts_is_predicative(struct tcf_exts *exts)
@@ -128,6 +130,12 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
 	       struct tcf_result *res)
 {
 #ifdef CONFIG_NET_CLS_ACT
+	/* This check is racy but this is OK, if the list is emptied
+	 * before walking the chain of actions the return value has
+	 * been updated to return zero. This way packets will not be
+	 * dropped when action list deletion occurs after the empty
+	 * check but before execution
+	 */ 
 	if (!list_empty(&exts->actions))
 		return tcf_action_exec(skb, &exts->actions, res);
 #endif
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index dce2b6e..410e69c 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -345,14 +345,14 @@ int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
 		    struct tcf_result *res)
 {
 	const struct tc_action *a;
-	int ret = -1;
+	int ret = 0;
 
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
 		ret = TC_ACT_OK;
 		goto exec_done;
 	}
-	list_for_each_entry(a, actions, list) {
+	list_for_each_entry_rcu(a, actions, list) {
 repeat:
 		if (a->ops) {
 			ret = a->ops->act(skb, a, res);
@@ -380,13 +380,13 @@ void tcf_action_destroy(struct list_head *actions, int bind)
 		if (a->ops) {
 			if (a->ops->cleanup(a, bind) == ACT_P_DELETED)
 				module_put(a->ops->owner);
-			list_del(&a->list);
-			kfree(a);
+			list_del_rcu(&a->list);
+			kfree_rcu(a, rcu);
 		} else {
 			/*FIXME: Remove later - catch insertion bugs*/
 			WARN(1, "tcf_action_destroy: BUG? destroying NULL ops\n");
-			list_del(&a->list);
-			kfree(a);
+			list_del_rcu(&a->list);
+			kfree_rcu(a, rcu);
 		}
 	}
 }
@@ -559,7 +559,7 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
 			goto err;
 		}
 		act->order = i;
-		list_add_tail(&act->list, actions);
+		list_add_tail_rcu(&act->list, actions);
 	}
 	return 0;
 
@@ -708,8 +708,8 @@ static void cleanup_a(struct list_head *actions)
 	struct tc_action *a, *tmp;
 
 	list_for_each_entry_safe(a, tmp, actions, list) {
-		list_del(&a->list);
-		kfree(a);
+		list_del_rcu(&a->list);
+		kfree_rcu(a, rcu);
 	}
 }
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 86c3678..af87be5 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -496,7 +496,7 @@ void tcf_exts_destroy(struct tcf_proto *tp, struct tcf_exts *exts)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	tcf_action_destroy(&exts->actions, TCA_ACT_UNBIND);
-	INIT_LIST_HEAD(&exts->actions);
+	INIT_LIST_HEAD_RCU(&exts->actions);
 #endif
 }
 EXPORT_SYMBOL(tcf_exts_destroy);
@@ -508,7 +508,7 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 	{
 		struct tc_action *act;
 
-		INIT_LIST_HEAD(&exts->actions);
+		INIT_LIST_HEAD_RCU(&exts->actions);
 		if (exts->police && tb[exts->police]) {
 			act = tcf_action_init_1(net, tb[exts->police], rate_tlv,
 						"police", TCA_ACT_NOREPLACE,
@@ -517,7 +517,7 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 				return PTR_ERR(act);
 
 			act->type = exts->type = TCA_OLD_COMPAT;
-			list_add(&act->list, &exts->actions);
+			list_add_rcu(&act->list, &exts->actions);
 		} else if (exts->action && tb[exts->action]) {
 			int err;
 			err = tcf_action_init(net, tb[exts->action], rate_tlv,
@@ -537,16 +537,17 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 }
 EXPORT_SYMBOL(tcf_exts_validate);
 
+/* It is not safe to use src->actions after this due to _init_rcu usage
+ * INIT_LIST_HEAD_RCU() is called on src->actions
+ */
 void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
 		     struct tcf_exts *src)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	if (!list_empty(&src->actions)) {
 		LIST_HEAD(tmp);
-		tcf_tree_lock(tp);
-		list_splice_init(&dst->actions, &tmp);
-		list_splice(&src->actions, &dst->actions);
-		tcf_tree_unlock(tp);
+		list_splice_init_rcu(&dst->actions, &tmp, synchronize_rcu);
+		list_splice_init_rcu(&src->actions, &dst->actions, synchronize_rcu);
 		tcf_action_destroy(&tmp, TCA_ACT_UNBIND);
 	}
 #endif

^ permalink raw reply related

* [RFC PATCH 11/12] net: make cls_bpf rcu safe
From: John Fastabend @ 2014-01-10  9:43 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, eric.dumazet; +Cc: netdev, davem
In-Reply-To: <20140110092041.7193.5952.stgit@nitbit.x32>

This patch makes the cls_bpf classifier RCU safe. The tcf_lock
was being used to protect a list of cls_bpf_prog now this list
is RCU safe and updates occur with rcu_replace.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_bpf.c |   79 ++++++++++++++++++++++++++-------------------------
 1 file changed, 40 insertions(+), 39 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 00a5a58..e69adc330 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -25,8 +25,9 @@ MODULE_AUTHOR("Daniel Borkmann <dborkman@redhat.com>");
 MODULE_DESCRIPTION("TC BPF based classifier");
 
 struct cls_bpf_head {
-	struct list_head plist;
+	struct list_head __rcu plist;
 	u32 hgen;
+	struct rcu_head rcu;
 };
 
 struct cls_bpf_prog {
@@ -37,6 +38,8 @@ struct cls_bpf_prog {
 	struct list_head link;
 	u32 handle;
 	u16 bpf_len;
+	struct tcf_proto *tp;
+	struct rcu_head rcu;
 };
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
@@ -49,11 +52,11 @@ static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
 static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			    struct tcf_result *res)
 {
-	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_head *head = rcu_dereference(tp->root);
 	struct cls_bpf_prog *prog;
 	int ret;
 
-	list_for_each_entry(prog, &head->plist, link) {
+	list_for_each_entry_rcu(prog, &head->plist, link) {
 		int filter_res = SK_RUN_FILTER(prog->filter, skb);
 
 		if (filter_res == 0)
@@ -81,8 +84,8 @@ static int cls_bpf_init(struct tcf_proto *tp)
 	if (head == NULL)
 		return -ENOBUFS;
 
-	INIT_LIST_HEAD(&head->plist);
-	tp->root = head;
+	INIT_LIST_HEAD_RCU(&head->plist);
+	rcu_assign_pointer(tp->root, head);
 
 	return 0;
 }
@@ -98,6 +101,13 @@ static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 	kfree(prog);
 }
 
+void __cls_bpf_delete_prog(struct rcu_head *rcu)
+{
+	struct cls_bpf_prog *prog = container_of(rcu, struct cls_bpf_prog, rcu);
+
+	cls_bpf_delete_prog(prog->tp, prog);
+}
+
 static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
 {
 	struct cls_bpf_head *head = tp->root;
@@ -105,11 +115,8 @@ static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
 
 	list_for_each_entry(prog, &head->plist, link) {
 		if (prog == todel) {
-			tcf_tree_lock(tp);
-			list_del(&prog->link);
-			tcf_tree_unlock(tp);
-
-			cls_bpf_delete_prog(tp, prog);
+			list_del_rcu(&prog->link);
+			call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 			return 0;
 		}
 	}
@@ -123,11 +130,12 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 	struct cls_bpf_prog *prog, *tmp;
 
 	list_for_each_entry_safe(prog, tmp, &head->plist, link) {
-		list_del(&prog->link);
-		cls_bpf_delete_prog(tp, prog);
+		list_del_rcu(&prog->link);
+		call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 	}
 
-	kfree(head);
+	rcu_assign_pointer(tp->root, NULL);
+	kfree_rcu(head, rcu);
 }
 
 static unsigned long cls_bpf_get(struct tcf_proto *tp, u32 handle)
@@ -158,10 +166,10 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
 				   unsigned long base, struct nlattr **tb,
 				   struct nlattr *est)
 {
-	struct sock_filter *bpf_ops, *bpf_old;
+	struct sock_filter *bpf_ops;
 	struct tcf_exts exts;
 	struct sock_fprog tmp;
-	struct sk_filter *fp, *fp_old;
+	struct sk_filter *fp;
 	u16 bpf_size, bpf_len;
 	u32 classid;
 	int ret;
@@ -197,24 +205,13 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
 	if (ret)
 		goto errout_free;
 
-	tcf_tree_lock(tp);
-	fp_old = prog->filter;
-	bpf_old = prog->bpf_ops;
-
 	prog->bpf_len = bpf_len;
 	prog->bpf_ops = bpf_ops;
 	prog->filter = fp;
 	prog->res.classid = classid;
-	tcf_tree_unlock(tp);
 
 	tcf_bind_filter(tp, &prog->res, base);
 	tcf_exts_change(tp, &prog->exts, &exts);
-
-	if (fp_old)
-		sk_unattached_filter_destroy(fp_old);
-	if (bpf_old)
-		kfree(bpf_old);
-
 	return 0;
 
 errout_free:
@@ -245,8 +242,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 			  unsigned long *arg)
 {
 	struct cls_bpf_head *head = tp->root;
-	struct cls_bpf_prog *prog = (struct cls_bpf_prog *) *arg;
+	struct cls_bpf_prog *oldprog = (struct cls_bpf_prog *) *arg;
 	struct nlattr *tb[TCA_BPF_MAX + 1];
+	struct cls_bpf_prog *prog;
 	int ret;
 
 	if (tca[TCA_OPTIONS] == NULL)
@@ -256,18 +254,19 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (ret < 0)
 		return ret;
 
-	if (prog != NULL) {
-		if (handle && prog->handle != handle)
-			return -EINVAL;
-		return cls_bpf_modify_existing(net, tp, prog, base, tb,
-					       tca[TCA_RATE]);
-	}
-
 	prog = kzalloc(sizeof(*prog), GFP_KERNEL);
 	if (prog == NULL)
 		return -ENOBUFS;
 
 	tcf_exts_init(&prog->exts, TCA_BPF_ACT, TCA_BPF_POLICE);
+
+	if (oldprog) {
+		if (handle && oldprog->handle != handle) {
+			ret = -EINVAL;
+			goto errout;
+		}
+	}
+
 	if (handle == 0)
 		prog->handle = cls_bpf_grab_new_handle(tp, head);
 	else
@@ -281,15 +280,17 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (ret < 0)
 		goto errout;
 
-	tcf_tree_lock(tp);
-	list_add(&prog->link, &head->plist);
-	tcf_tree_unlock(tp);
+	if (oldprog) {
+		list_replace_rcu(&prog->link, &oldprog->link);
+		call_rcu(&oldprog->rcu, __cls_bpf_delete_prog);
+	} else {
+		list_add_rcu(&prog->link, &head->plist);
+	}
 
 	*arg = (unsigned long) prog;
-
 	return 0;
 errout:
-	if (*arg == 0UL && prog)
+	if (prog)
 		kfree(prog);
 
 	return ret;

^ permalink raw reply related

* [RFC PATCH 10/12] net: sched: rcu'ify cls_rsvp
From: John Fastabend @ 2014-01-10  9:43 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, eric.dumazet; +Cc: netdev, davem
In-Reply-To: <20140110092041.7193.5952.stgit@nitbit.x32>

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_rsvp.h |  152 ++++++++++++++++++++++++++++----------------------
 1 file changed, 86 insertions(+), 66 deletions(-)

diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 4f25c2a..ebdbd45 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -70,31 +70,34 @@ struct rsvp_head {
 	u32			tmap[256/32];
 	u32			hgenerator;
 	u8			tgenerator;
-	struct rsvp_session	*ht[256];
+	struct rsvp_session __rcu *ht[256];
+	struct rcu_head		rcu;
 };
 
 struct rsvp_session {
-	struct rsvp_session	*next;
-	__be32			dst[RSVP_DST_LEN];
-	struct tc_rsvp_gpi 	dpi;
-	u8			protocol;
-	u8			tunnelid;
+	struct rsvp_session __rcu	*next;
+	__be32				dst[RSVP_DST_LEN];
+	struct tc_rsvp_gpi		dpi;
+	u8				protocol;
+	u8				tunnelid;
 	/* 16 (src,sport) hash slots, and one wildcard source slot */
-	struct rsvp_filter	*ht[16 + 1];
+	struct rsvp_filter __rcu	*ht[16 + 1];
+	struct rcu_head			rcu;
 };
 
 
 struct rsvp_filter {
-	struct rsvp_filter	*next;
-	__be32			src[RSVP_DST_LEN];
-	struct tc_rsvp_gpi	spi;
-	u8			tunnelhdr;
+	struct rsvp_filter __rcu	*next;
+	__be32				src[RSVP_DST_LEN];
+	struct tc_rsvp_gpi		spi;
+	u8				tunnelhdr;
 
-	struct tcf_result	res;
-	struct tcf_exts		exts;
+	struct tcf_result		res;
+	struct tcf_exts			exts;
 
-	u32			handle;
-	struct rsvp_session	*sess;
+	u32				handle;
+	struct rsvp_session		*sess;
+	struct rcu_head			rcu;
 };
 
 static inline unsigned int hash_dst(__be32 *dst, u8 protocol, u8 tunnelid)
@@ -128,7 +131,7 @@ static inline unsigned int hash_src(__be32 *src)
 static int rsvp_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			 struct tcf_result *res)
 {
-	struct rsvp_session **sht = ((struct rsvp_head *)tp->root)->ht;
+	struct rsvp_head *head = rcu_dereference_bh(tp->root);
 	struct rsvp_session *s;
 	struct rsvp_filter *f;
 	unsigned int h1, h2;
@@ -169,7 +172,8 @@ restart:
 	h1 = hash_dst(dst, protocol, tunnelid);
 	h2 = hash_src(src);
 
-	for (s = sht[h1]; s; s = s->next) {
+	for (s = rcu_dereference_bh(head->ht[h1]); s;
+	     s = rcu_dereference_bh(s->next)) {
 		if (dst[RSVP_DST_LEN-1] == s->dst[RSVP_DST_LEN - 1] &&
 		    protocol == s->protocol &&
 		    !(s->dpi.mask &
@@ -181,7 +185,8 @@ restart:
 #endif
 		    tunnelid == s->tunnelid) {
 
-			for (f = s->ht[h2]; f; f = f->next) {
+			for (f = rcu_dereference_bh(s->ht[h2]); f;
+			     f = rcu_dereference_bh(f->next)) {
 				if (src[RSVP_DST_LEN-1] == f->src[RSVP_DST_LEN - 1] &&
 				    !(f->spi.mask & (*(u32 *)(xprt + f->spi.offset) ^ f->spi.key))
 #if RSVP_DST_LEN == 4
@@ -205,7 +210,8 @@ matched:
 			}
 
 			/* And wildcard bucket... */
-			for (f = s->ht[16]; f; f = f->next) {
+			for (f = rcu_dereference_bh(s->ht[16]); f;
+			     f = rcu_dereference_bh(f->next)) {
 				*res = f->res;
 				RSVP_APPLY_RESULT();
 				goto matched;
@@ -218,7 +224,7 @@ matched:
 
 static unsigned long rsvp_get(struct tcf_proto *tp, u32 handle)
 {
-	struct rsvp_session **sht = ((struct rsvp_head *)tp->root)->ht;
+	struct rsvp_head *head = rtnl_dereference(tp->root);
 	struct rsvp_session *s;
 	struct rsvp_filter *f;
 	unsigned int h1 = handle & 0xFF;
@@ -227,8 +233,10 @@ static unsigned long rsvp_get(struct tcf_proto *tp, u32 handle)
 	if (h2 > 16)
 		return 0;
 
-	for (s = sht[h1]; s; s = s->next) {
-		for (f = s->ht[h2]; f; f = f->next) {
+	for (s = rtnl_dereference(head->ht[h1]); s;
+	     s = rtnl_dereference(s->next)) {
+		for (f = rtnl_dereference(s->ht[h2]); f;
+		     f = rtnl_dereference(f->next)) {
 			if (f->handle == handle)
 				return (unsigned long)f;
 		}
@@ -246,7 +254,7 @@ static int rsvp_init(struct tcf_proto *tp)
 
 	data = kzalloc(sizeof(struct rsvp_head), GFP_KERNEL);
 	if (data) {
-		tp->root = data;
+		rcu_assign_pointer(tp->root, data);
 		return 0;
 	}
 	return -ENOBUFS;
@@ -257,53 +265,55 @@ rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 {
 	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(tp, &f->exts);
-	kfree(f);
+	kfree_rcu(f, rcu);
 }
 
 static void rsvp_destroy(struct tcf_proto *tp)
 {
-	struct rsvp_head *data = xchg(&tp->root, NULL);
-	struct rsvp_session **sht;
+	struct rsvp_head *data = rtnl_dereference(tp->root);
 	int h1, h2;
 
 	if (data == NULL)
 		return;
 
-	sht = data->ht;
+	rcu_assign_pointer(tp->root, NULL);
 
 	for (h1 = 0; h1 < 256; h1++) {
 		struct rsvp_session *s;
 
-		while ((s = sht[h1]) != NULL) {
-			sht[h1] = s->next;
+		while ((s = rtnl_dereference(data->ht[h1])) != NULL) {
+			rcu_assign_pointer(data->ht[h1], s->next);
 
 			for (h2 = 0; h2 <= 16; h2++) {
 				struct rsvp_filter *f;
 
-				while ((f = s->ht[h2]) != NULL) {
-					s->ht[h2] = f->next;
+				while ((f = rtnl_dereference(s->ht[h2])) != NULL) {
+					rcu_assign_pointer(s->ht[h2], f->next);
 					rsvp_delete_filter(tp, f);
 				}
 			}
-			kfree(s);
+			kfree_rcu(s, rcu);
 		}
 	}
-	kfree(data);
+	rcu_assign_pointer(tp->root, NULL);
+	kfree_rcu(data, rcu);
 }
 
 static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 {
-	struct rsvp_filter **fp, *f = (struct rsvp_filter *)arg;
+	struct rsvp_head *head = rtnl_dereference(tp->root);
+	struct rsvp_filter *nfp, *f = (struct rsvp_filter *)arg;
+	struct rsvp_filter __rcu **fp;
 	unsigned int h = f->handle;
-	struct rsvp_session **sp;
-	struct rsvp_session *s = f->sess;
+	struct rsvp_session __rcu **sp;
+	struct rsvp_session *nsp, *s = f->sess;
 	int i;
 
-	for (fp = &s->ht[(h >> 8) & 0xFF]; *fp; fp = &(*fp)->next) {
-		if (*fp == f) {
-			tcf_tree_lock(tp);
+	fp = &s->ht[(h >> 8) & 0xFF];
+	for (nfp = rtnl_dereference(*fp); nfp;
+	     fp = &nfp->next, nfp = rtnl_dereference(*fp)) {
+		if (nfp == f) {
 			*fp = f->next;
-			tcf_tree_unlock(tp);
 			rsvp_delete_filter(tp, f);
 
 			/* Strip tree */
@@ -313,14 +323,12 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 					return 0;
 
 			/* OK, session has no flows */
-			for (sp = &((struct rsvp_head *)tp->root)->ht[h & 0xFF];
-			     *sp; sp = &(*sp)->next) {
-				if (*sp == s) {
-					tcf_tree_lock(tp);
+			sp = &head->ht[h & 0xFF];
+			for (nsp = rtnl_dereference(*sp); nsp;
+			     sp = &nsp->next, nsp = rtnl_dereference(*sp)) {
+				if (nsp == s) {
 					*sp = s->next;
-					tcf_tree_unlock(tp);
-
-					kfree(s);
+					kfree_rcu(s, rcu);
 					return 0;
 				}
 			}
@@ -333,7 +341,7 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 
 static unsigned int gen_handle(struct tcf_proto *tp, unsigned salt)
 {
-	struct rsvp_head *data = tp->root;
+	struct rsvp_head *data = rtnl_dereference(tp->root);
 	int i = 0xFFFF;
 
 	while (i-- > 0) {
@@ -361,7 +369,7 @@ static int tunnel_bts(struct rsvp_head *data)
 
 static void tunnel_recycle(struct rsvp_head *data)
 {
-	struct rsvp_session **sht = data->ht;
+	struct rsvp_session __rcu **sht = data->ht;
 	u32 tmap[256/32];
 	int h1, h2;
 
@@ -369,11 +377,13 @@ static void tunnel_recycle(struct rsvp_head *data)
 
 	for (h1 = 0; h1 < 256; h1++) {
 		struct rsvp_session *s;
-		for (s = sht[h1]; s; s = s->next) {
+		for (s = rtnl_dereference(sht[h1]); s;
+		     s = rtnl_dereference(s->next)) {
 			for (h2 = 0; h2 <= 16; h2++) {
 				struct rsvp_filter *f;
 
-				for (f = s->ht[h2]; f; f = f->next) {
+				for (f = rtnl_dereference(s->ht[h2]); f;
+				     f = rtnl_dereference(f->next)) {
 					if (f->tunnelhdr == 0)
 						continue;
 					data->tgenerator = f->res.classid;
@@ -417,9 +427,11 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 		       struct nlattr **tca,
 		       unsigned long *arg)
 {
-	struct rsvp_head *data = tp->root;
-	struct rsvp_filter *f, **fp;
-	struct rsvp_session *s, **sp;
+	struct rsvp_head *data = rtnl_dereference(tp->root);
+	struct rsvp_filter *f, *nfp;
+	struct rsvp_filter __rcu **fp;
+	struct rsvp_session *nsp, *s;
+	struct rsvp_session __rcu **sp;
 	struct tc_rsvp_pinfo *pinfo = NULL;
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_RSVP_MAX + 1];
@@ -499,7 +511,9 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 			goto errout;
 	}
 
-	for (sp = &data->ht[h1]; (s = *sp) != NULL; sp = &s->next) {
+	for (sp = &data->ht[h1];
+	     (s = rtnl_dereference(*sp)) != NULL;
+	     sp = &s->next) {
 		if (dst[RSVP_DST_LEN-1] == s->dst[RSVP_DST_LEN-1] &&
 		    pinfo && pinfo->protocol == s->protocol &&
 		    memcmp(&pinfo->dpi, &s->dpi, sizeof(s->dpi)) == 0 &&
@@ -521,12 +535,14 @@ insert:
 
 			tcf_exts_change(tp, &f->exts, &e);
 
-			for (fp = &s->ht[h2]; *fp; fp = &(*fp)->next)
-				if (((*fp)->spi.mask & f->spi.mask) != f->spi.mask)
+			fp = &s->ht[h2];
+			for (nfp = rtnl_dereference(*fp); nfp;
+			     fp = &nfp->next, nfp = rtnl_dereference(*fp))
+				if ((nfp->spi.mask & f->spi.mask) != f->spi.mask)
 					break;
-			f->next = *fp;
+			rcu_assign_pointer(f->next, nfp);
 			wmb();
-			*fp = f;
+			rcu_assign_pointer(*fp, f);
 
 			*arg = (unsigned long)f;
 			return 0;
@@ -546,13 +562,15 @@ insert:
 		s->protocol = pinfo->protocol;
 		s->tunnelid = pinfo->tunnelid;
 	}
-	for (sp = &data->ht[h1]; *sp; sp = &(*sp)->next) {
-		if (((*sp)->dpi.mask&s->dpi.mask) != s->dpi.mask)
+	sp = &data->ht[h1];
+	for (nsp = rtnl_dereference(*sp); nsp;
+	     sp = &nsp->next, nsp = rtnl_dereference(*sp)) {
+		if ((nsp->dpi.mask & s->dpi.mask) != s->dpi.mask)
 			break;
 	}
-	s->next = *sp;
+	rcu_assign_pointer(s->next, nsp);
 	wmb();
-	*sp = s;
+	rcu_assign_pointer(*sp, s);
 
 	goto insert;
 
@@ -565,7 +583,7 @@ errout2:
 
 static void rsvp_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 {
-	struct rsvp_head *head = tp->root;
+	struct rsvp_head *head = rtnl_dereference(tp->root);
 	unsigned int h, h1;
 
 	if (arg->stop)
@@ -574,11 +592,13 @@ static void rsvp_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	for (h = 0; h < 256; h++) {
 		struct rsvp_session *s;
 
-		for (s = head->ht[h]; s; s = s->next) {
+		for (s = rtnl_dereference(head->ht[h]); s;
+		     s = rtnl_dereference(s->next)) {
 			for (h1 = 0; h1 <= 16; h1++) {
 				struct rsvp_filter *f;
 
-				for (f = s->ht[h1]; f; f = f->next) {
+				for (f = rtnl_dereference(s->ht[h1]); f;
+				     f = rtnl_dereference(f->next)) {
 					if (arg->count < arg->skip) {
 						arg->count++;
 						continue;

^ permalink raw reply related

* [RFC PATCH 09/12] net: sched: make cls_u32 lockless
From: John Fastabend @ 2014-01-10  9:42 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, eric.dumazet; +Cc: netdev, davem
In-Reply-To: <20140110092041.7193.5952.stgit@nitbit.x32>

Make cls_u32 classifier safe to run without holding lock. This patch
converts statistics that are kept in read section u32_classify into
per cpu counters.

This patch needs careful review to be sure the internal data
structures are being handled. With regards to testing simply creating
a u32 filter is not enough the actual structure need to be created
and destroyed under stress for a valid test case.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_u32.c |  258 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 169 insertions(+), 89 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 20f2fb7..1d68d08 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -36,6 +36,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/errno.h>
+#include <linux/percpu.h>
 #include <linux/rtnetlink.h>
 #include <linux/skbuff.h>
 #include <net/netlink.h>
@@ -43,40 +44,46 @@
 #include <net/pkt_cls.h>
 
 struct tc_u_knode {
-	struct tc_u_knode	*next;
+	struct tc_u_knode __rcu	*next;
 	u32			handle;
-	struct tc_u_hnode	*ht_up;
+	struct tc_u_hnode __rcu	*ht_up;
 	struct tcf_exts		exts;
 #ifdef CONFIG_NET_CLS_IND
 	char                     indev[IFNAMSIZ];
 #endif
 	u8			fshift;
 	struct tcf_result	res;
-	struct tc_u_hnode	*ht_down;
+	struct tc_u_hnode __rcu	*ht_down;
 #ifdef CONFIG_CLS_U32_PERF
-	struct tc_u32_pcnt	*pf;
+	struct tc_u32_pcnt __percpu *pf;
 #endif
 #ifdef CONFIG_CLS_U32_MARK
-	struct tc_u32_mark	mark;
+	u32			val;
+	u32			mask;
+	u32 __percpu		*pcpu_success;
 #endif
+	struct tcf_proto	*tp;
+	struct rcu_head		rcu;
 	struct tc_u32_sel	sel;
 };
 
 struct tc_u_hnode {
-	struct tc_u_hnode	*next;
+	struct tc_u_hnode __rcu	*next;
 	u32			handle;
 	u32			prio;
 	struct tc_u_common	*tp_c;
 	int			refcnt;
 	unsigned int		divisor;
-	struct tc_u_knode	*ht[1];
+	struct tc_u_knode __rcu	*ht[1];
+	struct rcu_head		rcu;
 };
 
 struct tc_u_common {
-	struct tc_u_hnode	*hlist;
+	struct tc_u_hnode __rcu	*hlist;
 	struct Qdisc		*q;
 	int			refcnt;
 	u32			hgenerator;
+	struct rcu_head		rcu;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
@@ -95,7 +102,7 @@ static int u32_classify(struct sk_buff *skb, const struct tcf_proto *tp, struct
 		unsigned int	  off;
 	} stack[TC_U32_MAXDEPTH];
 
-	struct tc_u_hnode *ht = (struct tc_u_hnode *)tp->root;
+	struct tc_u_hnode *ht = (struct tc_u_hnode *) rcu_dereference_bh(tp->root);
 	unsigned int off = skb_network_offset(skb);
 	struct tc_u_knode *n;
 	int sdepth = 0;
@@ -103,27 +110,31 @@ static int u32_classify(struct sk_buff *skb, const struct tcf_proto *tp, struct
 	int sel = 0;
 #ifdef CONFIG_CLS_U32_PERF
 	int j;
+	struct tc_u32_pcnt *pf;
 #endif
 	int i, r;
 
 next_ht:
-	n = ht->ht[sel];
+	n = rcu_dereference_bh(ht->ht[sel]);
 
 next_knode:
 	if (n) {
 		struct tc_u32_key *key = n->sel.keys;
 
 #ifdef CONFIG_CLS_U32_PERF
-		n->pf->rcnt += 1;
+		pf = this_cpu_ptr(n->pf);
+		pf->rcnt += 1;
 		j = 0;
 #endif
 
 #ifdef CONFIG_CLS_U32_MARK
-		if ((skb->mark & n->mark.mask) != n->mark.val) {
-			n = n->next;
+		if ((skb->mark & n->mask) != n->val) {
+			n = rcu_dereference_bh(n->next);
 			goto next_knode;
 		} else {
-			n->mark.success++;
+			u32 *success = this_cpu_ptr(n->pcpu_success);
+
+			*success = *success + 1;
 		}
 #endif
 
@@ -138,37 +149,41 @@ next_knode:
 			if (!data)
 				goto out;
 			if ((*data ^ key->val) & key->mask) {
-				n = n->next;
+				n = rcu_dereference_bh(n->next);
 				goto next_knode;
 			}
 #ifdef CONFIG_CLS_U32_PERF
-			n->pf->kcnts[j] += 1;
+			pf = this_cpu_ptr(n->pf);
+			pf->kcnts[j] += 1;
 			j++;
 #endif
 		}
-		if (n->ht_down == NULL) {
+
+		ht = rcu_dereference_bh(n->ht_down);
+		if (!ht) {
 check_terminal:
 			if (n->sel.flags & TC_U32_TERMINAL) {
 
 				*res = n->res;
 #ifdef CONFIG_NET_CLS_IND
 				if (!tcf_match_indev(skb, n->indev)) {
-					n = n->next;
+					n = rcu_dereference_bh(n->next);
 					goto next_knode;
 				}
 #endif
 #ifdef CONFIG_CLS_U32_PERF
-				n->pf->rhit += 1;
+				pf = this_cpu_ptr(n->pf);
+				pf->rhit += 1;
 #endif
 				r = tcf_exts_exec(skb, &n->exts, res);
 				if (r < 0) {
-					n = n->next;
+					n = rcu_dereference_bh(n->next);
 					goto next_knode;
 				}
 
 				return r;
 			}
-			n = n->next;
+			n = rcu_dereference_bh(n->next);
 			goto next_knode;
 		}
 
@@ -179,7 +194,7 @@ check_terminal:
 		stack[sdepth].off = off;
 		sdepth++;
 
-		ht = n->ht_down;
+		ht = rcu_dereference_bh(n->ht_down);
 		sel = 0;
 		if (ht->divisor) {
 			__be32 *data, hdata;
@@ -221,7 +236,7 @@ check_terminal:
 	/* POP */
 	if (sdepth--) {
 		n = stack[sdepth].knode;
-		ht = n->ht_up;
+		ht = rcu_dereference_bh(n->ht_up);
 		off = stack[sdepth].off;
 		goto check_terminal;
 	}
@@ -238,7 +253,9 @@ u32_lookup_ht(struct tc_u_common *tp_c, u32 handle)
 {
 	struct tc_u_hnode *ht;
 
-	for (ht = tp_c->hlist; ht; ht = ht->next)
+	for (ht = rtnl_dereference(tp_c->hlist);
+	     ht;
+	     ht = rtnl_dereference(ht->next))
 		if (ht->handle == handle)
 			break;
 
@@ -255,7 +272,9 @@ u32_lookup_key(struct tc_u_hnode *ht, u32 handle)
 	if (sel > ht->divisor)
 		goto out;
 
-	for (n = ht->ht[sel]; n; n = n->next)
+	for (n = rtnl_dereference(ht->ht[sel]);
+	     n;
+	     n = rtnl_dereference(n->next))
 		if (n->handle == handle)
 			break;
 out:
@@ -269,7 +288,7 @@ static unsigned long u32_get(struct tcf_proto *tp, u32 handle)
 	struct tc_u_common *tp_c = tp->data;
 
 	if (TC_U32_HTID(handle) == TC_U32_ROOT)
-		ht = tp->root;
+		ht = rtnl_dereference(tp->root);
 	else
 		ht = u32_lookup_ht(tp_c, TC_U32_HTID(handle));
 
@@ -290,6 +309,9 @@ static u32 gen_new_htid(struct tc_u_common *tp_c)
 {
 	int i = 0x800;
 
+	/* hgenerator only used inside rtnl lock it is safe to increment
+	 * without read _copy_ update semantics
+	 */
 	do {
 		if (++tp_c->hgenerator == 0x7FF)
 			tp_c->hgenerator = 1;
@@ -325,11 +347,11 @@ static int u32_init(struct tcf_proto *tp)
 	}
 
 	tp_c->refcnt++;
-	root_ht->next = tp_c->hlist;
-	tp_c->hlist = root_ht;
+	rcu_assign_pointer(root_ht->next, tp_c->hlist);
+	rcu_assign_pointer(tp_c->hlist, root_ht);
 	root_ht->tp_c = tp_c;
 
-	tp->root = root_ht;
+	rcu_assign_pointer(tp->root, root_ht);
 	tp->data = tp_c;
 	return 0;
 }
@@ -341,25 +363,33 @@ static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n)
 	if (n->ht_down)
 		n->ht_down->refcnt--;
 #ifdef CONFIG_CLS_U32_PERF
-	kfree(n->pf);
+	free_percpu(n->pf);
 #endif
 	kfree(n);
 	return 0;
 }
 
+void __u32_delete_key(struct rcu_head *rcu)
+{
+	struct tc_u_knode *key = container_of(rcu, struct tc_u_knode, rcu);
+
+	u32_destroy_key(key->tp, key);
+}
+
 static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
 {
-	struct tc_u_knode **kp;
+	struct tc_u_knode __rcu **kp;
+	struct tc_u_knode *pkp;
 	struct tc_u_hnode *ht = key->ht_up;
 
 	if (ht) {
-		for (kp = &ht->ht[TC_U32_HASH(key->handle)]; *kp; kp = &(*kp)->next) {
-			if (*kp == key) {
-				tcf_tree_lock(tp);
-				*kp = key->next;
-				tcf_tree_unlock(tp);
+		kp = &ht->ht[TC_U32_HASH(key->handle)];
+		for (pkp = rtnl_dereference(*kp); pkp;
+		     kp = &pkp->next, pkp = rtnl_dereference(*kp)) {
+			if (pkp == key) {
+				rcu_assign_pointer(*kp, key->next);
 
-				u32_destroy_key(tp, key);
+				call_rcu(&key->rcu, __u32_delete_key);
 				return 0;
 			}
 		}
@@ -368,16 +398,15 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
 	return 0;
 }
 
-static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
+static void u32_clear_hnode(struct tc_u_hnode *ht)
 {
 	struct tc_u_knode *n;
 	unsigned int h;
 
 	for (h = 0; h <= ht->divisor; h++) {
-		while ((n = ht->ht[h]) != NULL) {
-			ht->ht[h] = n->next;
-
-			u32_destroy_key(tp, n);
+		while ((n = rtnl_dereference(ht->ht[h])) != NULL) {
+			rcu_assign_pointer(ht->ht[h], rtnl_dereference(n->next));
+			call_rcu(&n->rcu, __u32_delete_key);
 		}
 	}
 }
@@ -385,28 +414,31 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 {
 	struct tc_u_common *tp_c = tp->data;
-	struct tc_u_hnode **hn;
+	struct tc_u_hnode __rcu **hn;
+	struct tc_u_hnode *phn;
 
 	WARN_ON(ht->refcnt);
 
-	u32_clear_hnode(tp, ht);
+	u32_clear_hnode(ht);
 
-	for (hn = &tp_c->hlist; *hn; hn = &(*hn)->next) {
-		if (*hn == ht) {
-			*hn = ht->next;
-			kfree(ht);
+	hn = &tp_c->hlist;
+	for (phn = rtnl_dereference(*hn);
+	     phn;
+	     hn = &phn->next, phn = rtnl_dereference(*hn)) {
+		if (phn == ht) {
+			rcu_assign_pointer(*hn, ht->next);
+			kfree_rcu(ht, rcu);
 			return 0;
 		}
 	}
 
-	WARN_ON(1);
 	return -ENOENT;
 }
 
 static void u32_destroy(struct tcf_proto *tp)
 {
 	struct tc_u_common *tp_c = tp->data;
-	struct tc_u_hnode *root_ht = tp->root;
+	struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
 
 	WARN_ON(root_ht == NULL);
 
@@ -418,17 +450,16 @@ static void u32_destroy(struct tcf_proto *tp)
 
 		tp->q->u32_node = NULL;
 
-		for (ht = tp_c->hlist; ht; ht = ht->next) {
+		for (ht = rtnl_dereference(tp_c->hlist);
+		     ht;
+		     ht = rtnl_dereference(ht->next)) {
 			ht->refcnt--;
-			u32_clear_hnode(tp, ht);
+			u32_clear_hnode(ht);
 		}
 
-		while ((ht = tp_c->hlist) != NULL) {
-			tp_c->hlist = ht->next;
-
-			WARN_ON(ht->refcnt != 0);
-
-			kfree(ht);
+		while ((ht = rtnl_dereference(tp_c->hlist)) != NULL) {
+			rcu_assign_pointer(tp_c->hlist, ht->next);
+			kfree_rcu(ht, rcu);
 		}
 
 		kfree(tp_c);
@@ -440,6 +471,7 @@ static void u32_destroy(struct tcf_proto *tp)
 static int u32_delete(struct tcf_proto *tp, unsigned long arg)
 {
 	struct tc_u_hnode *ht = (struct tc_u_hnode *)arg;
+	struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
 
 	if (ht == NULL)
 		return 0;
@@ -447,7 +479,7 @@ static int u32_delete(struct tcf_proto *tp, unsigned long arg)
 	if (TC_U32_KEY(ht->handle))
 		return u32_delete_key(tp, (struct tc_u_knode *)ht);
 
-	if (tp->root == ht)
+	if (root_ht == ht)
 		return -EINVAL;
 
 	if (ht->refcnt == 1) {
@@ -465,7 +497,9 @@ static u32 gen_new_kid(struct tc_u_hnode *ht, u32 handle)
 	struct tc_u_knode *n;
 	unsigned int i = 0x7FF;
 
-	for (n = ht->ht[TC_U32_HASH(handle)]; n; n = n->next)
+	for (n = rtnl_dereference(ht->ht[TC_U32_HASH(handle)]);
+	     n;
+	     n = rtnl_dereference(n->next))
 		if (i < TC_U32_NODE(n->handle))
 			i = TC_U32_NODE(n->handle);
 	i++;
@@ -512,10 +546,8 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
 			ht_down->refcnt++;
 		}
 
-		tcf_tree_lock(tp);
-		ht_old = n->ht_down;
-		n->ht_down = ht_down;
-		tcf_tree_unlock(tp);
+		ht_old = rtnl_dereference(n->ht_down);
+		rcu_assign_pointer(n->ht_down, ht_down);
 
 		if (ht_old)
 			ht_old->refcnt--;
@@ -527,6 +559,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
 
 #ifdef CONFIG_NET_CLS_IND
 	if (tb[TCA_U32_INDEV]) {
+		/* TODO make this a pointer update */
 		err = tcf_change_indev(tp, n->indev, tb[TCA_U32_INDEV]);
 		if (err < 0)
 			goto errout;
@@ -590,8 +623,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 		ht->divisor = divisor;
 		ht->handle = handle;
 		ht->prio = tp->prio;
-		ht->next = tp_c->hlist;
-		tp_c->hlist = ht;
+		rcu_assign_pointer(ht->next, tp_c->hlist);
+		rcu_assign_pointer(tp_c->hlist, ht);
 		*arg = (unsigned long)ht;
 		return 0;
 	}
@@ -599,7 +632,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 	if (tb[TCA_U32_HASH]) {
 		htid = nla_get_u32(tb[TCA_U32_HASH]);
 		if (TC_U32_HTID(htid) == TC_U32_ROOT) {
-			ht = tp->root;
+			ht = rtnl_dereference(tp->root);
 			htid = ht->handle;
 		} else {
 			ht = u32_lookup_ht(tp->data, TC_U32_HTID(htid));
@@ -607,7 +640,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 				return -EINVAL;
 		}
 	} else {
-		ht = tp->root;
+		ht = rtnl_dereference(tp->root);
 		htid = ht->handle;
 	}
 
@@ -631,8 +664,9 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 		return -ENOBUFS;
 
 #ifdef CONFIG_CLS_U32_PERF
-	n->pf = kzalloc(sizeof(struct tc_u32_pcnt) + s->nkeys*sizeof(u64), GFP_KERNEL);
-	if (n->pf == NULL) {
+	n->pf = __alloc_percpu(sizeof(struct tc_u32_pcnt) + s->nkeys * sizeof(u64),
+			       __alignof__(struct tc_u32_pcnt));
+	if (!n->pf) {
 		kfree(n);
 		return -ENOBUFS;
 	}
@@ -643,34 +677,39 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 	n->handle = handle;
 	n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;
 	tcf_exts_init(&n->exts, TCA_U32_ACT, TCA_U32_POLICE);
+	n->tp = tp;
 
 #ifdef CONFIG_CLS_U32_MARK
+	n->pcpu_success = alloc_percpu(u32);
+
 	if (tb[TCA_U32_MARK]) {
 		struct tc_u32_mark *mark;
 
 		mark = nla_data(tb[TCA_U32_MARK]);
-		memcpy(&n->mark, mark, sizeof(struct tc_u32_mark));
-		n->mark.success = 0;
+		n->val = mark->val;
+		n->mask = mark->mask;
 	}
 #endif
 
 	err = u32_set_parms(net, tp, base, ht, n, tb, tca[TCA_RATE]);
 	if (err == 0) {
-		struct tc_u_knode **ins;
-		for (ins = &ht->ht[TC_U32_HASH(handle)]; *ins; ins = &(*ins)->next)
-			if (TC_U32_NODE(handle) < TC_U32_NODE((*ins)->handle))
+		struct tc_u_knode __rcu **ins;
+		struct tc_u_knode *pins;
+
+		ins = &ht->ht[TC_U32_HASH(handle)];
+		for (pins = rtnl_dereference(*ins); pins;
+		     ins = &pins->next, pins = rtnl_dereference(*ins))
+			if (TC_U32_NODE(handle) < TC_U32_NODE(pins->handle))
 				break;
 
-		n->next = *ins;
-		tcf_tree_lock(tp);
-		*ins = n;
-		tcf_tree_unlock(tp);
+		rcu_assign_pointer(n->next, pins);
+		rcu_assign_pointer(*ins, n);
 
 		*arg = (unsigned long)n;
 		return 0;
 	}
 #ifdef CONFIG_CLS_U32_PERF
-	kfree(n->pf);
+	free_percpu(n->pf);
 #endif
 	kfree(n);
 	return err;
@@ -686,7 +725,9 @@ static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	if (arg->stop)
 		return;
 
-	for (ht = tp_c->hlist; ht; ht = ht->next) {
+	for (ht = rtnl_dereference(tp_c->hlist);
+	     ht;
+	     ht = rtnl_dereference(ht->next)) {
 		if (ht->prio != tp->prio)
 			continue;
 		if (arg->count >= arg->skip) {
@@ -697,7 +738,9 @@ static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 		}
 		arg->count++;
 		for (h = 0; h <= ht->divisor; h++) {
-			for (n = ht->ht[h]; n; n = n->next) {
+			for (n = rtnl_dereference(ht->ht[h]);
+			     n;
+			     n = rtnl_dereference(n->next)) {
 				if (arg->count < arg->skip) {
 					arg->count++;
 					continue;
@@ -716,6 +759,7 @@ static int u32_dump(struct tcf_proto *tp, unsigned long fh,
 		     struct sk_buff *skb, struct tcmsg *t)
 {
 	struct tc_u_knode *n = (struct tc_u_knode *)fh;
+	struct tc_u_hnode *ht_up, *ht_down;
 	struct nlattr *nest;
 
 	if (n == NULL)
@@ -734,11 +778,18 @@ static int u32_dump(struct tcf_proto *tp, unsigned long fh,
 		if (nla_put_u32(skb, TCA_U32_DIVISOR, divisor))
 			goto nla_put_failure;
 	} else {
+#ifdef CONFIG_CLS_U32_PERF
+		struct tc_u32_pcnt *gpf;
+#endif
+		int cpu;
+
 		if (nla_put(skb, TCA_U32_SEL,
 			    sizeof(n->sel) + n->sel.nkeys*sizeof(struct tc_u32_key),
 			    &n->sel))
 			goto nla_put_failure;
-		if (n->ht_up) {
+
+		ht_up = rtnl_dereference(n->ht_up);
+		if (ht_up) {
 			u32 htid = n->handle & 0xFFFFF000;
 			if (nla_put_u32(skb, TCA_U32_HASH, htid))
 				goto nla_put_failure;
@@ -746,14 +797,24 @@ static int u32_dump(struct tcf_proto *tp, unsigned long fh,
 		if (n->res.classid &&
 		    nla_put_u32(skb, TCA_U32_CLASSID, n->res.classid))
 			goto nla_put_failure;
-		if (n->ht_down &&
-		    nla_put_u32(skb, TCA_U32_LINK, n->ht_down->handle))
+
+		ht_down = rtnl_dereference(n->ht_down);
+		if (ht_down &&
+		    nla_put_u32(skb, TCA_U32_LINK, ht_down->handle))
 			goto nla_put_failure;
 
 #ifdef CONFIG_CLS_U32_MARK
-		if ((n->mark.val || n->mark.mask) &&
-		    nla_put(skb, TCA_U32_MARK, sizeof(n->mark), &n->mark))
-			goto nla_put_failure;
+		if ((n->val || n->mask)) {
+			struct tc_u32_mark mark = {.val = n->val,
+						   .mask = n->mask,
+						   .success = 0};
+
+			for_each_possible_cpu(cpu)
+				mark.success += *per_cpu_ptr(n->pcpu_success, cpu);
+
+			if (nla_put(skb, TCA_U32_MARK, sizeof(mark), &mark))
+				goto nla_put_failure;
+		}
 #endif
 
 		if (tcf_exts_dump(skb, &n->exts) < 0)
@@ -765,10 +826,29 @@ static int u32_dump(struct tcf_proto *tp, unsigned long fh,
 			goto nla_put_failure;
 #endif
 #ifdef CONFIG_CLS_U32_PERF
+		gpf = kzalloc(sizeof(struct tc_u32_pcnt) +
+					n->sel.nkeys * sizeof(u64),
+			      GFP_KERNEL);
+		if (!gpf)
+			goto nla_put_failure;
+
+		for_each_possible_cpu(cpu) {
+			int i;
+			struct tc_u32_pcnt *pf = per_cpu_ptr(n->pf, cpu);
+
+			gpf->rcnt += pf->rcnt;
+			gpf->rhit += pf->rhit;
+			for (i = 0; i < n->sel.nkeys; i++)
+				gpf->kcnts[i] += pf->kcnts[i];
+		}
+
 		if (nla_put(skb, TCA_U32_PCNT,
 			    sizeof(struct tc_u32_pcnt) + n->sel.nkeys*sizeof(u64),
-			    n->pf))
+			    gpf)) {
+			kfree(gpf);
 			goto nla_put_failure;
+		}
+		kfree(gpf);
 #endif
 	}
 

^ permalink raw reply related

* [RFC PATCH 08/12] net: sched: RCU cls_tcindex
From: John Fastabend @ 2014-01-10  9:42 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, eric.dumazet; +Cc: netdev, davem
In-Reply-To: <20140110092041.7193.5952.stgit@nitbit.x32>

Make cls_tcindex RCU safe.

This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
caller either holds the rcu read lock or RTNL. This is needed to
handle the case where tcindex_lookup() is being called in both cases.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/rtnetlink.h |   10 +
 net/sched/cls_tcindex.c   |  327 ++++++++++++++++++++++++++-------------------
 2 files changed, 202 insertions(+), 135 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 939428a..6008b22 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -37,6 +37,16 @@ extern int lockdep_rtnl_is_held(void);
 	rcu_dereference_check(p, lockdep_rtnl_is_held())
 
 /**
+ * rcu_dereference_bh_rtnl - rcu_dereference_bh with debug checking
+ * @p: The pointer to read, prior to dereference
+ *
+ * Do an rcu_dereference_bh(p), but check caller either holds rcu_read_lock_bh()
+ * or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference_bh()
+ */
+#define rcu_dereference_bh_rtnl(p)				\
+	rcu_dereference_bh_check(p, lockdep_rtnl_is_held())
+
+/**
  * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
  * @p: The pointer to read, prior to dereferencing
  *
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index ffad187..19eb24e 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -24,7 +24,7 @@
 #define DEFAULT_HASH_SIZE	64	/* optimized for diffserv */
 
 
-#define	PRIV(tp)	((struct tcindex_data *) (tp)->root)
+#define	PRIV(tp)	((struct tcindex_data *) rtnl_dereference((tp)->root))
 
 
 struct tcindex_filter_result {
@@ -35,19 +35,22 @@ struct tcindex_filter_result {
 struct tcindex_filter {
 	u16 key;
 	struct tcindex_filter_result result;
-	struct tcindex_filter *next;
+	struct tcindex_filter __rcu *next;
+	struct rcu_head rcu;
 };
 
 
 struct tcindex_data {
 	struct tcindex_filter_result *perfect; /* perfect hash; NULL if none */
-	struct tcindex_filter **h; /* imperfect hash; only used if !perfect;
-				      NULL if unused */
+	struct tcindex_filter __rcu **h; /* imperfect hash; only used if
+					    !perfect NULL if unused */
+	struct rcu_head rcu;
+	struct tcf_proto *tp;
 	u16 mask;		/* AND key with mask */
-	int shift;		/* shift ANDed key to the right */
-	int hash;		/* hash table size; 0 if undefined */
-	int alloc_hash;		/* allocated size */
-	int fall_through;	/* 0: only classify if explicit match */
+	u32 shift;		/* shift ANDed key to the right */
+	u32 hash;		/* hash table size; 0 if undefined */
+	u32 alloc_hash;		/* allocated size */
+	u32 fall_through;	/* 0: only classify if explicit match */
 };
 
 static inline int
@@ -59,13 +62,18 @@ tcindex_filter_is_set(struct tcindex_filter_result *r)
 static struct tcindex_filter_result *
 tcindex_lookup(struct tcindex_data *p, u16 key)
 {
-	struct tcindex_filter *f;
+	if (p->perfect) {
+		struct tcindex_filter_result *f = p->perfect + key;
+
+		return tcindex_filter_is_set(f) ? f : NULL;
+	} else if (p->h) {
+		struct tcindex_filter __rcu **fp;
+		struct tcindex_filter *f;
 
-	if (p->perfect)
-		return tcindex_filter_is_set(p->perfect + key) ?
-			p->perfect + key : NULL;
-	else if (p->h) {
-		for (f = p->h[key % p->hash]; f; f = f->next)
+		fp = &p->h[key % p->hash];
+		for (f = rcu_dereference_bh_rtnl(*fp);
+		     f;
+		     fp = &f->next, f = rcu_dereference_bh_rtnl(*fp))
 			if (f->key == key)
 				return &f->result;
 	}
@@ -77,7 +85,7 @@ tcindex_lookup(struct tcindex_data *p, u16 key)
 static int tcindex_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			    struct tcf_result *res)
 {
-	struct tcindex_data *p = PRIV(tp);
+	struct tcindex_data *p = rcu_dereference(tp->root);
 	struct tcindex_filter_result *f;
 	int key = (skb->tc_index & p->mask) >> p->shift;
 
@@ -132,49 +140,113 @@ static int tcindex_init(struct tcf_proto *tp)
 	p->hash = DEFAULT_HASH_SIZE;
 	p->fall_through = 1;
 
-	tp->root = p;
+	rcu_assign_pointer(tp->root, p);
 	return 0;
 }
 
+static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
+{
+	struct tcindex_data *p = PRIV(tp);
+	struct tcindex_filter *f, *next;
+	int i;
+
+	pr_debug("tcindex_walk(tp %p,walker %p),p %p\n", tp, walker, p);
+	if (p->perfect) {
+		for (i = 0; i < p->hash; i++) {
+			if (!p->perfect[i].res.class)
+				continue;
+			if (walker->count >= walker->skip) {
+				if (walker->fn(tp,
+				    (unsigned long) (p->perfect+i), walker)
+				     < 0) {
+					walker->stop = 1;
+					return;
+				}
+			}
+			walker->count++;
+		}
+	}
+	if (!p->h)
+		return;
+	for (i = 0; i < p->hash; i++) {
+		for (f = rtnl_dereference(p->h[i]); f; f = next) {
+			next = rtnl_dereference(f->next);
+			if (walker->count >= walker->skip) {
+				if (walker->fn(tp, (unsigned long) &f->result,
+				    walker) < 0) {
+					walker->stop = 1;
+					return;
+				}
+			}
+			walker->count++;
+		}
+	}
+}
 
 static int
-__tcindex_delete(struct tcf_proto *tp, unsigned long arg, int lock)
+tcindex_delete(struct tcf_proto *tp, unsigned long arg)
 {
 	struct tcindex_data *p = PRIV(tp);
 	struct tcindex_filter_result *r = (struct tcindex_filter_result *) arg;
+	struct tcindex_filter __rcu **walk;
 	struct tcindex_filter *f = NULL;
 
-	pr_debug("tcindex_delete(tp %p,arg 0x%lx),p %p,f %p\n", tp, arg, p, f);
+	pr_debug("tcindex_delete(tp %p,arg 0x%lx),p %p\n", tp, arg, p);
 	if (p->perfect) {
 		if (!r->res.class)
 			return -ENOENT;
 	} else {
 		int i;
-		struct tcindex_filter **walk = NULL;
 
-		for (i = 0; i < p->hash; i++)
-			for (walk = p->h+i; *walk; walk = &(*walk)->next)
-				if (&(*walk)->result == r)
+		for (i = 0; i < p->hash; i++) {
+			walk = p->h + i;
+			for (f = rtnl_dereference(*walk); f;
+			     walk = &f->next, f = rtnl_dereference(*walk)) {
+				if (&f->result == r)
 					goto found;
+			}
+		}
 		return -ENOENT;
 
 found:
-		f = *walk;
-		if (lock)
-			tcf_tree_lock(tp);
-		*walk = f->next;
-		if (lock)
-			tcf_tree_unlock(tp);
+		rcu_assign_pointer(*walk, rtnl_dereference(f->next));
 	}
 	tcf_unbind_filter(tp, &r->res);
 	tcf_exts_destroy(tp, &r->exts);
-	kfree(f);
+	if (f)
+		kfree_rcu(f, rcu);
 	return 0;
 }
 
-static int tcindex_delete(struct tcf_proto *tp, unsigned long arg)
+static int tcindex_destroy_element(struct tcf_proto *tp,
+				   unsigned long arg,
+				   struct tcf_walker *walker)
+{
+	return tcindex_delete(tp, arg);
+}
+
+static void __tcindex_destroy(struct rcu_head *head)
 {
-	return __tcindex_delete(tp, arg, 1);
+	struct tcindex_data *p = container_of(head, struct tcindex_data, rcu);
+
+	kfree(p->perfect);
+	kfree(p->h);
+	kfree(p);
+}
+
+static void tcindex_destroy(struct tcf_proto *tp)
+{
+	struct tcindex_data *p = PRIV(tp);
+	struct tcf_walker walker;
+
+	pr_debug("%s(tp %p),p %p\n", __func__, tp, p);
+	walker.count = 0;
+	walker.skip = 0;
+	walker.fn = &tcindex_destroy_element;
+	tcindex_walk(tp, &walker);
+
+	rcu_assign_pointer(tp->root, NULL);
+	call_rcu(&p->rcu, __tcindex_destroy);
 }
 
 static inline int
@@ -191,6 +263,14 @@ static const struct nla_policy tcindex_policy[TCA_TCINDEX_MAX + 1] = {
 	[TCA_TCINDEX_CLASSID]		= { .type = NLA_U32 },
 };
 
+static void __tcindex_partial_destroy(struct rcu_head *head)
+{
+	struct tcindex_data *p = container_of(head, struct tcindex_data, rcu);
+
+	kfree(p->perfect);
+	kfree(p);
+}
+
 static int
 tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		  u32 handle, struct tcindex_data *p,
@@ -200,7 +280,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	int err, balloc = 0;
 	struct tcindex_filter_result new_filter_result, *old_r = r;
 	struct tcindex_filter_result cr;
-	struct tcindex_data cp;
+	struct tcindex_data *cp, *oldp;
 	struct tcindex_filter *f = NULL; /* make gcc behave */
 	struct tcf_exts e;
 
@@ -209,7 +289,27 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	if (err < 0)
 		return err;
 
-	memcpy(&cp, p, sizeof(cp));
+	/* tcindex_data attributes must look atomic to classifier/lookup so
+	 * allocate new tcindex data and RCU assign it onto root. Keeping
+	 * perfect hash and hash pointers from old data.
+	 */
+	cp = kzalloc(sizeof(cp), GFP_KERNEL);
+	if (!cp)
+		return -ENOMEM;
+
+	cp->mask = p->mask;
+	cp->shift = p->shift;
+	cp->hash = p->hash;
+	cp->alloc_hash = p->alloc_hash;
+	cp->fall_through = p->fall_through;
+	cp->tp = tp;
+
+	if (p->perfect) {
+		cp->perfect = kcalloc(cp->hash, sizeof(*r), GFP_KERNEL);
+		memcpy(cp->perfect, p->perfect, sizeof(*r) * cp->hash);
+	}
+	cp->h = p->h;
+
 	memset(&new_filter_result, 0, sizeof(new_filter_result));
 	tcf_exts_init(&new_filter_result.exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
 
@@ -221,71 +321,82 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	}
 
 	if (tb[TCA_TCINDEX_HASH])
-		cp.hash = nla_get_u32(tb[TCA_TCINDEX_HASH]);
+		cp->hash = nla_get_u32(tb[TCA_TCINDEX_HASH]);
 
 	if (tb[TCA_TCINDEX_MASK])
-		cp.mask = nla_get_u16(tb[TCA_TCINDEX_MASK]);
+		cp->mask = nla_get_u16(tb[TCA_TCINDEX_MASK]);
 
 	if (tb[TCA_TCINDEX_SHIFT])
-		cp.shift = nla_get_u32(tb[TCA_TCINDEX_SHIFT]);
+		cp->shift = nla_get_u32(tb[TCA_TCINDEX_SHIFT]);
 
 	err = -EBUSY;
+
 	/* Hash already allocated, make sure that we still meet the
 	 * requirements for the allocated hash.
 	 */
-	if (cp.perfect) {
-		if (!valid_perfect_hash(&cp) ||
-		    cp.hash > cp.alloc_hash)
+	if (cp->perfect) {
+		if (!valid_perfect_hash(cp) ||
+		    cp->hash > cp->alloc_hash)
 			goto errout;
-	} else if (cp.h && cp.hash != cp.alloc_hash)
+	} else if (cp->h && cp->hash != cp->alloc_hash)
 		goto errout;
 
 	err = -EINVAL;
 	if (tb[TCA_TCINDEX_FALL_THROUGH])
-		cp.fall_through = nla_get_u32(tb[TCA_TCINDEX_FALL_THROUGH]);
+		cp->fall_through = nla_get_u32(tb[TCA_TCINDEX_FALL_THROUGH]);
 
-	if (!cp.hash) {
+	if (!cp->hash) {
 		/* Hash not specified, use perfect hash if the upper limit
 		 * of the hashing index is below the threshold.
 		 */
-		if ((cp.mask >> cp.shift) < PERFECT_HASH_THRESHOLD)
-			cp.hash = (cp.mask >> cp.shift) + 1;
+		if ((cp->mask >> cp->shift) < PERFECT_HASH_THRESHOLD)
+			cp->hash = (cp->mask >> cp->shift) + 1;
 		else
-			cp.hash = DEFAULT_HASH_SIZE;
+			cp->hash = DEFAULT_HASH_SIZE;
 	}
 
-	if (!cp.perfect && !cp.h)
-		cp.alloc_hash = cp.hash;
+	if (!cp->perfect && cp->h)
+		cp->alloc_hash = cp->hash;
 
 	/* Note: this could be as restrictive as if (handle & ~(mask >> shift))
 	 * but then, we'd fail handles that may become valid after some future
 	 * mask change. While this is extremely unlikely to ever matter,
 	 * the check below is safer (and also more backwards-compatible).
 	 */
-	if (cp.perfect || valid_perfect_hash(&cp))
-		if (handle >= cp.alloc_hash)
+	if (cp->perfect || valid_perfect_hash(cp))
+		if (handle >= cp->alloc_hash)
 			goto errout;
 
 
 	err = -ENOMEM;
-	if (!cp.perfect && !cp.h) {
-		if (valid_perfect_hash(&cp)) {
-			cp.perfect = kcalloc(cp.hash, sizeof(*r), GFP_KERNEL);
-			if (!cp.perfect)
+	if (!cp->perfect && !cp->h) {
+		if (valid_perfect_hash(cp)) {
+			struct tcindex_filter_result *perfect;
+
+			perfect = kcalloc(cp->hash, sizeof(*r), GFP_KERNEL);
+			if (!perfect)
 				goto errout;
+			cp->perfect = perfect;
 			balloc = 1;
 		} else {
-			cp.h = kcalloc(cp.hash, sizeof(f), GFP_KERNEL);
-			if (!cp.h)
+			struct tcindex_filter __rcu **hash;
+
+			hash = kcalloc(cp->hash,
+				       sizeof(struct tcindex_filter *),
+				       GFP_KERNEL);
+
+			if (!hash)
 				goto errout;
+
+			cp->h = hash;
 			balloc = 2;
 		}
 	}
 
-	if (cp.perfect)
-		r = cp.perfect + handle;
+	if (cp->perfect)
+		r = cp->perfect + handle;
 	else
-		r = tcindex_lookup(&cp, handle) ? : &new_filter_result;
+		r = tcindex_lookup(cp, handle) ? : &new_filter_result;
 
 	if (r == &new_filter_result) {
 		f = kzalloc(sizeof(*f), GFP_KERNEL);
@@ -300,33 +411,41 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 
 	tcf_exts_change(tp, &cr.exts, &e);
 
-	tcf_tree_lock(tp);
 	if (old_r && old_r != r)
 		memset(old_r, 0, sizeof(*old_r));
 
-	memcpy(p, &cp, sizeof(cp));
+	oldp = p;
 	memcpy(r, &cr, sizeof(cr));
+	rcu_assign_pointer(tp->root, cp);
 
 	if (r == &new_filter_result) {
-		struct tcindex_filter **fp;
+		struct tcindex_filter *nfp;
+		struct tcindex_filter __rcu **fp;
 
 		f->key = handle;
 		f->result = new_filter_result;
 		f->next = NULL;
-		for (fp = p->h+(handle % p->hash); *fp; fp = &(*fp)->next)
-			/* nothing */;
-		*fp = f;
+
+		fp = p->h + (handle % p->hash);
+		for (nfp = rtnl_dereference(*fp);
+		     nfp;
+		     fp = &nfp->next, nfp = rtnl_dereference(*fp))
+				/* nothing */;
+
+		rcu_assign_pointer(*fp, f);
 	}
-	tcf_tree_unlock(tp);
 
+	if (oldp)
+		call_rcu(&oldp->rcu, __tcindex_partial_destroy);
 	return 0;
 
 errout_alloc:
 	if (balloc == 1)
-		kfree(cp.perfect);
+		kfree(cp->perfect);
 	else if (balloc == 2)
-		kfree(cp.h);
+		kfree(cp->h);
 errout:
+	kfree(cp);
 	tcf_exts_destroy(tp, &e);
 	return err;
 }
@@ -357,71 +476,6 @@ tcindex_change(struct net *net, struct sk_buff *in_skb,
 				 tca[TCA_RATE]);
 }
 
-
-static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
-{
-	struct tcindex_data *p = PRIV(tp);
-	struct tcindex_filter *f, *next;
-	int i;
-
-	pr_debug("tcindex_walk(tp %p,walker %p),p %p\n", tp, walker, p);
-	if (p->perfect) {
-		for (i = 0; i < p->hash; i++) {
-			if (!p->perfect[i].res.class)
-				continue;
-			if (walker->count >= walker->skip) {
-				if (walker->fn(tp,
-				    (unsigned long) (p->perfect+i), walker)
-				     < 0) {
-					walker->stop = 1;
-					return;
-				}
-			}
-			walker->count++;
-		}
-	}
-	if (!p->h)
-		return;
-	for (i = 0; i < p->hash; i++) {
-		for (f = p->h[i]; f; f = next) {
-			next = f->next;
-			if (walker->count >= walker->skip) {
-				if (walker->fn(tp, (unsigned long) &f->result,
-				    walker) < 0) {
-					walker->stop = 1;
-					return;
-				}
-			}
-			walker->count++;
-		}
-	}
-}
-
-
-static int tcindex_destroy_element(struct tcf_proto *tp,
-    unsigned long arg, struct tcf_walker *walker)
-{
-	return __tcindex_delete(tp, arg, 0);
-}
-
-
-static void tcindex_destroy(struct tcf_proto *tp)
-{
-	struct tcindex_data *p = PRIV(tp);
-	struct tcf_walker walker;
-
-	pr_debug("tcindex_destroy(tp %p),p %p\n", tp, p);
-	walker.count = 0;
-	walker.skip = 0;
-	walker.fn = &tcindex_destroy_element;
-	tcindex_walk(tp, &walker);
-	kfree(p->perfect);
-	kfree(p->h);
-	kfree(p);
-	tp->root = NULL;
-}
-
-
 static int tcindex_dump(struct tcf_proto *tp, unsigned long fh,
     struct sk_buff *skb, struct tcmsg *t)
 {
@@ -448,15 +502,18 @@ static int tcindex_dump(struct tcf_proto *tp, unsigned long fh,
 		nla_nest_end(skb, nest);
 	} else {
 		if (p->perfect) {
-			t->tcm_handle = r-p->perfect;
+			t->tcm_handle = r - p->perfect;
 		} else {
 			struct tcindex_filter *f;
+			struct tcindex_filter __rcu **fp;
 			int i;
 
 			t->tcm_handle = 0;
 			for (i = 0; !t->tcm_handle && i < p->hash; i++) {
-				for (f = p->h[i]; !t->tcm_handle && f;
-				     f = f->next) {
+				fp = &p->h[i];
+				for (f = rtnl_dereference(*fp);
+				     !t->tcm_handle && f;
+				     fp = &f->next, f = rtnl_dereference(*fp)) {
 					if (&f->result == r)
 						t->tcm_handle = f->key;
 				}

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox