[PATCH 0/7] caching bundles, iteration 2

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] caching bundles, iteration 2
@ 2010-03-29 14:12 Timo Teras
  2010-03-29 14:12 ` [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead Timo Teras
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

This is the current dump of my working tree. The first three
patches should be good already.

The patches 4-7 are more of experimental. And patch 6 has
currently a "bundle leak". If we have per-socket policy that
results in requiring a bundle, it's never freed. I'm not yet
exactly sure how to deal with that. Probably should keep the
bundles not associated with a flow cache entry in their own
global list, which should get garbage collected periodically.

Timo Teras (7):
  xfrm: remove policy lock when accessing policy->walk.dead
  flow: structurize flow cache
  flow: allocate hash table for online cpus only
  flow: delayed deletion of flow cache entries
  flow: virtualize get and entry deletion methods
  xfrm: cache bundles instead of policies for outgoing flows
  xfrm: remove policy garbage collection

 include/net/flow.h      |   18 +-
 include/net/xfrm.h      |   12 +-
 net/core/flow.c         |  442 +++++++++++++++----------
 net/ipv4/xfrm4_policy.c |   22 --
 net/ipv6/xfrm6_policy.c |   31 --
 net/xfrm/xfrm_policy.c  |  831 ++++++++++++++++++++++++-----------------------
 net/xfrm/xfrm_user.c    |    6 +-
 7 files changed, 712 insertions(+), 650 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-29 14:43   ` Herbert Xu
  2010-03-29 14:12 ` [PATCH 2/7] flow: structurize flow cache Timo Teras
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

All of the code considers ->dead as a hint that the cached policy
needs to get refreshed. The read side can just drop the read lock
without any side effects.

The write side needs to make sure that it's written only exactly
once. Only possible race is at xfrm_policy_kill(). This is fixed
by checking result of __xfrm_policy_unlink() when needed. It will
always succeed if the policy object is looked up from the hash
list (so some checks are removed), but it needs to be checked if
we are trying to unlink policy via a reference (appropriate
checks added).

Since policy->walk.dead is written exactly once, it no longer
needs to be protected with a write lock.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/xfrm/xfrm_policy.c |   30 +++++++-----------------------
 net/xfrm/xfrm_user.c   |    6 +-----
 2 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 843e066..595d347 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -156,7 +156,7 @@ static void xfrm_policy_timer(unsigned long data)
 
 	read_lock(&xp->lock);
 
-	if (xp->walk.dead)
+	if (unlikely(xp->walk.dead))
 		goto out;
 
 	dir = xfrm_policy_id2dir(xp->index);
@@ -297,17 +297,7 @@ static DECLARE_WORK(xfrm_policy_gc_work, xfrm_policy_gc_task);
 
 static void xfrm_policy_kill(struct xfrm_policy *policy)
 {
-	int dead;
-
-	write_lock_bh(&policy->lock);
-	dead = policy->walk.dead;
 	policy->walk.dead = 1;
-	write_unlock_bh(&policy->lock);
-
-	if (unlikely(dead)) {
-		WARN_ON(1);
-		return;
-	}
 
 	spin_lock_bh(&xfrm_policy_gc_lock);
 	hlist_add_head(&policy->bydst, &xfrm_policy_gc_list);
@@ -776,7 +766,6 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, struct xfrm_audit *audi
 int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 {
 	int dir, err = 0, cnt = 0;
-	struct xfrm_policy *dp;
 
 	write_lock_bh(&xfrm_policy_lock);
 
@@ -794,10 +783,9 @@ int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 				     &net->xfrm.policy_inexact[dir], bydst) {
 			if (pol->type != type)
 				continue;
-			dp = __xfrm_policy_unlink(pol, dir);
+			__xfrm_policy_unlink(pol, dir);
 			write_unlock_bh(&xfrm_policy_lock);
-			if (dp)
-				cnt++;
+			cnt++;
 
 			xfrm_audit_policy_delete(pol, 1, audit_info->loginuid,
 						 audit_info->sessionid,
@@ -816,10 +804,9 @@ int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 					     bydst) {
 				if (pol->type != type)
 					continue;
-				dp = __xfrm_policy_unlink(pol, dir);
+				__xfrm_policy_unlink(pol, dir);
 				write_unlock_bh(&xfrm_policy_lock);
-				if (dp)
-					cnt++;
+				cnt++;
 
 				xfrm_audit_policy_delete(pol, 1,
 							 audit_info->loginuid,
@@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
 		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
 	}
 	if (old_pol)
-		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
+		old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
 	write_unlock_bh(&xfrm_policy_lock);
 
 	if (old_pol) {
@@ -1737,11 +1724,8 @@ restart:
 			goto error;
 		}
 
-		for (pi = 0; pi < npols; pi++) {
-			read_lock_bh(&pols[pi]->lock);
+		for (pi = 0; pi < npols; pi++)
 			pol_dead |= pols[pi]->walk.dead;
-			read_unlock_bh(&pols[pi]->lock);
-		}
 
 		write_lock_bh(&policy->lock);
 		if (unlikely(pol_dead || stale_bundle(dst))) {
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 6106b72..778a6ec 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1766,13 +1766,9 @@ static int xfrm_add_pol_expire(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (xp == NULL)
 		return -ENOENT;
 
-	read_lock(&xp->lock);
-	if (xp->walk.dead) {
-		read_unlock(&xp->lock);
+	if (unlikely(xp->walk.dead))
 		goto out;
-	}
 
-	read_unlock(&xp->lock);
 	err = 0;
 	if (up->hard) {
 		uid_t loginuid = NETLINK_CB(skb).loginuid;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-29 14:12 ` [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead Timo Teras
@ 2010-03-29 14:43   ` Herbert Xu
  2010-03-30  4:55     ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-29 14:43 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

On Mon, Mar 29, 2010 at 05:12:38PM +0300, Timo Teras wrote:
>
> @@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
>  		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
>  	}
>  	if (old_pol)
> -		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
> +		old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>  	write_unlock_bh(&xfrm_policy_lock);
>  
>  	if (old_pol) {

So when can this actually fail?

Otherwise this patch looks good to me.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-29 14:43   ` Herbert Xu
@ 2010-03-30  4:55     ` Timo Teräs
  2010-03-30 11:53       ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30  4:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Mon, Mar 29, 2010 at 05:12:38PM +0300, Timo Teras wrote:
>> @@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
>>  		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
>>  	}
>>  	if (old_pol)
>> -		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>> +		old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>  	write_unlock_bh(&xfrm_policy_lock);
>>  
>>  	if (old_pol) {
> 
> So when can this actually fail?

Considering that the socket reference is received from the sk->sk_policy,
and the hash bucket we use is "XFRM_POLICY_MAX+dir", it's non-obvious if
it can fail or not.

It would look like the timer can kill a policy and unlink it, but it
would still be found from sk_policy.

It probably doesn't really make sense to insert per-socket policy that
expires. But in case someone does something like that, I'd think we
need the above just to be sure.

Considering this, xfrm_sk_policy_lookup() should probably check the
dead flag, and cleanup sk_policy if it was killed by a timer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30  4:55     ` Timo Teräs
@ 2010-03-30 11:53       ` Herbert Xu
  2010-03-30 12:04         ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 11:53 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 07:55:07AM +0300, Timo Teräs wrote:
> Herbert Xu wrote:
>> On Mon, Mar 29, 2010 at 05:12:38PM +0300, Timo Teras wrote:
>>> @@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
>>>  		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
>>>  	}
>>>  	if (old_pol)
>>> -		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>> +		old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>>  	write_unlock_bh(&xfrm_policy_lock);
>>>   	if (old_pol) {
>>
>> So when can this actually fail?
>
> Considering that the socket reference is received from the sk->sk_policy,
> and the hash bucket we use is "XFRM_POLICY_MAX+dir", it's non-obvious if
> it can fail or not.
>
> It would look like the timer can kill a policy and unlink it, but it
> would still be found from sk_policy.

Socket policies cannot expire.

In fact, they probably shouldn't even be on the bydst or any other
hash table.  I think the only reason they're there at all is because
the hash table was added to __xfrm_policy_link which happens to be
used by socket policies.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 11:53       ` Herbert Xu
@ 2010-03-30 12:04         ` Timo Teräs
  2010-03-30 12:14           ` Herbert Xu
  2010-03-30 14:01           ` Timo Teräs
  0 siblings, 2 replies; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:04 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 07:55:07AM +0300, Timo Teräs wrote:
>> Herbert Xu wrote:
>>> On Mon, Mar 29, 2010 at 05:12:38PM +0300, Timo Teras wrote:
>>>> @@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
>>>>  		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
>>>>  	}
>>>>  	if (old_pol)
>>>> -		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>>> +		old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>>>  	write_unlock_bh(&xfrm_policy_lock);
>>>>   	if (old_pol) {
>>> So when can this actually fail?
>> Considering that the socket reference is received from the sk->sk_policy,
>> and the hash bucket we use is "XFRM_POLICY_MAX+dir", it's non-obvious if
>> it can fail or not.
>>
>> It would look like the timer can kill a policy and unlink it, but it
>> would still be found from sk_policy.
> 
> Socket policies cannot expire.

Was not aware of that. The above is not needed then.

> In fact, they probably shouldn't even be on the bydst or any other
> hash table.  I think the only reason they're there at all is because
> the hash table was added to __xfrm_policy_link which happens to be
> used by socket policies.

I think it's hashed so socket policies are included in the policy
db dumps and counts.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:04         ` Timo Teräs
@ 2010-03-30 12:14           ` Herbert Xu
  2010-03-30 12:21             ` Timo Teräs
  2010-03-30 14:01           ` Timo Teräs
  1 sibling, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:14 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 03:04:01PM +0300, Timo Teräs wrote:
>> I think it's hashed so socket policies are included in the policy
> db dumps and counts.

No we use a linked list for that.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:14           ` Herbert Xu
@ 2010-03-30 12:21             ` Timo Teräs
  2010-03-30 12:23               ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:21 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 03:04:01PM +0300, Timo Teräs wrote:
>>> I think it's hashed so socket policies are included in the policy
>> db dumps and counts.
> 
> No we use a linked list for that.

Ah, yes. It's just all done together in the __xfrm_link_policy().

Hmm... is it possible to modify/delete per-socket policies from
userland via xfrm_user? That would be also another race why
we'd need to check the unlinking result.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:21             ` Timo Teräs
@ 2010-03-30 12:23               ` Herbert Xu
  2010-03-30 12:41                 ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:23 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 03:21:19PM +0300, Timo Teräs wrote:
>
> Hmm... is it possible to modify/delete per-socket policies from
> userland via xfrm_user? That would be also another race why
> we'd need to check the unlinking result.

How would the user be able to specify the socket?

The answer is no.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:23               ` Herbert Xu
@ 2010-03-30 12:41                 ` Timo Teräs
  2010-03-30 12:48                   ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 03:21:19PM +0300, Timo Teräs wrote:
>> Hmm... is it possible to modify/delete per-socket policies from
>> userland via xfrm_user? That would be also another race why
>> we'd need to check the unlinking result.
> 
> How would the user be able to specify the socket?
> 
> The answer is no.

I thought it was possible since they can be dumped.

But yes, I now see that the policy direction check does not
allow the per-socket policies to be specified.

So it'd make more sense to nuke the hashes entirely for
per-socket policies?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:41                 ` Timo Teräs
@ 2010-03-30 12:48                   ` Herbert Xu
  2010-03-30 13:33                     ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:48 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 03:41:02PM +0300, Timo Teräs wrote:
>
> So it'd make more sense to nuke the hashes entirely for
> per-socket policies?

Absolutely.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:48                   ` Herbert Xu
@ 2010-03-30 13:33                     ` Timo Teräs
  2010-03-30 14:30                       ` Herbert Xu
  2010-03-30 14:37                       ` Herbert Xu
  0 siblings, 2 replies; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 13:33 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 03:41:02PM +0300, Timo Teräs wrote:
>> So it'd make more sense to nuke the hashes entirely for
>> per-socket policies?
> 
> Absolutely.

I checked now the xfrm_user, and mostly it seems to prevent
modification to per-socket policies.

The only exception is XFRM_MSG_POLEXPIRE handler
xfrm_add_pol_expire(). It calls xfrm_policy_byid() without
verifying the direction, and can thus complete successfully on
a per-socket policy. This can actually result in per-socket
policy deletion via netlink.

I guess the proper thing is to add the direction check there.

It also seems that the by-index hash is also used when
generating new index. It's to double check that the index
is unique. So deleting the by-index hash from per-socket
policies seems tricky.

Removing bydst hashing should be trivial.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 13:33                     ` Timo Teräs
@ 2010-03-30 14:30                       ` Herbert Xu
  2010-03-30 14:34                         ` Herbert Xu
  2010-03-30 14:37                       ` Herbert Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 14:30 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 04:33:52PM +0300, Timo Teräs wrote:
> Herbert Xu wrote:
>> On Tue, Mar 30, 2010 at 03:41:02PM +0300, Timo Teräs wrote:
>>> So it'd make more sense to nuke the hashes entirely for
>>> per-socket policies?
>>
>> Absolutely.
>
> I checked now the xfrm_user, and mostly it seems to prevent
> modification to per-socket policies.
>
> The only exception is XFRM_MSG_POLEXPIRE handler
> xfrm_add_pol_expire(). It calls xfrm_policy_byid() without
> verifying the direction, and can thus complete successfully on
> a per-socket policy. This can actually result in per-socket
> policy deletion via netlink.

That shouldn't be possible since the directions used by socket
policies cannot be set through xfrm_user.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 14:30                       ` Herbert Xu
@ 2010-03-30 14:34                         ` Herbert Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 14:34 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 10:30:48PM +0800, Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 04:33:52PM +0300, Timo Teräs wrote:
> > Herbert Xu wrote:
> >> On Tue, Mar 30, 2010 at 03:41:02PM +0300, Timo Teräs wrote:
> >>> So it'd make more sense to nuke the hashes entirely for
> >>> per-socket policies?
> >>
> >> Absolutely.
> >
> > I checked now the xfrm_user, and mostly it seems to prevent
> > modification to per-socket policies.
> >
> > The only exception is XFRM_MSG_POLEXPIRE handler
> > xfrm_add_pol_expire(). It calls xfrm_policy_byid() without
> > verifying the direction, and can thus complete successfully on
> > a per-socket policy. This can actually result in per-socket
> > policy deletion via netlink.
> 
> That shouldn't be possible since the directions used by socket
> policies cannot be set through xfrm_user.

If there is a missing direction check then we should add it.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 13:33                     ` Timo Teräs
  2010-03-30 14:30                       ` Herbert Xu
@ 2010-03-30 14:37                       ` Herbert Xu
  1 sibling, 0 replies; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 14:37 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 04:33:52PM +0300, Timo Teräs wrote:
>
> It also seems that the by-index hash is also used when
> generating new index. It's to double check that the index
> is unique. So deleting the by-index hash from per-socket
> policies seems tricky.
>
> Removing bydst hashing should be trivial.

Yes that makes sense.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 12:04         ` Timo Teräs
  2010-03-30 12:14           ` Herbert Xu
@ 2010-03-30 14:01           ` Timo Teräs
  2010-03-30 14:29             ` Herbert Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 14:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Timo Teräs wrote:
> Herbert Xu wrote:
>> On Tue, Mar 30, 2010 at 07:55:07AM +0300, Timo Teräs wrote:
>>> Herbert Xu wrote:
>>>> On Mon, Mar 29, 2010 at 05:12:38PM +0300, Timo Teras wrote:
>>>>> @@ -1132,7 +1119,7 @@ int xfrm_sk_policy_insert(struct sock *sk, 
>>>>> int dir, struct xfrm_policy *pol)
>>>>>          __xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
>>>>>      }
>>>>>      if (old_pol)
>>>>> -        __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>>>> +        old_pol = __xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
>>>>>      write_unlock_bh(&xfrm_policy_lock);
>>>>>       if (old_pol) {
>>>> So when can this actually fail?
>>> Considering that the socket reference is received from the 
>>> sk->sk_policy,
>>> and the hash bucket we use is "XFRM_POLICY_MAX+dir", it's non-obvious if
>>> it can fail or not.
>>>
>>> It would look like the timer can kill a policy and unlink it, but it
>>> would still be found from sk_policy.
>>
>> Socket policies cannot expire.
> 
> Was not aware of that. The above is not needed then.

Since the exported function xfrm_policy_byid() can result in deletion
of socket policy, it's safer to leave this change in. This is can be
even triggered via xfrm_user since it does not check 'dir' for the
policy expired message it handles. Any custom module could do similar
harm.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 14:01           ` Timo Teräs
@ 2010-03-30 14:29             ` Herbert Xu
  2010-03-30 15:36               ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 14:29 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 05:01:47PM +0300, Timo Teräs wrote:
>
> Since the exported function xfrm_policy_byid() can result in deletion
> of socket policy, it's safer to leave this change in. This is can be
> even triggered via xfrm_user since it does not check 'dir' for the
> policy expired message it handles. Any custom module could do similar
> harm.

That's a bug.  Socket policies should never be deleted by anyone
other than the socket owner through a setsockopt.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 14:29             ` Herbert Xu
@ 2010-03-30 15:36               ` Timo Teräs
  2010-03-31  0:43                 ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 15:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 05:01:47PM +0300, Timo Teräs wrote:
>> Since the exported function xfrm_policy_byid() can result in deletion
>> of socket policy, it's safer to leave this change in. This is can be
>> even triggered via xfrm_user since it does not check 'dir' for the
>> policy expired message it handles. Any custom module could do similar
>> harm.
> 
> That's a bug.  Socket policies should never be deleted by anyone
> other than the socket owner through a setsockopt.

Ok. I can remove the change to xfrm_sk_policy_insert() if you want.

I only added it because it's non-trivial to figure out if there's
any code path that could race. It's a great help for reader of the
code to see that it's correct even if it's not strictly needed.
My initial feeling is that the change produces better code as the
original 'old_pol' does not have to get stored and restored.

I'm fine either way.

I will also include a patch to fix the missing 'dir' check in
xfrm_user.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead
  2010-03-30 15:36               ` Timo Teräs
@ 2010-03-31  0:43                 ` Herbert Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Herbert Xu @ 2010-03-31  0:43 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 06:36:37PM +0300, Timo Teräs wrote:
>
> I only added it because it's non-trivial to figure out if there's
> any code path that could race. It's a great help for reader of the
> code to see that it's correct even if it's not strictly needed.

No that's bad because you're misleading people into thinking
something that isn't allowed can happen.  It's much better to
add a comment instead.

> I will also include a patch to fix the missing 'dir' check in
> xfrm_user.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/7] flow: structurize flow cache
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
  2010-03-29 14:12 ` [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-30 12:01   ` Herbert Xu
  2010-03-29 14:12 ` [PATCH 3/7] flow: allocate hash table for online cpus only Timo Teras
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

Group all per-cpu data to one structure instead of having many
globals. Also prepare the internals so that we can have multiple
instances of the flow cache if needed.

Only the kmem_cache is left as a global as all flow caches share
the same element size, and benefit from using a common cache.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/core/flow.c |  223 +++++++++++++++++++++++++++++--------------------------
 1 files changed, 119 insertions(+), 104 deletions(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index 9601587..1d27ca6 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -35,104 +35,105 @@ struct flow_cache_entry {
 	atomic_t		*object_ref;
 };
 
-atomic_t flow_cache_genid = ATOMIC_INIT(0);
-
-static u32 flow_hash_shift;
-#define flow_hash_size	(1 << flow_hash_shift)
-static DEFINE_PER_CPU(struct flow_cache_entry **, flow_tables) = { NULL };
-
-#define flow_table(cpu) (per_cpu(flow_tables, cpu))
-
-static struct kmem_cache *flow_cachep __read_mostly;
-
-static int flow_lwm, flow_hwm;
-
-struct flow_percpu_info {
-	int hash_rnd_recalc;
-	u32 hash_rnd;
-	int count;
+struct flow_cache_percpu {
+	struct flow_cache_entry **	hash_table;
+	int				hash_count;
+	u32				hash_rnd;
+	int				hash_rnd_recalc;
+	struct tasklet_struct		flush_tasklet;
 };
-static DEFINE_PER_CPU(struct flow_percpu_info, flow_hash_info) = { 0 };
-
-#define flow_hash_rnd_recalc(cpu) \
-	(per_cpu(flow_hash_info, cpu).hash_rnd_recalc)
-#define flow_hash_rnd(cpu) \
-	(per_cpu(flow_hash_info, cpu).hash_rnd)
-#define flow_count(cpu) \
-	(per_cpu(flow_hash_info, cpu).count)
-
-static struct timer_list flow_hash_rnd_timer;
-
-#define FLOW_HASH_RND_PERIOD	(10 * 60 * HZ)
 
 struct flow_flush_info {
-	atomic_t cpuleft;
-	struct completion completion;
+	struct flow_cache *		cache;
+	atomic_t			cpuleft;
+	struct completion		completion;
 };
-static DEFINE_PER_CPU(struct tasklet_struct, flow_flush_tasklets) = { NULL };
 
-#define flow_flush_tasklet(cpu) (&per_cpu(flow_flush_tasklets, cpu))
+struct flow_cache {
+	u32				hash_shift;
+	unsigned long			order;
+	struct flow_cache_percpu *	percpu;
+	struct notifier_block		hotcpu_notifier;
+	int				low_watermark;
+	int				high_watermark;
+	struct timer_list		rnd_timer;
+};
+
+atomic_t flow_cache_genid = ATOMIC_INIT(0);
+static struct flow_cache flow_cache_global;
+static struct kmem_cache *flow_cachep;
+
+#define flow_cache_hash_size(cache)	(1 << (cache)->hash_shift)
+#define FLOW_HASH_RND_PERIOD		(10 * 60 * HZ)
 
 static void flow_cache_new_hashrnd(unsigned long arg)
 {
+	struct flow_cache *fc = (void *) arg;
 	int i;
 
 	for_each_possible_cpu(i)
-		flow_hash_rnd_recalc(i) = 1;
+		per_cpu_ptr(fc->percpu, i)->hash_rnd_recalc = 1;
 
-	flow_hash_rnd_timer.expires = jiffies + FLOW_HASH_RND_PERIOD;
-	add_timer(&flow_hash_rnd_timer);
+	fc->rnd_timer.expires = jiffies + FLOW_HASH_RND_PERIOD;
+	add_timer(&fc->rnd_timer);
 }
 
-static void flow_entry_kill(int cpu, struct flow_cache_entry *fle)
+static void flow_entry_kill(struct flow_cache *fc,
+			    struct flow_cache_percpu *fcp,
+			    struct flow_cache_entry *fle)
 {
 	if (fle->object)
 		atomic_dec(fle->object_ref);
 	kmem_cache_free(flow_cachep, fle);
-	flow_count(cpu)--;
+	fcp->hash_count--;
 }
 
-static void __flow_cache_shrink(int cpu, int shrink_to)
+static void __flow_cache_shrink(struct flow_cache *fc,
+				struct flow_cache_percpu *fcp,
+				int shrink_to)
 {
 	struct flow_cache_entry *fle, **flp;
 	int i;
 
-	for (i = 0; i < flow_hash_size; i++) {
+	for (i = 0; i < flow_cache_hash_size(fc); i++) {
 		int k = 0;
 
-		flp = &flow_table(cpu)[i];
+		flp = &fcp->hash_table[i];
 		while ((fle = *flp) != NULL && k < shrink_to) {
 			k++;
 			flp = &fle->next;
 		}
 		while ((fle = *flp) != NULL) {
 			*flp = fle->next;
-			flow_entry_kill(cpu, fle);
+			flow_entry_kill(fc, fcp, fle);
 		}
 	}
 }
 
-static void flow_cache_shrink(int cpu)
+static void flow_cache_shrink(struct flow_cache *fc,
+			      struct flow_cache_percpu *fcp)
 {
-	int shrink_to = flow_lwm / flow_hash_size;
+	int shrink_to = fc->low_watermark / flow_cache_hash_size(fc);
 
-	__flow_cache_shrink(cpu, shrink_to);
+	__flow_cache_shrink(fc, fcp, shrink_to);
 }
 
-static void flow_new_hash_rnd(int cpu)
+static void flow_new_hash_rnd(struct flow_cache *fc,
+			      struct flow_cache_percpu *fcp)
 {
-	get_random_bytes(&flow_hash_rnd(cpu), sizeof(u32));
-	flow_hash_rnd_recalc(cpu) = 0;
-
-	__flow_cache_shrink(cpu, 0);
+	get_random_bytes(&fcp->hash_rnd, sizeof(u32));
+	fcp->hash_rnd_recalc = 0;
+	__flow_cache_shrink(fc, fcp, 0);
 }
 
-static u32 flow_hash_code(struct flowi *key, int cpu)
+static u32 flow_hash_code(struct flow_cache *fc,
+			  struct flow_cache_percpu *fcp,
+			  struct flowi *key)
 {
 	u32 *k = (u32 *) key;
 
-	return (jhash2(k, (sizeof(*key) / sizeof(u32)), flow_hash_rnd(cpu)) &
-		(flow_hash_size - 1));
+	return (jhash2(k, (sizeof(*key) / sizeof(u32)), fcp->hash_rnd)
+		& (flow_cache_hash_size(fc) - 1));
 }
 
 #if (BITS_PER_LONG == 64)
@@ -168,24 +169,25 @@ static int flow_key_compare(struct flowi *key1, struct flowi *key2)
 void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 			flow_resolve_t resolver)
 {
+	struct flow_cache *fc = &flow_cache_global;
+	struct flow_cache_percpu *fcp;
 	struct flow_cache_entry *fle, **head;
 	unsigned int hash;
-	int cpu;
 
 	local_bh_disable();
-	cpu = smp_processor_id();
+	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 
 	fle = NULL;
 	/* Packet really early in init?  Making flow_cache_init a
 	 * pre-smp initcall would solve this.  --RR */
-	if (!flow_table(cpu))
+	if (!fcp->hash_table)
 		goto nocache;
 
-	if (flow_hash_rnd_recalc(cpu))
-		flow_new_hash_rnd(cpu);
-	hash = flow_hash_code(key, cpu);
+	if (fcp->hash_rnd_recalc)
+		flow_new_hash_rnd(fc, fcp);
+	hash = flow_hash_code(fc, fcp, key);
 
-	head = &flow_table(cpu)[hash];
+	head = &fcp->hash_table[hash];
 	for (fle = *head; fle; fle = fle->next) {
 		if (fle->family == family &&
 		    fle->dir == dir &&
@@ -204,8 +206,8 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 	}
 
 	if (!fle) {
-		if (flow_count(cpu) > flow_hwm)
-			flow_cache_shrink(cpu);
+		if (fcp->hash_count > fc->high_watermark)
+			flow_cache_shrink(fc, fcp);
 
 		fle = kmem_cache_alloc(flow_cachep, GFP_ATOMIC);
 		if (fle) {
@@ -215,7 +217,7 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 			fle->dir = dir;
 			memcpy(&fle->key, key, sizeof(*key));
 			fle->object = NULL;
-			flow_count(cpu)++;
+			fcp->hash_count++;
 		}
 	}
 
@@ -249,14 +251,15 @@ nocache:
 static void flow_cache_flush_tasklet(unsigned long data)
 {
 	struct flow_flush_info *info = (void *)data;
+	struct flow_cache *fc = info->cache;
+	struct flow_cache_percpu *fcp;
 	int i;
-	int cpu;
 
-	cpu = smp_processor_id();
-	for (i = 0; i < flow_hash_size; i++) {
+	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
+	for (i = 0; i < flow_cache_hash_size(fc); i++) {
 		struct flow_cache_entry *fle;
 
-		fle = flow_table(cpu)[i];
+		fle = fcp->hash_table[i];
 		for (; fle; fle = fle->next) {
 			unsigned genid = atomic_read(&flow_cache_genid);
 
@@ -272,7 +275,6 @@ static void flow_cache_flush_tasklet(unsigned long data)
 		complete(&info->completion);
 }
 
-static void flow_cache_flush_per_cpu(void *) __attribute__((__unused__));
 static void flow_cache_flush_per_cpu(void *data)
 {
 	struct flow_flush_info *info = data;
@@ -280,8 +282,7 @@ static void flow_cache_flush_per_cpu(void *data)
 	struct tasklet_struct *tasklet;
 
 	cpu = smp_processor_id();
-
-	tasklet = flow_flush_tasklet(cpu);
+	tasklet = &per_cpu_ptr(info->cache->percpu, cpu)->flush_tasklet;
 	tasklet->data = (unsigned long)info;
 	tasklet_schedule(tasklet);
 }
@@ -294,6 +295,7 @@ void flow_cache_flush(void)
 	/* Don't want cpus going down or up during this. */
 	get_online_cpus();
 	mutex_lock(&flow_flush_sem);
+	info.cache = &flow_cache_global;
 	atomic_set(&info.cpuleft, num_online_cpus());
 	init_completion(&info.completion);
 
@@ -307,62 +309,75 @@ void flow_cache_flush(void)
 	put_online_cpus();
 }
 
-static void __init flow_cache_cpu_prepare(int cpu)
+static void __init flow_cache_cpu_prepare(struct flow_cache *fc,
+					  struct flow_cache_percpu *fcp)
 {
-	struct tasklet_struct *tasklet;
-	unsigned long order;
-
-	for (order = 0;
-	     (PAGE_SIZE << order) <
-		     (sizeof(struct flow_cache_entry *)*flow_hash_size);
-	     order++)
-		/* NOTHING */;
-
-	flow_table(cpu) = (struct flow_cache_entry **)
-		__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
-	if (!flow_table(cpu))
-		panic("NET: failed to allocate flow cache order %lu\n", order);
-
-	flow_hash_rnd_recalc(cpu) = 1;
-	flow_count(cpu) = 0;
-
-	tasklet = flow_flush_tasklet(cpu);
-	tasklet_init(tasklet, flow_cache_flush_tasklet, 0);
+	fcp->hash_table = (struct flow_cache_entry **)
+		__get_free_pages(GFP_KERNEL|__GFP_ZERO, fc->order);
+	if (!fcp->hash_table)
+		panic("NET: failed to allocate flow cache order %lu\n", fc->order);
+
+	fcp->hash_rnd_recalc = 1;
+	fcp->hash_count = 0;
+	tasklet_init(&fcp->flush_tasklet, flow_cache_flush_tasklet, 0);
 }
 
 static int flow_cache_cpu(struct notifier_block *nfb,
 			  unsigned long action,
 			  void *hcpu)
 {
+	struct flow_cache *fc = container_of(nfb, struct flow_cache, hotcpu_notifier);
+	int cpu = (unsigned long) hcpu;
+	struct flow_cache_percpu *fcp = per_cpu_ptr(fc->percpu, cpu);
+
 	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-		__flow_cache_shrink((unsigned long)hcpu, 0);
+		__flow_cache_shrink(fc, fcp, 0);
 	return NOTIFY_OK;
 }
 
-static int __init flow_cache_init(void)
+static int flow_cache_init(struct flow_cache *fc)
 {
+	unsigned long order;
 	int i;
 
-	flow_cachep = kmem_cache_create("flow_cache",
-					sizeof(struct flow_cache_entry),
-					0, SLAB_PANIC,
-					NULL);
-	flow_hash_shift = 10;
-	flow_lwm = 2 * flow_hash_size;
-	flow_hwm = 4 * flow_hash_size;
+	fc->hash_shift = 10;
+	fc->low_watermark = 2 * flow_cache_hash_size(fc);
+	fc->high_watermark = 4 * flow_cache_hash_size(fc);
+
+	for (order = 0;
+	     (PAGE_SIZE << order) <
+		     (sizeof(struct flow_cache_entry *)*flow_cache_hash_size(fc));
+	     order++)
+		/* NOTHING */;
+	fc->order = order;
+	fc->percpu = alloc_percpu(struct flow_cache_percpu);
 
-	setup_timer(&flow_hash_rnd_timer, flow_cache_new_hashrnd, 0);
-	flow_hash_rnd_timer.expires = jiffies + FLOW_HASH_RND_PERIOD;
-	add_timer(&flow_hash_rnd_timer);
+	setup_timer(&fc->rnd_timer, flow_cache_new_hashrnd,
+		    (unsigned long) fc);
+	fc->rnd_timer.expires = jiffies + FLOW_HASH_RND_PERIOD;
+	add_timer(&fc->rnd_timer);
 
 	for_each_possible_cpu(i)
-		flow_cache_cpu_prepare(i);
+		flow_cache_cpu_prepare(fc, per_cpu_ptr(fc->percpu, i));
+
+	fc->hotcpu_notifier = (struct notifier_block){
+		.notifier_call = flow_cache_cpu,
+	};
+	register_hotcpu_notifier(&fc->hotcpu_notifier);
 
-	hotcpu_notifier(flow_cache_cpu, 0);
 	return 0;
 }
 
-module_init(flow_cache_init);
+static int __init flow_cache_init_global(void)
+{
+	flow_cachep = kmem_cache_create("flow_cache",
+					sizeof(struct flow_cache_entry),
+					0, SLAB_PANIC, NULL);
+
+	return flow_cache_init(&flow_cache_global);
+}
+
+module_init(flow_cache_init_global);
 
 EXPORT_SYMBOL(flow_cache_genid);
 EXPORT_SYMBOL(flow_cache_lookup);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] flow: structurize flow cache
  2010-03-29 14:12 ` [PATCH 2/7] flow: structurize flow cache Timo Teras
@ 2010-03-30 12:01   ` Herbert Xu
  2010-03-30 12:02     ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:01 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

On Mon, Mar 29, 2010 at 05:12:39PM +0300, Timo Teras wrote:
> Group all per-cpu data to one structure instead of having many
> globals. Also prepare the internals so that we can have multiple
> instances of the flow cache if needed.

Grouping everything together looks sane.  But adding the flow
cache pointer to everything seems to be unnecessary.  Why would
we need multiple flow caches?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] flow: structurize flow cache
  2010-03-30 12:01   ` Herbert Xu
@ 2010-03-30 12:02     ` Timo Teräs
  2010-03-30 12:15       ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Mon, Mar 29, 2010 at 05:12:39PM +0300, Timo Teras wrote:
>> Group all per-cpu data to one structure instead of having many
>> globals. Also prepare the internals so that we can have multiple
>> instances of the flow cache if needed.
> 
> Grouping everything together looks sane.  But adding the flow
> cache pointer to everything seems to be unnecessary.  Why would
> we need multiple flow caches?

Just a precaution since it doable cheaply.

If we want generic flow cache, we might want to have multiple
instance in future. It might also make sense to have per-net
flow cache.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] flow: structurize flow cache
  2010-03-30 12:02     ` Timo Teräs
@ 2010-03-30 12:15       ` Herbert Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:15 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 03:02:42PM +0300, Timo Teräs wrote:
>
> If we want generic flow cache, we might want to have multiple
> instance in future. It might also make sense to have per-net
> flow cache.

OK that makes sense.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/7] flow: allocate hash table for online cpus only
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
  2010-03-29 14:12 ` [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead Timo Teras
  2010-03-29 14:12 ` [PATCH 2/7] flow: structurize flow cache Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-30 12:12   ` Herbert Xu
  2010-03-29 14:12 ` [PATCH 4/7] flow: delayed deletion of flow cache entries Timo Teras
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

Instead of unconditionally allocating hash table for all possible
cpu's, allocate it only for online cpu's and release related
memory if cpu goes down.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/core/flow.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index 1d27ca6..104078d 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -309,36 +309,49 @@ void flow_cache_flush(void)
 	put_online_cpus();
 }
 
-static void __init flow_cache_cpu_prepare(struct flow_cache *fc,
-					  struct flow_cache_percpu *fcp)
+static void __cpuinit flow_cache_cpu_prepare(struct flow_cache *fc,
+					     struct flow_cache_percpu *fcp)
 {
 	fcp->hash_table = (struct flow_cache_entry **)
 		__get_free_pages(GFP_KERNEL|__GFP_ZERO, fc->order);
-	if (!fcp->hash_table)
-		panic("NET: failed to allocate flow cache order %lu\n", fc->order);
-
 	fcp->hash_rnd_recalc = 1;
 	fcp->hash_count = 0;
 	tasklet_init(&fcp->flush_tasklet, flow_cache_flush_tasklet, 0);
 }
 
-static int flow_cache_cpu(struct notifier_block *nfb,
-			  unsigned long action,
-			  void *hcpu)
+static int __cpuinit flow_cache_cpu(struct notifier_block *nfb,
+				    unsigned long action,
+				    void *hcpu)
 {
 	struct flow_cache *fc = container_of(nfb, struct flow_cache, hotcpu_notifier);
 	int cpu = (unsigned long) hcpu;
 	struct flow_cache_percpu *fcp = per_cpu_ptr(fc->percpu, cpu);
 
-	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-		__flow_cache_shrink(fc, fcp, 0);
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		flow_cache_cpu_prepare(fc, fcp);
+		if (!fcp->hash_table)
+			return NOTIFY_BAD;
+		break;
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		if (fcp->hash_table) {
+			__flow_cache_shrink(fc, fcp, 0);
+			free_pages((unsigned long) fcp->hash_table, fc->order);
+			fcp->hash_table = NULL;
+		}
+		break;
+	}
 	return NOTIFY_OK;
 }
 
 static int flow_cache_init(struct flow_cache *fc)
 {
 	unsigned long order;
-	int i;
+	int i, r;
 
 	fc->hash_shift = 10;
 	fc->low_watermark = 2 * flow_cache_hash_size(fc);
@@ -357,8 +370,12 @@ static int flow_cache_init(struct flow_cache *fc)
 	fc->rnd_timer.expires = jiffies + FLOW_HASH_RND_PERIOD;
 	add_timer(&fc->rnd_timer);
 
-	for_each_possible_cpu(i)
-		flow_cache_cpu_prepare(fc, per_cpu_ptr(fc->percpu, i));
+	for_each_online_cpu(i) {
+		r = flow_cache_cpu(&fc->hotcpu_notifier,
+				   CPU_UP_PREPARE, (void*) i);
+		if (r != NOTIFY_OK)
+			panic("NET: failed to allocate flow cache order %lu\n", order);
+	}
 
 	fc->hotcpu_notifier = (struct notifier_block){
 		.notifier_call = flow_cache_cpu,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] flow: allocate hash table for online cpus only
  2010-03-29 14:12 ` [PATCH 3/7] flow: allocate hash table for online cpus only Timo Teras
@ 2010-03-30 12:12   ` Herbert Xu
  2010-03-31 12:32     ` Rusty Russell
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:12 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev, Rusty Russell

On Mon, Mar 29, 2010 at 05:12:40PM +0300, Timo Teras wrote:
> Instead of unconditionally allocating hash table for all possible
> cpu's, allocate it only for online cpu's and release related
> memory if cpu goes down.
> 
> Signed-off-by: Timo Teras <timo.teras@iki.fi>

Hmm that's where we started but then Rusty changed it back in 2004:

commit 0a32dc4d8e83c48f7535d66731eb35d1916b39a8
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Wed Jan 21 18:14:37 2004 -0800

    [NET]: Simplify net/flow.c per-cpu handling.

    The cpu handling in net/core/flow.c is complex: it tries to allocate
    flow cache as each CPU comes up.  It might as well allocate them for
    each possible CPU at boot.

So I'd like to hear his opinion on changing it back again.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] flow: allocate hash table for online cpus only
  2010-03-30 12:12   ` Herbert Xu
@ 2010-03-31 12:32     ` Rusty Russell
  2010-03-31 13:27       ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Rusty Russell @ 2010-03-31 12:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Timo Teras, netdev

On Tue, 30 Mar 2010 10:42:55 pm Herbert Xu wrote:
> On Mon, Mar 29, 2010 at 05:12:40PM +0300, Timo Teras wrote:
> > Instead of unconditionally allocating hash table for all possible
> > cpu's, allocate it only for online cpu's and release related
> > memory if cpu goes down.
> > 
> > Signed-off-by: Timo Teras <timo.teras@iki.fi>
> 
> Hmm that's where we started but then Rusty changed it back in 2004:
> 
> commit 0a32dc4d8e83c48f7535d66731eb35d1916b39a8
> Author: Rusty Russell <rusty@rustcorp.com.au>
> Date:   Wed Jan 21 18:14:37 2004 -0800
> 
>     [NET]: Simplify net/flow.c per-cpu handling.
> 
>     The cpu handling in net/core/flow.c is complex: it tries to allocate
>     flow cache as each CPU comes up.  It might as well allocate them for
>     each possible CPU at boot.
> 
> So I'd like to hear his opinion on changing it back again.

It was pretty unique at the time, it no longer is, so the arguments are less
compelling IMHO.

However, we can now use a dynamic percpu variable and get it as a real
per-cpu thing (which currently means it *will* be for every available cpu,
not just online ones).  Haven't thought about it, but that change might be
worth considering instead?

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] flow: allocate hash table for online cpus only
  2010-03-31 12:32     ` Rusty Russell
@ 2010-03-31 13:27       ` Timo Teräs
  0 siblings, 0 replies; 35+ messages in thread
From: Timo Teräs @ 2010-03-31 13:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, netdev

Rusty Russell wrote:
> On Tue, 30 Mar 2010 10:42:55 pm Herbert Xu wrote:
>> On Mon, Mar 29, 2010 at 05:12:40PM +0300, Timo Teras wrote:
>>> Instead of unconditionally allocating hash table for all possible
>>> cpu's, allocate it only for online cpu's and release related
>>> memory if cpu goes down.
>>>
>>> Signed-off-by: Timo Teras <timo.teras@iki.fi>
>> Hmm that's where we started but then Rusty changed it back in 2004:
>>
>> So I'd like to hear his opinion on changing it back again.
> 
> It was pretty unique at the time, it no longer is, so the arguments are less
> compelling IMHO.
> 
> However, we can now use a dynamic percpu variable and get it as a real
> per-cpu thing (which currently means it *will* be for every available cpu,
> not just online ones).  Haven't thought about it, but that change might be
> worth considering instead?

I did convert most of the static percpu variables to a struct which
is allocated dynamically using alloc_percpu. See:
 http://marc.info/?l=linux-netdev&m=127003066905912&w=2

This patch is on top of that, to avoid allocating the larger hash
table unconditionally as amount of possible cpu's can be large.
If you take a look at the actual patch to add back the hash allocation
for only 'online' cpu's, it's not that complicated IMHO:
 http://marc.info/?l=linux-netdev&m=126987200927472&w=2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/7] flow: delayed deletion of flow cache entries
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
                   ` (2 preceding siblings ...)
  2010-03-29 14:12 ` [PATCH 3/7] flow: allocate hash table for online cpus only Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-30 12:22   ` Herbert Xu
  2010-03-29 14:12 ` [PATCH 5/7] flow: virtualize get and entry deletion methods Timo Teras
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

Speed up lookups by freeing flow cache entries later. This is also in
preparation to have virtual entry destructor that might do more
work.

As gc_list is more effective with double linked list, the flow cache
is converted to use common hlist and list macroes where appropriate.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/core/flow.c |  112 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index 104078d..760f93d 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -26,7 +26,10 @@
 #include <linux/security.h>
 
 struct flow_cache_entry {
-	struct flow_cache_entry	*next;
+	union {
+		struct hlist_node	hlist;
+		struct list_head	gc_list;
+	} u;
 	u16			family;
 	u8			dir;
 	u32			genid;
@@ -36,7 +39,7 @@ struct flow_cache_entry {
 };
 
 struct flow_cache_percpu {
-	struct flow_cache_entry **	hash_table;
+	struct hlist_head *		hash_table;
 	int				hash_count;
 	u32				hash_rnd;
 	int				hash_rnd_recalc;
@@ -63,6 +66,9 @@ atomic_t flow_cache_genid = ATOMIC_INIT(0);
 static struct flow_cache flow_cache_global;
 static struct kmem_cache *flow_cachep;
 
+static DEFINE_SPINLOCK(flow_cache_gc_lock);
+static LIST_HEAD(flow_cache_gc_list);
+
 #define flow_cache_hash_size(cache)	(1 << (cache)->hash_shift)
 #define FLOW_HASH_RND_PERIOD		(10 * 60 * HZ)
 
@@ -78,36 +84,62 @@ static void flow_cache_new_hashrnd(unsigned long arg)
 	add_timer(&fc->rnd_timer);
 }
 
-static void flow_entry_kill(struct flow_cache *fc,
-			    struct flow_cache_percpu *fcp,
-			    struct flow_cache_entry *fle)
+static void flow_entry_kill(struct flow_cache_entry *fle)
 {
 	if (fle->object)
 		atomic_dec(fle->object_ref);
 	kmem_cache_free(flow_cachep, fle);
-	fcp->hash_count--;
 }
 
+static void flow_cache_gc_task(struct work_struct *work)
+{
+	struct list_head gc_list;
+	struct flow_cache_entry *fce, *n;
+
+	INIT_LIST_HEAD(&gc_list);
+	spin_lock_bh(&flow_cache_gc_lock);
+	list_splice_tail_init(&flow_cache_gc_list, &gc_list);
+	spin_unlock_bh(&flow_cache_gc_lock);
+
+	list_for_each_entry_safe(fce, n, &gc_list, u.gc_list)
+		flow_entry_kill(fce);
+}
+static DECLARE_WORK(flow_cache_gc_work, flow_cache_gc_task);
+
 static void __flow_cache_shrink(struct flow_cache *fc,
 				struct flow_cache_percpu *fcp,
 				int shrink_to)
 {
-	struct flow_cache_entry *fle, **flp;
-	int i;
+	struct flow_cache_entry *fce;
+	struct hlist_node *entry, *tmp;
+	struct list_head gc_list;
+	int i, deleted = 0;
 
+	INIT_LIST_HEAD(&gc_list);
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
-		int k = 0;
-
-		flp = &fcp->hash_table[i];
-		while ((fle = *flp) != NULL && k < shrink_to) {
-			k++;
-			flp = &fle->next;
-		}
-		while ((fle = *flp) != NULL) {
-			*flp = fle->next;
-			flow_entry_kill(fc, fcp, fle);
+		int saved = 0;
+
+		hlist_for_each_entry_safe(fce, entry, tmp,
+					  &fcp->hash_table[i], u.hlist) {
+			if (saved < shrink_to) {
+				saved++;
+			} else {
+				deleted++;
+				hlist_del(&fce->u.hlist);
+				list_add_tail(&fce->u.gc_list, &gc_list);
+			}
 		}
 	}
+
+	if (deleted) {
+		fcp->hash_count -= deleted;
+
+		spin_lock_bh(&flow_cache_gc_lock);
+		list_splice_tail(&gc_list, &flow_cache_gc_list);
+		spin_unlock_bh(&flow_cache_gc_lock);
+
+		schedule_work(&flow_cache_gc_work);
+	}
 }
 
 static void flow_cache_shrink(struct flow_cache *fc,
@@ -171,7 +203,8 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 {
 	struct flow_cache *fc = &flow_cache_global;
 	struct flow_cache_percpu *fcp;
-	struct flow_cache_entry *fle, **head;
+	struct flow_cache_entry *fle;
+	struct hlist_node *entry;
 	unsigned int hash;
 
 	local_bh_disable();
@@ -187,8 +220,7 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 		flow_new_hash_rnd(fc, fcp);
 	hash = flow_hash_code(fc, fcp, key);
 
-	head = &fcp->hash_table[hash];
-	for (fle = *head; fle; fle = fle->next) {
+	hlist_for_each_entry(fle, entry, &fcp->hash_table[hash], u.hlist) {
 		if (fle->family == family &&
 		    fle->dir == dir &&
 		    flow_key_compare(key, &fle->key) == 0) {
@@ -211,12 +243,12 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 
 		fle = kmem_cache_alloc(flow_cachep, GFP_ATOMIC);
 		if (fle) {
-			fle->next = *head;
-			*head = fle;
 			fle->family = family;
 			fle->dir = dir;
 			memcpy(&fle->key, key, sizeof(*key));
 			fle->object = NULL;
+
+			hlist_add_head(&fle->u.hlist, &fcp->hash_table[hash]);
 			fcp->hash_count++;
 		}
 	}
@@ -253,24 +285,38 @@ static void flow_cache_flush_tasklet(unsigned long data)
 	struct flow_flush_info *info = (void *)data;
 	struct flow_cache *fc = info->cache;
 	struct flow_cache_percpu *fcp;
-	int i;
+	struct flow_cache_entry *fle;
+	struct hlist_node *entry, *tmp;
+	struct list_head gc_list;
+	int i, deleted = 0;
+	unsigned genid;
 
+	INIT_LIST_HEAD(&gc_list);
 	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
-		struct flow_cache_entry *fle;
-
-		fle = fcp->hash_table[i];
-		for (; fle; fle = fle->next) {
-			unsigned genid = atomic_read(&flow_cache_genid);
+		hlist_for_each_entry_safe(fle, entry, tmp,
+					  &fcp->hash_table[i], u.hlist) {
+			genid = atomic_read(&flow_cache_genid);
 
 			if (!fle->object || fle->genid == genid)
 				continue;
 
-			fle->object = NULL;
-			atomic_dec(fle->object_ref);
+			deleted++;
+			hlist_del(&fle->u.hlist);
+			list_add_tail(&fle->u.gc_list, &gc_list);
 		}
 	}
 
+	if (deleted) {
+		fcp->hash_count -= deleted;
+
+		spin_lock_bh(&flow_cache_gc_lock);
+		list_splice_tail(&gc_list, &flow_cache_gc_list);
+		spin_unlock_bh(&flow_cache_gc_lock);
+
+		schedule_work(&flow_cache_gc_work);
+	}
+
 	if (atomic_dec_and_test(&info->cpuleft))
 		complete(&info->completion);
 }
@@ -312,7 +358,7 @@ void flow_cache_flush(void)
 static void __cpuinit flow_cache_cpu_prepare(struct flow_cache *fc,
 					     struct flow_cache_percpu *fcp)
 {
-	fcp->hash_table = (struct flow_cache_entry **)
+	fcp->hash_table = (struct hlist_head *)
 		__get_free_pages(GFP_KERNEL|__GFP_ZERO, fc->order);
 	fcp->hash_rnd_recalc = 1;
 	fcp->hash_count = 0;
@@ -359,7 +405,7 @@ static int flow_cache_init(struct flow_cache *fc)
 
 	for (order = 0;
 	     (PAGE_SIZE << order) <
-		     (sizeof(struct flow_cache_entry *)*flow_cache_hash_size(fc));
+		     (sizeof(struct hlist_head)*flow_cache_hash_size(fc));
 	     order++)
 		/* NOTHING */;
 	fc->order = order;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] flow: delayed deletion of flow cache entries
  2010-03-29 14:12 ` [PATCH 4/7] flow: delayed deletion of flow cache entries Timo Teras
@ 2010-03-30 12:22   ` Herbert Xu
  2010-03-30 12:32     ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:22 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

On Mon, Mar 29, 2010 at 05:12:41PM +0300, Timo Teras wrote:
> Speed up lookups by freeing flow cache entries later. This is also in
> preparation to have virtual entry destructor that might do more
> work.

So how does this speed up lookups exactly?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] flow: delayed deletion of flow cache entries
  2010-03-30 12:22   ` Herbert Xu
@ 2010-03-30 12:32     ` Timo Teräs
  2010-03-30 12:36       ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Mon, Mar 29, 2010 at 05:12:41PM +0300, Timo Teras wrote:
>> Speed up lookups by freeing flow cache entries later. This is also in
>> preparation to have virtual entry destructor that might do more
>> work.
> 
> So how does this speed up lookups exactly?

If flow cache regeneration or shrinking is triggered in lookup,
it would previously free it in place. Now that is deferred. But
yes, it's more useful after the next patches that call the
virtual destructor. Should have explained this better.

Like said in the general description, patches 4-7 go together
and still have some problem cases. But should show where I'm
trying to go.

I'd be interested hear if the idea of patches 4-7 is good
or we could things somehow better.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] flow: delayed deletion of flow cache entries
  2010-03-30 12:32     ` Timo Teräs
@ 2010-03-30 12:36       ` Herbert Xu
  2010-03-30 12:43         ` Timo Teräs
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2010-03-30 12:36 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

On Tue, Mar 30, 2010 at 03:32:58PM +0300, Timo Teräs wrote:
> If flow cache regeneration or shrinking is triggered in lookup,
> it would previously free it in place. Now that is deferred. But
> yes, it's more useful after the next patches that call the
> virtual destructor. Should have explained this better.

Any chance you can refactor them so that this comes after the
virtual get/put patch?

That way can evaluate this on its own merit rather than being
a prerequisite for the more important stuff.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] flow: delayed deletion of flow cache entries
  2010-03-30 12:36       ` Herbert Xu
@ 2010-03-30 12:43         ` Timo Teräs
  0 siblings, 0 replies; 35+ messages in thread
From: Timo Teräs @ 2010-03-30 12:43 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Herbert Xu wrote:
> On Tue, Mar 30, 2010 at 03:32:58PM +0300, Timo Teräs wrote:
>> If flow cache regeneration or shrinking is triggered in lookup,
>> it would previously free it in place. Now that is deferred. But
>> yes, it's more useful after the next patches that call the
>> virtual destructor. Should have explained this better.
> 
> Any chance you can refactor them so that this comes after the
> virtual get/put patch?
> 
> That way can evaluate this on its own merit rather than being
> a prerequisite for the more important stuff.

I thought it's not good to have possible speed regressions even
temporarily in the tree, so I figured this should go first.

But sure, I'll refactor this to be a later commit for the next
iteration.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/7] flow: virtualize get and entry deletion methods
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
                   ` (3 preceding siblings ...)
  2010-03-29 14:12 ` [PATCH 4/7] flow: delayed deletion of flow cache entries Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-29 14:12 ` [PATCH 6/7] xfrm: cache bundles instead of policies for outgoing flows Timo Teras
  2010-03-29 14:12 ` [PATCH 7/7] xfrm: remove policy garbage collection Timo Teras
  6 siblings, 0 replies; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

This allows to validate the cached object before returning it.
It also allows to destruct object properly, if the last reference
was held in flow cache. This is also a prepartion for caching
bundles in the flow cache.

In return for virtualizing the methods, we save on:
- not having to regenerate the whole flow cache on policy removal:
  each flow matching a killed policy gets refreshed as the getter
  function notices it smartly.
- we do not have to call flow_cache_flush from policy gc, since the
  flow cache now properly deletes the object if it had any references

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/net/flow.h     |   18 ++++++--
 include/net/xfrm.h     |    2 +
 net/core/flow.c        |  112 +++++++++++++++++++++++++++---------------------
 net/xfrm/xfrm_policy.c |  111 +++++++++++++++++++++++++++++++----------------
 4 files changed, 153 insertions(+), 90 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 809970b..f462325 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -86,11 +86,21 @@ struct flowi {
 
 struct net;
 struct sock;
-typedef int (*flow_resolve_t)(struct net *net, struct flowi *key, u16 family,
-			      u8 dir, void **objp, atomic_t **obj_refp);
 
-extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
-			       u8 dir, flow_resolve_t resolver);
+struct flow_cache_entry_ops {
+	struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
+	int (*check)(struct flow_cache_entry_ops **);
+	void (*delete)(struct flow_cache_entry_ops **);
+};
+
+typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
+		struct net *net, struct flowi *key, u16 family,
+		u8 dir, struct flow_cache_entry_ops **old_ops);
+
+extern struct flow_cache_entry_ops **flow_cache_lookup(
+		struct net *net, struct flowi *key, u16 family,
+		u8 dir, flow_resolve_t resolver);
+
 extern void flow_cache_flush(void);
 extern atomic_t flow_cache_genid;
 
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index d74e080..cb8934b 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -19,6 +19,7 @@
 #include <net/route.h>
 #include <net/ipv6.h>
 #include <net/ip6_fib.h>
+#include <net/flow.h>
 
 #include <linux/interrupt.h>
 
@@ -481,6 +482,7 @@ struct xfrm_policy {
 	atomic_t		refcnt;
 	struct timer_list	timer;
 
+	struct flow_cache_entry_ops *fc_ops;
 	u32			priority;
 	u32			index;
 	struct xfrm_mark	mark;
diff --git a/net/core/flow.c b/net/core/flow.c
index 760f93d..0245455 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -30,12 +30,11 @@ struct flow_cache_entry {
 		struct hlist_node	hlist;
 		struct list_head	gc_list;
 	} u;
-	u16			family;
-	u8			dir;
-	u32			genid;
-	struct flowi		key;
-	void			*object;
-	atomic_t		*object_ref;
+	u16				family;
+	u8				dir;
+	u32				genid;
+	struct flowi			key;
+	struct flow_cache_entry_ops	**ops;
 };
 
 struct flow_cache_percpu {
@@ -84,10 +83,19 @@ static void flow_cache_new_hashrnd(unsigned long arg)
 	add_timer(&fc->rnd_timer);
 }
 
+static int flow_entry_valid(struct flow_cache_entry *fle)
+{
+	if (atomic_read(&flow_cache_genid) != fle->genid)
+		return 0;
+	if (fle->ops && !(*fle->ops)->check(fle->ops))
+		return 0;
+	return 1;
+}
+
 static void flow_entry_kill(struct flow_cache_entry *fle)
 {
-	if (fle->object)
-		atomic_dec(fle->object_ref);
+	if (fle->ops)
+		(*fle->ops)->delete(fle->ops);
 	kmem_cache_free(flow_cachep, fle);
 }
 
@@ -121,7 +129,7 @@ static void __flow_cache_shrink(struct flow_cache *fc,
 
 		hlist_for_each_entry_safe(fce, entry, tmp,
 					  &fcp->hash_table[i], u.hlist) {
-			if (saved < shrink_to) {
+			if (saved < shrink_to && flow_entry_valid(fce)) {
 				saved++;
 			} else {
 				deleted++;
@@ -198,19 +206,22 @@ static int flow_key_compare(struct flowi *key1, struct flowi *key2)
 	return 0;
 }
 
-void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
-			flow_resolve_t resolver)
+struct flow_cache_entry_ops **flow_cache_lookup(
+		struct net *net, struct flowi *key, u16 family, u8 dir,
+		flow_resolve_t resolver)
 {
 	struct flow_cache *fc = &flow_cache_global;
 	struct flow_cache_percpu *fcp;
 	struct flow_cache_entry *fle;
 	struct hlist_node *entry;
+	struct flow_cache_entry_ops **ops;
 	unsigned int hash;
 
 	local_bh_disable();
 	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 
 	fle = NULL;
+	ops = NULL;
 	/* Packet really early in init?  Making flow_cache_init a
 	 * pre-smp initcall would solve this.  --RR */
 	if (!fcp->hash_table)
@@ -221,32 +232,46 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 	hash = flow_hash_code(fc, fcp, key);
 
 	hlist_for_each_entry(fle, entry, &fcp->hash_table[hash], u.hlist) {
-		if (fle->family == family &&
-		    fle->dir == dir &&
-		    flow_key_compare(key, &fle->key) == 0) {
-			if (fle->genid == atomic_read(&flow_cache_genid)) {
-				void *ret = fle->object;
+		if (fle->family != family ||
+		    fle->dir != dir ||
+		    flow_key_compare(key, &fle->key) != 0)
+			continue;
+
+		ops = fle->ops;
+		if (fle->genid != atomic_read(&flow_cache_genid)) {
+			if (ops)
+				(*ops)->delete(ops);
+			fle->ops = NULL;
+			ops = NULL;
+			break;
+		}
 
-				if (ret)
-					atomic_inc(fle->object_ref);
-				local_bh_enable();
+		if (!ops) {
+			local_bh_enable();
+			return NULL;
+		}
 
-				return ret;
-			}
-			break;
+		ops = (*ops)->get(ops);
+		if (ops) {
+			local_bh_enable();
+			return ops;
 		}
+
+		ops = fle->ops;
+		break;
 	}
 
 	if (!fle) {
 		if (fcp->hash_count > fc->high_watermark)
 			flow_cache_shrink(fc, fcp);
 
+		ops = NULL;
 		fle = kmem_cache_alloc(flow_cachep, GFP_ATOMIC);
 		if (fle) {
 			fle->family = family;
 			fle->dir = dir;
 			memcpy(&fle->key, key, sizeof(*key));
-			fle->object = NULL;
+			fle->ops = NULL;
 
 			hlist_add_head(&fle->u.hlist, &fcp->hash_table[hash]);
 			fcp->hash_count++;
@@ -254,30 +279,22 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 	}
 
 nocache:
-	{
-		int err;
-		void *obj;
-		atomic_t *obj_ref;
-
-		err = resolver(net, key, family, dir, &obj, &obj_ref);
-
-		if (fle && !err) {
-			fle->genid = atomic_read(&flow_cache_genid);
-
-			if (fle->object)
-				atomic_dec(fle->object_ref);
-
-			fle->object = obj;
-			fle->object_ref = obj_ref;
-			if (obj)
-				atomic_inc(fle->object_ref);
+	ops = resolver(net, key, family, dir, ops);
+	if (fle) {
+		fle->genid = atomic_read(&flow_cache_genid);
+		if (IS_ERR(ops)) {
+			fle->genid--;
+			fle->ops = NULL;
+		} else {
+			fle->ops = ops;
 		}
-		local_bh_enable();
-
-		if (err)
-			obj = ERR_PTR(err);
-		return obj;
+	} else {
+		if (ops && !IS_ERR(ops))
+			(*ops)->delete(ops);
 	}
+	local_bh_enable();
+
+	return ops;
 }
 
 static void flow_cache_flush_tasklet(unsigned long data)
@@ -289,16 +306,13 @@ static void flow_cache_flush_tasklet(unsigned long data)
 	struct hlist_node *entry, *tmp;
 	struct list_head gc_list;
 	int i, deleted = 0;
-	unsigned genid;
 
 	INIT_LIST_HEAD(&gc_list);
 	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
 		hlist_for_each_entry_safe(fle, entry, tmp,
 					  &fcp->hash_table[i], u.hlist) {
-			genid = atomic_read(&flow_cache_genid);
-
-			if (!fle->object || fle->genid == genid)
+			if (flow_entry_valid(fle))
 				continue;
 
 			deleted++;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 595d347..cd9f2bc 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -216,6 +216,36 @@ expired:
 	xfrm_pol_put(xp);
 }
 
+static struct flow_cache_entry_ops **xfrm_policy_get_fce(
+		struct flow_cache_entry_ops **ops)
+{
+	struct xfrm_policy *pol = container_of(ops, struct xfrm_policy, fc_ops);
+
+	if (unlikely(pol->walk.dead))
+		ops = NULL;
+	else
+		xfrm_pol_hold(pol);
+
+	return ops;
+}
+
+static int xfrm_policy_check_fce(struct flow_cache_entry_ops **ops)
+{
+	struct xfrm_policy *pol = container_of(ops, struct xfrm_policy, fc_ops);
+
+	return !pol->walk.dead;
+}
+
+static void xfrm_policy_delete_fce(struct flow_cache_entry_ops **ops)
+{
+	xfrm_pol_put(container_of(ops, struct xfrm_policy, fc_ops));
+}
+
+static struct flow_cache_entry_ops xfrm_policy_fc_ops __read_mostly = {
+	.get = xfrm_policy_get_fce,
+	.check = xfrm_policy_check_fce,
+	.delete = xfrm_policy_delete_fce,
+};
 
 /* Allocate xfrm_policy. Not used here, it is supposed to be used by pfkeyv2
  * SPD calls.
@@ -236,6 +266,7 @@ struct xfrm_policy *xfrm_policy_alloc(struct net *net, gfp_t gfp)
 		atomic_set(&policy->refcnt, 1);
 		setup_timer(&policy->timer, xfrm_policy_timer,
 				(unsigned long)policy);
+		policy->fc_ops = &xfrm_policy_fc_ops;
 	}
 	return policy;
 }
@@ -269,9 +300,6 @@ static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
 	if (del_timer(&policy->timer))
 		atomic_dec(&policy->refcnt);
 
-	if (atomic_read(&policy->refcnt) > 1)
-		flow_cache_flush();
-
 	xfrm_pol_put(policy);
 }
 
@@ -661,10 +689,8 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark, u8 type,
 	}
 	write_unlock_bh(&xfrm_policy_lock);
 
-	if (ret && delete) {
-		atomic_inc(&flow_cache_genid);
+	if (ret && delete)
 		xfrm_policy_kill(ret);
-	}
 	return ret;
 }
 EXPORT_SYMBOL(xfrm_policy_bysel_ctx);
@@ -703,10 +729,8 @@ struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u8 type,
 	}
 	write_unlock_bh(&xfrm_policy_lock);
 
-	if (ret && delete) {
-		atomic_inc(&flow_cache_genid);
+	if (ret && delete)
 		xfrm_policy_kill(ret);
-	}
 	return ret;
 }
 EXPORT_SYMBOL(xfrm_policy_byid);
@@ -822,7 +846,6 @@ int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 	}
 	if (!cnt)
 		err = -ESRCH;
-	atomic_inc(&flow_cache_genid);
 out:
 	write_unlock_bh(&xfrm_policy_lock);
 	return err;
@@ -976,32 +999,35 @@ fail:
 	return ret;
 }
 
-static int xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family,
-			      u8 dir, void **objp, atomic_t **obj_refp)
+static struct flow_cache_entry_ops **xfrm_policy_lookup(
+		struct net *net, struct flowi *fl, u16 family,
+		u8 dir, struct flow_cache_entry_ops **old_ops)
 {
 	struct xfrm_policy *pol;
-	int err = 0;
+
+	if (old_ops)
+		xfrm_pol_put(container_of(old_ops, struct xfrm_policy, fc_ops));
 
 #ifdef CONFIG_XFRM_SUB_POLICY
 	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
-	if (IS_ERR(pol)) {
-		err = PTR_ERR(pol);
-		pol = NULL;
-	}
-	if (pol || err)
-		goto end;
+	if (IS_ERR(pol))
+		return (void *) pol;
+	if (pol)
+		goto found;
 #endif
 	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
-	if (IS_ERR(pol)) {
-		err = PTR_ERR(pol);
-		pol = NULL;
-	}
-#ifdef CONFIG_XFRM_SUB_POLICY
-end:
-#endif
-	if ((*objp = (void *) pol) != NULL)
-		*obj_refp = &pol->refcnt;
-	return err;
+	if (IS_ERR(pol))
+		return (void *) pol;
+	if (pol)
+		goto found;
+	return NULL;
+
+found:
+	/* Resolver returns two references:
+	 * one for cache and one for caller of flow_cache_lookup() */
+	xfrm_pol_hold(pol);
+
+	return &pol->fc_ops;
 }
 
 static inline int policy_to_flow_dir(int dir)
@@ -1091,8 +1117,6 @@ int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 	pol = __xfrm_policy_unlink(pol, dir);
 	write_unlock_bh(&xfrm_policy_lock);
 	if (pol) {
-		if (dir < XFRM_POLICY_MAX)
-			atomic_inc(&flow_cache_genid);
 		xfrm_policy_kill(pol);
 		return 0;
 	}
@@ -1575,18 +1599,24 @@ restart:
 	}
 
 	if (!policy) {
+		struct flow_cache_entry_ops **ops;
+
 		/* To accelerate a bit...  */
 		if ((dst_orig->flags & DST_NOXFRM) ||
 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
 			goto nopol;
 
-		policy = flow_cache_lookup(net, fl, dst_orig->ops->family,
-					   dir, xfrm_policy_lookup);
-		err = PTR_ERR(policy);
-		if (IS_ERR(policy)) {
+		ops = flow_cache_lookup(net, fl, dst_orig->ops->family,
+					dir, xfrm_policy_lookup);
+		err = PTR_ERR(ops);
+		if (IS_ERR(ops)) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
 			goto dropdst;
 		}
+		if (ops)
+			policy = container_of(ops, struct xfrm_policy, fc_ops);
+		else
+			policy = NULL;
 	}
 
 	if (!policy)
@@ -1936,9 +1966,16 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 		}
 	}
 
-	if (!pol)
-		pol = flow_cache_lookup(net, &fl, family, fl_dir,
+	if (!pol) {
+		struct flow_cache_entry_ops **ops;
+
+		ops = flow_cache_lookup(net, &fl, family, fl_dir,
 					xfrm_policy_lookup);
+		if (IS_ERR(ops))
+			pol = (void *) ops;
+		else if (ops)
+			pol = container_of(ops, struct xfrm_policy, fc_ops);
+	}
 
 	if (IS_ERR(pol)) {
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINPOLERROR);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/7] xfrm: cache bundles instead of policies for outgoing flows
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
                   ` (4 preceding siblings ...)
  2010-03-29 14:12 ` [PATCH 5/7] flow: virtualize get and entry deletion methods Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  2010-03-29 14:12 ` [PATCH 7/7] xfrm: remove policy garbage collection Timo Teras
  6 siblings, 0 replies; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

__xfrm_lookup() is called for each packet transmitted out of
system. The xfrm_find_bundle() does a linear search which can
kill system performance depending on how many bundles are
required per policy.

This modifies __xfrm_lookup() to store bundles directly in
the flow cache. If we did not get a hit, we just create a new
bundle instead of doing slow search. This means that we can now
get multiple xfrm_dst's for same flow (on per-cpu basis).

BUGGY: Currently leaks bundles if a per-socket policy
requires a bundle.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/net/flow.h      |    6 +-
 include/net/xfrm.h      |   10 +-
 net/core/flow.c         |    4 +-
 net/ipv4/xfrm4_policy.c |   22 --
 net/ipv6/xfrm6_policy.c |   31 --
 net/xfrm/xfrm_policy.c  |  697 ++++++++++++++++++++++++-----------------------
 6 files changed, 370 insertions(+), 400 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index f462325..9e97dfa 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -94,12 +94,12 @@ struct flow_cache_entry_ops {
 };
 
 typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
-		struct net *net, struct flowi *key, u16 family,
-		u8 dir, struct flow_cache_entry_ops **old_ops);
+		struct net *net, struct flowi *key, u16 family, u8 dir,
+		struct flow_cache_entry_ops **old_ops, void *ctx);
 
 extern struct flow_cache_entry_ops **flow_cache_lookup(
 		struct net *net, struct flowi *key, u16 family,
-		u8 dir, flow_resolve_t resolver);
+		u8 dir, flow_resolve_t resolver, void *ctx);
 
 extern void flow_cache_flush(void);
 extern atomic_t flow_cache_genid;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index cb8934b..6ce593f 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -267,7 +267,6 @@ struct xfrm_policy_afinfo {
 					       xfrm_address_t *saddr,
 					       xfrm_address_t *daddr);
 	int			(*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr);
-	struct dst_entry	*(*find_bundle)(struct flowi *fl, struct xfrm_policy *policy);
 	void			(*decode_session)(struct sk_buff *skb,
 						  struct flowi *fl,
 						  int reverse);
@@ -483,13 +482,13 @@ struct xfrm_policy {
 	struct timer_list	timer;
 
 	struct flow_cache_entry_ops *fc_ops;
+	atomic_t		genid;
 	u32			priority;
 	u32			index;
 	struct xfrm_mark	mark;
 	struct xfrm_selector	selector;
 	struct xfrm_lifetime_cfg lft;
 	struct xfrm_lifetime_cur curlft;
-	struct dst_entry       *bundles;
 	struct xfrm_policy_walk_entry walk;
 	u8			type;
 	u8			action;
@@ -879,11 +878,15 @@ struct xfrm_dst {
 		struct rt6_info		rt6;
 	} u;
 	struct dst_entry *route;
+	struct flow_cache_entry_ops *fc_ops;
+	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+	int num_pols, num_xfrms;
 #ifdef CONFIG_XFRM_SUB_POLICY
 	struct flowi *origin;
 	struct xfrm_selector *partner;
 #endif
-	u32 genid;
+	u32 xfrm_genid;
+	u32 policy_genid;
 	u32 route_mtu_cached;
 	u32 child_mtu_cached;
 	u32 route_cookie;
@@ -893,6 +896,7 @@ struct xfrm_dst {
 #ifdef CONFIG_XFRM
 static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
 {
+	xfrm_pols_put(xdst->pols, xdst->num_pols);
 	dst_release(xdst->route);
 	if (likely(xdst->u.dst.xfrm))
 		xfrm_state_put(xdst->u.dst.xfrm);
diff --git a/net/core/flow.c b/net/core/flow.c
index 0245455..d3a27cc 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -208,7 +208,7 @@ static int flow_key_compare(struct flowi *key1, struct flowi *key2)
 
 struct flow_cache_entry_ops **flow_cache_lookup(
 		struct net *net, struct flowi *key, u16 family, u8 dir,
-		flow_resolve_t resolver)
+		flow_resolve_t resolver, void *ctx)
 {
 	struct flow_cache *fc = &flow_cache_global;
 	struct flow_cache_percpu *fcp;
@@ -279,7 +279,7 @@ struct flow_cache_entry_ops **flow_cache_lookup(
 	}
 
 nocache:
-	ops = resolver(net, key, family, dir, ops);
+	ops = resolver(net, key, family, dir, ops, ctx);
 	if (fle) {
 		fle->genid = atomic_read(&flow_cache_genid);
 		if (IS_ERR(ops)) {
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index e4a1483..1705476 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -59,27 +59,6 @@ static int xfrm4_get_saddr(struct net *net,
 	return 0;
 }
 
-static struct dst_entry *
-__xfrm4_find_bundle(struct flowi *fl, struct xfrm_policy *policy)
-{
-	struct dst_entry *dst;
-
-	read_lock_bh(&policy->lock);
-	for (dst = policy->bundles; dst; dst = dst->next) {
-		struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
-		if (xdst->u.rt.fl.oif == fl->oif &&	/*XXX*/
-		    xdst->u.rt.fl.fl4_dst == fl->fl4_dst &&
-		    xdst->u.rt.fl.fl4_src == fl->fl4_src &&
-		    xdst->u.rt.fl.fl4_tos == fl->fl4_tos &&
-		    xfrm_bundle_ok(policy, xdst, fl, AF_INET, 0)) {
-			dst_clone(dst);
-			break;
-		}
-	}
-	read_unlock_bh(&policy->lock);
-	return dst;
-}
-
 static int xfrm4_get_tos(struct flowi *fl)
 {
 	return fl->fl4_tos;
@@ -259,7 +238,6 @@ static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
 	.dst_ops =		&xfrm4_dst_ops,
 	.dst_lookup =		xfrm4_dst_lookup,
 	.get_saddr =		xfrm4_get_saddr,
-	.find_bundle = 		__xfrm4_find_bundle,
 	.decode_session =	_decode_session4,
 	.get_tos =		xfrm4_get_tos,
 	.init_path =		xfrm4_init_path,
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ae18165..8c452fd 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -67,36 +67,6 @@ static int xfrm6_get_saddr(struct net *net,
 	return 0;
 }
 
-static struct dst_entry *
-__xfrm6_find_bundle(struct flowi *fl, struct xfrm_policy *policy)
-{
-	struct dst_entry *dst;
-
-	/* Still not clear if we should set fl->fl6_{src,dst}... */
-	read_lock_bh(&policy->lock);
-	for (dst = policy->bundles; dst; dst = dst->next) {
-		struct xfrm_dst *xdst = (struct xfrm_dst*)dst;
-		struct in6_addr fl_dst_prefix, fl_src_prefix;
-
-		ipv6_addr_prefix(&fl_dst_prefix,
-				 &fl->fl6_dst,
-				 xdst->u.rt6.rt6i_dst.plen);
-		ipv6_addr_prefix(&fl_src_prefix,
-				 &fl->fl6_src,
-				 xdst->u.rt6.rt6i_src.plen);
-		if (ipv6_addr_equal(&xdst->u.rt6.rt6i_dst.addr, &fl_dst_prefix) &&
-		    ipv6_addr_equal(&xdst->u.rt6.rt6i_src.addr, &fl_src_prefix) &&
-		    xfrm_bundle_ok(policy, xdst, fl, AF_INET6,
-				   (xdst->u.rt6.rt6i_dst.plen != 128 ||
-				    xdst->u.rt6.rt6i_src.plen != 128))) {
-			dst_clone(dst);
-			break;
-		}
-	}
-	read_unlock_bh(&policy->lock);
-	return dst;
-}
-
 static int xfrm6_get_tos(struct flowi *fl)
 {
 	return 0;
@@ -291,7 +261,6 @@ static struct xfrm_policy_afinfo xfrm6_policy_afinfo = {
 	.dst_ops =		&xfrm6_dst_ops,
 	.dst_lookup =		xfrm6_dst_lookup,
 	.get_saddr = 		xfrm6_get_saddr,
-	.find_bundle =		__xfrm6_find_bundle,
 	.decode_session =	_decode_session6,
 	.get_tos =		xfrm6_get_tos,
 	.init_path =		xfrm6_init_path,
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index cd9f2bc..11fb48d 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -50,6 +50,7 @@ static DEFINE_SPINLOCK(xfrm_policy_gc_lock);
 static struct xfrm_policy_afinfo *xfrm_policy_get_afinfo(unsigned short family);
 static void xfrm_policy_put_afinfo(struct xfrm_policy_afinfo *afinfo);
 static void xfrm_init_pmtu(struct dst_entry *dst);
+static int stale_bundle(struct dst_entry *dst);
 
 static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 						int dir);
@@ -278,8 +279,6 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 {
 	BUG_ON(!policy->walk.dead);
 
-	BUG_ON(policy->bundles);
-
 	if (del_timer(&policy->timer))
 		BUG();
 
@@ -290,12 +289,7 @@ EXPORT_SYMBOL(xfrm_policy_destroy);
 
 static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
 {
-	struct dst_entry *dst;
-
-	while ((dst = policy->bundles) != NULL) {
-		policy->bundles = dst->next;
-		dst_free(dst);
-	}
+	atomic_inc(&policy->genid);
 
 	if (del_timer(&policy->timer))
 		atomic_dec(&policy->refcnt);
@@ -573,7 +567,6 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
 	struct hlist_node *entry, *newpos;
-	struct dst_entry *gc_list;
 	u32 mark = policy->mark.v & policy->mark.m;
 
 	write_lock_bh(&xfrm_policy_lock);
@@ -624,33 +617,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 		schedule_work(&net->xfrm.policy_hash_work);
 
 	read_lock_bh(&xfrm_policy_lock);
-	gc_list = NULL;
 	entry = &policy->bydst;
-	hlist_for_each_entry_continue(policy, entry, bydst) {
-		struct dst_entry *dst;
-
-		write_lock(&policy->lock);
-		dst = policy->bundles;
-		if (dst) {
-			struct dst_entry *tail = dst;
-			while (tail->next)
-				tail = tail->next;
-			tail->next = gc_list;
-			gc_list = dst;
-
-			policy->bundles = NULL;
-		}
-		write_unlock(&policy->lock);
-	}
+	hlist_for_each_entry_continue(policy, entry, bydst)
+		atomic_inc(&policy->genid);
 	read_unlock_bh(&xfrm_policy_lock);
 
-	while (gc_list) {
-		struct dst_entry *dst = gc_list;
-
-		gc_list = dst->next;
-		dst_free(dst);
-	}
-
 	return 0;
 }
 EXPORT_SYMBOL(xfrm_policy_insert);
@@ -999,30 +970,35 @@ fail:
 	return ret;
 }
 
+static struct xfrm_policy *__xfrm_policy_lookup(
+		struct net *net, struct flowi *fl,
+		u16 family, u8 dir)
+{
+#ifdef CONFIG_XFRM_SUB_POLICY
+	struct xfrm_policy *pol;
+
+	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
+	if (pol != NULL)
+		return pol;
+#endif
+	return xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
+}
+
 static struct flow_cache_entry_ops **xfrm_policy_lookup(
 		struct net *net, struct flowi *fl, u16 family,
-		u8 dir, struct flow_cache_entry_ops **old_ops)
+		u8 dir, struct flow_cache_entry_ops **old_ops, void *ctx)
 {
 	struct xfrm_policy *pol;
 
 	if (old_ops)
 		xfrm_pol_put(container_of(old_ops, struct xfrm_policy, fc_ops));
 
-#ifdef CONFIG_XFRM_SUB_POLICY
-	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
+	pol = __xfrm_policy_lookup(net, fl, family, dir);
 	if (IS_ERR(pol))
 		return (void *) pol;
-	if (pol)
-		goto found;
-#endif
-	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
-	if (IS_ERR(pol))
-		return (void *) pol;
-	if (pol)
-		goto found;
-	return NULL;
+	if (!pol)
+		return NULL;
 
-found:
 	/* Resolver returns two references:
 	 * one for cache and one for caller of flow_cache_lookup() */
 	xfrm_pol_hold(pol);
@@ -1311,18 +1287,6 @@ xfrm_tmpl_resolve(struct xfrm_policy **pols, int npols, struct flowi *fl,
  * still valid.
  */
 
-static struct dst_entry *
-xfrm_find_bundle(struct flowi *fl, struct xfrm_policy *policy, unsigned short family)
-{
-	struct dst_entry *x;
-	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
-	if (unlikely(afinfo == NULL))
-		return ERR_PTR(-EINVAL);
-	x = afinfo->find_bundle(fl, policy);
-	xfrm_policy_put_afinfo(afinfo);
-	return x;
-}
-
 static inline int xfrm_get_tos(struct flowi *fl, int family)
 {
 	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
@@ -1338,6 +1302,53 @@ static inline int xfrm_get_tos(struct flowi *fl, int family)
 	return tos;
 }
 
+static struct flow_cache_entry_ops **xfrm_bundle_get_fce(
+		struct flow_cache_entry_ops **ops)
+{
+	struct xfrm_dst *xdst = container_of(ops, struct xfrm_dst, fc_ops);
+	struct dst_entry *dst = (struct dst_entry*) xdst;
+
+	if (xdst->route == NULL) {
+		/* Dummy bundle - if it has xfrms we were not
+		 * able to build bundle as template resolution failed.
+		 * It means we need to try again resolving. */
+		if (xdst->num_xfrms > 0)
+			return NULL;
+	} else {
+		/* Real bundle */
+		if (stale_bundle(dst))
+			return NULL;
+	}
+
+	dst_hold(dst);
+	return ops;
+}
+
+static int xfrm_bundle_check_fce(struct flow_cache_entry_ops **ops)
+{
+	struct xfrm_dst *xdst = container_of(ops, struct xfrm_dst, fc_ops);
+	struct dst_entry *dst = (struct dst_entry*) xdst;
+
+	if (xdst->route && stale_bundle(dst))
+		return 0;
+
+	return 1;
+}
+
+static void xfrm_bundle_delete_fce(struct flow_cache_entry_ops **ops)
+{
+	struct xfrm_dst *xdst = container_of(ops, struct xfrm_dst, fc_ops);
+	struct dst_entry *dst = (struct dst_entry*) xdst;
+
+	dst_free(dst);
+}
+
+static struct flow_cache_entry_ops xfrm_bundle_fc_ops __read_mostly = {
+	.get = xfrm_bundle_get_fce,
+	.check = xfrm_bundle_check_fce,
+	.delete = xfrm_bundle_delete_fce,
+};
+
 static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 {
 	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
@@ -1360,9 +1371,10 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 		BUG();
 	}
 	xdst = dst_alloc(dst_ops) ?: ERR_PTR(-ENOBUFS);
-
 	xfrm_policy_put_afinfo(afinfo);
 
+	xdst->fc_ops = &xfrm_bundle_fc_ops;
+
 	return xdst;
 }
 
@@ -1400,6 +1412,7 @@ static inline int xfrm_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	return err;
 }
 
+
 /* Allocate chain of dst_entry's, attach known xfrm's, calculate
  * all the metrics... Shortly, bundle a bundle.
  */
@@ -1463,7 +1476,7 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
 			dst_hold(dst);
 
 		dst1->xfrm = xfrm[i];
-		xdst->genid = xfrm[i]->genid;
+		xdst->xfrm_genid = xfrm[i]->genid;
 
 		dst1->obsolete = -1;
 		dst1->flags |= DST_HOST;
@@ -1556,7 +1569,187 @@ xfrm_dst_update_origin(struct dst_entry *dst, struct flowi *fl)
 #endif
 }
 
-static int stale_bundle(struct dst_entry *dst);
+static int xfrm_resolve_policies(struct xfrm_policy *main_policy,
+				 struct flowi *fl, u16 family,
+				 struct xfrm_policy **pols,
+				 int *num_pols, int *num_xfrms)
+{
+	int i;
+
+	if (!main_policy) {
+		*num_pols = 0;
+		*num_xfrms = 0;
+		return 0;
+	}
+	if (IS_ERR(main_policy))
+		return PTR_ERR(main_policy);
+
+	pols[0] = main_policy;
+	*num_pols = 1;
+	*num_xfrms = pols[0]->xfrm_nr;
+
+#ifdef CONFIG_XFRM_SUB_POLICY
+	if (pols[0] && pols[0]->action == XFRM_POLICY_ALLOW &&
+	    pols[0]->type != XFRM_POLICY_TYPE_MAIN) {
+		pols[1] = xfrm_policy_lookup_bytype(xp_net(main_policy),
+						    XFRM_POLICY_TYPE_MAIN,
+						    fl, family,
+						    XFRM_POLICY_OUT);
+		if (pols[1]) {
+			if (IS_ERR(pols[1])) {
+				xfrm_pols_put(pols, *num_pols);
+				return PTR_ERR(pols[1]);
+			}
+			(*num_pols) ++;
+			(*num_xfrms) += pols[1]->xfrm_nr;
+		}
+	}
+#endif
+	for (i = 0; i < *num_pols; i++) {
+		if (pols[i]->action != XFRM_POLICY_ALLOW) {
+			*num_xfrms = -1;
+			break;
+		}
+	}
+
+	return 0;
+
+}
+
+static struct xfrm_dst *xfrm_resolve_and_create_bundle(
+		struct xfrm_policy **pols, int num_pols,
+		struct flowi *fl, u16 family, struct dst_entry *dst_orig)
+{
+	struct net *net = xp_net(pols[0]);
+	struct xfrm_state *xfrm[XFRM_MAX_DEPTH];
+	struct dst_entry *dst;
+	struct xfrm_dst *xdst;
+	int err;
+
+	/* Try to instantiate a bundle */
+	err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
+	if (err < 0) {
+		if (err != -EAGAIN)
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+		return ERR_PTR(err);
+	}
+
+	dst = xfrm_bundle_create(pols[0], xfrm, err, fl, dst_orig);
+	if (IS_ERR(dst)) {
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLEGENERROR);
+		return (void*) dst;
+	}
+
+	xdst = (struct xfrm_dst *)dst;
+	xdst->num_xfrms = err;
+	if (num_pols > 1)
+		err = xfrm_dst_update_parent(dst, &pols[1]->selector);
+	else
+		err = xfrm_dst_update_origin(dst, fl);
+	if (unlikely(err)) {
+		dst_free(dst);
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
+		return ERR_PTR(err);
+	}
+
+	xdst->num_pols = num_pols;
+	memcpy(xdst->pols, pols, sizeof(struct xfrm_policy*) * num_pols);
+	xdst->policy_genid = atomic_read(&pols[0]->genid);
+
+	return xdst;
+}
+
+static struct flow_cache_entry_ops **xfrm_bundle_lookup(
+		struct net *net, struct flowi *fl, u16 family, u8 dir,
+		struct flow_cache_entry_ops **old_ops, void *ctx)
+{
+	struct dst_entry *dst_orig = (struct dst_entry *)ctx;
+	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+	struct xfrm_dst *xdst, *new_xdst;
+	int num_pols = 0, num_xfrms = 0, i, err, pol_dead;
+
+	/* Check if the policies from old bundle are usable */
+	xdst = NULL;
+	if (old_ops) {
+		xdst = container_of(old_ops, struct xfrm_dst, fc_ops);
+		num_pols = xdst->num_pols;
+		num_xfrms = xdst->num_xfrms;
+		pol_dead = 0;
+		for (i = 0; i < num_pols; i++) {
+			pols[i] = xdst->pols[i];
+			pol_dead |= pols[i]->walk.dead;
+		}
+		if (pol_dead) {
+			dst_free((struct dst_entry*) xdst);
+			xdst = NULL;
+			num_pols = 0;
+			num_xfrms = 0;
+			old_ops = NULL;
+		}
+	}
+
+	/* Resolve policies to use if we couldn't get them from
+	 * previous cache entry */
+	if (xdst == NULL) {
+		pols[0] = __xfrm_policy_lookup(net, fl, family, dir);
+		err = xfrm_resolve_policies(pols[0], fl, family,
+					    pols, &num_pols, &num_xfrms);
+		if (err < 0)
+			goto inc_error;
+		if (num_pols == 0)
+			return NULL;
+		if (num_xfrms <= 0)
+			goto make_dummy_bundle;
+	}
+
+	new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family, dst_orig);
+	if (IS_ERR(new_xdst)) {
+		err = PTR_ERR(new_xdst);
+		if (err != -EAGAIN)
+			goto error;
+		if (old_ops == NULL)
+			goto make_dummy_bundle;
+		dst_hold((struct dst_entry*) xdst);
+		return old_ops;
+	}
+
+	/* Kill the previous bundle */
+	if (xdst) {
+		/* The policies were stolen for newly generated bundle */
+		xdst->num_pols = 0;
+		dst_free((struct dst_entry*)xdst);
+	}
+
+	/* Flow cache does not have reference, it dst_free()'s,
+	 * but we do need to return one reference for original caller */
+	dst_hold((struct dst_entry*)new_xdst);
+	return &new_xdst->fc_ops;
+
+make_dummy_bundle:
+	/* We found policies, but there's no bundles to instantiate:
+	 * either because the policy blocks, has no transformations or
+	 * we could not build template (no xfrm_states).*/
+	xdst = xfrm_alloc_dst(net, family);
+	if (IS_ERR(xdst)) {
+		xfrm_pols_put(pols, num_pols);
+		return (void*) xdst;
+	}
+	xdst->num_pols = num_pols;
+	xdst->num_xfrms = num_xfrms;
+	memcpy(xdst->pols, pols, sizeof(struct xfrm_policy*) * num_pols);
+
+	dst_hold((struct dst_entry*)xdst);
+	return &xdst->fc_ops;
+
+inc_error:
+	XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+error:
+	if (xdst != NULL)
+		dst_free((struct dst_entry*) xdst);
+	else
+		xfrm_pols_put(pols, num_pols);
+	return ERR_PTR(err);
+}
 
 /* Main function: finds/creates a bundle for given flow.
  *
@@ -1566,248 +1759,138 @@ static int stale_bundle(struct dst_entry *dst);
 int __xfrm_lookup(struct net *net, struct dst_entry **dst_p, struct flowi *fl,
 		  struct sock *sk, int flags)
 {
-	struct xfrm_policy *policy;
-	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
-	int npols;
-	int pol_dead;
-	int xfrm_nr;
-	int pi;
-	struct xfrm_state *xfrm[XFRM_MAX_DEPTH];
-	struct dst_entry *dst, *dst_orig = *dst_p;
-	int nx = 0;
-	int err;
-	u32 genid;
-	u16 family;
+	struct flow_cache_entry_ops **ops;
+	struct xfrm_dst *xdst;
+	struct xfrm_policy *pol;
+	struct dst_entry *dst, *dst_orig = *dst_p, *route;
+	u16 family = dst_orig->ops->family;
 	u8 dir = policy_to_flow_dir(XFRM_POLICY_OUT);
+	int err, num_pols, num_xfrms;
 
 restart:
-	genid = atomic_read(&flow_cache_genid);
-	policy = NULL;
-	for (pi = 0; pi < ARRAY_SIZE(pols); pi++)
-		pols[pi] = NULL;
-	npols = 0;
-	pol_dead = 0;
-	xfrm_nr = 0;
+	dst = NULL;
+	xdst = NULL;
+	route = NULL;
 
 	if (sk && sk->sk_policy[XFRM_POLICY_OUT]) {
-		policy = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl);
-		err = PTR_ERR(policy);
-		if (IS_ERR(policy)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+		struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+
+		pol = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl);
+		err = xfrm_resolve_policies(pol, fl, family,
+					    pols, &num_pols, &num_xfrms);
+		if (err < 0)
 			goto dropdst;
+
+		if (num_pols) {
+			if (num_xfrms <= 0)
+				goto no_transform;
+
+			xdst = xfrm_resolve_and_create_bundle(
+					pols, num_pols, fl,
+					family, dst_orig);
+			if (IS_ERR(xdst)) {
+				xfrm_pols_put(pols, num_pols);
+				err = PTR_ERR(xdst);
+				goto dropdst;
+			}
+			route = xdst->route;
 		}
 	}
 
-	if (!policy) {
-		struct flow_cache_entry_ops **ops;
-
+	if (xdst == NULL) {
 		/* To accelerate a bit...  */
 		if ((dst_orig->flags & DST_NOXFRM) ||
 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
 			goto nopol;
 
-		ops = flow_cache_lookup(net, fl, dst_orig->ops->family,
-					dir, xfrm_policy_lookup);
-		err = PTR_ERR(ops);
+		ops = flow_cache_lookup(net, fl, family, dir,
+					xfrm_bundle_lookup, dst_orig);
+		if (ops == NULL)
+			goto nopol;
 		if (IS_ERR(ops)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+			err = PTR_ERR(ops);
 			goto dropdst;
 		}
-		if (ops)
-			policy = container_of(ops, struct xfrm_policy, fc_ops);
-		else
-			policy = NULL;
-	}
+		xdst = container_of(ops, struct xfrm_dst, fc_ops);
+
+		num_pols = xdst->num_pols;
+		num_xfrms = xdst->num_xfrms;
+		pol = xdst->pols[0];
+		route = xdst->route;
+	}
+
+	dst = (struct dst_entry *) xdst;
+	if (route == NULL && num_xfrms > 0) {
+		dst_release(dst);
+
+		/* The only case when xfrm_bundle_lookup() returns a
+		 * bundle with null route, is when the template could
+		 * not be resolved. It means policies are there, but
+		 * bundle could not be created, since we don't yet
+		 * have the xfrm_state's. We need to wait for KM to
+		 * negotiate new SA's or bail out with error.*/
+		if (net->xfrm.sysctl_larval_drop) {
+			/* EREMOTE tells the caller to generate
+			 * a one-shot blackhole route. */
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
+			return -EREMOTE;
+		}
+		if (flags & XFRM_LOOKUP_WAIT) {
+			DECLARE_WAITQUEUE(wait, current);
 
-	if (!policy)
-		goto nopol;
+			add_wait_queue(&net->xfrm.km_waitq, &wait);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&net->xfrm.km_waitq, &wait);
 
-	family = dst_orig->ops->family;
-	pols[0] = policy;
-	npols ++;
-	xfrm_nr += pols[0]->xfrm_nr;
+			if (!signal_pending(current))
+				goto restart;
 
-	err = -ENOENT;
-	if ((flags & XFRM_LOOKUP_ICMP) && !(policy->flags & XFRM_POLICY_ICMP))
+			err = -ERESTART;
+		} else
+			err = -EAGAIN;
+
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
 		goto error;
+	}
 
-	policy->curlft.use_time = get_seconds();
+no_transform:
+	if (num_pols == 0)
+		goto nopol;
 
-	switch (policy->action) {
-	default:
-	case XFRM_POLICY_BLOCK:
+	if ((flags & XFRM_LOOKUP_ICMP) &&
+	    !(pol->flags & XFRM_POLICY_ICMP)) {
+		err = -ENOENT;
+		goto error;
+	}
+
+	pol->curlft.use_time = get_seconds();
+	if (num_xfrms < 0) {
 		/* Prohibit the flow */
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLBLOCK);
 		err = -EPERM;
 		goto error;
-
-	case XFRM_POLICY_ALLOW:
-#ifndef CONFIG_XFRM_SUB_POLICY
-		if (policy->xfrm_nr == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pol_put(policy);
-			return 0;
-		}
-#endif
-
-		/* Try to find matching bundle.
-		 *
-		 * LATER: help from flow cache. It is optional, this
-		 * is required only for output policy.
-		 */
-		dst = xfrm_find_bundle(fl, policy, family);
-		if (IS_ERR(dst)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			err = PTR_ERR(dst);
-			goto error;
-		}
-
-		if (dst)
-			break;
-
-#ifdef CONFIG_XFRM_SUB_POLICY
-		if (pols[0]->type != XFRM_POLICY_TYPE_MAIN) {
-			pols[1] = xfrm_policy_lookup_bytype(net,
-							    XFRM_POLICY_TYPE_MAIN,
-							    fl, family,
-							    XFRM_POLICY_OUT);
-			if (pols[1]) {
-				if (IS_ERR(pols[1])) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
-					err = PTR_ERR(pols[1]);
-					goto error;
-				}
-				if (pols[1]->action == XFRM_POLICY_BLOCK) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLBLOCK);
-					err = -EPERM;
-					goto error;
-				}
-				npols ++;
-				xfrm_nr += pols[1]->xfrm_nr;
-			}
-		}
-
-		/*
-		 * Because neither flowi nor bundle information knows about
-		 * transformation template size. On more than one policy usage
-		 * we can realize whether all of them is bypass or not after
-		 * they are searched. See above not-transformed bypass
-		 * is surrounded by non-sub policy configuration, too.
-		 */
-		if (xfrm_nr == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pols_put(pols, npols);
-			return 0;
-		}
-
-#endif
-		nx = xfrm_tmpl_resolve(pols, npols, fl, xfrm, family);
-
-		if (unlikely(nx<0)) {
-			err = nx;
-			if (err == -EAGAIN && net->xfrm.sysctl_larval_drop) {
-				/* EREMOTE tells the caller to generate
-				 * a one-shot blackhole route.
-				 */
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-				xfrm_pol_put(policy);
-				return -EREMOTE;
-			}
-			if (err == -EAGAIN && (flags & XFRM_LOOKUP_WAIT)) {
-				DECLARE_WAITQUEUE(wait, current);
-
-				add_wait_queue(&net->xfrm.km_waitq, &wait);
-				set_current_state(TASK_INTERRUPTIBLE);
-				schedule();
-				set_current_state(TASK_RUNNING);
-				remove_wait_queue(&net->xfrm.km_waitq, &wait);
-
-				nx = xfrm_tmpl_resolve(pols, npols, fl, xfrm, family);
-
-				if (nx == -EAGAIN && signal_pending(current)) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-					err = -ERESTART;
-					goto error;
-				}
-				if (nx == -EAGAIN ||
-				    genid != atomic_read(&flow_cache_genid)) {
-					xfrm_pols_put(pols, npols);
-					goto restart;
-				}
-				err = nx;
-			}
-			if (err < 0) {
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-				goto error;
-			}
-		}
-		if (nx == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pols_put(pols, npols);
-			return 0;
-		}
-
-		dst = xfrm_bundle_create(policy, xfrm, nx, fl, dst_orig);
-		err = PTR_ERR(dst);
-		if (IS_ERR(dst)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLEGENERROR);
-			goto error;
-		}
-
-		for (pi = 0; pi < npols; pi++)
-			pol_dead |= pols[pi]->walk.dead;
-
-		write_lock_bh(&policy->lock);
-		if (unlikely(pol_dead || stale_bundle(dst))) {
-			/* Wow! While we worked on resolving, this
-			 * policy has gone. Retry. It is not paranoia,
-			 * we just cannot enlist new bundle to dead object.
-			 * We can't enlist stable bundles either.
-			 */
-			write_unlock_bh(&policy->lock);
-			dst_free(dst);
-
-			if (pol_dead)
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLDEAD);
-			else
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			err = -EHOSTUNREACH;
-			goto error;
-		}
-
-		if (npols > 1)
-			err = xfrm_dst_update_parent(dst, &pols[1]->selector);
-		else
-			err = xfrm_dst_update_origin(dst, fl);
-		if (unlikely(err)) {
-			write_unlock_bh(&policy->lock);
-			dst_free(dst);
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			goto error;
-		}
-
-		dst->next = policy->bundles;
-		policy->bundles = dst;
-		dst_hold(dst);
-		write_unlock_bh(&policy->lock);
+	} else if (num_xfrms > 0) {
+		/* Flow transformed */
+		*dst_p = dst;
+		dst_release(dst_orig);
+	} else {
+		/* Flow passes untransformed */
+		dst_release(dst);
 	}
-	*dst_p = dst;
-	dst_release(dst_orig);
-	xfrm_pols_put(pols, npols);
 	return 0;
 
+nopol:
+	if (!(flags & XFRM_LOOKUP_ICMP))
+		return 0;
+	err = -ENOENT;
 error:
-	xfrm_pols_put(pols, npols);
+	dst_release(dst);
 dropdst:
 	dst_release(dst_orig);
 	*dst_p = NULL;
 	return err;
-
-nopol:
-	err = -ENOENT;
-	if (flags & XFRM_LOOKUP_ICMP)
-		goto dropdst;
-	return 0;
 }
 EXPORT_SYMBOL(__xfrm_lookup);
 
@@ -1970,7 +2053,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 		struct flow_cache_entry_ops **ops;
 
 		ops = flow_cache_lookup(net, &fl, family, fl_dir,
-					xfrm_policy_lookup);
+					xfrm_policy_lookup, NULL);
 		if (IS_ERR(ops))
 			pol = (void *) ops;
 		else if (ops)
@@ -2159,69 +2242,9 @@ static struct dst_entry *xfrm_negative_advice(struct dst_entry *dst)
 	return dst;
 }
 
-static void prune_one_bundle(struct xfrm_policy *pol, int (*func)(struct dst_entry *), struct dst_entry **gc_list_p)
-{
-	struct dst_entry *dst, **dstp;
-
-	write_lock(&pol->lock);
-	dstp = &pol->bundles;
-	while ((dst=*dstp) != NULL) {
-		if (func(dst)) {
-			*dstp = dst->next;
-			dst->next = *gc_list_p;
-			*gc_list_p = dst;
-		} else {
-			dstp = &dst->next;
-		}
-	}
-	write_unlock(&pol->lock);
-}
-
-static void xfrm_prune_bundles(struct net *net, int (*func)(struct dst_entry *))
-{
-	struct dst_entry *gc_list = NULL;
-	int dir;
-
-	read_lock_bh(&xfrm_policy_lock);
-	for (dir = 0; dir < XFRM_POLICY_MAX * 2; dir++) {
-		struct xfrm_policy *pol;
-		struct hlist_node *entry;
-		struct hlist_head *table;
-		int i;
-
-		hlist_for_each_entry(pol, entry,
-				     &net->xfrm.policy_inexact[dir], bydst)
-			prune_one_bundle(pol, func, &gc_list);
-
-		table = net->xfrm.policy_bydst[dir].table;
-		for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-			hlist_for_each_entry(pol, entry, table + i, bydst)
-				prune_one_bundle(pol, func, &gc_list);
-		}
-	}
-	read_unlock_bh(&xfrm_policy_lock);
-
-	while (gc_list) {
-		struct dst_entry *dst = gc_list;
-		gc_list = dst->next;
-		dst_free(dst);
-	}
-}
-
-static int unused_bundle(struct dst_entry *dst)
-{
-	return !atomic_read(&dst->__refcnt);
-}
-
 static void __xfrm_garbage_collect(struct net *net)
 {
-	xfrm_prune_bundles(net, unused_bundle);
-}
-
-static int xfrm_flush_bundles(struct net *net)
-{
-	xfrm_prune_bundles(net, stale_bundle);
-	return 0;
+	flow_cache_flush();
 }
 
 static void xfrm_init_pmtu(struct dst_entry *dst)
@@ -2281,7 +2304,9 @@ int xfrm_bundle_ok(struct xfrm_policy *pol, struct xfrm_dst *first,
 			return 0;
 		if (dst->xfrm->km.state != XFRM_STATE_VALID)
 			return 0;
-		if (xdst->genid != dst->xfrm->genid)
+		if (xdst->xfrm_genid != dst->xfrm->genid)
+			return 0;
+		if (xdst->policy_genid != atomic_read(&xdst->pols[0]->genid))
 			return 0;
 
 		if (strict && fl &&
@@ -2442,11 +2467,9 @@ static void xfrm_policy_put_afinfo(struct xfrm_policy_afinfo *afinfo)
 
 static int xfrm_dev_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
-	struct net_device *dev = ptr;
-
 	switch (event) {
 	case NETDEV_DOWN:
-		xfrm_flush_bundles(dev_net(dev));
+		flow_cache_flush();
 	}
 	return NOTIFY_DONE;
 }
@@ -2778,7 +2801,6 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 			       struct xfrm_migrate *m, int num_migrate)
 {
 	struct xfrm_migrate *mp;
-	struct dst_entry *dst;
 	int i, j, n = 0;
 
 	write_lock_bh(&pol->lock);
@@ -2803,10 +2825,7 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 			       sizeof(pol->xfrm_vec[i].saddr));
 			pol->xfrm_vec[i].encap_family = mp->new_family;
 			/* flush bundles */
-			while ((dst = pol->bundles) != NULL) {
-				pol->bundles = dst->next;
-				dst_free(dst);
-			}
+			atomic_inc(&pol->genid);
 		}
 	}
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/7] xfrm: remove policy garbage collection
  2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
                   ` (5 preceding siblings ...)
  2010-03-29 14:12 ` [PATCH 6/7] xfrm: cache bundles instead of policies for outgoing flows Timo Teras
@ 2010-03-29 14:12 ` Timo Teras
  6 siblings, 0 replies; 35+ messages in thread
From: Timo Teras @ 2010-03-29 14:12 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

Policies are now properly reference counted and destroyed from
all code paths. The delayed gc is just an overhead now and can
be removed.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/xfrm/xfrm_policy.c |   39 +++++----------------------------------
 1 files changed, 5 insertions(+), 34 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 11fb48d..0ae5768 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -44,9 +44,6 @@ static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO];
 
 static struct kmem_cache *xfrm_dst_cache __read_mostly;
 
-static HLIST_HEAD(xfrm_policy_gc_list);
-static DEFINE_SPINLOCK(xfrm_policy_gc_lock);
-
 static struct xfrm_policy_afinfo *xfrm_policy_get_afinfo(unsigned short family);
 static void xfrm_policy_put_afinfo(struct xfrm_policy_afinfo *afinfo);
 static void xfrm_init_pmtu(struct dst_entry *dst);
@@ -287,32 +284,6 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 }
 EXPORT_SYMBOL(xfrm_policy_destroy);
 
-static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
-{
-	atomic_inc(&policy->genid);
-
-	if (del_timer(&policy->timer))
-		atomic_dec(&policy->refcnt);
-
-	xfrm_pol_put(policy);
-}
-
-static void xfrm_policy_gc_task(struct work_struct *work)
-{
-	struct xfrm_policy *policy;
-	struct hlist_node *entry, *tmp;
-	struct hlist_head gc_list;
-
-	spin_lock_bh(&xfrm_policy_gc_lock);
-	gc_list.first = xfrm_policy_gc_list.first;
-	INIT_HLIST_HEAD(&xfrm_policy_gc_list);
-	spin_unlock_bh(&xfrm_policy_gc_lock);
-
-	hlist_for_each_entry_safe(policy, entry, tmp, &gc_list, bydst)
-		xfrm_policy_gc_kill(policy);
-}
-static DECLARE_WORK(xfrm_policy_gc_work, xfrm_policy_gc_task);
-
 /* Rule must be locked. Release descentant resources, announce
  * entry dead. The rule must be unlinked from lists to the moment.
  */
@@ -321,11 +292,12 @@ static void xfrm_policy_kill(struct xfrm_policy *policy)
 {
 	policy->walk.dead = 1;
 
-	spin_lock_bh(&xfrm_policy_gc_lock);
-	hlist_add_head(&policy->bydst, &xfrm_policy_gc_list);
-	spin_unlock_bh(&xfrm_policy_gc_lock);
+	atomic_inc(&policy->genid);
 
-	schedule_work(&xfrm_policy_gc_work);
+	if (del_timer(&policy->timer))
+		xfrm_pol_put(policy);
+
+	xfrm_pol_put(policy);
 }
 
 static unsigned int xfrm_policy_hashmax __read_mostly = 1 * 1024 * 1024;
@@ -2575,7 +2547,6 @@ static void xfrm_policy_fini(struct net *net)
 	audit_info.sessionid = -1;
 	audit_info.secid = 0;
 	xfrm_policy_flush(net, XFRM_POLICY_TYPE_MAIN, &audit_info);
-	flush_work(&xfrm_policy_gc_work);
 
 	WARN_ON(!list_empty(&net->xfrm.policy_all));
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2010-03-31 13:27 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-29 14:12 [PATCH 0/7] caching bundles, iteration 2 Timo Teras
2010-03-29 14:12 ` [PATCH 1/7] xfrm: remove policy lock when accessing policy->walk.dead Timo Teras
2010-03-29 14:43   ` Herbert Xu
2010-03-30  4:55     ` Timo Teräs
2010-03-30 11:53       ` Herbert Xu
2010-03-30 12:04         ` Timo Teräs
2010-03-30 12:14           ` Herbert Xu
2010-03-30 12:21             ` Timo Teräs
2010-03-30 12:23               ` Herbert Xu
2010-03-30 12:41                 ` Timo Teräs
2010-03-30 12:48                   ` Herbert Xu
2010-03-30 13:33                     ` Timo Teräs
2010-03-30 14:30                       ` Herbert Xu
2010-03-30 14:34                         ` Herbert Xu
2010-03-30 14:37                       ` Herbert Xu
2010-03-30 14:01           ` Timo Teräs
2010-03-30 14:29             ` Herbert Xu
2010-03-30 15:36               ` Timo Teräs
2010-03-31  0:43                 ` Herbert Xu
2010-03-29 14:12 ` [PATCH 2/7] flow: structurize flow cache Timo Teras
2010-03-30 12:01   ` Herbert Xu
2010-03-30 12:02     ` Timo Teräs
2010-03-30 12:15       ` Herbert Xu
2010-03-29 14:12 ` [PATCH 3/7] flow: allocate hash table for online cpus only Timo Teras
2010-03-30 12:12   ` Herbert Xu
2010-03-31 12:32     ` Rusty Russell
2010-03-31 13:27       ` Timo Teräs
2010-03-29 14:12 ` [PATCH 4/7] flow: delayed deletion of flow cache entries Timo Teras
2010-03-30 12:22   ` Herbert Xu
2010-03-30 12:32     ` Timo Teräs
2010-03-30 12:36       ` Herbert Xu
2010-03-30 12:43         ` Timo Teräs
2010-03-29 14:12 ` [PATCH 5/7] flow: virtualize get and entry deletion methods Timo Teras
2010-03-29 14:12 ` [PATCH 6/7] xfrm: cache bundles instead of policies for outgoing flows Timo Teras
2010-03-29 14:12 ` [PATCH 7/7] xfrm: remove policy garbage collection Timo Teras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).