xfrm_state locking regression...

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* xfrm_state locking regression...
@ 2008-09-03  2:51 David Miller
  2008-09-03  3:00 ` David Miller
  2008-09-03  5:01 ` Herbert Xu
  0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-03  2:51 UTC (permalink / raw)
  To: netdev; +Cc: timo.teras


I've been hitting a lockup on my machine, and the way to hit
it is sort of straight forward.

1) Setup IPSEC connection using openswan

2) Use something like XAUTH where openswan is not able refresh the
   IPSEC policy when it times out because it has no way to prompt the
   user for the XAUTH password again.

3) After the timeout, try to send a packet over the IPSEC connection,
   a simple DNS lookup over it works for me.

4) Shut down IPSEC daemon.

At this point with SMP we'll wedge in __xfrm_state_destroy() trying to
obtain xfrm_state_lock.

This problem seems to have been introduced by:

commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
Author: Timo Teras <timo.teras@iki.fi>
Date:   Thu Feb 28 21:31:08 2008 -0800

    [XFRM]: Speed up xfrm_policy and xfrm_state walking

(BTW, I want to point out how hesitant I was at the time to apply
 this patch.  It touches a lot of delicate areas...)

This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy().

Such new locking cannot work properly, without also updating a lot of
other code in net/xfrm/xfrm_state.c

The new locking means that it is absolutely illegal to do an
xfrm_state_put() while holding xfrm_state_lock.  Yet we do this
all over the place!

Asynchronously another context can drop the count to one, and this
xfrm_state_put() call will thus call into __xfrm_state_destroy() and
try to take xfrm_state_lock (again) causing the deadlock.

I'm not exactly sure how to fix this currently.

I guess we can add a local pointer variable to the relevant functions
"struct xfrm_state *to_put", which is initialized to NULL, and we
replace the put calls with assignments to to_put.  In the "out" path,
after we drop the xfrm_state_lock, we do a put if "to_put" is
non-NULL.

Something like the patch below, but frankly this is some ugly stuff.

I double checked xfrm_policy.c, since it has the potential to have
similar issues due to the above changeset, but it appears to be
OK the best I can tell.

ipsec: Fix deadlock in xfrm_state management.

Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
with xfrm_state_lock held.  If we do, we'll deadlock since we
have the lock already and __xfrm_state_destroy() tries to take
it again.

Fix this by pushing the xfrm_state_put() calls after the lock
is dropped.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..98f20cc 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -780,11 +780,13 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 {
 	unsigned int h;
 	struct hlist_node *entry;
-	struct xfrm_state *x, *x0;
+	struct xfrm_state *x, *x0, *to_put;
 	int acquire_in_progress = 0;
 	int error = 0;
 	struct xfrm_state *best = NULL;
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 	h = xfrm_dst_hash(daddr, saddr, tmpl->reqid, family);
 	hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) {
@@ -833,7 +835,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 		if (tmpl->id.spi &&
 		    (x0 = __xfrm_state_lookup(daddr, tmpl->id.spi,
 					      tmpl->id.proto, family)) != NULL) {
-			xfrm_state_put(x0);
+			to_put = x0;
 			error = -EEXIST;
 			goto out;
 		}
@@ -849,7 +851,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 		error = security_xfrm_state_alloc_acquire(x, pol->security, fl->secid);
 		if (error) {
 			x->km.state = XFRM_STATE_DEAD;
-			xfrm_state_put(x);
+			to_put = x;
 			x = NULL;
 			goto out;
 		}
@@ -870,7 +872,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 			xfrm_hash_grow_check(x->bydst.next != NULL);
 		} else {
 			x->km.state = XFRM_STATE_DEAD;
-			xfrm_state_put(x);
+			to_put = x;
 			x = NULL;
 			error = -ESRCH;
 		}
@@ -881,6 +883,8 @@ out:
 	else
 		*err = acquire_in_progress ? -EAGAIN : error;
 	spin_unlock_bh(&xfrm_state_lock);
+	if (to_put)
+		xfrm_state_put(to_put);
 	return x;
 }
 
@@ -1067,18 +1071,20 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq);
 
 int xfrm_state_add(struct xfrm_state *x)
 {
-	struct xfrm_state *x1;
+	struct xfrm_state *x1, *to_put;
 	int family;
 	int err;
 	int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
 
 	family = x->props.family;
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 
 	x1 = __xfrm_state_locate(x, use_spi, family);
 	if (x1) {
-		xfrm_state_put(x1);
+		to_put = x1;
 		x1 = NULL;
 		err = -EEXIST;
 		goto out;
@@ -1088,7 +1094,7 @@ int xfrm_state_add(struct xfrm_state *x)
 		x1 = __xfrm_find_acq_byseq(x->km.seq);
 		if (x1 && ((x1->id.proto != x->id.proto) ||
 		    xfrm_addr_cmp(&x1->id.daddr, &x->id.daddr, family))) {
-			xfrm_state_put(x1);
+			to_put = x1;
 			x1 = NULL;
 		}
 	}
@@ -1110,6 +1116,9 @@ out:
 		xfrm_state_put(x1);
 	}
 
+	if (to_put)
+		xfrm_state_put(to_put);
+
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_add);
@@ -1269,10 +1278,12 @@ EXPORT_SYMBOL(xfrm_state_migrate);
 
 int xfrm_state_update(struct xfrm_state *x)
 {
-	struct xfrm_state *x1;
+	struct xfrm_state *x1, *to_put;
 	int err;
 	int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 	x1 = __xfrm_state_locate(x, use_spi, x->props.family);
 
@@ -1281,7 +1292,7 @@ int xfrm_state_update(struct xfrm_state *x)
 		goto out;
 
 	if (xfrm_state_kern(x1)) {
-		xfrm_state_put(x1);
+		to_put = x1;
 		err = -EEXIST;
 		goto out;
 	}
@@ -1295,6 +1306,9 @@ int xfrm_state_update(struct xfrm_state *x)
 out:
 	spin_unlock_bh(&xfrm_state_lock);
 
+	if (to_put)
+		xfrm_state_put(to_put);
+
 	if (err)
 		return err;
 
@@ -1483,11 +1497,13 @@ EXPORT_SYMBOL(xfrm_get_acqseq);
 int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 {
 	unsigned int h;
-	struct xfrm_state *x0;
+	struct xfrm_state *x0, *to_put;
 	int err = -ENOENT;
 	__be32 minspi = htonl(low);
 	__be32 maxspi = htonl(high);
 
+	to_put = NULL;
+
 	spin_lock_bh(&x->lock);
 	if (x->km.state == XFRM_STATE_DEAD)
 		goto unlock;
@@ -1501,7 +1517,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 	if (minspi == maxspi) {
 		x0 = xfrm_state_lookup(&x->id.daddr, minspi, x->id.proto, x->props.family);
 		if (x0) {
-			xfrm_state_put(x0);
+			to_put = x0;
 			goto unlock;
 		}
 		x->id.spi = minspi;
@@ -1514,7 +1530,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 				x->id.spi = htonl(spi);
 				break;
 			}
-			xfrm_state_put(x0);
+			to_put = x0;
 		}
 	}
 	if (x->id.spi) {
@@ -1529,6 +1545,9 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 unlock:
 	spin_unlock_bh(&x->lock);
 
+	if (to_put)
+		xfrm_state_put(to_put);
+
 	return err;
 }
 EXPORT_SYMBOL(xfrm_alloc_spi);

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  2:51 xfrm_state locking regression David Miller
@ 2008-09-03  3:00 ` David Miller
  2008-09-03  5:01 ` Herbert Xu
  1 sibling, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-03  3:00 UTC (permalink / raw)
  To: netdev; +Cc: timo.teras

From: David Miller <davem@davemloft.net>
Date: Tue, 02 Sep 2008 19:51:48 -0700 (PDT)

> @@ -1483,11 +1497,13 @@ EXPORT_SYMBOL(xfrm_get_acqseq);
>  int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
>  {
 ...
> @@ -1514,7 +1530,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
>  				x->id.spi = htonl(spi);
>  				break;
>  			}
> -			xfrm_state_put(x0);
> +			to_put = x0;
>  		}
>  	}
>  	if (x->id.spi) {

This part of my patch is buggy, we loop here so would need to remember more
state to do the puts right.

But luckily, this function doesn't need this to_put code, as it grabs
x->lock not xfrm_state_lock so all the changes to this function can
simply be removed :)

I'm going to do some testing on this and commit it unless more problems
are found.

ipsec: Fix deadlock in xfrm_state management.

Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
with xfrm_state_lock held.  If we do, we'll deadlock since we
have the lock already and __xfrm_state_destroy() tries to take
it again.

Fix this by pushing the xfrm_state_put() calls after the lock
is dropped.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..7bd62f6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -780,11 +780,13 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 {
 	unsigned int h;
 	struct hlist_node *entry;
-	struct xfrm_state *x, *x0;
+	struct xfrm_state *x, *x0, *to_put;
 	int acquire_in_progress = 0;
 	int error = 0;
 	struct xfrm_state *best = NULL;
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 	h = xfrm_dst_hash(daddr, saddr, tmpl->reqid, family);
 	hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) {
@@ -833,7 +835,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 		if (tmpl->id.spi &&
 		    (x0 = __xfrm_state_lookup(daddr, tmpl->id.spi,
 					      tmpl->id.proto, family)) != NULL) {
-			xfrm_state_put(x0);
+			to_put = x0;
 			error = -EEXIST;
 			goto out;
 		}
@@ -849,7 +851,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 		error = security_xfrm_state_alloc_acquire(x, pol->security, fl->secid);
 		if (error) {
 			x->km.state = XFRM_STATE_DEAD;
-			xfrm_state_put(x);
+			to_put = x;
 			x = NULL;
 			goto out;
 		}
@@ -870,7 +872,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 			xfrm_hash_grow_check(x->bydst.next != NULL);
 		} else {
 			x->km.state = XFRM_STATE_DEAD;
-			xfrm_state_put(x);
+			to_put = x;
 			x = NULL;
 			error = -ESRCH;
 		}
@@ -881,6 +883,8 @@ out:
 	else
 		*err = acquire_in_progress ? -EAGAIN : error;
 	spin_unlock_bh(&xfrm_state_lock);
+	if (to_put)
+		xfrm_state_put(to_put);
 	return x;
 }
 
@@ -1067,18 +1071,20 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq);
 
 int xfrm_state_add(struct xfrm_state *x)
 {
-	struct xfrm_state *x1;
+	struct xfrm_state *x1, *to_put;
 	int family;
 	int err;
 	int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
 
 	family = x->props.family;
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 
 	x1 = __xfrm_state_locate(x, use_spi, family);
 	if (x1) {
-		xfrm_state_put(x1);
+		to_put = x1;
 		x1 = NULL;
 		err = -EEXIST;
 		goto out;
@@ -1088,7 +1094,7 @@ int xfrm_state_add(struct xfrm_state *x)
 		x1 = __xfrm_find_acq_byseq(x->km.seq);
 		if (x1 && ((x1->id.proto != x->id.proto) ||
 		    xfrm_addr_cmp(&x1->id.daddr, &x->id.daddr, family))) {
-			xfrm_state_put(x1);
+			to_put = x1;
 			x1 = NULL;
 		}
 	}
@@ -1110,6 +1116,9 @@ out:
 		xfrm_state_put(x1);
 	}
 
+	if (to_put)
+		xfrm_state_put(to_put);
+
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_add);
@@ -1269,10 +1278,12 @@ EXPORT_SYMBOL(xfrm_state_migrate);
 
 int xfrm_state_update(struct xfrm_state *x)
 {
-	struct xfrm_state *x1;
+	struct xfrm_state *x1, *to_put;
 	int err;
 	int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
 
+	to_put = NULL;
+
 	spin_lock_bh(&xfrm_state_lock);
 	x1 = __xfrm_state_locate(x, use_spi, x->props.family);
 
@@ -1281,7 +1292,7 @@ int xfrm_state_update(struct xfrm_state *x)
 		goto out;
 
 	if (xfrm_state_kern(x1)) {
-		xfrm_state_put(x1);
+		to_put = x1;
 		err = -EEXIST;
 		goto out;
 	}
@@ -1295,6 +1306,9 @@ int xfrm_state_update(struct xfrm_state *x)
 out:
 	spin_unlock_bh(&xfrm_state_lock);
 
+	if (to_put)
+		xfrm_state_put(to_put);
+
 	if (err)
 		return err;
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  2:51 xfrm_state locking regression David Miller
  2008-09-03  3:00 ` David Miller
@ 2008-09-03  5:01 ` Herbert Xu
  2008-09-03  5:07   ` Timo Teräs
  1 sibling, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-03  5:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, timo.teras

David Miller <davem@davemloft.net> wrote:
>
> This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy().

This is simply bogus.  We should remove the entry from the list
when the state is deleted, not when it's destroyed.  That is, we
should be deleting x->all in places like __xfrm_state_delete instead
of __xfrm_state_destroy.

The fact that we're adding x->all in __xfrm_state_insert means
that we should be deleting it from its counter-part, which is
__xfrm_state_delete.

The same thing applies to the policies of course.  Once a policy
or state is deleted there is no reason why it should show up in a
walk just because some dst is holding onto it.

> ipsec: Fix deadlock in xfrm_state management.
> 
> Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
> ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
> illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
> with xfrm_state_lock held.  If we do, we'll deadlock since we
> have the lock already and __xfrm_state_destroy() tries to take
> it again.
> 
> Fix this by pushing the xfrm_state_put() calls after the lock
> is dropped.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

However, moving things out of critical sections is always a good
idea.  So I think your patch is an improvement regardless of why
it came about :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:01 ` Herbert Xu
@ 2008-09-03  5:07   ` Timo Teräs
  2008-09-03  5:23     ` Herbert Xu
                       ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  5:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
>> This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy().
> 
> This is simply bogus.  We should remove the entry from the list
> when the state is deleted, not when it's destroyed.  That is, we
> should be deleting x->all in places like __xfrm_state_delete instead
> of __xfrm_state_destroy.
>
> The fact that we're adding x->all in __xfrm_state_insert means
> that we should be deleting it from its counter-part, which is
> __xfrm_state_delete.
> 
> The same thing applies to the policies of course.  Once a policy
> or state is deleted there is no reason why it should show up in a
> walk just because some dst is holding onto it.

But then caller of xfrm_state_walk() can be still holding a
reference to the entry and dumping of the whole chain fails,
because the entry is no longer part of the chain when dumping
continues later.

The walking coding skips entries which are dead so the
entry is only kept temporarily to make a full traversal of
the list.

>> ipsec: Fix deadlock in xfrm_state management.
>>
>> Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
>> ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
>> illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
>> with xfrm_state_lock held.  If we do, we'll deadlock since we
>> have the lock already and __xfrm_state_destroy() tries to take
>> it again.
>>
>> Fix this by pushing the xfrm_state_put() calls after the lock
>> is dropped.
>>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> However, moving things out of critical sections is always a good
> idea.  So I think your patch is an improvement regardless of why
> it came about :)

Right. Also alternatively the xfrm_state_all could be protected
by a different lock than xfrm_state_lock.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:07   ` Timo Teräs
@ 2008-09-03  5:23     ` Herbert Xu
  2008-09-03  5:39       ` Timo Teräs
  2008-09-09 12:25       ` David Miller
  2008-09-03  5:39     ` Herbert Xu
  2008-09-03  6:10     ` David Miller
  2 siblings, 2 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-03  5:23 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote:
> 
> But then caller of xfrm_state_walk() can be still holding a
> reference to the entry and dumping of the whole chain fails,
> because the entry is no longer part of the chain when dumping
> continues later.
> 
> The walking coding skips entries which are dead so the
> entry is only kept temporarily to make a full traversal of
> the list.

Ah I see your point now.

Actually another thing we should do is to postpone the unlinking
into the GC process.  This doesn't really need to be immediate.

And here is a patch to fix a couple of changes in user-behaviour.

ipsec: Restore larval states and socket policies in dump

The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]:
Speed up xfrm_policy and xfrm_state walking") inadvertently removed
larval states and socket policies from netlink dumps.  This patch
restores them.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 841b32a..5138845 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1077,6 +1077,7 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 	struct hlist_head *chain = policy_hash_bysel(&pol->selector,
 						     pol->family, dir);
 
+	list_add_tail(&pol->bytype, &xfrm_policy_bytype[pol->type]);
 	hlist_add_head(&pol->bydst, chain);
 	hlist_add_head(&pol->byidx, xfrm_policy_byidx+idx_hash(pol->index));
 	xfrm_policy_count[dir]++;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..89551b6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -856,6 +856,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
+			list_add_tail(&x->all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1051,6 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
+		list_add_tail(&x->all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:23     ` Herbert Xu
@ 2008-09-03  5:39       ` Timo Teräs
  2008-09-03  5:40         ` Herbert Xu
  2008-09-09 12:25       ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  5:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote:
>> But then caller of xfrm_state_walk() can be still holding a
>> reference to the entry and dumping of the whole chain fails,
>> because the entry is no longer part of the chain when dumping
>> continues later.
>>
>> The walking coding skips entries which are dead so the
>> entry is only kept temporarily to make a full traversal of
>> the list.
> 
> Ah I see your point now.
> 
> Actually another thing we should do is to postpone the unlinking
> into the GC process.  This doesn't really need to be immediate.

Yes, might be a good idea.

> And here is a patch to fix a couple of changes in user-behaviour.
> 
> ipsec: Restore larval states and socket policies in dump
> 
> The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]:
> Speed up xfrm_policy and xfrm_state walking") inadvertently removed
> larval states and socket policies from netlink dumps.  This patch
> restores them.

Oh, I seemed to have missed that there was multiple places where
the list insertion happens. Would it make sense to make function
for inserting the entries in the lists?

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:39       ` Timo Teräs
@ 2008-09-03  5:40         ` Herbert Xu
  0 siblings, 0 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-03  5:40 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Wed, Sep 03, 2008 at 08:39:18AM +0300, Timo Teräs wrote:
> 
> Oh, I seemed to have missed that there was multiple places where
> the list insertion happens. Would it make sense to make function
> for inserting the entries in the lists?

Sure, grep for the places that change xfrm_state_num and you've
got your function :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:23     ` Herbert Xu
  2008-09-03  5:39       ` Timo Teräs
@ 2008-09-09 12:25       ` David Miller
  1 sibling, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-09 12:25 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 3 Sep 2008 15:23:30 +1000

> ipsec: Restore larval states and socket policies in dump
> 
> The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]:
> Speed up xfrm_policy and xfrm_state walking") inadvertently removed
> larval states and socket policies from netlink dumps.  This patch
> restores them.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied to net-2.6, thanks Herbert.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:07   ` Timo Teräs
  2008-09-03  5:23     ` Herbert Xu
@ 2008-09-03  5:39     ` Herbert Xu
  2008-09-03  5:45       ` Timo Teräs
  2008-09-03  6:10     ` David Miller
  2 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-03  5:39 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote:
> 
> Right. Also alternatively the xfrm_state_all could be protected
> by a different lock than xfrm_state_lock.

Yes that's a good idea.  Here's a patch to do that for states.
I'm still thinking about whether policy destruction should be
restructured or not so I'm not patching that.

ipsec: Move state dump list unlinking into GC

This patch moves the unlinking of x->all (for dumping) into the
GC and changes the lock to xfrm_cfg_mutex.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 89551b6..f29923a 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode)
 
 static void xfrm_state_gc_destroy(struct xfrm_state *x)
 {
+	mutex_lock(&xfrm_cfg_mutex);
+	list_del(&x->all);
+	mutex_unlock(&xfrm_cfg_mutex);
+
 	del_timer_sync(&x->timer);
 	del_timer_sync(&x->rtimer);
 	kfree(x->aalg);
@@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 {
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
-	spin_lock_bh(&xfrm_state_lock);
-	list_del(&x->all);
-	spin_unlock_bh(&xfrm_state_lock);
-
 	spin_lock_bh(&xfrm_state_gc_lock);
 	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:39     ` Herbert Xu
@ 2008-09-03  5:45       ` Timo Teräs
  2008-09-03  5:50         ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  5:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote:
>> Right. Also alternatively the xfrm_state_all could be protected
>> by a different lock than xfrm_state_lock.
> 
> Yes that's a good idea.  Here's a patch to do that for states.
> I'm still thinking about whether policy destruction should be
> restructured or not so I'm not patching that.
> 
> ipsec: Move state dump list unlinking into GC
> 
> This patch moves the unlinking of x->all (for dumping) into the
> GC and changes the lock to xfrm_cfg_mutex.

Shouldn't this also change the locking in all places where
x->all is used?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:45       ` Timo Teräs
@ 2008-09-03  5:50         ` Herbert Xu
  2008-09-03  6:14           ` David Miller
  2008-09-10  3:04           ` David Miller
  0 siblings, 2 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-03  5:50 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote:
>
> > ipsec: Move state dump list unlinking into GC
> > 
> > This patch moves the unlinking of x->all (for dumping) into the
> > GC and changes the lock to xfrm_cfg_mutex.
> 
> Shouldn't this also change the locking in all places where
> x->all is used?

You're right.  I'd forgotten about the larval states for some
reason, probably because I just added them to the list :)

So let's keep xfrm_state_lock and just move it.

ipsec: Move state dump list unlinking into GC

This patch moves the unlinking of x->all (for dumping) into the
GC so that we know for sure what locks we hold when we unlink the
entry.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 89551b6..0244325 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode)
 
 static void xfrm_state_gc_destroy(struct xfrm_state *x)
 {
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&x->all);
+	spin_unlock_bh(&xfrm_state_lock);
+
 	del_timer_sync(&x->timer);
 	del_timer_sync(&x->rtimer);
 	kfree(x->aalg);
@@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 {
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
-	spin_lock_bh(&xfrm_state_lock);
-	list_del(&x->all);
-	spin_unlock_bh(&xfrm_state_lock);
-
 	spin_lock_bh(&xfrm_state_gc_lock);
 	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:50         ` Herbert Xu
@ 2008-09-03  6:14           ` David Miller
  2008-09-03  6:27             ` Timo Teräs
  2008-09-10  3:04           ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-03  6:14 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 3 Sep 2008 15:50:41 +1000

> On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote:
> >
> > > ipsec: Move state dump list unlinking into GC
> > > 
> > > This patch moves the unlinking of x->all (for dumping) into the
> > > GC and changes the lock to xfrm_cfg_mutex.
> > 
> > Shouldn't this also change the locking in all places where
> > x->all is used?
> 
> You're right.  I'd forgotten about the larval states for some
> reason, probably because I just added them to the list :)
> 
> So let's keep xfrm_state_lock and just move it.

How about we un-crap the reference counting?

Every list an object lives on is an external reference, and that is
the rule that is violated by this ->all list handling.

And that's why we need to take a single shared lock twice just to get
rid of an xfrm_state object now.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:14           ` David Miller
@ 2008-09-03  6:27             ` Timo Teräs
  2008-09-03  6:35               ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  6:27 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 3 Sep 2008 15:50:41 +1000
> 
>> On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote:
>>>> ipsec: Move state dump list unlinking into GC
>>>>
>>>> This patch moves the unlinking of x->all (for dumping) into the
>>>> GC and changes the lock to xfrm_cfg_mutex.
>>> Shouldn't this also change the locking in all places where
>>> x->all is used?
>> You're right.  I'd forgotten about the larval states for some
>> reason, probably because I just added them to the list :)
>>
>> So let's keep xfrm_state_lock and just move it.
> 
> How about we un-crap the reference counting?
> 
> Every list an object lives on is an external reference, and that is
> the rule that is violated by this ->all list handling.
> 
> And that's why we need to take a single shared lock twice just to get
> rid of an xfrm_state object now.

Well, it's just another list keeping a reference like ->bydst,
->bysrc and ->byspi. The actual amount of external references is
still correct (the walking code calls _hold() when it returns while
keeping an external pointer).

The difference is that node should not be unlinked from ->all until
all other references are gone. For other lists the unlinking can be
done earlier since they are used only for lookups.

Any good other ways to enumerate to list entries while allowing
to keep a temporary "iterator"? The previous method was crap too.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:27             ` Timo Teräs
@ 2008-09-03  6:35               ` David Miller
  2008-09-03  6:45                 ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-03  6:35 UTC (permalink / raw)
  To: timo.teras; +Cc: herbert, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Wed, 03 Sep 2008 09:27:47 +0300

> Well, it's just another list keeping a reference like ->bydst,
> ->bysrc and ->byspi. The actual amount of external references is
> still correct (the walking code calls _hold() when it returns while
> keeping an external pointer).

->bydst, ->bysrc, and ->byspi are counted as a single external
reference because:

1) They are controlled as a group

2) Doing 3 atomic operations is more expensive than one

I know because I did that conversion from 3 refcount operations down
to 1 and I timed it with stress tests, which showed that it made a
huge performance difference for the control path of our IPSEC stack.

> The difference is that node should not be unlinked from ->all until
> all other references are gone. For other lists the unlinking can be
> done earlier since they are used only for lookups.

Once there are no list references, there cannot be any other references.
So in fact it seems to me that unlinking when the xfrm_state is removed
from those other lists makes perfect sense.

If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state
objects marked DEAD, why does the ->all list reference have to survive
past __xfrm_state_delete()?

It seems the perfect place to do the ->all removal.

> Any good other ways to enumerate to list entries while allowing
> to keep a temporary "iterator"? The previous method was crap too.

At least the old stuff was self-consistent and only needed one central
lock grab to destoy an object.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:35               ` David Miller
@ 2008-09-03  6:45                 ` Timo Teräs
  2008-09-03  6:47                   ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  6:45 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote:
> From: Timo Teräs <timo.teras@iki.fi>
> Date: Wed, 03 Sep 2008 09:27:47 +0300
> 
>> Well, it's just another list keeping a reference like ->bydst,
>> ->bysrc and ->byspi. The actual amount of external references is
>> still correct (the walking code calls _hold() when it returns while
>> keeping an external pointer).
> 
> ->bydst, ->bysrc, and ->byspi are counted as a single external
> reference because:
> 
> 1) They are controlled as a group
> 
> 2) Doing 3 atomic operations is more expensive than one
> 
> I know because I did that conversion from 3 refcount operations down
> to 1 and I timed it with stress tests, which showed that it made a
> huge performance difference for the control path of our IPSEC stack.

I was a bit confused what you meant by "external reference". But
yes, in this sense it's adding a new external reference.

>> The difference is that node should not be unlinked from ->all until
>> all other references are gone. For other lists the unlinking can be
>> done earlier since they are used only for lookups.
> 
> Once there are no list references, there cannot be any other references.
> So in fact it seems to me that unlinking when the xfrm_state is removed
> from those other lists makes perfect sense.
> 
> If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state
> objects marked DEAD, why does the ->all list reference have to survive
> past __xfrm_state_delete()?
> 
> It seems the perfect place to do the ->all removal.

1. xfrm_state_walk() called, it returns but holds an entry since
   the walking was interrupted temporarily (e.g. full netlink buffer).

2. xfrm_state_delete() called to the entry that xfrm_state_walk()
   is keeping a pointer to and it is unlinked.

3. xfrm_state_walk() called again, it tries to resume list walking
   but whoops, the entry was unlinked and kaboom.

>> Any good other ways to enumerate to list entries while allowing
>> to keep a temporary "iterator"? The previous method was crap too.
> 
> At least the old stuff was self-consistent and only needed one central
> lock grab to destoy an object.

Yes, but the dumping code produced crap. It could dump same entry
multiple times, miss entries and was dog slow. With it there was
no possibility to keep userland in sync with kernel SPD/SAD because
entries were lost.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:45                 ` Timo Teräs
@ 2008-09-03  6:47                   ` David Miller
  2008-09-03  7:14                     ` Timo Teräs
  2008-09-05 11:55                     ` Herbert Xu
  0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-03  6:47 UTC (permalink / raw)
  To: timo.teras; +Cc: herbert, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Wed, 03 Sep 2008 09:45:48 +0300

> David Miller wrote:
> > Once there are no list references, there cannot be any other references.
> > So in fact it seems to me that unlinking when the xfrm_state is removed
> > from those other lists makes perfect sense.
> > 
> > If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state
> > objects marked DEAD, why does the ->all list reference have to survive
> > past __xfrm_state_delete()?
> > 
> > It seems the perfect place to do the ->all removal.
> 
> 1. xfrm_state_walk() called, it returns but holds an entry since
>    the walking was interrupted temporarily (e.g. full netlink buffer).
> 
> 2. xfrm_state_delete() called to the entry that xfrm_state_walk()
>    is keeping a pointer to and it is unlinked.
> 
> 3. xfrm_state_walk() called again, it tries to resume list walking
>    but whoops, the entry was unlinked and kaboom.

Get creative, use a key of some sort to continue the walk, that's what
other netlink'ish subsystems use.

> Yes, but the dumping code produced crap. It could dump same entry
> multiple times, miss entries and was dog slow. With it there was
> no possibility to keep userland in sync with kernel SPD/SAD because
> entries were lost.

I'd rather see an entry twice in a dump than have my IPSEC gateway
lockup, or run slower because we take a lock twice as often as
necessary during object destruction.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:47                   ` David Miller
@ 2008-09-03  7:14                     ` Timo Teräs
  2008-09-05 11:55                     ` Herbert Xu
  1 sibling, 0 replies; 95+ messages in thread
From: Timo Teräs @ 2008-09-03  7:14 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote:
> Get creative, use a key of some sort to continue the walk, that's what
> other netlink'ish subsystems use.

Ok. ctnetlink_dump_table() seems to iterate the hash list and keeps
a unique id where it left. If that entry was deleted it starts again
from that hash bucket.

We could iterate SPD by idx. By SPI would be nice for SAD, but not
all entries have SPI. I guess we could use either of the bysrc or bydst
hash and memorize xfrm_id. Restart from first hash bucket entry if
the memorized entry gets deleted, and restart from the beginning
if hash gets resized.

This way we could avoid the additional locking and guarantee that all
entries are dumped in reasonable time.

>> Yes, but the dumping code produced crap. It could dump same entry
>> multiple times, miss entries and was dog slow. With it there was
>> no possibility to keep userland in sync with kernel SPD/SAD because
>> entries were lost.
> 
> I'd rather see an entry twice in a dump than have my IPSEC gateway
> lockup, or run slower because we take a lock twice as often as
> necessary during object destruction.

Seeing entry twice is not a problem. Not seeing an entry at all was
the real problem. Also listing SAD could take tens of seconds on
modern system. That's not nice either.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  6:47                   ` David Miller
  2008-09-03  7:14                     ` Timo Teräs
@ 2008-09-05 11:55                     ` Herbert Xu
  2008-09-09  0:09                       ` David Miller
  2008-09-09  0:25                       ` David Miller
  1 sibling, 2 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-05 11:55 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 02, 2008 at 11:47:23PM -0700, David Miller wrote:
> 
> Get creative, use a key of some sort to continue the walk, that's what
> other netlink'ish subsystems use.

OK let's see if this is creative enough :)

The next step after this patch is to kill x->all altogether and
go back to the old walking method (plus the saved state).

ipsec: Use RCU-like construct for saved state within a walk

Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.

As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.

An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	/* Note: bydst is re-used during gc */
 	struct list_head	all;
-	struct hlist_node	bydst;
+	union {
+		struct list_head	gclist;
+		struct hlist_node	bydst;
+	};
 	struct hlist_node	bysrc;
 	struct hlist_node	byspi;
 
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->count = 0;
 }
 
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-}
-
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
 			   int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
 extern struct xfrm_state *xfrm_state_alloc(void);
 extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, 
 					  struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..ec9d1e5 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
+/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
+static unsigned long xfrm_state_walk_ongoing;
+/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
+static unsigned long xfrm_state_walk_completed;
+
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -191,7 +196,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static HLIST_HEAD(xfrm_state_gc_list);
+static LIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x;
-	struct hlist_node *entry, *tmp;
-	struct hlist_head gc_list;
+	struct xfrm_state *x, *tmp;
+	static LIST_HEAD(gc_list);
+	unsigned long completed;
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	gc_list.first = xfrm_state_gc_list.first;
-	INIT_HLIST_HEAD(&xfrm_state_gc_list);
+	list_splice_tail_init(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
+	mutex_lock(&xfrm_cfg_mutex);
+	completed = xfrm_state_walk_completed;
+	mutex_unlock(&xfrm_cfg_mutex);
+
+	list_for_each_entry_safe(x, tmp, &gc_list, gclist) {
+		if ((long)(x->lastused - completed) > 0)
+			break;
 		xfrm_state_gc_destroy(x);
+	}
 
 	wake_up(&km_waitq);
 }
@@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 {
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
-	spin_lock_bh(&xfrm_state_lock);
-	list_del(&x->all);
-	spin_unlock_bh(&xfrm_state_lock);
-
 	spin_lock_bh(&xfrm_state_gc_lock);
-	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
+	list_add_tail(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
+		x->lastused = xfrm_state_walk_ongoing;
+		list_del_rcu(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 			if (err) {
 				xfrm_state_hold(last);
 				walk->state = last;
+				xfrm_state_walk_ongoing++;
 				goto out;
 			}
 		}
@@ -1586,12 +1596,26 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
+	if (old != NULL) {
 		xfrm_state_put(old);
+		xfrm_state_walk_completed++;
+		schedule_work(&xfrm_state_gc_work);
+	}
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
+void xfrm_state_walk_done(struct xfrm_state_walk *walk)
+{
+	if (walk->state != NULL) {
+		xfrm_state_put(walk->state);
+		walk->state = NULL;
+		xfrm_state_walk_completed++;
+		schedule_work(&xfrm_state_gc_work);
+	}
+}
+EXPORT_SYMBOL(xfrm_state_walk_done);
+
 
 void xfrm_replay_notify(struct xfrm_state *x, int event)
 {

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-05 11:55                     ` Herbert Xu
@ 2008-09-09  0:09                       ` David Miller
  2008-09-09  0:18                         ` Herbert Xu
  2008-09-09  0:25                       ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-09  0:09 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 5 Sep 2008 21:55:07 +1000

> @@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
>  
>  static void xfrm_state_gc_task(struct work_struct *data)
>  {
> -	struct xfrm_state *x;
> -	struct hlist_node *entry, *tmp;
> -	struct hlist_head gc_list;
> +	struct xfrm_state *x, *tmp;
> +	static LIST_HEAD(gc_list);
> +	unsigned long completed;
>  
>  	spin_lock_bh(&xfrm_state_gc_lock);
> -	gc_list.first = xfrm_state_gc_list.first;
> -	INIT_HLIST_HEAD(&xfrm_state_gc_list);
> +	list_splice_tail_init(&xfrm_state_gc_list, &gc_list);
>  	spin_unlock_bh(&xfrm_state_gc_lock);
>  
> -	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
> +	mutex_lock(&xfrm_cfg_mutex);
> +	completed = xfrm_state_walk_completed;
> +	mutex_unlock(&xfrm_cfg_mutex);
> +
> +	list_for_each_entry_safe(x, tmp, &gc_list, gclist) {
> +		if ((long)(x->lastused - completed) > 0)
> +			break;
>  		xfrm_state_gc_destroy(x);
> +	}
>  
>  	wake_up(&km_waitq);
>  }

What happens to all of the entries on the local gc_list which you
will be skipping when you break out of that last loop?

It looks like they will never get processed.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09  0:09                       ` David Miller
@ 2008-09-09  0:18                         ` Herbert Xu
  2008-09-09  0:20                           ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-09  0:18 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Mon, Sep 08, 2008 at 05:09:07PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 5 Sep 2008 21:55:07 +1000
> 
> > @@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
> >  
> >  static void xfrm_state_gc_task(struct work_struct *data)
> >  {
> > -	struct xfrm_state *x;
> > -	struct hlist_node *entry, *tmp;
> > -	struct hlist_head gc_list;
> > +	struct xfrm_state *x, *tmp;
> > +	static LIST_HEAD(gc_list);
> > +	unsigned long completed;
> >  
> >  	spin_lock_bh(&xfrm_state_gc_lock);
> > -	gc_list.first = xfrm_state_gc_list.first;
> > -	INIT_HLIST_HEAD(&xfrm_state_gc_list);
> > +	list_splice_tail_init(&xfrm_state_gc_list, &gc_list);
> >  	spin_unlock_bh(&xfrm_state_gc_lock);
> >  
> > -	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
> > +	mutex_lock(&xfrm_cfg_mutex);
> > +	completed = xfrm_state_walk_completed;
> > +	mutex_unlock(&xfrm_cfg_mutex);
> > +
> > +	list_for_each_entry_safe(x, tmp, &gc_list, gclist) {
> > +		if ((long)(x->lastused - completed) > 0)
> > +			break;
> >  		xfrm_state_gc_destroy(x);
> > +	}
> >  
> >  	wake_up(&km_waitq);
> >  }
> 
> What happens to all of the entries on the local gc_list which you
> will be skipping when you break out of that last loop?
> 
> It looks like they will never get processed.

I made gc_list static, so they'll be processed when the GC task
runs again.  Also whenever completed is incremented we schedule
the GC task.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09  0:18                         ` Herbert Xu
@ 2008-09-09  0:20                           ` David Miller
  0 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-09  0:20 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 9 Sep 2008 10:18:28 +1000

> On Mon, Sep 08, 2008 at 05:09:07PM -0700, David Miller wrote:
> > What happens to all of the entries on the local gc_list which you
> > will be skipping when you break out of that last loop?
> > 
> > It looks like they will never get processed.
> 
> I made gc_list static, so they'll be processed when the GC task
> runs again.  Also whenever completed is incremented we schedule
> the GC task.

Thanks, I didn't catch that "static" bit :-)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-05 11:55                     ` Herbert Xu
  2008-09-09  0:09                       ` David Miller
@ 2008-09-09  0:25                       ` David Miller
  2008-09-09 14:33                         ` Herbert Xu
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-09  0:25 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 5 Sep 2008 21:55:07 +1000

> ipsec: Use RCU-like construct for saved state within a walk

Ok, now that I finally understand how this works I like it.

The only comment I would make is that maybe it's a bit excessive
to trigger the GC worker every time we walk the states.

Instead you could expose the GC task's local list go:

	xfrm_state_walk_completed++;
	smp_wmb();
	if (!list_empty(&xfrm_state_gc_work_list))
		schedule_work(&xfrm_state_gc_work);

in xfrm_state_walk_done().

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09  0:25                       ` David Miller
@ 2008-09-09 14:33                         ` Herbert Xu
  2008-09-09 20:20                           ` David Miller
                                             ` (3 more replies)
  0 siblings, 4 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-09 14:33 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
> 
> The only comment I would make is that maybe it's a bit excessive
> to trigger the GC worker every time we walk the states.

Good point!

I've avoided the memory barrier by simply extending the mutexed
section in the GC to cover the list splicing.  Here's the updated
patch:

ipsec: Use RCU-like construct for saved state within a walk

Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.

As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.

An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	/* Note: bydst is re-used during gc */
 	struct list_head	all;
-	struct hlist_node	bydst;
+	union {
+		struct list_head	gclist;
+		struct hlist_node	bydst;
+	};
 	struct hlist_node	bysrc;
 	struct hlist_node	byspi;
 
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->count = 0;
 }
 
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-}
-
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
 			   int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
 extern struct xfrm_state *xfrm_state_alloc(void);
 extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, 
 					  struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..d90f936 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
+/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
+static unsigned long xfrm_state_walk_ongoing;
+/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
+static unsigned long xfrm_state_walk_completed;
+
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static HLIST_HEAD(xfrm_state_gc_list);
+static LIST_HEAD(xfrm_state_gc_leftovers);
+static LIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -403,17 +409,22 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x;
-	struct hlist_node *entry, *tmp;
-	struct hlist_head gc_list;
+	struct xfrm_state *x, *tmp;
+	unsigned long completed;
 
+	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	gc_list.first = xfrm_state_gc_list.first;
-	INIT_HLIST_HEAD(&xfrm_state_gc_list);
+	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
+	completed = xfrm_state_walk_completed;
+	mutex_unlock(&xfrm_cfg_mutex);
+
+	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
+		if ((long)(x->lastused - completed) > 0)
+			break;
 		xfrm_state_gc_destroy(x);
+	}
 
 	wake_up(&km_waitq);
 }
@@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 {
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
-	spin_lock_bh(&xfrm_state_lock);
-	list_del(&x->all);
-	spin_unlock_bh(&xfrm_state_lock);
-
 	spin_lock_bh(&xfrm_state_gc_lock);
-	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
+	list_add_tail(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
+		x->lastused = xfrm_state_walk_ongoing;
+		list_del_rcu(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 			if (err) {
 				xfrm_state_hold(last);
 				walk->state = last;
+				xfrm_state_walk_ongoing++;
 				goto out;
 			}
 		}
@@ -1586,12 +1596,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
+	if (old != NULL) {
 		xfrm_state_put(old);
+		xfrm_state_walk_completed++;
+		if (!list_empty(&xfrm_state_gc_leftovers))
+			schedule_work(&xfrm_state_gc_work);
+	}
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
+void xfrm_state_walk_done(struct xfrm_state_walk *walk)
+{
+	if (walk->state != NULL) {
+		xfrm_state_put(walk->state);
+		walk->state = NULL;
+		xfrm_state_walk_completed++;
+		if (!list_empty(&xfrm_state_gc_leftovers))
+			schedule_work(&xfrm_state_gc_work);
+	}
+}
+EXPORT_SYMBOL(xfrm_state_walk_done);
+
 
 void xfrm_replay_notify(struct xfrm_state *x, int event)
 {

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09 14:33                         ` Herbert Xu
@ 2008-09-09 20:20                           ` David Miller
  2008-09-10  3:01                           ` David Miller
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-09 20:20 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 00:33:12 +1000

> I've avoided the memory barrier by simply extending the mutexed
> section in the GC to cover the list splicing.  Here's the updated
> patch:
> 
> ipsec: Use RCU-like construct for saved state within a walk

Looks good.  I'll be able to merge this next time I pull
net-2.6 into net-next-2.6 because of your larval/socket
policy fix which touched similar areas.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09 14:33                         ` Herbert Xu
  2008-09-09 20:20                           ` David Miller
@ 2008-09-10  3:01                           ` David Miller
  2008-09-11 21:24                           ` Paul E. McKenney
  2008-09-21 12:29                           ` Timo Teräs
  3 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-10  3:01 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 00:33:12 +1000

> ipsec: Use RCU-like construct for saved state within a walk

Applied, thanks Herbert.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09 14:33                         ` Herbert Xu
  2008-09-09 20:20                           ` David Miller
  2008-09-10  3:01                           ` David Miller
@ 2008-09-11 21:24                           ` Paul E. McKenney
  2008-09-11 22:00                             ` David Miller
  2008-09-21 12:29                           ` Timo Teräs
  3 siblings, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2008-09-11 21:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, timo.teras, netdev

On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote:
> On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
> > 
> > The only comment I would make is that maybe it's a bit excessive
> > to trigger the GC worker every time we walk the states.
> 
> Good point!
> 
> I've avoided the memory barrier by simply extending the mutexed
> section in the GC to cover the list splicing.  Here's the updated
> patch:
> 
> ipsec: Use RCU-like construct for saved state within a walk
> 
> Now that we save states within a walk we need synchronisation
> so that the list the saved state is on doesn't disappear from
> under us.
> 
> As it stands this is done by keeping the state on the list which
> is bad because it gets in the way of the management of the state
> life-cycle.
> 
> An alternative is to make our own pseudo-RCU system where we use
> counters to indicate which state can't be freed immediately as
> it may be referenced by an ongoing walk when that resumes.

There is only one reader at a time, right?  Otherwise, I don't see how
the increments and reads of xfrm_state_walk_completed line up.

							Thanx, Paul

> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 2933d74..4bb9499 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
>  /* Full description of state of transformer. */
>  struct xfrm_state
>  {
> -	/* Note: bydst is re-used during gc */
>  	struct list_head	all;
> -	struct hlist_node	bydst;
> +	union {
> +		struct list_head	gclist;
> +		struct hlist_node	bydst;
> +	};
>  	struct hlist_node	bysrc;
>  	struct hlist_node	byspi;
> 
> @@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
>  	walk->count = 0;
>  }
> 
> -static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
> -{
> -	if (walk->state != NULL) {
> -		xfrm_state_put(walk->state);
> -		walk->state = NULL;
> -	}
> -}
> -
>  extern int xfrm_state_walk(struct xfrm_state_walk *walk,
>  			   int (*func)(struct xfrm_state *, int, void*), void *);
> +extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
>  extern struct xfrm_state *xfrm_state_alloc(void);
>  extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, 
>  					  struct flowi *fl, struct xfrm_tmpl *tmpl,
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 7bd62f6..d90f936 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
>  static unsigned int xfrm_state_num;
>  static unsigned int xfrm_state_genid;
> 
> +/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
> +static unsigned long xfrm_state_walk_ongoing;
> +/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
> +static unsigned long xfrm_state_walk_completed;
> +
>  static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
>  static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
> 
> @@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
>  static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
> 
>  static struct work_struct xfrm_state_gc_work;
> -static HLIST_HEAD(xfrm_state_gc_list);
> +static LIST_HEAD(xfrm_state_gc_leftovers);
> +static LIST_HEAD(xfrm_state_gc_list);
>  static DEFINE_SPINLOCK(xfrm_state_gc_lock);
> 
>  int __xfrm_state_delete(struct xfrm_state *x);
> @@ -403,17 +409,22 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
> 
>  static void xfrm_state_gc_task(struct work_struct *data)
>  {
> -	struct xfrm_state *x;
> -	struct hlist_node *entry, *tmp;
> -	struct hlist_head gc_list;
> +	struct xfrm_state *x, *tmp;
> +	unsigned long completed;
> 
> +	mutex_lock(&xfrm_cfg_mutex);
>  	spin_lock_bh(&xfrm_state_gc_lock);
> -	gc_list.first = xfrm_state_gc_list.first;
> -	INIT_HLIST_HEAD(&xfrm_state_gc_list);
> +	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
>  	spin_unlock_bh(&xfrm_state_gc_lock);
> 
> -	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
> +	completed = xfrm_state_walk_completed;
> +	mutex_unlock(&xfrm_cfg_mutex);
> +
> +	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
> +		if ((long)(x->lastused - completed) > 0)
> +			break;
>  		xfrm_state_gc_destroy(x);
> +	}
> 
>  	wake_up(&km_waitq);
>  }
> @@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x)
>  {
>  	WARN_ON(x->km.state != XFRM_STATE_DEAD);
> 
> -	spin_lock_bh(&xfrm_state_lock);
> -	list_del(&x->all);
> -	spin_unlock_bh(&xfrm_state_lock);
> -
>  	spin_lock_bh(&xfrm_state_gc_lock);
> -	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
> +	list_add_tail(&x->gclist, &xfrm_state_gc_list);
>  	spin_unlock_bh(&xfrm_state_gc_lock);
>  	schedule_work(&xfrm_state_gc_work);
>  }
> @@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x)
>  	if (x->km.state != XFRM_STATE_DEAD) {
>  		x->km.state = XFRM_STATE_DEAD;
>  		spin_lock(&xfrm_state_lock);
> +		x->lastused = xfrm_state_walk_ongoing;
> +		list_del_rcu(&x->all);
>  		hlist_del(&x->bydst);
>  		hlist_del(&x->bysrc);
>  		if (x->id.spi)
> @@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
>  			if (err) {
>  				xfrm_state_hold(last);
>  				walk->state = last;
> +				xfrm_state_walk_ongoing++;
>  				goto out;
>  			}
>  		}
> @@ -1586,12 +1596,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
>  		err = func(last, 0, data);
>  out:
>  	spin_unlock_bh(&xfrm_state_lock);
> -	if (old != NULL)
> +	if (old != NULL) {
>  		xfrm_state_put(old);
> +		xfrm_state_walk_completed++;
> +		if (!list_empty(&xfrm_state_gc_leftovers))
> +			schedule_work(&xfrm_state_gc_work);
> +	}
>  	return err;
>  }
>  EXPORT_SYMBOL(xfrm_state_walk);
> 
> +void xfrm_state_walk_done(struct xfrm_state_walk *walk)
> +{
> +	if (walk->state != NULL) {
> +		xfrm_state_put(walk->state);
> +		walk->state = NULL;
> +		xfrm_state_walk_completed++;
> +		if (!list_empty(&xfrm_state_gc_leftovers))
> +			schedule_work(&xfrm_state_gc_work);
> +	}
> +}
> +EXPORT_SYMBOL(xfrm_state_walk_done);
> +
> 
>  void xfrm_replay_notify(struct xfrm_state *x, int event)
>  {
> 
> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-11 21:24                           ` Paul E. McKenney
@ 2008-09-11 22:00                             ` David Miller
  2008-09-11 23:22                               ` Paul E. McKenney
  2008-09-12 16:08                               ` Herbert Xu
  0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-11 22:00 UTC (permalink / raw)
  To: paulmck; +Cc: herbert, timo.teras, netdev

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Thu, 11 Sep 2008 14:24:59 -0700

> On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote:
> > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
> > > 
> > > The only comment I would make is that maybe it's a bit excessive
> > > to trigger the GC worker every time we walk the states.
> > 
> > Good point!
> > 
> > I've avoided the memory barrier by simply extending the mutexed
> > section in the GC to cover the list splicing.  Here's the updated
> > patch:
> > 
> > ipsec: Use RCU-like construct for saved state within a walk
> > 
> > Now that we save states within a walk we need synchronisation
> > so that the list the saved state is on doesn't disappear from
> > under us.
> > 
> > As it stands this is done by keeping the state on the list which
> > is bad because it gets in the way of the management of the state
> > life-cycle.
> > 
> > An alternative is to make our own pseudo-RCU system where we use
> > counters to indicate which state can't be freed immediately as
> > it may be referenced by an ongoing walk when that resumes.
> 
> There is only one reader at a time, right?  Otherwise, I don't see how
> the increments and reads of xfrm_state_walk_completed line up.

The RTNL semaphore is held across the modifications of the counters.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-11 22:00                             ` David Miller
@ 2008-09-11 23:22                               ` Paul E. McKenney
  2008-09-12 16:08                               ` Herbert Xu
  1 sibling, 0 replies; 95+ messages in thread
From: Paul E. McKenney @ 2008-09-11 23:22 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, timo.teras, netdev

On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Date: Thu, 11 Sep 2008 14:24:59 -0700
> 
> > On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote:
> > > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
> > > > 
> > > > The only comment I would make is that maybe it's a bit excessive
> > > > to trigger the GC worker every time we walk the states.
> > > 
> > > Good point!
> > > 
> > > I've avoided the memory barrier by simply extending the mutexed
> > > section in the GC to cover the list splicing.  Here's the updated
> > > patch:
> > > 
> > > ipsec: Use RCU-like construct for saved state within a walk
> > > 
> > > Now that we save states within a walk we need synchronisation
> > > so that the list the saved state is on doesn't disappear from
> > > under us.
> > > 
> > > As it stands this is done by keeping the state on the list which
> > > is bad because it gets in the way of the management of the state
> > > life-cycle.
> > > 
> > > An alternative is to make our own pseudo-RCU system where we use
> > > counters to indicate which state can't be freed immediately as
> > > it may be referenced by an ongoing walk when that resumes.
> > 
> > There is only one reader at a time, right?  Otherwise, I don't see how
> > the increments and reads of xfrm_state_walk_completed line up.
> 
> The RTNL semaphore is held across the modifications of the counters.

Got it, thank you, sorry for the noise!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-11 22:00                             ` David Miller
  2008-09-11 23:22                               ` Paul E. McKenney
@ 2008-09-12 16:08                               ` Herbert Xu
  2008-09-12 17:37                                 ` Paul E. McKenney
  1 sibling, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-12 16:08 UTC (permalink / raw)
  To: David Miller; +Cc: paulmck, timo.teras, netdev

On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote:
>
> > There is only one reader at a time, right?  Otherwise, I don't see how
> > the increments and reads of xfrm_state_walk_completed line up.
> 
> The RTNL semaphore is held across the modifications of the counters.

s/RTNL/xfrm_cfg_mutex/

but otherwise what Dave said :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-12 16:08                               ` Herbert Xu
@ 2008-09-12 17:37                                 ` Paul E. McKenney
  0 siblings, 0 replies; 95+ messages in thread
From: Paul E. McKenney @ 2008-09-12 17:37 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, timo.teras, netdev

On Sat, Sep 13, 2008 at 02:08:25AM +1000, Herbert Xu wrote:
> On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote:
> >
> > > There is only one reader at a time, right?  Otherwise, I don't see how
> > > the increments and reads of xfrm_state_walk_completed line up.
> > 
> > The RTNL semaphore is held across the modifications of the counters.
> 
> s/RTNL/xfrm_cfg_mutex/
> 
> but otherwise what Dave said :)

Thank you for the updated information, and I guess I don't feel quite
so bad about the false alarm.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-09 14:33                         ` Herbert Xu
                                             ` (2 preceding siblings ...)
  2008-09-11 21:24                           ` Paul E. McKenney
@ 2008-09-21 12:29                           ` Timo Teräs
  2008-09-21 15:21                             ` Timo Teräs
  3 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-21 12:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
> I've avoided the memory barrier by simply extending the mutexed
> section in the GC to cover the list splicing.  Here's the updated
> patch:
> 
> ipsec: Use RCU-like construct for saved state within a walk
> 
> Now that we save states within a walk we need synchronisation
> so that the list the saved state is on doesn't disappear from
> under us.
> 
> As it stands this is done by keeping the state on the list which
> is bad because it gets in the way of the management of the state
> life-cycle.
> 
> An alternative is to make our own pseudo-RCU system where we use
> counters to indicate which state can't be freed immediately as
> it may be referenced by an ongoing walk when that resumes.

Does not this logic fail if:
1. completed = ongoing
2. 1st walk started and iterator kept (a->lastused = ongoing, ongoing++)
3. 2nd walk started and iterator kept (b->lastused = ongoing, ongoing++)
4. 2nd walk finished (completed++)
5. gc triggered: a gets deleted since a->lastused == completed
6. 1st walk continued but freed memory accessed as a was deleted

Though currently it does not affect, since xfrm_state_hold/_put
are still called when keeping the iterator, so the entries won't
actually get garbage collected anyway. So the completed/ongoing
counting is basically useless. Or am I missing something?

Wouldn't it be enough to do the list_del_rcu on delete? And
just keep reference as previously?

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-21 12:29                           ` Timo Teräs
@ 2008-09-21 15:21                             ` Timo Teräs
  2008-09-22 11:42                               ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-21 15:21 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Timo Teräs wrote:
> Herbert Xu wrote:
>> On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote:
>> I've avoided the memory barrier by simply extending the mutexed
>> section in the GC to cover the list splicing.  Here's the updated
>> patch:
>>
>> ipsec: Use RCU-like construct for saved state within a walk
>>
>> Now that we save states within a walk we need synchronisation
>> so that the list the saved state is on doesn't disappear from
>> under us.
>>
>> As it stands this is done by keeping the state on the list which
>> is bad because it gets in the way of the management of the state
>> life-cycle.
>>
>> An alternative is to make our own pseudo-RCU system where we use
>> counters to indicate which state can't be freed immediately as
>> it may be referenced by an ongoing walk when that resumes.
> 
> Does not this logic fail if:
> 1. completed = ongoing
> 2. 1st walk started and iterator kept (a->lastused = ongoing, ongoing++)
> 3. 2nd walk started and iterator kept (b->lastused = ongoing, ongoing++)
> 4. 2nd walk finished (completed++)
> 5. gc triggered: a gets deleted since a->lastused == completed
> 6. 1st walk continued but freed memory accessed as a was deleted
> 
> Though currently it does not affect, since xfrm_state_hold/_put
> are still called when keeping the iterator, so the entries won't
> actually get garbage collected anyway. So the completed/ongoing
> counting is basically useless. Or am I missing something?

Obviously I missed the fact that it prevents deletion of the entries
which the iterator held list node might still point to.

But the flaw still exists: the entries which interator->next points
can be deleted if there is a walking that takes a long time and
meanwhile we get walks that complete fast.

> Wouldn't it be enough to do the list_del_rcu on delete? And
> just keep reference as previously?

So if we do list_del_rcu() on delete, could we also xfrm_state_hold()
the entry pointed to by that list entry. And then on GC we could
xfrm_state_put() the next entry.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-21 15:21                             ` Timo Teräs
@ 2008-09-22 11:42                               ` Herbert Xu
  2008-09-22 13:01                                 ` Timo Teräs
  2008-09-23  2:48                                 ` David Miller
  0 siblings, 2 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-22 11:42 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote:
>
> But the flaw still exists: the entries which interator->next points
> can be deleted if there is a walking that takes a long time and
> meanwhile we get walks that complete fast.

Thanks, I guess that's why I'm not paid to work on RCU :)

> So if we do list_del_rcu() on delete, could we also xfrm_state_hold()
> the entry pointed to by that list entry. And then on GC we could
> xfrm_state_put() the next entry.

Unfortunately it's not that simple since we'll be in the same
bind if the entry after the next entry gets deleted as well as
the next entry.

The only other solution is to go back to the original scheme
where we keep the list intact until GC.  However, nobody has
come up with a way of doing that that allows the SA creation
path to run locklessly with respect to the dumping path.

So I think we'll have to stick with the quasi-RCU solution for
now.  Here's a patch to fix the flaw that you discovered.

ipsec: Fix xfrm_state_walk race

As discovered by Timo Teräs, the currently xfrm_state_walk scheme
is racy because if a second dump finishes before the first, we
may free xfrm states that the first dump would walk over later.

This patch fixes this by storing the dumps in a list in order
to calculate the correct completion counter which cures this
problem.

I've expanded netlink_cb in order to accomodate the extra state
related to this.  It shouldn't be a big deal since netlink_cb
is kmalloced for each dump and we're just increasing it by 4 or
8 bytes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 9ff1b54..cbba776 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[6];
+	long		args[7];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4bb9499..48630b2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1246,6 +1246,8 @@ struct xfrm6_tunnel {
 };
 
 struct xfrm_state_walk {
+	struct list_head list;
+	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
@@ -1281,13 +1283,7 @@ static inline void xfrm6_fini(void)
 extern int xfrm_proc_init(void);
 #endif
 
-static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
-{
-	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-}
-
+extern void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto);
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
 			   int (*func)(struct xfrm_state *, int, void*), void *);
 extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fe84a46..f7805c5 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -64,6 +64,9 @@ static unsigned long xfrm_state_walk_ongoing;
 /* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
 static unsigned long xfrm_state_walk_completed;
 
+/* List of outstanding state walks used to set the completed counter.  */
+static LIST_HEAD(xfrm_state_walks);
+
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -1582,7 +1585,6 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 			if (err) {
 				xfrm_state_hold(last);
 				walk->state = last;
-				xfrm_state_walk_ongoing++;
 				goto out;
 			}
 		}
@@ -1597,25 +1599,44 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL) {
+	if (old != NULL)
 		xfrm_state_put(old);
-		xfrm_state_walk_completed++;
-		if (!list_empty(&xfrm_state_gc_leftovers))
-			schedule_work(&xfrm_state_gc_work);
-	}
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
+void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
+{
+	walk->proto = proto;
+	walk->state = NULL;
+	walk->count = 0;
+	list_add_tail(&walk->list, &xfrm_state_walks);
+	walk->genid = ++xfrm_state_walk_ongoing;
+}
+EXPORT_SYMBOL(xfrm_state_walk_init);
+
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
+	struct list_head *prev;
+
 	if (walk->state != NULL) {
 		xfrm_state_put(walk->state);
 		walk->state = NULL;
-		xfrm_state_walk_completed++;
-		if (!list_empty(&xfrm_state_gc_leftovers))
-			schedule_work(&xfrm_state_gc_work);
 	}
+
+	prev = walk->list.prev;
+	list_del(&walk->list);
+
+	if (prev != &xfrm_state_walks) {
+		list_entry(prev, struct xfrm_state_walk, list)->genid =
+			walk->genid;
+		return;
+	}
+
+	xfrm_state_walk_completed = walk->genid;
+
+	if (!list_empty(&xfrm_state_gc_leftovers))
+		schedule_work(&xfrm_state_gc_work);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-22 11:42                               ` Herbert Xu
@ 2008-09-22 13:01                                 ` Timo Teräs
  2008-09-22 23:50                                   ` Herbert Xu
  2008-09-23  2:48                                 ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-22 13:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote:
>> So if we do list_del_rcu() on delete, could we also xfrm_state_hold()
>> the entry pointed to by that list entry. And then on GC we could
>> xfrm_state_put() the next entry.
> 
> Unfortunately it's not that simple since we'll be in the same
> bind if the entry after the next entry gets deleted as well as
> the next entry.

Well, I was thinking that we hold the next pointer. And when
continuing the dump, we can first skip all entries that are marked
as dead (each next pointer is valid since each of the next pointers
are held once). When we find the first valid entry to dump we
_put() the originally held entry. That would recursively _put() all
the next entries which were held.

> The only other solution is to go back to the original scheme
> where we keep the list intact until GC.  However, nobody has
> come up with a way of doing that that allows the SA creation
> path to run locklessly with respect to the dumping path.
> 
> So I think we'll have to stick with the quasi-RCU solution for
> now.  Here's a patch to fix the flaw that you discovered.

But yes, this would work as well.

Not sure which one would be faster. I guess the holding of
individual entries would be at least more memory efficient.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-22 13:01                                 ` Timo Teräs
@ 2008-09-22 23:50                                   ` Herbert Xu
  2008-09-23  4:53                                     ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-22 23:50 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote:
>
> > Unfortunately it's not that simple since we'll be in the same
> > bind if the entry after the next entry gets deleted as well as
> > the next entry.
> 
> Well, I was thinking that we hold the next pointer. And when
> continuing the dump, we can first skip all entries that are marked
> as dead (each next pointer is valid since each of the next pointers
> are held once). When we find the first valid entry to dump we
> _put() the originally held entry. That would recursively _put() all
> the next entries which were held.

No that doesn't work.  Let's say we store the entry X in walk->state,
and we hold X as well as X->next.  Now X, X->next, and X->next->next
get deleted from the list.  What'll happen is that X and X->next
will stick around but X->next->next will be freed.  So when we
resume from X we'll dump X and X->next correctly, but then hit
X->next->next and be in the same shithole.

Really, the only way to do what you want here is to hold everything
from X to the very end of the list.  The problem with that is that
you can no longer drop the reference counts when we resume.

You can see why even if we only hold X and X->next.  Should X->next
be deleted from the list, when we resume from X we'll no longer be
able to drop the reference count on the original X->next since it
now points to a new entry.

> Not sure which one would be faster. I guess the holding of
> individual entries would be at least more memory efficient.

I think this is a lot more efficient than storing every single
entry from X to the end of the list :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-22 23:50                                   ` Herbert Xu
@ 2008-09-23  4:53                                     ` Timo Teräs
  2008-09-23  4:59                                       ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23  4:53 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote:
>>> Unfortunately it's not that simple since we'll be in the same
>>> bind if the entry after the next entry gets deleted as well as
>>> the next entry.
>> Well, I was thinking that we hold the next pointer. And when
>> continuing the dump, we can first skip all entries that are marked
>> as dead (each next pointer is valid since each of the next pointers
>> are held once). When we find the first valid entry to dump we
>> _put() the originally held entry. That would recursively _put() all
>> the next entries which were held.
> 
> No that doesn't work.  Let's say we store the entry X in walk->state,
> and we hold X as well as X->next.  Now X, X->next, and X->next->next
> get deleted from the list.  What'll happen is that X and X->next
> will stick around but X->next->next will be freed.  So when we
> resume from X we'll dump X and X->next correctly, but then hit
> X->next->next and be in the same shithole.

I think it would work. Here's the scenarios:
We hold X as dumping is interrupted there. X->next points statically to
some non-deleted entry and is held.

Now, if X->next gets deleted, it's marked dead and X->next->next is held
too. Thus when there is multiple deleted entries in chain, the whole
chain is held recursively/iteratively. When walking is continued on X
the first thing we do is skip all dead entries from X, after that we
put X and that would trigger put() for all X->next:s which were held
iteratively.

If X->next is not deleted, and X->next->next gets deleted, the X->next
list structure is updated correctly by list_del_rcu and the entry can
be actually freed even if the walking didn't iterate that entry (it
would be skipped anyway as it's marked dead on deletion).

So the idea was to hold X->next from deletion function, not from
the walking function. That would be, we always hold deleted->next when
there are ongoing walks. And on final _put() we _put() the ->next
entry.

I think that would work.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  4:53                                     ` Timo Teräs
@ 2008-09-23  4:59                                       ` Herbert Xu
  2008-09-23  5:17                                         ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23  4:59 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote:
>
> So the idea was to hold X->next from deletion function, not from
> the walking function. That would be, we always hold deleted->next when
> there are ongoing walks. And on final _put() we _put() the ->next
> entry.
> 
> I think that would work.

Right, this would work.

However, now you're again slowing down the relatively fast path of
state deletion by moving extra work there from the non-critical
dump path.

When we optimise code for time, we don't necessarily try to minimise
the work overall.  We instead try to minimise the amount of work on
the critical paths.  It's OK to let the other paths such as dumping
do a bit more work if it improves the fast paths.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  4:59                                       ` Herbert Xu
@ 2008-09-23  5:17                                         ` Timo Teräs
  2008-09-23  5:22                                           ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23  5:17 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote:
>> So the idea was to hold X->next from deletion function, not from
>> the walking function. That would be, we always hold deleted->next when
>> there are ongoing walks. And on final _put() we _put() the ->next
>> entry.
>>
>> I think that would work.
> 
> Right, this would work.
> 
> However, now you're again slowing down the relatively fast path of
> state deletion by moving extra work there from the non-critical
> dump path.

The extra step there wold be a hold call. The recursive _put on a
_put of some entry can happen only on dump path. As otherwise the
->next entry is first held in state delete, but would be immediately
_put on the _put as the final step of _delete().

So basically one additional atomic_inc() and one atomic_dec() on the
normal _delete() path.

> When we optimise code for time, we don't necessarily try to minimise
> the work overall.  We instead try to minimise the amount of work on
> the critical paths.  It's OK to let the other paths such as dumping
> do a bit more work if it improves the fast paths.

Not sure about the overall penalty, but yes I know it would have some
penalty. But at least there would be no locking.

Particularly the thing that can happen on your approach is that if
there is a user land process that gets suspended during a dump
processing, it would prevent the garbage collection of all entries
for all eternity until that process is continued or killed.

So this would allow deletion and GC of entries even during walking.
But this was just a thought. Maybe it's not worth trying.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  5:17                                         ` Timo Teräs
@ 2008-09-23  5:22                                           ` Herbert Xu
  2008-09-23  6:25                                             ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23  5:22 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote:
>
> The extra step there wold be a hold call. The recursive _put on a
> _put of some entry can happen only on dump path. As otherwise the
> ->next entry is first held in state delete, but would be immediately
> _put on the _put as the final step of _delete().
> 
> So basically one additional atomic_inc() and one atomic_dec() on the
> normal _delete() path.

Can you post a patch? If this can be done locklessly then yes
it would probably be a good way to go.  However, I'm not sure
whether I understand your lockless proposal yet.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  5:22                                           ` Herbert Xu
@ 2008-09-23  6:25                                             ` Timo Teräs
  2008-09-23  6:47                                               ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23  6:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote:
>> The extra step there wold be a hold call. The recursive _put on a
>> _put of some entry can happen only on dump path. As otherwise the
>> ->next entry is first held in state delete, but would be immediately
>> _put on the _put as the final step of _delete().
>>
>> So basically one additional atomic_inc() and one atomic_dec() on the
>> normal _delete() path.
> 
> Can you post a patch? If this can be done locklessly then yes
> it would probably be a good way to go.  However, I'm not sure
> whether I understand your lockless proposal yet.

Something like this. I just compile tested, so I'm not sure if it
actually works.

This basically reverts the GC changes. The interesting bits are in
xfrm_state_delete(). It will _hold() the ->next entry unless we
were at last entry. This will make sure that when an entry is
in XFRM_STATE_DEAD the all.next is valid all the way until the
entry is destroyed.

The corresponding put is in _destroy(). Added it as a final thing
to do so hopefully compiler optimizes tail recursion to iteration.
(Seemed to do that in my case.)

And finally, _walk() was right all along. It returns to dumping
and first skips all dead entries. And just before return, when
all dead entries are already skipped the originally held entry
is _put(). That _put call (or the one in _walk_done()) will result
all dead entries that were held, to be iteratively put.

- Timo

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..d5cba7b 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@ struct xfrm_state
 {
 	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -1246,8 +1246,6 @@ struct xfrm6_tunnel {
 };
 
 struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..f45f006 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,17 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	gc_list.first = xfrm_state_gc_list.first;
+	INIT_HLIST_HEAD(&xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -553,12 +538,19 @@ EXPORT_SYMBOL(xfrm_state_alloc);
 
 void __xfrm_state_destroy(struct xfrm_state *x)
 {
+	struct xfrm_state *next = NULL;
+
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
+	if (x->all.next != &xfrm_state_all)
+		next = container_of(x->all.next, struct xfrm_state, all);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
+
+	if (next != NULL)
+		xfrm_state_put(next);
 }
 EXPORT_SYMBOL(__xfrm_state_destroy);
 
@@ -569,8 +561,10 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
 		list_del_rcu(&x->all);
+		if (x->all.next != &xfrm_state_all)
+			xfrm_state_hold(container_of(x->all.next,
+						     struct xfrm_state, all));
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1612,33 +1606,15 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->proto = proto;
 	walk->state = NULL;
 	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
 	if (walk->state != NULL) {
 		xfrm_state_put(walk->state);
 		walk->state = NULL;
 	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
-		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
-
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  6:25                                             ` Timo Teräs
@ 2008-09-23  6:47                                               ` Herbert Xu
  2008-09-23  6:56                                                 ` Timo Teräs
  2008-09-23  9:39                                                 ` Timo Teräs
  0 siblings, 2 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-23  6:47 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote:
>
> Something like this. I just compile tested, so I'm not sure if it
> actually works.

It is quite clever :)

> The corresponding put is in _destroy(). Added it as a final thing
> to do so hopefully compiler optimizes tail recursion to iteration.
> (Seemed to do that in my case.)

Yeah we should open code this to be absolutely sure that we don't
get killed by a stupid compiler.

However, I think this is still open to the same problem that my
patch had, i.e., if a dumper goes to sleep during the dump we may
prevent entries from being freed indefinitely.  Yes your idea
is better in that we may only withhold say a fraction (depending
on the order of deletion it could be anywhere from none to every
entry) of the entries instead of all of them, but fundamentally
the same issue is still there.

Considering the fact that dumps require root privileges I'm not
convinced as of yet that this is worth it.

Hmm, could we perhaps go back to your original scheme of keeping
everything on the list and see if we can use RCU to make it lockless
instead?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  6:47                                               ` Herbert Xu
@ 2008-09-23  6:56                                                 ` Timo Teräs
  2008-09-23  9:39                                                 ` Timo Teräs
  1 sibling, 0 replies; 95+ messages in thread
From: Timo Teräs @ 2008-09-23  6:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote:
>> Something like this. I just compile tested, so I'm not sure if it
>> actually works.
> 
> It is quite clever :)

Thanks :)

And actually, we can make _delete() do normal list_del() and not hold
the next pointer if there are no running dumps. This is the common
case anyway. Then there would be no penalty at all.

>> The corresponding put is in _destroy(). Added it as a final thing
>> to do so hopefully compiler optimizes tail recursion to iteration.
>> (Seemed to do that in my case.)
> 
> Yeah we should open code this to be absolutely sure that we don't
> get killed by a stupid compiler.

Right. The patch was just to show the idea. Will do a proper patch
with this and the above mentioned optimization if you think we
should change to this until we get really cool lockless dumping.

> However, I think this is still open to the same problem that my
> patch had, i.e., if a dumper goes to sleep during the dump we may
> prevent entries from being freed indefinitely.  Yes your idea
> is better in that we may only withhold say a fraction (depending
> on the order of deletion it could be anywhere from none to every
> entry) of the entries instead of all of them, but fundamentally
> the same issue is still there.
> 
> Considering the fact that dumps require root privileges I'm not
> convinced as of yet that this is worth it.

You are right. It can keep memory allocated. And in worst case
you end up keeping all the deleted entries (entries are deleted
in the order they appear in all list). But I would not be to
worried about it either.

> Hmm, could we perhaps go back to your original scheme of keeping
> everything on the list and see if we can use RCU to make it lockless
> instead?

That would be ideal of course.

- Timo


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  6:47                                               ` Herbert Xu
  2008-09-23  6:56                                                 ` Timo Teräs
@ 2008-09-23  9:39                                                 ` Timo Teräs
  2008-09-23 11:24                                                   ` Herbert Xu
  1 sibling, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23  9:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> However, I think this is still open to the same problem that my
> patch had, i.e., if a dumper goes to sleep during the dump we may
> prevent entries from being freed indefinitely.  Yes your idea
> is better in that we may only withhold say a fraction (depending
> on the order of deletion it could be anywhere from none to every
> entry) of the entries instead of all of them, but fundamentally
> the same issue is still there.
> 
> Considering the fact that dumps require root privileges I'm not
> convinced as of yet that this is worth it.
> 
> Hmm, could we perhaps go back to your original scheme of keeping
> everything on the list and see if we can use RCU to make it lockless
> instead?

How about this: we keep list of walks as your latest patch does.
When walking is interrupted, we do not hold the entry, we just store
the pointer to walk iterator. When the entry is deleted from the lists
we go through the walk contexts, and if someone is pointing to
the entry being deleted, we just update it to next.

The list of walkers can be protected by xfrm_state_lock. That
needs to be taken anyway for accessing/modifying the other lists.

During most of the time, there is no walks active, so the penalty
for _destroy() is minimal.

This would make it possibly to reclaim the deleted entries right
away.

Does this sound better?

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23  9:39                                                 ` Timo Teräs
@ 2008-09-23 11:24                                                   ` Herbert Xu
  2008-09-23 12:08                                                     ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23 11:24 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote:
>
> This would make it possibly to reclaim the deleted entries right
> away.
> 
> Does this sound better?

Yep this sounds pretty good to me.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 11:24                                                   ` Herbert Xu
@ 2008-09-23 12:08                                                     ` Timo Teräs
  2008-09-23 12:14                                                       ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23 12:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote:
>> This would make it possibly to reclaim the deleted entries right
>> away.
>>
>> Does this sound better?
> 
> Yep this sounds pretty good to me.
> 
> Thanks,

So the patch would look something like this. Compile tested.
Will test later when it becomes possible to reboot my box.

Cheers,
 Timo

ipsec: Fix up xfrm_state_walk.state on node deletion

Now that we track xfrm_state_walks, it makes more sense to
fix up the state pointer on deletion of the node if needed.

This allows accurately to delete all entries immediately,
instead of possibly waiting for userland.

Also fixed locking of xfrm_state_walks list handling.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/linux/netlink.h |    2 +-
 include/net/xfrm.h      |    3 +-
 net/xfrm/xfrm_state.c   |   77 +++++++++++++++++-----------------------------
 3 files changed, 31 insertions(+), 51 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..7f787c7 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@ struct xfrm_state
 {
 	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -1247,7 +1247,6 @@ struct xfrm6_tunnel {
 
 struct xfrm_state_walk {
 	struct list_head list;
-	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..636f7ee 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,11 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
 /* List of outstanding state walks used to set the completed counter.  */
 static LIST_HEAD(xfrm_state_walks);
 
@@ -199,8 +194,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +406,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -556,7 +543,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -564,13 +551,22 @@ EXPORT_SYMBOL(__xfrm_state_destroy);
 
 int __xfrm_state_delete(struct xfrm_state *x)
 {
+	struct xfrm_state_walk *walk;
+	struct xfrm_state *next;
 	int err = -ESRCH;
 
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		if (list_is_last(&x->all, &xfrm_state_walks))
+			next = NULL;
+		else
+			next = container_of(x->all.next, struct xfrm_state, all);
+		list_for_each_entry(walk, &xfrm_state_walks, list) {
+			if (walk->state == x)
+				walk->state = next;
+		}
+		list_del(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1566,15 +1562,16 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *x, *last = NULL;
 	int err = 0;
 
 	if (walk->state == NULL && walk->count != 0)
 		return 0;
 
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
+	x = walk->state;
+	walk->state = NULL;
+
 	if (x == NULL)
 		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
@@ -1585,7 +1582,6 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		if (last) {
 			err = func(last, walk->count, data);
 			if (err) {
-				xfrm_state_hold(last);
 				walk->state = last;
 				goto out;
 			}
@@ -1601,8 +1597,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
+
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
@@ -1612,33 +1607,19 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->proto = proto;
 	walk->state = NULL;
 	walk->count = 0;
+
+	spin_lock_bh(&xfrm_state_lock);
 	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	spin_unlock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
+	spin_lock_bh(&xfrm_state_lock);
 	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
-		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
-
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_unlock_bh(&xfrm_state_lock);
+	walk->state = NULL;
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 12:08                                                     ` Timo Teräs
@ 2008-09-23 12:14                                                       ` Herbert Xu
  2008-09-23 12:25                                                         ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23 12:14 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote:
>
> +		list_for_each_entry(walk, &xfrm_state_walks, list) {
> +			if (walk->state == x)
> +				walk->state = next;
> +		}

Just make a new list in xfrm_state that has all the dumpers sitting
on it.  In fact all you need is a hlist, or even just a pointer.

Walking through an unbounded list on the fast path is not good :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 12:14                                                       ` Herbert Xu
@ 2008-09-23 12:25                                                         ` Timo Teräs
  2008-09-23 12:56                                                           ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23 12:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote:
>> +		list_for_each_entry(walk, &xfrm_state_walks, list) {
>> +			if (walk->state == x)
>> +				walk->state = next;
>> +		}
> 
> Just make a new list in xfrm_state that has all the dumpers sitting
> on it.  In fact all you need is a hlist, or even just a pointer.

There can be a lot of xfrm_state structs. As in thousands. So it
will take more memory then. hlist would be enough, so it'd be a
4/8 bytes addition. Single pointer would not be enough as we can
have multiple walks on the same entry.

> Walking through an unbounded list on the fast path is not good :)

I was thinking about the same. That's why I wasn't so sure if this is
better. In practice there aren't many walks active. Maybe we limit
the maximum simultaneous walks?

But in real life, there is only a one or two walkers if any active.
So I would not consider a global update too expensive. But if you think
it might become a problem, I'd add an hlist to xfrm_state of walkers
referencing that entry.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 12:25                                                         ` Timo Teräs
@ 2008-09-23 12:56                                                           ` Herbert Xu
  2008-09-23 13:01                                                             ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23 12:56 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote:
>
> There can be a lot of xfrm_state structs. As in thousands. So it
> will take more memory then. hlist would be enough, so it'd be a
> 4/8 bytes addition. Single pointer would not be enough as we can
> have multiple walks on the same entry.

Right, we need doubly linked lists in the walkers to ease unlinking.
But only a single pointer (i.e., an hlist) is needed in xfrm_state.

> I was thinking about the same. That's why I wasn't so sure if this is
> better. In practice there aren't many walks active. Maybe we limit
> the maximum simultaneous walks?

That sounds like a hack :)

> But in real life, there is only a one or two walkers if any active.
> So I would not consider a global update too expensive. But if you think
> it might become a problem, I'd add an hlist to xfrm_state of walkers
> referencing that entry.

Well in real life dumps don't tend to get suspended either :)
In any case, a suspended dump is only going to pin down some
memory, while a very long list walk may kill the box.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 12:56                                                           ` Herbert Xu
@ 2008-09-23 13:01                                                             ` Timo Teräs
  2008-09-23 13:07                                                               ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23 13:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote:
>> There can be a lot of xfrm_state structs. As in thousands. So it
>> will take more memory then. hlist would be enough, so it'd be a
>> 4/8 bytes addition. Single pointer would not be enough as we can
>> have multiple walks on the same entry.
> 
> Right, we need doubly linked lists in the walkers to ease unlinking.
> But only a single pointer (i.e., an hlist) is needed in xfrm_state.

Right.
 
>> I was thinking about the same. That's why I wasn't so sure if this is
>> better. In practice there aren't many walks active. Maybe we limit
>> the maximum simultaneous walks?
> 
> That sounds like a hack :)

Yes :)
 
>> But in real life, there is only a one or two walkers if any active.
>> So I would not consider a global update too expensive. But if you think
>> it might become a problem, I'd add an hlist to xfrm_state of walkers
>> referencing that entry.
> 
> Well in real life dumps don't tend to get suspended either :)
> In any case, a suspended dump is only going to pin down some
> memory, while a very long list walk may kill the box.

So, what to do?
1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
   xfrm_state_put(all.next) on destruct.
2. Add per-entry hlist of walkers currently referencing it.
3. Use the global walker list.

1 can keep memory allocated until userland wakes up. 2 & 3 can make
the delete of that entry slow if there's many walkers suspended.

Btw. the current stuff in net-next is broken. There's no locking
for xfrm_state_walkers list handling.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 13:01                                                             ` Timo Teräs
@ 2008-09-23 13:07                                                               ` Herbert Xu
  2008-09-23 13:30                                                                 ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23 13:07 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote:
>
> So, what to do?
> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
>    xfrm_state_put(all.next) on destruct.
> 2. Add per-entry hlist of walkers currently referencing it.
> 3. Use the global walker list.
> 
> 1 can keep memory allocated until userland wakes up. 2 & 3 can make
> the delete of that entry slow if there's many walkers suspended.

I'd cross 3 off the list because 2 is just so much better :)

I'd slightly lean towards 2 but now that you mention it yes even
that is vulnerable to loads of dumpers sitting on the same entry.
So SELINUX folks wouldn't like that :)

> Btw. the current stuff in net-next is broken. There's no locking
> for xfrm_state_walkers list handling.

What about xfrm_cfg_mutex?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 13:07                                                               ` Herbert Xu
@ 2008-09-23 13:30                                                                 ` Timo Teräs
  2008-09-23 13:32                                                                   ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23 13:30 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote:
>> So, what to do?
>> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
>>    xfrm_state_put(all.next) on destruct.
>> 2. Add per-entry hlist of walkers currently referencing it.
>> 3. Use the global walker list.
>>
>> 1 can keep memory allocated until userland wakes up. 2 & 3 can make
>> the delete of that entry slow if there's many walkers suspended.
> 
> I'd cross 3 off the list because 2 is just so much better :)
> 
> I'd slightly lean towards 2 but now that you mention it yes even
> that is vulnerable to loads of dumpers sitting on the same entry.
> So SELINUX folks wouldn't like that :)

Umm... right. It's a tricky problem. Cannot think of perfect
solution atm. But I guess 3 is in general case the best. But in
worst case scenarios 1 performs better.

I have no strong opinion either way. So what ever you want, I'm
happy to provide a patch for.

>> Btw. the current stuff in net-next is broken. There's no locking
>> for xfrm_state_walkers list handling.
> 
> What about xfrm_cfg_mutex?

It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
touch xfrm_state_walks list without locking properly. At least in
the version I'm looking (= net-next-2.6 via git web interface).

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 13:30                                                                 ` Timo Teräs
@ 2008-09-23 13:32                                                                   ` Herbert Xu
  2008-09-23 13:46                                                                     ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-23 13:32 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev

On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote:
>
> I have no strong opinion either way. So what ever you want, I'm
> happy to provide a patch for.

Let's just leave it as it is for now.

> > What about xfrm_cfg_mutex?
> 
> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
> touch xfrm_state_walks list without locking properly. At least in
> the version I'm looking (= net-next-2.6 via git web interface).

Their callers should be holding it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 13:32                                                                   ` Herbert Xu
@ 2008-09-23 13:46                                                                     ` Timo Teräs
  2008-09-24  4:23                                                                       ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-23 13:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote:
>> I have no strong opinion either way. So what ever you want, I'm
>> happy to provide a patch for.
> 
> Let's just leave it as it is for now.

Well, 1 is an improvement over the current implementation, so
I think it'd be better than not doing anything.

>>> What about xfrm_cfg_mutex?
>> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
>> touch xfrm_state_walks list without locking properly. At least in
>> the version I'm looking (= net-next-2.6 via git web interface).
> 
> Their callers should be holding it.

Ah, the other layers take it at least on _walk_init paths. But
_walk_done can be called from recv() syscalls. The af_key
implementation does not take xfrm_cfg_mutex there. I don't think
xfrm_user does that either as it does not pass cb_mutex to
netlink_kernel_create. So at least the _state_walk_done path
is unsafe as-is, I think.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-23 13:46                                                                     ` Timo Teräs
@ 2008-09-24  4:23                                                                       ` Herbert Xu
  2008-09-24  5:14                                                                         ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  4:23 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Tue, Sep 23, 2008 at 04:46:31PM +0300, Timo Teräs wrote:
>
> Ah, the other layers take it at least on _walk_init paths. But
> _walk_done can be called from recv() syscalls. The af_key
> implementation does not take xfrm_cfg_mutex there. I don't think
> xfrm_user does that either as it does not pass cb_mutex to
> netlink_kernel_create. So at least the _state_walk_done path
> is unsafe as-is, I think.

OK we'd need to fix that up.

However, I've got a new idea :) Let's put the dumpers on the
list directly and then we won't have to deal any of this crap
about states going away.

Warning: compile tested only!

ipsec: Put dumpers on the dump list

As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep.  It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.

I've also changed the order of addition on new states to prevent
a never-ending dump.

Finally the obsolete last optimisation is now gone.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..17f9494 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,8 @@ extern struct mutex xfrm_cfg_mutex;
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -138,6 +137,8 @@ struct xfrm_state
 
 	/* Key manger bits */
 	struct {
+		/* These two fields correspond to xfrm_state_walk. */
+		struct list_head all;
 		u8		state;
 		u8		dying;
 		u32		seq;
@@ -1246,11 +1247,11 @@ struct xfrm6_tunnel {
 };
 
 struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
-	struct xfrm_state *state;
-	int count;
+	/* These two fields correspond to xfrm_state. */
+	struct list_head all;
+	u8 state;
 	u8 proto;
+	int count;
 };
 
 struct xfrm_policy_walk {
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index f7805c5..ff3bb24 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->all);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
+		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		list_del(&x->km.all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -939,7 +922,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
+	list_add(&x->km.all, &xfrm_state_all);
 
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
@@ -1564,79 +1547,54 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *x;
 	int err = 0;
 
-	if (walk->state == NULL && walk->count != 0)
-		return 0;
-
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
-	list_for_each_entry_from(x, &xfrm_state_all, all) {
+	if (list_empty(&walk->all))
+		x = list_first_entry(&xfrm_state_all, struct xfrm_state, km.all);
+	else
+		x = list_entry(&walk->all, struct xfrm_state, km.all);
+	list_for_each_entry_from(x, &xfrm_state_all, km.all) {
 		if (x->km.state == XFRM_STATE_DEAD)
 			continue;
 		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
 			continue;
-		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				goto out;
-			}
+		err = func(x, walk->count, data);
+		if (err) {
+			list_move_tail(&walk->all, &x->km.all);
+			goto out;
 		}
-		last = x;
 		walk->count++;
 	}
 	if (walk->count == 0) {
 		err = -ENOENT;
 		goto out;
 	}
-	if (last)
-		err = func(last, 0, data);
+	list_del_init(&walk->all);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
+	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
-	walk->state = NULL;
+	walk->state = XFRM_STATE_DEAD;
 	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
+	if (list_empty(&walk->all))
 		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
 
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&walk->all);
+	spin_lock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  4:23                                                                       ` Herbert Xu
@ 2008-09-24  5:14                                                                         ` Timo Teräs
  2008-09-24  5:15                                                                           ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24  5:14 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 04:46:31PM +0300, Timo Teräs wrote:
>> Ah, the other layers take it at least on _walk_init paths. But
>> _walk_done can be called from recv() syscalls. The af_key
>> implementation does not take xfrm_cfg_mutex there. I don't think
>> xfrm_user does that either as it does not pass cb_mutex to
>> netlink_kernel_create. So at least the _state_walk_done path
>> is unsafe as-is, I think.
> 
> OK we'd need to fix that up.
> 
> However, I've got a new idea :) Let's put the dumpers on the
> list directly and then we won't have to deal any of this crap
> about states going away.

Now this is an interesting idea... I like this a lot.

Only comment is that there really should be struct for the
shared stuff. Otherwise it's bound to break at some point.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  5:14                                                                         ` Timo Teräs
@ 2008-09-24  5:15                                                                           ` Herbert Xu
  2008-09-24  5:46                                                                             ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  5:15 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 08:14:11AM +0300, Timo Teräs wrote:
>
> Only comment is that there really should be struct for the
> shared stuff. Otherwise it's bound to break at some point.

That's why I put comments there.  If people change it without
reading the comments, they'll ignore the struct too :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  5:15                                                                           ` Herbert Xu
@ 2008-09-24  5:46                                                                             ` Timo Teräs
  2008-09-24  5:55                                                                               ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24  5:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 08:14:11AM +0300, Timo Teräs wrote:
>> Only comment is that there really should be struct for the
>> shared stuff. Otherwise it's bound to break at some point.
> 
> That's why I put comments there.  If people change it without
> reading the comments, they'll ignore the struct too :)

Well, it's also because in the dump routine the entries
are cast to xfrm_state, even if it was a walker entry. This
is just wrong, though it probably works since only the specific
entries are used.

There should be some intermediate struct which we use to 
iterate in dumping routing, check if it was iterator/real entry
(can be still based on the state thing). And only after that
cast to struct xfrm_state. It would make the dumping routine
more readable.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  5:46                                                                             ` Timo Teräs
@ 2008-09-24  5:55                                                                               ` Herbert Xu
  2008-09-24  6:04                                                                                 ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  5:55 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 08:46:09AM +0300, Timo Teräs wrote:
>
> Well, it's also because in the dump routine the entries
> are cast to xfrm_state, even if it was a walker entry. This
> is just wrong, though it probably works since only the specific
> entries are used.

No it works because all walker entries set the state to DEAD
so they're skipped by everybody else.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  5:55                                                                               ` Herbert Xu
@ 2008-09-24  6:04                                                                                 ` Timo Teräs
  2008-09-24  6:13                                                                                   ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24  6:04 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 08:46:09AM +0300, Timo Teräs wrote:
>> Well, it's also because in the dump routine the entries
>> are cast to xfrm_state, even if it was a walker entry. This
>> is just wrong, though it probably works since only the specific
>> entries are used.
> 
> No it works because all walker entries set the state to DEAD
> so they're skipped by everybody else.

Yes, I know. I was pointing the fact that the walker function
iterates using struct xfrm_state. So temporarily when it is
iterating through the walker entry, we get strct xfrm_state
pointer which points to some place before the struct xfrm_state_walk.
Now since the km.state is checked first, those are skipped.

I find it very confusing to let the code say "iterate through
list of struct xfrm_state" when it is not such a list. It is a
list of struct xfrm_state or struct xfrm_state_walk. So I'd use
some intermediate struct to so the code can say e.g "iterate
through list of struct xfrm_state_dump_entry" or whatever.

Or at least add a comment to the dumping function to say that
we have struct xfrm_state, but in matter of fact it can be
also struct xfrm_state_walk pointer with displacement, so we
better check km.state first.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  6:04                                                                                 ` Timo Teräs
@ 2008-09-24  6:13                                                                                   ` Herbert Xu
  2008-09-24  6:20                                                                                     ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  6:13 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 09:04:30AM +0300, Timo Teräs wrote:
>
> Or at least add a comment to the dumping function to say that
> we have struct xfrm_state, but in matter of fact it can be
> also struct xfrm_state_walk pointer with displacement, so we
> better check km.state first.

Which is exactly what we do.  The first thing we check in the
loop is km.state.  I really don't see your problem.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  6:13                                                                                   ` Herbert Xu
@ 2008-09-24  6:20                                                                                     ` Timo Teräs
  2008-09-24  6:21                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24  6:20 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 09:04:30AM +0300, Timo Teräs wrote:
>> Or at least add a comment to the dumping function to say that
>> we have struct xfrm_state, but in matter of fact it can be
>> also struct xfrm_state_walk pointer with displacement, so we
>> better check km.state first.
> 
> Which is exactly what we do.  The first thing we check in the
> loop is km.state.  I really don't see your problem.

Yes. I'm not complaining that the code does not work.

Just saying that the code easily misleads the reader. And in
this kind of non-obvious places we should have some comments.
Or make the code more readable by adding the intermediate struct.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  6:20                                                                                     ` Timo Teräs
@ 2008-09-24  6:21                                                                                       ` Herbert Xu
  2008-09-24  7:29                                                                                         ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  6:21 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 09:20:10AM +0300, Timo Teräs wrote:
>
> Just saying that the code easily misleads the reader. And in
> this kind of non-obvious places we should have some comments.
> Or make the code more readable by adding the intermediate struct.

Fair enough.  Feel free to reformat the patch and add the struct
to make it more bullet-proof.  I'm sorry I've got some paid work
to do now :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  6:21                                                                                       ` Herbert Xu
@ 2008-09-24  7:29                                                                                         ` Timo Teräs
  2008-09-24  7:54                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24  7:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 09:20:10AM +0300, Timo Teräs wrote:
>> Just saying that the code easily misleads the reader. And in
>> this kind of non-obvious places we should have some comments.
>> Or make the code more readable by adding the intermediate struct.
> 
> Fair enough.  Feel free to reformat the patch and add the struct
> to make it more bullet-proof.  I'm sorry I've got some paid work
> to do now :)

Ok, here's one try. I also updated the list_add() calls introduced
in commit "ipsec: Restore larval states and socket policies in
dump".

Same warning still applies: only compile tested.

We should probably do similar fix for the SPD dumping. I can make
a patch for that later today.

- Timo

ipsec: Put dumpers on the dump list

Based on Herbert Xu's patch and idea. Improved the original patch
to apply cleanly and modified iteration code to be more readable
by using a common struct for entries in all list.

As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep.  It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.

I've also changed the order of addition on new states to prevent
a never-ending dump.

Finally the obsolete last optimisation is now gone.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..becaa1e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
       metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
  */
 
+struct xfrm_state_walk {
+	struct list_head	all;
+	u8			state;
+	union {
+		u8		dying;
+		u8		proto;
+	};
+	u32			seq;
+};
+
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -136,12 +145,8 @@ struct xfrm_state
 
 	u32			genid;
 
-	/* Key manger bits */
-	struct {
-		u8		state;
-		u8		dying;
-		u32		seq;
-	} km;
+	/* Key manager bits */
+	struct xfrm_state_walk	km;
 
 	/* Parameters of this state. */
 	struct {
@@ -1245,14 +1250,6 @@ struct xfrm6_tunnel {
 	int priority;
 };
 
-struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
-	struct xfrm_state *state;
-	int count;
-	u8 proto;
-};
-
 struct xfrm_policy_walk {
 	struct xfrm_policy *policy;
 	int count;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..f2deae0 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
+		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		list_del(&x->km.all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
-			list_add_tail(&x->all, &xfrm_state_all);
+			list_add(&x->km.all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
+	list_add(&x->km.all, &xfrm_state_all);
 
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
@@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
-		list_add_tail(&x->all, &xfrm_state_all);
+		list_add(&x->km.all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1566,79 +1549,56 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *state;
+	struct xfrm_state_walk *x;
 	int err = 0;
 
-	if (walk->state == NULL && walk->count != 0)
-		return 0;
-
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
+	if (list_empty(&walk->all))
+		x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all);
+	else
+		x = list_entry(&walk->all, struct xfrm_state_walk, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
-		if (x->km.state == XFRM_STATE_DEAD)
+		if (x->state == XFRM_STATE_DEAD)
 			continue;
-		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
+		state = container_of(x, struct xfrm_state, km);
+		if (!xfrm_id_proto_match(state->id.proto, walk->proto))
 			continue;
-		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				goto out;
-			}
+		err = func(state, walk->seq, data);
+		if (err) {
+			list_move_tail(&walk->all, &x->all);
+			goto out;
 		}
-		last = x;
-		walk->count++;
+		walk->seq++;
 	}
-	if (walk->count == 0) {
+	if (walk->seq == 0) {
 		err = -ENOENT;
 		goto out;
 	}
-	if (last)
-		err = func(last, 0, data);
+	list_del_init(&walk->all);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
+	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	walk->state = XFRM_STATE_DEAD;
+	walk->seq = 0;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
+	if (list_empty(&walk->all))
 		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
 
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&walk->all);
+	spin_lock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  7:29                                                                                         ` Timo Teräs
@ 2008-09-24  7:54                                                                                           ` Herbert Xu
  2008-09-24 13:18                                                                                             ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24  7:54 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 10:29:15AM +0300, Timo Teräs wrote:
>
> Same warning still applies: only compile tested.

Looks good to me, I espeically like the count/seq bit.  But I'm
sure Dave would appreciate it if we actually ran this too :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24  7:54                                                                                           ` Herbert Xu
@ 2008-09-24 13:18                                                                                             ` Timo Teräs
  2008-09-24 14:08                                                                                               ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-24 13:18 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 10:29:15AM +0300, Timo Teräs wrote:
>> Same warning still applies: only compile tested.
> 
> Looks good to me, I espeically like the count/seq bit.  But I'm
> sure Dave would appreciate it if we actually ran this too :)

Gave it a test spin and found one bug. The walking routine will
get called one extra time when it has successfully dumped
everything so we need to check for that in the beginning; the
earlier version would start all over from the beginning resulting
never ending dumps.

So the patch is now as follows. I have an almost ready patch
for xfrm_policy dumping too. Will finish it and send tomorrow.

- Timo

ipsec: Put xfrm_state dumpers on the dump list

Based on Herbert Xu's patch and idea. Improved the original patch
to apply cleanly and modified iteration code to be more readable
by using a common struct for entries in all list.

As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep.  It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.

I've also changed the order of addition on new states to prevent
a never-ending dump.

Finally the obsolete last optimisation is now gone.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/linux/netlink.h |    2 +-
 include/net/xfrm.h      |   29 ++++++-------
 net/xfrm/xfrm_state.c   |  109 +++++++++++++++-------------------------------
 3 files changed, 50 insertions(+), 90 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..becaa1e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
       metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
  */
 
+struct xfrm_state_walk {
+	struct list_head	all;
+	u8			state;
+	union {
+		u8		dying;
+		u8		proto;
+	};
+	u32			seq;
+};
+
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -136,12 +145,8 @@ struct xfrm_state
 
 	u32			genid;
 
-	/* Key manger bits */
-	struct {
-		u8		state;
-		u8		dying;
-		u32		seq;
-	} km;
+	/* Key manager bits */
+	struct xfrm_state_walk	km;
 
 	/* Parameters of this state. */
 	struct {
@@ -1245,14 +1250,6 @@ struct xfrm6_tunnel {
 	int priority;
 };
 
-struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
-	struct xfrm_state *state;
-	int count;
-	u8 proto;
-};
-
 struct xfrm_policy_walk {
 	struct xfrm_policy *policy;
 	int count;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..747fd8c 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
+		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		list_del(&x->km.all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
-			list_add_tail(&x->all, &xfrm_state_all);
+			list_add(&x->km.all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
+	list_add(&x->km.all, &xfrm_state_all);
 
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
@@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
-		list_add_tail(&x->all, &xfrm_state_all);
+		list_add(&x->km.all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1566,79 +1549,59 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *state;
+	struct xfrm_state_walk *x;
 	int err = 0;
 
-	if (walk->state == NULL && walk->count != 0)
+	if (walk->seq != 0 && list_empty(&walk->all))
 		return 0;
 
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
+	if (list_empty(&walk->all))
+		x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all);
+	else
+		x = list_entry(&walk->all, struct xfrm_state_walk, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
-		if (x->km.state == XFRM_STATE_DEAD)
+		if (x->state == XFRM_STATE_DEAD)
 			continue;
-		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
+		state = container_of(x, struct xfrm_state, km);
+		if (!xfrm_id_proto_match(state->id.proto, walk->proto))
 			continue;
-		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				goto out;
-			}
+		err = func(state, walk->seq, data);
+		if (err) {
+			list_move_tail(&walk->all, &x->all);
+			goto out;
 		}
-		last = x;
-		walk->count++;
+		walk->seq++;
 	}
-	if (walk->count == 0) {
+	if (walk->seq == 0) {
 		err = -ENOENT;
 		goto out;
 	}
-	if (last)
-		err = func(last, 0, data);
+	list_del_init(&walk->all);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
+	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	walk->state = XFRM_STATE_DEAD;
+	walk->seq = 0;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
+	if (list_empty(&walk->all))
 		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
 
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&walk->all);
+	spin_lock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24 13:18                                                                                             ` Timo Teräs
@ 2008-09-24 14:08                                                                                               ` Herbert Xu
  2008-09-25  6:03                                                                                                 ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-24 14:08 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Wed, Sep 24, 2008 at 04:18:37PM +0300, Timo Teräs wrote:
>
> Gave it a test spin and found one bug. The walking routine will
> get called one extra time when it has successfully dumped
> everything so we need to check for that in the beginning; the
> earlier version would start all over from the beginning resulting
> never ending dumps.

Aha, that's why the if statement was there :)

Thanks for all the work!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-24 14:08                                                                                               ` Herbert Xu
@ 2008-09-25  6:03                                                                                                 ` Timo Teräs
  2008-09-25  7:57                                                                                                   ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-25  6:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 04:18:37PM +0300, Timo Teräs wrote:
>> Gave it a test spin and found one bug. The walking routine will
>> get called one extra time when it has successfully dumped
>> everything so we need to check for that in the beginning; the
>> earlier version would start all over from the beginning resulting
>> never ending dumps.
> 
> Aha, that's why the if statement was there :)

Yup.

I forgot to test pf_key support. And remembered why the "last"
thingy was in the walking routines. In pf_key dumps, the dump
is terminated when an entry with seq zero is received; that's
why the final entry received special treatment. And now that
breaks. The first entry is zero so pf_key dumps only one entry.

Not sure if we should fix this in pf_key or in the walking routines.

Cheers,
  Timo


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  6:03                                                                                                 ` Timo Teräs
@ 2008-09-25  7:57                                                                                                   ` Herbert Xu
  2008-09-25  8:42                                                                                                     ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-25  7:57 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Thu, Sep 25, 2008 at 09:03:21AM +0300, Timo Teräs wrote:
>
> I forgot to test pf_key support. And remembered why the "last"
> thingy was in the walking routines. In pf_key dumps, the dump
> is terminated when an entry with seq zero is received; that's
> why the final entry received special treatment. And now that
> breaks. The first entry is zero so pf_key dumps only one entry.
> 
> Not sure if we should fix this in pf_key or in the walking routines.

This is part of the user-space ABI (and an RFC) so wecan't change
it.  I've put the last stuff back and fixed up the error handling
on the very last entry.

ipsec: Put xfrm_state dumpers on the dump list

Based on Herbert Xu's patch and idea. Timo improved the original
patch to apply cleanly and modified iteration code to be more
readable by using a common struct for entries in all list.

As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep.  It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.

I've also changed the order of addition on new states to prevent
a never-ending dump.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..becaa1e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
       metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
  */
 
+struct xfrm_state_walk {
+	struct list_head	all;
+	u8			state;
+	union {
+		u8		dying;
+		u8		proto;
+	};
+	u32			seq;
+};
+
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -136,12 +145,8 @@ struct xfrm_state
 
 	u32			genid;
 
-	/* Key manger bits */
-	struct {
-		u8		state;
-		u8		dying;
-		u32		seq;
-	} km;
+	/* Key manager bits */
+	struct xfrm_state_walk	km;
 
 	/* Parameters of this state. */
 	struct {
@@ -1245,14 +1250,6 @@ struct xfrm6_tunnel {
 	int priority;
 };
 
-struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
-	struct xfrm_state *state;
-	int count;
-	u8 proto;
-};
-
 struct xfrm_policy_walk {
 	struct xfrm_policy *policy;
 	int count;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..5b98f6f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
+		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		list_del(&x->km.all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
-			list_add_tail(&x->all, &xfrm_state_all);
+			list_add(&x->km.all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
+	list_add(&x->km.all, &xfrm_state_all);
 
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
@@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
-		list_add_tail(&x->all, &xfrm_state_all);
+		list_add(&x->km.all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1566,79 +1549,72 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *state, *last = NULL;
+	struct xfrm_state_walk *x;
 	int err = 0;
 
-	if (walk->state == NULL && walk->count != 0)
+	if (walk->seq != 0 && list_empty(&walk->all))
 		return 0;
 
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
+	if (list_empty(&walk->all))
+		x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all);
+	else
+		x = list_entry(&walk->all, struct xfrm_state_walk, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
-		if (x->km.state == XFRM_STATE_DEAD)
+		if (x->state == XFRM_STATE_DEAD)
 			continue;
-		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
+		state = container_of(x, struct xfrm_state, km);
+		if (!xfrm_id_proto_match(state->id.proto, walk->proto))
 			continue;
+
 		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				goto out;
-			}
+			err = func(last, walk->seq, data);
+			if (err)
+				goto save_dump;
 		}
-		last = x;
-		walk->count++;
+
+		last = state;
+		walk->seq++;
 	}
-	if (walk->count == 0) {
+	if (walk->seq == 0) {
 		err = -ENOENT;
 		goto out;
 	}
-	if (last)
-		err = func(last, 0, data);
+
+	list_del_init(&walk->all);
+	if (!last)
+		goto out;
+
+	err = func(last, 0, data);
+	if (err) {
+save_dump:
+		list_move_tail(&walk->all, &last->km.all);
+	}
+
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
+	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	walk->state = XFRM_STATE_DEAD;
+	walk->seq = 0;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
+	if (list_empty(&walk->all))
 		return;
-	}
 
-	xfrm_state_walk_completed = walk->genid;
-
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&walk->all);
+	spin_lock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 
Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  7:57                                                                                                   ` Herbert Xu
@ 2008-09-25  8:42                                                                                                     ` Timo Teräs
  2008-09-25  8:56                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-25  8:42 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Thu, Sep 25, 2008 at 09:03:21AM +0300, Timo Teräs wrote:
>> I forgot to test pf_key support. And remembered why the "last"
>> thingy was in the walking routines. In pf_key dumps, the dump
>> is terminated when an entry with seq zero is received; that's
>> why the final entry received special treatment. And now that
>> breaks. The first entry is zero so pf_key dumps only one entry.
>>
>> Not sure if we should fix this in pf_key or in the walking routines.
> 
> This is part of the user-space ABI (and an RFC) so wecan't change
> it.  I've put the last stuff back and fixed up the error handling
> on the very last entry.

I was just figuring if we need to change seq in pf_key.c kernel code
or in xfrm_state.c. But I guess it's good if the walking walks as
previously.

Now there is a new problem: if dumping is interrupted and meanwhile
all the remaining entries get deleted, there is no entry to dump
with seq zero when dumping is eventually continued.

Since xfrm_user does not care for the seq as end-of-message, this
problem appears only in af_key.

Maybe we should make af_key.c code instead cache one entry before
sending it to user space. We could then get rid of the last hack
inside xfrm core walking code.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  8:42                                                                                                     ` Timo Teräs
@ 2008-09-25  8:56                                                                                                       ` Herbert Xu
  2008-09-25  9:01                                                                                                         ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-25  8:56 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Thu, Sep 25, 2008 at 11:42:15AM +0300, Timo Teräs wrote:
>
> Maybe we should make af_key.c code instead cache one entry before
> sending it to user space. We could then get rid of the last hack
> inside xfrm core walking code.

Yes that should work.  But you want to keep cache the skb instead
of the state since the latter can die.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  8:56                                                                                                       ` Herbert Xu
@ 2008-09-25  9:01                                                                                                         ` Timo Teräs
  2008-09-25  9:49                                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-25  9:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Thu, Sep 25, 2008 at 11:42:15AM +0300, Timo Teräs wrote:
>> Maybe we should make af_key.c code instead cache one entry before
>> sending it to user space. We could then get rid of the last hack
>> inside xfrm core walking code.
> 
> Yes that should work.  But you want to keep cache the skb instead
> of the state since the latter can die.

Exactly, keep the converted entry in skb, since we have to allocate
that anyway. And when there's no more entries coming from enumerator
the sequence number of final skb is adjusted to zero.

Not sure if I can finish it today, but I think I should have the
time by tomorrow. Unless you have time to do it before that, I'll
post a patch later this week. Will update the similar xfrm_policy
patch for that too, before sending it.

- Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  9:01                                                                                                         ` Timo Teräs
@ 2008-09-25  9:49                                                                                                           ` Herbert Xu
  2008-09-25 12:12                                                                                                             ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-25  9:49 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Thu, Sep 25, 2008 at 12:01:30PM +0300, Timo Teräs wrote:
>
> Not sure if I can finish it today, but I think I should have the
> time by tomorrow. Unless you have time to do it before that, I'll
> post a patch later this week. Will update the similar xfrm_policy
> patch for that too, before sending it.

Thanks! I've got some other work to do so feel free to post this
tomorrow.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25  9:49                                                                                                           ` Herbert Xu
@ 2008-09-25 12:12                                                                                                             ` Timo Teräs
  2008-09-25 12:36                                                                                                               ` Timo Teräs
  0 siblings, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-25 12:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Herbert Xu wrote:
> On Thu, Sep 25, 2008 at 12:01:30PM +0300, Timo Teräs wrote:
>> Not sure if I can finish it today, but I think I should have the
>> time by tomorrow. Unless you have time to do it before that, I'll
>> post a patch later this week. Will update the similar xfrm_policy
>> patch for that too, before sending it.
> 
> Thanks! I've got some other work to do so feel free to post this
> tomorrow.

It was simpler than I thought (and I thought I would be more busy
with other things). Anyway, here goes. This is a single patch that
modifies both xfrm_state and xfrm_policy dumping. This was to get
af_key patch nicer (the final skb patching/sending is common code).

Also, since the af_key implementation is the only place where this
seq number is used, we might consider removing it from the core
walking code. It's pretty af_key specfic too.

The xfrm_policy patch uses moves xfrm_policy.dead to
xfrm_policy_walk_entry which is the struct shared with walkers and
the real entries.

I did some testing to ensure that it works with xfrm and af_key as
expected (with entries added by hand and dumping them).

ipsec: Put dumpers on the dump list

Herbert Xu came up with the idea and the original patch to make
xfrm_state dump list contain also dumpers:

As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep.  It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.

I've also changed the order of addition on new states to prevent
a never-ending dump.

Timo Teräs improved the patch to apply cleanly to latest tree,
modified iteration code to be more readable by using a common
struct for entries in the list, implemented the same idea for
xfrm_policy dumping and moved the af_key specific "last" entry
caching to af_key.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@ struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..b98d205 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex;
       metrics. Plus, it will be made via sk->sk_dst_cache. Solved.
  */
 
+struct xfrm_state_walk {
+	struct list_head	all;
+	u8			state;
+	union {
+		u8		dying;
+		u8		proto;
+	};
+	u32			seq;
+};
+
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -136,12 +145,8 @@ struct xfrm_state
 
 	u32			genid;
 
-	/* Key manger bits */
-	struct {
-		u8		state;
-		u8		dying;
-		u32		seq;
-	} km;
+	/* Key manager bits */
+	struct xfrm_state_walk	km;
 
 	/* Parameters of this state. */
 	struct {
@@ -449,10 +454,20 @@ struct xfrm_tmpl
 
 #define XFRM_MAX_DEPTH		6
 
+struct xfrm_policy_walk_entry {
+	struct list_head	all;
+	u8			dead;
+};
+
+struct xfrm_policy_walk {
+	struct xfrm_policy_walk_entry walk;
+	u8 type;
+	u32 seq;
+};
+
 struct xfrm_policy
 {
 	struct xfrm_policy	*next;
-	struct list_head	bytype;
 	struct hlist_node	bydst;
 	struct hlist_node	byidx;
 
@@ -467,13 +482,12 @@ struct xfrm_policy
 	struct xfrm_lifetime_cfg lft;
 	struct xfrm_lifetime_cur curlft;
 	struct dst_entry       *bundles;
-	u16			family;
+	struct xfrm_policy_walk_entry walk;
 	u8			type;
 	u8			action;
 	u8			flags;
-	u8			dead;
 	u8			xfrm_nr;
-	/* XXX 1 byte hole, try to pack */
+	u16			family;
 	struct xfrm_sec_ctx	*security;
 	struct xfrm_tmpl       	xfrm_vec[XFRM_MAX_DEPTH];
 };
@@ -1245,20 +1259,6 @@ struct xfrm6_tunnel {
 	int priority;
 };
 
-struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
-	struct xfrm_state *state;
-	int count;
-	u8 proto;
-};
-
-struct xfrm_policy_walk {
-	struct xfrm_policy *policy;
-	int count;
-	u8 type, cur_type;
-};
-
 extern void xfrm_init(void);
 extern void xfrm4_init(void);
 extern void xfrm_state_init(void);
@@ -1410,24 +1410,10 @@ static inline int xfrm4_udp_encap_rcv(struct sock *sk, struct sk_buff *skb)
 
 struct xfrm_policy *xfrm_policy_alloc(gfp_t gfp);
 
-static inline void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type)
-{
-	walk->cur_type = XFRM_POLICY_TYPE_MAIN;
-	walk->type = type;
-	walk->policy = NULL;
-	walk->count = 0;
-}
-
-static inline void xfrm_policy_walk_done(struct xfrm_policy_walk *walk)
-{
-	if (walk->policy != NULL) {
-		xfrm_pol_put(walk->policy);
-		walk->policy = NULL;
-	}
-}
-
+extern void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type);
 extern int xfrm_policy_walk(struct xfrm_policy_walk *walk,
 	int (*func)(struct xfrm_policy *, int, int, void*), void *);
+extern void xfrm_policy_walk_done(struct xfrm_policy_walk *walk);
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl);
 struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int dir,
 					  struct xfrm_selector *sel,
diff --git a/net/key/af_key.c b/net/key/af_key.c
index d628df9..9fd4e68 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -58,6 +58,7 @@ struct pfkey_sock {
 			struct xfrm_policy_walk	policy;
 			struct xfrm_state_walk	state;
 		} u;
+		struct sk_buff	*skb;
 	} dump;
 };
 
@@ -73,20 +74,6 @@ static int pfkey_can_dump(struct sock *sk)
 	return 0;
 }
 
-static int pfkey_do_dump(struct pfkey_sock *pfk)
-{
-	int rc;
-
-	rc = pfk->dump.dump(pfk);
-	if (rc == -ENOBUFS)
-		return 0;
-
-	pfk->dump.done(pfk);
-	pfk->dump.dump = NULL;
-	pfk->dump.done = NULL;
-	return rc;
-}
-
 static void pfkey_sock_destruct(struct sock *sk)
 {
 	skb_queue_purge(&sk->sk_receive_queue);
@@ -310,6 +297,33 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 	return err;
 }
 
+static int pfkey_do_dump(struct pfkey_sock *pfk)
+{
+	struct sadb_msg *hdr;
+	int rc;
+
+	rc = pfk->dump.dump(pfk);
+	if (rc == -ENOBUFS)
+		return 0;
+
+	if (pfk->dump.skb) {
+		if (!pfkey_can_dump(&pfk->sk))
+			return 0;
+
+		hdr = (struct sadb_msg *) pfk->dump.skb->data;
+		hdr->sadb_msg_seq = 0;
+		hdr->sadb_msg_errno = rc;
+		pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE,
+				&pfk->sk);
+		pfk->dump.skb = NULL;
+	}
+
+	pfk->dump.done(pfk);
+	pfk->dump.dump = NULL;
+	pfk->dump.done = NULL;
+	return rc;
+}
+
 static inline void pfkey_hdr_dup(struct sadb_msg *new, struct sadb_msg *orig)
 {
 	*new = *orig;
@@ -1736,9 +1750,14 @@ static int dump_sa(struct xfrm_state *x, int count, void *ptr)
 	out_hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto);
 	out_hdr->sadb_msg_errno = 0;
 	out_hdr->sadb_msg_reserved = 0;
-	out_hdr->sadb_msg_seq = count;
+	out_hdr->sadb_msg_seq = count + 1;
 	out_hdr->sadb_msg_pid = pfk->dump.msg_pid;
-	pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ONE, &pfk->sk);
+
+	if (pfk->dump.skb)
+		pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE,
+				&pfk->sk);
+	pfk->dump.skb = out_skb;
+
 	return 0;
 }
 
@@ -2237,7 +2256,7 @@ static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h
 	return 0;
 
 out:
-	xp->dead = 1;
+	xp->walk.dead = 1;
 	xfrm_policy_destroy(xp);
 	return err;
 }
@@ -2575,9 +2594,14 @@ static int dump_sp(struct xfrm_policy *xp, int dir, int count, void *ptr)
 	out_hdr->sadb_msg_type = SADB_X_SPDDUMP;
 	out_hdr->sadb_msg_satype = SADB_SATYPE_UNSPEC;
 	out_hdr->sadb_msg_errno = 0;
-	out_hdr->sadb_msg_seq = count;
+	out_hdr->sadb_msg_seq = count + 1;
 	out_hdr->sadb_msg_pid = pfk->dump.msg_pid;
-	pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ONE, &pfk->sk);
+
+	if (pfk->dump.skb)
+		pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE,
+				&pfk->sk);
+	pfk->dump.skb = out_skb;
+
 	return 0;
 }
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ef9ccbc..b7ec080 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -46,7 +46,7 @@ EXPORT_SYMBOL(xfrm_cfg_mutex);
 
 static DEFINE_RWLOCK(xfrm_policy_lock);
 
-static struct list_head xfrm_policy_bytype[XFRM_POLICY_TYPE_MAX];
+static struct list_head xfrm_policy_all;
 unsigned int xfrm_policy_count[XFRM_POLICY_MAX*2];
 EXPORT_SYMBOL(xfrm_policy_count);
 
@@ -164,7 +164,7 @@ static void xfrm_policy_timer(unsigned long data)
 
 	read_lock(&xp->lock);
 
-	if (xp->dead)
+	if (xp->walk.dead)
 		goto out;
 
 	dir = xfrm_policy_id2dir(xp->index);
@@ -236,7 +236,7 @@ struct xfrm_policy *xfrm_policy_alloc(gfp_t gfp)
 	policy = kzalloc(sizeof(struct xfrm_policy), gfp);
 
 	if (policy) {
-		INIT_LIST_HEAD(&policy->bytype);
+		INIT_LIST_HEAD(&policy->walk.all);
 		INIT_HLIST_NODE(&policy->bydst);
 		INIT_HLIST_NODE(&policy->byidx);
 		rwlock_init(&policy->lock);
@@ -252,17 +252,13 @@ EXPORT_SYMBOL(xfrm_policy_alloc);
 
 void xfrm_policy_destroy(struct xfrm_policy *policy)
 {
-	BUG_ON(!policy->dead);
+	BUG_ON(!policy->walk.dead);
 
 	BUG_ON(policy->bundles);
 
 	if (del_timer(&policy->timer))
 		BUG();
 
-	write_lock_bh(&xfrm_policy_lock);
-	list_del(&policy->bytype);
-	write_unlock_bh(&xfrm_policy_lock);
-
 	security_xfrm_policy_free(policy->security);
 	kfree(policy);
 }
@@ -310,8 +306,8 @@ static void xfrm_policy_kill(struct xfrm_policy *policy)
 	int dead;
 
 	write_lock_bh(&policy->lock);
-	dead = policy->dead;
-	policy->dead = 1;
+	dead = policy->walk.dead;
+	policy->walk.dead = 1;
 	write_unlock_bh(&policy->lock);
 
 	if (unlikely(dead)) {
@@ -609,6 +605,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	if (delpol) {
 		hlist_del(&delpol->bydst);
 		hlist_del(&delpol->byidx);
+		list_del(&delpol->walk.all);
 		xfrm_policy_count[dir]--;
 	}
 	policy->index = delpol ? delpol->index : xfrm_gen_index(policy->type, dir);
@@ -617,7 +614,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	policy->curlft.use_time = 0;
 	if (!mod_timer(&policy->timer, jiffies + HZ))
 		xfrm_pol_hold(policy);
-	list_add_tail(&policy->bytype, &xfrm_policy_bytype[policy->type]);
+	list_add(&policy->walk.all, &xfrm_policy_all);
 	write_unlock_bh(&xfrm_policy_lock);
 
 	if (delpol)
@@ -684,6 +681,7 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int dir,
 				}
 				hlist_del(&pol->bydst);
 				hlist_del(&pol->byidx);
+				list_del(&pol->walk.all);
 				xfrm_policy_count[dir]--;
 			}
 			ret = pol;
@@ -727,6 +725,7 @@ struct xfrm_policy *xfrm_policy_byid(u8 type, int dir, u32 id, int delete,
 				}
 				hlist_del(&pol->bydst);
 				hlist_del(&pol->byidx);
+				list_del(&pol->walk.all);
 				xfrm_policy_count[dir]--;
 			}
 			ret = pol;
@@ -840,6 +839,7 @@ int xfrm_policy_flush(u8 type, struct xfrm_audit *audit_info)
 					continue;
 				hlist_del(&pol->bydst);
 				hlist_del(&pol->byidx);
+				list_del(&pol->walk.all);
 				write_unlock_bh(&xfrm_policy_lock);
 
 				xfrm_audit_policy_delete(pol, 1,
@@ -867,60 +867,68 @@ int xfrm_policy_walk(struct xfrm_policy_walk *walk,
 		     int (*func)(struct xfrm_policy *, int, int, void*),
 		     void *data)
 {
-	struct xfrm_policy *old, *pol, *last = NULL;
+	struct xfrm_policy *pol;
+	struct xfrm_policy_walk_entry *x;
 	int error = 0;
 
 	if (walk->type >= XFRM_POLICY_TYPE_MAX &&
 	    walk->type != XFRM_POLICY_TYPE_ANY)
 		return -EINVAL;
 
-	if (walk->policy == NULL && walk->count != 0)
+	if (list_empty(&walk->walk.all) && walk->seq != 0)
 		return 0;
 
-	old = pol = walk->policy;
-	walk->policy = NULL;
-	read_lock_bh(&xfrm_policy_lock);
-
-	for (; walk->cur_type < XFRM_POLICY_TYPE_MAX; walk->cur_type++) {
-		if (walk->type != walk->cur_type &&
-		    walk->type != XFRM_POLICY_TYPE_ANY)
+	write_lock_bh(&xfrm_policy_lock);
+	if (list_empty(&walk->walk.all))
+		x = list_first_entry(&xfrm_policy_all, struct xfrm_policy_walk_entry, all);
+	else
+		x = list_entry(&walk->walk.all, struct xfrm_policy_walk_entry, all);
+	list_for_each_entry_from(x, &xfrm_policy_all, all) {
+		if (x->dead)
 			continue;
-
-		if (pol == NULL) {
-			pol = list_first_entry(&xfrm_policy_bytype[walk->cur_type],
-					       struct xfrm_policy, bytype);
-		}
-		list_for_each_entry_from(pol, &xfrm_policy_bytype[walk->cur_type], bytype) {
-			if (pol->dead)
-				continue;
-			if (last) {
-				error = func(last, xfrm_policy_id2dir(last->index),
-					     walk->count, data);
-				if (error) {
-					xfrm_pol_hold(last);
-					walk->policy = last;
-					goto out;
-				}
-			}
-			last = pol;
-			walk->count++;
+		pol = container_of(x, struct xfrm_policy, walk);
+		if (walk->type != XFRM_POLICY_TYPE_ANY &&
+		    walk->type != pol->type)
+			continue;
+		error = func(pol, xfrm_policy_id2dir(pol->index),
+			     walk->seq, data);
+		if (error) {
+			list_move_tail(&walk->walk.all, &x->all);
+			goto out;
 		}
-		pol = NULL;
+		walk->seq++;
 	}
-	if (walk->count == 0) {
+	if (walk->seq == 0) {
 		error = -ENOENT;
 		goto out;
 	}
-	if (last)
-		error = func(last, xfrm_policy_id2dir(last->index), 0, data);
+	list_del_init(&walk->walk.all);
 out:
-	read_unlock_bh(&xfrm_policy_lock);
-	if (old != NULL)
-		xfrm_pol_put(old);
+	write_unlock_bh(&xfrm_policy_lock);
 	return error;
 }
 EXPORT_SYMBOL(xfrm_policy_walk);
 
+void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type)
+{
+	INIT_LIST_HEAD(&walk->walk.all);
+	walk->walk.dead = 1;
+	walk->type = type;
+	walk->seq = 0;
+}
+EXPORT_SYMBOL(xfrm_policy_walk_init);
+
+void xfrm_policy_walk_done(struct xfrm_policy_walk *walk)
+{
+	if (list_empty(&walk->walk.all))
+		return;
+
+	write_lock_bh(&xfrm_policy_lock);
+	list_del(&walk->walk.all);
+	write_unlock_bh(&xfrm_policy_lock);
+}
+EXPORT_SYMBOL(xfrm_policy_walk_done);
+
 /*
  * Find policy to apply to this flow.
  *
@@ -1077,7 +1085,7 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 	struct hlist_head *chain = policy_hash_bysel(&pol->selector,
 						     pol->family, dir);
 
-	list_add_tail(&pol->bytype, &xfrm_policy_bytype[pol->type]);
+	list_add(&pol->walk.all, &xfrm_policy_all);
 	hlist_add_head(&pol->bydst, chain);
 	hlist_add_head(&pol->byidx, xfrm_policy_byidx+idx_hash(pol->index));
 	xfrm_policy_count[dir]++;
@@ -1095,6 +1103,7 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 
 	hlist_del(&pol->bydst);
 	hlist_del(&pol->byidx);
+	list_del(&pol->walk.all);
 	xfrm_policy_count[dir]--;
 
 	return pol;
@@ -1720,7 +1729,7 @@ restart:
 
 		for (pi = 0; pi < npols; pi++) {
 			read_lock_bh(&pols[pi]->lock);
-			pol_dead |= pols[pi]->dead;
+			pol_dead |= pols[pi]->walk.dead;
 			read_unlock_bh(&pols[pi]->lock);
 		}
 
@@ -2415,9 +2424,7 @@ static void __init xfrm_policy_init(void)
 			panic("XFRM: failed to allocate bydst hash\n");
 	}
 
-	for (dir = 0; dir < XFRM_POLICY_TYPE_MAX; dir++)
-		INIT_LIST_HEAD(&xfrm_policy_bytype[dir]);
-
+	INIT_LIST_HEAD(&xfrm_policy_all);
 	INIT_WORK(&xfrm_policy_gc_work, xfrm_policy_gc_task);
 	register_netdevice_notifier(&xfrm_dev_notifier);
 }
@@ -2601,7 +2608,7 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 	int i, j, n = 0;
 
 	write_lock_bh(&pol->lock);
-	if (unlikely(pol->dead)) {
+	if (unlikely(pol->walk.dead)) {
 		/* target policy has been deleted */
 		write_unlock_bh(&pol->lock);
 		return -ENOENT;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..747fd8c 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
+		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		list_del(&x->km.all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
-			list_add_tail(&x->all, &xfrm_state_all);
+			list_add(&x->km.all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
+	list_add(&x->km.all, &xfrm_state_all);
 
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
@@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
-		list_add_tail(&x->all, &xfrm_state_all);
+		list_add(&x->km.all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1566,79 +1549,59 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *state;
+	struct xfrm_state_walk *x;
 	int err = 0;
 
-	if (walk->state == NULL && walk->count != 0)
+	if (walk->seq != 0 && list_empty(&walk->all))
 		return 0;
 
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
+	if (list_empty(&walk->all))
+		x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all);
+	else
+		x = list_entry(&walk->all, struct xfrm_state_walk, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
-		if (x->km.state == XFRM_STATE_DEAD)
+		if (x->state == XFRM_STATE_DEAD)
 			continue;
-		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
+		state = container_of(x, struct xfrm_state, km);
+		if (!xfrm_id_proto_match(state->id.proto, walk->proto))
 			continue;
-		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				goto out;
-			}
+		err = func(state, walk->seq, data);
+		if (err) {
+			list_move_tail(&walk->all, &x->all);
+			goto out;
 		}
-		last = x;
-		walk->count++;
+		walk->seq++;
 	}
-	if (walk->count == 0) {
+	if (walk->seq == 0) {
 		err = -ENOENT;
 		goto out;
 	}
-	if (last)
-		err = func(last, 0, data);
+	list_del_init(&walk->all);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
+	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	walk->state = XFRM_STATE_DEAD;
+	walk->seq = 0;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
+	if (list_empty(&walk->all))
 		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
 
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_lock_bh(&xfrm_state_lock);
+	list_del(&walk->all);
+	spin_lock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 04c4150..76f75df 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1102,7 +1102,7 @@ static struct xfrm_policy *xfrm_policy_construct(struct xfrm_userpolicy_info *p,
 	return xp;
  error:
 	*errp = err;
-	xp->dead = 1;
+	xp->walk.dead = 1;
 	xfrm_policy_destroy(xp);
 	return NULL;
 }
@@ -1595,7 +1595,7 @@ static int xfrm_add_pol_expire(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -ENOENT;
 
 	read_lock(&xp->lock);
-	if (xp->dead) {
+	if (xp->walk.dead) {
 		read_unlock(&xp->lock);
 		goto out;
 	}

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25 12:12                                                                                                             ` Timo Teräs
@ 2008-09-25 12:36                                                                                                               ` Timo Teräs
  2008-09-26  2:08                                                                                                                 ` Herbert Xu
  2008-10-01 10:07                                                                                                                 ` David Miller
  0 siblings, 2 replies; 95+ messages in thread
From: Timo Teräs @ 2008-09-25 12:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, jamal

Timo Teräs wrote:
> Herbert Xu wrote:
>> Thanks! I've got some other work to do so feel free to post this
>> tomorrow.
> 
> It was simpler than I thought (and I thought I would be more busy
> with other things). Anyway, here goes. This is a single patch that
> modifies both xfrm_state and xfrm_policy dumping. This was to get
> af_key patch nicer (the final skb patching/sending is common code).

One more thing I noticed. If the af_key socket is closed while
dumping is ongoing the current code can leak stuff. We should call
the .done. And also in the new patch, release the stored skb.

The right place to do that would be pfkey_release, I assume?

- Timo


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25 12:36                                                                                                               ` Timo Teräs
@ 2008-09-26  2:08                                                                                                                 ` Herbert Xu
  2008-10-01 10:07                                                                                                                 ` David Miller
  1 sibling, 0 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-26  2:08 UTC (permalink / raw)
  To: Timo Teräs; +Cc: David Miller, netdev, jamal

On Thu, Sep 25, 2008 at 03:36:04PM +0300, Timo Teräs wrote:
>
> The right place to do that would be pfkey_release, I assume?

Yes that sounds right.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-25 12:36                                                                                                               ` Timo Teräs
  2008-09-26  2:08                                                                                                                 ` Herbert Xu
@ 2008-10-01 10:07                                                                                                                 ` David Miller
  2008-10-01 14:05                                                                                                                   ` Herbert Xu
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2008-10-01 10:07 UTC (permalink / raw)
  To: timo.teras; +Cc: herbert, netdev, hadi

From: Timo Teräs <timo.teras@iki.fi>
Date: Thu, 25 Sep 2008 15:36:04 +0300

> Timo Teräs wrote:
> > Herbert Xu wrote:
> >> Thanks! I've got some other work to do so feel free to post this
> >> tomorrow.
> > 
> > It was simpler than I thought (and I thought I would be more busy
> > with other things). Anyway, here goes. This is a single patch that
> > modifies both xfrm_state and xfrm_policy dumping. This was to get
> > af_key patch nicer (the final skb patching/sending is common code).
> 
> One more thing I noticed. If the af_key socket is closed while
> dumping is ongoing the current code can leak stuff. We should call
> the .done. And also in the new patch, release the stored skb.
> 
> The right place to do that would be pfkey_release, I assume?

I reread this thread, and the most recent patch seems good
and even tested :-)  But it seems that you want to make some
more tweaks, right?

Please post explicitly the patch you want me to apply to
net-next-2.6 once you reach that point, thanks!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-10-01 10:07                                                                                                                 ` David Miller
@ 2008-10-01 14:05                                                                                                                   ` Herbert Xu
  0 siblings, 0 replies; 95+ messages in thread
From: Herbert Xu @ 2008-10-01 14:05 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev, hadi

On Wed, Oct 01, 2008 at 03:07:53AM -0700, David Miller wrote:
>
> > The right place to do that would be pfkey_release, I assume?
> 
> I reread this thread, and the most recent patch seems good
> and even tested :-)  But it seems that you want to make some
> more tweaks, right?
> 
> Please post explicitly the patch you want me to apply to
> net-next-2.6 once you reach that point, thanks!

Yes I think Timo is still touching it up.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-22 11:42                               ` Herbert Xu
  2008-09-22 13:01                                 ` Timo Teräs
@ 2008-09-23  2:48                                 ` David Miller
  1 sibling, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-23  2:48 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 22 Sep 2008 19:42:56 +0800

> ipsec: Fix xfrm_state_walk race
> 
> As discovered by Timo Teräs, the currently xfrm_state_walk scheme
> is racy because if a second dump finishes before the first, we
> may free xfrm states that the first dump would walk over later.
> 
> This patch fixes this by storing the dumps in a list in order
> to calculate the correct completion counter which cures this
> problem.
> 
> I've expanded netlink_cb in order to accomodate the extra state
> related to this.  It shouldn't be a big deal since netlink_cb
> is kmalloced for each dump and we're just increasing it by 4 or
> 8 bytes.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Looks good, applied to net-next-2.6

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:50         ` Herbert Xu
  2008-09-03  6:14           ` David Miller
@ 2008-09-10  3:04           ` David Miller
  2008-09-10  3:15             ` Herbert Xu
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-10  3:04 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 3 Sep 2008 15:50:41 +1000

> ipsec: Move state dump list unlinking into GC
> 
> This patch moves the unlinking of x->all (for dumping) into the
> GC so that we know for sure what locks we hold when we unlink the
> entry.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

After your patch I just applied, things are a bit different now.

Are you going to respin this or is it not relevant any longer?

> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 89551b6..0244325 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode)
>  
>  static void xfrm_state_gc_destroy(struct xfrm_state *x)
>  {
> +	spin_lock_bh(&xfrm_state_lock);
> +	list_del(&x->all);
> +	spin_unlock_bh(&xfrm_state_lock);
> +
>  	del_timer_sync(&x->timer);
>  	del_timer_sync(&x->rtimer);
>  	kfree(x->aalg);
> @@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x)
>  {
>  	WARN_ON(x->km.state != XFRM_STATE_DEAD);
>  
> -	spin_lock_bh(&xfrm_state_lock);
> -	list_del(&x->all);
> -	spin_unlock_bh(&xfrm_state_lock);
> -
>  	spin_lock_bh(&xfrm_state_gc_lock);
>  	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
>  	spin_unlock_bh(&xfrm_state_gc_lock);
> 
> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  3:04           ` David Miller
@ 2008-09-10  3:15             ` Herbert Xu
  2008-09-10  3:22               ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  3:15 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 08:04:00PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 3 Sep 2008 15:50:41 +1000
> 
> > ipsec: Move state dump list unlinking into GC
> > 
> > This patch moves the unlinking of x->all (for dumping) into the
> > GC so that we know for sure what locks we hold when we unlink the
> > entry.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> After your patch I just applied, things are a bit different now.
> 
> Are you going to respin this or is it not relevant any longer?

This patch is now redundant.

What could be done though is to remove x->all and restore the
original walking by hashing.  This would eliminate a list node
from xfrm_state, and would also restore the original dump ordering.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  3:15             ` Herbert Xu
@ 2008-09-10  3:22               ` David Miller
  2008-09-10  3:23                 ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-10  3:22 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 13:15:59 +1000

> This patch is now redundant.

Ok.

> What could be done though is to remove x->all and restore the
> original walking by hashing.  This would eliminate a list node
> from xfrm_state, and would also restore the original dump ordering.

Yes I remember you mentioning that.

If you're too busy I can try to spin this idea into a patch.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  3:22               ` David Miller
@ 2008-09-10  3:23                 ` Herbert Xu
  2008-09-10  3:38                   ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  3:23 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 08:22:15PM -0700, David Miller wrote:
>
> > What could be done though is to remove x->all and restore the
> > original walking by hashing.  This would eliminate a list node
> > from xfrm_state, and would also restore the original dump ordering.
> 
> Yes I remember you mentioning that.
> 
> If you're too busy I can try to spin this idea into a patch.

That would be very much appreciated.  Thanks Dave!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  3:23                 ` Herbert Xu
@ 2008-09-10  3:38                   ` David Miller
  2008-09-10  4:01                     ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-10  3:38 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 13:23:04 +1000

> On Tue, Sep 09, 2008 at 08:22:15PM -0700, David Miller wrote:
> >
> > > What could be done though is to remove x->all and restore the
> > > original walking by hashing.  This would eliminate a list node
> > > from xfrm_state, and would also restore the original dump ordering.
> > 
> > Yes I remember you mentioning that.
> > 
> > If you're too busy I can try to spin this idea into a patch.
> 
> That would be very much appreciated.  Thanks Dave!

No problem.  It might be a little bit of a chore because this new
walker design is intimately tied to the af_key non-atomic dump
changes.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  3:38                   ` David Miller
@ 2008-09-10  4:01                     ` Herbert Xu
  2008-09-10  4:06                       ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  4:01 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote:
>
> No problem.  It might be a little bit of a chore because this new
> walker design is intimately tied to the af_key non-atomic dump
> changes.

Actually I was mistaken as to how the original dump worked.  I'd
thought that it actually kept track of which bucket it was in and
resumed from that bucket.  However in reality it only had a global
counter and would always start walking from the beginning up until
the counted value.  So it isn't as easy as just copying the old
code across :)

While it is possible to restore the old order and save a little
bit of memory, I certainly don't think this is very urgent.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:01                     ` Herbert Xu
@ 2008-09-10  4:06                       ` David Miller
  2008-09-10  4:22                         ` Herbert Xu
  2008-09-10  5:16                         ` Timo Teräs
  0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-10  4:06 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 14:01:07 +1000

> On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote:
> >
> > No problem.  It might be a little bit of a chore because this new
> > walker design is intimately tied to the af_key non-atomic dump
> > changes.
> 
> Actually I was mistaken as to how the original dump worked.  I'd
> thought that it actually kept track of which bucket it was in and
> resumed from that bucket.  However in reality it only had a global
> counter and would always start walking from the beginning up until
> the counted value.  So it isn't as easy as just copying the old
> code across :)

Only AF_KEY gave an error, and this ended the dump.  This was
one of Timo's goals, to make AF_KEY continue where it left
off in subsequent dump calls done by the user, when we hit the
socket limit.

> While it is possible to restore the old order and save a little
> bit of memory, I certainly don't think this is very urgent.

Too late, I already implemented it :-)

I'm about to give the following patch a test:

ipsec: Restore hash based xfrm_state dumping.

Get rid of ->all member of struct xfrm_state, and just use a hash
iteration like we used before.

This shrinks the size of struct xfrm_state, and also restores the
dump ordering of 2.6.25 and previous.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4bb9499..ba0e47c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex;
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	struct list_head	all;
 	union {
 		struct list_head	gclist;
 		struct hlist_node	bydst;
@@ -1247,6 +1246,7 @@ struct xfrm6_tunnel {
 
 struct xfrm_state_walk {
 	struct xfrm_state *state;
+	unsigned int chain;
 	int count;
 	u8 proto;
 };
@@ -1283,9 +1283,10 @@ extern int xfrm_proc_init(void);
 
 static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 {
-	walk->proto = proto;
 	walk->state = NULL;
+	walk->chain = 0;
 	walk->count = 0;
+	walk->proto = proto;
 }
 
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index aaafcee..ccf8492 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -50,7 +50,6 @@ static DEFINE_SPINLOCK(xfrm_state_lock);
  * Main use is finding SA after policy selected tunnel or transport mode.
  * Also, it can be used by ah/esp icmp error handler to find offending SA.
  */
-static LIST_HEAD(xfrm_state_all);
 static struct hlist_head *xfrm_state_bydst __read_mostly;
 static struct hlist_head *xfrm_state_bysrc __read_mostly;
 static struct hlist_head *xfrm_state_byspi __read_mostly;
@@ -525,7 +524,6 @@ struct xfrm_state *xfrm_state_alloc(void)
 	if (x) {
 		atomic_set(&x->refcnt, 1);
 		atomic_set(&x->tunnel_users, 0);
-		INIT_LIST_HEAD(&x->all);
 		INIT_HLIST_NODE(&x->bydst);
 		INIT_HLIST_NODE(&x->bysrc);
 		INIT_HLIST_NODE(&x->byspi);
@@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x)
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
 		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -867,7 +864,6 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 
 		if (km_query(x, tmpl, pol) == 0) {
 			x->km.state = XFRM_STATE_ACQ;
-			list_add_tail(&x->all, &xfrm_state_all);
 			hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 			h = xfrm_src_hash(daddr, saddr, family);
 			hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -936,8 +932,6 @@ static void __xfrm_state_insert(struct xfrm_state *x)
 
 	x->genid = ++xfrm_state_genid;
 
-	list_add_tail(&x->all, &xfrm_state_all);
-
 	h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
 	hlist_add_head(&x->bydst, xfrm_state_bydst+h);
@@ -1065,7 +1059,6 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re
 		xfrm_state_hold(x);
 		x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ;
 		add_timer(&x->timer);
-		list_add_tail(&x->all, &xfrm_state_all);
 		hlist_add_head(&x->bydst, xfrm_state_bydst+h);
 		h = xfrm_src_hash(daddr, saddr, family);
 		hlist_add_head(&x->bysrc, xfrm_state_bysrc+h);
@@ -1563,7 +1556,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    void *data)
 {
 	struct xfrm_state *old, *x, *last = NULL;
-	int err = 0;
+	int i, err = 0;
 
 	if (walk->state == NULL && walk->count != 0)
 		return 0;
@@ -1571,24 +1564,29 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 	old = x = walk->state;
 	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
-	if (x == NULL)
-		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
-	list_for_each_entry_from(x, &xfrm_state_all, all) {
-		if (x->km.state == XFRM_STATE_DEAD)
-			continue;
-		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
-			continue;
-		if (last) {
-			err = func(last, walk->count, data);
-			if (err) {
-				xfrm_state_hold(last);
-				walk->state = last;
-				xfrm_state_walk_ongoing++;
-				goto out;
+	for (i = walk->chain; i <= xfrm_state_hmask; i++) {
+		struct hlist_head *h = xfrm_state_bydst + i;
+		struct hlist_node *entry;
+
+		if (!x)
+			x = hlist_entry(h->first, struct xfrm_state, bydst);
+		hlist_for_each_entry_from(x, entry, bydst) {
+			if (!xfrm_id_proto_match(x->id.proto, walk->proto))
+				continue;
+			if (last) {
+				err = func(last, walk->count, data);
+				if (err) {
+					xfrm_state_hold(last);
+					walk->state = last;
+					walk->chain = i;
+					xfrm_state_walk_ongoing++;
+					goto out;
+				}
 			}
+			last = x;
+			walk->count++;
 		}
-		last = x;
-		walk->count++;
+		x = NULL;
 	}
 	if (walk->count == 0) {
 		err = -ENOENT;

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:06                       ` David Miller
@ 2008-09-10  4:22                         ` Herbert Xu
  2008-09-10  4:24                           ` David Miller
  2008-09-10  5:16                         ` Timo Teräs
  1 sibling, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  4:22 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote:
> 
> Only AF_KEY gave an error, and this ended the dump.  This was
> one of Timo's goals, to make AF_KEY continue where it left
> off in subsequent dump calls done by the user, when we hit the
> socket limit.

I see.

> > While it is possible to restore the old order and save a little
> > bit of memory, I certainly don't think this is very urgent.
> 
> Too late, I already implemented it :-)

Awesome :) Your patch is much simpler than I thought.
 
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 4bb9499..ba0e47c 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex;
>  /* Full description of state of transformer. */
>  struct xfrm_state
>  {
> -	struct list_head	all;
>  	union {
>  		struct list_head	gclist;
>  		struct hlist_node	bydst;

This union now needs to move across to bysrc or byspi since bydst
needs to be preserved for walking even after a node is added to
the GC.

> @@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x)
>  		x->km.state = XFRM_STATE_DEAD;
>  		spin_lock(&xfrm_state_lock);
>  		x->lastused = xfrm_state_walk_ongoing;
> -		list_del_rcu(&x->all);
>  		hlist_del(&x->bydst);

We need hlist_del_rcu on bydst for the same reason.

> @@ -1571,24 +1564,29 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
>  	old = x = walk->state;
>  	walk->state = NULL;
>  	spin_lock_bh(&xfrm_state_lock);
> -	if (x == NULL)
> -		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
> -	list_for_each_entry_from(x, &xfrm_state_all, all) {
> -		if (x->km.state == XFRM_STATE_DEAD)
> -			continue;
> -		if (!xfrm_id_proto_match(x->id.proto, walk->proto))
> -			continue;
> -		if (last) {
> -			err = func(last, walk->count, data);
> -			if (err) {
> -				xfrm_state_hold(last);
> -				walk->state = last;
> -				xfrm_state_walk_ongoing++;
> -				goto out;
> +	for (i = walk->chain; i <= xfrm_state_hmask; i++) {
> +		struct hlist_head *h = xfrm_state_bydst + i;
> +		struct hlist_node *entry;
> +
> +		if (!x)
> +			x = hlist_entry(h->first, struct xfrm_state, bydst);
> +		hlist_for_each_entry_from(x, entry, bydst) {
> +			if (!xfrm_id_proto_match(x->id.proto, walk->proto))
> +				continue;
> +			if (last) {
> +				err = func(last, walk->count, data);
> +				if (err) {

This is a bug in the original patch.  When we get an error on
the very last node we'll lose that node instead of dumping it
later.  I wonder if we could just get rid of last altogether...

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:22                         ` Herbert Xu
@ 2008-09-10  4:24                           ` David Miller
  2008-09-10  4:48                             ` David Miller
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-10  4:24 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 14:22:47 +1000

> On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote:
> > @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex;
> >  /* Full description of state of transformer. */
> >  struct xfrm_state
> >  {
> > -	struct list_head	all;
> >  	union {
> >  		struct list_head	gclist;
> >  		struct hlist_node	bydst;
> 
> This union now needs to move across to bysrc or byspi since bydst
> needs to be preserved for walking even after a node is added to
> the GC.

That would explain the crash I just got :)

> > @@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x)
> >  		x->km.state = XFRM_STATE_DEAD;
> >  		spin_lock(&xfrm_state_lock);
> >  		x->lastused = xfrm_state_walk_ongoing;
> > -		list_del_rcu(&x->all);
> >  		hlist_del(&x->bydst);
> 
> We need hlist_del_rcu on bydst for the same reason.

Ok, gotcha.

> This is a bug in the original patch.  When we get an error on
> the very last node we'll lose that node instead of dumping it
> later.  I wonder if we could just get rid of last altogether...

I'll take a look at this too after I resolve and test the above
issues.

Thanks.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:24                           ` David Miller
@ 2008-09-10  4:48                             ` David Miller
  2008-09-10  4:52                               ` David Miller
  2008-09-10  4:53                               ` Herbert Xu
  0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-10  4:48 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: David Miller <davem@davemloft.net>
Date: Tue, 09 Sep 2008 21:24:26 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 10 Sep 2008 14:22:47 +1000
> 
> > On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote:
> > > @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex;
> > >  /* Full description of state of transformer. */
> > >  struct xfrm_state
> > >  {
> > > -	struct list_head	all;
> > >  	union {
> > >  		struct list_head	gclist;
> > >  		struct hlist_node	bydst;
> > 
> > This union now needs to move across to bysrc or byspi since bydst
> > needs to be preserved for walking even after a node is added to
> > the GC.
> 
> That would explain the crash I just got :)

No, it doesn't explain the crash/hang.  And it happens even without my
patch.

I get a hang in del_timer_sync() and it's from a workqueue so I suspect
your patch :-)

I can't see what does the list delete from the GC leftover list?  Entries
seem to stay on there forever.  We even kfree() the object.

The list iteration is:

	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
		if ((long)(x->lastused - completed) > 0)
			break;
		xfrm_state_gc_destroy(x);
	}

And xfrm_state_gc_destroy() doesn't do a list_del(&x->gclist) or similar.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:48                             ` David Miller
@ 2008-09-10  4:52                               ` David Miller
  2008-09-10  4:53                               ` Herbert Xu
  1 sibling, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-10  4:52 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: David Miller <davem@davemloft.net>
Date: Tue, 09 Sep 2008 21:48:29 -0700 (PDT)

> The list iteration is:
> 
> 	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
> 		if ((long)(x->lastused - completed) > 0)
> 			break;
> 		xfrm_state_gc_destroy(x);
> 	}
> 
> And xfrm_state_gc_destroy() doesn't do a list_del(&x->gclist) or similar.

I'm now testing the following fix:

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index aaafcee..abbe270 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -423,6 +423,7 @@ static void xfrm_state_gc_task(struct work_struct *data)
 	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
 		if ((long)(x->lastused - completed) > 0)
 			break;
+		list_del(&x->gclist);
 		xfrm_state_gc_destroy(x);
 	}
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:48                             ` David Miller
  2008-09-10  4:52                               ` David Miller
@ 2008-09-10  4:53                               ` Herbert Xu
  2008-09-10  5:21                                 ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  4:53 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 09:48:29PM -0700, David Miller wrote:
> 
> I can't see what does the list delete from the GC leftover list?  Entries
> seem to stay on there forever.  We even kfree() the object.

Doh! Yes there's supposed to be a list_del in there.

Here's a fixed patch:

ipsec: Use RCU-like construct for saved state within a walk

Now that we save states within a walk we need synchronisation
so that the list the saved state is on doesn't disappear from
under us.

As it stands this is done by keeping the state on the list which
is bad because it gets in the way of the management of the state
life-cycle.

An alternative is to make our own pseudo-RCU system where we use
counters to indicate which state can't be freed immediately as
it may be referenced by an ongoing walk when that resumes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..4bb9499 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex;
 /* Full description of state of transformer. */
 struct xfrm_state
 {
-	/* Note: bydst is re-used during gc */
 	struct list_head	all;
-	struct hlist_node	bydst;
+	union {
+		struct list_head	gclist;
+		struct hlist_node	bydst;
+	};
 	struct hlist_node	bysrc;
 	struct hlist_node	byspi;
 
@@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->count = 0;
 }
 
-static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk)
-{
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-}
-
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
 			   int (*func)(struct xfrm_state *, int, void*), void *);
+extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
 extern struct xfrm_state *xfrm_state_alloc(void);
 extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, 
 					  struct flowi *fl, struct xfrm_tmpl *tmpl,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 7bd62f6..fe84a46 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
+/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
+static unsigned long xfrm_state_walk_ongoing;
+/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
+static unsigned long xfrm_state_walk_completed;
+
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static HLIST_HEAD(xfrm_state_gc_list);
+static LIST_HEAD(xfrm_state_gc_leftovers);
+static LIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -403,17 +409,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x;
-	struct hlist_node *entry, *tmp;
-	struct hlist_head gc_list;
+	struct xfrm_state *x, *tmp;
+	unsigned long completed;
 
+	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	gc_list.first = xfrm_state_gc_list.first;
-	INIT_HLIST_HEAD(&xfrm_state_gc_list);
+	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst)
+	completed = xfrm_state_walk_completed;
+	mutex_unlock(&xfrm_cfg_mutex);
+
+	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
+		if ((long)(x->lastused - completed) > 0)
+			break;
+		list_del(&x->all);
 		xfrm_state_gc_destroy(x);
+	}
 
 	wake_up(&km_waitq);
 }
@@ -540,12 +552,8 @@ void __xfrm_state_destroy(struct xfrm_state *x)
 {
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
-	spin_lock_bh(&xfrm_state_lock);
-	list_del(&x->all);
-	spin_unlock_bh(&xfrm_state_lock);
-
 	spin_lock_bh(&xfrm_state_gc_lock);
-	hlist_add_head(&x->bydst, &xfrm_state_gc_list);
+	list_add_tail(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -558,6 +566,8 @@ int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
+		x->lastused = xfrm_state_walk_ongoing;
+		list_del_rcu(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1572,6 +1582,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 			if (err) {
 				xfrm_state_hold(last);
 				walk->state = last;
+				xfrm_state_walk_ongoing++;
 				goto out;
 			}
 		}
@@ -1586,12 +1597,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
+	if (old != NULL) {
 		xfrm_state_put(old);
+		xfrm_state_walk_completed++;
+		if (!list_empty(&xfrm_state_gc_leftovers))
+			schedule_work(&xfrm_state_gc_work);
+	}
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
+void xfrm_state_walk_done(struct xfrm_state_walk *walk)
+{
+	if (walk->state != NULL) {
+		xfrm_state_put(walk->state);
+		walk->state = NULL;
+		xfrm_state_walk_completed++;
+		if (!list_empty(&xfrm_state_gc_leftovers))
+			schedule_work(&xfrm_state_gc_work);
+	}
+}
+EXPORT_SYMBOL(xfrm_state_walk_done);
+
 
 void xfrm_replay_notify(struct xfrm_state *x, int event)
 {

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:53                               ` Herbert Xu
@ 2008-09-10  5:21                                 ` David Miller
  0 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-10  5:21 UTC (permalink / raw)
  To: herbert; +Cc: timo.teras, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 10 Sep 2008 14:53:50 +1000

> On Tue, Sep 09, 2008 at 09:48:29PM -0700, David Miller wrote:
> > 
> > I can't see what does the list delete from the GC leftover list?  Entries
> > seem to stay on there forever.  We even kfree() the object.
> 
> Doh! Yes there's supposed to be a list_del in there.

Your original change was already in net-next-2.6 so I commited the
one-liner fix.

Thanks.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  4:06                       ` David Miller
  2008-09-10  4:22                         ` Herbert Xu
@ 2008-09-10  5:16                         ` Timo Teräs
  2008-09-10  5:23                           ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: Timo Teräs @ 2008-09-10  5:16 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 10 Sep 2008 14:01:07 +1000
> 
>> On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote:
>>> No problem.  It might be a little bit of a chore because this new
>>> walker design is intimately tied to the af_key non-atomic dump
>>> changes.
>> Actually I was mistaken as to how the original dump worked.  I'd
>> thought that it actually kept track of which bucket it was in and
>> resumed from that bucket.  However in reality it only had a global
>> counter and would always start walking from the beginning up until
>> the counted value.  So it isn't as easy as just copying the old
>> code across :)
> 
> Only AF_KEY gave an error, and this ended the dump.  This was
> one of Timo's goals, to make AF_KEY continue where it left
> off in subsequent dump calls done by the user, when we hit the
> socket limit.

Yes, this was the other goal. The other goal was to speed up
dumping in netlink from O(n^2) to O(n). And fixing the problem that
netlink might miss to dump some entries (if an entry was removed
from the beginning of the hash when dumping was ongoing).

> ipsec: Restore hash based xfrm_state dumping.
> 
> Get rid of ->all member of struct xfrm_state, and just use a hash
> iteration like we used before.
> 
> This shrinks the size of struct xfrm_state, and also restores the
> dump ordering of 2.6.25 and previous.

I think bad things will happen if the hash gets resized between
xfrm_state_walk() calls. Since it assumes that the walk->state
entry has remained in walk->chain bucket.

The dumping order shouldn't really make any difference. But
reducing struct xfrm_state is definitely good. The only downside
is that when an entry is inserted to the beginning of the hash
while dump is ongoing, it won't be dumped at all. But that is
not really a problem since you get a notification about that
entry separately.

Cheers,
  Timo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  5:16                         ` Timo Teräs
@ 2008-09-10  5:23                           ` David Miller
  2008-09-10  5:46                             ` Herbert Xu
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2008-09-10  5:23 UTC (permalink / raw)
  To: timo.teras; +Cc: herbert, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Wed, 10 Sep 2008 08:16:12 +0300

> David Miller wrote:
> > ipsec: Restore hash based xfrm_state dumping.
> > 
> > Get rid of ->all member of struct xfrm_state, and just use a hash
> > iteration like we used before.
> > 
> > This shrinks the size of struct xfrm_state, and also restores the
> > dump ordering of 2.6.25 and previous.
> 
> I think bad things will happen if the hash gets resized between
> xfrm_state_walk() calls. Since it assumes that the walk->state
> entry has remained in walk->chain bucket.
> 
> The dumping order shouldn't really make any difference. But
> reducing struct xfrm_state is definitely good. The only downside
> is that when an entry is inserted to the beginning of the hash
> while dump is ongoing, it won't be dumped at all. But that is
> not really a problem since you get a notification about that
> entry separately.

That's a good point, the hash resizing case.

If the hash grows, we'll see some entries again.

If it shrinks, we might skip some instead, which is very undesirable.

Actually, thinking about it some more, both possibilities (skipping
and dups) seem to be possible in both situations (growing and
shrinking).

Ehm... I think we should put this patch and idea aside for a little bit.
:-)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-10  5:23                           ` David Miller
@ 2008-09-10  5:46                             ` Herbert Xu
  0 siblings, 0 replies; 95+ messages in thread
From: Herbert Xu @ 2008-09-10  5:46 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

On Tue, Sep 09, 2008 at 10:23:30PM -0700, David Miller wrote:
> 
> That's a good point, the hash resizing case.
> 
> If the hash grows, we'll see some entries again.
> 
> If it shrinks, we might skip some instead, which is very undesirable.
> 
> Actually, thinking about it some more, both possibilities (skipping
> and dups) seem to be possible in both situations (growing and
> shrinking).
> 
> Ehm... I think we should put this patch and idea aside for a little bit.
> :-)

Yeah the hash resizing thing really kills this approach.  In fact
it seems that the original walker was written before resizing was
added.

I definitely agree that ordering isn't as important as showing the
entries reliably.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: xfrm_state locking regression...
  2008-09-03  5:07   ` Timo Teräs
  2008-09-03  5:23     ` Herbert Xu
  2008-09-03  5:39     ` Herbert Xu
@ 2008-09-03  6:10     ` David Miller
  2 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2008-09-03  6:10 UTC (permalink / raw)
  To: timo.teras; +Cc: herbert, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Wed, 03 Sep 2008 08:07:33 +0300

> Also alternatively the xfrm_state_all could be protected by a
> different lock than xfrm_state_lock.

Before your changes the state destroy was protected by no central
locks at all.  There was no need.

Since the reference goes to zero, there are no external references
to this object.

But this is exactly what you added, an external references that is
not reference counted.

This is yet another reason all of Herbert's objections to your
location for ->all list add and removal are sound.

So we now have to take a centralized lock twice when getting rid of an
XFRM state object.  Which is potentially a scalability performance
regression.

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2008-10-01 14:05 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-03  2:51 xfrm_state locking regression David Miller
2008-09-03  3:00 ` David Miller
2008-09-03  5:01 ` Herbert Xu
2008-09-03  5:07   ` Timo Teräs
2008-09-03  5:23     ` Herbert Xu
2008-09-03  5:39       ` Timo Teräs
2008-09-03  5:40         ` Herbert Xu
2008-09-09 12:25       ` David Miller
2008-09-03  5:39     ` Herbert Xu
2008-09-03  5:45       ` Timo Teräs
2008-09-03  5:50         ` Herbert Xu
2008-09-03  6:14           ` David Miller
2008-09-03  6:27             ` Timo Teräs
2008-09-03  6:35               ` David Miller
2008-09-03  6:45                 ` Timo Teräs
2008-09-03  6:47                   ` David Miller
2008-09-03  7:14                     ` Timo Teräs
2008-09-05 11:55                     ` Herbert Xu
2008-09-09  0:09                       ` David Miller
2008-09-09  0:18                         ` Herbert Xu
2008-09-09  0:20                           ` David Miller
2008-09-09  0:25                       ` David Miller
2008-09-09 14:33                         ` Herbert Xu
2008-09-09 20:20                           ` David Miller
2008-09-10  3:01                           ` David Miller
2008-09-11 21:24                           ` Paul E. McKenney
2008-09-11 22:00                             ` David Miller
2008-09-11 23:22                               ` Paul E. McKenney
2008-09-12 16:08                               ` Herbert Xu
2008-09-12 17:37                                 ` Paul E. McKenney
2008-09-21 12:29                           ` Timo Teräs
2008-09-21 15:21                             ` Timo Teräs
2008-09-22 11:42                               ` Herbert Xu
2008-09-22 13:01                                 ` Timo Teräs
2008-09-22 23:50                                   ` Herbert Xu
2008-09-23  4:53                                     ` Timo Teräs
2008-09-23  4:59                                       ` Herbert Xu
2008-09-23  5:17                                         ` Timo Teräs
2008-09-23  5:22                                           ` Herbert Xu
2008-09-23  6:25                                             ` Timo Teräs
2008-09-23  6:47                                               ` Herbert Xu
2008-09-23  6:56                                                 ` Timo Teräs
2008-09-23  9:39                                                 ` Timo Teräs
2008-09-23 11:24                                                   ` Herbert Xu
2008-09-23 12:08                                                     ` Timo Teräs
2008-09-23 12:14                                                       ` Herbert Xu
2008-09-23 12:25                                                         ` Timo Teräs
2008-09-23 12:56                                                           ` Herbert Xu
2008-09-23 13:01                                                             ` Timo Teräs
2008-09-23 13:07                                                               ` Herbert Xu
2008-09-23 13:30                                                                 ` Timo Teräs
2008-09-23 13:32                                                                   ` Herbert Xu
2008-09-23 13:46                                                                     ` Timo Teräs
2008-09-24  4:23                                                                       ` Herbert Xu
2008-09-24  5:14                                                                         ` Timo Teräs
2008-09-24  5:15                                                                           ` Herbert Xu
2008-09-24  5:46                                                                             ` Timo Teräs
2008-09-24  5:55                                                                               ` Herbert Xu
2008-09-24  6:04                                                                                 ` Timo Teräs
2008-09-24  6:13                                                                                   ` Herbert Xu
2008-09-24  6:20                                                                                     ` Timo Teräs
2008-09-24  6:21                                                                                       ` Herbert Xu
2008-09-24  7:29                                                                                         ` Timo Teräs
2008-09-24  7:54                                                                                           ` Herbert Xu
2008-09-24 13:18                                                                                             ` Timo Teräs
2008-09-24 14:08                                                                                               ` Herbert Xu
2008-09-25  6:03                                                                                                 ` Timo Teräs
2008-09-25  7:57                                                                                                   ` Herbert Xu
2008-09-25  8:42                                                                                                     ` Timo Teräs
2008-09-25  8:56                                                                                                       ` Herbert Xu
2008-09-25  9:01                                                                                                         ` Timo Teräs
2008-09-25  9:49                                                                                                           ` Herbert Xu
2008-09-25 12:12                                                                                                             ` Timo Teräs
2008-09-25 12:36                                                                                                               ` Timo Teräs
2008-09-26  2:08                                                                                                                 ` Herbert Xu
2008-10-01 10:07                                                                                                                 ` David Miller
2008-10-01 14:05                                                                                                                   ` Herbert Xu
2008-09-23  2:48                                 ` David Miller
2008-09-10  3:04           ` David Miller
2008-09-10  3:15             ` Herbert Xu
2008-09-10  3:22               ` David Miller
2008-09-10  3:23                 ` Herbert Xu
2008-09-10  3:38                   ` David Miller
2008-09-10  4:01                     ` Herbert Xu
2008-09-10  4:06                       ` David Miller
2008-09-10  4:22                         ` Herbert Xu
2008-09-10  4:24                           ` David Miller
2008-09-10  4:48                             ` David Miller
2008-09-10  4:52                               ` David Miller
2008-09-10  4:53                               ` Herbert Xu
2008-09-10  5:21                                 ` David Miller
2008-09-10  5:16                         ` Timo Teräs
2008-09-10  5:23                           ` David Miller
2008-09-10  5:46                             ` Herbert Xu
2008-09-03  6:10     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).