* xfrm_state locking regression...
@ 2008-09-03 2:51 David Miller
2008-09-03 3:00 ` David Miller
2008-09-03 5:01 ` Herbert Xu
0 siblings, 2 replies; 95+ messages in thread
From: David Miller @ 2008-09-03 2:51 UTC (permalink / raw)
To: netdev; +Cc: timo.teras
I've been hitting a lockup on my machine, and the way to hit
it is sort of straight forward.
1) Setup IPSEC connection using openswan
2) Use something like XAUTH where openswan is not able refresh the
IPSEC policy when it times out because it has no way to prompt the
user for the XAUTH password again.
3) After the timeout, try to send a packet over the IPSEC connection,
a simple DNS lookup over it works for me.
4) Shut down IPSEC daemon.
At this point with SMP we'll wedge in __xfrm_state_destroy() trying to
obtain xfrm_state_lock.
This problem seems to have been introduced by:
commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
Author: Timo Teras <timo.teras@iki.fi>
Date: Thu Feb 28 21:31:08 2008 -0800
[XFRM]: Speed up xfrm_policy and xfrm_state walking
(BTW, I want to point out how hesitant I was at the time to apply
this patch. It touches a lot of delicate areas...)
This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy().
Such new locking cannot work properly, without also updating a lot of
other code in net/xfrm/xfrm_state.c
The new locking means that it is absolutely illegal to do an
xfrm_state_put() while holding xfrm_state_lock. Yet we do this
all over the place!
Asynchronously another context can drop the count to one, and this
xfrm_state_put() call will thus call into __xfrm_state_destroy() and
try to take xfrm_state_lock (again) causing the deadlock.
I'm not exactly sure how to fix this currently.
I guess we can add a local pointer variable to the relevant functions
"struct xfrm_state *to_put", which is initialized to NULL, and we
replace the put calls with assignments to to_put. In the "out" path,
after we drop the xfrm_state_lock, we do a put if "to_put" is
non-NULL.
Something like the patch below, but frankly this is some ugly stuff.
I double checked xfrm_policy.c, since it has the potential to have
similar issues due to the above changeset, but it appears to be
OK the best I can tell.
ipsec: Fix deadlock in xfrm_state management.
Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3
("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
with xfrm_state_lock held. If we do, we'll deadlock since we
have the lock already and __xfrm_state_destroy() tries to take
it again.
Fix this by pushing the xfrm_state_put() calls after the lock
is dropped.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4c6914e..98f20cc 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -780,11 +780,13 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
{
unsigned int h;
struct hlist_node *entry;
- struct xfrm_state *x, *x0;
+ struct xfrm_state *x, *x0, *to_put;
int acquire_in_progress = 0;
int error = 0;
struct xfrm_state *best = NULL;
+ to_put = NULL;
+
spin_lock_bh(&xfrm_state_lock);
h = xfrm_dst_hash(daddr, saddr, tmpl->reqid, family);
hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) {
@@ -833,7 +835,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
if (tmpl->id.spi &&
(x0 = __xfrm_state_lookup(daddr, tmpl->id.spi,
tmpl->id.proto, family)) != NULL) {
- xfrm_state_put(x0);
+ to_put = x0;
error = -EEXIST;
goto out;
}
@@ -849,7 +851,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
error = security_xfrm_state_alloc_acquire(x, pol->security, fl->secid);
if (error) {
x->km.state = XFRM_STATE_DEAD;
- xfrm_state_put(x);
+ to_put = x;
x = NULL;
goto out;
}
@@ -870,7 +872,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
xfrm_hash_grow_check(x->bydst.next != NULL);
} else {
x->km.state = XFRM_STATE_DEAD;
- xfrm_state_put(x);
+ to_put = x;
x = NULL;
error = -ESRCH;
}
@@ -881,6 +883,8 @@ out:
else
*err = acquire_in_progress ? -EAGAIN : error;
spin_unlock_bh(&xfrm_state_lock);
+ if (to_put)
+ xfrm_state_put(to_put);
return x;
}
@@ -1067,18 +1071,20 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq);
int xfrm_state_add(struct xfrm_state *x)
{
- struct xfrm_state *x1;
+ struct xfrm_state *x1, *to_put;
int family;
int err;
int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
family = x->props.family;
+ to_put = NULL;
+
spin_lock_bh(&xfrm_state_lock);
x1 = __xfrm_state_locate(x, use_spi, family);
if (x1) {
- xfrm_state_put(x1);
+ to_put = x1;
x1 = NULL;
err = -EEXIST;
goto out;
@@ -1088,7 +1094,7 @@ int xfrm_state_add(struct xfrm_state *x)
x1 = __xfrm_find_acq_byseq(x->km.seq);
if (x1 && ((x1->id.proto != x->id.proto) ||
xfrm_addr_cmp(&x1->id.daddr, &x->id.daddr, family))) {
- xfrm_state_put(x1);
+ to_put = x1;
x1 = NULL;
}
}
@@ -1110,6 +1116,9 @@ out:
xfrm_state_put(x1);
}
+ if (to_put)
+ xfrm_state_put(to_put);
+
return err;
}
EXPORT_SYMBOL(xfrm_state_add);
@@ -1269,10 +1278,12 @@ EXPORT_SYMBOL(xfrm_state_migrate);
int xfrm_state_update(struct xfrm_state *x)
{
- struct xfrm_state *x1;
+ struct xfrm_state *x1, *to_put;
int err;
int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY);
+ to_put = NULL;
+
spin_lock_bh(&xfrm_state_lock);
x1 = __xfrm_state_locate(x, use_spi, x->props.family);
@@ -1281,7 +1292,7 @@ int xfrm_state_update(struct xfrm_state *x)
goto out;
if (xfrm_state_kern(x1)) {
- xfrm_state_put(x1);
+ to_put = x1;
err = -EEXIST;
goto out;
}
@@ -1295,6 +1306,9 @@ int xfrm_state_update(struct xfrm_state *x)
out:
spin_unlock_bh(&xfrm_state_lock);
+ if (to_put)
+ xfrm_state_put(to_put);
+
if (err)
return err;
@@ -1483,11 +1497,13 @@ EXPORT_SYMBOL(xfrm_get_acqseq);
int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
{
unsigned int h;
- struct xfrm_state *x0;
+ struct xfrm_state *x0, *to_put;
int err = -ENOENT;
__be32 minspi = htonl(low);
__be32 maxspi = htonl(high);
+ to_put = NULL;
+
spin_lock_bh(&x->lock);
if (x->km.state == XFRM_STATE_DEAD)
goto unlock;
@@ -1501,7 +1517,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
if (minspi == maxspi) {
x0 = xfrm_state_lookup(&x->id.daddr, minspi, x->id.proto, x->props.family);
if (x0) {
- xfrm_state_put(x0);
+ to_put = x0;
goto unlock;
}
x->id.spi = minspi;
@@ -1514,7 +1530,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
x->id.spi = htonl(spi);
break;
}
- xfrm_state_put(x0);
+ to_put = x0;
}
}
if (x->id.spi) {
@@ -1529,6 +1545,9 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
unlock:
spin_unlock_bh(&x->lock);
+ if (to_put)
+ xfrm_state_put(to_put);
+
return err;
}
EXPORT_SYMBOL(xfrm_alloc_spi);
^ permalink raw reply related [flat|nested] 95+ messages in thread* Re: xfrm_state locking regression... 2008-09-03 2:51 xfrm_state locking regression David Miller @ 2008-09-03 3:00 ` David Miller 2008-09-03 5:01 ` Herbert Xu 1 sibling, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-03 3:00 UTC (permalink / raw) To: netdev; +Cc: timo.teras From: David Miller <davem@davemloft.net> Date: Tue, 02 Sep 2008 19:51:48 -0700 (PDT) > @@ -1483,11 +1497,13 @@ EXPORT_SYMBOL(xfrm_get_acqseq); > int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) > { ... > @@ -1514,7 +1530,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) > x->id.spi = htonl(spi); > break; > } > - xfrm_state_put(x0); > + to_put = x0; > } > } > if (x->id.spi) { This part of my patch is buggy, we loop here so would need to remember more state to do the puts right. But luckily, this function doesn't need this to_put code, as it grabs x->lock not xfrm_state_lock so all the changes to this function can simply be removed :) I'm going to do some testing on this and commit it unless more problems are found. ipsec: Fix deadlock in xfrm_state management. Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is illegal to call __xfrm_state_destroy (and thus xfrm_state_put()) with xfrm_state_lock held. If we do, we'll deadlock since we have the lock already and __xfrm_state_destroy() tries to take it again. Fix this by pushing the xfrm_state_put() calls after the lock is dropped. Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 4c6914e..7bd62f6 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -780,11 +780,13 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, { unsigned int h; struct hlist_node *entry; - struct xfrm_state *x, *x0; + struct xfrm_state *x, *x0, *to_put; int acquire_in_progress = 0; int error = 0; struct xfrm_state *best = NULL; + to_put = NULL; + spin_lock_bh(&xfrm_state_lock); h = xfrm_dst_hash(daddr, saddr, tmpl->reqid, family); hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) { @@ -833,7 +835,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (tmpl->id.spi && (x0 = __xfrm_state_lookup(daddr, tmpl->id.spi, tmpl->id.proto, family)) != NULL) { - xfrm_state_put(x0); + to_put = x0; error = -EEXIST; goto out; } @@ -849,7 +851,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, error = security_xfrm_state_alloc_acquire(x, pol->security, fl->secid); if (error) { x->km.state = XFRM_STATE_DEAD; - xfrm_state_put(x); + to_put = x; x = NULL; goto out; } @@ -870,7 +872,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, xfrm_hash_grow_check(x->bydst.next != NULL); } else { x->km.state = XFRM_STATE_DEAD; - xfrm_state_put(x); + to_put = x; x = NULL; error = -ESRCH; } @@ -881,6 +883,8 @@ out: else *err = acquire_in_progress ? -EAGAIN : error; spin_unlock_bh(&xfrm_state_lock); + if (to_put) + xfrm_state_put(to_put); return x; } @@ -1067,18 +1071,20 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq); int xfrm_state_add(struct xfrm_state *x) { - struct xfrm_state *x1; + struct xfrm_state *x1, *to_put; int family; int err; int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY); family = x->props.family; + to_put = NULL; + spin_lock_bh(&xfrm_state_lock); x1 = __xfrm_state_locate(x, use_spi, family); if (x1) { - xfrm_state_put(x1); + to_put = x1; x1 = NULL; err = -EEXIST; goto out; @@ -1088,7 +1094,7 @@ int xfrm_state_add(struct xfrm_state *x) x1 = __xfrm_find_acq_byseq(x->km.seq); if (x1 && ((x1->id.proto != x->id.proto) || xfrm_addr_cmp(&x1->id.daddr, &x->id.daddr, family))) { - xfrm_state_put(x1); + to_put = x1; x1 = NULL; } } @@ -1110,6 +1116,9 @@ out: xfrm_state_put(x1); } + if (to_put) + xfrm_state_put(to_put); + return err; } EXPORT_SYMBOL(xfrm_state_add); @@ -1269,10 +1278,12 @@ EXPORT_SYMBOL(xfrm_state_migrate); int xfrm_state_update(struct xfrm_state *x) { - struct xfrm_state *x1; + struct xfrm_state *x1, *to_put; int err; int use_spi = xfrm_id_proto_match(x->id.proto, IPSEC_PROTO_ANY); + to_put = NULL; + spin_lock_bh(&xfrm_state_lock); x1 = __xfrm_state_locate(x, use_spi, x->props.family); @@ -1281,7 +1292,7 @@ int xfrm_state_update(struct xfrm_state *x) goto out; if (xfrm_state_kern(x1)) { - xfrm_state_put(x1); + to_put = x1; err = -EEXIST; goto out; } @@ -1295,6 +1306,9 @@ int xfrm_state_update(struct xfrm_state *x) out: spin_unlock_bh(&xfrm_state_lock); + if (to_put) + xfrm_state_put(to_put); + if (err) return err; ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 2:51 xfrm_state locking regression David Miller 2008-09-03 3:00 ` David Miller @ 2008-09-03 5:01 ` Herbert Xu 2008-09-03 5:07 ` Timo Teräs 1 sibling, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-03 5:01 UTC (permalink / raw) To: David Miller; +Cc: netdev, timo.teras David Miller <davem@davemloft.net> wrote: > > This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy(). This is simply bogus. We should remove the entry from the list when the state is deleted, not when it's destroyed. That is, we should be deleting x->all in places like __xfrm_state_delete instead of __xfrm_state_destroy. The fact that we're adding x->all in __xfrm_state_insert means that we should be deleting it from its counter-part, which is __xfrm_state_delete. The same thing applies to the policies of course. Once a policy or state is deleted there is no reason why it should show up in a walk just because some dst is holding onto it. > ipsec: Fix deadlock in xfrm_state management. > > Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 > ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is > illegal to call __xfrm_state_destroy (and thus xfrm_state_put()) > with xfrm_state_lock held. If we do, we'll deadlock since we > have the lock already and __xfrm_state_destroy() tries to take > it again. > > Fix this by pushing the xfrm_state_put() calls after the lock > is dropped. > > Signed-off-by: David S. Miller <davem@davemloft.net> However, moving things out of critical sections is always a good idea. So I think your patch is an improvement regardless of why it came about :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:01 ` Herbert Xu @ 2008-09-03 5:07 ` Timo Teräs 2008-09-03 5:23 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: Timo Teräs @ 2008-09-03 5:07 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > David Miller <davem@davemloft.net> wrote: >> This is what added the xfrm_state_lock grabbing in __xfrm_state_destroy(). > > This is simply bogus. We should remove the entry from the list > when the state is deleted, not when it's destroyed. That is, we > should be deleting x->all in places like __xfrm_state_delete instead > of __xfrm_state_destroy. > > The fact that we're adding x->all in __xfrm_state_insert means > that we should be deleting it from its counter-part, which is > __xfrm_state_delete. > > The same thing applies to the policies of course. Once a policy > or state is deleted there is no reason why it should show up in a > walk just because some dst is holding onto it. But then caller of xfrm_state_walk() can be still holding a reference to the entry and dumping of the whole chain fails, because the entry is no longer part of the chain when dumping continues later. The walking coding skips entries which are dead so the entry is only kept temporarily to make a full traversal of the list. >> ipsec: Fix deadlock in xfrm_state management. >> >> Ever since commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 >> ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is >> illegal to call __xfrm_state_destroy (and thus xfrm_state_put()) >> with xfrm_state_lock held. If we do, we'll deadlock since we >> have the lock already and __xfrm_state_destroy() tries to take >> it again. >> >> Fix this by pushing the xfrm_state_put() calls after the lock >> is dropped. >> >> Signed-off-by: David S. Miller <davem@davemloft.net> > > However, moving things out of critical sections is always a good > idea. So I think your patch is an improvement regardless of why > it came about :) Right. Also alternatively the xfrm_state_all could be protected by a different lock than xfrm_state_lock. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:07 ` Timo Teräs @ 2008-09-03 5:23 ` Herbert Xu 2008-09-03 5:39 ` Timo Teräs 2008-09-09 12:25 ` David Miller 2008-09-03 5:39 ` Herbert Xu 2008-09-03 6:10 ` David Miller 2 siblings, 2 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-03 5:23 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote: > > But then caller of xfrm_state_walk() can be still holding a > reference to the entry and dumping of the whole chain fails, > because the entry is no longer part of the chain when dumping > continues later. > > The walking coding skips entries which are dead so the > entry is only kept temporarily to make a full traversal of > the list. Ah I see your point now. Actually another thing we should do is to postpone the unlinking into the GC process. This doesn't really need to be immediate. And here is a patch to fix a couple of changes in user-behaviour. ipsec: Restore larval states and socket policies in dump The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") inadvertently removed larval states and socket policies from netlink dumps. This patch restores them. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 841b32a..5138845 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1077,6 +1077,7 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir) struct hlist_head *chain = policy_hash_bysel(&pol->selector, pol->family, dir); + list_add_tail(&pol->bytype, &xfrm_policy_bytype[pol->type]); hlist_add_head(&pol->bydst, chain); hlist_add_head(&pol->byidx, xfrm_policy_byidx+idx_hash(pol->index)); xfrm_policy_count[dir]++; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 4c6914e..89551b6 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -856,6 +856,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; + list_add_tail(&x->all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1051,6 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); + list_add_tail(&x->all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:23 ` Herbert Xu @ 2008-09-03 5:39 ` Timo Teräs 2008-09-03 5:40 ` Herbert Xu 2008-09-09 12:25 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-03 5:39 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote: >> But then caller of xfrm_state_walk() can be still holding a >> reference to the entry and dumping of the whole chain fails, >> because the entry is no longer part of the chain when dumping >> continues later. >> >> The walking coding skips entries which are dead so the >> entry is only kept temporarily to make a full traversal of >> the list. > > Ah I see your point now. > > Actually another thing we should do is to postpone the unlinking > into the GC process. This doesn't really need to be immediate. Yes, might be a good idea. > And here is a patch to fix a couple of changes in user-behaviour. > > ipsec: Restore larval states and socket policies in dump > > The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]: > Speed up xfrm_policy and xfrm_state walking") inadvertently removed > larval states and socket policies from netlink dumps. This patch > restores them. Oh, I seemed to have missed that there was multiple places where the list insertion happens. Would it make sense to make function for inserting the entries in the lists? - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:39 ` Timo Teräs @ 2008-09-03 5:40 ` Herbert Xu 0 siblings, 0 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-03 5:40 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Wed, Sep 03, 2008 at 08:39:18AM +0300, Timo Teräs wrote: > > Oh, I seemed to have missed that there was multiple places where > the list insertion happens. Would it make sense to make function > for inserting the entries in the lists? Sure, grep for the places that change xfrm_state_num and you've got your function :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:23 ` Herbert Xu 2008-09-03 5:39 ` Timo Teräs @ 2008-09-09 12:25 ` David Miller 1 sibling, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-09 12:25 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 3 Sep 2008 15:23:30 +1000 > ipsec: Restore larval states and socket policies in dump > > The commit commit 4c563f7669c10a12354b72b518c2287ffc6ebfb3 ("[XFRM]: > Speed up xfrm_policy and xfrm_state walking") inadvertently removed > larval states and socket policies from netlink dumps. This patch > restores them. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Applied to net-2.6, thanks Herbert. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:07 ` Timo Teräs 2008-09-03 5:23 ` Herbert Xu @ 2008-09-03 5:39 ` Herbert Xu 2008-09-03 5:45 ` Timo Teräs 2008-09-03 6:10 ` David Miller 2 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-03 5:39 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote: > > Right. Also alternatively the xfrm_state_all could be protected > by a different lock than xfrm_state_lock. Yes that's a good idea. Here's a patch to do that for states. I'm still thinking about whether policy destruction should be restructured or not so I'm not patching that. ipsec: Move state dump list unlinking into GC This patch moves the unlinking of x->all (for dumping) into the GC and changes the lock to xfrm_cfg_mutex. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 89551b6..f29923a 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode) static void xfrm_state_gc_destroy(struct xfrm_state *x) { + mutex_lock(&xfrm_cfg_mutex); + list_del(&x->all); + mutex_unlock(&xfrm_cfg_mutex); + del_timer_sync(&x->timer); del_timer_sync(&x->rtimer); kfree(x->aalg); @@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x) { WARN_ON(x->km.state != XFRM_STATE_DEAD); - spin_lock_bh(&xfrm_state_lock); - list_del(&x->all); - spin_unlock_bh(&xfrm_state_lock); - spin_lock_bh(&xfrm_state_gc_lock); hlist_add_head(&x->bydst, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:39 ` Herbert Xu @ 2008-09-03 5:45 ` Timo Teräs 2008-09-03 5:50 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-03 5:45 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Wed, Sep 03, 2008 at 08:07:33AM +0300, Timo Teräs wrote: >> Right. Also alternatively the xfrm_state_all could be protected >> by a different lock than xfrm_state_lock. > > Yes that's a good idea. Here's a patch to do that for states. > I'm still thinking about whether policy destruction should be > restructured or not so I'm not patching that. > > ipsec: Move state dump list unlinking into GC > > This patch moves the unlinking of x->all (for dumping) into the > GC and changes the lock to xfrm_cfg_mutex. Shouldn't this also change the locking in all places where x->all is used? ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:45 ` Timo Teräs @ 2008-09-03 5:50 ` Herbert Xu 2008-09-03 6:14 ` David Miller 2008-09-10 3:04 ` David Miller 0 siblings, 2 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-03 5:50 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote: > > > ipsec: Move state dump list unlinking into GC > > > > This patch moves the unlinking of x->all (for dumping) into the > > GC and changes the lock to xfrm_cfg_mutex. > > Shouldn't this also change the locking in all places where > x->all is used? You're right. I'd forgotten about the larval states for some reason, probably because I just added them to the list :) So let's keep xfrm_state_lock and just move it. ipsec: Move state dump list unlinking into GC This patch moves the unlinking of x->all (for dumping) into the GC so that we know for sure what locks we hold when we unlink the entry. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 89551b6..0244325 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode) static void xfrm_state_gc_destroy(struct xfrm_state *x) { + spin_lock_bh(&xfrm_state_lock); + list_del(&x->all); + spin_unlock_bh(&xfrm_state_lock); + del_timer_sync(&x->timer); del_timer_sync(&x->rtimer); kfree(x->aalg); @@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x) { WARN_ON(x->km.state != XFRM_STATE_DEAD); - spin_lock_bh(&xfrm_state_lock); - list_del(&x->all); - spin_unlock_bh(&xfrm_state_lock); - spin_lock_bh(&xfrm_state_gc_lock); hlist_add_head(&x->bydst, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:50 ` Herbert Xu @ 2008-09-03 6:14 ` David Miller 2008-09-03 6:27 ` Timo Teräs 2008-09-10 3:04 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-03 6:14 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 3 Sep 2008 15:50:41 +1000 > On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote: > > > > > ipsec: Move state dump list unlinking into GC > > > > > > This patch moves the unlinking of x->all (for dumping) into the > > > GC and changes the lock to xfrm_cfg_mutex. > > > > Shouldn't this also change the locking in all places where > > x->all is used? > > You're right. I'd forgotten about the larval states for some > reason, probably because I just added them to the list :) > > So let's keep xfrm_state_lock and just move it. How about we un-crap the reference counting? Every list an object lives on is an external reference, and that is the rule that is violated by this ->all list handling. And that's why we need to take a single shared lock twice just to get rid of an xfrm_state object now. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:14 ` David Miller @ 2008-09-03 6:27 ` Timo Teräs 2008-09-03 6:35 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-03 6:27 UTC (permalink / raw) To: David Miller; +Cc: herbert, netdev David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 3 Sep 2008 15:50:41 +1000 > >> On Wed, Sep 03, 2008 at 08:45:01AM +0300, Timo Teräs wrote: >>>> ipsec: Move state dump list unlinking into GC >>>> >>>> This patch moves the unlinking of x->all (for dumping) into the >>>> GC and changes the lock to xfrm_cfg_mutex. >>> Shouldn't this also change the locking in all places where >>> x->all is used? >> You're right. I'd forgotten about the larval states for some >> reason, probably because I just added them to the list :) >> >> So let's keep xfrm_state_lock and just move it. > > How about we un-crap the reference counting? > > Every list an object lives on is an external reference, and that is > the rule that is violated by this ->all list handling. > > And that's why we need to take a single shared lock twice just to get > rid of an xfrm_state object now. Well, it's just another list keeping a reference like ->bydst, ->bysrc and ->byspi. The actual amount of external references is still correct (the walking code calls _hold() when it returns while keeping an external pointer). The difference is that node should not be unlinked from ->all until all other references are gone. For other lists the unlinking can be done earlier since they are used only for lookups. Any good other ways to enumerate to list entries while allowing to keep a temporary "iterator"? The previous method was crap too. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:27 ` Timo Teräs @ 2008-09-03 6:35 ` David Miller 2008-09-03 6:45 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-03 6:35 UTC (permalink / raw) To: timo.teras; +Cc: herbert, netdev From: Timo Teräs <timo.teras@iki.fi> Date: Wed, 03 Sep 2008 09:27:47 +0300 > Well, it's just another list keeping a reference like ->bydst, > ->bysrc and ->byspi. The actual amount of external references is > still correct (the walking code calls _hold() when it returns while > keeping an external pointer). ->bydst, ->bysrc, and ->byspi are counted as a single external reference because: 1) They are controlled as a group 2) Doing 3 atomic operations is more expensive than one I know because I did that conversion from 3 refcount operations down to 1 and I timed it with stress tests, which showed that it made a huge performance difference for the control path of our IPSEC stack. > The difference is that node should not be unlinked from ->all until > all other references are gone. For other lists the unlinking can be > done earlier since they are used only for lookups. Once there are no list references, there cannot be any other references. So in fact it seems to me that unlinking when the xfrm_state is removed from those other lists makes perfect sense. If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state objects marked DEAD, why does the ->all list reference have to survive past __xfrm_state_delete()? It seems the perfect place to do the ->all removal. > Any good other ways to enumerate to list entries while allowing > to keep a temporary "iterator"? The previous method was crap too. At least the old stuff was self-consistent and only needed one central lock grab to destoy an object. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:35 ` David Miller @ 2008-09-03 6:45 ` Timo Teräs 2008-09-03 6:47 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-03 6:45 UTC (permalink / raw) To: David Miller; +Cc: herbert, netdev David Miller wrote: > From: Timo Teräs <timo.teras@iki.fi> > Date: Wed, 03 Sep 2008 09:27:47 +0300 > >> Well, it's just another list keeping a reference like ->bydst, >> ->bysrc and ->byspi. The actual amount of external references is >> still correct (the walking code calls _hold() when it returns while >> keeping an external pointer). > > ->bydst, ->bysrc, and ->byspi are counted as a single external > reference because: > > 1) They are controlled as a group > > 2) Doing 3 atomic operations is more expensive than one > > I know because I did that conversion from 3 refcount operations down > to 1 and I timed it with stress tests, which showed that it made a > huge performance difference for the control path of our IPSEC stack. I was a bit confused what you meant by "external reference". But yes, in this sense it's adding a new external reference. >> The difference is that node should not be unlinked from ->all until >> all other references are gone. For other lists the unlinking can be >> done earlier since they are used only for lookups. > > Once there are no list references, there cannot be any other references. > So in fact it seems to me that unlinking when the xfrm_state is removed > from those other lists makes perfect sense. > > If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state > objects marked DEAD, why does the ->all list reference have to survive > past __xfrm_state_delete()? > > It seems the perfect place to do the ->all removal. 1. xfrm_state_walk() called, it returns but holds an entry since the walking was interrupted temporarily (e.g. full netlink buffer). 2. xfrm_state_delete() called to the entry that xfrm_state_walk() is keeping a pointer to and it is unlinked. 3. xfrm_state_walk() called again, it tries to resume list walking but whoops, the entry was unlinked and kaboom. >> Any good other ways to enumerate to list entries while allowing >> to keep a temporary "iterator"? The previous method was crap too. > > At least the old stuff was self-consistent and only needed one central > lock grab to destoy an object. Yes, but the dumping code produced crap. It could dump same entry multiple times, miss entries and was dog slow. With it there was no possibility to keep userland in sync with kernel SPD/SAD because entries were lost. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:45 ` Timo Teräs @ 2008-09-03 6:47 ` David Miller 2008-09-03 7:14 ` Timo Teräs 2008-09-05 11:55 ` Herbert Xu 0 siblings, 2 replies; 95+ messages in thread From: David Miller @ 2008-09-03 6:47 UTC (permalink / raw) To: timo.teras; +Cc: herbert, netdev From: Timo Teräs <timo.teras@iki.fi> Date: Wed, 03 Sep 2008 09:45:48 +0300 > David Miller wrote: > > Once there are no list references, there cannot be any other references. > > So in fact it seems to me that unlinking when the xfrm_state is removed > > from those other lists makes perfect sense. > > > > If __xfrm_state_delete sets the state to DEAD, and you skip xfrm_state > > objects marked DEAD, why does the ->all list reference have to survive > > past __xfrm_state_delete()? > > > > It seems the perfect place to do the ->all removal. > > 1. xfrm_state_walk() called, it returns but holds an entry since > the walking was interrupted temporarily (e.g. full netlink buffer). > > 2. xfrm_state_delete() called to the entry that xfrm_state_walk() > is keeping a pointer to and it is unlinked. > > 3. xfrm_state_walk() called again, it tries to resume list walking > but whoops, the entry was unlinked and kaboom. Get creative, use a key of some sort to continue the walk, that's what other netlink'ish subsystems use. > Yes, but the dumping code produced crap. It could dump same entry > multiple times, miss entries and was dog slow. With it there was > no possibility to keep userland in sync with kernel SPD/SAD because > entries were lost. I'd rather see an entry twice in a dump than have my IPSEC gateway lockup, or run slower because we take a lock twice as often as necessary during object destruction. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:47 ` David Miller @ 2008-09-03 7:14 ` Timo Teräs 2008-09-05 11:55 ` Herbert Xu 1 sibling, 0 replies; 95+ messages in thread From: Timo Teräs @ 2008-09-03 7:14 UTC (permalink / raw) To: David Miller; +Cc: herbert, netdev David Miller wrote: > Get creative, use a key of some sort to continue the walk, that's what > other netlink'ish subsystems use. Ok. ctnetlink_dump_table() seems to iterate the hash list and keeps a unique id where it left. If that entry was deleted it starts again from that hash bucket. We could iterate SPD by idx. By SPI would be nice for SAD, but not all entries have SPI. I guess we could use either of the bysrc or bydst hash and memorize xfrm_id. Restart from first hash bucket entry if the memorized entry gets deleted, and restart from the beginning if hash gets resized. This way we could avoid the additional locking and guarantee that all entries are dumped in reasonable time. >> Yes, but the dumping code produced crap. It could dump same entry >> multiple times, miss entries and was dog slow. With it there was >> no possibility to keep userland in sync with kernel SPD/SAD because >> entries were lost. > > I'd rather see an entry twice in a dump than have my IPSEC gateway > lockup, or run slower because we take a lock twice as often as > necessary during object destruction. Seeing entry twice is not a problem. Not seeing an entry at all was the real problem. Also listing SAD could take tens of seconds on modern system. That's not nice either. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 6:47 ` David Miller 2008-09-03 7:14 ` Timo Teräs @ 2008-09-05 11:55 ` Herbert Xu 2008-09-09 0:09 ` David Miller 2008-09-09 0:25 ` David Miller 1 sibling, 2 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-05 11:55 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 02, 2008 at 11:47:23PM -0700, David Miller wrote: > > Get creative, use a key of some sort to continue the walk, that's what > other netlink'ish subsystems use. OK let's see if this is creative enough :) The next step after this patch is to kill x->all altogether and go back to the old walking method (plus the saved state). ipsec: Use RCU-like construct for saved state within a walk Now that we save states within a walk we need synchronisation so that the list the saved state is on doesn't disappear from under us. As it stands this is done by keeping the state on the list which is bad because it gets in the way of the management of the state life-cycle. An alternative is to make our own pseudo-RCU system where we use counters to indicate which state can't be freed immediately as it may be referenced by an ongoing walk when that resumes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 2933d74..4bb9499 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex; /* Full description of state of transformer. */ struct xfrm_state { - /* Note: bydst is re-used during gc */ struct list_head all; - struct hlist_node bydst; + union { + struct list_head gclist; + struct hlist_node bydst; + }; struct hlist_node bysrc; struct hlist_node byspi; @@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->count = 0; } -static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk) -{ - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } -} - extern int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *); +extern void xfrm_state_walk_done(struct xfrm_state_walk *walk); extern struct xfrm_state *xfrm_state_alloc(void); extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7bd62f6..ec9d1e5 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; +/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ +static unsigned long xfrm_state_walk_ongoing; +/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ +static unsigned long xfrm_state_walk_completed; + static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -191,7 +196,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static HLIST_HEAD(xfrm_state_gc_list); +static LIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x; - struct hlist_node *entry, *tmp; - struct hlist_head gc_list; + struct xfrm_state *x, *tmp; + static LIST_HEAD(gc_list); + unsigned long completed; spin_lock_bh(&xfrm_state_gc_lock); - gc_list.first = xfrm_state_gc_list.first; - INIT_HLIST_HEAD(&xfrm_state_gc_list); + list_splice_tail_init(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) + mutex_lock(&xfrm_cfg_mutex); + completed = xfrm_state_walk_completed; + mutex_unlock(&xfrm_cfg_mutex); + + list_for_each_entry_safe(x, tmp, &gc_list, gclist) { + if ((long)(x->lastused - completed) > 0) + break; xfrm_state_gc_destroy(x); + } wake_up(&km_waitq); } @@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x) { WARN_ON(x->km.state != XFRM_STATE_DEAD); - spin_lock_bh(&xfrm_state_lock); - list_del(&x->all); - spin_unlock_bh(&xfrm_state_lock); - spin_lock_bh(&xfrm_state_gc_lock); - hlist_add_head(&x->bydst, &xfrm_state_gc_list); + list_add_tail(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); + x->lastused = xfrm_state_walk_ongoing; + list_del_rcu(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (err) { xfrm_state_hold(last); walk->state = last; + xfrm_state_walk_ongoing++; goto out; } } @@ -1586,12 +1596,26 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) + if (old != NULL) { xfrm_state_put(old); + xfrm_state_walk_completed++; + schedule_work(&xfrm_state_gc_work); + } return err; } EXPORT_SYMBOL(xfrm_state_walk); +void xfrm_state_walk_done(struct xfrm_state_walk *walk) +{ + if (walk->state != NULL) { + xfrm_state_put(walk->state); + walk->state = NULL; + xfrm_state_walk_completed++; + schedule_work(&xfrm_state_gc_work); + } +} +EXPORT_SYMBOL(xfrm_state_walk_done); + void xfrm_replay_notify(struct xfrm_state *x, int event) { Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-05 11:55 ` Herbert Xu @ 2008-09-09 0:09 ` David Miller 2008-09-09 0:18 ` Herbert Xu 2008-09-09 0:25 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-09 0:09 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 5 Sep 2008 21:55:07 +1000 > @@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) > > static void xfrm_state_gc_task(struct work_struct *data) > { > - struct xfrm_state *x; > - struct hlist_node *entry, *tmp; > - struct hlist_head gc_list; > + struct xfrm_state *x, *tmp; > + static LIST_HEAD(gc_list); > + unsigned long completed; > > spin_lock_bh(&xfrm_state_gc_lock); > - gc_list.first = xfrm_state_gc_list.first; > - INIT_HLIST_HEAD(&xfrm_state_gc_list); > + list_splice_tail_init(&xfrm_state_gc_list, &gc_list); > spin_unlock_bh(&xfrm_state_gc_lock); > > - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) > + mutex_lock(&xfrm_cfg_mutex); > + completed = xfrm_state_walk_completed; > + mutex_unlock(&xfrm_cfg_mutex); > + > + list_for_each_entry_safe(x, tmp, &gc_list, gclist) { > + if ((long)(x->lastused - completed) > 0) > + break; > xfrm_state_gc_destroy(x); > + } > > wake_up(&km_waitq); > } What happens to all of the entries on the local gc_list which you will be skipping when you break out of that last loop? It looks like they will never get processed. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 0:09 ` David Miller @ 2008-09-09 0:18 ` Herbert Xu 2008-09-09 0:20 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-09 0:18 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Mon, Sep 08, 2008 at 05:09:07PM -0700, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 5 Sep 2008 21:55:07 +1000 > > > @@ -403,17 +408,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) > > > > static void xfrm_state_gc_task(struct work_struct *data) > > { > > - struct xfrm_state *x; > > - struct hlist_node *entry, *tmp; > > - struct hlist_head gc_list; > > + struct xfrm_state *x, *tmp; > > + static LIST_HEAD(gc_list); > > + unsigned long completed; > > > > spin_lock_bh(&xfrm_state_gc_lock); > > - gc_list.first = xfrm_state_gc_list.first; > > - INIT_HLIST_HEAD(&xfrm_state_gc_list); > > + list_splice_tail_init(&xfrm_state_gc_list, &gc_list); > > spin_unlock_bh(&xfrm_state_gc_lock); > > > > - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) > > + mutex_lock(&xfrm_cfg_mutex); > > + completed = xfrm_state_walk_completed; > > + mutex_unlock(&xfrm_cfg_mutex); > > + > > + list_for_each_entry_safe(x, tmp, &gc_list, gclist) { > > + if ((long)(x->lastused - completed) > 0) > > + break; > > xfrm_state_gc_destroy(x); > > + } > > > > wake_up(&km_waitq); > > } > > What happens to all of the entries on the local gc_list which you > will be skipping when you break out of that last loop? > > It looks like they will never get processed. I made gc_list static, so they'll be processed when the GC task runs again. Also whenever completed is incremented we schedule the GC task. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 0:18 ` Herbert Xu @ 2008-09-09 0:20 ` David Miller 0 siblings, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-09 0:20 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 9 Sep 2008 10:18:28 +1000 > On Mon, Sep 08, 2008 at 05:09:07PM -0700, David Miller wrote: > > What happens to all of the entries on the local gc_list which you > > will be skipping when you break out of that last loop? > > > > It looks like they will never get processed. > > I made gc_list static, so they'll be processed when the GC task > runs again. Also whenever completed is incremented we schedule > the GC task. Thanks, I didn't catch that "static" bit :-) ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-05 11:55 ` Herbert Xu 2008-09-09 0:09 ` David Miller @ 2008-09-09 0:25 ` David Miller 2008-09-09 14:33 ` Herbert Xu 1 sibling, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-09 0:25 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 5 Sep 2008 21:55:07 +1000 > ipsec: Use RCU-like construct for saved state within a walk Ok, now that I finally understand how this works I like it. The only comment I would make is that maybe it's a bit excessive to trigger the GC worker every time we walk the states. Instead you could expose the GC task's local list go: xfrm_state_walk_completed++; smp_wmb(); if (!list_empty(&xfrm_state_gc_work_list)) schedule_work(&xfrm_state_gc_work); in xfrm_state_walk_done(). ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 0:25 ` David Miller @ 2008-09-09 14:33 ` Herbert Xu 2008-09-09 20:20 ` David Miller ` (3 more replies) 0 siblings, 4 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-09 14:33 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: > > The only comment I would make is that maybe it's a bit excessive > to trigger the GC worker every time we walk the states. Good point! I've avoided the memory barrier by simply extending the mutexed section in the GC to cover the list splicing. Here's the updated patch: ipsec: Use RCU-like construct for saved state within a walk Now that we save states within a walk we need synchronisation so that the list the saved state is on doesn't disappear from under us. As it stands this is done by keeping the state on the list which is bad because it gets in the way of the management of the state life-cycle. An alternative is to make our own pseudo-RCU system where we use counters to indicate which state can't be freed immediately as it may be referenced by an ongoing walk when that resumes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 2933d74..4bb9499 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex; /* Full description of state of transformer. */ struct xfrm_state { - /* Note: bydst is re-used during gc */ struct list_head all; - struct hlist_node bydst; + union { + struct list_head gclist; + struct hlist_node bydst; + }; struct hlist_node bysrc; struct hlist_node byspi; @@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->count = 0; } -static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk) -{ - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } -} - extern int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *); +extern void xfrm_state_walk_done(struct xfrm_state_walk *walk); extern struct xfrm_state *xfrm_state_alloc(void); extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7bd62f6..d90f936 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; +/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ +static unsigned long xfrm_state_walk_ongoing; +/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ +static unsigned long xfrm_state_walk_completed; + static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static HLIST_HEAD(xfrm_state_gc_list); +static LIST_HEAD(xfrm_state_gc_leftovers); +static LIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -403,17 +409,22 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x; - struct hlist_node *entry, *tmp; - struct hlist_head gc_list; + struct xfrm_state *x, *tmp; + unsigned long completed; + mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - gc_list.first = xfrm_state_gc_list.first; - INIT_HLIST_HEAD(&xfrm_state_gc_list); + list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); spin_unlock_bh(&xfrm_state_gc_lock); - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) + completed = xfrm_state_walk_completed; + mutex_unlock(&xfrm_cfg_mutex); + + list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { + if ((long)(x->lastused - completed) > 0) + break; xfrm_state_gc_destroy(x); + } wake_up(&km_waitq); } @@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x) { WARN_ON(x->km.state != XFRM_STATE_DEAD); - spin_lock_bh(&xfrm_state_lock); - list_del(&x->all); - spin_unlock_bh(&xfrm_state_lock); - spin_lock_bh(&xfrm_state_gc_lock); - hlist_add_head(&x->bydst, &xfrm_state_gc_list); + list_add_tail(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); + x->lastused = xfrm_state_walk_ongoing; + list_del_rcu(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (err) { xfrm_state_hold(last); walk->state = last; + xfrm_state_walk_ongoing++; goto out; } } @@ -1586,12 +1596,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) + if (old != NULL) { xfrm_state_put(old); + xfrm_state_walk_completed++; + if (!list_empty(&xfrm_state_gc_leftovers)) + schedule_work(&xfrm_state_gc_work); + } return err; } EXPORT_SYMBOL(xfrm_state_walk); +void xfrm_state_walk_done(struct xfrm_state_walk *walk) +{ + if (walk->state != NULL) { + xfrm_state_put(walk->state); + walk->state = NULL; + xfrm_state_walk_completed++; + if (!list_empty(&xfrm_state_gc_leftovers)) + schedule_work(&xfrm_state_gc_work); + } +} +EXPORT_SYMBOL(xfrm_state_walk_done); + void xfrm_replay_notify(struct xfrm_state *x, int event) { Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 14:33 ` Herbert Xu @ 2008-09-09 20:20 ` David Miller 2008-09-10 3:01 ` David Miller ` (2 subsequent siblings) 3 siblings, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-09 20:20 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 00:33:12 +1000 > I've avoided the memory barrier by simply extending the mutexed > section in the GC to cover the list splicing. Here's the updated > patch: > > ipsec: Use RCU-like construct for saved state within a walk Looks good. I'll be able to merge this next time I pull net-2.6 into net-next-2.6 because of your larval/socket policy fix which touched similar areas. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 14:33 ` Herbert Xu 2008-09-09 20:20 ` David Miller @ 2008-09-10 3:01 ` David Miller 2008-09-11 21:24 ` Paul E. McKenney 2008-09-21 12:29 ` Timo Teräs 3 siblings, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-10 3:01 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 00:33:12 +1000 > ipsec: Use RCU-like construct for saved state within a walk Applied, thanks Herbert. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 14:33 ` Herbert Xu 2008-09-09 20:20 ` David Miller 2008-09-10 3:01 ` David Miller @ 2008-09-11 21:24 ` Paul E. McKenney 2008-09-11 22:00 ` David Miller 2008-09-21 12:29 ` Timo Teräs 3 siblings, 1 reply; 95+ messages in thread From: Paul E. McKenney @ 2008-09-11 21:24 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, timo.teras, netdev On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote: > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: > > > > The only comment I would make is that maybe it's a bit excessive > > to trigger the GC worker every time we walk the states. > > Good point! > > I've avoided the memory barrier by simply extending the mutexed > section in the GC to cover the list splicing. Here's the updated > patch: > > ipsec: Use RCU-like construct for saved state within a walk > > Now that we save states within a walk we need synchronisation > so that the list the saved state is on doesn't disappear from > under us. > > As it stands this is done by keeping the state on the list which > is bad because it gets in the way of the management of the state > life-cycle. > > An alternative is to make our own pseudo-RCU system where we use > counters to indicate which state can't be freed immediately as > it may be referenced by an ongoing walk when that resumes. There is only one reader at a time, right? Otherwise, I don't see how the increments and reads of xfrm_state_walk_completed line up. Thanx, Paul > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > diff --git a/include/net/xfrm.h b/include/net/xfrm.h > index 2933d74..4bb9499 100644 > --- a/include/net/xfrm.h > +++ b/include/net/xfrm.h > @@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex; > /* Full description of state of transformer. */ > struct xfrm_state > { > - /* Note: bydst is re-used during gc */ > struct list_head all; > - struct hlist_node bydst; > + union { > + struct list_head gclist; > + struct hlist_node bydst; > + }; > struct hlist_node bysrc; > struct hlist_node byspi; > > @@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) > walk->count = 0; > } > > -static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk) > -{ > - if (walk->state != NULL) { > - xfrm_state_put(walk->state); > - walk->state = NULL; > - } > -} > - > extern int xfrm_state_walk(struct xfrm_state_walk *walk, > int (*func)(struct xfrm_state *, int, void*), void *); > +extern void xfrm_state_walk_done(struct xfrm_state_walk *walk); > extern struct xfrm_state *xfrm_state_alloc(void); > extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, > struct flowi *fl, struct xfrm_tmpl *tmpl, > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 7bd62f6..d90f936 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; > static unsigned int xfrm_state_num; > static unsigned int xfrm_state_genid; > > +/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ > +static unsigned long xfrm_state_walk_ongoing; > +/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ > +static unsigned long xfrm_state_walk_completed; > + > static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); > static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); > > @@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); > static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; > > static struct work_struct xfrm_state_gc_work; > -static HLIST_HEAD(xfrm_state_gc_list); > +static LIST_HEAD(xfrm_state_gc_leftovers); > +static LIST_HEAD(xfrm_state_gc_list); > static DEFINE_SPINLOCK(xfrm_state_gc_lock); > > int __xfrm_state_delete(struct xfrm_state *x); > @@ -403,17 +409,22 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) > > static void xfrm_state_gc_task(struct work_struct *data) > { > - struct xfrm_state *x; > - struct hlist_node *entry, *tmp; > - struct hlist_head gc_list; > + struct xfrm_state *x, *tmp; > + unsigned long completed; > > + mutex_lock(&xfrm_cfg_mutex); > spin_lock_bh(&xfrm_state_gc_lock); > - gc_list.first = xfrm_state_gc_list.first; > - INIT_HLIST_HEAD(&xfrm_state_gc_list); > + list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); > spin_unlock_bh(&xfrm_state_gc_lock); > > - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) > + completed = xfrm_state_walk_completed; > + mutex_unlock(&xfrm_cfg_mutex); > + > + list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { > + if ((long)(x->lastused - completed) > 0) > + break; > xfrm_state_gc_destroy(x); > + } > > wake_up(&km_waitq); > } > @@ -540,12 +551,8 @@ void __xfrm_state_destroy(struct xfrm_state *x) > { > WARN_ON(x->km.state != XFRM_STATE_DEAD); > > - spin_lock_bh(&xfrm_state_lock); > - list_del(&x->all); > - spin_unlock_bh(&xfrm_state_lock); > - > spin_lock_bh(&xfrm_state_gc_lock); > - hlist_add_head(&x->bydst, &xfrm_state_gc_list); > + list_add_tail(&x->gclist, &xfrm_state_gc_list); > spin_unlock_bh(&xfrm_state_gc_lock); > schedule_work(&xfrm_state_gc_work); > } > @@ -558,6 +565,8 @@ int __xfrm_state_delete(struct xfrm_state *x) > if (x->km.state != XFRM_STATE_DEAD) { > x->km.state = XFRM_STATE_DEAD; > spin_lock(&xfrm_state_lock); > + x->lastused = xfrm_state_walk_ongoing; > + list_del_rcu(&x->all); > hlist_del(&x->bydst); > hlist_del(&x->bysrc); > if (x->id.spi) > @@ -1572,6 +1581,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, > if (err) { > xfrm_state_hold(last); > walk->state = last; > + xfrm_state_walk_ongoing++; > goto out; > } > } > @@ -1586,12 +1596,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, > err = func(last, 0, data); > out: > spin_unlock_bh(&xfrm_state_lock); > - if (old != NULL) > + if (old != NULL) { > xfrm_state_put(old); > + xfrm_state_walk_completed++; > + if (!list_empty(&xfrm_state_gc_leftovers)) > + schedule_work(&xfrm_state_gc_work); > + } > return err; > } > EXPORT_SYMBOL(xfrm_state_walk); > > +void xfrm_state_walk_done(struct xfrm_state_walk *walk) > +{ > + if (walk->state != NULL) { > + xfrm_state_put(walk->state); > + walk->state = NULL; > + xfrm_state_walk_completed++; > + if (!list_empty(&xfrm_state_gc_leftovers)) > + schedule_work(&xfrm_state_gc_work); > + } > +} > +EXPORT_SYMBOL(xfrm_state_walk_done); > + > > void xfrm_replay_notify(struct xfrm_state *x, int event) > { > > Thanks, > -- > Visit Openswan at http://www.openswan.org/ > Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-11 21:24 ` Paul E. McKenney @ 2008-09-11 22:00 ` David Miller 2008-09-11 23:22 ` Paul E. McKenney 2008-09-12 16:08 ` Herbert Xu 0 siblings, 2 replies; 95+ messages in thread From: David Miller @ 2008-09-11 22:00 UTC (permalink / raw) To: paulmck; +Cc: herbert, timo.teras, netdev From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Date: Thu, 11 Sep 2008 14:24:59 -0700 > On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote: > > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: > > > > > > The only comment I would make is that maybe it's a bit excessive > > > to trigger the GC worker every time we walk the states. > > > > Good point! > > > > I've avoided the memory barrier by simply extending the mutexed > > section in the GC to cover the list splicing. Here's the updated > > patch: > > > > ipsec: Use RCU-like construct for saved state within a walk > > > > Now that we save states within a walk we need synchronisation > > so that the list the saved state is on doesn't disappear from > > under us. > > > > As it stands this is done by keeping the state on the list which > > is bad because it gets in the way of the management of the state > > life-cycle. > > > > An alternative is to make our own pseudo-RCU system where we use > > counters to indicate which state can't be freed immediately as > > it may be referenced by an ongoing walk when that resumes. > > There is only one reader at a time, right? Otherwise, I don't see how > the increments and reads of xfrm_state_walk_completed line up. The RTNL semaphore is held across the modifications of the counters. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-11 22:00 ` David Miller @ 2008-09-11 23:22 ` Paul E. McKenney 2008-09-12 16:08 ` Herbert Xu 1 sibling, 0 replies; 95+ messages in thread From: Paul E. McKenney @ 2008-09-11 23:22 UTC (permalink / raw) To: David Miller; +Cc: herbert, timo.teras, netdev On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote: > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > Date: Thu, 11 Sep 2008 14:24:59 -0700 > > > On Wed, Sep 10, 2008 at 12:33:12AM +1000, Herbert Xu wrote: > > > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: > > > > > > > > The only comment I would make is that maybe it's a bit excessive > > > > to trigger the GC worker every time we walk the states. > > > > > > Good point! > > > > > > I've avoided the memory barrier by simply extending the mutexed > > > section in the GC to cover the list splicing. Here's the updated > > > patch: > > > > > > ipsec: Use RCU-like construct for saved state within a walk > > > > > > Now that we save states within a walk we need synchronisation > > > so that the list the saved state is on doesn't disappear from > > > under us. > > > > > > As it stands this is done by keeping the state on the list which > > > is bad because it gets in the way of the management of the state > > > life-cycle. > > > > > > An alternative is to make our own pseudo-RCU system where we use > > > counters to indicate which state can't be freed immediately as > > > it may be referenced by an ongoing walk when that resumes. > > > > There is only one reader at a time, right? Otherwise, I don't see how > > the increments and reads of xfrm_state_walk_completed line up. > > The RTNL semaphore is held across the modifications of the counters. Got it, thank you, sorry for the noise! Thanx, Paul ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-11 22:00 ` David Miller 2008-09-11 23:22 ` Paul E. McKenney @ 2008-09-12 16:08 ` Herbert Xu 2008-09-12 17:37 ` Paul E. McKenney 1 sibling, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-12 16:08 UTC (permalink / raw) To: David Miller; +Cc: paulmck, timo.teras, netdev On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote: > > > There is only one reader at a time, right? Otherwise, I don't see how > > the increments and reads of xfrm_state_walk_completed line up. > > The RTNL semaphore is held across the modifications of the counters. s/RTNL/xfrm_cfg_mutex/ but otherwise what Dave said :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-12 16:08 ` Herbert Xu @ 2008-09-12 17:37 ` Paul E. McKenney 0 siblings, 0 replies; 95+ messages in thread From: Paul E. McKenney @ 2008-09-12 17:37 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, timo.teras, netdev On Sat, Sep 13, 2008 at 02:08:25AM +1000, Herbert Xu wrote: > On Thu, Sep 11, 2008 at 03:00:06PM -0700, David Miller wrote: > > > > > There is only one reader at a time, right? Otherwise, I don't see how > > > the increments and reads of xfrm_state_walk_completed line up. > > > > The RTNL semaphore is held across the modifications of the counters. > > s/RTNL/xfrm_cfg_mutex/ > > but otherwise what Dave said :) Thank you for the updated information, and I guess I don't feel quite so bad about the false alarm. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-09 14:33 ` Herbert Xu ` (2 preceding siblings ...) 2008-09-11 21:24 ` Paul E. McKenney @ 2008-09-21 12:29 ` Timo Teräs 2008-09-21 15:21 ` Timo Teräs 3 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-21 12:29 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: > I've avoided the memory barrier by simply extending the mutexed > section in the GC to cover the list splicing. Here's the updated > patch: > > ipsec: Use RCU-like construct for saved state within a walk > > Now that we save states within a walk we need synchronisation > so that the list the saved state is on doesn't disappear from > under us. > > As it stands this is done by keeping the state on the list which > is bad because it gets in the way of the management of the state > life-cycle. > > An alternative is to make our own pseudo-RCU system where we use > counters to indicate which state can't be freed immediately as > it may be referenced by an ongoing walk when that resumes. Does not this logic fail if: 1. completed = ongoing 2. 1st walk started and iterator kept (a->lastused = ongoing, ongoing++) 3. 2nd walk started and iterator kept (b->lastused = ongoing, ongoing++) 4. 2nd walk finished (completed++) 5. gc triggered: a gets deleted since a->lastused == completed 6. 1st walk continued but freed memory accessed as a was deleted Though currently it does not affect, since xfrm_state_hold/_put are still called when keeping the iterator, so the entries won't actually get garbage collected anyway. So the completed/ongoing counting is basically useless. Or am I missing something? Wouldn't it be enough to do the list_del_rcu on delete? And just keep reference as previously? - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-21 12:29 ` Timo Teräs @ 2008-09-21 15:21 ` Timo Teräs 2008-09-22 11:42 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-21 15:21 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Timo Teräs wrote: > Herbert Xu wrote: >> On Mon, Sep 08, 2008 at 05:25:13PM -0700, David Miller wrote: >> I've avoided the memory barrier by simply extending the mutexed >> section in the GC to cover the list splicing. Here's the updated >> patch: >> >> ipsec: Use RCU-like construct for saved state within a walk >> >> Now that we save states within a walk we need synchronisation >> so that the list the saved state is on doesn't disappear from >> under us. >> >> As it stands this is done by keeping the state on the list which >> is bad because it gets in the way of the management of the state >> life-cycle. >> >> An alternative is to make our own pseudo-RCU system where we use >> counters to indicate which state can't be freed immediately as >> it may be referenced by an ongoing walk when that resumes. > > Does not this logic fail if: > 1. completed = ongoing > 2. 1st walk started and iterator kept (a->lastused = ongoing, ongoing++) > 3. 2nd walk started and iterator kept (b->lastused = ongoing, ongoing++) > 4. 2nd walk finished (completed++) > 5. gc triggered: a gets deleted since a->lastused == completed > 6. 1st walk continued but freed memory accessed as a was deleted > > Though currently it does not affect, since xfrm_state_hold/_put > are still called when keeping the iterator, so the entries won't > actually get garbage collected anyway. So the completed/ongoing > counting is basically useless. Or am I missing something? Obviously I missed the fact that it prevents deletion of the entries which the iterator held list node might still point to. But the flaw still exists: the entries which interator->next points can be deleted if there is a walking that takes a long time and meanwhile we get walks that complete fast. > Wouldn't it be enough to do the list_del_rcu on delete? And > just keep reference as previously? So if we do list_del_rcu() on delete, could we also xfrm_state_hold() the entry pointed to by that list entry. And then on GC we could xfrm_state_put() the next entry. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-21 15:21 ` Timo Teräs @ 2008-09-22 11:42 ` Herbert Xu 2008-09-22 13:01 ` Timo Teräs 2008-09-23 2:48 ` David Miller 0 siblings, 2 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-22 11:42 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote: > > But the flaw still exists: the entries which interator->next points > can be deleted if there is a walking that takes a long time and > meanwhile we get walks that complete fast. Thanks, I guess that's why I'm not paid to work on RCU :) > So if we do list_del_rcu() on delete, could we also xfrm_state_hold() > the entry pointed to by that list entry. And then on GC we could > xfrm_state_put() the next entry. Unfortunately it's not that simple since we'll be in the same bind if the entry after the next entry gets deleted as well as the next entry. The only other solution is to go back to the original scheme where we keep the list intact until GC. However, nobody has come up with a way of doing that that allows the SA creation path to run locklessly with respect to the dumping path. So I think we'll have to stick with the quasi-RCU solution for now. Here's a patch to fix the flaw that you discovered. ipsec: Fix xfrm_state_walk race As discovered by Timo Teräs, the currently xfrm_state_walk scheme is racy because if a second dump finishes before the first, we may free xfrm states that the first dump would walk over later. This patch fixes this by storing the dumps in a list in order to calculate the correct completion counter which cures this problem. I've expanded netlink_cb in order to accomodate the extra state related to this. It shouldn't be a big deal since netlink_cb is kmalloced for each dump and we're just increasing it by 4 or 8 bytes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 9ff1b54..cbba776 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[6]; + long args[7]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 4bb9499..48630b2 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1246,6 +1246,8 @@ struct xfrm6_tunnel { }; struct xfrm_state_walk { + struct list_head list; + unsigned long genid; struct xfrm_state *state; int count; u8 proto; @@ -1281,13 +1283,7 @@ static inline void xfrm6_fini(void) extern int xfrm_proc_init(void); #endif -static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) -{ - walk->proto = proto; - walk->state = NULL; - walk->count = 0; -} - +extern void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto); extern int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *); extern void xfrm_state_walk_done(struct xfrm_state_walk *walk); diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index fe84a46..f7805c5 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -64,6 +64,9 @@ static unsigned long xfrm_state_walk_ongoing; /* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ static unsigned long xfrm_state_walk_completed; +/* List of outstanding state walks used to set the completed counter. */ +static LIST_HEAD(xfrm_state_walks); + static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -1582,7 +1585,6 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (err) { xfrm_state_hold(last); walk->state = last; - xfrm_state_walk_ongoing++; goto out; } } @@ -1597,25 +1599,44 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) { + if (old != NULL) xfrm_state_put(old); - xfrm_state_walk_completed++; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); - } return err; } EXPORT_SYMBOL(xfrm_state_walk); +void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) +{ + walk->proto = proto; + walk->state = NULL; + walk->count = 0; + list_add_tail(&walk->list, &xfrm_state_walks); + walk->genid = ++xfrm_state_walk_ongoing; +} +EXPORT_SYMBOL(xfrm_state_walk_init); + void xfrm_state_walk_done(struct xfrm_state_walk *walk) { + struct list_head *prev; + if (walk->state != NULL) { xfrm_state_put(walk->state); walk->state = NULL; - xfrm_state_walk_completed++; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); } + + prev = walk->list.prev; + list_del(&walk->list); + + if (prev != &xfrm_state_walks) { + list_entry(prev, struct xfrm_state_walk, list)->genid = + walk->genid; + return; + } + + xfrm_state_walk_completed = walk->genid; + + if (!list_empty(&xfrm_state_gc_leftovers)) + schedule_work(&xfrm_state_gc_work); } EXPORT_SYMBOL(xfrm_state_walk_done); Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-22 11:42 ` Herbert Xu @ 2008-09-22 13:01 ` Timo Teräs 2008-09-22 23:50 ` Herbert Xu 2008-09-23 2:48 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-22 13:01 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote: >> So if we do list_del_rcu() on delete, could we also xfrm_state_hold() >> the entry pointed to by that list entry. And then on GC we could >> xfrm_state_put() the next entry. > > Unfortunately it's not that simple since we'll be in the same > bind if the entry after the next entry gets deleted as well as > the next entry. Well, I was thinking that we hold the next pointer. And when continuing the dump, we can first skip all entries that are marked as dead (each next pointer is valid since each of the next pointers are held once). When we find the first valid entry to dump we _put() the originally held entry. That would recursively _put() all the next entries which were held. > The only other solution is to go back to the original scheme > where we keep the list intact until GC. However, nobody has > come up with a way of doing that that allows the SA creation > path to run locklessly with respect to the dumping path. > > So I think we'll have to stick with the quasi-RCU solution for > now. Here's a patch to fix the flaw that you discovered. But yes, this would work as well. Not sure which one would be faster. I guess the holding of individual entries would be at least more memory efficient. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-22 13:01 ` Timo Teräs @ 2008-09-22 23:50 ` Herbert Xu 2008-09-23 4:53 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-22 23:50 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote: > > > Unfortunately it's not that simple since we'll be in the same > > bind if the entry after the next entry gets deleted as well as > > the next entry. > > Well, I was thinking that we hold the next pointer. And when > continuing the dump, we can first skip all entries that are marked > as dead (each next pointer is valid since each of the next pointers > are held once). When we find the first valid entry to dump we > _put() the originally held entry. That would recursively _put() all > the next entries which were held. No that doesn't work. Let's say we store the entry X in walk->state, and we hold X as well as X->next. Now X, X->next, and X->next->next get deleted from the list. What'll happen is that X and X->next will stick around but X->next->next will be freed. So when we resume from X we'll dump X and X->next correctly, but then hit X->next->next and be in the same shithole. Really, the only way to do what you want here is to hold everything from X to the very end of the list. The problem with that is that you can no longer drop the reference counts when we resume. You can see why even if we only hold X and X->next. Should X->next be deleted from the list, when we resume from X we'll no longer be able to drop the reference count on the original X->next since it now points to a new entry. > Not sure which one would be faster. I guess the holding of > individual entries would be at least more memory efficient. I think this is a lot more efficient than storing every single entry from X to the end of the list :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-22 23:50 ` Herbert Xu @ 2008-09-23 4:53 ` Timo Teräs 2008-09-23 4:59 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 4:53 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote: >>> Unfortunately it's not that simple since we'll be in the same >>> bind if the entry after the next entry gets deleted as well as >>> the next entry. >> Well, I was thinking that we hold the next pointer. And when >> continuing the dump, we can first skip all entries that are marked >> as dead (each next pointer is valid since each of the next pointers >> are held once). When we find the first valid entry to dump we >> _put() the originally held entry. That would recursively _put() all >> the next entries which were held. > > No that doesn't work. Let's say we store the entry X in walk->state, > and we hold X as well as X->next. Now X, X->next, and X->next->next > get deleted from the list. What'll happen is that X and X->next > will stick around but X->next->next will be freed. So when we > resume from X we'll dump X and X->next correctly, but then hit > X->next->next and be in the same shithole. I think it would work. Here's the scenarios: We hold X as dumping is interrupted there. X->next points statically to some non-deleted entry and is held. Now, if X->next gets deleted, it's marked dead and X->next->next is held too. Thus when there is multiple deleted entries in chain, the whole chain is held recursively/iteratively. When walking is continued on X the first thing we do is skip all dead entries from X, after that we put X and that would trigger put() for all X->next:s which were held iteratively. If X->next is not deleted, and X->next->next gets deleted, the X->next list structure is updated correctly by list_del_rcu and the entry can be actually freed even if the walking didn't iterate that entry (it would be skipped anyway as it's marked dead on deletion). So the idea was to hold X->next from deletion function, not from the walking function. That would be, we always hold deleted->next when there are ongoing walks. And on final _put() we _put() the ->next entry. I think that would work. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 4:53 ` Timo Teräs @ 2008-09-23 4:59 ` Herbert Xu 2008-09-23 5:17 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 4:59 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote: > > So the idea was to hold X->next from deletion function, not from > the walking function. That would be, we always hold deleted->next when > there are ongoing walks. And on final _put() we _put() the ->next > entry. > > I think that would work. Right, this would work. However, now you're again slowing down the relatively fast path of state deletion by moving extra work there from the non-critical dump path. When we optimise code for time, we don't necessarily try to minimise the work overall. We instead try to minimise the amount of work on the critical paths. It's OK to let the other paths such as dumping do a bit more work if it improves the fast paths. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 4:59 ` Herbert Xu @ 2008-09-23 5:17 ` Timo Teräs 2008-09-23 5:22 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 5:17 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote: >> So the idea was to hold X->next from deletion function, not from >> the walking function. That would be, we always hold deleted->next when >> there are ongoing walks. And on final _put() we _put() the ->next >> entry. >> >> I think that would work. > > Right, this would work. > > However, now you're again slowing down the relatively fast path of > state deletion by moving extra work there from the non-critical > dump path. The extra step there wold be a hold call. The recursive _put on a _put of some entry can happen only on dump path. As otherwise the ->next entry is first held in state delete, but would be immediately _put on the _put as the final step of _delete(). So basically one additional atomic_inc() and one atomic_dec() on the normal _delete() path. > When we optimise code for time, we don't necessarily try to minimise > the work overall. We instead try to minimise the amount of work on > the critical paths. It's OK to let the other paths such as dumping > do a bit more work if it improves the fast paths. Not sure about the overall penalty, but yes I know it would have some penalty. But at least there would be no locking. Particularly the thing that can happen on your approach is that if there is a user land process that gets suspended during a dump processing, it would prevent the garbage collection of all entries for all eternity until that process is continued or killed. So this would allow deletion and GC of entries even during walking. But this was just a thought. Maybe it's not worth trying. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 5:17 ` Timo Teräs @ 2008-09-23 5:22 ` Herbert Xu 2008-09-23 6:25 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 5:22 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote: > > The extra step there wold be a hold call. The recursive _put on a > _put of some entry can happen only on dump path. As otherwise the > ->next entry is first held in state delete, but would be immediately > _put on the _put as the final step of _delete(). > > So basically one additional atomic_inc() and one atomic_dec() on the > normal _delete() path. Can you post a patch? If this can be done locklessly then yes it would probably be a good way to go. However, I'm not sure whether I understand your lockless proposal yet. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 5:22 ` Herbert Xu @ 2008-09-23 6:25 ` Timo Teräs 2008-09-23 6:47 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 6:25 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote: >> The extra step there wold be a hold call. The recursive _put on a >> _put of some entry can happen only on dump path. As otherwise the >> ->next entry is first held in state delete, but would be immediately >> _put on the _put as the final step of _delete(). >> >> So basically one additional atomic_inc() and one atomic_dec() on the >> normal _delete() path. > > Can you post a patch? If this can be done locklessly then yes > it would probably be a good way to go. However, I'm not sure > whether I understand your lockless proposal yet. Something like this. I just compile tested, so I'm not sure if it actually works. This basically reverts the GC changes. The interesting bits are in xfrm_state_delete(). It will _hold() the ->next entry unless we were at last entry. This will make sure that when an entry is in XFRM_STATE_DEAD the all.next is valid all the way until the entry is destroyed. The corresponding put is in _destroy(). Added it as a final thing to do so hopefully compiler optimizes tail recursion to iteration. (Seemed to do that in my case.) And finally, _walk() was right all along. It returns to dumping and first skips all dead entries. And just before return, when all dead entries are already skipped the originally held entry is _put(). That _put call (or the one in _walk_done()) will result all dead entries that were held, to be iteratively put. - Timo diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..d5cba7b 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -122,7 +122,7 @@ struct xfrm_state { struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -1246,8 +1246,6 @@ struct xfrm6_tunnel { }; struct xfrm_state_walk { - struct list_head list; - unsigned long genid; struct xfrm_state *state; int count; u8 proto; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..f45f006 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,17 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + gc_list.first = xfrm_state_gc_list.first; + INIT_HLIST_HEAD(&xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -553,12 +538,19 @@ EXPORT_SYMBOL(xfrm_state_alloc); void __xfrm_state_destroy(struct xfrm_state *x) { + struct xfrm_state *next = NULL; + WARN_ON(x->km.state != XFRM_STATE_DEAD); + if (x->all.next != &xfrm_state_all) + next = container_of(x->all.next, struct xfrm_state, all); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); + + if (next != NULL) + xfrm_state_put(next); } EXPORT_SYMBOL(__xfrm_state_destroy); @@ -569,8 +561,10 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; list_del_rcu(&x->all); + if (x->all.next != &xfrm_state_all) + xfrm_state_hold(container_of(x->all.next, + struct xfrm_state, all)); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1612,33 +1606,15 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->proto = proto; walk->state = NULL; walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - if (walk->state != NULL) { xfrm_state_put(walk->state); walk->state = NULL; } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; - return; - } - - xfrm_state_walk_completed = walk->genid; - - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); } EXPORT_SYMBOL(xfrm_state_walk_done); ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 6:25 ` Timo Teräs @ 2008-09-23 6:47 ` Herbert Xu 2008-09-23 6:56 ` Timo Teräs 2008-09-23 9:39 ` Timo Teräs 0 siblings, 2 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-23 6:47 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote: > > Something like this. I just compile tested, so I'm not sure if it > actually works. It is quite clever :) > The corresponding put is in _destroy(). Added it as a final thing > to do so hopefully compiler optimizes tail recursion to iteration. > (Seemed to do that in my case.) Yeah we should open code this to be absolutely sure that we don't get killed by a stupid compiler. However, I think this is still open to the same problem that my patch had, i.e., if a dumper goes to sleep during the dump we may prevent entries from being freed indefinitely. Yes your idea is better in that we may only withhold say a fraction (depending on the order of deletion it could be anywhere from none to every entry) of the entries instead of all of them, but fundamentally the same issue is still there. Considering the fact that dumps require root privileges I'm not convinced as of yet that this is worth it. Hmm, could we perhaps go back to your original scheme of keeping everything on the list and see if we can use RCU to make it lockless instead? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 6:47 ` Herbert Xu @ 2008-09-23 6:56 ` Timo Teräs 2008-09-23 9:39 ` Timo Teräs 1 sibling, 0 replies; 95+ messages in thread From: Timo Teräs @ 2008-09-23 6:56 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote: >> Something like this. I just compile tested, so I'm not sure if it >> actually works. > > It is quite clever :) Thanks :) And actually, we can make _delete() do normal list_del() and not hold the next pointer if there are no running dumps. This is the common case anyway. Then there would be no penalty at all. >> The corresponding put is in _destroy(). Added it as a final thing >> to do so hopefully compiler optimizes tail recursion to iteration. >> (Seemed to do that in my case.) > > Yeah we should open code this to be absolutely sure that we don't > get killed by a stupid compiler. Right. The patch was just to show the idea. Will do a proper patch with this and the above mentioned optimization if you think we should change to this until we get really cool lockless dumping. > However, I think this is still open to the same problem that my > patch had, i.e., if a dumper goes to sleep during the dump we may > prevent entries from being freed indefinitely. Yes your idea > is better in that we may only withhold say a fraction (depending > on the order of deletion it could be anywhere from none to every > entry) of the entries instead of all of them, but fundamentally > the same issue is still there. > > Considering the fact that dumps require root privileges I'm not > convinced as of yet that this is worth it. You are right. It can keep memory allocated. And in worst case you end up keeping all the deleted entries (entries are deleted in the order they appear in all list). But I would not be to worried about it either. > Hmm, could we perhaps go back to your original scheme of keeping > everything on the list and see if we can use RCU to make it lockless > instead? That would be ideal of course. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 6:47 ` Herbert Xu 2008-09-23 6:56 ` Timo Teräs @ 2008-09-23 9:39 ` Timo Teräs 2008-09-23 11:24 ` Herbert Xu 1 sibling, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 9:39 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > However, I think this is still open to the same problem that my > patch had, i.e., if a dumper goes to sleep during the dump we may > prevent entries from being freed indefinitely. Yes your idea > is better in that we may only withhold say a fraction (depending > on the order of deletion it could be anywhere from none to every > entry) of the entries instead of all of them, but fundamentally > the same issue is still there. > > Considering the fact that dumps require root privileges I'm not > convinced as of yet that this is worth it. > > Hmm, could we perhaps go back to your original scheme of keeping > everything on the list and see if we can use RCU to make it lockless > instead? How about this: we keep list of walks as your latest patch does. When walking is interrupted, we do not hold the entry, we just store the pointer to walk iterator. When the entry is deleted from the lists we go through the walk contexts, and if someone is pointing to the entry being deleted, we just update it to next. The list of walkers can be protected by xfrm_state_lock. That needs to be taken anyway for accessing/modifying the other lists. During most of the time, there is no walks active, so the penalty for _destroy() is minimal. This would make it possibly to reclaim the deleted entries right away. Does this sound better? Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 9:39 ` Timo Teräs @ 2008-09-23 11:24 ` Herbert Xu 2008-09-23 12:08 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 11:24 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote: > > This would make it possibly to reclaim the deleted entries right > away. > > Does this sound better? Yep this sounds pretty good to me. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 11:24 ` Herbert Xu @ 2008-09-23 12:08 ` Timo Teräs 2008-09-23 12:14 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 12:08 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote: >> This would make it possibly to reclaim the deleted entries right >> away. >> >> Does this sound better? > > Yep this sounds pretty good to me. > > Thanks, So the patch would look something like this. Compile tested. Will test later when it becomes possible to reboot my box. Cheers, Timo ipsec: Fix up xfrm_state_walk.state on node deletion Now that we track xfrm_state_walks, it makes more sense to fix up the state pointer on deletion of the node if needed. This allows accurately to delete all entries immediately, instead of possibly waiting for userland. Also fixed locking of xfrm_state_walks list handling. Signed-off-by: Timo Teras <timo.teras@iki.fi> --- include/linux/netlink.h | 2 +- include/net/xfrm.h | 3 +- net/xfrm/xfrm_state.c | 77 +++++++++++++++++----------------------------- 3 files changed, 31 insertions(+), 51 deletions(-) diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..7f787c7 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -122,7 +122,7 @@ struct xfrm_state { struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -1247,7 +1247,6 @@ struct xfrm6_tunnel { struct xfrm_state_walk { struct list_head list; - unsigned long genid; struct xfrm_state *state; int count; u8 proto; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..636f7ee 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,11 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - /* List of outstanding state walks used to set the completed counter. */ static LIST_HEAD(xfrm_state_walks); @@ -199,8 +194,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +406,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -556,7 +543,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -564,13 +551,22 @@ EXPORT_SYMBOL(__xfrm_state_destroy); int __xfrm_state_delete(struct xfrm_state *x) { + struct xfrm_state_walk *walk; + struct xfrm_state *next; int err = -ESRCH; if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + if (list_is_last(&x->all, &xfrm_state_walks)) + next = NULL; + else + next = container_of(x->all.next, struct xfrm_state, all); + list_for_each_entry(walk, &xfrm_state_walks, list) { + if (walk->state == x) + walk->state = next; + } + list_del(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1566,15 +1562,16 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *x, *last = NULL; int err = 0; if (walk->state == NULL && walk->count != 0) return 0; - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); + x = walk->state; + walk->state = NULL; + if (x == NULL) x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); list_for_each_entry_from(x, &xfrm_state_all, all) { @@ -1585,7 +1582,6 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (last) { err = func(last, walk->count, data); if (err) { - xfrm_state_hold(last); walk->state = last; goto out; } @@ -1601,8 +1597,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); + return err; } EXPORT_SYMBOL(xfrm_state_walk); @@ -1612,33 +1607,19 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->proto = proto; walk->state = NULL; walk->count = 0; + + spin_lock_bh(&xfrm_state_lock); list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + spin_unlock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; + spin_lock_bh(&xfrm_state_lock); list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; - return; - } - - xfrm_state_walk_completed = walk->genid; - - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_unlock_bh(&xfrm_state_lock); + walk->state = NULL; } EXPORT_SYMBOL(xfrm_state_walk_done); -- 1.5.4.3 ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 12:08 ` Timo Teräs @ 2008-09-23 12:14 ` Herbert Xu 2008-09-23 12:25 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 12:14 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote: > > + list_for_each_entry(walk, &xfrm_state_walks, list) { > + if (walk->state == x) > + walk->state = next; > + } Just make a new list in xfrm_state that has all the dumpers sitting on it. In fact all you need is a hlist, or even just a pointer. Walking through an unbounded list on the fast path is not good :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 12:14 ` Herbert Xu @ 2008-09-23 12:25 ` Timo Teräs 2008-09-23 12:56 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 12:25 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote: >> + list_for_each_entry(walk, &xfrm_state_walks, list) { >> + if (walk->state == x) >> + walk->state = next; >> + } > > Just make a new list in xfrm_state that has all the dumpers sitting > on it. In fact all you need is a hlist, or even just a pointer. There can be a lot of xfrm_state structs. As in thousands. So it will take more memory then. hlist would be enough, so it'd be a 4/8 bytes addition. Single pointer would not be enough as we can have multiple walks on the same entry. > Walking through an unbounded list on the fast path is not good :) I was thinking about the same. That's why I wasn't so sure if this is better. In practice there aren't many walks active. Maybe we limit the maximum simultaneous walks? But in real life, there is only a one or two walkers if any active. So I would not consider a global update too expensive. But if you think it might become a problem, I'd add an hlist to xfrm_state of walkers referencing that entry. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 12:25 ` Timo Teräs @ 2008-09-23 12:56 ` Herbert Xu 2008-09-23 13:01 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 12:56 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote: > > There can be a lot of xfrm_state structs. As in thousands. So it > will take more memory then. hlist would be enough, so it'd be a > 4/8 bytes addition. Single pointer would not be enough as we can > have multiple walks on the same entry. Right, we need doubly linked lists in the walkers to ease unlinking. But only a single pointer (i.e., an hlist) is needed in xfrm_state. > I was thinking about the same. That's why I wasn't so sure if this is > better. In practice there aren't many walks active. Maybe we limit > the maximum simultaneous walks? That sounds like a hack :) > But in real life, there is only a one or two walkers if any active. > So I would not consider a global update too expensive. But if you think > it might become a problem, I'd add an hlist to xfrm_state of walkers > referencing that entry. Well in real life dumps don't tend to get suspended either :) In any case, a suspended dump is only going to pin down some memory, while a very long list walk may kill the box. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 12:56 ` Herbert Xu @ 2008-09-23 13:01 ` Timo Teräs 2008-09-23 13:07 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 13:01 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote: >> There can be a lot of xfrm_state structs. As in thousands. So it >> will take more memory then. hlist would be enough, so it'd be a >> 4/8 bytes addition. Single pointer would not be enough as we can >> have multiple walks on the same entry. > > Right, we need doubly linked lists in the walkers to ease unlinking. > But only a single pointer (i.e., an hlist) is needed in xfrm_state. Right. >> I was thinking about the same. That's why I wasn't so sure if this is >> better. In practice there aren't many walks active. Maybe we limit >> the maximum simultaneous walks? > > That sounds like a hack :) Yes :) >> But in real life, there is only a one or two walkers if any active. >> So I would not consider a global update too expensive. But if you think >> it might become a problem, I'd add an hlist to xfrm_state of walkers >> referencing that entry. > > Well in real life dumps don't tend to get suspended either :) > In any case, a suspended dump is only going to pin down some > memory, while a very long list walk may kill the box. So, what to do? 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and xfrm_state_put(all.next) on destruct. 2. Add per-entry hlist of walkers currently referencing it. 3. Use the global walker list. 1 can keep memory allocated until userland wakes up. 2 & 3 can make the delete of that entry slow if there's many walkers suspended. Btw. the current stuff in net-next is broken. There's no locking for xfrm_state_walkers list handling. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 13:01 ` Timo Teräs @ 2008-09-23 13:07 ` Herbert Xu 2008-09-23 13:30 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 13:07 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote: > > So, what to do? > 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and > xfrm_state_put(all.next) on destruct. > 2. Add per-entry hlist of walkers currently referencing it. > 3. Use the global walker list. > > 1 can keep memory allocated until userland wakes up. 2 & 3 can make > the delete of that entry slow if there's many walkers suspended. I'd cross 3 off the list because 2 is just so much better :) I'd slightly lean towards 2 but now that you mention it yes even that is vulnerable to loads of dumpers sitting on the same entry. So SELINUX folks wouldn't like that :) > Btw. the current stuff in net-next is broken. There's no locking > for xfrm_state_walkers list handling. What about xfrm_cfg_mutex? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 13:07 ` Herbert Xu @ 2008-09-23 13:30 ` Timo Teräs 2008-09-23 13:32 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 13:30 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote: >> So, what to do? >> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and >> xfrm_state_put(all.next) on destruct. >> 2. Add per-entry hlist of walkers currently referencing it. >> 3. Use the global walker list. >> >> 1 can keep memory allocated until userland wakes up. 2 & 3 can make >> the delete of that entry slow if there's many walkers suspended. > > I'd cross 3 off the list because 2 is just so much better :) > > I'd slightly lean towards 2 but now that you mention it yes even > that is vulnerable to loads of dumpers sitting on the same entry. > So SELINUX folks wouldn't like that :) Umm... right. It's a tricky problem. Cannot think of perfect solution atm. But I guess 3 is in general case the best. But in worst case scenarios 1 performs better. I have no strong opinion either way. So what ever you want, I'm happy to provide a patch for. >> Btw. the current stuff in net-next is broken. There's no locking >> for xfrm_state_walkers list handling. > > What about xfrm_cfg_mutex? It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} touch xfrm_state_walks list without locking properly. At least in the version I'm looking (= net-next-2.6 via git web interface). - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 13:30 ` Timo Teräs @ 2008-09-23 13:32 ` Herbert Xu 2008-09-23 13:46 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-23 13:32 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote: > > I have no strong opinion either way. So what ever you want, I'm > happy to provide a patch for. Let's just leave it as it is for now. > > What about xfrm_cfg_mutex? > > It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} > touch xfrm_state_walks list without locking properly. At least in > the version I'm looking (= net-next-2.6 via git web interface). Their callers should be holding it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 13:32 ` Herbert Xu @ 2008-09-23 13:46 ` Timo Teräs 2008-09-24 4:23 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-23 13:46 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev Herbert Xu wrote: > On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote: >> I have no strong opinion either way. So what ever you want, I'm >> happy to provide a patch for. > > Let's just leave it as it is for now. Well, 1 is an improvement over the current implementation, so I think it'd be better than not doing anything. >>> What about xfrm_cfg_mutex? >> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} >> touch xfrm_state_walks list without locking properly. At least in >> the version I'm looking (= net-next-2.6 via git web interface). > > Their callers should be holding it. Ah, the other layers take it at least on _walk_init paths. But _walk_done can be called from recv() syscalls. The af_key implementation does not take xfrm_cfg_mutex there. I don't think xfrm_user does that either as it does not pass cb_mutex to netlink_kernel_create. So at least the _state_walk_done path is unsafe as-is, I think. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-23 13:46 ` Timo Teräs @ 2008-09-24 4:23 ` Herbert Xu 2008-09-24 5:14 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 4:23 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Tue, Sep 23, 2008 at 04:46:31PM +0300, Timo Teräs wrote: > > Ah, the other layers take it at least on _walk_init paths. But > _walk_done can be called from recv() syscalls. The af_key > implementation does not take xfrm_cfg_mutex there. I don't think > xfrm_user does that either as it does not pass cb_mutex to > netlink_kernel_create. So at least the _state_walk_done path > is unsafe as-is, I think. OK we'd need to fix that up. However, I've got a new idea :) Let's put the dumpers on the list directly and then we won't have to deal any of this crap about states going away. Warning: compile tested only! ipsec: Put dumpers on the dump list As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Finally the obsolete last optimisation is now gone. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..17f9494 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -120,9 +120,8 @@ extern struct mutex xfrm_cfg_mutex; /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -138,6 +137,8 @@ struct xfrm_state /* Key manger bits */ struct { + /* These two fields correspond to xfrm_state_walk. */ + struct list_head all; u8 state; u8 dying; u32 seq; @@ -1246,11 +1247,11 @@ struct xfrm6_tunnel { }; struct xfrm_state_walk { - struct list_head list; - unsigned long genid; - struct xfrm_state *state; - int count; + /* These two fields correspond to xfrm_state. */ + struct list_head all; + u8 state; u8 proto; + int count; }; struct xfrm_policy_walk { diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index f7805c5..ff3bb24 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->all); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); + INIT_LIST_HEAD(&x->km.all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + list_del(&x->km.all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -939,7 +922,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); @@ -1564,79 +1547,54 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *x; int err = 0; - if (walk->state == NULL && walk->count != 0) - return 0; - - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); - list_for_each_entry_from(x, &xfrm_state_all, all) { + if (list_empty(&walk->all)) + x = list_first_entry(&xfrm_state_all, struct xfrm_state, km.all); + else + x = list_entry(&walk->all, struct xfrm_state, km.all); + list_for_each_entry_from(x, &xfrm_state_all, km.all) { if (x->km.state == XFRM_STATE_DEAD) continue; if (!xfrm_id_proto_match(x->id.proto, walk->proto)) continue; - if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - goto out; - } + err = func(x, walk->count, data); + if (err) { + list_move_tail(&walk->all, &x->km.all); + goto out; } - last = x; walk->count++; } if (walk->count == 0) { err = -ENOENT; goto out; } - if (last) - err = func(last, 0, data); + list_del_init(&walk->all); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); return err; } EXPORT_SYMBOL(xfrm_state_walk); void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { + INIT_LIST_HEAD(&walk->all); walk->proto = proto; - walk->state = NULL; + walk->state = XFRM_STATE_DEAD; walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; + if (list_empty(&walk->all)) return; - } - - xfrm_state_walk_completed = walk->genid; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_lock_bh(&xfrm_state_lock); + list_del(&walk->all); + spin_lock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_done); Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 4:23 ` Herbert Xu @ 2008-09-24 5:14 ` Timo Teräs 2008-09-24 5:15 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 5:14 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Tue, Sep 23, 2008 at 04:46:31PM +0300, Timo Teräs wrote: >> Ah, the other layers take it at least on _walk_init paths. But >> _walk_done can be called from recv() syscalls. The af_key >> implementation does not take xfrm_cfg_mutex there. I don't think >> xfrm_user does that either as it does not pass cb_mutex to >> netlink_kernel_create. So at least the _state_walk_done path >> is unsafe as-is, I think. > > OK we'd need to fix that up. > > However, I've got a new idea :) Let's put the dumpers on the > list directly and then we won't have to deal any of this crap > about states going away. Now this is an interesting idea... I like this a lot. Only comment is that there really should be struct for the shared stuff. Otherwise it's bound to break at some point. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 5:14 ` Timo Teräs @ 2008-09-24 5:15 ` Herbert Xu 2008-09-24 5:46 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 5:15 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 08:14:11AM +0300, Timo Teräs wrote: > > Only comment is that there really should be struct for the > shared stuff. Otherwise it's bound to break at some point. That's why I put comments there. If people change it without reading the comments, they'll ignore the struct too :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 5:15 ` Herbert Xu @ 2008-09-24 5:46 ` Timo Teräs 2008-09-24 5:55 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 5:46 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 08:14:11AM +0300, Timo Teräs wrote: >> Only comment is that there really should be struct for the >> shared stuff. Otherwise it's bound to break at some point. > > That's why I put comments there. If people change it without > reading the comments, they'll ignore the struct too :) Well, it's also because in the dump routine the entries are cast to xfrm_state, even if it was a walker entry. This is just wrong, though it probably works since only the specific entries are used. There should be some intermediate struct which we use to iterate in dumping routing, check if it was iterator/real entry (can be still based on the state thing). And only after that cast to struct xfrm_state. It would make the dumping routine more readable. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 5:46 ` Timo Teräs @ 2008-09-24 5:55 ` Herbert Xu 2008-09-24 6:04 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 5:55 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 08:46:09AM +0300, Timo Teräs wrote: > > Well, it's also because in the dump routine the entries > are cast to xfrm_state, even if it was a walker entry. This > is just wrong, though it probably works since only the specific > entries are used. No it works because all walker entries set the state to DEAD so they're skipped by everybody else. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 5:55 ` Herbert Xu @ 2008-09-24 6:04 ` Timo Teräs 2008-09-24 6:13 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 6:04 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 08:46:09AM +0300, Timo Teräs wrote: >> Well, it's also because in the dump routine the entries >> are cast to xfrm_state, even if it was a walker entry. This >> is just wrong, though it probably works since only the specific >> entries are used. > > No it works because all walker entries set the state to DEAD > so they're skipped by everybody else. Yes, I know. I was pointing the fact that the walker function iterates using struct xfrm_state. So temporarily when it is iterating through the walker entry, we get strct xfrm_state pointer which points to some place before the struct xfrm_state_walk. Now since the km.state is checked first, those are skipped. I find it very confusing to let the code say "iterate through list of struct xfrm_state" when it is not such a list. It is a list of struct xfrm_state or struct xfrm_state_walk. So I'd use some intermediate struct to so the code can say e.g "iterate through list of struct xfrm_state_dump_entry" or whatever. Or at least add a comment to the dumping function to say that we have struct xfrm_state, but in matter of fact it can be also struct xfrm_state_walk pointer with displacement, so we better check km.state first. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 6:04 ` Timo Teräs @ 2008-09-24 6:13 ` Herbert Xu 2008-09-24 6:20 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 6:13 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 09:04:30AM +0300, Timo Teräs wrote: > > Or at least add a comment to the dumping function to say that > we have struct xfrm_state, but in matter of fact it can be > also struct xfrm_state_walk pointer with displacement, so we > better check km.state first. Which is exactly what we do. The first thing we check in the loop is km.state. I really don't see your problem. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 6:13 ` Herbert Xu @ 2008-09-24 6:20 ` Timo Teräs 2008-09-24 6:21 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 6:20 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 09:04:30AM +0300, Timo Teräs wrote: >> Or at least add a comment to the dumping function to say that >> we have struct xfrm_state, but in matter of fact it can be >> also struct xfrm_state_walk pointer with displacement, so we >> better check km.state first. > > Which is exactly what we do. The first thing we check in the > loop is km.state. I really don't see your problem. Yes. I'm not complaining that the code does not work. Just saying that the code easily misleads the reader. And in this kind of non-obvious places we should have some comments. Or make the code more readable by adding the intermediate struct. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 6:20 ` Timo Teräs @ 2008-09-24 6:21 ` Herbert Xu 2008-09-24 7:29 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 6:21 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 09:20:10AM +0300, Timo Teräs wrote: > > Just saying that the code easily misleads the reader. And in > this kind of non-obvious places we should have some comments. > Or make the code more readable by adding the intermediate struct. Fair enough. Feel free to reformat the patch and add the struct to make it more bullet-proof. I'm sorry I've got some paid work to do now :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 6:21 ` Herbert Xu @ 2008-09-24 7:29 ` Timo Teräs 2008-09-24 7:54 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 7:29 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 09:20:10AM +0300, Timo Teräs wrote: >> Just saying that the code easily misleads the reader. And in >> this kind of non-obvious places we should have some comments. >> Or make the code more readable by adding the intermediate struct. > > Fair enough. Feel free to reformat the patch and add the struct > to make it more bullet-proof. I'm sorry I've got some paid work > to do now :) Ok, here's one try. I also updated the list_add() calls introduced in commit "ipsec: Restore larval states and socket policies in dump". Same warning still applies: only compile tested. We should probably do similar fix for the SPD dumping. I can make a patch for that later today. - Timo ipsec: Put dumpers on the dump list Based on Herbert Xu's patch and idea. Improved the original patch to apply cleanly and modified iteration code to be more readable by using a common struct for entries in all list. As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Finally the obsolete last optimisation is now gone. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Timo Teras <timo.teras@iki.fi> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..becaa1e 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex; metrics. Plus, it will be made via sk->sk_dst_cache. Solved. */ +struct xfrm_state_walk { + struct list_head all; + u8 state; + union { + u8 dying; + u8 proto; + }; + u32 seq; +}; + /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -136,12 +145,8 @@ struct xfrm_state u32 genid; - /* Key manger bits */ - struct { - u8 state; - u8 dying; - u32 seq; - } km; + /* Key manager bits */ + struct xfrm_state_walk km; /* Parameters of this state. */ struct { @@ -1245,14 +1250,6 @@ struct xfrm6_tunnel { int priority; }; -struct xfrm_state_walk { - struct list_head list; - unsigned long genid; - struct xfrm_state *state; - int count; - u8 proto; -}; - struct xfrm_policy_walk { struct xfrm_policy *policy; int count; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..f2deae0 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); + INIT_LIST_HEAD(&x->km.all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + list_del(&x->km.all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); @@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1566,79 +1549,56 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *state; + struct xfrm_state_walk *x; int err = 0; - if (walk->state == NULL && walk->count != 0) - return 0; - - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); + if (list_empty(&walk->all)) + x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all); + else + x = list_entry(&walk->all, struct xfrm_state_walk, all); list_for_each_entry_from(x, &xfrm_state_all, all) { - if (x->km.state == XFRM_STATE_DEAD) + if (x->state == XFRM_STATE_DEAD) continue; - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) + state = container_of(x, struct xfrm_state, km); + if (!xfrm_id_proto_match(state->id.proto, walk->proto)) continue; - if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - goto out; - } + err = func(state, walk->seq, data); + if (err) { + list_move_tail(&walk->all, &x->all); + goto out; } - last = x; - walk->count++; + walk->seq++; } - if (walk->count == 0) { + if (walk->seq == 0) { err = -ENOENT; goto out; } - if (last) - err = func(last, 0, data); + list_del_init(&walk->all); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); return err; } EXPORT_SYMBOL(xfrm_state_walk); void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { + INIT_LIST_HEAD(&walk->all); walk->proto = proto; - walk->state = NULL; - walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + walk->state = XFRM_STATE_DEAD; + walk->seq = 0; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; + if (list_empty(&walk->all)) return; - } - - xfrm_state_walk_completed = walk->genid; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_lock_bh(&xfrm_state_lock); + list_del(&walk->all); + spin_lock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_done); ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 7:29 ` Timo Teräs @ 2008-09-24 7:54 ` Herbert Xu 2008-09-24 13:18 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 7:54 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 10:29:15AM +0300, Timo Teräs wrote: > > Same warning still applies: only compile tested. Looks good to me, I espeically like the count/seq bit. But I'm sure Dave would appreciate it if we actually ran this too :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 7:54 ` Herbert Xu @ 2008-09-24 13:18 ` Timo Teräs 2008-09-24 14:08 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-24 13:18 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 10:29:15AM +0300, Timo Teräs wrote: >> Same warning still applies: only compile tested. > > Looks good to me, I espeically like the count/seq bit. But I'm > sure Dave would appreciate it if we actually ran this too :) Gave it a test spin and found one bug. The walking routine will get called one extra time when it has successfully dumped everything so we need to check for that in the beginning; the earlier version would start all over from the beginning resulting never ending dumps. So the patch is now as follows. I have an almost ready patch for xfrm_policy dumping too. Will finish it and send tomorrow. - Timo ipsec: Put xfrm_state dumpers on the dump list Based on Herbert Xu's patch and idea. Improved the original patch to apply cleanly and modified iteration code to be more readable by using a common struct for entries in all list. As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Finally the obsolete last optimisation is now gone. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Timo Teras <timo.teras@iki.fi> --- include/linux/netlink.h | 2 +- include/net/xfrm.h | 29 ++++++------- net/xfrm/xfrm_state.c | 109 +++++++++++++++------------------------------- 3 files changed, 50 insertions(+), 90 deletions(-) diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..becaa1e 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex; metrics. Plus, it will be made via sk->sk_dst_cache. Solved. */ +struct xfrm_state_walk { + struct list_head all; + u8 state; + union { + u8 dying; + u8 proto; + }; + u32 seq; +}; + /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -136,12 +145,8 @@ struct xfrm_state u32 genid; - /* Key manger bits */ - struct { - u8 state; - u8 dying; - u32 seq; - } km; + /* Key manager bits */ + struct xfrm_state_walk km; /* Parameters of this state. */ struct { @@ -1245,14 +1250,6 @@ struct xfrm6_tunnel { int priority; }; -struct xfrm_state_walk { - struct list_head list; - unsigned long genid; - struct xfrm_state *state; - int count; - u8 proto; -}; - struct xfrm_policy_walk { struct xfrm_policy *policy; int count; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..747fd8c 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); + INIT_LIST_HEAD(&x->km.all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + list_del(&x->km.all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); @@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1566,79 +1549,59 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *state; + struct xfrm_state_walk *x; int err = 0; - if (walk->state == NULL && walk->count != 0) + if (walk->seq != 0 && list_empty(&walk->all)) return 0; - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); + if (list_empty(&walk->all)) + x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all); + else + x = list_entry(&walk->all, struct xfrm_state_walk, all); list_for_each_entry_from(x, &xfrm_state_all, all) { - if (x->km.state == XFRM_STATE_DEAD) + if (x->state == XFRM_STATE_DEAD) continue; - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) + state = container_of(x, struct xfrm_state, km); + if (!xfrm_id_proto_match(state->id.proto, walk->proto)) continue; - if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - goto out; - } + err = func(state, walk->seq, data); + if (err) { + list_move_tail(&walk->all, &x->all); + goto out; } - last = x; - walk->count++; + walk->seq++; } - if (walk->count == 0) { + if (walk->seq == 0) { err = -ENOENT; goto out; } - if (last) - err = func(last, 0, data); + list_del_init(&walk->all); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); return err; } EXPORT_SYMBOL(xfrm_state_walk); void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { + INIT_LIST_HEAD(&walk->all); walk->proto = proto; - walk->state = NULL; - walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + walk->state = XFRM_STATE_DEAD; + walk->seq = 0; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; + if (list_empty(&walk->all)) return; - } - - xfrm_state_walk_completed = walk->genid; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_lock_bh(&xfrm_state_lock); + list_del(&walk->all); + spin_lock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_done); ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 13:18 ` Timo Teräs @ 2008-09-24 14:08 ` Herbert Xu 2008-09-25 6:03 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-24 14:08 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Wed, Sep 24, 2008 at 04:18:37PM +0300, Timo Teräs wrote: > > Gave it a test spin and found one bug. The walking routine will > get called one extra time when it has successfully dumped > everything so we need to check for that in the beginning; the > earlier version would start all over from the beginning resulting > never ending dumps. Aha, that's why the if statement was there :) Thanks for all the work! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-24 14:08 ` Herbert Xu @ 2008-09-25 6:03 ` Timo Teräs 2008-09-25 7:57 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-25 6:03 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Wed, Sep 24, 2008 at 04:18:37PM +0300, Timo Teräs wrote: >> Gave it a test spin and found one bug. The walking routine will >> get called one extra time when it has successfully dumped >> everything so we need to check for that in the beginning; the >> earlier version would start all over from the beginning resulting >> never ending dumps. > > Aha, that's why the if statement was there :) Yup. I forgot to test pf_key support. And remembered why the "last" thingy was in the walking routines. In pf_key dumps, the dump is terminated when an entry with seq zero is received; that's why the final entry received special treatment. And now that breaks. The first entry is zero so pf_key dumps only one entry. Not sure if we should fix this in pf_key or in the walking routines. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 6:03 ` Timo Teräs @ 2008-09-25 7:57 ` Herbert Xu 2008-09-25 8:42 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-25 7:57 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Thu, Sep 25, 2008 at 09:03:21AM +0300, Timo Teräs wrote: > > I forgot to test pf_key support. And remembered why the "last" > thingy was in the walking routines. In pf_key dumps, the dump > is terminated when an entry with seq zero is received; that's > why the final entry received special treatment. And now that > breaks. The first entry is zero so pf_key dumps only one entry. > > Not sure if we should fix this in pf_key or in the walking routines. This is part of the user-space ABI (and an RFC) so wecan't change it. I've put the last stuff back and fixed up the error handling on the very last entry. ipsec: Put xfrm_state dumpers on the dump list Based on Herbert Xu's patch and idea. Timo improved the original patch to apply cleanly and modified iteration code to be more readable by using a common struct for entries in all list. As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Timo Teras <timo.teras@iki.fi> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..becaa1e 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex; metrics. Plus, it will be made via sk->sk_dst_cache. Solved. */ +struct xfrm_state_walk { + struct list_head all; + u8 state; + union { + u8 dying; + u8 proto; + }; + u32 seq; +}; + /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -136,12 +145,8 @@ struct xfrm_state u32 genid; - /* Key manger bits */ - struct { - u8 state; - u8 dying; - u32 seq; - } km; + /* Key manager bits */ + struct xfrm_state_walk km; /* Parameters of this state. */ struct { @@ -1245,14 +1250,6 @@ struct xfrm6_tunnel { int priority; }; -struct xfrm_state_walk { - struct list_head list; - unsigned long genid; - struct xfrm_state *state; - int count; - u8 proto; -}; - struct xfrm_policy_walk { struct xfrm_policy *policy; int count; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..5b98f6f 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); + INIT_LIST_HEAD(&x->km.all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + list_del(&x->km.all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); @@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1566,79 +1549,72 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *state, *last = NULL; + struct xfrm_state_walk *x; int err = 0; - if (walk->state == NULL && walk->count != 0) + if (walk->seq != 0 && list_empty(&walk->all)) return 0; - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); + if (list_empty(&walk->all)) + x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all); + else + x = list_entry(&walk->all, struct xfrm_state_walk, all); list_for_each_entry_from(x, &xfrm_state_all, all) { - if (x->km.state == XFRM_STATE_DEAD) + if (x->state == XFRM_STATE_DEAD) continue; - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) + state = container_of(x, struct xfrm_state, km); + if (!xfrm_id_proto_match(state->id.proto, walk->proto)) continue; + if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - goto out; - } + err = func(last, walk->seq, data); + if (err) + goto save_dump; } - last = x; - walk->count++; + + last = state; + walk->seq++; } - if (walk->count == 0) { + if (walk->seq == 0) { err = -ENOENT; goto out; } - if (last) - err = func(last, 0, data); + + list_del_init(&walk->all); + if (!last) + goto out; + + err = func(last, 0, data); + if (err) { +save_dump: + list_move_tail(&walk->all, &last->km.all); + } + out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); return err; } EXPORT_SYMBOL(xfrm_state_walk); void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { + INIT_LIST_HEAD(&walk->all); walk->proto = proto; - walk->state = NULL; - walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + walk->state = XFRM_STATE_DEAD; + walk->seq = 0; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; + if (list_empty(&walk->all)) return; - } - xfrm_state_walk_completed = walk->genid; - - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_lock_bh(&xfrm_state_lock); + list_del(&walk->all); + spin_lock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_done); Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 7:57 ` Herbert Xu @ 2008-09-25 8:42 ` Timo Teräs 2008-09-25 8:56 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-25 8:42 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Thu, Sep 25, 2008 at 09:03:21AM +0300, Timo Teräs wrote: >> I forgot to test pf_key support. And remembered why the "last" >> thingy was in the walking routines. In pf_key dumps, the dump >> is terminated when an entry with seq zero is received; that's >> why the final entry received special treatment. And now that >> breaks. The first entry is zero so pf_key dumps only one entry. >> >> Not sure if we should fix this in pf_key or in the walking routines. > > This is part of the user-space ABI (and an RFC) so wecan't change > it. I've put the last stuff back and fixed up the error handling > on the very last entry. I was just figuring if we need to change seq in pf_key.c kernel code or in xfrm_state.c. But I guess it's good if the walking walks as previously. Now there is a new problem: if dumping is interrupted and meanwhile all the remaining entries get deleted, there is no entry to dump with seq zero when dumping is eventually continued. Since xfrm_user does not care for the seq as end-of-message, this problem appears only in af_key. Maybe we should make af_key.c code instead cache one entry before sending it to user space. We could then get rid of the last hack inside xfrm core walking code. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 8:42 ` Timo Teräs @ 2008-09-25 8:56 ` Herbert Xu 2008-09-25 9:01 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-25 8:56 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Thu, Sep 25, 2008 at 11:42:15AM +0300, Timo Teräs wrote: > > Maybe we should make af_key.c code instead cache one entry before > sending it to user space. We could then get rid of the last hack > inside xfrm core walking code. Yes that should work. But you want to keep cache the skb instead of the state since the latter can die. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 8:56 ` Herbert Xu @ 2008-09-25 9:01 ` Timo Teräs 2008-09-25 9:49 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-25 9:01 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Thu, Sep 25, 2008 at 11:42:15AM +0300, Timo Teräs wrote: >> Maybe we should make af_key.c code instead cache one entry before >> sending it to user space. We could then get rid of the last hack >> inside xfrm core walking code. > > Yes that should work. But you want to keep cache the skb instead > of the state since the latter can die. Exactly, keep the converted entry in skb, since we have to allocate that anyway. And when there's no more entries coming from enumerator the sequence number of final skb is adjusted to zero. Not sure if I can finish it today, but I think I should have the time by tomorrow. Unless you have time to do it before that, I'll post a patch later this week. Will update the similar xfrm_policy patch for that too, before sending it. - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 9:01 ` Timo Teräs @ 2008-09-25 9:49 ` Herbert Xu 2008-09-25 12:12 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-25 9:49 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Thu, Sep 25, 2008 at 12:01:30PM +0300, Timo Teräs wrote: > > Not sure if I can finish it today, but I think I should have the > time by tomorrow. Unless you have time to do it before that, I'll > post a patch later this week. Will update the similar xfrm_policy > patch for that too, before sending it. Thanks! I've got some other work to do so feel free to post this tomorrow. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 9:49 ` Herbert Xu @ 2008-09-25 12:12 ` Timo Teräs 2008-09-25 12:36 ` Timo Teräs 0 siblings, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-25 12:12 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Herbert Xu wrote: > On Thu, Sep 25, 2008 at 12:01:30PM +0300, Timo Teräs wrote: >> Not sure if I can finish it today, but I think I should have the >> time by tomorrow. Unless you have time to do it before that, I'll >> post a patch later this week. Will update the similar xfrm_policy >> patch for that too, before sending it. > > Thanks! I've got some other work to do so feel free to post this > tomorrow. It was simpler than I thought (and I thought I would be more busy with other things). Anyway, here goes. This is a single patch that modifies both xfrm_state and xfrm_policy dumping. This was to get af_key patch nicer (the final skb patching/sending is common code). Also, since the af_key implementation is the only place where this seq number is used, we might consider removing it from the core walking code. It's pretty af_key specfic too. The xfrm_policy patch uses moves xfrm_policy.dead to xfrm_policy_walk_entry which is the struct shared with walkers and the real entries. I did some testing to ensure that it works with xfrm and af_key as expected (with entries added by hand and dumping them). ipsec: Put dumpers on the dump list Herbert Xu came up with the idea and the original patch to make xfrm_state dump list contain also dumpers: As it is we go to extraordinary lengths to ensure that states don't go away while dumpers go to sleep. It's much easier if we just put the dumpers themselves on the list since they can't go away while they're going. I've also changed the order of addition on new states to prevent a never-ending dump. Timo Teräs improved the patch to apply cleanly to latest tree, modified iteration code to be more readable by using a common struct for entries in the list, implemented the same idea for xfrm_policy dumping and moved the af_key specific "last" entry caching to af_key. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Timo Teras <timo.teras@iki.fi> diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..b98d205 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -117,12 +117,21 @@ extern struct mutex xfrm_cfg_mutex; metrics. Plus, it will be made via sk->sk_dst_cache. Solved. */ +struct xfrm_state_walk { + struct list_head all; + u8 state; + union { + u8 dying; + u8 proto; + }; + u32 seq; +}; + /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -136,12 +145,8 @@ struct xfrm_state u32 genid; - /* Key manger bits */ - struct { - u8 state; - u8 dying; - u32 seq; - } km; + /* Key manager bits */ + struct xfrm_state_walk km; /* Parameters of this state. */ struct { @@ -449,10 +454,20 @@ struct xfrm_tmpl #define XFRM_MAX_DEPTH 6 +struct xfrm_policy_walk_entry { + struct list_head all; + u8 dead; +}; + +struct xfrm_policy_walk { + struct xfrm_policy_walk_entry walk; + u8 type; + u32 seq; +}; + struct xfrm_policy { struct xfrm_policy *next; - struct list_head bytype; struct hlist_node bydst; struct hlist_node byidx; @@ -467,13 +482,12 @@ struct xfrm_policy struct xfrm_lifetime_cfg lft; struct xfrm_lifetime_cur curlft; struct dst_entry *bundles; - u16 family; + struct xfrm_policy_walk_entry walk; u8 type; u8 action; u8 flags; - u8 dead; u8 xfrm_nr; - /* XXX 1 byte hole, try to pack */ + u16 family; struct xfrm_sec_ctx *security; struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH]; }; @@ -1245,20 +1259,6 @@ struct xfrm6_tunnel { int priority; }; -struct xfrm_state_walk { - struct list_head list; - unsigned long genid; - struct xfrm_state *state; - int count; - u8 proto; -}; - -struct xfrm_policy_walk { - struct xfrm_policy *policy; - int count; - u8 type, cur_type; -}; - extern void xfrm_init(void); extern void xfrm4_init(void); extern void xfrm_state_init(void); @@ -1410,24 +1410,10 @@ static inline int xfrm4_udp_encap_rcv(struct sock *sk, struct sk_buff *skb) struct xfrm_policy *xfrm_policy_alloc(gfp_t gfp); -static inline void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type) -{ - walk->cur_type = XFRM_POLICY_TYPE_MAIN; - walk->type = type; - walk->policy = NULL; - walk->count = 0; -} - -static inline void xfrm_policy_walk_done(struct xfrm_policy_walk *walk) -{ - if (walk->policy != NULL) { - xfrm_pol_put(walk->policy); - walk->policy = NULL; - } -} - +extern void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type); extern int xfrm_policy_walk(struct xfrm_policy_walk *walk, int (*func)(struct xfrm_policy *, int, int, void*), void *); +extern void xfrm_policy_walk_done(struct xfrm_policy_walk *walk); int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl); struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int dir, struct xfrm_selector *sel, diff --git a/net/key/af_key.c b/net/key/af_key.c index d628df9..9fd4e68 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -58,6 +58,7 @@ struct pfkey_sock { struct xfrm_policy_walk policy; struct xfrm_state_walk state; } u; + struct sk_buff *skb; } dump; }; @@ -73,20 +74,6 @@ static int pfkey_can_dump(struct sock *sk) return 0; } -static int pfkey_do_dump(struct pfkey_sock *pfk) -{ - int rc; - - rc = pfk->dump.dump(pfk); - if (rc == -ENOBUFS) - return 0; - - pfk->dump.done(pfk); - pfk->dump.dump = NULL; - pfk->dump.done = NULL; - return rc; -} - static void pfkey_sock_destruct(struct sock *sk) { skb_queue_purge(&sk->sk_receive_queue); @@ -310,6 +297,33 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, return err; } +static int pfkey_do_dump(struct pfkey_sock *pfk) +{ + struct sadb_msg *hdr; + int rc; + + rc = pfk->dump.dump(pfk); + if (rc == -ENOBUFS) + return 0; + + if (pfk->dump.skb) { + if (!pfkey_can_dump(&pfk->sk)) + return 0; + + hdr = (struct sadb_msg *) pfk->dump.skb->data; + hdr->sadb_msg_seq = 0; + hdr->sadb_msg_errno = rc; + pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE, + &pfk->sk); + pfk->dump.skb = NULL; + } + + pfk->dump.done(pfk); + pfk->dump.dump = NULL; + pfk->dump.done = NULL; + return rc; +} + static inline void pfkey_hdr_dup(struct sadb_msg *new, struct sadb_msg *orig) { *new = *orig; @@ -1736,9 +1750,14 @@ static int dump_sa(struct xfrm_state *x, int count, void *ptr) out_hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto); out_hdr->sadb_msg_errno = 0; out_hdr->sadb_msg_reserved = 0; - out_hdr->sadb_msg_seq = count; + out_hdr->sadb_msg_seq = count + 1; out_hdr->sadb_msg_pid = pfk->dump.msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ONE, &pfk->sk); + + if (pfk->dump.skb) + pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE, + &pfk->sk); + pfk->dump.skb = out_skb; + return 0; } @@ -2237,7 +2256,7 @@ static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h return 0; out: - xp->dead = 1; + xp->walk.dead = 1; xfrm_policy_destroy(xp); return err; } @@ -2575,9 +2594,14 @@ static int dump_sp(struct xfrm_policy *xp, int dir, int count, void *ptr) out_hdr->sadb_msg_type = SADB_X_SPDDUMP; out_hdr->sadb_msg_satype = SADB_SATYPE_UNSPEC; out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = count; + out_hdr->sadb_msg_seq = count + 1; out_hdr->sadb_msg_pid = pfk->dump.msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ONE, &pfk->sk); + + if (pfk->dump.skb) + pfkey_broadcast(pfk->dump.skb, GFP_ATOMIC, BROADCAST_ONE, + &pfk->sk); + pfk->dump.skb = out_skb; + return 0; } diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index ef9ccbc..b7ec080 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -46,7 +46,7 @@ EXPORT_SYMBOL(xfrm_cfg_mutex); static DEFINE_RWLOCK(xfrm_policy_lock); -static struct list_head xfrm_policy_bytype[XFRM_POLICY_TYPE_MAX]; +static struct list_head xfrm_policy_all; unsigned int xfrm_policy_count[XFRM_POLICY_MAX*2]; EXPORT_SYMBOL(xfrm_policy_count); @@ -164,7 +164,7 @@ static void xfrm_policy_timer(unsigned long data) read_lock(&xp->lock); - if (xp->dead) + if (xp->walk.dead) goto out; dir = xfrm_policy_id2dir(xp->index); @@ -236,7 +236,7 @@ struct xfrm_policy *xfrm_policy_alloc(gfp_t gfp) policy = kzalloc(sizeof(struct xfrm_policy), gfp); if (policy) { - INIT_LIST_HEAD(&policy->bytype); + INIT_LIST_HEAD(&policy->walk.all); INIT_HLIST_NODE(&policy->bydst); INIT_HLIST_NODE(&policy->byidx); rwlock_init(&policy->lock); @@ -252,17 +252,13 @@ EXPORT_SYMBOL(xfrm_policy_alloc); void xfrm_policy_destroy(struct xfrm_policy *policy) { - BUG_ON(!policy->dead); + BUG_ON(!policy->walk.dead); BUG_ON(policy->bundles); if (del_timer(&policy->timer)) BUG(); - write_lock_bh(&xfrm_policy_lock); - list_del(&policy->bytype); - write_unlock_bh(&xfrm_policy_lock); - security_xfrm_policy_free(policy->security); kfree(policy); } @@ -310,8 +306,8 @@ static void xfrm_policy_kill(struct xfrm_policy *policy) int dead; write_lock_bh(&policy->lock); - dead = policy->dead; - policy->dead = 1; + dead = policy->walk.dead; + policy->walk.dead = 1; write_unlock_bh(&policy->lock); if (unlikely(dead)) { @@ -609,6 +605,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl) if (delpol) { hlist_del(&delpol->bydst); hlist_del(&delpol->byidx); + list_del(&delpol->walk.all); xfrm_policy_count[dir]--; } policy->index = delpol ? delpol->index : xfrm_gen_index(policy->type, dir); @@ -617,7 +614,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl) policy->curlft.use_time = 0; if (!mod_timer(&policy->timer, jiffies + HZ)) xfrm_pol_hold(policy); - list_add_tail(&policy->bytype, &xfrm_policy_bytype[policy->type]); + list_add(&policy->walk.all, &xfrm_policy_all); write_unlock_bh(&xfrm_policy_lock); if (delpol) @@ -684,6 +681,7 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int dir, } hlist_del(&pol->bydst); hlist_del(&pol->byidx); + list_del(&pol->walk.all); xfrm_policy_count[dir]--; } ret = pol; @@ -727,6 +725,7 @@ struct xfrm_policy *xfrm_policy_byid(u8 type, int dir, u32 id, int delete, } hlist_del(&pol->bydst); hlist_del(&pol->byidx); + list_del(&pol->walk.all); xfrm_policy_count[dir]--; } ret = pol; @@ -840,6 +839,7 @@ int xfrm_policy_flush(u8 type, struct xfrm_audit *audit_info) continue; hlist_del(&pol->bydst); hlist_del(&pol->byidx); + list_del(&pol->walk.all); write_unlock_bh(&xfrm_policy_lock); xfrm_audit_policy_delete(pol, 1, @@ -867,60 +867,68 @@ int xfrm_policy_walk(struct xfrm_policy_walk *walk, int (*func)(struct xfrm_policy *, int, int, void*), void *data) { - struct xfrm_policy *old, *pol, *last = NULL; + struct xfrm_policy *pol; + struct xfrm_policy_walk_entry *x; int error = 0; if (walk->type >= XFRM_POLICY_TYPE_MAX && walk->type != XFRM_POLICY_TYPE_ANY) return -EINVAL; - if (walk->policy == NULL && walk->count != 0) + if (list_empty(&walk->walk.all) && walk->seq != 0) return 0; - old = pol = walk->policy; - walk->policy = NULL; - read_lock_bh(&xfrm_policy_lock); - - for (; walk->cur_type < XFRM_POLICY_TYPE_MAX; walk->cur_type++) { - if (walk->type != walk->cur_type && - walk->type != XFRM_POLICY_TYPE_ANY) + write_lock_bh(&xfrm_policy_lock); + if (list_empty(&walk->walk.all)) + x = list_first_entry(&xfrm_policy_all, struct xfrm_policy_walk_entry, all); + else + x = list_entry(&walk->walk.all, struct xfrm_policy_walk_entry, all); + list_for_each_entry_from(x, &xfrm_policy_all, all) { + if (x->dead) continue; - - if (pol == NULL) { - pol = list_first_entry(&xfrm_policy_bytype[walk->cur_type], - struct xfrm_policy, bytype); - } - list_for_each_entry_from(pol, &xfrm_policy_bytype[walk->cur_type], bytype) { - if (pol->dead) - continue; - if (last) { - error = func(last, xfrm_policy_id2dir(last->index), - walk->count, data); - if (error) { - xfrm_pol_hold(last); - walk->policy = last; - goto out; - } - } - last = pol; - walk->count++; + pol = container_of(x, struct xfrm_policy, walk); + if (walk->type != XFRM_POLICY_TYPE_ANY && + walk->type != pol->type) + continue; + error = func(pol, xfrm_policy_id2dir(pol->index), + walk->seq, data); + if (error) { + list_move_tail(&walk->walk.all, &x->all); + goto out; } - pol = NULL; + walk->seq++; } - if (walk->count == 0) { + if (walk->seq == 0) { error = -ENOENT; goto out; } - if (last) - error = func(last, xfrm_policy_id2dir(last->index), 0, data); + list_del_init(&walk->walk.all); out: - read_unlock_bh(&xfrm_policy_lock); - if (old != NULL) - xfrm_pol_put(old); + write_unlock_bh(&xfrm_policy_lock); return error; } EXPORT_SYMBOL(xfrm_policy_walk); +void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type) +{ + INIT_LIST_HEAD(&walk->walk.all); + walk->walk.dead = 1; + walk->type = type; + walk->seq = 0; +} +EXPORT_SYMBOL(xfrm_policy_walk_init); + +void xfrm_policy_walk_done(struct xfrm_policy_walk *walk) +{ + if (list_empty(&walk->walk.all)) + return; + + write_lock_bh(&xfrm_policy_lock); + list_del(&walk->walk.all); + write_unlock_bh(&xfrm_policy_lock); +} +EXPORT_SYMBOL(xfrm_policy_walk_done); + /* * Find policy to apply to this flow. * @@ -1077,7 +1085,7 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir) struct hlist_head *chain = policy_hash_bysel(&pol->selector, pol->family, dir); - list_add_tail(&pol->bytype, &xfrm_policy_bytype[pol->type]); + list_add(&pol->walk.all, &xfrm_policy_all); hlist_add_head(&pol->bydst, chain); hlist_add_head(&pol->byidx, xfrm_policy_byidx+idx_hash(pol->index)); xfrm_policy_count[dir]++; @@ -1095,6 +1103,7 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol, hlist_del(&pol->bydst); hlist_del(&pol->byidx); + list_del(&pol->walk.all); xfrm_policy_count[dir]--; return pol; @@ -1720,7 +1729,7 @@ restart: for (pi = 0; pi < npols; pi++) { read_lock_bh(&pols[pi]->lock); - pol_dead |= pols[pi]->dead; + pol_dead |= pols[pi]->walk.dead; read_unlock_bh(&pols[pi]->lock); } @@ -2415,9 +2424,7 @@ static void __init xfrm_policy_init(void) panic("XFRM: failed to allocate bydst hash\n"); } - for (dir = 0; dir < XFRM_POLICY_TYPE_MAX; dir++) - INIT_LIST_HEAD(&xfrm_policy_bytype[dir]); - + INIT_LIST_HEAD(&xfrm_policy_all); INIT_WORK(&xfrm_policy_gc_work, xfrm_policy_gc_task); register_netdevice_notifier(&xfrm_dev_notifier); } @@ -2601,7 +2608,7 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol, int i, j, n = 0; write_lock_bh(&pol->lock); - if (unlikely(pol->dead)) { + if (unlikely(pol->walk.dead)) { /* target policy has been deleted */ write_unlock_bh(&pol->lock); return -ENOENT; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..747fd8c 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -529,7 +513,7 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); + INIT_LIST_HEAD(&x->km.all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -556,7 +540,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -569,8 +553,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + list_del(&x->km.all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -871,7 +854,7 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -940,7 +923,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); @@ -1069,7 +1052,7 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); - list_add_tail(&x->all, &xfrm_state_all); + list_add(&x->km.all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1566,79 +1549,59 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *state; + struct xfrm_state_walk *x; int err = 0; - if (walk->state == NULL && walk->count != 0) + if (walk->seq != 0 && list_empty(&walk->all)) return 0; - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); + if (list_empty(&walk->all)) + x = list_first_entry(&xfrm_state_all, struct xfrm_state_walk, all); + else + x = list_entry(&walk->all, struct xfrm_state_walk, all); list_for_each_entry_from(x, &xfrm_state_all, all) { - if (x->km.state == XFRM_STATE_DEAD) + if (x->state == XFRM_STATE_DEAD) continue; - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) + state = container_of(x, struct xfrm_state, km); + if (!xfrm_id_proto_match(state->id.proto, walk->proto)) continue; - if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - goto out; - } + err = func(state, walk->seq, data); + if (err) { + list_move_tail(&walk->all, &x->all); + goto out; } - last = x; - walk->count++; + walk->seq++; } - if (walk->count == 0) { + if (walk->seq == 0) { err = -ENOENT; goto out; } - if (last) - err = func(last, 0, data); + list_del_init(&walk->all); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); return err; } EXPORT_SYMBOL(xfrm_state_walk); void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { + INIT_LIST_HEAD(&walk->all); walk->proto = proto; - walk->state = NULL; - walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + walk->state = XFRM_STATE_DEAD; + walk->seq = 0; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; + if (list_empty(&walk->all)) return; - } - - xfrm_state_walk_completed = walk->genid; - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_lock_bh(&xfrm_state_lock); + list_del(&walk->all); + spin_lock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_done); diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 04c4150..76f75df 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1102,7 +1102,7 @@ static struct xfrm_policy *xfrm_policy_construct(struct xfrm_userpolicy_info *p, return xp; error: *errp = err; - xp->dead = 1; + xp->walk.dead = 1; xfrm_policy_destroy(xp); return NULL; } @@ -1595,7 +1595,7 @@ static int xfrm_add_pol_expire(struct sk_buff *skb, struct nlmsghdr *nlh, return -ENOENT; read_lock(&xp->lock); - if (xp->dead) { + if (xp->walk.dead) { read_unlock(&xp->lock); goto out; } ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 12:12 ` Timo Teräs @ 2008-09-25 12:36 ` Timo Teräs 2008-09-26 2:08 ` Herbert Xu 2008-10-01 10:07 ` David Miller 0 siblings, 2 replies; 95+ messages in thread From: Timo Teräs @ 2008-09-25 12:36 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, netdev, jamal Timo Teräs wrote: > Herbert Xu wrote: >> Thanks! I've got some other work to do so feel free to post this >> tomorrow. > > It was simpler than I thought (and I thought I would be more busy > with other things). Anyway, here goes. This is a single patch that > modifies both xfrm_state and xfrm_policy dumping. This was to get > af_key patch nicer (the final skb patching/sending is common code). One more thing I noticed. If the af_key socket is closed while dumping is ongoing the current code can leak stuff. We should call the .done. And also in the new patch, release the stored skb. The right place to do that would be pfkey_release, I assume? - Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 12:36 ` Timo Teräs @ 2008-09-26 2:08 ` Herbert Xu 2008-10-01 10:07 ` David Miller 1 sibling, 0 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-26 2:08 UTC (permalink / raw) To: Timo Teräs; +Cc: David Miller, netdev, jamal On Thu, Sep 25, 2008 at 03:36:04PM +0300, Timo Teräs wrote: > > The right place to do that would be pfkey_release, I assume? Yes that sounds right. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-25 12:36 ` Timo Teräs 2008-09-26 2:08 ` Herbert Xu @ 2008-10-01 10:07 ` David Miller 2008-10-01 14:05 ` Herbert Xu 1 sibling, 1 reply; 95+ messages in thread From: David Miller @ 2008-10-01 10:07 UTC (permalink / raw) To: timo.teras; +Cc: herbert, netdev, hadi From: Timo Teräs <timo.teras@iki.fi> Date: Thu, 25 Sep 2008 15:36:04 +0300 > Timo Teräs wrote: > > Herbert Xu wrote: > >> Thanks! I've got some other work to do so feel free to post this > >> tomorrow. > > > > It was simpler than I thought (and I thought I would be more busy > > with other things). Anyway, here goes. This is a single patch that > > modifies both xfrm_state and xfrm_policy dumping. This was to get > > af_key patch nicer (the final skb patching/sending is common code). > > One more thing I noticed. If the af_key socket is closed while > dumping is ongoing the current code can leak stuff. We should call > the .done. And also in the new patch, release the stored skb. > > The right place to do that would be pfkey_release, I assume? I reread this thread, and the most recent patch seems good and even tested :-) But it seems that you want to make some more tweaks, right? Please post explicitly the patch you want me to apply to net-next-2.6 once you reach that point, thanks! ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-10-01 10:07 ` David Miller @ 2008-10-01 14:05 ` Herbert Xu 0 siblings, 0 replies; 95+ messages in thread From: Herbert Xu @ 2008-10-01 14:05 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev, hadi On Wed, Oct 01, 2008 at 03:07:53AM -0700, David Miller wrote: > > > The right place to do that would be pfkey_release, I assume? > > I reread this thread, and the most recent patch seems good > and even tested :-) But it seems that you want to make some > more tweaks, right? > > Please post explicitly the patch you want me to apply to > net-next-2.6 once you reach that point, thanks! Yes I think Timo is still touching it up. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-22 11:42 ` Herbert Xu 2008-09-22 13:01 ` Timo Teräs @ 2008-09-23 2:48 ` David Miller 1 sibling, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-23 2:48 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 22 Sep 2008 19:42:56 +0800 > ipsec: Fix xfrm_state_walk race > > As discovered by Timo Teräs, the currently xfrm_state_walk scheme > is racy because if a second dump finishes before the first, we > may free xfrm states that the first dump would walk over later. > > This patch fixes this by storing the dumps in a list in order > to calculate the correct completion counter which cures this > problem. > > I've expanded netlink_cb in order to accomodate the extra state > related to this. It shouldn't be a big deal since netlink_cb > is kmalloced for each dump and we're just increasing it by 4 or > 8 bytes. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Looks good, applied to net-next-2.6 ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:50 ` Herbert Xu 2008-09-03 6:14 ` David Miller @ 2008-09-10 3:04 ` David Miller 2008-09-10 3:15 ` Herbert Xu 1 sibling, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-10 3:04 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 3 Sep 2008 15:50:41 +1000 > ipsec: Move state dump list unlinking into GC > > This patch moves the unlinking of x->all (for dumping) into the > GC so that we know for sure what locks we hold when we unlink the > entry. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> After your patch I just applied, things are a bit different now. Are you going to respin this or is it not relevant any longer? > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 89551b6..0244325 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -380,6 +380,10 @@ static void xfrm_put_mode(struct xfrm_mode *mode) > > static void xfrm_state_gc_destroy(struct xfrm_state *x) > { > + spin_lock_bh(&xfrm_state_lock); > + list_del(&x->all); > + spin_unlock_bh(&xfrm_state_lock); > + > del_timer_sync(&x->timer); > del_timer_sync(&x->rtimer); > kfree(x->aalg); > @@ -540,10 +544,6 @@ void __xfrm_state_destroy(struct xfrm_state *x) > { > WARN_ON(x->km.state != XFRM_STATE_DEAD); > > - spin_lock_bh(&xfrm_state_lock); > - list_del(&x->all); > - spin_unlock_bh(&xfrm_state_lock); > - > spin_lock_bh(&xfrm_state_gc_lock); > hlist_add_head(&x->bydst, &xfrm_state_gc_list); > spin_unlock_bh(&xfrm_state_gc_lock); > > Thanks, > -- > Visit Openswan at http://www.openswan.org/ > Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 3:04 ` David Miller @ 2008-09-10 3:15 ` Herbert Xu 2008-09-10 3:22 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-10 3:15 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 08:04:00PM -0700, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 3 Sep 2008 15:50:41 +1000 > > > ipsec: Move state dump list unlinking into GC > > > > This patch moves the unlinking of x->all (for dumping) into the > > GC so that we know for sure what locks we hold when we unlink the > > entry. > > > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > After your patch I just applied, things are a bit different now. > > Are you going to respin this or is it not relevant any longer? This patch is now redundant. What could be done though is to remove x->all and restore the original walking by hashing. This would eliminate a list node from xfrm_state, and would also restore the original dump ordering. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 3:15 ` Herbert Xu @ 2008-09-10 3:22 ` David Miller 2008-09-10 3:23 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-10 3:22 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 13:15:59 +1000 > This patch is now redundant. Ok. > What could be done though is to remove x->all and restore the > original walking by hashing. This would eliminate a list node > from xfrm_state, and would also restore the original dump ordering. Yes I remember you mentioning that. If you're too busy I can try to spin this idea into a patch. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 3:22 ` David Miller @ 2008-09-10 3:23 ` Herbert Xu 2008-09-10 3:38 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-10 3:23 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 08:22:15PM -0700, David Miller wrote: > > > What could be done though is to remove x->all and restore the > > original walking by hashing. This would eliminate a list node > > from xfrm_state, and would also restore the original dump ordering. > > Yes I remember you mentioning that. > > If you're too busy I can try to spin this idea into a patch. That would be very much appreciated. Thanks Dave! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 3:23 ` Herbert Xu @ 2008-09-10 3:38 ` David Miller 2008-09-10 4:01 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-10 3:38 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 13:23:04 +1000 > On Tue, Sep 09, 2008 at 08:22:15PM -0700, David Miller wrote: > > > > > What could be done though is to remove x->all and restore the > > > original walking by hashing. This would eliminate a list node > > > from xfrm_state, and would also restore the original dump ordering. > > > > Yes I remember you mentioning that. > > > > If you're too busy I can try to spin this idea into a patch. > > That would be very much appreciated. Thanks Dave! No problem. It might be a little bit of a chore because this new walker design is intimately tied to the af_key non-atomic dump changes. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 3:38 ` David Miller @ 2008-09-10 4:01 ` Herbert Xu 2008-09-10 4:06 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-10 4:01 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote: > > No problem. It might be a little bit of a chore because this new > walker design is intimately tied to the af_key non-atomic dump > changes. Actually I was mistaken as to how the original dump worked. I'd thought that it actually kept track of which bucket it was in and resumed from that bucket. However in reality it only had a global counter and would always start walking from the beginning up until the counted value. So it isn't as easy as just copying the old code across :) While it is possible to restore the old order and save a little bit of memory, I certainly don't think this is very urgent. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:01 ` Herbert Xu @ 2008-09-10 4:06 ` David Miller 2008-09-10 4:22 ` Herbert Xu 2008-09-10 5:16 ` Timo Teräs 0 siblings, 2 replies; 95+ messages in thread From: David Miller @ 2008-09-10 4:06 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 14:01:07 +1000 > On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote: > > > > No problem. It might be a little bit of a chore because this new > > walker design is intimately tied to the af_key non-atomic dump > > changes. > > Actually I was mistaken as to how the original dump worked. I'd > thought that it actually kept track of which bucket it was in and > resumed from that bucket. However in reality it only had a global > counter and would always start walking from the beginning up until > the counted value. So it isn't as easy as just copying the old > code across :) Only AF_KEY gave an error, and this ended the dump. This was one of Timo's goals, to make AF_KEY continue where it left off in subsequent dump calls done by the user, when we hit the socket limit. > While it is possible to restore the old order and save a little > bit of memory, I certainly don't think this is very urgent. Too late, I already implemented it :-) I'm about to give the following patch a test: ipsec: Restore hash based xfrm_state dumping. Get rid of ->all member of struct xfrm_state, and just use a hash iteration like we used before. This shrinks the size of struct xfrm_state, and also restores the dump ordering of 2.6.25 and previous. Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 4bb9499..ba0e47c 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex; /* Full description of state of transformer. */ struct xfrm_state { - struct list_head all; union { struct list_head gclist; struct hlist_node bydst; @@ -1247,6 +1246,7 @@ struct xfrm6_tunnel { struct xfrm_state_walk { struct xfrm_state *state; + unsigned int chain; int count; u8 proto; }; @@ -1283,9 +1283,10 @@ extern int xfrm_proc_init(void); static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) { - walk->proto = proto; walk->state = NULL; + walk->chain = 0; walk->count = 0; + walk->proto = proto; } extern int xfrm_state_walk(struct xfrm_state_walk *walk, diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index aaafcee..ccf8492 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -50,7 +50,6 @@ static DEFINE_SPINLOCK(xfrm_state_lock); * Main use is finding SA after policy selected tunnel or transport mode. * Also, it can be used by ah/esp icmp error handler to find offending SA. */ -static LIST_HEAD(xfrm_state_all); static struct hlist_head *xfrm_state_bydst __read_mostly; static struct hlist_head *xfrm_state_bysrc __read_mostly; static struct hlist_head *xfrm_state_byspi __read_mostly; @@ -525,7 +524,6 @@ struct xfrm_state *xfrm_state_alloc(void) if (x) { atomic_set(&x->refcnt, 1); atomic_set(&x->tunnel_users, 0); - INIT_LIST_HEAD(&x->all); INIT_HLIST_NODE(&x->bydst); INIT_HLIST_NODE(&x->bysrc); INIT_HLIST_NODE(&x->byspi); @@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x) x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -867,7 +864,6 @@ xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; - list_add_tail(&x->all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -936,8 +932,6 @@ static void __xfrm_state_insert(struct xfrm_state *x) x->genid = ++xfrm_state_genid; - list_add_tail(&x->all, &xfrm_state_all); - h = xfrm_dst_hash(&x->id.daddr, &x->props.saddr, x->props.reqid, x->props.family); hlist_add_head(&x->bydst, xfrm_state_bydst+h); @@ -1065,7 +1059,6 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re xfrm_state_hold(x); x->timer.expires = jiffies + sysctl_xfrm_acq_expires*HZ; add_timer(&x->timer); - list_add_tail(&x->all, &xfrm_state_all); hlist_add_head(&x->bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(&x->bysrc, xfrm_state_bysrc+h); @@ -1563,7 +1556,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, void *data) { struct xfrm_state *old, *x, *last = NULL; - int err = 0; + int i, err = 0; if (walk->state == NULL && walk->count != 0) return 0; @@ -1571,24 +1564,29 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, old = x = walk->state; walk->state = NULL; spin_lock_bh(&xfrm_state_lock); - if (x == NULL) - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); - list_for_each_entry_from(x, &xfrm_state_all, all) { - if (x->km.state == XFRM_STATE_DEAD) - continue; - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) - continue; - if (last) { - err = func(last, walk->count, data); - if (err) { - xfrm_state_hold(last); - walk->state = last; - xfrm_state_walk_ongoing++; - goto out; + for (i = walk->chain; i <= xfrm_state_hmask; i++) { + struct hlist_head *h = xfrm_state_bydst + i; + struct hlist_node *entry; + + if (!x) + x = hlist_entry(h->first, struct xfrm_state, bydst); + hlist_for_each_entry_from(x, entry, bydst) { + if (!xfrm_id_proto_match(x->id.proto, walk->proto)) + continue; + if (last) { + err = func(last, walk->count, data); + if (err) { + xfrm_state_hold(last); + walk->state = last; + walk->chain = i; + xfrm_state_walk_ongoing++; + goto out; + } } + last = x; + walk->count++; } - last = x; - walk->count++; + x = NULL; } if (walk->count == 0) { err = -ENOENT; ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:06 ` David Miller @ 2008-09-10 4:22 ` Herbert Xu 2008-09-10 4:24 ` David Miller 2008-09-10 5:16 ` Timo Teräs 1 sibling, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-10 4:22 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote: > > Only AF_KEY gave an error, and this ended the dump. This was > one of Timo's goals, to make AF_KEY continue where it left > off in subsequent dump calls done by the user, when we hit the > socket limit. I see. > > While it is possible to restore the old order and save a little > > bit of memory, I certainly don't think this is very urgent. > > Too late, I already implemented it :-) Awesome :) Your patch is much simpler than I thought. > diff --git a/include/net/xfrm.h b/include/net/xfrm.h > index 4bb9499..ba0e47c 100644 > --- a/include/net/xfrm.h > +++ b/include/net/xfrm.h > @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex; > /* Full description of state of transformer. */ > struct xfrm_state > { > - struct list_head all; > union { > struct list_head gclist; > struct hlist_node bydst; This union now needs to move across to bysrc or byspi since bydst needs to be preserved for walking even after a node is added to the GC. > @@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x) > x->km.state = XFRM_STATE_DEAD; > spin_lock(&xfrm_state_lock); > x->lastused = xfrm_state_walk_ongoing; > - list_del_rcu(&x->all); > hlist_del(&x->bydst); We need hlist_del_rcu on bydst for the same reason. > @@ -1571,24 +1564,29 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, > old = x = walk->state; > walk->state = NULL; > spin_lock_bh(&xfrm_state_lock); > - if (x == NULL) > - x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); > - list_for_each_entry_from(x, &xfrm_state_all, all) { > - if (x->km.state == XFRM_STATE_DEAD) > - continue; > - if (!xfrm_id_proto_match(x->id.proto, walk->proto)) > - continue; > - if (last) { > - err = func(last, walk->count, data); > - if (err) { > - xfrm_state_hold(last); > - walk->state = last; > - xfrm_state_walk_ongoing++; > - goto out; > + for (i = walk->chain; i <= xfrm_state_hmask; i++) { > + struct hlist_head *h = xfrm_state_bydst + i; > + struct hlist_node *entry; > + > + if (!x) > + x = hlist_entry(h->first, struct xfrm_state, bydst); > + hlist_for_each_entry_from(x, entry, bydst) { > + if (!xfrm_id_proto_match(x->id.proto, walk->proto)) > + continue; > + if (last) { > + err = func(last, walk->count, data); > + if (err) { This is a bug in the original patch. When we get an error on the very last node we'll lose that node instead of dumping it later. I wonder if we could just get rid of last altogether... Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:22 ` Herbert Xu @ 2008-09-10 4:24 ` David Miller 2008-09-10 4:48 ` David Miller 0 siblings, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-10 4:24 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 14:22:47 +1000 > On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote: > > @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex; > > /* Full description of state of transformer. */ > > struct xfrm_state > > { > > - struct list_head all; > > union { > > struct list_head gclist; > > struct hlist_node bydst; > > This union now needs to move across to bysrc or byspi since bydst > needs to be preserved for walking even after a node is added to > the GC. That would explain the crash I just got :) > > @@ -566,7 +564,6 @@ int __xfrm_state_delete(struct xfrm_state *x) > > x->km.state = XFRM_STATE_DEAD; > > spin_lock(&xfrm_state_lock); > > x->lastused = xfrm_state_walk_ongoing; > > - list_del_rcu(&x->all); > > hlist_del(&x->bydst); > > We need hlist_del_rcu on bydst for the same reason. Ok, gotcha. > This is a bug in the original patch. When we get an error on > the very last node we'll lose that node instead of dumping it > later. I wonder if we could just get rid of last altogether... I'll take a look at this too after I resolve and test the above issues. Thanks. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:24 ` David Miller @ 2008-09-10 4:48 ` David Miller 2008-09-10 4:52 ` David Miller 2008-09-10 4:53 ` Herbert Xu 0 siblings, 2 replies; 95+ messages in thread From: David Miller @ 2008-09-10 4:48 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: David Miller <davem@davemloft.net> Date: Tue, 09 Sep 2008 21:24:26 -0700 (PDT) > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 10 Sep 2008 14:22:47 +1000 > > > On Tue, Sep 09, 2008 at 09:06:54PM -0700, David Miller wrote: > > > @@ -120,7 +120,6 @@ extern struct mutex xfrm_cfg_mutex; > > > /* Full description of state of transformer. */ > > > struct xfrm_state > > > { > > > - struct list_head all; > > > union { > > > struct list_head gclist; > > > struct hlist_node bydst; > > > > This union now needs to move across to bysrc or byspi since bydst > > needs to be preserved for walking even after a node is added to > > the GC. > > That would explain the crash I just got :) No, it doesn't explain the crash/hang. And it happens even without my patch. I get a hang in del_timer_sync() and it's from a workqueue so I suspect your patch :-) I can't see what does the list delete from the GC leftover list? Entries seem to stay on there forever. We even kfree() the object. The list iteration is: list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { if ((long)(x->lastused - completed) > 0) break; xfrm_state_gc_destroy(x); } And xfrm_state_gc_destroy() doesn't do a list_del(&x->gclist) or similar. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:48 ` David Miller @ 2008-09-10 4:52 ` David Miller 2008-09-10 4:53 ` Herbert Xu 1 sibling, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-10 4:52 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: David Miller <davem@davemloft.net> Date: Tue, 09 Sep 2008 21:48:29 -0700 (PDT) > The list iteration is: > > list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { > if ((long)(x->lastused - completed) > 0) > break; > xfrm_state_gc_destroy(x); > } > > And xfrm_state_gc_destroy() doesn't do a list_del(&x->gclist) or similar. I'm now testing the following fix: diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index aaafcee..abbe270 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -423,6 +423,7 @@ static void xfrm_state_gc_task(struct work_struct *data) list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { if ((long)(x->lastused - completed) > 0) break; + list_del(&x->gclist); xfrm_state_gc_destroy(x); } ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:48 ` David Miller 2008-09-10 4:52 ` David Miller @ 2008-09-10 4:53 ` Herbert Xu 2008-09-10 5:21 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: Herbert Xu @ 2008-09-10 4:53 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 09:48:29PM -0700, David Miller wrote: > > I can't see what does the list delete from the GC leftover list? Entries > seem to stay on there forever. We even kfree() the object. Doh! Yes there's supposed to be a list_del in there. Here's a fixed patch: ipsec: Use RCU-like construct for saved state within a walk Now that we save states within a walk we need synchronisation so that the list the saved state is on doesn't disappear from under us. As it stands this is done by keeping the state on the list which is bad because it gets in the way of the management of the state life-cycle. An alternative is to make our own pseudo-RCU system where we use counters to indicate which state can't be freed immediately as it may be referenced by an ongoing walk when that resumes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 2933d74..4bb9499 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -120,9 +120,11 @@ extern struct mutex xfrm_cfg_mutex; /* Full description of state of transformer. */ struct xfrm_state { - /* Note: bydst is re-used during gc */ struct list_head all; - struct hlist_node bydst; + union { + struct list_head gclist; + struct hlist_node bydst; + }; struct hlist_node bysrc; struct hlist_node byspi; @@ -1286,16 +1288,9 @@ static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->count = 0; } -static inline void xfrm_state_walk_done(struct xfrm_state_walk *walk) -{ - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } -} - extern int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *); +extern void xfrm_state_walk_done(struct xfrm_state_walk *walk); extern struct xfrm_state *xfrm_state_alloc(void); extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7bd62f6..fe84a46 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,6 +59,11 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; +/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ +static unsigned long xfrm_state_walk_ongoing; +/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ +static unsigned long xfrm_state_walk_completed; + static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -191,7 +196,8 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static HLIST_HEAD(xfrm_state_gc_list); +static LIST_HEAD(xfrm_state_gc_leftovers); +static LIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -403,17 +409,23 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x; - struct hlist_node *entry, *tmp; - struct hlist_head gc_list; + struct xfrm_state *x, *tmp; + unsigned long completed; + mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - gc_list.first = xfrm_state_gc_list.first; - INIT_HLIST_HEAD(&xfrm_state_gc_list); + list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); spin_unlock_bh(&xfrm_state_gc_lock); - hlist_for_each_entry_safe(x, entry, tmp, &gc_list, bydst) + completed = xfrm_state_walk_completed; + mutex_unlock(&xfrm_cfg_mutex); + + list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { + if ((long)(x->lastused - completed) > 0) + break; + list_del(&x->all); xfrm_state_gc_destroy(x); + } wake_up(&km_waitq); } @@ -540,12 +552,8 @@ void __xfrm_state_destroy(struct xfrm_state *x) { WARN_ON(x->km.state != XFRM_STATE_DEAD); - spin_lock_bh(&xfrm_state_lock); - list_del(&x->all); - spin_unlock_bh(&xfrm_state_lock); - spin_lock_bh(&xfrm_state_gc_lock); - hlist_add_head(&x->bydst, &xfrm_state_gc_list); + list_add_tail(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -558,6 +566,8 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); + x->lastused = xfrm_state_walk_ongoing; + list_del_rcu(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1572,6 +1582,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (err) { xfrm_state_hold(last); walk->state = last; + xfrm_state_walk_ongoing++; goto out; } } @@ -1586,12 +1597,28 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) + if (old != NULL) { xfrm_state_put(old); + xfrm_state_walk_completed++; + if (!list_empty(&xfrm_state_gc_leftovers)) + schedule_work(&xfrm_state_gc_work); + } return err; } EXPORT_SYMBOL(xfrm_state_walk); +void xfrm_state_walk_done(struct xfrm_state_walk *walk) +{ + if (walk->state != NULL) { + xfrm_state_put(walk->state); + walk->state = NULL; + xfrm_state_walk_completed++; + if (!list_empty(&xfrm_state_gc_leftovers)) + schedule_work(&xfrm_state_gc_work); + } +} +EXPORT_SYMBOL(xfrm_state_walk_done); + void xfrm_replay_notify(struct xfrm_state *x, int event) { Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:53 ` Herbert Xu @ 2008-09-10 5:21 ` David Miller 0 siblings, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-10 5:21 UTC (permalink / raw) To: herbert; +Cc: timo.teras, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 10 Sep 2008 14:53:50 +1000 > On Tue, Sep 09, 2008 at 09:48:29PM -0700, David Miller wrote: > > > > I can't see what does the list delete from the GC leftover list? Entries > > seem to stay on there forever. We even kfree() the object. > > Doh! Yes there's supposed to be a list_del in there. Your original change was already in net-next-2.6 so I commited the one-liner fix. Thanks. ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 4:06 ` David Miller 2008-09-10 4:22 ` Herbert Xu @ 2008-09-10 5:16 ` Timo Teräs 2008-09-10 5:23 ` David Miller 1 sibling, 1 reply; 95+ messages in thread From: Timo Teräs @ 2008-09-10 5:16 UTC (permalink / raw) To: David Miller; +Cc: herbert, netdev David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 10 Sep 2008 14:01:07 +1000 > >> On Tue, Sep 09, 2008 at 08:38:08PM -0700, David Miller wrote: >>> No problem. It might be a little bit of a chore because this new >>> walker design is intimately tied to the af_key non-atomic dump >>> changes. >> Actually I was mistaken as to how the original dump worked. I'd >> thought that it actually kept track of which bucket it was in and >> resumed from that bucket. However in reality it only had a global >> counter and would always start walking from the beginning up until >> the counted value. So it isn't as easy as just copying the old >> code across :) > > Only AF_KEY gave an error, and this ended the dump. This was > one of Timo's goals, to make AF_KEY continue where it left > off in subsequent dump calls done by the user, when we hit the > socket limit. Yes, this was the other goal. The other goal was to speed up dumping in netlink from O(n^2) to O(n). And fixing the problem that netlink might miss to dump some entries (if an entry was removed from the beginning of the hash when dumping was ongoing). > ipsec: Restore hash based xfrm_state dumping. > > Get rid of ->all member of struct xfrm_state, and just use a hash > iteration like we used before. > > This shrinks the size of struct xfrm_state, and also restores the > dump ordering of 2.6.25 and previous. I think bad things will happen if the hash gets resized between xfrm_state_walk() calls. Since it assumes that the walk->state entry has remained in walk->chain bucket. The dumping order shouldn't really make any difference. But reducing struct xfrm_state is definitely good. The only downside is that when an entry is inserted to the beginning of the hash while dump is ongoing, it won't be dumped at all. But that is not really a problem since you get a notification about that entry separately. Cheers, Timo ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 5:16 ` Timo Teräs @ 2008-09-10 5:23 ` David Miller 2008-09-10 5:46 ` Herbert Xu 0 siblings, 1 reply; 95+ messages in thread From: David Miller @ 2008-09-10 5:23 UTC (permalink / raw) To: timo.teras; +Cc: herbert, netdev From: Timo Teräs <timo.teras@iki.fi> Date: Wed, 10 Sep 2008 08:16:12 +0300 > David Miller wrote: > > ipsec: Restore hash based xfrm_state dumping. > > > > Get rid of ->all member of struct xfrm_state, and just use a hash > > iteration like we used before. > > > > This shrinks the size of struct xfrm_state, and also restores the > > dump ordering of 2.6.25 and previous. > > I think bad things will happen if the hash gets resized between > xfrm_state_walk() calls. Since it assumes that the walk->state > entry has remained in walk->chain bucket. > > The dumping order shouldn't really make any difference. But > reducing struct xfrm_state is definitely good. The only downside > is that when an entry is inserted to the beginning of the hash > while dump is ongoing, it won't be dumped at all. But that is > not really a problem since you get a notification about that > entry separately. That's a good point, the hash resizing case. If the hash grows, we'll see some entries again. If it shrinks, we might skip some instead, which is very undesirable. Actually, thinking about it some more, both possibilities (skipping and dups) seem to be possible in both situations (growing and shrinking). Ehm... I think we should put this patch and idea aside for a little bit. :-) ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-10 5:23 ` David Miller @ 2008-09-10 5:46 ` Herbert Xu 0 siblings, 0 replies; 95+ messages in thread From: Herbert Xu @ 2008-09-10 5:46 UTC (permalink / raw) To: David Miller; +Cc: timo.teras, netdev On Tue, Sep 09, 2008 at 10:23:30PM -0700, David Miller wrote: > > That's a good point, the hash resizing case. > > If the hash grows, we'll see some entries again. > > If it shrinks, we might skip some instead, which is very undesirable. > > Actually, thinking about it some more, both possibilities (skipping > and dups) seem to be possible in both situations (growing and > shrinking). > > Ehm... I think we should put this patch and idea aside for a little bit. > :-) Yeah the hash resizing thing really kills this approach. In fact it seems that the original walker was written before resizing was added. I definitely agree that ordering isn't as important as showing the entries reliably. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: xfrm_state locking regression... 2008-09-03 5:07 ` Timo Teräs 2008-09-03 5:23 ` Herbert Xu 2008-09-03 5:39 ` Herbert Xu @ 2008-09-03 6:10 ` David Miller 2 siblings, 0 replies; 95+ messages in thread From: David Miller @ 2008-09-03 6:10 UTC (permalink / raw) To: timo.teras; +Cc: herbert, netdev From: Timo Teräs <timo.teras@iki.fi> Date: Wed, 03 Sep 2008 08:07:33 +0300 > Also alternatively the xfrm_state_all could be protected by a > different lock than xfrm_state_lock. Before your changes the state destroy was protected by no central locks at all. There was no need. Since the reference goes to zero, there are no external references to this object. But this is exactly what you added, an external references that is not reference counted. This is yet another reason all of Herbert's objections to your location for ->all list add and removal are sound. So we now have to take a centralized lock twice when getting rid of an XFRM state object. Which is potentially a scalability performance regression. ^ permalink raw reply [flat|nested] 95+ messages in thread
end of thread, other threads:[~2008-10-01 14:05 UTC | newest] Thread overview: 95+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-03 2:51 xfrm_state locking regression David Miller 2008-09-03 3:00 ` David Miller 2008-09-03 5:01 ` Herbert Xu 2008-09-03 5:07 ` Timo Teräs 2008-09-03 5:23 ` Herbert Xu 2008-09-03 5:39 ` Timo Teräs 2008-09-03 5:40 ` Herbert Xu 2008-09-09 12:25 ` David Miller 2008-09-03 5:39 ` Herbert Xu 2008-09-03 5:45 ` Timo Teräs 2008-09-03 5:50 ` Herbert Xu 2008-09-03 6:14 ` David Miller 2008-09-03 6:27 ` Timo Teräs 2008-09-03 6:35 ` David Miller 2008-09-03 6:45 ` Timo Teräs 2008-09-03 6:47 ` David Miller 2008-09-03 7:14 ` Timo Teräs 2008-09-05 11:55 ` Herbert Xu 2008-09-09 0:09 ` David Miller 2008-09-09 0:18 ` Herbert Xu 2008-09-09 0:20 ` David Miller 2008-09-09 0:25 ` David Miller 2008-09-09 14:33 ` Herbert Xu 2008-09-09 20:20 ` David Miller 2008-09-10 3:01 ` David Miller 2008-09-11 21:24 ` Paul E. McKenney 2008-09-11 22:00 ` David Miller 2008-09-11 23:22 ` Paul E. McKenney 2008-09-12 16:08 ` Herbert Xu 2008-09-12 17:37 ` Paul E. McKenney 2008-09-21 12:29 ` Timo Teräs 2008-09-21 15:21 ` Timo Teräs 2008-09-22 11:42 ` Herbert Xu 2008-09-22 13:01 ` Timo Teräs 2008-09-22 23:50 ` Herbert Xu 2008-09-23 4:53 ` Timo Teräs 2008-09-23 4:59 ` Herbert Xu 2008-09-23 5:17 ` Timo Teräs 2008-09-23 5:22 ` Herbert Xu 2008-09-23 6:25 ` Timo Teräs 2008-09-23 6:47 ` Herbert Xu 2008-09-23 6:56 ` Timo Teräs 2008-09-23 9:39 ` Timo Teräs 2008-09-23 11:24 ` Herbert Xu 2008-09-23 12:08 ` Timo Teräs 2008-09-23 12:14 ` Herbert Xu 2008-09-23 12:25 ` Timo Teräs 2008-09-23 12:56 ` Herbert Xu 2008-09-23 13:01 ` Timo Teräs 2008-09-23 13:07 ` Herbert Xu 2008-09-23 13:30 ` Timo Teräs 2008-09-23 13:32 ` Herbert Xu 2008-09-23 13:46 ` Timo Teräs 2008-09-24 4:23 ` Herbert Xu 2008-09-24 5:14 ` Timo Teräs 2008-09-24 5:15 ` Herbert Xu 2008-09-24 5:46 ` Timo Teräs 2008-09-24 5:55 ` Herbert Xu 2008-09-24 6:04 ` Timo Teräs 2008-09-24 6:13 ` Herbert Xu 2008-09-24 6:20 ` Timo Teräs 2008-09-24 6:21 ` Herbert Xu 2008-09-24 7:29 ` Timo Teräs 2008-09-24 7:54 ` Herbert Xu 2008-09-24 13:18 ` Timo Teräs 2008-09-24 14:08 ` Herbert Xu 2008-09-25 6:03 ` Timo Teräs 2008-09-25 7:57 ` Herbert Xu 2008-09-25 8:42 ` Timo Teräs 2008-09-25 8:56 ` Herbert Xu 2008-09-25 9:01 ` Timo Teräs 2008-09-25 9:49 ` Herbert Xu 2008-09-25 12:12 ` Timo Teräs 2008-09-25 12:36 ` Timo Teräs 2008-09-26 2:08 ` Herbert Xu 2008-10-01 10:07 ` David Miller 2008-10-01 14:05 ` Herbert Xu 2008-09-23 2:48 ` David Miller 2008-09-10 3:04 ` David Miller 2008-09-10 3:15 ` Herbert Xu 2008-09-10 3:22 ` David Miller 2008-09-10 3:23 ` Herbert Xu 2008-09-10 3:38 ` David Miller 2008-09-10 4:01 ` Herbert Xu 2008-09-10 4:06 ` David Miller 2008-09-10 4:22 ` Herbert Xu 2008-09-10 4:24 ` David Miller 2008-09-10 4:48 ` David Miller 2008-09-10 4:52 ` David Miller 2008-09-10 4:53 ` Herbert Xu 2008-09-10 5:21 ` David Miller 2008-09-10 5:16 ` Timo Teräs 2008-09-10 5:23 ` David Miller 2008-09-10 5:46 ` Herbert Xu 2008-09-03 6:10 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).