[PATCH] xfrm: implement basic garbage collection for bundles

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xfrm: implement basic garbage collection for bundles
@ 2010-03-20 12:15 Timo Teras
  2010-03-20 12:32 ` Herbert Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Timo Teras @ 2010-03-20 12:15 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teras, Neil Horman, Herbert Xu

The dst core calls garbage collection only from dst_alloc when
the dst entry threshold is exceeded. Xfrm core currently checks
bundles only on NETDEV_DOWN event.

Previously this has not been a big problem since xfrm gc threshold
was small, and they were generated all the time due to another bug.

Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
large gc threshold sizes (>200000 on machines with normal amount
of memory) the garbage collection does not get triggered under
normal circumstances. This can result in enormous amount of stale
bundles. Further more, each of these stale bundles keep a reference
to ipv4/ipv6 rtable entries which are already gargage collected and
put to dst core "destroy free'd dst's" list. Now this list can grow
to be very large, and the dst core periodic job can bring even a fast
machine to it's knees.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 net/xfrm/xfrm_policy.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 843e066..f3f3d36 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -44,6 +44,9 @@ static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO];

 static struct kmem_cache *xfrm_dst_cache __read_mostly;

+static int xfrm_gc_interval __read_mostly = 5 * 60 * HZ;
+static struct delayed_work expires_work;
+
 static HLIST_HEAD(xfrm_policy_gc_list);
 static DEFINE_SPINLOCK(xfrm_policy_gc_lock);

@@ -2203,6 +2206,16 @@ static int xfrm_flush_bundles(struct net *net)
 	return 0;
 }

+static void xfrm_gc_worker_func(struct work_struct *work)
+{
+	struct net *net;
+
+	for_each_net(net)
+		xfrm_flush_bundles(net);
+
+	schedule_delayed_work(&expires_work, xfrm_gc_interval);
+}
+
 static void xfrm_init_pmtu(struct dst_entry *dst)
 {
 	do {
@@ -2498,8 +2511,13 @@ static int __net_init xfrm_policy_init(struct net *net)

 	INIT_LIST_HEAD(&net->xfrm.policy_all);
 	INIT_WORK(&net->xfrm.policy_hash_work, xfrm_hash_resize);
-	if (net_eq(net, &init_net))
+	if (net_eq(net, &init_net)) {
 		register_netdevice_notifier(&xfrm_dev_notifier);
+
+		INIT_DELAYED_WORK_DEFERRABLE(&expires_work, xfrm_gc_worker_func);
+		schedule_delayed_work(&expires_work,
+			net_random() % xfrm_gc_interval + xfrm_gc_interval);
+	}
 	return 0;

 out_bydst:
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:15 [PATCH] xfrm: implement basic garbage collection for bundles Timo Teras
@ 2010-03-20 12:32 ` Herbert Xu
  2010-03-20 12:42   ` Timo Teräs
  2010-03-20 13:26   ` Neil Horman
  0 siblings, 2 replies; 8+ messages in thread
From: Herbert Xu @ 2010-03-20 12:32 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev, Neil Horman

On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
> The dst core calls garbage collection only from dst_alloc when
> the dst entry threshold is exceeded. Xfrm core currently checks
> bundles only on NETDEV_DOWN event.
> 
> Previously this has not been a big problem since xfrm gc threshold
> was small, and they were generated all the time due to another bug.
> 
> Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
> ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
> large gc threshold sizes (>200000 on machines with normal amount
> of memory) the garbage collection does not get triggered under
> normal circumstances. This can result in enormous amount of stale
> bundles. Further more, each of these stale bundles keep a reference
> to ipv4/ipv6 rtable entries which are already gargage collected and
> put to dst core "destroy free'd dst's" list. Now this list can grow
> to be very large, and the dst core periodic job can bring even a fast
> machine to it's knees.

So why do we need this larger threshold in the first place? Neil?
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:32 ` Herbert Xu
@ 2010-03-20 12:42   ` Timo Teräs
  2010-03-20 12:49     ` Herbert Xu
  2010-03-20 13:26   ` Neil Horman
  1 sibling, 1 reply; 8+ messages in thread
From: Timo Teräs @ 2010-03-20 12:42 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, Neil Horman

Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
>> The dst core calls garbage collection only from dst_alloc when
>> the dst entry threshold is exceeded. Xfrm core currently checks
>> bundles only on NETDEV_DOWN event.
>>
>> Previously this has not been a big problem since xfrm gc threshold
>> was small, and they were generated all the time due to another bug.
>>
>> Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
>> ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
>> large gc threshold sizes (>200000 on machines with normal amount
>> of memory) the garbage collection does not get triggered under
>> normal circumstances. This can result in enormous amount of stale
>> bundles. Further more, each of these stale bundles keep a reference
>> to ipv4/ipv6 rtable entries which are already gargage collected and
>> put to dst core "destroy free'd dst's" list. Now this list can grow
>> to be very large, and the dst core periodic job can bring even a fast
>> machine to it's knees.
> 
> So why do we need this larger threshold in the first place? Neil?

Actually it looks like that on ipv6 side the gc_thresh is something
more normal. On ipv4 side it's insanely big. The 1/2 ratio is not
what ipv4 rtable uses for it's own gc_thresh. Looks like it's using
1/16 ratio which yields much better value.

But even if we have the gc_thresh back to 1024 or similar size,
it is still a good thing to do some basic gc on xfrm bundles so
that the underlaying rtable dst's can be freed before they end up
in the dst core list.

- Timo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:42   ` Timo Teräs
@ 2010-03-20 12:49     ` Herbert Xu
  2010-03-20 12:54       ` Timo Teräs
  0 siblings, 1 reply; 8+ messages in thread
From: Herbert Xu @ 2010-03-20 12:49 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev, Neil Horman

On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote:
>
> But even if we have the gc_thresh back to 1024 or similar size,
> it is still a good thing to do some basic gc on xfrm bundles so
> that the underlaying rtable dst's can be freed before they end up
> in the dst core list.

But I'm not sure if a timer-based one really makes sense though.

If you're worried about stale entries preventing IPv4/IPv6 rt
entries from being collected, perhaps we should invoke the xfrm
GC process from the IPv4/IPv6 GC function?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:49     ` Herbert Xu
@ 2010-03-20 12:54       ` Timo Teräs
  2010-03-20 13:13         ` Herbert Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Timo Teräs @ 2010-03-20 12:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, Neil Horman

Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote:
>> But even if we have the gc_thresh back to 1024 or similar size,
>> it is still a good thing to do some basic gc on xfrm bundles so
>> that the underlaying rtable dst's can be freed before they end up
>> in the dst core list.
> 
> But I'm not sure if a timer-based one really makes sense though.
> 
> If you're worried about stale entries preventing IPv4/IPv6 rt
> entries from being collected, perhaps we should invoke the xfrm
> GC process from the IPv4/IPv6 GC function?

Those are more or less timer based too. I guess it would be
better to put to dst core's function then. It can see hanging
refs, and call xfrm gc in that case.

To even minimize that it would help a lot if xfrm_bundle_ok
could release (or swap to dummies) the referenced dst->route
and dst->child entries. xfrm_bundle_ok is mostly called for
all bundles regularly by xfrm_find_bundle before new ones are
created.

- Timo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:54       ` Timo Teräs
@ 2010-03-20 13:13         ` Herbert Xu
  2010-03-20 13:34           ` Timo Teräs
  0 siblings, 1 reply; 8+ messages in thread
From: Herbert Xu @ 2010-03-20 13:13 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev, Neil Horman

On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote:
>
> Those are more or less timer based too. I guess it would be
> better to put to dst core's function then. It can see hanging
> refs, and call xfrm gc in that case.

The IPv4 one appears to be invoked from dst_alloc just like
xfrm.

> To even minimize that it would help a lot if xfrm_bundle_ok
> could release (or swap to dummies) the referenced dst->route
> and dst->child entries. xfrm_bundle_ok is mostly called for
> all bundles regularly by xfrm_find_bundle before new ones are
> created.

Yes I agree that's what we should do once your other patch is
applied.

So every top-level xfrm_dst that's not referenced by some user
should sit in the flow cache.  When it's needed and we find it
to be stale we kick it out of the cache and free it along with
the rest of its constituents.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 13:13         ` Herbert Xu
@ 2010-03-20 13:34           ` Timo Teräs
  0 siblings, 0 replies; 8+ messages in thread
From: Timo Teräs @ 2010-03-20 13:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, Neil Horman

Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote:
>> Those are more or less timer based too. I guess it would be
>> better to put to dst core's function then. It can see hanging
>> refs, and call xfrm gc in that case.
> 
> The IPv4 one appears to be invoked from dst_alloc just like
> xfrm.

Yes. But under normal circumstances that's not used. It also
has a separate timer based gc that goes through the buckets
periodically. That e.g. how the expired pmtu routes are kicked
out and how routes with old genid are purged before we start
to reach the gc_thresh limit. This happens in 
rt_worker_func() which is called every ip_rt_gc_interval
(defaults to one minute).
 
>> To even minimize that it would help a lot if xfrm_bundle_ok
>> could release (or swap to dummies) the referenced dst->route
>> and dst->child entries. xfrm_bundle_ok is mostly called for
>> all bundles regularly by xfrm_find_bundle before new ones are
>> created.
> 
> Yes I agree that's what we should do once your other patch is
> applied.
> 
> So every top-level xfrm_dst that's not referenced by some user
> should sit in the flow cache.  When it's needed and we find it
> to be stale we kick it out of the cache and free it along with
> the rest of its constituents.

Right. Ok, just to get road map of what I should do and in which
order and how.

xfrm gc:
  - should dst core call it?
  - should xfrm do it periodically?
  - or do we rely that we get new bundles and let
    xfrm[46]_find_bundle do the clean up?
  - should xfrm_bundle_ok() release the inner dst's right away
    instead of waiting any of the above to happen?

caching of bundles instead of policies for outgoing traffic:
  - should flow cache be per-netns?
  - since it will now have two types of objects, would it make
    sense to have virtual put and get callbacks instead of
    atomic_t pointer. this way qw can BUG_ON() if the refcount
    goes to zero unexpectedly (or call the appropriate destructor)
    and call the real _put function anway (e.g. dst_release does
    WARN_ON stuff); the _get function can also do additional
    checking if the object is valid or not; this way the flow
    cache resolver output is always a valid object
  - resolver to have "resolve_fast" and "resolve_slow". if fast
    fails, the slow gets called with bh enabled so it can sleep,
    since bundle creation needs that.
  - it would probably be then useful to have dummy xfrm_dst for
    policy reference when the policy forbids traffic?

- Timo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfrm: implement basic garbage collection for bundles
  2010-03-20 12:32 ` Herbert Xu
  2010-03-20 12:42   ` Timo Teräs
@ 2010-03-20 13:26   ` Neil Horman
  1 sibling, 0 replies; 8+ messages in thread
From: Neil Horman @ 2010-03-20 13:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Timo Teras, netdev

On Sat, Mar 20, 2010 at 08:32:48PM +0800, Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
> > The dst core calls garbage collection only from dst_alloc when
> > the dst entry threshold is exceeded. Xfrm core currently checks
> > bundles only on NETDEV_DOWN event.
> > 
> > Previously this has not been a big problem since xfrm gc threshold
> > was small, and they were generated all the time due to another bug.
> > 
> > Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
> > ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
> > large gc threshold sizes (>200000 on machines with normal amount
> > of memory) the garbage collection does not get triggered under
> > normal circumstances. This can result in enormous amount of stale
> > bundles. Further more, each of these stale bundles keep a reference
> > to ipv4/ipv6 rtable entries which are already gargage collected and
> > put to dst core "destroy free'd dst's" list. Now this list can grow
> > to be very large, and the dst core periodic job can bring even a fast
> > machine to it's knees.
> 
> So why do we need this larger threshold in the first place? Neil?
My initial reasoning is here:
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
I'd had a bug  claiming that ipsec didn't scale to thousands of SA's, and the
reason turned out to be that xfrm code capped their entry threshold at 1024
entries:
https://bugzilla.redhat.com/show_bug.cgi?id=253053

I exported the gc thresholds via sysctls, and then used the above commit to
select sane values, reasoning that we should be able to support as many tunnels
as routes when possible.

If its too high, I have no qualm with lowering it, given that those that need it
can bump it up higher via the sysctl.

Regards
Neil

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-03-20 13:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-20 12:15 [PATCH] xfrm: implement basic garbage collection for bundles Timo Teras
2010-03-20 12:32 ` Herbert Xu
2010-03-20 12:42   ` Timo Teräs
2010-03-20 12:49     ` Herbert Xu
2010-03-20 12:54       ` Timo Teräs
2010-03-20 13:13         ` Herbert Xu
2010-03-20 13:34           ` Timo Teräs
2010-03-20 13:26   ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).