netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Timo Teräs" <timo.teras@iki.fi>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Subject: Re: [PATCH] xfrm: cache bundle lookup results in flow cache
Date: Thu, 18 Mar 2010 21:30:58 +0200	[thread overview]
Message-ID: <4BA27F72.40901@iki.fi> (raw)
In-Reply-To: <4BA0FBB6.10208@iki.fi>

Timo Teräs wrote:
> Herbert Xu wrote:
>> On Wed, Mar 17, 2010 at 04:16:21PM +0200, Timo Teräs wrote:
>>> The problem is if I have multipoint gre1 and policy that says
>>> "encrypt all gre in transport mode".
>>>
>>> Thus for each public address, I get one bundle. But the
>>> xfrm_lookup() is called for each packet because ipgre_tunnel_xmit()
>>> calls ip_route_output_key() on per-packet basis.
>>>
>>> For my use-case it makes a huge difference.
>>
>> But if your traffic switches between those tunnels on each packet,
>> we're back to square one, right?
> 
> Not to my understanding. Why would it change?

Here's how things go to my understanding.

When we are forwarding packets, each packet goes through
__xfrm_route_forward(). Or if we are sending from ip_gre (like in
my case), the packets go through ip_route_output_key().

Both of these call xfrm_lookup() to get the xfrm_dst instead
of the real rtable dst. This is done on per-packet basis.
Basically, the xfrm_dst is never kept referenced directly
on either code path. Instead it needs to be xfrm'ed per-packet.

Now, these created xfrm_dst gets cached in policy->bundles
with ref count zero and not deleted until garbage collection
is required. That's how xfrm_find_bundle tries to reuse them
(but it does O(n) lookup).

On the gre+esp case (should apply to any forward path too)
the caching of bundle speeds up the output path considerably
as there can be a lot of xfrm_dst's. Especially if it's a
wildcard transport mode policy which basically get's bundles
for each destination. So there can be a lot of xfrm_dst's
all valid, since they refer to unique xfrm_state on
per-destination basis.

Now, even more, since xfrm_dst needs to be regenerated if
the underlying ipv4 rtable entry has expired there can be
a lot of bundles. So the linear search is a major performance
killer.

Btw. it looks like the xfrm_dst garbage collection is broke.
It's only garbage collected if a network device goes down,
or dst_alloc calls it after gc threshold is exceeded. Since
gc threshold got dynamic not long ago, it can be very big.
This causes stale xfrm_dst's to be kept alive, and what is
worse their inner rtable dst is kept referenced. But as they
were expired and dst_free'd the dst core "free still referenced"
dst's goes through that list over and over again and will
kill the whole system performance when the list grows long.

I think as a minimum we should add 'do stale_bundle' check
to all xfrm_dst's every n minutes or so.

>>> Then we cannot maintain policy use time. But if it's not a
>>> requirement, we could drop the policy from cache.
>>
>> I don't see why we can't maintain the policy use time if we did
>> this, all you need is a back-pointer from the top xfrm_dst.
> 
> Sure.

Actually no. As the pmtu case showed, it's more likely that
xfrm_dst needs to be regenerated, but the policy stays the
same since policy db isn't touched that often. If we keep
them separately we can almost most of the time avoid doing
policy lookup which is also O(n). Also the currently cache
entry validation is needs to check policy's bundles_genid
before allowing touching of xfrm_dst. Otherwise we would have
to keep global bundle_genid, and we'd lose the parent pointer
on cache miss.

Caching bundles be another win too. Since if do cache entries
like this, we could track how many cache miss xfrm_dst's
we've had and use that decide when to trigger xfrm_dst
garbage collector instead of (or in addition to) fixed timer.

- Timo

  parent reply	other threads:[~2010-03-18 19:31 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-15 12:20 [PATCH] xfrm: cache bundle lookup results in flow cache Timo Teras
2010-03-17 13:07 ` Herbert Xu
2010-03-17 14:16   ` Timo Teräs
2010-03-17 14:58     ` Herbert Xu
2010-03-17 15:56       ` Timo Teräs
2010-03-17 16:32         ` Timo Teräs
2010-03-18 19:30         ` Timo Teräs [this message]
2010-03-19  0:31           ` Herbert Xu
2010-03-19  5:48             ` Timo Teräs
2010-03-19  6:03               ` Herbert Xu
2010-03-19  6:21                 ` Timo Teräs
2010-03-19  7:17                   ` Herbert Xu
2010-03-19  7:27                   ` Timo Teräs
2010-03-19  0:32           ` Herbert Xu
2010-03-19  7:20 ` Herbert Xu
2010-03-19  7:48   ` Timo Teräs
2010-03-19  8:29     ` Herbert Xu
2010-03-19  8:37       ` Timo Teräs
2010-03-19  8:47         ` Herbert Xu
2010-03-19  9:12           ` Timo Teräs
2010-03-19  9:32             ` Herbert Xu
2010-03-19  9:53               ` Timo Teräs
2010-03-20 15:17                 ` Herbert Xu
2010-03-20 16:26                   ` Timo Teräs
2010-03-21  0:46                     ` Herbert Xu
2010-03-21  7:34                       ` Timo Teräs
2010-03-21  8:31                         ` Timo Teräs
2010-03-22  3:52                           ` Herbert Xu
2010-03-22 18:03                             ` Timo Teräs
2010-03-23  7:28                               ` Timo Teräs
2010-03-23  7:42                                 ` Herbert Xu
2010-03-23  9:19                                   ` Timo Teräs
2010-03-23  9:41                                     ` Herbert Xu
2010-03-22  1:26                   ` David Miller
2010-03-22  1:28                   ` David Miller
2010-03-22  1:32                     ` Herbert Xu
2010-03-22  1:36                       ` David Miller
2010-03-22  1:40                         ` Herbert Xu
2010-03-22  3:12                           ` David Miller
2010-03-22  3:52                             ` Herbert Xu
2010-03-22 18:31                               ` Timo Teräs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BA27F72.40901@iki.fi \
    --to=timo.teras@iki.fi \
    --cc=herbert@gondor.apana.org.au \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).