All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: davem@davemloft.net, netdev@oss.sgi.com,
	Robert.Olsson@data.slu.se, hadi@cyberus.ca
Subject: Re: Get rid of rt_check_expire and rt_garbage_collect
Date: Sat, 02 Apr 2005 15:58:55 +0200	[thread overview]
Message-ID: <424EA51F.6000300@cosmosbay.com> (raw)
In-Reply-To: <20050402112304.GA11321@gondor.apana.org.au>

Herbert Xu a écrit :
> On Sat, Apr 02, 2005 at 11:21:30AM +0200, Eric Dumazet wrote:
> 
>>Well, I began my work because of the overflow bug in rt_check_expire()...
>>Then I realize this function could not work as expected. On a loaded 
>>machine, one timer tick is 1 ms.
>>During this time, number of chains that are scanned is ridiculous.
>>With the standard timer of 60 second, fact is rt_check_expire() is useless.
> 
> 
> I see.  What we've got here is a scalability problem with respect
> to the number of hash buckets.  As the number of buckets increases,
> the amount of work the timer GC has to perform inreases proportionally.
> 
> Since the timer GC parameters are fixed, this will eventually break.
> 
> Rather than changing the timer GC so that it runs more often to keep
> up with the large routing cache, we should get out of this by reducing
> the amount of work we have to do.
> 
> Imagine an ideal balanced hash table with 2.6 million entries.  That
> is, all incoming/outgoing packets belong to flows that are already in
> the hash table.  Imagine also that there is no PMTU/link failure taking
> place so all entries are valid forever.
> 
> In this state there is absolutely no need to execute the timer GC.
> 
> Let's remove one of those assumptions and allow there to be entries
> which need to expire after a set period.
> 
> Instead of having the timer GC clean them up, we can move the expire
> check to the place where the entries are used.  That is, we make
> ip_route_input/ip_route_output/ipv4_dst_check check whether the
> entry has expired.
> 
> On the face of it we're doing more work since every routing cache
> hit will need to check the validity of the dst.  However, because
> it's a single subtraction it is actually pretty cheap.  There is
> also no additional cache miss compared to doing it in the timer
> GC since we have to read the dst anyway.
> 
> Let's go one step further and make the routing cache come to life.
> Now there are new entries coming in and we need to remove old ones
> in order to make room for them.
> 
> That task is currently carried out by the timer GC in rt_check_expire
> and on demand by rt_garbage_collect.  Either way we have to walk the
> entire routing cache looking for entries to get rid of.
> 
> This is quite expensive when the routing cache is large.  However,
> there is a better way.
> 
> The reason we keep a cap on the routing cache (for a given hash size)
> is so that individual chains do not degenerate into long linked lists.
> 
> In other words, we don't really care about how many entries there are
> in the routing cache.  But we do care about how long each hash chain
> is.
> 
> So instead of walking the entire routing cache to keep the number of
> entries down, what we should do is keep each hash chain as short as
> possible.
> 
> Assuming that the hash function is good, this should achieve the
> same end result.
> 
> Here is how it can be done: Every time a routing entry is inserted into
> a hash chain, we perform GC on that chain unconditionally.
> 
> It might seem that we're doing more work again.  However, as before
> because we're traversing the chain anyway, it is very cheap to perform
> the GC operations which mainly involve the checks in rt_may_expire.
> 
> OK that's enough thinking and it's time to write some code to see
> whether this is all bullshit :)
> 
> Cheers,

Well, it may work if you dont care about memory used.

# grep dst /proc/slabinfo
ip_dst_cache      2825575 2849590    384   10    1 : tunables   54   27    8 : slabdata 284959 284959      0

On this machine, route cache takes 1.1 GB of ram... impressive.

Then if the network load decrease (or completely stop), only a timer driven gc could purge the cache.

So rt_check_expire() is *needed*

You are right saying that gc parameters are fixed, thus gc breaks at high load.


Eric

  reply	other threads:[~2005-04-02 13:58 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-15 16:13 [PATCH] no more rwlock_t inside tcp_ehash_bucket Eric Dumazet
2005-03-15 18:32 ` David S. Miller
2005-03-16 10:47   ` [BUG] overflow in net/ipv4/route.c rt_check_expire() Eric Dumazet
2005-03-16 18:05     ` [PATCH] reduce sizeof(struct inet_peer) from 128 to 64 bytes on 64bits architectures Eric Dumazet
2005-03-16 22:10       ` David S. Miller
2005-03-16 22:09     ` [BUG] overflow in net/ipv4/route.c rt_check_expire() David S. Miller
2005-03-17 19:52       ` Eric Dumazet
2005-04-01  6:13         ` David S. Miller
2005-04-01 14:39           ` Eric Dumazet
2005-04-01 15:53             ` Robert Olsson
2005-04-01 16:34               ` Eric Dumazet
2005-04-01 17:26                 ` Robert Olsson
2005-04-01 20:28             ` David S. Miller
2005-04-01 21:05               ` Eric Dumazet
2005-04-01 21:08                 ` David S. Miller
2005-04-01 21:43                   ` Eric Dumazet
2005-04-01 22:34                     ` David S. Miller
2005-04-01 23:21                       ` Eric Dumazet
2005-04-01 23:54                         ` David S. Miller
2005-04-02  8:21                         ` Herbert Xu
2005-04-02  9:21                           ` Eric Dumazet
2005-04-02 11:23                             ` Get rid of rt_check_expire and rt_garbage_collect Herbert Xu
2005-04-02 13:58                               ` Eric Dumazet [this message]
2005-04-02 14:03                               ` Robert Olsson
2005-04-02 21:05                               ` jamal
2005-04-03  7:38                                 ` Herbert Xu
2005-04-03  7:41                                 ` Herbert Xu
2005-04-02 13:48                             ` [BUG] overflow in net/ipv4/route.c rt_check_expire() Robert Olsson
2005-04-02 14:10                               ` Eric Dumazet
2005-04-02 14:46                                 ` Robert Olsson
2005-04-02 20:47                                 ` jamal
2005-04-02 19:32                               ` Herbert Xu
2005-04-02 19:55                                 ` David S. Miller
2005-04-03  7:43                                   ` Herbert Xu
2005-04-03 19:57                                     ` Robert Olsson
2005-04-03 21:45                                       ` Herbert Xu
2005-04-04 10:27                                         ` Robert Olsson
2005-04-04 10:38                                           ` Herbert Xu
2005-04-04 12:29                                             ` Robert Olsson
2005-04-03 19:36                                 ` Robert Olsson
2005-04-03 21:43                                   ` Herbert Xu
2005-04-04 10:38                                     ` Robert Olsson
2005-04-04 10:48                                       ` Herbert Xu
2005-04-04 13:17                                         ` Robert Olsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=424EA51F.6000300@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=Robert.Olsson@data.slu.se \
    --cc=davem@davemloft.net \
    --cc=hadi@cyberus.ca \
    --cc=herbert@gondor.apana.org.au \
    --cc=netdev@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.