[RFC] Limit the size of the IPV4 route hash.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Limit the size of the IPV4 route hash.
@ 2004-12-10 19:00 Robin Holt
  2004-12-10 19:48 ` David S. Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Holt @ 2004-12-10 19:00 UTC (permalink / raw)
  To: yoshfuji, akpm, hirofumi, davem, torvalds, dipankar, laforge,
	bunk, herbert, paulmck
  Cc: netdev, linux-kernel, ^Greg Banks

I have sent a couple emails concerning the IPv4 route hash size in the
past week with no response.  I am now sending to everyone that has made
changes to the net/ipv4/route.c file in the last six months to hopefully
get some direction.  Sorry for the wide net, but I do not know how to
proceed.

The first post was asking for direction on the maximum size for the
route cache.  The link is here:  (NOTE: I never saw this come back from
the netdev list)

What is a reasonable upper limit to the rt_hash_table.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110244057617765&w=2

I then did some testing/experimenting with systems that are in production,
determined the size calculation is definitely too large and then came
to the following conclusion:

Limit the route hash size.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110260977405809&w=2

In the second, I included the patch, but did not intend this to be a
patch submission.  Sorry for the Signed-off-by.

Where do I go from here?  I hate to just submit this as a patch without
any other discussion.  I have checked route cache size on many machines
and they have all been in the 30-100 range except for some on ISP machines
that are serving web pages where I have seen three machines with a cache
size of up to 800 entries.  And one university email server where they
have set the secret_interval to 86,400 which has peaked at 18,434 entries.

With those sizes noted, the cache size of one page does not appear to
have any negative impact for any except the email server.  For that
machine, they have already reviewed the code and decided to adjust
tunable values so I can not believe they would be upset with having to
provide an rhash_entries= append on the boot line.

Are there any benchmarks I am supposed to run prior to asking for this
patch to be incorporated?

Any guidance would be greatly appreciated.  Thank you for you attention.
Again, sorry for the wide net.

Robin Holt

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 19:00 [RFC] Limit the size of the IPV4 route hash Robin Holt
@ 2004-12-10 19:48 ` David S. Miller
  2004-12-10 21:00   ` Robin Holt
  0 siblings, 1 reply; 10+ messages in thread
From: David S. Miller @ 2004-12-10 19:48 UTC (permalink / raw)
  To: Robin Holt
  Cc: yoshfuji, akpm, hirofumi, torvalds, dipankar, laforge, bunk,
	herbert, paulmck, netdev, linux-kernel, gnb

On Fri, 10 Dec 2004 13:00:25 -0600
Robin Holt <holt@sgi.com> wrote:

> I then did some testing/experimenting with systems that are in production,
> determined the size calculation is definitely too large and then came
> to the following conclusion:
> 
> Limit the route hash size.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110260977405809&w=2
> 
> In the second, I included the patch, but did not intend this to be a
> patch submission.  Sorry for the Signed-off-by.
> 
> Where do I go from here?  I hate to just submit this as a patch without
> any other discussion.

Sometimes we have to just sit and be content with the fact that
nobody is inspired by our work enough to respond. :-)  It usually
means that people aren't too thrilled with your patch, but don't
feel any impetus to mention why.

I can definitely say that just forcing it to use 1 page is wrong.
Even ignoring your tests, your test was on a system that has 16K
PAGE_SIZE.  Other systems use 4K and 8K (and other) PAGE_SIZE
values.  This is why we make our calculations relative to PAGE_SHIFT.

Also, 1 page even in your case is (assuming you are on a 64-bit platform,
you didn't mention) going to give us 1024 hash chains.  A reasonably
busy web server will easily be talking to more than 1K unique hosts at
a given point in time.  This is especially true as slow long distance
connections bunch up.

Alexey Kuznetsov needed to use more than one page on his lowly
i386 router in Russia, and this was circa 7 or 8 years ago.

People are pretty happy with the current algorithm, and in fact
most people ask us to increase it's value not decrease it :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 19:48 ` David S. Miller
@ 2004-12-10 21:00   ` Robin Holt
  2004-12-10 21:06     ` David S. Miller
  2004-12-10 21:09     ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Robin Holt @ 2004-12-10 21:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robin Holt, yoshfuji, akpm, hirofumi, torvalds, dipankar, laforge,
	bunk, herbert, paulmck, netdev, linux-kernel, gnb

> 
> I can definitely say that just forcing it to use 1 page is wrong.
> Even ignoring your tests, your test was on a system that has 16K
> PAGE_SIZE.  Other systems use 4K and 8K (and other) PAGE_SIZE
> values.  This is why we make our calculations relative to PAGE_SHIFT.

I picked 1 page because it was not some magic small number that I would
need to justify.  I also hoped that it would incite someone to respond.

> 
> Also, 1 page even in your case is (assuming you are on a 64-bit platform,
> you didn't mention) going to give us 1024 hash chains.  A reasonably
> busy web server will easily be talking to more than 1K unique hosts at
> a given point in time.  This is especially true as slow long distance
> connections bunch up.

But 1k hosts is not the limit with a 16k page.  There are 1k buckets,
but each is a list.  A reasonably well designed hash will scale to greater
than one item per bucket.  Additionally, for the small percentage of web
servers with enough network traffic that they will be affected by the
depth of the entries, they can set rhash_entries for their specific needs.

> 
> Alexey Kuznetsov needed to use more than one page on his lowly
> i386 router in Russia, and this was circa 7 or 8 years ago.

And now he, for his special case, could set rhash_entries to increase
the size.

I realize I have a special case which highlighted the problem.  My case
shows that not putting an upper limit or at least a drastically aggressive
non-linear growth cap does cause issues.  For the really large system,
we were seeing a size of 512MB for the hash which was limited because
that was the largest amount of memory available on a single node.  I can
not ever imagine this being a reasonable limit.  Not with 512 cpus and
1024 network adapters could I envision that this level of hashing would
actually be advantageous given all the other lock contention that will
be seen.

Can we agree that a linear calculation based on num_physpages is probably
not the best algorithm.  If so, should we make it a linear to a limit or
a logarithmically decreasing size to a limit?  How do we determine that
limit point?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 21:00   ` Robin Holt
@ 2004-12-10 21:06     ` David S. Miller
  2004-12-10 21:09     ` Andrew Morton
  1 sibling, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-12-10 21:06 UTC (permalink / raw)
  To: Robin Holt
  Cc: holt, yoshfuji, akpm, hirofumi, torvalds, dipankar, laforge, bunk,
	herbert, paulmck, netdev, linux-kernel, gnb

On Fri, 10 Dec 2004 15:00:06 -0600
Robin Holt <holt@sgi.com> wrote:

> > Also, 1 page even in your case is (assuming you are on a 64-bit platform,
> > you didn't mention) going to give us 1024 hash chains.  A reasonably
> > busy web server will easily be talking to more than 1K unique hosts at
> > a given point in time.  This is especially true as slow long distance
> > connections bunch up.
> 
> But 1k hosts is not the limit with a 16k page.  There are 1k buckets,
> but each is a list.  A reasonably well designed hash will scale to greater
> than one item per bucket.  Additionally, for the small percentage of web
> servers with enough network traffic that they will be affected by the
> depth of the entries, they can set rhash_entries for their specific needs.

We want to aim for a depth of 1 in each chain, so that, assuming the
hash is decent, we'll achieve O(1) lookup complexity.  That is why we
want the number of chains to be at least as large as the number of
active routing cache entries we'll work with.

> I realize I have a special case which highlighted the problem.  My case
> shows that not putting an upper limit or at least a drastically aggressive
> non-linear growth cap does cause issues.  For the really large system,
> we were seeing a size of 512MB for the hash which was limited because
> that was the largest amount of memory available on a single node.

That's true, 512MB is just too much.  So let's define some reasonable
default cap like 16MB or 32MB, and as current it is overridable via
rhash_entries.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 21:00   ` Robin Holt
  2004-12-10 21:06     ` David S. Miller
@ 2004-12-10 21:09     ` Andrew Morton
  2004-12-10 23:27       ` Robin Holt
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2004-12-10 21:09 UTC (permalink / raw)
  To: Robin Holt
  Cc: davem, holt, yoshfuji, hirofumi, torvalds, dipankar, laforge,
	bunk, herbert, paulmck, netdev, linux-kernel, gnb

Robin Holt <holt@sgi.com> wrote:
>
> I realize I have a special case which highlighted the problem.  My case
>  shows that not putting an upper limit or at least a drastically aggressive
>  non-linear growth cap does cause issues.  For the really large system,
>  we were seeing a size of 512MB for the hash which was limited because
>  that was the largest amount of memory available on a single node.  I can
>  not ever imagine this being a reasonable limit.  Not with 512 cpus and
>  1024 network adapters could I envision that this level of hashing would
>  actually be advantageous given all the other lock contention that will
>  be seen.

Half a gig for the hashtable does seems a bit nutty.

>  Can we agree that a linear calculation based on num_physpages is probably
>  not the best algorithm.  If so, should we make it a linear to a limit or
>  a logarithmically decreasing size to a limit?  How do we determine that
>  limit point?

An initial default of N + M * log2(num_physpages) would probably give a
saner result.

The big risk is that someone has a too-small table for some specific
application and their machine runs more slowly than it should, but they
never notice.  I wonder if it would be possible to put a little once-only
printk into the routing code: "warning route-cache chain exceeded 100
entries: consider using the rhash_entries boot option".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 21:09     ` Andrew Morton
@ 2004-12-10 23:27       ` Robin Holt
  2004-12-10 23:38         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Holt @ 2004-12-10 23:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robin Holt, davem, yoshfuji, hirofumi, torvalds, dipankar,
	laforge, bunk, herbert, paulmck, netdev, linux-kernel, gnb

On Fri, Dec 10, 2004 at 01:09:47PM -0800, Andrew Morton wrote:
> Robin Holt <holt@sgi.com> wrote:
> >  Can we agree that a linear calculation based on num_physpages is probably
> >  not the best algorithm.  If so, should we make it a linear to a limit or
> >  a logarithmically decreasing size to a limit?  How do we determine that
> >  limit point?
> 
> An initial default of N + M * log2(num_physpages) would probably give a
> saner result.
> 
> The big risk is that someone has a too-small table for some specific
> application and their machine runs more slowly than it should, but they
> never notice.  I wonder if it would be possible to put a little once-only
> printk into the routing code: "warning route-cache chain exceeded 100
> entries: consider using the rhash_entries boot option".

Since the hash gets flushed every 10 seconds, what if we kept track of
the maximum depth reached and when we reach a certain threshold, just
allocate a larger hash and replace the old with the new.  I do like the
printk idea so the admin can prevent inconsistent performance early in
the run cycle for the system.  We could even scale the hash size up based
upon demand.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 23:38         ` Andrew Morton
@ 2004-12-10 23:37           ` Robin Holt
  2004-12-10 23:40             ` Robin Holt
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Holt @ 2004-12-10 23:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robin Holt, davem, yoshfuji, hirofumi, torvalds, dipankar,
	laforge, bunk, herbert, paulmck, netdev, linux-kernel, gnb

On Fri, Dec 10, 2004 at 03:38:48PM -0800, Andrew Morton wrote:
> Robin Holt <holt@sgi.com> wrote:
> >
> > > The big risk is that someone has a too-small table for some specific
> > > application and their machine runs more slowly than it should, but they
> > > never notice.  I wonder if it would be possible to put a little once-only
> > > printk into the routing code: "warning route-cache chain exceeded 100
> > > entries: consider using the rhash_entries boot option".
> > 
> > Since the hash gets flushed every 10 seconds, what if we kept track of
> > the maximum depth reached and when we reach a certain threshold, just
> > allocate a larger hash and replace the old with the new.  I do like the
> > printk idea so the admin can prevent inconsistent performance early in
> > the run cycle for the system.  We could even scale the hash size up based
> > upon demand.
> 
> Once the system has been running for a while, the possibility of allocating
> a decent number of physically-contiguous pages is basically zero.
> 
> If we were to dynamically size it we'd need to either use new data
> structure (slower) or use vmalloc() (slower and can fragment vmalloc
> space).

Why do they need to be physically contiguous?  It is a hash correct?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 23:27       ` Robin Holt
@ 2004-12-10 23:38         ` Andrew Morton
  2004-12-10 23:37           ` Robin Holt
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2004-12-10 23:38 UTC (permalink / raw)
  To: Robin Holt
  Cc: holt, davem, yoshfuji, hirofumi, torvalds, dipankar, laforge,
	bunk, herbert, paulmck, netdev, linux-kernel, gnb

Robin Holt <holt@sgi.com> wrote:
>
> > The big risk is that someone has a too-small table for some specific
> > application and their machine runs more slowly than it should, but they
> > never notice.  I wonder if it would be possible to put a little once-only
> > printk into the routing code: "warning route-cache chain exceeded 100
> > entries: consider using the rhash_entries boot option".
> 
> Since the hash gets flushed every 10 seconds, what if we kept track of
> the maximum depth reached and when we reach a certain threshold, just
> allocate a larger hash and replace the old with the new.  I do like the
> printk idea so the admin can prevent inconsistent performance early in
> the run cycle for the system.  We could even scale the hash size up based
> upon demand.

Once the system has been running for a while, the possibility of allocating
a decent number of physically-contiguous pages is basically zero.

If we were to dynamically size it we'd need to either use new data
structure (slower) or use vmalloc() (slower and can fragment vmalloc
space).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 23:37           ` Robin Holt
@ 2004-12-10 23:40             ` Robin Holt
  2004-12-13  0:55               ` David S. Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Holt @ 2004-12-10 23:40 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, davem, yoshfuji, hirofumi, torvalds, dipankar,
	laforge, bunk, herbert, paulmck, netdev, linux-kernel, gnb

On Fri, Dec 10, 2004 at 05:37:00PM -0600, Robin Holt wrote:
> On Fri, Dec 10, 2004 at 03:38:48PM -0800, Andrew Morton wrote:
> > Robin Holt <holt@sgi.com> wrote:
> > >
> > > > The big risk is that someone has a too-small table for some specific
> > > > application and their machine runs more slowly than it should, but they
> > > > never notice.  I wonder if it would be possible to put a little once-only
> > > > printk into the routing code: "warning route-cache chain exceeded 100
> > > > entries: consider using the rhash_entries boot option".
> > > 
> > > Since the hash gets flushed every 10 seconds, what if we kept track of
> > > the maximum depth reached and when we reach a certain threshold, just
> > > allocate a larger hash and replace the old with the new.  I do like the
> > > printk idea so the admin can prevent inconsistent performance early in
> > > the run cycle for the system.  We could even scale the hash size up based
> > > upon demand.
> > 
> > Once the system has been running for a while, the possibility of allocating
> > a decent number of physically-contiguous pages is basically zero.
> > 
> > If we were to dynamically size it we'd need to either use new data
> > structure (slower) or use vmalloc() (slower and can fragment vmalloc
> > space).
> 
> Why do they need to be physically contiguous?  It is a hash correct?

Sorry, I was asleep at the wheel.  I failed to even grok your second
paragraph.  I will fall back to agreeing with the printk to let the admin
know that something is amiss.

Should we possibly modify the output of /proc/net/rt_cache (or whatever
its name is) to include the hash bucket so people can watch to see how
many bucket collisions their system has?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Limit the size of the IPV4 route hash.
  2004-12-10 23:40             ` Robin Holt
@ 2004-12-13  0:55               ` David S. Miller
  0 siblings, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-12-13  0:55 UTC (permalink / raw)
  To: Robin Holt
  Cc: holt, akpm, yoshfuji, hirofumi, torvalds, dipankar, laforge, bunk,
	herbert, paulmck, netdev, linux-kernel, gnb

On Fri, 10 Dec 2004 17:40:37 -0600
Robin Holt <holt@sgi.com> wrote:

> Sorry, I was asleep at the wheel.  I failed to even grok your second
> paragraph.  I will fall back to agreeing with the printk to let the admin
> know that something is amiss.
> 
> Should we possibly modify the output of /proc/net/rt_cache (or whatever
> its name is) to include the hash bucket so people can watch to see how
> many bucket collisions their system has?

I think there are rt stats for this already added by Robert Olsson.

One thing not mentioned, besides the physically contiguous issue,
is that fact that the locking would need to be changed quite a bit
in order to allow runtime reallocation of the hash table.  The current
code is certainly not ready for it.  It might just work to run the
hash freeing via RCU, but I'm not quite sure.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-12-13  0:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-10 19:00 [RFC] Limit the size of the IPV4 route hash Robin Holt
2004-12-10 19:48 ` David S. Miller
2004-12-10 21:00   ` Robin Holt
2004-12-10 21:06     ` David S. Miller
2004-12-10 21:09     ` Andrew Morton
2004-12-10 23:27       ` Robin Holt
2004-12-10 23:38         ` Andrew Morton
2004-12-10 23:37           ` Robin Holt
2004-12-10 23:40             ` Robin Holt
2004-12-13  0:55               ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).