netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] limit rt cache size
       [not found] <44D75EF8.1070901@sw.ru>
@ 2006-08-07 16:48 ` Alexey Kuznetsov
  2006-08-08  3:42   ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Kuznetsov @ 2006-08-07 16:48 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: davem, netdev

Hello!

> During OpenVZ stress testing we found that UDP traffic with
> random src can generate too much excessive rt hash growing
> leading finally to OOM and kernel panics.
> 
> It was found that for 4GB i686 system (having 1048576 total pages and
>  225280 normal zone pages) kernel allocates the following route hash:
> syslog: IP route cache hash table entries: 262144 (order: 8, 1048576 bytes)
> => ip_rt_max_size = 4194304 entries, i.e.
> max rt size is 4194304 * 256b = 1Gb of RAM > normal_zone

Grrr... Indeed.


> Attached the patch which removes HASH_HIGHMEM flag from
> alloc_large_system_hash() call. However, I'm not sure whether
> it should be removed as well for TCP tcp_hashinfo.ehash and
> tcp_hashinfo.bhash (as those are probably limited by number of files?).

The patch looks OK. But I am not sure too.

To be honest, I do not understand the sense of HASH_HIGHMEM flag.
At the first sight, hash table eats low memory, objects hashed in this table
also eat low memory. Why is its size calculated from total memory?
But taking into account that this flag is used only by tcp.c and route.c,
both of which feed on low memory, I miss something important.

Let's ask people on netdev.


What's about routing cache size, it looks like it is another bug.
route.c should not force rt_max_size = 16*rt_hash_size.
I think it should consult available memory and to limit rt_max_size
to some reasonable value, even if hash size is too high.



> --- ./net/ipv4/route.c.xrt	2006-07-14 19:08:33.000000000 +0400
> +++ ./net/ipv4/route.c	2006-08-07 18:25:37.000000000 +0400
> @@ -3149,7 +3149,7 @@ int __init ip_rt_init(void)
>  					rhash_entries,
>  					(num_physpages >= 128 * 1024) ?
>  					15 : 17,
> -					HASH_HIGHMEM,
> +					0,
>  					&rt_hash_log,
>  					&rt_hash_mask,
>  					0);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-07 16:48 ` [PATCH] limit rt cache size Alexey Kuznetsov
@ 2006-08-08  3:42   ` David Miller
  2006-08-08  5:11     ` Andi Kleen
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David Miller @ 2006-08-08  3:42 UTC (permalink / raw)
  To: kuznet; +Cc: dev, netdev

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Mon, 7 Aug 2006 20:48:42 +0400

> The patch looks OK. But I am not sure too.
> 
> To be honest, I do not understand the sense of HASH_HIGHMEM flag.
> At the first sight, hash table eats low memory, objects hashed in this table
> also eat low memory. Why is its size calculated from total memory?
> But taking into account that this flag is used only by tcp.c and route.c,
> both of which feed on low memory, I miss something important.
> 
> Let's ask people on netdev.

Is it not so hard to check history of the change to see where these
things come from?  :-) If we study the output of command:

	git whatchanged net/core/route.c

we quickly discover this GIT commit:

424c4b70cc4ff3930ee36a2ef7b204e4d704fd26

[IPV4]: Use the fancy alloc_large_system_hash() function for route hash table

- rt hash table allocated using alloc_large_system_hash() function.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

And it is clear that old code used num_physpages, which counts low
memory only.  This shows clearly that Eric's usage of the HASH_HIGHMEM
flag here is erroneous.  So we should remove it.

Look!  This thing even uses num_physpages in current code to compute
the "scale" argument to alloc_large_system_hash() :)))

> What's about routing cache size, it looks like it is another bug.
> route.c should not force rt_max_size = 16*rt_hash_size.
> I think it should consult available memory and to limit rt_max_size
> to some reasonable value, even if hash size is too high.

Sure.  This current setting of 16*rt_hash_size is meant to
try to limit hash chain lengths I guess.  2.4.x does the same
thing.  Note also that by basing it upon number of routing cache
hash chains, it is effectively consulting available memory.
This is why when hash table sizing is crap so it rt_max_size
calculation.  Fix one and you fix them both :)

Once the HASH_HIGHMEM flag is removed, assuming system has > 128K of
memory, what we get is:

	hash_chains = lowmem / 128K
	rt_max_size = ((lowmem / 128K) * 16) == lowmem / 8K

So we allow one routing cache entry for each 8K of lowmem we have :)

So for now it is probably sufficient to just get rid of the
HASH_HIGHMEM flag here.  Later we can try changing this multiplier
of "16" to something like "8" or even "4".


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  3:42   ` David Miller
@ 2006-08-08  5:11     ` Andi Kleen
  2006-08-08  6:18       ` David Miller
  2006-08-08 20:37       ` akepner
  2006-08-08  8:17     ` Kirill Korotaev
  2006-08-08  8:57     ` Eric Dumazet
  2 siblings, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2006-08-08  5:11 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, dev, netdev


> So for now it is probably sufficient to just get rid of the
> HASH_HIGHMEM flag here.  Later we can try changing this multiplier
> of "16" to something like "8" or even "4".

The hash sizing code needs far more tweaks. iirc it can still allocate several GB 
hash tables on large memory systems (i've seen that once in the boot log
of a 2TB system).  Even on smaller systems it is usually too much.

IMHO there needs to be a maximum size (maybe related to the sum of 
caches of all CPUs in the system?)

Best would be to fix this for all large system hashes together.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  5:11     ` Andi Kleen
@ 2006-08-08  6:18       ` David Miller
  2006-08-08  6:53         ` Andi Kleen
  2006-08-08 20:37       ` akepner
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2006-08-08  6:18 UTC (permalink / raw)
  To: ak; +Cc: kuznet, dev, netdev

From: Andi Kleen <ak@suse.de>
Date: Tue, 8 Aug 2006 07:11:06 +0200

> The hash sizing code needs far more tweaks. iirc it can still
> allocate several GB hash tables on large memory systems (i've seen
> that once in the boot log of a 2TB system).  Even on smaller systems
> it is usually too much.

There is already a limit parameter to alloc_large_system_hash(), the
fact that the routing cache passes in zero (which causes the limit to
be 1/16 of all system memory) is just a bug. :-)

In passing this immediately this suggests a fix to alloc_large_system_hash,
in that when limit is given as 0, it should follow the same rules for
HASH_HIGHME which are applied to the "scale" arg.  It currently goes:

	if (max == 0) {
		max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4;
		do_div(max, bucketsize);
	}

Whereas it should probably go:

	if (max == 0) {
		max = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
		max = (max << PAGE_SHIFT) >> 4;
		do_div(max, bucketsize);
	}

or something like that.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  6:18       ` David Miller
@ 2006-08-08  6:53         ` Andi Kleen
  2006-08-08  7:01           ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2006-08-08  6:53 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, dev, netdev


> 
> Whereas it should probably go:
> 
> 	if (max == 0) {
> 		max = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
> 		max = (max << PAGE_SHIFT) >> 4;
> 		do_div(max, bucketsize);
> 	}
> 
> or something like that.

That's still too big. Consider a 2TB machine, with all memory in LOWMEM.

Or even a smaller system. At some point it doesn't make sense
anymore to go to a larger table, even if you have the memory
and it is actually costly to have a larger table because walking
the table (netstat, route etc.) completely will take longer and longer.

It needs some additional limit. Either a maximum one or something dynamic
like CPU cache sizes (but there is currently no portable way to get that)

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  6:53         ` Andi Kleen
@ 2006-08-08  7:01           ` David Miller
  2006-08-08 12:54             ` Kirill Korotaev
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2006-08-08  7:01 UTC (permalink / raw)
  To: ak; +Cc: kuznet, dev, netdev

From: Andi Kleen <ak@suse.de>
Date: Tue, 8 Aug 2006 08:53:03 +0200

> That's still too big. Consider a 2TB machine, with all memory in LOWMEM.

Andi I agree with you, route.c should pass in a suitable limit.
I'm just suggesting a fix for a seperate problem.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  3:42   ` David Miller
  2006-08-08  5:11     ` Andi Kleen
@ 2006-08-08  8:17     ` Kirill Korotaev
  2006-08-08  8:34       ` David Miller
  2006-08-08  8:57     ` Eric Dumazet
  2 siblings, 1 reply; 28+ messages in thread
From: Kirill Korotaev @ 2006-08-08  8:17 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, netdev

David Miller wrote:
> we quickly discover this GIT commit:
> 
> 424c4b70cc4ff3930ee36a2ef7b204e4d704fd26
> 
> [IPV4]: Use the fancy alloc_large_system_hash() function for route hash table
> 
> - rt hash table allocated using alloc_large_system_hash() function.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> And it is clear that old code used num_physpages, which counts low
> memory only.  This shows clearly that Eric's usage of the HASH_HIGHMEM
> flag here is erroneous.  So we should remove it.
at least for i686 num_physpages includes highmem, so IMHO this bug was there for years:

./arch/i386/mm/init.c:
static void __init set_max_mapnr_init(void)
{
#ifdef CONFIG_HIGHMEM
        num_physpages = highend_pfn;
#else
        num_physpages = max_low_pfn;
#endif
}

> Look!  This thing even uses num_physpages in current code to compute
> the "scale" argument to alloc_large_system_hash() :)))
the same bug here? :) the good thing is that it only select scale 15 or 17.
no any other possible choice here :)))

>>What's about routing cache size, it looks like it is another bug.
>>route.c should not force rt_max_size = 16*rt_hash_size.
>>I think it should consult available memory and to limit rt_max_size
>>to some reasonable value, even if hash size is too high.
> 
> 
> Sure.  This current setting of 16*rt_hash_size is meant to
> try to limit hash chain lengths I guess.  2.4.x does the same
> thing.  Note also that by basing it upon number of routing cache
> hash chains, it is effectively consulting available memory.
> This is why when hash table sizing is crap so it rt_max_size
> calculation.  Fix one and you fix them both :)
imho chain lengh limitation to 16 is not that bad, but to avoid such "features"
probably should be fixed :)

> Once the HASH_HIGHMEM flag is removed, assuming system has > 128K of
> memory, what we get is:
> 
> 	hash_chains = lowmem / 128K
> 	rt_max_size = ((lowmem / 128K) * 16) == lowmem / 8K
> 
> So we allow one routing cache entry for each 8K of lowmem we have :)
> 
> So for now it is probably sufficient to just get rid of the
> HASH_HIGHMEM flag here.  Later we can try changing this multiplier
> of "16" to something like "8" or even "4".
should we remove it for TCP hashes?

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  8:17     ` Kirill Korotaev
@ 2006-08-08  8:34       ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-08  8:34 UTC (permalink / raw)
  To: dev; +Cc: kuznet, netdev

From: Kirill Korotaev <dev@sw.ru>
Date: Tue, 08 Aug 2006 12:17:57 +0400

> at least for i686 num_physpages includes highmem, so IMHO this bug
> was there for years:

Correct, I misread the x86 code.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  3:42   ` David Miller
  2006-08-08  5:11     ` Andi Kleen
  2006-08-08  8:17     ` Kirill Korotaev
@ 2006-08-08  8:57     ` Eric Dumazet
  2006-08-08  9:12       ` David Miller
  2 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2006-08-08  8:57 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, dev, netdev

On Tuesday 08 August 2006 05:42, David Miller wrote:
> From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
> Date: Mon, 7 Aug 2006 20:48:42 +0400
>
> > The patch looks OK. But I am not sure too.
> >
> > To be honest, I do not understand the sense of HASH_HIGHMEM flag.
> > At the first sight, hash table eats low memory, objects hashed in this
> > table also eat low memory. Why is its size calculated from total memory?
> > But taking into account that this flag is used only by tcp.c and route.c,
> > both of which feed on low memory, I miss something important.
> >
> > Let's ask people on netdev.
>
> Is it not so hard to check history of the change to see where these
> things come from?  :-) If we study the output of command:
>
> 	git whatchanged net/core/route.c
>
> we quickly discover this GIT commit:
>
> 424c4b70cc4ff3930ee36a2ef7b204e4d704fd26
>
> [IPV4]: Use the fancy alloc_large_system_hash() function for route hash
> table
>
> - rt hash table allocated using alloc_large_system_hash() function.
>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> And it is clear that old code used num_physpages, which counts low
> memory only.  This shows clearly that Eric's usage of the HASH_HIGHMEM
> flag here is erroneous.  So we should remove it.

Yes probably.

If I recall well, I blindly copied code from net/ipv4/tcp.c (tcp ehash table 
allocation). I was not aware of this HASH_HIGHMEM part.

As the allocation of routes are SLAB_ATOMIC, while TCP sockets are allocated 
SLAB_KERNEL , it makes sense to size the route hash table accordingly to 
nr_kernel_pages instead of nr_all_pages

For TCP, an OOM is OK since sock_alloc_inode() should returns NULL and this 
should be handled fine.

I think we had discussion about being able to dynamically resize route hash 
table (or tcp hash table), using RCU. Did someone worked on this ?
For most current machines (ram size >= 1GB) , the default hash table sizes are 
just insane for 99% of uses.


>
> Look!  This thing even uses num_physpages in current code to compute
> the "scale" argument to alloc_large_system_hash() :)))
>
> > What's about routing cache size, it looks like it is another bug.
> > route.c should not force rt_max_size = 16*rt_hash_size.
> > I think it should consult available memory and to limit rt_max_size
> > to some reasonable value, even if hash size is too high.
>
> Sure.  This current setting of 16*rt_hash_size is meant to
> try to limit hash chain lengths I guess.  2.4.x does the same
> thing.  Note also that by basing it upon number of routing cache
> hash chains, it is effectively consulting available memory.
> This is why when hash table sizing is crap so it rt_max_size
> calculation.  Fix one and you fix them both :)
>
> Once the HASH_HIGHMEM flag is removed, assuming system has > 128K of
> memory, what we get is:
>
> 	hash_chains = lowmem / 128K
> 	rt_max_size = ((lowmem / 128K) * 16) == lowmem / 8K
>
> So we allow one routing cache entry for each 8K of lowmem we have :)
>
> So for now it is probably sufficient to just get rid of the
> HASH_HIGHMEM flag here.  Later we can try changing this multiplier
> of "16" to something like "8" or even "4".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  8:57     ` Eric Dumazet
@ 2006-08-08  9:12       ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-08  9:12 UTC (permalink / raw)
  To: dada1; +Cc: kuznet, dev, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 8 Aug 2006 10:57:31 +0200

> I think we had discussion about being able to dynamically resize
> route hash table (or tcp hash table), using RCU. Did someone worked
> on this ?  For most current machines (ram size >= 1GB) , the default
> hash table sizes are just insane for 99% of uses.

Specifically we were talking about TCP recently, but yes the
same rings true for the routing cache as well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  7:01           ` David Miller
@ 2006-08-08 12:54             ` Kirill Korotaev
  2006-08-08 12:58               ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill Korotaev @ 2006-08-08 12:54 UTC (permalink / raw)
  To: David Miller; +Cc: ak, kuznet, netdev

David Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Tue, 8 Aug 2006 08:53:03 +0200
> 
> 
>>That's still too big. Consider a 2TB machine, with all memory in LOWMEM.
> 
> 
> Andi I agree with you, route.c should pass in a suitable limit.
> I'm just suggesting a fix for a seperate problem.

So summaring up we have the following issues imho:
1a) rt hash size should be calculated based on lowmem size, not total size
1b) rt hash size should have some upper limit (requested by Andi Kleen for huge mem systems)
2a) max number of rt hash entries should be calculated based on low memory, not as
   rt_hash_chains*16.
2b) when CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGE_ALLOC are ON, 2a) should be corrected
   taking into account _real_ rtable entry size (one page instead of 256b!!!).
3) should we limit TCP hashe and hashb size the same way?

If I haven't missed something I will prepare a patch for 1-2) and
a separate patch for 3).

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08 12:54             ` Kirill Korotaev
@ 2006-08-08 12:58               ` Andi Kleen
  0 siblings, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2006-08-08 12:58 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: David Miller, kuznet, netdev


> 3) should we limit TCP hashe and hashb size the same way?

Yes - or best in fact all hashes handled by alloc_large_system_hash()

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08  5:11     ` Andi Kleen
  2006-08-08  6:18       ` David Miller
@ 2006-08-08 20:37       ` akepner
  2006-08-08 23:23         ` Andi Kleen
  1 sibling, 1 reply; 28+ messages in thread
From: akepner @ 2006-08-08 20:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, kuznet, dev, netdev

On Tue, 8 Aug 2006, Andi Kleen wrote:

>
> The hash sizing code needs far more tweaks. iirc it can still allocate
> several GB hash tables on large memory systems (i've seen that once in 
> the boot log  of a 2TB system).  Even on smaller systems it is usually 
> too much.

Yes. Linear growth with memory is not a good algorithm for
sizing hashes on a machine with ~TB... SGI has recommended
booting with "thash_entries=2097152", but that's often much
bigger than needed, too.

>
> IMHO there needs to be a maximum size (maybe related to the sum of
> caches of all CPUs in the system?)
>
> Best would be to fix this for all large system hashes together.

How about using an algorithm like this: up to a certain "size"
(memory size, cache size,...), scale the hash tables linearly; 
but for larger sizes, scale logarithmically (or approximately
logarithmically)

Code that implements this is below. (It's not perfect, but mostly
works. Just trying to get the idea across....) "struct hash_sizer"
has fields that fix the linear/logarithmic crossover point. And
size_system_hash() recommends the number of hash table entries
based the amount of memory.


#include <stdio.h>

unsigned long nr_all_pages;
#define PAGE_SHIFT 12
#define PAGE_SIZE (1UL << PAGE_SHIFT)

static int 
long_log2 (unsigned long x)
{
 	int r = 0;
 	for (x >>= 1; x > 0; x >>= 1)
 		r++;
 	return r;
}

/*
  * scale linearly 64Kibuckets/GiByte up to 1 GiByte,
  * and (sort of) logarithmically thereafter
  */
struct hash_sizer {
 	unsigned long buckets; /* power of 2 */
 	unsigned long bytes;   /* multiple of PAGE_SIZE */
} tcp_hash_sizer = {
 	(1UL << 16), /* 64 Kibuckets */
 	(1UL << 30)  /* GiByte */
};

unsigned long 
size_system_hash (struct hash_sizer *hs) 
{
 	unsigned long size;
 	unsigned long long slog;
 	unsigned long scale, tmp;
 	int log2, i = 0;

 	if ( nr_all_pages < (hs->bytes/PAGE_SIZE)) {
 		/* linear scaling */
 		unsigned long long tmp = nr_all_pages;
 		tmp *= hs->buckets;
 		tmp /= (hs->bytes/PAGE_SIZE);
 		return (unsigned long)tmp;
 	}

 	/* logarithmic scaling */
 	log2 = long_log2(nr_all_pages);
 	scale  = (1 << log2);

 	slog = log2 * scale;

 	tmp = nr_all_pages;
 	tmp &= ~(1 << log2);

 	while (tmp) {

 		i++;
 		tmp <<= 1;
 		if (tmp & (1 << log2))
 			slog += (scale / (1 << i));
 	}

 	/* slog/scale is approximately log2(nr_all_pages)
 	 */

 	size = (hs->buckets * slog)/scale;
 	size -= hs->buckets * (long_log2(hs->bytes) - PAGE_SHIFT);
 	size += hs->buckets;

 	return size;

}

int 
main(void) 
{

 	unsigned long hash_size;

 	printf("# nr_all_pages memory_size [bytes] hash_size\n");

 	for (nr_all_pages = 1; nr_all_pages < (1UL << 26);
 		//nr_all_pages <<= 1) {
 		nr_all_pages += 7919) {

 		hash_size = size_system_hash(&tcp_hash_sizer);
 		printf("  %lu          %llu                %lu\n",
 			nr_all_pages,
 			(unsigned long long)(nr_all_pages)*PAGE_SIZE,
 			hash_size);
 	}

 	return 0;
}


-- 
Arthur



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08 20:37       ` akepner
@ 2006-08-08 23:23         ` Andi Kleen
  2006-08-09  0:06           ` akepner
  2006-08-09  0:11           ` David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2006-08-08 23:23 UTC (permalink / raw)
  To: akepner; +Cc: David Miller, kuznet, dev, netdev


> >
> > IMHO there needs to be a maximum size (maybe related to the sum of
> > caches of all CPUs in the system?)
> >
> > Best would be to fix this for all large system hashes together.
> 
> How about using an algorithm like this: up to a certain "size"
> (memory size, cache size,...), scale the hash tables linearly; 
> but for larger sizes, scale logarithmically (or approximately
> logarithmically)

I don't think it makes any sense to continue scaling at all after
some point - you won't get better shorter hash chains anymore and the 
large hash tables actually cause problems: e.g. there are situations where we walk
the complete tables and that takes longer and longer.

Also does a 1TB machine really need bigger hash tables than a 100GB one?

The problem is to find out what a good boundary is.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08 23:23         ` Andi Kleen
@ 2006-08-09  0:06           ` akepner
  2006-08-09  0:11           ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: akepner @ 2006-08-09  0:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, kuznet, dev, netdev

On Wed, 9 Aug 2006, Andi Kleen wrote:

> ....
> I don't think it makes any sense to continue scaling at all after
> some point - you won't get better shorter hash chains anymore and the
> large hash tables actually cause problems: e.g. there are situations where we walk
> the complete tables and that takes longer and longer.
>

Logarithmic growth is _much_ slower than linear growth, so
scaling becomes quite restrained after the linear/logarithmic
boundary is exceeded.

> Also does a 1TB machine really need bigger hash tables than a 100GB one?
>

Maybe - depends on what the machine is used for, I suppose.

> The problem is to find out what a good boundary is.

True, and every choice involves some arbitrariness. But
the arbitariness in my example is at least well contained
(2 values). Anyway, I think there are good arguments for log
scaling these kinds of things after some point.

-- 
Arthur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:11           ` David Miller
@ 2006-08-09  0:11             ` akepner
  2006-08-09  0:22               ` David Miller
                                 ` (2 more replies)
  2006-08-09  0:24             ` Andi Kleen
  1 sibling, 3 replies; 28+ messages in thread
From: akepner @ 2006-08-09  0:11 UTC (permalink / raw)
  To: David Miller; +Cc: ak, kuznet, dev, netdev

On Tue, 8 Aug 2006, David Miller wrote:

> From: Andi Kleen <ak@suse.de>
> Date: Wed, 9 Aug 2006 01:23:01 +0200
>
>> The problem is to find out what a good boundary is.
>
> The more I think about this the more I lean towards
> two conclusions:
>
> 1) dynamic table growth is the only reasonable way to
>   handle this and not waste memory in all cases
> ....

Definitely that's the ideal way to go.

But there's alot of state to update (more or less
atomically, too) in the TCP hashes. Looks tricky to
do that without hurting performance, especially since
you'll probably want to resize the tables when you've
discovered they're full and busy....

-- 
Arthur



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-08 23:23         ` Andi Kleen
  2006-08-09  0:06           ` akepner
@ 2006-08-09  0:11           ` David Miller
  2006-08-09  0:11             ` akepner
  2006-08-09  0:24             ` Andi Kleen
  1 sibling, 2 replies; 28+ messages in thread
From: David Miller @ 2006-08-09  0:11 UTC (permalink / raw)
  To: ak; +Cc: akepner, kuznet, dev, netdev

From: Andi Kleen <ak@suse.de>
Date: Wed, 9 Aug 2006 01:23:01 +0200

> The problem is to find out what a good boundary is.

The more I think about this the more I lean towards
two conclusions:

1) dynamic table growth is the only reasonable way to
   handle this and not waste memory in all cases

2) for cases where we haven't implemented dynamic
   table growth, specifying a proper limit argument
   to the hash table allocation is a sufficient
   solution for the time being

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:11             ` akepner
@ 2006-08-09  0:22               ` David Miller
  2006-08-09  1:02               ` Andi Kleen
  2006-08-09  8:05               ` Kirill Korotaev
  2 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-09  0:22 UTC (permalink / raw)
  To: akepner; +Cc: ak, kuznet, dev, netdev

From: akepner@sgi.com
Date: Tue, 8 Aug 2006 17:11:29 -0700 (PDT)

> But there's alot of state to update (more or less
> atomically, too) in the TCP hashes. Looks tricky to
> do that without hurting performance, especially since
> you'll probably want to resize the tables when you've
> discovered they're full and busy....

There have been two failed attempts at this :)

For the time being I think the routing cache can be
dynamically sized with less effort, and I might take
a stab at that today.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:11           ` David Miller
  2006-08-09  0:11             ` akepner
@ 2006-08-09  0:24             ` Andi Kleen
  2006-08-09  0:32               ` David Miller
  2006-08-09  8:09               ` Kirill Korotaev
  1 sibling, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2006-08-09  0:24 UTC (permalink / raw)
  To: David Miller; +Cc: akepner, kuznet, dev, netdev

On Wednesday 09 August 2006 02:11, David Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 9 Aug 2006 01:23:01 +0200
> 
> > The problem is to find out what a good boundary is.
> 
> The more I think about this the more I lean towards
> two conclusions:
> 
> 1) dynamic table growth is the only reasonable way to
>    handle this and not waste memory in all cases

Yes, but even with dynamic growth you still need some upper boundary 
(otherwise a DOS could eat all your memory). And it would need
to be figured out what it is.

BTW does dynamic shrink after a load spike make sense too?

> 2) for cases where we haven't implemented dynamic
>    table growth, specifying a proper limit argument
>    to the hash table allocation is a sufficient
>    solution for the time being

Agreed, just we don't know what the proper limits are. 

I guess it would need someone running quite a lot of benchmarks.
Anyone volunteering? @)

Or do some cheesy default and document the options to change
it clearly and wait for feedback from users on what works for
them?

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:24             ` Andi Kleen
@ 2006-08-09  0:32               ` David Miller
  2006-08-09  8:09               ` Kirill Korotaev
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-09  0:32 UTC (permalink / raw)
  To: ak; +Cc: akepner, kuznet, dev, netdev

From: Andi Kleen <ak@suse.de>
Date: Wed, 9 Aug 2006 02:24:04 +0200

> Yes, but even with dynamic growth you still need some upper boundary 
> (otherwise a DOS could eat all your memory). And it would need
> to be figured out what it is.

Absolutely.  Otherwise the GC'ing of the routing cache would
break horribly.

> BTW does dynamic shrink after a load spike make sense too?

I'm not sure about that bit yet.

> Or do some cheesy default and document the options to change
> it clearly and wait for feedback from users on what works for
> them?

I think this makes sense.  An initial upper limit of 1 million hash
table slots seems reasonable at first.  That's 8MB on a 64-bit
machine.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:11             ` akepner
  2006-08-09  0:22               ` David Miller
@ 2006-08-09  1:02               ` Andi Kleen
  2006-08-09 16:16                 ` akepner
  2006-08-09  8:05               ` Kirill Korotaev
  2 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2006-08-09  1:02 UTC (permalink / raw)
  To: akepner; +Cc: David Miller, kuznet, dev, netdev


> But there's alot of state to update (more or less
> atomically, too) 

Why does it need to be atomic? It might be enough
to just check a flag and poll for it in the readers and then redo the 
lookup.

> in the TCP hashes. Looks tricky to 
> do that without hurting performance, especially since
> you'll probably want to resize the tables when you've
> discovered they're full and busy....

There will be some hickup, but as long as the downtime
is limited it shouldn't be too bad.

-Andi

 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:11             ` akepner
  2006-08-09  0:22               ` David Miller
  2006-08-09  1:02               ` Andi Kleen
@ 2006-08-09  8:05               ` Kirill Korotaev
  2 siblings, 0 replies; 28+ messages in thread
From: Kirill Korotaev @ 2006-08-09  8:05 UTC (permalink / raw)
  To: akepner; +Cc: David Miller, ak, kuznet, netdev

>> 1) dynamic table growth is the only reasonable way to
>>   handle this and not waste memory in all cases
>> ....
> 
> 
> Definitely that's the ideal way to go.
> 
> But there's alot of state to update (more or less
> atomically, too) in the TCP hashes. Looks tricky to
> do that without hurting performance, especially since
> you'll probably want to resize the tables when you've
> discovered they're full and busy....
and the memory if fragmented too! :/

Kirill


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  0:24             ` Andi Kleen
  2006-08-09  0:32               ` David Miller
@ 2006-08-09  8:09               ` Kirill Korotaev
  2006-08-09  8:53                 ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Kirill Korotaev @ 2006-08-09  8:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, akepner, kuznet, netdev

>>2) for cases where we haven't implemented dynamic
>>   table growth, specifying a proper limit argument
>>   to the hash table allocation is a sufficient
>>   solution for the time being
> 
> 
> Agreed, just we don't know what the proper limits are. 
> 
> I guess it would need someone running quite a lot of benchmarks.
> Anyone volunteering? @)
In my original post I noted how it is quite easy to consume
the whole 1Gb of RAM on i686 PC (and it's only 4,194,304 entries)
it looks like with more IP addresses it is not that hard to
consume much more memory.

> Or do some cheesy default and document the options to change
> it clearly and wait for feedback from users on what works for
> them?

Kirill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  8:09               ` Kirill Korotaev
@ 2006-08-09  8:53                 ` Eric Dumazet
  2006-08-09  9:22                   ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2006-08-09  8:53 UTC (permalink / raw)
  To: Kirill Korotaev; +Cc: Andi Kleen, David Miller, akepner, kuznet, netdev

On Wednesday 09 August 2006 10:09, Kirill Korotaev wrote:
> >>2) for cases where we haven't implemented dynamic
> >>   table growth, specifying a proper limit argument
> >>   to the hash table allocation is a sufficient
> >>   solution for the time being
> >
> > Agreed, just we don't know what the proper limits are.
> >
> > I guess it would need someone running quite a lot of benchmarks.
> > Anyone volunteering? @)
>
> In my original post I noted how it is quite easy to consume
> the whole 1Gb of RAM on i686 PC (and it's only 4,194,304 entries)
> it looks like with more IP addresses it is not that hard to
> consume much more memory.

If MAX_ORDER = 11, we have a max hash table of 8 MB : 2097152 slots
But even 2097152 dst need 139810 pages (560 MB of low mem), so 16 times 
needs... too much ram.

Probably a test like this is necessary :

if (ip_rt_max_size > (nr_kernel_pages/8))
    ip_rt_max_size = (nr_kernel_pages/8);

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  8:53                 ` Eric Dumazet
@ 2006-08-09  9:22                   ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-09  9:22 UTC (permalink / raw)
  To: dada1; +Cc: dev, ak, akepner, kuznet, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 9 Aug 2006 10:53:18 +0200

> If MAX_ORDER = 11, we have a max hash table of 8 MB : 2097152 slots
> But even 2097152 dst need 139810 pages (560 MB of low mem), so 16 times 
> needs... too much ram.
> 
> Probably a test like this is necessary :
> 
> if (ip_rt_max_size > (nr_kernel_pages/8))
>     ip_rt_max_size = (nr_kernel_pages/8);

Perhaps the 8MB limit in my patches is not so arbitrary :-)

But it should be based upon nr_kernel_pages for sure, and
we'd need to kill the __meminidata tag on that thing if
we started to use it like this.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09  1:02               ` Andi Kleen
@ 2006-08-09 16:16                 ` akepner
  2006-08-09 16:32                   ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: akepner @ 2006-08-09 16:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, kuznet, dev, netdev

On Wed, 9 Aug 2006, Andi Kleen wrote:

>
>> But there's alot of state to update (more or less
>> atomically, too)
>
> Why does it need to be atomic? It might be enough
> to just check a flag and poll for it in the readers and then redo the
> lookup.
> ....

(I qualified "atomic" with "more or less" :-)

Sure, something like that could work.

>
> There will be some hickup, but as long as the downtime
> is limited it shouldn't be too bad.
>

Benchmarks are in order....

-- 
Arthur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09 16:16                 ` akepner
@ 2006-08-09 16:32                   ` Andi Kleen
  2006-08-10  0:02                     ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2006-08-09 16:32 UTC (permalink / raw)
  To: akepner; +Cc: David Miller, kuznet, dev, netdev


> > There will be some hickup, but as long as the downtime
> > is limited it shouldn't be too bad.
> >
> 
> Benchmarks are in order....

One issue I forgot earlier and Kirill pointed out is that the
reallocation would require vmalloc because memory will be too fragmented
to get a big piece of physical memory. So it would add TLB pressure. 

Can't think of a good way around that.

You might have been right with being sceptical.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] limit rt cache size
  2006-08-09 16:32                   ` Andi Kleen
@ 2006-08-10  0:02                     ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2006-08-10  0:02 UTC (permalink / raw)
  To: ak; +Cc: akepner, kuznet, dev, netdev

From: Andi Kleen <ak@suse.de>
Date: Wed, 9 Aug 2006 18:32:26 +0200

> One issue I forgot earlier and Kirill pointed out is that the
> reallocation would require vmalloc because memory will be too fragmented
> to get a big piece of physical memory. So it would add TLB pressure. 
> 
> Can't think of a good way around that.
> 
> You might have been right with being sceptical.

For NUMA this already happens, is it a problem in practice?

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-08-10  0:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <44D75EF8.1070901@sw.ru>
2006-08-07 16:48 ` [PATCH] limit rt cache size Alexey Kuznetsov
2006-08-08  3:42   ` David Miller
2006-08-08  5:11     ` Andi Kleen
2006-08-08  6:18       ` David Miller
2006-08-08  6:53         ` Andi Kleen
2006-08-08  7:01           ` David Miller
2006-08-08 12:54             ` Kirill Korotaev
2006-08-08 12:58               ` Andi Kleen
2006-08-08 20:37       ` akepner
2006-08-08 23:23         ` Andi Kleen
2006-08-09  0:06           ` akepner
2006-08-09  0:11           ` David Miller
2006-08-09  0:11             ` akepner
2006-08-09  0:22               ` David Miller
2006-08-09  1:02               ` Andi Kleen
2006-08-09 16:16                 ` akepner
2006-08-09 16:32                   ` Andi Kleen
2006-08-10  0:02                     ` David Miller
2006-08-09  8:05               ` Kirill Korotaev
2006-08-09  0:24             ` Andi Kleen
2006-08-09  0:32               ` David Miller
2006-08-09  8:09               ` Kirill Korotaev
2006-08-09  8:53                 ` Eric Dumazet
2006-08-09  9:22                   ` David Miller
2006-08-08  8:17     ` Kirill Korotaev
2006-08-08  8:34       ` David Miller
2006-08-08  8:57     ` Eric Dumazet
2006-08-08  9:12       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).