* [PATCH] net: Saner thash_entries default with much memory
@ 2007-10-26 15:21 Jean Delvare
2007-10-26 15:34 ` Andi Kleen
2007-10-26 15:55 ` akepner
0 siblings, 2 replies; 8+ messages in thread
From: Jean Delvare @ 2007-10-26 15:21 UTC (permalink / raw)
To: netdev; +Cc: Andi Kleen
On systems with a very large amount of memory, the heuristics in
alloc_large_system_hash() result in a very large TCP established hash
table: 16 millions of entries for a 128 GB ia64 system. This makes
reading from /proc/net/tcp pretty slow (well over a second) and as a
result netstat is slow on these machines. I know that /proc/net/tcp is
deprecated in favor of tcp_diag, however at the moment netstat only
knows of the former.
I am skeptical that such a large TCP established hash is often needed.
Just because a system has a lot of memory doesn't imply that it will
have several millions of concurrent TCP connections. Thus I believe
that we should put an arbitrary high limit to the size of the TCP
established hash by default. Users who really need a bigger hash can
always use the thash_entries boot parameter to get more.
I propose 2 millions of entries as the arbitrary high limit. This
makes /proc/net/tcp reasonably fast on the system in question (0.2 s)
while being still large enough for me to be confident that network
performance won't suffer.
This is just one way to limit the hash size, there are others; I am not
familiar enough with the TCP code to decide which is best. Thus, I
would welcome the proposals of alternatives.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
---
net/ipv4/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- linux-2.6.24-rc1.orig/net/ipv4/tcp.c 2007-10-24 09:59:58.000000000 +0200
+++ linux-2.6.24-rc1/net/ipv4/tcp.c 2007-10-26 16:26:41.000000000 +0200
@@ -2453,7 +2453,7 @@ void __init tcp_init(void)
0,
&tcp_hashinfo.ehash_size,
NULL,
- 0);
+ thash_entries ? 0 : 2 * 1024 * 1024);
tcp_hashinfo.ehash_size = 1 << tcp_hashinfo.ehash_size;
for (i = 0; i < tcp_hashinfo.ehash_size; i++) {
rwlock_init(&tcp_hashinfo.ehash[i].lock);
--
Jean Delvare
Suse L3
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-26 15:21 [PATCH] net: Saner thash_entries default with much memory Jean Delvare
@ 2007-10-26 15:34 ` Andi Kleen
2007-10-30 7:57 ` David Miller
2007-10-26 15:55 ` akepner
1 sibling, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-10-26 15:34 UTC (permalink / raw)
To: Jean Delvare; +Cc: netdev, Andi Kleen
On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
> I know that /proc/net/tcp is
> deprecated in favor of tcp_diag, however at the moment netstat only
> knows of the former.
Even tcp_diag will be slow when all slots are dumped. It's a fundamental
problem of the data structure. /proc has slightly higher constant
factor overhead, that's all.
Also there are some tricks to make it a little faster (e.g.
the old patches to not take the lock for empty buckets) but again
it's only just patching constant factors, not the fundamental
scaling issue.
> I propose 2 millions of entries as the arbitrary high limit. This
It's probably still far too large.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-26 15:21 [PATCH] net: Saner thash_entries default with much memory Jean Delvare
2007-10-26 15:34 ` Andi Kleen
@ 2007-10-26 15:55 ` akepner
1 sibling, 0 replies; 8+ messages in thread
From: akepner @ 2007-10-26 15:55 UTC (permalink / raw)
To: Jean Delvare; +Cc: netdev, Andi Kleen, Robert.Olsson
On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
> ....
> This is just one way to limit the hash size, there are others; I am not
> familiar enough with the TCP code to decide which is best. Thus, I
> would welcome the proposals of alternatives.
>
Yeah, the existing way of sizing them is very bad on large systems
and hardcoding a size sucks (though that's what we do on large
Altix systems now).
IIRC there was some talk about using a different data structure
here - something that wouldn't require a fixed size and that
could maintain reasonably fast lookup even when it got large.
Robert, has work been done to use TRASH to address this problem?
Other ideas?
--
Arthur
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-26 15:34 ` Andi Kleen
@ 2007-10-30 7:57 ` David Miller
2007-10-30 13:18 ` Jean Delvare
2007-10-30 20:42 ` Andi Kleen
0 siblings, 2 replies; 8+ messages in thread
From: David Miller @ 2007-10-30 7:57 UTC (permalink / raw)
To: ak; +Cc: jdelvare, netdev
From: Andi Kleen <ak@suse.de>
Date: Fri, 26 Oct 2007 17:34:17 +0200
> On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
> > I propose 2 millions of entries as the arbitrary high limit. This
>
> It's probably still far too large.
I agree. Perhaps a better number is something on the order of
(512 * 1024) so I think I'll check in a variant of Jean's patch
with just the limit decreased like that.
Using just some back of the envelope calculations, on UP 64-bit
systems each socket uses about 2424 bytes minimum of memory (this is
the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP).
This is an underestimate because it does not even consider things like
allocator overhead.
Next, machines that service that many sockets typically have them
mostly with full transmit queues talking to a very slow receiver at
the other end. So let's estimate that on average each socket consumes
about 64K of retransmit queue data.
I think this is an extremely conservative estimate beause it doesn't
even consider overhead coming from struct sk_buff and related state.
So for (512 * 1024) of established sockets we consume roughly 35GB of
memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'.
So to me (512 * 1024) is a very reasonable limit and (with lockdep
and spinlock debugging disabled) this makes the EHASH table consume
8MB on UP 64-bit and ~12MB on SMP 64-bit systems.
Thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-30 7:57 ` David Miller
@ 2007-10-30 13:18 ` Jean Delvare
2007-10-30 21:11 ` David Miller
2007-10-30 20:42 ` Andi Kleen
1 sibling, 1 reply; 8+ messages in thread
From: Jean Delvare @ 2007-10-30 13:18 UTC (permalink / raw)
To: David Miller; +Cc: ak, netdev
Hi David,
Le mardi 30 octobre 2007, David Miller a écrit :
> From: Andi Kleen <ak@suse.de>
> Date: Fri, 26 Oct 2007 17:34:17 +0200
>
> > On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
> > > I propose 2 millions of entries as the arbitrary high limit. This
> >
> > It's probably still far too large.
>
> I agree. Perhaps a better number is something on the order of
> (512 * 1024) so I think I'll check in a variant of Jean's patch
> with just the limit decreased like that.
That's very fine with me. I originally proposed an admittedly high
limit value to increase the chance to see it accepted. I am not
familiar enough with networking to know what a more reasonable
limit would be, so I'm leaving it to the experts.
> Using just some back of the envelope calculations, on UP 64-bit
> systems each socket uses about 2424 bytes minimum of memory (this is
> the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP).
> This is an underestimate because it does not even consider things like
> allocator overhead.
>
> Next, machines that service that many sockets typically have them
> mostly with full transmit queues talking to a very slow receiver at
> the other end. So let's estimate that on average each socket consumes
> about 64K of retransmit queue data.
>
> I think this is an extremely conservative estimate beause it doesn't
> even consider overhead coming from struct sk_buff and related state.
>
> So for (512 * 1024) of established sockets we consume roughly 35GB of
> memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'.
>
> So to me (512 * 1024) is a very reasonable limit and (with lockdep
> and spinlock debugging disabled) this makes the EHASH table consume
> 8MB on UP 64-bit and ~12MB on SMP 64-bit systems.
OK, let's go with (512 * 1024) then. Want me to send an updated patch?
Thanks,
--
Jean Delvare
Suse L3
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-30 7:57 ` David Miller
2007-10-30 13:18 ` Jean Delvare
@ 2007-10-30 20:42 ` Andi Kleen
2007-10-30 21:46 ` David Miller
1 sibling, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-10-30 20:42 UTC (permalink / raw)
To: David Miller; +Cc: jdelvare, netdev
> Next, machines that service that many sockets typically have them
> mostly with full transmit queues talking to a very slow receiver at
> the other end.
Not sure -- there are likely use cases with lots of idle but connected
sockets.
Also the constraint here is not really how many sockets are served,
but how well the hash function manages to spread them in the table.. I don't
have good data on that.
But still (512 * 1024) sounds reasonable because e.g. in the lots
of idle socket case you're probably fine with the hash chains
having more than one entry worst case because a small working
set will fit in cache and as long as the chains do not end up
very long walking in cache of a short list will be still fast enough.
> So to me (512 * 1024) is a very reasonable limit and (with lockdep
> and spinlock debugging disabled) this makes the EHASH table consume
> 8MB on UP 64-bit and ~12MB on SMP 64-bit systems.
I still have my doubts it makes sense to have an own lock for each bucket. It
would be probably better to just divide the hash value through a factor
again and then use that to index a smaller lock only table.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-30 13:18 ` Jean Delvare
@ 2007-10-30 21:11 ` David Miller
0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2007-10-30 21:11 UTC (permalink / raw)
To: jdelvare; +Cc: ak, netdev
From: Jean Delvare <jdelvare@suse.de>
Date: Tue, 30 Oct 2007 14:18:27 +0100
> OK, let's go with (512 * 1024) then. Want me to send an updated patch?
Why submit a patch that's already in Linus's tree :-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net: Saner thash_entries default with much memory
2007-10-30 20:42 ` Andi Kleen
@ 2007-10-30 21:46 ` David Miller
0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2007-10-30 21:46 UTC (permalink / raw)
To: ak; +Cc: jdelvare, netdev
From: Andi Kleen <ak@suse.de>
Date: Tue, 30 Oct 2007 21:42:05 +0100
> I still have my doubts it makes sense to have an own lock for each bucket. It
> would be probably better to just divide the hash value through a factor
> again and then use that to index a smaller lock only table.
Yes, and that's why we do it this way in the routing cache hashes.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-10-30 21:46 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-26 15:21 [PATCH] net: Saner thash_entries default with much memory Jean Delvare
2007-10-26 15:34 ` Andi Kleen
2007-10-30 7:57 ` David Miller
2007-10-30 13:18 ` Jean Delvare
2007-10-30 21:11 ` David Miller
2007-10-30 20:42 ` Andi Kleen
2007-10-30 21:46 ` David Miller
2007-10-26 15:55 ` akepner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).