public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RE: Limit hash table size
@ 2004-02-06  0:10 Chen, Kenneth W
  2004-02-06  0:23 ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-06  0:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

Andrew,

Will you merge the changes in the network area first while I'm working
on the solution suggested here for inode and dentry? The 2GB tcp hash is
the biggest problem for us right now.

- Ken


-----Original Message-----
From: Andrew Morton [mailto:akpm@osdl.org] 
Sent: Thursday, February 05, 2004 3:58 PM
To: Chen, Kenneth W
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: Re: Limit hash table size

Ken, I remain unhappy with this patch.  If a big box has 500 million
dentries or inodes in cache (is possible), those hash chains will be
more
than 200 entries long on average.  It will be very slow.

We need to do something smarter.  At least, for machines which do not
have
the ia64 proliferation-of-zones problem.

Maybe we should leave the sizing of these tables as-is, and add some
hook
which allows the architecture to scale them back.

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: Limit hash table size
@ 2004-02-18  0:45 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-18  0:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

Updates to Documentation/kernel-parameters.txt for the 4 new boot time
parameters.

- Ken


-----Original Message-----
From: Chen, Kenneth W 
Sent: Tuesday, February 17, 2004 4:17 PM
To: 'Andrew Morton'
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: RE: Limit hash table size

> I think it would be better to leave things as they are,
> with a boot option to scale the tables down.

OK, here is one that adds boot time parameters only and leave everything
else untouched.

> The below patch addresses the inode and dentry caches.
> Need to think a bit more about the networking ones.

Don't know what happened to my mailer, the mailing list archive shows
complete patch including the network part.  Hope this time you receive
the whole thing.

[-- Attachment #2: knl_param.patch --]
[-- Type: application/octet-stream, Size: 1234 bytes --]

diff -Nur linux-2.6.3-rc4/Documentation/kernel-parameters.txt linux-2.6.3-rc4.hash/Documentation/kernel-parameters.txt
--- linux-2.6.3-rc4/Documentation/kernel-parameters.txt	2004-02-16 18:23:44.000000000 -0800
+++ linux-2.6.3-rc4.hash/Documentation/kernel-parameters.txt	2004-02-17 16:41:58.000000000 -0800
@@ -292,6 +292,9 @@
 
 	devfs=		[DEVFS]
 			See Documentation/filesystems/devfs/boot-options.
+
+	dhash_entries=	[KNL]
+			Set number of hash buckets for dentry cache.
  
 	digi=		[HW,SERIAL]
 			IO parameters + enable/disable command.
@@ -424,6 +427,9 @@
 	idle=		[HW]
 			Format: idle=poll or idle=halt
  
+	ihash_entries=	[KNL]
+			Set number of hash buckets for inode cache.
+
 	in2000=		[HW,SCSI]
 			See header of drivers/scsi/in2000.c.
 
@@ -873,6 +879,9 @@
 
 	resume=		[SWSUSP] Specify the partition device for software suspension
 
+	rhash_entries=	[KNL,NET]
+			Set number of hash buckets for route cache
+ 
 	riscom8=	[HW,SERIAL]
 			Format: <io_board1>[,<io_board2>[,...<io_boardN>]]
 
@@ -1135,6 +1144,9 @@
 	tgfx_2=		See Documentation/input/joystick-parport.txt.
 	tgfx_3=
 
+	thash_entries=	[KNL,NET]
+			Set number of hash buckets for TCP connection
+
 	tipar=		[HW]
 			See header of drivers/char/tipar.c.
 

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: Limit hash table size
@ 2004-02-18  0:16 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-18  0:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

> I think it would be better to leave things as they are,
> with a boot option to scale the tables down.

OK, here is one that adds boot time parameters only and leave everything
else untouched.

> The below patch addresses the inode and dentry caches.
> Need to think a bit more about the networking ones.

Don't know what happened to my mailer, the mailing list archive shows
complete patch including the network part.  Hope this time you receive
the whole thing.

[-- Attachment #2: hash5.patch --]
[-- Type: application/octet-stream, Size: 4055 bytes --]

diff -Nur linux-2.6.2-rc1/fs/dcache.c linux-2.6.2-rc1.ken/fs/dcache.c
--- linux-2.6.2-rc1/fs/dcache.c	2004-01-20 19:49:25.000000000 -0800
+++ linux-2.6.2-rc1.ken/fs/dcache.c	2004-02-17 15:31:35.000000000 -0800
@@ -1527,6 +1527,16 @@
 	return ino;
 }
 
+static __initdata unsigned long dhash_entries;
+static int __init set_dhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	dhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1552,11 +1562,13 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!dhash_entries) {
+		dhash_entries = PAGE_SHIFT < 13 ?
+				mempages >> (13 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 13);
+	}
+	dhash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.2-rc1/fs/inode.c linux-2.6.2-rc1.ken/fs/inode.c
--- linux-2.6.2-rc1/fs/inode.c	2004-01-20 19:50:41.000000000 -0800
+++ linux-2.6.2-rc1.ken/fs/inode.c	2004-02-17 15:30:51.000000000 -0800
@@ -1327,6 +1327,16 @@
 		wake_up_all(wq);
 }
 
+static __initdata unsigned long ihash_entries;
+static int __init set_ihash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ihash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,9 +1350,13 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!ihash_entries) {
+		ihash_entries = PAGE_SHIFT < 14 ?
+				mempages >> (14 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 14);
+	}
+	ihash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.2-rc1/net/ipv4/route.c linux-2.6.2-rc1.ken/net/ipv4/route.c
--- linux-2.6.2-rc1/net/ipv4/route.c	2004-01-20 19:50:41.000000000 -0800
+++ linux-2.6.2-rc1.ken/net/ipv4/route.c	2004-02-17 15:43:59.000000000 -0800
@@ -2717,6 +2717,16 @@
 #endif /* CONFIG_PROC_FS */
 #endif /* CONFIG_NET_CLS_ROUTE */
 
+static __initdata unsigned long rhash_entries;
+static int __init set_rhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	rhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("rhash_entries=", set_rhash_entries);
+
 int __init ip_rt_init(void)
 {
 	int i, order, goal, rc = 0;
@@ -2742,8 +2752,10 @@
 	if (!ipv4_dst_ops.kmem_cachep)
 		panic("IP: failed to allocate ip_dst_cache\n");
 
-	goal = num_physpages >> (26 - PAGE_SHIFT);
-
+	if (!rhash_entries)
+		goal = num_physpages >> (26 - PAGE_SHIFT);
+	else
+		goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		/* NOTHING */;
 
diff -Nur linux-2.6.2-rc1/net/ipv4/tcp.c linux-2.6.2-rc1.ken/net/ipv4/tcp.c
--- linux-2.6.2-rc1/net/ipv4/tcp.c	2004-01-20 19:49:36.000000000 -0800
+++ linux-2.6.2-rc1.ken/net/ipv4/tcp.c	2004-02-17 15:45:25.000000000 -0800
@@ -2569,6 +2569,16 @@
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern void tcpdiag_init(void);
 
+static __initdata unsigned long thash_entries;
+static int __init set_thash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	thash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("thash_entries=", set_thash_entries);
+
 void __init tcp_init(void)
 {
 	struct sk_buff *skb = NULL;
@@ -2610,6 +2620,8 @@
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
+	if (thash_entries)
+		goal = (thash_entries * sizeof(struct tcp_ehash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		;
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: Limit hash table size
@ 2004-02-17 22:24 Chen, Kenneth W
  2004-02-17 23:24 ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-17 22:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

OK, here is another revision on top of what has been discussed.  It adds
4 boot time parameters so user can override default size as needed to
suite special needs.  I will sent a separate patch for
kernel-parameters.txt if everyone is OK with this one.

- Ken

-----Original Message-----
From: Andrew Morton [mailto:akpm@osdl.org] 
Sent: Thursday, February 05, 2004 3:58 PM
To: Chen, Kenneth W
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: Re: Limit hash table size

Ken, I remain unhappy with this patch.  If a big box has 500 million
dentries or inodes in cache (is possible), those hash chains will be
more
than 200 entries long on average.  It will be very slow.

We need to do something smarter.  At least, for machines which do not
have
the ia64 proliferation-of-zones problem.

Maybe we should leave the sizing of these tables as-is, and add some
hook
which allows the architecture to scale them back.

[-- Attachment #2: hash4.patch --]
[-- Type: application/octet-stream, Size: 4596 bytes --]

diff -Nur linux-2.6.3-rc4/fs/dcache.c linux-2.6.3-rc4.hash/fs/dcache.c
--- linux-2.6.3-rc4/fs/dcache.c	2004-02-16 18:21:54.000000000 -0800
+++ linux-2.6.3-rc4.hash/fs/dcache.c	2004-02-17 14:03:21.000000000 -0800
@@ -49,6 +49,7 @@
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
+#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1531,6 +1532,16 @@
 	return ino;
 }
 
+static __initdata unsigned long dhash_entries;
+static int __init set_dhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	dhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1556,11 +1567,14 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!dhash_entries) {
+		dhash_entries = PAGE_SHIFT < 13 ?
+				mempages >> (13 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 13);
+		dhash_entries = min(D_HASHMAX, dhash_entries);
+	}
+	dhash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.3-rc4/fs/inode.c linux-2.6.3-rc4.hash/fs/inode.c
--- linux-2.6.3-rc4/fs/inode.c	2004-02-16 18:23:36.000000000 -0800
+++ linux-2.6.3-rc4.hash/fs/inode.c	2004-02-17 14:03:21.000000000 -0800
@@ -53,6 +53,7 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
+#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1327,6 +1328,16 @@
 		wake_up_all(wq);
 }
 
+static __initdata unsigned long ihash_entries;
+static int __init set_ihash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ihash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,9 +1351,14 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!ihash_entries) {
+		ihash_entries = PAGE_SHIFT < 14 ?
+				mempages >> (14 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 14);
+		ihash_entries = min(I_HASHMAX, ihash_entries);
+	}
+	ihash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.3-rc4/net/ipv4/route.c linux-2.6.3-rc4.hash/net/ipv4/route.c
--- linux-2.6.3-rc4/net/ipv4/route.c	2004-02-16 18:23:37.000000000 -0800
+++ linux-2.6.3-rc4.hash/net/ipv4/route.c	2004-02-17 14:03:21.000000000 -0800
@@ -2717,6 +2717,16 @@
 #endif /* CONFIG_PROC_FS */
 #endif /* CONFIG_NET_CLS_ROUTE */
 
+static __initdata unsigned long rhash_entries;
+static int __init set_rhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	rhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("rhash_entries=", set_rhash_entries);
+
 int __init ip_rt_init(void)
 {
 	int i, order, goal, rc = 0;
@@ -2743,7 +2753,10 @@
 		panic("IP: failed to allocate ip_dst_cache\n");
 
 	goal = num_physpages >> (26 - PAGE_SHIFT);
-
+	if (!rhash_entries)
+		goal = min(10, goal);
+	else
+		goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		/* NOTHING */;
 
diff -Nur linux-2.6.3-rc4/net/ipv4/tcp.c linux-2.6.3-rc4.hash/net/ipv4/tcp.c
--- linux-2.6.3-rc4/net/ipv4/tcp.c	2004-02-16 18:22:05.000000000 -0800
+++ linux-2.6.3-rc4.hash/net/ipv4/tcp.c	2004-02-17 14:03:21.000000000 -0800
@@ -2570,6 +2570,16 @@
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern void tcpdiag_init(void);
 
+static __initdata unsigned long thash_entries;
+static int __init set_thash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	thash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("thash_entries=", set_thash_entries);
+
 void __init tcp_init(void)
 {
 	struct sk_buff *skb = NULL;
@@ -2611,6 +2621,10 @@
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
+	if (!thash_entries)
+		goal = min(10, goal);
+	else 
+		goal = (thash_entries * sizeof(struct tcp_ehash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		;
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread
* Re: Limit hash table size
@ 2004-02-06  6:32 Manfred Spraul
  0 siblings, 0 replies; 38+ messages in thread
From: Manfred Spraul @ 2004-02-06  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Chen, Kenneth W

Andrew wrote:

>Maybe we should leave the sizing of these tables as-is, and add some hook
>which allows the architecture to scale them back.
>  
>
Architecture or administrator?
I think a boot parameter is the better solution: The admin knows if his 
system is a compute node or a file server.

--
    Manfred



^ permalink raw reply	[flat|nested] 38+ messages in thread
[parent not found: <B05667366EE6204181EABE9C1B1C0EB5802441@scsmsx401.sc.intel.com.suse.lists.linux.kernel>]
* RE: Limit hash table size
@ 2004-01-14 22:31 Chen, Kenneth W
  2004-01-18 14:25 ` Anton Blanchard
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-14 22:31 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

Anton Blanchard wrote:
> Well x86 isnt very interesting here, its all the 64bit archs
> that will end up with TBs of memory in the future.

To address Anton's concerns on PPC64, we have revised the patch to
enforce maximum size base on number of entry instead of page order.  So
differences in page size/pointer size etc doesn't affect the final
calculation.  The upper bound is capped at 2M.  All numbers on x86
remain the same as we don't want to disturb already established and
working number.  See patch at the end of the email.  It is diff'ed
relative to 2.6.1-mm3 tree.

> But look at the horrid worst case there. My point is limiting
> the hash without any data is not a good idea. In 2.4 we raised
> MAX_ORDER on ppc64 because we spent so much time walking
> pagecache chains,

I just have to re-iterate that when hash table is made too large, we
start trading cache misses on the head array accesses for misses on the
hash list traversal. Big hashes can hurt you if you don't actually use
the capacity.

> Why cant we do something like Andrews recent min_free_kbytes
> patch and make the rate of change non linear. Just slow the
> increase down as we get bigger. I agree a 2GB hashtable is
> pretty ludicrous, but a 4MB one on a 512GB machine (which
> we sell at the moment) could be too :)

It doesn't need to be over designed.  Generally there is no one size fit
all type of solution either.  Linear scale works fine for many years and
it just start to tip off on large machine.  We just need to put a upper
bound before it runs away.

- Ken

[-- Attachment #2: hash2.patch --]
[-- Type: application/octet-stream, Size: 1769 bytes --]

diff -Nur linux-2.6.1-mm3.orig/fs/dcache.c linux-2.6.1-mm3/fs/dcache.c
--- linux-2.6.1-mm3.orig/fs/dcache.c	2004-01-14 13:48:09.000000000 -0800
+++ linux-2.6.1-mm3/fs/dcache.c	2004-01-14 14:02:02.000000000 -0800
@@ -49,6 +49,7 @@
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
+#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1552,9 +1553,9 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-	mempages >>= 1;
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
+	mempages = (mempages << PAGE_SHIFT) >> 13;
+	mempages = min(D_HASHMAX, mempages) * sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
 	do {
diff -Nur linux-2.6.1-mm3.orig/fs/inode.c linux-2.6.1-mm3/fs/inode.c
--- linux-2.6.1-mm3.orig/fs/inode.c	2004-01-14 13:48:09.000000000 -0800
+++ linux-2.6.1-mm3/fs/inode.c	2004-01-14 14:01:34.000000000 -0800
@@ -53,6 +53,7 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
+#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1328,9 +1329,9 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= 2;
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
+	mempages = (mempages << PAGE_SHIFT) >> 14;
+	mempages = min(I_HASHMAX, mempages) * sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: Limit hash table size
@ 2004-01-14 22:29 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-14 22:29 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-ia64

Manfred Spraul wrote:
> What about making the limit configurable with a boot time
> parameter? If someone uses a 512 GB ppc64 as an nfs server,
> he might want a 2 GB inode hash.

I'm sorry, this code won't have any effect beyond MAX_ORDER defined for
each architecture.  It's not possible to get 2GB hash table on PPC64
since MAX_ORDER is defined at 13 so far for PPC64, which means a 16MB
absolute upper limit enforced by the page allocator.

- Ken

^ permalink raw reply	[flat|nested] 38+ messages in thread
* Re: Limit hash table size
@ 2004-01-12 16:50 Manfred Spraul
  0 siblings, 0 replies; 38+ messages in thread
From: Manfred Spraul @ 2004-01-12 16:50 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

>
>
>Why cant we do something like Andrews recent min_free_kbytes patch and
>make the rate of change non linear. Just slow the increase down as we
>get bigger. I agree a 2GB hashtable is pretty ludicrous, but a 4MB one
>on a 512GB machine (which we sell at the moment) could be too :)
>  
>
What about making the limit configurable with a boot time parameter? If 
someone uses a 512 GB ppc64 as an nfs server, he might want a 2 GB inode 
hash.

--
    Manfred

[-- Attachment #2: patch-hash-alloc --]
[-- Type: text/plain, Size: 2266 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 0
//  EXTRAVERSION = -test11
--- 2.6/fs/inode.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/inode.c	2003-11-29 10:19:21.000000000 +0100
@@ -1327,6 +1327,20 @@
 		wake_up_all(wq);
 }
 
+static __initdata int ihash_entries;
+
+static int __init set_ihash_entries(char *str)
+{
+	get_option(&str, &ihash_entries);
+	if (ihash_entries <= 0) {
+		ihash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,8 +1354,16 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
+	if (!ihash_entries) {
+		ihash_entries = mempages >> (14 - PAGE_SHIFT);
+		/* Limit inode hash size. Override for nfs servers
+		 * that handle lots of files.
+		 */
+		if (ihash_entries > 1024*1024)
+			ihash_entries = 1024*1024;
+	}
+
+	mempages = ihash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
--- 2.6/fs/dcache.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/dcache.c	2003-11-29 10:53:15.000000000 +0100
@@ -1546,6 +1546,20 @@
 	return ino;
 }
 
+static __initdata int dhash_entries;
+
+static int __init set_dhash_entries(char *str)
+{
+	get_option(&str, &dhash_entries);
+	if (dhash_entries <= 0) {
+		dhash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1571,10 +1585,18 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
+	if (!dhash_entries) {
 #if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
+		mempages >>= (13 - PAGE_SHIFT);
 #endif
-	mempages *= sizeof(struct hlist_head);
+		dhash_entries = mempages;
+		/* 8 mio is enough for general purpose systems.
+		 * For file servers, override with "dhash_entries="
+		 */
+		if (dhash_entries > 8*1024*1024)
+			dhash_entries = 8*1024*1024;
+	}
+	mempages = dhash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread
* RE: Limit hash table size
@ 2004-01-09 19:05 Chen, Kenneth W
  2004-01-12 13:32 ` Anton Blanchard
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-09 19:05 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

Anton Blanchard wrote:

>Have you done any analysis of hash depths of large memory machines? We
>had some extremely deep pagecache hashchains in 2.4 on a 64GB machine.
>While the radix tree should fix that, whos to say we cant get into a
>similar situation with the dcache?

We don't have any data to justify any size change for x86, that was the
main reason we limit the size by page order.  


>Check out how deep some of the inode hash chains are here:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0312.0/0105.html

If I read them correctly, most of the distribution is in the first 2
buckets, so it doesn't matter whether you have 100 buckets or 1 million
buckets, only first 2 are being hammered hard.  So are we wasting memory
on the buckets that are not being used?

- Ken

^ permalink raw reply	[flat|nested] 38+ messages in thread
* Limit hash table size
@ 2004-01-08 23:12 Chen, Kenneth W
  2004-01-09  9:25 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-08 23:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-ia64; +Cc: Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

The issue of exceedingly large hash tables has been discussed on the
mailing list a while back, but seems to slip through the cracks.

What we found is it's not a problem for x86 (and most other
architectures) because __get_free_pages won't be able to get anything
beyond order MAX_ORDER-1 (10) which means at most those hash tables are
4MB each (assume 4K page size).  However, on ia64, in order to support
larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
means a 2GB upper limits enforced by the page allocator (assume 16K page
size).  PPC64 is another example that bumps up MAX_ORDER.

Last time I checked, the tcp ehash table is taking a whooping (insane!)
2GB on one of our large machine.  dentry and inode hash tables also take
considerable amount of memory.

This patch just enforces all the hash tables to have a max order of 10,
which limits them down to 16MB each on ia64.  People can clean up other
part of table size calculation.  But minimally, this patch doesn't
change any hash sizes already in use on x86.

Andrew, would you please apply?

- Ken Chen
 <<hashtable.patch>> 

[-- Attachment #2: hashtable.patch --]
[-- Type: application/octet-stream, Size: 2196 bytes --]

diff -Nurp linux-2.6.0.orig/fs/dcache.c linux-2.6.0/fs/dcache.c
--- linux-2.6.0.orig/fs/dcache.c	2003-12-17 18:58:15.000000000 -0800
+++ linux-2.6.0/fs/dcache.c	2004-01-08 14:59:58.000000000 -0800
@@ -1571,11 +1571,9 @@ static void __init dcache_init(unsigned 
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
+	mempages >>= 1;
 	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
 		;
 
 	do {
diff -Nurp linux-2.6.0.orig/fs/inode.c linux-2.6.0/fs/inode.c
--- linux-2.6.0.orig/fs/inode.c	2003-12-17 18:59:55.000000000 -0800
+++ linux-2.6.0/fs/inode.c	2004-01-08 15:00:19.000000000 -0800
@@ -1340,9 +1340,9 @@ void __init inode_init(unsigned long mem
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
+	mempages >>= 2;
 	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
 		;
 
 	do {
diff -Nurp linux-2.6.0.orig/net/ipv4/route.c linux-2.6.0/net/ipv4/route.c
--- linux-2.6.0.orig/net/ipv4/route.c	2003-12-17 18:59:55.000000000 -0800
+++ linux-2.6.0/net/ipv4/route.c	2004-01-08 15:01:17.000000000 -0800
@@ -2747,7 +2747,7 @@ int __init ip_rt_init(void)
 
 	goal = num_physpages >> (26 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		/* NOTHING */;
 
 	do {
diff -Nurp linux-2.6.0.orig/net/ipv4/tcp.c linux-2.6.0/net/ipv4/tcp.c
--- linux-2.6.0.orig/net/ipv4/tcp.c	2003-12-17 18:58:38.000000000 -0800
+++ linux-2.6.0/net/ipv4/tcp.c	2004-01-08 15:00:42.000000000 -0800
@@ -2610,7 +2610,7 @@ void __init tcp_init(void)
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		;
 	do {
 		tcp_ehash_size = (1UL << order) * PAGE_SIZE /

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2004-02-19  7:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-06  0:10 Limit hash table size Chen, Kenneth W
2004-02-06  0:23 ` Andrew Morton
2004-02-09 23:12   ` Jes Sorensen
  -- strict thread matches above, loose matches on Subject: below --
2004-02-18  0:45 Chen, Kenneth W
2004-02-18  0:16 Chen, Kenneth W
2004-02-17 22:24 Chen, Kenneth W
2004-02-17 23:24 ` Andrew Morton
2004-02-06  6:32 Manfred Spraul
     [not found] <B05667366EE6204181EABE9C1B1C0EB5802441@scsmsx401.sc.intel.com.suse.lists.linux.kernel>
     [not found] ` <20040205155813.726041bd.akpm@osdl.org.suse.lists.linux.kernel>
2004-02-06  1:54   ` Andi Kleen
2004-02-05  2:38     ` Steve Lord
2004-02-06  3:12       ` Andrew Morton
2004-02-05  4:06         ` Steve Lord
2004-02-06  4:39           ` Andi Kleen
2004-02-06  4:59             ` Andrew Morton
2004-02-06  5:34             ` Maneesh Soni
2004-02-06  3:19         ` Andi Kleen
2004-02-06  3:23         ` Nick Piggin
2004-02-06  3:34           ` Andrew Morton
2004-02-06  3:38             ` Nick Piggin
2004-02-18 12:41       ` Pavel Machek
2004-02-06  3:09     ` Andrew Morton
2004-02-06  3:18       ` Andi Kleen
2004-02-06  3:30         ` Andrew Morton
2004-02-06  4:45           ` Martin J. Bligh
2004-02-06  6:22       ` Matt Mackall
2004-02-06 20:20       ` Taneli Vähäkangas
2004-02-06 20:27         ` Andrew Morton
2004-02-06 21:46           ` Taneli Vähäkangas
2004-01-14 22:31 Chen, Kenneth W
2004-01-18 14:25 ` Anton Blanchard
2004-01-14 22:29 Chen, Kenneth W
2004-01-12 16:50 Manfred Spraul
2004-01-09 19:05 Chen, Kenneth W
2004-01-12 13:32 ` Anton Blanchard
2004-01-08 23:12 Chen, Kenneth W
2004-01-09  9:25 ` Andrew Morton
2004-01-09 14:25 ` Anton Blanchard
2004-02-05 23:58 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox