RE: Limit hash table size

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: Limit hash table size
@ 2004-02-06  0:10 Chen, Kenneth W
  2004-02-06  0:23 ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-06  0:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

Andrew,

Will you merge the changes in the network area first while I'm working
on the solution suggested here for inode and dentry? The 2GB tcp hash is
the biggest problem for us right now.

- Ken

-----Original Message-----
From: Andrew Morton [mailto:akpm@osdl.org] 
Sent: Thursday, February 05, 2004 3:58 PM
To: Chen, Kenneth W
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: Re: Limit hash table size

Ken, I remain unhappy with this patch.  If a big box has 500 million
dentries or inodes in cache (is possible), those hash chains will be
more
than 200 entries long on average.  It will be very slow.

We need to do something smarter.  At least, for machines which do not
have
the ia64 proliferation-of-zones problem.

Maybe we should leave the sizing of these tables as-is, and add some
hook
which allows the architecture to scale them back.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  0:10 Limit hash table size Chen, Kenneth W
@ 2004-02-06  0:23 ` Andrew Morton
  2004-02-09 23:12   ` Jes Sorensen
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  0:23 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel, linux-ia64

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> Andrew,
> 
> Will you merge the changes in the network area first while I'm working
> on the solution suggested here for inode and dentry? The 2GB tcp hash is
> the biggest problem for us right now.

Is there some reason why TCP could not also end up creating 100's of
millions of objects?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  0:23 ` Andrew Morton
@ 2004-02-09 23:12   ` Jes Sorensen
  0 siblings, 0 replies; 38+ messages in thread
From: Jes Sorensen @ 2004-02-09 23:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64

>>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:

Andrew> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>>  Andrew,
>> 
>> Will you merge the changes in the network area first while I'm
>> working on the solution suggested here for inode and dentry? The
>> 2GB tcp hash is the biggest problem for us right now.

Andrew> Is there some reason why TCP could not also end up creating
Andrew> 100's of millions of objects?

Andrew,

I think the likelihood that TCP will generate that is quite small,
it would require a fairly significant number of network interfaces and
I doubt people with 1TB RAM boxes will throw in 256 10GigE interfaces
and run them all flat out.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-02-18  0:45 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-18  0:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

Updates to Documentation/kernel-parameters.txt for the 4 new boot time
parameters.

- Ken


-----Original Message-----
From: Chen, Kenneth W 
Sent: Tuesday, February 17, 2004 4:17 PM
To: 'Andrew Morton'
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: RE: Limit hash table size

> I think it would be better to leave things as they are,
> with a boot option to scale the tables down.

OK, here is one that adds boot time parameters only and leave everything
else untouched.

> The below patch addresses the inode and dentry caches.
> Need to think a bit more about the networking ones.

Don't know what happened to my mailer, the mailing list archive shows
complete patch including the network part.  Hope this time you receive
the whole thing.

[-- Attachment #2: knl_param.patch --]
[-- Type: application/octet-stream, Size: 1234 bytes --]

diff -Nur linux-2.6.3-rc4/Documentation/kernel-parameters.txt linux-2.6.3-rc4.hash/Documentation/kernel-parameters.txt
--- linux-2.6.3-rc4/Documentation/kernel-parameters.txt	2004-02-16 18:23:44.000000000 -0800
+++ linux-2.6.3-rc4.hash/Documentation/kernel-parameters.txt	2004-02-17 16:41:58.000000000 -0800
@@ -292,6 +292,9 @@
 
 	devfs=		[DEVFS]
 			See Documentation/filesystems/devfs/boot-options.
+
+	dhash_entries=	[KNL]
+			Set number of hash buckets for dentry cache.
  
 	digi=		[HW,SERIAL]
 			IO parameters + enable/disable command.
@@ -424,6 +427,9 @@
 	idle=		[HW]
 			Format: idle=poll or idle=halt
  
+	ihash_entries=	[KNL]
+			Set number of hash buckets for inode cache.
+
 	in2000=		[HW,SCSI]
 			See header of drivers/scsi/in2000.c.
 
@@ -873,6 +879,9 @@
 
 	resume=		[SWSUSP] Specify the partition device for software suspension
 
+	rhash_entries=	[KNL,NET]
+			Set number of hash buckets for route cache
+ 
 	riscom8=	[HW,SERIAL]
 			Format: <io_board1>[,<io_board2>[,...<io_boardN>]]
 
@@ -1135,6 +1144,9 @@
 	tgfx_2=		See Documentation/input/joystick-parport.txt.
 	tgfx_3=
 
+	thash_entries=	[KNL,NET]
+			Set number of hash buckets for TCP connection
+
 	tipar=		[HW]
 			See header of drivers/char/tipar.c.
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-02-18  0:16 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-18  0:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

> I think it would be better to leave things as they are,
> with a boot option to scale the tables down.

OK, here is one that adds boot time parameters only and leave everything
else untouched.

> The below patch addresses the inode and dentry caches.
> Need to think a bit more about the networking ones.

Don't know what happened to my mailer, the mailing list archive shows
complete patch including the network part.  Hope this time you receive
the whole thing.

[-- Attachment #2: hash5.patch --]
[-- Type: application/octet-stream, Size: 4055 bytes --]

diff -Nur linux-2.6.2-rc1/fs/dcache.c linux-2.6.2-rc1.ken/fs/dcache.c
--- linux-2.6.2-rc1/fs/dcache.c	2004-01-20 19:49:25.000000000 -0800
+++ linux-2.6.2-rc1.ken/fs/dcache.c	2004-02-17 15:31:35.000000000 -0800
@@ -1527,6 +1527,16 @@
 	return ino;
 }
 
+static __initdata unsigned long dhash_entries;
+static int __init set_dhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	dhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1552,11 +1562,13 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!dhash_entries) {
+		dhash_entries = PAGE_SHIFT < 13 ?
+				mempages >> (13 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 13);
+	}
+	dhash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.2-rc1/fs/inode.c linux-2.6.2-rc1.ken/fs/inode.c
--- linux-2.6.2-rc1/fs/inode.c	2004-01-20 19:50:41.000000000 -0800
+++ linux-2.6.2-rc1.ken/fs/inode.c	2004-02-17 15:30:51.000000000 -0800
@@ -1327,6 +1327,16 @@
 		wake_up_all(wq);
 }
 
+static __initdata unsigned long ihash_entries;
+static int __init set_ihash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ihash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,9 +1350,13 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!ihash_entries) {
+		ihash_entries = PAGE_SHIFT < 14 ?
+				mempages >> (14 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 14);
+	}
+	ihash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.2-rc1/net/ipv4/route.c linux-2.6.2-rc1.ken/net/ipv4/route.c
--- linux-2.6.2-rc1/net/ipv4/route.c	2004-01-20 19:50:41.000000000 -0800
+++ linux-2.6.2-rc1.ken/net/ipv4/route.c	2004-02-17 15:43:59.000000000 -0800
@@ -2717,6 +2717,16 @@
 #endif /* CONFIG_PROC_FS */
 #endif /* CONFIG_NET_CLS_ROUTE */
 
+static __initdata unsigned long rhash_entries;
+static int __init set_rhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	rhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("rhash_entries=", set_rhash_entries);
+
 int __init ip_rt_init(void)
 {
 	int i, order, goal, rc = 0;
@@ -2742,8 +2752,10 @@
 	if (!ipv4_dst_ops.kmem_cachep)
 		panic("IP: failed to allocate ip_dst_cache\n");
 
-	goal = num_physpages >> (26 - PAGE_SHIFT);
-
+	if (!rhash_entries)
+		goal = num_physpages >> (26 - PAGE_SHIFT);
+	else
+		goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		/* NOTHING */;
 
diff -Nur linux-2.6.2-rc1/net/ipv4/tcp.c linux-2.6.2-rc1.ken/net/ipv4/tcp.c
--- linux-2.6.2-rc1/net/ipv4/tcp.c	2004-01-20 19:49:36.000000000 -0800
+++ linux-2.6.2-rc1.ken/net/ipv4/tcp.c	2004-02-17 15:45:25.000000000 -0800
@@ -2569,6 +2569,16 @@
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern void tcpdiag_init(void);
 
+static __initdata unsigned long thash_entries;
+static int __init set_thash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	thash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("thash_entries=", set_thash_entries);
+
 void __init tcp_init(void)
 {
 	struct sk_buff *skb = NULL;
@@ -2610,6 +2620,8 @@
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
+	if (thash_entries)
+		goal = (thash_entries * sizeof(struct tcp_ehash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		;
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-02-17 22:24 Chen, Kenneth W
  2004-02-17 23:24 ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-02-17 22:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

OK, here is another revision on top of what has been discussed.  It adds
4 boot time parameters so user can override default size as needed to
suite special needs.  I will sent a separate patch for
kernel-parameters.txt if everyone is OK with this one.

- Ken

-----Original Message-----
From: Andrew Morton [mailto:akpm@osdl.org] 
Sent: Thursday, February 05, 2004 3:58 PM
To: Chen, Kenneth W
Cc: linux-kernel@vger.kernel.org; linux-ia64@vger.kernel.org
Subject: Re: Limit hash table size

Ken, I remain unhappy with this patch.  If a big box has 500 million
dentries or inodes in cache (is possible), those hash chains will be
more
than 200 entries long on average.  It will be very slow.

We need to do something smarter.  At least, for machines which do not
have
the ia64 proliferation-of-zones problem.

Maybe we should leave the sizing of these tables as-is, and add some
hook
which allows the architecture to scale them back.

[-- Attachment #2: hash4.patch --]
[-- Type: application/octet-stream, Size: 4596 bytes --]

diff -Nur linux-2.6.3-rc4/fs/dcache.c linux-2.6.3-rc4.hash/fs/dcache.c
--- linux-2.6.3-rc4/fs/dcache.c	2004-02-16 18:21:54.000000000 -0800
+++ linux-2.6.3-rc4.hash/fs/dcache.c	2004-02-17 14:03:21.000000000 -0800
@@ -49,6 +49,7 @@
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
+#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1531,6 +1532,16 @@
 	return ino;
 }
 
+static __initdata unsigned long dhash_entries;
+static int __init set_dhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	dhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1556,11 +1567,14 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!dhash_entries) {
+		dhash_entries = PAGE_SHIFT < 13 ?
+				mempages >> (13 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 13);
+		dhash_entries = min(D_HASHMAX, dhash_entries);
+	}
+	dhash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.3-rc4/fs/inode.c linux-2.6.3-rc4.hash/fs/inode.c
--- linux-2.6.3-rc4/fs/inode.c	2004-02-16 18:23:36.000000000 -0800
+++ linux-2.6.3-rc4.hash/fs/inode.c	2004-02-17 14:03:21.000000000 -0800
@@ -53,6 +53,7 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
+#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1327,6 +1328,16 @@
 		wake_up_all(wq);
 }
 
+static __initdata unsigned long ihash_entries;
+static int __init set_ihash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ihash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,9 +1351,14 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	if (!ihash_entries) {
+		ihash_entries = PAGE_SHIFT < 14 ?
+				mempages >> (14 - PAGE_SHIFT) :
+				mempages << (PAGE_SHIFT - 14);
+		ihash_entries = min(I_HASHMAX, ihash_entries);
+	}
+	ihash_entries *= sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
 		;
 
 	do {
diff -Nur linux-2.6.3-rc4/net/ipv4/route.c linux-2.6.3-rc4.hash/net/ipv4/route.c
--- linux-2.6.3-rc4/net/ipv4/route.c	2004-02-16 18:23:37.000000000 -0800
+++ linux-2.6.3-rc4.hash/net/ipv4/route.c	2004-02-17 14:03:21.000000000 -0800
@@ -2717,6 +2717,16 @@
 #endif /* CONFIG_PROC_FS */
 #endif /* CONFIG_NET_CLS_ROUTE */
 
+static __initdata unsigned long rhash_entries;
+static int __init set_rhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	rhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("rhash_entries=", set_rhash_entries);
+
 int __init ip_rt_init(void)
 {
 	int i, order, goal, rc = 0;
@@ -2743,7 +2753,10 @@
 		panic("IP: failed to allocate ip_dst_cache\n");
 
 	goal = num_physpages >> (26 - PAGE_SHIFT);
-
+	if (!rhash_entries)
+		goal = min(10, goal);
+	else
+		goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		/* NOTHING */;
 
diff -Nur linux-2.6.3-rc4/net/ipv4/tcp.c linux-2.6.3-rc4.hash/net/ipv4/tcp.c
--- linux-2.6.3-rc4/net/ipv4/tcp.c	2004-02-16 18:22:05.000000000 -0800
+++ linux-2.6.3-rc4.hash/net/ipv4/tcp.c	2004-02-17 14:03:21.000000000 -0800
@@ -2570,6 +2570,16 @@
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern void tcpdiag_init(void);
 
+static __initdata unsigned long thash_entries;
+static int __init set_thash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	thash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("thash_entries=", set_thash_entries);
+
 void __init tcp_init(void)
 {
 	struct sk_buff *skb = NULL;
@@ -2611,6 +2621,10 @@
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
+	if (!thash_entries)
+		goal = min(10, goal);
+	else 
+		goal = (thash_entries * sizeof(struct tcp_ehash_bucket)) >> PAGE_SHIFT;
 	for (order = 0; (1UL << order) < goal; order++)
 		;
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-17 22:24 Chen, Kenneth W
@ 2004-02-17 23:24 ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2004-02-17 23:24 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel, linux-ia64

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> OK, here is another revision on top of what has been discussed.  It adds
> 4 boot time parameters so user can override default size as needed to
> suite special needs.

You've set the default to 2M entries, with an option for this to be
increased via a boot parameter.  This means that people whose machines have
lots of memory and a sane number of zones may suddenly wonder why their
app-which-uses-a-zillion-files is running like crap.

I think it would be better to leave things as they are, with a boot option
to scale the tables down.  The below patch addresses the inode and dentry
caches.  Need to think a bit more about the networking ones.

> I will sent a separate patch for
> kernel-parameters.txt if everyone is OK with this one.

Yes please.


diff -puN fs/dcache.c~limit-hash-table-sizes-boot-options-restore-defaults fs/dcache.c
--- 25/fs/dcache.c~limit-hash-table-sizes-boot-options-restore-defaults	Tue Feb 17 15:11:09 2004
+++ 25-akpm/fs/dcache.c	Tue Feb 17 15:11:29 2004
@@ -49,7 +49,6 @@ static kmem_cache_t *dentry_cache; 
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
-#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1567,12 +1566,11 @@ static void __init dcache_init(unsigned 
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-	if (!dhash_entries) {
+	if (!dhash_entries)
 		dhash_entries = PAGE_SHIFT < 13 ?
 				mempages >> (13 - PAGE_SHIFT) :
 				mempages << (PAGE_SHIFT - 13);
-		dhash_entries = min(D_HASHMAX, dhash_entries);
-	}
+
 	dhash_entries *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
 		;
diff -puN fs/inode.c~limit-hash-table-sizes-boot-options-restore-defaults fs/inode.c
--- 25/fs/inode.c~limit-hash-table-sizes-boot-options-restore-defaults	Tue Feb 17 15:11:09 2004
+++ 25-akpm/fs/inode.c	Tue Feb 17 15:11:47 2004
@@ -53,7 +53,6 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
-#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1336,12 +1335,11 @@ void __init inode_init(unsigned long mem
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	if (!ihash_entries) {
+	if (!ihash_entries)
 		ihash_entries = PAGE_SHIFT < 14 ?
 				mempages >> (14 - PAGE_SHIFT) :
 				mempages << (PAGE_SHIFT - 14);
-		ihash_entries = min(I_HASHMAX, ihash_entries);
-	}
+
 	ihash_entries *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
 		;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
@ 2004-02-06  6:32 Manfred Spraul
  0 siblings, 0 replies; 38+ messages in thread
From: Manfred Spraul @ 2004-02-06  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Chen, Kenneth W

Andrew wrote:

>Maybe we should leave the sizing of these tables as-is, and add some hook
>which allows the architecture to scale them back.
>  
>
Architecture or administrator?
I think a boot parameter is the better solution: The admin knows if his 
system is a compute node or a file server.

--
    Manfred



^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <B05667366EE6204181EABE9C1B1C0EB5802441@scsmsx401.sc.intel.com.suse.lists.linux.kernel>]

[parent not found: <20040205155813.726041bd.akpm@osdl.org.suse.lists.linux.kernel>]

* Re: Limit hash table size
       [not found] ` <20040205155813.726041bd.akpm@osdl.org.suse.lists.linux.kernel>
@ 2004-02-06  1:54   ` Andi Kleen
  2004-02-05  2:38     ` Steve Lord
  2004-02-06  3:09     ` Andrew Morton
  0 siblings, 2 replies; 38+ messages in thread
From: Andi Kleen @ 2004-02-06  1:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, kenneth.w.chen

Andrew Morton <akpm@osdl.org> writes:

> Ken, I remain unhappy with this patch.  If a big box has 500 million
> dentries or inodes in cache (is possible), those hash chains will be more
> than 200 entries long on average.  It will be very slow.

How about limiting the global size of the dcache in this case ? 

I cannot imagine a workload where it would make sense to ever cache 
500 million dentries. It just risks to keep the whole file system
after an updatedb in memory on a big box, which is not necessarily
good use of the memory.

Limiting the number of dentries would keep the hash chains at a 
reasonable length too and somewhat bound the worst case CPU 
use for cache misses and search time in cache lookups.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  1:54   ` Andi Kleen
@ 2004-02-05  2:38     ` Steve Lord
  2004-02-06  3:12       ` Andrew Morton
  2004-02-18 12:41       ` Pavel Machek
  2004-02-06  3:09     ` Andrew Morton
  1 sibling, 2 replies; 38+ messages in thread
From: Steve Lord @ 2004-02-05  2:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, kenneth.w.chen

Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
> 
> 
>>Ken, I remain unhappy with this patch.  If a big box has 500 million
>>dentries or inodes in cache (is possible), those hash chains will be more
>>than 200 entries long on average.  It will be very slow.
> 
> 
> How about limiting the global size of the dcache in this case ? 
> 
> I cannot imagine a workload where it would make sense to ever cache 
> 500 million dentries. It just risks to keep the whole file system
> after an updatedb in memory on a big box, which is not necessarily
> good use of the memory.
> 
> Limiting the number of dentries would keep the hash chains at a 
> reasonable length too and somewhat bound the worst case CPU 
> use for cache misses and search time in cache lookups.
> 

This is not directly on the topic of hash chain length but related.

I have seen some dire cases with the dcache, SGI had some boxes with
millions of files out there, and every night a cron job would come
along and suck them all into memory. Resources got tight at some point,
and as more inodes and dentries were being read in, the try to free
pages path was continually getting called. There was always something
in filesystem cache which could get freed, and the inodes and dentries
kept getting more and more of the memory.

The fact that directory dcache entries are hard to get rid of because
they have children and the directory dcache entries pinned pages in
the cache meant that even if you could persuade the system to run a
prune_dcache, it did not free much of the memory.

Some sort of scheme where instead of a memory allocate for a new dcache
first attempting to push out file contents, it first attempted to prune
a few old dcache entries instead might go a long way in this area.

Now if there was some way of knowing in advance what a new dcache entry
would be for (directory or leaf node), at least they could be seperated
into distinct caches - but that would take some work I suspect. How
you balance between getting fresh pages and reclaiming old dentries
is the hard part.

Hmm, looks like Pavel maybe just hit something along these lines... see

'2.6.2 extremely unresponsive after rsync backup'

Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-05  2:38     ` Steve Lord
@ 2004-02-06  3:12       ` Andrew Morton
  2004-02-05  4:06         ` Steve Lord
                           ` (2 more replies)
  2004-02-18 12:41       ` Pavel Machek
  1 sibling, 3 replies; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  3:12 UTC (permalink / raw)
  To: Steve Lord; +Cc: ak, linux-kernel, kenneth.w.chen

Steve Lord <lord@xfs.org> wrote:
>
>  I have seen some dire cases with the dcache, SGI had some boxes with
>  millions of files out there, and every night a cron job would come
>  along and suck them all into memory. Resources got tight at some point,
>  and as more inodes and dentries were being read in, the try to free
>  pages path was continually getting called. There was always something
>  in filesystem cache which could get freed, and the inodes and dentries
>  kept getting more and more of the memory.

There are a number of variables here.  Certainly, the old
inodes-pinned-by-highmem pagecache will cause this to happen - badly.  2.6
is pretty aggressive at killing off those inodes.

What kernel was it?

Was it a highmem box?  If so, was the filesystem in question placing
directory pagecache in highmem?  If so, that was really bad on older 2.4:
the directory pagecache in highmem pins down all directory inodes.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:12       ` Andrew Morton
@ 2004-02-05  4:06         ` Steve Lord
  2004-02-06  4:39           ` Andi Kleen
  2004-02-06  3:19         ` Andi Kleen
  2004-02-06  3:23         ` Nick Piggin
  2 siblings, 1 reply; 38+ messages in thread
From: Steve Lord @ 2004-02-05  4:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, linux-kernel, kenneth.w.chen

Andrew Morton wrote:
> Steve Lord <lord@xfs.org> wrote:
> 
>> I have seen some dire cases with the dcache, SGI had some boxes with
>> millions of files out there, and every night a cron job would come
>> along and suck them all into memory. Resources got tight at some point,
>> and as more inodes and dentries were being read in, the try to free
>> pages path was continually getting called. There was always something
>> in filesystem cache which could get freed, and the inodes and dentries
>> kept getting more and more of the memory.
> 
> 
> There are a number of variables here.  Certainly, the old
> inodes-pinned-by-highmem pagecache will cause this to happen - badly.  2.6
> is pretty aggressive at killing off those inodes.
> 
> What kernel was it?
> 
> Was it a highmem box?  If so, was the filesystem in question placing
> directory pagecache in highmem?  If so, that was really bad on older 2.4:
> the directory pagecache in highmem pins down all directory inodes.
> 

This is where my memory gets a little hazy, its been a few months.
This would have been a 2.4 kernel (probably around 2.4.21)
on an Altix, the filesystem was XFS. So no highmem, but definitely
not your standard kernel.

I never had time to dig into it too much, but I always thought that
on machines with large amounts of memory it was too easy for the
inode and dcache pools to get very large at the expense of the
regular memory zones. While there were pages available, the dcache
and inode zones could just keep on growing. If you run a big
enough find you get lots of memory into the dcache zone and
have a hard time getting it out again.

And it was try_to_free_pages I was referring to.

It does look like 2.6 does better, but I don't have quite the
amount of memory on my laptop....

Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-05  4:06         ` Steve Lord
@ 2004-02-06  4:39           ` Andi Kleen
  2004-02-06  4:59             ` Andrew Morton
  2004-02-06  5:34             ` Maneesh Soni
  0 siblings, 2 replies; 38+ messages in thread
From: Andi Kleen @ 2004-02-06  4:39 UTC (permalink / raw)
  To: Steve Lord; +Cc: akpm, linux-kernel, kenneth.w.chen

On Wed, 04 Feb 2004 22:06:42 -0600
Steve Lord <lord@xfs.org> wrote:


> It does look like 2.6 does better, but I don't have quite the
> amount of memory on my laptop....

I see the problem on a 8GB x86-64 box with 2.6.2-rc1. 

After a find / I have:

 Active / Total Objects (% used)    : 1794827 / 1804510 (99.5%)
 Active / Total Slabs (% used)      : 125647 / 125647 (100.0%)
 Active / Total Caches (% used)     : 71 / 112 (63.4%)
 Active / Total Size (% used)       : 685008.27K / 686856.36K (99.7%)
 Minimum / Average / Maximum Object : 0.02K / 0.38K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
708600 708559  99%    0.25K  47240       15    188960K dentry_cache
624734 624128  99%    0.69K  56794       11    454352K reiser_inode_cache
254200 253514  99%    0.09K   6355       40     25420K buffer_head
109327 109253  99%    0.06K   1853       59      7412K size-64

Now I allocate 6GB of RAM. After that:

 Active / Total Objects (% used)    : 741266 / 1092573 (67.8%)
 Active / Total Slabs (% used)      : 78291 / 78291 (100.0%)
 Active / Total Caches (% used)     : 71 / 112 (63.4%)
 Active / Total Size (% used)       : 339837.24K / 455189.93K (74.7%)
 Minimum / Average / Maximum Object : 0.02K / 0.42K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
464497 381126  82%    0.69K  42227       11    337816K reiser_inode_cache
391545 192481  49%    0.25K  26103       15    104412K dentry_cache
135840  95304  70%    0.09K   3396       40     13584K buffer_head

1GB of dentry cache seems to be quite excessive.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  4:39           ` Andi Kleen
@ 2004-02-06  4:59             ` Andrew Morton
  2004-02-06  5:34             ` Maneesh Soni
  1 sibling, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  4:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: lord, linux-kernel, kenneth.w.chen

Andi Kleen <ak@suse.de> wrote:
>
> On Wed, 04 Feb 2004 22:06:42 -0600
> Steve Lord <lord@xfs.org> wrote:
> 
> 
> > It does look like 2.6 does better, but I don't have quite the
> > amount of memory on my laptop....
> 
> I see the problem on a 8GB x86-64 box with 2.6.2-rc1. 

I see no problem here.  I see a kernel which is designed to address a mix
of workloads only being tested for one particular workload.

> After a find / I have:
> 
>  Active / Total Objects (% used)    : 1794827 / 1804510 (99.5%)
>  Active / Total Slabs (% used)      : 125647 / 125647 (100.0%)
>  Active / Total Caches (% used)     : 71 / 112 (63.4%)
>  Active / Total Size (% used)       : 685008.27K / 686856.36K (99.7%)
>  Minimum / Average / Maximum Object : 0.02K / 0.38K / 128.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 708600 708559  99%    0.25K  47240       15    188960K dentry_cache
> 624734 624128  99%    0.69K  56794       11    454352K reiser_inode_cache
> 254200 253514  99%    0.09K   6355       40     25420K buffer_head
> 109327 109253  99%    0.06K   1853       59      7412K size-64
> 
> Now I allocate 6GB of RAM. After that:
> 
>  Active / Total Objects (% used)    : 741266 / 1092573 (67.8%)
>  Active / Total Slabs (% used)      : 78291 / 78291 (100.0%)
>  Active / Total Caches (% used)     : 71 / 112 (63.4%)
>  Active / Total Size (% used)       : 339837.24K / 455189.93K (74.7%)
>  Minimum / Average / Maximum Object : 0.02K / 0.42K / 128.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 464497 381126  82%    0.69K  42227       11    337816K reiser_inode_cache
> 391545 192481  49%    0.25K  26103       15    104412K dentry_cache
> 135840  95304  70%    0.09K   3396       40     13584K buffer_head

The caches were shrunk by around 40%.  Looks good.

> 1GB of dentry cache seems to be quite excessive.

Not if your workload wants to use those dentries again.  You've only
conducted one particular test.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  4:39           ` Andi Kleen
  2004-02-06  4:59             ` Andrew Morton
@ 2004-02-06  5:34             ` Maneesh Soni
  1 sibling, 0 replies; 38+ messages in thread
From: Maneesh Soni @ 2004-02-06  5:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Steve Lord, akpm, linux-kernel, kenneth.w.chen

On Fri, Feb 06, 2004 at 04:49:01AM +0000, Andi Kleen wrote:
> On Wed, 04 Feb 2004 22:06:42 -0600
> Steve Lord <lord@xfs.org> wrote:
> 
> 
> > It does look like 2.6 does better, but I don't have quite the
> > amount of memory on my laptop....
> 
> I see the problem on a 8GB x86-64 box with 2.6.2-rc1. 
> 
> After a find / I have:
> 
>  Active / Total Objects (% used)    : 1794827 / 1804510 (99.5%)
>  Active / Total Slabs (% used)      : 125647 / 125647 (100.0%)
>  Active / Total Caches (% used)     : 71 / 112 (63.4%)
>  Active / Total Size (% used)       : 685008.27K / 686856.36K (99.7%)
>  Minimum / Average / Maximum Object : 0.02K / 0.38K / 128.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 708600 708559  99%    0.25K  47240       15    188960K dentry_cache
> 624734 624128  99%    0.69K  56794       11    454352K reiser_inode_cache
> 254200 253514  99%    0.09K   6355       40     25420K buffer_head
> 109327 109253  99%    0.06K   1853       59      7412K size-64
> 
> Now I allocate 6GB of RAM. After that:
> 
>  Active / Total Objects (% used)    : 741266 / 1092573 (67.8%)
>  Active / Total Slabs (% used)      : 78291 / 78291 (100.0%)
>  Active / Total Caches (% used)     : 71 / 112 (63.4%)
>  Active / Total Size (% used)       : 339837.24K / 455189.93K (74.7%)
>  Minimum / Average / Maximum Object : 0.02K / 0.42K / 128.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 464497 381126  82%    0.69K  42227       11    337816K reiser_inode_cache
> 391545 192481  49%    0.25K  26103       15    104412K dentry_cache
				                ^^^^^^^^^
> 135840  95304  70%    0.09K   3396       40     13584K buffer_head
> 
> 1GB of dentry cache seems to be quite excessive.
> 

Here dentry cache looks to be approximately 104 MB (104,412K) and not 1GB and 
it has reduced from 188 MB to 104 MB.

Maneesh

-- 
Maneesh Soni
Linux Technology Center, 
IBM Software Lab, Bangalore, India
email: maneesh@in.ibm.com
Phone: 91-80-5044999 Fax: 91-80-5268553
T/L : 9243696

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:12       ` Andrew Morton
  2004-02-05  4:06         ` Steve Lord
@ 2004-02-06  3:19         ` Andi Kleen
  2004-02-06  3:23         ` Nick Piggin
  2 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2004-02-06  3:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Steve Lord, ak, linux-kernel, kenneth.w.chen

> Was it a highmem box?  If so, was the filesystem in question placing

I doubt SGI has any big highmem linux boxes ;-) 

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:12       ` Andrew Morton
  2004-02-05  4:06         ` Steve Lord
  2004-02-06  3:19         ` Andi Kleen
@ 2004-02-06  3:23         ` Nick Piggin
  2004-02-06  3:34           ` Andrew Morton
  2 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2004-02-06  3:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Steve Lord, ak, linux-kernel, kenneth.w.chen



Andrew Morton wrote:

>Steve Lord <lord@xfs.org> wrote:
>
>> I have seen some dire cases with the dcache, SGI had some boxes with
>> millions of files out there, and every night a cron job would come
>> along and suck them all into memory. Resources got tight at some point,
>> and as more inodes and dentries were being read in, the try to free
>> pages path was continually getting called. There was always something
>> in filesystem cache which could get freed, and the inodes and dentries
>> kept getting more and more of the memory.
>>
>
>There are a number of variables here.  Certainly, the old
>inodes-pinned-by-highmem pagecache will cause this to happen - badly.  2.6
>is pretty aggressive at killing off those inodes.
>
>What kernel was it?
>
>Was it a highmem box?  If so, was the filesystem in question placing
>directory pagecache in highmem?  If so, that was really bad on older 2.4:
>the directory pagecache in highmem pins down all directory inodes.
>
>

2.6.2-mm1 should fix this I think.
In particular, this hunk in vm-shrink-zone.patch

@@ -918,6 +917,15 @@ int try_to_free_pages(struct zone **zone
 		get_page_state(&ps);
 		nr_reclaimed += shrink_caches(zones, priority, &total_scanned,
 						gfp_mask, nr_pages, &ps);
+
+		if (zones[0] - zones[0]->zone_pgdat->node_zones < ZONE_HIGHMEM) {
+			shrink_slab(total_scanned, gfp_mask);
+			if (reclaim_state) {
+				nr_reclaimed += reclaim_state->reclaimed_slab;
+				reclaim_state->reclaimed_slab = 0;
+			}
+		}
+
 		if (nr_reclaimed >= nr_pages) {
 			ret = 1;
 			goto out;
@@ -933,13 +941,6 @@ int try_to_free_pages(struct zone **zone
 
 		/* Take a nap, wait for some writeback to complete */
 		blk_congestion_wait(WRITE, HZ/10);
-		if (zones[0] - zones[0]->zone_pgdat->node_zones < ZONE_HIGHMEM) {
-			shrink_slab(total_scanned, gfp_mask);
-			if (reclaim_state) {
-				nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 	}
 	if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
 		out_of_memory();



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:23         ` Nick Piggin
@ 2004-02-06  3:34           ` Andrew Morton
  2004-02-06  3:38             ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  3:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: lord, ak, linux-kernel, kenneth.w.chen

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> >Was it a highmem box?  If so, was the filesystem in question placing
>  >directory pagecache in highmem?  If so, that was really bad on older 2.4:
>  >the directory pagecache in highmem pins down all directory inodes.
>  >
>  >
> 
>  2.6.2-mm1 should fix this I think.

2.6.anything should fix it.  It used to, anyway.

>  In particular, this hunk in vm-shrink-zone.patch

That's on the direct reclaim path - for sane workloads most of the freeing
activity is via kswapd.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:34           ` Andrew Morton
@ 2004-02-06  3:38             ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2004-02-06  3:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lord, ak, linux-kernel, kenneth.w.chen



Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>>Was it a highmem box?  If so, was the filesystem in question placing
>>>
>> >directory pagecache in highmem?  If so, that was really bad on older 2.4:
>> >the directory pagecache in highmem pins down all directory inodes.
>> >
>> >
>>
>> 2.6.2-mm1 should fix this I think.
>>
>
>2.6.anything should fix it.  It used to, anyway.
>
>
>> In particular, this hunk in vm-shrink-zone.patch
>>
>
>That's on the direct reclaim path - for sane workloads most of the freeing
>activity is via kswapd.
>
>
>

OK - he said the try to free pages path was being called but
maybe he didn't actually mean try_to_free_pages.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-05  2:38     ` Steve Lord
  2004-02-06  3:12       ` Andrew Morton
@ 2004-02-18 12:41       ` Pavel Machek
  1 sibling, 0 replies; 38+ messages in thread
From: Pavel Machek @ 2004-02-18 12:41 UTC (permalink / raw)
  To: Steve Lord; +Cc: Andi Kleen, Andrew Morton, linux-kernel, kenneth.w.chen

Hi!

> Hmm, looks like Pavel maybe just hit something along these lines... 
> see
> 
> '2.6.2 extremely unresponsive after rsync backup'
> 

I tried to reproduce it and I could not. Not sure what went wrong
that night, but it does not want to happen again.

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  1:54   ` Andi Kleen
  2004-02-05  2:38     ` Steve Lord
@ 2004-02-06  3:09     ` Andrew Morton
  2004-02-06  3:18       ` Andi Kleen
                         ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  3:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, kenneth.w.chen

Andi Kleen <ak@suse.de> wrote:
>
> Andrew Morton <akpm@osdl.org> writes:
> 
> > Ken, I remain unhappy with this patch.  If a big box has 500 million
> > dentries or inodes in cache (is possible), those hash chains will be more
> > than 200 entries long on average.  It will be very slow.
> 
> How about limiting the global size of the dcache in this case ? 

But to what size?

The thing is, any workload which touches a huge number of dentries/inodes
will, if it touches them again, touch them again in exactly the same order.
This triggers the worst-case LRU behaviour.

So if you limit dcache to 100MB and you happen to have a workload which
touches 101MB's worth, you get a 100% miss rate.  You suffer a 100000%
slowdown on the second pass, which is unhappy.  It doesn't seem worth
crippling such workloads just because of the updatedb thing.

> I cannot imagine a workload where it would make sense to ever cache 
> 500 million dentries. It just risks to keep the whole file system
> after an updatedb in memory on a big box, which is not necessarily
> good use of the memory.

A decent approach to the updatedb problem is an application hint which says
"reclaim i/dcache harder".  Just turn it on during the updatedb run -
crude, but it's a start.

But I've been telling poeple for a year that they should set
/proc/sys/vm/swappiness to zero during the updatedb run and afaik nobody has
bothered to try it...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:09     ` Andrew Morton
@ 2004-02-06  3:18       ` Andi Kleen
  2004-02-06  3:30         ` Andrew Morton
  2004-02-06  6:22       ` Matt Mackall
  2004-02-06 20:20       ` Taneli Vähäkangas
  2 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2004-02-06  3:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, kenneth.w.chen, lord

On Thu, Feb 05, 2004 at 07:09:04PM -0800, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> > Andrew Morton <akpm@osdl.org> writes:
> > 
> > > Ken, I remain unhappy with this patch.  If a big box has 500 million
> > > dentries or inodes in cache (is possible), those hash chains will be more
> > > than 200 entries long on average.  It will be very slow.
> > 
> > How about limiting the global size of the dcache in this case ? 
> 
> But to what size?

Some percent of memory (but less than currently), with a upper limit.
Let's say 2GB on a 60GB 64bit box.  And a sysctl for users to enlarge.

> 
> The thing is, any workload which touches a huge number of dentries/inodes
> will, if it touches them again, touch them again in exactly the same order.
> This triggers the worst-case LRU behaviour.
> 
> So if you limit dcache to 100MB and you happen to have a workload which
> touches 101MB's worth, you get a 100% miss rate.  You suffer a 100000%
> slowdown on the second pass, which is unhappy.  It doesn't seem worth
> crippling such workloads just because of the updatedb thing.

You always risk crippling some workloads. But letting all your memory
slowly getting taken over by the dcache is likely to break more 
things than limiting it to a reasonable upper size
(and let people enlarge it for extreme workloads) 

And it makes the actual dcache lookups slow too because you have
to walk hundreds of "zombies" in your hash table just to figure out
that the dentry you need it is not cached.

I remember sct fixed that in 2.1 timeframe.  But
it got later broken by the dynamic memory sizing heuristics.

The dynamic memory sizing was mainly an optimization for diff -burpN 
on source trees for kernel hackers. While I like having that fast too
I don't think it's a good idea to cripple big machines in general for it.

> > 500 million dentries. It just risks to keep the whole file system
> > after an updatedb in memory on a big box, which is not necessarily
> > good use of the memory.
> 
> A decent approach to the updatedb problem is an application hint which says
> "reclaim i/dcache harder".  Just turn it on during the updatedb run -
> crude, but it's a start.
> 
> But I've been telling poeple for a year that they should set
> /proc/sys/vm/swappiness to zero during the updatedb run and afaik nobody has
> bothered to try it...

I do not think such hacks are the right way to do. If updatedb does not
do that backup will or maybe your nightly tripwire run or some other
random application that walks file systems. Hacking all of them is just not 
realistic.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:18       ` Andi Kleen
@ 2004-02-06  3:30         ` Andrew Morton
  2004-02-06  4:45           ` Martin J. Bligh
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2004-02-06  3:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, kenneth.w.chen, lord

Andi Kleen <ak@suse.de> wrote:
>
> > But I've been telling poeple for a year that they should set
>  > /proc/sys/vm/swappiness to zero during the updatedb run and afaik nobody has
>  > bothered to try it...
> 
>  I do not think such hacks are the right way to do. If updatedb does not
>  do that backup will or maybe your nightly tripwire run or some other
>  random application that walks file systems. Hacking all of them is just not 
>  realistic.

You need some way of not slowing down real-world applications by a factor
of 1000.  That is unacceptable, and the problems which updatedb and friends
cause (just once per day!) pale in comparison.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:30         ` Andrew Morton
@ 2004-02-06  4:45           ` Martin J. Bligh
  0 siblings, 0 replies; 38+ messages in thread
From: Martin J. Bligh @ 2004-02-06  4:45 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen; +Cc: linux-kernel, kenneth.w.chen, lord

>> > But I've been telling poeple for a year that they should set
>>  > /proc/sys/vm/swappiness to zero during the updatedb run and afaik nobody has
>>  > bothered to try it...
>> 
>>  I do not think such hacks are the right way to do. If updatedb does not
>>  do that backup will or maybe your nightly tripwire run or some other
>>  random application that walks file systems. Hacking all of them is just not 
>>  realistic.
> 
> You need some way of not slowing down real-world applications by a factor
> of 1000.  That is unacceptable, and the problems which updatedb and friends
> cause (just once per day!) pale in comparison.

I still think this needs to be on a per-process basis, rather than system
wide - it's updatedb that's the problem here, not the time of day. Personally, 
I'd just trigger on processes that were niced to hell, but I'm sure other
people have other ways.

M.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:09     ` Andrew Morton
  2004-02-06  3:18       ` Andi Kleen
@ 2004-02-06  6:22       ` Matt Mackall
  2004-02-06 20:20       ` Taneli Vähäkangas
  2 siblings, 0 replies; 38+ messages in thread
From: Matt Mackall @ 2004-02-06  6:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, kenneth.w.chen

On Thu, Feb 05, 2004 at 07:09:04PM -0800, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> > Andrew Morton <akpm@osdl.org> writes:
> > 
> > > Ken, I remain unhappy with this patch.  If a big box has 500 million
> > > dentries or inodes in cache (is possible), those hash chains will be more
> > > than 200 entries long on average.  It will be very slow.
> > 
> > How about limiting the global size of the dcache in this case ? 
> 
> But to what size?
> 
> The thing is, any workload which touches a huge number of dentries/inodes
> will, if it touches them again, touch them again in exactly the same order.
> This triggers the worst-case LRU behaviour.
> 
> So if you limit dcache to 100MB and you happen to have a workload which
> touches 101MB's worth, you get a 100% miss rate.  You suffer a 100000%
> slowdown on the second pass, which is unhappy.  It doesn't seem worth
> crippling such workloads just because of the updatedb thing.

A less strict approach to LRU might serve. Probabilistically dropping
something in the first half of the LRU rather than the head would go a
long way to gracefully degrading the "working set slightly larger than
cache" effect. There are a couple different ways to do this that are
reasonably efficient.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06  3:09     ` Andrew Morton
  2004-02-06  3:18       ` Andi Kleen
  2004-02-06  6:22       ` Matt Mackall
@ 2004-02-06 20:20       ` Taneli Vähäkangas
  2004-02-06 20:27         ` Andrew Morton
  2 siblings, 1 reply; 38+ messages in thread
From: Taneli Vähäkangas @ 2004-02-06 20:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

(Cc: list trimmed)

On Thu, Feb 05, 2004 at 07:09:04PM -0800, Andrew Morton wrote:
> A decent approach to the updatedb problem is an application hint which says
> "reclaim i/dcache harder".  Just turn it on during the updatedb run -
> crude, but it's a start.
> 
> But I've been telling poeple for a year that they should set
> /proc/sys/vm/swappiness to zero during the updatedb run and afaik nobody has
> bothered to try it...

Ok, I tried it. If anything, it made "interactive feel" slightly worse.
This is 2.6.2-rc3 on 2xPII-233, 128M RAM, 280M swap, Gnome and Mozilla.
If that does not apply, then forget about it. OTOH, I'd very much
appreciate if the system didn't act very sluggish during updatedb.

	Taneli


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06 20:20       ` Taneli Vähäkangas
@ 2004-02-06 20:27         ` Andrew Morton
  2004-02-06 21:46           ` Taneli Vähäkangas
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2004-02-06 20:27 UTC (permalink / raw)
  To: Taneli Vähäkangas; +Cc: linux-kernel

Taneli Vähäkangas <taneli@firmament.fi> wrote:
>
> OTOH, I'd very much
> appreciate if the system didn't act very sluggish during updatedb.

It really helps if your filesystems were laid out by a 2.6 kernel.  What
usually happens at present is that you install the distro using a 2.4
kernel and then install 2.6.  So all those files under /usr/bin and
/usr/include and everywhere else are laid down by the 2.4 kernel.

Problem is, 2.4's ext2 and ext3 don't have the Orlov allocator, which lays
files out in a much more updatedb-friendly way.  I've seen the disk
bandwidth quadruple as updatedb switches from a 2.4-laid-out partition to a
2.6-laid-out partition.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-02-06 20:27         ` Andrew Morton
@ 2004-02-06 21:46           ` Taneli Vähäkangas
  0 siblings, 0 replies; 38+ messages in thread
From: Taneli Vähäkangas @ 2004-02-06 21:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, Feb 06, 2004 at 12:27:52PM -0800, Andrew Morton wrote:
> Taneli Vähäkangas <taneli@firmament.fi> wrote:
> >
> > OTOH, I'd very much
> > appreciate if the system didn't act very sluggish during updatedb.
> 
> It really helps if your filesystems were laid out by a 2.6 kernel.  What
> usually happens at present is that you install the distro using a 2.4
> kernel and then install 2.6.  So all those files under /usr/bin and
> /usr/include and everywhere else are laid down by the 2.4 kernel.

Actually, I just moved my root and /usr partitions to another hard drive
on Monday using tar on the same 2.6 system. Should that have helped? I
didn't notice any improvement, but the new drive may be a little slower
(its about the same age (6 years?) and capacity (4G), but from a laptop).
Maybe I should move also /home over and see if it improves? It is 4 and
a half years old, and probably made with 2.2 kernel. That will require a
little more effort, since I don't have a replacement HD.

	Taneli

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-01-14 22:31 Chen, Kenneth W
  2004-01-18 14:25 ` Anton Blanchard
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-14 22:31 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

Anton Blanchard wrote:
> Well x86 isnt very interesting here, its all the 64bit archs
> that will end up with TBs of memory in the future.

To address Anton's concerns on PPC64, we have revised the patch to
enforce maximum size base on number of entry instead of page order.  So
differences in page size/pointer size etc doesn't affect the final
calculation.  The upper bound is capped at 2M.  All numbers on x86
remain the same as we don't want to disturb already established and
working number.  See patch at the end of the email.  It is diff'ed
relative to 2.6.1-mm3 tree.

> But look at the horrid worst case there. My point is limiting
> the hash without any data is not a good idea. In 2.4 we raised
> MAX_ORDER on ppc64 because we spent so much time walking
> pagecache chains,

I just have to re-iterate that when hash table is made too large, we
start trading cache misses on the head array accesses for misses on the
hash list traversal. Big hashes can hurt you if you don't actually use
the capacity.

> Why cant we do something like Andrews recent min_free_kbytes
> patch and make the rate of change non linear. Just slow the
> increase down as we get bigger. I agree a 2GB hashtable is
> pretty ludicrous, but a 4MB one on a 512GB machine (which
> we sell at the moment) could be too :)

It doesn't need to be over designed.  Generally there is no one size fit
all type of solution either.  Linear scale works fine for many years and
it just start to tip off on large machine.  We just need to put a upper
bound before it runs away.

- Ken

[-- Attachment #2: hash2.patch --]
[-- Type: application/octet-stream, Size: 1769 bytes --]

diff -Nur linux-2.6.1-mm3.orig/fs/dcache.c linux-2.6.1-mm3/fs/dcache.c
--- linux-2.6.1-mm3.orig/fs/dcache.c	2004-01-14 13:48:09.000000000 -0800
+++ linux-2.6.1-mm3/fs/dcache.c	2004-01-14 14:02:02.000000000 -0800
@@ -49,6 +49,7 @@
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
+#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1552,9 +1553,9 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-	mempages >>= 1;
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
+	mempages = (mempages << PAGE_SHIFT) >> 13;
+	mempages = min(D_HASHMAX, mempages) * sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
 	do {
diff -Nur linux-2.6.1-mm3.orig/fs/inode.c linux-2.6.1-mm3/fs/inode.c
--- linux-2.6.1-mm3.orig/fs/inode.c	2004-01-14 13:48:09.000000000 -0800
+++ linux-2.6.1-mm3/fs/inode.c	2004-01-14 14:01:34.000000000 -0800
@@ -53,6 +53,7 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
+#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1328,9 +1329,9 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= 2;
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
+	mempages = (mempages << PAGE_SHIFT) >> 14;
+	mempages = min(I_HASHMAX, mempages) * sizeof(struct hlist_head);
+	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
 	do {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-01-14 22:31 Chen, Kenneth W
@ 2004-01-18 14:25 ` Anton Blanchard
  0 siblings, 0 replies; 38+ messages in thread
From: Anton Blanchard @ 2004-01-18 14:25 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton


Hi,

> To address Anton's concerns on PPC64, we have revised the patch to
> enforce maximum size base on number of entry instead of page order.  So
> differences in page size/pointer size etc doesn't affect the final
> calculation.  The upper bound is capped at 2M.  All numbers on x86
> remain the same as we don't want to disturb already established and
> working number.  See patch at the end of the email.  It is diff'ed
> relative to 2.6.1-mm3 tree.

That sounds reasonable to me.

Anton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-01-14 22:29 Chen, Kenneth W
  0 siblings, 0 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-14 22:29 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-ia64

Manfred Spraul wrote:
> What about making the limit configurable with a boot time
> parameter? If someone uses a 512 GB ppc64 as an nfs server,
> he might want a 2 GB inode hash.

I'm sorry, this code won't have any effect beyond MAX_ORDER defined for
each architecture.  It's not possible to get 2GB hash table on PPC64
since MAX_ORDER is defined at 13 so far for PPC64, which means a 16MB
absolute upper limit enforced by the page allocator.

- Ken

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
@ 2004-01-12 16:50 Manfred Spraul
  0 siblings, 0 replies; 38+ messages in thread
From: Manfred Spraul @ 2004-01-12 16:50 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

>
>
>Why cant we do something like Andrews recent min_free_kbytes patch and
>make the rate of change non linear. Just slow the increase down as we
>get bigger. I agree a 2GB hashtable is pretty ludicrous, but a 4MB one
>on a 512GB machine (which we sell at the moment) could be too :)
>  
>
What about making the limit configurable with a boot time parameter? If 
someone uses a 512 GB ppc64 as an nfs server, he might want a 2 GB inode 
hash.

--
    Manfred

[-- Attachment #2: patch-hash-alloc --]
[-- Type: text/plain, Size: 2266 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 0
//  EXTRAVERSION = -test11
--- 2.6/fs/inode.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/inode.c	2003-11-29 10:19:21.000000000 +0100
@@ -1327,6 +1327,20 @@
 		wake_up_all(wq);
 }
 
+static __initdata int ihash_entries;
+
+static int __init set_ihash_entries(char *str)
+{
+	get_option(&str, &ihash_entries);
+	if (ihash_entries <= 0) {
+		ihash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,8 +1354,16 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
+	if (!ihash_entries) {
+		ihash_entries = mempages >> (14 - PAGE_SHIFT);
+		/* Limit inode hash size. Override for nfs servers
+		 * that handle lots of files.
+		 */
+		if (ihash_entries > 1024*1024)
+			ihash_entries = 1024*1024;
+	}
+
+	mempages = ihash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
--- 2.6/fs/dcache.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/dcache.c	2003-11-29 10:53:15.000000000 +0100
@@ -1546,6 +1546,20 @@
 	return ino;
 }
 
+static __initdata int dhash_entries;
+
+static int __init set_dhash_entries(char *str)
+{
+	get_option(&str, &dhash_entries);
+	if (dhash_entries <= 0) {
+		dhash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1571,10 +1585,18 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
+	if (!dhash_entries) {
 #if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
+		mempages >>= (13 - PAGE_SHIFT);
 #endif
-	mempages *= sizeof(struct hlist_head);
+		dhash_entries = mempages;
+		/* 8 mio is enough for general purpose systems.
+		 * For file servers, override with "dhash_entries="
+		 */
+		if (dhash_entries > 8*1024*1024)
+			dhash_entries = 8*1024*1024;
+	}
+	mempages = dhash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Limit hash table size
@ 2004-01-09 19:05 Chen, Kenneth W
  2004-01-12 13:32 ` Anton Blanchard
  0 siblings, 1 reply; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-09 19:05 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

Anton Blanchard wrote:

>Have you done any analysis of hash depths of large memory machines? We
>had some extremely deep pagecache hashchains in 2.4 on a 64GB machine.
>While the radix tree should fix that, whos to say we cant get into a
>similar situation with the dcache?

We don't have any data to justify any size change for x86, that was the
main reason we limit the size by page order.  

>Check out how deep some of the inode hash chains are here:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0312.0/0105.html

If I read them correctly, most of the distribution is in the first 2
buckets, so it doesn't matter whether you have 100 buckets or 1 million
buckets, only first 2 are being hammered hard.  So are we wasting memory
on the buckets that are not being used?

- Ken

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-01-09 19:05 Chen, Kenneth W
@ 2004-01-12 13:32 ` Anton Blanchard
  0 siblings, 0 replies; 38+ messages in thread
From: Anton Blanchard @ 2004-01-12 13:32 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

> We don't have any data to justify any size change for x86, that was the
> main reason we limit the size by page order.

Well x86 isnt very interesting here, its all the 64bit archs that will
end up with TBs of memory in the future.

> If I read them correctly, most of the distribution is in the first 2
> buckets, so it doesn't matter whether you have 100 buckets or 1 million
> buckets, only first 2 are being hammered hard.  So are we wasting memory
> on the buckets that are not being used?

But look at the horrid worst case there. My point is limiting the hash
without any data is not a good idea. In 2.4 we raised MAX_ORDER on ppc64
because we spent so much time walking pagecache chains, id hate to see
us limit the icache and dcache hash in 2.6 and end up with a similar
problem.

Why cant we do something like Andrews recent min_free_kbytes patch and
make the rate of change non linear. Just slow the increase down as we
get bigger. I agree a 2GB hashtable is pretty ludicrous, but a 4MB one
on a 512GB machine (which we sell at the moment) could be too :)

Anton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Limit hash table size
@ 2004-01-08 23:12 Chen, Kenneth W
  2004-01-09  9:25 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Chen, Kenneth W @ 2004-01-08 23:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-ia64; +Cc: Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

The issue of exceedingly large hash tables has been discussed on the
mailing list a while back, but seems to slip through the cracks.

What we found is it's not a problem for x86 (and most other
architectures) because __get_free_pages won't be able to get anything
beyond order MAX_ORDER-1 (10) which means at most those hash tables are
4MB each (assume 4K page size).  However, on ia64, in order to support
larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
means a 2GB upper limits enforced by the page allocator (assume 16K page
size).  PPC64 is another example that bumps up MAX_ORDER.

Last time I checked, the tcp ehash table is taking a whooping (insane!)
2GB on one of our large machine.  dentry and inode hash tables also take
considerable amount of memory.

This patch just enforces all the hash tables to have a max order of 10,
which limits them down to 16MB each on ia64.  People can clean up other
part of table size calculation.  But minimally, this patch doesn't
change any hash sizes already in use on x86.

Andrew, would you please apply?

- Ken Chen
 <<hashtable.patch>> 

[-- Attachment #2: hashtable.patch --]
[-- Type: application/octet-stream, Size: 2196 bytes --]

diff -Nurp linux-2.6.0.orig/fs/dcache.c linux-2.6.0/fs/dcache.c
--- linux-2.6.0.orig/fs/dcache.c	2003-12-17 18:58:15.000000000 -0800
+++ linux-2.6.0/fs/dcache.c	2004-01-08 14:59:58.000000000 -0800
@@ -1571,11 +1571,9 @@ static void __init dcache_init(unsigned 
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
+	mempages >>= 1;
 	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
 		;
 
 	do {
diff -Nurp linux-2.6.0.orig/fs/inode.c linux-2.6.0/fs/inode.c
--- linux-2.6.0.orig/fs/inode.c	2003-12-17 18:59:55.000000000 -0800
+++ linux-2.6.0/fs/inode.c	2004-01-08 15:00:19.000000000 -0800
@@ -1340,9 +1340,9 @@ void __init inode_init(unsigned long mem
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
+	mempages >>= 2;
 	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
+	for (order = 0; (order < 10) && (((1UL << order) << PAGE_SHIFT) < mempages); order++)
 		;
 
 	do {
diff -Nurp linux-2.6.0.orig/net/ipv4/route.c linux-2.6.0/net/ipv4/route.c
--- linux-2.6.0.orig/net/ipv4/route.c	2003-12-17 18:59:55.000000000 -0800
+++ linux-2.6.0/net/ipv4/route.c	2004-01-08 15:01:17.000000000 -0800
@@ -2747,7 +2747,7 @@ int __init ip_rt_init(void)
 
 	goal = num_physpages >> (26 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		/* NOTHING */;
 
 	do {
diff -Nurp linux-2.6.0.orig/net/ipv4/tcp.c linux-2.6.0/net/ipv4/tcp.c
--- linux-2.6.0.orig/net/ipv4/tcp.c	2003-12-17 18:58:38.000000000 -0800
+++ linux-2.6.0/net/ipv4/tcp.c	2004-01-08 15:00:42.000000000 -0800
@@ -2610,7 +2610,7 @@ void __init tcp_init(void)
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		;
 	do {
 		tcp_ehash_size = (1UL << order) * PAGE_SIZE /

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-01-08 23:12 Chen, Kenneth W
@ 2004-01-09  9:25 ` Andrew Morton
  2004-01-09 14:25 ` Anton Blanchard
  2004-02-05 23:58 ` Andrew Morton
  2 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2004-01-09  9:25 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel, linux-ia64

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> The issue of exceedingly large hash tables has been discussed on the
>  mailing list a while back, but seems to slip through the cracks.
> 
>  What we found is it's not a problem for x86 (and most other
>  architectures) because __get_free_pages won't be able to get anything
>  beyond order MAX_ORDER-1 (10) which means at most those hash tables are
>  4MB each (assume 4K page size).  However, on ia64, in order to support
>  larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
>  means a 2GB upper limits enforced by the page allocator (assume 16K page
>  size).  PPC64 is another example that bumps up MAX_ORDER.
> 
>  Last time I checked, the tcp ehash table is taking a whooping (insane!)
>  2GB on one of our large machine.  dentry and inode hash tables also take
>  considerable amount of memory.
> 
>  This patch just enforces all the hash tables to have a max order of 10,
>  which limits them down to 16MB each on ia64.  People can clean up other
>  part of table size calculation.  But minimally, this patch doesn't
>  change any hash sizes already in use on x86.

Fair enough; it's better than what we had before and Mr Networking is OK
with it ;)  I can't say that I can think of anything smarter.  Thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-01-08 23:12 Chen, Kenneth W
  2004-01-09  9:25 ` Andrew Morton
@ 2004-01-09 14:25 ` Anton Blanchard
  2004-02-05 23:58 ` Andrew Morton
  2 siblings, 0 replies; 38+ messages in thread
From: Anton Blanchard @ 2004-01-09 14:25 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Linux Kernel Mailing List, linux-ia64, Andrew Morton

Hi,

> Last time I checked, the tcp ehash table is taking a whooping (insane!)
> 2GB on one of our large machine.  dentry and inode hash tables also take
> considerable amount of memory.
> 
> This patch just enforces all the hash tables to have a max order of 10,
> which limits them down to 16MB each on ia64.  People can clean up other
> part of table size calculation.  But minimally, this patch doesn't
> change any hash sizes already in use on x86.

By limiting it by order you are crippling ppc64 compared to ia64 :)
Perhaps we should limit it by size instead.

Have you done any analysis of hash depths of large memory machines? We
had some extremely deep pagecache hashchains in 2.4 on a 64GB machine.
While the radix tree should fix that, whos to say we cant get into a
similar situation with the dcache?

Check out how deep some of the inode hash chains are here:

http://www.ussg.iu.edu/hypermail/linux/kernel/0312.0/0105.html

That was on a 64GB box from memory. Switch to the current high end,
say 512GB and those chains could become a real dogs breakfast,
especially if we limit the hashtable to 4MB.

Anton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Limit hash table size
  2004-01-08 23:12 Chen, Kenneth W
  2004-01-09  9:25 ` Andrew Morton
  2004-01-09 14:25 ` Anton Blanchard
@ 2004-02-05 23:58 ` Andrew Morton
  2 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2004-02-05 23:58 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel, linux-ia64


Ken, I remain unhappy with this patch.  If a big box has 500 million
dentries or inodes in cache (is possible), those hash chains will be more
than 200 entries long on average.  It will be very slow.

We need to do something smarter.  At least, for machines which do not have
the ia64 proliferation-of-zones problem.

Maybe we should leave the sizing of these tables as-is, and add some hook
which allows the architecture to scale them back.




From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>

The issue of exceedingly large hash tables has been discussed on the
mailing list a while back, but seems to slip through the cracks.

What we found is it's not a problem for x86 (and most other
architectures) because __get_free_pages won't be able to get anything
beyond order MAX_ORDER-1 (10) which means at most those hash tables are
4MB each (assume 4K page size).  However, on ia64, in order to support
larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
means a 2GB upper limits enforced by the page allocator (assume 16K page
size).  PPC64 is another example that bumps up MAX_ORDER.

Last time I checked, the tcp ehash table is taking a whooping (insane!)
2GB on one of our large machine.  dentry and inode hash tables also take
considerable amount of memory.

We enforce the maximum size based on the number of entries instead of the
page order.  The upper bound is capped at 2M.  All numbers on x86 remain the
same as we don't want to disturb already established and working number.

(acked by davem)


---

 25-akpm/fs/dcache.c      |    9 +++++----
 25-akpm/fs/inode.c       |    7 +++++--
 25-akpm/net/ipv4/route.c |    2 +-
 25-akpm/net/ipv4/tcp.c   |    2 +-
 4 files changed, 12 insertions(+), 8 deletions(-)

diff -puN fs/dcache.c~limit-hash-table-sizes fs/dcache.c
--- 25/fs/dcache.c~limit-hash-table-sizes	Thu Feb  5 15:43:40 2004
+++ 25-akpm/fs/dcache.c	Thu Feb  5 15:43:40 2004
@@ -49,6 +49,7 @@ static kmem_cache_t *dentry_cache; 
  */
 #define D_HASHBITS     d_hash_shift
 #define D_HASHMASK     d_hash_mask
+#define D_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
@@ -1565,10 +1566,10 @@ static void __init dcache_init(unsigned 
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
+	mempages = PAGE_SHIFT < 13 ?
+		   mempages >> (13 - PAGE_SHIFT) :
+		   mempages << (PAGE_SHIFT - 13);
+	mempages = min(D_HASHMAX, mempages) * sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
diff -puN fs/inode.c~limit-hash-table-sizes fs/inode.c
--- 25/fs/inode.c~limit-hash-table-sizes	Thu Feb  5 15:43:40 2004
+++ 25-akpm/fs/inode.c	Thu Feb  5 15:43:40 2004
@@ -53,6 +53,7 @@
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
+#define I_HASHMAX	(2*1024*1024UL)	/* max number of entries */
 
 static unsigned int i_hash_mask;
 static unsigned int i_hash_shift;
@@ -1325,8 +1326,10 @@ void __init inode_init(unsigned long mem
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
+	mempages = PAGE_SHIFT < 14 ?
+		   mempages >> (14 - PAGE_SHIFT) :
+		   mempages << (PAGE_SHIFT - 14);
+	mempages = min(I_HASHMAX, mempages) * sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
diff -puN net/ipv4/route.c~limit-hash-table-sizes net/ipv4/route.c
--- 25/net/ipv4/route.c~limit-hash-table-sizes	Thu Feb  5 15:43:40 2004
+++ 25-akpm/net/ipv4/route.c	Thu Feb  5 15:43:40 2004
@@ -2744,7 +2744,7 @@ int __init ip_rt_init(void)
 
 	goal = num_physpages >> (26 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		/* NOTHING */;
 
 	do {
diff -puN net/ipv4/tcp.c~limit-hash-table-sizes net/ipv4/tcp.c
--- 25/net/ipv4/tcp.c~limit-hash-table-sizes	Thu Feb  5 15:43:40 2004
+++ 25-akpm/net/ipv4/tcp.c	Thu Feb  5 15:43:40 2004
@@ -2611,7 +2611,7 @@ void __init tcp_init(void)
 	else
 		goal = num_physpages >> (23 - PAGE_SHIFT);
 
-	for (order = 0; (1UL << order) < goal; order++)
+	for (order = 0; (order < 10) && ((1UL << order) < goal); order++)
 		;
 	do {
 		tcp_ehash_size = (1UL << order) * PAGE_SIZE /

_


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2004-02-19  7:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-06  0:10 Limit hash table size Chen, Kenneth W
2004-02-06  0:23 ` Andrew Morton
2004-02-09 23:12   ` Jes Sorensen
  -- strict thread matches above, loose matches on Subject: below --
2004-02-18  0:45 Chen, Kenneth W
2004-02-18  0:16 Chen, Kenneth W
2004-02-17 22:24 Chen, Kenneth W
2004-02-17 23:24 ` Andrew Morton
2004-02-06  6:32 Manfred Spraul
     [not found] <B05667366EE6204181EABE9C1B1C0EB5802441@scsmsx401.sc.intel.com.suse.lists.linux.kernel>
     [not found] ` <20040205155813.726041bd.akpm@osdl.org.suse.lists.linux.kernel>
2004-02-06  1:54   ` Andi Kleen
2004-02-05  2:38     ` Steve Lord
2004-02-06  3:12       ` Andrew Morton
2004-02-05  4:06         ` Steve Lord
2004-02-06  4:39           ` Andi Kleen
2004-02-06  4:59             ` Andrew Morton
2004-02-06  5:34             ` Maneesh Soni
2004-02-06  3:19         ` Andi Kleen
2004-02-06  3:23         ` Nick Piggin
2004-02-06  3:34           ` Andrew Morton
2004-02-06  3:38             ` Nick Piggin
2004-02-18 12:41       ` Pavel Machek
2004-02-06  3:09     ` Andrew Morton
2004-02-06  3:18       ` Andi Kleen
2004-02-06  3:30         ` Andrew Morton
2004-02-06  4:45           ` Martin J. Bligh
2004-02-06  6:22       ` Matt Mackall
2004-02-06 20:20       ` Taneli Vähäkangas
2004-02-06 20:27         ` Andrew Morton
2004-02-06 21:46           ` Taneli Vähäkangas
2004-01-14 22:31 Chen, Kenneth W
2004-01-18 14:25 ` Anton Blanchard
2004-01-14 22:29 Chen, Kenneth W
2004-01-12 16:50 Manfred Spraul
2004-01-09 19:05 Chen, Kenneth W
2004-01-12 13:32 ` Anton Blanchard
2004-01-08 23:12 Chen, Kenneth W
2004-01-09  9:25 ` Andrew Morton
2004-01-09 14:25 ` Anton Blanchard
2004-02-05 23:58 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox