[PATCH 1/2] mm/large system hash: use vmalloc for size > MAX

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
@ 2019-06-05 14:48 Nicholas Piggin
  2019-06-05 14:48 ` [PATCH 2/2] mm/large system hash: clear hashdist when only one node with memory is booted Nicholas Piggin
  2019-06-05 21:22 ` [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Andrew Morton
  0 siblings, 2 replies; 4+ messages in thread
From: Nicholas Piggin @ 2019-06-05 14:48 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-kernel, Andrew Morton, Linus Torvalds

The kernel currently clamps large system hashes to MAX_ORDER when
hashdist is not set, which is rather arbitrary.

vmalloc space is limited on 32-bit machines, but this shouldn't
result in much more used because of small physical memory limiting
system hash sizes.

Include "vmalloc" or "linear" in the kernel log message.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

This is a better solution than the previous one for the case of !NUMA
systems running on CONFIG_NUMA kernels, we can clear the default
hashdist early and have everything allocated out of the linear map.

The hugepage vmap series I will post later, but it's quite
independent from this improvement.

 mm/page_alloc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..15f46be7d210 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7966,6 +7966,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 	unsigned long log2qty, size;
 	void *table = NULL;
 	gfp_t gfp_flags;
+	bool virt;
 
 	/* allow the kernel cmdline to have a say */
 	if (!numentries) {
@@ -8022,6 +8023,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
+		virt = false;
 		size = bucketsize << log2qty;
 		if (flags & HASH_EARLY) {
 			if (flags & HASH_ZERO)
@@ -8029,26 +8031,26 @@ void *__init alloc_large_system_hash(const char *tablename,
 			else
 				table = memblock_alloc_raw(size,
 							   SMP_CACHE_BYTES);
-		} else if (hashdist) {
+		} else if (get_order(size) >= MAX_ORDER || hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
+			virt = true;
 		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
 			 * alloc_pages_exact() automatically does
 			 */
-			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, gfp_flags);
-				kmemleak_alloc(table, size, 1, gfp_flags);
-			}
+			table = alloc_pages_exact(size, gfp_flags);
+			kmemleak_alloc(table, size, 1, gfp_flags);
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
 
 	if (!table)
 		panic("Failed to allocate %s hash table\n", tablename);
 
-	pr_info("%s hash table entries: %ld (order: %d, %lu bytes)\n",
-		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size);
+	pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
+		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
+		virt ? "vmalloc" : "linear");
 
 	if (_hash_shift)
 		*_hash_shift = log2qty;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] mm/large system hash: clear hashdist when only one node with memory is booted
  2019-06-05 14:48 [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Nicholas Piggin
@ 2019-06-05 14:48 ` Nicholas Piggin
  2019-06-05 21:22 ` [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Andrew Morton
  1 sibling, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2019-06-05 14:48 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linux-kernel, Andrew Morton, Linus Torvalds

CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally
even when booting on single node machines. This causes the large system
hashes to be allocated with vmalloc, and mapped with small pages.

This change clears hashdist if only one node has come up with memory.

This results in the important large inode and dentry hashes using
memblock allocations. All others are within 4MB size up to about 128GB
of RAM, which allows them to be allocated from the linear map on most
non-NUMA images.

Other big hashes like futex and TCP should eventually be moved over to
the same style of allocation as those vfs caches that use HASH_EARLY if
!hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.

This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
(performance is in the noise, under 1% difference, page tables are
likely to be well cached for this workload).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15f46be7d210..cd944f48be9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7519,10 +7519,28 @@ static int page_alloc_cpu_dead(unsigned int cpu)
 	return 0;
 }

+#ifdef CONFIG_NUMA
+int hashdist = HASHDIST_DEFAULT;
+
+static int __init set_hashdist(char *str)
+{
+	if (!str)
+		return 0;
+	hashdist = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("hashdist=", set_hashdist);
+#endif
+
 void __init page_alloc_init(void)
 {
 	int ret;

+#ifdef CONFIG_NUMA
+	if (num_node_state(N_MEMORY) == 1)
+		hashdist = 0;
+#endif
+
 	ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,
 					"mm/page_alloc:dead", NULL,
 					page_alloc_cpu_dead);
@@ -7907,19 +7925,6 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	return ret;
 }

-#ifdef CONFIG_NUMA
-int hashdist = HASHDIST_DEFAULT;
-
-static int __init set_hashdist(char *str)
-{
-	if (!str)
-		return 0;
-	hashdist = simple_strtoul(str, &str, 0);
-	return 1;
-}
-__setup("hashdist=", set_hashdist);
-#endif
-
 #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
 /*
  * Returns the number of pages that arch has reserved but
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
  2019-06-05 14:48 [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Nicholas Piggin
  2019-06-05 14:48 ` [PATCH 2/2] mm/large system hash: clear hashdist when only one node with memory is booted Nicholas Piggin
@ 2019-06-05 21:22 ` Andrew Morton
  2019-06-06  2:27   ` Nicholas Piggin
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2019-06-05 21:22 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-mm, linux-kernel, Linus Torvalds

On Thu,  6 Jun 2019 00:48:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> The kernel currently clamps large system hashes to MAX_ORDER when
> hashdist is not set, which is rather arbitrary.
> 
> vmalloc space is limited on 32-bit machines, but this shouldn't
> result in much more used because of small physical memory limiting
> system hash sizes.
> 
> Include "vmalloc" or "linear" in the kernel log message.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> This is a better solution than the previous one for the case of !NUMA
> systems running on CONFIG_NUMA kernels, we can clear the default
> hashdist early and have everything allocated out of the linear map.
> 
> The hugepage vmap series I will post later, but it's quite
> independent from this improvement.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7966,6 +7966,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>  	unsigned long log2qty, size;
>  	void *table = NULL;
>  	gfp_t gfp_flags;
> +	bool virt;
>  
>  	/* allow the kernel cmdline to have a say */
>  	if (!numentries) {
> @@ -8022,6 +8023,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>  
>  	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
>  	do {
> +		virt = false;
>  		size = bucketsize << log2qty;
>  		if (flags & HASH_EARLY) {
>  			if (flags & HASH_ZERO)
> @@ -8029,26 +8031,26 @@ void *__init alloc_large_system_hash(const char *tablename,
>  			else
>  				table = memblock_alloc_raw(size,
>  							   SMP_CACHE_BYTES);
> -		} else if (hashdist) {
> +		} else if (get_order(size) >= MAX_ORDER || hashdist) {
>  			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
> +			virt = true;
>  		} else {
>  			/*
>  			 * If bucketsize is not a power-of-two, we may free
>  			 * some pages at the end of hash table which
>  			 * alloc_pages_exact() automatically does
>  			 */
> -			if (get_order(size) < MAX_ORDER) {
> -				table = alloc_pages_exact(size, gfp_flags);
> -				kmemleak_alloc(table, size, 1, gfp_flags);
> -			}
> +			table = alloc_pages_exact(size, gfp_flags);
> +			kmemleak_alloc(table, size, 1, gfp_flags);
>  		}
>  	} while (!table && size > PAGE_SIZE && --log2qty);
>  
>  	if (!table)
>  		panic("Failed to allocate %s hash table\n", tablename);
>  
> -	pr_info("%s hash table entries: %ld (order: %d, %lu bytes)\n",
> -		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size);
> +	pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
> +		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
> +		virt ? "vmalloc" : "linear");

Could remove `bool virt' and use is_vmalloc_addr() in the printk?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
  2019-06-05 21:22 ` [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Andrew Morton
@ 2019-06-06  2:27   ` Nicholas Piggin
  0 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2019-06-06  2:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Linus Torvalds

Andrew Morton's on June 6, 2019 7:22 am:
> On Thu,  6 Jun 2019 00:48:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> The kernel currently clamps large system hashes to MAX_ORDER when
>> hashdist is not set, which is rather arbitrary.
>> 
>> vmalloc space is limited on 32-bit machines, but this shouldn't
>> result in much more used because of small physical memory limiting
>> system hash sizes.
>> 
>> Include "vmalloc" or "linear" in the kernel log message.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> 
>> This is a better solution than the previous one for the case of !NUMA
>> systems running on CONFIG_NUMA kernels, we can clear the default
>> hashdist early and have everything allocated out of the linear map.
>> 
>> The hugepage vmap series I will post later, but it's quite
>> independent from this improvement.
>> 
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7966,6 +7966,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>>  	unsigned long log2qty, size;
>>  	void *table = NULL;
>>  	gfp_t gfp_flags;
>> +	bool virt;
>>  
>>  	/* allow the kernel cmdline to have a say */
>>  	if (!numentries) {
>> @@ -8022,6 +8023,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>>  
>>  	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
>>  	do {
>> +		virt = false;
>>  		size = bucketsize << log2qty;
>>  		if (flags & HASH_EARLY) {
>>  			if (flags & HASH_ZERO)
>> @@ -8029,26 +8031,26 @@ void *__init alloc_large_system_hash(const char *tablename,
>>  			else
>>  				table = memblock_alloc_raw(size,
>>  							   SMP_CACHE_BYTES);
>> -		} else if (hashdist) {
>> +		} else if (get_order(size) >= MAX_ORDER || hashdist) {
>>  			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
>> +			virt = true;
>>  		} else {
>>  			/*
>>  			 * If bucketsize is not a power-of-two, we may free
>>  			 * some pages at the end of hash table which
>>  			 * alloc_pages_exact() automatically does
>>  			 */
>> -			if (get_order(size) < MAX_ORDER) {
>> -				table = alloc_pages_exact(size, gfp_flags);
>> -				kmemleak_alloc(table, size, 1, gfp_flags);
>> -			}
>> +			table = alloc_pages_exact(size, gfp_flags);
>> +			kmemleak_alloc(table, size, 1, gfp_flags);
>>  		}
>>  	} while (!table && size > PAGE_SIZE && --log2qty);
>>  
>>  	if (!table)
>>  		panic("Failed to allocate %s hash table\n", tablename);
>>  
>> -	pr_info("%s hash table entries: %ld (order: %d, %lu bytes)\n",
>> -		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size);
>> +	pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
>> +		tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
>> +		virt ? "vmalloc" : "linear");
> 
> Could remove `bool virt' and use is_vmalloc_addr() in the printk?
> 

It can run before mem_init() and it looks like some archs set
VMALLOC_START/END (high_memory) there (e.g., x86-32, ppc32).

Thanks,
Nick



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-06  2:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-06-05 14:48 [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Nicholas Piggin
2019-06-05 14:48 ` [PATCH 2/2] mm/large system hash: clear hashdist when only one node with memory is booted Nicholas Piggin
2019-06-05 21:22 ` [PATCH 1/2] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist Andrew Morton
2019-06-06  2:27   ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).