From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f199.google.com (mail-yb0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id A207B280753 for ; Sat, 20 May 2017 13:07:03 -0400 (EDT) Received: by mail-yb0-f199.google.com with SMTP id g96so40713203ybi.11 for ; Sat, 20 May 2017 10:07:03 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id l5si4074887ybb.277.2017.05.20.10.07.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 20 May 2017 10:07:02 -0700 (PDT) From: Pavel Tatashin Subject: [v4 0/1] mm: Adaptive hash table scaling Date: Sat, 20 May 2017 13:06:52 -0400 Message-Id: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Changes from v3 - v4: - Fixed an issue with 32-bit overflow (adapt is ull now instead ul) - Added changes suggested by Michal Hocko: use high_limit instead of a new flag to determine that we should use this new scaling. Pavel Tatashin (1): mm: Adaptive hash table scaling mm/page_alloc.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) -- 2.13.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 4A7CE280753 for ; Sat, 20 May 2017 13:07:06 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id c75so39021352qka.7 for ; Sat, 20 May 2017 10:07:06 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id b52si12806196qta.156.2017.05.20.10.07.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 20 May 2017 10:07:05 -0700 (PDT) From: Pavel Tatashin Subject: [v4 1/1] mm: Adaptive hash table scaling Date: Sat, 20 May 2017 13:06:53 -0400 Message-Id: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> In-Reply-To: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT is provided every time memory quadruples the sizes of hash tables will only double instead of quadrupling as well. This algorithm starts working only when memory size reaches a certain point, currently set to 64G. This is example of dentry hash table size, before and after four various memory configurations: MEMORY SCALE HASH_SIZE old new old new 8G 13 13 8M 8M 16G 13 13 16M 16M 32G 13 13 32M 32M 64G 13 13 64M 64M 128G 13 14 128M 64M 256G 13 14 256M 128M 512G 13 15 512M 128M 1024G 13 15 1024M 256M 2048G 13 16 2048M 256M 4096G 13 16 4096M 512M 8192G 13 17 8192M 512M 16384G 13 17 16384M 1024M 32768G 13 18 32768M 1024M 65536G 13 18 65536M 2048M Signed-off-by: Pavel Tatashin --- mm/page_alloc.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8afa63e81e73..15bba5c325a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7169,6 +7169,17 @@ static unsigned long __init arch_reserved_kernel_pages(void) #endif /* + * Adaptive scale is meant to reduce sizes of hash tables on large memory + * machines. As memory size is increased the scale is also increased but at + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory + * quadruples the scale is increased by one, which means the size of hash table + * only doubles, instead of quadrupling as well. + */ +#define ADAPT_SCALE_BASE (64ull << 30) +#define ADAPT_SCALE_SHIFT 2 +#define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) + +/* * allocate a large system hash table from bootmem * - it is assumed that the hash table must contain an exact power-of-2 * quantity of entries @@ -7199,6 +7210,14 @@ void *__init alloc_large_system_hash(const char *tablename, if (PAGE_SHIFT < 20) numentries = round_up(numentries, (1<<20)/PAGE_SIZE); + if (!high_limit) { + unsigned long long adapt; + + for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; + adapt <<= ADAPT_SCALE_SHIFT) + scale++; + } + /* limit to 1 bucket per 2^scale bytes of low memory */ if (scale > PAGE_SHIFT) numentries >>= (scale - PAGE_SHIFT); -- 2.13.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 5CEA2280753 for ; Sat, 20 May 2017 22:07:31 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id t126so91357563pgc.9 for ; Sat, 20 May 2017 19:07:31 -0700 (PDT) Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTPS id m63si13021513pfa.331.2017.05.20.19.07.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 20 May 2017 19:07:30 -0700 (PDT) From: Andi Kleen Subject: Re: [v4 1/1] mm: Adaptive hash table scaling References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> Date: Sat, 20 May 2017 19:07:29 -0700 In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> (Pavel Tatashin's message of "Sat, 20 May 2017 13:06:53 -0400") Message-ID: <87h90faroe.fsf@firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Pavel Tatashin writes: > Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT > is provided every time memory quadruples the sizes of hash tables will only > double instead of quadrupling as well. This algorithm starts working only > when memory size reaches a certain point, currently set to 64G. > > This is example of dentry hash table size, before and after four various > memory configurations: IMHO the scale is still too aggressive. I find it very unlikely that a 1TB machine really needs 256MB of hash table because number of used files are unlikely to directly scale with memory. Perhaps should just cap it at some large size, e.g. 32M -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f72.google.com (mail-it0-f72.google.com [209.85.214.72]) by kanga.kvack.org (Postfix) with ESMTP id F34D1280850 for ; Sun, 21 May 2017 08:58:48 -0400 (EDT) Received: by mail-it0-f72.google.com with SMTP id l145so62539168ita.14 for ; Sun, 21 May 2017 05:58:48 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id x127si29862850itb.55.2017.05.21.05.58.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 21 May 2017 05:58:48 -0700 (PDT) Subject: Re: [v4 1/1] mm: Adaptive hash table scaling References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <87h90faroe.fsf@firstfloor.org> From: Pasha Tatashin Message-ID: Date: Sun, 21 May 2017 08:58:25 -0400 MIME-Version: 1.0 In-Reply-To: <87h90faroe.fsf@firstfloor.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Hi Andi, Thank you for looking at this. I mentioned earlier, I would not want to impose a cap. However, if you think that for example dcache needs a cap, there is already a mechanism for that via high_limit argument, so the client can be changed to provide that cap. However, this particular patch addresses scaling problem for everyone by making it scale with memory at a slower pace. Thank you, Pasha On 05/20/2017 10:07 PM, Andi Kleen wrote: > Pavel Tatashin writes: > >> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT >> is provided every time memory quadruples the sizes of hash tables will only >> double instead of quadrupling as well. This algorithm starts working only >> when memory size reaches a certain point, currently set to 64G. >> >> This is example of dentry hash table size, before and after four various >> memory configurations: > > IMHO the scale is still too aggressive. I find it very unlikely > that a 1TB machine really needs 256MB of hash table because > number of used files are unlikely to directly scale with memory. > > Perhaps should just cap it at some large size, e.g. 32M > > -Andi > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id C3C02280850 for ; Sun, 21 May 2017 12:35:09 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id w79so20587636wme.7 for ; Sun, 21 May 2017 09:35:09 -0700 (PDT) Received: from one.firstfloor.org (one.firstfloor.org. [193.170.194.197]) by mx.google.com with ESMTPS id n21si9714103wrn.252.2017.05.21.09.35.07 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 21 May 2017 09:35:07 -0700 (PDT) Date: Sun, 21 May 2017 09:35:06 -0700 From: Andi Kleen Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170521163506.GA8096@two.firstfloor.org> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <87h90faroe.fsf@firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Pasha Tatashin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org On Sun, May 21, 2017 at 08:58:25AM -0400, Pasha Tatashin wrote: > Hi Andi, > > Thank you for looking at this. I mentioned earlier, I would not want to > impose a cap. However, if you think that for example dcache needs a cap, > there is already a mechanism for that via high_limit argument, so the client Lots of arguments are not the solution. Today this only affects a few highend systems, but we'll see much more large memory systems in the future. We don't want to have all these users either waste their memory, or apply magic arguments. > can be changed to provide that cap. However, this particular patch addresses > scaling problem for everyone by making it scale with memory at a slower > pace. Yes your patch goes in the right direction and should be applied. Just could be even more aggressive. Long term probably all these hash tables need to be converted to rhash to dynamically resize. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id A1A31831F4 for ; Mon, 22 May 2017 02:17:58 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id p86so116614527pfl.12 for ; Sun, 21 May 2017 23:17:58 -0700 (PDT) Received: from ozlabs.org (ozlabs.org. [103.22.144.67]) by mx.google.com with ESMTPS id b76si16301470pfd.382.2017.05.21.23.17.57 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 21 May 2017 23:17:57 -0700 (PDT) From: Michael Ellerman Subject: Re: [v4 1/1] mm: Adaptive hash table scaling In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> Date: Mon, 22 May 2017 16:17:54 +1000 Message-ID: <87inkts9d9.fsf@concordia.ellerman.id.au> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Tatashin , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Pavel Tatashin writes: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8afa63e81e73..15bba5c325a5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7169,6 +7169,17 @@ static unsigned long __init arch_reserved_kernel_pages(void) > #endif > > /* > + * Adaptive scale is meant to reduce sizes of hash tables on large memory > + * machines. As memory size is increased the scale is also increased but at > + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory > + * quadruples the scale is increased by one, which means the size of hash table > + * only doubles, instead of quadrupling as well. > + */ > +#define ADAPT_SCALE_BASE (64ull << 30) > +#define ADAPT_SCALE_SHIFT 2 > +#define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) > + > +/* > * allocate a large system hash table from bootmem > * - it is assumed that the hash table must contain an exact power-of-2 > * quantity of entries > @@ -7199,6 +7210,14 @@ void *__init alloc_large_system_hash(const char *tablename, > if (PAGE_SHIFT < 20) > numentries = round_up(numentries, (1<<20)/PAGE_SIZE); > > + if (!high_limit) { > + unsigned long long adapt; > + > + for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; > + adapt <<= ADAPT_SCALE_SHIFT) > + scale++; > + } This still doesn't work for me. The scale++ is overflowing according to UBSAN (line 7221). It looks like numentries is 194560. 00000950 68 0a 50 49 44 20 68 61 73 68 20 74 61 62 6c 65 |h.PID hash table| 00000960 20 65 6e 74 72 69 65 73 3a 20 34 30 39 36 20 28 | entries: 4096 (| 00000970 6f 72 64 65 72 3a 20 32 2c 20 31 36 33 38 34 20 |order: 2, 16384 | 00000980 62 79 74 65 73 29 0a 61 6c 6c 6f 63 5f 6c 61 72 |bytes).alloc_lar| 00000990 67 65 5f 73 79 73 74 65 6d 5f 68 61 73 68 3a 20 |ge_system_hash: | 000009a0 6e 75 6d 65 6e 74 72 69 65 73 20 31 39 34 35 36 |numentries 19456| 000009b0 30 0a 61 6c 6c 6f 63 5f 6c 61 72 67 65 5f 73 79 |0.alloc_large_sy| 000009c0 73 74 65 6d 5f 68 61 73 68 3a 20 61 64 61 70 74 |stem_hash: adapt| 000009d0 20 30 0a 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d | 0.=============| 000009e0 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |================| * 00000a20 3d 3d 3d 0a 55 42 53 41 4e 3a 20 55 6e 64 65 66 |===.UBSAN: Undef| 00000a30 69 6e 65 64 20 62 65 68 61 76 69 6f 75 72 20 69 |ined behaviour i| 00000a40 6e 20 2e 2e 2f 6d 6d 2f 70 61 67 65 5f 61 6c 6c |n ../mm/page_all| 00000a50 6f 63 2e 63 3a 37 32 32 31 3a 31 30 0a 73 69 67 |oc.c:7221:10.sig| 00000a60 6e 65 64 20 69 6e 74 65 67 65 72 20 6f 76 65 72 |ned integer over| 00000a70 66 6c 6f 77 3a 0a 32 31 34 37 34 38 33 36 34 37 |flow:.2147483647| 00000a80 20 2b 20 31 20 63 61 6e 6e 6f 74 20 62 65 20 72 | + 1 cannot be r| 00000a90 65 70 72 65 73 65 6e 74 65 64 20 69 6e 20 74 79 |epresented in ty| 00000aa0 70 65 20 27 69 6e 74 20 5b 34 5d 27 0a 43 50 55 |pe 'int [4]'.CPU| 00000ab0 3a 20 30 20 50 49 44 3a 20 30 20 43 6f 6d 6d 3a |: 0 PID: 0 Comm:| 00000ac0 20 73 77 61 70 70 65 72 20 4e 6f 74 20 74 61 69 | swapper Not tai| 00000ad0 6e 74 65 64 20 34 2e 31 32 2e 30 2d 72 63 31 2d |nted 4.12.0-rc1-| 00000ae0 67 63 63 2d 36 2e 33 2e 31 2d 30 30 31 38 32 2d |gcc-6.3.1-00182-| 00000af0 67 36 37 64 30 36 38 37 32 32 34 61 39 2d 64 69 |g67d0687224a9-di| 00000b00 72 74 79 20 23 38 0a 43 61 6c 6c 20 54 72 61 63 |rty #8.Call Trac| 00000b10 65 3a 0a 5b 63 30 65 30 35 65 61 30 5d 20 5b 63 |e:.[c0e05ea0] [c| 00000b20 30 34 37 38 38 63 34 5d 20 75 62 73 61 6e 5f 65 |04788c4] ubsan_e| 00000b30 70 69 6c 6f 67 75 65 2b 30 78 31 38 2f 30 78 34 |pilogue+0x18/0x4| 00000b40 63 20 28 75 6e 72 65 6c 69 61 62 6c 65 29 0a 5b |c (unreliable).[| 00000b50 63 30 65 30 35 65 62 30 5d 20 5b 63 30 34 37 39 |c0e05eb0] [c0479| 00000b60 32 36 30 5d 20 68 61 6e 64 6c 65 5f 6f 76 65 72 |260] handle_over| 00000b70 66 6c 6f 77 2b 30 78 62 63 2f 30 78 64 63 0a 5b |flow+0xbc/0xdc.[| 00000b80 63 30 65 30 35 66 33 30 5d 20 5b 63 30 61 62 39 |c0e05f30] [c0ab9| 00000b90 38 66 38 5d 20 61 6c 6c 6f 63 5f 6c 61 72 67 65 |8f8] alloc_large| 00000ba0 5f 73 79 73 74 65 6d 5f 68 61 73 68 2b 30 78 65 |_system_hash+0xe| 00000bb0 34 2f 30 78 35 65 63 0a 5b 63 30 65 30 35 66 39 |4/0x5ec.[c0e05f9| 00000bc0 30 5d 20 5b 63 30 61 62 65 30 30 30 5d 20 76 66 |0] [c0abe000] vf| 00000bd0 73 5f 63 61 63 68 65 73 5f 69 6e 69 74 5f 65 61 |s_caches_init_ea| 00000be0 72 6c 79 2b 30 78 34 63 2f 30 78 36 34 0a 5b 63 |rly+0x4c/0x64.[c| 00000bf0 30 65 30 35 66 62 30 5d 20 5b 63 30 61 61 35 32 |0e05fb0] [c0aa52| 00000c00 31 38 5d 20 73 74 61 72 74 5f 6b 65 72 6e 65 6c |18] start_kernel| 00000c10 2b 30 78 32 33 63 2f 30 78 33 63 34 0a 5b 63 30 |+0x23c/0x3c4.[c0| 00000c20 65 30 35 66 66 30 5d 20 5b 30 30 30 30 33 34 34 |e05ff0] [0000344| 00000c30 63 5d 20 30 78 33 34 34 63 0a 3d 3d 3d 3d 3d 3d |c] 0x344c.======| 00000c40 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |================| cheers -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 6CFB9831F4 for ; Mon, 22 May 2017 05:29:16 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id g143so24593185wme.13 for ; Mon, 22 May 2017 02:29:16 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v7si10450536wmv.91.2017.05.22.02.29.14 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 22 May 2017 02:29:15 -0700 (PDT) Date: Mon, 22 May 2017 11:29:10 +0200 From: Michal Hocko Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170522092910.GD8509@dhcp22.suse.cz> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org On Sat 20-05-17 13:06:53, Pavel Tatashin wrote: [...] > /* > + * Adaptive scale is meant to reduce sizes of hash tables on large memory > + * machines. As memory size is increased the scale is also increased but at > + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory > + * quadruples the scale is increased by one, which means the size of hash table > + * only doubles, instead of quadrupling as well. > + */ > +#define ADAPT_SCALE_BASE (64ull << 30) I have only noticed this email today because my incoming emails stopped syncing since Friday. But this is _definitely_ not the right approachh. 64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've already proposed scaling up to 32M for 32b systems and Andi seems to be suggesting the same. So can we fold or apply the following instead? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id B9D8D831F4 for ; Mon, 22 May 2017 09:19:14 -0400 (EDT) Received: by mail-vk0-f70.google.com with SMTP id p85so24862647vkd.10 for ; Mon, 22 May 2017 06:19:14 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id 65si7929732uaa.201.2017.05.22.06.19.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 May 2017 06:19:13 -0700 (PDT) Subject: Re: [v4 1/1] mm: Adaptive hash table scaling References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> From: Pasha Tatashin Message-ID: Date: Mon, 22 May 2017 09:18:58 -0400 MIME-Version: 1.0 In-Reply-To: <20170522092910.GD8509@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman > > I have only noticed this email today because my incoming emails stopped > syncing since Friday. But this is _definitely_ not the right approachh. > 64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've > already proposed scaling up to 32M for 32b systems and Andi seems to be > suggesting the same. So can we fold or apply the following instead? Hi Michal, Thank you for your suggestion. I will update the patch. 64G base for 32bit systems is not meant to be ever used, as the adaptive scaling for 32bit system is just not needed. 32M and 64G are going to be exactly the same on such systems. Here is theoretical limit for the max hash size of entries (dentry cache example): size of bucket: sizeof(struct hlist_bl_head) = 4 bytes numentries: (1 << 32) / PAGE_SIZE = 1048576 (for 4K pages) hash size: 4b * 1048576 = 4M In practice it is going to be an order smaller, as number of kernel pages is less then (1<<32). However, I will apply your suggestions as there seems to be a problem of overflowing in comparing ul vs. ull as reported by Michael Ellerman, and having a large base on 32bit systems will solve this issue. I will revert back to "ul" all the quantities. Another approach is to make it a 64 bit only macro like this: #if __BITS_PER_LONG > 32 #define ADAPT_SCALE_BASE (64ull << 30) #define ADAPT_SCALE_SHIFT 2 #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) #define adapt_scale(high_limit, numentries, scalep) if (!(high_limit)) { \ unsigned long adapt; \ for (adapt = ADAPT_SCALE_NPAGES; adapt < \ (numentries); adapt <<= ADAPT_SCALE_SHIFT) \ (*(scalep))++; \ } #else #define adapt_scale(high_limit, numentries scalep) #endif Pasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 304DE831F4 for ; Mon, 22 May 2017 09:38:37 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id r203so25573156wmb.2 for ; Mon, 22 May 2017 06:38:37 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 74si3338309wma.124.2017.05.22.06.38.35 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 22 May 2017 06:38:36 -0700 (PDT) Date: Mon, 22 May 2017 15:38:34 +0200 From: Michal Hocko Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170522133834.GL8509@dhcp22.suse.cz> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Pasha Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman On Mon 22-05-17 09:18:58, Pasha Tatashin wrote: > > > >I have only noticed this email today because my incoming emails stopped > >syncing since Friday. But this is _definitely_ not the right approachh. > >64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've > >already proposed scaling up to 32M for 32b systems and Andi seems to be > >suggesting the same. So can we fold or apply the following instead? > > Hi Michal, > > Thank you for your suggestion. I will update the patch. > > 64G base for 32bit systems is not meant to be ever used, as the adaptive > scaling for 32bit system is just not needed. 32M and 64G are going to be > exactly the same on such systems. > > Here is theoretical limit for the max hash size of entries (dentry cache > example): > > size of bucket: sizeof(struct hlist_bl_head) = 4 bytes > numentries: (1 << 32) / PAGE_SIZE = 1048576 (for 4K pages) > hash size: 4b * 1048576 = 4M > > In practice it is going to be an order smaller, as number of kernel pages is > less then (1<<32). I haven't double check your math but if the above is correct then I would just go and disable the adaptive scaling for 32b altogether. More on that below. > However, I will apply your suggestions as there seems to be a problem of > overflowing in comparing ul vs. ull as reported by Michael Ellerman, and > having a large base on 32bit systems will solve this issue. I will revert > back to "ul" all the quantities. Yeah, that is just calling for troubles. > Another approach is to make it a 64 bit only macro like this: > > #if __BITS_PER_LONG > 32 > > #define ADAPT_SCALE_BASE (64ull << 30) > #define ADAPT_SCALE_SHIFT 2 > #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) > > #define adapt_scale(high_limit, numentries, scalep) > if (!(high_limit)) { \ > unsigned long adapt; \ > for (adapt = ADAPT_SCALE_NPAGES; adapt < \ > (numentries); adapt <<= ADAPT_SCALE_SHIFT) \ > (*(scalep))++; \ > } > #else > #define adapt_scale(high_limit, numentries scalep) > #endif This is just too ugly to live, really. If we do not need adaptive scaling then just make it #if __BITS_PER_LONG around the code. I would be fine with this. A big fat warning explaining why this is 64b only would be appropriate. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua0-f197.google.com (mail-ua0-f197.google.com [209.85.217.197]) by kanga.kvack.org (Postfix) with ESMTP id B0848831F4 for ; Mon, 22 May 2017 09:41:18 -0400 (EDT) Received: by mail-ua0-f197.google.com with SMTP id k4so33649037uaa.0 for ; Mon, 22 May 2017 06:41:18 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id o132si6110044vke.89.2017.05.22.06.41.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 May 2017 06:41:17 -0700 (PDT) Subject: Re: [v4 1/1] mm: Adaptive hash table scaling References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> <20170522133834.GL8509@dhcp22.suse.cz> From: Pasha Tatashin Message-ID: <6e81aa26-e43e-6264-e2f9-547531b809f5@oracle.com> Date: Mon, 22 May 2017 09:41:08 -0400 MIME-Version: 1.0 In-Reply-To: <20170522133834.GL8509@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman > > This is just too ugly to live, really. If we do not need adaptive > scaling then just make it #if __BITS_PER_LONG around the code. I would > be fine with this. A big fat warning explaining why this is 64b only > would be appropriate. > OK, let me prettify it somehow, and I will send a new patch out. Pasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756280AbdETRHI (ORCPT ); Sat, 20 May 2017 13:07:08 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:42103 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756157AbdETRHD (ORCPT ); Sat, 20 May 2017 13:07:03 -0400 From: Pavel Tatashin To: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Subject: [v4 0/1] mm: Adaptive hash table scaling Date: Sat, 20 May 2017 13:06:52 -0400 Message-Id: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> X-Mailer: git-send-email 1.7.1 X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changes from v3 - v4: - Fixed an issue with 32-bit overflow (adapt is ull now instead ul) - Added changes suggested by Michal Hocko: use high_limit instead of a new flag to determine that we should use this new scaling. Pavel Tatashin (1): mm: Adaptive hash table scaling mm/page_alloc.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) -- 2.13.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756377AbdETRHX (ORCPT ); Sat, 20 May 2017 13:07:23 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:25950 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756217AbdETRHH (ORCPT ); Sat, 20 May 2017 13:07:07 -0400 From: Pavel Tatashin To: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Subject: [v4 1/1] mm: Adaptive hash table scaling Date: Sat, 20 May 2017 13:06:53 -0400 Message-Id: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT is provided every time memory quadruples the sizes of hash tables will only double instead of quadrupling as well. This algorithm starts working only when memory size reaches a certain point, currently set to 64G. This is example of dentry hash table size, before and after four various memory configurations: MEMORY SCALE HASH_SIZE old new old new 8G 13 13 8M 8M 16G 13 13 16M 16M 32G 13 13 32M 32M 64G 13 13 64M 64M 128G 13 14 128M 64M 256G 13 14 256M 128M 512G 13 15 512M 128M 1024G 13 15 1024M 256M 2048G 13 16 2048M 256M 4096G 13 16 4096M 512M 8192G 13 17 8192M 512M 16384G 13 17 16384M 1024M 32768G 13 18 32768M 1024M 65536G 13 18 65536M 2048M Signed-off-by: Pavel Tatashin --- mm/page_alloc.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8afa63e81e73..15bba5c325a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7169,6 +7169,17 @@ static unsigned long __init arch_reserved_kernel_pages(void) #endif /* + * Adaptive scale is meant to reduce sizes of hash tables on large memory + * machines. As memory size is increased the scale is also increased but at + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory + * quadruples the scale is increased by one, which means the size of hash table + * only doubles, instead of quadrupling as well. + */ +#define ADAPT_SCALE_BASE (64ull << 30) +#define ADAPT_SCALE_SHIFT 2 +#define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) + +/* * allocate a large system hash table from bootmem * - it is assumed that the hash table must contain an exact power-of-2 * quantity of entries @@ -7199,6 +7210,14 @@ void *__init alloc_large_system_hash(const char *tablename, if (PAGE_SHIFT < 20) numentries = round_up(numentries, (1<<20)/PAGE_SIZE); + if (!high_limit) { + unsigned long long adapt; + + for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; + adapt <<= ADAPT_SCALE_SHIFT) + scale++; + } + /* limit to 1 bucket per 2^scale bytes of low memory */ if (scale > PAGE_SHIFT) numentries >>= (scale - PAGE_SHIFT); -- 2.13.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756234AbdEUCHe (ORCPT ); Sat, 20 May 2017 22:07:34 -0400 Received: from mga03.intel.com ([134.134.136.65]:25354 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751782AbdEUCHb (ORCPT ); Sat, 20 May 2017 22:07:31 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.38,372,1491289200"; d="scan'208";a="264484039" From: Andi Kleen To: Pavel Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Subject: Re: [v4 1/1] mm: Adaptive hash table scaling References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> Date: Sat, 20 May 2017 19:07:29 -0700 In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> (Pavel Tatashin's message of "Sat, 20 May 2017 13:06:53 -0400") Message-ID: <87h90faroe.fsf@firstfloor.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Pavel Tatashin writes: > Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT > is provided every time memory quadruples the sizes of hash tables will only > double instead of quadrupling as well. This algorithm starts working only > when memory size reaches a certain point, currently set to 64G. > > This is example of dentry hash table size, before and after four various > memory configurations: IMHO the scale is still too aggressive. I find it very unlikely that a 1TB machine really needs 256MB of hash table because number of used files are unlikely to directly scale with memory. Perhaps should just cap it at some large size, e.g. 32M -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756582AbdEUM6y (ORCPT ); Sun, 21 May 2017 08:58:54 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:22974 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753411AbdEUM6t (ORCPT ); Sun, 21 May 2017 08:58:49 -0400 Subject: Re: [v4 1/1] mm: Adaptive hash table scaling To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <87h90faroe.fsf@firstfloor.org> From: Pasha Tatashin Message-ID: Date: Sun, 21 May 2017 08:58:25 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <87h90faroe.fsf@firstfloor.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andi, Thank you for looking at this. I mentioned earlier, I would not want to impose a cap. However, if you think that for example dcache needs a cap, there is already a mechanism for that via high_limit argument, so the client can be changed to provide that cap. However, this particular patch addresses scaling problem for everyone by making it scale with memory at a slower pace. Thank you, Pasha On 05/20/2017 10:07 PM, Andi Kleen wrote: > Pavel Tatashin writes: > >> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT >> is provided every time memory quadruples the sizes of hash tables will only >> double instead of quadrupling as well. This algorithm starts working only >> when memory size reaches a certain point, currently set to 64G. >> >> This is example of dentry hash table size, before and after four various >> memory configurations: > > IMHO the scale is still too aggressive. I find it very unlikely > that a 1TB machine really needs 256MB of hash table because > number of used files are unlikely to directly scale with memory. > > Perhaps should just cap it at some large size, e.g. 32M > > -Andi > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756695AbdEUQfK (ORCPT ); Sun, 21 May 2017 12:35:10 -0400 Received: from one.firstfloor.org ([193.170.194.197]:43654 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756623AbdEUQfJ (ORCPT ); Sun, 21 May 2017 12:35:09 -0400 Date: Sun, 21 May 2017 09:35:06 -0700 From: Andi Kleen To: Pasha Tatashin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170521163506.GA8096@two.firstfloor.org> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <87h90faroe.fsf@firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, May 21, 2017 at 08:58:25AM -0400, Pasha Tatashin wrote: > Hi Andi, > > Thank you for looking at this. I mentioned earlier, I would not want to > impose a cap. However, if you think that for example dcache needs a cap, > there is already a mechanism for that via high_limit argument, so the client Lots of arguments are not the solution. Today this only affects a few highend systems, but we'll see much more large memory systems in the future. We don't want to have all these users either waste their memory, or apply magic arguments. > can be changed to provide that cap. However, this particular patch addresses > scaling problem for everyone by making it scale with memory at a slower > pace. Yes your patch goes in the right direction and should be applied. Just could be even more aggressive. Long term probably all these hash tables need to be converted to rhash to dynamically resize. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752198AbdEVGSA (ORCPT ); Mon, 22 May 2017 02:18:00 -0400 Received: from ozlabs.org ([103.22.144.67]:52667 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752126AbdEVGR4 (ORCPT ); Mon, 22 May 2017 02:17:56 -0400 From: Michael Ellerman To: Pavel Tatashin , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org Subject: Re: [v4 1/1] mm: Adaptive hash table scaling In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> User-Agent: Notmuch/0.21 (https://notmuchmail.org) Date: Mon, 22 May 2017 16:17:54 +1000 Message-ID: <87inkts9d9.fsf@concordia.ellerman.id.au> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Pavel Tatashin writes: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8afa63e81e73..15bba5c325a5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7169,6 +7169,17 @@ static unsigned long __init arch_reserved_kernel_pages(void) > #endif > > /* > + * Adaptive scale is meant to reduce sizes of hash tables on large memory > + * machines. As memory size is increased the scale is also increased but at > + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory > + * quadruples the scale is increased by one, which means the size of hash table > + * only doubles, instead of quadrupling as well. > + */ > +#define ADAPT_SCALE_BASE (64ull << 30) > +#define ADAPT_SCALE_SHIFT 2 > +#define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) > + > +/* > * allocate a large system hash table from bootmem > * - it is assumed that the hash table must contain an exact power-of-2 > * quantity of entries > @@ -7199,6 +7210,14 @@ void *__init alloc_large_system_hash(const char *tablename, > if (PAGE_SHIFT < 20) > numentries = round_up(numentries, (1<<20)/PAGE_SIZE); > > + if (!high_limit) { > + unsigned long long adapt; > + > + for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; > + adapt <<= ADAPT_SCALE_SHIFT) > + scale++; > + } This still doesn't work for me. The scale++ is overflowing according to UBSAN (line 7221). It looks like numentries is 194560. 00000950 68 0a 50 49 44 20 68 61 73 68 20 74 61 62 6c 65 |h.PID hash table| 00000960 20 65 6e 74 72 69 65 73 3a 20 34 30 39 36 20 28 | entries: 4096 (| 00000970 6f 72 64 65 72 3a 20 32 2c 20 31 36 33 38 34 20 |order: 2, 16384 | 00000980 62 79 74 65 73 29 0a 61 6c 6c 6f 63 5f 6c 61 72 |bytes).alloc_lar| 00000990 67 65 5f 73 79 73 74 65 6d 5f 68 61 73 68 3a 20 |ge_system_hash: | 000009a0 6e 75 6d 65 6e 74 72 69 65 73 20 31 39 34 35 36 |numentries 19456| 000009b0 30 0a 61 6c 6c 6f 63 5f 6c 61 72 67 65 5f 73 79 |0.alloc_large_sy| 000009c0 73 74 65 6d 5f 68 61 73 68 3a 20 61 64 61 70 74 |stem_hash: adapt| 000009d0 20 30 0a 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d | 0.=============| 000009e0 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |================| * 00000a20 3d 3d 3d 0a 55 42 53 41 4e 3a 20 55 6e 64 65 66 |===.UBSAN: Undef| 00000a30 69 6e 65 64 20 62 65 68 61 76 69 6f 75 72 20 69 |ined behaviour i| 00000a40 6e 20 2e 2e 2f 6d 6d 2f 70 61 67 65 5f 61 6c 6c |n ../mm/page_all| 00000a50 6f 63 2e 63 3a 37 32 32 31 3a 31 30 0a 73 69 67 |oc.c:7221:10.sig| 00000a60 6e 65 64 20 69 6e 74 65 67 65 72 20 6f 76 65 72 |ned integer over| 00000a70 66 6c 6f 77 3a 0a 32 31 34 37 34 38 33 36 34 37 |flow:.2147483647| 00000a80 20 2b 20 31 20 63 61 6e 6e 6f 74 20 62 65 20 72 | + 1 cannot be r| 00000a90 65 70 72 65 73 65 6e 74 65 64 20 69 6e 20 74 79 |epresented in ty| 00000aa0 70 65 20 27 69 6e 74 20 5b 34 5d 27 0a 43 50 55 |pe 'int [4]'.CPU| 00000ab0 3a 20 30 20 50 49 44 3a 20 30 20 43 6f 6d 6d 3a |: 0 PID: 0 Comm:| 00000ac0 20 73 77 61 70 70 65 72 20 4e 6f 74 20 74 61 69 | swapper Not tai| 00000ad0 6e 74 65 64 20 34 2e 31 32 2e 30 2d 72 63 31 2d |nted 4.12.0-rc1-| 00000ae0 67 63 63 2d 36 2e 33 2e 31 2d 30 30 31 38 32 2d |gcc-6.3.1-00182-| 00000af0 67 36 37 64 30 36 38 37 32 32 34 61 39 2d 64 69 |g67d0687224a9-di| 00000b00 72 74 79 20 23 38 0a 43 61 6c 6c 20 54 72 61 63 |rty #8.Call Trac| 00000b10 65 3a 0a 5b 63 30 65 30 35 65 61 30 5d 20 5b 63 |e:.[c0e05ea0] [c| 00000b20 30 34 37 38 38 63 34 5d 20 75 62 73 61 6e 5f 65 |04788c4] ubsan_e| 00000b30 70 69 6c 6f 67 75 65 2b 30 78 31 38 2f 30 78 34 |pilogue+0x18/0x4| 00000b40 63 20 28 75 6e 72 65 6c 69 61 62 6c 65 29 0a 5b |c (unreliable).[| 00000b50 63 30 65 30 35 65 62 30 5d 20 5b 63 30 34 37 39 |c0e05eb0] [c0479| 00000b60 32 36 30 5d 20 68 61 6e 64 6c 65 5f 6f 76 65 72 |260] handle_over| 00000b70 66 6c 6f 77 2b 30 78 62 63 2f 30 78 64 63 0a 5b |flow+0xbc/0xdc.[| 00000b80 63 30 65 30 35 66 33 30 5d 20 5b 63 30 61 62 39 |c0e05f30] [c0ab9| 00000b90 38 66 38 5d 20 61 6c 6c 6f 63 5f 6c 61 72 67 65 |8f8] alloc_large| 00000ba0 5f 73 79 73 74 65 6d 5f 68 61 73 68 2b 30 78 65 |_system_hash+0xe| 00000bb0 34 2f 30 78 35 65 63 0a 5b 63 30 65 30 35 66 39 |4/0x5ec.[c0e05f9| 00000bc0 30 5d 20 5b 63 30 61 62 65 30 30 30 5d 20 76 66 |0] [c0abe000] vf| 00000bd0 73 5f 63 61 63 68 65 73 5f 69 6e 69 74 5f 65 61 |s_caches_init_ea| 00000be0 72 6c 79 2b 30 78 34 63 2f 30 78 36 34 0a 5b 63 |rly+0x4c/0x64.[c| 00000bf0 30 65 30 35 66 62 30 5d 20 5b 63 30 61 61 35 32 |0e05fb0] [c0aa52| 00000c00 31 38 5d 20 73 74 61 72 74 5f 6b 65 72 6e 65 6c |18] start_kernel| 00000c10 2b 30 78 32 33 63 2f 30 78 33 63 34 0a 5b 63 30 |+0x23c/0x3c4.[c0| 00000c20 65 30 35 66 66 30 5d 20 5b 30 30 30 30 33 34 34 |e05ff0] [0000344| 00000c30 63 5d 20 30 78 33 34 34 63 0a 3d 3d 3d 3d 3d 3d |c] 0x344c.======| 00000c40 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |================| cheers From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758157AbdEVJ3Y (ORCPT ); Mon, 22 May 2017 05:29:24 -0400 Received: from mx2.suse.de ([195.135.220.15]:33933 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1757923AbdEVJ3U (ORCPT ); Mon, 22 May 2017 05:29:20 -0400 Date: Mon, 22 May 2017 11:29:10 +0200 From: Michal Hocko To: Pavel Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170522092910.GD8509@dhcp22.suse.cz> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 20-05-17 13:06:53, Pavel Tatashin wrote: [...] > /* > + * Adaptive scale is meant to reduce sizes of hash tables on large memory > + * machines. As memory size is increased the scale is also increased but at > + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory > + * quadruples the scale is increased by one, which means the size of hash table > + * only doubles, instead of quadrupling as well. > + */ > +#define ADAPT_SCALE_BASE (64ull << 30) I have only noticed this email today because my incoming emails stopped syncing since Friday. But this is _definitely_ not the right approachh. 64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've already proposed scaling up to 32M for 32b systems and Andi seems to be suggesting the same. So can we fold or apply the following instead? --- >>From 6a17a022e82ac715a08a9f4707c1c29a58a2225b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 22 May 2017 10:45:20 +0200 Subject: [PATCH] mm: fix adaptive hash table sizing for 32b systems Guenter Roeck has noticed that many qemu boot test on 32b systems and bisected it to "mm: drop HASH_ADAPT". The patch itself only makes the HASH_ADAPT unconditional for all users which shouldn't matter. Except it does because ADAPT_SCALE_BASE is 64GB which is out of 32b word size so the adapt_scale loop will never terminate and so HASH_EARLY allocations lock up with the patch while we even do not try to use the new hash adapt code because the early allocation suceeded. Fix this by reducint ADAPT_SCALE_BASE down to 32MB on 32b machines. Fixes: mm: adaptive hash table scaling Signed-off-by: Michal Hocko --- mm/page_alloc.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a26e19c3e1ff..70c5fc1fb89a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7174,11 +7174,15 @@ static unsigned long __init arch_reserved_kernel_pages(void) /* * Adaptive scale is meant to reduce sizes of hash tables on large memory * machines. As memory size is increased the scale is also increased but at - * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory - * quadruples the scale is increased by one, which means the size of hash table - * only doubles, instead of quadrupling as well. + * slower pace. Starting from ADAPT_SCALE_BASE (64G on 64b systems and 32M + * on 32b), every time memory quadruples the scale is increased by one, which + * means the size of hash table only doubles, instead of quadrupling as well. */ +#if __BITS_PER_LONG == 64 #define ADAPT_SCALE_BASE (64ul << 30) +#else +#define ADAPT_SCALE_BASE (32ul << 20) +#endif #define ADAPT_SCALE_SHIFT 2 #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) -- 2.11.0 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759535AbdEVNTP (ORCPT ); Mon, 22 May 2017 09:19:15 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:20633 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759395AbdEVNTO (ORCPT ); Mon, 22 May 2017 09:19:14 -0400 Subject: Re: [v4 1/1] mm: Adaptive hash table scaling To: Michal Hocko Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> From: Pasha Tatashin Message-ID: Date: Mon, 22 May 2017 09:18:58 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <20170522092910.GD8509@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > I have only noticed this email today because my incoming emails stopped > syncing since Friday. But this is _definitely_ not the right approachh. > 64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've > already proposed scaling up to 32M for 32b systems and Andi seems to be > suggesting the same. So can we fold or apply the following instead? Hi Michal, Thank you for your suggestion. I will update the patch. 64G base for 32bit systems is not meant to be ever used, as the adaptive scaling for 32bit system is just not needed. 32M and 64G are going to be exactly the same on such systems. Here is theoretical limit for the max hash size of entries (dentry cache example): size of bucket: sizeof(struct hlist_bl_head) = 4 bytes numentries: (1 << 32) / PAGE_SIZE = 1048576 (for 4K pages) hash size: 4b * 1048576 = 4M In practice it is going to be an order smaller, as number of kernel pages is less then (1<<32). However, I will apply your suggestions as there seems to be a problem of overflowing in comparing ul vs. ull as reported by Michael Ellerman, and having a large base on 32bit systems will solve this issue. I will revert back to "ul" all the quantities. Another approach is to make it a 64 bit only macro like this: #if __BITS_PER_LONG > 32 #define ADAPT_SCALE_BASE (64ull << 30) #define ADAPT_SCALE_SHIFT 2 #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) #define adapt_scale(high_limit, numentries, scalep) if (!(high_limit)) { \ unsigned long adapt; \ for (adapt = ADAPT_SCALE_NPAGES; adapt < \ (numentries); adapt <<= ADAPT_SCALE_SHIFT) \ (*(scalep))++; \ } #else #define adapt_scale(high_limit, numentries scalep) #endif Pasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933818AbdEVNik (ORCPT ); Mon, 22 May 2017 09:38:40 -0400 Received: from mx2.suse.de ([195.135.220.15]:33588 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932935AbdEVNig (ORCPT ); Mon, 22 May 2017 09:38:36 -0400 Date: Mon, 22 May 2017 15:38:34 +0200 From: Michal Hocko To: Pasha Tatashin Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman Subject: Re: [v4 1/1] mm: Adaptive hash table scaling Message-ID: <20170522133834.GL8509@dhcp22.suse.cz> References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 22-05-17 09:18:58, Pasha Tatashin wrote: > > > >I have only noticed this email today because my incoming emails stopped > >syncing since Friday. But this is _definitely_ not the right approachh. > >64G for 32b systems is _way_ off. We have only ~1G for the kernel. I've > >already proposed scaling up to 32M for 32b systems and Andi seems to be > >suggesting the same. So can we fold or apply the following instead? > > Hi Michal, > > Thank you for your suggestion. I will update the patch. > > 64G base for 32bit systems is not meant to be ever used, as the adaptive > scaling for 32bit system is just not needed. 32M and 64G are going to be > exactly the same on such systems. > > Here is theoretical limit for the max hash size of entries (dentry cache > example): > > size of bucket: sizeof(struct hlist_bl_head) = 4 bytes > numentries: (1 << 32) / PAGE_SIZE = 1048576 (for 4K pages) > hash size: 4b * 1048576 = 4M > > In practice it is going to be an order smaller, as number of kernel pages is > less then (1<<32). I haven't double check your math but if the above is correct then I would just go and disable the adaptive scaling for 32b altogether. More on that below. > However, I will apply your suggestions as there seems to be a problem of > overflowing in comparing ul vs. ull as reported by Michael Ellerman, and > having a large base on 32bit systems will solve this issue. I will revert > back to "ul" all the quantities. Yeah, that is just calling for troubles. > Another approach is to make it a 64 bit only macro like this: > > #if __BITS_PER_LONG > 32 > > #define ADAPT_SCALE_BASE (64ull << 30) > #define ADAPT_SCALE_SHIFT 2 > #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) > > #define adapt_scale(high_limit, numentries, scalep) > if (!(high_limit)) { \ > unsigned long adapt; \ > for (adapt = ADAPT_SCALE_NPAGES; adapt < \ > (numentries); adapt <<= ADAPT_SCALE_SHIFT) \ > (*(scalep))++; \ > } > #else > #define adapt_scale(high_limit, numentries scalep) > #endif This is just too ugly to live, really. If we do not need adaptive scaling then just make it #if __BITS_PER_LONG around the code. I would be fine with this. A big fat warning explaining why this is 64b only would be appropriate. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759841AbdEVNlU (ORCPT ); Mon, 22 May 2017 09:41:20 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:17938 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759803AbdEVNlS (ORCPT ); Mon, 22 May 2017 09:41:18 -0400 Subject: Re: [v4 1/1] mm: Adaptive hash table scaling To: Michal Hocko Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Ellerman References: <1495300013-653283-1-git-send-email-pasha.tatashin@oracle.com> <1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com> <20170522092910.GD8509@dhcp22.suse.cz> <20170522133834.GL8509@dhcp22.suse.cz> From: Pasha Tatashin Message-ID: <6e81aa26-e43e-6264-e2f9-547531b809f5@oracle.com> Date: Mon, 22 May 2017 09:41:08 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <20170522133834.GL8509@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Source-IP: aserv0022.oracle.com [141.146.126.234] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > This is just too ugly to live, really. If we do not need adaptive > scaling then just make it #if __BITS_PER_LONG around the code. I would > be fine with this. A big fat warning explaining why this is 64b only > would be appropriate. > OK, let me prettify it somehow, and I will send a new patch out. Pasha