From: Balbir Singh <bsingharora@gmail.com>
To: Li Zhang <zhlcindy@gmail.com>
Cc: akpm@linux-foundation.org, Vlastimil Babka <vbabka@suse.cz>,
mgorman@techsingularity.net,
Michael Ellerman <mpe@ellerman.id.au>,
Anshuman Khandual <khandual@linux.vnet.ibm.com>,
aneesh.kumar@linux.vnet.ibm.com, linux-mm@kvack.org,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Li Zhang <zhlcindy@linux.vnet.ibm.com>
Subject: Re: [PATCH 0/2] mm: Enable page parallel initialisation for Power
Date: Wed, 9 Mar 2016 15:28:31 +1100 [thread overview]
Message-ID: <56DFA66F.2020002@gmail.com> (raw)
In-Reply-To: <CAD8of+o9zbwae-JM2EtcEnUyZAr43+jQLz1YSVZVKfda+h+Xvg@mail.gmail.com>
On 09/03/16 15:17, Li Zhang wrote:
> On Tue, Mar 8, 2016 at 10:45 PM, Balbir Singh <bsingharora@gmail.com> wrote:
>>
>> On 08/03/16 14:55, Li Zhang wrote:
>>> From: Li Zhang <zhlcindy@linux.vnet.ibm.com>
>>>
>>> Uptream has supported page parallel initialisation for X86 and the
>>> boot time is improved greately. Some tests have been done for Power.
>>>
>>> Here is the result I have done with different memory size.
>>>
>>> * 4GB memory:
>>> boot time is as the following:
>>> with patch vs without patch: 10.4s vs 24.5s
>>> boot time is improved 57%
>>> * 200GB memory:
>>> boot time looks the same with and without patches.
>>> boot time is about 38s
>>> * 32TB memory:
>>> boot time looks the same with and without patches
>>> boot time is about 160s.
>>> The boot time is much shorter than X86 with 24TB memory.
>>> From community discussion, it costs about 694s for X86 24T system.
>>>
>>> From code view, parallel initialisation improve the performance by
>>> deferring memory initilisation to kswap with N kthreads, it should
>>> improve the performance therotically.
>>>
>>> From the test result, On X86, performance is improved greatly with huge
>>> memory. But on Power platform, it is improved greatly with less than
>>> 100GB memory. For huge memory, it is not improved greatly. But it saves
>>> the time with several threads at least, as the following information
>>> shows(32TB system log):
>>>
>>> [ 22.648169] node 9 initialised, 16607461 pages in 280ms
>>> [ 22.783772] node 3 initialised, 23937243 pages in 410ms
>>> [ 22.858877] node 6 initialised, 29179347 pages in 490ms
>>> [ 22.863252] node 2 initialised, 29179347 pages in 490ms
>>> [ 22.907545] node 0 initialised, 32049614 pages in 540ms
>>> [ 22.920891] node 15 initialised, 32212280 pages in 550ms
>>> [ 22.923236] node 4 initialised, 32306127 pages in 550ms
>>> [ 22.923384] node 12 initialised, 32314319 pages in 550ms
>>> [ 22.924754] node 8 initialised, 32314319 pages in 550ms
>>> [ 22.940780] node 13 initialised, 33353677 pages in 570ms
>>> [ 22.940796] node 11 initialised, 33353677 pages in 570ms
>>> [ 22.941700] node 5 initialised, 33353677 pages in 570ms
>>> [ 22.941721] node 10 initialised, 33353677 pages in 570ms
>>> [ 22.941876] node 7 initialised, 33353677 pages in 570ms
>>> [ 22.944946] node 14 initialised, 33353677 pages in 570ms
>>> [ 22.946063] node 1 initialised, 33345485 pages in 580ms
>>>
>>> It saves the time about 550*16 ms at least, although it can be ignore to compare
>>> the boot time about 160 seconds. What's more, the boot time is much shorter
>>> on Power even without patches than x86 for huge memory machine.
>>>
>>> So this patchset is still necessary to be enabled for Power.
>>>
>>>
> Hi Balbir,
>
> Thanks for your reviewing.
>
>> The patchset looks good, two questions
>>
>> 1. The patchset is still necessary for
>> a. systems with smaller amount of RAM?
> I think it is. Currently, I tested systems for 4GB, 50GB, and
> boot time is improved.
> We may test more systems with different memory size in the future.
>> b. Theoretically it improves boot time?
> The boot time is improved a little bit for huge memory system
> and it can be ignored.
> But I think it's still necessary to enable this feature.
>
>> 2. the pgdat->node_spanned_pages >> 8 sounds arbitrary
>> On a system with 2TB*16 nodes, it would initialize about 8GB before calling deferred init?
>> Don't we need at-least 32GB + space for other early hash allocations
>> BTW, My expectation was that 32TB would imply 32GB+32GB of large hash allocations early on
> pgdat->node_spanned_pages >> 8 means that it allocates the size
> of the memory on one node.
> On a system with 2TB *16nodes, it will allocate 16*8GB = 128GB.
> I am not sure if it can be minimised to >> 16 to make sure all
> the architectures with different
> memory size work well. And this is also mentioned in early
> discussion for X86, so I choose >> 8.
>
> * From the code as the following:
>
> free_area_init_core ->
> memmap_init->
> update_defer_init
> #define memmap_init(size, nid, zone, start_pfn) \
> memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
>
> memmap_init_zone is based on a zone, but free_area_init_core will
> help find the highest
> zone on the node. And update_defer_init() get max initialised
> memory on highest zone for a node to
> reserve for early initialisation.
>
> static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> {
> ...
> for (j = 0; j < MAX_NR_ZONES; j++) {
> ....
> memmap_init(size, nid, j, zone_start_fn); //find
> the highest zone on a node.
> ...
> }
> }
>
> * From the dmesg log, after applying this patchset, it has
> 123013440K(about 117GB),
> which is enough for Dentry node hash table and Inode hash table in
> this system.
>
> [ 0.000000] Memory: 123013440K/31739871232K available (8000K
> kernel code, 1856K rwdata,
> 3384K rodata, 6208K init, 2544K bss, 28531136K reserved, 0K cma-reserved)
>
> Thanks :)
>
Looks good! It seems the real benefit is for smaller systems - thanks for clarifying
Please check if CMA is affected in any way
Acked-by: Balbir Singh <bsingharora@gmail.com>
Balbir Singh.
WARNING: multiple messages have this Message-ID (diff)
From: Balbir Singh <bsingharora@gmail.com>
To: Li Zhang <zhlcindy@gmail.com>
Cc: akpm@linux-foundation.org, Vlastimil Babka <vbabka@suse.cz>,
mgorman@techsingularity.net,
Michael Ellerman <mpe@ellerman.id.au>,
Anshuman Khandual <khandual@linux.vnet.ibm.com>,
aneesh.kumar@linux.vnet.ibm.com, linux-mm@kvack.org,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Li Zhang <zhlcindy@linux.vnet.ibm.com>
Subject: Re: [PATCH 0/2] mm: Enable page parallel initialisation for Power
Date: Wed, 9 Mar 2016 15:28:31 +1100 [thread overview]
Message-ID: <56DFA66F.2020002@gmail.com> (raw)
In-Reply-To: <CAD8of+o9zbwae-JM2EtcEnUyZAr43+jQLz1YSVZVKfda+h+Xvg@mail.gmail.com>
On 09/03/16 15:17, Li Zhang wrote:
> On Tue, Mar 8, 2016 at 10:45 PM, Balbir Singh <bsingharora@gmail.com> wrote:
>>
>> On 08/03/16 14:55, Li Zhang wrote:
>>> From: Li Zhang <zhlcindy@linux.vnet.ibm.com>
>>>
>>> Uptream has supported page parallel initialisation for X86 and the
>>> boot time is improved greately. Some tests have been done for Power.
>>>
>>> Here is the result I have done with different memory size.
>>>
>>> * 4GB memory:
>>> boot time is as the following:
>>> with patch vs without patch: 10.4s vs 24.5s
>>> boot time is improved 57%
>>> * 200GB memory:
>>> boot time looks the same with and without patches.
>>> boot time is about 38s
>>> * 32TB memory:
>>> boot time looks the same with and without patches
>>> boot time is about 160s.
>>> The boot time is much shorter than X86 with 24TB memory.
>>> From community discussion, it costs about 694s for X86 24T system.
>>>
>>> From code view, parallel initialisation improve the performance by
>>> deferring memory initilisation to kswap with N kthreads, it should
>>> improve the performance therotically.
>>>
>>> From the test result, On X86, performance is improved greatly with huge
>>> memory. But on Power platform, it is improved greatly with less than
>>> 100GB memory. For huge memory, it is not improved greatly. But it saves
>>> the time with several threads at least, as the following information
>>> shows(32TB system log):
>>>
>>> [ 22.648169] node 9 initialised, 16607461 pages in 280ms
>>> [ 22.783772] node 3 initialised, 23937243 pages in 410ms
>>> [ 22.858877] node 6 initialised, 29179347 pages in 490ms
>>> [ 22.863252] node 2 initialised, 29179347 pages in 490ms
>>> [ 22.907545] node 0 initialised, 32049614 pages in 540ms
>>> [ 22.920891] node 15 initialised, 32212280 pages in 550ms
>>> [ 22.923236] node 4 initialised, 32306127 pages in 550ms
>>> [ 22.923384] node 12 initialised, 32314319 pages in 550ms
>>> [ 22.924754] node 8 initialised, 32314319 pages in 550ms
>>> [ 22.940780] node 13 initialised, 33353677 pages in 570ms
>>> [ 22.940796] node 11 initialised, 33353677 pages in 570ms
>>> [ 22.941700] node 5 initialised, 33353677 pages in 570ms
>>> [ 22.941721] node 10 initialised, 33353677 pages in 570ms
>>> [ 22.941876] node 7 initialised, 33353677 pages in 570ms
>>> [ 22.944946] node 14 initialised, 33353677 pages in 570ms
>>> [ 22.946063] node 1 initialised, 33345485 pages in 580ms
>>>
>>> It saves the time about 550*16 ms at least, although it can be ignore to compare
>>> the boot time about 160 seconds. What's more, the boot time is much shorter
>>> on Power even without patches than x86 for huge memory machine.
>>>
>>> So this patchset is still necessary to be enabled for Power.
>>>
>>>
> Hi Balbir,
>
> Thanks for your reviewing.
>
>> The patchset looks good, two questions
>>
>> 1. The patchset is still necessary for
>> a. systems with smaller amount of RAM?
> I think it is. Currently, I tested systems for 4GB, 50GB, and
> boot time is improved.
> We may test more systems with different memory size in the future.
>> b. Theoretically it improves boot time?
> The boot time is improved a little bit for huge memory system
> and it can be ignored.
> But I think it's still necessary to enable this feature.
>
>> 2. the pgdat->node_spanned_pages >> 8 sounds arbitrary
>> On a system with 2TB*16 nodes, it would initialize about 8GB before calling deferred init?
>> Don't we need at-least 32GB + space for other early hash allocations
>> BTW, My expectation was that 32TB would imply 32GB+32GB of large hash allocations early on
> pgdat->node_spanned_pages >> 8 means that it allocates the size
> of the memory on one node.
> On a system with 2TB *16nodes, it will allocate 16*8GB = 128GB.
> I am not sure if it can be minimised to >> 16 to make sure all
> the architectures with different
> memory size work well. And this is also mentioned in early
> discussion for X86, so I choose >> 8.
>
> * From the code as the following:
>
> free_area_init_core ->
> memmap_init->
> update_defer_init
> #define memmap_init(size, nid, zone, start_pfn) \
> memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
>
> memmap_init_zone is based on a zone, but free_area_init_core will
> help find the highest
> zone on the node. And update_defer_init() get max initialised
> memory on highest zone for a node to
> reserve for early initialisation.
>
> static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> {
> ...
> for (j = 0; j < MAX_NR_ZONES; j++) {
> ....
> memmap_init(size, nid, j, zone_start_fn); //find
> the highest zone on a node.
> ...
> }
> }
>
> * From the dmesg log, after applying this patchset, it has
> 123013440K(about 117GB),
> which is enough for Dentry node hash table and Inode hash table in
> this system.
>
> [ 0.000000] Memory: 123013440K/31739871232K available (8000K
> kernel code, 1856K rwdata,
> 3384K rodata, 6208K init, 2544K bss, 28531136K reserved, 0K cma-reserved)
>
> Thanks :)
>
Looks good! It seems the real benefit is for smaller systems - thanks for clarifying
Please check if CMA is affected in any way
Acked-by: Balbir Singh <bsingharora@gmail.com>
Balbir Singh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-03-09 4:28 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-08 3:55 [PATCH 0/2] mm: Enable page parallel initialisation for Power Li Zhang
2016-03-08 3:55 ` Li Zhang
2016-03-08 3:55 ` [PATCH 1/2] mm: meminit: initialise more memory for inode/dentry hash tables in early boot Li Zhang
2016-03-08 3:55 ` Li Zhang
2016-03-08 13:25 ` Vlastimil Babka
2016-03-08 13:25 ` Vlastimil Babka
2016-03-08 3:55 ` [PATCH 2/2] powerpc/mm: Enable page parallel initialisation Li Zhang
2016-03-08 3:55 ` Li Zhang
2016-03-08 9:36 ` Michael Ellerman
2016-03-08 9:36 ` Michael Ellerman
2016-03-09 2:06 ` Li Zhang
2016-03-09 2:06 ` Li Zhang
2016-03-09 21:42 ` Andrew Morton
2016-03-09 21:42 ` Andrew Morton
2016-03-10 0:28 ` Michael Ellerman
2016-03-10 0:28 ` Michael Ellerman
2016-03-08 14:45 ` [PATCH 0/2] mm: Enable page parallel initialisation for Power Balbir Singh
2016-03-08 14:45 ` Balbir Singh
2016-03-09 4:17 ` Li Zhang
2016-03-09 4:17 ` Li Zhang
2016-03-09 4:28 ` Balbir Singh [this message]
2016-03-09 4:28 ` Balbir Singh
2016-03-09 5:50 ` Li Zhang
2016-03-09 5:50 ` Li Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56DFA66F.2020002@gmail.com \
--to=bsingharora@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=khandual@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@techsingularity.net \
--cc=mpe@ellerman.id.au \
--cc=vbabka@suse.cz \
--cc=zhlcindy@gmail.com \
--cc=zhlcindy@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.