* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-07 14:44 ` Will Deacon
0 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-09-07 14:44 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Jia He, Russell King, Catalin Marinas, Mark Rutland,
Andrew Morton, Michal Hocko, Wei Yang, Kees Cook, Laura Abbott,
Vladimir Murzin, Philip Derrin, AKASHI Takahiro, James Morse,
Steve Capper, Gioh Kim, Vlastimil Babka, Mel Gorman,
Johannes Weiner, Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU,
Andrey Ryabinin, Nikolay Borisov, Daniel Jordan, Daniel Vacek,
Eugeniu Rosca, linux-arm-kernel, Linux Kernel Mailing List,
Linux-MM, Jia He
On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> >
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> >
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map. The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> >
>
> OK so we can summarize the benefits of this series as follows:
> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> *milliseconds*
>
> Google was not very helpful in figuring out what a Samurai CPU is and
> why we should care about the boot time of Linux running on a virtual
> model of it, and the 15 ms speedup is not that compelling either.
>
> Apologies to Jia that it took 11 revisions to reach this conclusion,
> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> for this reason is totally unjustified, and we're better off
> disregarding these patches.
Oh, we're talking about a *simulator* for the significant boot time
improvement here? I didn't realise that, so I agree that the premise of
this patch set looks pretty questionable given how much "fun" we've had
with the memmap on arm and arm64.
Will
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2018-09-07 14:44 ` Will Deacon
(?)
@ 2018-09-14 18:50 ` Eugeniu Rosca
-1 siblings, 0 replies; 29+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
To: Jia He, Will Deacon, Ard Biesheuvel
Cc: Russell King, Catalin Marinas, Mark Rutland, Andrew Morton,
Michal Hocko, Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner, Kemi Wang,
Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
Daniel Jordan, Daniel Vacek, linux-arm-kernel,
Linux Kernel Mailing List, Linux-MM, Jia He, George G. Davis,
Vladimir Zapolskiy, Andy Lowe, linux-renesas-soc, Eugeniu Rosca,
Eugeniu Rosca
+ Renesas people
Hello Will, hello Ard,
On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> >
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> >
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
>
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
>
> Will
Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.
I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.
FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:
Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>
Best regards,
Eugeniu.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-14 18:50 ` Eugeniu Rosca
0 siblings, 0 replies; 29+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
To: Jia He, Will Deacon, Ard Biesheuvel
Cc: Russell King, Catalin Marinas, Mark Rutland, Andrew Morton,
Michal Hocko, Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner, Kemi Wang,
Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov,
Daniel Jordan, Daniel Vacek, linux-arm-kernel,
Linux Kernel Mailing List, Linux-MM, Jia He, George G. Davis,
Vladimir Zapolskiy, Andy Lowe, linux-renesas-soc, Eugeniu Rosca,
Eugeniu Rosca
+ Renesas people
Hello Will, hello Ard,
On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> >
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> >
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
>
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
>
> Will
Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.
I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.
FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:
Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>
Best regards,
Eugeniu.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-14 18:50 ` Eugeniu Rosca
0 siblings, 0 replies; 29+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
To: linux-arm-kernel
+ Renesas people
Hello Will, hello Ard,
On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> >
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> >
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
>
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
>
> Will
Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.
I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.
FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:
Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>
Best regards,
Eugeniu.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2018-09-07 14:44 ` Will Deacon
@ 2019-06-08 4:22 ` Hanjun Guo
-1 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-08 4:22 UTC (permalink / raw)
To: Will Deacon, Ard Biesheuvel
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Kemi Wang, Wei Yang,
Linux-MM, Eugeniu Rosca, Petr Tesarik, Nikolay Borisov,
Russell King, Daniel Jordan, AKASHI Takahiro, Mel Gorman,
Andrey Ryabinin, Laura Abbott, Daniel Vacek, Vladimir Murzin,
Kees Cook, Vlastimil Babka, Johannes Weiner, YASUAKI ISHIMATSU,
Jia He, Jia He, Gioh Kim, linux-arm-kernel, Steve Capper,
Linux Kernel Mailing List, James Morse, Philip Derrin,
Andrew Morton
Hi Ard, Will,
This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.
On 2018/9/7 22:44, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>> possible panic bug. So Daniel Vacek reverted it later.
>>>
>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>
>>> More from what Daniel said:
>>> "On arm and arm64, memblock is used by default. But generic version of
>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>> not always return the next valid one but skips more resulting in some
>>> valid frames to be skipped (as if they were invalid). And that's why
>>> kernel was eventually crashing on some !arm machines."
>>>
>>> About the performance consideration:
>>> As said by James in b92df1de5,
>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>
>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>> for improvement. After this set, I can see the time overhead of memmap_init
>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>> larger than TBs
>>>
>>
>> OK so we can summarize the benefits of this series as follows:
>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>> *milliseconds*
>>
>> Google was not very helpful in figuring out what a Samurai CPU is and
>> why we should care about the boot time of Linux running on a virtual
>> model of it, and the 15 ms speedup is not that compelling either.
Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below
without this patch set with this patch set
mem_init() 13310ms 1415ms
So we got about 8x speedup on this machine, which is very impressive.
The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).
>>
>> Apologies to Jia that it took 11 revisions to reach this conclusion,
>> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
>> for this reason is totally unjustified, and we're better off
>> disregarding these patches.
Indeed this patch set has a bug, For exampe, if we have 3 regions which
is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
last region, we will increase early_region_idx to count of region, which is
out of bound of the regions. Fixed by patch below,
mm/memblock.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 8279295..8283bf0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
if (pfn >= start_pfn && pfn < end_pfn)
return pfn;
- early_region_idx++;
+ /* try slow path */
+ if (++early_region_idx == type->cnt)
+ goto slow_path;
+
next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
if (pfn >= end_pfn && pfn <= next_start_pfn)
return next_start_pfn;
}
+slow_path:
/* slow path, do the binary searching */
do {
mid = (right + left) / 2;
As the really impressive speedup on our ARM64 server system, could you reconsider
this patch set for merge? if you want more data I'm willing to clarify and give
more test.
Thanks
Hanjun
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-08 4:22 ` Hanjun Guo
0 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-08 4:22 UTC (permalink / raw)
To: Will Deacon, Ard Biesheuvel
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Wei Yang, Linux-MM,
Jia He, Eugeniu Rosca, Petr Tesarik, Nikolay Borisov,
Russell King, Daniel Jordan, AKASHI Takahiro, Gioh Kim,
Andrey Ryabinin, Laura Abbott, Daniel Vacek, Mel Gorman,
Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
Jia He, Kemi Wang, Vlastimil Babka, linux-arm-kernel,
Steve Capper, Linux Kernel Mailing List, James Morse,
Johannes Weiner, Andrew Morton
Hi Ard, Will,
This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.
On 2018/9/7 22:44, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>> possible panic bug. So Daniel Vacek reverted it later.
>>>
>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>
>>> More from what Daniel said:
>>> "On arm and arm64, memblock is used by default. But generic version of
>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>> not always return the next valid one but skips more resulting in some
>>> valid frames to be skipped (as if they were invalid). And that's why
>>> kernel was eventually crashing on some !arm machines."
>>>
>>> About the performance consideration:
>>> As said by James in b92df1de5,
>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>
>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>> for improvement. After this set, I can see the time overhead of memmap_init
>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>> larger than TBs
>>>
>>
>> OK so we can summarize the benefits of this series as follows:
>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>> *milliseconds*
>>
>> Google was not very helpful in figuring out what a Samurai CPU is and
>> why we should care about the boot time of Linux running on a virtual
>> model of it, and the 15 ms speedup is not that compelling either.
Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below
without this patch set with this patch set
mem_init() 13310ms 1415ms
So we got about 8x speedup on this machine, which is very impressive.
The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).
>>
>> Apologies to Jia that it took 11 revisions to reach this conclusion,
>> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
>> for this reason is totally unjustified, and we're better off
>> disregarding these patches.
Indeed this patch set has a bug, For exampe, if we have 3 regions which
is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
last region, we will increase early_region_idx to count of region, which is
out of bound of the regions. Fixed by patch below,
mm/memblock.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 8279295..8283bf0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
if (pfn >= start_pfn && pfn < end_pfn)
return pfn;
- early_region_idx++;
+ /* try slow path */
+ if (++early_region_idx == type->cnt)
+ goto slow_path;
+
next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
if (pfn >= end_pfn && pfn <= next_start_pfn)
return next_start_pfn;
}
+slow_path:
/* slow path, do the binary searching */
do {
mid = (right + left) / 2;
As the really impressive speedup on our ARM64 server system, could you reconsider
this patch set for merge? if you want more data I'm willing to clarify and give
more test.
Thanks
Hanjun
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2019-06-08 4:22 ` Hanjun Guo
@ 2019-06-10 13:16 ` Ard Biesheuvel
-1 siblings, 0 replies; 29+ messages in thread
From: Ard Biesheuvel @ 2019-06-10 13:16 UTC (permalink / raw)
To: Hanjun Guo
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
Nikolay Borisov, Jia He, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Philip Derrin,
YASUAKI ISHIMATSU, Jia He, Ard Biesheuvel, Kemi Wang,
Vlastimil Babka, linux-arm-kernel, Steve Capper,
Linux Kernel Mailing List, Gioh Kim, Johannes Weiner,
Andrew Morton
On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> Hi Ard, Will,
>
> This week we were trying to debug an issue of time consuming in mem_init(),
> and leading to this similar solution form Jia He, so I would like to bring this
> thread back, please see my detail test result below.
>
> On 2018/9/7 22:44, Will Deacon wrote:
> > On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> >> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> >>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> >>> where possible") optimized the loop in memmap_init_zone(). But it causes
> >>> possible panic bug. So Daniel Vacek reverted it later.
> >>>
> >>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> >>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >>>
> >>> More from what Daniel said:
> >>> "On arm and arm64, memblock is used by default. But generic version of
> >>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> >>> not always return the next valid one but skips more resulting in some
> >>> valid frames to be skipped (as if they were invalid). And that's why
> >>> kernel was eventually crashing on some !arm machines."
> >>>
> >>> About the performance consideration:
> >>> As said by James in b92df1de5,
> >>> "I have tested this patch on a virtual model of a Samurai CPU with a
> >>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
> >>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >>>
> >>> Besides we can remain memblock_next_valid_pfn, there is still some room
> >>> for improvement. After this set, I can see the time overhead of memmap_init
> >>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> >>> memory, pagesize 64k). I believe arm server will benefit more if memory is
> >>> larger than TBs
> >>>
> >>
> >> OK so we can summarize the benefits of this series as follows:
> >> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> >> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> >> *milliseconds*
> >>
> >> Google was not very helpful in figuring out what a Samurai CPU is and
> >> why we should care about the boot time of Linux running on a virtual
> >> model of it, and the 15 ms speedup is not that compelling either.
>
> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
> 384G memory in total, we got the time consuming below
>
> without this patch set with this patch set
> mem_init() 13310ms 1415ms
>
> So we got about 8x speedup on this machine, which is very impressive.
>
Yes, this is impressive. But does it matter in the grand scheme of
things? How much time does this system take to arrive at this point
from power on?
> The time consuming is related the memory DIMM size and where to locate those
> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
> machine, the time consuming reduced from 20s to 2s (I think it's related to
> firmware implementations).
>
I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.
So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.
> >>
> >> Apologies to Jia that it took 11 revisions to reach this conclusion,
> >> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> >> for this reason is totally unjustified, and we're better off
> >> disregarding these patches.
>
> Indeed this patch set has a bug, For exampe, if we have 3 regions which
> is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
> last region, we will increase early_region_idx to count of region, which is
> out of bound of the regions. Fixed by patch below,
>
> mm/memblock.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8279295..8283bf0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> if (pfn >= start_pfn && pfn < end_pfn)
> return pfn;
>
> - early_region_idx++;
> + /* try slow path */
> + if (++early_region_idx == type->cnt)
> + goto slow_path;
> +
> next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
>
> if (pfn >= end_pfn && pfn <= next_start_pfn)
> return next_start_pfn;
> }
>
> +slow_path:
> /* slow path, do the binary searching */
> do {
> mid = (right + left) / 2;
>
> As the really impressive speedup on our ARM64 server system, could you reconsider
> this patch set for merge? if you want more data I'm willing to clarify and give
> more test.
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-10 13:16 ` Ard Biesheuvel
0 siblings, 0 replies; 29+ messages in thread
From: Ard Biesheuvel @ 2019-06-10 13:16 UTC (permalink / raw)
To: Hanjun Guo
Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Jia He, Gioh Kim,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
James Morse, Philip Derrin, Andrew Morton
On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> Hi Ard, Will,
>
> This week we were trying to debug an issue of time consuming in mem_init(),
> and leading to this similar solution form Jia He, so I would like to bring this
> thread back, please see my detail test result below.
>
> On 2018/9/7 22:44, Will Deacon wrote:
> > On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> >> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> >>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> >>> where possible") optimized the loop in memmap_init_zone(). But it causes
> >>> possible panic bug. So Daniel Vacek reverted it later.
> >>>
> >>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> >>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >>>
> >>> More from what Daniel said:
> >>> "On arm and arm64, memblock is used by default. But generic version of
> >>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> >>> not always return the next valid one but skips more resulting in some
> >>> valid frames to be skipped (as if they were invalid). And that's why
> >>> kernel was eventually crashing on some !arm machines."
> >>>
> >>> About the performance consideration:
> >>> As said by James in b92df1de5,
> >>> "I have tested this patch on a virtual model of a Samurai CPU with a
> >>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
> >>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >>>
> >>> Besides we can remain memblock_next_valid_pfn, there is still some room
> >>> for improvement. After this set, I can see the time overhead of memmap_init
> >>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> >>> memory, pagesize 64k). I believe arm server will benefit more if memory is
> >>> larger than TBs
> >>>
> >>
> >> OK so we can summarize the benefits of this series as follows:
> >> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> >> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> >> *milliseconds*
> >>
> >> Google was not very helpful in figuring out what a Samurai CPU is and
> >> why we should care about the boot time of Linux running on a virtual
> >> model of it, and the 15 ms speedup is not that compelling either.
>
> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
> 384G memory in total, we got the time consuming below
>
> without this patch set with this patch set
> mem_init() 13310ms 1415ms
>
> So we got about 8x speedup on this machine, which is very impressive.
>
Yes, this is impressive. But does it matter in the grand scheme of
things? How much time does this system take to arrive at this point
from power on?
> The time consuming is related the memory DIMM size and where to locate those
> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
> machine, the time consuming reduced from 20s to 2s (I think it's related to
> firmware implementations).
>
I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.
So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.
> >>
> >> Apologies to Jia that it took 11 revisions to reach this conclusion,
> >> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> >> for this reason is totally unjustified, and we're better off
> >> disregarding these patches.
>
> Indeed this patch set has a bug, For exampe, if we have 3 regions which
> is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
> last region, we will increase early_region_idx to count of region, which is
> out of bound of the regions. Fixed by patch below,
>
> mm/memblock.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8279295..8283bf0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> if (pfn >= start_pfn && pfn < end_pfn)
> return pfn;
>
> - early_region_idx++;
> + /* try slow path */
> + if (++early_region_idx == type->cnt)
> + goto slow_path;
> +
> next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
>
> if (pfn >= end_pfn && pfn <= next_start_pfn)
> return next_start_pfn;
> }
>
> +slow_path:
> /* slow path, do the binary searching */
> do {
> mid = (right + left) / 2;
>
> As the really impressive speedup on our ARM64 server system, could you reconsider
> this patch set for merge? if you want more data I'm willing to clarify and give
> more test.
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2019-06-10 13:16 ` Ard Biesheuvel
@ 2019-06-11 15:18 ` Hanjun Guo
-1 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-11 15:18 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
Nikolay Borisov, Jia He, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Philip Derrin,
YASUAKI ISHIMATSU, Jia He, Ard Biesheuvel, Kemi Wang,
Vlastimil Babka, linux-arm-kernel, Steve Capper,
Linux Kernel Mailing List, Gioh Kim, Johannes Weiner,
Andrew Morton
Hello Ard,
Thanks for the reply, please see my comments inline.
On 2019/6/10 21:16, Ard Biesheuvel wrote:
> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>
>> Hi Ard, Will,
>>
>> This week we were trying to debug an issue of time consuming in mem_init(),
>> and leading to this similar solution form Jia He, so I would like to bring this
>> thread back, please see my detail test result below.
>>
>> On 2018/9/7 22:44, Will Deacon wrote:
>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>
>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>
>>>>> More from what Daniel said:
>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>> not always return the next valid one but skips more resulting in some
>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>> kernel was eventually crashing on some !arm machines."
>>>>>
>>>>> About the performance consideration:
>>>>> As said by James in b92df1de5,
>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>
>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>> larger than TBs
>>>>>
>>>>
>>>> OK so we can summarize the benefits of this series as follows:
>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>> *milliseconds*
>>>>
>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>> why we should care about the boot time of Linux running on a virtual
>>>> model of it, and the 15 ms speedup is not that compelling either.
>>
>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>> 384G memory in total, we got the time consuming below
>>
>> without this patch set with this patch set
>> mem_init() 13310ms 1415ms
>>
>> So we got about 8x speedup on this machine, which is very impressive.
>>
>
> Yes, this is impressive. But does it matter in the grand scheme of
> things?
It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.
> How much time does this system take to arrive at this point
> from power on?
Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.
>
>> The time consuming is related the memory DIMM size and where to locate those
>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>> firmware implementations).
>>
>
> I agree that this optimization looks good in isolation, but the fact
> that you spotted a bug justifies my skepticism at the time. On the
> other hand, now that we have several independent reports (from you,
> but also from the Renesas folks) that the speedup is worthwhile for
> real world use cases, I think it does make sense to revisit it.
Thank you very much for taking care of this :)
>
> So what I would like to see is the patch set being proposed again,
> with the new data points added for documentation. Also, the commit
> logs need to crystal clear about how the meaning of PFN validity
> differs between ARM and other architectures, and why the assumptions
> that the optimization is based on are guaranteed to hold.
I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.
Thanks
Hanjun
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-11 15:18 ` Hanjun Guo
0 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-11 15:18 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Jia He, Gioh Kim,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
James Morse, Philip Derrin, Andrew Morton
Hello Ard,
Thanks for the reply, please see my comments inline.
On 2019/6/10 21:16, Ard Biesheuvel wrote:
> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>
>> Hi Ard, Will,
>>
>> This week we were trying to debug an issue of time consuming in mem_init(),
>> and leading to this similar solution form Jia He, so I would like to bring this
>> thread back, please see my detail test result below.
>>
>> On 2018/9/7 22:44, Will Deacon wrote:
>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>
>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>
>>>>> More from what Daniel said:
>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>> not always return the next valid one but skips more resulting in some
>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>> kernel was eventually crashing on some !arm machines."
>>>>>
>>>>> About the performance consideration:
>>>>> As said by James in b92df1de5,
>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>
>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>> larger than TBs
>>>>>
>>>>
>>>> OK so we can summarize the benefits of this series as follows:
>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>> *milliseconds*
>>>>
>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>> why we should care about the boot time of Linux running on a virtual
>>>> model of it, and the 15 ms speedup is not that compelling either.
>>
>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>> 384G memory in total, we got the time consuming below
>>
>> without this patch set with this patch set
>> mem_init() 13310ms 1415ms
>>
>> So we got about 8x speedup on this machine, which is very impressive.
>>
>
> Yes, this is impressive. But does it matter in the grand scheme of
> things?
It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.
> How much time does this system take to arrive at this point
> from power on?
Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.
>
>> The time consuming is related the memory DIMM size and where to locate those
>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>> firmware implementations).
>>
>
> I agree that this optimization looks good in isolation, but the fact
> that you spotted a bug justifies my skepticism at the time. On the
> other hand, now that we have several independent reports (from you,
> but also from the Renesas folks) that the speedup is worthwhile for
> real world use cases, I think it does make sense to revisit it.
Thank you very much for taking care of this :)
>
> So what I would like to see is the patch set being proposed again,
> with the new data points added for documentation. Also, the commit
> logs need to crystal clear about how the meaning of PFN validity
> differs between ARM and other architectures, and why the assumptions
> that the optimization is based on are guaranteed to hold.
I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.
Thanks
Hanjun
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2019-06-11 15:18 ` Hanjun Guo
@ 2019-06-12 1:05 ` Jia He
-1 siblings, 0 replies; 29+ messages in thread
From: Jia He @ 2019-06-12 1:05 UTC (permalink / raw)
To: Hanjun Guo, Ard Biesheuvel
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
Nikolay Borisov, Russell King, Daniel Jordan, AKASHI Takahiro,
Mel Gorman, Andrey Ryabinin, Laura Abbott, Daniel Vacek,
Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
Jia He, Ard Biesheuvel, Kemi Wang, Vlastimil Babka,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
Gioh Kim, Johannes Weiner, Andrew Morton
Hi Hanjun
On 2019/6/11 23:18, Hanjun Guo wrote:
> Hello Ard,
>
> Thanks for the reply, please see my comments inline.
>
> On 2019/6/10 21:16, Ard Biesheuvel wrote:
>> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>> Hi Ard, Will,
>>>
>>> This week we were trying to debug an issue of time consuming in mem_init(),
>>> and leading to this similar solution form Jia He, so I would like to bring this
>>> thread back, please see my detail test result below.
>>>
>>> On 2018/9/7 22:44, Will Deacon wrote:
>>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>>
>>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>>
>>>>>> More from what Daniel said:
>>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>>> not always return the next valid one but skips more resulting in some
>>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>>> kernel was eventually crashing on some !arm machines."
>>>>>>
>>>>>> About the performance consideration:
>>>>>> As said by James in b92df1de5,
>>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>>
>>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>>> larger than TBs
>>>>>>
>>>>> OK so we can summarize the benefits of this series as follows:
>>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>>> *milliseconds*
>>>>>
>>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>>> why we should care about the boot time of Linux running on a virtual
>>>>> model of it, and the 15 ms speedup is not that compelling either.
>>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>>> 384G memory in total, we got the time consuming below
>>>
>>> without this patch set with this patch set
>>> mem_init() 13310ms 1415ms
>>>
>>> So we got about 8x speedup on this machine, which is very impressive.
>>>
>> Yes, this is impressive. But does it matter in the grand scheme of
>> things?
> It matters for this machine, because it's for storage and there is
> a watchdog and the time consuming triggers the watchdog.
>
>> How much time does this system take to arrive at this point
>> from power on?
> Sorry, I don't have such data, as the arch timer is not initialized
> and I didn't see the time stamp at this point, but I read the cycles
> from arch timer before and after the time consuming function to get
> how much time consumed.
>
>>> The time consuming is related the memory DIMM size and where to locate those
>>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>>> firmware implementations).
>>>
>> I agree that this optimization looks good in isolation, but the fact
>> that you spotted a bug justifies my skepticism at the time. On the
>> other hand, now that we have several independent reports (from you,
>> but also from the Renesas folks) that the speedup is worthwhile for
>> real world use cases, I think it does make sense to revisit it.
> Thank you very much for taking care of this :)
>
>> So what I would like to see is the patch set being proposed again,
>> with the new data points added for documentation. Also, the commit
>> logs need to crystal clear about how the meaning of PFN validity
>> differs between ARM and other architectures, and why the assumptions
>> that the optimization is based on are guaranteed to hold.
> I think Jia He no longer works for HXT, if don't mind, I can repost
> this patch set with Jia He's authority unchanged.
Ok, I don't mind that, thanks for your followup :)
---
Cheers,
Justin (Jia He)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-12 1:05 ` Jia He
0 siblings, 0 replies; 29+ messages in thread
From: Jia He @ 2019-06-12 1:05 UTC (permalink / raw)
To: Hanjun Guo, Ard Biesheuvel
Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Gioh Kim,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
James Morse, Philip Derrin, Andrew Morton
Hi Hanjun
On 2019/6/11 23:18, Hanjun Guo wrote:
> Hello Ard,
>
> Thanks for the reply, please see my comments inline.
>
> On 2019/6/10 21:16, Ard Biesheuvel wrote:
>> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>> Hi Ard, Will,
>>>
>>> This week we were trying to debug an issue of time consuming in mem_init(),
>>> and leading to this similar solution form Jia He, so I would like to bring this
>>> thread back, please see my detail test result below.
>>>
>>> On 2018/9/7 22:44, Will Deacon wrote:
>>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>>
>>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>>
>>>>>> More from what Daniel said:
>>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>>> not always return the next valid one but skips more resulting in some
>>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>>> kernel was eventually crashing on some !arm machines."
>>>>>>
>>>>>> About the performance consideration:
>>>>>> As said by James in b92df1de5,
>>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds."
>>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>>
>>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>>> larger than TBs
>>>>>>
>>>>> OK so we can summarize the benefits of this series as follows:
>>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>>> *milliseconds*
>>>>>
>>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>>> why we should care about the boot time of Linux running on a virtual
>>>>> model of it, and the 15 ms speedup is not that compelling either.
>>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>>> 384G memory in total, we got the time consuming below
>>>
>>> without this patch set with this patch set
>>> mem_init() 13310ms 1415ms
>>>
>>> So we got about 8x speedup on this machine, which is very impressive.
>>>
>> Yes, this is impressive. But does it matter in the grand scheme of
>> things?
> It matters for this machine, because it's for storage and there is
> a watchdog and the time consuming triggers the watchdog.
>
>> How much time does this system take to arrive at this point
>> from power on?
> Sorry, I don't have such data, as the arch timer is not initialized
> and I didn't see the time stamp at this point, but I read the cycles
> from arch timer before and after the time consuming function to get
> how much time consumed.
>
>>> The time consuming is related the memory DIMM size and where to locate those
>>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>>> firmware implementations).
>>>
>> I agree that this optimization looks good in isolation, but the fact
>> that you spotted a bug justifies my skepticism at the time. On the
>> other hand, now that we have several independent reports (from you,
>> but also from the Renesas folks) that the speedup is worthwhile for
>> real world use cases, I think it does make sense to revisit it.
> Thank you very much for taking care of this :)
>
>> So what I would like to see is the patch set being proposed again,
>> with the new data points added for documentation. Also, the commit
>> logs need to crystal clear about how the meaning of PFN validity
>> differs between ARM and other architectures, and why the assumptions
>> that the optimization is based on are guaranteed to hold.
> I think Jia He no longer works for HXT, if don't mind, I can repost
> this patch set with Jia He's authority unchanged.
Ok, I don't mind that, thanks for your followup :)
---
Cheers,
Justin (Jia He)
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
2019-06-12 1:05 ` Jia He
@ 2019-06-12 12:48 ` Hanjun Guo
-1 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-12 12:48 UTC (permalink / raw)
To: Jia He, Ard Biesheuvel
Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
Nikolay Borisov, Russell King, Daniel Jordan, AKASHI Takahiro,
Mel Gorman, Andrey Ryabinin, Laura Abbott, Daniel Vacek,
Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
Jia He, Ard Biesheuvel, Kemi Wang, Vlastimil Babka,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
Gioh Kim, Johannes Weiner, Andrew Morton
On 2019/6/12 9:05, Jia He wrote:
>>
>>> So what I would like to see is the patch set being proposed again,
>>> with the new data points added for documentation. Also, the commit
>>> logs need to crystal clear about how the meaning of PFN validity
>>> differs between ARM and other architectures, and why the assumptions
>>> that the optimization is based on are guaranteed to hold.
>> I think Jia He no longer works for HXT, if don't mind, I can repost
>> this patch set with Jia He's authority unchanged.
> Ok, I don't mind that, thanks for your followup :)
That's great, I will prepare a new version with Ard's comments addressed
then repost.
Thanks
Hanjun
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-12 12:48 ` Hanjun Guo
0 siblings, 0 replies; 29+ messages in thread
From: Hanjun Guo @ 2019-06-12 12:48 UTC (permalink / raw)
To: Jia He, Ard Biesheuvel
Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Gioh Kim,
linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
James Morse, Philip Derrin, Andrew Morton
On 2019/6/12 9:05, Jia He wrote:
>>
>>> So what I would like to see is the patch set being proposed again,
>>> with the new data points added for documentation. Also, the commit
>>> logs need to crystal clear about how the meaning of PFN validity
>>> differs between ARM and other architectures, and why the assumptions
>>> that the optimization is based on are guaranteed to hold.
>> I think Jia He no longer works for HXT, if don't mind, I can repost
>> this patch set with Jia He's authority unchanged.
> Ok, I don't mind that, thanks for your followup :)
That's great, I will prepare a new version with Ard's comments addressed
then repost.
Thanks
Hanjun
^ permalink raw reply [flat|nested] 29+ messages in thread