All of lore.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hp.com>
To: Daniel J Blueman <daniel@numascale.com>
Cc: Mel Gorman <mgorman@suse.de>, Linux-MM <linux-mm@kvack.org>,
	Nathan Zimmer <nzimmer@sgi.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Scott Norton <scott.norton@hp.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	'Steffen Persvold' <sp@numascale.com>
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3
Date: Tue, 28 Apr 2015 21:31:48 -0400	[thread overview]
Message-ID: <55403484.8060906@hp.com> (raw)
In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no>

On 04/23/2015 11:53 AM, Daniel J Blueman wrote:
> On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <mgorman@suse.de> wrote:
>> The big change here is an adjustment to the topology_init path that 
>> caused
>> soft lockups on Waiman and Daniel Blue had reported it was an expensive
>> function.
>>
>> Changelog since v2
>> o Reduce overhead of topology_init
>> o Remove boot-time kernel parameter to enable/disable
>> o Enable on UMA
>>
>> Changelog since v1
>> o Always initialise low zones
>> o Typo corrections
>> o Rename parallel mem init to parallel struct page init
>> o Rebase to 4.0
> []
>
> Splendid work! On this 256c setup, topology_init now takes 185ms.
>
> This brings the kernel boot time down to 324s [1]. It turns out that 
> one memset is responsible for most of the time setting up the the PUDs 
> and PMDs; adapting memset to using non-temporal writes [3] avoids 
> generating RMW cycles, bringing boot time down to 186s [2].
>
> If this is a possibility, I can split this patch and map other arch's 
> memset_nocache to memset, or change the callsite as preferred; 
> comments welcome.
>
> Thanks,
>  Daniel
>
> [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt
> [2] 
> https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt
>
> -- [3]
>
> From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001
> From: Daniel J Blueman <daniel@numascale.com>
> Date: Thu, 23 Apr 2015 23:26:27 +0800
> Subject: [RFC] Speedup PMD setup
>
> Using non-temporal writes prevents read-modify-write cycles,
> which are much slower over large topologies.
>
> Adapt the existing memset() function into a _nocache variant and use
> when setting up PMDs during early boot to reduce boot time.
>
> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
> ---
> arch/x86/include/asm/string_64.h |  3 ++
> arch/x86/lib/memset_64.S         | 90 
> ++++++++++++++++++++++++++++++++++++++++
> mm/memblock.c                    |  2 +-
> 3 files changed, 94 insertions(+), 1 deletion(-)
>

I tried your patch on my 12-TB IvyBridge-EX test machine and the bootup 
time increased from 265s to 289s (24s increase). I think my IvyBridge-EX 
box was using the optimized memset_c_e (rep stosb) code which turned out 
to perform better than the non-temporal move in your code. I think that 
may be due to the temporal moves that need to be done at the beginning 
and end of the memory range.

I had tried to replace clear_page() with non-temporal moves. I generally 
got about a few percentage points improvement compared with the 
optimized clear_page_c() and clear_page_c_e() code. That is not a lot.

Anyway, I think the AMD box that you used wasn't setting the 
X86_FEATURE_REP_GOOD or X86_FEATURE_ERMS bits resulting in poor memset 
performance. If such a feature is supported in the AMD CPU (albeit in a 
different way), you may consider sending in patch to set those features 
bit. Alternatively, you will need to duplicate the alternative 
instruction stuff in your memset_nocache() to make sure that it can use 
the optimized code, if appropriate.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <waiman.long@hp.com>
To: Daniel J Blueman <daniel@numascale.com>
Cc: Mel Gorman <mgorman@suse.de>, Linux-MM <linux-mm@kvack.org>,
	Nathan Zimmer <nzimmer@sgi.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Scott Norton <scott.norton@hp.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"'Steffen Persvold'" <sp@numascale.com>
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3
Date: Tue, 28 Apr 2015 21:31:48 -0400	[thread overview]
Message-ID: <55403484.8060906@hp.com> (raw)
In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no>

On 04/23/2015 11:53 AM, Daniel J Blueman wrote:
> On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <mgorman@suse.de> wrote:
>> The big change here is an adjustment to the topology_init path that 
>> caused
>> soft lockups on Waiman and Daniel Blue had reported it was an expensive
>> function.
>>
>> Changelog since v2
>> o Reduce overhead of topology_init
>> o Remove boot-time kernel parameter to enable/disable
>> o Enable on UMA
>>
>> Changelog since v1
>> o Always initialise low zones
>> o Typo corrections
>> o Rename parallel mem init to parallel struct page init
>> o Rebase to 4.0
> []
>
> Splendid work! On this 256c setup, topology_init now takes 185ms.
>
> This brings the kernel boot time down to 324s [1]. It turns out that 
> one memset is responsible for most of the time setting up the the PUDs 
> and PMDs; adapting memset to using non-temporal writes [3] avoids 
> generating RMW cycles, bringing boot time down to 186s [2].
>
> If this is a possibility, I can split this patch and map other arch's 
> memset_nocache to memset, or change the callsite as preferred; 
> comments welcome.
>
> Thanks,
>  Daniel
>
> [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt
> [2] 
> https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt
>
> -- [3]
>
> From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001
> From: Daniel J Blueman <daniel@numascale.com>
> Date: Thu, 23 Apr 2015 23:26:27 +0800
> Subject: [RFC] Speedup PMD setup
>
> Using non-temporal writes prevents read-modify-write cycles,
> which are much slower over large topologies.
>
> Adapt the existing memset() function into a _nocache variant and use
> when setting up PMDs during early boot to reduce boot time.
>
> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
> ---
> arch/x86/include/asm/string_64.h |  3 ++
> arch/x86/lib/memset_64.S         | 90 
> ++++++++++++++++++++++++++++++++++++++++
> mm/memblock.c                    |  2 +-
> 3 files changed, 94 insertions(+), 1 deletion(-)
>

I tried your patch on my 12-TB IvyBridge-EX test machine and the bootup 
time increased from 265s to 289s (24s increase). I think my IvyBridge-EX 
box was using the optimized memset_c_e (rep stosb) code which turned out 
to perform better than the non-temporal move in your code. I think that 
may be due to the temporal moves that need to be done at the beginning 
and end of the memory range.

I had tried to replace clear_page() with non-temporal moves. I generally 
got about a few percentage points improvement compared with the 
optimized clear_page_c() and clear_page_c_e() code. That is not a lot.

Anyway, I think the AMD box that you used wasn't setting the 
X86_FEATURE_REP_GOOD or X86_FEATURE_ERMS bits resulting in poor memset 
performance. If such a feature is supported in the AMD CPU (albeit in a 
different way), you may consider sending in patch to set those features 
bit. Alternatively, you will need to duplicate the alternative 
instruction stuff in your memset_nocache() to make sure that it can use 
the optimized code, if appropriate.

Cheers,
Longman

  parent reply	other threads:[~2015-04-29  1:31 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-23 10:33 [PATCH 0/13] Parallel struct page initialisation v3 Mel Gorman
2015-04-23 10:33 ` Mel Gorman
2015-04-23 10:33 ` [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 02/13] mm: meminit: Move page initialization into a separate function Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-27 22:46   ` Andrew Morton
2015-04-27 22:46     ` Andrew Morton
2015-04-28  8:28     ` Mel Gorman
2015-04-28  8:28       ` Mel Gorman
2015-04-28 16:02       ` nzimmer
2015-04-28 16:02         ` nzimmer
2015-04-28 22:41       ` Andrew Morton
2015-04-28 22:41         ` Andrew Morton
2015-04-28 23:05         ` Mel Gorman
2015-04-28 23:05           ` Mel Gorman
2015-04-23 10:33 ` [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-27 22:43   ` Andrew Morton
2015-04-27 22:43     ` Andrew Morton
2015-04-23 10:33 ` [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-27 22:43   ` Andrew Morton
2015-04-27 22:43     ` Andrew Morton
2015-04-28  9:37     ` Mel Gorman
2015-04-28  9:37       ` Mel Gorman
2015-04-23 10:33 ` [PATCH 06/13] mm: meminit: Inline some helper functions Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 15:56   ` Mel Gorman
2015-04-23 15:56     ` Mel Gorman
2015-04-27 22:43   ` Andrew Morton
2015-04-27 22:43     ` Andrew Morton
2015-04-28  9:53     ` Mel Gorman
2015-04-28  9:53       ` Mel Gorman
2015-04-28 13:48       ` Andrew Morton
2015-04-28 13:48         ` Andrew Morton
2015-04-28 14:56         ` Mel Gorman
2015-04-28 14:56           ` Mel Gorman
2015-04-23 10:33 ` [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-27 22:43   ` Andrew Morton
2015-04-27 22:43     ` Andrew Morton
2015-04-23 10:33 ` [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64 Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-27 22:43   ` Andrew Morton
2015-04-27 22:43     ` Andrew Morton
2015-04-28 11:38     ` Mel Gorman
2015-04-28 11:38       ` Mel Gorman
2015-04-23 10:33 ` [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 10:33 ` [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links Mel Gorman
2015-04-23 10:33   ` Mel Gorman
2015-04-23 15:53 ` [PATCH 0/13] Parallel struct page initialisation v3 Daniel J Blueman
2015-04-23 15:53   ` Daniel J Blueman
2015-04-23 16:30   ` Mel Gorman
2015-04-23 16:30     ` Mel Gorman
2015-04-24 19:48   ` Waiman Long
2015-04-24 19:48     ` Waiman Long
2015-04-29  1:31   ` Waiman Long [this message]
2015-04-29  1:31     ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55403484.8060906@hp.com \
    --to=waiman.long@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel@numascale.com \
    --cc=dave.hansen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=nzimmer@sgi.com \
    --cc=scott.norton@hp.com \
    --cc=sp@numascale.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.