linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Waiman Long <waiman.long@hp.com>
Cc: Linux-MM <linux-mm@kvack.org>, Robin Holt <holt@sgi.com>,
	Nathan Zimmer <nzimmer@sgi.com>, Daniel Rahn <drahn@suse.com>,
	Davidlohr Bueso <dbueso@suse.com>,
	Dave Hansen <dave.hansen@intel.com>, Tom Vaden <tom.vaden@hp.com>,
	Scott Norton <scott.norton@hp.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation
Date: Wed, 15 Apr 2015 16:44:15 +0100	[thread overview]
Message-ID: <20150415154415.GH14842@suse.de> (raw)
In-Reply-To: <552E7AC5.3020703@hp.com>

On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
> On 04/15/2015 09:38 AM, Mel Gorman wrote:
> >On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> >>><SNIP>
> >>>Patches are against 4.0-rc7.
> >>>
> >>>  Documentation/kernel-parameters.txt |   8 +
> >>>  arch/ia64/mm/numa.c                 |  19 +-
> >>>  arch/x86/Kconfig                    |   2 +
> >>>  include/linux/memblock.h            |  18 ++
> >>>  include/linux/mm.h                  |   8 +-
> >>>  include/linux/mmzone.h              |  37 +++-
> >>>  init/main.c                         |   1 +
> >>>  mm/Kconfig                          |  29 +++
> >>>  mm/bootmem.c                        |   6 +-
> >>>  mm/internal.h                       |  23 ++-
> >>>  mm/memblock.c                       |  34 ++-
> >>>  mm/mm_init.c                        |   9 +-
> >>>  mm/nobootmem.c                      |   7 +-
> >>>  mm/page_alloc.c                     | 398 +++++++++++++++++++++++++++++++-----
> >>>  mm/vmscan.c                         |   6 +-
> >>>  15 files changed, 507 insertions(+), 98 deletions(-)
> >>>
> >>I had included your patch with the 4.0 kernel and booted up a
> >>16-socket 12-TB machine. I measured the elapsed time from the elilo
> >>prompt to the availability of ssh login. Without the patch, the
> >>bootup time was 404s. It was reduced to 298s with the patch. So
> >>there was about 100s reduction in bootup time (1/4 of the total).
> >>
> >Cool, thanks for testing. Would you be able to state if this is really
> >important or not? Does booting 100s second faster on a 12TB machine really
> >matter? I can then add that justification to the changelog to avoid a
> >conversation with Andrew that goes something like
> >
> >Andrew: Why are we doing this?
> >Mel:    Because we can and apparently people might want it.
> >Andrew: What's the maintenance cost of this?
> >Mel:    Magic beans
> >
> >I prefer talking to Andrew when it's harder to predict what he'll say.
> 
> Booting 100s faster is certainly something that is nice to have.
> Right now, more time is spent in the firmware POST portion of the
> bootup process than in the OS boot.

I'm not surprised. On two different 1TB machines, I've seen a post time
of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
plotting world domination probably.

> So I would say this patch isn't
> really critical right now as machines with that much memory are
> relatively rare. However, if we look forward to the near future,
> some new memory technology like persistent memory is coming and
> machines with large amount of memory (whether persistent or not)
> will become more common. This patch will certainly be useful if we
> look forward into the future.
> 

Whether persistent memory needs struct pages or not is up in the air and
I'm not getting stuck in that can of worms. 100 seconds off kernel init
time is a starting point. I can try pushing it on on that basis but I
really would like to see SGI and Intel people also chime in on how it
affects their really large machines.

> >>However, there were 2 bootup problems in the dmesg log that needed
> >>to be addressed.
> >>1. There were 2 vmalloc allocation failures:
> >>[    2.284686] vmalloc: allocation failure, allocated 16578404352 of
> >>17179873280 bytes
> >>[   10.399938] vmalloc: allocation failure, allocated 7970922496 of
> >>8589938688 bytes
> >>
> >>2. There were 2 soft lockup warnings:
> >>[   57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
> >>[swapper/0:1]
> >>[   85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
> >>[swapper/0:1]
> >>
> >>Once those problems are fixed, the patch should be in a pretty good
> >>shape. I have attached the dmesg log for your reference.
> >>
> >The obvious conclusion is that initialising 1G per node is not enough for
> >really large machines. Can you try this on top? It's untested but should
> >work. The low value was chosen because it happened to work and I wanted
> >to get test coverage on common hardware but broke is broke.
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index f2c96d02662f..6b3bec304e35 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
> >  	if (pgdat->first_deferred_pfn != ULONG_MAX)
> >  		return false;
> >
> >-	/* Initialise at least 1G per zone */
> >+	/* Initialise at least 32G per node */
> >  	(*nr_initialised)++;
> >-	if (*nr_initialised>  (1UL<<  (30 - PAGE_SHIFT))&&
> >+	if (*nr_initialised>  (32UL<<  (30 - PAGE_SHIFT))&&
> >  	(pfn&  (PAGES_PER_SECTION - 1)) == 0) {
> >  		pgdat->first_deferred_pfn = pfn;
> >  		return false;
> 
> I will try this out when I can get hold of the 12-TB machine again.
> 

Thanks.

> The vmalloc allocation failures were for the following hash tables:
> - Dentry cache hash table entries
> - Inode-cache hash table entries
> 
> Those hash tables scale linearly with the amount of memory available
> in the system. So instead of hardcoding a certain value, why don't
> we make it a certain % of the total memory but bottomed out to 1G at
> the low end?
> 

Because then it becomes what percentage is the right percentage and what
happens if it's a percentage of total memory but the NUMA nodes are not
all the same size?. I want to start simple until there is more data on
what these really large machines look like and if it ever fails in the
field, there is the command-line switch until a patch is available.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-04-15 15:44 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-13 10:16 [RFC PATCH 0/14] Parallel memory initialisation Mel Gorman
2015-04-13 10:16 ` [PATCH 01/14] memblock: Introduce a for_each_reserved_mem_region iterator Mel Gorman
2015-04-13 10:16 ` [PATCH 02/14] mm: meminit: Move page initialization into a separate function Mel Gorman
2015-04-13 10:16 ` [PATCH 03/14] mm: meminit: Only set page reserved in the memblock region Mel Gorman
2015-04-13 10:16 ` [PATCH 04/14] mm: page_alloc: Pass PFN to __free_pages_bootmem Mel Gorman
2015-04-13 10:16 ` [PATCH 05/14] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Mel Gorman
2015-04-13 10:16 ` [PATCH 06/14] mm: meminit: Inline some helper functions Mel Gorman
2015-04-13 10:16 ` [PATCH 07/14] mm: meminit: Partially initialise memory if CONFIG_DEFERRED_MEM_INIT is set Mel Gorman
2015-04-13 10:17 ` [PATCH 08/14] mm: meminit: Initialise remaining memory in parallel with kswapd Mel Gorman
2015-04-13 10:17 ` [PATCH 09/14] mm: meminit: Minimise number of pfn->page lookups during initialisation Mel Gorman
2015-04-13 10:17 ` [PATCH 10/14] x86: mm: Enable deferred memory initialisation on x86-64 Mel Gorman
2015-04-13 18:21   ` Paul Bolle
2015-04-13 10:17 ` [PATCH 11/14] mm: meminit: Control parallel memory initialisation from command line and config Mel Gorman
2015-04-13 10:17 ` [PATCH 12/14] mm: meminit: Free pages in large chunks where possible Mel Gorman
2015-04-13 10:17 ` [PATCH 13/14] mm: meminit: Reduce number of times pageblocks are set during initialisation Mel Gorman
2015-04-13 10:17 ` [PATCH 14/14] mm: meminit: Remove mminit_verify_page_links Mel Gorman
2015-04-13 10:29 ` [RFC PATCH 0/14] Parallel memory initialisation Mel Gorman
2015-04-15 13:15 ` Waiman Long
2015-04-15 13:38   ` Mel Gorman
2015-04-15 14:50     ` Waiman Long
2015-04-15 15:44       ` Mel Gorman [this message]
2015-04-15 21:37         ` nzimmer
2015-04-16 18:20     ` Waiman Long
2015-04-15 14:27   ` Peter Zijlstra
2015-04-15 14:34     ` Mel Gorman
2015-04-15 14:48       ` Peter Zijlstra
2015-04-15 16:18         ` Waiman Long
2015-04-15 16:42           ` Norton, Scott J
2015-04-16  7:25 ` Andrew Morton
2015-04-16  8:46   ` Mel Gorman
2015-04-16 17:26     ` Andrew Morton
2015-04-16 17:37       ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2015-04-16  7:51 Daniel J Blueman
2015-04-20  3:15 ` Daniel J Blueman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150415154415.GH14842@suse.de \
    --to=mgorman@suse.de \
    --cc=dave.hansen@intel.com \
    --cc=dbueso@suse.com \
    --cc=drahn@suse.com \
    --cc=holt@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nzimmer@sgi.com \
    --cc=scott.norton@hp.com \
    --cc=tom.vaden@hp.com \
    --cc=waiman.long@hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).