[RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Mike Travis <travis@sgi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Nathan Zimmer <nzimmer@sgi.com>,
	holt@sgi.com, rob@landley.net, tglx@linutronix.de,
	mingo@redhat.com, yinghai@kernel.org, akpm@linux-foundation.org,
	gregkh@linuxfoundation.org, x86@kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
Date: Wed, 26 Jun 2013 11:22:48 +0200	[thread overview]
Message-ID: <20130626092248.GB27025@gmail.com> (raw)
In-Reply-To: <51C9E6CD.5080508@sgi.com>


(Changed the subject, to make it more apparent what we are talking about.)

* Mike Travis <travis@sgi.com> wrote:

> On 6/25/2013 11:43 AM, H. Peter Anvin wrote:
> > On 06/25/2013 10:22 AM, Mike Travis wrote:
> >>
> >> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
> >>>
> >>> * Nathan Zimmer <nzimmer@sgi.com> wrote:
> >>>
> >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> >>>>>
> >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
> >>>>> boot time effect should be felt on smaller 'a couple of gigabytes' 
> >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot 
> >>>>> time on a 32 TB system is spent?
> >>>>
> >>>> There are other several spots that could be improved on a large system 
> >>>> but memory initialization is by far the biggest.
> >>>
> >>> My feeling is that deferred/on-demand initialization triggered from the 
> >>> buddy allocator is the better long term solution.
> >>
> >> I haven't caught up with all of Nathan's changes yet (just
> >> got back from vacation), but there was an option to either
> >> start the memory insertion on boot, or trigger it later
> >> using the /sys/.../memory interface.  There is also a monitor
> >> program that calculates the memory insertion rate.  This was
> >> extremely useful to determine how changes in the kernel
> >> affected the rate.
> >>
> > 
> > Sorry, I *totally* did not follow that comment.  It seemed like a
> > complete non-sequitur?
> > 
> > 	-hpa
> 
> It was I who was not following the question.  I'm still reverting
> back to "work mode".
> 
> [There is more code in a separate patch that Nate has not sent
> yet that instructs the kernel to start adding memory as early
> as possible, or not.  That way you can start the insertion process
> later and monitor it's progress to determine how changes in the
> kernel affect that process.  It is controlled by a separate
> CONFIG option.]

So, just to repeat (and expand upon) the solution hpa and me suggests: 
it's not based on /sys, delayed initialization lists or any similar 
(essentially memory hot plug based) approach.

It's a transparent on-demand initialization scheme based on only 
initializing the very early memory setup in 1GB (2MB) steps (not in 4K 
steps like we do it today).

Any subsequent split-up initialization is done on-demand, in alloc_pages() 
et al, initilizing a batch of 512 (or 1024) struct page head's when an 
uninitialized portion is first encountered.

This leaves the principle logic of early init largely untouched, we still 
have the same amount of RAM during and after bootup, except that on 32 TB 
systems we don't spend ~2 hours initializing 8,589,934,592 page heads.

This scheme could be implemented by introducing a new PG_initialized flag, 
which is seen by an unlikely() branch in alloc_pages() and which triggers 
the on-demand initialization of pages.

[ It could probably be made zero-cost for the post-initialization state:
  we already check a bunch of rare PG_ flags, one more flag would not 
  introduce any new branch in the page allocation hot path. ]

It's a technically different solution from what was submitted in this 
thread.

Cons:

 - it works after bootup, via GFP. If done in a simple fashion it adds one 
   more branch to the GFP fastpath. [ If done a bit more cleverly it can 
   merge into an existing unlikely() branch and become essentially 
   zero-cost for the fastpath. ]

 - it adds an initialization non-determinism to GFP, to the tune of
   initializing ~512 page heads when RAM is utilized first.

 - initialization is done when memory is needed - not during or shortly 
   after bootup. This (slightly) increases first-use overhead. [I don't 
   think this factor is significant - and I think we'll quickly see 
   speedups to initialization, once the overhead becomes more easily 
   measurable.]

Pros:

 - it's transparent to the boot process. ('free' shows the same full
   amount of RAM all the time, there's no weird effects of RAM coming
   online asynchronously. You see all the RAM you have - etc.)

 - it helps the boot time of every single Linux system, not just large RAM
   ones. On a smallish, 4GB system memory init can take up precious
   hundreds of milliseconds, so this is a practical issue.

 - it spreads initialization overhead to later portions of the system's 
   life time: when there's typically more idle time and more paralellism
   available.

 - initialization overhead, because it's a natural part of first-time 
   memory allocation with this scheme, becomes more measurable (and thus 
   more prominently optimized) than any deferred lists processed in the 
   background.

 - as an added bonus it probably speeds up your usecase even more than the
   patches you are providing: on a 32 TB system the primary initialization
   would only have to enumerate memory, allocate page heads and buddy
   bitmaps, and initialize the 1GB granular page heads: there's only 32768
   of them.

So unless I overlooked some factor this scheme would be unconditional 
goodness for everyone.

Thanks,

	Ingo

next prev parent reply	other threads:[~2013-06-26  9:22 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer
2013-06-21 16:25 ` [RFC 1/2] x86_64, mm: Delay initializing large portion " Nathan Zimmer
2013-06-25  4:14   ` Rob Landley
2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer
2013-06-23  9:28   ` Ingo Molnar
2013-06-23  9:32     ` Ingo Molnar
2013-06-24 17:38       ` H. Peter Anvin
2013-06-24 19:39         ` Ingo Molnar
2013-06-24 20:08           ` H. Peter Anvin
2013-06-25  7:31             ` Ingo Molnar
2013-06-24 20:36     ` Nathan Zimmer
2013-06-25  7:38       ` Ingo Molnar
2013-06-25 15:07         ` H. Peter Anvin
2013-06-25 17:19           ` Mike Travis
2013-06-25 17:22         ` Mike Travis
2013-06-25 18:43           ` H. Peter Anvin
2013-06-25 18:51             ` Mike Travis
2013-06-26  9:22               ` Ingo Molnar [this message]
2013-06-26 13:28                 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Andrew Morton
2013-06-26 13:37                   ` Ingo Molnar
2013-06-26 15:02                     ` Nathan Zimmer
2013-06-26 16:15                     ` Mike Travis
2013-06-26 12:14       ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Ingo Molnar
2013-06-26 14:49         ` Nathan Zimmer
2013-06-26 15:12           ` Dave Hansen
2013-06-26 15:20             ` Nathan Zimmer
2013-06-26 15:58               ` Ingo Molnar
2013-06-26 16:11                 ` Nathan Zimmer
2013-06-26 16:07         ` Mike Travis
2013-06-21 16:51 ` [RFC 0/2] Delay initializing of large sections of memory Greg KH
2013-06-21 17:03   ` H. Peter Anvin
2013-06-21 17:18     ` Nathan Zimmer
2013-06-21 17:28       ` H. Peter Anvin
2013-06-21 20:05         ` Nathan Zimmer
2013-06-21 20:08           ` H. Peter Anvin
2013-06-21 20:33             ` Nathan Zimmer
2013-06-21 21:36             ` Mike Travis
2013-06-21 21:07       ` Mike Travis
2013-06-21 18:44     ` Yinghai Lu
2013-06-21 18:50       ` Greg KH
2013-06-21 19:10         ` Yinghai Lu
2013-06-21 19:19           ` Nathan Zimmer
2013-06-21 20:28             ` Yinghai Lu
2013-06-21 20:40               ` Nathan Zimmer
2013-06-21 21:30         ` Mike Travis
2013-06-22  0:23           ` Yinghai Lu
2013-06-25 17:35             ` Mike Travis
2013-06-25 18:17               ` H. Peter Anvin
2013-06-25 18:40                 ` Mike Travis
2013-06-25 18:40                 ` Yinghai Lu
2013-06-25 18:44                   ` H. Peter Anvin
2013-06-25 18:58                     ` Mike Travis
2013-06-25 19:03                       ` Yinghai Lu
2013-06-25 19:09                         ` H. Peter Anvin
2013-06-25 19:28                           ` Yinghai Lu
2013-06-27  6:37                       ` Yinghai Lu
2013-06-27 11:05                         ` Robin Holt
2013-06-27 15:50                         ` Mike Travis
2013-06-26  9:23                   ` Ingo Molnar
2013-06-25 18:38               ` Yinghai Lu
2013-06-25 18:42                 ` Mike Travis
2013-06-21 18:36 ` Yinghai Lu
2013-06-21 18:44   ` Greg Kroah-Hartman
2013-06-21 19:00     ` Yinghai Lu
2013-06-21 21:28       ` Mike Travis
2013-06-21 21:19   ` Mike Travis
  -- strict thread matches above, loose matches on Subject: below --
2013-06-27  3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman
2013-06-28 20:37 ` Nathan Zimmer
2013-06-29  7:24   ` Ingo Molnar
2013-06-29 18:03     ` Nathan Zimmer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130626092248.GB27025@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=holt@sgi.com \
    --cc=hpa@zytor.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nzimmer@sgi.com \
    --cc=rob@landley.net \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=travis@sgi.com \
    --cc=x86@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).