* [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-25 18:51 ` Mike Travis
@ 2013-06-26 9:22 ` Ingo Molnar
2013-06-26 13:28 ` Andrew Morton
0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-26 9:22 UTC (permalink / raw)
To: Mike Travis
Cc: H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai,
akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds,
Peter Zijlstra
(Changed the subject, to make it more apparent what we are talking about.)
* Mike Travis <travis@sgi.com> wrote:
> On 6/25/2013 11:43 AM, H. Peter Anvin wrote:
> > On 06/25/2013 10:22 AM, Mike Travis wrote:
> >>
> >> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
> >>>
> >>> * Nathan Zimmer <nzimmer@sgi.com> wrote:
> >>>
> >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> >>>>>
> >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the
> >>>>> boot time effect should be felt on smaller 'a couple of gigabytes'
> >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot
> >>>>> time on a 32 TB system is spent?
> >>>>
> >>>> There are other several spots that could be improved on a large system
> >>>> but memory initialization is by far the biggest.
> >>>
> >>> My feeling is that deferred/on-demand initialization triggered from the
> >>> buddy allocator is the better long term solution.
> >>
> >> I haven't caught up with all of Nathan's changes yet (just
> >> got back from vacation), but there was an option to either
> >> start the memory insertion on boot, or trigger it later
> >> using the /sys/.../memory interface. There is also a monitor
> >> program that calculates the memory insertion rate. This was
> >> extremely useful to determine how changes in the kernel
> >> affected the rate.
> >>
> >
> > Sorry, I *totally* did not follow that comment. It seemed like a
> > complete non-sequitur?
> >
> > -hpa
>
> It was I who was not following the question. I'm still reverting
> back to "work mode".
>
> [There is more code in a separate patch that Nate has not sent
> yet that instructs the kernel to start adding memory as early
> as possible, or not. That way you can start the insertion process
> later and monitor it's progress to determine how changes in the
> kernel affect that process. It is controlled by a separate
> CONFIG option.]
So, just to repeat (and expand upon) the solution hpa and me suggests:
it's not based on /sys, delayed initialization lists or any similar
(essentially memory hot plug based) approach.
It's a transparent on-demand initialization scheme based on only
initializing the very early memory setup in 1GB (2MB) steps (not in 4K
steps like we do it today).
Any subsequent split-up initialization is done on-demand, in alloc_pages()
et al, initilizing a batch of 512 (or 1024) struct page head's when an
uninitialized portion is first encountered.
This leaves the principle logic of early init largely untouched, we still
have the same amount of RAM during and after bootup, except that on 32 TB
systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
This scheme could be implemented by introducing a new PG_initialized flag,
which is seen by an unlikely() branch in alloc_pages() and which triggers
the on-demand initialization of pages.
[ It could probably be made zero-cost for the post-initialization state:
we already check a bunch of rare PG_ flags, one more flag would not
introduce any new branch in the page allocation hot path. ]
It's a technically different solution from what was submitted in this
thread.
Cons:
- it works after bootup, via GFP. If done in a simple fashion it adds one
more branch to the GFP fastpath. [ If done a bit more cleverly it can
merge into an existing unlikely() branch and become essentially
zero-cost for the fastpath. ]
- it adds an initialization non-determinism to GFP, to the tune of
initializing ~512 page heads when RAM is utilized first.
- initialization is done when memory is needed - not during or shortly
after bootup. This (slightly) increases first-use overhead. [I don't
think this factor is significant - and I think we'll quickly see
speedups to initialization, once the overhead becomes more easily
measurable.]
Pros:
- it's transparent to the boot process. ('free' shows the same full
amount of RAM all the time, there's no weird effects of RAM coming
online asynchronously. You see all the RAM you have - etc.)
- it helps the boot time of every single Linux system, not just large RAM
ones. On a smallish, 4GB system memory init can take up precious
hundreds of milliseconds, so this is a practical issue.
- it spreads initialization overhead to later portions of the system's
life time: when there's typically more idle time and more paralellism
available.
- initialization overhead, because it's a natural part of first-time
memory allocation with this scheme, becomes more measurable (and thus
more prominently optimized) than any deferred lists processed in the
background.
- as an added bonus it probably speeds up your usecase even more than the
patches you are providing: on a 32 TB system the primary initialization
would only have to enumerate memory, allocate page heads and buddy
bitmaps, and initialize the 1GB granular page heads: there's only 32768
of them.
So unless I overlooked some factor this scheme would be unconditional
goodness for everyone.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-26 9:22 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar
@ 2013-06-26 13:28 ` Andrew Morton
2013-06-26 13:37 ` Ingo Molnar
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2013-06-26 13:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
Linus Torvalds, Peter Zijlstra
On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> except that on 32 TB
> systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
That's about a million a second which is crazy slow - even my prehistoric desktop
is 100x faster than that.
Where's all this time actually being spent?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-26 13:28 ` Andrew Morton
@ 2013-06-26 13:37 ` Ingo Molnar
2013-06-26 15:02 ` Nathan Zimmer
2013-06-26 16:15 ` Mike Travis
0 siblings, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2013-06-26 13:37 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
Linus Torvalds, Peter Zijlstra
* Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
>
> > except that on 32 TB
> > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
>
> That's about a million a second which is crazy slow - even my
> prehistoric desktop is 100x faster than that.
>
> Where's all this time actually being spent?
See the earlier part of the thread - apparently it's spent initializing
the page heads - remote NUMA node misses from a single boot CPU, going
across a zillion cross-connects? I guess there's some other low hanging
fruits as well - so making this easier to profile would be nice. The
profile posted was not really usable.
Btw., NUMA locality would be another advantage of on-demand
initialization: actual users of RAM tend to allocate node-local
(especially on large clusters), so any overhead will be naturally lower.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-26 13:37 ` Ingo Molnar
@ 2013-06-26 15:02 ` Nathan Zimmer
2013-06-26 16:15 ` Mike Travis
1 sibling, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-26 15:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Mike Travis, H. Peter Anvin, Nathan Zimmer, holt,
rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
Linus Torvalds, Peter Zijlstra
On Wed, Jun 26, 2013 at 03:37:15PM +0200, Ingo Molnar wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> >
> > > except that on 32 TB
> > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
> >
> > That's about a million a second which is crazy slow - even my
> > prehistoric desktop is 100x faster than that.
> >
> > Where's all this time actually being spent?
>
> See the earlier part of the thread - apparently it's spent initializing
> the page heads - remote NUMA node misses from a single boot CPU, going
> across a zillion cross-connects? I guess there's some other low hanging
> fruits as well - so making this easier to profile would be nice. The
> profile posted was not really usable.
>
That is correct, from what I am seeing, using crude cycle counters, there is
far more time spent on the later nodes, i.e. memory near the boot node is
initialized a lot faster then remote memory.
I think the other low hanging fruits are currently being drowned out by the
lack of locality.
Nate
> Btw., NUMA locality would be another advantage of on-demand
> initialization: actual users of RAM tend to allocate node-local
> (especially on large clusters), so any overhead will be naturally lower.
>
> Thanks,
>
> Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-26 13:37 ` Ingo Molnar
2013-06-26 15:02 ` Nathan Zimmer
@ 2013-06-26 16:15 ` Mike Travis
1 sibling, 0 replies; 9+ messages in thread
From: Mike Travis @ 2013-06-26 16:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
Linus Torvalds, Peter Zijlstra
On 6/26/2013 6:37 AM, Ingo Molnar wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
>>
>>> except that on 32 TB
>>> systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
>>
>> That's about a million a second which is crazy slow - even my
>> prehistoric desktop is 100x faster than that.
>>
>> Where's all this time actually being spent?
>
> See the earlier part of the thread - apparently it's spent initializing
> the page heads - remote NUMA node misses from a single boot CPU, going
> across a zillion cross-connects? I guess there's some other low hanging
> fruits as well - so making this easier to profile would be nice. The
> profile posted was not really usable.
This is one advantage of delayed memory init. I can do it under
the profiler. I will put everything together to accomplish this
and then send a perf report.
>
> Btw., NUMA locality would be another advantage of on-demand
> initialization: actual users of RAM tend to allocate node-local
> (especially on large clusters), so any overhead will be naturally lower.
>
> Thanks,
>
> Ingo
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
@ 2013-06-27 3:35 Daniel J Blueman
2013-06-28 20:37 ` Nathan Zimmer
0 siblings, 1 reply; 9+ messages in thread
From: Daniel J Blueman @ 2013-06-27 3:35 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob,
Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc,
Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold
On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
>
> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org> wrote:
>
> > except that on 32 TB
> > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
>
> That's about a million a second which is crazy slow - even my
prehistoric desktop
> is 100x faster than that.
>
> Where's all this time actually being spent?
The complexity of a directory-lookup architecture to make the
(intrinsically unscalable) cache-coherency protocol scalable gives you a
~1us roundtrip to remote NUMA nodes.
Probably a lot of time is spent in some memsets, and RMW cycles which
are setting page bits, which are intrinsically synchronous, so the
initialising core can't get to 12 or so outstanding memory transactions.
Since EFI memory ranges have a flag to state if they are zerod (which
may be a fair assumption for memory on non-bootstrap processor NUMA
nodes), we can probably collapse the RMWs to just writes.
A normal write will require a coherency cycle, then a fetch and a
writeback when it's evicted from the cache. For this purpose,
non-temporal writes would eliminate the cache line fetch and give a
massive increase in bandwidth. We wouldn't even need a store-fence as
the initialising core is the only one online.
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale Asia
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-27 3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman
@ 2013-06-28 20:37 ` Nathan Zimmer
2013-06-29 7:24 ` Ingo Molnar
0 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-28 20:37 UTC (permalink / raw)
To: Daniel J Blueman
Cc: Andrew Morton, Mike Travis, H. Peter Anvin, holt, rob,
Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc,
Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold
On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> >
> > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org>
> wrote:
> >
> > > except that on 32 TB
> > > systems we don't spend ~2 hours initializing 8,589,934,592 page
> heads.
> >
> > That's about a million a second which is crazy slow - even my
> prehistoric desktop
> > is 100x faster than that.
> >
> > Where's all this time actually being spent?
>
> The complexity of a directory-lookup architecture to make the
> (intrinsically unscalable) cache-coherency protocol scalable gives you
> a ~1us roundtrip to remote NUMA nodes.
>
> Probably a lot of time is spent in some memsets, and RMW cycles which
> are setting page bits, which are intrinsically synchronous, so the
> initialising core can't get to 12 or so outstanding memory transactions.
>
> Since EFI memory ranges have a flag to state if they are zerod (which
> may be a fair assumption for memory on non-bootstrap processor NUMA
> nodes), we can probably collapse the RMWs to just writes.
>
> A normal write will require a coherency cycle, then a fetch and a
> writeback when it's evicted from the cache. For this purpose,
> non-temporal writes would eliminate the cache line fetch and give a
> massive increase in bandwidth. We wouldn't even need a store-fence as
> the initialising core is the only one online.
>
> Daniel
Could you elaborate a bit more? or suggest a specific area to look at?
After some experiments with trying to just set some fields in the struct
page directly I haven't been able to produce any improvements. Of
course there is lots about the area which I don't have much experience with.
Nate
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-28 20:37 ` Nathan Zimmer
@ 2013-06-29 7:24 ` Ingo Molnar
2013-06-29 18:03 ` Nathan Zimmer
0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-29 7:24 UTC (permalink / raw)
To: Nathan Zimmer
Cc: Daniel J Blueman, Andrew Morton, Mike Travis, H. Peter Anvin,
holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86,
linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra,
Steffen Persvold
* Nathan Zimmer <nzimmer@sgi.com> wrote:
> On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> >>
> >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar
> ><mi...@kernel.org> wrote:
> >>
> >> > except that on 32 TB
> >> > systems we don't spend ~2 hours initializing 8,589,934,592
> >page heads.
> >>
> >> That's about a million a second which is crazy slow - even my
> >prehistoric desktop
> >> is 100x faster than that.
> >>
> >> Where's all this time actually being spent?
> >
> > The complexity of a directory-lookup architecture to make the
> > (intrinsically unscalable) cache-coherency protocol scalable gives you
> > a ~1us roundtrip to remote NUMA nodes.
> >
> > Probably a lot of time is spent in some memsets, and RMW cycles which
> > are setting page bits, which are intrinsically synchronous, so the
> > initialising core can't get to 12 or so outstanding memory
> > transactions.
> >
> > Since EFI memory ranges have a flag to state if they are zerod (which
> > may be a fair assumption for memory on non-bootstrap processor NUMA
> > nodes), we can probably collapse the RMWs to just writes.
> >
> > A normal write will require a coherency cycle, then a fetch and a
> > writeback when it's evicted from the cache. For this purpose,
> > non-temporal writes would eliminate the cache line fetch and give a
> > massive increase in bandwidth. We wouldn't even need a store-fence as
> > the initialising core is the only one online.
>
> Could you elaborate a bit more? or suggest a specific area to look at?
>
> After some experiments with trying to just set some fields in the struct
> page directly I haven't been able to produce any improvements. Of
> course there is lots about the area which I don't have much experience
> with.
Any such improvement will at most be in the 10-20% range.
I'd suggest first concentrating on the 1000-fold boot time initialization
speedup that the buddy allocator delayed initialization can offer, and
speeding up whatever remains after that stage - in a much more
development-friendly environment. (You'll be able to run 'perf record
./calloc-1TB' after bootup and get meaningful results, etc.)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
2013-06-29 7:24 ` Ingo Molnar
@ 2013-06-29 18:03 ` Nathan Zimmer
0 siblings, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-29 18:03 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nathan Zimmer, Daniel J Blueman, Andrew Morton, Mike Travis,
H. Peter Anvin, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai,
Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds,
Peter Zijlstra, Steffen Persvold
On Sat, Jun 29, 2013 at 09:24:41AM +0200, Ingo Molnar wrote:
>
> * Nathan Zimmer <nzimmer@sgi.com> wrote:
>
> > On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> > >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> > >>
> > >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar
> > ><mi...@kernel.org> wrote:
> > >>
> > >> > except that on 32 TB
> > >> > systems we don't spend ~2 hours initializing 8,589,934,592
> > >page heads.
> > >>
> > >> That's about a million a second which is crazy slow - even my
> > >prehistoric desktop
> > >> is 100x faster than that.
> > >>
> > >> Where's all this time actually being spent?
> > >
> > > The complexity of a directory-lookup architecture to make the
> > > (intrinsically unscalable) cache-coherency protocol scalable gives you
> > > a ~1us roundtrip to remote NUMA nodes.
> > >
> > > Probably a lot of time is spent in some memsets, and RMW cycles which
> > > are setting page bits, which are intrinsically synchronous, so the
> > > initialising core can't get to 12 or so outstanding memory
> > > transactions.
> > >
> > > Since EFI memory ranges have a flag to state if they are zerod (which
> > > may be a fair assumption for memory on non-bootstrap processor NUMA
> > > nodes), we can probably collapse the RMWs to just writes.
> > >
> > > A normal write will require a coherency cycle, then a fetch and a
> > > writeback when it's evicted from the cache. For this purpose,
> > > non-temporal writes would eliminate the cache line fetch and give a
> > > massive increase in bandwidth. We wouldn't even need a store-fence as
> > > the initialising core is the only one online.
> >
> > Could you elaborate a bit more? or suggest a specific area to look at?
> >
> > After some experiments with trying to just set some fields in the struct
> > page directly I haven't been able to produce any improvements. Of
> > course there is lots about the area which I don't have much experience
> > with.
>
> Any such improvement will at most be in the 10-20% range.
>
> I'd suggest first concentrating on the 1000-fold boot time initialization
> speedup that the buddy allocator delayed initialization can offer, and
> speeding up whatever remains after that stage - in a much more
> development-friendly environment. (You'll be able to run 'perf record
> ./calloc-1TB' after bootup and get meaningful results, etc.)
>
> Thanks,
>
> Ingo
I had been focusing on the bigger gains but my attention had been diverted by
hope of an easy, alibiet smaller, win.
I have been experimenting with the patch proper, I am just doing 2MB pages for
the moment. The improvement is vast, I'll worry about proper numbers once I
think I have a fully working patch.
Some progress is being made on the real patch. I think the memory is
being set up correctly, On aligned pages setting the up the page as normal
plus setting new PG_ flag.
Right now I am trying to sort out free_pages_prepare and free_pages_check.
Thanks,
Nate
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-06-29 18:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-27 3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman
2013-06-28 20:37 ` Nathan Zimmer
2013-06-29 7:24 ` Ingo Molnar
2013-06-29 18:03 ` Nathan Zimmer
-- strict thread matches above, loose matches on Subject: below --
2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer
2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer
2013-06-23 9:28 ` Ingo Molnar
2013-06-24 20:36 ` Nathan Zimmer
2013-06-25 7:38 ` Ingo Molnar
2013-06-25 17:22 ` Mike Travis
2013-06-25 18:43 ` H. Peter Anvin
2013-06-25 18:51 ` Mike Travis
2013-06-26 9:22 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar
2013-06-26 13:28 ` Andrew Morton
2013-06-26 13:37 ` Ingo Molnar
2013-06-26 15:02 ` Nathan Zimmer
2013-06-26 16:15 ` Mike Travis
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).