* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator @ 2013-06-27 3:35 Daniel J Blueman 2013-06-28 20:37 ` Nathan Zimmer 0 siblings, 1 reply; 9+ messages in thread From: Daniel J Blueman @ 2013-06-27 3:35 UTC (permalink / raw) To: Andrew Morton Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote: > > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org> wrote: > > > except that on 32 TB > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads. > > That's about a million a second which is crazy slow - even my prehistoric desktop > is 100x faster than that. > > Where's all this time actually being spent? The complexity of a directory-lookup architecture to make the (intrinsically unscalable) cache-coherency protocol scalable gives you a ~1us roundtrip to remote NUMA nodes. Probably a lot of time is spent in some memsets, and RMW cycles which are setting page bits, which are intrinsically synchronous, so the initialising core can't get to 12 or so outstanding memory transactions. Since EFI memory ranges have a flag to state if they are zerod (which may be a fair assumption for memory on non-bootstrap processor NUMA nodes), we can probably collapse the RMWs to just writes. A normal write will require a coherency cycle, then a fetch and a writeback when it's evicted from the cache. For this purpose, non-temporal writes would eliminate the cache line fetch and give a massive increase in bandwidth. We wouldn't even need a store-fence as the initialising core is the only one online. Daniel -- Daniel J Blueman Principal Software Engineer, Numascale Asia ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-27 3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman @ 2013-06-28 20:37 ` Nathan Zimmer 2013-06-29 7:24 ` Ingo Molnar 0 siblings, 1 reply; 9+ messages in thread From: Nathan Zimmer @ 2013-06-28 20:37 UTC (permalink / raw) To: Daniel J Blueman Cc: Andrew Morton, Mike Travis, H. Peter Anvin, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold On 06/26/2013 10:35 PM, Daniel J Blueman wrote: > On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote: > > > > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org> > wrote: > > > > > except that on 32 TB > > > systems we don't spend ~2 hours initializing 8,589,934,592 page > heads. > > > > That's about a million a second which is crazy slow - even my > prehistoric desktop > > is 100x faster than that. > > > > Where's all this time actually being spent? > > The complexity of a directory-lookup architecture to make the > (intrinsically unscalable) cache-coherency protocol scalable gives you > a ~1us roundtrip to remote NUMA nodes. > > Probably a lot of time is spent in some memsets, and RMW cycles which > are setting page bits, which are intrinsically synchronous, so the > initialising core can't get to 12 or so outstanding memory transactions. > > Since EFI memory ranges have a flag to state if they are zerod (which > may be a fair assumption for memory on non-bootstrap processor NUMA > nodes), we can probably collapse the RMWs to just writes. > > A normal write will require a coherency cycle, then a fetch and a > writeback when it's evicted from the cache. For this purpose, > non-temporal writes would eliminate the cache line fetch and give a > massive increase in bandwidth. We wouldn't even need a store-fence as > the initialising core is the only one online. > > Daniel Could you elaborate a bit more? or suggest a specific area to look at? After some experiments with trying to just set some fields in the struct page directly I haven't been able to produce any improvements. Of course there is lots about the area which I don't have much experience with. Nate ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-28 20:37 ` Nathan Zimmer @ 2013-06-29 7:24 ` Ingo Molnar 2013-06-29 18:03 ` Nathan Zimmer 0 siblings, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2013-06-29 7:24 UTC (permalink / raw) To: Nathan Zimmer Cc: Daniel J Blueman, Andrew Morton, Mike Travis, H. Peter Anvin, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold * Nathan Zimmer <nzimmer@sgi.com> wrote: > On 06/26/2013 10:35 PM, Daniel J Blueman wrote: > >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote: > >> > >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar > ><mi...@kernel.org> wrote: > >> > >> > except that on 32 TB > >> > systems we don't spend ~2 hours initializing 8,589,934,592 > >page heads. > >> > >> That's about a million a second which is crazy slow - even my > >prehistoric desktop > >> is 100x faster than that. > >> > >> Where's all this time actually being spent? > > > > The complexity of a directory-lookup architecture to make the > > (intrinsically unscalable) cache-coherency protocol scalable gives you > > a ~1us roundtrip to remote NUMA nodes. > > > > Probably a lot of time is spent in some memsets, and RMW cycles which > > are setting page bits, which are intrinsically synchronous, so the > > initialising core can't get to 12 or so outstanding memory > > transactions. > > > > Since EFI memory ranges have a flag to state if they are zerod (which > > may be a fair assumption for memory on non-bootstrap processor NUMA > > nodes), we can probably collapse the RMWs to just writes. > > > > A normal write will require a coherency cycle, then a fetch and a > > writeback when it's evicted from the cache. For this purpose, > > non-temporal writes would eliminate the cache line fetch and give a > > massive increase in bandwidth. We wouldn't even need a store-fence as > > the initialising core is the only one online. > > Could you elaborate a bit more? or suggest a specific area to look at? > > After some experiments with trying to just set some fields in the struct > page directly I haven't been able to produce any improvements. Of > course there is lots about the area which I don't have much experience > with. Any such improvement will at most be in the 10-20% range. I'd suggest first concentrating on the 1000-fold boot time initialization speedup that the buddy allocator delayed initialization can offer, and speeding up whatever remains after that stage - in a much more development-friendly environment. (You'll be able to run 'perf record ./calloc-1TB' after bootup and get meaningful results, etc.) Thanks, Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-29 7:24 ` Ingo Molnar @ 2013-06-29 18:03 ` Nathan Zimmer 0 siblings, 0 replies; 9+ messages in thread From: Nathan Zimmer @ 2013-06-29 18:03 UTC (permalink / raw) To: Ingo Molnar Cc: Nathan Zimmer, Daniel J Blueman, Andrew Morton, Mike Travis, H. Peter Anvin, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold On Sat, Jun 29, 2013 at 09:24:41AM +0200, Ingo Molnar wrote: > > * Nathan Zimmer <nzimmer@sgi.com> wrote: > > > On 06/26/2013 10:35 PM, Daniel J Blueman wrote: > > >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote: > > >> > > >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar > > ><mi...@kernel.org> wrote: > > >> > > >> > except that on 32 TB > > >> > systems we don't spend ~2 hours initializing 8,589,934,592 > > >page heads. > > >> > > >> That's about a million a second which is crazy slow - even my > > >prehistoric desktop > > >> is 100x faster than that. > > >> > > >> Where's all this time actually being spent? > > > > > > The complexity of a directory-lookup architecture to make the > > > (intrinsically unscalable) cache-coherency protocol scalable gives you > > > a ~1us roundtrip to remote NUMA nodes. > > > > > > Probably a lot of time is spent in some memsets, and RMW cycles which > > > are setting page bits, which are intrinsically synchronous, so the > > > initialising core can't get to 12 or so outstanding memory > > > transactions. > > > > > > Since EFI memory ranges have a flag to state if they are zerod (which > > > may be a fair assumption for memory on non-bootstrap processor NUMA > > > nodes), we can probably collapse the RMWs to just writes. > > > > > > A normal write will require a coherency cycle, then a fetch and a > > > writeback when it's evicted from the cache. For this purpose, > > > non-temporal writes would eliminate the cache line fetch and give a > > > massive increase in bandwidth. We wouldn't even need a store-fence as > > > the initialising core is the only one online. > > > > Could you elaborate a bit more? or suggest a specific area to look at? > > > > After some experiments with trying to just set some fields in the struct > > page directly I haven't been able to produce any improvements. Of > > course there is lots about the area which I don't have much experience > > with. > > Any such improvement will at most be in the 10-20% range. > > I'd suggest first concentrating on the 1000-fold boot time initialization > speedup that the buddy allocator delayed initialization can offer, and > speeding up whatever remains after that stage - in a much more > development-friendly environment. (You'll be able to run 'perf record > ./calloc-1TB' after bootup and get meaningful results, etc.) > > Thanks, > > Ingo I had been focusing on the bigger gains but my attention had been diverted by hope of an easy, alibiet smaller, win. I have been experimenting with the patch proper, I am just doing 2MB pages for the moment. The improvement is vast, I'll worry about proper numbers once I think I have a fully working patch. Some progress is being made on the real patch. I think the memory is being set up correctly, On aligned pages setting the up the page as normal plus setting new PG_ flag. Right now I am trying to sort out free_pages_prepare and free_pages_check. Thanks, Nate ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC 0/2] Delay initializing of large sections of memory @ 2013-06-21 16:25 Nathan Zimmer 2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer 0 siblings, 1 reply; 9+ messages in thread From: Nathan Zimmer @ 2013-06-21 16:25 UTC (permalink / raw) Cc: holt, travis, nzimmer, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel This rfc patch set delays initializing large sections of memory until we have started cpus. This has the effect of reducing startup times on large memory systems. On 16TB it can take over an hour to boot and most of that time is spent initializing memory. We avoid that bottleneck by delaying initialization until after we have started multiple cpus and can initialize in a multithreaded manner. This allows us to actually reduce boot time rather then just moving around the point of initialization. Mike and I have worked on this set for a while, with him doing the most of the heavy lifting, and are eager for some feedback. Mike Travis (2): x86_64, mm: Delay initializing large portion of memory x86_64, mm: Reinsert the absent memory Documentation/kernel-parameters.txt | 15 ++ arch/x86/Kconfig | 10 ++ arch/x86/include/asm/e820.h | 16 +- arch/x86/kernel/e820.c | 292 +++++++++++++++++++++++++++++++++++- drivers/base/memory.c | 83 ++++++++++ include/linux/memory.h | 5 + 6 files changed, 413 insertions(+), 8 deletions(-) -- 1.8.2.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer @ 2013-06-21 16:25 ` Nathan Zimmer 2013-06-23 9:28 ` Ingo Molnar 0 siblings, 1 reply; 9+ messages in thread From: Nathan Zimmer @ 2013-06-21 16:25 UTC (permalink / raw) Cc: holt, travis, nzimmer, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel The memory we set aside in the previous patch needs to be reinserted. We start this process via late_initcall so we will have multiple cpus to do the work. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yinghai Lu <yinghai@kernel.org> --- arch/x86/kernel/e820.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++ drivers/base/memory.c | 83 +++++++++++++++++++++++++++++++ include/linux/memory.h | 5 ++ 3 files changed, 217 insertions(+) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 3752dc5..d31039d 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -23,6 +23,7 @@ #ifdef CONFIG_DELAY_MEM_INIT #include <linux/memory.h> +#include <linux/delay.h> #endif #include <asm/e820.h> @@ -397,6 +398,22 @@ static u64 min_region_size; /* min size of region to slice from */ static u64 pre_region_size; /* multiply bsize for node low memory */ static u64 post_region_size; /* multiply bsize for node high memory */ +static unsigned long add_absent_work_start_time; +static unsigned long add_absent_work_stop_time; +static unsigned int add_absent_job_count; +static atomic_t add_absent_work_count; + +struct absent_work { + struct work_struct work; + struct absent_work *next; + atomic_t busy; + int cpu; + int node; + int index; +}; +static DEFINE_PER_CPU(struct absent_work, absent_work); +static struct absent_work *first_absent_work; + static int __init setup_delay_mem_init(char *str) { int bbits, mpnbits, minmult, premult, postmult; @@ -527,6 +544,118 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map, } return ret; } + +/* Assign a cpu for this memory chunk and get the per_cpu absent_work struct */ +static struct absent_work *get_absent_work(int node) +{ + int cpu; + + for_each_cpu(cpu, cpumask_of_node(node)) { + struct absent_work *aws = &per_cpu(absent_work, cpu); + if (aws->node) + continue; + aws->cpu = cpu; + aws->node = node; + return aws; + } + + /* (if this becomes a problem, we can use a cpu on another node) */ + pr_crit("e820: No CPU on Node %d to schedule absent_work\n", node); + return NULL; +} + +/* Count of 'not done' processes */ +static int count_absent_work_notdone(void) +{ + struct absent_work *aws; + int notdone = 0; + + for (aws = first_absent_work; aws; aws = aws->next) + if (atomic_read(&aws->busy) < 2) + notdone++; + + return notdone; +} + +/* The absent_work thread */ +static void add_absent_memory_work(struct work_struct *work) +{ + struct absent_work *aws; + u64 phys_addr, size; + int ret; + + aws = container_of(work, struct absent_work, work); + + phys_addr = e820_absent.map[aws->index].addr; + size = e820_absent.map[aws->index].size; + ret = memory_add_absent(aws->node, phys_addr, size); + if (ret) + pr_crit("e820: Error %d adding absent memory %llx %llx (%d)\n", + ret, phys_addr, size, aws->node); + + atomic_set(&aws->busy, 2); + atomic_dec(&add_absent_work_count); + + /* if no one is waiting, then snap stop time */ + if (!count_absent_work_notdone()) + add_absent_work_stop_time = get_seconds(); +} + +/* Initialize absent_work threads */ +static int add_absent_memory(void) +{ + struct absent_work *aws = NULL; + int cpu, i; + + add_absent_work_start_time = get_seconds(); + add_absent_work_stop_time = 0; + atomic_set(&add_absent_work_count, 0); + + for_each_online_cpu(cpu) { + struct absent_work *aws = &per_cpu(absent_work, cpu); + aws->node = 0; + } + + /* setup each work thread */ + for (i = 0; i < e820_absent.nr_map; i++) { + u64 phys_addr = e820_absent.map[i].addr; + int node = memory_add_physaddr_to_nid(phys_addr); + + if (!node_online(node)) + continue; + + if (!aws) { + aws = get_absent_work(node); + first_absent_work = aws; + } else { + aws->next = get_absent_work(node); + aws = aws->next; + } + + if (!aws) + continue; + + INIT_WORK(&aws->work, add_absent_memory_work); + atomic_set(&aws->busy, 0); + aws->index = i; + + /* schedule absent_work thread */ + if (!schedule_work_on(aws->cpu, &aws->work)) + BUG(); + } + + + pr_info("e820: Add absent memory started\n"); + + return 0; +} + +/* Called during bootup to start adding absent_mem early */ +static int absent_memory_init(void) +{ + return add_absent_memory(); +} +late_initcall(absent_memory_init); #endif /* CONFIG_DELAY_MEM_INIT */ static int __init __append_e820_map(struct e820entry *biosmap, int nr_map) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 14f8a69..5b4245a 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -442,6 +442,89 @@ static inline int memory_probe_init(void) } #endif +#ifdef CONFIG_DELAY_MEM_INIT +static struct memory_block *memory_get_block(u64 phys_addr, + struct memory_block *last_mem_blk) +{ + unsigned long pfn = phys_addr >> PAGE_SHIFT; + struct memory_block *mem_blk = NULL; + struct mem_section *mem_sect; + unsigned long section_nr = pfn_to_section_nr(pfn); + + if (!present_section_nr(section_nr)) + return NULL; + + mem_sect = __nr_to_section(section_nr); + mem_blk = find_memory_block_hinted(mem_sect, last_mem_blk); + return mem_blk; +} + +/* addr and size must be aligned on memory_block_size boundaries */ +int memory_add_absent(int nid, u64 phys_addr, u64 size) +{ + struct memory_block *mem = NULL; + struct page *first_page; + unsigned long block_sz; + unsigned long nr_pages; + unsigned long start_pfn; + int ret; + + block_sz = get_memory_block_size(); + if (phys_addr & (block_sz - 1) || size & (block_sz - 1)) + return -EINVAL; + + /* memory already present? */ + if (memory_get_block(phys_addr, NULL)) + return -EBUSY; + + ret = add_memory(nid, phys_addr, size); + if (ret) + return ret; + + /* grab first block to use for onlining process */ + mem = memory_get_block(phys_addr, NULL); + if (!mem) + return -ENOMEM; + + first_page = pfn_to_page(mem->start_section_nr << PFN_SECTION_SHIFT); + start_pfn = page_to_pfn(first_page); + nr_pages = size >> PAGE_SHIFT; + + ret = online_pages(start_pfn, nr_pages, ONLINE_KEEP); + if (ret) + return ret; + + for (;;) { + /* we already have first block from above */ + mutex_lock(&mem->state_mutex); + if (mem->state == MEM_OFFLINE) { + mem->state = MEM_ONLINE; + kobject_uevent(&mem->dev.kobj, KOBJ_ONLINE); + } + mutex_unlock(&mem->state_mutex); + + phys_addr += block_sz; + size -= block_sz; + if (!size) + break; + + mem = memory_get_block(phys_addr, mem); + if (mem) + continue; + + pr_err("memory_get_block failed at %llx\n", phys_addr); + return -EFAULT; + } + return 0; +} + +#else +static inline int start_add_absent_init(void) +{ + return 0; +} +#endif /* CONFIG_DELAY_MEM_INIT */ + #ifdef CONFIG_MEMORY_FAILURE /* * Support for offlining pages of memory diff --git a/include/linux/memory.h b/include/linux/memory.h index 85c31a8..a000c54 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -128,6 +128,11 @@ extern struct memory_block *find_memory_block(struct mem_section *); enum mem_add_context { BOOT, HOTPLUG }; #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ +#ifdef CONFIG_DELAY_MEM_INIT +extern int memory_add_absent(int nid, u64 phys_addr, u64 size); +#endif + + #ifdef CONFIG_MEMORY_HOTPLUG #define hotplug_memory_notifier(fn, pri) ({ \ static __meminitdata struct notifier_block fn##_mem_nb =\ -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer @ 2013-06-23 9:28 ` Ingo Molnar 2013-06-24 20:36 ` Nathan Zimmer 0 siblings, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2013-06-23 9:28 UTC (permalink / raw) To: Nathan Zimmer Cc: holt, travis, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra * Nathan Zimmer <nzimmer@sgi.com> wrote: > The memory we set aside in the previous patch needs to be reinserted. > We start this process via late_initcall so we will have multiple cpus to do > the work. > > Signed-off-by: Mike Travis <travis@sgi.com> > Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > --- > arch/x86/kernel/e820.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++ > drivers/base/memory.c | 83 +++++++++++++++++++++++++++++++ > include/linux/memory.h | 5 ++ > 3 files changed, 217 insertions(+) > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index 3752dc5..d31039d 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -23,6 +23,7 @@ > > #ifdef CONFIG_DELAY_MEM_INIT > #include <linux/memory.h> > +#include <linux/delay.h> > #endif > > #include <asm/e820.h> > @@ -397,6 +398,22 @@ static u64 min_region_size; /* min size of region to slice from */ > static u64 pre_region_size; /* multiply bsize for node low memory */ > static u64 post_region_size; /* multiply bsize for node high memory */ > > +static unsigned long add_absent_work_start_time; > +static unsigned long add_absent_work_stop_time; > +static unsigned int add_absent_job_count; > +static atomic_t add_absent_work_count; > + > +struct absent_work { > + struct work_struct work; > + struct absent_work *next; > + atomic_t busy; > + int cpu; > + int node; > + int index; > +}; > +static DEFINE_PER_CPU(struct absent_work, absent_work); > +static struct absent_work *first_absent_work; That's 4.5 GB/sec initialization speed - that feels a bit slow and the boot time effect should be felt on smaller 'a couple of gigabytes' desktop boxes as well. Do we know exactly where the 2 hours of boot time on a 32 TB system is spent? While you cannot profile the boot process (yet), you could try your delayed patch and run a "perf record -g" call-graph profiling of the late-time initialization routines. What does 'perf report' show? Delayed initialization makes sense I guess because 32 TB is a lot of memory - I'm just wondering whether there's some low hanging fruits left in the mem init code, that code is certainly not optimized for performance. Plus with a struct page size of around 64 bytes (?) 32 TB of RAM has 512 GB of struct page arrays alone. Initializing those will take quite some time as well - and I suspect they are allocated via zeroing them first. If that memset() exists then getting rid of it might be a good move as well. Yet another thing to consider would be to implement an initialization speedup of 3 orders of magnitude: initialize on the large page (2MB) grandularity and on-demand delay the initialization of the 4K granular struct pages [but still allocating them] - which I suspect are a good chunk of the overhead? That way we could initialize in 2MB steps and speed up the 2 hours bootup of 32 TB of RAM to 14 seconds... [ The cost would be one more branch in the buddy allocator, to detect not-yet-initialized 2 MB chunks as we encounter them. Acceptable I think. ] Thanks, Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-23 9:28 ` Ingo Molnar @ 2013-06-24 20:36 ` Nathan Zimmer 2013-06-25 7:38 ` Ingo Molnar 0 siblings, 1 reply; 9+ messages in thread From: Nathan Zimmer @ 2013-06-24 20:36 UTC (permalink / raw) To: Ingo Molnar Cc: Nathan Zimmer, holt, travis, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: > > That's 4.5 GB/sec initialization speed - that feels a bit slow and the > boot time effect should be felt on smaller 'a couple of gigabytes' desktop > boxes as well. Do we know exactly where the 2 hours of boot time on a 32 > TB system is spent? > There are other several spots that could be improved on a large system but memory initialization is by far the biggest. > While you cannot profile the boot process (yet), you could try your > delayed patch and run a "perf record -g" call-graph profiling of the > late-time initialization routines. What does 'perf report' show? > I have some data from earlier runs. memmap_init_zone was the function that was the biggest hitter by far. Parts of it could certianly are low hanging fruit, set_pageblock_migratetype for example. However it seems for a larger system SetPageReserved will be the largest consumer of cycles. On a 1TB system I just booted it was around 50% of time spent in memmap_init_zone. perf seems to struggle with 512 cpus, but I did get some data. It seems to indicate similar data to what I found in earlier experiments. Lots of time in memmap_init_zone, Some are waiting on locks, this guy seems to be representative of that. - 0.14% kworker/160:1 [kernel.kallsyms] [k] mspin_lock ▒ + mspin_lock ▒ + __mutex_lock_slowpath ▒ - mutex_lock ▒ - 99.69% online_pages > Delayed initialization makes sense I guess because 32 TB is a lot of > memory - I'm just wondering whether there's some low hanging fruits left > in the mem init code, that code is certainly not optimized for > performance. > > Plus with a struct page size of around 64 bytes (?) 32 TB of RAM has 512 > GB of struct page arrays alone. Initializing those will take quite some > time as well - and I suspect they are allocated via zeroing them first. If > that memset() exists then getting rid of it might be a good move as well. > > Yet another thing to consider would be to implement an initialization > speedup of 3 orders of magnitude: initialize on the large page (2MB) > grandularity and on-demand delay the initialization of the 4K granular > struct pages [but still allocating them] - which I suspect are a good > chunk of the overhead? That way we could initialize in 2MB steps and speed > up the 2 hours bootup of 32 TB of RAM to 14 seconds... > > [ The cost would be one more branch in the buddy allocator, to detect > not-yet-initialized 2 MB chunks as we encounter them. Acceptable I > think. ] > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-24 20:36 ` Nathan Zimmer @ 2013-06-25 7:38 ` Ingo Molnar 2013-06-25 17:22 ` Mike Travis 0 siblings, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2013-06-25 7:38 UTC (permalink / raw) To: Nathan Zimmer Cc: holt, travis, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra * Nathan Zimmer <nzimmer@sgi.com> wrote: > On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: > > > > That's 4.5 GB/sec initialization speed - that feels a bit slow and the > > boot time effect should be felt on smaller 'a couple of gigabytes' > > desktop boxes as well. Do we know exactly where the 2 hours of boot > > time on a 32 TB system is spent? > > There are other several spots that could be improved on a large system > but memory initialization is by far the biggest. My feeling is that deferred/on-demand initialization triggered from the buddy allocator is the better long term solution. That will also make it much easier to profile/test memory init performance: boot up a large system and run a simple testprogram that allocates a lot of RAM. ( It will also make people want to optimize the initialization sequence better, as it will be part of any freshly booted system's memory allocation overhead. ) Thanks, Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-25 7:38 ` Ingo Molnar @ 2013-06-25 17:22 ` Mike Travis 2013-06-25 18:43 ` H. Peter Anvin 0 siblings, 1 reply; 9+ messages in thread From: Mike Travis @ 2013-06-25 17:22 UTC (permalink / raw) To: Ingo Molnar Cc: Nathan Zimmer, holt, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On 6/25/2013 12:38 AM, Ingo Molnar wrote: > > * Nathan Zimmer <nzimmer@sgi.com> wrote: > >> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: >>> >>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the >>> boot time effect should be felt on smaller 'a couple of gigabytes' >>> desktop boxes as well. Do we know exactly where the 2 hours of boot >>> time on a 32 TB system is spent? >> >> There are other several spots that could be improved on a large system >> but memory initialization is by far the biggest. > > My feeling is that deferred/on-demand initialization triggered from the > buddy allocator is the better long term solution. I haven't caught up with all of Nathan's changes yet (just got back from vacation), but there was an option to either start the memory insertion on boot, or trigger it later using the /sys/.../memory interface. There is also a monitor program that calculates the memory insertion rate. This was extremely useful to determine how changes in the kernel affected the rate. > > That will also make it much easier to profile/test memory init > performance: boot up a large system and run a simple testprogram that > allocates a lot of RAM. > > ( It will also make people want to optimize the initialization sequence > better, as it will be part of any freshly booted system's memory > allocation overhead. ) > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-25 17:22 ` Mike Travis @ 2013-06-25 18:43 ` H. Peter Anvin 2013-06-25 18:51 ` Mike Travis 0 siblings, 1 reply; 9+ messages in thread From: H. Peter Anvin @ 2013-06-25 18:43 UTC (permalink / raw) To: Mike Travis Cc: Ingo Molnar, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On 06/25/2013 10:22 AM, Mike Travis wrote: > > On 6/25/2013 12:38 AM, Ingo Molnar wrote: >> >> * Nathan Zimmer <nzimmer@sgi.com> wrote: >> >>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: >>>> >>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the >>>> boot time effect should be felt on smaller 'a couple of gigabytes' >>>> desktop boxes as well. Do we know exactly where the 2 hours of boot >>>> time on a 32 TB system is spent? >>> >>> There are other several spots that could be improved on a large system >>> but memory initialization is by far the biggest. >> >> My feeling is that deferred/on-demand initialization triggered from the >> buddy allocator is the better long term solution. > > I haven't caught up with all of Nathan's changes yet (just > got back from vacation), but there was an option to either > start the memory insertion on boot, or trigger it later > using the /sys/.../memory interface. There is also a monitor > program that calculates the memory insertion rate. This was > extremely useful to determine how changes in the kernel > affected the rate. > Sorry, I *totally* did not follow that comment. It seemed like a complete non-sequitur? -hpa ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory 2013-06-25 18:43 ` H. Peter Anvin @ 2013-06-25 18:51 ` Mike Travis 2013-06-26 9:22 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar 0 siblings, 1 reply; 9+ messages in thread From: Mike Travis @ 2013-06-25 18:51 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On 6/25/2013 11:43 AM, H. Peter Anvin wrote: > On 06/25/2013 10:22 AM, Mike Travis wrote: >> >> On 6/25/2013 12:38 AM, Ingo Molnar wrote: >>> >>> * Nathan Zimmer <nzimmer@sgi.com> wrote: >>> >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: >>>>> >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the >>>>> boot time effect should be felt on smaller 'a couple of gigabytes' >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot >>>>> time on a 32 TB system is spent? >>>> >>>> There are other several spots that could be improved on a large system >>>> but memory initialization is by far the biggest. >>> >>> My feeling is that deferred/on-demand initialization triggered from the >>> buddy allocator is the better long term solution. >> >> I haven't caught up with all of Nathan's changes yet (just >> got back from vacation), but there was an option to either >> start the memory insertion on boot, or trigger it later >> using the /sys/.../memory interface. There is also a monitor >> program that calculates the memory insertion rate. This was >> extremely useful to determine how changes in the kernel >> affected the rate. >> > > Sorry, I *totally* did not follow that comment. It seemed like a > complete non-sequitur? > > -hpa It was I who was not following the question. I'm still reverting back to "work mode". [There is more code in a separate patch that Nate has not sent yet that instructs the kernel to start adding memory as early as possible, or not. That way you can start the insertion process later and monitor it's progress to determine how changes in the kernel affect that process. It is controlled by a separate CONFIG option.] > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-25 18:51 ` Mike Travis @ 2013-06-26 9:22 ` Ingo Molnar 2013-06-26 13:28 ` Andrew Morton 0 siblings, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2013-06-26 9:22 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra (Changed the subject, to make it more apparent what we are talking about.) * Mike Travis <travis@sgi.com> wrote: > On 6/25/2013 11:43 AM, H. Peter Anvin wrote: > > On 06/25/2013 10:22 AM, Mike Travis wrote: > >> > >> On 6/25/2013 12:38 AM, Ingo Molnar wrote: > >>> > >>> * Nathan Zimmer <nzimmer@sgi.com> wrote: > >>> > >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: > >>>>> > >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the > >>>>> boot time effect should be felt on smaller 'a couple of gigabytes' > >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot > >>>>> time on a 32 TB system is spent? > >>>> > >>>> There are other several spots that could be improved on a large system > >>>> but memory initialization is by far the biggest. > >>> > >>> My feeling is that deferred/on-demand initialization triggered from the > >>> buddy allocator is the better long term solution. > >> > >> I haven't caught up with all of Nathan's changes yet (just > >> got back from vacation), but there was an option to either > >> start the memory insertion on boot, or trigger it later > >> using the /sys/.../memory interface. There is also a monitor > >> program that calculates the memory insertion rate. This was > >> extremely useful to determine how changes in the kernel > >> affected the rate. > >> > > > > Sorry, I *totally* did not follow that comment. It seemed like a > > complete non-sequitur? > > > > -hpa > > It was I who was not following the question. I'm still reverting > back to "work mode". > > [There is more code in a separate patch that Nate has not sent > yet that instructs the kernel to start adding memory as early > as possible, or not. That way you can start the insertion process > later and monitor it's progress to determine how changes in the > kernel affect that process. It is controlled by a separate > CONFIG option.] So, just to repeat (and expand upon) the solution hpa and me suggests: it's not based on /sys, delayed initialization lists or any similar (essentially memory hot plug based) approach. It's a transparent on-demand initialization scheme based on only initializing the very early memory setup in 1GB (2MB) steps (not in 4K steps like we do it today). Any subsequent split-up initialization is done on-demand, in alloc_pages() et al, initilizing a batch of 512 (or 1024) struct page head's when an uninitialized portion is first encountered. This leaves the principle logic of early init largely untouched, we still have the same amount of RAM during and after bootup, except that on 32 TB systems we don't spend ~2 hours initializing 8,589,934,592 page heads. This scheme could be implemented by introducing a new PG_initialized flag, which is seen by an unlikely() branch in alloc_pages() and which triggers the on-demand initialization of pages. [ It could probably be made zero-cost for the post-initialization state: we already check a bunch of rare PG_ flags, one more flag would not introduce any new branch in the page allocation hot path. ] It's a technically different solution from what was submitted in this thread. Cons: - it works after bootup, via GFP. If done in a simple fashion it adds one more branch to the GFP fastpath. [ If done a bit more cleverly it can merge into an existing unlikely() branch and become essentially zero-cost for the fastpath. ] - it adds an initialization non-determinism to GFP, to the tune of initializing ~512 page heads when RAM is utilized first. - initialization is done when memory is needed - not during or shortly after bootup. This (slightly) increases first-use overhead. [I don't think this factor is significant - and I think we'll quickly see speedups to initialization, once the overhead becomes more easily measurable.] Pros: - it's transparent to the boot process. ('free' shows the same full amount of RAM all the time, there's no weird effects of RAM coming online asynchronously. You see all the RAM you have - etc.) - it helps the boot time of every single Linux system, not just large RAM ones. On a smallish, 4GB system memory init can take up precious hundreds of milliseconds, so this is a practical issue. - it spreads initialization overhead to later portions of the system's life time: when there's typically more idle time and more paralellism available. - initialization overhead, because it's a natural part of first-time memory allocation with this scheme, becomes more measurable (and thus more prominently optimized) than any deferred lists processed in the background. - as an added bonus it probably speeds up your usecase even more than the patches you are providing: on a 32 TB system the primary initialization would only have to enumerate memory, allocate page heads and buddy bitmaps, and initialize the 1GB granular page heads: there's only 32768 of them. So unless I overlooked some factor this scheme would be unconditional goodness for everyone. Thanks, Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-26 9:22 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar @ 2013-06-26 13:28 ` Andrew Morton 2013-06-26 13:37 ` Ingo Molnar 0 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2013-06-26 13:28 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote: > except that on 32 TB > systems we don't spend ~2 hours initializing 8,589,934,592 page heads. That's about a million a second which is crazy slow - even my prehistoric desktop is 100x faster than that. Where's all this time actually being spent? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-26 13:28 ` Andrew Morton @ 2013-06-26 13:37 ` Ingo Molnar 2013-06-26 15:02 ` Nathan Zimmer 2013-06-26 16:15 ` Mike Travis 0 siblings, 2 replies; 9+ messages in thread From: Ingo Molnar @ 2013-06-26 13:37 UTC (permalink / raw) To: Andrew Morton Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra * Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote: > > > except that on 32 TB > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads. > > That's about a million a second which is crazy slow - even my > prehistoric desktop is 100x faster than that. > > Where's all this time actually being spent? See the earlier part of the thread - apparently it's spent initializing the page heads - remote NUMA node misses from a single boot CPU, going across a zillion cross-connects? I guess there's some other low hanging fruits as well - so making this easier to profile would be nice. The profile posted was not really usable. Btw., NUMA locality would be another advantage of on-demand initialization: actual users of RAM tend to allocate node-local (especially on large clusters), so any overhead will be naturally lower. Thanks, Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-26 13:37 ` Ingo Molnar @ 2013-06-26 15:02 ` Nathan Zimmer 2013-06-26 16:15 ` Mike Travis 1 sibling, 0 replies; 9+ messages in thread From: Nathan Zimmer @ 2013-06-26 15:02 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On Wed, Jun 26, 2013 at 03:37:15PM +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote: > > > > > except that on 32 TB > > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads. > > > > That's about a million a second which is crazy slow - even my > > prehistoric desktop is 100x faster than that. > > > > Where's all this time actually being spent? > > See the earlier part of the thread - apparently it's spent initializing > the page heads - remote NUMA node misses from a single boot CPU, going > across a zillion cross-connects? I guess there's some other low hanging > fruits as well - so making this easier to profile would be nice. The > profile posted was not really usable. > That is correct, from what I am seeing, using crude cycle counters, there is far more time spent on the later nodes, i.e. memory near the boot node is initialized a lot faster then remote memory. I think the other low hanging fruits are currently being drowned out by the lack of locality. Nate > Btw., NUMA locality would be another advantage of on-demand > initialization: actual users of RAM tend to allocate node-local > (especially on large clusters), so any overhead will be naturally lower. > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator 2013-06-26 13:37 ` Ingo Molnar 2013-06-26 15:02 ` Nathan Zimmer @ 2013-06-26 16:15 ` Mike Travis 1 sibling, 0 replies; 9+ messages in thread From: Mike Travis @ 2013-06-26 16:15 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra On 6/26/2013 6:37 AM, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote: >> >>> except that on 32 TB >>> systems we don't spend ~2 hours initializing 8,589,934,592 page heads. >> >> That's about a million a second which is crazy slow - even my >> prehistoric desktop is 100x faster than that. >> >> Where's all this time actually being spent? > > See the earlier part of the thread - apparently it's spent initializing > the page heads - remote NUMA node misses from a single boot CPU, going > across a zillion cross-connects? I guess there's some other low hanging > fruits as well - so making this easier to profile would be nice. The > profile posted was not really usable. This is one advantage of delayed memory init. I can do it under the profiler. I will put everything together to accomplish this and then send a perf report. > > Btw., NUMA locality would be another advantage of on-demand > initialization: actual users of RAM tend to allocate node-local > (especially on large clusters), so any overhead will be naturally lower. > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-06-29 18:03 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-27 3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman 2013-06-28 20:37 ` Nathan Zimmer 2013-06-29 7:24 ` Ingo Molnar 2013-06-29 18:03 ` Nathan Zimmer -- strict thread matches above, loose matches on Subject: below -- 2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer 2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer 2013-06-23 9:28 ` Ingo Molnar 2013-06-24 20:36 ` Nathan Zimmer 2013-06-25 7:38 ` Ingo Molnar 2013-06-25 17:22 ` Mike Travis 2013-06-25 18:43 ` H. Peter Anvin 2013-06-25 18:51 ` Mike Travis 2013-06-26 9:22 ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar 2013-06-26 13:28 ` Andrew Morton 2013-06-26 13:37 ` Ingo Molnar 2013-06-26 15:02 ` Nathan Zimmer 2013-06-26 16:15 ` Mike Travis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).