Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
@ 2013-06-27  3:35 Daniel J Blueman
  2013-06-28 20:37 ` Nathan Zimmer
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel J Blueman @ 2013-06-27  3:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob,
	Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold

On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
 >
 > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org> wrote:
 >
 > > except that on 32 TB
 > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
 >
 > That's about a million a second which is crazy slow - even my 
prehistoric desktop
 > is 100x faster than that.
 >
 > Where's all this time actually being spent?

The complexity of a directory-lookup architecture to make the 
(intrinsically unscalable) cache-coherency protocol scalable gives you a 
~1us roundtrip to remote NUMA nodes.

Probably a lot of time is spent in some memsets, and RMW cycles which 
are setting page bits, which are intrinsically synchronous, so the 
initialising core can't get to 12 or so outstanding memory transactions.

Since EFI memory ranges have a flag to state if they are zerod (which 
may be a fair assumption for memory on non-bootstrap processor NUMA 
nodes), we can probably collapse the RMWs to just writes.

A normal write will require a coherency cycle, then a fetch and a 
writeback when it's evicted from the cache. For this purpose, 
non-temporal writes would eliminate the cache line fetch and give a 
massive increase in bandwidth. We wouldn't even need a store-fence as 
the initialising core is the only one online.

Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-27  3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman
@ 2013-06-28 20:37 ` Nathan Zimmer
  2013-06-29  7:24   ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-28 20:37 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Andrew Morton, Mike Travis, H. Peter Anvin, holt, rob,
	Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86, linux-doc,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Steffen Persvold

On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> >
> > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@kernel.org> 
> wrote:
> >
> > > except that on 32 TB
> > > systems we don't spend ~2 hours initializing 8,589,934,592 page 
> heads.
> >
> > That's about a million a second which is crazy slow - even my 
> prehistoric desktop
> > is 100x faster than that.
> >
> > Where's all this time actually being spent?
>
> The complexity of a directory-lookup architecture to make the 
> (intrinsically unscalable) cache-coherency protocol scalable gives you 
> a ~1us roundtrip to remote NUMA nodes.
>
> Probably a lot of time is spent in some memsets, and RMW cycles which 
> are setting page bits, which are intrinsically synchronous, so the 
> initialising core can't get to 12 or so outstanding memory transactions.
>
> Since EFI memory ranges have a flag to state if they are zerod (which 
> may be a fair assumption for memory on non-bootstrap processor NUMA 
> nodes), we can probably collapse the RMWs to just writes.
>
> A normal write will require a coherency cycle, then a fetch and a 
> writeback when it's evicted from the cache. For this purpose, 
> non-temporal writes would eliminate the cache line fetch and give a 
> massive increase in bandwidth. We wouldn't even need a store-fence as 
> the initialising core is the only one online.
>
> Daniel

Could you elaborate a bit more? or suggest a specific area to look at?

After some experiments with trying to just set some fields in the struct 
page directly I haven't been able to produce any improvements.  Of 
course there is lots about the area which I don't have much experience with.

Nate


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-28 20:37 ` Nathan Zimmer
@ 2013-06-29  7:24   ` Ingo Molnar
  2013-06-29 18:03     ` Nathan Zimmer
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-29  7:24 UTC (permalink / raw)
  To: Nathan Zimmer
  Cc: Daniel J Blueman, Andrew Morton, Mike Travis, H. Peter Anvin,
	holt, rob, Thomas Gleixner, Ingo Molnar, yinghai, Greg KH, x86,
	linux-doc, Linux Kernel, Linus Torvalds, Peter Zijlstra,
	Steffen Persvold


* Nathan Zimmer <nzimmer@sgi.com> wrote:

> On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> >>
> >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar
> ><mi...@kernel.org> wrote:
> >>
> >> > except that on 32 TB
> >> > systems we don't spend ~2 hours initializing 8,589,934,592
> >page heads.
> >>
> >> That's about a million a second which is crazy slow - even my
> >prehistoric desktop
> >> is 100x faster than that.
> >>
> >> Where's all this time actually being spent?
> >
> > The complexity of a directory-lookup architecture to make the 
> > (intrinsically unscalable) cache-coherency protocol scalable gives you 
> > a ~1us roundtrip to remote NUMA nodes.
> >
> > Probably a lot of time is spent in some memsets, and RMW cycles which 
> > are setting page bits, which are intrinsically synchronous, so the 
> > initialising core can't get to 12 or so outstanding memory 
> > transactions.
> >
> > Since EFI memory ranges have a flag to state if they are zerod (which 
> > may be a fair assumption for memory on non-bootstrap processor NUMA 
> > nodes), we can probably collapse the RMWs to just writes.
> >
> > A normal write will require a coherency cycle, then a fetch and a 
> > writeback when it's evicted from the cache. For this purpose, 
> > non-temporal writes would eliminate the cache line fetch and give a 
> > massive increase in bandwidth. We wouldn't even need a store-fence as 
> > the initialising core is the only one online.
> 
> Could you elaborate a bit more? or suggest a specific area to look at?
> 
> After some experiments with trying to just set some fields in the struct 
> page directly I haven't been able to produce any improvements.  Of 
> course there is lots about the area which I don't have much experience 
> with.

Any such improvement will at most be in the 10-20% range.

I'd suggest first concentrating on the 1000-fold boot time initialization 
speedup that the buddy allocator delayed initialization can offer, and 
speeding up whatever remains after that stage - in a much more 
development-friendly environment. (You'll be able to run 'perf record 
./calloc-1TB' after bootup and get meaningful results, etc.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-29  7:24   ` Ingo Molnar
@ 2013-06-29 18:03     ` Nathan Zimmer
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-29 18:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nathan Zimmer, Daniel J Blueman, Andrew Morton, Mike Travis,
	H. Peter Anvin, holt, rob, Thomas Gleixner, Ingo Molnar, yinghai,
	Greg KH, x86, linux-doc, Linux Kernel, Linus Torvalds,
	Peter Zijlstra, Steffen Persvold

On Sat, Jun 29, 2013 at 09:24:41AM +0200, Ingo Molnar wrote:
> 
> * Nathan Zimmer <nzimmer@sgi.com> wrote:
> 
> > On 06/26/2013 10:35 PM, Daniel J Blueman wrote:
> > >On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
> > >>
> > >> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar
> > ><mi...@kernel.org> wrote:
> > >>
> > >> > except that on 32 TB
> > >> > systems we don't spend ~2 hours initializing 8,589,934,592
> > >page heads.
> > >>
> > >> That's about a million a second which is crazy slow - even my
> > >prehistoric desktop
> > >> is 100x faster than that.
> > >>
> > >> Where's all this time actually being spent?
> > >
> > > The complexity of a directory-lookup architecture to make the 
> > > (intrinsically unscalable) cache-coherency protocol scalable gives you 
> > > a ~1us roundtrip to remote NUMA nodes.
> > >
> > > Probably a lot of time is spent in some memsets, and RMW cycles which 
> > > are setting page bits, which are intrinsically synchronous, so the 
> > > initialising core can't get to 12 or so outstanding memory 
> > > transactions.
> > >
> > > Since EFI memory ranges have a flag to state if they are zerod (which 
> > > may be a fair assumption for memory on non-bootstrap processor NUMA 
> > > nodes), we can probably collapse the RMWs to just writes.
> > >
> > > A normal write will require a coherency cycle, then a fetch and a 
> > > writeback when it's evicted from the cache. For this purpose, 
> > > non-temporal writes would eliminate the cache line fetch and give a 
> > > massive increase in bandwidth. We wouldn't even need a store-fence as 
> > > the initialising core is the only one online.
> > 
> > Could you elaborate a bit more? or suggest a specific area to look at?
> > 
> > After some experiments with trying to just set some fields in the struct 
> > page directly I haven't been able to produce any improvements.  Of 
> > course there is lots about the area which I don't have much experience 
> > with.
> 
> Any such improvement will at most be in the 10-20% range.
> 
> I'd suggest first concentrating on the 1000-fold boot time initialization 
> speedup that the buddy allocator delayed initialization can offer, and 
> speeding up whatever remains after that stage - in a much more 
> development-friendly environment. (You'll be able to run 'perf record 
> ./calloc-1TB' after bootup and get meaningful results, etc.)
> 
> Thanks,
> 
> 	Ingo

I had been focusing on the bigger gains but my attention had been diverted by
hope of an easy, alibiet smaller, win.


I have been experimenting with the patch proper, I am just doing 2MB pages for
the moment.  The improvement is vast,  I'll worry about proper numbers once I
think I have a fully working patch.

Some progress is being made on the real patch.  I think the memory is
being set up correctly, On aligned pages setting the up the page as normal
plus setting new PG_ flag. 

Right now I am trying to sort out free_pages_prepare and free_pages_check.

Thanks,
Nate




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 0/2] Delay initializing of large sections of memory
@ 2013-06-21 16:25 Nathan Zimmer
  2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-21 16:25 UTC (permalink / raw)
  Cc: holt, travis, nzimmer, rob, tglx, mingo, hpa, yinghai, akpm,
	gregkh, x86, linux-doc, linux-kernel

This rfc patch set delays initializing large sections of memory until we have
started cpus.  This has the effect of reducing startup times on large memory
systems.  On 16TB it can take over an hour to boot and most of that time
is spent initializing memory.

We avoid that bottleneck by delaying initialization until after we have 
started multiple cpus and can initialize in a multithreaded manner.
This allows us to actually reduce boot time rather then just moving around
the point of initialization.

Mike and I have worked on this set for a while, with him doing the most of the
heavy lifting, and are eager for some feedback.

Mike Travis (2):
  x86_64, mm: Delay initializing large portion of memory
  x86_64, mm: Reinsert the absent memory

 Documentation/kernel-parameters.txt |  15 ++
 arch/x86/Kconfig                    |  10 ++
 arch/x86/include/asm/e820.h         |  16 +-
 arch/x86/kernel/e820.c              | 292 +++++++++++++++++++++++++++++++++++-
 drivers/base/memory.c               |  83 ++++++++++
 include/linux/memory.h              |   5 +
 6 files changed, 413 insertions(+), 8 deletions(-)

-- 
1.8.2.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer
@ 2013-06-21 16:25 ` Nathan Zimmer
  2013-06-23  9:28   ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-21 16:25 UTC (permalink / raw)
  Cc: holt, travis, nzimmer, rob, tglx, mingo, hpa, yinghai, akpm,
	gregkh, x86, linux-doc, linux-kernel

The memory we set aside in the previous patch needs to be reinserted.
We start this process via late_initcall so we will have multiple cpus to do
the work.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/kernel/e820.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/base/memory.c  |  83 +++++++++++++++++++++++++++++++
 include/linux/memory.h |   5 ++
 3 files changed, 217 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 3752dc5..d31039d 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -23,6 +23,7 @@
 
 #ifdef CONFIG_DELAY_MEM_INIT
 #include <linux/memory.h>
+#include <linux/delay.h>
 #endif
 
 #include <asm/e820.h>
@@ -397,6 +398,22 @@ static u64 min_region_size;	/* min size of region to slice from */
 static u64 pre_region_size;	/* multiply bsize for node low memory */
 static u64 post_region_size;	/* multiply bsize for node high memory */
 
+static unsigned long add_absent_work_start_time;
+static unsigned long add_absent_work_stop_time;
+static unsigned int add_absent_job_count;
+static atomic_t add_absent_work_count;
+
+struct absent_work {
+	struct work_struct	work;
+	struct absent_work	*next;
+	atomic_t		busy;
+	int			cpu;
+	int			node;
+	int			index;
+};
+static DEFINE_PER_CPU(struct absent_work, absent_work);
+static struct absent_work *first_absent_work;
+
 static int __init setup_delay_mem_init(char *str)
 {
 	int bbits, mpnbits, minmult, premult, postmult;
@@ -527,6 +544,118 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 	}
 	return ret;
 }
+
+/* Assign a cpu for this memory chunk and get the per_cpu absent_work struct */
+static struct absent_work *get_absent_work(int node)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(node)) {
+		struct absent_work *aws = &per_cpu(absent_work, cpu);
+		if (aws->node)
+			continue;
+		aws->cpu = cpu;
+		aws->node = node;
+		return aws;
+	}
+
+	/* (if this becomes a problem, we can use a cpu on another node) */
+	pr_crit("e820: No CPU on Node %d to schedule absent_work\n", node);
+	return NULL;
+}
+
+/* Count of 'not done' processes */
+static int count_absent_work_notdone(void)
+{
+	struct absent_work *aws;
+	int notdone = 0;
+
+	for (aws = first_absent_work; aws; aws = aws->next)
+		if (atomic_read(&aws->busy) < 2)
+			notdone++;
+
+	return notdone;
+}
+
+/* The absent_work thread */
+static void add_absent_memory_work(struct work_struct *work)
+{
+	struct absent_work *aws;
+	u64 phys_addr, size;
+	int ret;
+
+	aws = container_of(work, struct absent_work, work);
+
+	phys_addr = e820_absent.map[aws->index].addr;
+	size = e820_absent.map[aws->index].size;
+	ret = memory_add_absent(aws->node, phys_addr, size);
+	if (ret)
+		pr_crit("e820: Error %d adding absent memory %llx %llx (%d)\n",
+			ret, phys_addr, size, aws->node);
+
+	atomic_set(&aws->busy, 2);
+	atomic_dec(&add_absent_work_count);
+
+	/* if no one is waiting, then snap stop time */
+	if (!count_absent_work_notdone())
+		add_absent_work_stop_time = get_seconds();
+}
+
+/* Initialize absent_work threads */
+static int add_absent_memory(void)
+{
+	struct absent_work *aws = NULL;
+	int cpu, i;
+
+	add_absent_work_start_time = get_seconds();
+	add_absent_work_stop_time = 0;
+	atomic_set(&add_absent_work_count, 0);
+
+	for_each_online_cpu(cpu) {
+		struct absent_work *aws = &per_cpu(absent_work, cpu);
+		aws->node = 0;
+	}
+
+	/* setup each work thread */
+	for (i = 0; i < e820_absent.nr_map; i++) {
+		u64 phys_addr = e820_absent.map[i].addr;
+		int node = memory_add_physaddr_to_nid(phys_addr);
+
+		if (!node_online(node))
+			continue;
+
+		if (!aws) {
+			aws = get_absent_work(node);
+			first_absent_work = aws;
+		} else {
+			aws->next = get_absent_work(node);
+			aws = aws->next;
+		}
+
+		if (!aws)
+			continue;
+
+		INIT_WORK(&aws->work, add_absent_memory_work);
+		atomic_set(&aws->busy, 0);
+		aws->index = i;
+
+		/* schedule absent_work thread */
+		if (!schedule_work_on(aws->cpu, &aws->work))
+			BUG();
+	}
+
+
+	pr_info("e820: Add absent memory started\n");
+
+	return 0;
+}
+
+/* Called during bootup to start adding absent_mem early */
+static int absent_memory_init(void)
+{
+	return add_absent_memory();
+}
+late_initcall(absent_memory_init);
 #endif /* CONFIG_DELAY_MEM_INIT */
 
 static int __init __append_e820_map(struct e820entry *biosmap, int nr_map)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 14f8a69..5b4245a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -442,6 +442,89 @@ static inline int memory_probe_init(void)
 }
 #endif
 
+#ifdef CONFIG_DELAY_MEM_INIT
+static struct memory_block *memory_get_block(u64 phys_addr,
+					struct memory_block *last_mem_blk)
+{
+	unsigned long pfn = phys_addr >> PAGE_SHIFT;
+	struct memory_block *mem_blk = NULL;
+	struct mem_section *mem_sect;
+	unsigned long section_nr = pfn_to_section_nr(pfn);
+
+	if (!present_section_nr(section_nr))
+		return NULL;
+
+	mem_sect = __nr_to_section(section_nr);
+	mem_blk = find_memory_block_hinted(mem_sect, last_mem_blk);
+	return mem_blk;
+}
+
+/* addr and size must be aligned on memory_block_size boundaries */
+int memory_add_absent(int nid, u64 phys_addr, u64 size)
+{
+	struct memory_block *mem = NULL;
+	struct page *first_page;
+	unsigned long block_sz;
+	unsigned long nr_pages;
+	unsigned long start_pfn;
+	int ret;
+
+	block_sz = get_memory_block_size();
+	if (phys_addr & (block_sz - 1) || size & (block_sz - 1))
+		return -EINVAL;
+
+	/* memory already present? */
+	if (memory_get_block(phys_addr, NULL))
+		return -EBUSY;
+
+	ret = add_memory(nid, phys_addr, size);
+	if (ret)
+		return ret;
+
+	/* grab first block to use for onlining process */
+	mem = memory_get_block(phys_addr, NULL);
+	if (!mem)
+		return -ENOMEM;
+
+	first_page = pfn_to_page(mem->start_section_nr << PFN_SECTION_SHIFT);
+	start_pfn = page_to_pfn(first_page);
+	nr_pages = size >> PAGE_SHIFT;
+
+	ret = online_pages(start_pfn, nr_pages, ONLINE_KEEP);
+	if (ret)
+		return ret;
+
+	for (;;) {
+		/* we already have first block from above */
+		mutex_lock(&mem->state_mutex);
+		if (mem->state == MEM_OFFLINE) {
+			mem->state = MEM_ONLINE;
+			kobject_uevent(&mem->dev.kobj, KOBJ_ONLINE);
+		}
+		mutex_unlock(&mem->state_mutex);
+
+		phys_addr += block_sz;
+		size -= block_sz;
+		if (!size)
+			break;
+
+		mem = memory_get_block(phys_addr, mem);
+		if (mem)
+			continue;
+
+		pr_err("memory_get_block failed at %llx\n", phys_addr);
+		return -EFAULT;
+	}
+	return 0;
+}
+
+#else
+static inline int start_add_absent_init(void)
+{
+	return 0;
+}
+#endif /* CONFIG_DELAY_MEM_INIT */
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for offlining pages of memory
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 85c31a8..a000c54 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -128,6 +128,11 @@ extern struct memory_block *find_memory_block(struct mem_section *);
 enum mem_add_context { BOOT, HOTPLUG };
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
+#ifdef CONFIG_DELAY_MEM_INIT
+extern int memory_add_absent(int nid, u64 phys_addr, u64 size);
+#endif
+
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 #define hotplug_memory_notifier(fn, pri) ({		\
 	static __meminitdata struct notifier_block fn##_mem_nb =\
-- 
1.8.2.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer
@ 2013-06-23  9:28   ` Ingo Molnar
  2013-06-24 20:36     ` Nathan Zimmer
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-23  9:28 UTC (permalink / raw)
  To: Nathan Zimmer
  Cc: holt, travis, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86,
	linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra

* Nathan Zimmer <nzimmer@sgi.com> wrote:

> The memory we set aside in the previous patch needs to be reinserted.
> We start this process via late_initcall so we will have multiple cpus to do
> the work.
> 
> Signed-off-by: Mike Travis <travis@sgi.com>
> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org> 
> Cc: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/kernel/e820.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/base/memory.c  |  83 +++++++++++++++++++++++++++++++
>  include/linux/memory.h |   5 ++
>  3 files changed, 217 insertions(+)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 3752dc5..d31039d 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -23,6 +23,7 @@
>  
>  #ifdef CONFIG_DELAY_MEM_INIT
>  #include <linux/memory.h>
> +#include <linux/delay.h>
>  #endif
>  
>  #include <asm/e820.h>
> @@ -397,6 +398,22 @@ static u64 min_region_size;	/* min size of region to slice from */
>  static u64 pre_region_size;	/* multiply bsize for node low memory */
>  static u64 post_region_size;	/* multiply bsize for node high memory */
>  
> +static unsigned long add_absent_work_start_time;
> +static unsigned long add_absent_work_stop_time;
> +static unsigned int add_absent_job_count;
> +static atomic_t add_absent_work_count;
> +
> +struct absent_work {
> +	struct work_struct	work;
> +	struct absent_work	*next;
> +	atomic_t		busy;
> +	int			cpu;
> +	int			node;
> +	int			index;
> +};
> +static DEFINE_PER_CPU(struct absent_work, absent_work);
> +static struct absent_work *first_absent_work;

That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
boot time effect should be felt on smaller 'a couple of gigabytes' desktop 
boxes as well. Do we know exactly where the 2 hours of boot time on a 32 
TB system is spent?

While you cannot profile the boot process (yet), you could try your 
delayed patch and run a "perf record -g" call-graph profiling of the 
late-time initialization routines. What does 'perf report' show?

Delayed initialization makes sense I guess because 32 TB is a lot of 
memory - I'm just wondering whether there's some low hanging fruits left 
in the mem init code, that code is certainly not optimized for 
performance.

Plus with a struct page size of around 64 bytes (?) 32 TB of RAM has 512 
GB of struct page arrays alone. Initializing those will take quite some 
time as well - and I suspect they are allocated via zeroing them first. If 
that memset() exists then getting rid of it might be a good move as well.

Yet another thing to consider would be to implement an initialization 
speedup of 3 orders of magnitude: initialize on the large page (2MB) 
grandularity and on-demand delay the initialization of the 4K granular 
struct pages [but still allocating them] - which I suspect are a good 
chunk of the overhead? That way we could initialize in 2MB steps and speed 
up the 2 hours bootup of 32 TB of RAM to 14 seconds...

[ The cost would be one more branch in the buddy allocator, to detect
  not-yet-initialized 2 MB chunks as we encounter them. Acceptable I 
  think. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-23  9:28   ` Ingo Molnar
@ 2013-06-24 20:36     ` Nathan Zimmer
  2013-06-25  7:38       ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-24 20:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nathan Zimmer, holt, travis, rob, tglx, mingo, hpa, yinghai, akpm,
	gregkh, x86, linux-doc, linux-kernel, Linus Torvalds,
	Peter Zijlstra

On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> 
> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
> boot time effect should be felt on smaller 'a couple of gigabytes' desktop 
> boxes as well. Do we know exactly where the 2 hours of boot time on a 32 
> TB system is spent?
> 
There are other several spots that could be improved on a large system but 
memory initialization is by far the biggest.

> While you cannot profile the boot process (yet), you could try your 
> delayed patch and run a "perf record -g" call-graph profiling of the 
> late-time initialization routines. What does 'perf report' show?
> 

I have some data from earlier runs.
memmap_init_zone was the function that was the biggest hitter by far.
Parts of it could certianly are low hanging fruit, set_pageblock_migratetype
for example.
However it seems for a larger system SetPageReserved will be the largest
consumer of cycles.  On a 1TB system I just booted it was around 50% of time
spent in memmap_init_zone.


perf seems to struggle with 512 cpus, but I did get some data.
It seems to indicate similar data to what I found in earlier experiments.
Lots of time in memmap_init_zone,
Some are waiting on locks, this guy seems to be representative of that.

-      0.14%    kworker/160:1  [kernel.kallsyms]        [k] mspin_lock                               ▒
   + mspin_lock                                                                                      ▒
   + __mutex_lock_slowpath                                                                           ▒
   - mutex_lock                                                                                      ▒
      - 99.69% online_pages             

> Delayed initialization makes sense I guess because 32 TB is a lot of 
> memory - I'm just wondering whether there's some low hanging fruits left 
> in the mem init code, that code is certainly not optimized for 
> performance.
>
> Plus with a struct page size of around 64 bytes (?) 32 TB of RAM has 512 
> GB of struct page arrays alone. Initializing those will take quite some 
> time as well - and I suspect they are allocated via zeroing them first. If 
> that memset() exists then getting rid of it might be a good move as well.
> 
> Yet another thing to consider would be to implement an initialization 
> speedup of 3 orders of magnitude: initialize on the large page (2MB) 
> grandularity and on-demand delay the initialization of the 4K granular 
> struct pages [but still allocating them] - which I suspect are a good 
> chunk of the overhead? That way we could initialize in 2MB steps and speed 
> up the 2 hours bootup of 32 TB of RAM to 14 seconds...
> 
> [ The cost would be one more branch in the buddy allocator, to detect
>   not-yet-initialized 2 MB chunks as we encounter them. Acceptable I 
>   think. ]
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-24 20:36     ` Nathan Zimmer
@ 2013-06-25  7:38       ` Ingo Molnar
  2013-06-25 17:22         ` Mike Travis
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-25  7:38 UTC (permalink / raw)
  To: Nathan Zimmer
  Cc: holt, travis, rob, tglx, mingo, hpa, yinghai, akpm, gregkh, x86,
	linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra

* Nathan Zimmer <nzimmer@sgi.com> wrote:

> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> > 
> > That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
> > boot time effect should be felt on smaller 'a couple of gigabytes' 
> > desktop boxes as well. Do we know exactly where the 2 hours of boot 
> > time on a 32 TB system is spent?
> 
> There are other several spots that could be improved on a large system 
> but memory initialization is by far the biggest.

My feeling is that deferred/on-demand initialization triggered from the 
buddy allocator is the better long term solution.

That will also make it much easier to profile/test memory init 
performance: boot up a large system and run a simple testprogram that 
allocates a lot of RAM.

( It will also make people want to optimize the initialization sequence 
  better, as it will be part of any freshly booted system's memory 
  allocation overhead. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-25  7:38       ` Ingo Molnar
@ 2013-06-25 17:22         ` Mike Travis
  2013-06-25 18:43           ` H. Peter Anvin
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Travis @ 2013-06-25 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nathan Zimmer, holt, rob, tglx, mingo, hpa, yinghai, akpm, gregkh,
	x86, linux-doc, linux-kernel, Linus Torvalds, Peter Zijlstra



On 6/25/2013 12:38 AM, Ingo Molnar wrote:
> 
> * Nathan Zimmer <nzimmer@sgi.com> wrote:
> 
>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
>>>
>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
>>> boot time effect should be felt on smaller 'a couple of gigabytes' 
>>> desktop boxes as well. Do we know exactly where the 2 hours of boot 
>>> time on a 32 TB system is spent?
>>
>> There are other several spots that could be improved on a large system 
>> but memory initialization is by far the biggest.
> 
> My feeling is that deferred/on-demand initialization triggered from the 
> buddy allocator is the better long term solution.

I haven't caught up with all of Nathan's changes yet (just
got back from vacation), but there was an option to either
start the memory insertion on boot, or trigger it later
using the /sys/.../memory interface.  There is also a monitor
program that calculates the memory insertion rate.  This was
extremely useful to determine how changes in the kernel
affected the rate.

> 
> That will also make it much easier to profile/test memory init 
> performance: boot up a large system and run a simple testprogram that 
> allocates a lot of RAM.
> 
> ( It will also make people want to optimize the initialization sequence 
>   better, as it will be part of any freshly booted system's memory 
>   allocation overhead. )
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-25 17:22         ` Mike Travis
@ 2013-06-25 18:43           ` H. Peter Anvin
  2013-06-25 18:51             ` Mike Travis
  0 siblings, 1 reply; 9+ messages in thread
From: H. Peter Anvin @ 2013-06-25 18:43 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, akpm,
	gregkh, x86, linux-doc, linux-kernel, Linus Torvalds,
	Peter Zijlstra

On 06/25/2013 10:22 AM, Mike Travis wrote:
> 
> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
>>
>> * Nathan Zimmer <nzimmer@sgi.com> wrote:
>>
>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
>>>>
>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
>>>> boot time effect should be felt on smaller 'a couple of gigabytes' 
>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot 
>>>> time on a 32 TB system is spent?
>>>
>>> There are other several spots that could be improved on a large system 
>>> but memory initialization is by far the biggest.
>>
>> My feeling is that deferred/on-demand initialization triggered from the 
>> buddy allocator is the better long term solution.
> 
> I haven't caught up with all of Nathan's changes yet (just
> got back from vacation), but there was an option to either
> start the memory insertion on boot, or trigger it later
> using the /sys/.../memory interface.  There is also a monitor
> program that calculates the memory insertion rate.  This was
> extremely useful to determine how changes in the kernel
> affected the rate.
> 

Sorry, I *totally* did not follow that comment.  It seemed like a
complete non-sequitur?

	-hpa



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/2] x86_64, mm: Reinsert the absent memory
  2013-06-25 18:43           ` H. Peter Anvin
@ 2013-06-25 18:51             ` Mike Travis
  2013-06-26  9:22               ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Travis @ 2013-06-25 18:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Nathan Zimmer, holt, rob, tglx, mingo, yinghai, akpm,
	gregkh, x86, linux-doc, linux-kernel, Linus Torvalds,
	Peter Zijlstra



On 6/25/2013 11:43 AM, H. Peter Anvin wrote:
> On 06/25/2013 10:22 AM, Mike Travis wrote:
>>
>> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
>>>
>>> * Nathan Zimmer <nzimmer@sgi.com> wrote:
>>>
>>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
>>>>>
>>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
>>>>> boot time effect should be felt on smaller 'a couple of gigabytes' 
>>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot 
>>>>> time on a 32 TB system is spent?
>>>>
>>>> There are other several spots that could be improved on a large system 
>>>> but memory initialization is by far the biggest.
>>>
>>> My feeling is that deferred/on-demand initialization triggered from the 
>>> buddy allocator is the better long term solution.
>>
>> I haven't caught up with all of Nathan's changes yet (just
>> got back from vacation), but there was an option to either
>> start the memory insertion on boot, or trigger it later
>> using the /sys/.../memory interface.  There is also a monitor
>> program that calculates the memory insertion rate.  This was
>> extremely useful to determine how changes in the kernel
>> affected the rate.
>>
> 
> Sorry, I *totally* did not follow that comment.  It seemed like a
> complete non-sequitur?
> 
> 	-hpa

It was I who was not following the question.  I'm still reverting
back to "work mode".

[There is more code in a separate patch that Nate has not sent
yet that instructs the kernel to start adding memory as early
as possible, or not.  That way you can start the insertion process
later and monitor it's progress to determine how changes in the
kernel affect that process.  It is controlled by a separate
CONFIG option.]


> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-25 18:51             ` Mike Travis
@ 2013-06-26  9:22               ` Ingo Molnar
  2013-06-26 13:28                 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2013-06-26  9:22 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Nathan Zimmer, holt, rob, tglx, mingo, yinghai,
	akpm, gregkh, x86, linux-doc, linux-kernel, Linus Torvalds,
	Peter Zijlstra


(Changed the subject, to make it more apparent what we are talking about.)

* Mike Travis <travis@sgi.com> wrote:

> On 6/25/2013 11:43 AM, H. Peter Anvin wrote:
> > On 06/25/2013 10:22 AM, Mike Travis wrote:
> >>
> >> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
> >>>
> >>> * Nathan Zimmer <nzimmer@sgi.com> wrote:
> >>>
> >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> >>>>>
> >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the 
> >>>>> boot time effect should be felt on smaller 'a couple of gigabytes' 
> >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot 
> >>>>> time on a 32 TB system is spent?
> >>>>
> >>>> There are other several spots that could be improved on a large system 
> >>>> but memory initialization is by far the biggest.
> >>>
> >>> My feeling is that deferred/on-demand initialization triggered from the 
> >>> buddy allocator is the better long term solution.
> >>
> >> I haven't caught up with all of Nathan's changes yet (just
> >> got back from vacation), but there was an option to either
> >> start the memory insertion on boot, or trigger it later
> >> using the /sys/.../memory interface.  There is also a monitor
> >> program that calculates the memory insertion rate.  This was
> >> extremely useful to determine how changes in the kernel
> >> affected the rate.
> >>
> > 
> > Sorry, I *totally* did not follow that comment.  It seemed like a
> > complete non-sequitur?
> > 
> > 	-hpa
> 
> It was I who was not following the question.  I'm still reverting
> back to "work mode".
> 
> [There is more code in a separate patch that Nate has not sent
> yet that instructs the kernel to start adding memory as early
> as possible, or not.  That way you can start the insertion process
> later and monitor it's progress to determine how changes in the
> kernel affect that process.  It is controlled by a separate
> CONFIG option.]

So, just to repeat (and expand upon) the solution hpa and me suggests: 
it's not based on /sys, delayed initialization lists or any similar 
(essentially memory hot plug based) approach.

It's a transparent on-demand initialization scheme based on only 
initializing the very early memory setup in 1GB (2MB) steps (not in 4K 
steps like we do it today).

Any subsequent split-up initialization is done on-demand, in alloc_pages() 
et al, initilizing a batch of 512 (or 1024) struct page head's when an 
uninitialized portion is first encountered.

This leaves the principle logic of early init largely untouched, we still 
have the same amount of RAM during and after bootup, except that on 32 TB 
systems we don't spend ~2 hours initializing 8,589,934,592 page heads.

This scheme could be implemented by introducing a new PG_initialized flag, 
which is seen by an unlikely() branch in alloc_pages() and which triggers 
the on-demand initialization of pages.

[ It could probably be made zero-cost for the post-initialization state:
  we already check a bunch of rare PG_ flags, one more flag would not 
  introduce any new branch in the page allocation hot path. ]

It's a technically different solution from what was submitted in this 
thread.

Cons:

 - it works after bootup, via GFP. If done in a simple fashion it adds one 
   more branch to the GFP fastpath. [ If done a bit more cleverly it can 
   merge into an existing unlikely() branch and become essentially 
   zero-cost for the fastpath. ]

 - it adds an initialization non-determinism to GFP, to the tune of
   initializing ~512 page heads when RAM is utilized first.

 - initialization is done when memory is needed - not during or shortly 
   after bootup. This (slightly) increases first-use overhead. [I don't 
   think this factor is significant - and I think we'll quickly see 
   speedups to initialization, once the overhead becomes more easily 
   measurable.]

Pros:

 - it's transparent to the boot process. ('free' shows the same full
   amount of RAM all the time, there's no weird effects of RAM coming
   online asynchronously. You see all the RAM you have - etc.)

 - it helps the boot time of every single Linux system, not just large RAM
   ones. On a smallish, 4GB system memory init can take up precious
   hundreds of milliseconds, so this is a practical issue.

 - it spreads initialization overhead to later portions of the system's 
   life time: when there's typically more idle time and more paralellism
   available.

 - initialization overhead, because it's a natural part of first-time 
   memory allocation with this scheme, becomes more measurable (and thus 
   more prominently optimized) than any deferred lists processed in the 
   background.

 - as an added bonus it probably speeds up your usecase even more than the
   patches you are providing: on a 32 TB system the primary initialization
   would only have to enumerate memory, allocate page heads and buddy
   bitmaps, and initialize the 1GB granular page heads: there's only 32768
   of them.

So unless I overlooked some factor this scheme would be unconditional 
goodness for everyone.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-26  9:22               ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar
@ 2013-06-26 13:28                 ` Andrew Morton
  2013-06-26 13:37                   ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2013-06-26 13:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
	mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
	Linus Torvalds, Peter Zijlstra

On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:

> except that on 32 TB 
> systems we don't spend ~2 hours initializing 8,589,934,592 page heads.

That's about a million a second which is crazy slow - even my prehistoric desktop
is 100x faster than that.

Where's all this time actually being spent?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-26 13:28                 ` Andrew Morton
@ 2013-06-26 13:37                   ` Ingo Molnar
  2013-06-26 15:02                     ` Nathan Zimmer
  2013-06-26 16:15                     ` Mike Travis
  0 siblings, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2013-06-26 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Travis, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
	mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
	Linus Torvalds, Peter Zijlstra

* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> 
> > except that on 32 TB 
> > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
> 
> That's about a million a second which is crazy slow - even my 
> prehistoric desktop is 100x faster than that.
> 
> Where's all this time actually being spent?

See the earlier part of the thread - apparently it's spent initializing 
the page heads - remote NUMA node misses from a single boot CPU, going 
across a zillion cross-connects? I guess there's some other low hanging 
fruits as well - so making this easier to profile would be nice. The 
profile posted was not really usable.

Btw., NUMA locality would be another advantage of on-demand 
initialization: actual users of RAM tend to allocate node-local 
(especially on large clusters), so any overhead will be naturally lower.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-26 13:37                   ` Ingo Molnar
@ 2013-06-26 15:02                     ` Nathan Zimmer
  2013-06-26 16:15                     ` Mike Travis
  1 sibling, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-06-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Mike Travis, H. Peter Anvin, Nathan Zimmer, holt,
	rob, tglx, mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
	Linus Torvalds, Peter Zijlstra

On Wed, Jun 26, 2013 at 03:37:15PM +0200, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > > except that on 32 TB 
> > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
> > 
> > That's about a million a second which is crazy slow - even my 
> > prehistoric desktop is 100x faster than that.
> > 
> > Where's all this time actually being spent?
> 
> See the earlier part of the thread - apparently it's spent initializing 
> the page heads - remote NUMA node misses from a single boot CPU, going 
> across a zillion cross-connects? I guess there's some other low hanging 
> fruits as well - so making this easier to profile would be nice. The 
> profile posted was not really usable.
> 
That is correct, from what I am seeing, using crude cycle counters, there is
far more time spent on the later nodes, i.e. memory near the boot node is 
initialized a lot faster then remote memory.

I think the other low hanging fruits are currently being drowned out by the
lack of locality.

Nate

> Btw., NUMA locality would be another advantage of on-demand 
> initialization: actual users of RAM tend to allocate node-local 
> (especially on large clusters), so any overhead will be naturally lower.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator
  2013-06-26 13:37                   ` Ingo Molnar
  2013-06-26 15:02                     ` Nathan Zimmer
@ 2013-06-26 16:15                     ` Mike Travis
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Travis @ 2013-06-26 16:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, H. Peter Anvin, Nathan Zimmer, holt, rob, tglx,
	mingo, yinghai, gregkh, x86, linux-doc, linux-kernel,
	Linus Torvalds, Peter Zijlstra



On 6/26/2013 6:37 AM, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mingo@kernel.org> wrote:
>>
>>> except that on 32 TB 
>>> systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
>>
>> That's about a million a second which is crazy slow - even my 
>> prehistoric desktop is 100x faster than that.
>>
>> Where's all this time actually being spent?
> 
> See the earlier part of the thread - apparently it's spent initializing 
> the page heads - remote NUMA node misses from a single boot CPU, going 
> across a zillion cross-connects? I guess there's some other low hanging 
> fruits as well - so making this easier to profile would be nice. The 
> profile posted was not really usable.

This is one advantage of delayed memory init.  I can do it under
the profiler.  I will put everything together to accomplish this
and then send a perf report.

> 
> Btw., NUMA locality would be another advantage of on-demand 
> initialization: actual users of RAM tend to allocate node-local 
> (especially on large clusters), so any overhead will be naturally lower.
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-06-29 18:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-27  3:35 [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Daniel J Blueman
2013-06-28 20:37 ` Nathan Zimmer
2013-06-29  7:24   ` Ingo Molnar
2013-06-29 18:03     ` Nathan Zimmer
  -- strict thread matches above, loose matches on Subject: below --
2013-06-21 16:25 [RFC 0/2] Delay initializing of large sections of memory Nathan Zimmer
2013-06-21 16:25 ` [RFC 2/2] x86_64, mm: Reinsert the absent memory Nathan Zimmer
2013-06-23  9:28   ` Ingo Molnar
2013-06-24 20:36     ` Nathan Zimmer
2013-06-25  7:38       ` Ingo Molnar
2013-06-25 17:22         ` Mike Travis
2013-06-25 18:43           ` H. Peter Anvin
2013-06-25 18:51             ` Mike Travis
2013-06-26  9:22               ` [RFC] Transparent on-demand memory setup initialization embedded in the (GFP) buddy allocator Ingo Molnar
2013-06-26 13:28                 ` Andrew Morton
2013-06-26 13:37                   ` Ingo Molnar
2013-06-26 15:02                     ` Nathan Zimmer
2013-06-26 16:15                     ` Mike Travis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).