Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/5] mm: improve remapping of vmalloc regions
From: Nick Piggin @ 2006-04-20 17:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Linux Memory Management, Hugh Dickins

Hi,

I'd like some feedback about this patchset -- whether it is the right
design, and the implementation (e.g. people might dislike patch 4).

vm_insert_page and remap_pfn_range loops are really clever, bit
probably asking a bit too much of most drivers. I was able to get
rid of most of them without too much trouble.

Not tested, because I don't have any of the hardware, but it seems
compiles OK.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC] - Kernel text replication on IA64
From: Luck, Tony @ 2006-04-20 16:41 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-ia64, lee.schermerhorn, clameter, linux-mm
In-Reply-To: <20060420135315.GA28021@sgi.com>

On Thu, Apr 20, 2006 at 08:53:16AM -0500, Jack Steiner wrote:

> There was a question about the effects of kernel text replication last
> month.  I was curious so I resurrected an old trillian patch (Tony Luck's)
> & got it working again. Here is the preliminary patch & some data about
> the benefit.

It's not *that* old ... just from Atlas days, not from Trillian.  Google
carbon dating shows old versions of this patch from around the August
2002 time frame (against 2.4.19).

> All tests were run on a 12 cpu (Itanium2, 900 MHz, 1.5MB L3) , 6 node
> system. All cpus are idle with the exception of the cpu running the test.

Presumably results would be even better on a bigger system where the
distance from the test node to node 0 is even bigger.

On truly huge systems does node0 (or the interconnects leading to it) ever
suffer measureable slowdown from the instruction fetch traffic coming from
the other 255 (or more) nodes?

> Enabling replication reserves 1 additional DTLB entry for kernel code.
> This reduces the number of DTLB entries that is available for user code.
> There is the potential that this could impact some applications.
> Additional measurements are still needed.

Ken's recent patch to free up the DTLB that is currently used for per-cpu
data would mitigate this (though I'm sure he'll be unamused if I blow the
1.6% gain he saw on his transaction processing benchmark on this :-)

> ------------------------------------------------------
>   Cold cache. Running on node 3 of 6 node system
> 
>                          NoRep        Rep   %improvement
> null                :    0.894 :    0.812 :         9.17
> forkexit            :  521.518 :  416.467 :        20.14
> openclose           :  106.683 :   75.000 :        29.70
> pid                 :    2.577 :    2.356 :         8.58
> time                :   17.882 :   11.693 :        34.61
> gettimeofday        :   17.523 :   11.695 :        33.26

Those are some pretty nice numbers.

> ------------------------------------------------------
>    Hot cache. Running on node 3 of 6 node system
> 
>                          NoRep        Rep   %improvement
> forkexit            :  162.019 :  151.927 :         6.23
> openclose           :    8.445 :    8.128 :         3.75

These ones are good too.  But a bit surprising ... it implies that we
are still seeing significant kernel-code i-cache misses even in a
micro-benchmark tight loop.  Montecito (with the big L2 icache) should
be better here (and so see less improvement with replicated text).

> +	  Say Y if you want to eeplicate kernel text on each node of a NUMA system.

s/eeplicate/replicate/

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm3] add migratepage address space op to shmem
From: Christoph Lameter @ 2006-04-20 16:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Lee Schermerhorn
In-Reply-To: <1145548859.5214.9.camel@localhost.localdomain>

On Thu, 20 Apr 2006, Lee Schermerhorn wrote:

> Add migratepage address space op to shmem
> 
> Basic problem:  pages of a shared memory segment can only be
> migrated once.
> 
> In 2.6.16 through 2.6.17-rc1, shared memory mappings do not
> have a migratepage address space op.  Therefore, migrate_pages()
> falls back to default processing.  In this path, it will try to
> pageout() dirty pages.  Once a shared memory page has been migrated
> it becomes dirty, so migrate_pages() will try to page it out.  
> However, because the page count is 3 [cache + current + pte],
> pageout() will return PAGE_KEEP because is_page_cache_freeable()
> returns false.  This will abort all subsequent migrations.
> 
> This patch adds a migratepage address space op to shared memory
> segments to avoid taking the default path.  We use the "migrate_page()"
> function because it knows how to migrate dirty pages.  This allows
> shared memory segment pages to migrate, subject to other conditions
> such as # pte's referencing the page [page_mapcount(page)], when
> requested.  
> 
> I think this is safe.  If we're migrating a shared memory page,
> then we found the page via a page table, so it must be in
> memory.
> 
> Can be verified with memtoy and the shmem-mbind-test script, both
> available at:  http://free.linux.hp.com/~lts/Tools/
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> Index: linux-2.6.17-rc1-mm3/mm/shmem.c
> ===================================================================
> --- linux-2.6.17-rc1-mm3.orig/mm/shmem.c	2006-04-19 17:29:09.000000000 -0400
> +++ linux-2.6.17-rc1-mm3/mm/shmem.c	2006-04-19 17:29:36.000000000 -0400
> @@ -46,6 +46,8 @@
>  #include <linux/mempolicy.h>
>  #include <linux/namei.h>
>  #include <linux/ctype.h>
> +#include <linux/migrate.h>
> +
>  #include <asm/uaccess.h>
>  #include <asm/div64.h>
>  #include <asm/pgtable.h>
> @@ -2165,6 +2167,7 @@ static struct address_space_operations s
>  	.prepare_write	= shmem_prepare_write,
>  	.commit_write	= simple_commit_write,
>  #endif
> +	.migratepage	= migrate_page,
>  };
>  
>  static struct file_operations shmem_file_operations = {
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2.6.17-rc1-mm3] add migratepage address space op to shmem
From: Lee Schermerhorn @ 2006-04-20 16:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Christoph Lameter

Add migratepage address space op to shmem

Basic problem:  pages of a shared memory segment can only be
migrated once.

In 2.6.16 through 2.6.17-rc1, shared memory mappings do not
have a migratepage address space op.  Therefore, migrate_pages()
falls back to default processing.  In this path, it will try to
pageout() dirty pages.  Once a shared memory page has been migrated
it becomes dirty, so migrate_pages() will try to page it out.  
However, because the page count is 3 [cache + current + pte],
pageout() will return PAGE_KEEP because is_page_cache_freeable()
returns false.  This will abort all subsequent migrations.

This patch adds a migratepage address space op to shared memory
segments to avoid taking the default path.  We use the "migrate_page()"
function because it knows how to migrate dirty pages.  This allows
shared memory segment pages to migrate, subject to other conditions
such as # pte's referencing the page [page_mapcount(page)], when
requested.  

I think this is safe.  If we're migrating a shared memory page,
then we found the page via a page table, so it must be in
memory.

Can be verified with memtoy and the shmem-mbind-test script, both
available at:  http://free.linux.hp.com/~lts/Tools/

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm3/mm/shmem.c
===================================================================
--- linux-2.6.17-rc1-mm3.orig/mm/shmem.c	2006-04-19 17:29:09.000000000 -0400
+++ linux-2.6.17-rc1-mm3/mm/shmem.c	2006-04-19 17:29:36.000000000 -0400
@@ -46,6 +46,8 @@
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
+#include <linux/migrate.h>
+
 #include <asm/uaccess.h>
 #include <asm/div64.h>
 #include <asm/pgtable.h>
@@ -2165,6 +2167,7 @@ static struct address_space_operations s
 	.prepare_write	= shmem_prepare_write,
 	.commit_write	= simple_commit_write,
 #endif
+	.migratepage	= migrate_page,
 };
 
 static struct file_operations shmem_file_operations = {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC] - Kernel text replication on IA64
From: Jack Steiner @ 2006-04-20 13:53 UTC (permalink / raw)
  To: linux-ia64; +Cc: lee.schermerhorn, clameter, linux-mm

There was a question about the effects of kernel text replication last
month.  I was curious so I resurrected an old trillian patch (Tony Luck's)
& got it working again. Here is the preliminary patch & some data about
the benefit.

This is still a work-in-progress. I have not concluded whether I think the
patch is beneficial. Please take a look. Comments are appreciated.

Note that one piece is missing from the patch. It is currently
incompatible with kprobes. That is easy to fix if we decide to go forward
with the patch.  For now, make sure that CONFIG_KPROBES is not selected.
Kdb breakpoints will not work, either.  (But, then, kdb breakpoints don't
really work anyway).

----------------

Here is a summary that shows the benefit of kernel text replication on a
few selected microbenchmarks.

The tests were run on a kernel that supports kernel text replication as a
boottime option. The first column shows the time (in usec) to run the
microbenchmark when text replication is disabled. The second column is the
same kernel but text replication was enabled at boot time.

Each test supports an option to select whether to run the test with a
"hot" or "cold" cache.

If "hot" is selected, the test is run multiple times in a tight loop.
Because the microbenchmarks have a small cache footprint, replication is
not expected to help if caches are hot. The rate of i-cache misses for
kernel code should be low.

If "cold" is selected, all caches are flushed between each iteration of
the loop. This removes any cached kernel code from the caches & will
increase the time of the next system call(s). When kernel text replication
is enabled, the caches are refilled from the local node which has a
smaller latency than when refilling from node 0 if replication is
disabled.  The cache flush time is not included in the times but does have
a small residual impact on timing (less than a usec).

All tests were run on a 12 cpu (Itanium2, 900 MHz, 1.5MB L3) , 6 node
system. All cpus are idle with the exception of the cpu running the test.

Tests run on node 0 (as expected) show no improvement when text
replication is enabled. Tests run on other nodes show a significant
improvement when replication is enabled.

Note that these are microbenchmarks. The effect on real applications has
not been determined. Applications that have small cache footprints or
applications that are mostly cpu bound in user code are not expected to
show significant improvement with kernel text replication. In addition,
the improvements when replication is enabled will increase as system size
or system activity increases.

Enabling replication reserves 1 additional DTLB entry for kernel code.
This reduces the number of DTLB entries that is available for user code.
There is the potential that this could impact some applications.
Additional measurements are still needed.



------------------------------------------------------
  Cold cache. Running on node 3 of 6 node system

                         NoRep        Rep   %improvement
null                :    0.894 :    0.812 :         9.17
forkexit            :  521.518 :  416.467 :        20.14
openclose           :  106.683 :   75.000 :        29.70
pid                 :    2.577 :    2.356 :         8.58
time                :   17.882 :   11.693 :        34.61
gettimeofday        :   17.523 :   11.695 :        33.26


------------------------------------------------------
   Hot cache. Running on node 3 of 6 node system

                         NoRep        Rep   %improvement
null                :    0.044 :    0.044 :         0.00
forkexit            :  162.019 :  151.927 :         6.23
openclose           :    8.445 :    8.128 :         3.75
pid                 :    0.067 :    0.067 :         0.00
time                :    1.110 :    1.100 :         0.90
gettimeofday        :    1.079 :    1.074 :         0.46







 arch/ia64/Kconfig              |    7 ++
 arch/ia64/kernel/head.S        |   89 +++++++++++++++++++++++++++++
 arch/ia64/kernel/mca_asm.S     |   13 +++-
 arch/ia64/kernel/setup.c       |    6 +
 arch/ia64/kernel/smpboot.c     |    2 
 arch/ia64/kernel/vmlinux.lds.S |   71 +++++++++++++----------
 arch/ia64/mm/init.c            |  125 ++++++++++++++++++++++++++++++++++++++---
 include/asm-ia64/kregs.h       |    7 +-
 include/asm-ia64/numa.h        |   10 +++
 include/asm-ia64/pgtable.h     |    1 
 include/asm-ia64/system.h      |    1 
 11 files changed, 294 insertions(+), 38 deletions(-)



Index: linux/arch/ia64/kernel/head.S
===================================================================
--- linux.orig/arch/ia64/kernel/head.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/head.S	2006-04-18 14:36:14.895436090 -0500
@@ -247,6 +247,20 @@ start_ap:
 	;;
 	itr.d dtr[r16]=r18
 	;;
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	mov r16=IA64_TR_KERNEL_DATA
+	movl r17=KERNEL_DATA_START
+	movl r18=PAGE_KERNEL
+	;;
+	or r18=r2,r18
+	mov cr.ifa=r17
+	;;
+	srlz.i
+	;;
+	itr.d dtr[r16]=r18
+	;;
+#endif
+
 	srlz.i
 
 	/*
@@ -1218,4 +1232,79 @@ tlb_purge_done:
 END(ia64_jump_to_sal)
 #endif /* CONFIG_HOTPLUG_CPU */
 
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+
+#define PSR_BITS_TO_CLEAR						\
+	(IA64_PSR_I | IA64_PSR_IT | IA64_PSR_DT | IA64_PSR_RT |		\
+	 IA64_PSR_DD | IA64_PSR_SS | IA64_PSR_RI | IA64_PSR_ED |	\
+	 IA64_PSR_DFL | IA64_PSR_DFH)
+
+#define PSR_BITS_TO_SET							\
+	(IA64_PSR_BN)
+
+/*
+ * ccNUMA systems bring up all cpus running from the copy of the
+ * kernel that elilo loaded into memory.  Processors that find that
+ * they are not using the kernel text/rodata that is on their local
+ * node can use this routine to reset their TLB mappings to point
+ * at the correct copy.
+ *
+ * This is like the magic trick where you pull a table cloth out
+ * from under a table covered with plates, glasses and silverware,
+ * except in this version we slide an identical tablecloth in to
+ * replace the one we pulled out.
+ *
+ * Inputs:
+ *	in0 = virtual address of local node copy to be mapped
+ */
+
+GLOBAL_ENTRY(remap_kernel_text)
+	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+	alloc loc1=ar.pfs,8,5,7,0
+	mov loc0=rp
+	.body
+	;;
+	mov loc4=ar.rsc			// save RSE configuration
+	mov ar.rsc=0			// put RSE in enforced lazy, LE mode
+	tpa in0=in0
+	;;
+	movl r16=PSR_BITS_TO_CLEAR
+	mov loc3=psr			// save processor status word
+	movl r17=PSR_BITS_TO_SET
+	;;
+	or loc3=loc3,r17
+	;;
+	andcm r16=loc3,r16		// get psr with IT, DT, and RT bits cleared
+	br.call.sptk.few rp=ia64_switch_mode_phys
+.ret3:
+	rsm psr.i | psr.ic
+	movl r24=KERNEL_START
+	movl r25=KERNEL_TR_PAGE_SHIFT<<2
+	movl r21=PAGE_KERNELRX
+	mov r22=IA64_TR_KERNEL
+	;;
+	ptr.i r24,r25			// purge old code mapping
+	;;
+	srlz.i
+	;;
+	mov cr.ifa=r24
+	mov cr.itir=r25
+	or in0=r21,in0
+	;;
+	srlz.i
+	;;
+	itr.i itr[r22]=in0
+	;;
+	srlz.i
+	;;
+	mov r16=loc3
+	br.call.sptk.few rp=ia64_switch_mode_virt // return to virtual mode
+.ret4:	mov ar.rsc=loc4			// restore RSE configuration
+	mov ar.pfs=loc1
+	mov rp=loc0
+	br.ret.sptk.few rp
+END(remap_kernel_text)
+#endif /* CONFIG_KERNEL_TEXT_REPLICATION */
+
 #endif /* CONFIG_SMP */
Index: linux/arch/ia64/kernel/smpboot.c
===================================================================
--- linux.orig/arch/ia64/kernel/smpboot.c	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/smpboot.c	2006-04-18 14:36:14.895436090 -0500
@@ -402,6 +402,8 @@ smp_callin (void)
 
 	smp_setup_percpu_timer();
 
+	check_remap_kernel_text(cpuid);
+
 	ia64_mca_cmc_vector_setup();	/* Setup vector on AP */
 
 #ifdef CONFIG_PERFMON
Index: linux/arch/ia64/mm/init.c
===================================================================
--- linux.orig/arch/ia64/mm/init.c	2006-04-18 14:35:34.407443859 -0500
+++ linux/arch/ia64/mm/init.c	2006-04-18 15:22:15.766146875 -0500
@@ -60,6 +60,88 @@ EXPORT_SYMBOL(zero_page_memmap_ptr);
 #define MAX_PGT_FREES_PER_PASS		16L
 #define PGT_FRACTION_OF_NODE_MEM	16
 
+static unsigned long free_mem_range (void *, void *);
+
+#ifdef	CONFIG_KERNEL_TEXT_REPLICATION
+/*
+ * Set ktreplicate to 0 to disable kernel text replication.
+ */
+static int ktreplicate=1;
+
+static int __init replicate_setup(char *str)
+{
+	get_option(&str, &ktreplicate);
+	return 1;
+}
+
+__setup("ktreplicate=", replicate_setup);
+
+
+/*
+ * Addresses of per-node copies of kernel text/readonly-data
+ */
+static void *kcopybase[MAX_NUMNODES];
+
+/*
+ * Remap the kernel text for this cpu if a closer copy
+ * is available.
+ */
+void
+check_remap_kernel_text(int cpuid)
+{
+        if (kcopybase[node_cpuid[cpuid].nid])
+		remap_kernel_text(kcopybase[node_cpuid[cpuid].nid]);
+}
+
+/*
+ * Make properly aligned copies of kernel text and read-only
+ * data on other nodes.
+ */
+void replicate_kernel(void)
+{
+	extern void *_start_replicate, *_end_replicate;
+	void *kstart = &_start_replicate;
+	void *kend = &_end_replicate;
+	struct page *page;
+	int nid, length, copies = 1;
+	void *addr;
+	int kloadnode;
+	int cpuid = smp_processor_id();
+
+	kloadnode = paddr_to_nid(ia64_tpa(&kcopybase[0]));
+	kcopybase[kloadnode] = ia64_imva(&_start_replicate);
+	kcopybase[node_cpuid[cpuid].nid] = ia64_imva(&_start_replicate);
+
+	if (ktreplicate) {
+		length = kend - kstart;
+		for_each_online_node(nid) {
+			if (nid == kloadnode || nr_cpus_node(nid) == 0)
+				continue;
+
+			page = alloc_pages_node(nid, GFP_KERNEL, get_order(KERNEL_TR_PAGE_SIZE));
+			if (!page) {
+				printk("Could not replicate kernel to node %d\n", nid);
+				continue;
+			}
+			addr = page_address(page);
+			free_mem_range(addr + length, addr + KERNEL_TR_PAGE_SIZE);
+			kcopybase[nid] = addr;
+			memcpy(addr, &_start_replicate, length);
+			copies++;
+		}
+		printk("Replicated kernel to %d nodes\n", copies);
+	} else {
+		printk("Kernel text replication is disabled\n");
+	}
+
+	/*
+	 * Make kernel text read-only. We do this even if replication * is disabled.
+	 */
+	check_remap_kernel_text(cpuid);
+
+}
+#endif	/* CONFIG_KERNEL_TEXT_REPLICATION */
+
 static inline long
 max_pgt_pages(void)
 {
@@ -194,22 +276,51 @@ ia64_init_addr_space (void)
 	}
 }
 
-void
-free_initmem (void)
+static unsigned long
+free_mem_range (void *addr, void *eaddr)
 {
-	unsigned long addr, eaddr;
+	unsigned long pages_freed = 0;
 
-	addr = (unsigned long) ia64_imva(__init_begin);
-	eaddr = (unsigned long) ia64_imva(__init_end);
 	while (addr < eaddr) {
 		ClearPageReserved(virt_to_page(addr));
 		init_page_count(virt_to_page(addr));
-		free_page(addr);
+		free_page((u64)addr);
 		++totalram_pages;
 		addr += PAGE_SIZE;
+		++pages_freed;
+	}
+	return pages_freed;
+}
+
+void
+free_initmem (void)
+{
+	void *addr, *eaddr;
+	unsigned long pages_freed = 0;
+	extern char __init_data_begin[], __init_data_end[];
+	int nid;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	for_each_online_node(nid) {
+		if (!kcopybase[nid])
+			continue;
+		addr = kcopybase[nid] + ((u64) ia64_imva(&__init_begin) & (KERNEL_TR_PAGE_SIZE-1));
+		eaddr = kcopybase[nid] + ((u64) ia64_imva(&__init_end) & (KERNEL_TR_PAGE_SIZE-1));
+		pages_freed += free_mem_range(addr, eaddr);
 	}
+#else
+	addr = ia64_imva(__init_begin);
+	eaddr = ia64_imva(__init_end);
+	pages_freed += free_mem_range(addr, eaddr);
+#endif
+
+	addr = ia64_imva(__init_data_begin);
+	eaddr = ia64_imva(__init_data_end);
+	pages_freed += free_mem_range(addr, eaddr);
+
 	printk(KERN_INFO "Freeing unused kernel memory: %ldkB freed\n",
-	       (__init_end - __init_begin) >> 10);
+	       (pages_freed << PAGE_SHIFT) >> 10);
+
 }
 
 void __init
Index: linux/arch/ia64/kernel/mca_asm.S
===================================================================
--- linux.orig/arch/ia64/kernel/mca_asm.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/mca_asm.S	2006-04-18 14:36:14.899435695 -0500
@@ -96,8 +96,15 @@ ia64_do_tlb_purge:
 	mov r18=KERNEL_TR_PAGE_SHIFT<<2
 	;;
 	ptr.i r16, r18
-	ptr.d r16, r18
 	;;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	movl r17=KERNEL_DATA_START
+	;;
+	ptr.d r17, r18
+	;;
+#endif
+
 	srlz.i
 	;;
 	srlz.d
@@ -192,6 +199,10 @@ ia64_reload_tr:
 	;;
         itr.i itr[r16]=r18
 	;;
+	movl r17=KERNEL_DATA_START
+	;;
+	mov cr.ifa=r17
+	;;
         itr.d dtr[r16]=r18
         ;;
 	srlz.i
Index: linux/arch/ia64/kernel/setup.c
===================================================================
--- linux.orig/arch/ia64/kernel/setup.c	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/setup.c	2006-04-18 14:36:14.899435695 -0500
@@ -887,6 +887,12 @@ check_bugs (void)
 {
 	ia64_patch_mckinley_e9((unsigned long) __start___mckinley_e9_bundles,
 			       (unsigned long) __end___mckinley_e9_bundles);
+
+	/*
+	 * This really doesn't belong here but this is the last arch-specific
+	 * callout before starting cpus. Need a better place for this.
+	 */
+	replicate_kernel();
 }
 
 static int __init run_dmi_scan(void)
Index: linux/include/asm-ia64/system.h
===================================================================
--- linux.orig/include/asm-ia64/system.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/system.h	2006-04-18 14:36:14.903435299 -0500
@@ -26,6 +26,7 @@
  * - 0xa000000000000000+3*PERCPU_PAGE_SIZE remain unmapped (guard page)
  */
 #define KERNEL_START		 (GATE_ADDR+0x100000000)
+#define KERNEL_DATA_START	 (GATE_ADDR+0x180000000)
 #define PERCPU_ADDR		(-PERCPU_PAGE_SIZE)
 
 #ifndef __ASSEMBLY__
Index: linux/arch/ia64/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/ia64/kernel/vmlinux.lds.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/vmlinux.lds.S	2006-04-18 15:38:28.645844817 -0500
@@ -39,6 +39,7 @@ SECTIONS
   code : { } :code
   . = KERNEL_START;
 
+  _start_replicate = .;
   _text = .;
   _stext = .;
 
@@ -79,8 +80,30 @@ SECTIONS
 	  __stop___mca_table = .;
 	}
 
-  /* Global data */
-  _data = .;
+  .data.patch.vtop : AT(ADDR(.data.patch.vtop) - LOAD_OFFSET)
+	{
+	  __start___vtop_patchlist = .;
+	  *(.data.patch.vtop)
+	  __end___vtop_patchlist = .;
+	}
+
+  .data.patch.mckinley_e9 : AT(ADDR(.data.patch.mckinley_e9) - LOAD_OFFSET)
+	{
+	  __start___mckinley_e9_bundles = .;
+	  *(.data.patch.mckinley_e9)
+	  __end___mckinley_e9_bundles = .;
+	}
+
+#if defined(CONFIG_IA64_GENERIC)
+  /* Machine Vector */
+  . = ALIGN(16);
+  .machvec : AT(ADDR(.machvec) - LOAD_OFFSET)
+	{
+	  machvec_start = .;
+	  *(.machvec)
+	  machvec_end = .;
+	}
+#endif
 
   /* Unwind info & table: */
   . = ALIGN(8);
@@ -98,7 +121,7 @@ SECTIONS
   .opd : AT(ADDR(.opd) - LOAD_OFFSET)
 	{ *(.opd) }
 
-  /* Initialization code and data: */
+  /* Initialization code: */
 
   . = ALIGN(PAGE_SIZE);
   __init_begin = .;
@@ -109,6 +132,21 @@ SECTIONS
 	  _einittext = .;
 	}
 
+  . = ALIGN(PAGE_SIZE);
+  __init_end = .;
+  _end_replicate = .;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+#undef LOAD_OFFSET
+#define LOAD_OFFSET	(KERNEL_DATA_START - KERNEL_TR_PAGE_SIZE)
+. = KERNEL_DATA_START + (. - KERNEL_START);
+#endif
+
+  /* Global read/write data */
+  _data = .;
+
+  /* Initialization data: */
+  __init_data_begin = .;
   .init.data : AT(ADDR(.init.data) - LOAD_OFFSET)
 	{ *(.init.data) }
 
@@ -139,31 +177,6 @@ SECTIONS
 	  __initcall_end = .;
 	}
 
-  .data.patch.vtop : AT(ADDR(.data.patch.vtop) - LOAD_OFFSET)
-	{
-	  __start___vtop_patchlist = .;
-	  *(.data.patch.vtop)
-	  __end___vtop_patchlist = .;
-	}
-
-  .data.patch.mckinley_e9 : AT(ADDR(.data.patch.mckinley_e9) - LOAD_OFFSET)
-	{
-	  __start___mckinley_e9_bundles = .;
-	  *(.data.patch.mckinley_e9)
-	  __end___mckinley_e9_bundles = .;
-	}
-
-#if defined(CONFIG_IA64_GENERIC)
-  /* Machine Vector */
-  . = ALIGN(16);
-  .machvec : AT(ADDR(.machvec) - LOAD_OFFSET)
-	{
-	  machvec_start = .;
-	  *(.machvec)
-	  machvec_end = .;
-	}
-#endif
-
    __con_initcall_start = .;
   .con_initcall.init : AT(ADDR(.con_initcall.init) - LOAD_OFFSET)
 	{ *(.con_initcall.init) }
@@ -173,7 +186,7 @@ SECTIONS
 	{ *(.security_initcall.init) }
   __security_initcall_end = .;
   . = ALIGN(PAGE_SIZE);
-  __init_end = .;
+  __init_data_end = .;
 
   /* The initial task and kernel stack */
   .data.init_task : AT(ADDR(.data.init_task) - LOAD_OFFSET)
Index: linux/include/asm-ia64/kregs.h
===================================================================
--- linux.orig/include/asm-ia64/kregs.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/kregs.h	2006-04-18 14:36:14.903435299 -0500
@@ -27,11 +27,16 @@
 /*
  * Translation registers:
  */
-#define IA64_TR_KERNEL		0	/* itr0, dtr0: maps kernel image (code & data) */
+#define IA64_TR_KERNEL		0	/* itr0, dtr0: maps kernel image (code & readonly data) */
+					/*    also maps RW data if text replication is not enabled */
 #define IA64_TR_PALCODE		1	/* itr1: maps PALcode as required by EFI */
 #define IA64_TR_PERCPU_DATA	1	/* dtr1: percpu data */
 #define IA64_TR_CURRENT_STACK	2	/* dtr2: maps kernel's memory- & register-stacks */
 
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+#define IA64_TR_KERNEL_DATA	3	/* dtr3: maps kernel's global data */
+#endif
+
 /* Processor status register bits: */
 #define IA64_PSR_BE_BIT		1
 #define IA64_PSR_UP_BIT		2
Index: linux/include/asm-ia64/numa.h
===================================================================
--- linux.orig/include/asm-ia64/numa.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/numa.h	2006-04-18 14:36:14.903435299 -0500
@@ -71,4 +71,14 @@ extern int paddr_to_nid(unsigned long pa
 
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+extern void remap_kernel_text(void *);
+extern void check_remap_kernel_text(int);
+extern void replicate_kernel(void);
+#else
+#define remap_kernel_text(p)
+#define check_remap_kernel_text(c)
+#define replicate_kernel()
+#endif
+
 #endif /* _ASM_IA64_NUMA_H */
Index: linux/include/asm-ia64/pgtable.h
===================================================================
--- linux.orig/include/asm-ia64/pgtable.h	2006-04-18 14:35:34.411443463 -0500
+++ linux/include/asm-ia64/pgtable.h	2006-04-18 14:36:14.903435299 -0500
@@ -146,6 +146,7 @@
 #define PAGE_COPY_EXEC	__pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX)
 #define PAGE_GATE	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_X_RX)
 #define PAGE_KERNEL	__pgprot(__DIRTY_BITS  | _PAGE_PL_0 | _PAGE_AR_RWX)
+#define PAGE_KERNELR	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_R)
 #define PAGE_KERNELRX	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_RX)
 
 # ifndef __ASSEMBLY__
Index: linux/arch/ia64/Kconfig
===================================================================
--- linux.orig/arch/ia64/Kconfig	2006-04-18 14:35:34.407443859 -0500
+++ linux/arch/ia64/Kconfig	2006-04-18 14:36:14.907434903 -0500
@@ -260,6 +260,13 @@ config NR_CPUS
 	  than 64 will cause the use of a CPU mask array, causing a small
 	  performance hit.
 
+config KERNEL_TEXT_REPLICATION
+	bool "Kernel text replication"
+	depends on NUMA
+	default off
+	help
+	  Say Y if you want to eeplicate kernel text on each node of a NUMA system.
+
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
 	depends on SMP && EXPERIMENTAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 006/006] pgdat allocation for new node add (call pgdat allocation)
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

This patch adds node-hot-add support to add_memory().

node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
   due to update phase is difference.)

Note:
  To make common function as much as possible, 
  there is 2 changes from v2.
    - The old add_memory(), which is defiend by each archs,
      is renamed to arch_add_memory(). New add_memory becomes
      caller of arch dependent function as a common code.
  
    - This patch changes add_memory()'s interface
        From: add_memory(start, end)
        TO  : add_memory(nid, start, end).
      It was cause of similar code that finding node id from
      physical address is inside of old add_memory() on each arch. 
      
      In addition, acpi memory hotplug driver can find node id easier.
      In v2, it must walk DSDT'S _CRS by matching physical address to
      get the handle of its memory device, then get _PXM and node id.
      Because input is just physical address. 
      However, in v3, the acpi driver can use handle to get _PXM and node id
      for the new memory device. It can pass just node id to add_memory().


Fix interface of arch_add_memory() is in next patche.

Signed-off-by: Yasunori Goto     <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memory_hotplug.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 52 insertions(+)

Index: pgdat11/mm/memory_hotplug.c
===================================================================
--- pgdat11.orig/mm/memory_hotplug.c	2006-04-20 16:36:38.000000000 +0900
+++ pgdat11/mm/memory_hotplug.c	2006-04-20 17:09:39.000000000 +0900
@@ -160,12 +160,64 @@ int online_pages(unsigned long pfn, unsi
 	return 0;
 }
 
+static pg_data_t *hotadd_new_pgdat(int nid, u64 start)
+{
+	struct pglist_data *pgdat;
+	unsigned long zones_size[MAX_NR_ZONES] = {0};
+	unsigned long zholes_size[MAX_NR_ZONES] = {0};
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+
+	pgdat = arch_alloc_nodedata(nid);
+	if (!pgdat)
+		return NULL;
+
+	arch_refresh_nodedata(nid, pgdat);
+
+	/* we can use NODE_DATA(nid) from here */
+
+	/* init node's zones as empty zones, we don't have any present pages.*/
+	free_area_init_node(nid, pgdat, zones_size, start_pfn, zholes_size);
+
+	return pgdat;
+}
+
+static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
+{
+	arch_refresh_nodedata(nid, NULL);
+	arch_free_nodedata(pgdat);
+	return;
+}
+
 int add_memory(int nid, u64 start, u64 size)
 {
+	pg_data_t *pgdat = NULL;
+	int new_pgdat = 0;
 	int ret;
 
+	if (!node_online(nid)) {
+		pgdat = hotadd_new_pgdat(nid, start);
+		if (!pgdat)
+			return -ENOMEM;
+		new_pgdat = 1;
+		ret = kswapd_run(nid);
+		if (ret)
+			goto error;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size);
 
+	if (ret < 0)
+		goto error;
+
+	/* we online node here. we have no error path from here. */
+	node_set_online(nid);
+
+	return ret;
+error:
+	/* rollback pgdat allocation and others */
+	if (new_pgdat)
+		rollback_node_hotadd(nid, pgdat);
+
 	return ret;
 }

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 005/006] pgdat allocation for new node add (export kswapd start func)
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

When node is hot-added, kswapd for the node should start.
This export kswapd start function as kswapd_run() to use at add_memory().


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/swap.h |    2 ++
 mm/vmscan.c          |   35 ++++++++++++++++++++++++++---------
 2 files changed, 28 insertions(+), 9 deletions(-)

Index: pgdat11/mm/vmscan.c
===================================================================
--- pgdat11.orig/mm/vmscan.c	2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/mm/vmscan.c	2006-04-20 11:00:33.000000000 +0900
@@ -35,6 +35,7 @@
 #include <linux/notifier.h>
 #include <linux/rwsem.h>
 #include <linux/delay.h>
+#include <linux/kthread.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1353,20 +1354,36 @@ static int __devinit cpu_callback(struct
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+/*
+ * This kswapd start function will be called by init and node-hot-add.
+ * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
+ */
+int kswapd_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret = 0;
+
+	if (pgdat->kswapd)
+		return 0;
+
+	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
+	if (pgdat->kswapd == ERR_PTR(-ENOMEM)) {
+		/* failure at boot is fatal */
+		BUG_ON(system_state == SYSTEM_BOOTING);
+		printk("faled to run kswapd on node %d\n",nid);
+		ret = -1;
+	}
+	return ret;
+}
+
 static int __init kswapd_init(void)
 {
-	pg_data_t *pgdat;
+	int nid;
 
 	swap_setup();
-	for_each_online_pgdat(pgdat) {
-		pid_t pid;
+	for_each_online_node(nid)
+ 		kswapd_run(nid);
 
-		pid = kernel_thread(kswapd, pgdat, CLONE_KERNEL);
-		BUG_ON(pid < 0);
-		read_lock(&tasklist_lock);
-		pgdat->kswapd = find_task_by_pid(pid);
-		read_unlock(&tasklist_lock);
-	}
 	total_memory = nr_free_pagecache_pages();
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
Index: pgdat11/include/linux/swap.h
===================================================================
--- pgdat11.orig/include/linux/swap.h	2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/swap.h	2006-04-20 11:00:33.000000000 +0900
@@ -212,6 +212,8 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+extern int kswapd_run(int nid);
+
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 004/006] pgdat allocation for new node add (refresh node_data[])
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

This function refresh NODE_DATA() for generic archs.
In this case, NODE_DATA(nid) == node_data[nid].
node_data[] is array of address of pgdat.
So, refresh is quite simple.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 arch/ia64/Kconfig              |    4 ++++
 include/linux/memory_hotplug.h |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h	2006-04-20 11:00:23.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h	2006-04-20 11:00:28.000000000 +0900
@@ -91,6 +91,9 @@ static inline pg_data_t * arch_alloc_nod
 static inline void arch_free_nodedata(pg_data_t *pgdat)
 {
 }
+static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+}
 
 #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
@@ -114,6 +117,12 @@ static inline void arch_free_nodedata(pg
  */
 #define generic_free_nodedata(pgdat)	kfree(pgdat)
 
+extern pg_data_t *node_data[];
+static inline void generic_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+	node_data[nid] = pgdat;
+}
+
 #else /* !CONFIG_NUMA */
 
 /* never called */
@@ -125,6 +134,9 @@ static inline pg_data_t *generic_alloc_n
 static inline void generic_free_nodedata(pg_data_t *pgdat)
 {
 }
+static inline void generic_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+}
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
Index: pgdat11/arch/ia64/Kconfig
===================================================================
--- pgdat11.orig/arch/ia64/Kconfig	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/ia64/Kconfig	2006-04-20 11:00:28.000000000 +0900
@@ -374,6 +374,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool y
 	depends on NEED_MULTIPLE_NODES
 
+config HAVE_ARCH_NODEDATA_EXTENSION
+	def_bool y
+	depends on NUMA
+
 config IA32_SUPPORT
 	bool "Support for Linux/x86 binaries"
 	help

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data)
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

For node hotplug, basically we have to allocate new pgdat.
But, there are several types of implementations of pgdat.

1. Allocate only pgdat.
   This style allocate only pgdat area.
   And its address is recorded in node_data[].
   It is most popular style.

2. Static array of pgdat
   In this case, all of pgdats are static array.
   Some archs use this style.

3. Allocate not only pgdat, but also per node data.
   To increase performance, each node has copy of some data as
   a per node data. So, this area must be allocated too.

   Ia64 is this style. Ia64 has the copies of node_data[] array
   on each per node data to increase performance.

In this series of patches, treat (1) as generic arch.

generic archs can use generic function. (2) and (3) should have
its own if necessary. 

This patch defines pgdat allocator.
Updating NODE_DATA() macro function is in other patch.

( I'll post another patch for (3).
  I don't know (2) which can use memory hotplug.
  So, there is not patch for (2). )

Signed-off-by: Yasonori Goto     <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memory_hotplug.h |   55 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 55 insertions(+)

Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h	2006-04-20 11:00:09.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h	2006-04-20 11:00:23.000000000 +0900
@@ -73,6 +73,61 @@ static inline int memofy_add_physaddr_to
 }
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
+/*
+ * For supporint node-hotadd, we have to allocate new pgdat.
+ *
+ * If an arch have generic style NODE_DATA(),
+ * node_data[nid] = kzalloc() works well . But it depends on each arch.
+ *
+ * In general, generic_alloc_nodedata() is used.
+ * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
+ *
+ */
+static inline pg_data_t * arch_alloc_nodedata(int nid)
+{
+	return NULL;
+}
+static inline void arch_free_nodedata(pg_data_t *pgdat)
+{
+}
+
+#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
+#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
+#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
+
+#ifdef CONFIG_NUMA
+/*
+ * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
+ * XXX: kmalloc_node() can't work well to get new node's memory at this time.
+ *	Because, pgdat for the new node is not allocated/initialized yet itself.
+ *	To use new node's memory, more consideration will be necessary.
+ */
+#define generic_alloc_nodedata(nid)				\
+({								\
+	(pg_data_t *)kzalloc(sizeof(pg_data_t), GFP_KERNEL);	\
+})
+/*
+ * This definition is just for error path in node hotadd.
+ * For node hotremove, we have to replace this.
+ */
+#define generic_free_nodedata(pgdat)	kfree(pgdat)
+
+#else /* !CONFIG_NUMA */
+
+/* never called */
+static inline pg_data_t *generic_alloc_nodedata(int nid)
+{
+	BUG();
+	return NULL;
+}
+static inline void generic_free_nodedata(pg_data_t *pgdat)
+{
+}
+#endif /* CONFIG_NUMA */
+#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 002/006] pgdat allocation for new node add (get node id by acpi)
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

This is to find node id from acpi's handle of memory_device in DSDT.
_PXM for the new node can be found by acpi_get_pxm()
by using new memory's handle. 
So, node id can be found by pxm_to_nid_map[].

  This patch becomes simpler than v2 of node hot-add patch.
  Because old add_memory() function doesn't have node id parameter.
  So, kernel must find its handle by physical address via DSDT again.
  But, v3 just give node id to add_memory() now.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

 drivers/acpi/acpi_memhotplug.c |    3 ++-
 drivers/acpi/numa.c            |   15 +++++++++++++++
 include/linux/acpi.h           |    6 ++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

Index: pgdat11/drivers/acpi/acpi_memhotplug.c
===================================================================
--- pgdat11.orig/drivers/acpi/acpi_memhotplug.c	2006-04-20 11:00:09.000000000 +0900
+++ pgdat11/drivers/acpi/acpi_memhotplug.c	2006-04-20 11:00:17.000000000 +0900
@@ -215,7 +215,7 @@ static int acpi_memory_enable_device(str
 {
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
-	int node = 0;
+	int node;
 
 	ACPI_FUNCTION_TRACE("acpi_memory_enable_device");
 
@@ -227,6 +227,7 @@ static int acpi_memory_enable_device(str
 		return result;
 	}
 
+	node = acpi_get_node(mem_device->handle);
 	/*
 	 * Tell the VM there is more memory here...
 	 * Note: Assume that this function returns zero on success
Index: pgdat11/drivers/acpi/numa.c
===================================================================
--- pgdat11.orig/drivers/acpi/numa.c	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/drivers/acpi/numa.c	2006-04-20 11:00:17.000000000 +0900
@@ -256,3 +256,18 @@ int acpi_get_pxm(acpi_handle h)
 }
 
 EXPORT_SYMBOL(acpi_get_pxm);
+
+int acpi_get_node(acpi_handle *handle)
+{
+	int pxm, node = -1;
+
+	ACPI_FUNCTION_TRACE("acpi_get_node");
+
+	pxm = acpi_get_pxm(handle);
+	if (pxm >= 0)
+		node = acpi_map_pxm_to_node(pxm);
+
+	return_VALUE(node);
+}
+
+EXPORT_SYMBOL(acpi_get_node);
Index: pgdat11/include/linux/acpi.h
===================================================================
--- pgdat11.orig/include/linux/acpi.h	2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/acpi.h	2006-04-20 11:00:17.000000000 +0900
@@ -529,12 +529,18 @@ static inline void acpi_set_cstate_limit
 
 #ifdef CONFIG_ACPI_NUMA
 int acpi_get_pxm(acpi_handle handle);
+int acpi_get_node(acpi_handle *handle);
 #else
 static inline int acpi_get_pxm(acpi_handle handle)
 {
 	return 0;
 }
+static inline int acpi_get_node(acpi_handle *handle)
+{
+	return 0;
+}
 #endif
+extern int acpi_paddr_to_node(u64 start_addr, u64 size);
 
 extern int pnpacpi_disabled;
 

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 001/006] pgdat allocation for new node add (specify node id)
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
In-Reply-To: <20060420185123.EE48.Y-GOTO@jp.fujitsu.com>

This patch changes name of old add_memory() to arch_add_memory.
and use node id to get pgdat for the node at NODE_DATA().

Note: Powerpc's old add_memory() is defined as __devinit. However,
      add_memory() is usually called only after bootup. 
      I suppose it may be redundant. But, I'm not well known about powerpc.
      So, I keep it. (But, __meminit is better at least.)

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

 arch/i386/mm/init.c            |    2 +-
 arch/ia64/mm/init.c            |    4 ++--
 arch/powerpc/mm/mem.c          |    9 ++++++---
 arch/x86_64/mm/init.c          |    6 +++---
 drivers/acpi/acpi_memhotplug.c |    3 ++-
 drivers/base/memory.c          |    4 +++-
 include/linux/memory_hotplug.h |   13 ++++++++++++-
 mm/memory_hotplug.c            |   10 ++++++++++
 8 files changed, 39 insertions(+), 12 deletions(-)

Index: pgdat11/arch/i386/mm/init.c
===================================================================
--- pgdat11.orig/arch/i386/mm/init.c	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/i386/mm/init.c	2006-04-20 16:08:21.000000000 +0900
@@ -654,7 +654,7 @@ void __init mem_init(void)
  */
 #ifdef CONFIG_MEMORY_HOTPLUG
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start , u64 size)
 {
 	struct pglist_data *pgdata = &contig_page_data;
 	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
Index: pgdat11/arch/ia64/mm/init.c
===================================================================
--- pgdat11.orig/arch/ia64/mm/init.c	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/ia64/mm/init.c	2006-04-20 16:04:14.000000000 +0900
@@ -652,7 +652,7 @@ void online_page(struct page *page)
 	num_physpages++;
 }
 
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
 {
 	pg_data_t *pgdat;
 	struct zone *zone;
@@ -660,7 +660,7 @@ int add_memory(u64 start, u64 size)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	pgdat = NODE_DATA(0);
+	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones + ZONE_NORMAL;
 	ret = __add_pages(zone, start_pfn, nr_pages);
Index: pgdat11/arch/powerpc/mm/mem.c
===================================================================
--- pgdat11.orig/arch/powerpc/mm/mem.c	2006-04-20 10:59:54.000000000 +0900
+++ pgdat11/arch/powerpc/mm/mem.c	2006-04-20 16:06:58.000000000 +0900
@@ -114,15 +114,18 @@ void online_page(struct page *page)
 	num_physpages++;
 }
 
-int __devinit add_memory(u64 start, u64 size)
+int memory_add_physaddr_to_nid(u64 start)
+{
+	return hot_add_scn_to_nid(start);
+}
+
+int __devinit arch_add_memory(in nid, u64 start, u64 size)
 {
 	struct pglist_data *pgdata;
 	struct zone *zone;
-	int nid;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	nid = hot_add_scn_to_nid(start);
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
Index: pgdat11/arch/x86_64/mm/init.c
===================================================================
--- pgdat11.orig/arch/x86_64/mm/init.c	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/x86_64/mm/init.c	2006-04-20 16:10:38.000000000 +0900
@@ -552,9 +552,9 @@ int __add_pages(struct zone *z, unsigned
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
 {
-	struct pglist_data *pgdat = NODE_DATA(0);
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones + MAX_NR_ZONES-2;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -571,7 +571,7 @@ error:
 	printk("%s: Problem encountered in __add_pages!\n", __func__);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(add_memory);
+EXPORT_SYMBOL_GPL(arch_add_memory);
 
 int remove_memory(u64 start, u64 size)
 {
Index: pgdat11/drivers/acpi/acpi_memhotplug.c
===================================================================
--- pgdat11.orig/drivers/acpi/acpi_memhotplug.c	2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/drivers/acpi/acpi_memhotplug.c	2006-04-20 16:35:24.000000000 +0900
@@ -215,6 +215,7 @@ static int acpi_memory_enable_device(str
 {
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
+	int node = 0;
 
 	ACPI_FUNCTION_TRACE("acpi_memory_enable_device");
 
@@ -244,7 +245,7 @@ static int acpi_memory_enable_device(str
 			continue;
 		}
 
-		result = add_memory(info->start_addr, info->length);
+		result = add_memory(node, info->start_addr, info->length);
 		if (result)
 			continue;
 		info->enabled = 1;
Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h	2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h	2006-04-20 16:35:23.000000000 +0900
@@ -63,6 +63,16 @@ extern int online_pages(unsigned long, u
 /* reasonably generic interface to expand the physical pages in a zone  */
 extern int __add_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
+
+#if defined(CONFIG_NUMA)
+extern int memory_add_physaddr_to_nid(u64 start);
+#else
+static inline int memofy_add_physaddr_to_nid(u64 start)
+{
+	return 0;
+}
+#endif
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
@@ -99,7 +109,8 @@ static inline int __remove_pages(struct 
 	return -ENOSYS;
 }
 
-extern int add_memory(u64 start, u64 size);
+extern int add_memory(int nid, u64 start, u64 size);
+extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int remove_memory(u64 start, u64 size);
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
Index: pgdat11/mm/memory_hotplug.c
===================================================================
--- pgdat11.orig/mm/memory_hotplug.c	2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/mm/memory_hotplug.c	2006-04-20 16:35:53.000000000 +0900
@@ -159,3 +159,13 @@ int online_pages(unsigned long pfn, unsi
 
 	return 0;
 }
+
+int add_memory(int nid, u64 start, u64 size)
+{
+	int ret;
+
+	/* call arch's memory hotadd */
+	ret = arch_add_memory(nid, start, size;
+
+	return ret;
+}
Index: pgdat11/drivers/base/memory.c
===================================================================
--- pgdat11.orig/drivers/base/memory.c	2006-04-20 10:59:54.000000000 +0900
+++ pgdat11/drivers/base/memory.c	2006-04-20 11:00:09.000000000 +0900
@@ -306,11 +306,13 @@ static ssize_t
 memory_probe_store(struct class *class, const char *buf, size_t count)
 {
 	u64 phys_addr;
+	int nid;
 	int ret;
 
 	phys_addr = simple_strtoull(buf, NULL, 0);
 
-	ret = add_memory(phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	nid = memory_add_physaddr_to_nid(phys_addr);
+	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
 
 	if (ret)
 		count = ret;

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch: 000/006] pgdat allocation for new node add
From: Yasunori Goto @ 2006-04-20 10:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm, Yasunori Goto

Hello.

These are parts of patches for new nodes addition v4.
When new node is added, new pgdat is allocated and initialized by this patch.
These includes...
  - specify node id at add_memory().
  - start kswapd for new node.
  - allocate pgdat and register its address to node_data[].

This set includes node_data[] updater for generic arch.
Ia64 has copies of node_data[] on each node.
But, this patch set doesn't include patches to update them.
I'll post them later.


This patch is for 2.6.17-rc1-mm3.

Please apply.

------------------------------------------------------------

Change log from v4 of node hot-add.
  - generic pgdat allocation is picked up.
  - update for 2.6.17-rc1-mm3.

V4 of post is here.
<description>
http://marc.theaimsgroup.com/?l=linux-mm&m=114258404023573&w=2
<patches>
http://marc.theaimsgroup.com/?l=linux-mm&w=2&r=1&s=memory+hotplug+node+v.4.&q=b



-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Mem-misc tests
From: Tony Huang @ 2006-04-19 16:57 UTC (permalink / raw)
  To: linux-mm

Hi,

We tried to run your mem-misc tests(e.g. shm-stress).
So, in order for the binary to run on different OS
(e.g. Red Hat Enterprise Linux 3), we need to
recompile of the files using the 'Makefile', right? We
ran into some problems recompiling the files? Is there
anything else that we should do to make the binary
(e.g. shm-stress) work? Many thanks.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Read/Write migration entries: Implement correct behavior in copy_one_pte
From: KAMEZAWA Hiroyuki @ 2006-04-19  3:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: hugh, linux-kernel, linux-mm, akpm
In-Reply-To: <Pine.LNX.4.64.0604181823590.9747@schroedinger.engr.sgi.com>

On Tue, 18 Apr 2006 18:27:28 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 19 Apr 2006, KAMEZAWA Hiroyuki wrote:
> 
> > > Note that this is again only a partial solution. mprotect() also has the
> > > potential of changing the write status to read. 
> > yes. in change_pte_range(). 
> > 
> > Note:
> > fork() and mprotect() both requires mm->mmap_sem.
> > So both of them is not problem when migration holds mm->mmap_sem.
> > If we does lazy migration or memory hot removing or allows migration from
> > another process, this will be problem.
> 
> Oh. We already allow migration from another process since the page may 
> be mapped by multiple mm's. Page migration will then replace the ptes in 
> *all* mm_structs that map this page with migration entries.
> 
> So we need a fix here.
> 
Ah.....yes. sorry.

In my understanding (and grep), read/write protection for anon pages 
can be changed under

- fork()
- mprotect()

all are known.

BTW, do we manage page table under move_vma() in right way ?

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Read/Write migration entries: Implement correct behavior in copy_one_pte
From: Christoph Lameter @ 2006-04-19  1:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: hugh, linux-kernel, linux-mm, akpm
In-Reply-To: <20060419095044.d7333b21.kamezawa.hiroyu@jp.fujitsu.com>

On Wed, 19 Apr 2006, KAMEZAWA Hiroyuki wrote:

> > Note that this is again only a partial solution. mprotect() also has the
> > potential of changing the write status to read. 
> yes. in change_pte_range(). 
> 
> Note:
> fork() and mprotect() both requires mm->mmap_sem.
> So both of them is not problem when migration holds mm->mmap_sem.
> If we does lazy migration or memory hot removing or allows migration from
> another process, this will be problem.

Oh. We already allow migration from another process since the page may 
be mapped by multiple mm's. Page migration will then replace the ptes in 
*all* mm_structs that map this page with migration entries.

So we need a fix here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Read/Write migration entries: Implement correct behavior in copy_one_pte
From: KAMEZAWA Hiroyuki @ 2006-04-19  0:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: hugh, linux-kernel, linux-mm, akpm
In-Reply-To: <Pine.LNX.4.64.0604181119480.7814@schroedinger.engr.sgi.com>

On Tue, 18 Apr 2006 11:21:25 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> Note that this is again only a partial solution. mprotect() also has the
> potential of changing the write status to read. 
yes. in change_pte_range(). 

Note:
fork() and mprotect() both requires mm->mmap_sem.
So both of them is not problem when migration holds mm->mmap_sem.
If we does lazy migration or memory hot removing or allows migration from
another process, this will be problem.



> Are there any additional occurrences? Would you check and fix this one as well?
> 
pte_modify() looks to be called only by mprotect(). I checked all mk_pte() but
not found no occurrences now. But I'll have to do more.


> If we cannot get to all the locations or if these fixes get too extensive
> then I think we better drop the preservation of write permissions and
> tolerate the occurrence of some useless COW after migration.
> 
yes. I agree.

-Kame
> 
> 
> Migration entries with write permission must become SWP_MIGRATION_READ
> entries if a COW mapping is processed. The migration entries from which
> the copy is being made must also become SWP_MIGRATION_READ. This mimicks
> the copying of pte for an anonymous page.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.17-rc1-mm3/mm/memory.c
> ===================================================================
> --- linux-2.6.17-rc1-mm3.orig/mm/memory.c	2006-04-18 10:58:33.000000000 -0700
> +++ linux-2.6.17-rc1-mm3/mm/memory.c	2006-04-18 11:09:23.000000000 -0700
> @@ -434,7 +434,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  	/* pte contains position in swap or file, so copy. */
>  	if (unlikely(!pte_present(pte))) {
>  		if (!pte_file(pte)) {
> -			swap_duplicate(pte_to_swp_entry(pte));
> +			swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +			swap_duplicate(entry);
>  			/* make sure dst_mm is on swapoff's mmlist. */
>  			if (unlikely(list_empty(&dst_mm->mmlist))) {
>  				spin_lock(&mmlist_lock);
> @@ -443,6 +445,19 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  						 &src_mm->mmlist);
>  				spin_unlock(&mmlist_lock);
>  			}
> +			if (is_migration_entry(entry) &&
> +					is_cow_mapping(vm_flags)) {
> +				page = migration_entry_to_page(entry);
> +
> +				/*
> +				 * COW mappings require pages in both parent
> +				*  and child to be set to read.
> +				 */
> +				entry = make_migration_entry(page,
> +	`					SWP_MIGRATION_READ);
> +				pte = swp_entry_to_pte(entry);
> +				set_pte_at(src_mm, addr, src_pte, pte);
> +			}
>  		}
>  		goto out_set_pte;
>  	}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] slab: cleanup kmem_getpages
From: Andrew Morton @ 2006-04-18 23:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dgc, linux-mm
In-Reply-To: <20060418232428.GA13570@lst.de>

Christoph Hellwig <hch@lst.de> wrote:
>
> > > +	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
> > >  	if (!page)
> > >  		return NULL;
> > > -	addr = page_address(page);
> > .....
> > > +	while (nr_pages--) {
> > >  		__SetPageSlab(page);
> > >  		page++;
> > >  	}
> > > -	return addr;
> > > +	return page_address(page);
> > 
> > I think that's a bug - you return the address of the page after the
> > allocation, not the first page of the allocation.
> 
> You're right.  I wonder why this didn't show up in my testing.  Looks
> like slab will never allocate any high-order pages if your page size
> is big enough..
> 
> Andrew, please drop this for now.  I'll redo it without that bit once
> I'll get some time.

I already fixed it - it was giving me instantaneous oopses.

--- devel/mm/slab.c~slab-cleanup-kmem_getpages-fix	2006-04-15 01:00:53.000000000 -0700
+++ devel-akpm/mm/slab.c	2006-04-15 01:01:49.000000000 -0700
@@ -1492,6 +1492,7 @@ static void *kmem_getpages(struct kmem_c
 {
 	struct page *page;
 	int nr_pages;
+	int i;
 
 #ifndef CONFIG_MMU
 	/*
@@ -1510,10 +1511,8 @@ static void *kmem_getpages(struct kmem_c
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		atomic_add(nr_pages, &slab_reclaim_pages);
 	add_page_state(nr_slab, nr_pages);
-	while (nr_pages--) {
-		__SetPageSlab(page);
-		page++;
-	}
+	for (i = 0; i < nr_pages; i++)
+		__SetPageSlab(page + i);
 	return page_address(page);
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] slab: cleanup kmem_getpages
From: Christoph Hellwig @ 2006-04-18 23:24 UTC (permalink / raw)
  To: David Chinner; +Cc: Christoph Hellwig, akpm, linux-mm
In-Reply-To: <20060418232000.GL2732@melbourne.sgi.com>

On Wed, Apr 19, 2006 at 09:20:00AM +1000, David Chinner wrote:
> On Fri, Apr 14, 2006 at 08:36:18PM +0200, Christoph Hellwig wrote:
> > The last ifdef addition hit the ugliness treshold on this functions, so:
> > 
> >  - rename the varibale i to nr_pages so it's somewhat descriptive
> >  - remove the addr variable and do the page_address call at the very end
> >  - instead of ifdef'ing the whole alloc_pages_node call just make the
> >    __GFP_COMP addition to flags conditional
> >  - rewrite the __GFP_COMP comment to make sense
> ....
> > +	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
> >  	if (!page)
> >  		return NULL;
> > -	addr = page_address(page);
> .....
> > +	while (nr_pages--) {
> >  		__SetPageSlab(page);
> >  		page++;
> >  	}
> > -	return addr;
> > +	return page_address(page);
> 
> I think that's a bug - you return the address of the page after the
> allocation, not the first page of the allocation.

You're right.  I wonder why this didn't show up in my testing.  Looks
like slab will never allocate any high-order pages if your page size
is big enough..

Andrew, please drop this for now.  I'll redo it without that bit once
I'll get some time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] slab: cleanup kmem_getpages
From: David Chinner @ 2006-04-18 23:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: akpm, linux-mm
In-Reply-To: <20060414183618.GA21144@lst.de>

On Fri, Apr 14, 2006 at 08:36:18PM +0200, Christoph Hellwig wrote:
> The last ifdef addition hit the ugliness treshold on this functions, so:
> 
>  - rename the varibale i to nr_pages so it's somewhat descriptive
>  - remove the addr variable and do the page_address call at the very end
>  - instead of ifdef'ing the whole alloc_pages_node call just make the
>    __GFP_COMP addition to flags conditional
>  - rewrite the __GFP_COMP comment to make sense
....
> +	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
>  	if (!page)
>  		return NULL;
> -	addr = page_address(page);
.....
> +	while (nr_pages--) {
>  		__SetPageSlab(page);
>  		page++;
>  	}
> -	return addr;
> +	return page_address(page);

I think that's a bug - you return the address of the page after the
allocation, not the first page of the allocation.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] Sizing zones and holes in an architecture independent manner V3
From: Mel Gorman @ 2006-04-18 20:27 UTC (permalink / raw)
  To: Jack Steiner
  Cc: davej, tony.luck, linuxppc-dev, Linux Kernel Mailing List,
	bob.picco, ak, Linux Memory Management List
In-Reply-To: <20060418195644.GA30911@sgi.com>

On Tue, 18 Apr 2006, Jack Steiner wrote:

> On Tue, Apr 18, 2006 at 02:00:15PM +0100, Mel Gorman wrote:
>> This is V3 of the patchset to size zones and memory holes in an
> ...
>
>
>
> FYI, I applied the patches to a recent kernel & booted on a large
> SGI system.

Very cool, thanks for testing. I'm glad to see the assumptions related to 
the size of node_map held up for a large machine.

> No problem aside from what I assume is a very large number
> of debug messages.
>

The debug messages are all in patch 7/7 and will be dropped before I try 
and push this to -mm. The patch was included here in case a machine failed 
to boot so I could figure out what went wrong.

>
> ------
> Linux version 2.6.17-tony (steiner@attica) (gcc version 4.1.0 (SUSE Linux)) #5 SMP PREEMPT Tue Apr 18 14:01:54 CDT 2006
> EFI v1.10 by INTEL: SALsystab=0x6002c51df0 ACPI 2.0=0x6002c51ee0
> Number of logical nodes in system = 512
> Number of memory chunks in system = 512
> SAL 2.9: SGI SN2 version 1.10
> SAL Platform features: ITC_Drift
> SAL: AP wakeup using external interrupt vector 0x12
> No logical to physical processor mapping available
> ACPI: Local APIC address c0000000fee00000
> ACPI: Error parsing MADT - no IOSAPIC entries
> register_intr: No IOSAPIC for GSI 52
> 512 CPUs available, 512 CPUs total
> MCA related initialization done
> SGI SAL version 1.10
> add_active_range(0, 25168900, 25235456): New
> add_active_range(0, 25236375, 25419776): New
> add_active_range(0, 27262976, 27516927): New
> add_active_range(0, 29360128, 29614080): New
> add_active_range(1, 92277760, 92528640): New
> ...
> add_active_range(511, 34322242024, 34322242047): New
> add_active_range(511, 34322243072, 34322243417): New
> add_active_range(511, 34322243432, 34322243460): New
> add_active_range(511, 34322243488, 34322243500): New
> Virtual mem_map starts at 0xa0007e407d270000
> free_area_init_nodes(68719476736, 68719476736, 34322243584, 34322243584)
> free_area_init_nodes(): find_min_pfn = 25168900
> Dumping sorted node map
> entry 0: 0  25168900 -> 25235456
> entry 1: 0  25236375 -> 25419776
> entry 2: 0  27262976 -> 27516927
> entry 3: 0  29360128 -> 29614080
> entry 4: 1  92277760 -> 92528640
> entry 5: 1  94371840 -> 94625792
> entry 6: 1  96468992 -> 96722944
> ...
> entry 1536: 511	 34321989632 -> 34322242001
> entry 1537: 511	 34322242016 -> 34322242022
> entry 1538: 511	 34322242024 -> 34322242047
> entry 1539: 511	 34322243072 -> 34322243417
> entry 1540: 511	 34322243432 -> 34322243460
> entry 1541: 511	 34322243488 -> 34322243500
> __absent_pages_in_range(0, 25168900, 68719476736) = 3687320
> __absent_pages_in_range(0, 68719476736, 68719476736) = 0
> __absent_pages_in_range(0, 68719476736, 34322243584) = 0
> __absent_pages_in_range(0, 34322243584, 34322243584) = 0
> __absent_pages_in_range(0, 25168900, 68719476736) = 3687320
> ...
> __absent_pages_in_range(511, 34322243584, 34322243584) = 0
> __absent_pages_in_range(511, 25168900, 68719476736) = 3687485
> __absent_pages_in_range(511, 68719476736, 68719476736) = 0
> __absent_pages_in_range(511, 68719476736, 34322243584) = 0
> __absent_pages_in_range(511, 34322243584, 34322243584) = 0
> Built 512 zonelists
> Kernel command line: BOOT_IMAGE=net0:jfs/tonys	ro hashdist=1 dhash_entries=2097152 ihash_entries=2097152 rhash_entries=2097152 thash_entries=2097152 console=ttySG0,38400n8 kdb=on noprobe root=/dev/sda5
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> Console: colour dummy device 80x25
> Memory: 4639378784k/4655642784k available (7021k code, 16279040k reserved, 4361k data, 736k init)
> 	(essentially the same numbers as with a kernel w/o the patches)
> McKinley Errata 9 workaround not needed; disabling it
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/7] [RFC] Sizing zones and holes in an architecture independent manner V3
From: Jack Steiner @ 2006-04-18 19:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: davej, tony.luck, linuxppc-dev, linux-kernel, bob.picco, ak,
	linux-mm
In-Reply-To: <20060418130015.28928.10163.sendpatchset@skynet>

On Tue, Apr 18, 2006 at 02:00:15PM +0100, Mel Gorman wrote:
> This is V3 of the patchset to size zones and memory holes in an
...



FYI, I applied the patches to a recent kernel & booted on a large
SGI system. No problem aside from what I assume is a very large number
of debug messages. 


------
Linux version 2.6.17-tony (steiner@attica) (gcc version 4.1.0 (SUSE Linux)) #5 SMP PREEMPT Tue Apr 18 14:01:54 CDT 2006
EFI v1.10 by INTEL: SALsystab=0x6002c51df0 ACPI 2.0=0x6002c51ee0
Number of logical nodes in system = 512
Number of memory chunks in system = 512
SAL 2.9: SGI SN2 version 1.10
SAL Platform features: ITC_Drift
SAL: AP wakeup using external interrupt vector 0x12
No logical to physical processor mapping available
ACPI: Local APIC address c0000000fee00000
ACPI: Error parsing MADT - no IOSAPIC entries
register_intr: No IOSAPIC for GSI 52
512 CPUs available, 512 CPUs total
MCA related initialization done
SGI SAL version 1.10
add_active_range(0, 25168900, 25235456): New
add_active_range(0, 25236375, 25419776): New
add_active_range(0, 27262976, 27516927): New
add_active_range(0, 29360128, 29614080): New
add_active_range(1, 92277760, 92528640): New
...
add_active_range(511, 34322242024, 34322242047): New
add_active_range(511, 34322243072, 34322243417): New
add_active_range(511, 34322243432, 34322243460): New
add_active_range(511, 34322243488, 34322243500): New
Virtual mem_map starts at 0xa0007e407d270000
free_area_init_nodes(68719476736, 68719476736, 34322243584, 34322243584)
free_area_init_nodes(): find_min_pfn = 25168900
Dumping sorted node map
entry 0: 0  25168900 -> 25235456
entry 1: 0  25236375 -> 25419776
entry 2: 0  27262976 -> 27516927
entry 3: 0  29360128 -> 29614080
entry 4: 1  92277760 -> 92528640
entry 5: 1  94371840 -> 94625792
entry 6: 1  96468992 -> 96722944
...
entry 1536: 511	 34321989632 -> 34322242001
entry 1537: 511	 34322242016 -> 34322242022
entry 1538: 511	 34322242024 -> 34322242047
entry 1539: 511	 34322243072 -> 34322243417
entry 1540: 511	 34322243432 -> 34322243460
entry 1541: 511	 34322243488 -> 34322243500
__absent_pages_in_range(0, 25168900, 68719476736) = 3687320
__absent_pages_in_range(0, 68719476736, 68719476736) = 0
__absent_pages_in_range(0, 68719476736, 34322243584) = 0
__absent_pages_in_range(0, 34322243584, 34322243584) = 0
__absent_pages_in_range(0, 25168900, 68719476736) = 3687320
...
__absent_pages_in_range(511, 34322243584, 34322243584) = 0
__absent_pages_in_range(511, 25168900, 68719476736) = 3687485
__absent_pages_in_range(511, 68719476736, 68719476736) = 0
__absent_pages_in_range(511, 68719476736, 34322243584) = 0
__absent_pages_in_range(511, 34322243584, 34322243584) = 0
Built 512 zonelists
Kernel command line: BOOT_IMAGE=net0:jfs/tonys	ro hashdist=1 dhash_entries=2097152 ihash_entries=2097152 rhash_entries=2097152 thash_entries=2097152 console=ttySG0,38400n8 kdb=on noprobe root=/dev/sda5
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Memory: 4639378784k/4655642784k available (7021k code, 16279040k reserved, 4361k data, 736k init)
	(essentially the same numbers as with a kernel w/o the patches)
McKinley Errata 9 workaround not needed; disabling it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Read/Write migration entries: Implement correct behavior in copy_one_pte
From: Christoph Lameter @ 2006-04-18 18:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: hugh, linux-kernel, linux-mm, akpm

Note that this is again only a partial solution. mprotect() also has the
potential of changing the write status to read. Are there any additional
occurrences? Would you check and fix this one as well?

If we cannot get to all the locations or if these fixes get too extensive
then I think we better drop the preservation of write permissions and
tolerate the occurrence of some useless COW after migration.



Migration entries with write permission must become SWP_MIGRATION_READ
entries if a COW mapping is processed. The migration entries from which
the copy is being made must also become SWP_MIGRATION_READ. This mimicks
the copying of pte for an anonymous page.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc1-mm3/mm/memory.c
===================================================================
--- linux-2.6.17-rc1-mm3.orig/mm/memory.c	2006-04-18 10:58:33.000000000 -0700
+++ linux-2.6.17-rc1-mm3/mm/memory.c	2006-04-18 11:09:23.000000000 -0700
@@ -434,7 +434,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
 		if (!pte_file(pte)) {
-			swap_duplicate(pte_to_swp_entry(pte));
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			swap_duplicate(entry);
 			/* make sure dst_mm is on swapoff's mmlist. */
 			if (unlikely(list_empty(&dst_mm->mmlist))) {
 				spin_lock(&mmlist_lock);
@@ -443,6 +445,19 @@ copy_one_pte(struct mm_struct *dst_mm, s
 						 &src_mm->mmlist);
 				spin_unlock(&mmlist_lock);
 			}
+			if (is_migration_entry(entry) &&
+					is_cow_mapping(vm_flags)) {
+				page = migration_entry_to_page(entry);
+
+				/*
+				 * COW mappings require pages in both parent
+				*  and child to be set to read.
+				 */
+				entry = make_migration_entry(page,
+	`					SWP_MIGRATION_READ);
+				pte = swp_entry_to_pte(entry);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/5] Swapless V2: Revise main migration logic
From: Christoph Lameter @ 2006-04-18 16:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: akpm, hugh, linux-kernel, lee.schermerhorn, linux-mm, taka,
	marcelo.tosatti
In-Reply-To: <20060418180810.e947564c.kamezawa.hiroyu@jp.fujitsu.com>

On Tue, 18 Apr 2006, KAMEZAWA Hiroyuki wrote:

> This anon_vma->lock is just an optimization (for now) but complicated.
> I think restart discusstion against -mm3? will be better.

I agree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 7/7] Print out debugging information during initialisation
From: Mel Gorman @ 2006-04-18 13:02 UTC (permalink / raw)
  To: davej, tony.luck, linux-mm, ak, bob.picco, linux-kernel,
	linuxppc-dev
  Cc: Mel Gorman
In-Reply-To: <20060418130015.28928.10163.sendpatchset@skynet>

The zone and hole sizing code is new and unexpected problems showed up
on machines that were not covered by the pre-release tests. This patch
prints out useful information when those unexpected situations occur.

It is not expected that this patch become a permanent part of the set.


 mem_init.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 47 insertions(+), 4 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c linux-2.6.17-rc1-107-debug/mm/mem_init.c
--- linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c	2006-04-18 10:22:00.000000000 +0100
+++ linux-2.6.17-rc1-107-debug/mm/mem_init.c	2006-04-18 10:23:46.000000000 +0100
@@ -593,6 +593,7 @@ static __meminit void init_currently_emp
 }
 
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+
 /* Note: nid == MAX_NUMNODES returns first region */
 static int __init first_active_region_index_in_nid(int nid)
 {
@@ -645,13 +646,24 @@ void __init free_bootmem_with_active_reg
 	for_each_active_range_index_in_nid(i, nid) {
 		unsigned long size_pages = 0;
 		unsigned long end_pfn = early_node_map[i].end_pfn;
-		if (early_node_map[i].start_pfn >= max_low_pfn)
+		if (early_node_map[i].start_pfn >= max_low_pfn) {
+			printk("start_pfn %lu >= %lu\n", early_node_map[i].start_pfn,
+								max_low_pfn);
 			continue;
+		}
 
-		if (end_pfn > max_low_pfn)
+		if (end_pfn > max_low_pfn) {
+			printk("end_pfn %lu going back to %lu\n", early_node_map[i].end_pfn,
+									max_low_pfn);
 			end_pfn = max_low_pfn;
+		}
 
 		size_pages = end_pfn - early_node_map[i].start_pfn;
+		printk("free_bootmem_node(%d, %lu, %lu) :::: pfn ranges (%d, %lu, %lu)\n",
+				early_node_map[i].nid,
+				PFN_PHYS(early_node_map[i].start_pfn),
+				PFN_PHYS(size_pages),
+				early_node_map[i].nid, early_node_map[i].start_pfn, end_pfn);
 		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
 				PFN_PHYS(early_node_map[i].start_pfn),
 				PFN_PHYS(size_pages));
@@ -661,10 +673,15 @@ void __init free_bootmem_with_active_reg
 void __init sparse_memory_present_with_active_regions(int nid)
 {
 	unsigned int i;
-	for_each_active_range_index_in_nid(i, nid)
+	for_each_active_range_index_in_nid(i, nid) {
+		printk("memory_present(%d, %lu, %lu)\n",
+			early_node_map[i].nid,
+			early_node_map[i].start_pfn,
+			early_node_map[i].end_pfn);
 		memory_present(early_node_map[i].nid,
 				early_node_map[i].start_pfn,
 				early_node_map[i].end_pfn);
+	}
 }
 
 void __init get_pfn_range_for_nid(unsigned int nid,
@@ -722,6 +739,8 @@ unsigned long __init __absent_pages_in_r
 	unsigned long prev_end_pfn = 0, hole_pages = 0;
 	unsigned long start_pfn;
 
+	printk("__absent_pages_in_range(%d, %lu, %lu) = ", nid,
+					range_start_pfn, range_end_pfn);
 	/* Find the end_pfn of the first active range of pfns in the node */
 	i = first_active_region_index_in_nid(nid);
 	prev_end_pfn = early_node_map[i].start_pfn;
@@ -749,6 +768,8 @@ unsigned long __init __absent_pages_in_r
 		prev_end_pfn = early_node_map[i].end_pfn;
 	}
 
+	printk("%lu\n", hole_pages);
+
 	return hole_pages;
 }
 
@@ -911,6 +932,9 @@ void __init add_active_range(unsigned in
 {
 	unsigned int i;
 
+	printk("add_active_range(%d, %lu, %lu): ",
+			nid, start_pfn, end_pfn);
+
 	/* Merge with existing active regions if possible */
 	for (i = 0; early_node_map[i].end_pfn; i++) {
 		if (early_node_map[i].nid != nid)
@@ -918,12 +942,15 @@ void __init add_active_range(unsigned in
 
 		/* Skip if an existing region covers this new one */
 		if (start_pfn >= early_node_map[i].start_pfn &&
-				end_pfn <= early_node_map[i].end_pfn)
+				end_pfn <= early_node_map[i].end_pfn) {
+			printk("Existing\n");
 			return;
+		}
 
 		/* Merge forward if suitable */
 		if (start_pfn <= early_node_map[i].end_pfn &&
 				end_pfn > early_node_map[i].end_pfn) {
+			printk("Merging forward\n");
 			early_node_map[i].end_pfn = end_pfn;
 			return;
 		}
@@ -931,6 +958,7 @@ void __init add_active_range(unsigned in
 		/* Merge backward if suitable */
 		if (start_pfn < early_node_map[i].end_pfn &&
 				end_pfn >= early_node_map[i].start_pfn) {
+			printk("Merging backwards\n");
 			early_node_map[i].start_pfn = start_pfn;
 			return;
 		}
@@ -942,6 +970,7 @@ void __init add_active_range(unsigned in
 		return;
 	}
 
+	printk("New\n");
 	early_node_map[i].nid = nid;
 	early_node_map[i].start_pfn = start_pfn;
 	early_node_map[i].end_pfn = end_pfn;
@@ -971,6 +1000,14 @@ static void __init sort_node_map(void)
 
 	sort(early_node_map, num, sizeof(struct node_active_region),
 						cmp_node_active_region, NULL);
+
+	printk("Dumping sorted node map\n");
+	for (num = 0; early_node_map[num].end_pfn; num++) {
+		printk("entry %lu: %d  %lu -> %lu\n", num,
+				early_node_map[num].nid,
+				early_node_map[num].start_pfn,
+				early_node_map[num].end_pfn);
+	}
 }
 
 /* Find the lowest pfn for a node. This depends on a sorted early_node_map */
@@ -1012,6 +1049,10 @@ void __init free_area_init_nodes(unsigne
 	unsigned long nid;
 	int zone_index;
 
+	printk("free_area_init_nodes(%lu, %lu, %lu, %lu)\n",
+			arch_max_dma_pfn, arch_max_dma32_pfn,
+			arch_max_low_pfn, arch_max_high_pfn);
+
 	/* Record where the zone boundaries are */
 	memset(arch_zone_lowest_possible_pfn, 0,
 				sizeof(arch_zone_lowest_possible_pfn));
@@ -1028,6 +1069,8 @@ void __init free_area_init_nodes(unsigne
 			arch_zone_highest_possible_pfn[zone_index-1];
 	}
 
+	printk("free_area_init_nodes(): find_min_pfn = %lu\n", find_min_pfn_with_active_regions());
+
 	/* Regions in the early_node_map can be in any order */
 	sort_node_map();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 6/7] Break out memory initialisation code from page_alloc.c to mem_init.c
From: Mel Gorman @ 2006-04-18 13:02 UTC (permalink / raw)
  To: davej, tony.luck, linuxppc-dev, linux-kernel, bob.picco, ak,
	linux-mm
  Cc: Mel Gorman
In-Reply-To: <20060418130015.28928.10163.sendpatchset@skynet>

page_alloc.c contains a large amount of memory initialisation code. This patch
breaks out the initialisation code to a separate file to make page_alloc.c
a bit easier to read.


 Makefile     |    2 
 mem_init.c   | 1040 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 page_alloc.c | 1021 -----------------------------------------------------
 3 files changed, 1041 insertions(+), 1022 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile	2006-04-18 10:22:00.000000000 +0100
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o 
 			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
-			   page_alloc.o page-writeback.o pdflush.o \
+			   page_alloc.o mem_init.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o $(mmu-y)
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c	2006-04-12 20:09:44.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c	2006-04-18 10:22:00.000000000 +0100
@@ -0,0 +1,1040 @@
+/*
+ * mm/mem_init.c
+ * Initialises the architecture independant view of memory. pgdats, zones, etc
+ *
+ *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ *  Copyright (C) 1995, Stephen Tweedie
+ *  Copyright (C) July 1999, Gerhard Wichert, Siemens AG
+ *  Copyright (C) 1999, Ingo Molnar, Red Hat
+ *  Copyright (C) 1999, 2000, Kanoj Sarcar, SGI
+ *  Copyright (C) Sept 2000, Martin J. Bligh
+ *	(lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Copyright (C) Apr 2006, Mel Gorman, IBM
+ *	(lots of bits taken from architecture-specific code)
+ */
+#include <linux/config.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/sysctl.h>
+#include <linux/swap.h>
+#include <linux/cpu.h>
+
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+int percpu_pagelist_fraction;
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+  #ifdef CONFIG_MAX_ACTIVE_REGIONS
+    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+  #else
+    #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+  #endif
+
+  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+/*
+ * Builds allocation fallback zone lists.
+ *
+ * Add all populated zones of a node to the zonelist.
+ */
+static int __init build_zonelists_node(pg_data_t *pgdat,
+			struct zonelist *zonelist, int nr_zones, int zone_type)
+{
+	struct zone *zone;
+
+	BUG_ON(zone_type > ZONE_HIGHMEM);
+
+	do {
+		zone = pgdat->node_zones + zone_type;
+		if (populated_zone(zone)) {
+#ifndef CONFIG_HIGHMEM
+			BUG_ON(zone_type > ZONE_NORMAL);
+#endif
+			zonelist->zones[nr_zones++] = zone;
+			check_highest_zone(zone_type);
+		}
+		zone_type--;
+
+	} while (zone_type >= 0);
+	return nr_zones;
+}
+
+static inline int highest_zone(int zone_bits)
+{
+	int res = ZONE_NORMAL;
+	if (zone_bits & (__force int)__GFP_HIGHMEM)
+		res = ZONE_HIGHMEM;
+	if (zone_bits & (__force int)__GFP_DMA32)
+		res = ZONE_DMA32;
+	if (zone_bits & (__force int)__GFP_DMA)
+		res = ZONE_DMA;
+	return res;
+}
+
+#ifdef CONFIG_NUMA
+#define MAX_NODE_LOAD (num_online_nodes())
+static int __initdata node_load[MAX_NUMNODES];
+/**
+ * find_next_best_node - find the next node that should appear in a given node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: nodemask_t of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list.  The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	int n, val;
+	int min_val = INT_MAX;
+	int best_node = -1;
+
+	/* Use the local node if we haven't already */
+	if (!node_isset(node, *used_node_mask)) {
+		node_set(node, *used_node_mask);
+		return node;
+	}
+
+	for_each_online_node(n) {
+		cpumask_t tmp;
+
+		/* Don't want a node to appear more than once */
+		if (node_isset(n, *used_node_mask))
+			continue;
+
+		/* Use the distance array to find the distance */
+		val = node_distance(node, n);
+
+		/* Penalize nodes under us ("prefer the next node") */
+		val += (n < node);
+
+		/* Give preference to headless and unused nodes */
+		tmp = node_to_cpumask(n);
+		if (!cpus_empty(tmp))
+			val += PENALTY_FOR_NODE_WITH_CPUS;
+
+		/* Slight preference for less loaded node */
+		val *= (MAX_NODE_LOAD*MAX_NUMNODES);
+		val += node_load[n];
+
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+
+	if (best_node >= 0)
+		node_set(best_node, *used_node_mask);
+
+	return best_node;
+}
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+	int prev_node, load;
+	struct zonelist *zonelist;
+	nodemask_t used_mask;
+
+	/* initialize zonelists */
+	for (i = 0; i < GFP_ZONETYPES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		zonelist->zones[0] = NULL;
+	}
+
+	/* NUMA-aware ordering of nodes */
+	local_node = pgdat->node_id;
+	load = num_online_nodes();
+	prev_node = local_node;
+	nodes_clear(used_mask);
+	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+		int distance = node_distance(local_node, node);
+
+		/*
+		 * If another node is sufficiently far away then it is better
+		 * to reclaim pages in a zone before going off node.
+		 */
+		if (distance > RECLAIM_DISTANCE)
+			zone_reclaim_mode = 1;
+
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+
+		if (distance != node_distance(local_node, prev_node))
+			node_load[node] += load;
+		prev_node = node;
+		load--;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			zonelist = pgdat->node_zonelists + i;
+			for (j = 0; zonelist->zones[j] != NULL; j++);
+
+			k = highest_zone(i);
+
+	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+			zonelist->zones[j] = NULL;
+		}
+	}
+}
+
+#else	/* CONFIG_NUMA */
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+
+	local_node = pgdat->node_id;
+	for (i = 0; i < GFP_ZONETYPES; i++) {
+		struct zonelist *zonelist;
+
+		zonelist = pgdat->node_zonelists + i;
+
+		j = 0;
+		k = highest_zone(i);
+ 		j = build_zonelists_node(pgdat, zonelist, j, k);
+ 		/*
+ 		 * Now we build the zonelist so that it contains the zones
+ 		 * of all the other nodes.
+ 		 * We don't want to pressure a particular node, so when
+ 		 * building the zones for node N, we make sure that the
+ 		 * zones coming right after the local ones are those from
+ 		 * node N+1 (modulo N)
+ 		 */
+		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+			if (!node_online(node))
+				continue;
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		}
+		for (node = 0; node < local_node; node++) {
+			if (!node_online(node))
+				continue;
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		}
+
+		zonelist->zones[j] = NULL;
+	}
+}
+
+#endif	/* CONFIG_NUMA */
+
+void __init build_all_zonelists(void)
+{
+	int i;
+
+	for_each_online_node(i)
+		build_zonelists(NODE_DATA(i));
+	printk("Built %i zonelists\n", num_online_nodes());
+	cpuset_init_current_mems_allowed();
+}
+
+/*
+ * Helper functions to size the waitqueue hash table.
+ * Essentially these want to choose hash table sizes sufficiently
+ * large so that collisions trying to wait on pages are rare.
+ * But in fact, the number of active page waitqueues on typical
+ * systems is ridiculously low, less than 200. So this is even
+ * conservative, even though it seems large.
+ *
+ * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
+ * waitqueues, i.e. the size of the waitq table given the number of pages.
+ */
+#define PAGES_PER_WAITQUEUE	256
+
+static inline unsigned long wait_table_size(unsigned long pages)
+{
+	unsigned long size = 1;
+
+	pages /= PAGES_PER_WAITQUEUE;
+
+	while (size < pages)
+		size <<= 1;
+
+	/*
+	 * Once we have dozens or even hundreds of threads sleeping
+	 * on IO we've got bigger problems than wait queue collision.
+	 * Limit the size of the wait table to a reasonable size.
+	 */
+	size = min(size, 4096UL);
+
+	return max(size, 4UL);
+}
+
+/*
+ * This is an integer logarithm so that shifts can be used later
+ * to extract the more random high bits from the multiplicative
+ * hash function before the remainder is taken.
+ */
+static inline unsigned long wait_table_bits(unsigned long size)
+{
+	return ffz(~size);
+}
+
+#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
+
+#ifndef __HAVE_ARCH_MEMMAP_INIT
+#define memmap_init(size, nid, zone, start_pfn) \
+	memmap_init_zone((size), (nid), (zone), (start_pfn))
+#endif
+
+/*
+ * Initially all pages are reserved - free ones are freed
+ * up by free_all_bootmem() once the early boot process is
+ * done. Non-atomic initialization, single-pass.
+ */
+void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
+		unsigned long start_pfn)
+{
+	struct page *page;
+	unsigned long end_pfn = start_pfn + size;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (!early_pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+		set_page_links(page, zone, nid, pfn);
+		init_page_count(page);
+		reset_page_mapcount(page);
+		SetPageReserved(page);
+		INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+		if (!is_highmem_idx(zone))
+			set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+	}
+}
+
+void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
+				unsigned long size)
+{
+	int order;
+	for (order = 0; order < MAX_ORDER ; order++) {
+		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		zone->free_area[order].nr_free = 0;
+	}
+}
+
+#define ZONETABLE_INDEX(x, zone_nr)	((x << ZONES_SHIFT) | zone_nr)
+void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
+		unsigned long size)
+{
+	unsigned long snum = pfn_to_section_nr(pfn);
+	unsigned long end = pfn_to_section_nr(pfn + size);
+
+	if (FLAGS_HAS_NODE)
+		zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
+	else
+		for (; snum <= end; snum++)
+			zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
+}
+
+static __meminit
+void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
+{
+	int i;
+	struct pglist_data *pgdat = zone->zone_pgdat;
+
+	/*
+	 * The per-page waitqueue mechanism uses hashed waitqueues
+	 * per zone.
+	 */
+	zone->wait_table_size = wait_table_size(zone_size_pages);
+	zone->wait_table_bits =	wait_table_bits(zone->wait_table_size);
+	zone->wait_table = (wait_queue_head_t *)
+		alloc_bootmem_node(pgdat, zone->wait_table_size
+					* sizeof(wait_queue_head_t));
+
+	for(i = 0; i < zone->wait_table_size; ++i)
+		init_waitqueue_head(zone->wait_table + i);
+}
+
+/*
+ * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * to the value high for the pageset p.
+ */
+static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+				unsigned long high)
+{
+	struct per_cpu_pages *pcp;
+
+	pcp = &p->pcp[0]; /* hot list */
+	pcp->high = high;
+	pcp->batch = max(1UL, high/4);
+	if ((high/4) > (PAGE_SHIFT * 8))
+		pcp->batch = PAGE_SHIFT * 8;
+}
+
+/*
+ * percpu_pagelist_fraction - changes the pcp->high for each zone on each
+ * cpu.  It is the fraction of total pages in each zone that a hot per cpu pagelist
+ * can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	unsigned int cpu;
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+	if (!write || (ret == -EINVAL))
+		return ret;
+	for_each_zone(zone) {
+		for_each_online_cpu(cpu) {
+			unsigned long  high;
+			high = zone->present_pages / percpu_pagelist_fraction;
+			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+		}
+	}
+	return 0;
+}
+
+static int __cpuinit zone_batchsize(struct zone *zone)
+{
+	int batch;
+
+	/*
+	 * The per-cpu-pages pools are set to around 1000th of the
+	 * size of the zone.  But no more than 1/2 of a meg.
+	 *
+	 * OK, so we don't know how big the cache is.  So guess.
+	 */
+	batch = zone->present_pages / 1024;
+	if (batch * PAGE_SIZE > 512 * 1024)
+		batch = (512 * 1024) / PAGE_SIZE;
+	batch /= 4;		/* We effectively *= 4 below */
+	if (batch < 1)
+		batch = 1;
+
+	/*
+	 * Clamp the batch to a 2^n - 1 value. Having a power
+	 * of 2 value was found to be more likely to have
+	 * suboptimal cache aliasing properties in some cases.
+	 *
+	 * For example if 2 tasks are alternately allocating
+	 * batches of pages, one task can end up with a lot
+	 * of pages of one half of the possible page colors
+	 * and the other with pages of the other colors.
+	 */
+	batch = (1 << (fls(batch + batch/2)-1)) - 1;
+
+	return batch;
+}
+
+inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+	struct per_cpu_pages *pcp;
+
+	memset(p, 0, sizeof(*p));
+
+	pcp = &p->pcp[0];		/* hot */
+	pcp->count = 0;
+	pcp->high = 6 * batch;
+	pcp->batch = max(1UL, 1 * batch);
+	INIT_LIST_HEAD(&pcp->list);
+
+	pcp = &p->pcp[1];		/* cold*/
+	pcp->count = 0;
+	pcp->high = 2 * batch;
+	pcp->batch = max(1UL, batch/2);
+	INIT_LIST_HEAD(&pcp->list);
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * Some NUMA counter updates may also be caught by the boot pagesets.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static struct per_cpu_pageset boot_pageset[NR_CPUS];
+
+/*
+ * Dynamically allocate memory for the
+ * per cpu pageset array in struct zone.
+ */
+static int __cpuinit process_zones(int cpu)
+{
+	struct zone *zone, *dzone;
+
+	for_each_zone(zone) {
+
+		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
+					 GFP_KERNEL, cpu_to_node(cpu));
+		if (!zone_pcp(zone, cpu))
+			goto bad;
+
+		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+
+		if (percpu_pagelist_fraction)
+			setup_pagelist_highmark(zone_pcp(zone, cpu),
+			 	(zone->present_pages / percpu_pagelist_fraction));
+	}
+
+	return 0;
+bad:
+	for_each_zone(dzone) {
+		if (dzone == zone)
+			break;
+		kfree(zone_pcp(dzone, cpu));
+		zone_pcp(dzone, cpu) = NULL;
+	}
+	return -ENOMEM;
+}
+
+static inline void free_zone_pagesets(int cpu)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
+
+		zone_pcp(zone, cpu) = NULL;
+		kfree(pset);
+	}
+}
+
+static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action,
+		void *hcpu)
+{
+	int cpu = (long)hcpu;
+	int ret = NOTIFY_OK;
+
+	switch (action) {
+		case CPU_UP_PREPARE:
+			if (process_zones(cpu))
+				ret = NOTIFY_BAD;
+			break;
+		case CPU_UP_CANCELED:
+		case CPU_DEAD:
+			free_zone_pagesets(cpu);
+			break;
+		default:
+			break;
+	}
+	return ret;
+}
+
+static struct notifier_block pageset_notifier =
+	{ &pageset_cpuup_callback, NULL, 0 };
+
+void __init setup_per_cpu_pageset(void)
+{
+	int err;
+
+	/* Initialize per_cpu_pageset for cpu 0.
+	 * A cpuup callback will do this for every cpu
+	 * as it comes online
+	 */
+	err = process_zones(smp_processor_id());
+	BUG_ON(err);
+	register_cpu_notifier(&pageset_notifier);
+}
+#endif
+
+static __meminit void zone_pcp_init(struct zone *zone)
+{
+	int cpu;
+	unsigned long batch = zone_batchsize(zone);
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+#ifdef CONFIG_NUMA
+		/* Early boot. Slab allocator not functional yet */
+		zone_pcp(zone, cpu) = &boot_pageset[cpu];
+		setup_pageset(&boot_pageset[cpu],0);
+#else
+		setup_pageset(zone_pcp(zone,cpu), batch);
+#endif
+	}
+	if (zone->present_pages)
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
+			zone->name, zone->present_pages, batch);
+}
+
+static __meminit void init_currently_empty_zone(struct zone *zone,
+		unsigned long zone_start_pfn, unsigned long size)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+
+	zone_wait_table_init(zone, size);
+	pgdat->nr_zones = zone_idx(zone) + 1;
+
+	zone->zone_start_pfn = zone_start_pfn;
+
+	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+
+	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/* Note: nid == MAX_NUMNODES returns first region */
+static int __init first_active_region_index_in_nid(int nid)
+{
+	int i;
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
+			return i;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+/* Note: nid == MAX_NUMNODES returns next region */
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+	for (index = index + 1; early_node_map[index].end_pfn; index++) {
+		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
+			return index;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		unsigned long start_pfn = early_node_map[i].start_pfn;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+
+		if ((start_pfn <= pfn) && (pfn < end_pfn))
+			return early_node_map[i].nid;
+	}
+
+	return -1;
+}
+#endif
+
+#define for_each_active_range_index_in_nid(i, nid) \
+	for (i = first_active_region_index_in_nid(nid); \
+				i != MAX_ACTIVE_REGIONS; \
+				i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid) {
+		unsigned long size_pages = 0;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+		if (early_node_map[i].start_pfn >= max_low_pfn)
+			continue;
+
+		if (end_pfn > max_low_pfn)
+			end_pfn = max_low_pfn;
+
+		size_pages = end_pfn - early_node_map[i].start_pfn;
+		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+				PFN_PHYS(early_node_map[i].start_pfn),
+				PFN_PHYS(size_pages));
+	}
+}
+
+void __init sparse_memory_present_with_active_regions(int nid)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid)
+		memory_present(early_node_map[i].nid,
+				early_node_map[i].start_pfn,
+				early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned int i;
+	*start_pfn = -1UL;
+	*end_pfn = 0;
+
+	for_each_active_range_index_in_nid(i, nid) {
+		if (early_node_map[i].start_pfn < *start_pfn)
+			*start_pfn = early_node_map[i].start_pfn;
+
+		if (early_node_map[i].end_pfn > *end_pfn)
+			*end_pfn = early_node_map[i].end_pfn;
+	}
+
+	if (*start_pfn == -1UL) {
+		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+		*start_pfn = 0;
+	}
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	unsigned long node_start_pfn, node_end_pfn;
+	unsigned long zone_start_pfn, zone_end_pfn;
+
+	/* Get the start and end of the node and zone */
+	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+	/* Check that this node has pages within the zone's required range */
+	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+		return 0;
+
+	/* Move the zone boundaries inside the node if necessary */
+	if (zone_end_pfn > node_end_pfn)
+		zone_end_pfn = node_end_pfn;
+	if (zone_start_pfn < node_start_pfn)
+		zone_start_pfn = node_start_pfn;
+
+	/* Return the spanned pages */
+	return zone_end_pfn - zone_start_pfn;
+}
+
+unsigned long __init __absent_pages_in_range(int nid,
+				unsigned long range_start_pfn,
+				unsigned long range_end_pfn)
+{
+	int i = 0;
+	unsigned long prev_end_pfn = 0, hole_pages = 0;
+	unsigned long start_pfn;
+
+	/* Find the end_pfn of the first active range of pfns in the node */
+	i = first_active_region_index_in_nid(nid);
+	prev_end_pfn = early_node_map[i].start_pfn;
+
+	/* Find all holes for the zone within the node */
+	for (; i != MAX_ACTIVE_REGIONS;
+			i = next_active_region_index_in_nid(i, nid)) {
+
+		/* No need to continue if prev_end_pfn is outside the zone */
+		if (prev_end_pfn >= range_end_pfn)
+			break;
+
+		/* Make sure the end of the zone is not within the hole */
+		start_pfn = early_node_map[i].start_pfn;
+		if (start_pfn > range_end_pfn)
+			start_pfn = range_end_pfn;
+		if (prev_end_pfn < range_start_pfn)
+			prev_end_pfn = range_start_pfn;
+
+		/* Update the hole size cound and move on */
+		if (start_pfn > range_start_pfn) {
+			BUG_ON(prev_end_pfn > start_pfn);
+			hole_pages += start_pfn - prev_end_pfn;
+		}
+		prev_end_pfn = early_node_map[i].end_pfn;
+	}
+
+	return hole_pages;
+}
+
+unsigned long __init absent_pages_in_range(unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	return __absent_pages_in_range(nid,
+				arch_zone_lowest_possible_pfn[zone_type],
+				arch_zone_highest_possible_pfn[zone_type]);
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *zones_size)
+{
+	return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+						unsigned long zone_type,
+						unsigned long *zholes_size)
+{
+	if (!zholes_size)
+		return 0;
+
+	return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long realtotalpages, totalpages = 0;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+								zones_size);
+	}
+	pgdat->node_spanned_pages = totalpages;
+
+	realtotalpages = totalpages;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		realtotalpages -=
+			zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+	}
+	pgdat->node_present_pages = realtotalpages;
+	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+							realtotalpages);
+}
+
+/*
+ * Set up the zone data structures:
+ *   - mark all pages reserved
+ *   - mark all memory queues empty
+ *   - clear the memory bitmaps
+ */
+static void __init free_area_init_core(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long j;
+	int nid = pgdat->node_id;
+	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+
+	pgdat_resize_init(pgdat);
+	pgdat->nr_zones = 0;
+	init_waitqueue_head(&pgdat->kswapd_wait);
+	pgdat->kswapd_max_order = 0;
+
+	for (j = 0; j < MAX_NR_ZONES; j++) {
+		struct zone *zone = pgdat->node_zones + j;
+		unsigned long size, realsize;
+
+		size = zone_present_pages_in_node(nid, j, zones_size);
+		realsize = size - zone_absent_pages_in_node(nid, j,
+								zholes_size);
+		if (j < ZONE_HIGHMEM)
+			nr_kernel_pages += realsize;
+		nr_all_pages += realsize;
+
+		zone->spanned_pages = size;
+		zone->present_pages = realsize;
+		zone->name = zone_names[j];
+		spin_lock_init(&zone->lock);
+		spin_lock_init(&zone->lru_lock);
+		zone_seqlock_init(zone);
+		zone->zone_pgdat = pgdat;
+		zone->free_pages = 0;
+
+		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+
+		zone_pcp_init(zone);
+		INIT_LIST_HEAD(&zone->active_list);
+		INIT_LIST_HEAD(&zone->inactive_list);
+		zone->nr_scan_active = 0;
+		zone->nr_scan_inactive = 0;
+		zone->nr_active = 0;
+		zone->nr_inactive = 0;
+		atomic_set(&zone->reclaim_in_progress, 0);
+		if (!size)
+			continue;
+
+		zonetable_add(zone, nid, j, zone_start_pfn, size);
+		init_currently_empty_zone(zone, zone_start_pfn, size);
+		zone_start_pfn += size;
+	}
+}
+
+static void __init alloc_node_mem_map(struct pglist_data *pgdat)
+{
+	/* Skip empty nodes */
+	if (!pgdat->node_spanned_pages)
+		return;
+
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+	/* ia64 gets its own node_mem_map, before this, without bootmem */
+	if (!pgdat->node_mem_map) {
+		unsigned long size;
+		struct page *map;
+
+		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
+		map = alloc_remap(pgdat->node_id, size);
+		if (!map)
+			map = alloc_bootmem_node(pgdat, size);
+		pgdat->node_mem_map = map;
+	}
+#ifdef CONFIG_FLATMEM
+	/*
+	 * With no DISCONTIG, the global mem_map is just set as node 0's
+	 */
+	if (pgdat == NODE_DATA(0))
+		mem_map = NODE_DATA(0)->node_mem_map;
+#endif
+#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+}
+
+void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long node_start_pfn,
+		unsigned long *zholes_size)
+{
+	pgdat->node_id = nid;
+	pgdat->node_start_pfn = node_start_pfn;
+	calculate_node_totalpages(pgdat, zones_size, zholes_size);
+
+	alloc_node_mem_map(pgdat);
+
+	free_area_init_core(pgdat, zones_size, zholes_size);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+						unsigned long end_pfn)
+{
+	unsigned int i;
+
+	/* Merge with existing active regions if possible */
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (early_node_map[i].nid != nid)
+			continue;
+
+		/* Skip if an existing region covers this new one */
+		if (start_pfn >= early_node_map[i].start_pfn &&
+				end_pfn <= early_node_map[i].end_pfn)
+			return;
+
+		/* Merge forward if suitable */
+		if (start_pfn <= early_node_map[i].end_pfn &&
+				end_pfn > early_node_map[i].end_pfn) {
+			early_node_map[i].end_pfn = end_pfn;
+			return;
+		}
+
+		/* Merge backward if suitable */
+		if (start_pfn < early_node_map[i].end_pfn &&
+				end_pfn >= early_node_map[i].start_pfn) {
+			early_node_map[i].start_pfn = start_pfn;
+			return;
+		}
+	}
+
+	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
+	if (i >= MAX_ACTIVE_REGIONS - 1) {
+		printk(KERN_ERR "Too many memory regions, truncating\n");
+		return;
+	}
+
+	early_node_map[i].nid = nid;
+	early_node_map[i].start_pfn = start_pfn;
+	early_node_map[i].end_pfn = end_pfn;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+	struct node_active_region *arange = (struct node_active_region *)a;
+	struct node_active_region *brange = (struct node_active_region *)b;
+
+	/* Done this way to avoid overflows */
+	if (arange->start_pfn > brange->start_pfn)
+		return 1;
+	if (arange->start_pfn < brange->start_pfn)
+		return -1;
+
+	return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+	size_t num = 0;
+	while (early_node_map[num].end_pfn)
+		num++;
+
+	sort(early_node_map, num, sizeof(struct node_active_region),
+						cmp_node_active_region, NULL);
+}
+
+/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
+unsigned long __init find_min_pfn_for_node(unsigned long nid)
+{
+	int i;
+
+	/* Assuming a sorted map, the first range found has the starting pfn */
+	for_each_active_range_index_in_nid(i, nid)
+		return early_node_map[i].start_pfn;
+
+	/* nid does not exist in early_node_map */
+	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+	return 0;
+}
+
+
+unsigned long __init find_min_pfn_with_active_regions(void)
+{
+	return find_min_pfn_for_node(MAX_NUMNODES);
+}
+
+unsigned long __init find_max_pfn_with_active_regions(void)
+{
+	int i;
+	unsigned long max_pfn = -1UL;
+
+	for (i = 0; early_node_map[i].end_pfn; i++)
+		max_pfn = max(max_pfn, early_node_map[i].start_pfn);
+
+	return max_pfn;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+				unsigned long arch_max_dma32_pfn,
+				unsigned long arch_max_low_pfn,
+				unsigned long arch_max_high_pfn)
+{
+	unsigned long nid;
+	int zone_index;
+
+	/* Record where the zone boundaries are */
+	memset(arch_zone_lowest_possible_pfn, 0,
+				sizeof(arch_zone_lowest_possible_pfn));
+	memset(arch_zone_highest_possible_pfn, 0,
+				sizeof(arch_zone_highest_possible_pfn));
+	arch_zone_lowest_possible_pfn[ZONE_DMA] =
+					find_min_pfn_with_active_regions();
+	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+	for (zone_index = 1; zone_index < MAX_NR_ZONES; zone_index++) {
+		arch_zone_lowest_possible_pfn[zone_index] = 
+			arch_zone_highest_possible_pfn[zone_index-1];
+	}
+
+	/* Regions in the early_node_map can be in any order */
+	sort_node_map();
+
+	for_each_online_node(nid) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		free_area_init_node(nid, pgdat, NULL,
+				find_min_pfn_for_node(nid), NULL);
+	}
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c	2006-04-18 10:17:49.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c	2006-04-18 10:22:55.000000000 +0100
@@ -37,8 +37,6 @@
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
-#include <linux/sort.h>
-#include <linux/pfn.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -54,7 +52,6 @@ EXPORT_SYMBOL(node_possible_map);
 unsigned long totalram_pages __read_mostly;
 unsigned long totalhigh_pages __read_mostly;
 long nr_swap_pages;
-int percpu_pagelist_fraction;
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -80,24 +77,11 @@ EXPORT_SYMBOL(totalram_pages);
 struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
 EXPORT_SYMBOL(zone_table);
 
-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
 
 unsigned long __initdata nr_kernel_pages;
 unsigned long __initdata nr_all_pages;
 
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-  #ifdef CONFIG_MAX_ACTIVE_REGIONS
-    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
-  #else
-    #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
-  #endif
-
-  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
-  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
-  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -1511,985 +1495,6 @@ void show_free_areas(void)
 	show_swap_cache_info();
 }
 
-/*
- * Builds allocation fallback zone lists.
- *
- * Add all populated zones of a node to the zonelist.
- */
-static int __init build_zonelists_node(pg_data_t *pgdat,
-			struct zonelist *zonelist, int nr_zones, int zone_type)
-{
-	struct zone *zone;
-
-	BUG_ON(zone_type > ZONE_HIGHMEM);
-
-	do {
-		zone = pgdat->node_zones + zone_type;
-		if (populated_zone(zone)) {
-#ifndef CONFIG_HIGHMEM
-			BUG_ON(zone_type > ZONE_NORMAL);
-#endif
-			zonelist->zones[nr_zones++] = zone;
-			check_highest_zone(zone_type);
-		}
-		zone_type--;
-
-	} while (zone_type >= 0);
-	return nr_zones;
-}
-
-static inline int highest_zone(int zone_bits)
-{
-	int res = ZONE_NORMAL;
-	if (zone_bits & (__force int)__GFP_HIGHMEM)
-		res = ZONE_HIGHMEM;
-	if (zone_bits & (__force int)__GFP_DMA32)
-		res = ZONE_DMA32;
-	if (zone_bits & (__force int)__GFP_DMA)
-		res = ZONE_DMA;
-	return res;
-}
-
-#ifdef CONFIG_NUMA
-#define MAX_NODE_LOAD (num_online_nodes())
-static int __initdata node_load[MAX_NUMNODES];
-/**
- * find_next_best_node - find the next node that should appear in a given node's fallback list
- * @node: node whose fallback list we're appending
- * @used_node_mask: nodemask_t of already used nodes
- *
- * We use a number of factors to determine which is the next node that should
- * appear on a given node's fallback list.  The node should not have appeared
- * already in @node's fallback list, and it should be the next closest node
- * according to the distance array (which contains arbitrary distance values
- * from each node to each node in the system), and should also prefer nodes
- * with no CPUs, since presumably they'll have very little allocation pressure
- * on them otherwise.
- * It returns -1 if no node is found.
- */
-static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
-{
-	int n, val;
-	int min_val = INT_MAX;
-	int best_node = -1;
-
-	/* Use the local node if we haven't already */
-	if (!node_isset(node, *used_node_mask)) {
-		node_set(node, *used_node_mask);
-		return node;
-	}
-
-	for_each_online_node(n) {
-		cpumask_t tmp;
-
-		/* Don't want a node to appear more than once */
-		if (node_isset(n, *used_node_mask))
-			continue;
-
-		/* Use the distance array to find the distance */
-		val = node_distance(node, n);
-
-		/* Penalize nodes under us ("prefer the next node") */
-		val += (n < node);
-
-		/* Give preference to headless and unused nodes */
-		tmp = node_to_cpumask(n);
-		if (!cpus_empty(tmp))
-			val += PENALTY_FOR_NODE_WITH_CPUS;
-
-		/* Slight preference for less loaded node */
-		val *= (MAX_NODE_LOAD*MAX_NUMNODES);
-		val += node_load[n];
-
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-
-	if (best_node >= 0)
-		node_set(best_node, *used_node_mask);
-
-	return best_node;
-}
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
-	int i, j, k, node, local_node;
-	int prev_node, load;
-	struct zonelist *zonelist;
-	nodemask_t used_mask;
-
-	/* initialize zonelists */
-	for (i = 0; i < GFP_ZONETYPES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
-	}
-
-	/* NUMA-aware ordering of nodes */
-	local_node = pgdat->node_id;
-	load = num_online_nodes();
-	prev_node = local_node;
-	nodes_clear(used_mask);
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		int distance = node_distance(local_node, node);
-
-		/*
-		 * If another node is sufficiently far away then it is better
-		 * to reclaim pages in a zone before going off node.
-		 */
-		if (distance > RECLAIM_DISTANCE)
-			zone_reclaim_mode = 1;
-
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-
-		if (distance != node_distance(local_node, prev_node))
-			node_load[node] += load;
-		prev_node = node;
-		load--;
-		for (i = 0; i < GFP_ZONETYPES; i++) {
-			zonelist = pgdat->node_zonelists + i;
-			for (j = 0; zonelist->zones[j] != NULL; j++);
-
-			k = highest_zone(i);
-
-	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-			zonelist->zones[j] = NULL;
-		}
-	}
-}
-
-#else	/* CONFIG_NUMA */
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
-	int i, j, k, node, local_node;
-
-	local_node = pgdat->node_id;
-	for (i = 0; i < GFP_ZONETYPES; i++) {
-		struct zonelist *zonelist;
-
-		zonelist = pgdat->node_zonelists + i;
-
-		j = 0;
-		k = highest_zone(i);
- 		j = build_zonelists_node(pgdat, zonelist, j, k);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-		}
-
-		zonelist->zones[j] = NULL;
-	}
-}
-
-#endif	/* CONFIG_NUMA */
-
-void __init build_all_zonelists(void)
-{
-	int i;
-
-	for_each_online_node(i)
-		build_zonelists(NODE_DATA(i));
-	printk("Built %i zonelists\n", num_online_nodes());
-	cpuset_init_current_mems_allowed();
-}
-
-/*
- * Helper functions to size the waitqueue hash table.
- * Essentially these want to choose hash table sizes sufficiently
- * large so that collisions trying to wait on pages are rare.
- * But in fact, the number of active page waitqueues on typical
- * systems is ridiculously low, less than 200. So this is even
- * conservative, even though it seems large.
- *
- * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
- * waitqueues, i.e. the size of the waitq table given the number of pages.
- */
-#define PAGES_PER_WAITQUEUE	256
-
-static inline unsigned long wait_table_size(unsigned long pages)
-{
-	unsigned long size = 1;
-
-	pages /= PAGES_PER_WAITQUEUE;
-
-	while (size < pages)
-		size <<= 1;
-
-	/*
-	 * Once we have dozens or even hundreds of threads sleeping
-	 * on IO we've got bigger problems than wait queue collision.
-	 * Limit the size of the wait table to a reasonable size.
-	 */
-	size = min(size, 4096UL);
-
-	return max(size, 4UL);
-}
-
-/*
- * This is an integer logarithm so that shifts can be used later
- * to extract the more random high bits from the multiplicative
- * hash function before the remainder is taken.
- */
-static inline unsigned long wait_table_bits(unsigned long size)
-{
-	return ffz(~size);
-}
-
-#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-
-/*
- * Initially all pages are reserved - free ones are freed
- * up by free_all_bootmem() once the early boot process is
- * done. Non-atomic initialization, single-pass.
- */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
-{
-	struct page *page;
-	unsigned long end_pfn = start_pfn + size;
-	unsigned long pfn;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		page = pfn_to_page(pfn);
-		set_page_links(page, zone, nid, pfn);
-		init_page_count(page);
-		reset_page_mapcount(page);
-		SetPageReserved(page);
-		INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
-		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
-		if (!is_highmem_idx(zone))
-			set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
-	}
-}
-
-void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
-				unsigned long size)
-{
-	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
-	}
-}
-
-#define ZONETABLE_INDEX(x, zone_nr)	((x << ZONES_SHIFT) | zone_nr)
-void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
-		unsigned long size)
-{
-	unsigned long snum = pfn_to_section_nr(pfn);
-	unsigned long end = pfn_to_section_nr(pfn + size);
-
-	if (FLAGS_HAS_NODE)
-		zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
-	else
-		for (; snum <= end; snum++)
-			zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
-}
-
-#ifndef __HAVE_ARCH_MEMMAP_INIT
-#define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
-#endif
-
-static int __cpuinit zone_batchsize(struct zone *zone)
-{
-	int batch;
-
-	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.  But no more than 1/2 of a meg.
-	 *
-	 * OK, so we don't know how big the cache is.  So guess.
-	 */
-	batch = zone->present_pages / 1024;
-	if (batch * PAGE_SIZE > 512 * 1024)
-		batch = (512 * 1024) / PAGE_SIZE;
-	batch /= 4;		/* We effectively *= 4 below */
-	if (batch < 1)
-		batch = 1;
-
-	/*
-	 * Clamp the batch to a 2^n - 1 value. Having a power
-	 * of 2 value was found to be more likely to have
-	 * suboptimal cache aliasing properties in some cases.
-	 *
-	 * For example if 2 tasks are alternately allocating
-	 * batches of pages, one task can end up with a lot
-	 * of pages of one half of the possible page colors
-	 * and the other with pages of the other colors.
-	 */
-	batch = (1 << (fls(batch + batch/2)-1)) - 1;
-
-	return batch;
-}
-
-inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-{
-	struct per_cpu_pages *pcp;
-
-	memset(p, 0, sizeof(*p));
-
-	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
-	pcp->high = 6 * batch;
-	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
-
-	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
-	pcp->high = 2 * batch;
-	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
-}
-
-/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
- */
-
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
-				unsigned long high)
-{
-	struct per_cpu_pages *pcp;
-
-	pcp = &p->pcp[0]; /* hot list */
-	pcp->high = high;
-	pcp->batch = max(1UL, high/4);
-	if ((high/4) > (PAGE_SHIFT * 8))
-		pcp->batch = PAGE_SHIFT * 8;
-}
-
-
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-
-	for_each_zone(zone) {
-
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, cpu_to_node(cpu));
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = NULL;
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		zone_pcp(zone, cpu) = NULL;
-		kfree(pset);
-	}
-}
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
-
-	switch (action) {
-		case CPU_UP_PREPARE:
-			if (process_zones(cpu))
-				ret = NOTIFY_BAD;
-			break;
-		case CPU_UP_CANCELED:
-		case CPU_DEAD:
-			free_zone_pagesets(cpu);
-			break;
-		default:
-			break;
-	}
-	return ret;
-}
-
-static struct notifier_block pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
-void __init setup_per_cpu_pageset(void)
-{
-	int err;
-
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
-	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
-	register_cpu_notifier(&pageset_notifier);
-}
-
-#endif
-
-static __meminit
-void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
-{
-	int i;
-	struct pglist_data *pgdat = zone->zone_pgdat;
-
-	/*
-	 * The per-page waitqueue mechanism uses hashed waitqueues
-	 * per zone.
-	 */
-	zone->wait_table_size = wait_table_size(zone_size_pages);
-	zone->wait_table_bits =	wait_table_bits(zone->wait_table_size);
-	zone->wait_table = (wait_queue_head_t *)
-		alloc_bootmem_node(pgdat, zone->wait_table_size
-					* sizeof(wait_queue_head_t));
-
-	for(i = 0; i < zone->wait_table_size; ++i)
-		init_waitqueue_head(zone->wait_table + i);
-}
-
-static __meminit void zone_pcp_init(struct zone *zone)
-{
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
-
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
-	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
-}
-
-static __meminit void init_currently_empty_zone(struct zone *zone,
-		unsigned long zone_start_pfn, unsigned long size)
-{
-	struct pglist_data *pgdat = zone->zone_pgdat;
-
-	zone_wait_table_init(zone, size);
-	pgdat->nr_zones = zone_idx(zone) + 1;
-
-	zone->zone_start_pfn = zone_start_pfn;
-
-	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
-
-	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-/* Note: nid == MAX_NUMNODES returns first region */
-static int __init first_active_region_index_in_nid(int nid)
-{
-	int i;
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
-			return i;
-	}
-
-	return MAX_ACTIVE_REGIONS;
-}
-
-/* Note: nid == MAX_NUMNODES returns next region */
-static int __init next_active_region_index_in_nid(unsigned int index, int nid)
-{
-	for (index = index + 1; early_node_map[index].end_pfn; index++) {
-		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
-			return index;
-	}
-
-	return MAX_ACTIVE_REGIONS;
-}
-
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-int __init early_pfn_to_nid(unsigned long pfn)
-{
-	int i;
-
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		unsigned long start_pfn = early_node_map[i].start_pfn;
-		unsigned long end_pfn = early_node_map[i].end_pfn;
-
-		if ((start_pfn <= pfn) && (pfn < end_pfn))
-			return early_node_map[i].nid;
-	}
-
-	return -1;
-}
-#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
-
-#define for_each_active_range_index_in_nid(i, nid) \
-	for (i = first_active_region_index_in_nid(nid); \
-				i != MAX_ACTIVE_REGIONS; \
-				i = next_active_region_index_in_nid(i, nid))
-
-void __init free_bootmem_with_active_regions(int nid,
-						unsigned long max_low_pfn)
-{
-	unsigned int i;
-	for_each_active_range_index_in_nid(i, nid) {
-		unsigned long size_pages = 0;
-		unsigned long end_pfn = early_node_map[i].end_pfn;
-		if (early_node_map[i].start_pfn >= max_low_pfn)
-			continue;
-
-		if (end_pfn > max_low_pfn)
-			end_pfn = max_low_pfn;
-
-		size_pages = end_pfn - early_node_map[i].start_pfn;
-		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
-				PFN_PHYS(early_node_map[i].start_pfn),
-				PFN_PHYS(size_pages));
-	}
-}
-
-void __init sparse_memory_present_with_active_regions(int nid)
-{
-	unsigned int i;
-	for_each_active_range_index_in_nid(i, nid)
-		memory_present(early_node_map[i].nid,
-				early_node_map[i].start_pfn,
-				early_node_map[i].end_pfn);
-}
-
-void __init get_pfn_range_for_nid(unsigned int nid,
-			unsigned long *start_pfn, unsigned long *end_pfn)
-{
-	unsigned int i;
-	*start_pfn = -1UL;
-	*end_pfn = 0;
-
-	for_each_active_range_index_in_nid(i, nid) {
-		if (early_node_map[i].start_pfn < *start_pfn)
-			*start_pfn = early_node_map[i].start_pfn;
-
-		if (early_node_map[i].end_pfn > *end_pfn)
-			*end_pfn = early_node_map[i].end_pfn;
-	}
-
-	if (*start_pfn == -1UL) {
-		printk(KERN_WARNING "Node %u active with no memory\n", nid);
-		*start_pfn = 0;
-	}
-}
-
-unsigned long __init zone_present_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *ignored)
-{
-	unsigned long node_start_pfn, node_end_pfn;
-	unsigned long zone_start_pfn, zone_end_pfn;
-
-	/* Get the start and end of the node and zone */
-	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
-	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
-
-	/* Check that this node has pages within the zone's required range */
-	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
-		return 0;
-
-	/* Move the zone boundaries inside the node if necessary */
-	if (zone_end_pfn > node_end_pfn)
-		zone_end_pfn = node_end_pfn;
-	if (zone_start_pfn < node_start_pfn)
-		zone_start_pfn = node_start_pfn;
-
-	/* Return the spanned pages */
-	return zone_end_pfn - zone_start_pfn;
-}
-
-unsigned long __init __absent_pages_in_range(int nid,
-				unsigned long range_start_pfn,
-				unsigned long range_end_pfn)
-{
-	int i = 0;
-	unsigned long prev_end_pfn = 0, hole_pages = 0;
-	unsigned long start_pfn;
-
-	/* Find the end_pfn of the first active range of pfns in the node */
-	i = first_active_region_index_in_nid(nid);
-	prev_end_pfn = early_node_map[i].start_pfn;
-
-	/* Find all holes for the zone within the node */
-	for (; i != MAX_ACTIVE_REGIONS;
-			i = next_active_region_index_in_nid(i, nid)) {
-
-		/* No need to continue if prev_end_pfn is outside the zone */
-		if (prev_end_pfn >= range_end_pfn)
-			break;
-
-		/* Make sure the end of the zone is not within the hole */
-		start_pfn = early_node_map[i].start_pfn;
-		if (start_pfn > range_end_pfn)
-			start_pfn = range_end_pfn;
-		if (prev_end_pfn < range_start_pfn)
-			prev_end_pfn = range_start_pfn;
-
-		/* Update the hole size cound and move on */
-		if (start_pfn > range_start_pfn) {
-			BUG_ON(prev_end_pfn > start_pfn);
-			hole_pages += start_pfn - prev_end_pfn;
-		}
-		prev_end_pfn = early_node_map[i].end_pfn;
-	}
-
-	return hole_pages;
-}
-
-unsigned long __init absent_pages_in_range(unsigned long start_pfn,
-							unsigned long end_pfn)
-{
-	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
-}
-
-unsigned long __init zone_absent_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *ignored)
-{
-	return __absent_pages_in_range(nid,
-				arch_zone_lowest_possible_pfn[zone_type],
-				arch_zone_highest_possible_pfn[zone_type]);
-}
-#else
-static inline unsigned long zone_present_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *zones_size)
-{
-	return zones_size[zone_type];
-}
-
-static inline unsigned long zone_absent_pages_in_node(int nid,
-						unsigned long zone_type,
-						unsigned long *zholes_size)
-{
-	if (!zholes_size)
-		return 0;
-
-	return zholes_size[zone_type];
-}
-#endif
-
-static void __init calculate_node_totalpages(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long realtotalpages, totalpages = 0;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		totalpages += zone_present_pages_in_node(pgdat->node_id, i,
-								zones_size);
-	}
-	pgdat->node_spanned_pages = totalpages;
-
-	realtotalpages = totalpages;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		realtotalpages -=
-			zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
-	}
-	pgdat->node_present_pages = realtotalpages;
-	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
-							realtotalpages);
-}
-
-/*
- * Set up the zone data structures:
- *   - mark all pages reserved
- *   - mark all memory queues empty
- *   - clear the memory bitmaps
- */
-static void __init free_area_init_core(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long j;
-	int nid = pgdat->node_id;
-	unsigned long zone_start_pfn = pgdat->node_start_pfn;
-
-	pgdat_resize_init(pgdat);
-	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
-	pgdat->kswapd_max_order = 0;
-	
-	for (j = 0; j < MAX_NR_ZONES; j++) {
-		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize;
-
-		size = zone_present_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
-								zholes_size);
-		if (j < ZONE_HIGHMEM)
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
-
-		zone->spanned_pages = size;
-		zone->present_pages = realsize;
-		zone->name = zone_names[j];
-		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
-		zone_seqlock_init(zone);
-		zone->zone_pgdat = pgdat;
-		zone->free_pages = 0;
-
-		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
-
-		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
-		zone->nr_active = 0;
-		zone->nr_inactive = 0;
-		atomic_set(&zone->reclaim_in_progress, 0);
-		if (!size)
-			continue;
-
-		zonetable_add(zone, nid, j, zone_start_pfn, size);
-		init_currently_empty_zone(zone, zone_start_pfn, size);
-		zone_start_pfn += size;
-	}
-}
-
-static void __init alloc_node_mem_map(struct pglist_data *pgdat)
-{
-	/* Skip empty nodes */
-	if (!pgdat->node_spanned_pages)
-		return;
-
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
-	/* ia64 gets its own node_mem_map, before this, without bootmem */
-	if (!pgdat->node_mem_map) {
-		unsigned long size;
-		struct page *map;
-
-		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
-		map = alloc_remap(pgdat->node_id, size);
-		if (!map)
-			map = alloc_bootmem_node(pgdat, size);
-		pgdat->node_mem_map = map;
-	}
-#ifdef CONFIG_FLATMEM
-	/*
-	 * With no DISCONTIG, the global mem_map is just set as node 0's
-	 */
-	if (pgdat == NODE_DATA(0))
-		mem_map = NODE_DATA(0)->node_mem_map;
-#endif
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
-}
-
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long node_start_pfn,
-		unsigned long *zholes_size)
-{
-	pgdat->node_id = nid;
-	pgdat->node_start_pfn = node_start_pfn;
-	calculate_node_totalpages(pgdat, zones_size, zholes_size);
-
-	alloc_node_mem_map(pgdat);
-
-	free_area_init_core(pgdat, zones_size, zholes_size);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-void __init add_active_range(unsigned int nid, unsigned long start_pfn,
-						unsigned long end_pfn)
-{
-	unsigned int i;
-
-	/* Merge with existing active regions if possible */
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		if (early_node_map[i].nid != nid)
-			continue;
-
-		/* Skip if an existing region covers this new one */
-		if (start_pfn >= early_node_map[i].start_pfn &&
-				end_pfn <= early_node_map[i].end_pfn)
-			return;
-
-		/* Merge forward if suitable */
-		if (start_pfn <= early_node_map[i].end_pfn &&
-				end_pfn > early_node_map[i].end_pfn) {
-			early_node_map[i].end_pfn = end_pfn;
-			return;
-		}
-
-		/* Merge backward if suitable */
-		if (start_pfn < early_node_map[i].end_pfn &&
-				end_pfn >= early_node_map[i].start_pfn) {
-			early_node_map[i].start_pfn = start_pfn;
-			return;
-		}
-	}
-
-	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
-	if (i >= MAX_ACTIVE_REGIONS - 1) {
-		printk(KERN_ERR "Too many memory regions, truncating\n");
-		return;
-	}
-
-	early_node_map[i].nid = nid;
-	early_node_map[i].start_pfn = start_pfn;
-	early_node_map[i].end_pfn = end_pfn;
-}
-
-/* Compare two active node_active_regions */
-static int __init cmp_node_active_region(const void *a, const void *b)
-{
-	struct node_active_region *arange = (struct node_active_region *)a;
-	struct node_active_region *brange = (struct node_active_region *)b;
-
-	/* Done this way to avoid overflows */
-	if (arange->start_pfn > brange->start_pfn)
-		return 1;
-	if (arange->start_pfn < brange->start_pfn)
-		return -1;
-
-	return 0;
-}
-
-/* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
-{
-	size_t num = 0;
-	while (early_node_map[num].end_pfn)
-		num++;
-
-	sort(early_node_map, num, sizeof(struct node_active_region),
-						cmp_node_active_region, NULL);
-}
-
-/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
-unsigned long __init find_min_pfn_for_node(unsigned long nid)
-{
-	int i;
-
-	/* Assuming a sorted map, the first range found has the starting pfn */
-	for_each_active_range_index_in_nid(i, nid)
-		return early_node_map[i].start_pfn;
-
-	/* nid does not exist in early_node_map */
-	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
-	return 0;
-}
-
-
-unsigned long __init find_min_pfn_with_active_regions(void)
-{
-	return find_min_pfn_for_node(MAX_NUMNODES);
-}
-
-unsigned long __init find_max_pfn_with_active_regions(void)
-{
-	int i;
-	unsigned long max_pfn = -1UL;
-
-	for (i = 0; early_node_map[i].end_pfn; i++)
-		max_pfn = max(max_pfn, early_node_map[i].start_pfn);
-
-	return max_pfn;
-}
-
-void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
-				unsigned long arch_max_dma32_pfn,
-				unsigned long arch_max_low_pfn,
-				unsigned long arch_max_high_pfn)
-{
-	unsigned long nid;
-	int zone_index;
-
-	/* Record where the zone boundaries are */
-	memset(arch_zone_lowest_possible_pfn, 0,
-				sizeof(arch_zone_lowest_possible_pfn));
-	memset(arch_zone_highest_possible_pfn, 0,
-				sizeof(arch_zone_highest_possible_pfn));
-	arch_zone_lowest_possible_pfn[ZONE_DMA] =
-					find_min_pfn_with_active_regions();
-	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
-	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
-	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
-	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
-	for (zone_index = 1; zone_index < MAX_NR_ZONES; zone_index++) {
-		arch_zone_lowest_possible_pfn[zone_index] =
-			arch_zone_highest_possible_pfn[zone_index-1];
-	}
-
-	/* Regions in the early_node_map can be in any order */
-	sort_node_map();
-
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
-		free_area_init_node(nid, pgdat, NULL,
-				find_min_pfn_for_node(nid), NULL);
-	}
-}
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 static bootmem_data_t contig_bootmem_data;
 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
@@ -2972,32 +1977,6 @@ int lowmem_reserve_ratio_sysctl_handler(
 	return 0;
 }
 
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu.  It is the fraction of total pages in each zone that a hot per cpu pagelist
- * can have before it gets flushed back to buddy allocator.
- */
-
-int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
-	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
-{
-	struct zone *zone;
-	unsigned int cpu;
-	int ret;
-
-	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
-	if (!write || (ret == -EINVAL))
-		return ret;
-	for_each_zone(zone) {
-		for_each_online_cpu(cpu) {
-			unsigned long  high;
-			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
-		}
-	}
-	return 0;
-}
-
 __initdata int hashdist = HASHDIST_DEFAULT;
 
 #ifdef CONFIG_NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox