Re: [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20
@ 2003-03-07  8:57 Kimi Suganuma
  2003-03-07 19:36 ` Bjorn Helgaas
  2003-03-12  6:59 ` Kimi Suganuma
  0 siblings, 2 replies; 3+ messages in thread
From: Kimi Suganuma @ 2003-03-07  8:57 UTC (permalink / raw)
  To: linux-ia64

Hi Bjorn,

Thank you for your consideration. I'll fix all warnings.

And about problems on HP machines, I guess the reason is that
the current CONFIG_NUMA kernel doesn't work right on a system
which needs the VIRTUAL_MEM_MAP function.
I thought that the CONFIG_NUMA kernel doesn't have to work on
such the systems, I mean you can make a kernel with just
turning off CONFIG_NUMA at config.
However, it would better if CONFIG_NUMA kernel works on all types
of machines. So I'll try to find out a solution for this issue.

Best Regards,
Kimi


On Thu, 6 Mar 2003 11:31:25 -0700
Bjorn Helgaas <bjorn_helgaas@hp.com> wrote:

> > I back ported the IA64 discontigmem function in 2.5 to 2.4.20.
> > I tested the patch on 8 way Itanium2 NUMA server with
> > a NUMA kernel and an SMP kernel.
> > 
> > David, Bjorn, please let me know is there any possibility
> > that you take this patch into the ia64-patch for 2.4.
> 
> I'm in the process of merging this patch, but when I build a
> "generic" kernel (with CONFIG_NUMA=y and CONFIG_DISCONTIGMEM=y),
> I get many new warnings:
> 
> /home/helgaas/bk/testing/include/asm/mmzone.h:62:21: warning: "virt_to_page" redefined
> /home/helgaas/bk/testing/include/asm/page.h:57:1: warning: this is the location of the previous definition
> /home/helgaas/bk/testing/include/asm/mmzone.h:71:21: warning: "page_to_phys" redefined
> /home/helgaas/bk/testing/include/asm/page.h:58:1: warning: this is the location of the previous definition
> /home/helgaas/bk/testing/include/linux/mmzone.h:229:21: warning: "numa_node_id" redefined
> /home/helgaas/bk/testing/include/asm/processor.h:206:1: warning: this is the location of the previous definition
> 
> Could you look into these and send a new patch to correct them?
> Also, the resulting kernel doesn't boot (it MCAs) on HP rx2600
> and zx2000.  I'll look into it in my spare time, but you can
> probably do so more efficiently.
> 
> A small patch that applies on top of the previous patch would be
> easiest.
> 
> Bjorn


-- 
suganuma <suganuma@hpc.bs1.fc.nec.co.jp>



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20
  2003-03-07  8:57 [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20 Kimi Suganuma
@ 2003-03-07 19:36 ` Bjorn Helgaas
  2003-03-12  6:59 ` Kimi Suganuma
  1 sibling, 0 replies; 3+ messages in thread
From: Bjorn Helgaas @ 2003-03-07 19:36 UTC (permalink / raw)
  To: linux-ia64

> And about problems on HP machines, I guess the reason is that
> the current CONFIG_NUMA kernel doesn't work right on a system
> which needs the VIRTUAL_MEM_MAP function.

Ah, it looks like you enforce this in the 2.5 config files:

    config VIRTUAL_MEM_MAP
            bool "Enable Virtual Mem Map"
            depends on !NUMA

Unfortunately, 2.4 doesn't have VIRTUAL_MEM_MAP as a config
option (it's enabled automatically if needed), so we can't do
that directly.

> ... it would better if CONFIG_NUMA kernel works on all types
> of machines. So I'll try to find out a solution for this issue.

That would be ideal.  Seems like there ought to be opportunities
for unification of DISCONTIGMEM, VIRTUAL_MEM_MAP, and parts of
NUMA.  The current scheme feels a little clunky, especially for
configuring a "generic" kernel.  (It sounds like it isn't even
possible to configure a kernel generic enough for both SGI SNx
and HP zx1 boxes.)

Bjorn

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20
  2003-03-07  8:57 [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20 Kimi Suganuma
  2003-03-07 19:36 ` Bjorn Helgaas
@ 2003-03-12  6:59 ` Kimi Suganuma
  1 sibling, 0 replies; 3+ messages in thread
From: Kimi Suganuma @ 2003-03-12  6:59 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

Hi Bjorn, Erich,

I fixed a bug about cpu_to_node_map[] and eliminated warnings
as much as I can. Now, I'll start to think of merging
VIRTUAL_MEM_MAP and NUMA. :-)

Regards,
Kimi

On Fri, 7 Mar 2003 12:36:01 -0700
Bjorn Helgaas <bjorn_helgaas@hp.com> wrote:

> > And about problems on HP machines, I guess the reason is that
> > the current CONFIG_NUMA kernel doesn't work right on a system
> > which needs the VIRTUAL_MEM_MAP function.
> 
> Ah, it looks like you enforce this in the 2.5 config files:
> 
>     config VIRTUAL_MEM_MAP
>             bool "Enable Virtual Mem Map"
>             depends on !NUMA
> 
> Unfortunately, 2.4 doesn't have VIRTUAL_MEM_MAP as a config
> option (it's enabled automatically if needed), so we can't do
> that directly.
> 
> > ... it would better if CONFIG_NUMA kernel works on all types
> > of machines. So I'll try to find out a solution for this issue.
> 
> That would be ideal.  Seems like there ought to be opportunities
> for unification of DISCONTIGMEM, VIRTUAL_MEM_MAP, and parts of
> NUMA.  The current scheme feels a little clunky, especially for
> configuring a "generic" kernel.  (It sounds like it isn't even
> possible to configure a kernel generic enough for both SGI SNx
> and HP zx1 boxes.)
> 
> Bjorn
> 

-- 
suganuma <suganuma@hpc.bs1.fc.nec.co.jp>

[-- Attachment #2: discontig-2.4.20-030312.patch --]
[-- Type: application/octet-stream, Size: 66528 bytes --]

diff -Nur linux-2.4.20-base/arch/ia64/config.in linux-2.4.20-dcm/arch/ia64/config.in
--- linux-2.4.20-base/arch/ia64/config.in	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/arch/ia64/config.in	Mon Mar  3 10:55:12 2003
@@ -66,6 +66,14 @@
 fi
 
 if [ "$CONFIG_IA64_GENERIC" = "y" -o "$CONFIG_IA64_DIG" = "y" -o "$CONFIG_IA64_HP_ZX1" = "y" ]; then
+	bool '  Enable NUMA support' CONFIG_NUMA
+	if [ "$CONFIG_NUMA" = "y" ]; then
+		define_bool CONFIG_DISCONTIGMEM y
+  		choice 'Maximum Memory per NUMA Node'			\
+		"16GB		CONFIG_IA64_NODESIZE_16GB		\
+		 64GB		CONFIG_IA64_NODESIZE_64GB		\
+		 256GB		CONFIG_IA64_NODESIZE_256GB" 16GB
+	fi
 	bool '  Enable IA-64 Machine Check Abort' CONFIG_IA64_MCA
 	define_bool CONFIG_PM y
 fi
diff -Nur linux-2.4.20-base/arch/ia64/kernel/acpi.c linux-2.4.20-dcm/arch/ia64/kernel/acpi.c
--- linux-2.4.20-base/arch/ia64/kernel/acpi.c	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/arch/ia64/kernel/acpi.c	Wed Mar 12 13:36:18 2003
@@ -8,6 +8,9 @@
  *  Copyright (C) 2000 Intel Corp.
  *  Copyright (C) 2000,2001 J.I. Lee <jung-ik.lee@intel.com>
  *  Copyright (C) 2001 Paul Diefenbaugh <paul.s.diefenbaugh@intel.com>
+ *  Copyright (C) 2001 Jenna Hall <jenna.s.hall@intel.com>
+ *  Copyright (C) 2001 Takayoshi Kochi <t-kouchi@cq.jp.nec.com>
+ *  Copyright (C) 2002 Erich Focht <efocht@ess.nec.de>
  *
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  *
@@ -38,11 +41,13 @@
 #include <linux/irq.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
+#include <linux/mm.h>
 #include <asm/io.h>
 #include <asm/iosapic.h>
 #include <asm/machvec.h>
 #include <asm/page.h>
 #include <asm/system.h>
+#include <asm/numa.h>
 
 
 #define PREFIX			"ACPI: "
@@ -222,7 +227,7 @@
 acpi_status
 acpi_hp_csr_space(acpi_handle obj, u64 *csr_base, u64 *csr_length)
 {
-	int i, offset = 0;
+	int offset = 0;
 	acpi_status status;
 	acpi_buffer buf = { .length  = ACPI_ALLOCATE_BUFFER,
 			    .pointer = NULL };
@@ -559,6 +564,191 @@
 }
 
 
+#ifdef CONFIG_ACPI_NUMA
+
+#define SLIT_DEBUG
+
+#define PXM_FLAG_LEN ((MAX_PXM_DOMAINS + 1)/32)
+
+static int __initdata srat_num_cpus;			/* number of cpus */
+static u32 __initdata pxm_flag[PXM_FLAG_LEN];
+#define pxm_bit_set(bit)	(set_bit(bit,(void *)pxm_flag))
+#define pxm_bit_test(bit)	(test_bit(bit,(void *)pxm_flag))
+/* maps to convert between proximity domain and logical node ID */
+int __initdata pxm_to_nid_map[MAX_PXM_DOMAINS];
+int __initdata nid_to_pxm_map[NR_NODES];
+static struct acpi_table_slit __initdata *slit_table;
+
+/*
+ * ACPI 2.0 SLIT (System Locality Information Table)
+ * http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf
+ */
+void __init
+acpi_numa_slit_init (struct acpi_table_slit *slit)
+{
+	u32 len;
+
+	len = sizeof(struct acpi_table_header) + 8
+		+ slit->localities * slit->localities;
+	if (slit->header.length != len) {
+		printk("ACPI 2.0 SLIT: size mismatch: %d expected, %d actual\n",
+		      len, slit->header.length);
+		memset(numa_slit, 10, sizeof(numa_slit));
+		return;
+	}
+	slit_table = slit;
+}
+
+void __init
+acpi_numa_processor_affinity_init (struct acpi_table_processor_affinity *pa)
+{
+	/* record this node in proximity bitmap */
+	pxm_bit_set(pa->proximity_domain);
+
+	node_cpuid[srat_num_cpus].phys_id = (pa->apic_id << 8) | (pa->lsapic_eid);
+	/* nid should be overridden as logical node id later */
+	node_cpuid[srat_num_cpus].nid = pa->proximity_domain;
+	srat_num_cpus++;
+}
+
+void __init
+acpi_numa_memory_affinity_init (struct acpi_table_memory_affinity *ma)
+{
+	unsigned long paddr, size, hole_size, min_hole_size;
+	u8 pxm;
+	struct node_memblk_s *p, *q, *pend;
+
+	pxm = ma->proximity_domain;
+
+	/* fill node memory chunk structure */
+	paddr = ma->base_addr_hi;
+	paddr = (paddr << 32) | ma->base_addr_lo;
+	size = ma->length_hi;
+	size = (size << 32) | ma->length_lo;
+
+	if (num_memblks >= NR_MEMBLKS) {
+		printk("Too many mem chunks in SRAT. Ignoring %ld MBytes at %lx\n",
+			size/(1024*1024), paddr);
+		return;
+	}
+
+	/* Ignore disabled entries */
+	if (!ma->flags.enabled)
+		return;
+
+	/*
+	 * When the chunk is not the first one in the node, check distance
+	 * from the other chunks. When the hole is too huge ignore the chunk.
+	 * This restriction should be removed when multiple chunks per node
+	 * is supported.
+	 */
+	pend = &node_memblk[num_memblks];
+	min_hole_size = 0;
+	for (p = &node_memblk[0]; p < pend; p++) {
+		if (p->nid != pxm)
+			continue;
+		if (p->start_paddr < paddr)
+			hole_size = paddr - (p->start_paddr + p->size);
+		else
+			hole_size = p->start_paddr - (paddr + size);
+
+		if (!min_hole_size || hole_size < min_hole_size)
+			min_hole_size = hole_size;
+	}
+
+#if 0	/* test */
+	if (min_hole_size) {
+		if (min_hole_size > size) {
+			printk("Too huge memory hole. Ignoring %ld MBytes at %lx\n",
+				size/(1024*1024), paddr);
+			return;
+		}
+	}
+#endif
+
+	/* record this node in proximity bitmap */
+	pxm_bit_set(pxm);
+
+	/* Insertion sort based on base address */
+	pend = &node_memblk[num_memblks];
+	for (p = &node_memblk[0]; p < pend; p++) {
+		if (paddr < p->start_paddr)
+			break;
+	}
+	if (p < pend) {
+		for (q = pend; q >= p; q--)
+			*(q + 1) = *q;
+	}
+	p->start_paddr = paddr;
+	p->size = size;
+	p->nid = pxm;
+	num_memblks++;
+}
+
+void __init
+acpi_numa_arch_fixup(void)
+{
+	int i, j, node_from, node_to;
+
+	/* calculate total number of nodes in system from PXM bitmap */
+	numnodes = 0;		/* init total nodes in system */
+
+	memset(pxm_to_nid_map, -1, sizeof(pxm_to_nid_map));
+	memset(nid_to_pxm_map, -1, sizeof(nid_to_pxm_map));
+	for (i = 0; i < MAX_PXM_DOMAINS; i++) {
+		if (pxm_bit_test(i)) {
+			pxm_to_nid_map[i] = numnodes;
+			nid_to_pxm_map[numnodes++] = i;
+		}
+	}
+
+	/* set logical node id in memory chunk structure */
+	for (i = 0; i < num_memblks; i++)
+		node_memblk[i].nid = pxm_to_nid_map[node_memblk[i].nid];
+
+	/* assign memory bank numbers for each chunk on each node */
+	for (i = 0; i < numnodes; i++) {
+		int bank;
+
+		bank = 0;
+		for (j = 0; j < num_memblks; j++)
+			if (node_memblk[j].nid == i)
+				node_memblk[j].bank = bank++;
+	}
+
+	/* set logical node id in cpu structure */
+	for (i = 0; i < srat_num_cpus; i++)
+		node_cpuid[i].nid = pxm_to_nid_map[node_cpuid[i].nid];
+
+	printk("Number of logical nodes in system = %d\n", numnodes);
+	printk("Number of memory chunks in system = %d\n", num_memblks);
+
+	if (!slit_table) return;
+	memset(numa_slit, -1, sizeof(numa_slit));
+	for (i=0; i<slit_table->localities; i++) {
+		if (!pxm_bit_test(i))
+			continue;
+		node_from = pxm_to_nid_map[i];
+		for (j=0; j<slit_table->localities; j++) {
+			if (!pxm_bit_test(j))
+				continue;
+			node_to = pxm_to_nid_map[j];
+			node_distance(node_from, node_to) = 
+				slit_table->entry[i*slit_table->localities + j];
+		}
+	}
+
+#ifdef SLIT_DEBUG
+	printk("ACPI 2.0 SLIT locality table:\n");
+	for (i = 0; i < numnodes; i++) {
+		for (j = 0; j < numnodes; j++)
+			printk("%03d ", node_distance(i,j));
+		printk("\n");
+	}
+#endif
+}
+#endif /* CONFIG_ACPI_NUMA */
+
 static int __init
 acpi_parse_fadt (unsigned long phys_addr, unsigned long size)
 {
@@ -665,12 +855,6 @@
 int __init
 acpi_boot_init (char *cmdline)
 {
-	int result;
-
-	/* Initialize the ACPI boot-time table parser */
-	result = acpi_table_init(cmdline);
-	if (result)
-		return result;
 
 	/*
 	 * MADT
@@ -738,6 +922,10 @@
 		available_cpus = 1; /* We've got at least one of these, no? */
 	}
 	smp_boot_data.cpu_count = total_cpus;
+	smp_build_cpu_map();
+#ifdef CONFIG_NUMA
+	build_cpu_to_node_map();
+#endif
 #endif
 	/* Make boot-up look pretty */
 	printk("%d CPUs available, %d CPUs total\n", available_cpus, total_cpus);
diff -Nur linux-2.4.20-base/arch/ia64/kernel/setup.c linux-2.4.20-dcm/arch/ia64/kernel/setup.c
--- linux-2.4.20-base/arch/ia64/kernel/setup.c	Fri Nov 29 08:53:09 2002
+++ linux-2.4.20-dcm/arch/ia64/kernel/setup.c	Wed Mar 12 13:49:06 2003
@@ -34,6 +34,7 @@
 
 #include <asm/ia32.h>
 #include <asm/page.h>
+#include <asm/pgtable.h>
 #include <asm/machvec.h>
 #include <asm/processor.h>
 #include <asm/sal.h>
@@ -49,16 +50,9 @@
 # error "struct cpuinfo_ia64 too big!"
 #endif
 
-#define MIN(a,b)	((a) < (b) ? (a) : (b))
-#define MAX(a,b)	((a) > (b) ? (a) : (b))
-
 extern char _end;
 
-#ifdef CONFIG_NUMA
- struct cpuinfo_ia64 *boot_cpu_data;
-#else
  struct cpuinfo_ia64 _cpu_data[NR_CPUS] __attribute__ ((section ("__special_page_section")));
-#endif
 
 unsigned long ia64_cycles_per_usec;
 struct ia64_boot_param *ia64_boot_param;
@@ -95,6 +89,7 @@
 static struct rsvd_region rsvd_region[IA64_MAX_RSVD_REGIONS + 1];
 static int num_rsvd_regions;
 
+#ifndef CONFIG_DISCONTIGMEM
 static unsigned long bootmap_start; /* physical address where the bootmem map is located */
 
 static int
@@ -107,18 +102,64 @@
 		*max_pfn = pfn;
 	return 0;
 }
+#endif /* !CONFIG_DISCONTIGMEM */
 
 #define IGNORE_PFN0	1	/* XXX fix me: ignore pfn 0 until TLB miss handler is updated... */
 
+#ifdef CONFIG_DISCONTIGMEM
 /*
- * Free available memory based on the primitive map created from
- * the boot parameters. This routine does not assume the incoming
- * segments are sorted.
+ * efi_memmap_walk() knows nothing about layout of memory across nodes. Find
+ * out to which node a block of memory belongs.  Ignore memory that we cannot
+ * identify, and split blocks that run across multiple nodes.
+ *
+ * Take this opportunity to round the start address up and the end address
+ * down to page boundaries.
  */
-static int
-free_available_memory (unsigned long start, unsigned long end, void *arg)
+void
+call_pernode_memory (unsigned long start, unsigned long end, void *arg)
+{
+	unsigned long rs, re;
+	void (*func)(unsigned long, unsigned long, int, int);
+	int i;
+
+	start = PAGE_ALIGN(start);
+	end &= PAGE_MASK;
+	if (start >= end)
+		return;
+
+	func = arg;
+
+	if (!num_memblks) {
+		/* this machine doesn't have SRAT, */
+		/* so call func with nid=0, bank=0 */
+		if (start < end)
+			(*func)(start, end - start, 0, 0);
+		return;
+	}
+
+	for (i = 0; i < num_memblks; i++) {
+		rs = max(start, node_memblk[i].start_paddr);
+		re = min(end, node_memblk[i].start_paddr+node_memblk[i].size);
+
+		if (rs < re)
+			(*func)(rs, re-rs, node_memblk[i].nid,
+				node_memblk[i].bank);
+	}
+}
+#endif /* CONFIG_DISCONTIGMEM */
+
+/*
+ * Filter incoming memory segments based on the primitive map created from
+ * the boot parameters. Segments contained in the map are removed from the
+ * memory ranges. A caller-specified function is called with the memory
+ * ranges that remain after filtering.
+ * This routine does not assume the incoming segments are sorted.
+ */
+int
+filter_rsvd_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long range_start, range_end, prev_start;
+	void (*func)(unsigned long, unsigned long);
 	int i;
 
 #if IGNORE_PFN0
@@ -132,13 +173,18 @@
 	 * lowest possible address(walker uses virtual)
 	 */
 	prev_start = PAGE_OFFSET;
+	func = arg;
 
 	for (i = 0; i < num_rsvd_regions; ++i) {
-		range_start = MAX(start, prev_start);
-		range_end   = MIN(end, rsvd_region[i].start);
+		range_start = max(start, prev_start);
+		range_end   = min(end, rsvd_region[i].start);
 
 		if (range_start < range_end)
-			free_bootmem(__pa(range_start), range_end - range_start);
+#ifdef CONFIG_DISCONTIGMEM
+			call_pernode_memory(__pa(range_start), __pa(range_end), func);
+#else
+			(*func)(__pa(range_start), range_end - range_start);
+#endif
 
 		/* nothing more available in this segment */
 		if (range_end == end) return 0;
@@ -150,6 +196,7 @@
 }
 
 
+#ifndef CONFIG_DISCONTIGMEM
 /*
  * Find a place to put the bootmap and return its starting address in bootmap_start.
  * This address must be page-aligned.
@@ -171,8 +218,8 @@
 	free_start = PAGE_OFFSET;
 
 	for (i = 0; i < num_rsvd_regions; i++) {
-		range_start = MAX(start, free_start);
-		range_end   = MIN(end, rsvd_region[i].start & PAGE_MASK);
+		range_start = max(start, free_start);
+		range_end   = min(end, rsvd_region[i].start & PAGE_MASK);
 
 		if (range_end <= range_start) continue;	/* skip over empty range */
 
@@ -188,6 +235,7 @@
 	}
 	return 0;
 }
+#endif /* CONFIG_DISCONTIGMEM */
 
 static void
 sort_regions (struct rsvd_region *rsvd_region, int max)
@@ -252,6 +300,14 @@
 
 	sort_regions(rsvd_region, num_rsvd_regions);
 
+#ifdef CONFIG_DISCONTIGMEM
+	{
+		extern void discontig_mem_init(void);
+		bootmap_size = max_pfn = 0;     /* stop gcc warnings */
+		discontig_mem_init();
+	}
+#else /* !CONFIG_DISCONTIGMEM */
+
 	/* first find highest page frame number */
 	max_pfn = 0;
 	efi_memmap_walk(find_max_pfn, &max_pfn);
@@ -268,8 +324,9 @@
 	bootmap_size = init_bootmem(bootmap_start >> PAGE_SHIFT, max_pfn);
 
 	/* Free all available memory, then mark bootmem-map as being in use.  */
-	efi_memmap_walk(free_available_memory, 0);
+	efi_memmap_walk(filter_rsvd_memory, free_bootmem);
 	reserve_bootmem(bootmap_start, bootmap_size);
+#endif /* !CONFIG_DISCONTIGMEM */
 
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (ia64_boot_param->initrd_start) {
@@ -296,6 +353,19 @@
 
 	efi_init();
 
+#ifdef CONFIG_ACPI_BOOT
+	/* Initialize the ACPI boot-time table parser */
+	acpi_table_init(*cmdline_p);
+
+#ifdef CONFIG_ACPI_NUMA
+	acpi_numa_init();
+#endif
+#else
+# ifdef CONFIG_SMP
+	smp_build_cpu_map();	/* happens, e.g., with the Ski simulator */
+# endif
+#endif /* CONFIG_APCI_BOOT */
+
 	iomem_resource.end = ~0UL;	/* FIXME probably belongs elsewhere */
 	find_memory();
 
@@ -537,40 +607,11 @@
 	pal_vm_info_2_u_t vmi;
 	unsigned int max_ctx;
 	struct cpuinfo_ia64 *my_cpu_data;
-#ifdef CONFIG_NUMA
-	int cpu, order;
 
-	/*
-	 * If NUMA is configured, the cpu_data array is not preallocated. The boot cpu
-	 * allocates entries for every possible cpu. As the remaining cpus come online,
-	 * they reallocate a new cpu_data structure on their local node. This extra work
-	 * is required because some boot code references all cpu_data structures
-	 * before the cpus are actually started.
-	 */
-	if (!boot_cpu_data) {
-		my_cpu_data = alloc_bootmem_pages_node(NODE_DATA(numa_node_id()),
-						       sizeof(struct cpuinfo_ia64));
-		boot_cpu_data = my_cpu_data;
-		my_cpu_data->cpu_data[0] = my_cpu_data;
-		for (cpu = 1; cpu < NR_CPUS; ++cpu)
-			my_cpu_data->cpu_data[cpu]
-				= alloc_bootmem_pages_node(NODE_DATA(numa_node_id()),
-							   sizeof(struct cpuinfo_ia64));
-		for (cpu = 1; cpu < NR_CPUS; ++cpu)
-			memcpy(my_cpu_data->cpu_data[cpu]->cpu_data,
-			       my_cpu_data->cpu_data, sizeof(my_cpu_data->cpu_data));
-	} else {
-		order = get_order(sizeof(struct cpuinfo_ia64));
-		my_cpu_data = page_address(alloc_pages_node(numa_node_id(), GFP_KERNEL, order));
-		memcpy(my_cpu_data, boot_cpu_data->cpu_data[smp_processor_id()],
-		       sizeof(struct cpuinfo_ia64));
-		__free_pages(virt_to_page(boot_cpu_data->cpu_data[smp_processor_id()]),
-			     order);
-		for (cpu = 0; cpu < NR_CPUS; ++cpu)
-			boot_cpu_data->cpu_data[cpu]->cpu_data[smp_processor_id()] = my_cpu_data;
-	}
-#else
 	my_cpu_data = cpu_data(smp_processor_id());
+
+#ifdef CONFIG_DISCONTIGMEM
+        my_cpu_data->node_data = get_node_data_ptr();
 #endif
 
 	/*
diff -Nur linux-2.4.20-base/arch/ia64/kernel/smpboot.c linux-2.4.20-dcm/arch/ia64/kernel/smpboot.c
--- linux-2.4.20-base/arch/ia64/kernel/smpboot.c	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/arch/ia64/kernel/smpboot.c	Wed Mar 12 13:34:08 2003
@@ -575,3 +575,66 @@
 		smp_num_cpus = 1;
 	}
 }
+
+/*
+ * Initialize the logical CPU number to SAPICID mapping
+ */
+void __init
+smp_build_cpu_map (void)
+{
+	int sapicid, cpu, i;
+	int boot_cpu_id = hard_smp_processor_id();
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++)
+		ia64_cpu_to_sapicid[cpu] = -1;
+
+	ia64_cpu_to_sapicid[0] = boot_cpu_id;
+
+	for (cpu = 1, i = 0; i < smp_boot_data.cpu_count; i++) {
+		sapicid = smp_boot_data.cpu_phys_id[i];
+		if (sapicid == -1 || sapicid == boot_cpu_id)
+			continue;
+		ia64_cpu_to_sapicid[cpu] = sapicid;
+		cpu++;
+	}
+}
+
+#ifdef CONFIG_NUMA
+
+/* on which node is each logical CPU (one cacheline even for 64 CPUs) */
+volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
+/* which logical CPUs are on which nodes */
+volatile unsigned long node_to_cpu_mask[MAX_NUMNODES] __cacheline_aligned;
+
+/*
+ * Build cpu to node mapping and initialize the per node cpu masks.
+ */
+void __init
+build_cpu_to_node_map (void)
+{
+	int cpu, i, node;
+
+	for(node=0; node<MAX_NUMNODES; node++)
+		node_to_cpu_mask[node] = 0;
+	for(cpu = 0; cpu < NR_CPUS; ++cpu) {
+		/*
+		 * All Itanium NUMA platforms I know use ACPI, so maybe we
+		 * can drop this ifdef completely.                    [EF]
+		 */
+#ifdef CONFIG_ACPI_NUMA
+		node = -1;
+		for (i = 0; i < NR_CPUS; ++i)
+			if (cpu_physical_id(cpu) == node_cpuid[i].phys_id) {
+				node = node_cpuid[i].nid;
+				break;
+			}
+#else
+#		error Fixme: Dunno how to build CPU-to-node map.
+#endif
+		cpu_to_node_map[cpu] = node;
+		if (node >= 0)
+			node_to_cpu_mask[node] |= (1UL << cpu);
+	}
+}
+
+#endif /* CONFIG_NUMA */
diff -Nur linux-2.4.20-base/arch/ia64/mm/Makefile linux-2.4.20-dcm/arch/ia64/mm/Makefile
--- linux-2.4.20-base/arch/ia64/mm/Makefile	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/arch/ia64/mm/Makefile	Mon Mar  3 10:55:12 2003
@@ -12,5 +12,7 @@
 export-objs := init.o
 
 obj-y	 := init.o fault.o tlb.o extable.o
+obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_DISCONTIGMEM) += discontig.o
 
 include $(TOPDIR)/Rules.make
diff -Nur linux-2.4.20-base/arch/ia64/mm/discontig.c linux-2.4.20-dcm/arch/ia64/mm/discontig.c
--- linux-2.4.20-base/arch/ia64/mm/discontig.c	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/arch/ia64/mm/discontig.c	Wed Mar 12 13:57:10 2003
@@ -0,0 +1,315 @@
+/*
+ * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2001 Intel Corp.
+ * Copyright (c) 2001 Tony Luck <tony.luck@intel.com>
+ * Copyright (c) 2002 NEC Corp.
+ * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
+ */
+
+/*
+ * Platform initialization for Discontig Memory
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/mmzone.h>
+#include <linux/acpi.h>
+#include <linux/efi.h>
+
+
+/*
+ * Round an address upward to the next multiple of GRANULE size.
+ */
+#define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
+
+static struct ia64_node_data	*node_data[NR_NODES];
+static long			boot_pg_data[8*NR_NODES+sizeof(pg_data_t)]  __initdata;
+static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
+static bootmem_data_t		bdata[NR_NODES][NR_BANKS_PER_NODE+1] __initdata;
+
+extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+
+/*
+ * Return the compact node number of this cpu. Used prior to
+ * setting up the cpu_data area.
+ *	Note - not fast, intended for boot use only!!
+ */
+int
+boot_get_local_nodeid(void)
+{
+	int	i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		if (node_cpuid[i].phys_id == hard_smp_processor_id())
+			return node_cpuid[i].nid;
+
+	/* node info missing, so nid should be 0.. */
+	return 0;
+}
+
+/*
+ * Return a pointer to the pg_data structure for a node.
+ * This function is used ONLY in early boot before the cpu_data
+ * structure is available.
+ */
+pg_data_t* __init
+boot_get_pg_data_ptr(long node)
+{
+	return pg_data_ptr[node];
+}
+
+
+/*
+ * Return a pointer to the node data for the current node.
+ *	(boottime initialization only)
+ */
+struct ia64_node_data *
+get_node_data_ptr(void)
+{
+	return node_data[boot_get_local_nodeid()];
+}
+
+/*
+ * We allocate one of the bootmem_data_t structs for each piece of memory
+ * that we wish to treat as a contiguous block.  Each such block must start
+ * on a BANKSIZE boundary.  Multiple banks per node is not supported.
+ */
+static int __init
+build_maps(unsigned long pstart, unsigned long length, int node)
+{
+	bootmem_data_t	*bdp;
+	unsigned long cstart, epfn;
+
+	bdp = pg_data_ptr[node]->bdata;
+	epfn = GRANULEROUNDUP(pstart + length) >> PAGE_SHIFT;
+	cstart = pstart & ~(BANKSIZE - 1);
+
+	if (!bdp->node_low_pfn) {
+		bdp->node_boot_start = cstart;
+		bdp->node_low_pfn = epfn;
+	} else {
+		bdp->node_boot_start = min(cstart, bdp->node_boot_start);
+		bdp->node_low_pfn = max(epfn, bdp->node_low_pfn);
+	}
+
+	min_low_pfn = min(min_low_pfn, bdp->node_boot_start>>PAGE_SHIFT);
+	max_low_pfn = max(max_low_pfn, bdp->node_low_pfn);
+
+	return 0;
+}
+
+/*
+ * Find space on each node for the bootmem map.
+ *
+ * Called by efi_memmap_walk to find boot memory on each node. Note that
+ * only blocks that are free are passed to this routine (currently filtered by
+ * free_available_memory).
+ */
+static int __init
+find_bootmap_space(unsigned long pstart, unsigned long length, int node)
+{
+	unsigned long	mapsize, pages, epfn;
+	bootmem_data_t	*bdp;
+
+	epfn = (pstart + length) >> PAGE_SHIFT;
+	bdp = &pg_data_ptr[node]->bdata[0];
+
+	if (pstart < bdp->node_boot_start || epfn > bdp->node_low_pfn)
+		return 0;
+
+	if (!bdp->node_bootmem_map) {
+		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
+		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		if (length > mapsize) {
+			init_bootmem_node(
+				BOOT_NODE_DATA(node),
+				pstart>>PAGE_SHIFT, 
+				bdp->node_boot_start>>PAGE_SHIFT,
+				bdp->node_low_pfn);
+		}
+
+	}
+
+	return 0;
+}
+
+
+/*
+ * Free available memory to the bootmem allocator.
+ *
+ * Note that only blocks that are free are passed to this routine (currently 
+ * filtered by free_available_memory).
+ *
+ */
+static int __init
+discontig_free_bootmem_node(unsigned long pstart, unsigned long length, int node)
+{
+	free_bootmem_node(BOOT_NODE_DATA(node), pstart, length);
+
+	return 0;
+}
+
+
+/*
+ * Reserve the space used by the bootmem maps.
+ */
+static void __init
+discontig_reserve_bootmem(void)
+{
+	int		node;
+	unsigned long	mapbase, mapsize, pages;
+	bootmem_data_t	*bdp;
+
+	for (node = 0; node < numnodes; node++) {
+		bdp = BOOT_NODE_DATA(node)->bdata;
+
+		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
+		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		mapbase = __pa(bdp->node_bootmem_map);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), mapbase, mapsize);
+	}
+}
+
+/*
+ * Allocate per node tables.
+ * 	- the pg_data structure is allocated on each node. This minimizes offnode 
+ *	  memory references
+ *	- the node data is allocated & initialized. Portions of this structure is read-only (after 
+ *	  boot) and contains node-local pointers to usefuls data structures located on
+ *	  other nodes.
+ *
+ * We also switch to using the "real" pg_data structures at this point. Earlier in boot, we
+ * use a different structure. The only use for pg_data prior to the point in boot is to get 
+ * the pointer to the bdata for the node.
+ */
+static void __init
+allocate_pernode_structures(void)
+{
+	pg_data_t	*pgdat=0, *new_pgdat_list=0;
+	int		node, mynode;
+
+	mynode = boot_get_local_nodeid();
+	for (node = numnodes - 1; node >= 0 ; node--) {
+		node_data[node] = alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof (struct ia64_node_data));
+		pgdat = __alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof(pg_data_t), SMP_CACHE_BYTES, 0);
+		pgdat->bdata = &(bdata[node][0]);
+		pg_data_ptr[node] = pgdat;
+		pgdat->node_next = new_pgdat_list;
+		new_pgdat_list = pgdat;
+	}
+	
+	memcpy(node_data[mynode]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
+	memcpy(node_data[mynode]->node_data_ptrs, node_data, sizeof(node_data));
+
+	pgdat_list = new_pgdat_list;
+}
+
+/*
+ * Called early in boot to setup the boot memory allocator, and to
+ * allocate the node-local pg_data & node-directory data structures..
+ */
+void __init
+discontig_mem_init(void)
+{
+	int	node;
+
+	if (numnodes == 0) {
+		printk("node info missing!\n");
+		numnodes = 1;
+	}
+
+	for (node = 0; node < numnodes; node++) {
+		pg_data_ptr[node] = (pg_data_t*) &boot_pg_data[node];
+		pg_data_ptr[node]->bdata = &bdata[node][0];
+	}
+
+	min_low_pfn = -1;
+	max_low_pfn = 0;
+
+        efi_memmap_walk(filter_rsvd_memory, build_maps);
+        efi_memmap_walk(filter_rsvd_memory, find_bootmap_space);
+        efi_memmap_walk(filter_rsvd_memory, discontig_free_bootmem_node);
+	discontig_reserve_bootmem();
+	allocate_pernode_structures();
+}
+
+/*
+ * Initialize the paging system.
+ *	- determine sizes of each node
+ *	- initialize the paging system for the node
+ *	- build the nodedir for the node. This contains pointers to
+ *	  the per-bank mem_map entries.
+ *	- fix the page struct "virtual" pointers. These are bank specific
+ *	  values that the paging system doesnt understand.
+ *	- replicate the nodedir structure to other nodes	
+ */ 
+
+void __init
+discontig_paging_init(void)
+{
+	int		node, mynode;
+	unsigned long	max_dma, zones_size[MAX_NR_ZONES];
+	unsigned long	kaddr, ekaddr, bid;
+	struct page	*page;
+	bootmem_data_t	*bdp;
+
+	max_mapnr = 0;
+	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+
+	mynode = boot_get_local_nodeid();
+	for (bid = 0; bid < NR_BANKS; bid++) {
+		node_data[mynode]->node_id_map[bid] = -1;
+		node_data[mynode]->bank_mem_map_base[bid] = NULL;
+	}
+
+	for (node = 0; node < numnodes; node++) {
+		long pfn, startpfn;
+
+		memset(zones_size, 0, sizeof(zones_size));
+
+		startpfn = -1;
+		bdp = BOOT_NODE_DATA(node)->bdata;
+		pfn = bdp->node_boot_start >> PAGE_SHIFT;
+		if (startpfn == -1)
+			startpfn = pfn;
+		if (pfn > max_dma)
+			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - pfn);
+		else if (bdp->node_low_pfn < max_dma)
+			zones_size[ZONE_DMA] += (bdp->node_low_pfn - pfn);
+		else {
+			zones_size[ZONE_DMA] += (max_dma - pfn);
+			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - max_dma);
+		}
+
+		free_area_init_node(node, NODE_DATA(node), NULL, zones_size, startpfn<<PAGE_SHIFT, 0);
+
+		page = NODE_DATA(node)->node_mem_map;
+
+		bdp = BOOT_NODE_DATA(node)->bdata;
+
+		kaddr = (unsigned long)__va(bdp->node_boot_start);
+		ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
+		while (kaddr < ekaddr) {
+			if (paddr_to_nid(__pa(kaddr)) == node) {
+				bid = BANK_MEM_MAP_INDEX(kaddr);
+				node_data[mynode]->node_id_map[bid] = node;
+				node_data[mynode]->bank_mem_map_base[bid] = page;
+				printk("addr(%lx), bank(%ld) -> node(%d), page(%lx)\n", kaddr, bid, node, (unsigned long)page);
+			}
+			kaddr += BANKSIZE;
+			page += BANKSIZE/PAGE_SIZE;
+		}
+		max_mapnr = max(max_mapnr, (unsigned long)(page - mem_map));
+	}
+
+	/*
+	 * Finish setting up the node data for this node, then copy it to the other nodes.
+	 */
+	for (node=0; node < numnodes; node++)
+		if (mynode != node) {
+			memcpy(node_data[node], node_data[mynode], sizeof(struct ia64_node_data));
+			node_data[node]->node = node;
+		}
+}
+
diff -Nur linux-2.4.20-base/arch/ia64/mm/init.c linux-2.4.20-dcm/arch/ia64/mm/init.c
--- linux-2.4.20-base/arch/ia64/mm/init.c	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/arch/ia64/mm/init.c	Wed Mar 12 14:00:32 2003
@@ -16,6 +16,7 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/efi.h>
+#include <linux/mmzone.h>
 
 #include <asm/bitops.h>
 #include <asm/dma.h>
@@ -38,12 +39,14 @@
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
 #define LARGE_GAP 0x40000000 /* Use virtual mem map if a hole is > than this */
 
-static unsigned long totalram_pages;
+static unsigned long totalram_pages, reserved_pages;
 
 unsigned long vmalloc_end = VMALLOC_END_INIT;
 
+#ifndef CONFIG_DISCONTIGMEM
 static struct page *vmem_map;
 static unsigned long num_dma_physpages;
+#endif
 
 int
 do_check_pgt_cache (int low, int high)
@@ -186,41 +189,48 @@
 	return;
 }
 
+#ifdef CONFIG_DISCONTIGMEM
 void
 show_mem(void)
 {
-	int i, total = 0, reserved = 0;
+	int i, reserved = 0;
 	int shared = 0, cached = 0;
+	pg_data_t *pgdat = pgdat_list;
 
 	printk("Mem-info:\n");
 	show_free_areas();
 
-#ifdef CONFIG_DISCONTIGMEM
-	{
-		pg_data_t *pgdat = pgdat_list;
-
-		printk("Free swap:       %6dkB\n", nr_swap_pages<<(PAGE_SHIFT-10));
-		do {
-			printk("Node ID: %d\n", pgdat->node_id);
-			for(i = 0; i < pgdat->node_size; i++) {
-				if (PageReserved(pgdat->node_mem_map+i))
-					reserved++;
-				else if (PageSwapCache(pgdat->node_mem_map+i))
-					cached++;
-				else if (page_count(pgdat->node_mem_map + i))
-					shared += page_count(pgdat->node_mem_map + i) - 1;
-			}
-			printk("\t%d pages of RAM\n", pgdat->node_size);
-			printk("\t%d reserved pages\n", reserved);
-			printk("\t%d pages shared\n", shared);
-			printk("\t%d pages swap cached\n", cached);
-			pgdat = pgdat->node_next;
-		} while (pgdat);
-		printk("Total of %ld pages in page table cache\n", pgtable_cache_size);
-		show_buffers();
-		printk("%d free buffer pages\n", nr_free_buffer_pages());
-	}
+	printk("Free swap:       %6dkB\n", nr_swap_pages<<(PAGE_SHIFT-10));
+	do {
+		printk("Node ID: %d\n", pgdat->node_id);
+		for(i = 0; i < pgdat->node_size; i++) {
+			if (PageReserved(pgdat->node_mem_map+i))
+				reserved++;
+			else if (PageSwapCache(pgdat->node_mem_map+i))
+				cached++;
+			else if (page_count(pgdat->node_mem_map + i))
+				shared += page_count(pgdat->node_mem_map + i) - 1;
+		}
+		printk("\t%ld pages of RAM\n", pgdat->node_size);
+		printk("\t%d reserved pages\n", reserved);
+		printk("\t%d pages shared\n", shared);
+		printk("\t%d pages swap cached\n", cached);
+		pgdat = pgdat->node_next;
+	} while (pgdat);
+	printk("Total of %ld pages in page table cache\n", pgtable_cache_size);
+	show_buffers();
+	printk("%d free buffer pages\n", nr_free_buffer_pages());
+}
 #else /* !CONFIG_DISCONTIGMEM */
+void
+show_mem(void)
+{
+	int i, total = 0, reserved = 0;
+	int shared = 0, cached = 0;
+
+	printk("Mem-info:\n");
+	show_free_areas();
+
 	printk("Free swap:       %6dkB\n", nr_swap_pages<<(PAGE_SHIFT-10));
 	i = max_mapnr;
 	while (i-- > 0) {
@@ -240,8 +250,8 @@
 	printk("%d pages swap cached\n", cached);
 	printk("%ld pages in page table cache\n", pgtable_cache_size);
 	show_buffers();
-#endif /* !CONFIG_DISCONTIGMEM */
 }
+#endif /* !CONFIG_DISCONTIGMEM */
 
 /*
  * This is like put_dirty_page() but installs a clean page with PAGE_GATE protection
@@ -357,6 +367,7 @@
 	ia64_tlb_init();
 }
 
+#ifndef CONFIG_DISCONTIGMEM
 static int
 create_mem_map_page_table (u64 start, u64 end, void *arg)
 {
@@ -466,6 +477,7 @@
 		*count += (end - start) >> PAGE_SHIFT;
 	return 0;
 }
+#endif /* CONFIG_DISCONTIGMEM */
 
 int
 ia64_page_valid (struct page *page)
@@ -498,20 +510,28 @@
 	last_end = end;
 	return 0;
 }
-#endif
+#endif /* CONFIG_DISCONTIGMEM */
 
 /*
  * Set up the page tables.
  */
+#ifdef CONFIG_DISCONTIGMEM
+void
+paging_init (void)
+{
+	extern void discontig_paging_init(void);
+
+	discontig_paging_init();
+	efi_memmap_walk(count_pages, &num_physpages);
+}
+#else /* !CONFIG_DISCONTIGMEM */
 void
 paging_init (void)
 {
 	unsigned long max_dma;
 	unsigned long zones_size[MAX_NR_ZONES];
 	unsigned long zholes_size[MAX_NR_ZONES];
-#ifndef CONFIG_DISCONTIGMEM
 	unsigned long max_gap;
-#endif
 
 	/* initialize mem_map[] */
 
@@ -539,9 +559,6 @@
 		}
 	}
 
-#ifdef CONFIG_DISCONTIGMEM
-	free_area_init_node(0, NULL, NULL, zones_size, 0, zholes_size);
-#else
 	max_gap = 0;
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
 
@@ -562,20 +579,19 @@
 		free_area_init_node(0, NULL, vmem_map, zones_size, 0, zholes_size);
 		printk("Virtual mem_map starts at 0x%p\n", mem_map);
 	}
-#endif
 }
+#endif /* !CONFIG_DISCONTIGMEM */
 
 static int
-count_reserved_pages (u64 start, u64 end, void *arg)
+count_reserved_pages (u64 start, u64 end)
 {
 	unsigned long num_reserved = 0;
-	unsigned long *count = arg;
 	struct page *pg;
 
 	for (pg = virt_to_page((void *)start); pg < virt_to_page((void *)end); ++pg)
 		if (PageReserved(pg))
 			++num_reserved;
-	*count += num_reserved;
+	reserved_pages += num_reserved;
 	return 0;
 }
 
@@ -583,8 +599,11 @@
 mem_init (void)
 {
 	extern char __start_gate_section[];
-	long reserved_pages, codesize, datasize, initsize;
+	long codesize, datasize, initsize;
 	unsigned long num_pgt_pages;
+	pg_data_t *pgdat;
+	extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+
 
 #ifdef CONFIG_PCI
 	/*
@@ -595,16 +614,19 @@
 	platform_pci_dma_init();
 #endif
 
+#ifndef CONFIG_DISCONTIGMEM
 	if (!mem_map)
 		BUG();
 
 	max_mapnr = max_low_pfn;
+#endif
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
 
-	totalram_pages += free_all_bootmem();
+	for_each_pgdat(pgdat)
+		totalram_pages += free_all_bootmem_node(pgdat);
 
 	reserved_pages = 0;
-	efi_memmap_walk(count_reserved_pages, &reserved_pages);
+	efi_memmap_walk(filter_rsvd_memory, count_reserved_pages);
 
 	codesize =  (unsigned long) &_etext - (unsigned long) &_stext;
 	datasize =  (unsigned long) &_edata - (unsigned long) &_etext;
diff -Nur linux-2.4.20-base/arch/ia64/mm/numa.c linux-2.4.20-dcm/arch/ia64/mm/numa.c
--- linux-2.4.20-base/arch/ia64/mm/numa.c	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/arch/ia64/mm/numa.c	Wed Mar 12 13:34:10 2003
@@ -0,0 +1,46 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * This file contains NUMA specific variables and functions which can
+ * be split away from DISCONTIGMEM and are used on NUMA machines with
+ * contiguous memory.
+ * 
+ *                         2002/08/07 Erich Focht <efocht@ess.nec.de>
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <linux/mmzone.h>
+#include <asm/numa.h>
+
+/*
+ * The following structures are usually initialized by ACPI or
+ * similar mechanisms and describe the NUMA characteristics of the machine.
+ */
+int num_memblks = 0;
+struct node_memblk_s node_memblk[NR_MEMBLKS];
+struct node_cpuid_s node_cpuid[NR_CPUS];
+/*
+ * This is a matrix with "distances" between nodes, they should be
+ * proportional to the memory access latency ratios.
+ */
+u8 numa_slit[NR_NODES * NR_NODES];
+
+/* Identify which cnode a physical address resides on */
+int
+paddr_to_nid(unsigned long paddr)
+{
+	int	i;
+
+	for (i = 0; i < num_memblks; i++)
+		if (paddr >= node_memblk[i].start_paddr &&
+		    paddr < node_memblk[i].start_paddr + node_memblk[i].size)
+			break;
+
+	return (i < num_memblks) ? node_memblk[i].nid : -1;
+}
diff -Nur linux-2.4.20-base/drivers/acpi/Config.in linux-2.4.20-dcm/drivers/acpi/Config.in
--- linux-2.4.20-base/drivers/acpi/Config.in	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/drivers/acpi/Config.in	Mon Mar  3 10:55:12 2003
@@ -36,6 +36,9 @@
     tristate     '  Fan'		CONFIG_ACPI_FAN
     tristate     '  Processor'		CONFIG_ACPI_PROCESSOR
     dep_tristate '  Thermal Zone' CONFIG_ACPI_THERMAL $CONFIG_ACPI_PROCESSOR
+    if [ "$CONFIG_NUMA" = "y" ]; then
+      bool	 '  NUMA support' 	CONFIG_ACPI_NUMA
+    fi
     bool         '  Debug Statements' 	CONFIG_ACPI_DEBUG
   fi
 
@@ -119,6 +122,9 @@
     tristate     '  Fan'		CONFIG_ACPI_FAN
     tristate     '  Processor'		CONFIG_ACPI_PROCESSOR
     dep_tristate '  Thermal Zone' CONFIG_ACPI_THERMAL $CONFIG_ACPI_PROCESSOR
+    if [ "$CONFIG_NUMA" = "y" ]; then
+      bool	 '  NUMA support' 	CONFIG_ACPI_NUMA
+    fi
     bool         '  Debug Statements' 	CONFIG_ACPI_DEBUG
     endmenu
   fi
diff -Nur linux-2.4.20-base/drivers/acpi/Makefile linux-2.4.20-dcm/drivers/acpi/Makefile
--- linux-2.4.20-base/drivers/acpi/Makefile	Mon Mar  3 10:24:21 2003
+++ linux-2.4.20-dcm/drivers/acpi/Makefile	Mon Mar  3 10:55:12 2003
@@ -50,6 +50,7 @@
   obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
   obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
   obj-$(CONFIG_ACPI_SYSTEM)	+= system.o
+obj-$(CONFIG_ACPI_NUMA)         += numa.o
 endif
 
 include $(TOPDIR)/Rules.make
diff -Nur linux-2.4.20-base/drivers/acpi/numa.c linux-2.4.20-dcm/drivers/acpi/numa.c
--- linux-2.4.20-base/drivers/acpi/numa.c	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/drivers/acpi/numa.c	Mon Mar  3 10:55:12 2003
@@ -0,0 +1,186 @@
+/*
+ *  acpi_numa.c - ACPI NUMA support
+ *
+ *  Copyright (C) 2002 Takayoshi Kochi <t-kouchi@cq.jp.nec.com>
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ */
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/acpi.h>
+#include "acpi_bus.h"
+
+extern int __init acpi_table_parse_madt_family (enum acpi_table_id id, unsigned long madt_size, int entry_id, acpi_madt_entry_handler handler);
+
+void __init
+acpi_table_print_srat_entry (
+	acpi_table_entry_header	*header)
+{
+	if (!header)
+		return;
+
+	switch (header->type) {
+
+	case ACPI_SRAT_PROCESSOR_AFFINITY:
+	{
+		struct acpi_table_processor_affinity *p =
+			(struct acpi_table_processor_affinity*) header;
+		printk(KERN_INFO "SRAT Processor (id[0x%02x] eid[0x%02x]) in proximity domain %d %s\n",
+		       p->apic_id, p->lsapic_eid, p->proximity_domain,
+		       p->flags.enabled?"enabled":"disabled");
+	}
+		break;
+
+	case ACPI_SRAT_MEMORY_AFFINITY:
+	{
+		struct acpi_table_memory_affinity *p =
+			(struct acpi_table_memory_affinity*) header;
+		printk(KERN_INFO "SRAT Memory (0x%08x%08x length 0x%08x%08x type 0x%x) in proximity domain %d %s%s\n",
+		       p->base_addr_hi, p->base_addr_lo, p->length_hi, p->length_lo,
+		       p->memory_type, p->proximity_domain,
+		       p->flags.enabled ? "enabled" : "disabled",
+		       p->flags.hot_pluggable ? " hot-pluggable" : "");
+	}
+		break;
+
+	default:
+		printk(KERN_WARNING "Found unsupported SRAT entry (type = 0x%x)\n",
+			header->type);
+		break;
+	}
+}
+
+
+static int __init
+acpi_parse_slit (unsigned long phys_addr, unsigned long size)
+{
+	struct acpi_table_slit	*slit;
+	u32			localities;
+
+	if (!phys_addr || !size)
+		return -EINVAL;
+
+	slit = (struct acpi_table_slit *) __va(phys_addr);
+
+	/* downcast just for %llu vs %lu for i386/ia64  */
+	localities = (u32) slit->localities;
+
+	printk(KERN_INFO "SLIT localities %ux%u\n", localities, localities);
+
+	acpi_numa_slit_init(slit);
+
+	return 0;
+}
+
+
+static int __init
+acpi_parse_processor_affinity (acpi_table_entry_header *header)
+{
+	struct acpi_table_processor_affinity *processor_affinity = NULL;
+
+	processor_affinity = (struct acpi_table_processor_affinity*) header;
+	if (!processor_affinity)
+		return -EINVAL;
+
+	acpi_table_print_srat_entry(header);
+
+	/* let architecture-dependent part to do it */
+	acpi_numa_processor_affinity_init(processor_affinity);
+
+	return 0;
+}
+
+
+static int __init
+acpi_parse_memory_affinity (acpi_table_entry_header *header)
+{
+	struct acpi_table_memory_affinity *memory_affinity = NULL;
+
+	memory_affinity = (struct acpi_table_memory_affinity*) header;
+	if (!memory_affinity)
+		return -EINVAL;
+
+	acpi_table_print_srat_entry(header);
+
+	/* let architecture-dependent part to do it */
+	acpi_numa_memory_affinity_init(memory_affinity);
+
+	return 0;
+}
+
+
+static int __init
+acpi_parse_srat (unsigned long phys_addr, unsigned long size)
+{
+	struct acpi_table_srat	*srat = NULL;
+
+	if (!phys_addr || !size)
+		return -EINVAL;
+
+	srat = (struct acpi_table_srat *) __va(phys_addr);
+
+	printk(KERN_INFO "SRAT revision %d\n", srat->table_revision);
+
+	return 0;
+}
+
+
+int __init
+acpi_table_parse_srat (
+	enum acpi_srat_entry_id	id,
+	acpi_madt_entry_handler	handler)
+{
+	return acpi_table_parse_madt_family(ACPI_SRAT, sizeof(struct acpi_table_srat),
+					    id, handler);
+}
+
+
+int __init
+acpi_numa_init()
+{
+	int			result;
+
+	/* SRAT: Static Resource Affinity Table */
+	result = acpi_table_parse(ACPI_SRAT, acpi_parse_srat);
+
+	if (result > 0) {
+		result = acpi_table_parse_srat(ACPI_SRAT_PROCESSOR_AFFINITY,
+					       acpi_parse_processor_affinity);
+		result = acpi_table_parse_srat(ACPI_SRAT_MEMORY_AFFINITY,
+					       acpi_parse_memory_affinity);
+	} else {
+		/* FIXME */
+		printk("Warning: acpi_table_parse(ACPI_SRAT) returned %d!\n",result);
+	}
+
+	/* SLIT: System Locality Information Table */
+	result = acpi_table_parse(ACPI_SLIT, acpi_parse_slit);
+	if (result < 1) {
+		/* FIXME */
+		printk("Warning: acpi_table_parse(ACPI_SLIT) returned %d!\n",result);
+	}
+
+	acpi_numa_arch_fixup();
+	return 0;
+}
diff -Nur linux-2.4.20-base/drivers/acpi/tables.c linux-2.4.20-dcm/drivers/acpi/tables.c
--- linux-2.4.20-base/drivers/acpi/tables.c	Mon Mar  3 10:24:22 2003
+++ linux-2.4.20-dcm/drivers/acpi/tables.c	Mon Mar  3 10:55:12 2003
@@ -224,11 +224,13 @@
 
 
 int __init
-acpi_table_parse_madt (
+acpi_table_parse_madt_family (
 	enum acpi_table_id	id,
+	unsigned long		madt_size,
+	int			entry_id,
 	acpi_madt_entry_handler	handler)
 {
-	struct acpi_table_madt	*madt = NULL;
+	void			*madt = NULL;
 	acpi_table_entry_header	*entry = NULL;
 	unsigned long		count = 0;
 	unsigned long		madt_end = 0;
@@ -240,19 +242,21 @@
 	/* Locate the MADT (if exists). There should only be one. */
 
 	for (i = 0; i < sdt.count; i++) {
-		if (sdt.entry[i].id != ACPI_APIC)
+		if (sdt.entry[i].id != id)
 			continue;
-		madt = (struct acpi_table_madt *)
+		madt = (void *)
 			__acpi_map_table(sdt.entry[i].pa, sdt.entry[i].size);
 		if (!madt) {
-			printk(KERN_WARNING PREFIX "Unable to map MADT\n");
+			printk(KERN_WARNING PREFIX "Unable to map %s\n",
+			       acpi_table_signatures[id]);
 			return -ENODEV;
 		}
 		break;
 	}
 
 	if (!madt) {
-		printk(KERN_WARNING PREFIX "MADT not present\n");
+		printk(KERN_WARNING PREFIX "%s not present\n",
+		       acpi_table_signatures[id]);
 		return -ENODEV;
 	}
 
@@ -261,18 +265,28 @@
 	/* Parse all entries looking for a match. */
 
 	entry = (acpi_table_entry_header *)
-		((unsigned long) madt + sizeof(struct acpi_table_madt));
+		((unsigned long) madt + madt_size);
 
 	while (((unsigned long) entry) < madt_end) {
-		if (entry->type == id) {
+		if (entry->type == entry_id) {
 			count++;
 			handler(entry);
 		}
 		entry = (acpi_table_entry_header *)
-			((unsigned long) entry += entry->length);
+			((unsigned long) entry + entry->length);
 	}
 
 	return count;
+}
+
+
+int __init
+acpi_table_parse_madt (
+	enum acpi_madt_entry_id	id,
+	acpi_madt_entry_handler	handler)
+{
+	return acpi_table_parse_madt_family(ACPI_APIC, sizeof(struct acpi_table_madt),
+					    id, handler);
 }
 
 
diff -Nur linux-2.4.20-base/include/asm-ia64/acpi.h linux-2.4.20-dcm/include/asm-ia64/acpi.h
--- linux-2.4.20-base/include/asm-ia64/acpi.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/asm-ia64/acpi.h	Wed Mar 12 14:17:39 2003
@@ -97,17 +97,18 @@
 	} while (0)
 
 const char *acpi_get_sysname (void);
-int acpi_boot_init (char *cdline);
 int acpi_request_vector (u32 int_type);
 int acpi_get_prt (struct pci_vector_struct **vectors, int *count);
 int acpi_get_interrupt_model (int *type);
 int acpi_irq_to_vector (u32 irq);
 
-#ifdef CONFIG_DISCONTIGMEM
-#define NODE_ARRAY_INDEX(x)	((x) / 8)	/* 8 bits/char */
-#define NODE_ARRAY_OFFSET(x)	((x) % 8)	/* 8 bits/char */
-#define MAX_PXM_DOMAINS		(256)
-#endif /* CONFIG_DISCONTIGMEM */
+#ifdef CONFIG_ACPI_NUMA
+#include <asm/numa.h>
+/* Proximity bitmap length; _PXM is at most 255 (8 bit)*/
+#define MAX_PXM_DOMAINS (256)
+extern int __initdata pxm_to_nid_map[MAX_PXM_DOMAINS];
+extern int __initdata nid_to_pxm_map[NR_NODES];
+#endif
 
 #endif /*__KERNEL__*/
 
diff -Nur linux-2.4.20-base/include/asm-ia64/mmzone.h linux-2.4.20-dcm/include/asm-ia64/mmzone.h
--- linux-2.4.20-base/include/asm-ia64/mmzone.h	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/include/asm-ia64/mmzone.h	Wed Mar 12 13:41:08 2003
@@ -0,0 +1,143 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2002 NEC Corp.
+ * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
+ * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
+ */
+#ifndef _ASM_IA64_MMZONE_H
+#define _ASM_IA64_MMZONE_H
+
+#include <linux/config.h>
+#include <linux/init.h>
+
+/*
+ * Given a kaddr, find the base mem_map address for the start of the mem_map
+ * entries for the bank containing the kaddr.
+ */
+#define BANK_MEM_MAP_BASE(kaddr) local_node_data->bank_mem_map_base[BANK_MEM_MAP_INDEX(kaddr)]
+
+/*
+ * Given a kaddr, this macro return the relative map number 
+ * within the bank.
+ */
+#define BANK_MAP_NR(kaddr) 	(BANK_OFFSET(kaddr) >> PAGE_SHIFT)
+
+/*
+ * Given a pte, this macro returns a pointer to the page struct for the pte.
+ */
+#define pte_page(pte)	virt_to_page(PAGE_OFFSET | (pte_val(pte)&_PFN_MASK))
+
+/*
+ * Determine if a kaddr is a valid memory address of memory that
+ * actually exists. 
+ *
+ * The check consists of 2 parts:
+ *	- verify that the address is a region 7 address & does not 
+ *	  contain any bits that preclude it from being a valid platform
+ *	  memory address
+ *	- verify that the chunk actually exists.
+ *
+ * Note that IO addresses are NOT considered valid addresses.
+ *
+ * Note, many platforms can simply check if kaddr exceeds a specific size.  
+ *	(However, this wont work on SGI platforms since IO space is embedded 
+ * 	within the range of valid memory addresses & nodes have holes in the 
+ *	address range between banks). 
+ */
+#define kern_addr_valid(kaddr)		({long _kav=(long)(kaddr);	\
+					VALID_MEM_KADDR(_kav);})
+
+/*
+ * Given a kaddr, return a pointer to the page struct for the page.
+ * If the kaddr does not represent RAM memory that potentially exists, return
+ * a pointer the page struct for max_mapnr. IO addresses will
+ * return the page for max_nr. Addresses in unpopulated RAM banks may
+ * return undefined results OR may panic the system.
+ *
+ */
+#define virt_to_page(kaddr)	({long _kvtp=(long)(kaddr);	\
+				(VALID_MEM_KADDR(_kvtp))	\
+					? BANK_MEM_MAP_BASE(_kvtp) + BANK_MAP_NR(_kvtp)	\
+					: NULL;})
+
+/*
+ * Given a page struct entry, return the physical address that the page struct represents.
+ * Since IA64 has all memory in the DMA zone, the following works:
+ */
+#define page_to_phys(page)	__pa(page_address(page))
+
+#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+
+#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
+
+#define pfn_to_page(pfn)	(struct page *)(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
+
+#define pfn_to_nid(pfn)		 local_node_data->node_id_map[(pfn << PAGE_SHIFT) >> DIG_BANKSHIFT]
+
+#define page_to_pfn(page)	(long)((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
+
+
+/*
+ * pfn_valid should be made as fast as possible, and the current definition
+ * is valid for machines that are NUMA, but still contiguous, which is what
+ * is currently supported. A more generalised, but slower definition would
+ * be something like this - mbligh:
+ * ( pfn_to_pgdat(pfn) && (pfn < node_end_pfn(pfn_to_nid(pfn))) )
+ */
+#define pfn_valid(pfn)          (pfn < max_low_pfn)
+extern unsigned long max_low_pfn;
+
+
+#ifdef CONFIG_NUMA
+
+/*
+ * Platform definitions for DIG platform with contiguous memory.
+ */
+#define MAX_PHYSNODE_ID	8	/* Maximum node number +1 */
+#define NR_NODES	8	/* Maximum number of nodes in SSI */
+
+#define MAX_PHYS_MEMORY	(1UL << 40)	/* 1 TB */
+
+/*
+ * Bank definitions.
+ * Configurable settings for DIG: 512MB/bank:  16GB/node,
+ *                               2048MB/bank:  64GB/node,
+ *                               8192MB/bank: 256GB/node.
+ */
+#define NR_BANKS_PER_NODE	32
+#if defined(CONFIG_IA64_NODESIZE_16GB)
+# define DIG_BANKSHIFT		29
+#elif defined(CONFIG_IA64_NODESIZE_64GB)
+# define DIG_BANKSHIFT		31
+#elif defined(CONFIG_IA64_NODESIZE_256GB)
+# define DIG_BANKSHIFT		33
+#else
+# error Unsupported bank and nodesize!
+#endif
+#define BANKSIZE		(1UL << DIG_BANKSHIFT)
+#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
+#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
+
+/*
+ * VALID_MEM_KADDR returns a boolean to indicate if a kaddr is
+ * potentially a valid cacheable identity mapped RAM memory address.
+ * Note that the RAM may or may not actually be present!!
+ */
+ #define VALID_MEM_KADDR(kaddr)	1
+/* #define VALID_MEM_KADDR(kaddr)	(BANK_MEM_MAP_BASE(kaddr) == NULL ? NULL : 1) */
+
+/*
+ * Given a nodeid & a bank number, find the address of the mem_map
+ * entry for the first page of the bank.
+ */
+#define BANK_MEM_MAP_INDEX(kaddr) \
+	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> DIG_BANKSHIFT)
+
+extern void build_cpu_to_node_map(void);
+
+#endif /* CONFIG_NUMA */
+#endif /* _ASM_IA64_MMZONE_H */
diff -Nur linux-2.4.20-base/include/asm-ia64/nodedata.h linux-2.4.20-dcm/include/asm-ia64/nodedata.h
--- linux-2.4.20-base/include/asm-ia64/nodedata.h	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/include/asm-ia64/nodedata.h	Wed Mar 12 13:41:15 2003
@@ -0,0 +1,75 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2002 NEC Corp.
+ * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
+ * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
+ */
+
+
+#ifndef _ASM_IA64_NODEDATA_H
+#define _ASM_IA64_NODEDATA_H
+
+
+#include <asm/mmzone.h>
+
+/*
+ * Node Data. One of these structures is located on each node of a NUMA system.
+ */
+
+struct pglist_data;
+struct ia64_node_data {
+	short			node;
+        struct pglist_data	*pg_data_ptrs[NR_NODES];
+	struct page		*bank_mem_map_base[NR_BANKS];
+	struct ia64_node_data	*node_data_ptrs[NR_NODES];
+	short			node_id_map[NR_BANKS];
+};
+
+
+/*
+ * Return a pointer to the node_data structure for the executing cpu.
+ */
+#define local_node_data		(local_cpu_data->node_data)
+
+
+/*
+ * Return a pointer to the node_data structure for the specified node.
+ */
+#define node_data(node)	(local_node_data->node_data_ptrs[node])
+
+/*
+ * Get a pointer to the node_id/node_data for the current cpu.
+ *    (boot time only)
+ */
+extern int boot_get_local_nodeid(void);
+extern struct ia64_node_data *get_node_data_ptr(void);
+
+/*
+ * Given a node id, return a pointer to the pg_data_t for the node.
+ * The following 2 macros are similar. 
+ *
+ * NODE_DATA 	- should be used in all code not related to system
+ *		  initialization. It uses pernode data structures to minimize
+ *		  offnode memory references. However, these structure are not 
+ *		  present during boot. This macro can be used once cpu_init
+ *		  completes.
+ *
+ * BOOT_NODE_DATA
+ *		- should be used during system initialization 
+ *		  prior to freeing __initdata. It does not depend on the percpu
+ *		  area being present.
+ *
+ * NOTE:   The names of these macros are misleading but are difficult to change
+ *	   since they are used in generic linux & on other architecures.
+ */
+#define NODE_DATA(nid)		(local_node_data->pg_data_ptrs[nid])
+#define BOOT_NODE_DATA(nid)	boot_get_pg_data_ptr((long)(nid))
+
+struct pglist_data;
+extern struct pglist_data * __init boot_get_pg_data_ptr(long);
+
+#endif /* _ASM_IA64_NODEDATA_H */
diff -Nur linux-2.4.20-base/include/asm-ia64/numa.h linux-2.4.20-dcm/include/asm-ia64/numa.h
--- linux-2.4.20-base/include/asm-ia64/numa.h	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/include/asm-ia64/numa.h	Wed Mar 12 13:41:15 2003
@@ -0,0 +1,70 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * This file contains NUMA specific prototypes and definitions.
+ * 
+ * 2002/08/05 Erich Focht <efocht@ess.nec.de>
+ *
+ */
+#ifndef _ASM_IA64_NUMA_H
+#define _ASM_IA64_NUMA_H
+
+#ifdef CONFIG_NUMA
+
+#ifdef CONFIG_DISCONTIGMEM
+# include <asm/mmzone.h>
+# define NR_MEMBLKS   (NR_BANKS)
+#else
+# define NR_NODES     (8)
+# define NR_MEMBLKS   (NR_NODES * 8)
+#endif
+
+#include <linux/cache.h>
+#include <linux/threads.h>
+extern volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
+extern volatile unsigned long node_to_cpu_mask[NR_NODES] __cacheline_aligned;
+
+/* Stuff below this line could be architecture independent */
+
+extern int num_memblks;		/* total number of memory chunks */
+
+/*
+ * List of node memory chunks. Filled when parsing SRAT table to
+ * obtain information about memory nodes.
+*/
+
+struct node_memblk_s {
+	unsigned long start_paddr;
+	unsigned long size;
+	int nid;		/* which logical node contains this chunk? */
+	int bank;		/* which mem bank on this node */
+};
+
+struct node_cpuid_s {
+	u16	phys_id;	/* id << 8 | eid */
+	int	nid;		/* logical node containing this CPU */
+};
+
+extern struct node_memblk_s node_memblk[NR_MEMBLKS];
+extern struct node_cpuid_s node_cpuid[NR_CPUS];
+
+/*
+ * ACPI 2.0 SLIT (System Locality Information Table)
+ * http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf
+ *
+ * This is a matrix with "distances" between nodes, they should be
+ * proportional to the memory access latency ratios.
+ */
+
+extern u8 numa_slit[NR_NODES * NR_NODES];
+#define node_distance(from,to) (numa_slit[from * numnodes + to])
+
+extern int paddr_to_nid(unsigned long paddr);
+
+#define local_nodeid (cpu_to_node_map[smp_processor_id()])
+
+#endif /* CONFIG_NUMA */
+
+#endif /* _ASM_IA64_NUMA_H */
diff -Nur linux-2.4.20-base/include/asm-ia64/numnodes.h linux-2.4.20-dcm/include/asm-ia64/numnodes.h
--- linux-2.4.20-base/include/asm-ia64/numnodes.h	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/include/asm-ia64/numnodes.h	Wed Mar 12 13:41:15 2003
@@ -0,0 +1,7 @@
+#ifndef _ASM_MAX_NUMNODES_H
+#define _ASM_MAX_NUMNODES_H
+
+#include <asm/mmzone.h>
+#define MAX_NUMNODES	NR_NODES
+
+#endif /* _ASM_MAX_NUMNODES_H */
diff -Nur linux-2.4.20-base/include/asm-ia64/page.h linux-2.4.20-dcm/include/asm-ia64/page.h
--- linux-2.4.20-base/include/asm-ia64/page.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/asm-ia64/page.h	Wed Mar 12 13:27:41 2003
@@ -52,20 +52,16 @@
  */
 #define MAP_NR_DENSE(addr)	(((unsigned long) (addr) - PAGE_OFFSET) >> PAGE_SHIFT)
 
+#ifndef CONFIG_DISCONTIGMEM
 #ifdef CONFIG_IA64_GENERIC
 # include <asm/machvec.h>
 # define virt_to_page(kaddr)	(mem_map + platform_map_nr(kaddr))
 # define page_to_phys(page)	((page - mem_map) << PAGE_SHIFT)
-#elif defined (CONFIG_IA64_SGI_SN1)
-# ifndef CONFIG_DISCONTIGMEM
-#  define virt_to_page(kaddr)	(mem_map + MAP_NR_DENSE(kaddr))
-#  define page_to_phys(page)	XXX fix me
-# endif
 #else
 # define virt_to_page(kaddr)	(mem_map + MAP_NR_DENSE(kaddr))
 # define page_to_phys(page)	((page - mem_map) << PAGE_SHIFT)
 #endif
-
+#endif
 struct page;
 extern int ia64_page_valid (struct page *);
 #define VALID_PAGE(page)	(((page - mem_map) < max_mapnr) && ia64_page_valid(page))
diff -Nur linux-2.4.20-base/include/asm-ia64/pgtable.h linux-2.4.20-dcm/include/asm-ia64/pgtable.h
--- linux-2.4.20-base/include/asm-ia64/pgtable.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/asm-ia64/pgtable.h	Wed Mar 12 13:41:38 2003
@@ -206,6 +206,15 @@
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  */
+#ifdef CONFIG_DISCONTIGMEM
+#define mk_pte(page,pgprot)							\
+({										\
+	pte_t __pte;								\
+										\
+	pte_val(__pte) = (unsigned long)page_address(page) - PAGE_OFFSET + pgprot_val(pgprot);	\
+	__pte;									\
+})
+#else
 #define mk_pte(page,pgprot)							\
 ({										\
 	pte_t __pte;								\
@@ -213,6 +222,7 @@
 	pte_val(__pte) = ((page - mem_map) << PAGE_SHIFT) | pgprot_val(pgprot);	\
 	__pte;									\
 })
+#endif
 
 /* This takes a physical page address that is used by the remapping functions */
 #define mk_pte_phys(physpage, pgprot) \
@@ -440,6 +450,7 @@
  */
 #define pgtable_cache_init()	do { } while (0)
 
+#ifndef CONFIG_DISCONTIGMEM
 /* arch mem_map init routines are needed due to holes in a virtual mem_map */
 #define HAVE_ARCH_MEMMAP_INIT
 
@@ -449,7 +460,7 @@
 extern unsigned long arch_memmap_init (memmap_init_callback_t *callback,
 	struct page *start, struct page *end, int zone,
 	unsigned long start_paddr, int highmem);
-
+#endif /* CONFIG_DISCONTIGMEM */
 # endif /* !__ASSEMBLY__ */
 
 /*
diff -Nur linux-2.4.20-base/include/asm-ia64/processor.h linux-2.4.20-dcm/include/asm-ia64/processor.h
--- linux-2.4.20-base/include/asm-ia64/processor.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/asm-ia64/processor.h	Wed Mar 12 13:41:15 2003
@@ -87,6 +87,9 @@
 #include <asm/rse.h>
 #include <asm/unwind.h>
 #include <asm/atomic.h>
+#ifdef CONFIG_NUMA
+#include <asm/nodedata.h>
+#endif
 
 /* like above but expressed as bitfields for more efficient access: */
 struct ia64_psr {
@@ -187,9 +190,8 @@
 	} ipi;
 #endif
 #ifdef CONFIG_NUMA
-	void *node_directory;
-	int numa_node_id;
-	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
+	struct ia64_node_data *node_data;
+	int nodeid;
 #endif
 	/* Platform specific word.  MUST BE LAST IN STRUCT */
 	__u64 platform_specific;
@@ -201,23 +203,8 @@
  */
 #define local_cpu_data		((struct cpuinfo_ia64 *) PERCPU_ADDR)
 
-/*
- * On NUMA systems, cpu_data for each cpu is allocated during cpu_init() & is allocated on
- * the node that contains the cpu. This minimizes off-node memory references.  cpu_data
- * for each cpu contains an array of pointers to the cpu_data structures of each of the
- * other cpus.
- *
- * On non-NUMA systems, cpu_data is a static array allocated at compile time.  References
- * to the cpu_data of another cpu is done by direct references to the appropriate entry of
- * the array.
- */
-#ifdef CONFIG_NUMA
-# define cpu_data(cpu)		local_cpu_data->cpu_data[cpu]
-# define numa_node_id()		(local_cpu_data->numa_node_id)
-#else
-  extern struct cpuinfo_ia64 _cpu_data[NR_CPUS];
-# define cpu_data(cpu)		(&_cpu_data[cpu])
-#endif
+extern struct cpuinfo_ia64	_cpu_data[NR_CPUS];
+#define cpu_data(cpu)		(&_cpu_data[cpu])
 
 extern void identify_cpu (struct cpuinfo_ia64 *);
 extern void print_cpu_info (struct cpuinfo_ia64 *);
diff -Nur linux-2.4.20-base/include/asm-ia64/smp.h linux-2.4.20-dcm/include/asm-ia64/smp.h
--- linux-2.4.20-base/include/asm-ia64/smp.h	Sat Nov 10 07:26:17 2001
+++ linux-2.4.20-dcm/include/asm-ia64/smp.h	Wed Mar 12 13:41:36 2003
@@ -122,6 +122,8 @@
 extern int smp_call_function_single (int cpuid, void (*func) (void *info), void *info,
 				     int retry, int wait);
 
+extern void smp_build_cpu_map(void);
+
 
 #endif /* CONFIG_SMP */
 #endif /* _ASM_IA64_SMP_H */
diff -Nur linux-2.4.20-base/include/asm-ia64/topology.h linux-2.4.20-dcm/include/asm-ia64/topology.h
--- linux-2.4.20-base/include/asm-ia64/topology.h	Thu Jan  1 09:00:00 1970
+++ linux-2.4.20-dcm/include/asm-ia64/topology.h	Wed Mar 12 14:17:42 2003
@@ -0,0 +1,63 @@
+/*
+ * linux/include/asm-ia64/topology.h
+ *
+ * Copyright (C) 2002, Erich Focht, NEC
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef _ASM_IA64_TOPOLOGY_H
+#define _ASM_IA64_TOPOLOGY_H
+
+#include <asm/acpi.h>
+#include <asm/numa.h>
+#include <asm/smp.h>
+
+#ifdef CONFIG_NUMA
+/*
+ * Returns the number of the node containing CPU 'cpu'
+ */
+#define __cpu_to_node(cpu) (int)(cpu_to_node_map[cpu])
+
+/*
+ * Returns a bitmask of CPUs on Node 'node'.
+ */
+#define __node_to_cpu_mask(node) (node_to_cpu_mask[node])
+
+#else
+#define __cpu_to_node(cpu) (0)
+#define __node_to_cpu_mask(node) (phys_cpu_present_map)
+#endif
+
+/*
+ * Returns the number of the node containing MemBlk 'memblk'
+ */
+#ifdef CONFIG_ACPI_NUMA
+#define __memblk_to_node(memblk) (node_memblk[memblk].nid)
+#else
+#define __memblk_to_node(memblk) (memblk)
+#endif
+
+/*
+ * Returns the number of the node containing Node 'nid'.
+ * Not implemented here. Multi-level hierarchies detected with
+ * the help of node_distance().
+ */
+#define __parent_node(nid) (nid)
+
+/*
+ * Returns the number of the first CPU on Node 'node'.
+ */
+#define __node_to_first_cpu(node) (__ffs(__node_to_cpu_mask(node)))
+
+/*
+ * Returns the number of the first MemBlk on Node 'node'
+ * Should be fixed when IA64 discontigmem goes in.
+ */
+#define __node_to_memblk(node) (node)
+
+#endif /* _ASM_IA64_TOPOLOGY_H */
diff -Nur linux-2.4.20-base/include/linux/acpi.h linux-2.4.20-dcm/include/linux/acpi.h
--- linux-2.4.20-base/include/linux/acpi.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/linux/acpi.h	Wed Mar 12 14:17:40 2003
@@ -344,6 +344,14 @@
 void acpi_table_print (struct acpi_table_header *, unsigned long);
 void acpi_table_print_madt_entry (acpi_table_entry_header *);
 
+#ifdef CONFIG_ACPI_NUMA
+int __init acpi_numa_init(void);
+void __init acpi_numa_slit_init (struct acpi_table_slit *);
+void __init acpi_numa_processor_affinity_init (struct acpi_table_processor_affinity *);
+void __init acpi_numa_memory_affinity_init (struct acpi_table_memory_affinity *);
+void __init acpi_numa_arch_fixup(void);
+#endif
+
 #endif /*CONFIG_ACPI_BOOT*/
 
 
diff -Nur linux-2.4.20-base/include/linux/mmzone.h linux-2.4.20-dcm/include/linux/mmzone.h
--- linux-2.4.20-base/include/linux/mmzone.h	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/include/linux/mmzone.h	Wed Mar 12 14:17:42 2003
@@ -8,6 +8,12 @@
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/wait.h>
+#ifdef CONFIG_DISCONTIGMEM
+#include <asm/numnodes.h>
+#endif
+#ifndef MAX_NUMNODES
+#define MAX_NUMNODES 1
+#endif
 
 /*
  * Free memory management - zoned buddy allocator.
@@ -212,6 +218,15 @@
 #define for_each_zone(zone) \
 	for(zone = pgdat_list->node_zones; zone; zone = next_zone(zone))
 
+#ifdef CONFIG_NUMA
+#define MAX_NR_MEMBLKS  BITS_PER_LONG /* Max number of Memory Blocks */
+#else /* !CONFIG_NUMA */
+#define MAX_NR_MEMBLKS  1
+#endif /* CONFIG_NUMA */
+
+#include <asm/topology.h>
+/* Returns the number of the current Node. */
+#define numa_node_id()          (__cpu_to_node(smp_processor_id()))
 
 #ifndef CONFIG_DISCONTIGMEM
 
diff -Nur linux-2.4.20-base/init/main.c linux-2.4.20-dcm/init/main.c
--- linux-2.4.20-base/init/main.c	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/init/main.c	Wed Mar 12 13:33:32 2003
@@ -290,6 +290,7 @@
 
 
 extern void setup_arch(char **);
+extern void __init build_all_zonelists(void);
 extern void cpu_idle(void);
 
 unsigned long wait_init_idle;
@@ -360,6 +361,7 @@
 	lock_kernel();
 	printk(linux_banner);
 	setup_arch(&command_line);
+	build_all_zonelists();
 	printk("Kernel command line: %s\n", saved_command_line);
 	parse_options(command_line);
 	trap_init();
diff -Nur linux-2.4.20-base/mm/page_alloc.c linux-2.4.20-dcm/mm/page_alloc.c
--- linux-2.4.20-base/mm/page_alloc.c	Mon Mar  3 10:24:23 2003
+++ linux-2.4.20-dcm/mm/page_alloc.c	Mon Mar  3 10:55:12 2003
@@ -586,13 +586,41 @@
 /*
  * Builds allocation fallback zone lists.
  */
-static inline void build_zonelists(pg_data_t *pgdat)
+static int __init build_zonelists_node(pg_data_t *pgdat, zonelist_t *zonelist, int j, int k)
 {
-	int i, j, k;
+	switch (k) {
+		zone_t *zone;
+	default:
+		BUG();
+	case ZONE_HIGHMEM:
+		zone = pgdat->node_zones + ZONE_HIGHMEM;
+		if (zone->memsize) {
+#ifndef CONFIG_HIGHMEM
+			BUG();
+#endif
+			zonelist->zones[j++] = zone;
+		}
+	case ZONE_NORMAL:
+		zone = pgdat->node_zones + ZONE_NORMAL;
+		if (zone->memsize)
+			zonelist->zones[j++] = zone;
+	case ZONE_DMA:
+		zone = pgdat->node_zones + ZONE_DMA;
+		if (zone->memsize)
+			zonelist->zones[j++] = zone;
+	}
 
+	return j;
+}
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+
+	local_node = pgdat->node_id;
+	printk("Building zonelist for node : %d\n", local_node);
 	for (i = 0; i <= GFP_ZONEMASK; i++) {
 		zonelist_t *zonelist;
-		zone_t *zone;
 
 		zonelist = pgdat->node_zonelists + i;
 		memset(zonelist, 0, sizeof(*zonelist));
@@ -604,31 +632,30 @@
 		if (i & __GFP_DMA)
 			k = ZONE_DMA;
 
-		switch (k) {
-			default:
-				BUG();
-			/*
-			 * fallthrough:
-			 */
-			case ZONE_HIGHMEM:
-				zone = pgdat->node_zones + ZONE_HIGHMEM;
-				if (zone->memsize) {
-#ifndef CONFIG_HIGHMEM
-					BUG();
-#endif
-					zonelist->zones[j++] = zone;
-				}
-			case ZONE_NORMAL:
-				zone = pgdat->node_zones + ZONE_NORMAL;
-				if (zone->memsize)
-					zonelist->zones[j++] = zone;
-			case ZONE_DMA:
-				zone = pgdat->node_zones + ZONE_DMA;
-				if (zone->memsize)
-					zonelist->zones[j++] = zone;
-		}
+		j = build_zonelists_node(pgdat, zonelist, j, k);
+		/*
+		 * Now we build the zonelist so that it contains the zones
+		 * of all the other nodes.
+		 * We don't want to pressure a particular node, so when
+		 * building the zones for node N, we make sure that the
+		 * zones coming right after the local ones are those from
+		 * node N+1 (modulo N)
+		 */
+		for (node = local_node + 1; node < numnodes; node++)
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		for (node = 0; node < local_node; node++)
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+
 		zonelist->zones[j++] = NULL;
-	} 
+	}
+}
+
+void __init build_all_zonelists(void)
+{
+	int i;
+
+	for(i = 0 ; i < numnodes ; i++)
+		build_zonelists(NODE_DATA(i));
 }
 
 /*
@@ -806,6 +833,7 @@
 		 * up by free_all_bootmem() once the early boot process is
 		 * done. Non-atomic initialization, single-pass.
 		 */
+
 		zone_start_paddr = MEMMAP_INIT(mem_map + offset,
 				mem_map + offset + size,
 				nid * MAX_NR_ZONES + j, zone_start_paddr,
@@ -850,7 +878,6 @@
 			  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 		}
 	}
-	build_zonelists(pgdat);
 }
 
 void __init free_area_init(unsigned long *zones_size)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-03-12  6:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-07  8:57 [Discontig-devel] Re: [Linux-ia64] discontigmem patch for 2.4.20 Kimi Suganuma
2003-03-07 19:36 ` Bjorn Helgaas
2003-03-12  6:59 ` Kimi Suganuma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox