linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC git pull] "big box" x86 changes
@ 2008-04-26 18:55 Ingo Molnar
  2008-04-26 19:05 ` Stefan Richter
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


Linus,

the following tree contains the "big box" topic commits of x86.git:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox.git for-linus

Most of the work has been done by Yinghai Lu who has gone through a 
heroic effort to fix all the big-box bugs that he encountered on the 
vanilla Linux kernel on his various up to 256 GB RAM test-systems. Also 
work is included from Ying Huang for those insane SGI UV boxes.

Most of these patches have been carried in x86.git since around v2.6.23 
so they have a good track record in terms of "practical" stability. They 
have not been problem-free - the tree reflects most of those iterations 
and fixes that happened in that ~6 months timeframe of testing. They do 
solve a boatload of problems and inefficiencies on those systems.

Due to the nature of this topic the commits are all across bootmem, 
driver core and other subsystems - but most of them (obviously) affect 
arch/x86 - which is why they have been carried there.

The sub-topic that looks most scary in this lot are the mmconf changes 
that start with this one:

  Robert Hancock (1):
        x86: validate against acpi motherboard resources

Note that these are not the same changes that you rejected in the past, 
Yinghai has done a number of delta patches to that to make it more 
practical and palatable ... i hope. It's still ... somewhat problematic 
and might still be rejectable.

Another one is the mmconfig enablement in the CPU on Family 10 Opterons 
is turned into an optional, DMI-driven thing in one of the later patches 
so that is a lot less scary than it looks - it's still a generic PC and 
thus tons better than all those crazy subarch hacks that people ended up 
doing.

So we need a bit of help wrt. how mergable this is, and what else is 
needed to make it mergable. These changes have been booted all across 
the x86 spectrum, small and large boxes alike, so i'd be seriously 
surprised if they caused any widespread breakage.

	Ingo

------------------>
Huang, Ying (4):
      x86, boot: add free_early to early reservation machanism
      x86, boot: add linked list of struct setup_data
      x86, boot: export linked list of struct setup_data via debugfs
      x86, boot: Document for linked list of struct setup_data

Ingo Molnar (2):
      x86: fix k8-bus_64.c build
      x86: sanity check gart for buggy device, fix

Robert Hancock (1):
      x86: validate against acpi motherboard resources

Yinghai Lu (39):
      mm: make mem_map allocation continuous
      mm: fix alloc_bootmem_core to use fast searching for all nodes
      x86: clear pci_mmcfg_virt when mmcfg get rejected
      x86: mmconf enable mcfg early
      x86_64: set cfg_size for AMD Family 10h in case MMCONFIG
      x86_64: check and enable MMCONFIG for AMD Family 10h
      x86_64: check MSR to get MMCONFIG for AMD Family 10h
      x86: if acpi=off, force setting the mmconf for fam10h
      x86: seperate mmconf for fam10h out from setup_64.c
      driver core: try parent numa_node at first before using default
      x86: skip it if Fam 10h only handle bus 0
      ide: use dev_to_node instead of pcibus_to_node
      x86: remove unneeded check in mmconf reject
      mm: offset align in alloc_bootmem()
      mm: allow reserve_bootmem() cross nodes
      x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
      x86_64: fix setup_node_bootmem to support big mem excluding with memmap
      x86 pci: remove checking type for mmconfig probe
      x86: change pci_direct_conf1 back not static
      x86: get mp_bus_to_node early
      x86: use bus conf in NB conf fun1 to get bus range on, on 64-bit
      x86: multi pci root bus with different io resource range, on 64-bit
      x86/acpi: make dev_to_node return online node
      x86: double check the multi root bus with fam10h mmconf
      x86/pci: add pci=skip_isa_align command lines.
      net: use numa_node in net_devcice->dev instead of parent
      x86_64: don't need set default res if only have one root bus
      x86_64/mm: check and print vmemmap allocation continuous
      x86_64/mm: check and print vmemmap allocation continuous -fix
      acpi: get boot_cpu_id as early for k8_scan_nodes
      x86: work around io allocation overlap of HT links
      x86: agp_gart size checking for buggy device
      x86: checking aperture size order
      x86: add pci=check_enable_amd_mmconf and dmi check
      x86 PCI: call dmi_check_pciprobe()
      x86_64: allocate gart aperture from 512M
      x86: don't call pxm_to_node again
      x86: clean up aperture_64.c
      x86: reserve dma32 early for gart -fix

 Documentation/i386/boot.txt         |   26 ++
 Documentation/kernel-parameters.txt |    2 +
 arch/x86/boot/header.S              |    6 +-
 arch/x86/kernel/Makefile            |    2 +
 arch/x86/kernel/acpi/boot.c         |   70 +++++
 arch/x86/kernel/aperture_64.c       |  283 ++++++++++++------
 arch/x86/kernel/e820_64.c           |   35 ++-
 arch/x86/kernel/head64.c            |   20 ++
 arch/x86/kernel/kdebugfs.c          |  163 ++++++++++-
 arch/x86/kernel/mmconf-fam10h_64.c  |  243 +++++++++++++++
 arch/x86/kernel/pci-dma.c           |   11 +-
 arch/x86/kernel/setup_64.c          |   47 +++-
 arch/x86/mm/init_64.c               |   38 ++-
 arch/x86/mm/k8topology_64.c         |   38 +++-
 arch/x86/mm/numa_64.c               |   43 +++-
 arch/x86/pci/Makefile_32            |    1 +
 arch/x86/pci/Makefile_64            |    2 +-
 arch/x86/pci/acpi.c                 |   81 ++----
 arch/x86/pci/common.c               |   76 +++++-
 arch/x86/pci/direct.c               |    8 +-
 arch/x86/pci/fixup.c                |   17 +
 arch/x86/pci/init.c                 |   17 +-
 arch/x86/pci/irq.c                  |    4 +-
 arch/x86/pci/k8-bus_64.c            |  580 +++++++++++++++++++++++++++++++----
 arch/x86/pci/legacy.c               |    4 +-
 arch/x86/pci/mmconfig-shared.c      |  252 +++++++++++++--
 arch/x86/pci/mmconfig_32.c          |    4 +
 arch/x86/pci/mmconfig_64.c          |   22 ++-
 arch/x86/pci/mp_bus_to_node.c       |   23 ++
 arch/x86/pci/pci.h                  |    6 +-
 drivers/acpi/bus.c                  |    2 +
 drivers/base/core.c                 |   14 +-
 drivers/char/agp/amd64-agp.c        |   21 +-
 drivers/pci/probe.c                 |   21 ++-
 include/asm-x86/bootparam.h         |   14 +
 include/asm-x86/e820_64.h           |    3 +-
 include/asm-x86/pci.h               |    2 +
 include/asm-x86/topology.h          |   16 +
 include/linux/acpi.h                |    5 +
 include/linux/ide.h                 |    3 +-
 include/linux/mm.h                  |    1 +
 include/linux/pci.h                 |   11 +-
 mm/bootmem.c                        |  157 +++++++---
 mm/sparse.c                         |   37 ++-
 net/core/skbuff.c                   |    2 +-
 45 files changed, 2071 insertions(+), 362 deletions(-)
 create mode 100644 arch/x86/kernel/mmconf-fam10h_64.c
 create mode 100644 arch/x86/pci/mp_bus_to_node.c

diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
index 2eb1610..0fac346 100644
--- a/Documentation/i386/boot.txt
+++ b/Documentation/i386/boot.txt
@@ -42,6 +42,8 @@ Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
 Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
 		the boot command line
 
+Protocol 2.09:	(kernel 2.6.26) Added a field of 64-bit physical
+		pointer to single linked list of struct	setup_data.
 
 **** MEMORY LAYOUT
 
@@ -172,6 +174,8 @@ Offset	Proto	Name		Meaning
 0240/8	2.07+	hardware_subarch_data Subarchitecture-specific data
 0248/4	2.08+	payload_offset	Offset of kernel payload
 024C/4	2.08+	payload_length	Length of kernel payload
+0250/8	2.09+	setup_data	64-bit physical pointer to linked list
+				of struct setup_data
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
     real value is 4.
@@ -572,6 +576,28 @@ command line is entered using the following protocol:
 	covered by setup_move_size, so you may need to adjust this
 	field.
 
+Field name:	setup_data
+Type:		write (obligatory)
+Offset/size:	0x250/8
+Protocol:	2.09+
+
+  The 64-bit physical pointer to NULL terminated single linked list of
+  struct setup_data. This is used to define a more extensible boot
+  parameters passing mechanism. The definition of struct setup_data is
+  as follow:
+
+  struct setup_data {
+	  u64 next;
+	  u32 type;
+	  u32 len;
+	  u8  data[0];
+  };
+
+  Where, the next is a 64-bit physical pointer to the next node of
+  linked list, the next field of the last node is 0; the type is used
+  to identify the contents of data; the len is the length of data
+  field; the data holds the real payload.
+
 
 **** MEMORY LAYOUT OF THE REAL-MODE CODE
 
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bf6303e..e0101d9 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1516,6 +1516,8 @@ and is between 256 and 4096 characters. It is defined in the file
 				This is normally done in pci_enable_device(),
 				so this option is a temporary workaround
 				for broken drivers that don't call it.
+		skip_isa_align	[X86] do not align io start addr, so can
+				handle more pci cards
 		firmware	[ARM] Do not re-enumerate the bus but instead
 				just use the configuration from the
 				bootloader. This is currently used on
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 6d2df8d..af86e43 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -120,7 +120,7 @@ _start:
 	# Part 2 of the header, from the old setup.S
 
 		.ascii	"HdrS"		# header signature
-		.word	0x0208		# header version number (>= 0x0105)
+		.word	0x0209		# header version number (>= 0x0105)
 					# or else old loadlin-1.5 will fail)
 		.globl realmode_swtch
 realmode_swtch:	.word	0, 0		# default_switch, SETUPSEG
@@ -227,6 +227,10 @@ hardware_subarch_data:	.quad 0
 payload_offset:		.long input_data
 payload_length:		.long input_data_end-input_data
 
+setup_data:		.quad 0			# 64-bit physical pointer to
+						# single linked list of
+						# struct setup_data
+
 # End of setup header #####################################################
 
 	.section ".inittext", "ax"
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 90e092d..815b650 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -99,4 +99,6 @@ ifeq ($(CONFIG_X86_64),y)
         obj-$(CONFIG_GART_IOMMU)	+= pci-gart_64.o aperture_64.o
         obj-$(CONFIG_CALGARY_IOMMU)	+= pci-calgary_64.o tce_64.o
         obj-$(CONFIG_SWIOTLB)		+= pci-swiotlb_64.o
+
+        obj-$(CONFIG_PCI_MMCONFIG)	+= mmconf-fam10h_64.o
 endif
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 977ed5c..c49ebcc 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -771,6 +771,32 @@ static void __init acpi_register_lapic_address(unsigned long address)
 		boot_cpu_physical_apicid  = GET_APIC_ID(read_apic_id());
 }
 
+static int __init early_acpi_parse_madt_lapic_addr_ovr(void)
+{
+	int count;
+
+	if (!cpu_has_apic)
+		return -ENODEV;
+
+	/*
+	 * Note that the LAPIC address is obtained from the MADT (32-bit value)
+	 * and (optionally) overriden by a LAPIC_ADDR_OVR entry (64-bit value).
+	 */
+
+	count =
+	    acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_OVERRIDE,
+				  acpi_parse_lapic_addr_ovr, 0);
+	if (count < 0) {
+		printk(KERN_ERR PREFIX
+		       "Error parsing LAPIC address override entry\n");
+		return count;
+	}
+
+	acpi_register_lapic_address(acpi_lapic_addr);
+
+	return count;
+}
+
 static int __init acpi_parse_madt_lapic_entries(void)
 {
 	int count;
@@ -901,6 +927,33 @@ static inline int acpi_parse_madt_ioapic_entries(void)
 }
 #endif	/* !CONFIG_X86_IO_APIC */
 
+static void __init early_acpi_process_madt(void)
+{
+#ifdef CONFIG_X86_LOCAL_APIC
+	int error;
+
+	if (!acpi_table_parse(ACPI_SIG_MADT, acpi_parse_madt)) {
+
+		/*
+		 * Parse MADT LAPIC entries
+		 */
+		error = early_acpi_parse_madt_lapic_addr_ovr();
+		if (!error) {
+			acpi_lapic = 1;
+			smp_found_config = 1;
+		}
+		if (error == -EINVAL) {
+			/*
+			 * Dell Precision Workstation 410, 610 come here.
+			 */
+			printk(KERN_ERR PREFIX
+			       "Invalid BIOS MADT, disabling ACPI\n");
+			disable_acpi();
+		}
+	}
+#endif
+}
+
 static void __init acpi_process_madt(void)
 {
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -1233,6 +1286,23 @@ int __init acpi_boot_table_init(void)
 	return 0;
 }
 
+int __init early_acpi_boot_init(void)
+{
+	/*
+	 * If acpi_disabled, bail out
+	 * One exception: acpi=ht continues far enough to enumerate LAPICs
+	 */
+	if (acpi_disabled && !acpi_ht)
+		return 1;
+
+	/*
+	 * Process the Multiple APIC Description Table (MADT), if present
+	 */
+	early_acpi_process_madt();
+
+	return 0;
+}
+
 int __init acpi_boot_init(void)
 {
 	/*
diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 479926d..02f4dba 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -35,6 +35,18 @@ int fallback_aper_force __initdata;
 
 int fix_aperture __initdata = 1;
 
+struct bus_dev_range {
+	int bus;
+	int dev_base;
+	int dev_limit;
+};
+
+static struct bus_dev_range bus_dev_ranges[] __initdata = {
+	{ 0x00, 0x18, 0x20},
+	{ 0xff, 0x00, 0x20},
+	{ 0xfe, 0x00, 0x20}
+};
+
 static struct resource gart_resource = {
 	.name	= "GART",
 	.flags	= IORESOURCE_MEM,
@@ -55,8 +67,9 @@ static u32 __init allocate_aperture(void)
 	u32 aper_size;
 	void *p;
 
-	if (fallback_aper_order > 7)
-		fallback_aper_order = 7;
+	/* aper_size should <= 1G */
+	if (fallback_aper_order > 5)
+		fallback_aper_order = 5;
 	aper_size = (32 * 1024 * 1024) << fallback_aper_order;
 
 	/*
@@ -65,7 +78,20 @@ static u32 __init allocate_aperture(void)
 	 * memory. Unfortunately we cannot move it up because that would
 	 * make the IOMMU useless.
 	 */
-	p = __alloc_bootmem_nopanic(aper_size, aper_size, 0);
+	/*
+	 * using 512M as goal, in case kexec will load kernel_big
+	 * that will do the on position decompress, and  could overlap with
+	 * that positon with gart that is used.
+	 * sequende:
+	 * kernel_small
+	 * ==> kexec (with kdump trigger path or previous doesn't shutdown gart)
+	 * ==> kernel_small(gart area become e820_reserved)
+	 * ==> kexec (with kdump trigger path or previous doesn't shutdown gart)
+	 * ==> kerne_big (uncompressed size will be big than 64M or 128M)
+	 * so don't use 512M below as gart iommu, leave the space for kernel
+	 * code for safe
+	 */
+	p = __alloc_bootmem_nopanic(aper_size, aper_size, 512ULL<<20);
 	if (!p || __pa(p)+aper_size > 0xffffffff) {
 		printk(KERN_ERR
 			"Cannot allocate aperture memory hole (%p,%uK)\n",
@@ -83,7 +109,7 @@ static u32 __init allocate_aperture(void)
 	return (u32)__pa(p);
 }
 
-static int __init aperture_valid(u64 aper_base, u32 aper_size)
+static int __init aperture_valid(u64 aper_base, u32 aper_size, u32 min_size)
 {
 	if (!aper_base)
 		return 0;
@@ -96,8 +122,9 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size)
 		printk(KERN_ERR "Aperture pointing to e820 RAM. Ignoring.\n");
 		return 0;
 	}
-	if (aper_size < 64*1024*1024) {
-		printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20);
+	if (aper_size < min_size) {
+		printk(KERN_ERR "Aperture too small (%d MB) than (%d MB)\n",
+				 aper_size>>20, min_size>>20);
 		return 0;
 	}
 
@@ -105,47 +132,51 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size)
 }
 
 /* Find a PCI capability */
-static __u32 __init find_cap(int num, int slot, int func, int cap)
+static __u32 __init find_cap(int bus, int slot, int func, int cap)
 {
 	int bytes;
 	u8 pos;
 
-	if (!(read_pci_config_16(num, slot, func, PCI_STATUS) &
+	if (!(read_pci_config_16(bus, slot, func, PCI_STATUS) &
 						PCI_STATUS_CAP_LIST))
 		return 0;
 
-	pos = read_pci_config_byte(num, slot, func, PCI_CAPABILITY_LIST);
+	pos = read_pci_config_byte(bus, slot, func, PCI_CAPABILITY_LIST);
 	for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) {
 		u8 id;
 
 		pos &= ~3;
-		id = read_pci_config_byte(num, slot, func, pos+PCI_CAP_LIST_ID);
+		id = read_pci_config_byte(bus, slot, func, pos+PCI_CAP_LIST_ID);
 		if (id == 0xff)
 			break;
 		if (id == cap)
 			return pos;
-		pos = read_pci_config_byte(num, slot, func,
+		pos = read_pci_config_byte(bus, slot, func,
 						pos+PCI_CAP_LIST_NEXT);
 	}
 	return 0;
 }
 
 /* Read a standard AGPv3 bridge header */
-static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
+static __u32 __init read_agp(int bus, int slot, int func, int cap, u32 *order)
 {
 	u32 apsize;
 	u32 apsizereg;
 	int nbits;
 	u32 aper_low, aper_hi;
 	u64 aper;
+	u32 old_order;
 
-	printk(KERN_INFO "AGP bridge at %02x:%02x:%02x\n", num, slot, func);
-	apsizereg = read_pci_config_16(num, slot, func, cap + 0x14);
+	printk(KERN_INFO "AGP bridge at %02x:%02x:%02x\n", bus, slot, func);
+	apsizereg = read_pci_config_16(bus, slot, func, cap + 0x14);
 	if (apsizereg == 0xffffffff) {
 		printk(KERN_ERR "APSIZE in AGP bridge unreadable\n");
 		return 0;
 	}
 
+	/* old_order could be the value from NB gart setting */
+	old_order = *order;
+
 	apsize = apsizereg & 0xfff;
 	/* Some BIOS use weird encodings not in the AGPv3 table. */
 	if (apsize & 0xff)
@@ -155,14 +186,26 @@ static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
 	if ((int)*order < 0) /* < 32MB */
 		*order = 0;
 
-	aper_low = read_pci_config(num, slot, func, 0x10);
-	aper_hi = read_pci_config(num, slot, func, 0x14);
+	aper_low = read_pci_config(bus, slot, func, 0x10);
+	aper_hi = read_pci_config(bus, slot, func, 0x14);
 	aper = (aper_low & ~((1<<22)-1)) | ((u64)aper_hi << 32);
 
+	/*
+	 * On some sick chips, APSIZE is 0. It means it wants 4G
+	 * so let double check that order, and lets trust AMD NB settings:
+	 */
+	printk(KERN_INFO "Aperture from AGP @ %Lx old size %u MB\n",
+			aper, 32 << old_order);
+	if (aper + (32ULL<<(20 + *order)) > 0x100000000ULL) {
+		printk(KERN_INFO "Aperture size %u MB (APSIZE %x) is not right, using settings from NB\n",
+				32 << *order, apsizereg);
+		*order = old_order;
+	}
+
 	printk(KERN_INFO "Aperture from AGP @ %Lx size %u MB (APSIZE %x)\n",
 			aper, 32 << *order, apsizereg);
 
-	if (!aperture_valid(aper, (32*1024*1024) << *order))
+	if (!aperture_valid(aper, (32*1024*1024) << *order, 32<<20))
 		return 0;
 	return (u32)aper;
 }
@@ -182,15 +225,15 @@ static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
  */
 static __u32 __init search_agp_bridge(u32 *order, int *valid_agp)
 {
-	int num, slot, func;
+	int bus, slot, func;
 
 	/* Poor man's PCI discovery */
-	for (num = 0; num < 256; num++) {
+	for (bus = 0; bus < 256; bus++) {
 		for (slot = 0; slot < 32; slot++) {
 			for (func = 0; func < 8; func++) {
 				u32 class, cap;
 				u8 type;
-				class = read_pci_config(num, slot, func,
+				class = read_pci_config(bus, slot, func,
 							PCI_CLASS_REVISION);
 				if (class == 0xffffffff)
 					break;
@@ -199,17 +242,17 @@ static __u32 __init search_agp_bridge(u32 *order, int *valid_agp)
 				case PCI_CLASS_BRIDGE_HOST:
 				case PCI_CLASS_BRIDGE_OTHER: /* needed? */
 					/* AGP bridge? */
-					cap = find_cap(num, slot, func,
+					cap = find_cap(bus, slot, func,
 							PCI_CAP_ID_AGP);
 					if (!cap)
 						break;
 					*valid_agp = 1;
-					return read_agp(num, slot, func, cap,
+					return read_agp(bus, slot, func, cap,
 							order);
 				}
 
 				/* No multi-function device? */
-				type = read_pci_config_byte(num, slot, func,
+				type = read_pci_config_byte(bus, slot, func,
 							       PCI_HEADER_TYPE);
 				if (!(type & 0x80))
 					break;
@@ -249,38 +292,49 @@ void __init early_gart_iommu_check(void)
 	 * or BIOS forget to put that in reserved.
 	 * try to update e820 to make that region as reserved.
 	 */
-	int fix, num;
+	int fix, slot;
 	u32 ctl;
 	u32 aper_size = 0, aper_order = 0, last_aper_order = 0;
 	u64 aper_base = 0, last_aper_base = 0;
 	int aper_enabled = 0, last_aper_enabled = 0;
+	int i;
 
 	if (!early_pci_allowed())
 		return;
 
 	fix = 0;
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		ctl = read_pci_config(0, num, 3, 0x90);
-		aper_enabled = ctl & 1;
-		aper_order = (ctl >> 1) & 7;
-		aper_size = (32 * 1024 * 1024) << aper_order;
-		aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff;
-		aper_base <<= 25;
-
-		if ((last_aper_order && aper_order != last_aper_order) ||
-		    (last_aper_base && aper_base != last_aper_base) ||
-		    (last_aper_enabled && aper_enabled != last_aper_enabled)) {
-			fix = 1;
-			break;
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			ctl = read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL);
+			aper_enabled = ctl & AMD64_GARTEN;
+			aper_order = (ctl >> 1) & 7;
+			aper_size = (32 * 1024 * 1024) << aper_order;
+			aper_base = read_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE) & 0x7fff;
+			aper_base <<= 25;
+
+			if ((last_aper_order && aper_order != last_aper_order) ||
+			    (last_aper_base && aper_base != last_aper_base) ||
+			    (last_aper_enabled && aper_enabled != last_aper_enabled)) {
+				fix = 1;
+				goto out;
+			}
+			last_aper_order = aper_order;
+			last_aper_base = aper_base;
+			last_aper_enabled = aper_enabled;
 		}
-		last_aper_order = aper_order;
-		last_aper_base = aper_base;
-		last_aper_enabled = aper_enabled;
 	}
 
+out:
 	if (!fix && !aper_enabled)
 		return;
 
@@ -288,8 +342,8 @@ void __init early_gart_iommu_check(void)
 		fix = 1;
 
 	if (gart_fix_e820 && !fix && aper_enabled) {
-		if (e820_any_mapped(aper_base, aper_base + aper_size,
-				    E820_RAM)) {
+		if (!e820_all_mapped(aper_base, aper_base + aper_size,
+				    E820_RESERVED)) {
 			/* reserved it, so we can resuse it in second kernel */
 			printk(KERN_INFO "update e820 for GART\n");
 			add_memory_region(aper_base, aper_size, E820_RESERVED);
@@ -299,23 +353,35 @@ void __init early_gart_iommu_check(void)
 	}
 
 	/* different nodes have different setting, disable them all at first*/
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
 
-		ctl = read_pci_config(0, num, 3, 0x90);
-		ctl &= ~1;
-		write_pci_config(0, num, 3, 0x90, ctl);
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			ctl = read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL);
+			ctl &= ~AMD64_GARTEN;
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL, ctl);
+		}
 	}
 
 }
 
+static int __initdata printed_gart_size_msg;
+
 void __init gart_iommu_hole_init(void)
 {
+	u32 agp_aper_base = 0, agp_aper_order = 0;
 	u32 aper_size, aper_alloc = 0, aper_order = 0, last_aper_order = 0;
 	u64 aper_base, last_aper_base = 0;
-	int fix, num, valid_agp = 0;
-	int node;
+	int fix, slot, valid_agp = 0;
+	int i, node;
 
 	if (gart_iommu_aperture_disabled || !fix_aperture ||
 	    !early_pci_allowed())
@@ -323,38 +389,63 @@ void __init gart_iommu_hole_init(void)
 
 	printk(KERN_INFO  "Checking aperture...\n");
 
+	if (!fallback_aper_force)
+		agp_aper_base = search_agp_bridge(&agp_aper_order, &valid_agp);
+
 	fix = 0;
 	node = 0;
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		iommu_detected = 1;
-		gart_iommu_aperture = 1;
-
-		aper_order = (read_pci_config(0, num, 3, 0x90) >> 1) & 7;
-		aper_size = (32 * 1024 * 1024) << aper_order;
-		aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff;
-		aper_base <<= 25;
-
-		printk(KERN_INFO "Node %d: aperture @ %Lx size %u MB\n",
-				node, aper_base, aper_size >> 20);
-		node++;
-
-		if (!aperture_valid(aper_base, aper_size)) {
-			fix = 1;
-			break;
-		}
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			iommu_detected = 1;
+			gart_iommu_aperture = 1;
+
+			aper_order = (read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL) >> 1) & 7;
+			aper_size = (32 * 1024 * 1024) << aper_order;
+			aper_base = read_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE) & 0x7fff;
+			aper_base <<= 25;
+
+			printk(KERN_INFO "Node %d: aperture @ %Lx size %u MB\n",
+					node, aper_base, aper_size >> 20);
+			node++;
+
+			if (!aperture_valid(aper_base, aper_size, 64<<20)) {
+				if (valid_agp && agp_aper_base &&
+				    agp_aper_base == aper_base &&
+				    agp_aper_order == aper_order) {
+					/* the same between two setting from NB and agp */
+					if (!no_iommu && end_pfn > MAX_DMA32_PFN && !printed_gart_size_msg) {
+						printk(KERN_ERR "you are using iommu with agp, but GART size is less than 64M\n");
+						printk(KERN_ERR "please increase GART size in your BIOS setup\n");
+						printk(KERN_ERR "if BIOS doesn't have that option, contact your HW vendor!\n");
+						printed_gart_size_msg = 1;
+					}
+				} else {
+					fix = 1;
+					goto out;
+				}
+			}
 
-		if ((last_aper_order && aper_order != last_aper_order) ||
-		    (last_aper_base && aper_base != last_aper_base)) {
-			fix = 1;
-			break;
+			if ((last_aper_order && aper_order != last_aper_order) ||
+			    (last_aper_base && aper_base != last_aper_base)) {
+				fix = 1;
+				goto out;
+			}
+			last_aper_order = aper_order;
+			last_aper_base = aper_base;
 		}
-		last_aper_order = aper_order;
-		last_aper_base = aper_base;
 	}
 
+out:
 	if (!fix && !fallback_aper_force) {
 		if (last_aper_base) {
 			unsigned long n = (32 * 1024 * 1024) << last_aper_order;
@@ -364,8 +455,10 @@ void __init gart_iommu_hole_init(void)
 		return;
 	}
 
-	if (!fallback_aper_force)
-		aper_alloc = search_agp_bridge(&aper_order, &valid_agp);
+	if (!fallback_aper_force) {
+		aper_alloc = agp_aper_base;
+		aper_order = agp_aper_order;
+	}
 
 	if (aper_alloc) {
 		/* Got the aperture from the AGP bridge */
@@ -401,16 +494,22 @@ void __init gart_iommu_hole_init(void)
 	}
 
 	/* Fix up the north bridges */
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		/*
-		 * Don't enable translation yet. That is done later.
-		 * Assume this BIOS didn't initialise the GART so
-		 * just overwrite all previous bits
-		 */
-		write_pci_config(0, num, 3, 0x90, aper_order<<1);
-		write_pci_config(0, num, 3, 0x94, aper_alloc>>25);
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			/* Don't enable translation yet. That is done later.
+			   Assume this BIOS didn't initialise the GART so
+			   just overwrite all previous bits */
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL, aper_order << 1);
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE, aper_alloc >> 25);
+		}
 	}
 }
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index cbd42e5..645ee5e 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -84,14 +84,41 @@ void __init reserve_early(unsigned long start, unsigned long end, char *name)
 		strncpy(r->name, name, sizeof(r->name) - 1);
 }
 
-void __init early_res_to_bootmem(void)
+void __init free_early(unsigned long start, unsigned long end)
+{
+	struct early_res *r;
+	int i, j;
+
+	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
+		r = &early_res[i];
+		if (start == r->start && end == r->end)
+			break;
+	}
+	if (i >= MAX_EARLY_RES || !early_res[i].end)
+		panic("free_early on not reserved area: %lx-%lx!", start, end);
+
+	for (j = i + 1; j < MAX_EARLY_RES && early_res[j].end; j++)
+		;
+
+	memcpy(&early_res[i], &early_res[i + 1],
+	       (j - 1 - i) * sizeof(struct early_res));
+
+	early_res[j - 1].end = 0;
+}
+
+void __init early_res_to_bootmem(unsigned long start, unsigned long end)
 {
 	int i;
+	unsigned long final_start, final_end;
 	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
 		struct early_res *r = &early_res[i];
-		printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
-			r->start, r->end - 1, r->name);
-		reserve_bootmem_generic(r->start, r->end - r->start);
+		final_start = max(start, r->start);
+		final_end = min(end, r->end);
+		if (final_start >= final_end)
+			continue;
+		printk(KERN_INFO "  early res: %d [%lx-%lx] %s\n", i,
+			final_start, final_end - 1, r->name);
+		reserve_bootmem_generic(final_start, final_end - final_start);
 	}
 }
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index d31d6b7..e25c57b 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 #include <linux/percpu.h>
 #include <linux/start_kernel.h>
+#include <linux/io.h>
 
 #include <asm/processor.h>
 #include <asm/proto.h>
@@ -100,6 +101,24 @@ static void __init reserve_ebda_region(void)
 	reserve_early(lowmem, 0x100000, "BIOS reserved");
 }
 
+static void __init reserve_setup_data(void)
+{
+	struct setup_data *data;
+	unsigned long pa_data;
+	char buf[32];
+
+	if (boot_params.hdr.version < 0x0209)
+		return;
+	pa_data = boot_params.hdr.setup_data;
+	while (pa_data) {
+		data = early_ioremap(pa_data, sizeof(*data));
+		sprintf(buf, "setup data %x", data->type);
+		reserve_early(pa_data, pa_data+sizeof(*data)+data->len, buf);
+		pa_data = data->next;
+		early_iounmap(data, sizeof(*data));
+	}
+}
+
 void __init x86_64_start_kernel(char * real_mode_data)
 {
 	int i;
@@ -156,6 +175,7 @@ void __init x86_64_start_kernel(char * real_mode_data)
 #endif
 
 	reserve_ebda_region();
+	reserve_setup_data();
 
 	/*
 	 * At this point everything still needed from the boot loader
diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c
index 7335430..c032059 100644
--- a/arch/x86/kernel/kdebugfs.c
+++ b/arch/x86/kernel/kdebugfs.c
@@ -6,23 +6,171 @@
  *
  * This file is released under the GPLv2.
  */
-
 #include <linux/debugfs.h>
+#include <linux/uaccess.h>
 #include <linux/stat.h>
 #include <linux/init.h>
+#include <linux/io.h>
+#include <linux/mm.h>
 
 #include <asm/setup.h>
 
 #ifdef CONFIG_DEBUG_BOOT_PARAMS
+struct setup_data_node {
+	u64 paddr;
+	u32 type;
+	u32 len;
+};
+
+static ssize_t
+setup_data_read(struct file *file, char __user *user_buf, size_t count,
+		loff_t *ppos)
+{
+	struct setup_data_node *node = file->private_data;
+	unsigned long remain;
+	loff_t pos = *ppos;
+	struct page *pg;
+	void *p;
+	u64 pa;
+
+	if (pos < 0)
+		return -EINVAL;
+	if (pos >= node->len)
+		return 0;
+
+	if (count > node->len - pos)
+		count = node->len - pos;
+	pa = node->paddr + sizeof(struct setup_data) + pos;
+	pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT);
+	if (PageHighMem(pg)) {
+		p = ioremap_cache(pa, count);
+		if (!p)
+			return -ENXIO;
+	} else {
+		p = __va(pa);
+	}
+
+	remain = copy_to_user(user_buf, p, count);
+
+	if (PageHighMem(pg))
+		iounmap(p);
+
+	if (remain)
+		return -EFAULT;
+
+	*ppos = pos + count;
+
+	return count;
+}
+
+static int setup_data_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+static const struct file_operations fops_setup_data = {
+	.read =		setup_data_read,
+	.open =		setup_data_open,
+};
+
+static int __init
+create_setup_data_node(struct dentry *parent, int no,
+		       struct setup_data_node *node)
+{
+	struct dentry *d, *type, *data;
+	char buf[16];
+	int error;
+
+	sprintf(buf, "%d", no);
+	d = debugfs_create_dir(buf, parent);
+	if (!d) {
+		error = -ENOMEM;
+		goto err_return;
+	}
+	type = debugfs_create_x32("type", S_IRUGO, d, &node->type);
+	if (!type) {
+		error = -ENOMEM;
+		goto err_dir;
+	}
+	data = debugfs_create_file("data", S_IRUGO, d, node, &fops_setup_data);
+	if (!data) {
+		error = -ENOMEM;
+		goto err_type;
+	}
+	return 0;
+
+err_type:
+	debugfs_remove(type);
+err_dir:
+	debugfs_remove(d);
+err_return:
+	return error;
+}
+
+static int __init create_setup_data_nodes(struct dentry *parent)
+{
+	struct setup_data_node *node;
+	struct setup_data *data;
+	int error, no = 0;
+	struct dentry *d;
+	struct page *pg;
+	u64 pa_data;
+
+	d = debugfs_create_dir("setup_data", parent);
+	if (!d) {
+		error = -ENOMEM;
+		goto err_return;
+	}
+
+	pa_data = boot_params.hdr.setup_data;
+
+	while (pa_data) {
+		node = kmalloc(sizeof(*node), GFP_KERNEL);
+		if (!node) {
+			error = -ENOMEM;
+			goto err_dir;
+		}
+		pg = pfn_to_page((pa_data+sizeof(*data)-1) >> PAGE_SHIFT);
+		if (PageHighMem(pg)) {
+			data = ioremap_cache(pa_data, sizeof(*data));
+			if (!data) {
+				error = -ENXIO;
+				goto err_dir;
+			}
+		} else {
+			data = __va(pa_data);
+		}
+
+		node->paddr = pa_data;
+		node->type = data->type;
+		node->len = data->len;
+		error = create_setup_data_node(d, no, node);
+		pa_data = data->next;
+
+		if (PageHighMem(pg))
+			iounmap(data);
+		if (error)
+			goto err_dir;
+		no++;
+	}
+	return 0;
+
+err_dir:
+	debugfs_remove(d);
+err_return:
+	return error;
+}
+
 static struct debugfs_blob_wrapper boot_params_blob = {
-	.data = &boot_params,
-	.size = sizeof(boot_params),
+	.data		= &boot_params,
+	.size		= sizeof(boot_params),
 };
 
 static int __init boot_params_kdebugfs_init(void)
 {
-	int error;
 	struct dentry *dbp, *version, *data;
+	int error;
 
 	dbp = debugfs_create_dir("boot_params", NULL);
 	if (!dbp) {
@@ -41,7 +189,13 @@ static int __init boot_params_kdebugfs_init(void)
 		error = -ENOMEM;
 		goto err_version;
 	}
+	error = create_setup_data_nodes(dbp);
+	if (error)
+		goto err_data;
 	return 0;
+
+err_data:
+	debugfs_remove(data);
 err_version:
 	debugfs_remove(version);
 err_dir:
@@ -61,5 +215,4 @@ static int __init arch_kdebugfs_init(void)
 
 	return error;
 }
-
 arch_initcall(arch_kdebugfs_init);
diff --git a/arch/x86/kernel/mmconf-fam10h_64.c b/arch/x86/kernel/mmconf-fam10h_64.c
new file mode 100644
index 0000000..edc5fbf
--- /dev/null
+++ b/arch/x86/kernel/mmconf-fam10h_64.c
@@ -0,0 +1,243 @@
+/*
+ * AMD Family 10h mmconfig enablement
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/pci.h>
+#include <linux/dmi.h>
+#include <asm/pci-direct.h>
+#include <linux/sort.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <asm/acpi.h>
+
+#include "../pci/pci.h"
+
+struct pci_hostbridge_probe {
+	u32 bus;
+	u32 slot;
+	u32 vendor;
+	u32 device;
+};
+
+static u64 __cpuinitdata fam10h_pci_mmconf_base;
+static int __cpuinitdata fam10h_pci_mmconf_base_status;
+
+static struct pci_hostbridge_probe pci_probes[] __cpuinitdata = {
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0xff, 0, PCI_VENDOR_ID_AMD, 0x1200 },
+};
+
+struct range {
+	u64 start;
+	u64 end;
+};
+
+static int __cpuinit cmp_range(const void *x1, const void *x2)
+{
+	const struct range *r1 = x1;
+	const struct range *r2 = x2;
+	int start1, start2;
+
+	start1 = r1->start >> 32;
+	start2 = r2->start >> 32;
+
+	return start1 - start2;
+}
+
+/*[47:0] */
+/* need to avoid (0xfd<<32) and (0xfe<<32), ht used space */
+#define FAM10H_PCI_MMCONF_BASE (0xfcULL<<32)
+#define BASE_VALID(b) ((b != (0xfdULL << 32)) && (b != (0xfeULL << 32)))
+static void __cpuinit get_fam10h_pci_mmconf_base(void)
+{
+	int i;
+	unsigned bus;
+	unsigned slot;
+	int found;
+
+	u64 val;
+	u32 address;
+	u64 tom2;
+	u64 base = FAM10H_PCI_MMCONF_BASE;
+
+	int hi_mmio_num;
+	struct range range[8];
+
+	/* only try to get setting from BSP */
+	/* -1 or 1 */
+	if (fam10h_pci_mmconf_base_status)
+		return;
+
+	if (!early_pci_allowed())
+		goto fail;
+
+	found = 0;
+	for (i = 0; i < ARRAY_SIZE(pci_probes); i++) {
+		u32 id;
+		u16 device;
+		u16 vendor;
+
+		bus = pci_probes[i].bus;
+		slot = pci_probes[i].slot;
+		id = read_pci_config(bus, slot, 0, PCI_VENDOR_ID);
+
+		vendor = id & 0xffff;
+		device = (id>>16) & 0xffff;
+		if (pci_probes[i].vendor == vendor &&
+		    pci_probes[i].device == device) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		goto fail;
+
+	/* SYS_CFG */
+	address = MSR_K8_SYSCFG;
+	rdmsrl(address, val);
+
+	/* TOP_MEM2 is not enabled? */
+	if (!(val & (1<<21))) {
+		tom2 = 0;
+	} else {
+		/* TOP_MEM2 */
+		address = MSR_K8_TOP_MEM2;
+		rdmsrl(address, val);
+		tom2 = val & (0xffffULL<<32);
+	}
+
+	if (base <= tom2)
+		base = tom2 + (1ULL<<32);
+
+	/*
+	 * need to check if the range is in the high mmio range that is
+	 * above 4G
+	 */
+	hi_mmio_num = 0;
+	for (i = 0; i < 8; i++) {
+		u32 reg;
+		u64 start;
+		u64 end;
+		reg = read_pci_config(bus, slot, 1, 0x80 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = (((u64)reg) << 8) & (0xffULL << 32); /* 39:16 on 31:8*/
+		reg = read_pci_config(bus, slot, 1, 0x84 + (i << 3));
+		end = (((u64)reg) << 8) & (0xffULL << 32); /* 39:16 on 31:8*/
+
+		if (!end)
+			continue;
+
+		range[hi_mmio_num].start = start;
+		range[hi_mmio_num].end = end;
+		hi_mmio_num++;
+	}
+
+	if (!hi_mmio_num)
+		goto out;
+
+	/* sort the range */
+	sort(range, hi_mmio_num, sizeof(struct range), cmp_range, NULL);
+
+	if (range[hi_mmio_num - 1].end < base)
+		goto out;
+	if (range[0].start > base)
+		goto out;
+
+	/* need to find one window */
+	base = range[0].start - (1ULL << 32);
+	if ((base > tom2) && BASE_VALID(base))
+		goto out;
+	base = range[hi_mmio_num - 1].end + (1ULL << 32);
+	if ((base > tom2) && BASE_VALID(base))
+		goto out;
+	/* need to find window between ranges */
+	if (hi_mmio_num > 1)
+	for (i = 0; i < hi_mmio_num - 1; i++) {
+		if (range[i + 1].start > (range[i].end + (1ULL << 32))) {
+			base = range[i].end + (1ULL << 32);
+			if ((base > tom2) && BASE_VALID(base))
+				goto out;
+		}
+	}
+
+fail:
+	fam10h_pci_mmconf_base_status = -1;
+	return;
+out:
+	fam10h_pci_mmconf_base = base;
+	fam10h_pci_mmconf_base_status = 1;
+}
+
+void __cpuinit fam10h_check_enable_mmcfg(void)
+{
+	u64 val;
+	u32 address;
+
+	if (!(pci_probe & PCI_CHECK_ENABLE_AMD_MMCONF))
+		return;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	rdmsrl(address, val);
+
+	/* try to make sure that AP's setting is identical to BSP setting */
+	if (val & FAM10H_MMIO_CONF_ENABLE) {
+		unsigned busnbits;
+		busnbits = (val >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+		/* only trust the one handle 256 buses, if acpi=off */
+		if (!acpi_pci_disabled || busnbits >= 8) {
+			u64 base;
+			base = val & (0xffffULL << 32);
+			if (fam10h_pci_mmconf_base_status <= 0) {
+				fam10h_pci_mmconf_base = base;
+				fam10h_pci_mmconf_base_status = 1;
+				return;
+			} else if (fam10h_pci_mmconf_base ==  base)
+				return;
+		}
+	}
+
+	/*
+	 * if it is not enabled, try to enable it and assume only one segment
+	 * with 256 buses
+	 */
+	get_fam10h_pci_mmconf_base();
+	if (fam10h_pci_mmconf_base_status <= 0)
+		return;
+
+	printk(KERN_INFO "Enable MMCONFIG on AMD Family 10h\n");
+	val &= ~((FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT) |
+	     (FAM10H_MMIO_CONF_BUSRANGE_MASK<<FAM10H_MMIO_CONF_BUSRANGE_SHIFT));
+	val |= fam10h_pci_mmconf_base | (8 << FAM10H_MMIO_CONF_BUSRANGE_SHIFT) |
+	       FAM10H_MMIO_CONF_ENABLE;
+	wrmsrl(address, val);
+}
+
+static int __devinit set_check_enable_amd_mmconf(const struct dmi_system_id *d)
+{
+        pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
+        return 0;
+}
+
+static struct dmi_system_id __devinitdata mmconf_dmi_table[] = {
+        {
+                .callback = set_check_enable_amd_mmconf,
+                .ident = "Sun Microsystems Machine",
+                .matches = {
+                        DMI_MATCH(DMI_SYS_VENDOR, "Sun Microsystems"),
+                },
+        },
+	{}
+};
+
+void __init check_enable_amd_mmconf_dmi(void)
+{
+	dmi_check_system(mmconf_dmi_table);
+}
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 388b113..50a18e4 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -77,10 +77,14 @@ void __init dma32_reserve_bootmem(void)
 	if (end_pfn <= MAX_DMA32_PFN)
 		return;
 
+	/*
+	 * check aperture_64.c allocate_aperture() for reason about
+	 * using 512M as goal
+	 */
 	align = 64ULL<<20;
 	size = round_up(dma32_bootmem_size, align);
 	dma32_bootmem_ptr = __alloc_bootmem_nopanic(size, align,
-				 __pa(MAX_DMA_ADDRESS));
+				 512ULL<<20);
 	if (dma32_bootmem_ptr)
 		dma32_bootmem_size = size;
 	else
@@ -88,7 +92,6 @@ void __init dma32_reserve_bootmem(void)
 }
 static void __init dma32_free_bootmem(void)
 {
-	int node;
 
 	if (end_pfn <= MAX_DMA32_PFN)
 		return;
@@ -96,9 +99,7 @@ static void __init dma32_free_bootmem(void)
 	if (!dma32_bootmem_ptr)
 		return;
 
-	for_each_online_node(node)
-		free_bootmem_node(NODE_DATA(node), __pa(dma32_bootmem_ptr),
-				  dma32_bootmem_size);
+	free_bootmem(__pa(dma32_bootmem_ptr), dma32_bootmem_size);
 
 	dma32_bootmem_ptr = NULL;
 	dma32_bootmem_size = 0;
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 17bdf23..2f5c488 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -29,6 +29,7 @@
 #include <linux/crash_dump.h>
 #include <linux/root_dev.h>
 #include <linux/pci.h>
+#include <asm/pci-direct.h>
 #include <linux/efi.h>
 #include <linux/acpi.h>
 #include <linux/kallsyms.h>
@@ -40,6 +41,7 @@
 #include <linux/dmi.h>
 #include <linux/dma-mapping.h>
 #include <linux/ctype.h>
+#include <linux/sort.h>
 #include <linux/uaccess.h>
 #include <linux/init_ohci1394_dma.h>
 
@@ -190,6 +192,7 @@ contig_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
 	e820_register_active_regions(0, start_pfn, end_pfn);
 	free_bootmem_with_active_regions(0, end_pfn);
+	early_res_to_bootmem(0, end_pfn<<PAGE_SHIFT);
 	reserve_bootmem(bootmap, bootmap_size, BOOTMEM_DEFAULT);
 }
 #endif
@@ -264,6 +267,40 @@ void __attribute__((weak)) __init memory_setup(void)
        machine_specific_memory_setup();
 }
 
+static void __init parse_setup_data(void)
+{
+	struct setup_data *data;
+	unsigned long pa_data;
+
+	if (boot_params.hdr.version < 0x0209)
+		return;
+	pa_data = boot_params.hdr.setup_data;
+	while (pa_data) {
+		data = early_ioremap(pa_data, PAGE_SIZE);
+		switch (data->type) {
+		default:
+			break;
+		}
+#ifndef CONFIG_DEBUG_BOOT_PARAMS
+		free_early(pa_data, pa_data+sizeof(*data)+data->len);
+#endif
+		pa_data = data->next;
+		early_iounmap(data, PAGE_SIZE);
+	}
+}
+
+#ifdef CONFIG_PCI_MMCONFIG
+extern void __cpuinit fam10h_check_enable_mmcfg(void);
+extern void __init check_enable_amd_mmconf_dmi(void);
+#else
+void __cpuinit fam10h_check_enable_mmcfg(void)
+{
+}
+void __init check_enable_amd_mmconf_dmi(void)
+{
+}
+#endif
+
 /*
  * setup_arch - architecture-specific boot-time initializations
  *
@@ -316,6 +353,8 @@ void __init setup_arch(char **cmdline_p)
 	strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
+	parse_setup_data();
+
 	parse_early_param();
 
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
@@ -397,8 +436,6 @@ void __init setup_arch(char **cmdline_p)
 	contig_initmem_init(0, end_pfn);
 #endif
 
-	early_res_to_bootmem();
-
 	dma32_reserve_bootmem();
 
 #ifdef CONFIG_ACPI_SLEEP
@@ -485,6 +522,9 @@ void __init setup_arch(char **cmdline_p)
 	conswitchp = &dummy_con;
 #endif
 #endif
+
+	/* do this before identify_cpu for boot cpu */
+	check_enable_amd_mmconf_dmi();
 }
 
 static int __cpuinit get_model_name(struct cpuinfo_x86 *c)
@@ -737,6 +777,9 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 	/* MFENCE stops RDTSC speculation */
 	set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
 
+	if (c->x86 == 0x10)
+		fam10h_check_enable_mmcfg();
+
 	if (amd_apic_timer_broken())
 		disable_apic_timer = 1;
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0cca626..e900757 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -810,7 +810,7 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
 #ifdef CONFIG_NUMA
-	int nid = phys_to_nid(phys);
+	int nid, next_nid;
 #endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
 
@@ -829,10 +829,14 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 
 	/* Should check here against the e820 map to avoid double free */
 #ifdef CONFIG_NUMA
+	nid = phys_to_nid(phys);
+	next_nid = phys_to_nid(phys + len - 1);
+	if (nid == next_nid)
 	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
-#else
-	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
+	else
 #endif
+	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
+
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;
 		set_dma_reserve(dma_reserve);
@@ -926,6 +930,10 @@ const char *arch_vma_name(struct vm_area_struct *vma)
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
  */
+static long __meminitdata addr_start, addr_end;
+static void __meminitdata *p_start, *p_end;
+static int __meminitdata node_start;
+
 int __meminit
 vmemmap_populate(struct page *start_page, unsigned long size, int node)
 {
@@ -960,12 +968,32 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 							PAGE_KERNEL_LARGE);
 			set_pmd(pmd, __pmd(pte_val(entry)));
 
-			printk(KERN_DEBUG " [%lx-%lx] PMD ->%p on node %d\n",
-				addr, addr + PMD_SIZE - 1, p, node);
+			/* check to see if we have contiguous blocks */
+			if (p_end != p || node_start != node) {
+				if (p_start)
+					printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+						addr_start, addr_end-1, p_start, p_end-1, node_start);
+				addr_start = addr;
+				node_start = node;
+				p_start = p;
+			}
+			addr_end = addr + PMD_SIZE;
+			p_end = p + PMD_SIZE;
 		} else {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 		}
 	}
 	return 0;
 }
+
+void __meminit vmemmap_populate_print_last(void)
+{
+	if (p_start) {
+		printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+			addr_start, addr_end-1, p_start, p_end-1, node_start);
+		p_start = NULL;
+		p_end = NULL;
+		node_start = 0;
+	}
+}
 #endif
diff --git a/arch/x86/mm/k8topology_64.c b/arch/x86/mm/k8topology_64.c
index 86808e6..1f476e4 100644
--- a/arch/x86/mm/k8topology_64.c
+++ b/arch/x86/mm/k8topology_64.c
@@ -13,12 +13,15 @@
 #include <linux/nodemask.h>
 #include <asm/io.h>
 #include <linux/pci_ids.h>
+#include <linux/acpi.h>
 #include <asm/types.h>
 #include <asm/mmzone.h>
 #include <asm/proto.h>
 #include <asm/e820.h>
 #include <asm/pci-direct.h>
 #include <asm/numa.h>
+#include <asm/mpspec.h>
+#include <asm/apic.h>
 
 static __init int find_northbridge(void)
 {
@@ -44,6 +47,30 @@ static __init int find_northbridge(void)
 	return -1;
 }
 
+static __init void early_get_boot_cpu_id(void)
+{
+	/*
+	 * need to get boot_cpu_id so can use that to create apicid_to_node
+	 * in k8_scan_nodes()
+	 */
+	/*
+	 * Find possible boot-time SMP configuration:
+	 */
+	early_find_smp_config();
+#ifdef CONFIG_ACPI
+	/*
+	 * Read APIC information from ACPI tables.
+	 */
+	early_acpi_boot_init();
+#endif
+	/*
+	 * get boot-time SMP configuration:
+	 */
+	if (smp_found_config)
+		early_get_smp_config();
+	early_init_lapic_mapping();
+}
+
 int __init k8_scan_nodes(unsigned long start, unsigned long end)
 {
 	unsigned long prevbase;
@@ -56,6 +83,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	unsigned cores;
 	unsigned bits;
 	int j;
+	unsigned apicid_base;
 
 	if (!early_pci_allowed())
 		return -1;
@@ -174,11 +202,19 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	/* use the coreid bits from early_identify_cpu */
 	bits = boot_cpu_data.x86_coreid_bits;
 	cores = (1<<bits);
+	apicid_base = 0;
+	/* need to get boot_cpu_id early for system with apicid lifting */
+	early_get_boot_cpu_id();
+	if (boot_cpu_physical_apicid > 0) {
+		printk(KERN_INFO "BSP APIC ID: %02x\n",
+				 boot_cpu_physical_apicid);
+		apicid_base = boot_cpu_physical_apicid;
+	}
 
 	for (i = 0; i < 8; i++) {
 		if (nodes[i].start != nodes[i].end) {
 			nodeid = nodeids[i];
-			for (j = 0; j < cores; j++)
+			for (j = apicid_base; j < cores + apicid_base; j++)
 				apicid_to_node[(nodeid << bits) + j] = i;
 			setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 		}
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 9a68922..efb7483 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -196,6 +196,7 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	unsigned long bootmap_start, nodedata_phys;
 	void *bootmap;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
+	int nid;
 
 	start = round_up(start, ZONE_ALIGN);
 
@@ -218,9 +219,20 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	NODE_DATA(nodeid)->node_start_pfn = start_pfn;
 	NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
-	/* Find a place for the bootmem map */
+	/*
+	 * Find a place for the bootmem map
+	 * nodedata_phys could be on other nodes by alloc_bootmem,
+	 * so need to sure bootmap_start not to be small, otherwise
+	 * early_node_mem will get that with find_e820_area instead
+	 * of alloc_bootmem, that could clash with reserved range
+	 */
 	bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn);
-	bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	nid = phys_to_nid(nodedata_phys);
+	if (nid == nodeid)
+		bootmap_start = round_up(nodedata_phys + pgdat_size,
+					 PAGE_SIZE);
+	else
+		bootmap_start = round_up(start, PAGE_SIZE);
 	/*
 	 * SMP_CAHCE_BYTES could be enough, but init_bootmem_node like
 	 * to use that to align to PAGE_SIZE
@@ -245,10 +257,29 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size,
-			BOOTMEM_DEFAULT);
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
-			bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+	/*
+	 * convert early reserve to bootmem reserve earlier
+	 * otherwise early_node_mem could use early reserved mem
+	 * on previous node
+	 */
+	early_res_to_bootmem(start, end);
+
+	/*
+	 * in some case early_node_mem could use alloc_bootmem
+	 * to get range on other node, don't reserve that again
+	 */
+	if (nid != nodeid)
+		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys,
+					pgdat_size, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(bootmap_start);
+	if (nid != nodeid)
+		printk(KERN_INFO "    bootmap(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
+				 bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
diff --git a/arch/x86/pci/Makefile_32 b/arch/x86/pci/Makefile_32
index cdd6828..e9c5caf 100644
--- a/arch/x86/pci/Makefile_32
+++ b/arch/x86/pci/Makefile_32
@@ -10,5 +10,6 @@ pci-y				+= legacy.o irq.o
 
 pci-$(CONFIG_X86_VISWS)		:= visws.o fixup.o
 pci-$(CONFIG_X86_NUMAQ)		:= numa.o irq.o
+pci-$(CONFIG_NUMA)		+= mp_bus_to_node.o
 
 obj-y				+= $(pci-y) common.o early.o
diff --git a/arch/x86/pci/Makefile_64 b/arch/x86/pci/Makefile_64
index 7d8c467..8fbd198 100644
--- a/arch/x86/pci/Makefile_64
+++ b/arch/x86/pci/Makefile_64
@@ -13,5 +13,5 @@ obj-y			+= legacy.o irq.o common.o early.o
 # mmconfig has a 64bit special
 obj-$(CONFIG_PCI_MMCONFIG) += mmconfig_64.o direct.o mmconfig-shared.o
 
-obj-$(CONFIG_NUMA)	+= k8-bus_64.o
+obj-y		+= k8-bus_64.o
 
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 2664cb3..28d17a5 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -6,45 +6,6 @@
 #include <asm/numa.h>
 #include "pci.h"
 
-static int __devinit can_skip_ioresource_align(const struct dmi_system_id *d)
-{
-	pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
-	printk(KERN_INFO "PCI: %s detected, can skip ISA alignment\n", d->ident);
-	return 0;
-}
-
-static struct dmi_system_id acpi_pciprobe_dmi_table[] __devinitdata = {
-/*
- * Systems where PCI IO resource ISA alignment can be skipped
- * when the ISA enable bit in the bridge control is not set
- */
-	{
-		.callback = can_skip_ioresource_align,
-		.ident = "IBM System x3800",
-		.matches = {
-			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
-			DMI_MATCH(DMI_PRODUCT_NAME, "x3800"),
-		},
-	},
-	{
-		.callback = can_skip_ioresource_align,
-		.ident = "IBM System x3850",
-		.matches = {
-			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
-			DMI_MATCH(DMI_PRODUCT_NAME, "x3850"),
-		},
-	},
-	{
-		.callback = can_skip_ioresource_align,
-		.ident = "IBM System x3950",
-		.matches = {
-			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
-			DMI_MATCH(DMI_PRODUCT_NAME, "x3950"),
-		},
-	},
-	{}
-};
-
 struct pci_root_info {
 	char *name;
 	unsigned int res_num;
@@ -191,9 +152,10 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 {
 	struct pci_bus *bus;
 	struct pci_sysdata *sd;
+	int node;
+#ifdef CONFIG_ACPI_NUMA
 	int pxm;
-
-	dmi_check_system(acpi_pciprobe_dmi_table);
+#endif
 
 	if (domain && !pci_domains_supported) {
 		printk(KERN_WARNING "PCI: Multiple domains not supported "
@@ -201,6 +163,20 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		return NULL;
 	}
 
+	node = -1;
+#ifdef CONFIG_ACPI_NUMA
+	pxm = acpi_get_pxm(device->handle);
+	if (pxm >= 0)
+		node = pxm_to_node(pxm);
+	if (node != -1)
+		set_mp_bus_to_node(busnum, node);
+	else
+#endif
+		node = get_mp_bus_to_node(busnum);
+
+	if (node != -1 && !node_online(node))
+		node = -1;
+
 	/* Allocate per-root-bus (not per bus) arch-specific data.
 	 * TODO: leak; this memory is never freed.
 	 * It's arguable whether it's worth the trouble to care.
@@ -212,13 +188,7 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 	}
 
 	sd->domain = domain;
-	sd->node = -1;
-
-	pxm = acpi_get_pxm(device->handle);
-#ifdef CONFIG_ACPI_NUMA
-	if (pxm >= 0)
-		sd->node = pxm_to_node(pxm);
-#endif
+	sd->node = node;
 	/*
 	 * Maybe the desired pci bus has been already scanned. In such case
 	 * it is unnecessary to scan the pci bus with the given domain,busnum.
@@ -237,18 +207,19 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 	if (!bus)
 		kfree(sd);
 
+	if (bus && node != -1) {
 #ifdef CONFIG_ACPI_NUMA
-	if (bus != NULL) {
-		if (pxm >= 0) {
-			printk("bus %d -> pxm %d -> node %d\n",
-				busnum, pxm, pxm_to_node(pxm));
-		}
-	}
+		if (pxm >= 0)
+			printk(KERN_DEBUG "bus %02x -> pxm %d -> node %d\n",
+				busnum, pxm, node);
+#else
+		printk(KERN_DEBUG "bus %02x -> node %d\n",
+			busnum, node);
 #endif
+	}
 
 	if (bus && (pci_probe & PCI_USE__CRS))
 		get_current_resources(device, busnum, domain, bus);
-	
 	return bus;
 }
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 75fcc29..9f6b117 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -90,6 +90,50 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)
 		rom_r->start = rom_r->end = rom_r->flags = 0;
 }
 
+static int __devinit can_skip_ioresource_align(const struct dmi_system_id *d)
+{
+	pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
+	printk(KERN_INFO "PCI: %s detected, can skip ISA alignment\n", d->ident);
+	return 0;
+}
+
+static struct dmi_system_id can_skip_pciprobe_dmi_table[] __devinitdata = {
+/*
+ * Systems where PCI IO resource ISA alignment can be skipped
+ * when the ISA enable bit in the bridge control is not set
+ */
+	{
+		.callback = can_skip_ioresource_align,
+		.ident = "IBM System x3800",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "x3800"),
+		},
+	},
+	{
+		.callback = can_skip_ioresource_align,
+		.ident = "IBM System x3850",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "x3850"),
+		},
+	},
+	{
+		.callback = can_skip_ioresource_align,
+		.ident = "IBM System x3950",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "IBM"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "x3950"),
+		},
+	},
+	{}
+};
+
+void __init dmi_check_skip_isa_align(void)
+{
+	dmi_check_system(can_skip_pciprobe_dmi_table);
+}
+
 /*
  *  Called after each bus is probed, but before its children
  *  are examined.
@@ -318,13 +362,16 @@ static struct dmi_system_id __devinitdata pciprobe_dmi_table[] = {
 	{}
 };
 
+void __init dmi_check_pciprobe(void)
+{
+	dmi_check_system(pciprobe_dmi_table);
+}
+
 struct pci_bus * __devinit pcibios_scan_root(int busnum)
 {
 	struct pci_bus *bus = NULL;
 	struct pci_sysdata *sd;
 
-	dmi_check_system(pciprobe_dmi_table);
-
 	while ((bus = pci_find_next_bus(bus)) != NULL) {
 		if (bus->number == busnum) {
 			/* Already scanned */
@@ -342,9 +389,14 @@ struct pci_bus * __devinit pcibios_scan_root(int busnum)
 		return NULL;
 	}
 
+	sd->node = get_mp_bus_to_node(busnum);
+
 	printk(KERN_DEBUG "PCI: Probing PCI hardware (bus %02x)\n", busnum);
+	bus = pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
+	if (!bus)
+		kfree(sd);
 
-	return pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
+	return bus;
 }
 
 extern u8 pci_cache_line_size;
@@ -420,6 +472,10 @@ char * __devinit  pcibios_setup(char *str)
 		pci_probe &= ~PCI_PROBE_MMCONF;
 		return NULL;
 	}
+	else if (!strcmp(str, "check_enable_amd_mmconf")) {
+		pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
+		return NULL;
+	}
 #endif
 	else if (!strcmp(str, "noacpi")) {
 		acpi_noirq_set();
@@ -453,6 +509,9 @@ char * __devinit  pcibios_setup(char *str)
 	} else if (!strcmp(str, "routeirq")) {
 		pci_routeirq = 1;
 		return NULL;
+	} else if (!strcmp(str, "skip_isa_align")) {
+		pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
+		return NULL;
 	}
 	return str;
 }
@@ -480,7 +539,7 @@ void pcibios_disable_device (struct pci_dev *dev)
 		pcibios_disable_irq(dev);
 }
 
-struct pci_bus *__devinit pci_scan_bus_with_sysdata(int busno)
+struct pci_bus *pci_scan_bus_on_node(int busno, struct pci_ops *ops, int node)
 {
 	struct pci_bus *bus = NULL;
 	struct pci_sysdata *sd;
@@ -495,10 +554,15 @@ struct pci_bus *__devinit pci_scan_bus_with_sysdata(int busno)
 		printk(KERN_ERR "PCI: OOM, skipping PCI bus %02x\n", busno);
 		return NULL;
 	}
-	sd->node = -1;
-	bus = pci_scan_bus(busno, &pci_root_ops, sd);
+	sd->node = node;
+	bus = pci_scan_bus(busno, ops, sd);
 	if (!bus)
 		kfree(sd);
 
 	return bus;
 }
+
+struct pci_bus *pci_scan_bus_with_sysdata(int busno)
+{
+	return pci_scan_bus_on_node(busno, &pci_root_ops, -1);
+}
diff --git a/arch/x86/pci/direct.c b/arch/x86/pci/direct.c
index 42f3e4c..21d1e0e 100644
--- a/arch/x86/pci/direct.c
+++ b/arch/x86/pci/direct.c
@@ -258,7 +258,8 @@ void __init pci_direct_init(int type)
 {
 	if (type == 0)
 		return;
-	printk(KERN_INFO "PCI: Using configuration type %d\n", type);
+	printk(KERN_INFO "PCI: Using configuration type %d for base access\n",
+		 type);
 	if (type == 1)
 		raw_pci_ops = &pci_direct_conf1;
 	else
@@ -275,8 +276,10 @@ int __init pci_direct_probe(void)
 	if (!region)
 		goto type2;
 
-	if (pci_check_type1())
+	if (pci_check_type1()) {
+		raw_pci_ops = &pci_direct_conf1;
 		return 1;
+	}
 	release_resource(region);
 
  type2:
@@ -290,7 +293,6 @@ int __init pci_direct_probe(void)
 		goto fail2;
 
 	if (pci_check_type2()) {
-		printk(KERN_INFO "PCI: Using configuration type 2\n");
 		raw_pci_ops = &pci_direct_conf2;
 		return 2;
 	}
diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index a5ef5f5..b60b2ab 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -493,3 +493,20 @@ static void __devinit pci_siemens_interrupt_controller(struct pci_dev *dev)
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
 			  pci_siemens_interrupt_controller);
+
+/*
+ * Regular PCI devices have 256 bytes, but AMD Family 10h Opteron ext config
+ * have 4096 bytes.  Even if the device is capable, that doesn't mean we can
+ * access it.  Maybe we don't have a way to generate extended config space
+ * accesses.   So check it
+ */
+static void fam10h_pci_cfg_space_size(struct pci_dev *dev)
+{
+	dev->cfg_size = pci_cfg_space_size_ext(dev, 0);
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1200, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1201, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1202, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1203, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1204, fam10h_pci_cfg_space_size);
diff --git a/arch/x86/pci/init.c b/arch/x86/pci/init.c
index 3de9f9b..f098a6e 100644
--- a/arch/x86/pci/init.c
+++ b/arch/x86/pci/init.c
@@ -6,16 +6,13 @@
    in the right sequence from here. */
 static __init int pci_access_init(void)
 {
-	int type __maybe_unused = 0;
-
 #ifdef CONFIG_PCI_DIRECT
+	int type = 0;
+
 	type = pci_direct_probe();
 #endif
-#ifdef CONFIG_PCI_MMCONFIG
-	pci_mmcfg_init(type);
-#endif
-	if (raw_pci_ops)
-		return 0;
+	pci_mmcfg_early_init();
+
 #ifdef CONFIG_PCI_BIOS
 	pci_pcbios_init();
 #endif
@@ -28,10 +25,14 @@ static __init int pci_access_init(void)
 #ifdef CONFIG_PCI_DIRECT
 	pci_direct_init(type);
 #endif
-	if (!raw_pci_ops)
+	if (!raw_pci_ops && !raw_pci_ext_ops)
 		printk(KERN_ERR
 		"PCI: Fatal: No config space access function found\n");
 
+	dmi_check_pciprobe();
+
+	dmi_check_skip_isa_align();
+
 	return 0;
 }
 arch_initcall(pci_access_init);
diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
index 579745c..0908fca 100644
--- a/arch/x86/pci/irq.c
+++ b/arch/x86/pci/irq.c
@@ -136,9 +136,11 @@ static void __init pirq_peer_trick(void)
 		busmap[e->bus] = 1;
 	}
 	for(i = 1; i < 256; i++) {
+		int node;
 		if (!busmap[i] || pci_find_bus(0, i))
 			continue;
-		if (pci_scan_bus_with_sysdata(i))
+		node = get_mp_bus_to_node(i);
+		if (pci_scan_bus_on_node(i, &pci_root_ops, node))
 			printk(KERN_INFO "PCI: Discovered primary peer "
 			       "bus %02x [IRQ]\n", i);
 	}
diff --git a/arch/x86/pci/k8-bus_64.c b/arch/x86/pci/k8-bus_64.c
index 9cc813e..cfdde16 100644
--- a/arch/x86/pci/k8-bus_64.c
+++ b/arch/x86/pci/k8-bus_64.c
@@ -1,83 +1,541 @@
 #include <linux/init.h>
 #include <linux/pci.h>
+#include <asm/pci-direct.h>
 #include <asm/mpspec.h>
 #include <linux/cpumask.h>
+#include <linux/topology.h>
 
 /*
  * This discovers the pcibus <-> node mapping on AMD K8.
- *
- * RED-PEN need to call this again on PCI hotplug
- * RED-PEN empty cpus get reported wrong
+ * also get peer root bus resource for io,mmio
  */
 
-#define NODE_ID_REGISTER 0x60
-#define NODE_ID(dword) (dword & 0x07)
-#define LDT_BUS_NUMBER_REGISTER_0 0x94
-#define LDT_BUS_NUMBER_REGISTER_1 0xB4
-#define LDT_BUS_NUMBER_REGISTER_2 0xD4
-#define NR_LDT_BUS_NUMBER_REGISTERS 3
-#define SECONDARY_LDT_BUS_NUMBER(dword) ((dword >> 8) & 0xFF)
-#define SUBORDINATE_LDT_BUS_NUMBER(dword) ((dword >> 16) & 0xFF)
-#define PCI_DEVICE_ID_K8HTCONFIG 0x1100
+
+/*
+ * sub bus (transparent) will use entres from 3 to store extra from root,
+ * so need to make sure have enought slot there, increase PCI_BUS_NUM_RESOURCES?
+ */
+#define RES_NUM 16
+struct pci_root_info {
+	char name[12];
+	unsigned int res_num;
+	struct resource res[RES_NUM];
+	int bus_min;
+	int bus_max;
+	int node;
+	int link;
+};
+
+/* 4 at this time, it may become to 32 */
+#define PCI_ROOT_NR 4
+static int pci_root_num;
+static struct pci_root_info pci_root_info[PCI_ROOT_NR];
+
+#ifdef CONFIG_NUMA
+
+#define BUS_NR 256
+
+static int mp_bus_to_node[BUS_NR];
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+		mp_bus_to_node[busnum] = node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node = -1;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return node;
+
+	node = mp_bus_to_node[busnum];
+
+	/*
+	 * let numa_node_id to decide it later in dma_alloc_pages
+	 * if there is no ram on that node
+	 */
+	if (node != -1 && !node_online(node))
+		node = -1;
+
+	return node;
+}
+#endif
+
+void set_pci_bus_resources_arch_default(struct pci_bus *b)
+{
+	int i;
+	int j;
+	struct pci_root_info *info;
+
+	/* if only one root bus, don't need to anything */
+	if (pci_root_num < 2)
+		return;
+
+	for (i = 0; i < pci_root_num; i++) {
+		if (pci_root_info[i].bus_min == b->number)
+			break;
+	}
+
+	if (i == pci_root_num)
+		return;
+
+	info = &pci_root_info[i];
+	for (j = 0; j < info->res_num; j++) {
+		struct resource *res;
+		struct resource *root;
+
+		res = &info->res[j];
+		b->resource[j] = res;
+		if (res->flags & IORESOURCE_IO)
+			root = &ioport_resource;
+		else
+			root = &iomem_resource;
+		insert_resource(root, res);
+	}
+}
+
+#define RANGE_NUM 16
+
+struct res_range {
+	size_t start;
+	size_t end;
+};
+
+static void __init update_range(struct res_range *range, size_t start,
+				size_t end)
+{
+	int i;
+	int j;
+
+	for (j = 0; j < RANGE_NUM; j++) {
+		if (!range[j].end)
+			continue;
+
+		if (start <= range[j].start && end >= range[j].end) {
+			range[j].start = 0;
+			range[j].end = 0;
+			continue;
+		}
+
+		if (start <= range[j].start && end < range[j].end && range[j].start < end + 1) {
+			range[j].start = end + 1;
+			continue;
+		}
+
+
+		if (start > range[j].start && end >= range[j].end && range[j].end > start - 1) {
+			range[j].end = start - 1;
+			continue;
+		}
+
+		if (start > range[j].start && end < range[j].end) {
+			/* find the new spare */
+			for (i = 0; i < RANGE_NUM; i++) {
+				if (range[i].end == 0)
+					break;
+			}
+			if (i < RANGE_NUM) {
+				range[i].end = range[j].end;
+				range[i].start = end + 1;
+			} else {
+				printk(KERN_ERR "run of slot in ranges\n");
+			}
+			range[j].end = start - 1;
+			continue;
+		}
+	}
+}
+
+static void __init update_res(struct pci_root_info *info, size_t start,
+			      size_t end, unsigned long flags, int merge)
+{
+	int i;
+	struct resource *res;
+
+	if (!merge)
+		goto addit;
+
+	/* try to merge it with old one */
+	for (i = 0; i < info->res_num; i++) {
+		size_t final_start, final_end;
+		size_t common_start, common_end;
+
+		res = &info->res[i];
+		if (res->flags != flags)
+			continue;
+
+		common_start = max((size_t)res->start, start);
+		common_end = min((size_t)res->end, end);
+		if (common_start > common_end + 1)
+			continue;
+
+		final_start = min((size_t)res->start, start);
+		final_end = max((size_t)res->end, end);
+
+		res->start = final_start;
+		res->end = final_end;
+		return;
+	}
+
+addit:
+
+	/* need to add that */
+	if (info->res_num >= RES_NUM)
+		return;
+
+	res = &info->res[info->res_num];
+	res->name = info->name;
+	res->flags = flags;
+	res->start = start;
+	res->end = end;
+	res->child = NULL;
+	info->res_num++;
+}
+
+struct pci_hostbridge_probe {
+	u32 bus;
+	u32 slot;
+	u32 vendor;
+	u32 device;
+};
+
+static struct pci_hostbridge_probe pci_probes[] __initdata = {
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1100 },
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0xff, 0, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1300 },
+};
+
+static u64 __initdata fam10h_mmconf_start;
+static u64 __initdata fam10h_mmconf_end;
+static void __init get_pci_mmcfg_amd_fam10h_range(void)
+{
+	u32 address;
+	u64 base, msr;
+	unsigned segn_busn_bits;
+
+	/* assume all cpus from fam10h have mmconf */
+        if (boot_cpu_data.x86 < 0x10)
+		return;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	rdmsrl(address, msr);
+
+	/* mmconfig is not enable */
+	if (!(msr & FAM10H_MMIO_CONF_ENABLE))
+		return;
+
+	base = msr & (FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT);
+
+	segn_busn_bits = (msr >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			 FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+	fam10h_mmconf_start = base;
+	fam10h_mmconf_end = base + (1ULL<<(segn_busn_bits + 20)) - 1;
+}
 
 /**
- * fill_mp_bus_to_cpumask()
+ * early_fill_mp_bus_to_node()
+ * called before pcibios_scan_root and pci_scan_bus
  * fills the mp_bus_to_cpumask array based according to the LDT Bus Number
  * Registers found in the K8 northbridge
  */
-__init static int
-fill_mp_bus_to_cpumask(void)
+static int __init early_fill_mp_bus_info(void)
 {
-	struct pci_dev *nb_dev = NULL;
-	int i, j;
-	u32 ldtbus, nid;
-	static int lbnr[3] = {
-		LDT_BUS_NUMBER_REGISTER_0,
-		LDT_BUS_NUMBER_REGISTER_1,
-		LDT_BUS_NUMBER_REGISTER_2
-	};
-
-	while ((nb_dev = pci_get_device(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_K8HTCONFIG, nb_dev))) {
-		pci_read_config_dword(nb_dev, NODE_ID_REGISTER, &nid);
-
-		for (i = 0; i < NR_LDT_BUS_NUMBER_REGISTERS; i++) {
-			pci_read_config_dword(nb_dev, lbnr[i], &ldtbus);
-			/*
-			 * if there are no busses hanging off of the current
-			 * ldt link then both the secondary and subordinate
-			 * bus number fields are set to 0.
-			 * 
-			 * RED-PEN
-			 * This is slightly broken because it assumes
- 			 * HT node IDs == Linux node ids, which is not always
-			 * true. However it is probably mostly true.
-			 */
-			if (!(SECONDARY_LDT_BUS_NUMBER(ldtbus) == 0
-				&& SUBORDINATE_LDT_BUS_NUMBER(ldtbus) == 0)) {
-				for (j = SECONDARY_LDT_BUS_NUMBER(ldtbus);
-				     j <= SUBORDINATE_LDT_BUS_NUMBER(ldtbus);
-				     j++) { 
-					struct pci_bus *bus;
-					struct pci_sysdata *sd;
-
-					long node = NODE_ID(nid);
-					/* Algorithm a bit dumb, but
- 					   it shouldn't matter here */
-					bus = pci_find_bus(0, j);
-					if (!bus)
-						continue;
-					if (!node_online(node))
-						node = 0;
-
-					sd = bus->sysdata;
-					sd->node = node;
-				}		
+	int i;
+	int j;
+	unsigned bus;
+	unsigned slot;
+	int found;
+	int node;
+	int link;
+	int def_node;
+	int def_link;
+	struct pci_root_info *info;
+	u32 reg;
+	struct resource *res;
+	size_t start;
+	size_t end;
+	struct res_range range[RANGE_NUM];
+	u64 val;
+	u32 address;
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < BUS_NR; i++)
+		mp_bus_to_node[i] = -1;
+#endif
+
+	if (!early_pci_allowed())
+		return -1;
+
+	found = 0;
+	for (i = 0; i < ARRAY_SIZE(pci_probes); i++) {
+		u32 id;
+		u16 device;
+		u16 vendor;
+
+		bus = pci_probes[i].bus;
+		slot = pci_probes[i].slot;
+		id = read_pci_config(bus, slot, 0, PCI_VENDOR_ID);
+
+		vendor = id & 0xffff;
+		device = (id>>16) & 0xffff;
+		if (pci_probes[i].vendor == vendor &&
+		    pci_probes[i].device == device) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return 0;
+
+	pci_root_num = 0;
+	for (i = 0; i < 4; i++) {
+		int min_bus;
+		int max_bus;
+		reg = read_pci_config(bus, slot, 1, 0xe0 + (i << 2));
+
+		/* Check if that register is enabled for bus range */
+		if ((reg & 7) != 3)
+			continue;
+
+		min_bus = (reg >> 16) & 0xff;
+		max_bus = (reg >> 24) & 0xff;
+		node = (reg >> 4) & 0x07;
+#ifdef CONFIG_NUMA
+		for (j = min_bus; j <= max_bus; j++)
+			mp_bus_to_node[j] = (unsigned char) node;
+#endif
+		link = (reg >> 8) & 0x03;
+
+		info = &pci_root_info[pci_root_num];
+		info->bus_min = min_bus;
+		info->bus_max = max_bus;
+		info->node = node;
+		info->link = link;
+		sprintf(info->name, "PCI Bus #%02x", min_bus);
+		pci_root_num++;
+	}
+
+	/* get the default node and link for left over res */
+	reg = read_pci_config(bus, slot, 0, 0x60);
+	def_node = (reg >> 8) & 0x07;
+	reg = read_pci_config(bus, slot, 0, 0x64);
+	def_link = (reg >> 8) & 0x03;
+
+	memset(range, 0, sizeof(range));
+	range[0].end = 0xffff;
+	/* io port resource */
+	for (i = 0; i < 4; i++) {
+		reg = read_pci_config(bus, slot, 1, 0xc0 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = reg & 0xfff000;
+		reg = read_pci_config(bus, slot, 1, 0xc4 + (i << 3));
+		node = reg & 0x07;
+		link = (reg >> 4) & 0x03;
+		end = (reg & 0xfff000) | 0xfff;
+
+		/* find the position */
+		for (j = 0; j < pci_root_num; j++) {
+			info = &pci_root_info[j];
+			if (info->node == node && info->link == link)
+				break;
+		}
+		if (j == pci_root_num)
+			continue; /* not found */
+
+		info = &pci_root_info[j];
+		printk(KERN_DEBUG "node %d link %d: io port [%llx, %llx]\n",
+		       node, link, (u64)start, (u64)end);
+
+		/* kernel only handle 16 bit only */
+		if (end > 0xffff)
+			end = 0xffff;
+		update_res(info, start, end, IORESOURCE_IO, 1);
+		update_range(range, start, end);
+	}
+	/* add left over io port range to def node/link, [0, 0xffff] */
+	/* find the position */
+	for (j = 0; j < pci_root_num; j++) {
+		info = &pci_root_info[j];
+		if (info->node == def_node && info->link == def_link)
+			break;
+	}
+	if (j < pci_root_num) {
+		info = &pci_root_info[j];
+		for (i = 0; i < RANGE_NUM; i++) {
+			if (!range[i].end)
+				continue;
+
+			update_res(info, range[i].start, range[i].end,
+				   IORESOURCE_IO, 1);
+		}
+	}
+
+	memset(range, 0, sizeof(range));
+	/* 0xfd00000000-0xffffffffff for HT */
+	range[0].end = (0xfdULL<<32) - 1;
+
+	/* need to take out [0, TOM) for RAM*/
+	address = MSR_K8_TOP_MEM1;
+	rdmsrl(address, val);
+	end = (val & 0xffffff8000000ULL);
+	printk(KERN_INFO "TOM: %016lx aka %ldM\n", end, end>>20);
+	if (end < (1ULL<<32))
+		update_range(range, 0, end - 1);
+
+	/* get mmconfig */
+	get_pci_mmcfg_amd_fam10h_range();
+	/* need to take out mmconf range */
+	if (fam10h_mmconf_end) {
+		printk(KERN_DEBUG "Fam 10h mmconf [%llx, %llx]\n", fam10h_mmconf_start, fam10h_mmconf_end);
+		update_range(range, fam10h_mmconf_start, fam10h_mmconf_end);
+	}
+
+	/* mmio resource */
+	for (i = 0; i < 8; i++) {
+		reg = read_pci_config(bus, slot, 1, 0x80 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = reg & 0xffffff00; /* 39:16 on 31:8*/
+		start <<= 8;
+		reg = read_pci_config(bus, slot, 1, 0x84 + (i << 3));
+		node = reg & 0x07;
+		link = (reg >> 4) & 0x03;
+		end = (reg & 0xffffff00);
+		end <<= 8;
+		end |= 0xffff;
+
+		/* find the position */
+		for (j = 0; j < pci_root_num; j++) {
+			info = &pci_root_info[j];
+			if (info->node == node && info->link == link)
+				break;
+		}
+		if (j == pci_root_num)
+			continue; /* not found */
+
+		info = &pci_root_info[j];
+
+		printk(KERN_DEBUG "node %d link %d: mmio [%llx, %llx]",
+		       node, link, (u64)start, (u64)end);
+		/*
+		 * some sick allocation would have range overlap with fam10h
+		 * mmconf range, so need to update start and end.
+		 */
+		if (fam10h_mmconf_end) {
+			int changed = 0;
+			u64 endx = 0;
+			if (start >= fam10h_mmconf_start &&
+			    start <= fam10h_mmconf_end) {
+				start = fam10h_mmconf_end + 1;
+				changed = 1;
+			}
+
+			if (end >= fam10h_mmconf_start &&
+			    end <= fam10h_mmconf_end) {
+				end = fam10h_mmconf_start - 1;
+				changed = 1;
+			}
+
+			if (start < fam10h_mmconf_start &&
+			    end > fam10h_mmconf_end) {
+				/* we got a hole */
+				endx = fam10h_mmconf_start - 1;
+				update_res(info, start, endx, IORESOURCE_MEM, 0);
+				update_range(range, start, endx);
+				printk(KERN_CONT " ==> [%llx, %llx]", (u64)start, endx);
+				start = fam10h_mmconf_end + 1;
+				changed = 1;
+			}
+			if (changed) {
+				if (start <= end) {
+					printk(KERN_CONT " %s [%llx, %llx]", endx?"and":"==>", (u64)start, (u64)end);
+				} else {
+					printk(KERN_CONT "%s\n", endx?"":" ==> none");
+					continue;
+				}
 			}
 		}
+
+		update_res(info, start, end, IORESOURCE_MEM, 1);
+		update_range(range, start, end);
+		printk(KERN_CONT "\n");
+	}
+
+	/* need to take out [4G, TOM2) for RAM*/
+	/* SYS_CFG */
+	address = MSR_K8_SYSCFG;
+	rdmsrl(address, val);
+	/* TOP_MEM2 is enabled? */
+	if (val & (1<<21)) {
+		/* TOP_MEM2 */
+		address = MSR_K8_TOP_MEM2;
+		rdmsrl(address, val);
+		end = (val & 0xffffff8000000ULL);
+		printk(KERN_INFO "TOM2: %016lx aka %ldM\n", end, end>>20);
+		update_range(range, 1ULL<<32, end - 1);
+	}
+
+	/*
+	 * add left over mmio range to def node/link ?
+	 * that is tricky, just record range in from start_min to 4G
+	 */
+	for (j = 0; j < pci_root_num; j++) {
+		info = &pci_root_info[j];
+		if (info->node == def_node && info->link == def_link)
+			break;
+	}
+	if (j < pci_root_num) {
+		info = &pci_root_info[j];
+
+		for (i = 0; i < RANGE_NUM; i++) {
+			if (!range[i].end)
+				continue;
+#if 0
+			/* don't use last one near 4G */
+			if (range[i].end == 0xffffffffULL)
+				continue;
+#endif
+
+			update_res(info, range[i].start, range[i].end,
+				   IORESOURCE_MEM, 1);
+		}
+	}
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < BUS_NR; i++) {
+		node = mp_bus_to_node[i];
+		if (node >= 0)
+			printk(KERN_DEBUG "bus: %02x to node: %02x\n", i, node);
+	}
+#endif
+
+	for (i = 0; i < pci_root_num; i++) {
+		int res_num;
+		int busnum;
+
+		info = &pci_root_info[i];
+		res_num = info->res_num;
+		busnum = info->bus_min;
+		printk(KERN_DEBUG "bus: [%02x,%02x] on node %x link %x\n",
+		       info->bus_min, info->bus_max, info->node, info->link);
+		for (j = 0; j < res_num; j++) {
+			res = &info->res[j];
+			printk(KERN_DEBUG "bus: %02x index %x %s: [%llx, %llx]\n",
+			       busnum, j,
+			       (res->flags & IORESOURCE_IO)?"io port":"mmio",
+			       res->start, res->end);
+		}
 	}
 
 	return 0;
 }
 
-fs_initcall(fill_mp_bus_to_cpumask);
+postcore_initcall(early_fill_mp_bus_info);
diff --git a/arch/x86/pci/legacy.c b/arch/x86/pci/legacy.c
index e041ced..a67921c 100644
--- a/arch/x86/pci/legacy.c
+++ b/arch/x86/pci/legacy.c
@@ -12,6 +12,7 @@
 static void __devinit pcibios_fixup_peer_bridges(void)
 {
 	int n, devfn;
+	long node;
 
 	if (pcibios_last_bus <= 0 || pcibios_last_bus >= 0xff)
 		return;
@@ -21,12 +22,13 @@ static void __devinit pcibios_fixup_peer_bridges(void)
 		u32 l;
 		if (pci_find_bus(0, n))
 			continue;
+		node = get_mp_bus_to_node(n);
 		for (devfn = 0; devfn < 256; devfn += 8) {
 			if (!raw_pci_read(0, n, devfn, PCI_VENDOR_ID, 2, &l) &&
 			    l != 0x0000 && l != 0xffff) {
 				DBG("Found device at %02x:%02x [%04x]\n", n, devfn, l);
 				printk(KERN_INFO "PCI: Discovered peer bus %02x\n", n);
-				pci_scan_bus_with_sysdata(n);
+				pci_scan_bus_on_node(n, &pci_root_ops, node);
 				break;
 			}
 		}
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 8d54df4..3fdee2d 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -28,7 +28,7 @@ static int __initdata pci_mmcfg_resources_inserted;
 static const char __init *pci_mmcfg_e7520(void)
 {
 	u32 win;
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0xce, 2, &win);
+	raw_pci_ops->read(0, 0, PCI_DEVFN(0, 0), 0xce, 2, &win);
 
 	win = win & 0xf000;
 	if(win == 0x0000 || win == 0xf000)
@@ -53,7 +53,7 @@ static const char __init *pci_mmcfg_intel_945(void)
 
 	pci_mmcfg_config_num = 1;
 
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0x48, 4, &pciexbar);
+	raw_pci_ops->read(0, 0, PCI_DEVFN(0, 0), 0x48, 4, &pciexbar);
 
 	/* Enable bit */
 	if (!(pciexbar & 1))
@@ -100,33 +100,102 @@ static const char __init *pci_mmcfg_intel_945(void)
 	return "Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub";
 }
 
+static const char __init *pci_mmcfg_amd_fam10h(void)
+{
+	u32 low, high, address;
+	u64 base, msr;
+	int i;
+	unsigned segnbits = 0, busnbits;
+
+	if (!(pci_probe & PCI_CHECK_ENABLE_AMD_MMCONF))
+		return NULL;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	if (rdmsr_safe(address, &low, &high))
+		return NULL;
+
+	msr = high;
+	msr <<= 32;
+	msr |= low;
+
+	/* mmconfig is not enable */
+	if (!(msr & FAM10H_MMIO_CONF_ENABLE))
+		return NULL;
+
+	base = msr & (FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT);
+
+	busnbits = (msr >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			 FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+	/*
+	 * only handle bus 0 ?
+	 * need to skip it
+	 */
+	if (!busnbits)
+		return NULL;
+
+	if (busnbits > 8) {
+		segnbits = busnbits - 8;
+		busnbits = 8;
+	}
+
+	pci_mmcfg_config_num = (1 << segnbits);
+	pci_mmcfg_config = kzalloc(sizeof(pci_mmcfg_config[0]) *
+				   pci_mmcfg_config_num, GFP_KERNEL);
+	if (!pci_mmcfg_config)
+		return NULL;
+
+	for (i = 0; i < (1 << segnbits); i++) {
+		pci_mmcfg_config[i].address = base + (1<<28) * i;
+		pci_mmcfg_config[i].pci_segment = i;
+		pci_mmcfg_config[i].start_bus_number = 0;
+		pci_mmcfg_config[i].end_bus_number = (1 << busnbits) - 1;
+	}
+
+	return "AMD Family 10h NB";
+}
+
 struct pci_mmcfg_hostbridge_probe {
+	u32 bus;
+	u32 devfn;
 	u32 vendor;
 	u32 device;
 	const char *(*probe)(void);
 };
 
 static struct pci_mmcfg_hostbridge_probe pci_mmcfg_probes[] __initdata = {
-	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7520_MCH, pci_mmcfg_e7520 },
-	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82945G_HB, pci_mmcfg_intel_945 },
+	{ 0, PCI_DEVFN(0, 0), PCI_VENDOR_ID_INTEL,
+	  PCI_DEVICE_ID_INTEL_E7520_MCH, pci_mmcfg_e7520 },
+	{ 0, PCI_DEVFN(0, 0), PCI_VENDOR_ID_INTEL,
+	  PCI_DEVICE_ID_INTEL_82945G_HB, pci_mmcfg_intel_945 },
+	{ 0, PCI_DEVFN(0x18, 0), PCI_VENDOR_ID_AMD,
+	  0x1200, pci_mmcfg_amd_fam10h },
+	{ 0xff, PCI_DEVFN(0, 0), PCI_VENDOR_ID_AMD,
+	  0x1200, pci_mmcfg_amd_fam10h },
 };
 
 static int __init pci_mmcfg_check_hostbridge(void)
 {
 	u32 l;
+	u32 bus, devfn;
 	u16 vendor, device;
 	int i;
 	const char *name;
 
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0, 4, &l);
-	vendor = l & 0xffff;
-	device = (l >> 16) & 0xffff;
+	if (!raw_pci_ops)
+		return 0;
 
 	pci_mmcfg_config_num = 0;
 	pci_mmcfg_config = NULL;
 	name = NULL;
 
 	for (i = 0; !name && i < ARRAY_SIZE(pci_mmcfg_probes); i++) {
+		bus =  pci_mmcfg_probes[i].bus;
+		devfn = pci_mmcfg_probes[i].devfn;
+		raw_pci_ops->read(0, bus, devfn, 0, 4, &l);
+		vendor = l & 0xffff;
+		device = (l >> 16) & 0xffff;
+
 		if (pci_mmcfg_probes[i].vendor == vendor &&
 		    pci_mmcfg_probes[i].device == device)
 			name = pci_mmcfg_probes[i].probe();
@@ -173,9 +242,78 @@ static void __init pci_mmcfg_insert_resources(unsigned long resource_flags)
 	pci_mmcfg_resources_inserted = 1;
 }
 
-static void __init pci_mmcfg_reject_broken(int type)
+static acpi_status __init check_mcfg_resource(struct acpi_resource *res,
+					      void *data)
+{
+	struct resource *mcfg_res = data;
+	struct acpi_resource_address64 address;
+	acpi_status status;
+
+	if (res->type == ACPI_RESOURCE_TYPE_FIXED_MEMORY32) {
+		struct acpi_resource_fixed_memory32 *fixmem32 =
+			&res->data.fixed_memory32;
+		if (!fixmem32)
+			return AE_OK;
+		if ((mcfg_res->start >= fixmem32->address) &&
+		    (mcfg_res->end < (fixmem32->address +
+				      fixmem32->address_length))) {
+			mcfg_res->flags = 1;
+			return AE_CTRL_TERMINATE;
+		}
+	}
+	if ((res->type != ACPI_RESOURCE_TYPE_ADDRESS32) &&
+	    (res->type != ACPI_RESOURCE_TYPE_ADDRESS64))
+		return AE_OK;
+
+	status = acpi_resource_to_address64(res, &address);
+	if (ACPI_FAILURE(status) ||
+	   (address.address_length <= 0) ||
+	   (address.resource_type != ACPI_MEMORY_RANGE))
+		return AE_OK;
+
+	if ((mcfg_res->start >= address.minimum) &&
+	    (mcfg_res->end < (address.minimum + address.address_length))) {
+		mcfg_res->flags = 1;
+		return AE_CTRL_TERMINATE;
+	}
+	return AE_OK;
+}
+
+static acpi_status __init find_mboard_resource(acpi_handle handle, u32 lvl,
+		void *context, void **rv)
+{
+	struct resource *mcfg_res = context;
+
+	acpi_walk_resources(handle, METHOD_NAME__CRS,
+			    check_mcfg_resource, context);
+
+	if (mcfg_res->flags)
+		return AE_CTRL_TERMINATE;
+
+	return AE_OK;
+}
+
+static int __init is_acpi_reserved(unsigned long start, unsigned long end)
+{
+	struct resource mcfg_res;
+
+	mcfg_res.start = start;
+	mcfg_res.end = end;
+	mcfg_res.flags = 0;
+
+	acpi_get_devices("PNP0C01", find_mboard_resource, &mcfg_res, NULL);
+
+	if (!mcfg_res.flags)
+		acpi_get_devices("PNP0C02", find_mboard_resource, &mcfg_res,
+				 NULL);
+
+	return mcfg_res.flags;
+}
+
+static void __init pci_mmcfg_reject_broken(int early)
 {
 	typeof(pci_mmcfg_config[0]) *cfg;
+	int i;
 
 	if ((pci_mmcfg_config_num == 0) ||
 	    (pci_mmcfg_config == NULL) ||
@@ -184,51 +322,85 @@ static void __init pci_mmcfg_reject_broken(int type)
 
 	cfg = &pci_mmcfg_config[0];
 
-	/*
-	 * Handle more broken MCFG tables on Asus etc.
-	 * They only contain a single entry for bus 0-0.
-	 */
-	if (pci_mmcfg_config_num == 1 &&
-	    cfg->pci_segment == 0 &&
-	    (cfg->start_bus_number | cfg->end_bus_number) == 0) {
-		printk(KERN_ERR "PCI: start and end of bus number is 0. "
-		       "Rejected as broken MCFG.\n");
-		goto reject;
+	for (i = 0; i < pci_mmcfg_config_num; i++) {
+		int valid = 0;
+		u32 size = (cfg->end_bus_number + 1) << 20;
+		cfg = &pci_mmcfg_config[i];
+		printk(KERN_NOTICE "PCI: MCFG configuration %d: base %lx "
+		       "segment %hu buses %u - %u\n",
+		       i, (unsigned long)cfg->address, cfg->pci_segment,
+		       (unsigned int)cfg->start_bus_number,
+		       (unsigned int)cfg->end_bus_number);
+
+		if (!early &&
+		    is_acpi_reserved(cfg->address, cfg->address + size - 1)) {
+			printk(KERN_NOTICE "PCI: MCFG area at %Lx reserved "
+			       "in ACPI motherboard resources\n",
+			       cfg->address);
+			valid = 1;
+		}
+
+		if (valid)
+			continue;
+
+		if (!early)
+			printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is not"
+			       " reserved in ACPI motherboard resources\n",
+			       cfg->address);
+		/* Don't try to do this check unless configuration
+		   type 1 is available. how about type 2 ?*/
+		if (raw_pci_ops && e820_all_mapped(cfg->address,
+						  cfg->address + size - 1,
+						  E820_RESERVED)) {
+			printk(KERN_NOTICE
+			       "PCI: MCFG area at %Lx reserved in E820\n",
+			       cfg->address);
+			valid = 1;
+		}
+
+		if (!valid)
+			goto reject;
 	}
 
-	/*
-	 * Only do this check when type 1 works. If it doesn't work
-	 * assume we run on a Mac and always use MCFG
-	 */
-	if (type == 1 && !e820_all_mapped(cfg->address,
-					  cfg->address + MMCONFIG_APER_MIN,
-					  E820_RESERVED)) {
-		printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is not"
-		       " E820-reserved\n", cfg->address);
-		goto reject;
-	}
 	return;
 
 reject:
 	printk(KERN_ERR "PCI: Not using MMCONFIG.\n");
+	pci_mmcfg_arch_free();
 	kfree(pci_mmcfg_config);
 	pci_mmcfg_config = NULL;
 	pci_mmcfg_config_num = 0;
 }
 
-void __init pci_mmcfg_init(int type)
-{
-	int known_bridge = 0;
+static int __initdata known_bridge;
 
+void __init __pci_mmcfg_init(int early)
+{
+	/* MMCONFIG disabled */
 	if ((pci_probe & PCI_PROBE_MMCONF) == 0)
 		return;
 
-	if (type == 1 && pci_mmcfg_check_hostbridge())
-		known_bridge = 1;
+	/* MMCONFIG already enabled */
+	if (!early && !(pci_probe & PCI_PROBE_MASK & ~PCI_PROBE_MMCONF))
+		return;
+
+	/* for late to exit */
+	if (known_bridge)
+		return;
+
+	if (early) {
+		if (pci_mmcfg_check_hostbridge())
+			known_bridge = 1;
+#if 0
+	/* check e820 in late? */
+		else
+			return;
+#endif
+	}
 
 	if (!known_bridge) {
 		acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
-		pci_mmcfg_reject_broken(type);
+		pci_mmcfg_reject_broken(early);
 	}
 
 	if ((pci_mmcfg_config_num == 0) ||
@@ -249,6 +421,16 @@ void __init pci_mmcfg_init(int type)
 	}
 }
 
+void __init pci_mmcfg_early_init(void)
+{
+	__pci_mmcfg_init(1);
+}
+
+void __init pci_mmcfg_late_init(void)
+{
+	__pci_mmcfg_init(0);
+}
+
 static int __init pci_mmcfg_late_insert_resources(void)
 {
 	/*
diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c
index 081816a..f3c761d 100644
--- a/arch/x86/pci/mmconfig_32.c
+++ b/arch/x86/pci/mmconfig_32.c
@@ -136,3 +136,7 @@ int __init pci_mmcfg_arch_init(void)
 	raw_pci_ext_ops = &pci_mmcfg;
 	return 1;
 }
+
+void __init pci_mmcfg_arch_free(void)
+{
+}
diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
index 9207fd4..a199416 100644
--- a/arch/x86/pci/mmconfig_64.c
+++ b/arch/x86/pci/mmconfig_64.c
@@ -127,7 +127,7 @@ static void __iomem * __init mcfg_ioremap(struct acpi_mcfg_allocation *cfg)
 int __init pci_mmcfg_arch_init(void)
 {
 	int i;
-	pci_mmcfg_virt = kmalloc(sizeof(*pci_mmcfg_virt) *
+	pci_mmcfg_virt = kzalloc(sizeof(*pci_mmcfg_virt) *
 				 pci_mmcfg_config_num, GFP_KERNEL);
 	if (pci_mmcfg_virt == NULL) {
 		printk(KERN_ERR "PCI: Can not allocate memory for mmconfig structures\n");
@@ -141,9 +141,29 @@ int __init pci_mmcfg_arch_init(void)
 			printk(KERN_ERR "PCI: Cannot map mmconfig aperture for "
 					"segment %d\n",
 				pci_mmcfg_config[i].pci_segment);
+			pci_mmcfg_arch_free();
 			return 0;
 		}
 	}
 	raw_pci_ext_ops = &pci_mmcfg;
 	return 1;
 }
+
+void __init pci_mmcfg_arch_free(void)
+{
+	int i;
+
+	if (pci_mmcfg_virt == NULL)
+		return;
+
+	for (i = 0; i < pci_mmcfg_config_num; ++i) {
+		if (pci_mmcfg_virt[i].virt) {
+			iounmap(pci_mmcfg_virt[i].virt);
+			pci_mmcfg_virt[i].virt = NULL;
+			pci_mmcfg_virt[i].cfg = NULL;
+		}
+	}
+
+	kfree(pci_mmcfg_virt);
+	pci_mmcfg_virt = NULL;
+}
diff --git a/arch/x86/pci/mp_bus_to_node.c b/arch/x86/pci/mp_bus_to_node.c
new file mode 100644
index 0000000..0229439
--- /dev/null
+++ b/arch/x86/pci/mp_bus_to_node.c
@@ -0,0 +1,23 @@
+#include <linux/pci.h>
+#include <linux/init.h>
+#include <linux/topology.h>
+
+#define BUS_NR 256
+
+static unsigned char mp_bus_to_node[BUS_NR];
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+	mp_bus_to_node[busnum] = (unsigned char) node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return 0;
+	node = mp_bus_to_node[busnum];
+	return node;
+}
diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h
index c4bddae..3b0db9b 100644
--- a/arch/x86/pci/pci.h
+++ b/arch/x86/pci/pci.h
@@ -26,6 +26,7 @@
 #define PCI_ASSIGN_ALL_BUSSES	0x4000
 #define PCI_CAN_SKIP_ISA_ALIGN	0x8000
 #define PCI_USE__CRS		0x10000
+#define PCI_CHECK_ENABLE_AMD_MMCONF	0x20000
 
 extern unsigned int pci_probe;
 extern unsigned long pirq_table_addr;
@@ -37,6 +38,9 @@ enum pci_bf_sort_state {
 	pci_dmi_bf,
 };
 
+extern void __init dmi_check_pciprobe(void);
+extern void __init dmi_check_skip_isa_align(void);
+
 /* pci-i386.c */
 
 extern unsigned int pcibios_max_latency;
@@ -97,11 +101,11 @@ extern struct pci_raw_ops pci_direct_conf1;
 extern int pci_direct_probe(void);
 extern void pci_direct_init(int type);
 extern void pci_pcbios_init(void);
-extern void pci_mmcfg_init(int type);
 
 /* pci-mmconfig.c */
 
 extern int __init pci_mmcfg_arch_init(void);
+extern void __init pci_mmcfg_arch_free(void);
 
 /*
  * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 2d1955c..a6dbcf4 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -35,6 +35,7 @@
 #ifdef CONFIG_X86
 #include <asm/mpspec.h>
 #endif
+#include <linux/pci.h>
 #include <acpi/acpi_bus.h>
 #include <acpi/acpi_drivers.h>
 
@@ -784,6 +785,7 @@ static int __init acpi_init(void)
 	result = acpi_bus_init();
 
 	if (!result) {
+		pci_mmcfg_late_init();
 		if (!(pm_flags & PM_APM))
 			pm_flags |= PM_ACPI;
 		else {
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 9248e09..be288b5 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -787,6 +787,10 @@ int device_add(struct device *dev)
 	parent = get_device(dev->parent);
 	setup_parent(dev, parent);
 
+	/* use parent numa_node */
+	if (parent)
+		set_dev_node(dev, dev_to_node(parent));
+
 	/* first, register with generic layer. */
 	error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
 	if (error)
@@ -1306,8 +1310,11 @@ int device_move(struct device *dev, struct device *new_parent)
 	dev->parent = new_parent;
 	if (old_parent)
 		klist_remove(&dev->knode_parent);
-	if (new_parent)
+	if (new_parent) {
 		klist_add_tail(&dev->knode_parent, &new_parent->klist_children);
+		set_dev_node(dev, dev_to_node(new_parent));
+	}
+
 	if (!dev->class)
 		goto out_put;
 	error = device_move_class_links(dev, old_parent, new_parent);
@@ -1317,9 +1324,12 @@ int device_move(struct device *dev, struct device *new_parent)
 		if (!kobject_move(&dev->kobj, &old_parent->kobj)) {
 			if (new_parent)
 				klist_remove(&dev->knode_parent);
-			if (old_parent)
+			dev->parent = old_parent;
+			if (old_parent) {
 				klist_add_tail(&dev->knode_parent,
 					       &old_parent->klist_children);
+				set_dev_node(dev, dev_to_node(old_parent));
+			}
 		}
 		cleanup_glue_dir(dev, new_parent_kobj);
 		put_device(new_parent);
diff --git a/drivers/char/agp/amd64-agp.c b/drivers/char/agp/amd64-agp.c
index d8200ac..e77b321 100644
--- a/drivers/char/agp/amd64-agp.c
+++ b/drivers/char/agp/amd64-agp.c
@@ -264,11 +264,7 @@ static int __devinit aperture_valid(u64 aper, u32 size)
 		printk(KERN_ERR PFX "No aperture\n");
 		return 0;
 	}
-	if (size < 32*1024*1024) {
-		printk(KERN_ERR PFX "Aperture too small (%d MB)\n", size>>20);
-		return 0;
-	}
-       if ((u64)aper + size > 0x100000000ULL) {
+	if ((u64)aper + size > 0x100000000ULL) {
 		printk(KERN_ERR PFX "Aperture out of bounds\n");
 		return 0;
 	}
@@ -276,6 +272,10 @@ static int __devinit aperture_valid(u64 aper, u32 size)
 		printk(KERN_ERR PFX "Aperture pointing to RAM\n");
 		return 0;
 	}
+	if (size < 32*1024*1024) {
+		printk(KERN_ERR PFX "Aperture too small (%d MB)\n", size>>20);
+		return 0;
+	}
 
 	/* Request the Aperture. This catches cases when someone else
 	   already put a mapping in there - happens with some very broken BIOS
@@ -331,6 +331,17 @@ static __devinit int fix_northbridge(struct pci_dev *nb, struct pci_dev *agp,
 	pci_read_config_dword(agp, 0x10, &aper_low);
 	pci_read_config_dword(agp, 0x14, &aper_hi);
 	aper = (aper_low & ~((1<<22)-1)) | ((u64)aper_hi << 32);
+
+	/*
+	 * On some sick chips APSIZE is 0. This means it wants 4G
+	 * so let double check that order, and lets trust the AMD NB settings
+	 */
+	if (order >=0 && aper + (32ULL<<(20 + order)) > 0x100000000ULL) {
+		printk(KERN_INFO "Aperture size %u MB is not right, using settings from NB\n",
+				  32 << order);
+		order = nb_order;
+	}
+
 	printk(KERN_INFO PFX "Aperture from AGP @ %Lx size %u MB\n", aper, 32 << order);
 	if (order < 0 || !aperture_valid(aper, (32*1024*1024)<<order))
 		return -1;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index f991359..4a55bf3 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -842,11 +842,14 @@ static void set_pcie_port_type(struct pci_dev *pdev)
  * reading the dword at 0x100 which must either be 0 or a valid extended
  * capability header.
  */
-int pci_cfg_space_size(struct pci_dev *dev)
+int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)
 {
 	int pos;
 	u32 status;
 
+	if (!check_exp_pcix)
+		goto skip;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_EXP);
 	if (!pos) {
 		pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
@@ -858,6 +861,7 @@ int pci_cfg_space_size(struct pci_dev *dev)
 			goto fail;
 	}
 
+ skip:
 	if (pci_read_config_dword(dev, 256, &status) != PCIBIOS_SUCCESSFUL)
 		goto fail;
 	if (status == 0xffffffff)
@@ -869,6 +873,11 @@ int pci_cfg_space_size(struct pci_dev *dev)
 	return PCI_CFG_SPACE_SIZE;
 }
 
+int pci_cfg_space_size(struct pci_dev *dev)
+{
+	return pci_cfg_space_size_ext(dev, 1);
+}
+
 static void pci_release_bus_bridge_dev(struct device *dev)
 {
 	kfree(dev);
@@ -964,7 +973,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->dev.release = pci_release_dev;
 	pci_dev_get(dev);
 
-	set_dev_node(&dev->dev, pcibus_to_node(bus));
 	dev->dev.dma_mask = &dev->dma_mask;
 	dev->dev.dma_parms = &dev->dma_parms;
 	dev->dev.coherent_dma_mask = 0xffffffffull;
@@ -1080,6 +1088,10 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
 	return max;
 }
 
+void __attribute__((weak)) set_pci_bus_resources_arch_default(struct pci_bus *b)
+{
+}
+
 struct pci_bus * pci_create_bus(struct device *parent,
 		int bus, struct pci_ops *ops, void *sysdata)
 {
@@ -1119,6 +1131,9 @@ struct pci_bus * pci_create_bus(struct device *parent,
 		goto dev_reg_err;
 	b->bridge = get_device(dev);
 
+	if (!parent)
+		set_dev_node(b->bridge, pcibus_to_node(b));
+
 	b->dev.class = &pcibus_class;
 	b->dev.parent = b->bridge;
 	sprintf(b->dev.bus_id, "%04x:%02x", pci_domain_nr(b), bus);
@@ -1136,6 +1151,8 @@ struct pci_bus * pci_create_bus(struct device *parent,
 	b->resource[0] = &ioport_resource;
 	b->resource[1] = &iomem_resource;
 
+	set_pci_bus_resources_arch_default(b);
+
 	return b;
 
 dev_create_file_err:
diff --git a/include/asm-x86/bootparam.h b/include/asm-x86/bootparam.h
index 5115135..e865990 100644
--- a/include/asm-x86/bootparam.h
+++ b/include/asm-x86/bootparam.h
@@ -9,6 +9,17 @@
 #include <asm/ist.h>
 #include <video/edid.h>
 
+/* setup data types */
+#define SETUP_NONE			0
+
+/* extensible setup data list node */
+struct setup_data {
+	u64 next;
+	u32 type;
+	u32 len;
+	u8 data[0];
+};
+
 struct setup_header {
 	__u8	setup_sects;
 	__u16	root_flags;
@@ -46,6 +57,9 @@ struct setup_header {
 	__u32	cmdline_size;
 	__u32	hardware_subarch;
 	__u64	hardware_subarch_data;
+	__u32	payload_offset;
+	__u32	payload_length;
+	__u64	setup_data;
 } __attribute__((packed));
 
 struct sys_desc_table {
diff --git a/include/asm-x86/e820_64.h b/include/asm-x86/e820_64.h
index f478c57..71c4d68 100644
--- a/include/asm-x86/e820_64.h
+++ b/include/asm-x86/e820_64.h
@@ -48,7 +48,8 @@ extern struct e820map e820;
 extern void update_e820(void);
 
 extern void reserve_early(unsigned long start, unsigned long end, char *name);
-extern void early_res_to_bootmem(void);
+extern void free_early(unsigned long start, unsigned long end);
+extern void early_res_to_bootmem(unsigned long start, unsigned long end);
 
 #endif/*!__ASSEMBLY__*/
 
diff --git a/include/asm-x86/pci.h b/include/asm-x86/pci.h
index ddd8e24..30bbde0 100644
--- a/include/asm-x86/pci.h
+++ b/include/asm-x86/pci.h
@@ -19,6 +19,8 @@ struct pci_sysdata {
 };
 
 /* scan a bus after allocating a pci_sysdata for it */
+extern struct pci_bus *pci_scan_bus_on_node(int busno, struct pci_ops *ops,
+					    int node);
 extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
 
 static inline int pci_domain_nr(struct pci_bus *bus)
diff --git a/include/asm-x86/topology.h b/include/asm-x86/topology.h
index 2207326..0e6d6b0 100644
--- a/include/asm-x86/topology.h
+++ b/include/asm-x86/topology.h
@@ -193,9 +193,25 @@ extern cpumask_t cpu_coregroup_map(int cpu);
 #define topology_thread_siblings(cpu)		(per_cpu(cpu_sibling_map, cpu))
 #endif
 
+struct pci_bus;
+void set_pci_bus_resources_arch_default(struct pci_bus *b);
+
 #ifdef CONFIG_SMP
 #define mc_capable()			(boot_cpu_data.x86_max_cores > 1)
 #define smt_capable()			(smp_num_siblings > 1)
 #endif
 
+#ifdef CONFIG_NUMA
+extern int get_mp_bus_to_node(int busnum);
+extern void set_mp_bus_to_node(int busnum, int node);
+#else
+static inline int get_mp_bus_to_node(int busnum)
+{
+	return 0;
+}
+static inline void set_mp_bus_to_node(int busnum, int node)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 2c7e003..41f7ce7 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,6 +79,7 @@ typedef int (*acpi_table_handler) (struct acpi_table_header *table);
 typedef int (*acpi_table_entry_handler) (struct acpi_subtable_header *header, const unsigned long end);
 
 char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
+int early_acpi_boot_init(void);
 int acpi_boot_init (void);
 int acpi_boot_table_init (void);
 int acpi_numa_init (void);
@@ -235,6 +236,10 @@ int acpi_check_mem_region(resource_size_t start, resource_size_t n,
 
 #else	/* CONFIG_ACPI */
 
+static inline int early_acpi_boot_init(void)
+{
+	return 0;
+}
 static inline int acpi_boot_init(void)
 {
 	return 0;
diff --git a/include/linux/ide.h b/include/linux/ide.h
index f20410d..13779aa 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -1303,8 +1303,7 @@ static inline void ide_dump_identify(u8 *id)
 
 static inline int hwif_to_node(ide_hwif_t *hwif)
 {
-	struct pci_dev *dev = to_pci_dev(hwif->dev);
-	return hwif->dev ? pcibus_to_node(dev->bus) : -1;
+	return hwif->dev ? dev_to_node(hwif->dev) : -1;
 }
 
 static inline ide_drive_t *ide_get_paired_drive(ide_drive_t *drive)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b695875..286d315 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,6 +1229,7 @@ void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
+void vmemmap_populate_print_last(void);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2924913..abc998f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -254,7 +254,7 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev,
 #define PCI_NUM_RESOURCES	11
 
 #ifndef PCI_BUS_NUM_RESOURCES
-#define PCI_BUS_NUM_RESOURCES	8
+#define PCI_BUS_NUM_RESOURCES	16
 #endif
 
 #define PCI_REGION_FLAG_MASK	0x0fU	/* These bits of resource flags tell us the PCI region flags */
@@ -666,6 +666,7 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max,
 
 void pci_walk_bus(struct pci_bus *top, void (*cb)(struct pci_dev *, void *),
 		  void *userdata);
+int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 
@@ -1053,5 +1054,13 @@ extern unsigned long pci_cardbus_mem_size;
 
 extern int pcibios_add_platform_entries(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_MMCONFIG
+extern void __init pci_mmcfg_early_init(void);
+extern void __init pci_mmcfg_late_init(void);
+#else
+static inline void pci_mmcfg_early_init(void) { }
+static inline void pci_mmcfg_late_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* LINUX_PCI_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 2ccea70..590873d 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -111,44 +111,71 @@ static unsigned long __init init_bootmem_core(pg_data_t *pgdat,
  * might be used for boot-time allocations - or it might get added
  * to the free page pool later on.
  */
-static int __init reserve_bootmem_core(bootmem_data_t *bdata,
+static int __init can_reserve_bootmem_core(bootmem_data_t *bdata,
 			unsigned long addr, unsigned long size, int flags)
 {
 	unsigned long sidx, eidx;
 	unsigned long i;
-	int ret;
+
+	BUG_ON(!size);
+
+	/* out of range, don't hold other */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return 0;
 
 	/*
-	 * round up, partially reserved pages are considered
-	 * fully reserved.
+	 * Round up to index to the range.
 	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
+	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
+
+	for (i = sidx; i < eidx; i++)
+		if (test_bit(i, bdata->node_bootmem_map)) {
+			if (flags & BOOTMEM_EXCLUSIVE)
+				return -EBUSY;
+		}
+
+	return 0;
+
+}
+static void __init reserve_bootmem_core(bootmem_data_t *bdata,
+			unsigned long addr, unsigned long size, int flags)
+{
+	unsigned long sidx, eidx;
+	unsigned long i;
+
 	BUG_ON(!size);
-	BUG_ON(PFN_DOWN(addr) >= bdata->node_low_pfn);
-	BUG_ON(PFN_UP(addr + size) > bdata->node_low_pfn);
-	BUG_ON(addr < bdata->node_boot_start);
 
-	sidx = PFN_DOWN(addr - bdata->node_boot_start);
+	/* out of range */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return;
+
+	/*
+	 * Round up to index to the range.
+	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
 	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
 	for (i = sidx; i < eidx; i++)
 		if (test_and_set_bit(i, bdata->node_bootmem_map)) {
 #ifdef CONFIG_DEBUG_BOOTMEM
 			printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE);
 #endif
-			if (flags & BOOTMEM_EXCLUSIVE) {
-				ret = -EBUSY;
-				goto err;
-			}
 		}
-
-	return 0;
-
-err:
-	/* unreserve memory we accidentally reserved */
-	for (i--; i >= sidx; i--)
-		clear_bit(i, bdata->node_bootmem_map);
-
-	return ret;
 }
 
 static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
@@ -206,9 +233,11 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
+	unsigned long areasize, preferred;
 	unsigned long i, start = 0, incr, eidx, end_pfn;
 	void *ret;
+	unsigned long node_boot_start;
+	void *node_bootmem_map;
 
 	if (!size) {
 		printk("__alloc_bootmem_core(): zero-sized request\n");
@@ -216,70 +245,83 @@ __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	}
 	BUG_ON(align & (align-1));
 
-	if (limit && bdata->node_boot_start >= limit)
-		return NULL;
-
 	/* on nodes without memory - bootmem_map is NULL */
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
+	node_boot_start = bdata->node_boot_start;
+	node_bootmem_map = bdata->node_bootmem_map;
+	if (align) {
+		node_boot_start = ALIGN(bdata->node_boot_start, align);
+		if (node_boot_start > bdata->node_boot_start)
+			node_bootmem_map = (unsigned long *)bdata->node_bootmem_map +
+			    PFN_DOWN(node_boot_start - bdata->node_boot_start)/BITS_PER_LONG;
+	}
+
+	if (limit && node_boot_start >= limit)
+		return NULL;
+
 	end_pfn = bdata->node_low_pfn;
 	limit = PFN_DOWN(limit);
 	if (limit && end_pfn > limit)
 		end_pfn = limit;
 
-	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
+	eidx = end_pfn - PFN_DOWN(node_boot_start);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
 	 */
-	if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) {
-		preferred = goal - bdata->node_boot_start;
+	preferred = 0;
+	if (goal && PFN_DOWN(goal) < end_pfn) {
+		if (goal > node_boot_start)
+			preferred = goal - node_boot_start;
 
-		if (bdata->last_success >= preferred)
+		if (bdata->last_success > node_boot_start &&
+			bdata->last_success - node_boot_start >= preferred)
 			if (!limit || (limit && limit > bdata->last_success))
-				preferred = bdata->last_success;
-	} else
-		preferred = 0;
+				preferred = bdata->last_success - node_boot_start;
+	}
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	preferred = PFN_DOWN(ALIGN(preferred, align));
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
 restart_scan:
-	for (i = preferred; i < eidx; i += incr) {
+	for (i = preferred; i < eidx;) {
 		unsigned long j;
-		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
+
+		i = find_next_zero_bit(node_bootmem_map, eidx, i);
 		i = ALIGN(i, incr);
 		if (i >= eidx)
 			break;
-		if (test_bit(i, bdata->node_bootmem_map))
+		if (test_bit(i, node_bootmem_map)) {
+			i += incr;
 			continue;
+		}
 		for (j = i + 1; j < i + areasize; ++j) {
 			if (j >= eidx)
 				goto fail_block;
-			if (test_bit(j, bdata->node_bootmem_map))
+			if (test_bit(j, node_bootmem_map))
 				goto fail_block;
 		}
 		start = i;
 		goto found;
 	fail_block:
 		i = ALIGN(j, incr);
+		if (i == j)
+			i += incr;
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
 
 found:
-	bdata->last_success = PFN_PHYS(start);
+	bdata->last_success = PFN_PHYS(start) + node_boot_start;
 	BUG_ON(start >= eidx);
 
 	/*
@@ -289,6 +331,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
+		unsigned long offset, remaining_size;
 		offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
@@ -297,14 +340,12 @@ found:
 			/* last_pos unchanged */
 			bdata->last_offset = offset + size;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 		} else {
 			remaining_size = size - remaining_size;
 			areasize = (remaining_size + PAGE_SIZE-1) / PAGE_SIZE;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 			bdata->last_pos = start + areasize - 1;
 			bdata->last_offset = remaining_size;
 		}
@@ -312,14 +353,14 @@ found:
 	} else {
 		bdata->last_pos = start + areasize - 1;
 		bdata->last_offset = size & ~PAGE_MASK;
-		ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start);
+		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
 	/*
 	 * Reserve the area now:
 	 */
 	for (i = start; i < start + areasize; i++)
-		if (unlikely(test_and_set_bit(i, bdata->node_bootmem_map)))
+		if (unlikely(test_and_set_bit(i, node_bootmem_map)))
 			BUG();
 	memset(ret, 0, size);
 	return ret;
@@ -401,6 +442,11 @@ unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
 void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
 				 unsigned long size, int flags)
 {
+	int ret;
+
+	ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
+	if (ret < 0)
+		return;
 	reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
 }
 
@@ -426,7 +472,16 @@ unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
 int __init reserve_bootmem(unsigned long addr, unsigned long size,
 			    int flags)
 {
-	return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
+	int ret;
+	bootmem_data_t *bdata;
+	list_for_each_entry(bdata, &bdata_list, list) {
+		ret = can_reserve_bootmem_core(bdata, addr, size, flags);
+		if (ret < 0)
+			return ret;
+	}
+	list_for_each_entry(bdata, &bdata_list, list)
+		reserve_bootmem_core(bdata, addr, size, flags);
+	return 0;
 }
 #endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 98d6b39..7e91913 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -295,6 +295,9 @@ struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
 	return NULL;
 }
 
+void __attribute__((weak)) __meminit vmemmap_populate_print_last(void)
+{
+}
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -304,22 +307,50 @@ void __init sparse_init(void)
 	unsigned long pnum;
 	struct page *map;
 	unsigned long *usemap;
+	unsigned long **usemap_map;
+	int size;
+
+	/*
+	 * map is using big page (aka 2M in x86 64 bit)
+	 * usemap is less one page (aka 24 bytes)
+	 * so alloc 2M (with 2M align) and 24 bytes in turn will
+	 * make next 2M slip to one more 2M later.
+	 * then in big system, the memory will have a lot of holes...
+	 * here try to allocate 2M pages continously.
+	 *
+	 * powerpc need to call sparse_init_one_section right after each
+	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
+	 */
+	size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
+	usemap_map = alloc_bootmem(size);
+	if (!usemap_map)
+		panic("can not allocate usemap_map\n");
 
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
+		usemap_map[pnum] = sparse_early_usemap_alloc(pnum);
+	}
 
-		map = sparse_early_mem_map_alloc(pnum);
-		if (!map)
+	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+		if (!present_section_nr(pnum))
 			continue;
 
-		usemap = sparse_early_usemap_alloc(pnum);
+		usemap = usemap_map[pnum];
 		if (!usemap)
 			continue;
 
+		map = sparse_early_mem_map_alloc(pnum);
+		if (!map)
+			continue;
+
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
 								usemap);
 	}
+
+	vmemmap_populate_print_last();
+
+	free_bootmem(__pa(usemap_map), size);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4fe605f..94448e6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -252,7 +252,7 @@ nodata:
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	int node = dev_to_node(&dev->dev);
 	struct sk_buff *skb;
 
 	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-26 18:55 [RFC git pull] "big box" x86 changes Ingo Molnar
@ 2008-04-26 19:05 ` Stefan Richter
  2008-04-26 19:21   ` Ingo Molnar
  2008-04-26 19:12 ` Linus Torvalds
  2008-04-26 22:17 ` [RFC git pull] "big box" x86 changes Andi Kleen
  2 siblings, 1 reply; 52+ messages in thread
From: Stefan Richter @ 2008-04-26 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Ingo Molnar wrote:
>       x86: fix k8-bus_64.c build

Is it not necessary to fold this into a previous patch, to keep the tree 
bisectable?
-- 
Stefan Richter
-=====-==--- -=-- ==-=-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-26 18:55 [RFC git pull] "big box" x86 changes Ingo Molnar
  2008-04-26 19:05 ` Stefan Richter
@ 2008-04-26 19:12 ` Linus Torvalds
  2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
                     ` (3 more replies)
  2008-04-26 22:17 ` [RFC git pull] "big box" x86 changes Andi Kleen
  2 siblings, 4 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-04-26 19:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes



On Sat, 26 Apr 2008, Ingo Molnar wrote:
> 
> So we need a bit of help wrt. how mergable this is, and what else is 
> needed to make it mergable. These changes have been booted all across 
> the x86 spectrum, small and large boxes alike, so i'd be seriously 
> surprised if they caused any widespread breakage.

I'd like to see this sent as perhaps four independent pulls (where 
"independent" doesn't necessarily mean that they don't depend on each 
other and have some ordering, but is more of a "this is the bootup 
changes" vs "these are bootmem-related" etc, if at all possible. 

Quite frankly, from a review standpoint, very few people should need to 
(or want to) review individual commits when there are 50 of them (they 
might only care about the changes that 3 or 4 of them make, but they get 
lost in the noise), but at the same time, if you have a single diffstat 
like this:

	45 files changed, 2071 insertions(+), 362 deletions(-)

then you're also going to have people not able to give good feedback.

In contrast, if you send out an email like this one, but that has the 
combined diff of say 10 commits in one area, and the diff is a couple of 
hundred lines of changes rather than almost 2,500 lines, then you'll 
definitely get people who are able and willing to look at four chunks like 
that.

IOW, they'd be big enough that people hopefully don't start nitpicking 
about some *totally* uninteresting small detail, but small enough that 
people can read it through without losing concentration about a quarter of 
the way in.

			Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-26 19:05 ` Stefan Richter
@ 2008-04-26 19:21   ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 19:21 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes


* Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> Ingo Molnar wrote:
>>       x86: fix k8-bus_64.c build
>
> Is it not necessary to fold this into a previous patch, to keep the tree 
> bisectable?

as i mentioned it i kept the tree as-is, to maintain its impression 
about its historic behavior. That way people can see how it behaved, 
what kind of breakages were observed, what time span the changes came 
in, etc. Those non-bisectable fixlets will be backmerged for a real pull 
request.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-26 19:12 ` Linus Torvalds
@ 2008-04-26 19:41   ` Ingo Molnar
  2008-04-26 19:52     ` Linus Torvalds
  2008-04-27 22:48     ` [git pull] "big box" x86 changes, bootmem/sparsemem Johannes Weiner
  2008-04-26 19:54   ` [git pull] "big box" x86 changes, boot protocol Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> IOW, they'd be big enough that people hopefully don't start nitpicking 
> about some *totally* uninteresting small detail, but small enough that 
> people can read it through without losing concentration about a 
> quarter of the way in.

ok. Here's the "memory management" type of changes:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem.git for-linus

the other sub-trees will depend on these changes. I think these 
infrastructure and other improvements are mergable and pullable as-is.

	Ingo

------------------>
Yinghai Lu (7):
      mm: make mem_map allocation continuous
      mm: fix alloc_bootmem_core to use fast searching for all nodes
      mm: offset align in alloc_bootmem()
      mm: allow reserve_bootmem() cross nodes
      x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
      x86_64: fix setup_node_bootmem to support big mem excluding with memmap
      x86_64/mm: check and print vmemmap allocation continuous

 arch/x86/kernel/e820_64.c  |   13 +++-
 arch/x86/kernel/setup_64.c |    3 +-
 arch/x86/mm/init_64.c      |   38 +++++++++--
 arch/x86/mm/numa_64.c      |   43 ++++++++++--
 include/asm-x86/e820_64.h  |    2 +-
 include/linux/mm.h         |    1 +
 mm/bootmem.c               |  157 +++++++++++++++++++++++++++++--------------
 mm/sparse.c                |   37 ++++++++++-
 8 files changed, 222 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index cbd42e5..f95fdab 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -84,14 +84,19 @@ void __init reserve_early(unsigned long start, unsigned long end, char *name)
 		strncpy(r->name, name, sizeof(r->name) - 1);
 }
 
-void __init early_res_to_bootmem(void)
+void __init early_res_to_bootmem(unsigned long start, unsigned long end)
 {
 	int i;
+	unsigned long final_start, final_end;
 	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
 		struct early_res *r = &early_res[i];
-		printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
-			r->start, r->end - 1, r->name);
-		reserve_bootmem_generic(r->start, r->end - r->start);
+		final_start = max(start, r->start);
+		final_end = min(end, r->end);
+		if (final_start >= final_end)
+			continue;
+		printk(KERN_INFO "  early res: %d [%lx-%lx] %s\n", i,
+			final_start, final_end - 1, r->name);
+		reserve_bootmem_generic(final_start, final_end - final_start);
 	}
 }
 
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 17bdf23..de9c870 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -190,6 +190,7 @@ contig_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
 	e820_register_active_regions(0, start_pfn, end_pfn);
 	free_bootmem_with_active_regions(0, end_pfn);
+	early_res_to_bootmem(0, end_pfn<<PAGE_SHIFT);
 	reserve_bootmem(bootmap, bootmap_size, BOOTMEM_DEFAULT);
 }
 #endif
@@ -397,8 +398,6 @@ void __init setup_arch(char **cmdline_p)
 	contig_initmem_init(0, end_pfn);
 #endif
 
-	early_res_to_bootmem();
-
 	dma32_reserve_bootmem();
 
 #ifdef CONFIG_ACPI_SLEEP
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0cca626..e900757 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -810,7 +810,7 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
 #ifdef CONFIG_NUMA
-	int nid = phys_to_nid(phys);
+	int nid, next_nid;
 #endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
 
@@ -829,10 +829,14 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 
 	/* Should check here against the e820 map to avoid double free */
 #ifdef CONFIG_NUMA
+	nid = phys_to_nid(phys);
+	next_nid = phys_to_nid(phys + len - 1);
+	if (nid == next_nid)
 	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
-#else
-	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
+	else
 #endif
+	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
+
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;
 		set_dma_reserve(dma_reserve);
@@ -926,6 +930,10 @@ const char *arch_vma_name(struct vm_area_struct *vma)
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
  */
+static long __meminitdata addr_start, addr_end;
+static void __meminitdata *p_start, *p_end;
+static int __meminitdata node_start;
+
 int __meminit
 vmemmap_populate(struct page *start_page, unsigned long size, int node)
 {
@@ -960,12 +968,32 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 							PAGE_KERNEL_LARGE);
 			set_pmd(pmd, __pmd(pte_val(entry)));
 
-			printk(KERN_DEBUG " [%lx-%lx] PMD ->%p on node %d\n",
-				addr, addr + PMD_SIZE - 1, p, node);
+			/* check to see if we have contiguous blocks */
+			if (p_end != p || node_start != node) {
+				if (p_start)
+					printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+						addr_start, addr_end-1, p_start, p_end-1, node_start);
+				addr_start = addr;
+				node_start = node;
+				p_start = p;
+			}
+			addr_end = addr + PMD_SIZE;
+			p_end = p + PMD_SIZE;
 		} else {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 		}
 	}
 	return 0;
 }
+
+void __meminit vmemmap_populate_print_last(void)
+{
+	if (p_start) {
+		printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+			addr_start, addr_end-1, p_start, p_end-1, node_start);
+		p_start = NULL;
+		p_end = NULL;
+		node_start = 0;
+	}
+}
 #endif
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 9a68922..efb7483 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -196,6 +196,7 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	unsigned long bootmap_start, nodedata_phys;
 	void *bootmap;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
+	int nid;
 
 	start = round_up(start, ZONE_ALIGN);
 
@@ -218,9 +219,20 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	NODE_DATA(nodeid)->node_start_pfn = start_pfn;
 	NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
-	/* Find a place for the bootmem map */
+	/*
+	 * Find a place for the bootmem map
+	 * nodedata_phys could be on other nodes by alloc_bootmem,
+	 * so need to sure bootmap_start not to be small, otherwise
+	 * early_node_mem will get that with find_e820_area instead
+	 * of alloc_bootmem, that could clash with reserved range
+	 */
 	bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn);
-	bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	nid = phys_to_nid(nodedata_phys);
+	if (nid == nodeid)
+		bootmap_start = round_up(nodedata_phys + pgdat_size,
+					 PAGE_SIZE);
+	else
+		bootmap_start = round_up(start, PAGE_SIZE);
 	/*
 	 * SMP_CAHCE_BYTES could be enough, but init_bootmem_node like
 	 * to use that to align to PAGE_SIZE
@@ -245,10 +257,29 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size,
-			BOOTMEM_DEFAULT);
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
-			bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+	/*
+	 * convert early reserve to bootmem reserve earlier
+	 * otherwise early_node_mem could use early reserved mem
+	 * on previous node
+	 */
+	early_res_to_bootmem(start, end);
+
+	/*
+	 * in some case early_node_mem could use alloc_bootmem
+	 * to get range on other node, don't reserve that again
+	 */
+	if (nid != nodeid)
+		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys,
+					pgdat_size, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(bootmap_start);
+	if (nid != nodeid)
+		printk(KERN_INFO "    bootmap(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
+				 bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
diff --git a/include/asm-x86/e820_64.h b/include/asm-x86/e820_64.h
index f478c57..4c6ad98 100644
--- a/include/asm-x86/e820_64.h
+++ b/include/asm-x86/e820_64.h
@@ -48,7 +48,7 @@ extern struct e820map e820;
 extern void update_e820(void);
 
 extern void reserve_early(unsigned long start, unsigned long end, char *name);
-extern void early_res_to_bootmem(void);
+extern void early_res_to_bootmem(unsigned long start, unsigned long end);
 
 #endif/*!__ASSEMBLY__*/
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b695875..286d315 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,6 +1229,7 @@ void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
+void vmemmap_populate_print_last(void);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 2ccea70..590873d 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -111,44 +111,71 @@ static unsigned long __init init_bootmem_core(pg_data_t *pgdat,
  * might be used for boot-time allocations - or it might get added
  * to the free page pool later on.
  */
-static int __init reserve_bootmem_core(bootmem_data_t *bdata,
+static int __init can_reserve_bootmem_core(bootmem_data_t *bdata,
 			unsigned long addr, unsigned long size, int flags)
 {
 	unsigned long sidx, eidx;
 	unsigned long i;
-	int ret;
+
+	BUG_ON(!size);
+
+	/* out of range, don't hold other */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return 0;
 
 	/*
-	 * round up, partially reserved pages are considered
-	 * fully reserved.
+	 * Round up to index to the range.
 	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
+	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
+
+	for (i = sidx; i < eidx; i++)
+		if (test_bit(i, bdata->node_bootmem_map)) {
+			if (flags & BOOTMEM_EXCLUSIVE)
+				return -EBUSY;
+		}
+
+	return 0;
+
+}
+static void __init reserve_bootmem_core(bootmem_data_t *bdata,
+			unsigned long addr, unsigned long size, int flags)
+{
+	unsigned long sidx, eidx;
+	unsigned long i;
+
 	BUG_ON(!size);
-	BUG_ON(PFN_DOWN(addr) >= bdata->node_low_pfn);
-	BUG_ON(PFN_UP(addr + size) > bdata->node_low_pfn);
-	BUG_ON(addr < bdata->node_boot_start);
 
-	sidx = PFN_DOWN(addr - bdata->node_boot_start);
+	/* out of range */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return;
+
+	/*
+	 * Round up to index to the range.
+	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
 	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
 	for (i = sidx; i < eidx; i++)
 		if (test_and_set_bit(i, bdata->node_bootmem_map)) {
 #ifdef CONFIG_DEBUG_BOOTMEM
 			printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE);
 #endif
-			if (flags & BOOTMEM_EXCLUSIVE) {
-				ret = -EBUSY;
-				goto err;
-			}
 		}
-
-	return 0;
-
-err:
-	/* unreserve memory we accidentally reserved */
-	for (i--; i >= sidx; i--)
-		clear_bit(i, bdata->node_bootmem_map);
-
-	return ret;
 }
 
 static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
@@ -206,9 +233,11 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
+	unsigned long areasize, preferred;
 	unsigned long i, start = 0, incr, eidx, end_pfn;
 	void *ret;
+	unsigned long node_boot_start;
+	void *node_bootmem_map;
 
 	if (!size) {
 		printk("__alloc_bootmem_core(): zero-sized request\n");
@@ -216,70 +245,83 @@ __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	}
 	BUG_ON(align & (align-1));
 
-	if (limit && bdata->node_boot_start >= limit)
-		return NULL;
-
 	/* on nodes without memory - bootmem_map is NULL */
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
+	node_boot_start = bdata->node_boot_start;
+	node_bootmem_map = bdata->node_bootmem_map;
+	if (align) {
+		node_boot_start = ALIGN(bdata->node_boot_start, align);
+		if (node_boot_start > bdata->node_boot_start)
+			node_bootmem_map = (unsigned long *)bdata->node_bootmem_map +
+			    PFN_DOWN(node_boot_start - bdata->node_boot_start)/BITS_PER_LONG;
+	}
+
+	if (limit && node_boot_start >= limit)
+		return NULL;
+
 	end_pfn = bdata->node_low_pfn;
 	limit = PFN_DOWN(limit);
 	if (limit && end_pfn > limit)
 		end_pfn = limit;
 
-	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
+	eidx = end_pfn - PFN_DOWN(node_boot_start);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
 	 */
-	if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) {
-		preferred = goal - bdata->node_boot_start;
+	preferred = 0;
+	if (goal && PFN_DOWN(goal) < end_pfn) {
+		if (goal > node_boot_start)
+			preferred = goal - node_boot_start;
 
-		if (bdata->last_success >= preferred)
+		if (bdata->last_success > node_boot_start &&
+			bdata->last_success - node_boot_start >= preferred)
 			if (!limit || (limit && limit > bdata->last_success))
-				preferred = bdata->last_success;
-	} else
-		preferred = 0;
+				preferred = bdata->last_success - node_boot_start;
+	}
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	preferred = PFN_DOWN(ALIGN(preferred, align));
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
 restart_scan:
-	for (i = preferred; i < eidx; i += incr) {
+	for (i = preferred; i < eidx;) {
 		unsigned long j;
-		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
+
+		i = find_next_zero_bit(node_bootmem_map, eidx, i);
 		i = ALIGN(i, incr);
 		if (i >= eidx)
 			break;
-		if (test_bit(i, bdata->node_bootmem_map))
+		if (test_bit(i, node_bootmem_map)) {
+			i += incr;
 			continue;
+		}
 		for (j = i + 1; j < i + areasize; ++j) {
 			if (j >= eidx)
 				goto fail_block;
-			if (test_bit(j, bdata->node_bootmem_map))
+			if (test_bit(j, node_bootmem_map))
 				goto fail_block;
 		}
 		start = i;
 		goto found;
 	fail_block:
 		i = ALIGN(j, incr);
+		if (i == j)
+			i += incr;
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
 
 found:
-	bdata->last_success = PFN_PHYS(start);
+	bdata->last_success = PFN_PHYS(start) + node_boot_start;
 	BUG_ON(start >= eidx);
 
 	/*
@@ -289,6 +331,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
+		unsigned long offset, remaining_size;
 		offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
@@ -297,14 +340,12 @@ found:
 			/* last_pos unchanged */
 			bdata->last_offset = offset + size;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 		} else {
 			remaining_size = size - remaining_size;
 			areasize = (remaining_size + PAGE_SIZE-1) / PAGE_SIZE;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 			bdata->last_pos = start + areasize - 1;
 			bdata->last_offset = remaining_size;
 		}
@@ -312,14 +353,14 @@ found:
 	} else {
 		bdata->last_pos = start + areasize - 1;
 		bdata->last_offset = size & ~PAGE_MASK;
-		ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start);
+		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
 	/*
 	 * Reserve the area now:
 	 */
 	for (i = start; i < start + areasize; i++)
-		if (unlikely(test_and_set_bit(i, bdata->node_bootmem_map)))
+		if (unlikely(test_and_set_bit(i, node_bootmem_map)))
 			BUG();
 	memset(ret, 0, size);
 	return ret;
@@ -401,6 +442,11 @@ unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
 void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
 				 unsigned long size, int flags)
 {
+	int ret;
+
+	ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
+	if (ret < 0)
+		return;
 	reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
 }
 
@@ -426,7 +472,16 @@ unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
 int __init reserve_bootmem(unsigned long addr, unsigned long size,
 			    int flags)
 {
-	return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
+	int ret;
+	bootmem_data_t *bdata;
+	list_for_each_entry(bdata, &bdata_list, list) {
+		ret = can_reserve_bootmem_core(bdata, addr, size, flags);
+		if (ret < 0)
+			return ret;
+	}
+	list_for_each_entry(bdata, &bdata_list, list)
+		reserve_bootmem_core(bdata, addr, size, flags);
+	return 0;
 }
 #endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 98d6b39..7e91913 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -295,6 +295,9 @@ struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
 	return NULL;
 }
 
+void __attribute__((weak)) __meminit vmemmap_populate_print_last(void)
+{
+}
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -304,22 +307,50 @@ void __init sparse_init(void)
 	unsigned long pnum;
 	struct page *map;
 	unsigned long *usemap;
+	unsigned long **usemap_map;
+	int size;
+
+	/*
+	 * map is using big page (aka 2M in x86 64 bit)
+	 * usemap is less one page (aka 24 bytes)
+	 * so alloc 2M (with 2M align) and 24 bytes in turn will
+	 * make next 2M slip to one more 2M later.
+	 * then in big system, the memory will have a lot of holes...
+	 * here try to allocate 2M pages continously.
+	 *
+	 * powerpc need to call sparse_init_one_section right after each
+	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
+	 */
+	size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
+	usemap_map = alloc_bootmem(size);
+	if (!usemap_map)
+		panic("can not allocate usemap_map\n");
 
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
+		usemap_map[pnum] = sparse_early_usemap_alloc(pnum);
+	}
 
-		map = sparse_early_mem_map_alloc(pnum);
-		if (!map)
+	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+		if (!present_section_nr(pnum))
 			continue;
 
-		usemap = sparse_early_usemap_alloc(pnum);
+		usemap = usemap_map[pnum];
 		if (!usemap)
 			continue;
 
+		map = sparse_early_mem_map_alloc(pnum);
+		if (!map)
+			continue;
+
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
 								usemap);
 	}
+
+	vmemmap_populate_print_last();
+
+	free_bootmem(__pa(usemap_map), size);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
@ 2008-04-26 19:52     ` Linus Torvalds
  2008-04-26 20:07       ` Ingo Molnar
  2008-04-26 20:08       ` [git pull] "big box" x86 changes, bootmem/sparsemem, #2 Ingo Molnar
  2008-04-27 22:48     ` [git pull] "big box" x86 changes, bootmem/sparsemem Johannes Weiner
  1 sibling, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-04-26 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes



On Sat, 26 Apr 2008, Ingo Molnar wrote:
>  #ifdef CONFIG_NUMA
> +	nid = phys_to_nid(phys);
> +	next_nid = phys_to_nid(phys + len - 1);
> +	if (nid == next_nid)
>  	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
> -#else
> -	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
> +	else
>  #endif
> +	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
> +

Noticed this when just trying to read the code to see if it looks sensible 
(without looking at any real details).

Code like this is *not* acceptable.

We do proper indentations. Improperly indented code is buggy. It doesn't 
matter if the compiler might generate the same code with or without 
indentation, it's still totally unacceptable.

Having preprocessor conditionals that mix things up is not an excuse, and 
it might be an argument for not doing the conditional that way (ie maybe 
just make sure that when NUMA is not on, nid/next_nid will always be 
different, and in a way that the compiler can perhaps see statically that 
they are different - so that you can have the conditional there even with 
NUMA off, but the compiler will just fold it away?).

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [git pull] "big box" x86 changes, boot protocol
  2008-04-26 19:12 ` Linus Torvalds
  2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
@ 2008-04-26 19:54   ` Ingo Molnar
  2008-04-26 20:39     ` Andrew Morton
  2008-04-27 11:21     ` Ian Campbell
  2008-04-26 20:24   ` [RFC git pull] "big box" x86 changes, GART Ingo Molnar
  2008-04-26 21:55   ` [git pull] "big box" x86 changes, PCI Ingo Molnar
  3 siblings, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


Linus, please pull the following x86 changes from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootparam.git for-linus

these are boot parameter extensions for really large SGI UV boxes. The 
change was seen and acked by the boot protocol guys. (well, Peter that 
is ;-)

Thanks,

	Ingo

------------------>
Huang, Ying (4):
      x86, boot: add free_early to early reservation machanism
      x86, boot: add linked list of struct setup_data
      x86, boot: export linked list of struct setup_data via debugfs
      x86, boot: Document for linked list of struct setup_data

 Documentation/i386/boot.txt |   26 +++++++
 arch/x86/boot/header.S      |    6 ++-
 arch/x86/kernel/e820_64.c   |   22 ++++++
 arch/x86/kernel/head64.c    |   20 +++++
 arch/x86/kernel/kdebugfs.c  |  163 +++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/setup_64.c  |   24 ++++++
 include/asm-x86/bootparam.h |   14 ++++
 include/asm-x86/e820_64.h   |    1 +
 8 files changed, 270 insertions(+), 6 deletions(-)

diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
index 2eb1610..0fac346 100644
--- a/Documentation/i386/boot.txt
+++ b/Documentation/i386/boot.txt
@@ -42,6 +42,8 @@ Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
 Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
 		the boot command line
 
+Protocol 2.09:	(kernel 2.6.26) Added a field of 64-bit physical
+		pointer to single linked list of struct	setup_data.
 
 **** MEMORY LAYOUT
 
@@ -172,6 +174,8 @@ Offset	Proto	Name		Meaning
 0240/8	2.07+	hardware_subarch_data Subarchitecture-specific data
 0248/4	2.08+	payload_offset	Offset of kernel payload
 024C/4	2.08+	payload_length	Length of kernel payload
+0250/8	2.09+	setup_data	64-bit physical pointer to linked list
+				of struct setup_data
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
     real value is 4.
@@ -572,6 +576,28 @@ command line is entered using the following protocol:
 	covered by setup_move_size, so you may need to adjust this
 	field.
 
+Field name:	setup_data
+Type:		write (obligatory)
+Offset/size:	0x250/8
+Protocol:	2.09+
+
+  The 64-bit physical pointer to NULL terminated single linked list of
+  struct setup_data. This is used to define a more extensible boot
+  parameters passing mechanism. The definition of struct setup_data is
+  as follow:
+
+  struct setup_data {
+	  u64 next;
+	  u32 type;
+	  u32 len;
+	  u8  data[0];
+  };
+
+  Where, the next is a 64-bit physical pointer to the next node of
+  linked list, the next field of the last node is 0; the type is used
+  to identify the contents of data; the len is the length of data
+  field; the data holds the real payload.
+
 
 **** MEMORY LAYOUT OF THE REAL-MODE CODE
 
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 6d2df8d..af86e43 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -120,7 +120,7 @@ _start:
 	# Part 2 of the header, from the old setup.S
 
 		.ascii	"HdrS"		# header signature
-		.word	0x0208		# header version number (>= 0x0105)
+		.word	0x0209		# header version number (>= 0x0105)
 					# or else old loadlin-1.5 will fail)
 		.globl realmode_swtch
 realmode_swtch:	.word	0, 0		# default_switch, SETUPSEG
@@ -227,6 +227,10 @@ hardware_subarch_data:	.quad 0
 payload_offset:		.long input_data
 payload_length:		.long input_data_end-input_data
 
+setup_data:		.quad 0			# 64-bit physical pointer to
+						# single linked list of
+						# struct setup_data
+
 # End of setup header #####################################################
 
 	.section ".inittext", "ax"
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index cbd42e5..79f0d52 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -84,6 +84,28 @@ void __init reserve_early(unsigned long start, unsigned long end, char *name)
 		strncpy(r->name, name, sizeof(r->name) - 1);
 }
 
+void __init free_early(unsigned long start, unsigned long end)
+{
+	struct early_res *r;
+	int i, j;
+
+	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
+		r = &early_res[i];
+		if (start == r->start && end == r->end)
+			break;
+	}
+	if (i >= MAX_EARLY_RES || !early_res[i].end)
+		panic("free_early on not reserved area: %lx-%lx!", start, end);
+
+	for (j = i + 1; j < MAX_EARLY_RES && early_res[j].end; j++)
+		;
+
+	memcpy(&early_res[i], &early_res[i + 1],
+	       (j - 1 - i) * sizeof(struct early_res));
+
+	early_res[j - 1].end = 0;
+}
+
 void __init early_res_to_bootmem(void)
 {
 	int i;
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index d31d6b7..e25c57b 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 #include <linux/percpu.h>
 #include <linux/start_kernel.h>
+#include <linux/io.h>
 
 #include <asm/processor.h>
 #include <asm/proto.h>
@@ -100,6 +101,24 @@ static void __init reserve_ebda_region(void)
 	reserve_early(lowmem, 0x100000, "BIOS reserved");
 }
 
+static void __init reserve_setup_data(void)
+{
+	struct setup_data *data;
+	unsigned long pa_data;
+	char buf[32];
+
+	if (boot_params.hdr.version < 0x0209)
+		return;
+	pa_data = boot_params.hdr.setup_data;
+	while (pa_data) {
+		data = early_ioremap(pa_data, sizeof(*data));
+		sprintf(buf, "setup data %x", data->type);
+		reserve_early(pa_data, pa_data+sizeof(*data)+data->len, buf);
+		pa_data = data->next;
+		early_iounmap(data, sizeof(*data));
+	}
+}
+
 void __init x86_64_start_kernel(char * real_mode_data)
 {
 	int i;
@@ -156,6 +175,7 @@ void __init x86_64_start_kernel(char * real_mode_data)
 #endif
 
 	reserve_ebda_region();
+	reserve_setup_data();
 
 	/*
 	 * At this point everything still needed from the boot loader
diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c
index 7335430..c032059 100644
--- a/arch/x86/kernel/kdebugfs.c
+++ b/arch/x86/kernel/kdebugfs.c
@@ -6,23 +6,171 @@
  *
  * This file is released under the GPLv2.
  */
-
 #include <linux/debugfs.h>
+#include <linux/uaccess.h>
 #include <linux/stat.h>
 #include <linux/init.h>
+#include <linux/io.h>
+#include <linux/mm.h>
 
 #include <asm/setup.h>
 
 #ifdef CONFIG_DEBUG_BOOT_PARAMS
+struct setup_data_node {
+	u64 paddr;
+	u32 type;
+	u32 len;
+};
+
+static ssize_t
+setup_data_read(struct file *file, char __user *user_buf, size_t count,
+		loff_t *ppos)
+{
+	struct setup_data_node *node = file->private_data;
+	unsigned long remain;
+	loff_t pos = *ppos;
+	struct page *pg;
+	void *p;
+	u64 pa;
+
+	if (pos < 0)
+		return -EINVAL;
+	if (pos >= node->len)
+		return 0;
+
+	if (count > node->len - pos)
+		count = node->len - pos;
+	pa = node->paddr + sizeof(struct setup_data) + pos;
+	pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT);
+	if (PageHighMem(pg)) {
+		p = ioremap_cache(pa, count);
+		if (!p)
+			return -ENXIO;
+	} else {
+		p = __va(pa);
+	}
+
+	remain = copy_to_user(user_buf, p, count);
+
+	if (PageHighMem(pg))
+		iounmap(p);
+
+	if (remain)
+		return -EFAULT;
+
+	*ppos = pos + count;
+
+	return count;
+}
+
+static int setup_data_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+static const struct file_operations fops_setup_data = {
+	.read =		setup_data_read,
+	.open =		setup_data_open,
+};
+
+static int __init
+create_setup_data_node(struct dentry *parent, int no,
+		       struct setup_data_node *node)
+{
+	struct dentry *d, *type, *data;
+	char buf[16];
+	int error;
+
+	sprintf(buf, "%d", no);
+	d = debugfs_create_dir(buf, parent);
+	if (!d) {
+		error = -ENOMEM;
+		goto err_return;
+	}
+	type = debugfs_create_x32("type", S_IRUGO, d, &node->type);
+	if (!type) {
+		error = -ENOMEM;
+		goto err_dir;
+	}
+	data = debugfs_create_file("data", S_IRUGO, d, node, &fops_setup_data);
+	if (!data) {
+		error = -ENOMEM;
+		goto err_type;
+	}
+	return 0;
+
+err_type:
+	debugfs_remove(type);
+err_dir:
+	debugfs_remove(d);
+err_return:
+	return error;
+}
+
+static int __init create_setup_data_nodes(struct dentry *parent)
+{
+	struct setup_data_node *node;
+	struct setup_data *data;
+	int error, no = 0;
+	struct dentry *d;
+	struct page *pg;
+	u64 pa_data;
+
+	d = debugfs_create_dir("setup_data", parent);
+	if (!d) {
+		error = -ENOMEM;
+		goto err_return;
+	}
+
+	pa_data = boot_params.hdr.setup_data;
+
+	while (pa_data) {
+		node = kmalloc(sizeof(*node), GFP_KERNEL);
+		if (!node) {
+			error = -ENOMEM;
+			goto err_dir;
+		}
+		pg = pfn_to_page((pa_data+sizeof(*data)-1) >> PAGE_SHIFT);
+		if (PageHighMem(pg)) {
+			data = ioremap_cache(pa_data, sizeof(*data));
+			if (!data) {
+				error = -ENXIO;
+				goto err_dir;
+			}
+		} else {
+			data = __va(pa_data);
+		}
+
+		node->paddr = pa_data;
+		node->type = data->type;
+		node->len = data->len;
+		error = create_setup_data_node(d, no, node);
+		pa_data = data->next;
+
+		if (PageHighMem(pg))
+			iounmap(data);
+		if (error)
+			goto err_dir;
+		no++;
+	}
+	return 0;
+
+err_dir:
+	debugfs_remove(d);
+err_return:
+	return error;
+}
+
 static struct debugfs_blob_wrapper boot_params_blob = {
-	.data = &boot_params,
-	.size = sizeof(boot_params),
+	.data		= &boot_params,
+	.size		= sizeof(boot_params),
 };
 
 static int __init boot_params_kdebugfs_init(void)
 {
-	int error;
 	struct dentry *dbp, *version, *data;
+	int error;
 
 	dbp = debugfs_create_dir("boot_params", NULL);
 	if (!dbp) {
@@ -41,7 +189,13 @@ static int __init boot_params_kdebugfs_init(void)
 		error = -ENOMEM;
 		goto err_version;
 	}
+	error = create_setup_data_nodes(dbp);
+	if (error)
+		goto err_data;
 	return 0;
+
+err_data:
+	debugfs_remove(data);
 err_version:
 	debugfs_remove(version);
 err_dir:
@@ -61,5 +215,4 @@ static int __init arch_kdebugfs_init(void)
 
 	return error;
 }
-
 arch_initcall(arch_kdebugfs_init);
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 17bdf23..b04e2c0 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -264,6 +264,28 @@ void __attribute__((weak)) __init memory_setup(void)
        machine_specific_memory_setup();
 }
 
+static void __init parse_setup_data(void)
+{
+	struct setup_data *data;
+	unsigned long pa_data;
+
+	if (boot_params.hdr.version < 0x0209)
+		return;
+	pa_data = boot_params.hdr.setup_data;
+	while (pa_data) {
+		data = early_ioremap(pa_data, PAGE_SIZE);
+		switch (data->type) {
+		default:
+			break;
+		}
+#ifndef CONFIG_DEBUG_BOOT_PARAMS
+		free_early(pa_data, pa_data+sizeof(*data)+data->len);
+#endif
+		pa_data = data->next;
+		early_iounmap(data, PAGE_SIZE);
+	}
+}
+
 /*
  * setup_arch - architecture-specific boot-time initializations
  *
@@ -316,6 +338,8 @@ void __init setup_arch(char **cmdline_p)
 	strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
+	parse_setup_data();
+
 	parse_early_param();
 
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
diff --git a/include/asm-x86/bootparam.h b/include/asm-x86/bootparam.h
index 5115135..e865990 100644
--- a/include/asm-x86/bootparam.h
+++ b/include/asm-x86/bootparam.h
@@ -9,6 +9,17 @@
 #include <asm/ist.h>
 #include <video/edid.h>
 
+/* setup data types */
+#define SETUP_NONE			0
+
+/* extensible setup data list node */
+struct setup_data {
+	u64 next;
+	u32 type;
+	u32 len;
+	u8 data[0];
+};
+
 struct setup_header {
 	__u8	setup_sects;
 	__u16	root_flags;
@@ -46,6 +57,9 @@ struct setup_header {
 	__u32	cmdline_size;
 	__u32	hardware_subarch;
 	__u64	hardware_subarch_data;
+	__u32	payload_offset;
+	__u32	payload_length;
+	__u64	setup_data;
 } __attribute__((packed));
 
 struct sys_desc_table {
diff --git a/include/asm-x86/e820_64.h b/include/asm-x86/e820_64.h
index f478c57..b5e02e3 100644
--- a/include/asm-x86/e820_64.h
+++ b/include/asm-x86/e820_64.h
@@ -48,6 +48,7 @@ extern struct e820map e820;
 extern void update_e820(void);
 
 extern void reserve_early(unsigned long start, unsigned long end, char *name);
+extern void free_early(unsigned long start, unsigned long end);
 extern void early_res_to_bootmem(void);
 
 #endif/*!__ASSEMBLY__*/


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-26 19:52     ` Linus Torvalds
@ 2008-04-26 20:07       ` Ingo Molnar
  2008-04-26 20:08       ` [git pull] "big box" x86 changes, bootmem/sparsemem, #2 Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Having preprocessor conditionals that mix things up is not an excuse, 
> and it might be an argument for not doing the conditional that way (ie 
> maybe just make sure that when NUMA is not on, nid/next_nid will 
> always be different, and in a way that the compiler can perhaps see 
> statically that they are different - so that you can have the 
> conditional there even with NUMA off, but the compiler will just fold 
> it away?).

for now i cleaned it up the way below, but i also queued up a cleanup 
patch separately (second patch attached below) that removes the #ifdef. 
The current version is the tested one so i'll keep that in the tree and 
will treat the cleanup separately.

	Ingo

------------------------>
Subject: x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
From: Yinghai Lu <yhlu.kernel@gmail.com>
Date: Tue, 18 Mar 2008 12:50:21 -0700

"mm: make reserve_bootmem can crossed the nodes" provides new
reserve_bootmem(), let reserve_bootmem_generic() use that.

reserve_bootmem_generic() is used to reserve initramdisk, so this way
we can make sure even when bootloader or kexec load ranges cross the
node memory boundaries, reserve_bootmem still works.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/init_64.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-x86.q/arch/x86/mm/init_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/init_64.c
+++ linux-x86.q/arch/x86/mm/init_64.c
@@ -810,7 +810,7 @@ void free_initrd_mem(unsigned long start
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
 #ifdef CONFIG_NUMA
-	int nid = phys_to_nid(phys);
+	int nid, next_nid;
 #endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
 
@@ -829,10 +829,16 @@ void __init reserve_bootmem_generic(unsi
 
 	/* Should check here against the e820 map to avoid double free */
 #ifdef CONFIG_NUMA
-	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(phys);
+	next_nid = phys_to_nid(phys + len - 1);
+	if (nid == next_nid)
+		reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	else
+		reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #else
 	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #endif
+
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;
 		set_dma_reserve(dma_reserve);

------------->
Subject: x86: reserve bootmem cleanup
From: Ingo Molnar <mingo@elte.hu>
Date: Sat Apr 26 21:50:20 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/init_64.c |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

Index: linux-x86.q/arch/x86/mm/init_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/init_64.c
+++ linux-x86.q/arch/x86/mm/init_64.c
@@ -810,10 +810,8 @@ void free_initrd_mem(unsigned long start
 
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
-#ifdef CONFIG_NUMA
-	int nid, next_nid;
-#endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
+	int nid, next_nid;
 
 	if (pfn >= end_pfn) {
 		/*
@@ -829,16 +827,12 @@ void __init reserve_bootmem_generic(unsi
 	}
 
 	/* Should check here against the e820 map to avoid double free */
-#ifdef CONFIG_NUMA
 	nid = phys_to_nid(phys);
 	next_nid = phys_to_nid(phys + len - 1);
 	if (nid == next_nid)
 		reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
 	else
 		reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
-#else
-	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
-#endif
 
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [git pull] "big box" x86 changes, bootmem/sparsemem, #2
  2008-04-26 19:52     ` Linus Torvalds
  2008-04-26 20:07       ` Ingo Molnar
@ 2008-04-26 20:08       ` Ingo Molnar
  2008-04-26 20:30         ` Linus Torvalds
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


here's the updated tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v2.git for-linus

	Ingo

------------------>
Yinghai Lu (7):
      mm: make mem_map allocation continuous
      mm: fix alloc_bootmem_core to use fast searching for all nodes
      mm: offset align in alloc_bootmem()
      mm: allow reserve_bootmem() cross nodes
      x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
      x86_64: fix setup_node_bootmem to support big mem excluding with memmap
      x86_64/mm: check and print vmemmap allocation continuous

 arch/x86/kernel/e820_64.c  |   13 +++-
 arch/x86/kernel/setup_64.c |    3 +-
 arch/x86/mm/init_64.c      |   38 +++++++++-
 arch/x86/mm/numa_64.c      |   42 ++++++++++--
 include/asm-x86/e820_64.h  |    2 +-
 include/linux/mm.h         |    1 +
 mm/bootmem.c               |  164 ++++++++++++++++++++++++++++++--------------
 mm/sparse.c                |   37 +++++++++-
 8 files changed, 228 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index cbd42e5..f95fdab 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -84,14 +84,19 @@ void __init reserve_early(unsigned long start, unsigned long end, char *name)
 		strncpy(r->name, name, sizeof(r->name) - 1);
 }
 
-void __init early_res_to_bootmem(void)
+void __init early_res_to_bootmem(unsigned long start, unsigned long end)
 {
 	int i;
+	unsigned long final_start, final_end;
 	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
 		struct early_res *r = &early_res[i];
-		printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
-			r->start, r->end - 1, r->name);
-		reserve_bootmem_generic(r->start, r->end - r->start);
+		final_start = max(start, r->start);
+		final_end = min(end, r->end);
+		if (final_start >= final_end)
+			continue;
+		printk(KERN_INFO "  early res: %d [%lx-%lx] %s\n", i,
+			final_start, final_end - 1, r->name);
+		reserve_bootmem_generic(final_start, final_end - final_start);
 	}
 }
 
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 17bdf23..de9c870 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -190,6 +190,7 @@ contig_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
 	e820_register_active_regions(0, start_pfn, end_pfn);
 	free_bootmem_with_active_regions(0, end_pfn);
+	early_res_to_bootmem(0, end_pfn<<PAGE_SHIFT);
 	reserve_bootmem(bootmap, bootmap_size, BOOTMEM_DEFAULT);
 }
 #endif
@@ -397,8 +398,6 @@ void __init setup_arch(char **cmdline_p)
 	contig_initmem_init(0, end_pfn);
 #endif
 
-	early_res_to_bootmem();
-
 	dma32_reserve_bootmem();
 
 #ifdef CONFIG_ACPI_SLEEP
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0cca626..5fbb865 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -810,7 +810,7 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
 #ifdef CONFIG_NUMA
-	int nid = phys_to_nid(phys);
+	int nid, next_nid;
 #endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
 
@@ -829,10 +829,16 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 
 	/* Should check here against the e820 map to avoid double free */
 #ifdef CONFIG_NUMA
-	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(phys);
+	next_nid = phys_to_nid(phys + len - 1);
+	if (nid == next_nid)
+		reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	else
+		reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #else
 	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #endif
+
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;
 		set_dma_reserve(dma_reserve);
@@ -926,6 +932,10 @@ const char *arch_vma_name(struct vm_area_struct *vma)
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
  */
+static long __meminitdata addr_start, addr_end;
+static void __meminitdata *p_start, *p_end;
+static int __meminitdata node_start;
+
 int __meminit
 vmemmap_populate(struct page *start_page, unsigned long size, int node)
 {
@@ -960,12 +970,32 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 							PAGE_KERNEL_LARGE);
 			set_pmd(pmd, __pmd(pte_val(entry)));
 
-			printk(KERN_DEBUG " [%lx-%lx] PMD ->%p on node %d\n",
-				addr, addr + PMD_SIZE - 1, p, node);
+			/* check to see if we have contiguous blocks */
+			if (p_end != p || node_start != node) {
+				if (p_start)
+					printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+						addr_start, addr_end-1, p_start, p_end-1, node_start);
+				addr_start = addr;
+				node_start = node;
+				p_start = p;
+			}
+			addr_end = addr + PMD_SIZE;
+			p_end = p + PMD_SIZE;
 		} else {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 		}
 	}
 	return 0;
 }
+
+void __meminit vmemmap_populate_print_last(void)
+{
+	if (p_start) {
+		printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+			addr_start, addr_end-1, p_start, p_end-1, node_start);
+		p_start = NULL;
+		p_end = NULL;
+		node_start = 0;
+	}
+}
 #endif
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 9a68922..c5066d5 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -196,6 +196,7 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	unsigned long bootmap_start, nodedata_phys;
 	void *bootmap;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
+	int nid;
 
 	start = round_up(start, ZONE_ALIGN);
 
@@ -218,9 +219,19 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	NODE_DATA(nodeid)->node_start_pfn = start_pfn;
 	NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
-	/* Find a place for the bootmem map */
+	/*
+	 * Find a place for the bootmem map
+	 * nodedata_phys could be on other nodes by alloc_bootmem,
+	 * so need to sure bootmap_start not to be small, otherwise
+	 * early_node_mem will get that with find_e820_area instead
+	 * of alloc_bootmem, that could clash with reserved range
+	 */
 	bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn);
-	bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	nid = phys_to_nid(nodedata_phys);
+	if (nid == nodeid)
+		bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	else
+		bootmap_start = round_up(start, PAGE_SIZE);
 	/*
 	 * SMP_CAHCE_BYTES could be enough, but init_bootmem_node like
 	 * to use that to align to PAGE_SIZE
@@ -245,10 +256,29 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size,
-			BOOTMEM_DEFAULT);
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
-			bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+	/*
+	 * convert early reserve to bootmem reserve earlier
+	 * otherwise early_node_mem could use early reserved mem
+	 * on previous node
+	 */
+	early_res_to_bootmem(start, end);
+
+	/*
+	 * in some case early_node_mem could use alloc_bootmem
+	 * to get range on other node, don't reserve that again
+	 */
+	if (nid != nodeid)
+		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys,
+					pgdat_size, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(bootmap_start);
+	if (nid != nodeid)
+		printk(KERN_INFO "    bootmap(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
+				 bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
diff --git a/include/asm-x86/e820_64.h b/include/asm-x86/e820_64.h
index f478c57..4c6ad98 100644
--- a/include/asm-x86/e820_64.h
+++ b/include/asm-x86/e820_64.h
@@ -48,7 +48,7 @@ extern struct e820map e820;
 extern void update_e820(void);
 
 extern void reserve_early(unsigned long start, unsigned long end, char *name);
-extern void early_res_to_bootmem(void);
+extern void early_res_to_bootmem(unsigned long start, unsigned long end);
 
 #endif/*!__ASSEMBLY__*/
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b695875..286d315 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,6 +1229,7 @@ void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
+void vmemmap_populate_print_last(void);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 2ccea70..b679164 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -111,44 +111,74 @@ static unsigned long __init init_bootmem_core(pg_data_t *pgdat,
  * might be used for boot-time allocations - or it might get added
  * to the free page pool later on.
  */
-static int __init reserve_bootmem_core(bootmem_data_t *bdata,
+static int __init can_reserve_bootmem_core(bootmem_data_t *bdata,
 			unsigned long addr, unsigned long size, int flags)
 {
 	unsigned long sidx, eidx;
 	unsigned long i;
-	int ret;
+
+	BUG_ON(!size);
+
+	/* out of range, don't hold other */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return 0;
 
 	/*
-	 * round up, partially reserved pages are considered
-	 * fully reserved.
+	 * Round up to index to the range.
 	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
+	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
+
+	for (i = sidx; i < eidx; i++) {
+		if (test_bit(i, bdata->node_bootmem_map)) {
+			if (flags & BOOTMEM_EXCLUSIVE)
+				return -EBUSY;
+		}
+	}
+
+	return 0;
+
+}
+
+static void __init reserve_bootmem_core(bootmem_data_t *bdata,
+			unsigned long addr, unsigned long size, int flags)
+{
+	unsigned long sidx, eidx;
+	unsigned long i;
+
 	BUG_ON(!size);
-	BUG_ON(PFN_DOWN(addr) >= bdata->node_low_pfn);
-	BUG_ON(PFN_UP(addr + size) > bdata->node_low_pfn);
-	BUG_ON(addr < bdata->node_boot_start);
 
-	sidx = PFN_DOWN(addr - bdata->node_boot_start);
+	/* out of range */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return;
+
+	/*
+	 * Round up to index to the range.
+	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
 	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
-	for (i = sidx; i < eidx; i++)
+	for (i = sidx; i < eidx; i++) {
 		if (test_and_set_bit(i, bdata->node_bootmem_map)) {
 #ifdef CONFIG_DEBUG_BOOTMEM
 			printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE);
 #endif
-			if (flags & BOOTMEM_EXCLUSIVE) {
-				ret = -EBUSY;
-				goto err;
-			}
 		}
-
-	return 0;
-
-err:
-	/* unreserve memory we accidentally reserved */
-	for (i--; i >= sidx; i--)
-		clear_bit(i, bdata->node_bootmem_map);
-
-	return ret;
+	}
 }
 
 static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
@@ -206,9 +236,11 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
+	unsigned long areasize, preferred;
 	unsigned long i, start = 0, incr, eidx, end_pfn;
 	void *ret;
+	unsigned long node_boot_start;
+	void *node_bootmem_map;
 
 	if (!size) {
 		printk("__alloc_bootmem_core(): zero-sized request\n");
@@ -216,70 +248,83 @@ __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	}
 	BUG_ON(align & (align-1));
 
-	if (limit && bdata->node_boot_start >= limit)
-		return NULL;
-
 	/* on nodes without memory - bootmem_map is NULL */
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
+	node_boot_start = bdata->node_boot_start;
+	node_bootmem_map = bdata->node_bootmem_map;
+	if (align) {
+		node_boot_start = ALIGN(bdata->node_boot_start, align);
+		if (node_boot_start > bdata->node_boot_start)
+			node_bootmem_map = (unsigned long *)bdata->node_bootmem_map +
+			    PFN_DOWN(node_boot_start - bdata->node_boot_start)/BITS_PER_LONG;
+	}
+
+	if (limit && node_boot_start >= limit)
+		return NULL;
+
 	end_pfn = bdata->node_low_pfn;
 	limit = PFN_DOWN(limit);
 	if (limit && end_pfn > limit)
 		end_pfn = limit;
 
-	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
+	eidx = end_pfn - PFN_DOWN(node_boot_start);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
 	 */
-	if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) {
-		preferred = goal - bdata->node_boot_start;
+	preferred = 0;
+	if (goal && PFN_DOWN(goal) < end_pfn) {
+		if (goal > node_boot_start)
+			preferred = goal - node_boot_start;
 
-		if (bdata->last_success >= preferred)
+		if (bdata->last_success > node_boot_start &&
+			bdata->last_success - node_boot_start >= preferred)
 			if (!limit || (limit && limit > bdata->last_success))
-				preferred = bdata->last_success;
-	} else
-		preferred = 0;
+				preferred = bdata->last_success - node_boot_start;
+	}
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	preferred = PFN_DOWN(ALIGN(preferred, align));
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
 restart_scan:
-	for (i = preferred; i < eidx; i += incr) {
+	for (i = preferred; i < eidx;) {
 		unsigned long j;
-		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
+
+		i = find_next_zero_bit(node_bootmem_map, eidx, i);
 		i = ALIGN(i, incr);
 		if (i >= eidx)
 			break;
-		if (test_bit(i, bdata->node_bootmem_map))
+		if (test_bit(i, node_bootmem_map)) {
+			i += incr;
 			continue;
+		}
 		for (j = i + 1; j < i + areasize; ++j) {
 			if (j >= eidx)
 				goto fail_block;
-			if (test_bit(j, bdata->node_bootmem_map))
+			if (test_bit(j, node_bootmem_map))
 				goto fail_block;
 		}
 		start = i;
 		goto found;
 	fail_block:
 		i = ALIGN(j, incr);
+		if (i == j)
+			i += incr;
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
 
 found:
-	bdata->last_success = PFN_PHYS(start);
+	bdata->last_success = PFN_PHYS(start) + node_boot_start;
 	BUG_ON(start >= eidx);
 
 	/*
@@ -289,6 +334,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
+		unsigned long offset, remaining_size;
 		offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
@@ -297,14 +343,12 @@ found:
 			/* last_pos unchanged */
 			bdata->last_offset = offset + size;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 		} else {
 			remaining_size = size - remaining_size;
 			areasize = (remaining_size + PAGE_SIZE-1) / PAGE_SIZE;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 			bdata->last_pos = start + areasize - 1;
 			bdata->last_offset = remaining_size;
 		}
@@ -312,14 +356,14 @@ found:
 	} else {
 		bdata->last_pos = start + areasize - 1;
 		bdata->last_offset = size & ~PAGE_MASK;
-		ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start);
+		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
 	/*
 	 * Reserve the area now:
 	 */
 	for (i = start; i < start + areasize; i++)
-		if (unlikely(test_and_set_bit(i, bdata->node_bootmem_map)))
+		if (unlikely(test_and_set_bit(i, node_bootmem_map)))
 			BUG();
 	memset(ret, 0, size);
 	return ret;
@@ -401,6 +445,11 @@ unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
 void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
 				 unsigned long size, int flags)
 {
+	int ret;
+
+	ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
+	if (ret < 0)
+		return;
 	reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
 }
 
@@ -426,7 +475,18 @@ unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
 int __init reserve_bootmem(unsigned long addr, unsigned long size,
 			    int flags)
 {
-	return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
+	bootmem_data_t *bdata;
+	int ret;
+
+	list_for_each_entry(bdata, &bdata_list, list) {
+		ret = can_reserve_bootmem_core(bdata, addr, size, flags);
+		if (ret < 0)
+			return ret;
+	}
+	list_for_each_entry(bdata, &bdata_list, list)
+		reserve_bootmem_core(bdata, addr, size, flags);
+
+	return 0;
 }
 #endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 98d6b39..7e91913 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -295,6 +295,9 @@ struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
 	return NULL;
 }
 
+void __attribute__((weak)) __meminit vmemmap_populate_print_last(void)
+{
+}
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -304,22 +307,50 @@ void __init sparse_init(void)
 	unsigned long pnum;
 	struct page *map;
 	unsigned long *usemap;
+	unsigned long **usemap_map;
+	int size;
+
+	/*
+	 * map is using big page (aka 2M in x86 64 bit)
+	 * usemap is less one page (aka 24 bytes)
+	 * so alloc 2M (with 2M align) and 24 bytes in turn will
+	 * make next 2M slip to one more 2M later.
+	 * then in big system, the memory will have a lot of holes...
+	 * here try to allocate 2M pages continously.
+	 *
+	 * powerpc need to call sparse_init_one_section right after each
+	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
+	 */
+	size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
+	usemap_map = alloc_bootmem(size);
+	if (!usemap_map)
+		panic("can not allocate usemap_map\n");
 
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
+		usemap_map[pnum] = sparse_early_usemap_alloc(pnum);
+	}
 
-		map = sparse_early_mem_map_alloc(pnum);
-		if (!map)
+	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+		if (!present_section_nr(pnum))
 			continue;
 
-		usemap = sparse_early_usemap_alloc(pnum);
+		usemap = usemap_map[pnum];
 		if (!usemap)
 			continue;
 
+		map = sparse_early_mem_map_alloc(pnum);
+		if (!map)
+			continue;
+
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
 								usemap);
 	}
+
+	vmemmap_populate_print_last();
+
+	free_bootmem(__pa(usemap_map), size);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG




^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC git pull] "big box" x86 changes, GART
  2008-04-26 19:12 ` Linus Torvalds
  2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
  2008-04-26 19:54   ` [git pull] "big box" x86 changes, boot protocol Ingo Molnar
@ 2008-04-26 20:24   ` Ingo Molnar
  2008-04-26 20:26     ` Ingo Molnar
  2008-04-26 21:55   ` [git pull] "big box" x86 changes, PCI Ingo Molnar
  3 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


Linus, please pull the x86 "big box & GART" subtopic tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-gart.git for-linus

it also touches drivers/char/agp/amd64-agp.c.

Thanks,

	Ingo

------------------>
Pavel Machek (1):
      x86: iommu: use symbolic constants, not hardcoded numbers

Yinghai Lu (5):
      x86: agp_gart size checking for buggy device
      x86: checking aperture size order
      x86_64: allocate gart aperture from 512M
      x86: clean up aperture_64.c
      x86: reserve dma32 early for gart fix

 arch/x86/kernel/aperture_64.c |  283 +++++++++++++++++++++++++++-------------
 arch/x86/kernel/pci-dma.c     |   11 +-
 arch/x86/kernel/pci-gart_64.c |   10 +-
 drivers/char/agp/amd64-agp.c  |   46 +++----
 include/asm-x86/gart.h        |   21 +++
 5 files changed, 242 insertions(+), 129 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 479926d..02f4dba 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -35,6 +35,18 @@ int fallback_aper_force __initdata;
 
 int fix_aperture __initdata = 1;
 
+struct bus_dev_range {
+	int bus;
+	int dev_base;
+	int dev_limit;
+};
+
+static struct bus_dev_range bus_dev_ranges[] __initdata = {
+	{ 0x00, 0x18, 0x20},
+	{ 0xff, 0x00, 0x20},
+	{ 0xfe, 0x00, 0x20}
+};
+
 static struct resource gart_resource = {
 	.name	= "GART",
 	.flags	= IORESOURCE_MEM,
@@ -55,8 +67,9 @@ static u32 __init allocate_aperture(void)
 	u32 aper_size;
 	void *p;
 
-	if (fallback_aper_order > 7)
-		fallback_aper_order = 7;
+	/* aper_size should <= 1G */
+	if (fallback_aper_order > 5)
+		fallback_aper_order = 5;
 	aper_size = (32 * 1024 * 1024) << fallback_aper_order;
 
 	/*
@@ -65,7 +78,20 @@ static u32 __init allocate_aperture(void)
 	 * memory. Unfortunately we cannot move it up because that would
 	 * make the IOMMU useless.
 	 */
-	p = __alloc_bootmem_nopanic(aper_size, aper_size, 0);
+	/*
+	 * using 512M as goal, in case kexec will load kernel_big
+	 * that will do the on position decompress, and  could overlap with
+	 * that positon with gart that is used.
+	 * sequende:
+	 * kernel_small
+	 * ==> kexec (with kdump trigger path or previous doesn't shutdown gart)
+	 * ==> kernel_small(gart area become e820_reserved)
+	 * ==> kexec (with kdump trigger path or previous doesn't shutdown gart)
+	 * ==> kerne_big (uncompressed size will be big than 64M or 128M)
+	 * so don't use 512M below as gart iommu, leave the space for kernel
+	 * code for safe
+	 */
+	p = __alloc_bootmem_nopanic(aper_size, aper_size, 512ULL<<20);
 	if (!p || __pa(p)+aper_size > 0xffffffff) {
 		printk(KERN_ERR
 			"Cannot allocate aperture memory hole (%p,%uK)\n",
@@ -83,7 +109,7 @@ static u32 __init allocate_aperture(void)
 	return (u32)__pa(p);
 }
 
-static int __init aperture_valid(u64 aper_base, u32 aper_size)
+static int __init aperture_valid(u64 aper_base, u32 aper_size, u32 min_size)
 {
 	if (!aper_base)
 		return 0;
@@ -96,8 +122,9 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size)
 		printk(KERN_ERR "Aperture pointing to e820 RAM. Ignoring.\n");
 		return 0;
 	}
-	if (aper_size < 64*1024*1024) {
-		printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20);
+	if (aper_size < min_size) {
+		printk(KERN_ERR "Aperture too small (%d MB) than (%d MB)\n",
+				 aper_size>>20, min_size>>20);
 		return 0;
 	}
 
@@ -105,47 +132,51 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size)
 }
 
 /* Find a PCI capability */
-static __u32 __init find_cap(int num, int slot, int func, int cap)
+static __u32 __init find_cap(int bus, int slot, int func, int cap)
 {
 	int bytes;
 	u8 pos;
 
-	if (!(read_pci_config_16(num, slot, func, PCI_STATUS) &
+	if (!(read_pci_config_16(bus, slot, func, PCI_STATUS) &
 						PCI_STATUS_CAP_LIST))
 		return 0;
 
-	pos = read_pci_config_byte(num, slot, func, PCI_CAPABILITY_LIST);
+	pos = read_pci_config_byte(bus, slot, func, PCI_CAPABILITY_LIST);
 	for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) {
 		u8 id;
 
 		pos &= ~3;
-		id = read_pci_config_byte(num, slot, func, pos+PCI_CAP_LIST_ID);
+		id = read_pci_config_byte(bus, slot, func, pos+PCI_CAP_LIST_ID);
 		if (id == 0xff)
 			break;
 		if (id == cap)
 			return pos;
-		pos = read_pci_config_byte(num, slot, func,
+		pos = read_pci_config_byte(bus, slot, func,
 						pos+PCI_CAP_LIST_NEXT);
 	}
 	return 0;
 }
 
 /* Read a standard AGPv3 bridge header */
-static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
+static __u32 __init read_agp(int bus, int slot, int func, int cap, u32 *order)
 {
 	u32 apsize;
 	u32 apsizereg;
 	int nbits;
 	u32 aper_low, aper_hi;
 	u64 aper;
+	u32 old_order;
 
-	printk(KERN_INFO "AGP bridge at %02x:%02x:%02x\n", num, slot, func);
-	apsizereg = read_pci_config_16(num, slot, func, cap + 0x14);
+	printk(KERN_INFO "AGP bridge at %02x:%02x:%02x\n", bus, slot, func);
+	apsizereg = read_pci_config_16(bus, slot, func, cap + 0x14);
 	if (apsizereg == 0xffffffff) {
 		printk(KERN_ERR "APSIZE in AGP bridge unreadable\n");
 		return 0;
 	}
 
+	/* old_order could be the value from NB gart setting */
+	old_order = *order;
+
 	apsize = apsizereg & 0xfff;
 	/* Some BIOS use weird encodings not in the AGPv3 table. */
 	if (apsize & 0xff)
@@ -155,14 +186,26 @@ static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
 	if ((int)*order < 0) /* < 32MB */
 		*order = 0;
 
-	aper_low = read_pci_config(num, slot, func, 0x10);
-	aper_hi = read_pci_config(num, slot, func, 0x14);
+	aper_low = read_pci_config(bus, slot, func, 0x10);
+	aper_hi = read_pci_config(bus, slot, func, 0x14);
 	aper = (aper_low & ~((1<<22)-1)) | ((u64)aper_hi << 32);
 
+	/*
+	 * On some sick chips, APSIZE is 0. It means it wants 4G
+	 * so let double check that order, and lets trust AMD NB settings:
+	 */
+	printk(KERN_INFO "Aperture from AGP @ %Lx old size %u MB\n",
+			aper, 32 << old_order);
+	if (aper + (32ULL<<(20 + *order)) > 0x100000000ULL) {
+		printk(KERN_INFO "Aperture size %u MB (APSIZE %x) is not right, using settings from NB\n",
+				32 << *order, apsizereg);
+		*order = old_order;
+	}
+
 	printk(KERN_INFO "Aperture from AGP @ %Lx size %u MB (APSIZE %x)\n",
 			aper, 32 << *order, apsizereg);
 
-	if (!aperture_valid(aper, (32*1024*1024) << *order))
+	if (!aperture_valid(aper, (32*1024*1024) << *order, 32<<20))
 		return 0;
 	return (u32)aper;
 }
@@ -182,15 +225,15 @@ static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order)
  */
 static __u32 __init search_agp_bridge(u32 *order, int *valid_agp)
 {
-	int num, slot, func;
+	int bus, slot, func;
 
 	/* Poor man's PCI discovery */
-	for (num = 0; num < 256; num++) {
+	for (bus = 0; bus < 256; bus++) {
 		for (slot = 0; slot < 32; slot++) {
 			for (func = 0; func < 8; func++) {
 				u32 class, cap;
 				u8 type;
-				class = read_pci_config(num, slot, func,
+				class = read_pci_config(bus, slot, func,
 							PCI_CLASS_REVISION);
 				if (class == 0xffffffff)
 					break;
@@ -199,17 +242,17 @@ static __u32 __init search_agp_bridge(u32 *order, int *valid_agp)
 				case PCI_CLASS_BRIDGE_HOST:
 				case PCI_CLASS_BRIDGE_OTHER: /* needed? */
 					/* AGP bridge? */
-					cap = find_cap(num, slot, func,
+					cap = find_cap(bus, slot, func,
 							PCI_CAP_ID_AGP);
 					if (!cap)
 						break;
 					*valid_agp = 1;
-					return read_agp(num, slot, func, cap,
+					return read_agp(bus, slot, func, cap,
 							order);
 				}
 
 				/* No multi-function device? */
-				type = read_pci_config_byte(num, slot, func,
+				type = read_pci_config_byte(bus, slot, func,
 							       PCI_HEADER_TYPE);
 				if (!(type & 0x80))
 					break;
@@ -249,38 +292,49 @@ void __init early_gart_iommu_check(void)
 	 * or BIOS forget to put that in reserved.
 	 * try to update e820 to make that region as reserved.
 	 */
-	int fix, num;
+	int fix, slot;
 	u32 ctl;
 	u32 aper_size = 0, aper_order = 0, last_aper_order = 0;
 	u64 aper_base = 0, last_aper_base = 0;
 	int aper_enabled = 0, last_aper_enabled = 0;
+	int i;
 
 	if (!early_pci_allowed())
 		return;
 
 	fix = 0;
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		ctl = read_pci_config(0, num, 3, 0x90);
-		aper_enabled = ctl & 1;
-		aper_order = (ctl >> 1) & 7;
-		aper_size = (32 * 1024 * 1024) << aper_order;
-		aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff;
-		aper_base <<= 25;
-
-		if ((last_aper_order && aper_order != last_aper_order) ||
-		    (last_aper_base && aper_base != last_aper_base) ||
-		    (last_aper_enabled && aper_enabled != last_aper_enabled)) {
-			fix = 1;
-			break;
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			ctl = read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL);
+			aper_enabled = ctl & AMD64_GARTEN;
+			aper_order = (ctl >> 1) & 7;
+			aper_size = (32 * 1024 * 1024) << aper_order;
+			aper_base = read_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE) & 0x7fff;
+			aper_base <<= 25;
+
+			if ((last_aper_order && aper_order != last_aper_order) ||
+			    (last_aper_base && aper_base != last_aper_base) ||
+			    (last_aper_enabled && aper_enabled != last_aper_enabled)) {
+				fix = 1;
+				goto out;
+			}
+			last_aper_order = aper_order;
+			last_aper_base = aper_base;
+			last_aper_enabled = aper_enabled;
 		}
-		last_aper_order = aper_order;
-		last_aper_base = aper_base;
-		last_aper_enabled = aper_enabled;
 	}
 
+out:
 	if (!fix && !aper_enabled)
 		return;
 
@@ -288,8 +342,8 @@ void __init early_gart_iommu_check(void)
 		fix = 1;
 
 	if (gart_fix_e820 && !fix && aper_enabled) {
-		if (e820_any_mapped(aper_base, aper_base + aper_size,
-				    E820_RAM)) {
+		if (!e820_all_mapped(aper_base, aper_base + aper_size,
+				    E820_RESERVED)) {
 			/* reserved it, so we can resuse it in second kernel */
 			printk(KERN_INFO "update e820 for GART\n");
 			add_memory_region(aper_base, aper_size, E820_RESERVED);
@@ -299,23 +353,35 @@ void __init early_gart_iommu_check(void)
 	}
 
 	/* different nodes have different setting, disable them all at first*/
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
 
-		ctl = read_pci_config(0, num, 3, 0x90);
-		ctl &= ~1;
-		write_pci_config(0, num, 3, 0x90, ctl);
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			ctl = read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL);
+			ctl &= ~AMD64_GARTEN;
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL, ctl);
+		}
 	}
 
 }
 
+static int __initdata printed_gart_size_msg;
+
 void __init gart_iommu_hole_init(void)
 {
+	u32 agp_aper_base = 0, agp_aper_order = 0;
 	u32 aper_size, aper_alloc = 0, aper_order = 0, last_aper_order = 0;
 	u64 aper_base, last_aper_base = 0;
-	int fix, num, valid_agp = 0;
-	int node;
+	int fix, slot, valid_agp = 0;
+	int i, node;
 
 	if (gart_iommu_aperture_disabled || !fix_aperture ||
 	    !early_pci_allowed())
@@ -323,38 +389,63 @@ void __init gart_iommu_hole_init(void)
 
 	printk(KERN_INFO  "Checking aperture...\n");
 
+	if (!fallback_aper_force)
+		agp_aper_base = search_agp_bridge(&agp_aper_order, &valid_agp);
+
 	fix = 0;
 	node = 0;
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		iommu_detected = 1;
-		gart_iommu_aperture = 1;
-
-		aper_order = (read_pci_config(0, num, 3, 0x90) >> 1) & 7;
-		aper_size = (32 * 1024 * 1024) << aper_order;
-		aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff;
-		aper_base <<= 25;
-
-		printk(KERN_INFO "Node %d: aperture @ %Lx size %u MB\n",
-				node, aper_base, aper_size >> 20);
-		node++;
-
-		if (!aperture_valid(aper_base, aper_size)) {
-			fix = 1;
-			break;
-		}
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			iommu_detected = 1;
+			gart_iommu_aperture = 1;
+
+			aper_order = (read_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL) >> 1) & 7;
+			aper_size = (32 * 1024 * 1024) << aper_order;
+			aper_base = read_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE) & 0x7fff;
+			aper_base <<= 25;
+
+			printk(KERN_INFO "Node %d: aperture @ %Lx size %u MB\n",
+					node, aper_base, aper_size >> 20);
+			node++;
+
+			if (!aperture_valid(aper_base, aper_size, 64<<20)) {
+				if (valid_agp && agp_aper_base &&
+				    agp_aper_base == aper_base &&
+				    agp_aper_order == aper_order) {
+					/* the same between two setting from NB and agp */
+					if (!no_iommu && end_pfn > MAX_DMA32_PFN && !printed_gart_size_msg) {
+						printk(KERN_ERR "you are using iommu with agp, but GART size is less than 64M\n");
+						printk(KERN_ERR "please increase GART size in your BIOS setup\n");
+						printk(KERN_ERR "if BIOS doesn't have that option, contact your HW vendor!\n");
+						printed_gart_size_msg = 1;
+					}
+				} else {
+					fix = 1;
+					goto out;
+				}
+			}
 
-		if ((last_aper_order && aper_order != last_aper_order) ||
-		    (last_aper_base && aper_base != last_aper_base)) {
-			fix = 1;
-			break;
+			if ((last_aper_order && aper_order != last_aper_order) ||
+			    (last_aper_base && aper_base != last_aper_base)) {
+				fix = 1;
+				goto out;
+			}
+			last_aper_order = aper_order;
+			last_aper_base = aper_base;
 		}
-		last_aper_order = aper_order;
-		last_aper_base = aper_base;
 	}
 
+out:
 	if (!fix && !fallback_aper_force) {
 		if (last_aper_base) {
 			unsigned long n = (32 * 1024 * 1024) << last_aper_order;
@@ -364,8 +455,10 @@ void __init gart_iommu_hole_init(void)
 		return;
 	}
 
-	if (!fallback_aper_force)
-		aper_alloc = search_agp_bridge(&aper_order, &valid_agp);
+	if (!fallback_aper_force) {
+		aper_alloc = agp_aper_base;
+		aper_order = agp_aper_order;
+	}
 
 	if (aper_alloc) {
 		/* Got the aperture from the AGP bridge */
@@ -401,16 +494,22 @@ void __init gart_iommu_hole_init(void)
 	}
 
 	/* Fix up the north bridges */
-	for (num = 24; num < 32; num++) {
-		if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
-			continue;
-
-		/*
-		 * Don't enable translation yet. That is done later.
-		 * Assume this BIOS didn't initialise the GART so
-		 * just overwrite all previous bits
-		 */
-		write_pci_config(0, num, 3, 0x90, aper_order<<1);
-		write_pci_config(0, num, 3, 0x94, aper_alloc>>25);
+	for (i = 0; i < ARRAY_SIZE(bus_dev_ranges); i++) {
+		int bus;
+		int dev_base, dev_limit;
+
+		bus = bus_dev_ranges[i].bus;
+		dev_base = bus_dev_ranges[i].dev_base;
+		dev_limit = bus_dev_ranges[i].dev_limit;
+		for (slot = dev_base; slot < dev_limit; slot++) {
+			if (!early_is_k8_nb(read_pci_config(bus, slot, 3, 0x00)))
+				continue;
+
+			/* Don't enable translation yet. That is done later.
+			   Assume this BIOS didn't initialise the GART so
+			   just overwrite all previous bits */
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTURECTL, aper_order << 1);
+			write_pci_config(bus, slot, 3, AMD64_GARTAPERTUREBASE, aper_alloc >> 25);
+		}
 	}
 }
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 388b113..50a18e4 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -77,10 +77,14 @@ void __init dma32_reserve_bootmem(void)
 	if (end_pfn <= MAX_DMA32_PFN)
 		return;
 
+	/*
+	 * check aperture_64.c allocate_aperture() for reason about
+	 * using 512M as goal
+	 */
 	align = 64ULL<<20;
 	size = round_up(dma32_bootmem_size, align);
 	dma32_bootmem_ptr = __alloc_bootmem_nopanic(size, align,
-				 __pa(MAX_DMA_ADDRESS));
+				 512ULL<<20);
 	if (dma32_bootmem_ptr)
 		dma32_bootmem_size = size;
 	else
@@ -88,7 +92,6 @@ void __init dma32_reserve_bootmem(void)
 }
 static void __init dma32_free_bootmem(void)
 {
-	int node;
 
 	if (end_pfn <= MAX_DMA32_PFN)
 		return;
@@ -96,9 +99,7 @@ static void __init dma32_free_bootmem(void)
 	if (!dma32_bootmem_ptr)
 		return;
 
-	for_each_online_node(node)
-		free_bootmem_node(NODE_DATA(node), __pa(dma32_bootmem_ptr),
-				  dma32_bootmem_size);
+	free_bootmem(__pa(dma32_bootmem_ptr), dma32_bootmem_size);
 
 	dma32_bootmem_ptr = NULL;
 	dma32_bootmem_size = 0;
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index c07455d..bffcf45 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -598,13 +598,13 @@ static __init int init_k8_gatt(struct agp_kern_info *info)
 		dev = k8_northbridges[i];
 		gatt_reg = __pa(gatt) >> 12;
 		gatt_reg <<= 4;
-		pci_write_config_dword(dev, 0x98, gatt_reg);
-		pci_read_config_dword(dev, 0x90, &ctl);
+		pci_write_config_dword(dev, AMD64_GARTTABLEBASE, gatt_reg);
+		pci_read_config_dword(dev, AMD64_GARTAPERTURECTL, &ctl);
 
-		ctl |= 1;
-		ctl &= ~((1<<4) | (1<<5));
+		ctl |= GARTEN;
+		ctl &= ~(DISGARTCPU | DISGARTIO);
 
-		pci_write_config_dword(dev, 0x90, ctl);
+		pci_write_config_dword(dev, AMD64_GARTAPERTURECTL, ctl);
 	}
 	flush_gart();
 
diff --git a/drivers/char/agp/amd64-agp.c b/drivers/char/agp/amd64-agp.c
index d8200ac..9c24470 100644
--- a/drivers/char/agp/amd64-agp.c
+++ b/drivers/char/agp/amd64-agp.c
@@ -16,28 +16,9 @@
 #include <asm/page.h>		/* PAGE_SIZE */
 #include <asm/e820.h>
 #include <asm/k8.h>
+#include <asm/gart.h>
 #include "agp.h"
 
-/* PTE bits. */
-#define GPTE_VALID	1
-#define GPTE_COHERENT	2
-
-/* Aperture control register bits. */
-#define GARTEN		(1<<0)
-#define DISGARTCPU	(1<<4)
-#define DISGARTIO	(1<<5)
-
-/* GART cache control register bits. */
-#define INVGART		(1<<0)
-#define GARTPTEERR	(1<<1)
-
-/* K8 On-cpu GART registers */
-#define AMD64_GARTAPERTURECTL	0x90
-#define AMD64_GARTAPERTUREBASE	0x94
-#define AMD64_GARTTABLEBASE	0x98
-#define AMD64_GARTCACHECTL	0x9c
-#define AMD64_GARTEN		(1<<0)
-
 /* NVIDIA K8 registers */
 #define NVIDIA_X86_64_0_APBASE		0x10
 #define NVIDIA_X86_64_1_APBASE1		0x50
@@ -165,7 +146,7 @@ static int amd64_fetch_size(void)
  * In a multiprocessor x86-64 system, this function gets
  * called once for each CPU.
  */
-static u64 amd64_configure (struct pci_dev *hammer, u64 gatt_table)
+static u64 amd64_configure(struct pci_dev *hammer, u64 gatt_table)
 {
 	u64 aperturebase;
 	u32 tmp;
@@ -181,7 +162,7 @@ static u64 amd64_configure (struct pci_dev *hammer, u64 gatt_table)
 	addr >>= 12;
 	tmp = (u32) addr<<4;
 	tmp &= ~0xf;
-	pci_write_config_dword (hammer, AMD64_GARTTABLEBASE, tmp);
+	pci_write_config_dword(hammer, AMD64_GARTTABLEBASE, tmp);
 
 	/* Enable GART translation for this hammer. */
 	pci_read_config_dword(hammer, AMD64_GARTAPERTURECTL, &tmp);
@@ -264,11 +245,7 @@ static int __devinit aperture_valid(u64 aper, u32 size)
 		printk(KERN_ERR PFX "No aperture\n");
 		return 0;
 	}
-	if (size < 32*1024*1024) {
-		printk(KERN_ERR PFX "Aperture too small (%d MB)\n", size>>20);
-		return 0;
-	}
-       if ((u64)aper + size > 0x100000000ULL) {
+	if ((u64)aper + size > 0x100000000ULL) {
 		printk(KERN_ERR PFX "Aperture out of bounds\n");
 		return 0;
 	}
@@ -276,6 +253,10 @@ static int __devinit aperture_valid(u64 aper, u32 size)
 		printk(KERN_ERR PFX "Aperture pointing to RAM\n");
 		return 0;
 	}
+	if (size < 32*1024*1024) {
+		printk(KERN_ERR PFX "Aperture too small (%d MB)\n", size>>20);
+		return 0;
+	}
 
 	/* Request the Aperture. This catches cases when someone else
 	   already put a mapping in there - happens with some very broken BIOS
@@ -331,6 +312,17 @@ static __devinit int fix_northbridge(struct pci_dev *nb, struct pci_dev *agp,
 	pci_read_config_dword(agp, 0x10, &aper_low);
 	pci_read_config_dword(agp, 0x14, &aper_hi);
 	aper = (aper_low & ~((1<<22)-1)) | ((u64)aper_hi << 32);
+
+	/*
+	 * On some sick chips APSIZE is 0. This means it wants 4G
+	 * so let double check that order, and lets trust the AMD NB settings
+	 */
+	if (order >=0 && aper + (32ULL<<(20 + order)) > 0x100000000ULL) {
+		printk(KERN_INFO "Aperture size %u MB is not right, using settings from NB\n",
+				  32 << order);
+		order = nb_order;
+	}
+
 	printk(KERN_INFO PFX "Aperture from AGP @ %Lx size %u MB\n", aper, 32 << order);
 	if (order < 0 || !aperture_valid(aper, (32*1024*1024)<<order))
 		return -1;
diff --git a/include/asm-x86/gart.h b/include/asm-x86/gart.h
index 90958ed..248e577 100644
--- a/include/asm-x86/gart.h
+++ b/include/asm-x86/gart.h
@@ -5,6 +5,7 @@ extern void pci_iommu_shutdown(void);
 extern void no_iommu_init(void);
 extern int force_iommu, no_iommu;
 extern int iommu_detected;
+extern int agp_amd64_init(void);
 #ifdef CONFIG_GART_IOMMU
 extern void gart_iommu_init(void);
 extern void gart_iommu_shutdown(void);
@@ -31,4 +32,24 @@ static inline void gart_iommu_shutdown(void)
 
 #endif
 
+/* PTE bits. */
+#define GPTE_VALID	1
+#define GPTE_COHERENT	2
+
+/* Aperture control register bits. */
+#define GARTEN		(1<<0)
+#define DISGARTCPU	(1<<4)
+#define DISGARTIO	(1<<5)
+
+/* GART cache control register bits. */
+#define INVGART		(1<<0)
+#define GARTPTEERR	(1<<1)
+
+/* K8 On-cpu GART registers */
+#define AMD64_GARTAPERTURECTL	0x90
+#define AMD64_GARTAPERTUREBASE	0x94
+#define AMD64_GARTTABLEBASE	0x98
+#define AMD64_GARTCACHECTL	0x9c
+#define AMD64_GARTEN		(1<<0)
+
 #endif


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes, GART
  2008-04-26 20:24   ` [RFC git pull] "big box" x86 changes, GART Ingo Molnar
@ 2008-04-26 20:26     ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 20:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


* Ingo Molnar <mingo@elte.hu> wrote:

> Linus, please pull the x86 "big box & GART" subtopic tree from:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-gart.git for-linus
> 
> it also touches drivers/char/agp/amd64-agp.c.

forgot to mention that these changes are pullable standalone, without 
the MM changes. I just test-booted it on a GART box to make sure:

  PCI-DMA: using GART IOMMU.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem, #2
  2008-04-26 20:08       ` [git pull] "big box" x86 changes, bootmem/sparsemem, #2 Ingo Molnar
@ 2008-04-26 20:30         ` Linus Torvalds
  2008-04-26 20:55           ` [git pull] "big box" x86 changes, bootmem/sparsemem, #3 Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-04-26 20:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes



On Sat, 26 Apr 2008, Ingo Molnar wrote:
> 
> here's the updated tree:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v2.git for-linus

Ok, this clashes with the bootparam changes..

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 19:54   ` [git pull] "big box" x86 changes, boot protocol Ingo Molnar
@ 2008-04-26 20:39     ` Andrew Morton
  2008-04-26 21:06       ` Adrian Bunk
  2008-04-26 23:37       ` Jeremy Fitzhardinge
  2008-04-27 11:21     ` Ian Campbell
  1 sibling, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2008-04-26 20:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes

On Sat, 26 Apr 2008 21:54:07 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> Linus, please pull the following x86 changes from:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootparam.git for-linus
> 
> these are boot parameter extensions for really large SGI UV boxes. The 
> change was seen and acked by the boot protocol guys. (well, Peter that 
> is ;-)
> 
> ...
>
> +void __init free_early(unsigned long start, unsigned long end)
> +{
> +	struct early_res *r;
> +	int i, j;
> +
> +	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
> +		r = &early_res[i];
> +		if (start == r->start && end == r->end)
> +			break;
> +	}
> +	if (i >= MAX_EARLY_RES || !early_res[i].end)
> +		panic("free_early on not reserved area: %lx-%lx!", start, end);
> +
> +	for (j = i + 1; j < MAX_EARLY_RES && early_res[j].end; j++)
> +		;
> +
> +	memcpy(&early_res[i], &early_res[i + 1],
> +	       (j - 1 - i) * sizeof(struct early_res));

nit: memcpy() shouldn't be used for overlapping copies.  It happens to be
OK (for dst<src) in the kernel implementations.  We hope.

> +	early_res[j - 1].end = 0;
> +}
> +

> +static ssize_t
> +setup_data_read(struct file *file, char __user *user_buf, size_t count,
> +		loff_t *ppos)
> +{
> +	struct setup_data_node *node = file->private_data;
> +	unsigned long remain;
> +	loff_t pos = *ppos;
> +	struct page *pg;
> +	void *p;
> +	u64 pa;
> +
> +	if (pos < 0)
> +		return -EINVAL;
> +	if (pos >= node->len)
> +		return 0;
> +
> +	if (count > node->len - pos)
> +		count = node->len - pos;
> +	pa = node->paddr + sizeof(struct setup_data) + pos;
> +	pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT);
> +	if (PageHighMem(pg)) {
> +		p = ioremap_cache(pa, count);
> +		if (!p)
> +			return -ENXIO;
> +	} else {
> +		p = __va(pa);
> +	}
> +
> +	remain = copy_to_user(user_buf, p, count);
> +
> +	if (PageHighMem(pg))
> +		iounmap(p);
> +
> +	if (remain)
> +		return -EFAULT;
> +
> +	*ppos = pos + count;
> +
> +	return count;
> +}

nit2: a read() function should return the number of bytes copied, and
should advance the file pointer by that much.  This code fails to do this
when a partial copy_to_user() occurs.

But we've made that mistake in many places and it doesn't appear to matter.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [git pull] "big box" x86 changes, bootmem/sparsemem, #3
  2008-04-26 20:30         ` Linus Torvalds
@ 2008-04-26 20:55           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > here's the updated tree:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v2.git for-linus
> 
> Ok, this clashes with the bootparam changes..

yeah. I respun the tree against the bootparam changes, you can pull it 
from:

  git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3.git for-linus

booted this specific tree up once so far. Thanks,

	Ingo

------------------>
Yinghai Lu (7):
      mm: make mem_map allocation continuous
      mm: fix alloc_bootmem_core to use fast searching for all nodes
      mm: offset align in alloc_bootmem()
      mm: allow reserve_bootmem() cross nodes
      x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
      x86_64: fix setup_node_bootmem to support big mem excluding with memmap
      x86_64/mm: check and print vmemmap allocation continuous

 arch/x86/kernel/e820_64.c  |   13 +++-
 arch/x86/kernel/setup_64.c |    3 +-
 arch/x86/mm/init_64.c      |   38 +++++++++-
 arch/x86/mm/numa_64.c      |   42 ++++++++++--
 include/asm-x86/e820_64.h  |    2 +-
 include/linux/mm.h         |    1 +
 mm/bootmem.c               |  164 ++++++++++++++++++++++++++++++--------------
 mm/sparse.c                |   37 +++++++++-
 8 files changed, 228 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index 79f0d52..645ee5e 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -106,14 +106,19 @@ void __init free_early(unsigned long start, unsigned long end)
 	early_res[j - 1].end = 0;
 }
 
-void __init early_res_to_bootmem(void)
+void __init early_res_to_bootmem(unsigned long start, unsigned long end)
 {
 	int i;
+	unsigned long final_start, final_end;
 	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
 		struct early_res *r = &early_res[i];
-		printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
-			r->start, r->end - 1, r->name);
-		reserve_bootmem_generic(r->start, r->end - r->start);
+		final_start = max(start, r->start);
+		final_end = min(end, r->end);
+		if (final_start >= final_end)
+			continue;
+		printk(KERN_INFO "  early res: %d [%lx-%lx] %s\n", i,
+			final_start, final_end - 1, r->name);
+		reserve_bootmem_generic(final_start, final_end - final_start);
 	}
 }
 
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index b04e2c0..60e64c8 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -190,6 +190,7 @@ contig_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
 	e820_register_active_regions(0, start_pfn, end_pfn);
 	free_bootmem_with_active_regions(0, end_pfn);
+	early_res_to_bootmem(0, end_pfn<<PAGE_SHIFT);
 	reserve_bootmem(bootmap, bootmap_size, BOOTMEM_DEFAULT);
 }
 #endif
@@ -421,8 +422,6 @@ void __init setup_arch(char **cmdline_p)
 	contig_initmem_init(0, end_pfn);
 #endif
 
-	early_res_to_bootmem();
-
 	dma32_reserve_bootmem();
 
 #ifdef CONFIG_ACPI_SLEEP
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0cca626..5fbb865 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -810,7 +810,7 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 {
 #ifdef CONFIG_NUMA
-	int nid = phys_to_nid(phys);
+	int nid, next_nid;
 #endif
 	unsigned long pfn = phys >> PAGE_SHIFT;
 
@@ -829,10 +829,16 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
 
 	/* Should check here against the e820 map to avoid double free */
 #ifdef CONFIG_NUMA
-	reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(phys);
+	next_nid = phys_to_nid(phys + len - 1);
+	if (nid == next_nid)
+		reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+	else
+		reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #else
 	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #endif
+
 	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
 		dma_reserve += len / PAGE_SIZE;
 		set_dma_reserve(dma_reserve);
@@ -926,6 +932,10 @@ const char *arch_vma_name(struct vm_area_struct *vma)
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
  */
+static long __meminitdata addr_start, addr_end;
+static void __meminitdata *p_start, *p_end;
+static int __meminitdata node_start;
+
 int __meminit
 vmemmap_populate(struct page *start_page, unsigned long size, int node)
 {
@@ -960,12 +970,32 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 							PAGE_KERNEL_LARGE);
 			set_pmd(pmd, __pmd(pte_val(entry)));
 
-			printk(KERN_DEBUG " [%lx-%lx] PMD ->%p on node %d\n",
-				addr, addr + PMD_SIZE - 1, p, node);
+			/* check to see if we have contiguous blocks */
+			if (p_end != p || node_start != node) {
+				if (p_start)
+					printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+						addr_start, addr_end-1, p_start, p_end-1, node_start);
+				addr_start = addr;
+				node_start = node;
+				p_start = p;
+			}
+			addr_end = addr + PMD_SIZE;
+			p_end = p + PMD_SIZE;
 		} else {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 		}
 	}
 	return 0;
 }
+
+void __meminit vmemmap_populate_print_last(void)
+{
+	if (p_start) {
+		printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+			addr_start, addr_end-1, p_start, p_end-1, node_start);
+		p_start = NULL;
+		p_end = NULL;
+		node_start = 0;
+	}
+}
 #endif
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 9a68922..c5066d5 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -196,6 +196,7 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	unsigned long bootmap_start, nodedata_phys;
 	void *bootmap;
 	const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE);
+	int nid;
 
 	start = round_up(start, ZONE_ALIGN);
 
@@ -218,9 +219,19 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 	NODE_DATA(nodeid)->node_start_pfn = start_pfn;
 	NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
-	/* Find a place for the bootmem map */
+	/*
+	 * Find a place for the bootmem map
+	 * nodedata_phys could be on other nodes by alloc_bootmem,
+	 * so need to sure bootmap_start not to be small, otherwise
+	 * early_node_mem will get that with find_e820_area instead
+	 * of alloc_bootmem, that could clash with reserved range
+	 */
 	bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn);
-	bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	nid = phys_to_nid(nodedata_phys);
+	if (nid == nodeid)
+		bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE);
+	else
+		bootmap_start = round_up(start, PAGE_SIZE);
 	/*
 	 * SMP_CAHCE_BYTES could be enough, but init_bootmem_node like
 	 * to use that to align to PAGE_SIZE
@@ -245,10 +256,29 @@ void __init setup_node_bootmem(int nodeid, unsigned long start,
 
 	free_bootmem_with_active_regions(nodeid, end);
 
-	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size,
-			BOOTMEM_DEFAULT);
-	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
-			bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+	/*
+	 * convert early reserve to bootmem reserve earlier
+	 * otherwise early_node_mem could use early reserved mem
+	 * on previous node
+	 */
+	early_res_to_bootmem(start, end);
+
+	/*
+	 * in some case early_node_mem could use alloc_bootmem
+	 * to get range on other node, don't reserve that again
+	 */
+	if (nid != nodeid)
+		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys,
+					pgdat_size, BOOTMEM_DEFAULT);
+	nid = phys_to_nid(bootmap_start);
+	if (nid != nodeid)
+		printk(KERN_INFO "    bootmap(%d) on node %d\n", nodeid, nid);
+	else
+		reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start,
+				 bootmap_pages<<PAGE_SHIFT, BOOTMEM_DEFAULT);
+
 #ifdef CONFIG_ACPI_NUMA
 	srat_reserve_add_area(nodeid);
 #endif
diff --git a/include/asm-x86/e820_64.h b/include/asm-x86/e820_64.h
index b5e02e3..71c4d68 100644
--- a/include/asm-x86/e820_64.h
+++ b/include/asm-x86/e820_64.h
@@ -49,7 +49,7 @@ extern void update_e820(void);
 
 extern void reserve_early(unsigned long start, unsigned long end, char *name);
 extern void free_early(unsigned long start, unsigned long end);
-extern void early_res_to_bootmem(void);
+extern void early_res_to_bootmem(unsigned long start, unsigned long end);
 
 #endif/*!__ASSEMBLY__*/
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b695875..286d315 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,6 +1229,7 @@ void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
+void vmemmap_populate_print_last(void);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 2ccea70..b679164 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -111,44 +111,74 @@ static unsigned long __init init_bootmem_core(pg_data_t *pgdat,
  * might be used for boot-time allocations - or it might get added
  * to the free page pool later on.
  */
-static int __init reserve_bootmem_core(bootmem_data_t *bdata,
+static int __init can_reserve_bootmem_core(bootmem_data_t *bdata,
 			unsigned long addr, unsigned long size, int flags)
 {
 	unsigned long sidx, eidx;
 	unsigned long i;
-	int ret;
+
+	BUG_ON(!size);
+
+	/* out of range, don't hold other */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return 0;
 
 	/*
-	 * round up, partially reserved pages are considered
-	 * fully reserved.
+	 * Round up to index to the range.
 	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
+	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
+
+	for (i = sidx; i < eidx; i++) {
+		if (test_bit(i, bdata->node_bootmem_map)) {
+			if (flags & BOOTMEM_EXCLUSIVE)
+				return -EBUSY;
+		}
+	}
+
+	return 0;
+
+}
+
+static void __init reserve_bootmem_core(bootmem_data_t *bdata,
+			unsigned long addr, unsigned long size, int flags)
+{
+	unsigned long sidx, eidx;
+	unsigned long i;
+
 	BUG_ON(!size);
-	BUG_ON(PFN_DOWN(addr) >= bdata->node_low_pfn);
-	BUG_ON(PFN_UP(addr + size) > bdata->node_low_pfn);
-	BUG_ON(addr < bdata->node_boot_start);
 
-	sidx = PFN_DOWN(addr - bdata->node_boot_start);
+	/* out of range */
+	if (addr + size < bdata->node_boot_start ||
+		PFN_DOWN(addr) > bdata->node_low_pfn)
+		return;
+
+	/*
+	 * Round up to index to the range.
+	 */
+	if (addr > bdata->node_boot_start)
+		sidx= PFN_DOWN(addr - bdata->node_boot_start);
+	else
+		sidx = 0;
+
 	eidx = PFN_UP(addr + size - bdata->node_boot_start);
+	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
+		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
-	for (i = sidx; i < eidx; i++)
+	for (i = sidx; i < eidx; i++) {
 		if (test_and_set_bit(i, bdata->node_bootmem_map)) {
 #ifdef CONFIG_DEBUG_BOOTMEM
 			printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE);
 #endif
-			if (flags & BOOTMEM_EXCLUSIVE) {
-				ret = -EBUSY;
-				goto err;
-			}
 		}
-
-	return 0;
-
-err:
-	/* unreserve memory we accidentally reserved */
-	for (i--; i >= sidx; i--)
-		clear_bit(i, bdata->node_bootmem_map);
-
-	return ret;
+	}
 }
 
 static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
@@ -206,9 +236,11 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
+	unsigned long areasize, preferred;
 	unsigned long i, start = 0, incr, eidx, end_pfn;
 	void *ret;
+	unsigned long node_boot_start;
+	void *node_bootmem_map;
 
 	if (!size) {
 		printk("__alloc_bootmem_core(): zero-sized request\n");
@@ -216,70 +248,83 @@ __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	}
 	BUG_ON(align & (align-1));
 
-	if (limit && bdata->node_boot_start >= limit)
-		return NULL;
-
 	/* on nodes without memory - bootmem_map is NULL */
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
+	node_boot_start = bdata->node_boot_start;
+	node_bootmem_map = bdata->node_bootmem_map;
+	if (align) {
+		node_boot_start = ALIGN(bdata->node_boot_start, align);
+		if (node_boot_start > bdata->node_boot_start)
+			node_bootmem_map = (unsigned long *)bdata->node_bootmem_map +
+			    PFN_DOWN(node_boot_start - bdata->node_boot_start)/BITS_PER_LONG;
+	}
+
+	if (limit && node_boot_start >= limit)
+		return NULL;
+
 	end_pfn = bdata->node_low_pfn;
 	limit = PFN_DOWN(limit);
 	if (limit && end_pfn > limit)
 		end_pfn = limit;
 
-	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
+	eidx = end_pfn - PFN_DOWN(node_boot_start);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
 	 */
-	if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) {
-		preferred = goal - bdata->node_boot_start;
+	preferred = 0;
+	if (goal && PFN_DOWN(goal) < end_pfn) {
+		if (goal > node_boot_start)
+			preferred = goal - node_boot_start;
 
-		if (bdata->last_success >= preferred)
+		if (bdata->last_success > node_boot_start &&
+			bdata->last_success - node_boot_start >= preferred)
 			if (!limit || (limit && limit > bdata->last_success))
-				preferred = bdata->last_success;
-	} else
-		preferred = 0;
+				preferred = bdata->last_success - node_boot_start;
+	}
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	preferred = PFN_DOWN(ALIGN(preferred, align));
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
 restart_scan:
-	for (i = preferred; i < eidx; i += incr) {
+	for (i = preferred; i < eidx;) {
 		unsigned long j;
-		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
+
+		i = find_next_zero_bit(node_bootmem_map, eidx, i);
 		i = ALIGN(i, incr);
 		if (i >= eidx)
 			break;
-		if (test_bit(i, bdata->node_bootmem_map))
+		if (test_bit(i, node_bootmem_map)) {
+			i += incr;
 			continue;
+		}
 		for (j = i + 1; j < i + areasize; ++j) {
 			if (j >= eidx)
 				goto fail_block;
-			if (test_bit(j, bdata->node_bootmem_map))
+			if (test_bit(j, node_bootmem_map))
 				goto fail_block;
 		}
 		start = i;
 		goto found;
 	fail_block:
 		i = ALIGN(j, incr);
+		if (i == j)
+			i += incr;
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
 
 found:
-	bdata->last_success = PFN_PHYS(start);
+	bdata->last_success = PFN_PHYS(start) + node_boot_start;
 	BUG_ON(start >= eidx);
 
 	/*
@@ -289,6 +334,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
+		unsigned long offset, remaining_size;
 		offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
@@ -297,14 +343,12 @@ found:
 			/* last_pos unchanged */
 			bdata->last_offset = offset + size;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 		} else {
 			remaining_size = size - remaining_size;
 			areasize = (remaining_size + PAGE_SIZE-1) / PAGE_SIZE;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 			bdata->last_pos = start + areasize - 1;
 			bdata->last_offset = remaining_size;
 		}
@@ -312,14 +356,14 @@ found:
 	} else {
 		bdata->last_pos = start + areasize - 1;
 		bdata->last_offset = size & ~PAGE_MASK;
-		ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start);
+		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
 	/*
 	 * Reserve the area now:
 	 */
 	for (i = start; i < start + areasize; i++)
-		if (unlikely(test_and_set_bit(i, bdata->node_bootmem_map)))
+		if (unlikely(test_and_set_bit(i, node_bootmem_map)))
 			BUG();
 	memset(ret, 0, size);
 	return ret;
@@ -401,6 +445,11 @@ unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
 void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
 				 unsigned long size, int flags)
 {
+	int ret;
+
+	ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
+	if (ret < 0)
+		return;
 	reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
 }
 
@@ -426,7 +475,18 @@ unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
 int __init reserve_bootmem(unsigned long addr, unsigned long size,
 			    int flags)
 {
-	return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
+	bootmem_data_t *bdata;
+	int ret;
+
+	list_for_each_entry(bdata, &bdata_list, list) {
+		ret = can_reserve_bootmem_core(bdata, addr, size, flags);
+		if (ret < 0)
+			return ret;
+	}
+	list_for_each_entry(bdata, &bdata_list, list)
+		reserve_bootmem_core(bdata, addr, size, flags);
+
+	return 0;
 }
 #endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 98d6b39..7e91913 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -295,6 +295,9 @@ struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
 	return NULL;
 }
 
+void __attribute__((weak)) __meminit vmemmap_populate_print_last(void)
+{
+}
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -304,22 +307,50 @@ void __init sparse_init(void)
 	unsigned long pnum;
 	struct page *map;
 	unsigned long *usemap;
+	unsigned long **usemap_map;
+	int size;
+
+	/*
+	 * map is using big page (aka 2M in x86 64 bit)
+	 * usemap is less one page (aka 24 bytes)
+	 * so alloc 2M (with 2M align) and 24 bytes in turn will
+	 * make next 2M slip to one more 2M later.
+	 * then in big system, the memory will have a lot of holes...
+	 * here try to allocate 2M pages continously.
+	 *
+	 * powerpc need to call sparse_init_one_section right after each
+	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
+	 */
+	size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
+	usemap_map = alloc_bootmem(size);
+	if (!usemap_map)
+		panic("can not allocate usemap_map\n");
 
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
+		usemap_map[pnum] = sparse_early_usemap_alloc(pnum);
+	}
 
-		map = sparse_early_mem_map_alloc(pnum);
-		if (!map)
+	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+		if (!present_section_nr(pnum))
 			continue;
 
-		usemap = sparse_early_usemap_alloc(pnum);
+		usemap = usemap_map[pnum];
 		if (!usemap)
 			continue;
 
+		map = sparse_early_mem_map_alloc(pnum);
+		if (!map)
+			continue;
+
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
 								usemap);
 	}
+
+	vmemmap_populate_print_last();
+
+	free_bootmem(__pa(usemap_map), size);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 20:39     ` Andrew Morton
@ 2008-04-26 21:06       ` Adrian Bunk
  2008-04-26 21:10         ` H. Peter Anvin
  2008-04-26 21:11         ` Linus Torvalds
  2008-04-26 23:37       ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 52+ messages in thread
From: Adrian Bunk @ 2008-04-26 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

On Sat, Apr 26, 2008 at 01:39:28PM -0700, Andrew Morton wrote:
> On Sat, 26 Apr 2008 21:54:07 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>...
> >
> > +void __init free_early(unsigned long start, unsigned long end)
> > +{
> > +	struct early_res *r;
> > +	int i, j;
> > +
> > +	for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
> > +		r = &early_res[i];
> > +		if (start == r->start && end == r->end)
> > +			break;
> > +	}
> > +	if (i >= MAX_EARLY_RES || !early_res[i].end)
> > +		panic("free_early on not reserved area: %lx-%lx!", start, end);
> > +
> > +	for (j = i + 1; j < MAX_EARLY_RES && early_res[j].end; j++)
> > +		;
> > +
> > +	memcpy(&early_res[i], &early_res[i + 1],
> > +	       (j - 1 - i) * sizeof(struct early_res));
> 
> nit: memcpy() shouldn't be used for overlapping copies.  It happens to be
> OK (for dst<src) in the kernel implementations.  We hope.
>...

We always use the gcc builtin for memcpy() here.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 21:06       ` Adrian Bunk
@ 2008-04-26 21:10         ` H. Peter Anvin
  2008-04-26 21:11         ` Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: H. Peter Anvin @ 2008-04-26 21:10 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, linux-kernel,
	Thomas Gleixner, Yinghai Lu, Yinghai Lu, jbarnes

Adrian Bunk wrote:
>>> +	memcpy(&early_res[i], &early_res[i + 1],
>>> +	       (j - 1 - i) * sizeof(struct early_res));

>> nit: memcpy() shouldn't be used for overlapping copies.  It happens to be
>> OK (for dst<src) in the kernel implementations.  We hope.
>> ...
> 
> We always use the gcc builtin for memcpy() here.
> 

You have to do something pretty weird for memcpy() to not work for
dst <= src even with overlap; this usually involves architectures that 
have explicit cache control instructions to establish the dst in the 
cache, if used before src is read.

This is not an issue on x86, though.

	-hpa

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 21:06       ` Adrian Bunk
  2008-04-26 21:10         ` H. Peter Anvin
@ 2008-04-26 21:11         ` Linus Torvalds
  2008-04-26 21:17           ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-04-26 21:11 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes



On Sun, 27 Apr 2008, Adrian Bunk wrote:

> On Sat, Apr 26, 2008 at 01:39:28PM -0700, Andrew Morton wrote:
> > On Sat, 26 Apr 2008 21:54:07 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> >...
> > > +
> > > +	memcpy(&early_res[i], &early_res[i + 1],
> > > +	       (j - 1 - i) * sizeof(struct early_res));
> > 
> > nit: memcpy() shouldn't be used for overlapping copies.  It happens to be
> > OK (for dst<src) in the kernel implementations.  We hope.
> >...
> 
> We always use the gcc builtin for memcpy() here.

It's probably hard to write a reasonable x86 memcpy() that wouldn't happen 
to do the right thing for this case, but I do agee - we should still use 
memmove() for this, just to make it clear that it does overlapping things.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 21:11         ` Linus Torvalds
@ 2008-04-26 21:17           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Andrew Morton, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > We always use the gcc builtin for memcpy() here.
> 
> It's probably hard to write a reasonable x86 memcpy() that wouldn't 
> happen to do the right thing for this case, but I do agee - we should 
> still use memmove() for this, just to make it clear that it does 
> overlapping things.

agreed, i queued up the patch below.

	Ingo

------------->
Subject: bootprotocol: cleanup
From: Ingo Molnar <mingo@elte.hu>
Date: Sat Apr 26 23:14:36 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/e820_64.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/arch/x86/kernel/e820_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/e820_64.c
+++ linux-x86.q/arch/x86/kernel/e820_64.c
@@ -100,7 +100,7 @@ void __init free_early(unsigned long sta
 	for (j = i + 1; j < MAX_EARLY_RES && early_res[j].end; j++)
 		;
 
-	memcpy(&early_res[i], &early_res[i + 1],
+	memmove(&early_res[i], &early_res[i + 1],
 	       (j - 1 - i) * sizeof(struct early_res));
 
 	early_res[j - 1].end = 0;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [git pull] "big box" x86 changes, PCI
  2008-04-26 19:12 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2008-04-26 20:24   ` [RFC git pull] "big box" x86 changes, GART Ingo Molnar
@ 2008-04-26 21:55   ` Ingo Molnar
  2008-04-27 16:30     ` Jesse Barnes
  2008-04-28 20:34     ` Jesse Barnes
  3 siblings, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-26 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, Yinghai Lu, jbarnes


ok, this is the final chunk of the "big box" topic - the PCI changes. 

These are the largest, and while i tried to reduce their number it's 
still 19 commits - but it's all around the same topic. The bulk of the 
new code is in a single file. The tree can be pulled from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci.git for-linus

this depends on the bootmem changes which are now upstream. These 
changes too have been in linux-next for some time and the cross-arch 
(build) success rate is high, as in:

   http://www.tglx.de/autoqa-cgi/index?run=89&tree=1

all but the last few patches have been in x86.git for a longer time and 
more than 95% of the changes are for arch/x86. (see the dates of the 
patches. The "15 Feb 2008" ones are older than their timestamp - that's 
when we imported those changes into a date-aware repository.)

i booted up this tree 5 times on x86, mixed 64-bit/32-bit.

	Ingo

------------------>
Robert Hancock (1):
      x86: validate against acpi motherboard resources

Yinghai Lu (18):
      x86: clear pci_mmcfg_virt when mmcfg get rejected
      x86: mmconf enable mcfg early
      x86_64: set cfg_size for AMD Family 10h in case MMCONFIG
      x86_64: check and enable MMCONFIG for AMD Family 10h
      x86_64: check MSR to get MMCONFIG for AMD Family 10h
      x86: if acpi=off, force setting the mmconf for fam10h
      x86: seperate mmconf for fam10h out from setup_64.c
      driver core: try parent numa_node at first before using default
      x86: remove unneeded check in mmconf reject
      x86 pci: remove checking type for mmconfig probe
      x86: get mp_bus_to_node early
      x86: use bus conf in NB conf fun1 to get bus range on, on 64-bit
      x86: multi pci root bus with different io resource range, on 64-bit
      x86: double check the multi root bus with fam10h mmconf
      x86_64: don't need set default res if only have one root bus
      acpi: get boot_cpu_id as early for k8_scan_nodes
      x86: work around io allocation overlap of HT links
      x86: add pci=check_enable_amd_mmconf and dmi check

 arch/x86/kernel/Makefile           |    2 +
 arch/x86/kernel/acpi/boot.c        |   70 +++++
 arch/x86/kernel/mmconf-fam10h_64.c |  243 +++++++++++++++
 arch/x86/kernel/setup_64.c         |   20 ++
 arch/x86/mm/k8topology_64.c        |   38 +++-
 arch/x86/pci/Makefile_32           |    1 +
 arch/x86/pci/Makefile_64           |    2 +-
 arch/x86/pci/acpi.c                |   27 +-
 arch/x86/pci/common.c              |   22 ++-
 arch/x86/pci/direct.c              |    8 +-
 arch/x86/pci/fixup.c               |   17 +
 arch/x86/pci/init.c                |   13 +-
 arch/x86/pci/irq.c                 |    4 +-
 arch/x86/pci/k8-bus_64.c           |  575 ++++++++++++++++++++++++++++++++----
 arch/x86/pci/legacy.c              |    4 +-
 arch/x86/pci/mmconfig-shared.c     |  247 +++++++++++++---
 arch/x86/pci/mmconfig_32.c         |    4 +
 arch/x86/pci/mmconfig_64.c         |   22 ++-
 arch/x86/pci/mp_bus_to_node.c      |   23 ++
 arch/x86/pci/pci.h                 |    3 +-
 drivers/acpi/bus.c                 |    2 +
 drivers/base/core.c                |   14 +-
 drivers/pci/probe.c                |   21 ++-
 include/asm-x86/pci.h              |    2 +
 include/asm-x86/topology.h         |   16 +
 include/linux/acpi.h               |    5 +
 include/linux/pci.h                |   11 +-
 27 files changed, 1284 insertions(+), 132 deletions(-)
 create mode 100644 arch/x86/kernel/mmconf-fam10h_64.c
 create mode 100644 arch/x86/pci/mp_bus_to_node.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 90e092d..815b650 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -99,4 +99,6 @@ ifeq ($(CONFIG_X86_64),y)
         obj-$(CONFIG_GART_IOMMU)	+= pci-gart_64.o aperture_64.o
         obj-$(CONFIG_CALGARY_IOMMU)	+= pci-calgary_64.o tce_64.o
         obj-$(CONFIG_SWIOTLB)		+= pci-swiotlb_64.o
+
+        obj-$(CONFIG_PCI_MMCONFIG)	+= mmconf-fam10h_64.o
 endif
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 977ed5c..c49ebcc 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -771,6 +771,32 @@ static void __init acpi_register_lapic_address(unsigned long address)
 		boot_cpu_physical_apicid  = GET_APIC_ID(read_apic_id());
 }
 
+static int __init early_acpi_parse_madt_lapic_addr_ovr(void)
+{
+	int count;
+
+	if (!cpu_has_apic)
+		return -ENODEV;
+
+	/*
+	 * Note that the LAPIC address is obtained from the MADT (32-bit value)
+	 * and (optionally) overriden by a LAPIC_ADDR_OVR entry (64-bit value).
+	 */
+
+	count =
+	    acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_OVERRIDE,
+				  acpi_parse_lapic_addr_ovr, 0);
+	if (count < 0) {
+		printk(KERN_ERR PREFIX
+		       "Error parsing LAPIC address override entry\n");
+		return count;
+	}
+
+	acpi_register_lapic_address(acpi_lapic_addr);
+
+	return count;
+}
+
 static int __init acpi_parse_madt_lapic_entries(void)
 {
 	int count;
@@ -901,6 +927,33 @@ static inline int acpi_parse_madt_ioapic_entries(void)
 }
 #endif	/* !CONFIG_X86_IO_APIC */
 
+static void __init early_acpi_process_madt(void)
+{
+#ifdef CONFIG_X86_LOCAL_APIC
+	int error;
+
+	if (!acpi_table_parse(ACPI_SIG_MADT, acpi_parse_madt)) {
+
+		/*
+		 * Parse MADT LAPIC entries
+		 */
+		error = early_acpi_parse_madt_lapic_addr_ovr();
+		if (!error) {
+			acpi_lapic = 1;
+			smp_found_config = 1;
+		}
+		if (error == -EINVAL) {
+			/*
+			 * Dell Precision Workstation 410, 610 come here.
+			 */
+			printk(KERN_ERR PREFIX
+			       "Invalid BIOS MADT, disabling ACPI\n");
+			disable_acpi();
+		}
+	}
+#endif
+}
+
 static void __init acpi_process_madt(void)
 {
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -1233,6 +1286,23 @@ int __init acpi_boot_table_init(void)
 	return 0;
 }
 
+int __init early_acpi_boot_init(void)
+{
+	/*
+	 * If acpi_disabled, bail out
+	 * One exception: acpi=ht continues far enough to enumerate LAPICs
+	 */
+	if (acpi_disabled && !acpi_ht)
+		return 1;
+
+	/*
+	 * Process the Multiple APIC Description Table (MADT), if present
+	 */
+	early_acpi_process_madt();
+
+	return 0;
+}
+
 int __init acpi_boot_init(void)
 {
 	/*
diff --git a/arch/x86/kernel/mmconf-fam10h_64.c b/arch/x86/kernel/mmconf-fam10h_64.c
new file mode 100644
index 0000000..edc5fbf
--- /dev/null
+++ b/arch/x86/kernel/mmconf-fam10h_64.c
@@ -0,0 +1,243 @@
+/*
+ * AMD Family 10h mmconfig enablement
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/pci.h>
+#include <linux/dmi.h>
+#include <asm/pci-direct.h>
+#include <linux/sort.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <asm/acpi.h>
+
+#include "../pci/pci.h"
+
+struct pci_hostbridge_probe {
+	u32 bus;
+	u32 slot;
+	u32 vendor;
+	u32 device;
+};
+
+static u64 __cpuinitdata fam10h_pci_mmconf_base;
+static int __cpuinitdata fam10h_pci_mmconf_base_status;
+
+static struct pci_hostbridge_probe pci_probes[] __cpuinitdata = {
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0xff, 0, PCI_VENDOR_ID_AMD, 0x1200 },
+};
+
+struct range {
+	u64 start;
+	u64 end;
+};
+
+static int __cpuinit cmp_range(const void *x1, const void *x2)
+{
+	const struct range *r1 = x1;
+	const struct range *r2 = x2;
+	int start1, start2;
+
+	start1 = r1->start >> 32;
+	start2 = r2->start >> 32;
+
+	return start1 - start2;
+}
+
+/*[47:0] */
+/* need to avoid (0xfd<<32) and (0xfe<<32), ht used space */
+#define FAM10H_PCI_MMCONF_BASE (0xfcULL<<32)
+#define BASE_VALID(b) ((b != (0xfdULL << 32)) && (b != (0xfeULL << 32)))
+static void __cpuinit get_fam10h_pci_mmconf_base(void)
+{
+	int i;
+	unsigned bus;
+	unsigned slot;
+	int found;
+
+	u64 val;
+	u32 address;
+	u64 tom2;
+	u64 base = FAM10H_PCI_MMCONF_BASE;
+
+	int hi_mmio_num;
+	struct range range[8];
+
+	/* only try to get setting from BSP */
+	/* -1 or 1 */
+	if (fam10h_pci_mmconf_base_status)
+		return;
+
+	if (!early_pci_allowed())
+		goto fail;
+
+	found = 0;
+	for (i = 0; i < ARRAY_SIZE(pci_probes); i++) {
+		u32 id;
+		u16 device;
+		u16 vendor;
+
+		bus = pci_probes[i].bus;
+		slot = pci_probes[i].slot;
+		id = read_pci_config(bus, slot, 0, PCI_VENDOR_ID);
+
+		vendor = id & 0xffff;
+		device = (id>>16) & 0xffff;
+		if (pci_probes[i].vendor == vendor &&
+		    pci_probes[i].device == device) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		goto fail;
+
+	/* SYS_CFG */
+	address = MSR_K8_SYSCFG;
+	rdmsrl(address, val);
+
+	/* TOP_MEM2 is not enabled? */
+	if (!(val & (1<<21))) {
+		tom2 = 0;
+	} else {
+		/* TOP_MEM2 */
+		address = MSR_K8_TOP_MEM2;
+		rdmsrl(address, val);
+		tom2 = val & (0xffffULL<<32);
+	}
+
+	if (base <= tom2)
+		base = tom2 + (1ULL<<32);
+
+	/*
+	 * need to check if the range is in the high mmio range that is
+	 * above 4G
+	 */
+	hi_mmio_num = 0;
+	for (i = 0; i < 8; i++) {
+		u32 reg;
+		u64 start;
+		u64 end;
+		reg = read_pci_config(bus, slot, 1, 0x80 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = (((u64)reg) << 8) & (0xffULL << 32); /* 39:16 on 31:8*/
+		reg = read_pci_config(bus, slot, 1, 0x84 + (i << 3));
+		end = (((u64)reg) << 8) & (0xffULL << 32); /* 39:16 on 31:8*/
+
+		if (!end)
+			continue;
+
+		range[hi_mmio_num].start = start;
+		range[hi_mmio_num].end = end;
+		hi_mmio_num++;
+	}
+
+	if (!hi_mmio_num)
+		goto out;
+
+	/* sort the range */
+	sort(range, hi_mmio_num, sizeof(struct range), cmp_range, NULL);
+
+	if (range[hi_mmio_num - 1].end < base)
+		goto out;
+	if (range[0].start > base)
+		goto out;
+
+	/* need to find one window */
+	base = range[0].start - (1ULL << 32);
+	if ((base > tom2) && BASE_VALID(base))
+		goto out;
+	base = range[hi_mmio_num - 1].end + (1ULL << 32);
+	if ((base > tom2) && BASE_VALID(base))
+		goto out;
+	/* need to find window between ranges */
+	if (hi_mmio_num > 1)
+	for (i = 0; i < hi_mmio_num - 1; i++) {
+		if (range[i + 1].start > (range[i].end + (1ULL << 32))) {
+			base = range[i].end + (1ULL << 32);
+			if ((base > tom2) && BASE_VALID(base))
+				goto out;
+		}
+	}
+
+fail:
+	fam10h_pci_mmconf_base_status = -1;
+	return;
+out:
+	fam10h_pci_mmconf_base = base;
+	fam10h_pci_mmconf_base_status = 1;
+}
+
+void __cpuinit fam10h_check_enable_mmcfg(void)
+{
+	u64 val;
+	u32 address;
+
+	if (!(pci_probe & PCI_CHECK_ENABLE_AMD_MMCONF))
+		return;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	rdmsrl(address, val);
+
+	/* try to make sure that AP's setting is identical to BSP setting */
+	if (val & FAM10H_MMIO_CONF_ENABLE) {
+		unsigned busnbits;
+		busnbits = (val >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+		/* only trust the one handle 256 buses, if acpi=off */
+		if (!acpi_pci_disabled || busnbits >= 8) {
+			u64 base;
+			base = val & (0xffffULL << 32);
+			if (fam10h_pci_mmconf_base_status <= 0) {
+				fam10h_pci_mmconf_base = base;
+				fam10h_pci_mmconf_base_status = 1;
+				return;
+			} else if (fam10h_pci_mmconf_base ==  base)
+				return;
+		}
+	}
+
+	/*
+	 * if it is not enabled, try to enable it and assume only one segment
+	 * with 256 buses
+	 */
+	get_fam10h_pci_mmconf_base();
+	if (fam10h_pci_mmconf_base_status <= 0)
+		return;
+
+	printk(KERN_INFO "Enable MMCONFIG on AMD Family 10h\n");
+	val &= ~((FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT) |
+	     (FAM10H_MMIO_CONF_BUSRANGE_MASK<<FAM10H_MMIO_CONF_BUSRANGE_SHIFT));
+	val |= fam10h_pci_mmconf_base | (8 << FAM10H_MMIO_CONF_BUSRANGE_SHIFT) |
+	       FAM10H_MMIO_CONF_ENABLE;
+	wrmsrl(address, val);
+}
+
+static int __devinit set_check_enable_amd_mmconf(const struct dmi_system_id *d)
+{
+        pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
+        return 0;
+}
+
+static struct dmi_system_id __devinitdata mmconf_dmi_table[] = {
+        {
+                .callback = set_check_enable_amd_mmconf,
+                .ident = "Sun Microsystems Machine",
+                .matches = {
+                        DMI_MATCH(DMI_SYS_VENDOR, "Sun Microsystems"),
+                },
+        },
+	{}
+};
+
+void __init check_enable_amd_mmconf_dmi(void)
+{
+	dmi_check_system(mmconf_dmi_table);
+}
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 60e64c8..2f5c488 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -29,6 +29,7 @@
 #include <linux/crash_dump.h>
 #include <linux/root_dev.h>
 #include <linux/pci.h>
+#include <asm/pci-direct.h>
 #include <linux/efi.h>
 #include <linux/acpi.h>
 #include <linux/kallsyms.h>
@@ -40,6 +41,7 @@
 #include <linux/dmi.h>
 #include <linux/dma-mapping.h>
 #include <linux/ctype.h>
+#include <linux/sort.h>
 #include <linux/uaccess.h>
 #include <linux/init_ohci1394_dma.h>
 
@@ -287,6 +289,18 @@ static void __init parse_setup_data(void)
 	}
 }
 
+#ifdef CONFIG_PCI_MMCONFIG
+extern void __cpuinit fam10h_check_enable_mmcfg(void);
+extern void __init check_enable_amd_mmconf_dmi(void);
+#else
+void __cpuinit fam10h_check_enable_mmcfg(void)
+{
+}
+void __init check_enable_amd_mmconf_dmi(void)
+{
+}
+#endif
+
 /*
  * setup_arch - architecture-specific boot-time initializations
  *
@@ -508,6 +522,9 @@ void __init setup_arch(char **cmdline_p)
 	conswitchp = &dummy_con;
 #endif
 #endif
+
+	/* do this before identify_cpu for boot cpu */
+	check_enable_amd_mmconf_dmi();
 }
 
 static int __cpuinit get_model_name(struct cpuinfo_x86 *c)
@@ -760,6 +777,9 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 	/* MFENCE stops RDTSC speculation */
 	set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
 
+	if (c->x86 == 0x10)
+		fam10h_check_enable_mmcfg();
+
 	if (amd_apic_timer_broken())
 		disable_apic_timer = 1;
 
diff --git a/arch/x86/mm/k8topology_64.c b/arch/x86/mm/k8topology_64.c
index 86808e6..1f476e4 100644
--- a/arch/x86/mm/k8topology_64.c
+++ b/arch/x86/mm/k8topology_64.c
@@ -13,12 +13,15 @@
 #include <linux/nodemask.h>
 #include <asm/io.h>
 #include <linux/pci_ids.h>
+#include <linux/acpi.h>
 #include <asm/types.h>
 #include <asm/mmzone.h>
 #include <asm/proto.h>
 #include <asm/e820.h>
 #include <asm/pci-direct.h>
 #include <asm/numa.h>
+#include <asm/mpspec.h>
+#include <asm/apic.h>
 
 static __init int find_northbridge(void)
 {
@@ -44,6 +47,30 @@ static __init int find_northbridge(void)
 	return -1;
 }
 
+static __init void early_get_boot_cpu_id(void)
+{
+	/*
+	 * need to get boot_cpu_id so can use that to create apicid_to_node
+	 * in k8_scan_nodes()
+	 */
+	/*
+	 * Find possible boot-time SMP configuration:
+	 */
+	early_find_smp_config();
+#ifdef CONFIG_ACPI
+	/*
+	 * Read APIC information from ACPI tables.
+	 */
+	early_acpi_boot_init();
+#endif
+	/*
+	 * get boot-time SMP configuration:
+	 */
+	if (smp_found_config)
+		early_get_smp_config();
+	early_init_lapic_mapping();
+}
+
 int __init k8_scan_nodes(unsigned long start, unsigned long end)
 {
 	unsigned long prevbase;
@@ -56,6 +83,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	unsigned cores;
 	unsigned bits;
 	int j;
+	unsigned apicid_base;
 
 	if (!early_pci_allowed())
 		return -1;
@@ -174,11 +202,19 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	/* use the coreid bits from early_identify_cpu */
 	bits = boot_cpu_data.x86_coreid_bits;
 	cores = (1<<bits);
+	apicid_base = 0;
+	/* need to get boot_cpu_id early for system with apicid lifting */
+	early_get_boot_cpu_id();
+	if (boot_cpu_physical_apicid > 0) {
+		printk(KERN_INFO "BSP APIC ID: %02x\n",
+				 boot_cpu_physical_apicid);
+		apicid_base = boot_cpu_physical_apicid;
+	}
 
 	for (i = 0; i < 8; i++) {
 		if (nodes[i].start != nodes[i].end) {
 			nodeid = nodeids[i];
-			for (j = 0; j < cores; j++)
+			for (j = apicid_base; j < cores + apicid_base; j++)
 				apicid_to_node[(nodeid << bits) + j] = i;
 			setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 		}
diff --git a/arch/x86/pci/Makefile_32 b/arch/x86/pci/Makefile_32
index cdd6828..e9c5caf 100644
--- a/arch/x86/pci/Makefile_32
+++ b/arch/x86/pci/Makefile_32
@@ -10,5 +10,6 @@ pci-y				+= legacy.o irq.o
 
 pci-$(CONFIG_X86_VISWS)		:= visws.o fixup.o
 pci-$(CONFIG_X86_NUMAQ)		:= numa.o irq.o
+pci-$(CONFIG_NUMA)		+= mp_bus_to_node.o
 
 obj-y				+= $(pci-y) common.o early.o
diff --git a/arch/x86/pci/Makefile_64 b/arch/x86/pci/Makefile_64
index 7d8c467..8fbd198 100644
--- a/arch/x86/pci/Makefile_64
+++ b/arch/x86/pci/Makefile_64
@@ -13,5 +13,5 @@ obj-y			+= legacy.o irq.o common.o early.o
 # mmconfig has a 64bit special
 obj-$(CONFIG_PCI_MMCONFIG) += mmconfig_64.o direct.o mmconfig-shared.o
 
-obj-$(CONFIG_NUMA)	+= k8-bus_64.o
+obj-y		+= k8-bus_64.o
 
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 2664cb3..1a9c0c6 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -191,7 +191,10 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 {
 	struct pci_bus *bus;
 	struct pci_sysdata *sd;
+	int node;
+#ifdef CONFIG_ACPI_NUMA
 	int pxm;
+#endif
 
 	dmi_check_system(acpi_pciprobe_dmi_table);
 
@@ -201,6 +204,17 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		return NULL;
 	}
 
+	node = -1;
+#ifdef CONFIG_ACPI_NUMA
+	pxm = acpi_get_pxm(device->handle);
+	if (pxm >= 0)
+		node = pxm_to_node(pxm);
+	if (node != -1)
+		set_mp_bus_to_node(busnum, node);
+	else
+		node = get_mp_bus_to_node(busnum);
+#endif
+
 	/* Allocate per-root-bus (not per bus) arch-specific data.
 	 * TODO: leak; this memory is never freed.
 	 * It's arguable whether it's worth the trouble to care.
@@ -212,13 +226,7 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 	}
 
 	sd->domain = domain;
-	sd->node = -1;
-
-	pxm = acpi_get_pxm(device->handle);
-#ifdef CONFIG_ACPI_NUMA
-	if (pxm >= 0)
-		sd->node = pxm_to_node(pxm);
-#endif
+	sd->node = node;
 	/*
 	 * Maybe the desired pci bus has been already scanned. In such case
 	 * it is unnecessary to scan the pci bus with the given domain,busnum.
@@ -238,9 +246,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		kfree(sd);
 
 #ifdef CONFIG_ACPI_NUMA
-	if (bus != NULL) {
+	if (bus) {
 		if (pxm >= 0) {
-			printk("bus %d -> pxm %d -> node %d\n",
+			printk(KERN_DEBUG "bus %02x -> pxm %d -> node %d\n",
 				busnum, pxm, pxm_to_node(pxm));
 		}
 	}
@@ -248,7 +256,6 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 
 	if (bus && (pci_probe & PCI_USE__CRS))
 		get_current_resources(device, busnum, domain, bus);
-	
 	return bus;
 }
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 75fcc29..2a4d751 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -342,9 +342,14 @@ struct pci_bus * __devinit pcibios_scan_root(int busnum)
 		return NULL;
 	}
 
+	sd->node = get_mp_bus_to_node(busnum);
+
 	printk(KERN_DEBUG "PCI: Probing PCI hardware (bus %02x)\n", busnum);
+	bus = pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
+	if (!bus)
+		kfree(sd);
 
-	return pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
+	return bus;
 }
 
 extern u8 pci_cache_line_size;
@@ -420,6 +425,10 @@ char * __devinit  pcibios_setup(char *str)
 		pci_probe &= ~PCI_PROBE_MMCONF;
 		return NULL;
 	}
+	else if (!strcmp(str, "check_enable_amd_mmconf")) {
+		pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
+		return NULL;
+	}
 #endif
 	else if (!strcmp(str, "noacpi")) {
 		acpi_noirq_set();
@@ -480,7 +489,7 @@ void pcibios_disable_device (struct pci_dev *dev)
 		pcibios_disable_irq(dev);
 }
 
-struct pci_bus *__devinit pci_scan_bus_with_sysdata(int busno)
+struct pci_bus *pci_scan_bus_on_node(int busno, struct pci_ops *ops, int node)
 {
 	struct pci_bus *bus = NULL;
 	struct pci_sysdata *sd;
@@ -495,10 +504,15 @@ struct pci_bus *__devinit pci_scan_bus_with_sysdata(int busno)
 		printk(KERN_ERR "PCI: OOM, skipping PCI bus %02x\n", busno);
 		return NULL;
 	}
-	sd->node = -1;
-	bus = pci_scan_bus(busno, &pci_root_ops, sd);
+	sd->node = node;
+	bus = pci_scan_bus(busno, ops, sd);
 	if (!bus)
 		kfree(sd);
 
 	return bus;
 }
+
+struct pci_bus *pci_scan_bus_with_sysdata(int busno)
+{
+	return pci_scan_bus_on_node(busno, &pci_root_ops, -1);
+}
diff --git a/arch/x86/pci/direct.c b/arch/x86/pci/direct.c
index 42f3e4c..21d1e0e 100644
--- a/arch/x86/pci/direct.c
+++ b/arch/x86/pci/direct.c
@@ -258,7 +258,8 @@ void __init pci_direct_init(int type)
 {
 	if (type == 0)
 		return;
-	printk(KERN_INFO "PCI: Using configuration type %d\n", type);
+	printk(KERN_INFO "PCI: Using configuration type %d for base access\n",
+		 type);
 	if (type == 1)
 		raw_pci_ops = &pci_direct_conf1;
 	else
@@ -275,8 +276,10 @@ int __init pci_direct_probe(void)
 	if (!region)
 		goto type2;
 
-	if (pci_check_type1())
+	if (pci_check_type1()) {
+		raw_pci_ops = &pci_direct_conf1;
 		return 1;
+	}
 	release_resource(region);
 
  type2:
@@ -290,7 +293,6 @@ int __init pci_direct_probe(void)
 		goto fail2;
 
 	if (pci_check_type2()) {
-		printk(KERN_INFO "PCI: Using configuration type 2\n");
 		raw_pci_ops = &pci_direct_conf2;
 		return 2;
 	}
diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index a5ef5f5..b60b2ab 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -493,3 +493,20 @@ static void __devinit pci_siemens_interrupt_controller(struct pci_dev *dev)
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
 			  pci_siemens_interrupt_controller);
+
+/*
+ * Regular PCI devices have 256 bytes, but AMD Family 10h Opteron ext config
+ * have 4096 bytes.  Even if the device is capable, that doesn't mean we can
+ * access it.  Maybe we don't have a way to generate extended config space
+ * accesses.   So check it
+ */
+static void fam10h_pci_cfg_space_size(struct pci_dev *dev)
+{
+	dev->cfg_size = pci_cfg_space_size_ext(dev, 0);
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1200, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1201, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1202, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1203, fam10h_pci_cfg_space_size);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1204, fam10h_pci_cfg_space_size);
diff --git a/arch/x86/pci/init.c b/arch/x86/pci/init.c
index 3de9f9b..343c363 100644
--- a/arch/x86/pci/init.c
+++ b/arch/x86/pci/init.c
@@ -6,16 +6,13 @@
    in the right sequence from here. */
 static __init int pci_access_init(void)
 {
-	int type __maybe_unused = 0;
-
 #ifdef CONFIG_PCI_DIRECT
+	int type = 0;
+
 	type = pci_direct_probe();
 #endif
-#ifdef CONFIG_PCI_MMCONFIG
-	pci_mmcfg_init(type);
-#endif
-	if (raw_pci_ops)
-		return 0;
+	pci_mmcfg_early_init();
+
 #ifdef CONFIG_PCI_BIOS
 	pci_pcbios_init();
 #endif
@@ -28,7 +25,7 @@ static __init int pci_access_init(void)
 #ifdef CONFIG_PCI_DIRECT
 	pci_direct_init(type);
 #endif
-	if (!raw_pci_ops)
+	if (!raw_pci_ops && !raw_pci_ext_ops)
 		printk(KERN_ERR
 		"PCI: Fatal: No config space access function found\n");
 
diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
index 579745c..0908fca 100644
--- a/arch/x86/pci/irq.c
+++ b/arch/x86/pci/irq.c
@@ -136,9 +136,11 @@ static void __init pirq_peer_trick(void)
 		busmap[e->bus] = 1;
 	}
 	for(i = 1; i < 256; i++) {
+		int node;
 		if (!busmap[i] || pci_find_bus(0, i))
 			continue;
-		if (pci_scan_bus_with_sysdata(i))
+		node = get_mp_bus_to_node(i);
+		if (pci_scan_bus_on_node(i, &pci_root_ops, node))
 			printk(KERN_INFO "PCI: Discovered primary peer "
 			       "bus %02x [IRQ]\n", i);
 	}
diff --git a/arch/x86/pci/k8-bus_64.c b/arch/x86/pci/k8-bus_64.c
index 9cc813e..ab6d4b1 100644
--- a/arch/x86/pci/k8-bus_64.c
+++ b/arch/x86/pci/k8-bus_64.c
@@ -1,83 +1,536 @@
 #include <linux/init.h>
 #include <linux/pci.h>
+#include <asm/pci-direct.h>
 #include <asm/mpspec.h>
 #include <linux/cpumask.h>
+#include <linux/topology.h>
 
 /*
  * This discovers the pcibus <-> node mapping on AMD K8.
- *
- * RED-PEN need to call this again on PCI hotplug
- * RED-PEN empty cpus get reported wrong
+ * also get peer root bus resource for io,mmio
  */
 
-#define NODE_ID_REGISTER 0x60
-#define NODE_ID(dword) (dword & 0x07)
-#define LDT_BUS_NUMBER_REGISTER_0 0x94
-#define LDT_BUS_NUMBER_REGISTER_1 0xB4
-#define LDT_BUS_NUMBER_REGISTER_2 0xD4
-#define NR_LDT_BUS_NUMBER_REGISTERS 3
-#define SECONDARY_LDT_BUS_NUMBER(dword) ((dword >> 8) & 0xFF)
-#define SUBORDINATE_LDT_BUS_NUMBER(dword) ((dword >> 16) & 0xFF)
-#define PCI_DEVICE_ID_K8HTCONFIG 0x1100
+
+/*
+ * sub bus (transparent) will use entres from 3 to store extra from root,
+ * so need to make sure have enought slot there, increase PCI_BUS_NUM_RESOURCES?
+ */
+#define RES_NUM 16
+struct pci_root_info {
+	char name[12];
+	unsigned int res_num;
+	struct resource res[RES_NUM];
+	int bus_min;
+	int bus_max;
+	int node;
+	int link;
+};
+
+/* 4 at this time, it may become to 32 */
+#define PCI_ROOT_NR 4
+static int pci_root_num;
+static struct pci_root_info pci_root_info[PCI_ROOT_NR];
+
+#ifdef CONFIG_NUMA
+
+#define BUS_NR 256
+
+static int mp_bus_to_node[BUS_NR];
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+		mp_bus_to_node[busnum] = node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node = -1;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return node;
+
+	node = mp_bus_to_node[busnum];
+
+	/*
+	 * let numa_node_id to decide it later in dma_alloc_pages
+	 * if there is no ram on that node
+	 */
+	if (node != -1 && !node_online(node))
+		node = -1;
+
+	return node;
+}
+#endif
+
+void set_pci_bus_resources_arch_default(struct pci_bus *b)
+{
+	int i;
+	int j;
+	struct pci_root_info *info;
+
+	/* if only one root bus, don't need to anything */
+	if (pci_root_num < 2)
+		return;
+
+	for (i = 0; i < pci_root_num; i++) {
+		if (pci_root_info[i].bus_min == b->number)
+			break;
+	}
+
+	if (i == pci_root_num)
+		return;
+
+	info = &pci_root_info[i];
+	for (j = 0; j < info->res_num; j++) {
+		struct resource *res;
+		struct resource *root;
+
+		res = &info->res[j];
+		b->resource[j] = res;
+		if (res->flags & IORESOURCE_IO)
+			root = &ioport_resource;
+		else
+			root = &iomem_resource;
+		insert_resource(root, res);
+	}
+}
+
+#define RANGE_NUM 16
+
+struct res_range {
+	size_t start;
+	size_t end;
+};
+
+static void __init update_range(struct res_range *range, size_t start,
+				size_t end)
+{
+	int i;
+	int j;
+
+	for (j = 0; j < RANGE_NUM; j++) {
+		if (!range[j].end)
+			continue;
+
+		if (start <= range[j].start && end >= range[j].end) {
+			range[j].start = 0;
+			range[j].end = 0;
+			continue;
+		}
+
+		if (start <= range[j].start && end < range[j].end && range[j].start < end + 1) {
+			range[j].start = end + 1;
+			continue;
+		}
+
+
+		if (start > range[j].start && end >= range[j].end && range[j].end > start - 1) {
+			range[j].end = start - 1;
+			continue;
+		}
+
+		if (start > range[j].start && end < range[j].end) {
+			/* find the new spare */
+			for (i = 0; i < RANGE_NUM; i++) {
+				if (range[i].end == 0)
+					break;
+			}
+			if (i < RANGE_NUM) {
+				range[i].end = range[j].end;
+				range[i].start = end + 1;
+			} else {
+				printk(KERN_ERR "run of slot in ranges\n");
+			}
+			range[j].end = start - 1;
+			continue;
+		}
+	}
+}
+
+static void __init update_res(struct pci_root_info *info, size_t start,
+			      size_t end, unsigned long flags, int merge)
+{
+	int i;
+	struct resource *res;
+
+	if (!merge)
+		goto addit;
+
+	/* try to merge it with old one */
+	for (i = 0; i < info->res_num; i++) {
+		size_t final_start, final_end;
+		size_t common_start, common_end;
+
+		res = &info->res[i];
+		if (res->flags != flags)
+			continue;
+
+		common_start = max((size_t)res->start, start);
+		common_end = min((size_t)res->end, end);
+		if (common_start > common_end + 1)
+			continue;
+
+		final_start = min((size_t)res->start, start);
+		final_end = max((size_t)res->end, end);
+
+		res->start = final_start;
+		res->end = final_end;
+		return;
+	}
+
+addit:
+
+	/* need to add that */
+	if (info->res_num >= RES_NUM)
+		return;
+
+	res = &info->res[info->res_num];
+	res->name = info->name;
+	res->flags = flags;
+	res->start = start;
+	res->end = end;
+	res->child = NULL;
+	info->res_num++;
+}
+
+struct pci_hostbridge_probe {
+	u32 bus;
+	u32 slot;
+	u32 vendor;
+	u32 device;
+};
+
+static struct pci_hostbridge_probe pci_probes[] __initdata = {
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1100 },
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0xff, 0, PCI_VENDOR_ID_AMD, 0x1200 },
+	{ 0, 0x18, PCI_VENDOR_ID_AMD, 0x1300 },
+};
+
+static u64 __initdata fam10h_mmconf_start;
+static u64 __initdata fam10h_mmconf_end;
+static void __init get_pci_mmcfg_amd_fam10h_range(void)
+{
+	u32 address;
+	u64 base, msr;
+	unsigned segn_busn_bits;
+
+	/* assume all cpus from fam10h have mmconf */
+        if (boot_cpu_data.x86 < 0x10)
+		return;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	rdmsrl(address, msr);
+
+	/* mmconfig is not enable */
+	if (!(msr & FAM10H_MMIO_CONF_ENABLE))
+		return;
+
+	base = msr & (FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT);
+
+	segn_busn_bits = (msr >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			 FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+	fam10h_mmconf_start = base;
+	fam10h_mmconf_end = base + (1ULL<<(segn_busn_bits + 20)) - 1;
+}
 
 /**
- * fill_mp_bus_to_cpumask()
+ * early_fill_mp_bus_to_node()
+ * called before pcibios_scan_root and pci_scan_bus
  * fills the mp_bus_to_cpumask array based according to the LDT Bus Number
  * Registers found in the K8 northbridge
  */
-__init static int
-fill_mp_bus_to_cpumask(void)
+static int __init early_fill_mp_bus_info(void)
 {
-	struct pci_dev *nb_dev = NULL;
-	int i, j;
-	u32 ldtbus, nid;
-	static int lbnr[3] = {
-		LDT_BUS_NUMBER_REGISTER_0,
-		LDT_BUS_NUMBER_REGISTER_1,
-		LDT_BUS_NUMBER_REGISTER_2
-	};
-
-	while ((nb_dev = pci_get_device(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_K8HTCONFIG, nb_dev))) {
-		pci_read_config_dword(nb_dev, NODE_ID_REGISTER, &nid);
-
-		for (i = 0; i < NR_LDT_BUS_NUMBER_REGISTERS; i++) {
-			pci_read_config_dword(nb_dev, lbnr[i], &ldtbus);
-			/*
-			 * if there are no busses hanging off of the current
-			 * ldt link then both the secondary and subordinate
-			 * bus number fields are set to 0.
-			 * 
-			 * RED-PEN
-			 * This is slightly broken because it assumes
- 			 * HT node IDs == Linux node ids, which is not always
-			 * true. However it is probably mostly true.
-			 */
-			if (!(SECONDARY_LDT_BUS_NUMBER(ldtbus) == 0
-				&& SUBORDINATE_LDT_BUS_NUMBER(ldtbus) == 0)) {
-				for (j = SECONDARY_LDT_BUS_NUMBER(ldtbus);
-				     j <= SUBORDINATE_LDT_BUS_NUMBER(ldtbus);
-				     j++) { 
-					struct pci_bus *bus;
-					struct pci_sysdata *sd;
-
-					long node = NODE_ID(nid);
-					/* Algorithm a bit dumb, but
- 					   it shouldn't matter here */
-					bus = pci_find_bus(0, j);
-					if (!bus)
-						continue;
-					if (!node_online(node))
-						node = 0;
-
-					sd = bus->sysdata;
-					sd->node = node;
-				}		
+	int i;
+	int j;
+	unsigned bus;
+	unsigned slot;
+	int found;
+	int node;
+	int link;
+	int def_node;
+	int def_link;
+	struct pci_root_info *info;
+	u32 reg;
+	struct resource *res;
+	size_t start;
+	size_t end;
+	struct res_range range[RANGE_NUM];
+	u64 val;
+	u32 address;
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < BUS_NR; i++)
+		mp_bus_to_node[i] = -1;
+#endif
+
+	if (!early_pci_allowed())
+		return -1;
+
+	found = 0;
+	for (i = 0; i < ARRAY_SIZE(pci_probes); i++) {
+		u32 id;
+		u16 device;
+		u16 vendor;
+
+		bus = pci_probes[i].bus;
+		slot = pci_probes[i].slot;
+		id = read_pci_config(bus, slot, 0, PCI_VENDOR_ID);
+
+		vendor = id & 0xffff;
+		device = (id>>16) & 0xffff;
+		if (pci_probes[i].vendor == vendor &&
+		    pci_probes[i].device == device) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return 0;
+
+	pci_root_num = 0;
+	for (i = 0; i < 4; i++) {
+		int min_bus;
+		int max_bus;
+		reg = read_pci_config(bus, slot, 1, 0xe0 + (i << 2));
+
+		/* Check if that register is enabled for bus range */
+		if ((reg & 7) != 3)
+			continue;
+
+		min_bus = (reg >> 16) & 0xff;
+		max_bus = (reg >> 24) & 0xff;
+		node = (reg >> 4) & 0x07;
+#ifdef CONFIG_NUMA
+		for (j = min_bus; j <= max_bus; j++)
+			mp_bus_to_node[j] = (unsigned char) node;
+#endif
+		link = (reg >> 8) & 0x03;
+
+		info = &pci_root_info[pci_root_num];
+		info->bus_min = min_bus;
+		info->bus_max = max_bus;
+		info->node = node;
+		info->link = link;
+		sprintf(info->name, "PCI Bus #%02x", min_bus);
+		pci_root_num++;
+	}
+
+	/* get the default node and link for left over res */
+	reg = read_pci_config(bus, slot, 0, 0x60);
+	def_node = (reg >> 8) & 0x07;
+	reg = read_pci_config(bus, slot, 0, 0x64);
+	def_link = (reg >> 8) & 0x03;
+
+	memset(range, 0, sizeof(range));
+	range[0].end = 0xffff;
+	/* io port resource */
+	for (i = 0; i < 4; i++) {
+		reg = read_pci_config(bus, slot, 1, 0xc0 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = reg & 0xfff000;
+		reg = read_pci_config(bus, slot, 1, 0xc4 + (i << 3));
+		node = reg & 0x07;
+		link = (reg >> 4) & 0x03;
+		end = (reg & 0xfff000) | 0xfff;
+
+		/* find the position */
+		for (j = 0; j < pci_root_num; j++) {
+			info = &pci_root_info[j];
+			if (info->node == node && info->link == link)
+				break;
+		}
+		if (j == pci_root_num)
+			continue; /* not found */
+
+		info = &pci_root_info[j];
+		printk(KERN_DEBUG "node %d link %d: io port [%llx, %llx]\n",
+		       node, link, (u64)start, (u64)end);
+
+		/* kernel only handle 16 bit only */
+		if (end > 0xffff)
+			end = 0xffff;
+		update_res(info, start, end, IORESOURCE_IO, 1);
+		update_range(range, start, end);
+	}
+	/* add left over io port range to def node/link, [0, 0xffff] */
+	/* find the position */
+	for (j = 0; j < pci_root_num; j++) {
+		info = &pci_root_info[j];
+		if (info->node == def_node && info->link == def_link)
+			break;
+	}
+	if (j < pci_root_num) {
+		info = &pci_root_info[j];
+		for (i = 0; i < RANGE_NUM; i++) {
+			if (!range[i].end)
+				continue;
+
+			update_res(info, range[i].start, range[i].end,
+				   IORESOURCE_IO, 1);
+		}
+	}
+
+	memset(range, 0, sizeof(range));
+	/* 0xfd00000000-0xffffffffff for HT */
+	range[0].end = (0xfdULL<<32) - 1;
+
+	/* need to take out [0, TOM) for RAM*/
+	address = MSR_K8_TOP_MEM1;
+	rdmsrl(address, val);
+	end = (val & 0xffffff8000000ULL);
+	printk(KERN_INFO "TOM: %016lx aka %ldM\n", end, end>>20);
+	if (end < (1ULL<<32))
+		update_range(range, 0, end - 1);
+
+	/* get mmconfig */
+	get_pci_mmcfg_amd_fam10h_range();
+	/* need to take out mmconf range */
+	if (fam10h_mmconf_end) {
+		printk(KERN_DEBUG "Fam 10h mmconf [%llx, %llx]\n", fam10h_mmconf_start, fam10h_mmconf_end);
+		update_range(range, fam10h_mmconf_start, fam10h_mmconf_end);
+	}
+
+	/* mmio resource */
+	for (i = 0; i < 8; i++) {
+		reg = read_pci_config(bus, slot, 1, 0x80 + (i << 3));
+		if (!(reg & 3))
+			continue;
+
+		start = reg & 0xffffff00; /* 39:16 on 31:8*/
+		start <<= 8;
+		reg = read_pci_config(bus, slot, 1, 0x84 + (i << 3));
+		node = reg & 0x07;
+		link = (reg >> 4) & 0x03;
+		end = (reg & 0xffffff00);
+		end <<= 8;
+		end |= 0xffff;
+
+		/* find the position */
+		for (j = 0; j < pci_root_num; j++) {
+			info = &pci_root_info[j];
+			if (info->node == node && info->link == link)
+				break;
+		}
+		if (j == pci_root_num)
+			continue; /* not found */
+
+		info = &pci_root_info[j];
+
+		printk(KERN_DEBUG "node %d link %d: mmio [%llx, %llx]",
+		       node, link, (u64)start, (u64)end);
+		/*
+		 * some sick allocation would have range overlap with fam10h
+		 * mmconf range, so need to update start and end.
+		 */
+		if (fam10h_mmconf_end) {
+			int changed = 0;
+			u64 endx = 0;
+			if (start >= fam10h_mmconf_start &&
+			    start <= fam10h_mmconf_end) {
+				start = fam10h_mmconf_end + 1;
+				changed = 1;
+			}
+
+			if (end >= fam10h_mmconf_start &&
+			    end <= fam10h_mmconf_end) {
+				end = fam10h_mmconf_start - 1;
+				changed = 1;
+			}
+
+			if (start < fam10h_mmconf_start &&
+			    end > fam10h_mmconf_end) {
+				/* we got a hole */
+				endx = fam10h_mmconf_start - 1;
+				update_res(info, start, endx, IORESOURCE_MEM, 0);
+				update_range(range, start, endx);
+				printk(KERN_CONT " ==> [%llx, %llx]", (u64)start, endx);
+				start = fam10h_mmconf_end + 1;
+				changed = 1;
+			}
+			if (changed) {
+				if (start <= end) {
+					printk(KERN_CONT " %s [%llx, %llx]", endx?"and":"==>", (u64)start, (u64)end);
+				} else {
+					printk(KERN_CONT "%s\n", endx?"":" ==> none");
+					continue;
+				}
 			}
 		}
+
+		update_res(info, start, end, IORESOURCE_MEM, 1);
+		update_range(range, start, end);
+		printk(KERN_CONT "\n");
+	}
+
+	/* need to take out [4G, TOM2) for RAM*/
+	/* SYS_CFG */
+	address = MSR_K8_SYSCFG;
+	rdmsrl(address, val);
+	/* TOP_MEM2 is enabled? */
+	if (val & (1<<21)) {
+		/* TOP_MEM2 */
+		address = MSR_K8_TOP_MEM2;
+		rdmsrl(address, val);
+		end = (val & 0xffffff8000000ULL);
+		printk(KERN_INFO "TOM2: %016lx aka %ldM\n", end, end>>20);
+		update_range(range, 1ULL<<32, end - 1);
+	}
+
+	/*
+	 * add left over mmio range to def node/link ?
+	 * that is tricky, just record range in from start_min to 4G
+	 */
+	for (j = 0; j < pci_root_num; j++) {
+		info = &pci_root_info[j];
+		if (info->node == def_node && info->link == def_link)
+			break;
+	}
+	if (j < pci_root_num) {
+		info = &pci_root_info[j];
+
+		for (i = 0; i < RANGE_NUM; i++) {
+			if (!range[i].end)
+				continue;
+
+			update_res(info, range[i].start, range[i].end,
+				   IORESOURCE_MEM, 1);
+		}
+	}
+
+#ifdef CONFIG_NUMA
+	for (i = 0; i < BUS_NR; i++) {
+		node = mp_bus_to_node[i];
+		if (node >= 0)
+			printk(KERN_DEBUG "bus: %02x to node: %02x\n", i, node);
+	}
+#endif
+
+	for (i = 0; i < pci_root_num; i++) {
+		int res_num;
+		int busnum;
+
+		info = &pci_root_info[i];
+		res_num = info->res_num;
+		busnum = info->bus_min;
+		printk(KERN_DEBUG "bus: [%02x,%02x] on node %x link %x\n",
+		       info->bus_min, info->bus_max, info->node, info->link);
+		for (j = 0; j < res_num; j++) {
+			res = &info->res[j];
+			printk(KERN_DEBUG "bus: %02x index %x %s: [%llx, %llx]\n",
+			       busnum, j,
+			       (res->flags & IORESOURCE_IO)?"io port":"mmio",
+			       res->start, res->end);
+		}
 	}
 
 	return 0;
 }
 
-fs_initcall(fill_mp_bus_to_cpumask);
+postcore_initcall(early_fill_mp_bus_info);
diff --git a/arch/x86/pci/legacy.c b/arch/x86/pci/legacy.c
index e041ced..a67921c 100644
--- a/arch/x86/pci/legacy.c
+++ b/arch/x86/pci/legacy.c
@@ -12,6 +12,7 @@
 static void __devinit pcibios_fixup_peer_bridges(void)
 {
 	int n, devfn;
+	long node;
 
 	if (pcibios_last_bus <= 0 || pcibios_last_bus >= 0xff)
 		return;
@@ -21,12 +22,13 @@ static void __devinit pcibios_fixup_peer_bridges(void)
 		u32 l;
 		if (pci_find_bus(0, n))
 			continue;
+		node = get_mp_bus_to_node(n);
 		for (devfn = 0; devfn < 256; devfn += 8) {
 			if (!raw_pci_read(0, n, devfn, PCI_VENDOR_ID, 2, &l) &&
 			    l != 0x0000 && l != 0xffff) {
 				DBG("Found device at %02x:%02x [%04x]\n", n, devfn, l);
 				printk(KERN_INFO "PCI: Discovered peer bus %02x\n", n);
-				pci_scan_bus_with_sysdata(n);
+				pci_scan_bus_on_node(n, &pci_root_ops, node);
 				break;
 			}
 		}
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 8d54df4..0cfebec 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -28,7 +28,7 @@ static int __initdata pci_mmcfg_resources_inserted;
 static const char __init *pci_mmcfg_e7520(void)
 {
 	u32 win;
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0xce, 2, &win);
+	raw_pci_ops->read(0, 0, PCI_DEVFN(0, 0), 0xce, 2, &win);
 
 	win = win & 0xf000;
 	if(win == 0x0000 || win == 0xf000)
@@ -53,7 +53,7 @@ static const char __init *pci_mmcfg_intel_945(void)
 
 	pci_mmcfg_config_num = 1;
 
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0x48, 4, &pciexbar);
+	raw_pci_ops->read(0, 0, PCI_DEVFN(0, 0), 0x48, 4, &pciexbar);
 
 	/* Enable bit */
 	if (!(pciexbar & 1))
@@ -100,33 +100,102 @@ static const char __init *pci_mmcfg_intel_945(void)
 	return "Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub";
 }
 
+static const char __init *pci_mmcfg_amd_fam10h(void)
+{
+	u32 low, high, address;
+	u64 base, msr;
+	int i;
+	unsigned segnbits = 0, busnbits;
+
+	if (!(pci_probe & PCI_CHECK_ENABLE_AMD_MMCONF))
+		return NULL;
+
+	address = MSR_FAM10H_MMIO_CONF_BASE;
+	if (rdmsr_safe(address, &low, &high))
+		return NULL;
+
+	msr = high;
+	msr <<= 32;
+	msr |= low;
+
+	/* mmconfig is not enable */
+	if (!(msr & FAM10H_MMIO_CONF_ENABLE))
+		return NULL;
+
+	base = msr & (FAM10H_MMIO_CONF_BASE_MASK<<FAM10H_MMIO_CONF_BASE_SHIFT);
+
+	busnbits = (msr >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+			 FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+	/*
+	 * only handle bus 0 ?
+	 * need to skip it
+	 */
+	if (!busnbits)
+		return NULL;
+
+	if (busnbits > 8) {
+		segnbits = busnbits - 8;
+		busnbits = 8;
+	}
+
+	pci_mmcfg_config_num = (1 << segnbits);
+	pci_mmcfg_config = kzalloc(sizeof(pci_mmcfg_config[0]) *
+				   pci_mmcfg_config_num, GFP_KERNEL);
+	if (!pci_mmcfg_config)
+		return NULL;
+
+	for (i = 0; i < (1 << segnbits); i++) {
+		pci_mmcfg_config[i].address = base + (1<<28) * i;
+		pci_mmcfg_config[i].pci_segment = i;
+		pci_mmcfg_config[i].start_bus_number = 0;
+		pci_mmcfg_config[i].end_bus_number = (1 << busnbits) - 1;
+	}
+
+	return "AMD Family 10h NB";
+}
+
 struct pci_mmcfg_hostbridge_probe {
+	u32 bus;
+	u32 devfn;
 	u32 vendor;
 	u32 device;
 	const char *(*probe)(void);
 };
 
 static struct pci_mmcfg_hostbridge_probe pci_mmcfg_probes[] __initdata = {
-	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7520_MCH, pci_mmcfg_e7520 },
-	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82945G_HB, pci_mmcfg_intel_945 },
+	{ 0, PCI_DEVFN(0, 0), PCI_VENDOR_ID_INTEL,
+	  PCI_DEVICE_ID_INTEL_E7520_MCH, pci_mmcfg_e7520 },
+	{ 0, PCI_DEVFN(0, 0), PCI_VENDOR_ID_INTEL,
+	  PCI_DEVICE_ID_INTEL_82945G_HB, pci_mmcfg_intel_945 },
+	{ 0, PCI_DEVFN(0x18, 0), PCI_VENDOR_ID_AMD,
+	  0x1200, pci_mmcfg_amd_fam10h },
+	{ 0xff, PCI_DEVFN(0, 0), PCI_VENDOR_ID_AMD,
+	  0x1200, pci_mmcfg_amd_fam10h },
 };
 
 static int __init pci_mmcfg_check_hostbridge(void)
 {
 	u32 l;
+	u32 bus, devfn;
 	u16 vendor, device;
 	int i;
 	const char *name;
 
-	pci_direct_conf1.read(0, 0, PCI_DEVFN(0,0), 0, 4, &l);
-	vendor = l & 0xffff;
-	device = (l >> 16) & 0xffff;
+	if (!raw_pci_ops)
+		return 0;
 
 	pci_mmcfg_config_num = 0;
 	pci_mmcfg_config = NULL;
 	name = NULL;
 
 	for (i = 0; !name && i < ARRAY_SIZE(pci_mmcfg_probes); i++) {
+		bus =  pci_mmcfg_probes[i].bus;
+		devfn = pci_mmcfg_probes[i].devfn;
+		raw_pci_ops->read(0, bus, devfn, 0, 4, &l);
+		vendor = l & 0xffff;
+		device = (l >> 16) & 0xffff;
+
 		if (pci_mmcfg_probes[i].vendor == vendor &&
 		    pci_mmcfg_probes[i].device == device)
 			name = pci_mmcfg_probes[i].probe();
@@ -173,9 +242,78 @@ static void __init pci_mmcfg_insert_resources(unsigned long resource_flags)
 	pci_mmcfg_resources_inserted = 1;
 }
 
-static void __init pci_mmcfg_reject_broken(int type)
+static acpi_status __init check_mcfg_resource(struct acpi_resource *res,
+					      void *data)
+{
+	struct resource *mcfg_res = data;
+	struct acpi_resource_address64 address;
+	acpi_status status;
+
+	if (res->type == ACPI_RESOURCE_TYPE_FIXED_MEMORY32) {
+		struct acpi_resource_fixed_memory32 *fixmem32 =
+			&res->data.fixed_memory32;
+		if (!fixmem32)
+			return AE_OK;
+		if ((mcfg_res->start >= fixmem32->address) &&
+		    (mcfg_res->end < (fixmem32->address +
+				      fixmem32->address_length))) {
+			mcfg_res->flags = 1;
+			return AE_CTRL_TERMINATE;
+		}
+	}
+	if ((res->type != ACPI_RESOURCE_TYPE_ADDRESS32) &&
+	    (res->type != ACPI_RESOURCE_TYPE_ADDRESS64))
+		return AE_OK;
+
+	status = acpi_resource_to_address64(res, &address);
+	if (ACPI_FAILURE(status) ||
+	   (address.address_length <= 0) ||
+	   (address.resource_type != ACPI_MEMORY_RANGE))
+		return AE_OK;
+
+	if ((mcfg_res->start >= address.minimum) &&
+	    (mcfg_res->end < (address.minimum + address.address_length))) {
+		mcfg_res->flags = 1;
+		return AE_CTRL_TERMINATE;
+	}
+	return AE_OK;
+}
+
+static acpi_status __init find_mboard_resource(acpi_handle handle, u32 lvl,
+		void *context, void **rv)
+{
+	struct resource *mcfg_res = context;
+
+	acpi_walk_resources(handle, METHOD_NAME__CRS,
+			    check_mcfg_resource, context);
+
+	if (mcfg_res->flags)
+		return AE_CTRL_TERMINATE;
+
+	return AE_OK;
+}
+
+static int __init is_acpi_reserved(unsigned long start, unsigned long end)
+{
+	struct resource mcfg_res;
+
+	mcfg_res.start = start;
+	mcfg_res.end = end;
+	mcfg_res.flags = 0;
+
+	acpi_get_devices("PNP0C01", find_mboard_resource, &mcfg_res, NULL);
+
+	if (!mcfg_res.flags)
+		acpi_get_devices("PNP0C02", find_mboard_resource, &mcfg_res,
+				 NULL);
+
+	return mcfg_res.flags;
+}
+
+static void __init pci_mmcfg_reject_broken(int early)
 {
 	typeof(pci_mmcfg_config[0]) *cfg;
+	int i;
 
 	if ((pci_mmcfg_config_num == 0) ||
 	    (pci_mmcfg_config == NULL) ||
@@ -184,51 +322,80 @@ static void __init pci_mmcfg_reject_broken(int type)
 
 	cfg = &pci_mmcfg_config[0];
 
-	/*
-	 * Handle more broken MCFG tables on Asus etc.
-	 * They only contain a single entry for bus 0-0.
-	 */
-	if (pci_mmcfg_config_num == 1 &&
-	    cfg->pci_segment == 0 &&
-	    (cfg->start_bus_number | cfg->end_bus_number) == 0) {
-		printk(KERN_ERR "PCI: start and end of bus number is 0. "
-		       "Rejected as broken MCFG.\n");
-		goto reject;
+	for (i = 0; i < pci_mmcfg_config_num; i++) {
+		int valid = 0;
+		u32 size = (cfg->end_bus_number + 1) << 20;
+		cfg = &pci_mmcfg_config[i];
+		printk(KERN_NOTICE "PCI: MCFG configuration %d: base %lx "
+		       "segment %hu buses %u - %u\n",
+		       i, (unsigned long)cfg->address, cfg->pci_segment,
+		       (unsigned int)cfg->start_bus_number,
+		       (unsigned int)cfg->end_bus_number);
+
+		if (!early &&
+		    is_acpi_reserved(cfg->address, cfg->address + size - 1)) {
+			printk(KERN_NOTICE "PCI: MCFG area at %Lx reserved "
+			       "in ACPI motherboard resources\n",
+			       cfg->address);
+			valid = 1;
+		}
+
+		if (valid)
+			continue;
+
+		if (!early)
+			printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is not"
+			       " reserved in ACPI motherboard resources\n",
+			       cfg->address);
+		/* Don't try to do this check unless configuration
+		   type 1 is available. how about type 2 ?*/
+		if (raw_pci_ops && e820_all_mapped(cfg->address,
+						  cfg->address + size - 1,
+						  E820_RESERVED)) {
+			printk(KERN_NOTICE
+			       "PCI: MCFG area at %Lx reserved in E820\n",
+			       cfg->address);
+			valid = 1;
+		}
+
+		if (!valid)
+			goto reject;
 	}
 
-	/*
-	 * Only do this check when type 1 works. If it doesn't work
-	 * assume we run on a Mac and always use MCFG
-	 */
-	if (type == 1 && !e820_all_mapped(cfg->address,
-					  cfg->address + MMCONFIG_APER_MIN,
-					  E820_RESERVED)) {
-		printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is not"
-		       " E820-reserved\n", cfg->address);
-		goto reject;
-	}
 	return;
 
 reject:
 	printk(KERN_ERR "PCI: Not using MMCONFIG.\n");
+	pci_mmcfg_arch_free();
 	kfree(pci_mmcfg_config);
 	pci_mmcfg_config = NULL;
 	pci_mmcfg_config_num = 0;
 }
 
-void __init pci_mmcfg_init(int type)
-{
-	int known_bridge = 0;
+static int __initdata known_bridge;
 
+void __init __pci_mmcfg_init(int early)
+{
+	/* MMCONFIG disabled */
 	if ((pci_probe & PCI_PROBE_MMCONF) == 0)
 		return;
 
-	if (type == 1 && pci_mmcfg_check_hostbridge())
-		known_bridge = 1;
+	/* MMCONFIG already enabled */
+	if (!early && !(pci_probe & PCI_PROBE_MASK & ~PCI_PROBE_MMCONF))
+		return;
+
+	/* for late to exit */
+	if (known_bridge)
+		return;
+
+	if (early) {
+		if (pci_mmcfg_check_hostbridge())
+			known_bridge = 1;
+	}
 
 	if (!known_bridge) {
 		acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
-		pci_mmcfg_reject_broken(type);
+		pci_mmcfg_reject_broken(early);
 	}
 
 	if ((pci_mmcfg_config_num == 0) ||
@@ -249,6 +416,16 @@ void __init pci_mmcfg_init(int type)
 	}
 }
 
+void __init pci_mmcfg_early_init(void)
+{
+	__pci_mmcfg_init(1);
+}
+
+void __init pci_mmcfg_late_init(void)
+{
+	__pci_mmcfg_init(0);
+}
+
 static int __init pci_mmcfg_late_insert_resources(void)
 {
 	/*
diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c
index 081816a..f3c761d 100644
--- a/arch/x86/pci/mmconfig_32.c
+++ b/arch/x86/pci/mmconfig_32.c
@@ -136,3 +136,7 @@ int __init pci_mmcfg_arch_init(void)
 	raw_pci_ext_ops = &pci_mmcfg;
 	return 1;
 }
+
+void __init pci_mmcfg_arch_free(void)
+{
+}
diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
index 9207fd4..a199416 100644
--- a/arch/x86/pci/mmconfig_64.c
+++ b/arch/x86/pci/mmconfig_64.c
@@ -127,7 +127,7 @@ static void __iomem * __init mcfg_ioremap(struct acpi_mcfg_allocation *cfg)
 int __init pci_mmcfg_arch_init(void)
 {
 	int i;
-	pci_mmcfg_virt = kmalloc(sizeof(*pci_mmcfg_virt) *
+	pci_mmcfg_virt = kzalloc(sizeof(*pci_mmcfg_virt) *
 				 pci_mmcfg_config_num, GFP_KERNEL);
 	if (pci_mmcfg_virt == NULL) {
 		printk(KERN_ERR "PCI: Can not allocate memory for mmconfig structures\n");
@@ -141,9 +141,29 @@ int __init pci_mmcfg_arch_init(void)
 			printk(KERN_ERR "PCI: Cannot map mmconfig aperture for "
 					"segment %d\n",
 				pci_mmcfg_config[i].pci_segment);
+			pci_mmcfg_arch_free();
 			return 0;
 		}
 	}
 	raw_pci_ext_ops = &pci_mmcfg;
 	return 1;
 }
+
+void __init pci_mmcfg_arch_free(void)
+{
+	int i;
+
+	if (pci_mmcfg_virt == NULL)
+		return;
+
+	for (i = 0; i < pci_mmcfg_config_num; ++i) {
+		if (pci_mmcfg_virt[i].virt) {
+			iounmap(pci_mmcfg_virt[i].virt);
+			pci_mmcfg_virt[i].virt = NULL;
+			pci_mmcfg_virt[i].cfg = NULL;
+		}
+	}
+
+	kfree(pci_mmcfg_virt);
+	pci_mmcfg_virt = NULL;
+}
diff --git a/arch/x86/pci/mp_bus_to_node.c b/arch/x86/pci/mp_bus_to_node.c
new file mode 100644
index 0000000..0229439
--- /dev/null
+++ b/arch/x86/pci/mp_bus_to_node.c
@@ -0,0 +1,23 @@
+#include <linux/pci.h>
+#include <linux/init.h>
+#include <linux/topology.h>
+
+#define BUS_NR 256
+
+static unsigned char mp_bus_to_node[BUS_NR];
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+	mp_bus_to_node[busnum] = (unsigned char) node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return 0;
+	node = mp_bus_to_node[busnum];
+	return node;
+}
diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h
index c4bddae..8ef86b5 100644
--- a/arch/x86/pci/pci.h
+++ b/arch/x86/pci/pci.h
@@ -26,6 +26,7 @@
 #define PCI_ASSIGN_ALL_BUSSES	0x4000
 #define PCI_CAN_SKIP_ISA_ALIGN	0x8000
 #define PCI_USE__CRS		0x10000
+#define PCI_CHECK_ENABLE_AMD_MMCONF	0x20000
 
 extern unsigned int pci_probe;
 extern unsigned long pirq_table_addr;
@@ -97,11 +98,11 @@ extern struct pci_raw_ops pci_direct_conf1;
 extern int pci_direct_probe(void);
 extern void pci_direct_init(int type);
 extern void pci_pcbios_init(void);
-extern void pci_mmcfg_init(int type);
 
 /* pci-mmconfig.c */
 
 extern int __init pci_mmcfg_arch_init(void);
+extern void __init pci_mmcfg_arch_free(void);
 
 /*
  * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 2d1955c..a6dbcf4 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -35,6 +35,7 @@
 #ifdef CONFIG_X86
 #include <asm/mpspec.h>
 #endif
+#include <linux/pci.h>
 #include <acpi/acpi_bus.h>
 #include <acpi/acpi_drivers.h>
 
@@ -784,6 +785,7 @@ static int __init acpi_init(void)
 	result = acpi_bus_init();
 
 	if (!result) {
+		pci_mmcfg_late_init();
 		if (!(pm_flags & PM_APM))
 			pm_flags |= PM_ACPI;
 		else {
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 9248e09..be288b5 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -787,6 +787,10 @@ int device_add(struct device *dev)
 	parent = get_device(dev->parent);
 	setup_parent(dev, parent);
 
+	/* use parent numa_node */
+	if (parent)
+		set_dev_node(dev, dev_to_node(parent));
+
 	/* first, register with generic layer. */
 	error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
 	if (error)
@@ -1306,8 +1310,11 @@ int device_move(struct device *dev, struct device *new_parent)
 	dev->parent = new_parent;
 	if (old_parent)
 		klist_remove(&dev->knode_parent);
-	if (new_parent)
+	if (new_parent) {
 		klist_add_tail(&dev->knode_parent, &new_parent->klist_children);
+		set_dev_node(dev, dev_to_node(new_parent));
+	}
+
 	if (!dev->class)
 		goto out_put;
 	error = device_move_class_links(dev, old_parent, new_parent);
@@ -1317,9 +1324,12 @@ int device_move(struct device *dev, struct device *new_parent)
 		if (!kobject_move(&dev->kobj, &old_parent->kobj)) {
 			if (new_parent)
 				klist_remove(&dev->knode_parent);
-			if (old_parent)
+			dev->parent = old_parent;
+			if (old_parent) {
 				klist_add_tail(&dev->knode_parent,
 					       &old_parent->klist_children);
+				set_dev_node(dev, dev_to_node(old_parent));
+			}
 		}
 		cleanup_glue_dir(dev, new_parent_kobj);
 		put_device(new_parent);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index f991359..4a55bf3 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -842,11 +842,14 @@ static void set_pcie_port_type(struct pci_dev *pdev)
  * reading the dword at 0x100 which must either be 0 or a valid extended
  * capability header.
  */
-int pci_cfg_space_size(struct pci_dev *dev)
+int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)
 {
 	int pos;
 	u32 status;
 
+	if (!check_exp_pcix)
+		goto skip;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_EXP);
 	if (!pos) {
 		pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
@@ -858,6 +861,7 @@ int pci_cfg_space_size(struct pci_dev *dev)
 			goto fail;
 	}
 
+ skip:
 	if (pci_read_config_dword(dev, 256, &status) != PCIBIOS_SUCCESSFUL)
 		goto fail;
 	if (status == 0xffffffff)
@@ -869,6 +873,11 @@ int pci_cfg_space_size(struct pci_dev *dev)
 	return PCI_CFG_SPACE_SIZE;
 }
 
+int pci_cfg_space_size(struct pci_dev *dev)
+{
+	return pci_cfg_space_size_ext(dev, 1);
+}
+
 static void pci_release_bus_bridge_dev(struct device *dev)
 {
 	kfree(dev);
@@ -964,7 +973,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->dev.release = pci_release_dev;
 	pci_dev_get(dev);
 
-	set_dev_node(&dev->dev, pcibus_to_node(bus));
 	dev->dev.dma_mask = &dev->dma_mask;
 	dev->dev.dma_parms = &dev->dma_parms;
 	dev->dev.coherent_dma_mask = 0xffffffffull;
@@ -1080,6 +1088,10 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
 	return max;
 }
 
+void __attribute__((weak)) set_pci_bus_resources_arch_default(struct pci_bus *b)
+{
+}
+
 struct pci_bus * pci_create_bus(struct device *parent,
 		int bus, struct pci_ops *ops, void *sysdata)
 {
@@ -1119,6 +1131,9 @@ struct pci_bus * pci_create_bus(struct device *parent,
 		goto dev_reg_err;
 	b->bridge = get_device(dev);
 
+	if (!parent)
+		set_dev_node(b->bridge, pcibus_to_node(b));
+
 	b->dev.class = &pcibus_class;
 	b->dev.parent = b->bridge;
 	sprintf(b->dev.bus_id, "%04x:%02x", pci_domain_nr(b), bus);
@@ -1136,6 +1151,8 @@ struct pci_bus * pci_create_bus(struct device *parent,
 	b->resource[0] = &ioport_resource;
 	b->resource[1] = &iomem_resource;
 
+	set_pci_bus_resources_arch_default(b);
+
 	return b;
 
 dev_create_file_err:
diff --git a/include/asm-x86/pci.h b/include/asm-x86/pci.h
index ddd8e24..30bbde0 100644
--- a/include/asm-x86/pci.h
+++ b/include/asm-x86/pci.h
@@ -19,6 +19,8 @@ struct pci_sysdata {
 };
 
 /* scan a bus after allocating a pci_sysdata for it */
+extern struct pci_bus *pci_scan_bus_on_node(int busno, struct pci_ops *ops,
+					    int node);
 extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
 
 static inline int pci_domain_nr(struct pci_bus *bus)
diff --git a/include/asm-x86/topology.h b/include/asm-x86/topology.h
index 2207326..0e6d6b0 100644
--- a/include/asm-x86/topology.h
+++ b/include/asm-x86/topology.h
@@ -193,9 +193,25 @@ extern cpumask_t cpu_coregroup_map(int cpu);
 #define topology_thread_siblings(cpu)		(per_cpu(cpu_sibling_map, cpu))
 #endif
 
+struct pci_bus;
+void set_pci_bus_resources_arch_default(struct pci_bus *b);
+
 #ifdef CONFIG_SMP
 #define mc_capable()			(boot_cpu_data.x86_max_cores > 1)
 #define smt_capable()			(smp_num_siblings > 1)
 #endif
 
+#ifdef CONFIG_NUMA
+extern int get_mp_bus_to_node(int busnum);
+extern void set_mp_bus_to_node(int busnum, int node);
+#else
+static inline int get_mp_bus_to_node(int busnum)
+{
+	return 0;
+}
+static inline void set_mp_bus_to_node(int busnum, int node)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 2c7e003..41f7ce7 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,6 +79,7 @@ typedef int (*acpi_table_handler) (struct acpi_table_header *table);
 typedef int (*acpi_table_entry_handler) (struct acpi_subtable_header *header, const unsigned long end);
 
 char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
+int early_acpi_boot_init(void);
 int acpi_boot_init (void);
 int acpi_boot_table_init (void);
 int acpi_numa_init (void);
@@ -235,6 +236,10 @@ int acpi_check_mem_region(resource_size_t start, resource_size_t n,
 
 #else	/* CONFIG_ACPI */
 
+static inline int early_acpi_boot_init(void)
+{
+	return 0;
+}
 static inline int acpi_boot_init(void)
 {
 	return 0;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2924913..abc998f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -254,7 +254,7 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev,
 #define PCI_NUM_RESOURCES	11
 
 #ifndef PCI_BUS_NUM_RESOURCES
-#define PCI_BUS_NUM_RESOURCES	8
+#define PCI_BUS_NUM_RESOURCES	16
 #endif
 
 #define PCI_REGION_FLAG_MASK	0x0fU	/* These bits of resource flags tell us the PCI region flags */
@@ -666,6 +666,7 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max,
 
 void pci_walk_bus(struct pci_bus *top, void (*cb)(struct pci_dev *, void *),
 		  void *userdata);
+int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 
@@ -1053,5 +1054,13 @@ extern unsigned long pci_cardbus_mem_size;
 
 extern int pcibios_add_platform_entries(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_MMCONFIG
+extern void __init pci_mmcfg_early_init(void);
+extern void __init pci_mmcfg_late_init(void);
+#else
+static inline void pci_mmcfg_early_init(void) { }
+static inline void pci_mmcfg_late_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* LINUX_PCI_H */

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-26 18:55 [RFC git pull] "big box" x86 changes Ingo Molnar
  2008-04-26 19:05 ` Stefan Richter
  2008-04-26 19:12 ` Linus Torvalds
@ 2008-04-26 22:17 ` Andi Kleen
  2008-04-27  3:14   ` Yinghai Lu
  2 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2008-04-26 22:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Ingo Molnar <mingo@elte.hu> writes:
>
> Most of the work has been done by Yinghai Lu who has gone through a 
> heroic effort to fix all the big-box bugs that he encountered on the 
> vanilla Linux kernel on his various up to 256 GB RAM test-systems. Also 
> work is included from Ying Huang for those insane SGI UV boxes.

Just to make things clear for the list readers, x86-64 worked fine for years 
on 256GB systems.

Many of the problems fixed in this tree are either relatively recent
regressions or have nothing to do with large boxes or 256GB (like the mmconfig
changes)

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 20:39     ` Andrew Morton
  2008-04-26 21:06       ` Adrian Bunk
@ 2008-04-26 23:37       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 52+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-26 23:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Andrew Morton wrote:
>> +	memcpy(&early_res[i], &early_res[i + 1],
>> +	       (j - 1 - i) * sizeof(struct early_res));
>>     
>
> nit: memcpy() shouldn't be used for overlapping copies.  It happens to be
> OK (for dst<src) in the kernel implementations.  We hope.
>   

Definitely shouldn't be assumed.  At one point in the distant past I had 
a ppc memcpy which would clobber a destination cacheline before reading 
the source, so source and dest within a cacheline's distance would be 
trouble, regardless of the direction.  Arch-specific code which knows 
about the arch-specific details of memcpy might be safer, I guess, but 
its still fairly brittle.

    J

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-26 22:17 ` [RFC git pull] "big box" x86 changes Andi Kleen
@ 2008-04-27  3:14   ` Yinghai Lu
  2008-04-27  8:30     ` Andi Kleen
  2008-04-27  8:32     ` [RFC git pull] "big box" x86 changes II Andi Kleen
  0 siblings, 2 replies; 52+ messages in thread
From: Yinghai Lu @ 2008-04-27  3:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes

On Sat, Apr 26, 2008 at 3:17 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Ingo Molnar <mingo@elte.hu> writes:
>  >
>  > Most of the work has been done by Yinghai Lu who has gone through a
>  > heroic effort to fix all the big-box bugs that he encountered on the
>  > vanilla Linux kernel on his various up to 256 GB RAM test-systems. Also
>  > work is included from Ying Huang for those insane SGI UV boxes.
>
>  Just to make things clear for the list readers, x86-64 worked fine for years
>  on 256GB systems.

Are you sure? I don't know who had 256g ram with x86-64 that earlier.
Horus from newisys?

that mean 8 sockets * 8 dimms * 4G.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes
  2008-04-27  3:14   ` Yinghai Lu
@ 2008-04-27  8:30     ` Andi Kleen
  2008-04-27  8:32     ` [RFC git pull] "big box" x86 changes II Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2008-04-27  8:30 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Andi Kleen, Ingo Molnar, Linus Torvalds, linux-kernel,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, jbarnes

> Are you sure? I don't know who had 256g ram with x86-64 that earlier.

IBM Summit. We tested that a couple of years ago.

I think the worst regressions you had to fix came from the vmemmap merge
BTW. It pretty much reintroduced some mistakes in memory placing that
the original code had too and which were fixed back then.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC git pull] "big box" x86 changes II
  2008-04-27  3:14   ` Yinghai Lu
  2008-04-27  8:30     ` Andi Kleen
@ 2008-04-27  8:32     ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2008-04-27  8:32 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Andi Kleen, Ingo Molnar, Linus Torvalds, linux-kernel,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, jbarnes


And BTW the Unisys ES7000 boxes supported that much (and more) memory
for quite some time too.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-26 19:54   ` [git pull] "big box" x86 changes, boot protocol Ingo Molnar
  2008-04-26 20:39     ` Andrew Morton
@ 2008-04-27 11:21     ` Ian Campbell
  2008-04-27 19:29       ` H. Peter Anvin
  2008-04-28 15:27       ` Ingo Molnar
  1 sibling, 2 replies; 52+ messages in thread
From: Ian Campbell @ 2008-04-27 11:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes,
	Jeremy Fitzhardinge, Rusty Russell

[-- Attachment #1: Type: text/plain, Size: 2366 bytes --]

On Sat, 2008-04-26 at 21:54 +0200, Ingo Molnar wrote:
> diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
> index 2eb1610..0fac346 100644
> --- a/Documentation/i386/boot.txt
> +++ b/Documentation/i386/boot.txt
> @@ -42,6 +42,8 @@ Protocol 2.05:        (Kernel 2.6.20) Make protected mode kernel relocatable.
>  Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
>                 the boot command line
>  
> +Protocol 2.09: (kernel 2.6.26) Added a field of 64-bit physical
> +               pointer to single linked list of struct setup_data.
>  

How about this. I can redo against current Linus but that will need
fixups here.

From cfed8d4043fc03ed948518076db602919b111a16 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ijc@hellion.org.uk>
Date: Sun, 27 Apr 2008 12:19:11 +0100
Subject: [PATCH] Backfill x86 boot protocol documentation.

Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
---
 Documentation/i386/boot.txt |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
index 0fac346..86fbbd0 100644
--- a/Documentation/i386/boot.txt
+++ b/Documentation/i386/boot.txt
@@ -40,9 +40,17 @@ Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
 		Introduce relocatable_kernel and kernel_alignment fields.
 
 Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
-		the boot command line
+		the boot command line.
 
-Protocol 2.09:	(kernel 2.6.26) Added a field of 64-bit physical
+Protocol 2.07:	(Kernel 2.6.24) Added paravirtualised boot protocol.
+		Introduced hardware_subarch and hardware_subarch_data
+		and KEEP_SEGMENTS flag in load_flags.
+
+Protocol 2.08:	(Kernel 2.6.26) Added crc32 checksum and ELF format
+		payload. Introduced payload_offset and payload length
+		fields to aid in locating the payload.
+
+Protocol 2.09:	(Kernel 2.6.26) Added a field of 64-bit physical
 		pointer to single linked list of struct	setup_data.
 
 **** MEMORY LAYOUT
-- 
1.5.5.1



-- 
Ian Campbell

Computer Science is the only discipline in which we view adding a new wing
to a building as being maintenance
		-- Jim Horning

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, PCI
  2008-04-26 21:55   ` [git pull] "big box" x86 changes, PCI Ingo Molnar
@ 2008-04-27 16:30     ` Jesse Barnes
  2008-04-28 15:38       ` Ingo Molnar
  2008-04-28 20:34     ` Jesse Barnes
  1 sibling, 1 reply; 52+ messages in thread
From: Jesse Barnes @ 2008-04-27 16:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu

On Saturday, April 26, 2008 2:55 pm Ingo Molnar wrote:
> ok, this is the final chunk of the "big box" topic - the PCI changes.
>
> These are the largest, and while i tried to reduce their number it's
> still 19 commits - but it's all around the same topic. The bulk of the
> new code is in a single file. The tree can be pulled from:
>
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci.
>git for-linus
>
> this depends on the bootmem changes which are now upstream. These
> changes too have been in linux-next for some time and the cross-arch
> (build) success rate is high, as in:
>
>    http://www.tglx.de/autoqa-cgi/index?run=89&tree=1
>
> all but the last few patches have been in x86.git for a longer time and
> more than 95% of the changes are for arch/x86. (see the dates of the
> patches. The "15 Feb 2008" ones are older than their timestamp - that's
> when we imported those changes into a date-aware repository.)
>
> i booted up this tree 5 times on x86, mixed 64-bit/32-bit.

Only did a quick scan, but the changes look nice so far.  I'll take a closer 
look and give it a try locally tomorrow.

On an unrelated note, can you take a look at the "PCI MSI breaks when booting 
with nosmp" thread?  I posted a patch there that unconditionally enables the 
local apic with nosmp/maxcpus=0 so that MSI will work correctly.  The other 
option of course is to disable MSI when nosmp/maxcpus=0 and local apic enable 
code alone.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-27 11:21     ` Ian Campbell
@ 2008-04-27 19:29       ` H. Peter Anvin
  2008-04-28 15:27       ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: H. Peter Anvin @ 2008-04-27 19:29 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, Yinghai Lu, Yinghai Lu, jbarnes,
	Jeremy Fitzhardinge, Rusty Russell

Ian Campbell wrote:
> 
> How about this. I can redo against current Linus but that will need
> fixups here.
> 
> From cfed8d4043fc03ed948518076db602919b111a16 Mon Sep 17 00:00:00 2001
> From: Ian Campbell <ijc@hellion.org.uk>
> Date: Sun, 27 Apr 2008 12:19:11 +0100
> Subject: [PATCH] Backfill x86 boot protocol documentation.
> 
> Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> ---
>  Documentation/i386/boot.txt |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
> index 0fac346..86fbbd0 100644
> --- a/Documentation/i386/boot.txt
> +++ b/Documentation/i386/boot.txt
> @@ -40,9 +40,17 @@ Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
>  		Introduce relocatable_kernel and kernel_alignment fields.
>  
>  Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
> -		the boot command line
> +		the boot command line.
>  
> -Protocol 2.09:	(kernel 2.6.26) Added a field of 64-bit physical
> +Protocol 2.07:	(Kernel 2.6.24) Added paravirtualised boot protocol.
> +		Introduced hardware_subarch and hardware_subarch_data
> +		and KEEP_SEGMENTS flag in load_flags.
> +
> +Protocol 2.08:	(Kernel 2.6.26) Added crc32 checksum and ELF format
> +		payload. Introduced payload_offset and payload length
> +		fields to aid in locating the payload.
> +
> +Protocol 2.09:	(Kernel 2.6.26) Added a field of 64-bit physical
>  		pointer to single linked list of struct	setup_data.
>  

Acked-by: H. Peter Anvin <hpa@zytor.com>

Since this is a documentation fix, it can wait until post-window.

	-hpa

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
  2008-04-26 19:52     ` Linus Torvalds
@ 2008-04-27 22:48     ` Johannes Weiner
  2008-04-27 23:46       ` Ingo Molnar
  2008-04-28  0:33       ` [git pull] "big box" x86 changes, bootmem/sparsemem Yinghai Lu
  1 sibling, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2008-04-27 22:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Hi Ingo,

Ingo Molnar <mingo@elte.hu> writes:

> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> IOW, they'd be big enough that people hopefully don't start nitpicking 
>> about some *totally* uninteresting small detail, but small enough that 
>> people can read it through without losing concentration about a 
>> quarter of the way in.
>
> ok. Here's the "memory management" type of changes:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem.git for-linus
>
> the other sub-trees will depend on these changes. I think these 
> infrastructure and other improvements are mergable and pullable as-is.
>
> 	Ingo
>
> ------------------>

[...]

>       mm: allow reserve_bootmem() cross nodes

I find it sad that this goes in now.  I wrote a clean version of
reserve_bootmem() [1] and it was rejected with arguments that I did not
understand [2] and that were not further explained even though I asked
for it [3].

http://lkml.org/lkml/2008/4/16/76
http://lkml.org/lkml/2008/4/16/234
http://lkml.org/lkml/2008/4/16/250

Your comment was rather unfair, because it gave the impression you did
not read the thread before replying.  And you did not react to other
explicit questions from me.  If you find my patches to be crap, say so
and please explain WHY so I have a chance to improve.

Please, reconsider the bootmem patches; bootmem code looks really bad at
the moment.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-27 22:48     ` [git pull] "big box" x86 changes, bootmem/sparsemem Johannes Weiner
@ 2008-04-27 23:46       ` Ingo Molnar
  2008-04-28  0:19         ` Johannes Weiner
  2008-04-28  0:33       ` [git pull] "big box" x86 changes, bootmem/sparsemem Yinghai Lu
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-04-27 23:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes


* Johannes Weiner <hannes@saeurebad.de> wrote:

> >       mm: allow reserve_bootmem() cross nodes
> 
> I find it sad that this goes in now.  I wrote a clean version of 
> reserve_bootmem() [1] and it was rejected with arguments that I did 
> not understand [2] and that were not further explained even though I 
> asked for it [3].
> 
> http://lkml.org/lkml/2008/4/16/76
> http://lkml.org/lkml/2008/4/16/234
> http://lkml.org/lkml/2008/4/16/250

oh, i had your patches applied then undid them due to this:

   http://www.ussg.iu.edu/hypermail/linux/kernel/0804.2/0253.html

havent seen a followup to that mail so it was 'issue pending'.

so i very much agree that your changes are cleaner, i just wanted to 
have one that has all the fixes included.

Would you like to post a patch against current -git or should i extract 
the cleaner reserve_bootmem() from your previous patch?

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-27 23:46       ` Ingo Molnar
@ 2008-04-28  0:19         ` Johannes Weiner
  2008-04-28  0:40           ` [patch] mm: node-setup agnostic free_bootmem() Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2008-04-28  0:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Hi Ingo,

Ingo Molnar <mingo@elte.hu> writes:

> * Johannes Weiner <hannes@saeurebad.de> wrote:
>
>> >       mm: allow reserve_bootmem() cross nodes
>> 
>> I find it sad that this goes in now.  I wrote a clean version of 
>> reserve_bootmem() [1] and it was rejected with arguments that I did 
>> not understand [2] and that were not further explained even though I 
>> asked for it [3].
>> 
>> http://lkml.org/lkml/2008/4/16/76
>> http://lkml.org/lkml/2008/4/16/234
>> http://lkml.org/lkml/2008/4/16/250
>
> oh, i had your patches applied then undid them due to this:
>
>    http://www.ussg.iu.edu/hypermail/linux/kernel/0804.2/0253.html
>
> havent seen a followup to that mail so it was 'issue pending'.

> so i very much agree that your changes are cleaner, i just wanted to 
> have one that has all the fixes included.

I had planned this to be another patch because there are more then one
boundary check I wanted to tighten.  I can merge them though if you
like.

> Would you like to post a patch against current -git or should i extract 
> the cleaner reserve_bootmem() from your previous patch?

I just moved and have only sporadic internet access and free time slots
available.  Would be nice if you could do it!

> 	Ingo

Thanks!

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-27 22:48     ` [git pull] "big box" x86 changes, bootmem/sparsemem Johannes Weiner
  2008-04-27 23:46       ` Ingo Molnar
@ 2008-04-28  0:33       ` Yinghai Lu
  2008-04-28 16:58         ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28  0:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, Jesse Barnes, Siddha, Suresh B

On Sun, Apr 27, 2008 at 3:48 PM, Johannes Weiner <hannes@saeurebad.de> wrote:
> Hi Ingo,
>
>
>  Ingo Molnar <mingo@elte.hu> writes:
>
>  > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>  >
>  >> IOW, they'd be big enough that people hopefully don't start nitpicking
>  >> about some *totally* uninteresting small detail, but small enough that
>  >> people can read it through without losing concentration about a
>  >> quarter of the way in.
>  >
>  > ok. Here's the "memory management" type of changes:
>  >
>  >    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem.git for-linus
>  >
>  > the other sub-trees will depend on these changes. I think these
>  > infrastructure and other improvements are mergable and pullable as-is.
>  >
>  >       Ingo
>  >
>  > ------------------>
>
>  [...]
>
>
>  >       mm: allow reserve_bootmem() cross nodes
>
>  I find it sad that this goes in now.  I wrote a clean version of
>  reserve_bootmem() [1] and it was rejected with arguments that I did not
>  understand [2] and that were not further explained even though I asked
>  for it [3].
>
>  http://lkml.org/lkml/2008/4/16/76
>  http://lkml.org/lkml/2008/4/16/234
>  http://lkml.org/lkml/2008/4/16/250
>
>  Your comment was rather unfair, because it gave the impression you did
>  not read the thread before replying.  And you did not react to other
>  explicit questions from me.  If you find my patches to be crap, say so
>  and please explain WHY so I have a chance to improve.

this thread is for reserve_bootmem ?

You patch is regarding to free_bootmem, and it doesn't work with intel
cross node boxes.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28  0:19         ` Johannes Weiner
@ 2008-04-28  0:40           ` Ingo Molnar
  2008-04-28  1:48             ` Yinghai Lu
  2008-04-28 16:49             ` Johannes Weiner
  0 siblings, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-28  0:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes


* Johannes Weiner <hannes@saeurebad.de> wrote:

> > so i very much agree that your changes are cleaner, i just wanted to 
> > have one that has all the fixes included.
> 
> I had planned this to be another patch because there are more then one 
> boundary check I wanted to tighten.  I can merge them though if you 
> like.

no, better to have them in separate patches.

> > Would you like to post a patch against current -git or should i 
> > extract the cleaner reserve_bootmem() from your previous patch?
> 
> I just moved and have only sporadic internet access and free time 
> slots available.  Would be nice if you could do it!

sure, find the merged patch below, against latest -git, boot-tested on 
x86. Is this what you had in mind?

	Ingo

---------------->
Subject: mm: node-setup agnostic free_bootmem()
From: Johannes Weiner <hannes@saeurebad.de>
Date: Wed, 16 Apr 2008 13:36:31 +0200

Make free_bootmem() look up the node holding the specified address
range which lets it work transparently on single-node and multi-node
configurations.

If the address range exceeds the node range, it well be marked free
across node boundaries, too.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Yinghai Lu <yhlu.kernel@gmail.com>
CC: Yasunori Goto <y-goto@jp.fujitsu.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 mm/bootmem.c |   27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

Index: linux-x86.q/mm/bootmem.c
===================================================================
--- linux-x86.q.orig/mm/bootmem.c
+++ linux-x86.q/mm/bootmem.c
@@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
 void __init free_bootmem(unsigned long addr, unsigned long size)
 {
 	bootmem_data_t *bdata;
-	list_for_each_entry(bdata, &bdata_list, list)
-		free_bootmem_core(bdata, addr, size);
+	unsigned long pos = addr;
+	unsigned long partsize = size;
+
+	list_for_each_entry(bdata, &bdata_list, list) {
+		unsigned long remainder = 0;
+
+		if (pos < bdata->node_boot_start)
+			continue;
+
+		if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
+			remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
+			partsize -= remainder;
+		}
+
+		free_bootmem_core(bdata, pos, partsize);
+
+		if (!remainder)
+			return;
+
+		pos = PFN_PHYS(bdata->node_low_pfn + 1);
+	}
+	printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
+			"state: pos=%lx, partsize=%lx\n", addr, size,
+			pos, partsize);
+	BUG();
 }
 
 unsigned long __init free_all_bootmem(void)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28  0:40           ` [patch] mm: node-setup agnostic free_bootmem() Ingo Molnar
@ 2008-04-28  1:48             ` Yinghai Lu
  2008-04-28 16:54               ` Johannes Weiner
  2008-04-28 16:49             ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28  1:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Johannes Weiner, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>
>  > > so i very much agree that your changes are cleaner, i just wanted to
>  > > have one that has all the fixes included.
>  >
>  > I had planned this to be another patch because there are more then one
>  > boundary check I wanted to tighten.  I can merge them though if you
>  > like.
>
>  no, better to have them in separate patches.
>
>  > > Would you like to post a patch against current -git or should i
>  > > extract the cleaner reserve_bootmem() from your previous patch?
>  >
>  > I just moved and have only sporadic internet access and free time
>  > slots available.  Would be nice if you could do it!
>
>  sure, find the merged patch below, against latest -git, boot-tested on
>  x86. Is this what you had in mind?
>
>         Ingo
>
>  ---------------->
>  Subject: mm: node-setup agnostic free_bootmem()
>  From: Johannes Weiner <hannes@saeurebad.de>
>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>
>  Make free_bootmem() look up the node holding the specified address
>  range which lets it work transparently on single-node and multi-node
>  configurations.
>
>  If the address range exceeds the node range, it well be marked free
>  across node boundaries, too.
>
>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>  CC: Andi Kleen <andi@firstfloor.org>
>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>  CC: Christoph Lameter <clameter@sgi.com>
>  CC: Andrew Morton <akpm@linux-foundation.org>
>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  ---
>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>   1 file changed, 25 insertions(+), 2 deletions(-)
>
>  Index: linux-x86.q/mm/bootmem.c
>  ===================================================================
>  --- linux-x86.q.orig/mm/bootmem.c
>  +++ linux-x86.q/mm/bootmem.c
>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>   void __init free_bootmem(unsigned long addr, unsigned long size)
>   {
>         bootmem_data_t *bdata;
>  -       list_for_each_entry(bdata, &bdata_list, list)
>  -               free_bootmem_core(bdata, addr, size);
>  +       unsigned long pos = addr;
>  +       unsigned long partsize = size;
>  +
>  +       list_for_each_entry(bdata, &bdata_list, list) {
>  +               unsigned long remainder = 0;
>  +
>  +               if (pos < bdata->node_boot_start)
>  +                       continue;
>  +
>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>  +                       partsize -= remainder;
>  +               }
>  +
>  +               free_bootmem_core(bdata, pos, partsize);
>  +
>  +               if (!remainder)
>  +                       return;
>  +
>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>  +       }
>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>  +                       pos, partsize);
>  +       BUG();
>   }
>
>   unsigned long __init free_all_bootmem(void)
>

it will not work with cross nodes.

for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
and if ramdisk sit cross 2G boundary. you will only free the range before 2g.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, boot protocol
  2008-04-27 11:21     ` Ian Campbell
  2008-04-27 19:29       ` H. Peter Anvin
@ 2008-04-28 15:27       ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-28 15:27 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes,
	Jeremy Fitzhardinge, Rusty Russell


* Ian Campbell <ijc@hellion.org.uk> wrote:

> How about this. I can redo against current Linus but that will need 
> fixups here.

no need - your patch is fine - thanks, applied.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, PCI
  2008-04-27 16:30     ` Jesse Barnes
@ 2008-04-28 15:38       ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2008-04-28 15:38 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu


* Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> On Saturday, April 26, 2008 2:55 pm Ingo Molnar wrote:
> > ok, this is the final chunk of the "big box" topic - the PCI changes.
> >
> > These are the largest, and while i tried to reduce their number it's
> > still 19 commits - but it's all around the same topic. The bulk of the
> > new code is in a single file. The tree can be pulled from:
> >
> >   
> > git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci.
> >git for-linus
> >
> > this depends on the bootmem changes which are now upstream. These
> > changes too have been in linux-next for some time and the cross-arch
> > (build) success rate is high, as in:
> >
> >    http://www.tglx.de/autoqa-cgi/index?run=89&tree=1
> >
> > all but the last few patches have been in x86.git for a longer time and
> > more than 95% of the changes are for arch/x86. (see the dates of the
> > patches. The "15 Feb 2008" ones are older than their timestamp - that's
> > when we imported those changes into a date-aware repository.)
> >
> > i booted up this tree 5 times on x86, mixed 64-bit/32-bit.
> 
> Only did a quick scan, but the changes look nice so far.  I'll take a 
> closer look and give it a try locally tomorrow.

ok, great! Let us know if you experience any problems - and please give 
us an Ack if it looks OK so that we can send the final pull request to 
Linus :-)

> On an unrelated note, can you take a look at the "PCI MSI breaks when 
> booting with nosmp" thread?  I posted a patch there that 
> unconditionally enables the local apic with nosmp/maxcpus=0 so that 
> MSI will work correctly.  The other option of course is to disable MSI 
> when nosmp/maxcpus=0 and local apic enable code alone.

yeah - great fix - your patch has been put into x86.git a few minutes 
after you posted the fix ;-) I think that's been a long-standing problem 
(== was broken forever), i noticed it happen on a number of Intel SDV 
boards myself in the past and initially suspected e1000 - it is the 
first driver that throws a fit with non-working MSI interrupts.

I never managed to track it down (there was nothing to bisect to and far 
enough in the past kernels simply stopped booting on that hardware) and 
things like acpi=off regularly dont work on prototype boards anyway. I'm 
glad you figured it out and it's on the way upstream.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28  0:40           ` [patch] mm: node-setup agnostic free_bootmem() Ingo Molnar
  2008-04-28  1:48             ` Yinghai Lu
@ 2008-04-28 16:49             ` Johannes Weiner
  2008-04-29 14:25               ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2008-04-28 16:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Hi Ingo,

Ingo Molnar <mingo@elte.hu> writes:

> * Johannes Weiner <hannes@saeurebad.de> wrote:
>
>> > so i very much agree that your changes are cleaner, i just wanted to 
>> > have one that has all the fixes included.
>> 
>> I had planned this to be another patch because there are more then one 
>> boundary check I wanted to tighten.  I can merge them though if you 
>> like.
>
> no, better to have them in separate patches.

Okay.

>> > Would you like to post a patch against current -git or should i 
>> > extract the cleaner reserve_bootmem() from your previous patch?
>> 
>> I just moved and have only sporadic internet access and free time 
>> slots available.  Would be nice if you could do it!
>
> sure, find the merged patch below, against latest -git, boot-tested on 
> x86. Is this what you had in mind?
>
> 	Ingo
>
> ---------------->
> Subject: mm: node-setup agnostic free_bootmem()
> From: Johannes Weiner <hannes@saeurebad.de>
> Date: Wed, 16 Apr 2008 13:36:31 +0200
>
> Make free_bootmem() look up the node holding the specified address
> range which lets it work transparently on single-node and multi-node
> configurations.
>
> If the address range exceeds the node range, it well be marked free
> across node boundaries, too.
>
> Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Yinghai Lu <yhlu.kernel@gmail.com>
> CC: Yasunori Goto <y-goto@jp.fujitsu.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Christoph Lameter <clameter@sgi.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  mm/bootmem.c |   27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
>
> Index: linux-x86.q/mm/bootmem.c
> ===================================================================
> --- linux-x86.q.orig/mm/bootmem.c
> +++ linux-x86.q/mm/bootmem.c
> @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>  void __init free_bootmem(unsigned long addr, unsigned long size)
>  {
>  	bootmem_data_t *bdata;
> -	list_for_each_entry(bdata, &bdata_list, list)
> -		free_bootmem_core(bdata, addr, size);
> +	unsigned long pos = addr;
> +	unsigned long partsize = size;
> +
> +	list_for_each_entry(bdata, &bdata_list, list) {
> +		unsigned long remainder = 0;
> +
> +		if (pos < bdata->node_boot_start)
> +			continue;
> +
> +		if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
> +			remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
> +			partsize -= remainder;
> +		}
> +
> +		free_bootmem_core(bdata, pos, partsize);
> +
> +		if (!remainder)
> +			return;
> +
> +		pos = PFN_PHYS(bdata->node_low_pfn + 1);
> +	}
> +	printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
> +			"state: pos=%lx, partsize=%lx\n", addr, size,
> +			pos, partsize);
> +	BUG();
>  }
>  
>  unsigned long __init free_all_bootmem(void)

Yes, looks good.  But needs explicit testing, I guess.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28  1:48             ` Yinghai Lu
@ 2008-04-28 16:54               ` Johannes Weiner
  2008-04-28 19:11                 ` Yinghai Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2008-04-28 16:54 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

Hi Yinghai,

"Yinghai Lu" <yhlu.kernel@gmail.com> writes:

> On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>>
>>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>>
>>  > > so i very much agree that your changes are cleaner, i just wanted to
>>  > > have one that has all the fixes included.
>>  >
>>  > I had planned this to be another patch because there are more then one
>>  > boundary check I wanted to tighten.  I can merge them though if you
>>  > like.
>>
>>  no, better to have them in separate patches.
>>
>>  > > Would you like to post a patch against current -git or should i
>>  > > extract the cleaner reserve_bootmem() from your previous patch?
>>  >
>>  > I just moved and have only sporadic internet access and free time
>>  > slots available.  Would be nice if you could do it!
>>
>>  sure, find the merged patch below, against latest -git, boot-tested on
>>  x86. Is this what you had in mind?
>>
>>         Ingo
>>
>>  ---------------->
>>  Subject: mm: node-setup agnostic free_bootmem()
>>  From: Johannes Weiner <hannes@saeurebad.de>
>>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>>
>>  Make free_bootmem() look up the node holding the specified address
>>  range which lets it work transparently on single-node and multi-node
>>  configurations.
>>
>>  If the address range exceeds the node range, it well be marked free
>>  across node boundaries, too.
>>
>>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>>  CC: Andi Kleen <andi@firstfloor.org>
>>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>  CC: Christoph Lameter <clameter@sgi.com>
>>  CC: Andrew Morton <akpm@linux-foundation.org>
>>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>>  ---
>>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>>  Index: linux-x86.q/mm/bootmem.c
>>  ===================================================================
>>  --- linux-x86.q.orig/mm/bootmem.c
>>  +++ linux-x86.q/mm/bootmem.c
>>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>>   void __init free_bootmem(unsigned long addr, unsigned long size)
>>   {
>>         bootmem_data_t *bdata;
>>  -       list_for_each_entry(bdata, &bdata_list, list)
>>  -               free_bootmem_core(bdata, addr, size);
>>  +       unsigned long pos = addr;
>>  +       unsigned long partsize = size;
>>  +
>>  +       list_for_each_entry(bdata, &bdata_list, list) {
>>  +               unsigned long remainder = 0;
>>  +
>>  +               if (pos < bdata->node_boot_start)
>>  +                       continue;
>>  +
>>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>>  +                       partsize -= remainder;
>>  +               }
>>  +
>>  +               free_bootmem_core(bdata, pos, partsize);
>>  +
>>  +               if (!remainder)
>>  +                       return;
>>  +
>>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>>  +       }
>>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>>  +                       pos, partsize);
>>  +       BUG();
>>   }
>>
>>   unsigned long __init free_all_bootmem(void)
>>
>
> it will not work with cross nodes.
>
> for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
> and if ramdisk sit cross 2G boundary. you will only free the range
> before 2g.

Yes, you stated that several times but this is not a technical argument:
These setups are afaik not yet supported by the kernel at all.  And you
could not explain the node layout with the patch that implements support
for these configurations.

So as long as you don't explain in technical detail why my patch won't
work, I will have to ignore your objections.

My opinion is that my patch should go in as is and the patch that adds
support for these node-setups should change bootmem accordingly, if
needed at all.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, bootmem/sparsemem
  2008-04-28  0:33       ` [git pull] "big box" x86 changes, bootmem/sparsemem Yinghai Lu
@ 2008-04-28 16:58         ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2008-04-28 16:58 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, Jesse Barnes, Siddha, Suresh B

Hi,

"Yinghai Lu" <yhlu.kernel@gmail.com> writes:

> On Sun, Apr 27, 2008 at 3:48 PM, Johannes Weiner <hannes@saeurebad.de> wrote:
>> Hi Ingo,
>>
>>
>>  Ingo Molnar <mingo@elte.hu> writes:
>>
>>  > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>  >
>>  >> IOW, they'd be big enough that people hopefully don't start nitpicking
>>  >> about some *totally* uninteresting small detail, but small enough that
>>  >> people can read it through without losing concentration about a
>>  >> quarter of the way in.
>>  >
>>  > ok. Here's the "memory management" type of changes:
>>  >
>>  >    git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem.git for-linus
>>  >
>>  > the other sub-trees will depend on these changes. I think these
>>  > infrastructure and other improvements are mergable and pullable as-is.
>>  >
>>  >       Ingo
>>  >
>>  > ------------------>
>>
>>  [...]
>>
>>
>>  >       mm: allow reserve_bootmem() cross nodes
>>
>>  I find it sad that this goes in now.  I wrote a clean version of
>>  reserve_bootmem() [1] and it was rejected with arguments that I did not
>>  understand [2] and that were not further explained even though I asked
>>  for it [3].
>>
>>  http://lkml.org/lkml/2008/4/16/76
>>  http://lkml.org/lkml/2008/4/16/234
>>  http://lkml.org/lkml/2008/4/16/250
>>
>>  Your comment was rather unfair, because it gave the impression you did
>>  not read the thread before replying.  And you did not react to other
>>  explicit questions from me.  If you find my patches to be crap, say so
>>  and please explain WHY so I have a chance to improve.
>
> this thread is for reserve_bootmem ?

Sorry, my mistake.  I was referring to

5a982cbc7b3fe6cf72266f319286f29963c71b9e: mm: fix boundary checking in
free_bootmem_core

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28 16:54               ` Johannes Weiner
@ 2008-04-28 19:11                 ` Yinghai Lu
  2008-04-28 19:55                   ` Yinghai Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28 19:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
> Hi Yinghai,
>
>
>
>  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>
>  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>  >>
>  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>
>  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>  >>  > > have one that has all the fixes included.
>  >>  >
>  >>  > I had planned this to be another patch because there are more then one
>  >>  > boundary check I wanted to tighten.  I can merge them though if you
>  >>  > like.
>  >>
>  >>  no, better to have them in separate patches.
>  >>
>  >>  > > Would you like to post a patch against current -git or should i
>  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>  >>  >
>  >>  > I just moved and have only sporadic internet access and free time
>  >>  > slots available.  Would be nice if you could do it!
>  >>
>  >>  sure, find the merged patch below, against latest -git, boot-tested on
>  >>  x86. Is this what you had in mind?
>  >>
>  >>         Ingo
>  >>
>  >>  ---------------->
>  >>  Subject: mm: node-setup agnostic free_bootmem()
>  >>  From: Johannes Weiner <hannes@saeurebad.de>
>  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>  >>
>  >>  Make free_bootmem() look up the node holding the specified address
>  >>  range which lets it work transparently on single-node and multi-node
>  >>  configurations.
>  >>
>  >>  If the address range exceeds the node range, it well be marked free
>  >>  across node boundaries, too.
>  >>
>  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>  >>  CC: Andi Kleen <andi@firstfloor.org>
>  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>  >>  CC: Christoph Lameter <clameter@sgi.com>
>  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >>  ---
>  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>  >>
>  >>  Index: linux-x86.q/mm/bootmem.c
>  >>  ===================================================================
>  >>  --- linux-x86.q.orig/mm/bootmem.c
>  >>  +++ linux-x86.q/mm/bootmem.c
>  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>  >>   {
>  >>         bootmem_data_t *bdata;
>  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>  >>  -               free_bootmem_core(bdata, addr, size);
>  >>  +       unsigned long pos = addr;
>  >>  +       unsigned long partsize = size;
>  >>  +
>  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>  >>  +               unsigned long remainder = 0;
>  >>  +
>  >>  +               if (pos < bdata->node_boot_start)
>  >>  +                       continue;
>  >>  +
>  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>  >>  +                       partsize -= remainder;
>  >>  +               }
>  >>  +
>  >>  +               free_bootmem_core(bdata, pos, partsize);
>  >>  +
>  >>  +               if (!remainder)
>  >>  +                       return;
>  >>  +
>  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>  >>  +       }
>  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>  >>  +                       pos, partsize);
>  >>  +       BUG();
>  >>   }
>  >>
>  >>   unsigned long __init free_all_bootmem(void)
>  >>
>  >
>  > it will not work with cross nodes.
>  >
>  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>  > and if ramdisk sit cross 2G boundary. you will only free the range
>  > before 2g.
>
>  Yes, you stated that several times but this is not a technical argument:
>  These setups are afaik not yet supported by the kernel at all.  And you
>  could not explain the node layout with the patch that implements support
>  for these configurations.

I looked at Suresh's patch, and it still only has one bdata for one node.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28 19:11                 ` Yinghai Lu
@ 2008-04-28 19:55                   ` Yinghai Lu
  2008-04-30 10:50                     ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28 19:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
>
> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>  > Hi Yinghai,
>  >
>  >
>  >
>  >  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>  >
>  >  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>  >  >>
>  >  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>  >  >>
>  >  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>  >  >>  > > have one that has all the fixes included.
>  >  >>  >
>  >  >>  > I had planned this to be another patch because there are more then one
>  >  >>  > boundary check I wanted to tighten.  I can merge them though if you
>  >  >>  > like.
>  >  >>
>  >  >>  no, better to have them in separate patches.
>  >  >>
>  >  >>  > > Would you like to post a patch against current -git or should i
>  >  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>  >  >>  >
>  >  >>  > I just moved and have only sporadic internet access and free time
>  >  >>  > slots available.  Would be nice if you could do it!
>  >  >>
>  >  >>  sure, find the merged patch below, against latest -git, boot-tested on
>  >  >>  x86. Is this what you had in mind?
>  >  >>
>  >  >>         Ingo
>  >  >>
>  >  >>  ---------------->
>  >  >>  Subject: mm: node-setup agnostic free_bootmem()
>  >  >>  From: Johannes Weiner <hannes@saeurebad.de>
>  >  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>  >  >>
>  >  >>  Make free_bootmem() look up the node holding the specified address
>  >  >>  range which lets it work transparently on single-node and multi-node
>  >  >>  configurations.
>  >  >>
>  >  >>  If the address range exceeds the node range, it well be marked free
>  >  >>  across node boundaries, too.
>  >  >>
>  >  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>  >  >>  CC: Andi Kleen <andi@firstfloor.org>
>  >  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>  >  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>  >  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>  >  >>  CC: Christoph Lameter <clameter@sgi.com>
>  >  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>  >  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >  >>  ---
>  >  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>  >  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>  >  >>
>  >  >>  Index: linux-x86.q/mm/bootmem.c
>  >  >>  ===================================================================
>  >  >>  --- linux-x86.q.orig/mm/bootmem.c
>  >  >>  +++ linux-x86.q/mm/bootmem.c
>  >  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>  >  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>  >  >>   {
>  >  >>         bootmem_data_t *bdata;
>  >  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>  >  >>  -               free_bootmem_core(bdata, addr, size);
>  >  >>  +       unsigned long pos = addr;
>  >  >>  +       unsigned long partsize = size;
>  >  >>  +
>  >  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>  >  >>  +               unsigned long remainder = 0;
>  >  >>  +
>  >  >>  +               if (pos < bdata->node_boot_start)
>  >  >>  +                       continue;
>  >  >>  +
>  >  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>  >  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>  >  >>  +                       partsize -= remainder;
>  >  >>  +               }
>  >  >>  +
>  >  >>  +               free_bootmem_core(bdata, pos, partsize);
>  >  >>  +
>  >  >>  +               if (!remainder)
>  >  >>  +                       return;
>  >  >>  +
>  >  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>  >  >>  +       }
>  >  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>  >  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>  >  >>  +                       pos, partsize);
>  >  >>  +       BUG();
>  >  >>   }
>  >  >>
>  >  >>   unsigned long __init free_all_bootmem(void)
>  >  >>
>  >  >
>  >  > it will not work with cross nodes.
>  >  >
>  >  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>  >  > and if ramdisk sit cross 2G boundary. you will only free the range
>  >  > before 2g.
>  >
>  >  Yes, you stated that several times but this is not a technical argument:
>  >  These setups are afaik not yet supported by the kernel at all.  And you
>  >  could not explain the node layout with the patch that implements support
>  >  for these configurations.
>
>  I looked at Suresh's patch, and it still only has one bdata for one node.

Suresh's patch already in the Linus tree.
commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
Author: Suresh Siddha <suresh.b.siddha@intel.com>
Date:   Tue Mar 25 10:14:35 2008 -0700

    srat, x86: add support for nodes spanning other nodes

    For example, If the physical address layout on a two node system with 8 GB
    memory is something like:
    node 0: 0-2GB, 4-6GB
    node 1: 2-4GB, 6-8GB

    Current kernels fail to boot/detect this NUMA topology.

    ACPI SRAT tables can expose such a topology which needs to be supported.

    Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, PCI
  2008-04-26 21:55   ` [git pull] "big box" x86 changes, PCI Ingo Molnar
  2008-04-27 16:30     ` Jesse Barnes
@ 2008-04-28 20:34     ` Jesse Barnes
  2008-04-28 22:53       ` Yinghai Lu
  2008-04-28 23:27       ` [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext Yinghai Lu
  1 sibling, 2 replies; 52+ messages in thread
From: Jesse Barnes @ 2008-04-28 20:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu

On Saturday, April 26, 2008 2:55 pm Ingo Molnar wrote:> @@ -184,51 +322,80 @@ static void __init pci_mmcfg_reject_broken(int type)>  >         cfg = &pci_mmcfg_config[0];>  > -       /*> -        * Handle more broken MCFG tables on Asus etc.> -        * They only contain a single entry for bus 0-0.> -        */> -       if (pci_mmcfg_config_num == 1 &&> -           cfg->pci_segment == 0 &&> -           (cfg->start_bus_number | cfg->end_bus_number) == 0) {> -               printk(KERN_ERR "PCI: start and end of bus number is 0. "> -                      "Rejected as broken MCFG.\n");> -               goto reject;> +       for (i = 0; i < pci_mmcfg_config_num; i++) {> +               int valid = 0;> +               u32 size = (cfg->end_bus_number + 1) << 20;> +               cfg = &pci_mmcfg_config[i];> +               printk(KERN_NOTICE "PCI: MCFG configuration %d: base %lx "> +                      "segment %hu buses %u - %u\n",> +                      i, (unsigned long)cfg->address, cfg->pci_segment,> +                      (unsigned int)cfg->start_bus_number,> +                      (unsigned int)cfg->end_bus_number);> +> +               if (!early &&> +                   is_acpi_reserved(cfg->address, cfg->address + size -> 1)) { +                       printk(KERN_NOTICE "PCI: MCFG area at %Lx> reserved " +                              "in ACPI motherboard> resources\n",> +                              cfg->address);> +                       valid = 1;> +               }> +> +               if (valid)> +                       continue;> +> +               if (!early)> +                       printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is> not" +                              " reserved in ACPI motherboard> resources\n", +                              cfg->address);> +               /* Don't try to do this check unless configuration> +                  type 1 is available. how about type 2 ?*/> +               if (raw_pci_ops && e820_all_mapped(cfg->address,> +                                                 cfg->address + size - 1,> +                                                 E820_RESERVED)) {> +                       printk(KERN_NOTICE> +                              "PCI: MCFG area at %Lx reserved in E820\n",> +                              cfg->address);> +                       valid = 1;> +               }> +> +               if (!valid)> +                       goto reject;>         }
This loop is a bit messy, is there some way of making it clearer?  Maybe the early vs. late stuff should be split into separate routines entirely...
> @@ -842,11 +842,14 @@ static void set_pcie_port_type(struct pci_dev *pdev)>   * reading the dword at 0x100 which must either be 0 or a valid extended>   * capability header.>   */> -int pci_cfg_space_size(struct pci_dev *dev)> +int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)>  {>         int pos;>         u32 status;>  > +       if (!check_exp_pcix)> +               goto skip;> +
Rather than adding a flag to pci_cfg_space_size, you could either factor out the extended space probe into a separate routine and use it from both pci_cfg_space_size and the fixup code, or just make the fixup code do the probe & cfg_size setting by hand, moving the PCI_CFG_SPACE_SIZE and PCI_CFG_SPACE_EXP_SIZE to pci.h.
In both cases I'm just trying to avoid having a flag you pass to a routine that changes its behavior significantly enough that a new function would make things more readable.
Other than that, things look pretty good.  And the resulting kernel boots fine on my test box, which is nice...
Thanks,Jesseÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [git pull] "big box" x86 changes, PCI
  2008-04-28 20:34     ` Jesse Barnes
@ 2008-04-28 22:53       ` Yinghai Lu
  2008-04-28 23:27       ` [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext Yinghai Lu
  1 sibling, 0 replies; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28 22:53 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

On Mon, Apr 28, 2008 at 1:34 PM, Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
> On Saturday, April 26, 2008 2:55 pm Ingo Molnar wrote:
>
>
> > @@ -184,51 +322,80 @@ static void __init pci_mmcfg_reject_broken(int type)
>  >
>  >         cfg = &pci_mmcfg_config[0];
>  >
>  > -       /*
>  > -        * Handle more broken MCFG tables on Asus etc.
>  > -        * They only contain a single entry for bus 0-0.
>  > -        */
>  > -       if (pci_mmcfg_config_num == 1 &&
>  > -           cfg->pci_segment == 0 &&
>  > -           (cfg->start_bus_number | cfg->end_bus_number) == 0) {
>  > -               printk(KERN_ERR "PCI: start and end of bus number is 0. "
>  > -                      "Rejected as broken MCFG.\n");
>  > -               goto reject;
>  > +       for (i = 0; i < pci_mmcfg_config_num; i++) {
>  > +               int valid = 0;
>  > +               u32 size = (cfg->end_bus_number + 1) << 20;
>  > +               cfg = &pci_mmcfg_config[i];
>  > +               printk(KERN_NOTICE "PCI: MCFG configuration %d: base %lx "
>  > +                      "segment %hu buses %u - %u\n",
>  > +                      i, (unsigned long)cfg->address, cfg->pci_segment,
>  > +                      (unsigned int)cfg->start_bus_number,
>  > +                      (unsigned int)cfg->end_bus_number);
>  > +
>  > +               if (!early &&
>  > +                   is_acpi_reserved(cfg->address, cfg->address + size -
>  > 1)) { +                       printk(KERN_NOTICE "PCI: MCFG area at %Lx
>  > reserved " +                              "in ACPI motherboard
>  > resources\n",
>  > +                              cfg->address);
>  > +                       valid = 1;
>  > +               }
>  > +
>  > +               if (valid)
>  > +                       continue;
>  > +
>  > +               if (!early)
>  > +                       printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is
>  > not" +                              " reserved in ACPI motherboard
>  > resources\n", +                              cfg->address);
>  > +               /* Don't try to do this check unless configuration
>  > +                  type 1 is available. how about type 2 ?*/
>  > +               if (raw_pci_ops && e820_all_mapped(cfg->address,
>  > +                                                 cfg->address + size - 1,
>  > +                                                 E820_RESERVED)) {
>  > +                       printk(KERN_NOTICE
>  > +                              "PCI: MCFG area at %Lx reserved in E820\n",
>  > +                              cfg->address);
>  > +                       valid = 1;
>  > +               }
>  > +
>  > +               if (!valid)
>  > +                       goto reject;
>  >         }
>
>  This loop is a bit messy, is there some way of making it clearer?  Maybe the
>  early vs. late stuff should be split into separate routines entirely...

        for (i = 0; i < pci_mmcfg_config_num; i++) {
                int valid = 0;
                u32 size = (cfg->end_bus_number + 1) << 20;
                cfg = &pci_mmcfg_config[i];
                printk(KERN_NOTICE "PCI: MCFG configuration %d: base %lx "
                       "segment %hu buses %u - %u\n",
                       i, (unsigned long)cfg->address, cfg->pci_segment,
                       (unsigned int)cfg->start_bus_number,
                       (unsigned int)cfg->end_bus_number);

                if (!early &&
============================================> early check..
                    is_acpi_reserved(cfg->address, cfg->address + size - 1)) {
                        printk(KERN_NOTICE "PCI: MCFG area at %Lx reserved "
                               "in ACPI motherboard resources\n",
                               cfg->address);
                        valid = 1;
                }

                if (valid)
                        continue;

                if (!early)
=================================================> check early
                        printk(KERN_ERR "PCI: BIOS Bug: MCFG area at %Lx is not"
                               " reserved in ACPI motherboard resources\n",
                               cfg->address);
                /* Don't try to do this check unless configuration
                   type 1 is available. how about type 2 ?*/
                if (raw_pci_ops && e820_all_mapped(cfg->address,
                                                  cfg->address + size - 1,
                                                  E820_RESERVED)) {
                        printk(KERN_NOTICE
                               "PCI: MCFG area at %Lx reserved in E820\n",
                               cfg->address);
                        valid = 1;
                }

                if (!valid)
                        goto reject;
        }

        return;

only two early check ... , if split that we will get almost same
duplicated lines.

>
>
>  > @@ -842,11 +842,14 @@ static void set_pcie_port_type(struct pci_dev *pdev)
>  >   * reading the dword at 0x100 which must either be 0 or a valid extended
>  >   * capability header.
>  >   */
>  > -int pci_cfg_space_size(struct pci_dev *dev)
>  > +int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)
>  >  {
>  >         int pos;
>  >         u32 status;
>  >
>  > +       if (!check_exp_pcix)
>  > +               goto skip;
>  > +
>
>  Rather than adding a flag to pci_cfg_space_size, you could either factor out
>  the extended space probe into a separate routine and use it from both
>  pci_cfg_space_size and the fixup code, or just make the fixup code do the
>  probe & cfg_size setting by hand, moving the PCI_CFG_SPACE_SIZE and
>  PCI_CFG_SPACE_EXP_SIZE to pci.h.

by hand, will need to check if mmconf is enabled or not.
will check if can factor out it.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext
  2008-04-28 20:34     ` Jesse Barnes
  2008-04-28 22:53       ` Yinghai Lu
@ 2008-04-28 23:27       ` Yinghai Lu
  2008-04-29 16:14         ` Jesse Barnes
  1 sibling, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-28 23:27 UTC (permalink / raw)
  To: Jesse Barnes, Ingo Molnar, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin
  Cc: Linus Torvalds


so let pci_cfg_space_size call it directly without flag.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index b60b2ab..ff3a6a3 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -502,7 +502,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
  */
 static void fam10h_pci_cfg_space_size(struct pci_dev *dev)
 {
-	dev->cfg_size = pci_cfg_space_size_ext(dev, 0);
+	dev->cfg_size = pci_cfg_space_size_ext(dev);
 }
 
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, 0x1200, fam10h_pci_cfg_space_size);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4a55bf3..3706ce7 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -842,13 +842,25 @@ static void set_pcie_port_type(struct pci_dev *pdev)
  * reading the dword at 0x100 which must either be 0 or a valid extended
  * capability header.
  */
-int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)
+int pci_cfg_space_size_ext(struct pci_dev *dev)
 {
-	int pos;
 	u32 status;
 
-	if (!check_exp_pcix)
-		goto skip;
+	if (pci_read_config_dword(dev, 256, &status) != PCIBIOS_SUCCESSFUL)
+		goto fail;
+	if (status == 0xffffffff)
+		goto fail;
+
+	return PCI_CFG_SPACE_EXP_SIZE;
+
+ fail:
+	return PCI_CFG_SPACE_SIZE;
+}
+
+int pci_cfg_space_size(struct pci_dev *dev)
+{
+	int pos;
+	u32 status;
 
 	pos = pci_find_capability(dev, PCI_CAP_ID_EXP);
 	if (!pos) {
@@ -861,23 +873,12 @@ int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix)
 			goto fail;
 	}
 
- skip:
-	if (pci_read_config_dword(dev, 256, &status) != PCIBIOS_SUCCESSFUL)
-		goto fail;
-	if (status == 0xffffffff)
-		goto fail;
-
-	return PCI_CFG_SPACE_EXP_SIZE;
+	return pci_cfg_space_size_ext(dev);
 
  fail:
 	return PCI_CFG_SPACE_SIZE;
 }
 
-int pci_cfg_space_size(struct pci_dev *dev)
-{
-	return pci_cfg_space_size_ext(dev, 1);
-}
-
 static void pci_release_bus_bridge_dev(struct device *dev)
 {
 	kfree(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8f53f4b..e9b4fcb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -667,7 +667,7 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max,
 
 void pci_walk_bus(struct pci_bus *top, void (*cb)(struct pci_dev *, void *),
 		  void *userdata);
-int pci_cfg_space_size_ext(struct pci_dev *dev, unsigned check_exp_pcix);
+int pci_cfg_space_size_ext(struct pci_dev *dev);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28 16:49             ` Johannes Weiner
@ 2008-04-29 14:25               ` Ingo Molnar
  2008-04-30 10:52                 ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-04-29 14:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes


* Johannes Weiner <hannes@saeurebad.de> wrote:

> >  void __init free_bootmem(unsigned long addr, unsigned long size)
> >  {
> >  	bootmem_data_t *bdata;
> > -	list_for_each_entry(bdata, &bdata_list, list)
> > -		free_bootmem_core(bdata, addr, size);
> > +	unsigned long pos = addr;
> > +	unsigned long partsize = size;
> > +
> > +	list_for_each_entry(bdata, &bdata_list, list) {
> > +		unsigned long remainder = 0;
> > +
> > +		if (pos < bdata->node_boot_start)
> > +			continue;
> > +
> > +		if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
> > +			remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
> > +			partsize -= remainder;
> > +		}
> > +
> > +		free_bootmem_core(bdata, pos, partsize);
> > +
> > +		if (!remainder)
> > +			return;
> > +
> > +		pos = PFN_PHYS(bdata->node_low_pfn + 1);
> > +	}
> > +	printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
> > +			"state: pos=%lx, partsize=%lx\n", addr, size,
> > +			pos, partsize);
> > +	BUG();
> >  }
> >  
> >  unsigned long __init free_all_bootmem(void)
> 
> Yes, looks good.  But needs explicit testing, I guess.

yep, but as Yinghai Lu has pointed it out, this removes a cross-node 
allocation fix. That fix has to be preserved in any cleanup, agreed?

in general bootmem should assume the weirdest of NUMA topologies and be 
defensive about them. Topologies will only become more complex, never 
less complex.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext
  2008-04-28 23:27       ` [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext Yinghai Lu
@ 2008-04-29 16:14         ` Jesse Barnes
  2008-04-29 22:05           ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Jesse Barnes @ 2008-04-29 16:14 UTC (permalink / raw)
  To: yhlu.kernel
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds

On Monday, April 28, 2008 4:27 pm Yinghai Lu wrote:
> so let pci_cfg_space_size call it directly without flag.
>
> Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

Thanks Yinghai.

Ingo, I assume you'll roll this in?

Jesse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext
  2008-04-29 16:14         ` Jesse Barnes
@ 2008-04-29 22:05           ` Ingo Molnar
  2008-04-29 22:34             ` Jesse Barnes
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2008-04-29 22:05 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: yhlu.kernel, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds


* Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> On Monday, April 28, 2008 4:27 pm Yinghai Lu wrote:
> > so let pci_cfg_space_size call it directly without flag.
> >
> > Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
> 
> Thanks Yinghai.
> 
> Ingo, I assume you'll roll this in?

no, i sent the tested tree to Linus, which was unmodified from 3 days 
ago. Could you queue up this cleanup into the PCI tree please?

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext
  2008-04-29 22:05           ` Ingo Molnar
@ 2008-04-29 22:34             ` Jesse Barnes
  0 siblings, 0 replies; 52+ messages in thread
From: Jesse Barnes @ 2008-04-29 22:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: yhlu.kernel, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds

On Tuesday, April 29, 2008 3:05 pm Ingo Molnar wrote:
> * Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
> > On Monday, April 28, 2008 4:27 pm Yinghai Lu wrote:
> > > so let pci_cfg_space_size call it directly without flag.
> > >
> > > Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
> >
> > Thanks Yinghai.
> >
> > Ingo, I assume you'll roll this in?
>
> no, i sent the tested tree to Linus, which was unmodified from 3 days
> ago. Could you queue up this cleanup into the PCI tree please?

Sure, np.  Applied.  Thanks Yinghai.

Jesse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-28 19:55                   ` Yinghai Lu
@ 2008-04-30 10:50                     ` Johannes Weiner
  2008-04-30 16:22                       ` Yinghai Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2008-04-30 10:50 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

Hi,

"Yinghai Lu" <yhlu.kernel@gmail.com> writes:

> On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
>>
>> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>>  > Hi Yinghai,
>>  >
>>  >
>>  >
>>  >  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>>  >
>>  >  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>>  >  >>
>>  >  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>>  >  >>
>>  >  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>>  >  >>  > > have one that has all the fixes included.
>>  >  >>  >
>>  >  >>  > I had planned this to be another patch because there are more then one
>>  >  >>  > boundary check I wanted to tighten.  I can merge them though if you
>>  >  >>  > like.
>>  >  >>
>>  >  >>  no, better to have them in separate patches.
>>  >  >>
>>  >  >>  > > Would you like to post a patch against current -git or should i
>>  >  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>>  >  >>  >
>>  >  >>  > I just moved and have only sporadic internet access and free time
>>  >  >>  > slots available.  Would be nice if you could do it!
>>  >  >>
>>  >  >>  sure, find the merged patch below, against latest -git, boot-tested on
>>  >  >>  x86. Is this what you had in mind?
>>  >  >>
>>  >  >>         Ingo
>>  >  >>
>>  >  >>  ---------------->
>>  >  >>  Subject: mm: node-setup agnostic free_bootmem()
>>  >  >>  From: Johannes Weiner <hannes@saeurebad.de>
>>  >  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>>  >  >>
>>  >  >>  Make free_bootmem() look up the node holding the specified address
>>  >  >>  range which lets it work transparently on single-node and multi-node
>>  >  >>  configurations.
>>  >  >>
>>  >  >>  If the address range exceeds the node range, it well be marked free
>>  >  >>  across node boundaries, too.
>>  >  >>
>>  >  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>>  >  >>  CC: Andi Kleen <andi@firstfloor.org>
>>  >  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>>  >  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>>  >  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>  >  >>  CC: Christoph Lameter <clameter@sgi.com>
>>  >  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>>  >  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>>  >  >>  ---
>>  >  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>>  >  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>>  >  >>
>>  >  >>  Index: linux-x86.q/mm/bootmem.c
>>  >  >>  ===================================================================
>>  >  >>  --- linux-x86.q.orig/mm/bootmem.c
>>  >  >>  +++ linux-x86.q/mm/bootmem.c
>>  >  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>>  >  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>>  >  >>   {
>>  >  >>         bootmem_data_t *bdata;
>>  >  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>>  >  >>  -               free_bootmem_core(bdata, addr, size);
>>  >  >>  +       unsigned long pos = addr;
>>  >  >>  +       unsigned long partsize = size;
>>  >  >>  +
>>  >  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>>  >  >>  +               unsigned long remainder = 0;
>>  >  >>  +
>>  >  >>  +               if (pos < bdata->node_boot_start)
>>  >  >>  +                       continue;
>>  >  >>  +
>>  >  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>>  >  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>>  >  >>  +                       partsize -= remainder;
>>  >  >>  +               }
>>  >  >>  +
>>  >  >>  +               free_bootmem_core(bdata, pos, partsize);
>>  >  >>  +
>>  >  >>  +               if (!remainder)
>>  >  >>  +                       return;
>>  >  >>  +
>>  >  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>>  >  >>  +       }
>>  >  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>>  >  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>>  >  >>  +                       pos, partsize);
>>  >  >>  +       BUG();
>>  >  >>   }
>>  >  >>
>>  >  >>   unsigned long __init free_all_bootmem(void)
>>  >  >>
>>  >  >
>>  >  > it will not work with cross nodes.
>>  >  >
>>  >  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>>  >  > and if ramdisk sit cross 2G boundary. you will only free the range
>>  >  > before 2g.
>>  >
>>  >  Yes, you stated that several times but this is not a technical argument:
>>  >  These setups are afaik not yet supported by the kernel at all.  And you
>>  >  could not explain the node layout with the patch that implements support
>>  >  for these configurations.
>>
>>  I looked at Suresh's patch, and it still only has one bdata for one node.
>
> Suresh's patch already in the Linus tree.
> commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
> Date:   Tue Mar 25 10:14:35 2008 -0700
>
>     srat, x86: add support for nodes spanning other nodes
>
>     For example, If the physical address layout on a two node system with 8 GB
>     memory is something like:
>     node 0: 0-2GB, 4-6GB
>     node 1: 2-4GB, 6-8GB
>
>     Current kernels fail to boot/detect this NUMA topology.
>
>     ACPI SRAT tables can expose such a topology which needs to be supported.
>
>     Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
>     Signed-off-by: Ingo Molnar <mingo@elte.hu>
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Okay, so we have one bdata for node 0 and one for node 1. Does that mean
that both have overlapping pfn ranges?

[1    |||||     ]
     [2    |||||     ]

Like this?  How are the ||||| represented in the bootmem maps of each bdata?

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-29 14:25               ` Ingo Molnar
@ 2008-04-30 10:52                 ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2008-04-30 10:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu, Yinghai Lu, jbarnes

Hi,

Ingo Molnar <mingo@elte.hu> writes:

> * Johannes Weiner <hannes@saeurebad.de> wrote:
>
>> >  void __init free_bootmem(unsigned long addr, unsigned long size)
>> >  {
>> >  	bootmem_data_t *bdata;
>> > -	list_for_each_entry(bdata, &bdata_list, list)
>> > -		free_bootmem_core(bdata, addr, size);
>> > +	unsigned long pos = addr;
>> > +	unsigned long partsize = size;
>> > +
>> > +	list_for_each_entry(bdata, &bdata_list, list) {
>> > +		unsigned long remainder = 0;
>> > +
>> > +		if (pos < bdata->node_boot_start)
>> > +			continue;
>> > +
>> > +		if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>> > +			remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>> > +			partsize -= remainder;
>> > +		}
>> > +
>> > +		free_bootmem_core(bdata, pos, partsize);
>> > +
>> > +		if (!remainder)
>> > +			return;
>> > +
>> > +		pos = PFN_PHYS(bdata->node_low_pfn + 1);
>> > +	}
>> > +	printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>> > +			"state: pos=%lx, partsize=%lx\n", addr, size,
>> > +			pos, partsize);
>> > +	BUG();
>> >  }
>> >  
>> >  unsigned long __init free_all_bootmem(void)
>> 
>> Yes, looks good.  But needs explicit testing, I guess.
>
> yep, but as Yinghai Lu has pointed it out, this removes a cross-node 
> allocation fix. That fix has to be preserved in any cleanup, agreed?

Yes, if Yinghai is right, my patch should be dropped, of course.

> in general bootmem should assume the weirdest of NUMA topologies and be 
> defensive about them. Topologies will only become more complex, never 
> less complex.

Okay.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-30 10:50                     ` Johannes Weiner
@ 2008-04-30 16:22                       ` Yinghai Lu
  2008-04-30 17:52                         ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Yinghai Lu @ 2008-04-30 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

On Wed, Apr 30, 2008 at 3:50 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>
> Hi,
>
>  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>
>  > On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
>  >>
>  >> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>  > Hi Yinghai,
>  >>  >
>  >>  >
>  >>  >
>  >>  >  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>  >>  >
>  >>  >  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>  >>  >  >>
>  >>  >  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>  >  >>
>  >>  >  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>  >>  >  >>  > > have one that has all the fixes included.
>  >>  >  >>  >
>  >>  >  >>  > I had planned this to be another patch because there are more then one
>  >>  >  >>  > boundary check I wanted to tighten.  I can merge them though if you
>  >>  >  >>  > like.
>  >>  >  >>
>  >>  >  >>  no, better to have them in separate patches.
>  >>  >  >>
>  >>  >  >>  > > Would you like to post a patch against current -git or should i
>  >>  >  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>  >>  >  >>  >
>  >>  >  >>  > I just moved and have only sporadic internet access and free time
>  >>  >  >>  > slots available.  Would be nice if you could do it!
>  >>  >  >>
>  >>  >  >>  sure, find the merged patch below, against latest -git, boot-tested on
>  >>  >  >>  x86. Is this what you had in mind?
>  >>  >  >>
>  >>  >  >>         Ingo
>  >>  >  >>
>  >>  >  >>  ---------------->
>  >>  >  >>  Subject: mm: node-setup agnostic free_bootmem()
>  >>  >  >>  From: Johannes Weiner <hannes@saeurebad.de>
>  >>  >  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>  >>  >  >>
>  >>  >  >>  Make free_bootmem() look up the node holding the specified address
>  >>  >  >>  range which lets it work transparently on single-node and multi-node
>  >>  >  >>  configurations.
>  >>  >  >>
>  >>  >  >>  If the address range exceeds the node range, it well be marked free
>  >>  >  >>  across node boundaries, too.
>  >>  >  >>
>  >>  >  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>  >>  >  >>  CC: Andi Kleen <andi@firstfloor.org>
>  >>  >  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>  >>  >  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>  >>  >  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>  >>  >  >>  CC: Christoph Lameter <clameter@sgi.com>
>  >>  >  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>  >>  >  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >>  >  >>  ---
>  >>  >  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>  >>  >  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>  >>  >  >>
>  >>  >  >>  Index: linux-x86.q/mm/bootmem.c
>  >>  >  >>  ===================================================================
>  >>  >  >>  --- linux-x86.q.orig/mm/bootmem.c
>  >>  >  >>  +++ linux-x86.q/mm/bootmem.c
>  >>  >  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>  >>  >  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>  >>  >  >>   {
>  >>  >  >>         bootmem_data_t *bdata;
>  >>  >  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>  >>  >  >>  -               free_bootmem_core(bdata, addr, size);
>  >>  >  >>  +       unsigned long pos = addr;
>  >>  >  >>  +       unsigned long partsize = size;
>  >>  >  >>  +
>  >>  >  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>  >>  >  >>  +               unsigned long remainder = 0;
>  >>  >  >>  +
>  >>  >  >>  +               if (pos < bdata->node_boot_start)
>  >>  >  >>  +                       continue;
>  >>  >  >>  +
>  >>  >  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>  >>  >  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>  >>  >  >>  +                       partsize -= remainder;
>  >>  >  >>  +               }
>  >>  >  >>  +
>  >>  >  >>  +               free_bootmem_core(bdata, pos, partsize);
>  >>  >  >>  +
>  >>  >  >>  +               if (!remainder)
>  >>  >  >>  +                       return;
>  >>  >  >>  +
>  >>  >  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>  >>  >  >>  +       }
>  >>  >  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>  >>  >  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>  >>  >  >>  +                       pos, partsize);
>  >>  >  >>  +       BUG();
>  >>  >  >>   }
>  >>  >  >>
>  >>  >  >>   unsigned long __init free_all_bootmem(void)
>  >>  >  >>
>  >>  >  >
>  >>  >  > it will not work with cross nodes.
>  >>  >  >
>  >>  >  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>  >>  >  > and if ramdisk sit cross 2G boundary. you will only free the range
>  >>  >  > before 2g.
>  >>  >
>  >>  >  Yes, you stated that several times but this is not a technical argument:
>  >>  >  These setups are afaik not yet supported by the kernel at all.  And you
>  >>  >  could not explain the node layout with the patch that implements support
>  >>  >  for these configurations.
>  >>
>  >>  I looked at Suresh's patch, and it still only has one bdata for one node.
>  >
>  > Suresh's patch already in the Linus tree.
>  > commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
>  > Author: Suresh Siddha <suresh.b.siddha@intel.com>
>  > Date:   Tue Mar 25 10:14:35 2008 -0700
>  >
>  >     srat, x86: add support for nodes spanning other nodes
>  >
>  >     For example, If the physical address layout on a two node system with 8 GB
>  >     memory is something like:
>  >     node 0: 0-2GB, 4-6GB
>  >     node 1: 2-4GB, 6-8GB
>  >
>  >     Current kernels fail to boot/detect this NUMA topology.
>  >
>  >     ACPI SRAT tables can expose such a topology which needs to be supported.
>  >
>  >     Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
>  >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
>  Okay, so we have one bdata for node 0 and one for node 1. Does that mean
>  that both have overlapping pfn ranges?
>
>  [1    |||||     ]
>      [2    |||||     ]
>
>  Like this?  How are the ||||| represented in the bootmem maps of each bdata?

Yes.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-30 16:22                       ` Yinghai Lu
@ 2008-04-30 17:52                         ` Johannes Weiner
  2008-04-30 20:30                           ` Yinghai Lu
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2008-04-30 17:52 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

Hi,

"Yinghai Lu" <yhlu.kernel@gmail.com> writes:

> On Wed, Apr 30, 2008 at 3:50 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>>
>> Hi,
>>
>>  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>>
>>  > On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
>>  >>
>>  >> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>>  >>  > Hi Yinghai,
>>  >>  >
>>  >>  >
>>  >>  >
>>  >>  >  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>>  >>  >
>>  >>  >  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>>  >>  >  >>
>>  >>  >  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>>  >>  >  >>
>>  >>  >  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>>  >>  >  >>  > > have one that has all the fixes included.
>>  >>  >  >>  >
>>  >>  >  >>  > I had planned this to be another patch because there are more then one
>>  >>  >  >>  > boundary check I wanted to tighten.  I can merge them though if you
>>  >>  >  >>  > like.
>>  >>  >  >>
>>  >>  >  >>  no, better to have them in separate patches.
>>  >>  >  >>
>>  >>  >  >>  > > Would you like to post a patch against current -git or should i
>>  >>  >  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>>  >>  >  >>  >
>>  >>  >  >>  > I just moved and have only sporadic internet access and free time
>>  >>  >  >>  > slots available.  Would be nice if you could do it!
>>  >>  >  >>
>>  >>  >  >>  sure, find the merged patch below, against latest -git, boot-tested on
>>  >>  >  >>  x86. Is this what you had in mind?
>>  >>  >  >>
>>  >>  >  >>         Ingo
>>  >>  >  >>
>>  >>  >  >>  ---------------->
>>  >>  >  >>  Subject: mm: node-setup agnostic free_bootmem()
>>  >>  >  >>  From: Johannes Weiner <hannes@saeurebad.de>
>>  >>  >  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>>  >>  >  >>
>>  >>  >  >>  Make free_bootmem() look up the node holding the specified address
>>  >>  >  >>  range which lets it work transparently on single-node and multi-node
>>  >>  >  >>  configurations.
>>  >>  >  >>
>>  >>  >  >>  If the address range exceeds the node range, it well be marked free
>>  >>  >  >>  across node boundaries, too.
>>  >>  >  >>
>>  >>  >  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>>  >>  >  >>  CC: Andi Kleen <andi@firstfloor.org>
>>  >>  >  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>>  >>  >  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>>  >>  >  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>  >>  >  >>  CC: Christoph Lameter <clameter@sgi.com>
>>  >>  >  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>>  >>  >  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>>  >>  >  >>  ---
>>  >>  >  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>>  >>  >  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>>  >>  >  >>
>>  >>  >  >>  Index: linux-x86.q/mm/bootmem.c
>>  >>  >  >>  ===================================================================
>>  >>  >  >>  --- linux-x86.q.orig/mm/bootmem.c
>>  >>  >  >>  +++ linux-x86.q/mm/bootmem.c
>>  >>  >  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>>  >>  >  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>>  >>  >  >>   {
>>  >>  >  >>         bootmem_data_t *bdata;
>>  >>  >  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>>  >>  >  >>  -               free_bootmem_core(bdata, addr, size);
>>  >>  >  >>  +       unsigned long pos = addr;
>>  >>  >  >>  +       unsigned long partsize = size;
>>  >>  >  >>  +
>>  >>  >  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>>  >>  >  >>  +               unsigned long remainder = 0;
>>  >>  >  >>  +
>>  >>  >  >>  +               if (pos < bdata->node_boot_start)
>>  >>  >  >>  +                       continue;
>>  >>  >  >>  +
>>  >>  >  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>>  >>  >  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>>  >>  >  >>  +                       partsize -= remainder;
>>  >>  >  >>  +               }
>>  >>  >  >>  +
>>  >>  >  >>  +               free_bootmem_core(bdata, pos, partsize);
>>  >>  >  >>  +
>>  >>  >  >>  +               if (!remainder)
>>  >>  >  >>  +                       return;
>>  >>  >  >>  +
>>  >>  >  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>>  >>  >  >>  +       }
>>  >>  >  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>>  >>  >  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>>  >>  >  >>  +                       pos, partsize);
>>  >>  >  >>  +       BUG();
>>  >>  >  >>   }
>>  >>  >  >>
>>  >>  >  >>   unsigned long __init free_all_bootmem(void)
>>  >>  >  >>
>>  >>  >  >
>>  >>  >  > it will not work with cross nodes.
>>  >>  >  >
>>  >>  >  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>>  >>  >  > and if ramdisk sit cross 2G boundary. you will only free the range
>>  >>  >  > before 2g.
>>  >>  >
>>  >>  >  Yes, you stated that several times but this is not a technical argument:
>>  >>  >  These setups are afaik not yet supported by the kernel at all.  And you
>>  >>  >  could not explain the node layout with the patch that implements support
>>  >>  >  for these configurations.
>>  >>
>>  >>  I looked at Suresh's patch, and it still only has one bdata for one node.
>>  >
>>  > Suresh's patch already in the Linus tree.
>>  > commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
>>  > Author: Suresh Siddha <suresh.b.siddha@intel.com>
>>  > Date:   Tue Mar 25 10:14:35 2008 -0700
>>  >
>>  >     srat, x86: add support for nodes spanning other nodes
>>  >
>>  >     For example, If the physical address layout on a two node system with 8 GB
>>  >     memory is something like:
>>  >     node 0: 0-2GB, 4-6GB
>>  >     node 1: 2-4GB, 6-8GB
>>  >
>>  >     Current kernels fail to boot/detect this NUMA topology.
>>  >
>>  >     ACPI SRAT tables can expose such a topology which needs to be supported.
>>  >
>>  >     Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
>>  >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
>>  >     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>>  Okay, so we have one bdata for node 0 and one for node 1. Does that mean
>>  that both have overlapping pfn ranges?
>>
>>  [1    |||||     ]
>>      [2    |||||     ]
>>
>>  Like this?  How are the ||||| represented in the bootmem maps of each bdata?
>
> Yes.

Okay.  So they share the same PFNs.  Now imagine the following scenario:

node0: 0-2GB, 4-6GB
node1: 2-4GB, 6-8GB

/* Marks the range on node0 and node1 */
free_bootmem(1.5G, 2G);

/* Frees all bootmem on both nodes */
free_all_bootmem_node(NODE_DATA(0));
free_all_bootmem_node(NODE_DATA(1));

Aren't the same page descriptors send to __free_bootmem_pages() twice?

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: node-setup agnostic free_bootmem()
  2008-04-30 17:52                         ` Johannes Weiner
@ 2008-04-30 20:30                           ` Yinghai Lu
  0 siblings, 0 replies; 52+ messages in thread
From: Yinghai Lu @ 2008-04-30 20:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin, jbarnes, Siddha, Suresh B

On Wed, Apr 30, 2008 at 10:52 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>
> Hi,
>
>  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>
>  > On Wed, Apr 30, 2008 at 3:50 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>
>  >> Hi,
>  >>
>  >>  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>  >>
>  >>  > On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
>  >>  >>
>  >>  >> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>  >>  > Hi Yinghai,
>  >>  >>  >
>  >>  >>  >
>  >>  >>  >
>  >>  >>  >  "Yinghai Lu" <yhlu.kernel@gmail.com> writes:
>  >>  >>  >
>  >>  >>  >  > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@elte.hu> wrote:
>  >>  >>  >  >>
>  >>  >>  >  >>  * Johannes Weiner <hannes@saeurebad.de> wrote:
>  >>  >>  >  >>
>  >>  >>  >  >>  > > so i very much agree that your changes are cleaner, i just wanted to
>  >>  >>  >  >>  > > have one that has all the fixes included.
>  >>  >>  >  >>  >
>  >>  >>  >  >>  > I had planned this to be another patch because there are more then one
>  >>  >>  >  >>  > boundary check I wanted to tighten.  I can merge them though if you
>  >>  >>  >  >>  > like.
>  >>  >>  >  >>
>  >>  >>  >  >>  no, better to have them in separate patches.
>  >>  >>  >  >>
>  >>  >>  >  >>  > > Would you like to post a patch against current -git or should i
>  >>  >>  >  >>  > > extract the cleaner reserve_bootmem() from your previous patch?
>  >>  >>  >  >>  >
>  >>  >>  >  >>  > I just moved and have only sporadic internet access and free time
>  >>  >>  >  >>  > slots available.  Would be nice if you could do it!
>  >>  >>  >  >>
>  >>  >>  >  >>  sure, find the merged patch below, against latest -git, boot-tested on
>  >>  >>  >  >>  x86. Is this what you had in mind?
>  >>  >>  >  >>
>  >>  >>  >  >>         Ingo
>  >>  >>  >  >>
>  >>  >>  >  >>  ---------------->
>  >>  >>  >  >>  Subject: mm: node-setup agnostic free_bootmem()
>  >>  >>  >  >>  From: Johannes Weiner <hannes@saeurebad.de>
>  >>  >>  >  >>  Date: Wed, 16 Apr 2008 13:36:31 +0200
>  >>  >>  >  >>
>  >>  >>  >  >>  Make free_bootmem() look up the node holding the specified address
>  >>  >>  >  >>  range which lets it work transparently on single-node and multi-node
>  >>  >>  >  >>  configurations.
>  >>  >>  >  >>
>  >>  >>  >  >>  If the address range exceeds the node range, it well be marked free
>  >>  >>  >  >>  across node boundaries, too.
>  >>  >>  >  >>
>  >>  >>  >  >>  Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
>  >>  >>  >  >>  CC: Andi Kleen <andi@firstfloor.org>
>  >>  >>  >  >>  CC: Yinghai Lu <yhlu.kernel@gmail.com>
>  >>  >>  >  >>  CC: Yasunori Goto <y-goto@jp.fujitsu.com>
>  >>  >>  >  >>  CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>  >>  >>  >  >>  CC: Christoph Lameter <clameter@sgi.com>
>  >>  >>  >  >>  CC: Andrew Morton <akpm@linux-foundation.org>
>  >>  >>  >  >>  Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >>  >>  >  >>  ---
>  >>  >>  >  >>   mm/bootmem.c |   27 +++++++++++++++++++++++++--
>  >>  >>  >  >>   1 file changed, 25 insertions(+), 2 deletions(-)
>  >>  >>  >  >>
>  >>  >>  >  >>  Index: linux-x86.q/mm/bootmem.c
>  >>  >>  >  >>  ===================================================================
>  >>  >>  >  >>  --- linux-x86.q.orig/mm/bootmem.c
>  >>  >>  >  >>  +++ linux-x86.q/mm/bootmem.c
>  >>  >>  >  >>  @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>  >>  >>  >  >>   void __init free_bootmem(unsigned long addr, unsigned long size)
>  >>  >>  >  >>   {
>  >>  >>  >  >>         bootmem_data_t *bdata;
>  >>  >>  >  >>  -       list_for_each_entry(bdata, &bdata_list, list)
>  >>  >>  >  >>  -               free_bootmem_core(bdata, addr, size);
>  >>  >>  >  >>  +       unsigned long pos = addr;
>  >>  >>  >  >>  +       unsigned long partsize = size;
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +       list_for_each_entry(bdata, &bdata_list, list) {
>  >>  >>  >  >>  +               unsigned long remainder = 0;
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +               if (pos < bdata->node_boot_start)
>  >>  >>  >  >>  +                       continue;
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +               if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>  >>  >>  >  >>  +                       remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>  >>  >>  >  >>  +                       partsize -= remainder;
>  >>  >>  >  >>  +               }
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +               free_bootmem_core(bdata, pos, partsize);
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +               if (!remainder)
>  >>  >>  >  >>  +                       return;
>  >>  >>  >  >>  +
>  >>  >>  >  >>  +               pos = PFN_PHYS(bdata->node_low_pfn + 1);
>  >>  >>  >  >>  +       }
>  >>  >>  >  >>  +       printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>  >>  >>  >  >>  +                       "state: pos=%lx, partsize=%lx\n", addr, size,
>  >>  >>  >  >>  +                       pos, partsize);
>  >>  >>  >  >>  +       BUG();
>  >>  >>  >  >>   }
>  >>  >>  >  >>
>  >>  >>  >  >>   unsigned long __init free_all_bootmem(void)
>  >>  >>  >  >>
>  >>  >>  >  >
>  >>  >>  >  > it will not work with cross nodes.
>  >>  >>  >  >
>  >>  >>  >  > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>  >>  >>  >  > and if ramdisk sit cross 2G boundary. you will only free the range
>  >>  >>  >  > before 2g.
>  >>  >>  >
>  >>  >>  >  Yes, you stated that several times but this is not a technical argument:
>  >>  >>  >  These setups are afaik not yet supported by the kernel at all.  And you
>  >>  >>  >  could not explain the node layout with the patch that implements support
>  >>  >>  >  for these configurations.
>  >>  >>
>  >>  >>  I looked at Suresh's patch, and it still only has one bdata for one node.
>  >>  >
>  >>  > Suresh's patch already in the Linus tree.
>  >>  > commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
>  >>  > Author: Suresh Siddha <suresh.b.siddha@intel.com>
>  >>  > Date:   Tue Mar 25 10:14:35 2008 -0700
>  >>  >
>  >>  >     srat, x86: add support for nodes spanning other nodes
>  >>  >
>  >>  >     For example, If the physical address layout on a two node system with 8 GB
>  >>  >     memory is something like:
>  >>  >     node 0: 0-2GB, 4-6GB
>  >>  >     node 1: 2-4GB, 6-8GB
>  >>  >
>  >>  >     Current kernels fail to boot/detect this NUMA topology.
>  >>  >
>  >>  >     ACPI SRAT tables can expose such a topology which needs to be supported.
>  >>  >
>  >>  >     Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
>  >>  >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
>  >>  >     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>  >>
>  >>  Okay, so we have one bdata for node 0 and one for node 1. Does that mean
>  >>  that both have overlapping pfn ranges?
>  >>
>  >>  [1    |||||     ]
>  >>      [2    |||||     ]
>  >>
>  >>  Like this?  How are the ||||| represented in the bootmem maps of each bdata?
>  >
>  > Yes.
>
>  Okay.  So they share the same PFNs.  Now imagine the following scenario:
>
>  node0: 0-2GB, 4-6GB
>  node1: 2-4GB, 6-8GB
>
>  /* Marks the range on node0 and node1 */
>  free_bootmem(1.5G, 2G);
>
>  /* Frees all bootmem on both nodes */
>  free_all_bootmem_node(NODE_DATA(0));
>  free_all_bootmem_node(NODE_DATA(1));
>
>  Aren't the same page descriptors send to __free_bootmem_pages() twice?

yeah, there is some problem....
may need to ask every node took another node_bootmem_not_use_map ...to
record the holes.

YH

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2008-04-30 20:30 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-26 18:55 [RFC git pull] "big box" x86 changes Ingo Molnar
2008-04-26 19:05 ` Stefan Richter
2008-04-26 19:21   ` Ingo Molnar
2008-04-26 19:12 ` Linus Torvalds
2008-04-26 19:41   ` [git pull] "big box" x86 changes, bootmem/sparsemem Ingo Molnar
2008-04-26 19:52     ` Linus Torvalds
2008-04-26 20:07       ` Ingo Molnar
2008-04-26 20:08       ` [git pull] "big box" x86 changes, bootmem/sparsemem, #2 Ingo Molnar
2008-04-26 20:30         ` Linus Torvalds
2008-04-26 20:55           ` [git pull] "big box" x86 changes, bootmem/sparsemem, #3 Ingo Molnar
2008-04-27 22:48     ` [git pull] "big box" x86 changes, bootmem/sparsemem Johannes Weiner
2008-04-27 23:46       ` Ingo Molnar
2008-04-28  0:19         ` Johannes Weiner
2008-04-28  0:40           ` [patch] mm: node-setup agnostic free_bootmem() Ingo Molnar
2008-04-28  1:48             ` Yinghai Lu
2008-04-28 16:54               ` Johannes Weiner
2008-04-28 19:11                 ` Yinghai Lu
2008-04-28 19:55                   ` Yinghai Lu
2008-04-30 10:50                     ` Johannes Weiner
2008-04-30 16:22                       ` Yinghai Lu
2008-04-30 17:52                         ` Johannes Weiner
2008-04-30 20:30                           ` Yinghai Lu
2008-04-28 16:49             ` Johannes Weiner
2008-04-29 14:25               ` Ingo Molnar
2008-04-30 10:52                 ` Johannes Weiner
2008-04-28  0:33       ` [git pull] "big box" x86 changes, bootmem/sparsemem Yinghai Lu
2008-04-28 16:58         ` Johannes Weiner
2008-04-26 19:54   ` [git pull] "big box" x86 changes, boot protocol Ingo Molnar
2008-04-26 20:39     ` Andrew Morton
2008-04-26 21:06       ` Adrian Bunk
2008-04-26 21:10         ` H. Peter Anvin
2008-04-26 21:11         ` Linus Torvalds
2008-04-26 21:17           ` Ingo Molnar
2008-04-26 23:37       ` Jeremy Fitzhardinge
2008-04-27 11:21     ` Ian Campbell
2008-04-27 19:29       ` H. Peter Anvin
2008-04-28 15:27       ` Ingo Molnar
2008-04-26 20:24   ` [RFC git pull] "big box" x86 changes, GART Ingo Molnar
2008-04-26 20:26     ` Ingo Molnar
2008-04-26 21:55   ` [git pull] "big box" x86 changes, PCI Ingo Molnar
2008-04-27 16:30     ` Jesse Barnes
2008-04-28 15:38       ` Ingo Molnar
2008-04-28 20:34     ` Jesse Barnes
2008-04-28 22:53       ` Yinghai Lu
2008-04-28 23:27       ` [PATCH] x86/pci: remove flag in pci_cfg_space_size_ext Yinghai Lu
2008-04-29 16:14         ` Jesse Barnes
2008-04-29 22:05           ` Ingo Molnar
2008-04-29 22:34             ` Jesse Barnes
2008-04-26 22:17 ` [RFC git pull] "big box" x86 changes Andi Kleen
2008-04-27  3:14   ` Yinghai Lu
2008-04-27  8:30     ` Andi Kleen
2008-04-27  8:32     ` [RFC git pull] "big box" x86 changes II Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).