* Re: kernel access of bad area, sig: 11 ( mpc852t)
From: Mark Chambers @ 2006-04-11 14:39 UTC (permalink / raw)
To: gautam borad, linuxppc-embedded
In-Reply-To: <443BB7E2.9040903@eisodus.com>
> Hi,
> Im having problem porting linux kernel 2.4.21 to our mpc852T custom
> board.The kernel
> panics randomly with sig 11.
> The board boots up fine and we also get to the prompt.When we open 3-4
> telnet sessions
> and try to run some command the kernel panics.This is completely
> random.Sometimes it
> even panics before opening the telnet session.
>
> One of the oops dump is:
>
> -----------------------------------------------------------------------------------------------------------------------------
> Oops: kernel access of bad area, sig: 11
> NIP: C0019FD0 XER: 00000000 LR: C001A06C SP: C1C33AD0 REGS: c1c33a20
> TRAP: 0300 Not tainted
> MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
> DAR: 725F6578, DSISR: 0000000B
> TASK = c1c32000[48] 'insmod' Last syscall: 5
> last math 00000000 last altivec 00000000
> GPR00: 7361726D C1C33AD0 C1C32000 00000000 C0113678 C0150000 C0150000
> C014B210
> GPR08: C014B210 C012D060 00000000 725F6578 04000024 00000000 00000000
> 00000000
> GPR16: 00000000 00000000 00000000 00000000 00001032 01C33BA0 00000000
> C0000000
> GPR24: C014BE38 C0150000 C0110000 C0110000 C0140000 00000001 00000000
> 00000001
> Call backtrace:
> C001A0C8 C0016174 C0015FEC C0015CC0 C0005E38 C0004668 C1C33D10
> C0004670 C004A380 C003FFD4 C00404D8 C0040AD4 C0040FF8 C00330D8
> C00334AC C000443C 1006EF48 1001FCF0 1002023C 10003A18 100036A0
> 300591AC 00000000
> -----------------------------------------------------------------------------------------------------------------------------
>
> The call trace back is of not much help because it is different on all
> oops.
> We are using u-boot 1-1-3.
>
> Thanks in advance.
>
You almost certainly have SDRAM problems. If you have thoroughly checked
out the
complete address range statically, remember that burst accesses will not
occur until the
cache is turned on, so your problem may be with bursting. But you can also
have severe
problems like a missing address line and linux still run for a few seconds.
Mark Chambers
^ permalink raw reply
* Oops: kernel access of bad area, sig: 11 ( mpc852t)
From: gautam borad @ 2006-04-11 14:06 UTC (permalink / raw)
To: linuxppc-embedded
Hi,
Im having problem porting linux kernel 2.4.21 to our mpc852T custom
board.The kernel
panics randomly with sig 11.
The board boots up fine and we also get to the prompt.When we open 3-4
telnet sessions
and try to run some command the kernel panics.This is completely
random.Sometimes it
even panics before opening the telnet session.
One of the oops dump is:
-----------------------------------------------------------------------------------------------------------------------------
Oops: kernel access of bad area, sig: 11
NIP: C0019FD0 XER: 00000000 LR: C001A06C SP: C1C33AD0 REGS: c1c33a20
TRAP: 0300 Not tainted
MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 725F6578, DSISR: 0000000B
TASK = c1c32000[48] 'insmod' Last syscall: 5
last math 00000000 last altivec 00000000
GPR00: 7361726D C1C33AD0 C1C32000 00000000 C0113678 C0150000 C0150000
C014B210
GPR08: C014B210 C012D060 00000000 725F6578 04000024 00000000 00000000
00000000
GPR16: 00000000 00000000 00000000 00000000 00001032 01C33BA0 00000000
C0000000
GPR24: C014BE38 C0150000 C0110000 C0110000 C0140000 00000001 00000000
00000001
Call backtrace:
C001A0C8 C0016174 C0015FEC C0015CC0 C0005E38 C0004668 C1C33D10
C0004670 C004A380 C003FFD4 C00404D8 C0040AD4 C0040FF8 C00330D8
C00334AC C000443C 1006EF48 1001FCF0 1002023C 10003A18 100036A0
300591AC 00000000
-----------------------------------------------------------------------------------------------------------------------------
The call trace back is of not much help because it is different on all oops.
We are using u-boot 1-1-3.
Thanks in advance.
^ permalink raw reply
* RTC/I2C problems
From: Kenneth Poole @ 2006-04-11 14:25 UTC (permalink / raw)
To: linuxppc-embedded
[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]
We had a similar problem with our 8xx implementation. The problem was
decrementer underflow. In the kernel initialization, calibrate_decr()
was called before set_dec(), so the decrementer can count down through 0
before the decrementer exception gets enabled. To fix this, we moved the
call to set_dec() from time_init() to calibrate_decr() before the
exception is enabled in the tbcsr.
Ken Poole
>Hi everybody.
>I have a problem with my RTC, i work on a MEN A12 board with MPC8245
and
>when my kernel start it takes 8 min to run. I have find the problem but
i
>don't know how to solve it. In time.c the function time_init execute
the
>following lines :
>if (__USE_RTC()) {
> /* 601 processor: dec counts down by 128 every 128ns */
> tb_ticks_per_jiffy = DECREMENTER_COUNT_601;
> /* mulhwu_scale_factor(1000000000, 1000000) is 0x418937 */
> tb_to_us = 0x0x418937;
>} else {
> ppc_md.calibrate_decr();
> tb_to_ns_scale = mulhwu(tb_to_us, 1000 << 10);
>}
>this is the ppc_md.calibrate_decr() function that takes a lot of time
and i
>don't know how to solve the problem. USE_RTC is fixed at 0 for 6xx
config
>maybe it must be at 1 ? My RTC run with SMBus (I2C). I work with ELDK
4.0
>and linux 2.6.15.
[-- Attachment #2: Type: text/html, Size: 4986 bytes --]
^ permalink raw reply
* Re: [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
From: Nick Piggin @ 2006-04-11 11:07 UTC (permalink / raw)
To: Mel Gorman; +Cc: davej, tony.luck, ak, linux-kernel, linuxppc-dev
In-Reply-To: <20060411104147.18153.64342.sendpatchset@skynet>
Mel Gorman wrote:
> page_alloc.c contains a large amount of memory initialisation code. This patch
> breaks out the initialisation code to a separate file to make page_alloc.c
> a bit easier to read.
>
Seems like a very good idea to me.
> +/*
> + * mm/mem_init.c
> + * Initialises the architecture independant view of memory. pgdats, zones, etc
> + *
> + * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
> + * Swap reorganised 29.12.95, Stephen Tweedie
> + * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
> + * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999
> + * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999
> + * Zone balancing, Kanoj Sarcar, SGI, Jan 2000
> + * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
> + * (lots of bits borrowed from Ingo Molnar & Andrew Morton)
> + * Arch-independant zone size and hole calculation, Mel Gorman, IBM, Apr 2006
> + * (lots of bits taken from architecture code)
> + */
Maybe drop the duplicated changelog? (just retain copyrights I guess)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply
* Re: Linux kernel for MPC834X platform
From: Kumar Gala @ 2006-04-11 13:03 UTC (permalink / raw)
To: Stefan Meyer; +Cc: linuxppc-embedded
In-Reply-To: <002301c65d52$fd7c6fd0$1100a8c0@stefanlaptop>
On Apr 11, 2006, at 5:30 AM, Stefan Meyer wrote:
> Hi All
> Pardon the intrusion if this is not the correct way to get the
> relevant information, but I am relatively new to kernel hacking.
> Anyway here goes my question:
> I am getting a Freescale MPC8343 based board up and running and
> have been using the BSP (with patched 2.6.11 kernel) release by
> Freescale with great success.
> However when I tried to use the USB ISO transactions it seems to be
> non functional. I have connected a known good working codec (tested
> on my PC's usb) and tried to access it via the low level sound
> driver (/dev/dsp) but it only creates garbled noise.
> In the meanwhile, I have noticed that the official kernel release
> 2.6.17 also start supporting the MPC834X platform and would rather
> like to persue that avenue to get my problems addressed (hopefully).
> I have downloaded the latest 2.6.17 kernel and got it going with
> the configuration for MPC834x_sys, with relation to booting and
> ethernet comms.
> However there are clearly no provision in the USB driver for
> setting up the usb clocks and ports (which is of course MPC834x
> specific).
> I have tried to hack it by inserting approriate code form the
> 2.6.11 freescale kernel but with very little success.
> On inspection of the 2.6.11 kernel from Freescale I have also
> noticed that the USB core drivers were heavily modified around DMA
> buffer management, although the 2.6.17 had no such hacks in place.
> My question simply relates to the status of the MPC834x
> implementation on the 2.6.17 kernel in so far that it is
> operational and being tested on hardware platforms. Could you
> please direct me to the persons(s) who have experience with this
> platform ?
The USB support for MPC834x only exists if you build it for
ARCH=powerpc. There is a patch on the list archives for adding
support to ARCH=ppc. I don't know if there is any issue with ISO
transactions in those version, but they been shown to work with USB
ethernet and USB HD devices.
- kumar
^ permalink raw reply
* Xilinx Virtex-2 PRO FPGA ppc 405 on ML310 board
From: Vincent Winstead @ 2006-04-11 12:55 UTC (permalink / raw)
To: linuxppc-embedded
[-- Attachment #1: Type: text/plain, Size: 591 bytes --]
I'm having a real difficulty trying to get linux onto this board. So I'm finally turning to the community for help. The only people that have documented their approach to putting open source linux onto the ML310 board have used bitkeeper to download the kernel for the project, but bitkeeper isn't used any more is it? I've been going straight to kernel.org and getting kernels from there and crosscompiling them on my machine to be transported to the ppc core on the ML310. Is this wrong? Is there a patch or some kernel source that I don't know about for PPC? Thanks!
-Vincent
[-- Attachment #2: Type: text/html, Size: 656 bytes --]
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: Mark Chambers @ 2006-04-11 12:03 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <443B27E6.8070607@ovro.caltech.edu>
> You always have the hack of reserving a block of memory
> at the top of SDRAM, i.e., lie to Linux and tell it it
> has less memory. It sounds like that might be the least
> effort. Rubini has an example of how to do that.
>
I've done this with a bus-master PCI peripheral and it works
nicely. The other point about this is that the mmap() call
will work rationally with memory above the linux pool, so
it's easy to go straight to/from user land with this approach.
(and you can set caching and the guard bit and all that so
that PPC access works the way you want it to)
Mark Chambers
^ permalink raw reply
* Linux kernel for MPC834X platform
From: Stefan Meyer @ 2006-04-11 10:30 UTC (permalink / raw)
To: linuxppc-embedded
[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]
Hi All
Pardon the intrusion if this is not the correct way to get the relevant information, but I am relatively new to kernel hacking.
Anyway here goes my question:
I am getting a Freescale MPC8343 based board up and running and have been using the BSP (with patched 2.6.11 kernel) release by Freescale with great success.
However when I tried to use the USB ISO transactions it seems to be non functional. I have connected a known good working codec (tested on my PC's usb) and tried to access it via the low level sound driver (/dev/dsp) but it only creates garbled noise.
In the meanwhile, I have noticed that the official kernel release 2.6.17 also start supporting the MPC834X platform and would rather like to persue that avenue to get my problems addressed (hopefully).
I have downloaded the latest 2.6.17 kernel and got it going with the configuration for MPC834x_sys, with relation to booting and ethernet comms.
However there are clearly no provision in the USB driver for setting up the usb clocks and ports (which is of course MPC834x specific).
I have tried to hack it by inserting approriate code form the 2.6.11 freescale kernel but with very little success.
On inspection of the 2.6.11 kernel from Freescale I have also noticed that the USB core drivers were heavily modified around DMA buffer management, although the 2.6.17 had no such hacks in place.
My question simply relates to the status of the MPC834x implementation on the 2.6.17 kernel in so far that it is operational and being tested on hardware platforms. Could you please direct me to the persons(s) who have experience with this platform ?
Regards
Stefan
[-- Attachment #2: Type: text/html, Size: 2730 bytes --]
^ permalink raw reply
* checking posting: ignore please
From: Stefan Meyer @ 2006-04-11 11:34 UTC (permalink / raw)
To: linuxppc-embedded
[-- Attachment #1: Type: text/plain, Size: 45 bytes --]
Just verifying that the posting is working
[-- Attachment #2: Type: text/html, Size: 377 bytes --]
^ permalink raw reply
* [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
To: linuxppc-dev, tony.luck, ak, linux-kernel, davej; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for ia64.
This has only been compile-tested due to lack of a suitable test machine.
arch/ia64/Kconfig | 3 +
arch/ia64/mm/contig.c | 62 +++++-----------------------------------
arch/ia64/mm/discontig.c | 43 +++++----------------------
arch/ia64/mm/init.c | 10 ++++++
include/asm-ia64/meminit.h | 1
5 files changed, 30 insertions(+), 89 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/Kconfig linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/Kconfig
--- linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/Kconfig 2006-04-10 10:55:25.000000000 +0100
@@ -352,6 +352,9 @@ config NUMA
Access). This option is for configuring high-end multiprocessor
server systems. If in doubt, say N.
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
# VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent.
# VIRTUAL_MEM_MAP has been retained for historical reasons.
config VIRTUAL_MEM_MAP
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/contig.c
--- linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/contig.c 2006-04-10 10:55:25.000000000 +0100
@@ -26,10 +26,6 @@
#include <asm/sections.h>
#include <asm/mca.h>
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static unsigned long num_dma_physpages;
-#endif
-
/**
* show_mem - display a memory statistics summary
*
@@ -212,18 +208,6 @@ count_pages (u64 start, u64 end, void *a
return 0;
}
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
- unsigned long *count = arg;
-
- if (start < MAX_DMA_ADDRESS)
- *count += (min(end, MAX_DMA_ADDRESS) - start) >> PAGE_SHIFT;
- return 0;
-}
-#endif
-
/*
* Set up the page tables.
*/
@@ -232,71 +216,41 @@ void __init
paging_init (void)
{
unsigned long max_dma;
- unsigned long zones_size[MAX_NR_ZONES];
#ifdef CONFIG_VIRTUAL_MEM_MAP
- unsigned long zholes_size[MAX_NR_ZONES];
+ unsigned long nid = 0;
unsigned long max_gap;
#endif
- /* initialize mem_map[] */
-
- memset(zones_size, 0, sizeof(zones_size));
-
num_physpages = 0;
efi_memmap_walk(count_pages, &num_physpages);
max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
#ifdef CONFIG_VIRTUAL_MEM_MAP
- memset(zholes_size, 0, sizeof(zholes_size));
-
- num_dma_physpages = 0;
- efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
- if (max_low_pfn < max_dma) {
- zones_size[ZONE_DMA] = max_low_pfn;
- zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
- } else {
- zones_size[ZONE_DMA] = max_dma;
- zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
- if (num_physpages > num_dma_physpages) {
- zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
- zholes_size[ZONE_NORMAL] =
- ((max_low_pfn - max_dma) -
- (num_physpages - num_dma_physpages));
- }
- }
-
max_gap = 0;
+ efi_memmap_walk(register_active_ranges, &nid);
efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
if (max_gap < LARGE_GAP) {
- vmem_map = (struct page *) 0;
- free_area_init_node(0, NODE_DATA(0), zones_size, 0,
- zholes_size);
+ free_area_init_nodes(max_dma, max_dma,
+ max_low_pfn, max_low_pfn);
} else {
unsigned long map_size;
/* allocate virtual_mem_map */
-
map_size = PAGE_ALIGN(max_low_pfn * sizeof(struct page));
vmalloc_end -= map_size;
vmem_map = (struct page *) vmalloc_end;
efi_memmap_walk(create_mem_map_page_table, NULL);
NODE_DATA(0)->node_mem_map = vmem_map;
- free_area_init_node(0, NODE_DATA(0), zones_size,
- 0, zholes_size);
+ free_area_init_nodes(max_dma, max_dma,
+ max_low_pfn, max_low_pfn);
printk("Virtual mem_map starts at 0x%p\n", mem_map);
}
#else /* !CONFIG_VIRTUAL_MEM_MAP */
- if (max_low_pfn < max_dma)
- zones_size[ZONE_DMA] = max_low_pfn;
- else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
- }
- free_area_init(zones_size);
+ add_active_range(0, 0, max_low_pfn);
+ free_area_init_nodes(max_dma, max_dma, max_low_pfn, max_low_pfn);
#endif /* !CONFIG_VIRTUAL_MEM_MAP */
zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
}
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c
--- linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c 2006-04-10 10:55:25.000000000 +0100
@@ -88,6 +88,9 @@ static int __init build_node_maps(unsign
min_low_pfn = min(min_low_pfn, bdp->node_boot_start>>PAGE_SHIFT);
max_low_pfn = max(max_low_pfn, bdp->node_low_pfn);
+ /* Add a known active range */
+ add_active_range(node, start, end);
+
return 0;
}
@@ -660,8 +663,7 @@ static __init int count_node_pages(unsig
void __init paging_init(void)
{
unsigned long max_dma;
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
+ unsigned long max_pfn = 0;
unsigned long pfn_offset = 0;
int node;
@@ -679,46 +681,17 @@ void __init paging_init(void)
#endif
for_each_online_node(node) {
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
num_physpages += mem_data[node].num_physpages;
-
- if (mem_data[node].min_pfn >= max_dma) {
- /* All of this node's memory is above ZONE_DMA */
- zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- mem_data[node].min_pfn;
- zholes_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- mem_data[node].min_pfn -
- mem_data[node].num_physpages;
- } else if (mem_data[node].max_pfn < max_dma) {
- /* All of this node's memory is in ZONE_DMA */
- zones_size[ZONE_DMA] = mem_data[node].max_pfn -
- mem_data[node].min_pfn;
- zholes_size[ZONE_DMA] = mem_data[node].max_pfn -
- mem_data[node].min_pfn -
- mem_data[node].num_dma_physpages;
- } else {
- /* This node has memory in both zones */
- zones_size[ZONE_DMA] = max_dma -
- mem_data[node].min_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] -
- mem_data[node].num_dma_physpages;
- zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- max_dma;
- zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] -
- (mem_data[node].num_physpages -
- mem_data[node].num_dma_physpages);
- }
-
pfn_offset = mem_data[node].min_pfn;
#ifdef CONFIG_VIRTUAL_MEM_MAP
NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset;
#endif
- free_area_init_node(node, NODE_DATA(node), zones_size,
- pfn_offset, zholes_size);
+ if (mem_data[node].max_pfn > max_pfn)
+ max_pfn = mem_data[node].max_pfn;
}
+ free_area_init_nodes(max_dma, max_dma, max_pfn, max_pfn);
+
zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
}
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/init.c linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/init.c
--- linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/ia64/mm/init.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-105-ia64_use_init_nodes/arch/ia64/mm/init.c 2006-04-10 10:55:25.000000000 +0100
@@ -526,6 +526,16 @@ ia64_pfn_valid (unsigned long pfn)
EXPORT_SYMBOL(ia64_pfn_valid);
int __init
+register_active_ranges(u64 start, u64 end, void *nid)
+{
+ BUG_ON(nid == NULL);
+ BUG_ON(*(unsigned long *)nid >= MAX_NUMNODES);
+
+ add_active_range(*(unsigned long *)nid, start, end);
+ return 0;
+}
+
+int __init
find_largest_hole (u64 start, u64 end, void *arg)
{
u64 *max_gap = arg;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h linux-2.6.17-rc1-105-ia64_use_init_nodes/include/asm-ia64/meminit.h
--- linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-105-ia64_use_init_nodes/include/asm-ia64/meminit.h 2006-04-10 10:55:25.000000000 +0100
@@ -56,6 +56,7 @@ extern void efi_memmap_init(unsigned lon
extern unsigned long vmalloc_end;
extern struct page *vmem_map;
extern int find_largest_hole (u64 start, u64 end, void *arg);
+ extern int register_active_ranges (u64 start, u64 end, void *arg);
extern int create_mem_map_page_table (u64 start, u64 end, void *arg);
#endif
^ permalink raw reply
* [PATCH 1/6] Introduce mechanism for registering active regions of memory
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: linuxppc-dev, ak, tony.luck, linux-kernel, davej; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
This patch defines the structure to represent an active range of page
frames within a node in an architecture independent manner. Architectures
are expected to register active ranges of PFNs using add_active_range(nid,
start_pfn, end_pfn) and call free_area_init_nodes() passing the PFNs of
the end of each zone.
include/linux/mm.h | 14 +
include/linux/mmzone.h | 15 +
mm/page_alloc.c | 374 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 378 insertions(+), 25 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/include/linux/mm.h linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mm.h
--- linux-2.6.17-rc1-clean/include/linux/mm.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mm.h 2006-04-10 10:52:14.000000000 +0100
@@ -867,6 +867,20 @@ extern void free_area_init(unsigned long
extern void free_area_init_node(int nid, pg_data_t *pgdat,
unsigned long * zones_size, unsigned long zone_start_pfn,
unsigned long *zholes_size);
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+extern void free_area_init_nodes(unsigned long max_dma_pfn,
+ unsigned long max_dma32_pfn,
+ unsigned long max_low_pfn,
+ unsigned long max_high_pfn);
+extern void add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
+extern void get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn);
+extern int early_pfn_to_nid(unsigned long pfn);
+extern void free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn);
+extern void memory_present_with_active_regions(int nid);
+#endif
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
extern void setup_per_zone_pages_min(void);
extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/include/linux/mmzone.h linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mmzone.h
--- linux-2.6.17-rc1-clean/include/linux/mmzone.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mmzone.h 2006-04-10 10:52:14.000000000 +0100
@@ -271,6 +271,18 @@ struct zonelist {
struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
};
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * This represents an active range of physical memory. Architectures register
+ * a pfn range using add_active_range() and later initialise the nodes and
+ * free list with free_area_init_nodes()
+ */
+struct node_active_region {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ int nid;
+};
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
@@ -465,7 +477,8 @@ extern struct zone *next_zone(struct zon
#endif
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+ !defined(CONFIG_ARCH_POPULATES_NODE_MAP)
#define early_pfn_to_nid(nid) (0UL)
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/mm/page_alloc.c linux-2.6.17-rc1-101-add_free_area_init_nodes/mm/page_alloc.c
--- linux-2.6.17-rc1-clean/mm/page_alloc.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/mm/page_alloc.c 2006-04-10 10:52:14.000000000 +0100
@@ -37,6 +37,8 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -84,6 +86,18 @@ int min_free_kbytes = 1024;
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ #ifdef CONFIG_MAX_ACTIVE_REGIONS
+ #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+ #else
+ #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+ #endif
+
+ struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+ unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+ unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1743,25 +1757,6 @@ static inline unsigned long wait_table_b
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long realtotalpages, totalpages = 0;
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- totalpages += zones_size[i];
- pgdat->node_spanned_pages = totalpages;
-
- realtotalpages = totalpages;
- if (zholes_size)
- for (i = 0; i < MAX_NR_ZONES; i++)
- realtotalpages -= zholes_size[i];
- pgdat->node_present_pages = realtotalpages;
- printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
-}
-
-
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
@@ -2048,6 +2043,214 @@ static __meminit void init_currently_emp
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+static int __init first_active_region_index_in_nid(int nid)
+{
+ int i;
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid == nid)
+ return i;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+ for (index = index + 1; early_node_map[index].end_pfn; index++) {
+ if (early_node_map[index].nid == nid)
+ return index;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ unsigned long start_pfn = early_node_map[i].start_pfn;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if ((start_pfn <= pfn) && (pfn < end_pfn))
+ return early_node_map[i].nid;
+ }
+
+ return -1;
+}
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+
+#define for_each_active_range_index_in_nid(i, nid) \
+ for (i = first_active_region_index_in_nid(nid); \
+ i != MAX_ACTIVE_REGIONS; \
+ i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long size_pages = 0;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+ if (early_node_map[i].start_pfn >= max_low_pfn)
+ continue;
+
+ if (end_pfn > max_low_pfn)
+ end_pfn = max_low_pfn;
+
+ size_pages = end_pfn - early_node_map[i].start_pfn;
+ free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+ PFN_PHYS(early_node_map[i].start_pfn),
+ PFN_PHYS(size_pages));
+ }
+}
+
+void __init memory_present_with_active_regions(int nid)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid)
+ memory_present(early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ unsigned int i;
+ *start_pfn = -1UL;
+ *end_pfn = 0;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ if (early_node_map[i].start_pfn < *start_pfn)
+ *start_pfn = early_node_map[i].start_pfn;
+
+ if (early_node_map[i].end_pfn > *end_pfn)
+ *end_pfn = early_node_map[i].end_pfn;
+ }
+
+ if (*start_pfn == -1UL) {
+ printk(KERN_WARNING "Node %u active with no memory\n", nid);
+ *start_pfn = 0;
+ }
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ /* Get the start and end of the node and zone */
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+ zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+ /* Check that this node has pages within the zone's required range */
+ if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+ return 0;
+
+ /* Move the zone boundaries inside the node if necessary */
+ if (zone_end_pfn > node_end_pfn)
+ zone_end_pfn = node_end_pfn;
+ if (zone_start_pfn < node_start_pfn)
+ zone_start_pfn = node_start_pfn;
+
+ /* Return the spanned pages */
+ return zone_end_pfn - zone_start_pfn;
+}
+
+static inline int __init pfn_range_in_zone(unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long zone_type)
+{
+ if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ return 1;
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ int i = 0;
+ unsigned long prev_end_pfn = 0, hole_pages = 0;
+ unsigned long start_pfn;
+
+ /* Find the end_pfn of the first active range of pfns in the node */
+ i = first_active_region_index_in_nid(nid);
+ prev_end_pfn = early_node_map[i].start_pfn;
+
+ /* Find all holes for the node */
+ for (; i != MAX_ACTIVE_REGIONS;
+ i = next_active_region_index_in_nid(i, nid)) {
+
+ /* Increase the hole size if the hole is within the zone */
+ start_pfn = early_node_map[i].start_pfn;
+ if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
+ BUG_ON(prev_end_pfn > start_pfn);
+ hole_pages += start_pfn - prev_end_pfn;
+ }
+
+ prev_end_pfn = early_node_map[i].end_pfn;
+ }
+
+ return hole_pages;
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zholes_size)
+{
+ if (!zholes_size)
+ return 0;
+
+ return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long realtotalpages, totalpages = 0;
+ int i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+ zones_size);
+ }
+ pgdat->node_spanned_pages = totalpages;
+
+ realtotalpages = totalpages;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ realtotalpages -=
+ zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+ }
+ pgdat->node_present_pages = realtotalpages;
+ printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+ realtotalpages);
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2070,10 +2273,9 @@ static void __init free_area_init_core(s
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize;
- realsize = size = zones_size[j];
- if (zholes_size)
- realsize -= zholes_size[j];
-
+ size = zone_present_pages_in_node(nid, j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
+ zholes_size);
if (j < ZONE_HIGHMEM)
nr_kernel_pages += realsize;
nr_all_pages += realsize;
@@ -2140,13 +2342,137 @@ void __init free_area_init_node(int nid,
{
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+ calculate_node_totalpages(pgdat, zones_size, zholes_size);
alloc_node_mem_map(pgdat);
free_area_init_core(pgdat, zones_size, zholes_size);
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned int i;
+ unsigned long pages = end_pfn - start_pfn;
+
+ /* Merge with existing active regions if possible */
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid != nid)
+ continue;
+
+ if (early_node_map[i].end_pfn == start_pfn) {
+ early_node_map[i].end_pfn += pages;
+ return;
+ }
+
+ if (early_node_map[i].start_pfn == (start_pfn + pages)) {
+ early_node_map[i].start_pfn -= pages;
+ return;
+ }
+ }
+
+ /*
+ * Leave last entry NULL so we dont iterate off the end (we use
+ * entry.end_pfn to terminate the walk).
+ */
+ if (i >= MAX_ACTIVE_REGIONS - 1) {
+ printk(KERN_ERR "WARNING: too many memory regions in "
+ "numa code, truncating\n");
+ return;
+ }
+
+ early_node_map[i].nid = nid;
+ early_node_map[i].start_pfn = start_pfn;
+ early_node_map[i].end_pfn = end_pfn;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+ struct node_active_region *arange = (struct node_active_region *)a;
+ struct node_active_region *brange = (struct node_active_region *)b;
+
+ /* Done this way to avoid overflows */
+ if (arange->start_pfn > brange->start_pfn)
+ return 1;
+ if (arange->start_pfn < brange->start_pfn)
+ return -1;
+
+ return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+ size_t num = 0;
+ while (early_node_map[num].end_pfn)
+ num++;
+
+ sort(early_node_map, num, sizeof(struct node_active_region),
+ cmp_node_active_region, NULL);
+}
+
+unsigned long __init find_min_pfn(void)
+{
+ int i;
+ unsigned long min_pfn = -1UL;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].start_pfn < min_pfn)
+ min_pfn = early_node_map[i].start_pfn;
+ }
+
+ return min_pfn;
+}
+
+/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
+unsigned long __init find_start_pfn_for_node(unsigned long nid)
+{
+ int i;
+
+ /* Assuming a sorted map, the first range found has the starting pfn */
+ for_each_active_range_index_in_nid(i, nid) {
+ return early_node_map[i].start_pfn;
+ }
+
+ /* nid does not exist in early_node_map */
+ printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+ return 0;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+ unsigned long arch_max_dma32_pfn,
+ unsigned long arch_max_low_pfn,
+ unsigned long arch_max_high_pfn)
+{
+ unsigned long nid;
+
+ /* Record where the zone boundaries are */
+ memset(arch_zone_lowest_possible_pfn, 0,
+ sizeof(arch_zone_lowest_possible_pfn));
+ memset(arch_zone_highest_possible_pfn, 0,
+ sizeof(arch_zone_highest_possible_pfn));
+ arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
+ arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
+ arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
+ arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
+ arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+
+ /* Regions in the early_node_map can be in any order */
+ sort_node_map();
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ free_area_init_node(nid, pgdat, NULL,
+ find_start_pfn_for_node(nid), NULL);
+ }
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
^ permalink raw reply
* [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: davej, tony.luck, ak, linux-kernel, linuxppc-dev; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for x86.
This has been boot tested on;
x86 with 4 CPUs, flatmem
x86 on NUMAQ
It needs to be boot tested on an x86 machine that uses SRAT.
Kconfig | 8 +---
kernel/setup.c | 19 +++-------
kernel/srat.c | 98 ----------------------------------------------------
mm/discontig.c | 59 ++++---------------------------
4 files changed, 17 insertions(+), 167 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/Kconfig linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/Kconfig
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/Kconfig 2006-04-10 10:53:52.000000000 +0100
@@ -563,12 +563,10 @@ config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SPARSEMEM_ENABLE
-source "mm/Kconfig"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
-config HAVE_ARCH_EARLY_PFN_TO_NID
- bool
- default y
- depends on NUMA
+source "mm/Kconfig"
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/setup.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/setup.c 2006-04-10 10:53:52.000000000 +0100
@@ -1156,22 +1156,15 @@ static unsigned long __init setup_memory
void __init zone_sizes_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
- unsigned int max_dma, low;
+ unsigned int max_dma;
+#ifndef CONFIG_HIGHMEM
+ unsigned long highend_pfn = max_low_pfn;
+#endif
max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
- low = max_low_pfn;
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = highend_pfn - low;
-#endif
- }
- free_area_init(zones_size);
+ add_active_range(0, 0, highend_pfn);
+ free_area_init_nodes(max_dma, max_dma, max_low_pfn, highend_pfn);
}
#else
extern unsigned long __init setup_memory(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/srat.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/srat.c 2006-04-10 10:53:52.000000000 +0100
@@ -56,8 +56,6 @@ struct node_memory_chunk_s {
static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS];
static int num_memory_chunks; /* total number of memory chunks */
-static int zholes_size_init;
-static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES];
extern void * boot_ioremap(unsigned long, unsigned long);
@@ -137,50 +135,6 @@ static void __init parse_memory_affinity
"enabled and removable" : "enabled" ) );
}
-#if MAX_NR_ZONES != 4
-#error "MAX_NR_ZONES != 4, chunk_to_zone requires review"
-#endif
-/* Take a chunk of pages from page frame cstart to cend and count the number
- * of pages in each zone, returned via zones[].
- */
-static __init void chunk_to_zones(unsigned long cstart, unsigned long cend,
- unsigned long *zones)
-{
- unsigned long max_dma;
- extern unsigned long max_low_pfn;
-
- int z;
- unsigned long rend;
-
- /* FIXME: MAX_DMA_ADDRESS and max_low_pfn are trying to provide
- * similarly scoped information and should be handled in a consistant
- * manner.
- */
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- /* Split the hole into the zones in which it falls. Repeatedly
- * take the segment in which the remaining hole starts, round it
- * to the end of that zone.
- */
- memset(zones, 0, MAX_NR_ZONES * sizeof(long));
- while (cstart < cend) {
- if (cstart < max_dma) {
- z = ZONE_DMA;
- rend = (cend < max_dma)? cend : max_dma;
-
- } else if (cstart < max_low_pfn) {
- z = ZONE_NORMAL;
- rend = (cend < max_low_pfn)? cend : max_low_pfn;
-
- } else {
- z = ZONE_HIGHMEM;
- rend = cend;
- }
- zones[z] += rend - cstart;
- cstart = rend;
- }
-}
-
/*
* The SRAT table always lists ascending addresses, so can always
* assume that the first "start" address that you see is the real
@@ -233,7 +187,6 @@ static int __init acpi20_parse_srat(stru
memset(pxm_bitmap, 0, sizeof(pxm_bitmap)); /* init proximity domain bitmap */
memset(node_memory_chunk, 0, sizeof(node_memory_chunk));
- memset(zholes_size, 0, sizeof(zholes_size));
/* -1 in these maps means not available */
memset(pxm_to_nid_map, -1, sizeof(pxm_to_nid_map));
@@ -414,54 +367,3 @@ out_err:
printk("failed to get NUMA memory information from SRAT table\n");
return 0;
}
-
-/* For each node run the memory list to determine whether there are
- * any memory holes. For each hole determine which ZONE they fall
- * into.
- *
- * NOTE#1: this requires knowledge of the zone boundries and so
- * _cannot_ be performed before those are calculated in setup_memory.
- *
- * NOTE#2: we rely on the fact that the memory chunks are ordered by
- * start pfn number during setup.
- */
-static void __init get_zholes_init(void)
-{
- int nid;
- int c;
- int first;
- unsigned long end = 0;
-
- for_each_online_node(nid) {
- first = 1;
- for (c = 0; c < num_memory_chunks; c++){
- if (node_memory_chunk[c].nid == nid) {
- if (first) {
- end = node_memory_chunk[c].end_pfn;
- first = 0;
-
- } else {
- /* Record any gap between this chunk
- * and the previous chunk on this node
- * against the zones it spans.
- */
- chunk_to_zones(end,
- node_memory_chunk[c].start_pfn,
- &zholes_size[nid * MAX_NR_ZONES]);
- }
- }
- }
- }
-}
-
-unsigned long * __init get_zholes_size(int nid)
-{
- if (!zholes_size_init) {
- zholes_size_init++;
- get_zholes_init();
- }
- if (nid >= MAX_NUMNODES || !node_online(nid))
- printk("%s: nid = %d is invalid/offline. num_online_nodes = %d",
- __FUNCTION__, nid, num_online_nodes());
- return &zholes_size[nid * MAX_NR_ZONES];
-}
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/mm/discontig.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/mm/discontig.c 2006-04-10 10:53:52.000000000 +0100
@@ -157,21 +157,6 @@ static void __init find_max_pfn_node(int
BUG();
}
-/* Find the owning node for a pfn. */
-int early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- for_each_node(nid) {
- if (node_end_pfn[nid] == 0)
- break;
- if (node_start_pfn[nid] <= pfn && node_end_pfn[nid] >= pfn)
- return nid;
- }
-
- return 0;
-}
-
/*
* Allocate memory for the pg_data_t for this node via a crude pre-bootmem
* method. For node zero take this from the bottom of memory, for
@@ -352,45 +337,17 @@ unsigned long __init setup_memory(void)
void __init zone_sizes_init(void)
{
int nid;
-
+ unsigned long max_dma_pfn;
for_each_online_node(nid) {
- unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
- unsigned long *zholes_size;
- unsigned int max_dma;
-
- unsigned long low = max_low_pfn;
- unsigned long start = node_start_pfn[nid];
- unsigned long high = node_end_pfn[nid];
-
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- if (node_has_online_mem(nid)){
- if (start > low) {
-#ifdef CONFIG_HIGHMEM
- BUG_ON(start > high);
- zones_size[ZONE_HIGHMEM] = high - start;
-#endif
- } else {
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- BUG_ON(max_dma > low);
- BUG_ON(low > high);
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = high - low;
-#endif
- }
- }
- }
-
- zholes_size = get_zholes_size(nid);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start,
- zholes_size);
+ if (node_has_online_mem(nid))
+ add_active_range(nid, node_start_pfn[nid],
+ node_end_pfn[nid]);
}
+
+ max_dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+ free_area_init_nodes(max_dma_pfn, max_dma_pfn,
+ max_low_pfn, highend_pfn);
return;
}
^ permalink raw reply
* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
To: linuxppc-dev, davej, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for x86_64.
This has only been boot tested on an x86_64 with NUMA and SRAT.
arch/x86_64/Kconfig | 3 ++
arch/x86_64/kernel/e820.c | 18 ++++++++++++
arch/x86_64/mm/init.c | 60 +---------------------------------------
arch/x86_64/mm/numa.c | 15 +++++-----
include/asm-x86_64/e820.h | 1
include/asm-x86_64/proto.h | 2 -
6 files changed, 32 insertions(+), 67 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig 2006-04-10 10:54:38.000000000 +0100
@@ -73,6 +73,9 @@ config ARCH_MAY_HAVE_PC_FDC
bool
default y
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
config DMI
bool
default y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c 2006-04-10 10:54:38.000000000 +0100
@@ -220,6 +220,24 @@ e820_hole_size(unsigned long start_pfn,
return ((end - start) - ram) >> PAGE_SHIFT;
}
+/* Walk the e820 map and register active regions */
+unsigned long __init
+e820_register_active_regions(void)
+{
+ int i;
+ unsigned long start_pfn, end_pfn;
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ if (ei->type != E820_RAM)
+ continue;
+
+ start_pfn = round_up(ei->addr, PAGE_SIZE);
+ end_pfn = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+ add_active_range(0, start_pfn, end_pfn);
+ }
+}
+
/*
* Mark e820 reserved areas as busy for the resource manager.
*/
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c 2006-04-10 10:54:38.000000000 +0100
@@ -405,69 +405,13 @@ void __cpuinit zap_low_mappings(int cpu)
__flush_tlb_all();
}
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- int i;
- unsigned long w;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- z[i] = 0;
-
- if (start_pfn < MAX_DMA_PFN)
- z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- if (start_pfn < MAX_DMA32_PFN) {
- unsigned long dma32_pfn = MAX_DMA32_PFN;
- if (dma32_pfn > end_pfn)
- dma32_pfn = end_pfn;
- z[ZONE_DMA32] = dma32_pfn - start_pfn;
- }
- z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- /* Remove lower zones from higher ones. */
- w = 0;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- if (z[i])
- z[i] -= w;
- w += z[i];
- }
-
- /* Compute holes */
- w = start_pfn;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- unsigned long s = w;
- w += z[i];
- h[i] = e820_hole_size(s, w);
- }
-
- /* Add the space pace needed for mem_map to the holes too. */
- for (i = 0; i < MAX_NR_ZONES; i++)
- h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
- /* The 16MB DMA zone has the kernel and other misc mappings.
- Account them too */
- if (h[ZONE_DMA]) {
- h[ZONE_DMA] += dma_reserve;
- if (h[ZONE_DMA] >= z[ZONE_DMA]) {
- printk(KERN_WARNING
- "Kernel too large and filling up ZONE_DMA?\n");
- h[ZONE_DMA] = z[ZONE_DMA];
- }
- }
-}
-
#ifndef CONFIG_NUMA
void __init paging_init(void)
{
- unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
memory_present(0, 0, end_pfn);
sparse_init();
- size_zones(zones, holes, 0, end_pfn);
- free_area_init_node(0, NODE_DATA(0), zones,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+ e820_register_active_regions();
+ free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
}
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c 2006-04-10 10:54:38.000000000 +0100
@@ -149,13 +149,12 @@ void __init setup_node_bootmem(int nodei
void __init setup_node_zones(int nodeid)
{
unsigned long start_pfn, end_pfn, memmapsize, limit;
- unsigned long zones[MAX_NR_ZONES];
- unsigned long holes[MAX_NR_ZONES];
start_pfn = node_start_pfn(nodeid);
end_pfn = node_end_pfn(nodeid);
+ add_active_range(nodeid, start_pfn, end_pfn);
- Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+ Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
nodeid, start_pfn, end_pfn);
/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -167,10 +166,6 @@ void __init setup_node_zones(int nodeid)
memmapsize, SMP_CACHE_BYTES,
round_down(limit - memmapsize, PAGE_SIZE),
limit);
-
- size_zones(zones, holes, start_pfn, end_pfn);
- free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
- start_pfn, holes);
}
void __init numa_init_array(void)
@@ -312,12 +307,18 @@ static void __init arch_sparse_init(void
void __init paging_init(void)
{
int i;
+ unsigned long max_normal_pfn = 0;
arch_sparse_init();
for_each_online_node(i) {
setup_node_zones(i);
+ if (max_normal_pfn < node_end_pfn(i))
+ max_normal_pfn = node_end_pfn(i);
}
+
+ free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, max_normal_pfn,
+ max_normal_pfn);
}
/* [numa=off] */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h 2006-04-10 10:54:38.000000000 +0100
@@ -53,6 +53,7 @@ extern void e820_bootmem_free(pg_data_t
extern void e820_setup_gap(void);
extern unsigned long e820_hole_size(unsigned long start_pfn,
unsigned long end_pfn);
+extern unsigned long e820_register_active_regions(void);
extern void __init parse_memopt(char *p, char **end);
extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h 2006-04-10 10:54:38.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
#define mtrr_bp_init() do {} while (0)
#endif
extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn);
extern void system_call(void);
extern int kernel_syscall(void);
^ permalink raw reply
* [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes()
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: davej, linuxppc-dev, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for Power.
This has been boot tested on PPC64 with NUMA both enabled and disabled. It
has been compile tested for an older CHRP-based machine.
powerpc/Kconfig | 13 ++--
powerpc/mm/mem.c | 50 +++++----------
powerpc/mm/numa.c | 157 ++++---------------------------------------------
ppc/Kconfig | 3
ppc/mm/init.c | 21 +++---
5 files changed, 54 insertions(+), 190 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/Kconfig linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig 2006-04-10 10:53:03.000000000 +0100
@@ -665,11 +665,16 @@ config ARCH_SPARSEMEM_DEFAULT
def_bool y
depends on SMP && PPC_PSERIES
-source "mm/Kconfig"
-
-config HAVE_ARCH_EARLY_PFN_TO_NID
+config ARCH_POPULATES_NODE_MAP
def_bool y
- depends on NEED_MULTIPLE_NODES
+
+# Value of 256 is MAX_LMB_REGIONS * 2
+config MAX_ACTIVE_REGIONS
+ int
+ default 256
+ depends on ARCH_POPULATES_NODE_MAP
+
+source "mm/Kconfig"
config ARCH_MEMORY_PROBE
def_bool y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c 2006-04-10 10:53:03.000000000 +0100
@@ -252,30 +252,26 @@ void __init do_init_bootmem(void)
boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages);
- /* Add all physical memory to the bootmem map, mark each area
- * present.
- */
+ /* Add active regions with valid PFNs */
for (i = 0; i < lmb.memory.cnt; i++) {
unsigned long base = lmb.memory.region[i].base;
unsigned long size = lmb_size_bytes(&lmb.memory, i);
-#ifdef CONFIG_HIGHMEM
- if (base >= total_lowmem)
- continue;
- if (base + size > total_lowmem)
- size = total_lowmem - base;
-#endif
- free_bootmem(base, size);
+ add_active_range(0, base, base + size);
}
+ /* Add all physical memory to the bootmem map, mark each area
+ * present.
+ */
+ free_bootmem_with_active_regions(0, total_lowmem);
+
/* reserve the sections we're already using */
for (i = 0; i < lmb.reserved.cnt; i++)
reserve_bootmem(lmb.reserved.region[i].base,
lmb_size_bytes(&lmb.reserved, i));
/* XXX need to clip this if using highmem? */
- for (i = 0; i < lmb.memory.cnt; i++)
- memory_present(0, lmb_start_pfn(&lmb.memory, i),
- lmb_end_pfn(&lmb.memory, i));
+ memory_present_with_active_regions(0);
+
init_bootmem_done = 1;
}
@@ -284,8 +280,6 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
unsigned long total_ram = lmb_phys_mem_size();
unsigned long top_of_ram = lmb_end_of_DRAM();
@@ -303,26 +297,18 @@ void __init paging_init(void)
top_of_ram, total_ram);
printk(KERN_INFO "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
- zholes_size[ZONE_HIGHMEM] = (top_of_ram - total_ram) >> PAGE_SHIFT;
+ free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+ total_lowmem >> PAGE_SHIFT,
+ total_lowmem >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT);
#else
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-#endif /* CONFIG_HIGHMEM */
+ free_area_init_nodes(top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT);
+#endif
- free_area_init_node(0, NODE_DATA(0), zones_size,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size);
}
#endif /* ! CONFIG_NEED_MULTIPLE_NODES */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c 2006-04-10 10:53:03.000000000 +0100
@@ -39,96 +39,6 @@ static bootmem_data_t __initdata plat_no
static int min_common_depth;
static int n_mem_addr_cells, n_mem_size_cells;
-/*
- * We need somewhere to store start/end/node for each region until we have
- * allocated the real node_data structures.
- */
-#define MAX_REGIONS (MAX_LMB_REGIONS*2)
-static struct {
- unsigned long start_pfn;
- unsigned long end_pfn;
- int nid;
-} init_node_data[MAX_REGIONS] __initdata;
-
-int __init early_pfn_to_nid(unsigned long pfn)
-{
- unsigned int i;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start_pfn = init_node_data[i].start_pfn;
- unsigned long end_pfn = init_node_data[i].end_pfn;
-
- if ((start_pfn <= pfn) && (pfn < end_pfn))
- return init_node_data[i].nid;
- }
-
- return -1;
-}
-
-void __init add_region(unsigned int nid, unsigned long start_pfn,
- unsigned long pages)
-{
- unsigned int i;
-
- dbg("add_region nid %d start_pfn 0x%lx pages 0x%lx\n",
- nid, start_pfn, pages);
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
- if (init_node_data[i].end_pfn == start_pfn) {
- init_node_data[i].end_pfn += pages;
- return;
- }
- if (init_node_data[i].start_pfn == (start_pfn + pages)) {
- init_node_data[i].start_pfn -= pages;
- return;
- }
- }
-
- /*
- * Leave last entry NULL so we dont iterate off the end (we use
- * entry.end_pfn to terminate the walk).
- */
- if (i >= (MAX_REGIONS - 1)) {
- printk(KERN_ERR "WARNING: too many memory regions in "
- "numa code, truncating\n");
- return;
- }
-
- init_node_data[i].start_pfn = start_pfn;
- init_node_data[i].end_pfn = start_pfn + pages;
- init_node_data[i].nid = nid;
-}
-
-/* We assume init_node_data has no overlapping regions */
-void __init get_region(unsigned int nid, unsigned long *start_pfn,
- unsigned long *end_pfn, unsigned long *pages_present)
-{
- unsigned int i;
-
- *start_pfn = -1UL;
- *end_pfn = *pages_present = 0;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
-
- *pages_present += init_node_data[i].end_pfn -
- init_node_data[i].start_pfn;
-
- if (init_node_data[i].start_pfn < *start_pfn)
- *start_pfn = init_node_data[i].start_pfn;
-
- if (init_node_data[i].end_pfn > *end_pfn)
- *end_pfn = init_node_data[i].end_pfn;
- }
-
- /* We didnt find a matching region, return start/end as 0 */
- if (*start_pfn == -1UL)
- *start_pfn = 0;
-}
-
static void __cpuinit map_cpu_to_node(int cpu, int node)
{
numa_cpu_lookup_table[cpu] = node;
@@ -449,8 +359,8 @@ new_range:
continue;
}
- add_region(nid, start >> PAGE_SHIFT,
- size >> PAGE_SHIFT);
+ add_active_range(nid, start >> PAGE_SHIFT,
+ (start >> PAGE_SHIFT) + (size >> PAGE_SHIFT));
if (--ranges)
goto new_range;
@@ -463,6 +373,7 @@ static void __init setup_nonnuma(void)
{
unsigned long top_of_ram = lmb_end_of_DRAM();
unsigned long total_ram = lmb_phys_mem_size();
+ unsigned long start_pfn, end_pfn;
unsigned int i;
printk(KERN_INFO "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
@@ -470,9 +381,11 @@ static void __init setup_nonnuma(void)
printk(KERN_INFO "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- for (i = 0; i < lmb.memory.cnt; ++i)
- add_region(0, lmb.memory.region[i].base >> PAGE_SHIFT,
- lmb_size_pages(&lmb.memory, i));
+ for (i = 0; i < lmb.memory.cnt; ++i) {
+ start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+ end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+ add_active_range(0, start_pfn, end_pfn);
+ }
node_set_online(0);
}
@@ -610,11 +523,11 @@ void __init do_init_bootmem(void)
(void *)(unsigned long)boot_cpuid);
for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
@@ -647,19 +560,7 @@ void __init do_init_bootmem(void)
init_bootmem_node(NODE_DATA(nid), bootmem_paddr >> PAGE_SHIFT,
start_pfn, end_pfn);
- /* Add free regions on this node */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn << PAGE_SHIFT;
- end = init_node_data[i].end_pfn << PAGE_SHIFT;
-
- dbg("free_bootmem %lx %lx\n", start, end - start);
- free_bootmem_node(NODE_DATA(nid), start, end - start);
- }
+ free_bootmem_with_active_regions(nid, end_pfn);
/* Mark reserved regions on this node */
for (i = 0; i < lmb.reserved.cnt; i++) {
@@ -690,44 +591,14 @@ void __init do_init_bootmem(void)
}
}
- /* Add regions into sparsemem */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn;
- end = init_node_data[i].end_pfn;
-
- memory_present(nid, start, end);
- }
+ memory_present_with_active_regions(nid);
}
}
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
- int nid;
-
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
-
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
-
- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
-
- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
- zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
- zholes_size);
- }
+ unsigned long end_pfn = lmb_end_of_DRAM();
+ free_area_init_nodes(end_pfn, end_pfn, end_pfn, end_pfn);
}
static int __init early_numa(char *p)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/Kconfig linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/Kconfig
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/Kconfig 2006-04-10 10:53:03.000000000 +0100
@@ -949,6 +949,9 @@ config NR_CPUS
config HIGHMEM
bool "High memory support"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
source kernel/Kconfig.hz
source kernel/Kconfig.preempt
source "mm/Kconfig"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/mm/init.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/mm/init.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c 2006-04-10 10:53:03.000000000 +0100
@@ -359,8 +359,7 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES], i;
-
+ unsigned long start_pfn, end_pfn;
#ifdef CONFIG_HIGHMEM
map_page(PKMAP_BASE, 0, 0); /* XXX gross */
pkmap_page_table = pte_offset_kernel(pmd_offset(pgd_offset_k
@@ -371,18 +370,18 @@ void __init paging_init(void)
kmap_prot = PAGE_KERNEL;
#endif /* CONFIG_HIGHMEM */
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- for (i = 1; i < MAX_NR_ZONES; i++)
- zones_size[i] = 0;
+ /* All pages are DMA-able so we put them all in the DMA zone. */
+ start_pfn = __pa(PAGE_OFFSET) >> PAGE_SHIFT;
+ end_pfn = start_pfn + (total_memory >> PAGE_SHIFT);
+ add_active_range(0, start_pfn, end_pfn);
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
+ free_area_init_nodes(total_lowmem, total_lowmem,
+ total_lowmem, total_memory);
+#else
+ free_area_init_nodes(total_memory, total_memory,
+ total_memory, total_memory);
#endif /* CONFIG_HIGHMEM */
-
- free_area_init(zones_size);
}
void __init mem_init(void)
^ permalink raw reply
* [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
To: davej, linuxppc-dev, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
page_alloc.c contains a large amount of memory initialisation code. This patch
breaks out the initialisation code to a separate file to make page_alloc.c
a bit easier to read.
Makefile | 2
mem_init.c | 1028 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
page_alloc.c | 1004 ----------------------------------------------------
3 files changed, 1029 insertions(+), 1005 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile 2006-04-11 09:37:06.000000000 +0100
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o
vmalloc.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
- page_alloc.o page-writeback.o pdflush.o \
+ page_alloc.o mem_init.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o $(mmu-y)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c 2006-04-11 09:48:34.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c 2006-04-11 09:37:06.000000000 +0100
@@ -0,0 +1,1028 @@
+/*
+ * mm/mem_init.c
+ * Initialises the architecture independant view of memory. pgdats, zones, etc
+ *
+ * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
+ * Swap reorganised 29.12.95, Stephen Tweedie
+ * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
+ * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999
+ * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999
+ * Zone balancing, Kanoj Sarcar, SGI, Jan 2000
+ * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
+ * (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Arch-independant zone size and hole calculation, Mel Gorman, IBM, Apr 2006
+ * (lots of bits taken from architecture code)
+ */
+#include <linux/config.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/module.h>
+#include <linux/cpuset.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/swap.h>
+#include <linux/sysctl.h>
+
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+int percpu_pagelist_fraction;
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ #ifdef CONFIG_MAX_ACTIVE_REGIONS
+ #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+ #else
+ #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+ #endif
+
+ struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+ unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+ unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+/*
+ * Builds allocation fallback zone lists.
+ *
+ * Add all populated zones of a node to the zonelist.
+ */
+static int __init build_zonelists_node(pg_data_t *pgdat,
+ struct zonelist *zonelist, int nr_zones, int zone_type)
+{
+ struct zone *zone;
+
+ BUG_ON(zone_type > ZONE_HIGHMEM);
+
+ do {
+ zone = pgdat->node_zones + zone_type;
+ if (populated_zone(zone)) {
+#ifndef CONFIG_HIGHMEM
+ BUG_ON(zone_type > ZONE_NORMAL);
+#endif
+ zonelist->zones[nr_zones++] = zone;
+ check_highest_zone(zone_type);
+ }
+ zone_type--;
+
+ } while (zone_type >= 0);
+ return nr_zones;
+}
+
+static inline int highest_zone(int zone_bits)
+{
+ int res = ZONE_NORMAL;
+ if (zone_bits & (__force int)__GFP_HIGHMEM)
+ res = ZONE_HIGHMEM;
+ if (zone_bits & (__force int)__GFP_DMA32)
+ res = ZONE_DMA32;
+ if (zone_bits & (__force int)__GFP_DMA)
+ res = ZONE_DMA;
+ return res;
+}
+
+#ifdef CONFIG_NUMA
+#define MAX_NODE_LOAD (num_online_nodes())
+static int __initdata node_load[MAX_NUMNODES];
+/**
+ * find_next_best_node - find the next node that should appear in a given node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: nodemask_t of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list. The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+ int n, val;
+ int min_val = INT_MAX;
+ int best_node = -1;
+
+ /* Use the local node if we haven't already */
+ if (!node_isset(node, *used_node_mask)) {
+ node_set(node, *used_node_mask);
+ return node;
+ }
+
+ for_each_online_node(n) {
+ cpumask_t tmp;
+
+ /* Don't want a node to appear more than once */
+ if (node_isset(n, *used_node_mask))
+ continue;
+
+ /* Use the distance array to find the distance */
+ val = node_distance(node, n);
+
+ /* Penalize nodes under us ("prefer the next node") */
+ val += (n < node);
+
+ /* Give preference to headless and unused nodes */
+ tmp = node_to_cpumask(n);
+ if (!cpus_empty(tmp))
+ val += PENALTY_FOR_NODE_WITH_CPUS;
+
+ /* Slight preference for less loaded node */
+ val *= (MAX_NODE_LOAD*MAX_NUMNODES);
+ val += node_load[n];
+
+ if (val < min_val) {
+ min_val = val;
+ best_node = n;
+ }
+ }
+
+ if (best_node >= 0)
+ node_set(best_node, *used_node_mask);
+
+ return best_node;
+}
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+ int i, j, k, node, local_node;
+ int prev_node, load;
+ struct zonelist *zonelist;
+ nodemask_t used_mask;
+
+ /* initialize zonelists */
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ zonelist = pgdat->node_zonelists + i;
+ zonelist->zones[0] = NULL;
+ }
+
+ /* NUMA-aware ordering of nodes */
+ local_node = pgdat->node_id;
+ load = num_online_nodes();
+ prev_node = local_node;
+ nodes_clear(used_mask);
+ while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+ int distance = node_distance(local_node, node);
+
+ /*
+ * If another node is sufficiently far away then it is better
+ * to reclaim pages in a zone before going off node.
+ */
+ if (distance > RECLAIM_DISTANCE)
+ zone_reclaim_mode = 1;
+
+ /*
+ * We don't want to pressure a particular node.
+ * So adding penalty to the first node in same
+ * distance group to make it round-robin.
+ */
+
+ if (distance != node_distance(local_node, prev_node))
+ node_load[node] += load;
+ prev_node = node;
+ load--;
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ zonelist = pgdat->node_zonelists + i;
+ for (j = 0; zonelist->zones[j] != NULL; j++);
+
+ k = highest_zone(i);
+
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ zonelist->zones[j] = NULL;
+ }
+ }
+}
+
+#else /* CONFIG_NUMA */
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+ int i, j, k, node, local_node;
+
+ local_node = pgdat->node_id;
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ struct zonelist *zonelist;
+
+ zonelist = pgdat->node_zonelists + i;
+
+ j = 0;
+ k = highest_zone(i);
+ j = build_zonelists_node(pgdat, zonelist, j, k);
+ /*
+ * Now we build the zonelist so that it contains the zones
+ * of all the other nodes.
+ * We don't want to pressure a particular node, so when
+ * building the zones for node N, we make sure that the
+ * zones coming right after the local ones are those from
+ * node N+1 (modulo N)
+ */
+ for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ }
+ for (node = 0; node < local_node; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ }
+
+ zonelist->zones[j] = NULL;
+ }
+}
+
+#endif /* CONFIG_NUMA */
+
+void __init build_all_zonelists(void)
+{
+ int i;
+
+ for_each_online_node(i)
+ build_zonelists(NODE_DATA(i));
+ printk("Built %i zonelists\n", num_online_nodes());
+ cpuset_init_current_mems_allowed();
+}
+
+/*
+ * Helper functions to size the waitqueue hash table.
+ * Essentially these want to choose hash table sizes sufficiently
+ * large so that collisions trying to wait on pages are rare.
+ * But in fact, the number of active page waitqueues on typical
+ * systems is ridiculously low, less than 200. So this is even
+ * conservative, even though it seems large.
+ *
+ * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
+ * waitqueues, i.e. the size of the waitq table given the number of pages.
+ */
+#define PAGES_PER_WAITQUEUE 256
+
+static inline unsigned long wait_table_size(unsigned long pages)
+{
+ unsigned long size = 1;
+
+ pages /= PAGES_PER_WAITQUEUE;
+
+ while (size < pages)
+ size <<= 1;
+
+ /*
+ * Once we have dozens or even hundreds of threads sleeping
+ * on IO we've got bigger problems than wait queue collision.
+ * Limit the size of the wait table to a reasonable size.
+ */
+ size = min(size, 4096UL);
+
+ return max(size, 4UL);
+}
+
+/*
+ * This is an integer logarithm so that shifts can be used later
+ * to extract the more random high bits from the multiplicative
+ * hash function before the remainder is taken.
+ */
+static inline unsigned long wait_table_bits(unsigned long size)
+{
+ return ffz(~size);
+}
+
+#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
+
+#ifndef __HAVE_ARCH_MEMMAP_INIT
+#define memmap_init(size, nid, zone, start_pfn) \
+ memmap_init_zone((size), (nid), (zone), (start_pfn))
+#endif
+
+/*
+ * Initially all pages are reserved - free ones are freed
+ * up by free_all_bootmem() once the early boot process is
+ * done. Non-atomic initialization, single-pass.
+ */
+void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
+ unsigned long start_pfn)
+{
+ struct page *page;
+ unsigned long end_pfn = start_pfn + size;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ if (!early_pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ set_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ reset_page_mapcount(page);
+ SetPageReserved(page);
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+ }
+}
+
+void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
+ unsigned long size)
+{
+ int order;
+ for (order = 0; order < MAX_ORDER ; order++) {
+ INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ zone->free_area[order].nr_free = 0;
+ }
+}
+
+#define ZONETABLE_INDEX(x, zone_nr) ((x << ZONES_SHIFT) | zone_nr)
+void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
+ unsigned long size)
+{
+ unsigned long snum = pfn_to_section_nr(pfn);
+ unsigned long end = pfn_to_section_nr(pfn + size);
+
+ if (FLAGS_HAS_NODE)
+ zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
+ else
+ for (; snum <= end; snum++)
+ zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
+}
+
+static __meminit
+void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
+{
+ int i;
+ struct pglist_data *pgdat = zone->zone_pgdat;
+
+ /*
+ * The per-page waitqueue mechanism uses hashed waitqueues
+ * per zone.
+ */
+ zone->wait_table_size = wait_table_size(zone_size_pages);
+ zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
+ zone->wait_table = (wait_queue_head_t *)
+ alloc_bootmem_node(pgdat, zone->wait_table_size
+ * sizeof(wait_queue_head_t));
+
+ for(i = 0; i < zone->wait_table_size; ++i)
+ init_waitqueue_head(zone->wait_table + i);
+}
+
+/*
+ * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * to the value high for the pageset p.
+ */
+static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+ unsigned long high)
+{
+ struct per_cpu_pages *pcp;
+
+ pcp = &p->pcp[0]; /* hot list */
+ pcp->high = high;
+ pcp->batch = max(1UL, high/4);
+ if ((high/4) > (PAGE_SHIFT * 8))
+ pcp->batch = PAGE_SHIFT * 8;
+}
+
+/*
+ * percpu_pagelist_fraction - changes the pcp->high for each zone on each
+ * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
+ * can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ unsigned int cpu;
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ if (!write || (ret == -EINVAL))
+ return ret;
+ for_each_zone(zone) {
+ for_each_online_cpu(cpu) {
+ unsigned long high;
+ high = zone->present_pages / percpu_pagelist_fraction;
+ setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+ }
+ }
+ return 0;
+}
+
+static int __cpuinit zone_batchsize(struct zone *zone)
+{
+ int batch;
+
+ /*
+ * The per-cpu-pages pools are set to around 1000th of the
+ * size of the zone. But no more than 1/2 of a meg.
+ *
+ * OK, so we don't know how big the cache is. So guess.
+ */
+ batch = zone->present_pages / 1024;
+ if (batch * PAGE_SIZE > 512 * 1024)
+ batch = (512 * 1024) / PAGE_SIZE;
+ batch /= 4; /* We effectively *= 4 below */
+ if (batch < 1)
+ batch = 1;
+
+ /*
+ * Clamp the batch to a 2^n - 1 value. Having a power
+ * of 2 value was found to be more likely to have
+ * suboptimal cache aliasing properties in some cases.
+ *
+ * For example if 2 tasks are alternately allocating
+ * batches of pages, one task can end up with a lot
+ * of pages of one half of the possible page colors
+ * and the other with pages of the other colors.
+ */
+ batch = (1 << (fls(batch + batch/2)-1)) - 1;
+
+ return batch;
+}
+
+inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+ struct per_cpu_pages *pcp;
+
+ memset(p, 0, sizeof(*p));
+
+ pcp = &p->pcp[0]; /* hot */
+ pcp->count = 0;
+ pcp->high = 6 * batch;
+ pcp->batch = max(1UL, 1 * batch);
+ INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &p->pcp[1]; /* cold*/
+ pcp->count = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = max(1UL, batch/2);
+ INIT_LIST_HEAD(&pcp->list);
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * Some NUMA counter updates may also be caught by the boot pagesets.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static struct per_cpu_pageset boot_pageset[NR_CPUS];
+
+/*
+ * Dynamically allocate memory for the
+ * per cpu pageset array in struct zone.
+ */
+static int __cpuinit process_zones(int cpu)
+{
+ struct zone *zone, *dzone;
+
+ for_each_zone(zone) {
+
+ zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!zone_pcp(zone, cpu))
+ goto bad;
+
+ setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+
+ if (percpu_pagelist_fraction)
+ setup_pagelist_highmark(zone_pcp(zone, cpu),
+ (zone->present_pages / percpu_pagelist_fraction));
+ }
+
+ return 0;
+bad:
+ for_each_zone(dzone) {
+ if (dzone == zone)
+ break;
+ kfree(zone_pcp(dzone, cpu));
+ zone_pcp(dzone, cpu) = NULL;
+ }
+ return -ENOMEM;
+}
+
+static inline void free_zone_pagesets(int cpu)
+{
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
+
+ zone_pcp(zone, cpu) = NULL;
+ kfree(pset);
+ }
+}
+
+static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (long)hcpu;
+ int ret = NOTIFY_OK;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (process_zones(cpu))
+ ret = NOTIFY_BAD;
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DEAD:
+ free_zone_pagesets(cpu);
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
+
+static struct notifier_block pageset_notifier =
+ { &pageset_cpuup_callback, NULL, 0 };
+
+void __init setup_per_cpu_pageset(void)
+{
+ int err;
+
+ /* Initialize per_cpu_pageset for cpu 0.
+ * A cpuup callback will do this for every cpu
+ * as it comes online
+ */
+ err = process_zones(smp_processor_id());
+ BUG_ON(err);
+ register_cpu_notifier(&pageset_notifier);
+}
+#endif
+
+static __meminit void zone_pcp_init(struct zone *zone)
+{
+ int cpu;
+ unsigned long batch = zone_batchsize(zone);
+
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+#ifdef CONFIG_NUMA
+ /* Early boot. Slab allocator not functional yet */
+ zone_pcp(zone, cpu) = &boot_pageset[cpu];
+ setup_pageset(&boot_pageset[cpu],0);
+#else
+ setup_pageset(zone_pcp(zone,cpu), batch);
+#endif
+ }
+ if (zone->present_pages)
+ printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
+ zone->name, zone->present_pages, batch);
+}
+
+static __meminit void init_currently_empty_zone(struct zone *zone,
+ unsigned long zone_start_pfn, unsigned long size)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+
+ zone_wait_table_init(zone, size);
+ pgdat->nr_zones = zone_idx(zone) + 1;
+
+ zone->zone_start_pfn = zone_start_pfn;
+
+ memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+
+ zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+static int __init first_active_region_index_in_nid(int nid)
+{
+ int i;
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid == nid)
+ return i;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+ for (index = index + 1; early_node_map[index].end_pfn; index++) {
+ if (early_node_map[index].nid == nid)
+ return index;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ unsigned long start_pfn = early_node_map[i].start_pfn;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if ((start_pfn <= pfn) && (pfn < end_pfn))
+ return early_node_map[i].nid;
+ }
+
+ return -1;
+}
+#endif
+
+#define for_each_active_range_index_in_nid(i, nid) \
+ for (i = first_active_region_index_in_nid(nid); \
+ i != MAX_ACTIVE_REGIONS; \
+ i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long size_pages = 0;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+ if (early_node_map[i].start_pfn >= max_low_pfn)
+ continue;
+
+ if (end_pfn > max_low_pfn)
+ end_pfn = max_low_pfn;
+
+ size_pages = end_pfn - early_node_map[i].start_pfn;
+ free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+ PFN_PHYS(early_node_map[i].start_pfn),
+ PFN_PHYS(size_pages));
+ }
+}
+
+void __init memory_present_with_active_regions(int nid)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid)
+ memory_present(early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ unsigned int i;
+ *start_pfn = -1UL;
+ *end_pfn = 0;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ if (early_node_map[i].start_pfn < *start_pfn)
+ *start_pfn = early_node_map[i].start_pfn;
+
+ if (early_node_map[i].end_pfn > *end_pfn)
+ *end_pfn = early_node_map[i].end_pfn;
+ }
+
+ if (*start_pfn == -1UL) {
+ printk(KERN_WARNING "Node %u active with no memory\n", nid);
+ *start_pfn = 0;
+ }
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ /* Get the start and end of the node and zone */
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+ zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+ /* Check that this node has pages within the zone's required range */
+ if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+ return 0;
+
+ /* Move the zone boundaries inside the node if necessary */
+ if (zone_end_pfn > node_end_pfn)
+ zone_end_pfn = node_end_pfn;
+ if (zone_start_pfn < node_start_pfn)
+ zone_start_pfn = node_start_pfn;
+
+ /* Return the spanned pages */
+ return zone_end_pfn - zone_start_pfn;
+}
+
+static inline int __init pfn_range_in_zone(unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long zone_type)
+{
+ if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ return 1;
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ int i = 0;
+ unsigned long prev_end_pfn = 0, hole_pages = 0;
+ unsigned long start_pfn;
+
+ /* Find the end_pfn of the first active range of pfns in the node */
+ i = first_active_region_index_in_nid(nid);
+ prev_end_pfn = early_node_map[i].start_pfn;
+
+ /* Find all holes for the node */
+ for (; i != MAX_ACTIVE_REGIONS;
+ i = next_active_region_index_in_nid(i, nid)) {
+
+ /* Increase the hole size if the hole is within the zone */
+ start_pfn = early_node_map[i].start_pfn;
+ if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
+ BUG_ON(prev_end_pfn > start_pfn);
+ hole_pages += start_pfn - prev_end_pfn;
+ }
+
+ prev_end_pfn = early_node_map[i].end_pfn;
+ }
+
+ return hole_pages;
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zholes_size)
+{
+ if (!zholes_size)
+ return 0;
+
+ return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long realtotalpages, totalpages = 0;
+ int i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+ zones_size);
+ }
+ pgdat->node_spanned_pages = totalpages;
+
+ realtotalpages = totalpages;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ realtotalpages -=
+ zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+ }
+ pgdat->node_present_pages = realtotalpages;
+ printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+ realtotalpages);
+}
+
+/*
+ * Set up the zone data structures:
+ * - mark all pages reserved
+ * - mark all memory queues empty
+ * - clear the memory bitmaps
+ */
+static void __init free_area_init_core(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long j;
+ int nid = pgdat->node_id;
+ unsigned long zone_start_pfn = pgdat->node_start_pfn;
+
+ pgdat_resize_init(pgdat);
+ pgdat->nr_zones = 0;
+ init_waitqueue_head(&pgdat->kswapd_wait);
+ pgdat->kswapd_max_order = 0;
+
+ for (j = 0; j < MAX_NR_ZONES; j++) {
+ struct zone *zone = pgdat->node_zones + j;
+ unsigned long size, realsize;
+
+ size = zone_present_pages_in_node(nid, j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
+ zholes_size);
+ if (j < ZONE_HIGHMEM)
+ nr_kernel_pages += realsize;
+ nr_all_pages += realsize;
+
+ zone->spanned_pages = size;
+ zone->present_pages = realsize;
+ zone->name = zone_names[j];
+ spin_lock_init(&zone->lock);
+ spin_lock_init(&zone->lru_lock);
+ zone_seqlock_init(zone);
+ zone->zone_pgdat = pgdat;
+ zone->free_pages = 0;
+
+ zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+
+ zone_pcp_init(zone);
+ INIT_LIST_HEAD(&zone->active_list);
+ INIT_LIST_HEAD(&zone->inactive_list);
+ zone->nr_scan_active = 0;
+ zone->nr_scan_inactive = 0;
+ zone->nr_active = 0;
+ zone->nr_inactive = 0;
+ atomic_set(&zone->reclaim_in_progress, 0);
+ if (!size)
+ continue;
+
+ zonetable_add(zone, nid, j, zone_start_pfn, size);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
+ zone_start_pfn += size;
+ }
+}
+
+static void __init alloc_node_mem_map(struct pglist_data *pgdat)
+{
+ /* Skip empty nodes */
+ if (!pgdat->node_spanned_pages)
+ return;
+
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+ /* ia64 gets its own node_mem_map, before this, without bootmem */
+ if (!pgdat->node_mem_map) {
+ unsigned long size;
+ struct page *map;
+
+ size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
+ map = alloc_remap(pgdat->node_id, size);
+ if (!map)
+ map = alloc_bootmem_node(pgdat, size);
+ pgdat->node_mem_map = map;
+ }
+#ifdef CONFIG_FLATMEM
+ /*
+ * With no DISCONTIG, the global mem_map is just set as node 0's
+ */
+ if (pgdat == NODE_DATA(0))
+ mem_map = NODE_DATA(0)->node_mem_map;
+#endif
+#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+}
+
+void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long node_start_pfn,
+ unsigned long *zholes_size)
+{
+ pgdat->node_id = nid;
+ pgdat->node_start_pfn = node_start_pfn;
+ calculate_node_totalpages(pgdat, zones_size, zholes_size);
+
+ alloc_node_mem_map(pgdat);
+
+ free_area_init_core(pgdat, zones_size, zholes_size);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned int i;
+ unsigned long pages = end_pfn - start_pfn;
+
+ /* Merge with existing active regions if possible */
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid != nid)
+ continue;
+
+ if (early_node_map[i].end_pfn == start_pfn) {
+ early_node_map[i].end_pfn += pages;
+ return;
+ }
+
+ if (early_node_map[i].start_pfn == (start_pfn + pages)) {
+ early_node_map[i].start_pfn -= pages;
+ return;
+ }
+ }
+
+ /*
+ * Leave last entry NULL so we dont iterate off the end (we use
+ * entry.end_pfn to terminate the walk).
+ */
+ if (i >= MAX_ACTIVE_REGIONS - 1) {
+ printk(KERN_ERR "WARNING: too many memory regions in "
+ "numa code, truncating\n");
+ return;
+ }
+
+ early_node_map[i].nid = nid;
+ early_node_map[i].start_pfn = start_pfn;
+ early_node_map[i].end_pfn = end_pfn;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+ struct node_active_region *arange = (struct node_active_region *)a;
+ struct node_active_region *brange = (struct node_active_region *)b;
+
+ /* Done this way to avoid overflows */
+ if (arange->start_pfn > brange->start_pfn)
+ return 1;
+ if (arange->start_pfn < brange->start_pfn)
+ return -1;
+
+ return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+ size_t num = 0;
+ while (early_node_map[num].end_pfn)
+ num++;
+
+ sort(early_node_map, num, sizeof(struct node_active_region),
+ cmp_node_active_region, NULL);
+}
+
+unsigned long __init find_min_pfn(void)
+{
+ int i;
+ unsigned long min_pfn = -1UL;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].start_pfn < min_pfn)
+ min_pfn = early_node_map[i].start_pfn;
+ }
+
+ return min_pfn;
+}
+
+/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
+unsigned long __init find_start_pfn_for_node(unsigned long nid)
+{
+ int i;
+
+ /* Assuming a sorted map, the first range found has the starting pfn */
+ for_each_active_range_index_in_nid(i, nid) {
+ return early_node_map[i].start_pfn;
+ }
+
+ /* nid does not exist in early_node_map */
+ printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+ return 0;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+ unsigned long arch_max_dma32_pfn,
+ unsigned long arch_max_low_pfn,
+ unsigned long arch_max_high_pfn)
+{
+ unsigned long nid;
+
+ /* Record where the zone boundaries are */
+ memset(arch_zone_lowest_possible_pfn, 0,
+ sizeof(arch_zone_lowest_possible_pfn));
+ memset(arch_zone_highest_possible_pfn, 0,
+ sizeof(arch_zone_highest_possible_pfn));
+ arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
+ arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
+ arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
+ arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
+ arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+
+ /* Regions in the early_node_map can be in any order */
+ sort_node_map();
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ free_area_init_node(nid, pgdat, NULL,
+ find_start_pfn_for_node(nid), NULL);
+ }
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c 2006-04-11 09:33:15.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c 2006-04-11 09:38:12.000000000 +0100
@@ -37,8 +37,6 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
-#include <linux/sort.h>
-#include <linux/pfn.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -54,7 +52,6 @@ EXPORT_SYMBOL(node_possible_map);
unsigned long totalram_pages __read_mostly;
unsigned long totalhigh_pages __read_mostly;
long nr_swap_pages;
-int percpu_pagelist_fraction;
static void __free_pages_ok(struct page *page, unsigned int order);
@@ -80,24 +77,11 @@ EXPORT_SYMBOL(totalram_pages);
struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
EXPORT_SYMBOL(zone_table);
-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
- #ifdef CONFIG_MAX_ACTIVE_REGIONS
- #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
- #else
- #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
- #endif
-
- struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
- unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
- unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1511,968 +1495,6 @@ void show_free_areas(void)
show_swap_cache_info();
}
-/*
- * Builds allocation fallback zone lists.
- *
- * Add all populated zones of a node to the zonelist.
- */
-static int __init build_zonelists_node(pg_data_t *pgdat,
- struct zonelist *zonelist, int nr_zones, int zone_type)
-{
- struct zone *zone;
-
- BUG_ON(zone_type > ZONE_HIGHMEM);
-
- do {
- zone = pgdat->node_zones + zone_type;
- if (populated_zone(zone)) {
-#ifndef CONFIG_HIGHMEM
- BUG_ON(zone_type > ZONE_NORMAL);
-#endif
- zonelist->zones[nr_zones++] = zone;
- check_highest_zone(zone_type);
- }
- zone_type--;
-
- } while (zone_type >= 0);
- return nr_zones;
-}
-
-static inline int highest_zone(int zone_bits)
-{
- int res = ZONE_NORMAL;
- if (zone_bits & (__force int)__GFP_HIGHMEM)
- res = ZONE_HIGHMEM;
- if (zone_bits & (__force int)__GFP_DMA32)
- res = ZONE_DMA32;
- if (zone_bits & (__force int)__GFP_DMA)
- res = ZONE_DMA;
- return res;
-}
-
-#ifdef CONFIG_NUMA
-#define MAX_NODE_LOAD (num_online_nodes())
-static int __initdata node_load[MAX_NUMNODES];
-/**
- * find_next_best_node - find the next node that should appear in a given node's fallback list
- * @node: node whose fallback list we're appending
- * @used_node_mask: nodemask_t of already used nodes
- *
- * We use a number of factors to determine which is the next node that should
- * appear on a given node's fallback list. The node should not have appeared
- * already in @node's fallback list, and it should be the next closest node
- * according to the distance array (which contains arbitrary distance values
- * from each node to each node in the system), and should also prefer nodes
- * with no CPUs, since presumably they'll have very little allocation pressure
- * on them otherwise.
- * It returns -1 if no node is found.
- */
-static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
-{
- int n, val;
- int min_val = INT_MAX;
- int best_node = -1;
-
- /* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
- node_set(node, *used_node_mask);
- return node;
- }
-
- for_each_online_node(n) {
- cpumask_t tmp;
-
- /* Don't want a node to appear more than once */
- if (node_isset(n, *used_node_mask))
- continue;
-
- /* Use the distance array to find the distance */
- val = node_distance(node, n);
-
- /* Penalize nodes under us ("prefer the next node") */
- val += (n < node);
-
- /* Give preference to headless and unused nodes */
- tmp = node_to_cpumask(n);
- if (!cpus_empty(tmp))
- val += PENALTY_FOR_NODE_WITH_CPUS;
-
- /* Slight preference for less loaded node */
- val *= (MAX_NODE_LOAD*MAX_NUMNODES);
- val += node_load[n];
-
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
-
- if (best_node >= 0)
- node_set(best_node, *used_node_mask);
-
- return best_node;
-}
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
- int i, j, k, node, local_node;
- int prev_node, load;
- struct zonelist *zonelist;
- nodemask_t used_mask;
-
- /* initialize zonelists */
- for (i = 0; i < GFP_ZONETYPES; i++) {
- zonelist = pgdat->node_zonelists + i;
- zonelist->zones[0] = NULL;
- }
-
- /* NUMA-aware ordering of nodes */
- local_node = pgdat->node_id;
- load = num_online_nodes();
- prev_node = local_node;
- nodes_clear(used_mask);
- while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
- int distance = node_distance(local_node, node);
-
- /*
- * If another node is sufficiently far away then it is better
- * to reclaim pages in a zone before going off node.
- */
- if (distance > RECLAIM_DISTANCE)
- zone_reclaim_mode = 1;
-
- /*
- * We don't want to pressure a particular node.
- * So adding penalty to the first node in same
- * distance group to make it round-robin.
- */
-
- if (distance != node_distance(local_node, prev_node))
- node_load[node] += load;
- prev_node = node;
- load--;
- for (i = 0; i < GFP_ZONETYPES; i++) {
- zonelist = pgdat->node_zonelists + i;
- for (j = 0; zonelist->zones[j] != NULL; j++);
-
- k = highest_zone(i);
-
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- zonelist->zones[j] = NULL;
- }
- }
-}
-
-#else /* CONFIG_NUMA */
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
- int i, j, k, node, local_node;
-
- local_node = pgdat->node_id;
- for (i = 0; i < GFP_ZONETYPES; i++) {
- struct zonelist *zonelist;
-
- zonelist = pgdat->node_zonelists + i;
-
- j = 0;
- k = highest_zone(i);
- j = build_zonelists_node(pgdat, zonelist, j, k);
- /*
- * Now we build the zonelist so that it contains the zones
- * of all the other nodes.
- * We don't want to pressure a particular node, so when
- * building the zones for node N, we make sure that the
- * zones coming right after the local ones are those from
- * node N+1 (modulo N)
- */
- for (node = local_node + 1; node < MAX_NUMNODES; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- }
- for (node = 0; node < local_node; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- }
-
- zonelist->zones[j] = NULL;
- }
-}
-
-#endif /* CONFIG_NUMA */
-
-void __init build_all_zonelists(void)
-{
- int i;
-
- for_each_online_node(i)
- build_zonelists(NODE_DATA(i));
- printk("Built %i zonelists\n", num_online_nodes());
- cpuset_init_current_mems_allowed();
-}
-
-/*
- * Helper functions to size the waitqueue hash table.
- * Essentially these want to choose hash table sizes sufficiently
- * large so that collisions trying to wait on pages are rare.
- * But in fact, the number of active page waitqueues on typical
- * systems is ridiculously low, less than 200. So this is even
- * conservative, even though it seems large.
- *
- * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
- * waitqueues, i.e. the size of the waitq table given the number of pages.
- */
-#define PAGES_PER_WAITQUEUE 256
-
-static inline unsigned long wait_table_size(unsigned long pages)
-{
- unsigned long size = 1;
-
- pages /= PAGES_PER_WAITQUEUE;
-
- while (size < pages)
- size <<= 1;
-
- /*
- * Once we have dozens or even hundreds of threads sleeping
- * on IO we've got bigger problems than wait queue collision.
- * Limit the size of the wait table to a reasonable size.
- */
- size = min(size, 4096UL);
-
- return max(size, 4UL);
-}
-
-/*
- * This is an integer logarithm so that shifts can be used later
- * to extract the more random high bits from the multiplicative
- * hash function before the remainder is taken.
- */
-static inline unsigned long wait_table_bits(unsigned long size)
-{
- return ffz(~size);
-}
-
-#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-
-/*
- * Initially all pages are reserved - free ones are freed
- * up by free_all_bootmem() once the early boot process is
- * done. Non-atomic initialization, single-pass.
- */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
- unsigned long start_pfn)
-{
- struct page *page;
- unsigned long end_pfn = start_pfn + size;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
- if (!early_pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
- }
-}
-
-void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
- unsigned long size)
-{
- int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
- }
-}
-
-#define ZONETABLE_INDEX(x, zone_nr) ((x << ZONES_SHIFT) | zone_nr)
-void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
- unsigned long size)
-{
- unsigned long snum = pfn_to_section_nr(pfn);
- unsigned long end = pfn_to_section_nr(pfn + size);
-
- if (FLAGS_HAS_NODE)
- zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
- else
- for (; snum <= end; snum++)
- zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
-}
-
-#ifndef __HAVE_ARCH_MEMMAP_INIT
-#define memmap_init(size, nid, zone, start_pfn) \
- memmap_init_zone((size), (nid), (zone), (start_pfn))
-#endif
-
-static int __cpuinit zone_batchsize(struct zone *zone)
-{
- int batch;
-
- /*
- * The per-cpu-pages pools are set to around 1000th of the
- * size of the zone. But no more than 1/2 of a meg.
- *
- * OK, so we don't know how big the cache is. So guess.
- */
- batch = zone->present_pages / 1024;
- if (batch * PAGE_SIZE > 512 * 1024)
- batch = (512 * 1024) / PAGE_SIZE;
- batch /= 4; /* We effectively *= 4 below */
- if (batch < 1)
- batch = 1;
-
- /*
- * Clamp the batch to a 2^n - 1 value. Having a power
- * of 2 value was found to be more likely to have
- * suboptimal cache aliasing properties in some cases.
- *
- * For example if 2 tasks are alternately allocating
- * batches of pages, one task can end up with a lot
- * of pages of one half of the possible page colors
- * and the other with pages of the other colors.
- */
- batch = (1 << (fls(batch + batch/2)-1)) - 1;
-
- return batch;
-}
-
-inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-{
- struct per_cpu_pages *pcp;
-
- memset(p, 0, sizeof(*p));
-
- pcp = &p->pcp[0]; /* hot */
- pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
- INIT_LIST_HEAD(&pcp->list);
-
- pcp = &p->pcp[1]; /* cold*/
- pcp->count = 0;
- pcp->high = 2 * batch;
- pcp->batch = max(1UL, batch/2);
- INIT_LIST_HEAD(&pcp->list);
-}
-
-/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
- */
-
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
- unsigned long high)
-{
- struct per_cpu_pages *pcp;
-
- pcp = &p->pcp[0]; /* hot list */
- pcp->high = high;
- pcp->batch = max(1UL, high/4);
- if ((high/4) > (PAGE_SHIFT * 8))
- pcp->batch = PAGE_SHIFT * 8;
-}
-
-
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
- struct zone *zone, *dzone;
-
- for_each_zone(zone) {
-
- zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
- GFP_KERNEL, cpu_to_node(cpu));
- if (!zone_pcp(zone, cpu))
- goto bad;
-
- setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
- if (percpu_pagelist_fraction)
- setup_pagelist_highmark(zone_pcp(zone, cpu),
- (zone->present_pages / percpu_pagelist_fraction));
- }
-
- return 0;
-bad:
- for_each_zone(dzone) {
- if (dzone == zone)
- break;
- kfree(zone_pcp(dzone, cpu));
- zone_pcp(dzone, cpu) = NULL;
- }
- return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
- struct zone *zone;
-
- for_each_zone(zone) {
- struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
- zone_pcp(zone, cpu) = NULL;
- kfree(pset);
- }
-}
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
- unsigned long action,
- void *hcpu)
-{
- int cpu = (long)hcpu;
- int ret = NOTIFY_OK;
-
- switch (action) {
- case CPU_UP_PREPARE:
- if (process_zones(cpu))
- ret = NOTIFY_BAD;
- break;
- case CPU_UP_CANCELED:
- case CPU_DEAD:
- free_zone_pagesets(cpu);
- break;
- default:
- break;
- }
- return ret;
-}
-
-static struct notifier_block pageset_notifier =
- { &pageset_cpuup_callback, NULL, 0 };
-
-void __init setup_per_cpu_pageset(void)
-{
- int err;
-
- /* Initialize per_cpu_pageset for cpu 0.
- * A cpuup callback will do this for every cpu
- * as it comes online
- */
- err = process_zones(smp_processor_id());
- BUG_ON(err);
- register_cpu_notifier(&pageset_notifier);
-}
-
-#endif
-
-static __meminit
-void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
-{
- int i;
- struct pglist_data *pgdat = zone->zone_pgdat;
-
- /*
- * The per-page waitqueue mechanism uses hashed waitqueues
- * per zone.
- */
- zone->wait_table_size = wait_table_size(zone_size_pages);
- zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
- zone->wait_table = (wait_queue_head_t *)
- alloc_bootmem_node(pgdat, zone->wait_table_size
- * sizeof(wait_queue_head_t));
-
- for(i = 0; i < zone->wait_table_size; ++i)
- init_waitqueue_head(zone->wait_table + i);
-}
-
-static __meminit void zone_pcp_init(struct zone *zone)
-{
- int cpu;
- unsigned long batch = zone_batchsize(zone);
-
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
- /* Early boot. Slab allocator not functional yet */
- zone_pcp(zone, cpu) = &boot_pageset[cpu];
- setup_pageset(&boot_pageset[cpu],0);
-#else
- setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
- }
- if (zone->present_pages)
- printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
- zone->name, zone->present_pages, batch);
-}
-
-static __meminit void init_currently_empty_zone(struct zone *zone,
- unsigned long zone_start_pfn, unsigned long size)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
-
- zone_wait_table_init(zone, size);
- pgdat->nr_zones = zone_idx(zone) + 1;
-
- zone->zone_start_pfn = zone_start_pfn;
-
- memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
-
- zone_init_free_lists(pgdat, zone, zone->spanned_pages);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-static int __init first_active_region_index_in_nid(int nid)
-{
- int i;
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].nid == nid)
- return i;
- }
-
- return MAX_ACTIVE_REGIONS;
-}
-
-static int __init next_active_region_index_in_nid(unsigned int index, int nid)
-{
- for (index = index + 1; early_node_map[index].end_pfn; index++) {
- if (early_node_map[index].nid == nid)
- return index;
- }
-
- return MAX_ACTIVE_REGIONS;
-}
-
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-int __init early_pfn_to_nid(unsigned long pfn)
-{
- int i;
-
- for (i = 0; early_node_map[i].end_pfn; i++) {
- unsigned long start_pfn = early_node_map[i].start_pfn;
- unsigned long end_pfn = early_node_map[i].end_pfn;
-
- if ((start_pfn <= pfn) && (pfn < end_pfn))
- return early_node_map[i].nid;
- }
-
- return -1;
-}
-#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
-
-#define for_each_active_range_index_in_nid(i, nid) \
- for (i = first_active_region_index_in_nid(nid); \
- i != MAX_ACTIVE_REGIONS; \
- i = next_active_region_index_in_nid(i, nid))
-
-void __init free_bootmem_with_active_regions(int nid,
- unsigned long max_low_pfn)
-{
- unsigned int i;
- for_each_active_range_index_in_nid(i, nid) {
- unsigned long size_pages = 0;
- unsigned long end_pfn = early_node_map[i].end_pfn;
- if (early_node_map[i].start_pfn >= max_low_pfn)
- continue;
-
- if (end_pfn > max_low_pfn)
- end_pfn = max_low_pfn;
-
- size_pages = end_pfn - early_node_map[i].start_pfn;
- free_bootmem_node(NODE_DATA(early_node_map[i].nid),
- PFN_PHYS(early_node_map[i].start_pfn),
- PFN_PHYS(size_pages));
- }
-}
-
-void __init memory_present_with_active_regions(int nid)
-{
- unsigned int i;
- for_each_active_range_index_in_nid(i, nid)
- memory_present(early_node_map[i].nid,
- early_node_map[i].start_pfn,
- early_node_map[i].end_pfn);
-}
-
-void __init get_pfn_range_for_nid(unsigned int nid,
- unsigned long *start_pfn, unsigned long *end_pfn)
-{
- unsigned int i;
- *start_pfn = -1UL;
- *end_pfn = 0;
-
- for_each_active_range_index_in_nid(i, nid) {
- if (early_node_map[i].start_pfn < *start_pfn)
- *start_pfn = early_node_map[i].start_pfn;
-
- if (early_node_map[i].end_pfn > *end_pfn)
- *end_pfn = early_node_map[i].end_pfn;
- }
-
- if (*start_pfn == -1UL) {
- printk(KERN_WARNING "Node %u active with no memory\n", nid);
- *start_pfn = 0;
- }
-}
-
-unsigned long __init zone_present_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *ignored)
-{
- unsigned long node_start_pfn, node_end_pfn;
- unsigned long zone_start_pfn, zone_end_pfn;
-
- /* Get the start and end of the node and zone */
- get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
- zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
- zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
-
- /* Check that this node has pages within the zone's required range */
- if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
- return 0;
-
- /* Move the zone boundaries inside the node if necessary */
- if (zone_end_pfn > node_end_pfn)
- zone_end_pfn = node_end_pfn;
- if (zone_start_pfn < node_start_pfn)
- zone_start_pfn = node_start_pfn;
-
- /* Return the spanned pages */
- return zone_end_pfn - zone_start_pfn;
-}
-
-static inline int __init pfn_range_in_zone(unsigned long start_pfn,
- unsigned long end_pfn,
- unsigned long zone_type)
-{
- if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
- return 0;
-
- if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
- return 0;
-
- if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
- return 0;
-
- if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
- return 0;
-
- return 1;
-}
-
-unsigned long __init zone_absent_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *ignored)
-{
- int i = 0;
- unsigned long prev_end_pfn = 0, hole_pages = 0;
- unsigned long start_pfn;
-
- /* Find the end_pfn of the first active range of pfns in the node */
- i = first_active_region_index_in_nid(nid);
- prev_end_pfn = early_node_map[i].start_pfn;
-
- /* Find all holes for the node */
- for (; i != MAX_ACTIVE_REGIONS;
- i = next_active_region_index_in_nid(i, nid)) {
-
- /* Increase the hole size if the hole is within the zone */
- start_pfn = early_node_map[i].start_pfn;
- if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
- BUG_ON(prev_end_pfn > start_pfn);
- hole_pages += start_pfn - prev_end_pfn;
- }
-
- prev_end_pfn = early_node_map[i].end_pfn;
- }
-
- return hole_pages;
-}
-#else
-static inline unsigned long zone_present_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *zones_size)
-{
- return zones_size[zone_type];
-}
-
-static inline unsigned long zone_absent_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *zholes_size)
-{
- if (!zholes_size)
- return 0;
-
- return zholes_size[zone_type];
-}
-#endif
-
-static void __init calculate_node_totalpages(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long realtotalpages, totalpages = 0;
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- totalpages += zone_present_pages_in_node(pgdat->node_id, i,
- zones_size);
- }
- pgdat->node_spanned_pages = totalpages;
-
- realtotalpages = totalpages;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- realtotalpages -=
- zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
- }
- pgdat->node_present_pages = realtotalpages;
- printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
- realtotalpages);
-}
-
-/*
- * Set up the zone data structures:
- * - mark all pages reserved
- * - mark all memory queues empty
- * - clear the memory bitmaps
- */
-static void __init free_area_init_core(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long j;
- int nid = pgdat->node_id;
- unsigned long zone_start_pfn = pgdat->node_start_pfn;
-
- pgdat_resize_init(pgdat);
- pgdat->nr_zones = 0;
- init_waitqueue_head(&pgdat->kswapd_wait);
- pgdat->kswapd_max_order = 0;
-
- for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize;
-
- size = zone_present_pages_in_node(nid, j, zones_size);
- realsize = size - zone_absent_pages_in_node(nid, j,
- zholes_size);
- if (j < ZONE_HIGHMEM)
- nr_kernel_pages += realsize;
- nr_all_pages += realsize;
-
- zone->spanned_pages = size;
- zone->present_pages = realsize;
- zone->name = zone_names[j];
- spin_lock_init(&zone->lock);
- spin_lock_init(&zone->lru_lock);
- zone_seqlock_init(zone);
- zone->zone_pgdat = pgdat;
- zone->free_pages = 0;
-
- zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
-
- zone_pcp_init(zone);
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
- zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
- zone->nr_active = 0;
- zone->nr_inactive = 0;
- atomic_set(&zone->reclaim_in_progress, 0);
- if (!size)
- continue;
-
- zonetable_add(zone, nid, j, zone_start_pfn, size);
- init_currently_empty_zone(zone, zone_start_pfn, size);
- zone_start_pfn += size;
- }
-}
-
-static void __init alloc_node_mem_map(struct pglist_data *pgdat)
-{
- /* Skip empty nodes */
- if (!pgdat->node_spanned_pages)
- return;
-
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
- /* ia64 gets its own node_mem_map, before this, without bootmem */
- if (!pgdat->node_mem_map) {
- unsigned long size;
- struct page *map;
-
- size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
- map = alloc_remap(pgdat->node_id, size);
- if (!map)
- map = alloc_bootmem_node(pgdat, size);
- pgdat->node_mem_map = map;
- }
-#ifdef CONFIG_FLATMEM
- /*
- * With no DISCONTIG, the global mem_map is just set as node 0's
- */
- if (pgdat == NODE_DATA(0))
- mem_map = NODE_DATA(0)->node_mem_map;
-#endif
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
-}
-
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long node_start_pfn,
- unsigned long *zholes_size)
-{
- pgdat->node_id = nid;
- pgdat->node_start_pfn = node_start_pfn;
- calculate_node_totalpages(pgdat, zones_size, zholes_size);
-
- alloc_node_mem_map(pgdat);
-
- free_area_init_core(pgdat, zones_size, zholes_size);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-void __init add_active_range(unsigned int nid, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned int i;
- unsigned long pages = end_pfn - start_pfn;
-
- /* Merge with existing active regions if possible */
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].nid != nid)
- continue;
-
- if (early_node_map[i].end_pfn == start_pfn) {
- early_node_map[i].end_pfn += pages;
- return;
- }
-
- if (early_node_map[i].start_pfn == (start_pfn + pages)) {
- early_node_map[i].start_pfn -= pages;
- return;
- }
- }
-
- /*
- * Leave last entry NULL so we dont iterate off the end (we use
- * entry.end_pfn to terminate the walk).
- */
- if (i >= MAX_ACTIVE_REGIONS - 1) {
- printk(KERN_ERR "WARNING: too many memory regions in "
- "numa code, truncating\n");
- return;
- }
-
- early_node_map[i].nid = nid;
- early_node_map[i].start_pfn = start_pfn;
- early_node_map[i].end_pfn = end_pfn;
-}
-
-/* Compare two active node_active_regions */
-static int __init cmp_node_active_region(const void *a, const void *b)
-{
- struct node_active_region *arange = (struct node_active_region *)a;
- struct node_active_region *brange = (struct node_active_region *)b;
-
- /* Done this way to avoid overflows */
- if (arange->start_pfn > brange->start_pfn)
- return 1;
- if (arange->start_pfn < brange->start_pfn)
- return -1;
-
- return 0;
-}
-
-/* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
-{
- size_t num = 0;
- while (early_node_map[num].end_pfn)
- num++;
-
- sort(early_node_map, num, sizeof(struct node_active_region),
- cmp_node_active_region, NULL);
-}
-
-unsigned long __init find_min_pfn(void)
-{
- int i;
- unsigned long min_pfn = -1UL;
-
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].start_pfn < min_pfn)
- min_pfn = early_node_map[i].start_pfn;
- }
-
- return min_pfn;
-}
-
-/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
-unsigned long __init find_start_pfn_for_node(unsigned long nid)
-{
- int i;
-
- /* Assuming a sorted map, the first range found has the starting pfn */
- for_each_active_range_index_in_nid(i, nid) {
- return early_node_map[i].start_pfn;
- }
-
- /* nid does not exist in early_node_map */
- printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
- return 0;
-}
-
-void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
- unsigned long arch_max_dma32_pfn,
- unsigned long arch_max_low_pfn,
- unsigned long arch_max_high_pfn)
-{
- unsigned long nid;
-
- /* Record where the zone boundaries are */
- memset(arch_zone_lowest_possible_pfn, 0,
- sizeof(arch_zone_lowest_possible_pfn));
- memset(arch_zone_highest_possible_pfn, 0,
- sizeof(arch_zone_highest_possible_pfn));
- arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
- arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
- arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
- arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
- arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
- arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
- arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
- arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
-
- /* Regions in the early_node_map can be in any order */
- sort_node_map();
-
- for_each_online_node(nid) {
- pg_data_t *pgdat = NODE_DATA(nid);
- free_area_init_node(nid, pgdat, NULL,
- find_start_pfn_for_node(nid), NULL);
- }
-}
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
@@ -2955,32 +1977,6 @@ int lowmem_reserve_ratio_sysctl_handler(
return 0;
}
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
- * can have before it gets flushed back to buddy allocator.
- */
-
-int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
- struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
-{
- struct zone *zone;
- unsigned int cpu;
- int ret;
-
- ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
- if (!write || (ret == -EINVAL))
- return ret;
- for_each_zone(zone) {
- for_each_online_cpu(cpu) {
- unsigned long high;
- high = zone->present_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(zone_pcp(zone, cpu), high);
- }
- }
- return 0;
-}
-
__initdata int hashdist = HASHDIST_DEFAULT;
#ifdef CONFIG_NUMA
^ permalink raw reply
* [PATCH 0/6] [RFC] Sizing zones and holes in an architecture independent manner
From: Mel Gorman @ 2006-04-11 10:39 UTC (permalink / raw)
To: linuxppc-dev, davej, tony.luck, linux-kernel, ak; +Cc: Mel Gorman
(The TO: list are the architecture maintainers according to the
MAINTAINERS. Apologies in advance if I got the list wrong)
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate
zone sizes and holes in each architecture is very similar. Some of this
zone and hole sizing code is difficult to read for no good reason. This
set of patches eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas
have been discovered, free_area_init_nodes() is called to initialise
the pgdat and zones. The zone sizes and holes are then calculated in an
architecture independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 136 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 150 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 35 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 59 arch-specific LOC removed
At this point, there is a net reduction of 27 lines of code and the
arch-independent code is a lot easier to read in comparison to some of
the arch-specific stuff, particularly in arch/i386/ .
For Patch 6, it was also noted that page_alloc.c has a *lot* of
initialisation code which makes the file harder to read than it needs to
be. Patch 6 creates a new file mem_init.c and moves a lot of initialisation
code from page_alloc.c to it. After the patch is applied, there is still
a net reduction of 3 lines of code.
The patches have been successfully boot tested on
o x86, flatmem
o x86, NUMAQ
o PPC64, NUMA
o PPC64, CONFIG_NUMA=n
o x86_64, NUMA with SRAT
The patches have only been *compile tested* for ia64 with a flatmem
configuration. At attempt was made to boot test on an ancient RS/6000
but the vanilla kernel does not boot so I have to investigate there.
The net reduction seems small but the big benefit of this set of patches
is the reduction of 380 lines of architecture-specific code, some of
which is very hairy. There should be a greater net reduction when other
architectures use the same mechanisms for zone and hole sizing but I lack
the hardware to test on.
Comments?
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous errors
arch/i386/Kconfig | 8
arch/i386/kernel/setup.c | 19
arch/i386/kernel/srat.c | 98 ----
arch/i386/mm/discontig.c | 59 --
arch/ia64/Kconfig | 3
arch/ia64/mm/contig.c | 62 --
arch/ia64/mm/discontig.c | 43 -
arch/ia64/mm/init.c | 10
arch/powerpc/Kconfig | 13
arch/powerpc/mm/mem.c | 50 --
arch/powerpc/mm/numa.c | 157 ------
arch/ppc/Kconfig | 3
arch/ppc/mm/init.c | 21
arch/x86_64/Kconfig | 3
arch/x86_64/kernel/e820.c | 18
arch/x86_64/mm/init.c | 60 --
arch/x86_64/mm/numa.c | 15
include/asm-ia64/meminit.h | 1
include/asm-x86_64/e820.h | 1
include/asm-x86_64/proto.h | 2
include/linux/mm.h | 14
include/linux/mmzone.h | 15
mm/Makefile | 2
mm/mem_init.c | 1028 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 678 -----------------------------
25 files changed, 1190 insertions(+), 1193 deletions(-)
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply
* post to the list
From: stone @ 2006-04-11 10:23 UTC (permalink / raw)
To: linuxppc-embedded
Thank you!
Welcome to the Linuxppc-embedded@ozlabs.org mailing list!
To post to this list, send your email to:
linuxppc-embedded@ozlabs.org
General information about the mailing list is at:
https://ozlabs.org/mailman/listinfo/linuxppc-embedded
If you ever want to unsubscribe or change your options (eg, switch to
or from digest mode, change your password, etc.), visit your
subscription page at:
https://ozlabs.org/mailman/options/linuxppc-embedded/crackpig%40gmail.com
You can also make such adjustments via email by sending a message to:
Linuxppc-embedded-request@ozlabs.org
with the word `help' in the subject or body (don't include the
quotes), and you will get back a message with instructions.
You must know your password to change your options (including changing
the password, itself) or to unsubscribe. It is:
80548054
Normally, Mailman will remind you of your ozlabs.org mailing list
passwords once every month, although you can disable this if you
prefer. This reminder will also include instructions on how to
unsubscribe or change your account options. There is also a button on
your options page that will email your current password to you.
^ permalink raw reply
* Re: [Fastboot] [PATCH]ppc64 kexec tools rm platform fix
From: Michael Ellerman @ 2006-04-11 9:36 UTC (permalink / raw)
To: David Wilder; +Cc: Mohan Kumar, linuxppc-dev list, fastboot
In-Reply-To: <1144659236.26317.35.camel@localhost.localdomain>
David Wilder wrote:
> I lost Michael's reply on the list.
> Here it is with my response..
>
> Michael wrote:
>
> >/ - continue;
> />/ - }
> />/ memset(fname, 0, sizeof(fname));
> />/ strcpy(fname, device_tree);
> />/ strcat(fname, dentry->d_name);
> />/ strcat(fname, "/linux,htab-base");
> />/ if ((file = fopen(fname, "r")) == NULL) {
> />/ - perror(fname);
> />/ closedir(cdir);
> />/ + if (errno == ENOENT) {
> />/ + /* Non LPAR */
> />/ + errno = 0;
> />/ + continue;
> />/ + }
> />/ + perror(fname);
> />/ closedir(dir);
> />/ return -1;
> /
> I don't think you want to do the closedir() before the if. You
> certainly
> don't need to do it twice?
>
> The two closedir() calls are not closing the same thing.
Right. We don't seem to close cdir if it all works though. That code
needs some serious restructuring, 350+ line functions are not cool.
cheers
--
Michael Ellerman
IBM OzLabs
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
^ permalink raw reply
* Re: [RFC/PATCH] powerpc: Use rtas query-cpu-stopped-state in smp spinup
From: Michael Ellerman @ 2006-04-11 8:53 UTC (permalink / raw)
To: Nathan Lynch; +Cc: linuxppc-dev, Paul Mackerras, Arnd Bergmann
In-Reply-To: <20060407011002.GF9208@localdomain>
[-- Attachment #1: Type: text/plain, Size: 953 bytes --]
On Thu, 2006-04-06 at 20:10 -0500, Nathan Lynch wrote:
> I tried your patch (non-kexec boot) on a 1-way SMT power5 with
> 2.6.17-rc1 and got the badness below, and I think it matches what I
> ran into in the past. Maybe doing query-cpu-stopped-state on a thread
> that has yet to be started by RTAS mucks things up?
>
> Memory: 2047640k/2097152k available (5356k kernel code, 49512k
> reserved, 1412k data, 900k bss, 240k init)
> Calibrating delay loop... 375.80 BogoMIPS (lpj=751616)
> Security Framework v1.0.0 initialized
> SELinux: Disabled at boot.
> Mount-cache hash table entries: 256
> Processor 1 is stuck. <<<<<<<<
OK, thanks for testing it. This is clearly not going to fly.
cheers
--
Michael Ellerman
IBM OzLabs
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]
^ permalink raw reply
* RTC/I2C problems
From: kevin morizur @ 2006-04-11 8:20 UTC (permalink / raw)
To: linuxppc-embedded
Hi everybody.
I have a problem with my RTC, i work on a MEN A12 board with MPC8245 and
when my kernel start it takes 8 min to run. I have find the problem but i
don't know how to solve it. In time.c the function time_init execute the
following lines :
if (__USE_RTC()) {
/* 601 processor: dec counts down by 128 every 128ns */
tb_ticks_per_jiffy = DECREMENTER_COUNT_601;
/* mulhwu_scale_factor(1000000000, 1000000) is 0x418937 */
tb_to_us = 0x0x418937;
} else {
ppc_md.calibrate_decr();
tb_to_ns_scale = mulhwu(tb_to_us, 1000 << 10);
}
this is the ppc_md.calibrate_decr() function that takes a lot of time and i
don't know how to solve the problem. USE_RTC is fixed at 0 for 6xx config
maybe it must be at 1 ? My RTC run with SMBus (I2C). I work with ELDK 4.0
and linux 2.6.15.
Thanks in advance.
Regards.
_________________________________________________________________
Tout savoir sur la sécurité de vos enfants sur Internet !
http://go.msn.fr/10-channel/80-security/protection/default.asp
^ permalink raw reply
* init issue on linux 2.4 devel and busybox
From: jean-francois.hasson @ 2006-04-11 7:25 UTC (permalink / raw)
To: linuxppc-embedded
Hi,
I have been working on porting linux 2.4 devel to the ml403 board. It boots
well and then it goes on the CompactFlash to get the root filesystem where
the executables are from busybox. The init program is found and launches
well however the message is that init.d/rcS can't be found. I have looked
up my inittab file and the first line is ::sysinit:/etc/init.d/rcS. I do
not understand why init can't find the rcS file when it seems correct in
inittab and I have checked that the file rcS is in /etc/init.d. Does anyone
have an idea as to what might be wrong ?
Thanks,
JF Hasson
#
This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. Any unauthorised disclosure, distribution or copying hereof is prohibited.
Ce courriel et les documents qui y sont attaches peuvent contenir des informations confidentielles. Au cas ou vous ne seriez pas le destinataire de ce courriel, vous etes prie d'en informer l'expediteur immediatement et de le detruire ainsi que tous les documents attaches de votre systeme informatique. Toute divulgation, distribution ou copie du present courriel et des documents attaches sans autorisation prealable de son emetteur est interdite.
#
^ permalink raw reply
* Re: MPC83xx: CF access in True IDE mode
From: Kumar Gala @ 2006-04-11 4:11 UTC (permalink / raw)
To: SIP COP 009; +Cc: linuxppc-embedded
In-Reply-To: <5a4792c00604101600v37616adao2f2598483a65e4d@mail.gmail.com>
On Apr 10, 2006, at 6:00 PM, SIP COP 009 wrote:
> Hi !
>
> I am looking out for some pointers to get CF working in the TRUE
> IDE mode on Linux2.6
>
> Any pointers and help would be greatly appreciated.
There was a driver posted to the linux kernel list for using CF on a
localbus.
- k
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: David Hawkins @ 2006-04-11 3:52 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <91B22F93A880FA48879475E134D6F0BE386FBD@CA1EXCLV02.adcorp.kla-tencor.com>
> Not if you're tight on routing resouces and/or design time ;-)
Yep, understood.
> Dave, I appreciate your input but scatter/gather just isn't
> an option here for a variety of reasons. Bandwidth/complexity/
> latency/time to design/time to debug/FPGA density all
> factor into this decision. BTW we're talking 66/64 PCI-X;
> 33/32 PCI isn't nearly fast enough.
No problem.
> Nope, I disagree. The MMU isn't involved here at all. I'm
> talking about setting up the PIM (PCI Inbound Map) via
> a system call as opposed to just writing it directly.
The MMU is involved since its responsible for all memory,
i.e., Linux is told that all physical memory is under
its control. I haven't looked at the PCI inbound map,
but I would assume that it can be setup to say map all
of the incoming PCI addresses to all of the SDRAM
addresses, or some block thereof. So the FPGA can see physical
pages that Linux is busy swapping data into and out of.
What you are wanting to do is tell Linux to lay-off
managing a block of SDRAM, but make it visible to the
kernel as a memory area.
So, you need the kernel to setup page tables that map
to those physical addresses and you need to tell Linux
not to treat them like it does the rest of SDRAM.
So I believe the MMU/page tables are involved.
> I really can't go into detail about my application for a
> variety of reasons (both technical and legal) but suffice
> it to say that 16 MB is just the starting point.
No sweat, it never hurts to ask.
> I guess I'll dig a little deeper into the source to see
> where the kernel does the PIM mapping. Re-architecting
> our app at this stage just isn't a practical consideration.
You always have the hack of reserving a block of memory
at the top of SDRAM, i.e., lie to Linux and tell it it
has less memory. It sounds like that might be the least
effort. Rubini has an example of how to do that.
Dave
^ permalink raw reply
* RE: Create permanent mapping from PCI bus to region of physical memory
From: Howard, Marc @ 2006-04-11 3:42 UTC (permalink / raw)
To: David Hawkins; +Cc: linuxppc-embedded
In-Reply-To: <443B1EE9.5030104@ovro.caltech.edu>
[-- Attachment #1: Type: text/plain, Size: 3052 bytes --]
>> The periperal is an FPGA. No, there is no internal processor;
>> everything is coded in Verilog.
>>
>> Scatter/gather isn't a viable option because of this.
>Er, why not, its an FPGA, everything is possible :)
Not if you're tight on routing resouces and/or design time ;-)
>So you have a PCI core, since you are planning to write to
>the host memory space, its a Bus Master PCI core.
>Who's FPGA, Altera, Xilinx, or someone else?
Xilinx Virtex 4, no internal PPC. The vendor really isn't
germain to the problem at hand however.
>> Additionally non-contiguous memory would reduce bandwidth
>> and increase FPGA design complexity.
>Not necessarily. If the target is using bus master DMA to
>write to the host memory, then you can hit pretty close
>to the bandwidth of the PCI bus. If you are DMAing in
>big blocks, the overhead of a block change isn't too much.
>I did tests with the 440EP using a DMA controller on an
>adapter board and found that the PCI bridge in the 440EP
>was the limiting factor, i.e., for a 33MHz 32-bit bus
>with a potential for 132MB/s, the *best* you can do is
>about 40MB/s since the bridge only accepts data in cache
>line sizes before sending a retry to the target. I can
>send you those results.
Dave, I appreciate your input but scatter/gather just isn't
an option here for a variety of reasons. Bandwidth/complexity/
latency/time to design/time to debug/FPGA density all
factor into this decision. BTW we're talking 66/64 PCI-X;
33/32 PCI isn't nearly fast enough.
>Randomly accessible from where; the host or an I/O interface
>at the FPGA. The pages can be made to appear contiguous to
>a host processor user-space process using the nopage callback
>of the VMA.
>From the point of view of the FPGA.
>> I realize this isn't a standard linux request but having
>> fixed, linear memory is quite common in embedded apps. There
>> should be a way to create this mapping in the 440GX's hardware
>> and I'm just looking for a system call (if there is one) to
>> implement it.
>Alas, this is one of the concessions one must make if you
>want to use a processor that enables the MMU.
Nope, I disagree. The MMU isn't involved here at all. I'm
talking about setting up the PIM (PCI Inbound Map) via
a system call as opposed to just writing it directly.
>However,
>I don't see any fundamental limitation in the design
>that would preclude a little extra work on the FPGA.
>But, it does require additional Verilog to support
>the flexibility. The long-term advantage is that you
>don't have to provide a hack (eg. reserve a block of
>high-memory under Linux).
I really can't go into detail about my application for a
variety of reasons (both technical and legal) but suffice
it to say that 16 MB is just the starting point.
I guess I'll dig a little deeper into the source to see
where the kernel does the PIM mapping. Re-architecting
our app at this stage just isn't a practical consideration.
Thanks again,
Marc
[-- Attachment #2: Type: text/html, Size: 3973 bytes --]
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: David Hawkins @ 2006-04-11 3:13 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <91B22F93A880FA48879475E134D6F0BE386FBC@CA1EXCLV02.adcorp.kla-tencor.com>
Hi Marc,
> The periperal is an FPGA. No, there is no internal processor;
> everything is coded in Verilog.
>
> Scatter/gather isn't a viable option because of this.
Er, why not, its an FPGA, everything is possible :)
So you have a PCI core, since you are planning to write to
the host memory space, its a Bus Master PCI core.
Who's FPGA, Altera, Xilinx, or someone else?
> Additionally non-contiguous memory would reduce bandwidth
> and increase FPGA design complexity.
Not necessarily. If the target is using bus master DMA to
write to the host memory, then you can hit pretty close
to the bandwidth of the PCI bus. If you are DMAing in
big blocks, the overhead of a block change isn't too much.
I did tests with the 440EP using a DMA controller on an
adapter board and found that the PCI bridge in the 440EP
was the limiting factor, i.e., for a 33MHz 32-bit bus
with a potential for 132MB/s, the *best* you can do is
about 40MB/s since the bridge only accepts data in cache
line sizes before sending a retry to the target. I can
send you those results.
> The data must be contignuous because
> of these reasons and the need for the data to be randomly
> accessible from the outside using simple address arithmetic.
Randomly accessible from where; the host or an I/O interface
at the FPGA. The pages can be made to appear contiguous to
a host processor user-space process using the nopage callback
of the VMA.
> I realize this isn't a standard linux request but having
> fixed, linear memory is quite common in embedded apps. There
> should be a way to create this mapping in the 440GX's hardware
> and I'm just looking for a system call (if there is one) to
> implement it.
Alas, this is one of the concessions one must make if you
want to use a processor that enables the MMU. However,
I don't see any fundamental limitation in the design
that would preclude a little extra work on the FPGA.
But, it does require additional Verilog to support
the flexibility. The long-term advantage is that you
don't have to provide a hack (eg. reserve a block of
high-memory under Linux).
So how about this concession. If Linux lets you alloc_pages
in 2MB max chunks, create 8 address decode regions on your
FPGA. Provide the host access to the 8 address registers.
When the Linux driver is installed, the driver alloc_pages
8 times and loads the PCI address of those regions into
the 8 registers.
On the FPGA side of things create a flat 16MB region, as
an address passes through a 2MB block, it changes the PCI
address it decodes to. So, your FPGA believes it has
a 16MB continuous block, and Linux can supply the memory
as 8 non-contiguous chunks. Back in user-space on Linux,
the nopage VMA call is used to map a page-at-a-time
and the 2MB regions appear as a 16MB contiguous region to
user-space. (There are probably other ways to make user
space see it as one block too)
Cheers
Dave
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox