* Re: freescale lite 5200 board and kernel 2.6
From: Domenico Andreoli @ 2006-04-10 21:54 UTC (permalink / raw)
To: linuxppc-embedded
In-Reply-To: <20060408082100.GA54030@server.idefix.loc>
On Sat, Apr 08, 2006 at 10:21:00AM +0200, Matthias Fechner wrote:
> Hello Domenico,
Hello Matthias,
> * Domenico Andreoli <cavokz@gmail.com> [07-04-06 00:10]:
> > kernel is built following the instructions on your wiki, i attached
> > the config file. please have a look, let me know if any check/test may
> > be advised.
>
> sry, but I have now time to try your kernel config, but I attached
> mine which is working fine for me.
> Maybe this helps you.
unfortunately it does not :/
sometimes kernel prints "eth0: config: auto-negotiation on, 100HDX,
10HDX.", the eth0 is up and everything works.
most of the times "eth0: config: auto-negotiation off, No speed/duplex
selected?." is instead printed and eth0 seems to not exist.
i tend to exclude hw problems, older kernel 2.4.x always succeed in
using eth0.
is there any way to force eth0 auto-negotiation?
thank you
domenico
-----[ Domenico Andreoli, aka cavok
--[ http://people.debian.org/~cavok/gpgkey.asc
---[ 3A0F 2F80 F79C 678A 8936 4FEE 0677 9033 A20E BC50
^ permalink raw reply
* [PATCH] kdump-ppc64-xmon-stop-cpu
From: David Wilder @ 2006-04-10 22:32 UTC (permalink / raw)
To: fastboot, linuxppc-dev, akpm, mchintage, paulus, hbabu
[-- Attachment #1: Type: text/plain, Size: 493 bytes --]
Patch 3 of 3
- During CPU(s) hang scenarios, kdump could not stop these CPUs.
However, the user could invoke soft-reset to shoot down CPUs reliably.
But, when the debugger is enabled, these CPUs are returned to hang state
after they exited from the debugger. This patch fixes this issue by
calling crash_kexec_secondary() before returns to previous state.
Please pick up this patch.
--
David Wilder
IBM Linux Technology Center
Beaverton, Oregon, USA
dwilder@us.ibm.com
(503)578-3789
[-- Attachment #2: kdump-ppc64-xmon-stop-cpu.patch --]
[-- Type: text/x-patch, Size: 1273 bytes --]
- During CPU(s) hang scenarios, kdump could not stop these CPUs. However, the user could invoke soft-reset to shoot down CPUs reliably. But, when the debugger is enabled, these CPUs are returned to hang state after they exited from the debugger. This patch fixes this issue by calling crash_kexec_secondary() before returns to previous state.
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Haren Myneni <haren@us.ibm.com>
--- 2617-rc1/arch/powerpc/kernel/traps.c.orig 2006-04-05 13:25:22.000000000 -0700
+++ 2617-rc1/arch/powerpc/kernel/traps.c 2006-04-05 13:25:33.000000000 -0700
@@ -206,6 +206,16 @@ void system_reset_exception(struct pt_re
die("System Reset", regs, SIGABRT);
+ /*
+ * Some CPUs which got released from debugger will execute this path.
+ * These CPUs entered debugger first time via soft-reset - Means,
+ * could be possible that these CPUs may not repond to an IPI later.
+ * Therefore, has to call kdump func directly.
+ * Not a problem if we exited from debugger to recover. In this case
+ * there will not be any primary kexec CPU. Hence, will be returned.
+ */
+ crash_kexec_secondary(regs);
+
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
panic("Unrecoverable System Reset");
^ permalink raw reply
* MPC83xx: CF access in True IDE mode
From: SIP COP 009 @ 2006-04-10 23:00 UTC (permalink / raw)
To: linuxppc-embedded
[-- Attachment #1: Type: text/plain, Size: 172 bytes --]
Hi !
I am looking out for some pointers to get CF working in the TRUE IDE mode on
Linux2.6
Any pointers and help would be greatly appreciated.
Thanks!
ashutosh
[-- Attachment #2: Type: text/html, Size: 209 bytes --]
^ permalink raw reply
* [PATCH] 2 of 3 kdump-ppc64-soft-reset-fixes
From: David Wilder @ 2006-04-10 22:31 UTC (permalink / raw)
To: fastboot, linuxppc-dev, akpm, mchintage, paulus, hbabu
[-- Attachment #1: Type: text/plain, Size: 1680 bytes --]
- When a system hangs, user will activate the soft-reset to initiate the
kdump boot. But, soft-reset behavior is indeterminate on sending FWNMI
to all CPUS. i.e, all CPUs will not get FW NMI at the same time. When
the first CPU entered (calling primary CPU here onwards) kdump using
crash_kexec(), sends an IPI to other CPUs. Some CPUs will respond to
this IPI and execute crash_ipi_callback() before receive NMI. When they
receive FW NMI, will execute die() and waiting forever since no more
kdump IPI coming from the primary CPU. This issue will be fixed by
invoking crash_kexec_secondary() directly from die().
Since the secondary CPUs will enter the IPI_callback function two times,
CPU states have to be saved only once and the primary CPU has to start
kdump boot after all CPUs are stopped. Hence, cpus_in_crash bitmap is
used to determine whether pt_regs is saved. If the bit is not set, regs
will be saved. Introduced cpus_in_sr bitmap and enter_on_soft_reset
counter which are used to let the primary CPU know that all secondary
CPUs entered via soft-reset and ready to do down.
- For the crash scenario, when a CPU hangs with interrupts disabled and
the other CPUs panic or user invoked kdump boot using sysrq-c. In this
case, the hung CPU can not be stopped and causes the kdump boot not
successful. This case can be treated as complete system hang and asks
the user to activate soft-reset if all secondary CPUs are not stopped.
Please pick-up this patch. (Note this patch is dependent on
kdump-image-rm-static.patch, see my earlier posting)
--
David Wilder
IBM Linux Technology Center
Beaverton, Oregon, USA
dwilder@us.ibm.com
(503)578-3789
[-- Attachment #2: kdump-ppc64-soft-reset-fixes.patch --]
[-- Type: text/x-patch, Size: 10588 bytes --]
- When a system hangs, user will activate the soft-reset to initiate the kdump boot. But, soft-reset behavior is indeterminate on sending FWNMI to all CPUS. i.e, all CPUs will not get FW NMI at the same time. When the first CPU entered (calling primary CPU here onwards) kdump using crash_kexec(), sends an IPI to other CPUs. Some CPUs will respond to this IPI and execute crash_ipi_callback() before receive NMI. When they receive FW NMI, will execute die() and waiting forever since no more kdump IPI coming from the primary CPU. This issue will be fixed by invoking crash_kexec_secondary() directly from die().
Since the secondary CPUs will enter the IPI_callback function two times, CPU states have to be saved only once and the primary CPU has to start kdump boot after all CPUs are stopped. Hence, cpus_in_crash bitmap is used to determine whether pt_regs is saved. If the bit is not set, regs will be saved. Introduced cpus_in_sr bitmap and enter_on_soft_reset counter which are used to let the primary CPU know that all secondary CPUs entered via soft-reset and ready to do down.
- For the crash scenario, when a CPU hangs with interrupts disabled and the other CPUs panic or user invoked kdump boot using sysrq-c. In this case, the hung CPU can not be stopped and causes the kdump boot not successful. This case can be treated as complete system hang and asks the user to activate soft-reset if all secondary CPUs are not stopped.
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Haren Myneni <haren@us.ibm.com>
diff -Naurp 2617-rc1.orig/arch/powerpc/kernel/crash.c 2617-rc1/arch/powerpc/kernel/crash.c
--- 2617-rc1.orig/arch/powerpc/kernel/crash.c 2006-04-05 13:20:38.000000000 -0700
+++ 2617-rc1/arch/powerpc/kernel/crash.c 2006-04-05 13:24:19.000000000 -0700
@@ -23,6 +23,7 @@
#include <linux/elfcore.h>
#include <linux/init.h>
#include <linux/types.h>
+#include <linux/irq.h>
#include <asm/processor.h>
#include <asm/machdep.h>
@@ -40,6 +41,9 @@
/* This keeps a track of which one is crashing cpu. */
int crashing_cpu = -1;
+static cpumask_t cpus_in_crash = CPU_MASK_NONE;
+extern struct kimage *kexec_crash_image;
+extern cpumask_t cpus_in_sr;
static u32 *append_elf_note(u32 *buf, char *name, unsigned type, void *data,
size_t data_len)
@@ -97,34 +101,66 @@ static void crash_save_this_cpu(struct p
}
#ifdef CONFIG_SMP
-static atomic_t waiting_for_crash_ipi;
+static atomic_t enter_on_soft_reset = ATOMIC_INIT(0);
void crash_ipi_callback(struct pt_regs *regs)
{
int cpu = smp_processor_id();
- if (cpu == crashing_cpu)
- return;
-
if (!cpu_online(cpu))
return;
- if (ppc_md.kexec_cpu_down)
- ppc_md.kexec_cpu_down(1, 1);
-
local_irq_disable();
+ if (!cpu_isset(cpu, cpus_in_crash))
+ crash_save_this_cpu(regs, cpu);
+ cpu_set(cpu, cpus_in_crash);
- crash_save_this_cpu(regs, cpu);
- atomic_dec(&waiting_for_crash_ipi);
+ /*
+ * Entered via soft-reset - could be the kdump
+ * process is invoked using soft-reset or user activated
+ * it if some CPU did not respond to an IPI.
+ * For soft-reset, the secondary CPU can enter this func
+ * twice. 1 - using IPI, and 2. soft-reset.
+ * Tell the kexec CPU that entered via soft-reset and ready
+ * to go down.
+ */
+ if (cpu_isset(cpu, cpus_in_sr)) {
+ cpu_clear(cpu, cpus_in_sr);
+ atomic_inc(&enter_on_soft_reset);
+ }
+
+ /*
+ * Starting the kdump boot.
+ * This barrier is needed to make sure that all CPUs are stopped.
+ * If not, soft-reset will be invoked to bring other CPUs.
+ */
+ while (!cpu_isset(crashing_cpu, cpus_in_crash))
+ barrier();
+
+ if (ppc_md.kexec_cpu_down)
+ ppc_md.kexec_cpu_down(1, 1);
kexec_smp_wait();
/* NOTREACHED */
}
-static void crash_kexec_prepare_cpus(void)
+/*
+ * Wait until all CPUs are entered via soft-reset.
+ */
+static void crash_soft_reset_check(int cpu)
+{
+ unsigned int ncpus = num_online_cpus() - 1;/* Excluding the panic cpu */
+
+ cpu_clear(cpu, cpus_in_sr);
+ while (atomic_read(&enter_on_soft_reset) != ncpus)
+ barrier();
+}
+
+
+static void crash_kexec_prepare_cpus(int cpu)
{
unsigned int msecs;
- atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+ unsigned int ncpus = num_online_cpus() - 1;/* Excluding the panic cpu */
crash_send_ipi(crash_ipi_callback);
smp_wmb();
@@ -132,13 +168,12 @@ static void crash_kexec_prepare_cpus(voi
/*
* FIXME: Until we will have the way to stop other CPUSs reliabally,
* the crash CPU will send an IPI and wait for other CPUs to
- * respond. If not, proceed the kexec boot even though we failed to
- * capture other CPU states.
+ * respond.
* Delay of at least 10 seconds.
*/
- printk(KERN_ALERT "Sending IPI to other cpus...\n");
+ printk(KERN_EMERG "Sending IPI to other cpus...\n");
msecs = 10000;
- while ((atomic_read(&waiting_for_crash_ipi) > 0) && (--msecs > 0)) {
+ while ((cpus_weight(cpus_in_crash) < ncpus) && (--msecs > 0)) {
barrier();
mdelay(1);
}
@@ -148,18 +183,71 @@ static void crash_kexec_prepare_cpus(voi
/*
* FIXME: In case if we do not get all CPUs, one possibility: ask the
* user to do soft reset such that we get all.
- * IPI handler is already set by the panic cpu initially. Therefore,
- * all cpus could invoke this handler from die() and the panic CPU
- * will call machine_kexec() directly from this handler to do
- * kexec boot.
- */
- if (atomic_read(&waiting_for_crash_ipi))
- printk(KERN_ALERT "done waiting: %d cpus not responding\n",
- atomic_read(&waiting_for_crash_ipi));
+ * Soft-reset will be used until better mechanism is implemented.
+ */
+ if (cpus_weight(cpus_in_crash) < ncpus) {
+ printk(KERN_EMERG "done waiting: %d cpu(s) not responding\n",
+ ncpus - cpus_weight(cpus_in_crash));
+ printk(KERN_EMERG "Activate soft-reset to stop other cpu(s)\n");
+ cpus_in_sr = CPU_MASK_NONE;
+ atomic_set(&enter_on_soft_reset, 0);
+ while (cpus_weight(cpus_in_crash) < ncpus)
+ barrier();
+ }
+ /*
+ * Make sure all CPUs are entered via soft-reset if the kdump is
+ * invoked using soft-reset.
+ */
+ if (cpu_isset(cpu, cpus_in_sr))
+ crash_soft_reset_check(cpu);
/* Leave the IPI callback set */
}
+
+/*
+ * This function will be called by secondary cpus or by kexec cpu
+ * if soft-reset is activated to stop some CPUs.
+ */
+void crash_kexec_secondary(struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ unsigned long flags;
+ int msecs = 5;
+
+ local_irq_save(flags);
+ /* Wait 5ms if the kexec CPU is not entered yet. */
+ while (crashing_cpu < 0) {
+ if (--msecs < 0) {
+ /*
+ * Either kdump image is not loaded or
+ * kdump process is not started - Probably xmon
+ * exited using 'x'(exit and recover) or
+ * kexec_should_crash() failed for all running tasks.
+ */
+ cpu_clear(cpu, cpus_in_sr);
+ local_irq_restore(flags);
+ return;
+ }
+ mdelay(1);
+ barrier();
+ }
+ if (cpu == crashing_cpu) {
+ /*
+ * Panic CPU will enter this func only via soft-reset.
+ * Wait until all secondary CPUs entered and
+ * then start kexec boot.
+ */
+ crash_soft_reset_check(cpu);
+ cpu_set(crashing_cpu, cpus_in_crash);
+ if (ppc_md.kexec_cpu_down)
+ ppc_md.kexec_cpu_down(1, 0);
+ machine_kexec(kexec_crash_image);
+ /* NOTREACHED */
+ }
+ crash_ipi_callback(regs);
+}
+
#else
-static void crash_kexec_prepare_cpus(void)
+static void crash_kexec_prepare_cpus(int cpu)
{
/*
* move the secondarys to us so that we can copy
@@ -170,6 +258,10 @@ static void crash_kexec_prepare_cpus(voi
smp_release_cpus();
}
+void crash_kexec_secondary(struct pt_regs *regs)
+{
+ cpus_in_sr = CPU_MASK_NONE;
+}
#endif
void default_machine_crash_shutdown(struct pt_regs *regs)
@@ -185,15 +277,14 @@ void default_machine_crash_shutdown(stru
* The kernel is broken so disable interrupts.
*/
local_irq_disable();
-
- if (ppc_md.kexec_cpu_down)
- ppc_md.kexec_cpu_down(1, 0);
-
/*
* Make a note of crashing cpu. Will be used in machine_kexec
* such that another IPI will not be sent.
*/
crashing_cpu = smp_processor_id();
- crash_kexec_prepare_cpus();
crash_save_this_cpu(regs, crashing_cpu);
+ crash_kexec_prepare_cpus(crashing_cpu);
+ cpu_set(crashing_cpu, cpus_in_crash);
+ if (ppc_md.kexec_cpu_down)
+ ppc_md.kexec_cpu_down(1, 0);
}
diff -Naurp 2617-rc1.orig/arch/powerpc/kernel/traps.c 2617-rc1/arch/powerpc/kernel/traps.c
--- 2617-rc1.orig/arch/powerpc/kernel/traps.c 2006-04-05 13:20:33.000000000 -0700
+++ 2617-rc1/arch/powerpc/kernel/traps.c 2006-04-05 13:22:00.000000000 -0700
@@ -51,9 +51,13 @@
#include <asm/firmware.h>
#include <asm/processor.h>
#endif
+#include <asm/kexec.h>
#ifdef CONFIG_PPC64 /* XXX */
#define _IO_BASE pci_io_base
+#ifdef CONFIG_KEXEC
+cpumask_t cpus_in_sr = CPU_MASK_NONE;
+#endif
#endif
#ifdef CONFIG_DEBUGGER
@@ -96,7 +100,7 @@ static DEFINE_SPINLOCK(die_lock);
int die(const char *str, struct pt_regs *regs, long err)
{
- static int die_counter, crash_dump_start = 0;
+ static int die_counter;
if (debugger(regs))
return 1;
@@ -128,22 +132,12 @@ int die(const char *str, struct pt_regs
print_modules();
show_regs(regs);
bust_spinlocks(0);
-
- if (!crash_dump_start && kexec_should_crash(current)) {
- crash_dump_start = 1;
- spin_unlock_irq(&die_lock);
- crash_kexec(regs);
- /* NOTREACHED */
- }
spin_unlock_irq(&die_lock);
- if (crash_dump_start)
- /*
- * Only for soft-reset: Other CPUs will be responded to an IPI
- * sent by first kexec CPU.
- */
- for(;;)
- ;
+ if (kexec_should_crash(current))
+ crash_kexec(regs);
+ crash_kexec_secondary(regs);
+
if (in_interrupt())
panic("Fatal exception in interrupt");
@@ -205,6 +199,10 @@ void system_reset_exception(struct pt_re
if (ppc_md.system_reset_exception(regs))
return;
}
+
+#ifdef CONFIG_KEXEC
+ cpu_set(smp_processor_id(), cpus_in_sr);
+#endif
die("System Reset", regs, SIGABRT);
diff -Naurp 2617-rc1.orig/include/asm-powerpc/kexec.h 2617-rc1/include/asm-powerpc/kexec.h
--- 2617-rc1.orig/include/asm-powerpc/kexec.h 2006-04-05 13:20:53.000000000 -0700
+++ 2617-rc1/include/asm-powerpc/kexec.h 2006-04-05 13:18:27.000000000 -0700
@@ -123,8 +123,10 @@ extern int default_machine_kexec_prepare
extern void default_machine_crash_shutdown(struct pt_regs *regs);
extern void machine_kexec_simple(struct kimage *image);
-
+extern void crash_kexec_secondary(struct pt_regs *regs);
#endif /* ! __ASSEMBLY__ */
+#else
+static inline void crash_kexec_secondary(struct pt_regs *regs) { }
#endif /* CONFIG_KEXEC */
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_KEXEC_H */
^ permalink raw reply
* Re: [PATCH] 2 of 3 kdump-ppc64-soft-reset-fixes
From: Andrew Morton @ 2006-04-10 22:58 UTC (permalink / raw)
To: David Wilder; +Cc: mchintage, linuxppc-dev, fastboot, paulus
In-Reply-To: <443ADCCE.1070504@us.ibm.com>
David Wilder <dwilder@us.ibm.com> wrote:
>
>
Please don't use a filename as a patch title. See section 2 of
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt.
> --- 2617-rc1.orig/arch/powerpc/kernel/crash.c 2006-04-05 13:20:38.000000000 -0700
> +++ 2617-rc1/arch/powerpc/kernel/crash.c 2006-04-05 13:24:19.000000000 -0700
> @@ -23,6 +23,7 @@
> #include <linux/elfcore.h>
> #include <linux/init.h>
> #include <linux/types.h>
> +#include <linux/irq.h>
>
> #include <asm/processor.h>
> #include <asm/machdep.h>
> @@ -40,6 +41,9 @@
>
> /* This keeps a track of which one is crashing cpu. */
> int crashing_cpu = -1;
> +static cpumask_t cpus_in_crash = CPU_MASK_NONE;
> +extern struct kimage *kexec_crash_image;
> +extern cpumask_t cpus_in_sr;
extern declarations should be placed in .h, not in .c.
> + while (!cpu_isset(crashing_cpu, cpus_in_crash))
> + barrier();
The patch contains lots of busy-loops which all do barrier().
barrier() is purely a compiler thing. I suspect you meant cpu_relax().
Either way, there should be a cpu_relax() in these loops.
> +
> +/*
> + * This function will be called by secondary cpus or by kexec cpu
> + * if soft-reset is activated to stop some CPUs.
> + */
> +void crash_kexec_secondary(struct pt_regs *regs)
> +{
> + int cpu = smp_processor_id();
> + unsigned long flags;
> + int msecs = 5;
> +
> + local_irq_save(flags);
> + /* Wait 5ms if the kexec CPU is not entered yet. */
> + while (crashing_cpu < 0) {
> + if (--msecs < 0) {
> + /*
> + * Either kdump image is not loaded or
> + * kdump process is not started - Probably xmon
> + * exited using 'x'(exit and recover) or
> + * kexec_should_crash() failed for all running tasks.
> + */
> + cpu_clear(cpu, cpus_in_sr);
> + local_irq_restore(flags);
> + return;
> + }
> + mdelay(1);
> + barrier();
> + }
> + if (cpu == crashing_cpu) {
Whitespace broke here.
> + /*
> + * Panic CPU will enter this func only via soft-reset.
> + * Wait until all secondary CPUs entered and
> + * then start kexec boot.
> + */
> + crash_soft_reset_check(cpu);
> + cpu_set(crashing_cpu, cpus_in_crash);
> + if (ppc_md.kexec_cpu_down)
> + ppc_md.kexec_cpu_down(1, 0);
> + machine_kexec(kexec_crash_image);
> + /* NOTREACHED */
> + }
> + crash_ipi_callback(regs);
> +}
^ permalink raw reply
* Create permanent mapping from PCI bus to region of physical memory
From: Howard, Marc @ 2006-04-11 0:17 UTC (permalink / raw)
To: linuxppc-embedded
Hi,
I'm working on a PPC440GX based board in which a PCI based peripheral
communicates with the host via a shared region of processor memory. The
peripheral will read and write this region autonomously. Because of
this I need a fixed mapping FROM the PCI bus TO system memory.
It's easy enough to rope off the top 16 MB or so of system physical
memory by lying to linux in the boot arguments. What I can't seem find
is a system call to use to create a fixed, permanent mapping of PPC
memory as seen from the PCI bus.
Has anyone out there done this and could share a code snippet with me?
Thanks,
Marc W. Howard
^ permalink raw reply
* Re: [Cbe-oss-dev] [PATCH 4/5] powerpc: export symbols for page size selection
From: Benjamin Herrenschmidt @ 2006-04-11 1:21 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linuxppc-dev, cbe-oss-dev, Arnd Bergmann
In-Reply-To: <20060407152813.GA382@lst.de>
On Fri, 2006-04-07 at 17:28 +0200, Christoph Hellwig wrote:
> On Fri, Apr 07, 2006 at 12:00:04AM +0200, arnd@arndb.de wrote:
> > We need access to some symbols in powerpc memory management
> > from spufs in order to create proper SLB entries.
>
> One more reason to disallow modular spufs..
Why ? It's not problem as long as they are exported GPL only ...
Ben.
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: David Hawkins @ 2006-04-11 1:32 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <91B22F93A880FA48879475E134D6F0BE026CD163@CA1EXCLV02.adcorp.kla-tencor.com>
Howard, Marc wrote:
> Hi,
>
> I'm working on a PPC440GX based board in which a PCI based peripheral
> communicates with the host via a shared region of processor memory. The
> peripheral will read and write this region autonomously. Because of
> this I need a fixed mapping FROM the PCI bus TO system memory.
>
> It's easy enough to rope off the top 16 MB or so of system physical
> memory by lying to linux in the boot arguments. What I can't seem find
> is a system call to use to create a fixed, permanent mapping of PPC
> memory as seen from the PCI bus.
>
> Has anyone out there done this and could share a code snippet with me?
>
Hi Marc,
So the PPC440GX is the host, but what is the peripheral, and
what restricts it to require a fixed address? Can it handle
a set of fixed addresses, eg. a scatter-gather buffer setup
by the host? Can the host and peripheral communicate in any
other way, eg. when the peripheral changes something in the
shared memory, surely it has to interrupt the host to let it
know?
Does the peripheral contain a CPU, if thats the case, then
the host and peripheral could also maintain a smaller region
that contains addresses of other pages, i.e., no need to
restrict the design to a single block of memory, you just
need a page of pointers to other shared blocks of pages.
For example, the alloc_pages call can be used to get order 11
pages, eg. 2^11 x 4096-byte pages, 8MB (hmm, thats sounds a bit
big, I seem to recall 2MB was the maximum). So your 16MBs
could be 8 separate blocks.
Anyway the comment is that you can use alloc_pages() to give
you a contiguous chunk of memory at some allocation-time
address, but not a 16MB chunk.
Of course if the peripheral is fairly dumb, then you can't do
this. But if it indicates a buffer-used condition, perhaps you
can swap over to a new 2MB block.
If you can provide a few more details on the restriction of
the peripheral, then I or others can suggest options.
Dave
^ permalink raw reply
* Re: Oops: machine check, sig: 7 [#1] - 16-bit Pccard - SOLVED!!!
From: Benjamin Herrenschmidt @ 2006-04-11 1:30 UTC (permalink / raw)
To: Daniel Ritz; +Cc: linuxppc-dev, Edward Felberbaum, linux-pcmcia, paulus
In-Reply-To: <200604092257.52206.daniel.ritz-ml@swissonline.ch>
> > I see from the dmesg output from my original post that memory ranges
> > 0xfdd7f000 and 0xfddff000 are used by the Gatwick and Heathrow mac io
> > controllers. That explains the conflict with PCMCIA over 0xfd000000.
>
> interesting...the memory ranges are used by other devices yet the
> request_resource() call in PCMCIA succeeds,,,and PCI resources shoudn't
> be there in the first place then...
>
> ok, it's in file arch/powerpc/platforms/powermac/feature.c...
> i can't see any request_resource() calls in there...so CC'ing the PPC guys..
> they can sure comment...
Most of the macio stuff is in drivers/macintosh/macio_asic.c ... the
whole macio itself is a PCI device and thus should already "occupy" it's
possibly address range...
> >
> > Question, can I minimize the range of memory that is reserved 0xffffff - or
> > is it a waste of time?
> >
>
> yeah, you probably could, but it sounds like a waste of time...
>
> > Eddie
> >
>
> rgds
> -daniel
^ permalink raw reply
* RE: Create permanent mapping from PCI bus to region of physical memory
From: Howard, Marc @ 2006-04-11 2:56 UTC (permalink / raw)
To: David Hawkins; +Cc: linuxppc-embedded
In-Reply-To: <443B0718.3050001@ovro.caltech.edu>
[-- Attachment #1: Type: text/plain, Size: 2082 bytes --]
Howard, Marc wrote:
>> Hi,
>>
>> I'm working on a PPC440GX based board in which a PCI based peripheral
>> communicates with the host via a shared region of processor memory. The
>> peripheral will read and write this region autonomously. Because of
>> this I need a fixed mapping FROM the PCI bus TO system memory.
>>
>> It's easy enough to rope off the top 16 MB or so of system physical
>> memory by lying to linux in the boot arguments. What I can't seem find
>> is a system call to use to create a fixed, permanent mapping of PPC
>> memory as seen from the PCI bus.
>>
>> Has anyone out there done this and could share a code snippet with me?
>>
>Hi Marc,
>So the PPC440GX is the host, but what is the peripheral, and
>what restricts it to require a fixed address? Can it handle
>a set of fixed addresses, eg. a scatter-gather buffer setup
>by the host? Can the host and peripheral communicate in any
>other way, eg. when the peripheral changes something in the
>shared memory, surely it has to interrupt the host to let it
>know?
>Does the peripheral contain a CPU, if thats the case, then
>the host and peripheral could also maintain a smaller region
>that contains addresses of other pages, i.e., no need to
>restrict the design to a single block of memory, you just
>need a page of pointers to other shared blocks of pages.
Dave,
The periperal is an FPGA. No, there is no internal processor;
everything is coded in Verilog.
Scatter/gather isn't a viable option because of this. Additionally
non-contiguous memory would reduce bandwidth and increase
FPGA design complexity. The data must be contignuous because
of these reasons and the need for the data to be randomly
accessible from the outside using simple address arithmetic.
I realize this isn't a standard linux request but having
fixed, linear memory is quite common in embedded apps. There
should be a way to create this mapping in the 440GX's hardware
and I'm just looking for a system call (if there is one) to
implement it.
Thanks,
Marc
[-- Attachment #2: Type: text/html, Size: 2869 bytes --]
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: David Hawkins @ 2006-04-11 3:13 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <91B22F93A880FA48879475E134D6F0BE386FBC@CA1EXCLV02.adcorp.kla-tencor.com>
Hi Marc,
> The periperal is an FPGA. No, there is no internal processor;
> everything is coded in Verilog.
>
> Scatter/gather isn't a viable option because of this.
Er, why not, its an FPGA, everything is possible :)
So you have a PCI core, since you are planning to write to
the host memory space, its a Bus Master PCI core.
Who's FPGA, Altera, Xilinx, or someone else?
> Additionally non-contiguous memory would reduce bandwidth
> and increase FPGA design complexity.
Not necessarily. If the target is using bus master DMA to
write to the host memory, then you can hit pretty close
to the bandwidth of the PCI bus. If you are DMAing in
big blocks, the overhead of a block change isn't too much.
I did tests with the 440EP using a DMA controller on an
adapter board and found that the PCI bridge in the 440EP
was the limiting factor, i.e., for a 33MHz 32-bit bus
with a potential for 132MB/s, the *best* you can do is
about 40MB/s since the bridge only accepts data in cache
line sizes before sending a retry to the target. I can
send you those results.
> The data must be contignuous because
> of these reasons and the need for the data to be randomly
> accessible from the outside using simple address arithmetic.
Randomly accessible from where; the host or an I/O interface
at the FPGA. The pages can be made to appear contiguous to
a host processor user-space process using the nopage callback
of the VMA.
> I realize this isn't a standard linux request but having
> fixed, linear memory is quite common in embedded apps. There
> should be a way to create this mapping in the 440GX's hardware
> and I'm just looking for a system call (if there is one) to
> implement it.
Alas, this is one of the concessions one must make if you
want to use a processor that enables the MMU. However,
I don't see any fundamental limitation in the design
that would preclude a little extra work on the FPGA.
But, it does require additional Verilog to support
the flexibility. The long-term advantage is that you
don't have to provide a hack (eg. reserve a block of
high-memory under Linux).
So how about this concession. If Linux lets you alloc_pages
in 2MB max chunks, create 8 address decode regions on your
FPGA. Provide the host access to the 8 address registers.
When the Linux driver is installed, the driver alloc_pages
8 times and loads the PCI address of those regions into
the 8 registers.
On the FPGA side of things create a flat 16MB region, as
an address passes through a 2MB block, it changes the PCI
address it decodes to. So, your FPGA believes it has
a 16MB continuous block, and Linux can supply the memory
as 8 non-contiguous chunks. Back in user-space on Linux,
the nopage VMA call is used to map a page-at-a-time
and the 2MB regions appear as a 16MB contiguous region to
user-space. (There are probably other ways to make user
space see it as one block too)
Cheers
Dave
^ permalink raw reply
* RE: Create permanent mapping from PCI bus to region of physical memory
From: Howard, Marc @ 2006-04-11 3:42 UTC (permalink / raw)
To: David Hawkins; +Cc: linuxppc-embedded
In-Reply-To: <443B1EE9.5030104@ovro.caltech.edu>
[-- Attachment #1: Type: text/plain, Size: 3052 bytes --]
>> The periperal is an FPGA. No, there is no internal processor;
>> everything is coded in Verilog.
>>
>> Scatter/gather isn't a viable option because of this.
>Er, why not, its an FPGA, everything is possible :)
Not if you're tight on routing resouces and/or design time ;-)
>So you have a PCI core, since you are planning to write to
>the host memory space, its a Bus Master PCI core.
>Who's FPGA, Altera, Xilinx, or someone else?
Xilinx Virtex 4, no internal PPC. The vendor really isn't
germain to the problem at hand however.
>> Additionally non-contiguous memory would reduce bandwidth
>> and increase FPGA design complexity.
>Not necessarily. If the target is using bus master DMA to
>write to the host memory, then you can hit pretty close
>to the bandwidth of the PCI bus. If you are DMAing in
>big blocks, the overhead of a block change isn't too much.
>I did tests with the 440EP using a DMA controller on an
>adapter board and found that the PCI bridge in the 440EP
>was the limiting factor, i.e., for a 33MHz 32-bit bus
>with a potential for 132MB/s, the *best* you can do is
>about 40MB/s since the bridge only accepts data in cache
>line sizes before sending a retry to the target. I can
>send you those results.
Dave, I appreciate your input but scatter/gather just isn't
an option here for a variety of reasons. Bandwidth/complexity/
latency/time to design/time to debug/FPGA density all
factor into this decision. BTW we're talking 66/64 PCI-X;
33/32 PCI isn't nearly fast enough.
>Randomly accessible from where; the host or an I/O interface
>at the FPGA. The pages can be made to appear contiguous to
>a host processor user-space process using the nopage callback
>of the VMA.
>From the point of view of the FPGA.
>> I realize this isn't a standard linux request but having
>> fixed, linear memory is quite common in embedded apps. There
>> should be a way to create this mapping in the 440GX's hardware
>> and I'm just looking for a system call (if there is one) to
>> implement it.
>Alas, this is one of the concessions one must make if you
>want to use a processor that enables the MMU.
Nope, I disagree. The MMU isn't involved here at all. I'm
talking about setting up the PIM (PCI Inbound Map) via
a system call as opposed to just writing it directly.
>However,
>I don't see any fundamental limitation in the design
>that would preclude a little extra work on the FPGA.
>But, it does require additional Verilog to support
>the flexibility. The long-term advantage is that you
>don't have to provide a hack (eg. reserve a block of
>high-memory under Linux).
I really can't go into detail about my application for a
variety of reasons (both technical and legal) but suffice
it to say that 16 MB is just the starting point.
I guess I'll dig a little deeper into the source to see
where the kernel does the PIM mapping. Re-architecting
our app at this stage just isn't a practical consideration.
Thanks again,
Marc
[-- Attachment #2: Type: text/html, Size: 3973 bytes --]
^ permalink raw reply
* Re: Create permanent mapping from PCI bus to region of physical memory
From: David Hawkins @ 2006-04-11 3:52 UTC (permalink / raw)
To: Howard, Marc; +Cc: linuxppc-embedded
In-Reply-To: <91B22F93A880FA48879475E134D6F0BE386FBD@CA1EXCLV02.adcorp.kla-tencor.com>
> Not if you're tight on routing resouces and/or design time ;-)
Yep, understood.
> Dave, I appreciate your input but scatter/gather just isn't
> an option here for a variety of reasons. Bandwidth/complexity/
> latency/time to design/time to debug/FPGA density all
> factor into this decision. BTW we're talking 66/64 PCI-X;
> 33/32 PCI isn't nearly fast enough.
No problem.
> Nope, I disagree. The MMU isn't involved here at all. I'm
> talking about setting up the PIM (PCI Inbound Map) via
> a system call as opposed to just writing it directly.
The MMU is involved since its responsible for all memory,
i.e., Linux is told that all physical memory is under
its control. I haven't looked at the PCI inbound map,
but I would assume that it can be setup to say map all
of the incoming PCI addresses to all of the SDRAM
addresses, or some block thereof. So the FPGA can see physical
pages that Linux is busy swapping data into and out of.
What you are wanting to do is tell Linux to lay-off
managing a block of SDRAM, but make it visible to the
kernel as a memory area.
So, you need the kernel to setup page tables that map
to those physical addresses and you need to tell Linux
not to treat them like it does the rest of SDRAM.
So I believe the MMU/page tables are involved.
> I really can't go into detail about my application for a
> variety of reasons (both technical and legal) but suffice
> it to say that 16 MB is just the starting point.
No sweat, it never hurts to ask.
> I guess I'll dig a little deeper into the source to see
> where the kernel does the PIM mapping. Re-architecting
> our app at this stage just isn't a practical consideration.
You always have the hack of reserving a block of memory
at the top of SDRAM, i.e., lie to Linux and tell it it
has less memory. It sounds like that might be the least
effort. Rubini has an example of how to do that.
Dave
^ permalink raw reply
* Re: MPC83xx: CF access in True IDE mode
From: Kumar Gala @ 2006-04-11 4:11 UTC (permalink / raw)
To: SIP COP 009; +Cc: linuxppc-embedded
In-Reply-To: <5a4792c00604101600v37616adao2f2598483a65e4d@mail.gmail.com>
On Apr 10, 2006, at 6:00 PM, SIP COP 009 wrote:
> Hi !
>
> I am looking out for some pointers to get CF working in the TRUE
> IDE mode on Linux2.6
>
> Any pointers and help would be greatly appreciated.
There was a driver posted to the linux kernel list for using CF on a
localbus.
- k
^ permalink raw reply
* init issue on linux 2.4 devel and busybox
From: jean-francois.hasson @ 2006-04-11 7:25 UTC (permalink / raw)
To: linuxppc-embedded
Hi,
I have been working on porting linux 2.4 devel to the ml403 board. It boots
well and then it goes on the CompactFlash to get the root filesystem where
the executables are from busybox. The init program is found and launches
well however the message is that init.d/rcS can't be found. I have looked
up my inittab file and the first line is ::sysinit:/etc/init.d/rcS. I do
not understand why init can't find the rcS file when it seems correct in
inittab and I have checked that the file rcS is in /etc/init.d. Does anyone
have an idea as to what might be wrong ?
Thanks,
JF Hasson
#
This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. Any unauthorised disclosure, distribution or copying hereof is prohibited.
Ce courriel et les documents qui y sont attaches peuvent contenir des informations confidentielles. Au cas ou vous ne seriez pas le destinataire de ce courriel, vous etes prie d'en informer l'expediteur immediatement et de le detruire ainsi que tous les documents attaches de votre systeme informatique. Toute divulgation, distribution ou copie du present courriel et des documents attaches sans autorisation prealable de son emetteur est interdite.
#
^ permalink raw reply
* RTC/I2C problems
From: kevin morizur @ 2006-04-11 8:20 UTC (permalink / raw)
To: linuxppc-embedded
Hi everybody.
I have a problem with my RTC, i work on a MEN A12 board with MPC8245 and
when my kernel start it takes 8 min to run. I have find the problem but i
don't know how to solve it. In time.c the function time_init execute the
following lines :
if (__USE_RTC()) {
/* 601 processor: dec counts down by 128 every 128ns */
tb_ticks_per_jiffy = DECREMENTER_COUNT_601;
/* mulhwu_scale_factor(1000000000, 1000000) is 0x418937 */
tb_to_us = 0x0x418937;
} else {
ppc_md.calibrate_decr();
tb_to_ns_scale = mulhwu(tb_to_us, 1000 << 10);
}
this is the ppc_md.calibrate_decr() function that takes a lot of time and i
don't know how to solve the problem. USE_RTC is fixed at 0 for 6xx config
maybe it must be at 1 ? My RTC run with SMBus (I2C). I work with ELDK 4.0
and linux 2.6.15.
Thanks in advance.
Regards.
_________________________________________________________________
Tout savoir sur la sécurité de vos enfants sur Internet !
http://go.msn.fr/10-channel/80-security/protection/default.asp
^ permalink raw reply
* Re: [RFC/PATCH] powerpc: Use rtas query-cpu-stopped-state in smp spinup
From: Michael Ellerman @ 2006-04-11 8:53 UTC (permalink / raw)
To: Nathan Lynch; +Cc: linuxppc-dev, Paul Mackerras, Arnd Bergmann
In-Reply-To: <20060407011002.GF9208@localdomain>
[-- Attachment #1: Type: text/plain, Size: 953 bytes --]
On Thu, 2006-04-06 at 20:10 -0500, Nathan Lynch wrote:
> I tried your patch (non-kexec boot) on a 1-way SMT power5 with
> 2.6.17-rc1 and got the badness below, and I think it matches what I
> ran into in the past. Maybe doing query-cpu-stopped-state on a thread
> that has yet to be started by RTAS mucks things up?
>
> Memory: 2047640k/2097152k available (5356k kernel code, 49512k
> reserved, 1412k data, 900k bss, 240k init)
> Calibrating delay loop... 375.80 BogoMIPS (lpj=751616)
> Security Framework v1.0.0 initialized
> SELinux: Disabled at boot.
> Mount-cache hash table entries: 256
> Processor 1 is stuck. <<<<<<<<
OK, thanks for testing it. This is clearly not going to fly.
cheers
--
Michael Ellerman
IBM OzLabs
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]
^ permalink raw reply
* Re: [Fastboot] [PATCH]ppc64 kexec tools rm platform fix
From: Michael Ellerman @ 2006-04-11 9:36 UTC (permalink / raw)
To: David Wilder; +Cc: Mohan Kumar, linuxppc-dev list, fastboot
In-Reply-To: <1144659236.26317.35.camel@localhost.localdomain>
David Wilder wrote:
> I lost Michael's reply on the list.
> Here it is with my response..
>
> Michael wrote:
>
> >/ - continue;
> />/ - }
> />/ memset(fname, 0, sizeof(fname));
> />/ strcpy(fname, device_tree);
> />/ strcat(fname, dentry->d_name);
> />/ strcat(fname, "/linux,htab-base");
> />/ if ((file = fopen(fname, "r")) == NULL) {
> />/ - perror(fname);
> />/ closedir(cdir);
> />/ + if (errno == ENOENT) {
> />/ + /* Non LPAR */
> />/ + errno = 0;
> />/ + continue;
> />/ + }
> />/ + perror(fname);
> />/ closedir(dir);
> />/ return -1;
> /
> I don't think you want to do the closedir() before the if. You
> certainly
> don't need to do it twice?
>
> The two closedir() calls are not closing the same thing.
Right. We don't seem to close cdir if it all works though. That code
needs some serious restructuring, 350+ line functions are not cool.
cheers
--
Michael Ellerman
IBM OzLabs
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
^ permalink raw reply
* post to the list
From: stone @ 2006-04-11 10:23 UTC (permalink / raw)
To: linuxppc-embedded
Thank you!
Welcome to the Linuxppc-embedded@ozlabs.org mailing list!
To post to this list, send your email to:
linuxppc-embedded@ozlabs.org
General information about the mailing list is at:
https://ozlabs.org/mailman/listinfo/linuxppc-embedded
If you ever want to unsubscribe or change your options (eg, switch to
or from digest mode, change your password, etc.), visit your
subscription page at:
https://ozlabs.org/mailman/options/linuxppc-embedded/crackpig%40gmail.com
You can also make such adjustments via email by sending a message to:
Linuxppc-embedded-request@ozlabs.org
with the word `help' in the subject or body (don't include the
quotes), and you will get back a message with instructions.
You must know your password to change your options (including changing
the password, itself) or to unsubscribe. It is:
80548054
Normally, Mailman will remind you of your ozlabs.org mailing list
passwords once every month, although you can disable this if you
prefer. This reminder will also include instructions on how to
unsubscribe or change your account options. There is also a button on
your options page that will email your current password to you.
^ permalink raw reply
* [PATCH 0/6] [RFC] Sizing zones and holes in an architecture independent manner
From: Mel Gorman @ 2006-04-11 10:39 UTC (permalink / raw)
To: linuxppc-dev, davej, tony.luck, linux-kernel, ak; +Cc: Mel Gorman
(The TO: list are the architecture maintainers according to the
MAINTAINERS. Apologies in advance if I got the list wrong)
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate
zone sizes and holes in each architecture is very similar. Some of this
zone and hole sizing code is difficult to read for no good reason. This
set of patches eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas
have been discovered, free_area_init_nodes() is called to initialise
the pgdat and zones. The zone sizes and holes are then calculated in an
architecture independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 136 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 150 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 35 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 59 arch-specific LOC removed
At this point, there is a net reduction of 27 lines of code and the
arch-independent code is a lot easier to read in comparison to some of
the arch-specific stuff, particularly in arch/i386/ .
For Patch 6, it was also noted that page_alloc.c has a *lot* of
initialisation code which makes the file harder to read than it needs to
be. Patch 6 creates a new file mem_init.c and moves a lot of initialisation
code from page_alloc.c to it. After the patch is applied, there is still
a net reduction of 3 lines of code.
The patches have been successfully boot tested on
o x86, flatmem
o x86, NUMAQ
o PPC64, NUMA
o PPC64, CONFIG_NUMA=n
o x86_64, NUMA with SRAT
The patches have only been *compile tested* for ia64 with a flatmem
configuration. At attempt was made to boot test on an ancient RS/6000
but the vanilla kernel does not boot so I have to investigate there.
The net reduction seems small but the big benefit of this set of patches
is the reduction of 380 lines of architecture-specific code, some of
which is very hairy. There should be a greater net reduction when other
architectures use the same mechanisms for zone and hole sizing but I lack
the hardware to test on.
Comments?
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous errors
arch/i386/Kconfig | 8
arch/i386/kernel/setup.c | 19
arch/i386/kernel/srat.c | 98 ----
arch/i386/mm/discontig.c | 59 --
arch/ia64/Kconfig | 3
arch/ia64/mm/contig.c | 62 --
arch/ia64/mm/discontig.c | 43 -
arch/ia64/mm/init.c | 10
arch/powerpc/Kconfig | 13
arch/powerpc/mm/mem.c | 50 --
arch/powerpc/mm/numa.c | 157 ------
arch/ppc/Kconfig | 3
arch/ppc/mm/init.c | 21
arch/x86_64/Kconfig | 3
arch/x86_64/kernel/e820.c | 18
arch/x86_64/mm/init.c | 60 --
arch/x86_64/mm/numa.c | 15
include/asm-ia64/meminit.h | 1
include/asm-x86_64/e820.h | 1
include/asm-x86_64/proto.h | 2
include/linux/mm.h | 14
include/linux/mmzone.h | 15
mm/Makefile | 2
mm/mem_init.c | 1028 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 678 -----------------------------
25 files changed, 1190 insertions(+), 1193 deletions(-)
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply
* [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
To: davej, linuxppc-dev, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
page_alloc.c contains a large amount of memory initialisation code. This patch
breaks out the initialisation code to a separate file to make page_alloc.c
a bit easier to read.
Makefile | 2
mem_init.c | 1028 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
page_alloc.c | 1004 ----------------------------------------------------
3 files changed, 1029 insertions(+), 1005 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/Makefile 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/Makefile 2006-04-11 09:37:06.000000000 +0100
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o
vmalloc.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
- page_alloc.o page-writeback.o pdflush.o \
+ page_alloc.o mem_init.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o $(mmu-y)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/mem_init.c 2006-04-11 09:48:34.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/mem_init.c 2006-04-11 09:37:06.000000000 +0100
@@ -0,0 +1,1028 @@
+/*
+ * mm/mem_init.c
+ * Initialises the architecture independant view of memory. pgdats, zones, etc
+ *
+ * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
+ * Swap reorganised 29.12.95, Stephen Tweedie
+ * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
+ * Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999
+ * Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999
+ * Zone balancing, Kanoj Sarcar, SGI, Jan 2000
+ * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
+ * (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Arch-independant zone size and hole calculation, Mel Gorman, IBM, Apr 2006
+ * (lots of bits taken from architecture code)
+ */
+#include <linux/config.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/module.h>
+#include <linux/cpuset.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/swap.h>
+#include <linux/sysctl.h>
+
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+int percpu_pagelist_fraction;
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ #ifdef CONFIG_MAX_ACTIVE_REGIONS
+ #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+ #else
+ #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+ #endif
+
+ struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+ unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+ unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+/*
+ * Builds allocation fallback zone lists.
+ *
+ * Add all populated zones of a node to the zonelist.
+ */
+static int __init build_zonelists_node(pg_data_t *pgdat,
+ struct zonelist *zonelist, int nr_zones, int zone_type)
+{
+ struct zone *zone;
+
+ BUG_ON(zone_type > ZONE_HIGHMEM);
+
+ do {
+ zone = pgdat->node_zones + zone_type;
+ if (populated_zone(zone)) {
+#ifndef CONFIG_HIGHMEM
+ BUG_ON(zone_type > ZONE_NORMAL);
+#endif
+ zonelist->zones[nr_zones++] = zone;
+ check_highest_zone(zone_type);
+ }
+ zone_type--;
+
+ } while (zone_type >= 0);
+ return nr_zones;
+}
+
+static inline int highest_zone(int zone_bits)
+{
+ int res = ZONE_NORMAL;
+ if (zone_bits & (__force int)__GFP_HIGHMEM)
+ res = ZONE_HIGHMEM;
+ if (zone_bits & (__force int)__GFP_DMA32)
+ res = ZONE_DMA32;
+ if (zone_bits & (__force int)__GFP_DMA)
+ res = ZONE_DMA;
+ return res;
+}
+
+#ifdef CONFIG_NUMA
+#define MAX_NODE_LOAD (num_online_nodes())
+static int __initdata node_load[MAX_NUMNODES];
+/**
+ * find_next_best_node - find the next node that should appear in a given node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: nodemask_t of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list. The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+ int n, val;
+ int min_val = INT_MAX;
+ int best_node = -1;
+
+ /* Use the local node if we haven't already */
+ if (!node_isset(node, *used_node_mask)) {
+ node_set(node, *used_node_mask);
+ return node;
+ }
+
+ for_each_online_node(n) {
+ cpumask_t tmp;
+
+ /* Don't want a node to appear more than once */
+ if (node_isset(n, *used_node_mask))
+ continue;
+
+ /* Use the distance array to find the distance */
+ val = node_distance(node, n);
+
+ /* Penalize nodes under us ("prefer the next node") */
+ val += (n < node);
+
+ /* Give preference to headless and unused nodes */
+ tmp = node_to_cpumask(n);
+ if (!cpus_empty(tmp))
+ val += PENALTY_FOR_NODE_WITH_CPUS;
+
+ /* Slight preference for less loaded node */
+ val *= (MAX_NODE_LOAD*MAX_NUMNODES);
+ val += node_load[n];
+
+ if (val < min_val) {
+ min_val = val;
+ best_node = n;
+ }
+ }
+
+ if (best_node >= 0)
+ node_set(best_node, *used_node_mask);
+
+ return best_node;
+}
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+ int i, j, k, node, local_node;
+ int prev_node, load;
+ struct zonelist *zonelist;
+ nodemask_t used_mask;
+
+ /* initialize zonelists */
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ zonelist = pgdat->node_zonelists + i;
+ zonelist->zones[0] = NULL;
+ }
+
+ /* NUMA-aware ordering of nodes */
+ local_node = pgdat->node_id;
+ load = num_online_nodes();
+ prev_node = local_node;
+ nodes_clear(used_mask);
+ while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+ int distance = node_distance(local_node, node);
+
+ /*
+ * If another node is sufficiently far away then it is better
+ * to reclaim pages in a zone before going off node.
+ */
+ if (distance > RECLAIM_DISTANCE)
+ zone_reclaim_mode = 1;
+
+ /*
+ * We don't want to pressure a particular node.
+ * So adding penalty to the first node in same
+ * distance group to make it round-robin.
+ */
+
+ if (distance != node_distance(local_node, prev_node))
+ node_load[node] += load;
+ prev_node = node;
+ load--;
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ zonelist = pgdat->node_zonelists + i;
+ for (j = 0; zonelist->zones[j] != NULL; j++);
+
+ k = highest_zone(i);
+
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ zonelist->zones[j] = NULL;
+ }
+ }
+}
+
+#else /* CONFIG_NUMA */
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+ int i, j, k, node, local_node;
+
+ local_node = pgdat->node_id;
+ for (i = 0; i < GFP_ZONETYPES; i++) {
+ struct zonelist *zonelist;
+
+ zonelist = pgdat->node_zonelists + i;
+
+ j = 0;
+ k = highest_zone(i);
+ j = build_zonelists_node(pgdat, zonelist, j, k);
+ /*
+ * Now we build the zonelist so that it contains the zones
+ * of all the other nodes.
+ * We don't want to pressure a particular node, so when
+ * building the zones for node N, we make sure that the
+ * zones coming right after the local ones are those from
+ * node N+1 (modulo N)
+ */
+ for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ }
+ for (node = 0; node < local_node; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ }
+
+ zonelist->zones[j] = NULL;
+ }
+}
+
+#endif /* CONFIG_NUMA */
+
+void __init build_all_zonelists(void)
+{
+ int i;
+
+ for_each_online_node(i)
+ build_zonelists(NODE_DATA(i));
+ printk("Built %i zonelists\n", num_online_nodes());
+ cpuset_init_current_mems_allowed();
+}
+
+/*
+ * Helper functions to size the waitqueue hash table.
+ * Essentially these want to choose hash table sizes sufficiently
+ * large so that collisions trying to wait on pages are rare.
+ * But in fact, the number of active page waitqueues on typical
+ * systems is ridiculously low, less than 200. So this is even
+ * conservative, even though it seems large.
+ *
+ * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
+ * waitqueues, i.e. the size of the waitq table given the number of pages.
+ */
+#define PAGES_PER_WAITQUEUE 256
+
+static inline unsigned long wait_table_size(unsigned long pages)
+{
+ unsigned long size = 1;
+
+ pages /= PAGES_PER_WAITQUEUE;
+
+ while (size < pages)
+ size <<= 1;
+
+ /*
+ * Once we have dozens or even hundreds of threads sleeping
+ * on IO we've got bigger problems than wait queue collision.
+ * Limit the size of the wait table to a reasonable size.
+ */
+ size = min(size, 4096UL);
+
+ return max(size, 4UL);
+}
+
+/*
+ * This is an integer logarithm so that shifts can be used later
+ * to extract the more random high bits from the multiplicative
+ * hash function before the remainder is taken.
+ */
+static inline unsigned long wait_table_bits(unsigned long size)
+{
+ return ffz(~size);
+}
+
+#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
+
+#ifndef __HAVE_ARCH_MEMMAP_INIT
+#define memmap_init(size, nid, zone, start_pfn) \
+ memmap_init_zone((size), (nid), (zone), (start_pfn))
+#endif
+
+/*
+ * Initially all pages are reserved - free ones are freed
+ * up by free_all_bootmem() once the early boot process is
+ * done. Non-atomic initialization, single-pass.
+ */
+void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
+ unsigned long start_pfn)
+{
+ struct page *page;
+ unsigned long end_pfn = start_pfn + size;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ if (!early_pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ set_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ reset_page_mapcount(page);
+ SetPageReserved(page);
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+ }
+}
+
+void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
+ unsigned long size)
+{
+ int order;
+ for (order = 0; order < MAX_ORDER ; order++) {
+ INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ zone->free_area[order].nr_free = 0;
+ }
+}
+
+#define ZONETABLE_INDEX(x, zone_nr) ((x << ZONES_SHIFT) | zone_nr)
+void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
+ unsigned long size)
+{
+ unsigned long snum = pfn_to_section_nr(pfn);
+ unsigned long end = pfn_to_section_nr(pfn + size);
+
+ if (FLAGS_HAS_NODE)
+ zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
+ else
+ for (; snum <= end; snum++)
+ zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
+}
+
+static __meminit
+void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
+{
+ int i;
+ struct pglist_data *pgdat = zone->zone_pgdat;
+
+ /*
+ * The per-page waitqueue mechanism uses hashed waitqueues
+ * per zone.
+ */
+ zone->wait_table_size = wait_table_size(zone_size_pages);
+ zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
+ zone->wait_table = (wait_queue_head_t *)
+ alloc_bootmem_node(pgdat, zone->wait_table_size
+ * sizeof(wait_queue_head_t));
+
+ for(i = 0; i < zone->wait_table_size; ++i)
+ init_waitqueue_head(zone->wait_table + i);
+}
+
+/*
+ * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * to the value high for the pageset p.
+ */
+static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+ unsigned long high)
+{
+ struct per_cpu_pages *pcp;
+
+ pcp = &p->pcp[0]; /* hot list */
+ pcp->high = high;
+ pcp->batch = max(1UL, high/4);
+ if ((high/4) > (PAGE_SHIFT * 8))
+ pcp->batch = PAGE_SHIFT * 8;
+}
+
+/*
+ * percpu_pagelist_fraction - changes the pcp->high for each zone on each
+ * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
+ * can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ unsigned int cpu;
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ if (!write || (ret == -EINVAL))
+ return ret;
+ for_each_zone(zone) {
+ for_each_online_cpu(cpu) {
+ unsigned long high;
+ high = zone->present_pages / percpu_pagelist_fraction;
+ setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+ }
+ }
+ return 0;
+}
+
+static int __cpuinit zone_batchsize(struct zone *zone)
+{
+ int batch;
+
+ /*
+ * The per-cpu-pages pools are set to around 1000th of the
+ * size of the zone. But no more than 1/2 of a meg.
+ *
+ * OK, so we don't know how big the cache is. So guess.
+ */
+ batch = zone->present_pages / 1024;
+ if (batch * PAGE_SIZE > 512 * 1024)
+ batch = (512 * 1024) / PAGE_SIZE;
+ batch /= 4; /* We effectively *= 4 below */
+ if (batch < 1)
+ batch = 1;
+
+ /*
+ * Clamp the batch to a 2^n - 1 value. Having a power
+ * of 2 value was found to be more likely to have
+ * suboptimal cache aliasing properties in some cases.
+ *
+ * For example if 2 tasks are alternately allocating
+ * batches of pages, one task can end up with a lot
+ * of pages of one half of the possible page colors
+ * and the other with pages of the other colors.
+ */
+ batch = (1 << (fls(batch + batch/2)-1)) - 1;
+
+ return batch;
+}
+
+inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+ struct per_cpu_pages *pcp;
+
+ memset(p, 0, sizeof(*p));
+
+ pcp = &p->pcp[0]; /* hot */
+ pcp->count = 0;
+ pcp->high = 6 * batch;
+ pcp->batch = max(1UL, 1 * batch);
+ INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &p->pcp[1]; /* cold*/
+ pcp->count = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = max(1UL, batch/2);
+ INIT_LIST_HEAD(&pcp->list);
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * Some NUMA counter updates may also be caught by the boot pagesets.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static struct per_cpu_pageset boot_pageset[NR_CPUS];
+
+/*
+ * Dynamically allocate memory for the
+ * per cpu pageset array in struct zone.
+ */
+static int __cpuinit process_zones(int cpu)
+{
+ struct zone *zone, *dzone;
+
+ for_each_zone(zone) {
+
+ zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!zone_pcp(zone, cpu))
+ goto bad;
+
+ setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+
+ if (percpu_pagelist_fraction)
+ setup_pagelist_highmark(zone_pcp(zone, cpu),
+ (zone->present_pages / percpu_pagelist_fraction));
+ }
+
+ return 0;
+bad:
+ for_each_zone(dzone) {
+ if (dzone == zone)
+ break;
+ kfree(zone_pcp(dzone, cpu));
+ zone_pcp(dzone, cpu) = NULL;
+ }
+ return -ENOMEM;
+}
+
+static inline void free_zone_pagesets(int cpu)
+{
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
+
+ zone_pcp(zone, cpu) = NULL;
+ kfree(pset);
+ }
+}
+
+static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (long)hcpu;
+ int ret = NOTIFY_OK;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (process_zones(cpu))
+ ret = NOTIFY_BAD;
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DEAD:
+ free_zone_pagesets(cpu);
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
+
+static struct notifier_block pageset_notifier =
+ { &pageset_cpuup_callback, NULL, 0 };
+
+void __init setup_per_cpu_pageset(void)
+{
+ int err;
+
+ /* Initialize per_cpu_pageset for cpu 0.
+ * A cpuup callback will do this for every cpu
+ * as it comes online
+ */
+ err = process_zones(smp_processor_id());
+ BUG_ON(err);
+ register_cpu_notifier(&pageset_notifier);
+}
+#endif
+
+static __meminit void zone_pcp_init(struct zone *zone)
+{
+ int cpu;
+ unsigned long batch = zone_batchsize(zone);
+
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+#ifdef CONFIG_NUMA
+ /* Early boot. Slab allocator not functional yet */
+ zone_pcp(zone, cpu) = &boot_pageset[cpu];
+ setup_pageset(&boot_pageset[cpu],0);
+#else
+ setup_pageset(zone_pcp(zone,cpu), batch);
+#endif
+ }
+ if (zone->present_pages)
+ printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
+ zone->name, zone->present_pages, batch);
+}
+
+static __meminit void init_currently_empty_zone(struct zone *zone,
+ unsigned long zone_start_pfn, unsigned long size)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+
+ zone_wait_table_init(zone, size);
+ pgdat->nr_zones = zone_idx(zone) + 1;
+
+ zone->zone_start_pfn = zone_start_pfn;
+
+ memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+
+ zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+static int __init first_active_region_index_in_nid(int nid)
+{
+ int i;
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid == nid)
+ return i;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+ for (index = index + 1; early_node_map[index].end_pfn; index++) {
+ if (early_node_map[index].nid == nid)
+ return index;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ unsigned long start_pfn = early_node_map[i].start_pfn;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if ((start_pfn <= pfn) && (pfn < end_pfn))
+ return early_node_map[i].nid;
+ }
+
+ return -1;
+}
+#endif
+
+#define for_each_active_range_index_in_nid(i, nid) \
+ for (i = first_active_region_index_in_nid(nid); \
+ i != MAX_ACTIVE_REGIONS; \
+ i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long size_pages = 0;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+ if (early_node_map[i].start_pfn >= max_low_pfn)
+ continue;
+
+ if (end_pfn > max_low_pfn)
+ end_pfn = max_low_pfn;
+
+ size_pages = end_pfn - early_node_map[i].start_pfn;
+ free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+ PFN_PHYS(early_node_map[i].start_pfn),
+ PFN_PHYS(size_pages));
+ }
+}
+
+void __init memory_present_with_active_regions(int nid)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid)
+ memory_present(early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ unsigned int i;
+ *start_pfn = -1UL;
+ *end_pfn = 0;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ if (early_node_map[i].start_pfn < *start_pfn)
+ *start_pfn = early_node_map[i].start_pfn;
+
+ if (early_node_map[i].end_pfn > *end_pfn)
+ *end_pfn = early_node_map[i].end_pfn;
+ }
+
+ if (*start_pfn == -1UL) {
+ printk(KERN_WARNING "Node %u active with no memory\n", nid);
+ *start_pfn = 0;
+ }
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ /* Get the start and end of the node and zone */
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+ zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+ /* Check that this node has pages within the zone's required range */
+ if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+ return 0;
+
+ /* Move the zone boundaries inside the node if necessary */
+ if (zone_end_pfn > node_end_pfn)
+ zone_end_pfn = node_end_pfn;
+ if (zone_start_pfn < node_start_pfn)
+ zone_start_pfn = node_start_pfn;
+
+ /* Return the spanned pages */
+ return zone_end_pfn - zone_start_pfn;
+}
+
+static inline int __init pfn_range_in_zone(unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long zone_type)
+{
+ if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ return 1;
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ int i = 0;
+ unsigned long prev_end_pfn = 0, hole_pages = 0;
+ unsigned long start_pfn;
+
+ /* Find the end_pfn of the first active range of pfns in the node */
+ i = first_active_region_index_in_nid(nid);
+ prev_end_pfn = early_node_map[i].start_pfn;
+
+ /* Find all holes for the node */
+ for (; i != MAX_ACTIVE_REGIONS;
+ i = next_active_region_index_in_nid(i, nid)) {
+
+ /* Increase the hole size if the hole is within the zone */
+ start_pfn = early_node_map[i].start_pfn;
+ if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
+ BUG_ON(prev_end_pfn > start_pfn);
+ hole_pages += start_pfn - prev_end_pfn;
+ }
+
+ prev_end_pfn = early_node_map[i].end_pfn;
+ }
+
+ return hole_pages;
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zholes_size)
+{
+ if (!zholes_size)
+ return 0;
+
+ return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long realtotalpages, totalpages = 0;
+ int i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+ zones_size);
+ }
+ pgdat->node_spanned_pages = totalpages;
+
+ realtotalpages = totalpages;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ realtotalpages -=
+ zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+ }
+ pgdat->node_present_pages = realtotalpages;
+ printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+ realtotalpages);
+}
+
+/*
+ * Set up the zone data structures:
+ * - mark all pages reserved
+ * - mark all memory queues empty
+ * - clear the memory bitmaps
+ */
+static void __init free_area_init_core(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long j;
+ int nid = pgdat->node_id;
+ unsigned long zone_start_pfn = pgdat->node_start_pfn;
+
+ pgdat_resize_init(pgdat);
+ pgdat->nr_zones = 0;
+ init_waitqueue_head(&pgdat->kswapd_wait);
+ pgdat->kswapd_max_order = 0;
+
+ for (j = 0; j < MAX_NR_ZONES; j++) {
+ struct zone *zone = pgdat->node_zones + j;
+ unsigned long size, realsize;
+
+ size = zone_present_pages_in_node(nid, j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
+ zholes_size);
+ if (j < ZONE_HIGHMEM)
+ nr_kernel_pages += realsize;
+ nr_all_pages += realsize;
+
+ zone->spanned_pages = size;
+ zone->present_pages = realsize;
+ zone->name = zone_names[j];
+ spin_lock_init(&zone->lock);
+ spin_lock_init(&zone->lru_lock);
+ zone_seqlock_init(zone);
+ zone->zone_pgdat = pgdat;
+ zone->free_pages = 0;
+
+ zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+
+ zone_pcp_init(zone);
+ INIT_LIST_HEAD(&zone->active_list);
+ INIT_LIST_HEAD(&zone->inactive_list);
+ zone->nr_scan_active = 0;
+ zone->nr_scan_inactive = 0;
+ zone->nr_active = 0;
+ zone->nr_inactive = 0;
+ atomic_set(&zone->reclaim_in_progress, 0);
+ if (!size)
+ continue;
+
+ zonetable_add(zone, nid, j, zone_start_pfn, size);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
+ zone_start_pfn += size;
+ }
+}
+
+static void __init alloc_node_mem_map(struct pglist_data *pgdat)
+{
+ /* Skip empty nodes */
+ if (!pgdat->node_spanned_pages)
+ return;
+
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+ /* ia64 gets its own node_mem_map, before this, without bootmem */
+ if (!pgdat->node_mem_map) {
+ unsigned long size;
+ struct page *map;
+
+ size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
+ map = alloc_remap(pgdat->node_id, size);
+ if (!map)
+ map = alloc_bootmem_node(pgdat, size);
+ pgdat->node_mem_map = map;
+ }
+#ifdef CONFIG_FLATMEM
+ /*
+ * With no DISCONTIG, the global mem_map is just set as node 0's
+ */
+ if (pgdat == NODE_DATA(0))
+ mem_map = NODE_DATA(0)->node_mem_map;
+#endif
+#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+}
+
+void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long node_start_pfn,
+ unsigned long *zholes_size)
+{
+ pgdat->node_id = nid;
+ pgdat->node_start_pfn = node_start_pfn;
+ calculate_node_totalpages(pgdat, zones_size, zholes_size);
+
+ alloc_node_mem_map(pgdat);
+
+ free_area_init_core(pgdat, zones_size, zholes_size);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned int i;
+ unsigned long pages = end_pfn - start_pfn;
+
+ /* Merge with existing active regions if possible */
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid != nid)
+ continue;
+
+ if (early_node_map[i].end_pfn == start_pfn) {
+ early_node_map[i].end_pfn += pages;
+ return;
+ }
+
+ if (early_node_map[i].start_pfn == (start_pfn + pages)) {
+ early_node_map[i].start_pfn -= pages;
+ return;
+ }
+ }
+
+ /*
+ * Leave last entry NULL so we dont iterate off the end (we use
+ * entry.end_pfn to terminate the walk).
+ */
+ if (i >= MAX_ACTIVE_REGIONS - 1) {
+ printk(KERN_ERR "WARNING: too many memory regions in "
+ "numa code, truncating\n");
+ return;
+ }
+
+ early_node_map[i].nid = nid;
+ early_node_map[i].start_pfn = start_pfn;
+ early_node_map[i].end_pfn = end_pfn;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+ struct node_active_region *arange = (struct node_active_region *)a;
+ struct node_active_region *brange = (struct node_active_region *)b;
+
+ /* Done this way to avoid overflows */
+ if (arange->start_pfn > brange->start_pfn)
+ return 1;
+ if (arange->start_pfn < brange->start_pfn)
+ return -1;
+
+ return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+ size_t num = 0;
+ while (early_node_map[num].end_pfn)
+ num++;
+
+ sort(early_node_map, num, sizeof(struct node_active_region),
+ cmp_node_active_region, NULL);
+}
+
+unsigned long __init find_min_pfn(void)
+{
+ int i;
+ unsigned long min_pfn = -1UL;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].start_pfn < min_pfn)
+ min_pfn = early_node_map[i].start_pfn;
+ }
+
+ return min_pfn;
+}
+
+/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
+unsigned long __init find_start_pfn_for_node(unsigned long nid)
+{
+ int i;
+
+ /* Assuming a sorted map, the first range found has the starting pfn */
+ for_each_active_range_index_in_nid(i, nid) {
+ return early_node_map[i].start_pfn;
+ }
+
+ /* nid does not exist in early_node_map */
+ printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+ return 0;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+ unsigned long arch_max_dma32_pfn,
+ unsigned long arch_max_low_pfn,
+ unsigned long arch_max_high_pfn)
+{
+ unsigned long nid;
+
+ /* Record where the zone boundaries are */
+ memset(arch_zone_lowest_possible_pfn, 0,
+ sizeof(arch_zone_lowest_possible_pfn));
+ memset(arch_zone_highest_possible_pfn, 0,
+ sizeof(arch_zone_highest_possible_pfn));
+ arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
+ arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
+ arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
+ arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
+ arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+
+ /* Regions in the early_node_map can be in any order */
+ sort_node_map();
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ free_area_init_node(nid, pgdat, NULL,
+ find_start_pfn_for_node(nid), NULL);
+ }
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c
--- linux-2.6.17-rc1-105-ia64_use_init_nodes/mm/page_alloc.c 2006-04-11 09:33:15.000000000 +0100
+++ linux-2.6.17-rc1-106-breakout_mem_init/mm/page_alloc.c 2006-04-11 09:38:12.000000000 +0100
@@ -37,8 +37,6 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
-#include <linux/sort.h>
-#include <linux/pfn.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -54,7 +52,6 @@ EXPORT_SYMBOL(node_possible_map);
unsigned long totalram_pages __read_mostly;
unsigned long totalhigh_pages __read_mostly;
long nr_swap_pages;
-int percpu_pagelist_fraction;
static void __free_pages_ok(struct page *page, unsigned int order);
@@ -80,24 +77,11 @@ EXPORT_SYMBOL(totalram_pages);
struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
EXPORT_SYMBOL(zone_table);
-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
- #ifdef CONFIG_MAX_ACTIVE_REGIONS
- #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
- #else
- #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
- #endif
-
- struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
- unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
- unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1511,968 +1495,6 @@ void show_free_areas(void)
show_swap_cache_info();
}
-/*
- * Builds allocation fallback zone lists.
- *
- * Add all populated zones of a node to the zonelist.
- */
-static int __init build_zonelists_node(pg_data_t *pgdat,
- struct zonelist *zonelist, int nr_zones, int zone_type)
-{
- struct zone *zone;
-
- BUG_ON(zone_type > ZONE_HIGHMEM);
-
- do {
- zone = pgdat->node_zones + zone_type;
- if (populated_zone(zone)) {
-#ifndef CONFIG_HIGHMEM
- BUG_ON(zone_type > ZONE_NORMAL);
-#endif
- zonelist->zones[nr_zones++] = zone;
- check_highest_zone(zone_type);
- }
- zone_type--;
-
- } while (zone_type >= 0);
- return nr_zones;
-}
-
-static inline int highest_zone(int zone_bits)
-{
- int res = ZONE_NORMAL;
- if (zone_bits & (__force int)__GFP_HIGHMEM)
- res = ZONE_HIGHMEM;
- if (zone_bits & (__force int)__GFP_DMA32)
- res = ZONE_DMA32;
- if (zone_bits & (__force int)__GFP_DMA)
- res = ZONE_DMA;
- return res;
-}
-
-#ifdef CONFIG_NUMA
-#define MAX_NODE_LOAD (num_online_nodes())
-static int __initdata node_load[MAX_NUMNODES];
-/**
- * find_next_best_node - find the next node that should appear in a given node's fallback list
- * @node: node whose fallback list we're appending
- * @used_node_mask: nodemask_t of already used nodes
- *
- * We use a number of factors to determine which is the next node that should
- * appear on a given node's fallback list. The node should not have appeared
- * already in @node's fallback list, and it should be the next closest node
- * according to the distance array (which contains arbitrary distance values
- * from each node to each node in the system), and should also prefer nodes
- * with no CPUs, since presumably they'll have very little allocation pressure
- * on them otherwise.
- * It returns -1 if no node is found.
- */
-static int __init find_next_best_node(int node, nodemask_t *used_node_mask)
-{
- int n, val;
- int min_val = INT_MAX;
- int best_node = -1;
-
- /* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
- node_set(node, *used_node_mask);
- return node;
- }
-
- for_each_online_node(n) {
- cpumask_t tmp;
-
- /* Don't want a node to appear more than once */
- if (node_isset(n, *used_node_mask))
- continue;
-
- /* Use the distance array to find the distance */
- val = node_distance(node, n);
-
- /* Penalize nodes under us ("prefer the next node") */
- val += (n < node);
-
- /* Give preference to headless and unused nodes */
- tmp = node_to_cpumask(n);
- if (!cpus_empty(tmp))
- val += PENALTY_FOR_NODE_WITH_CPUS;
-
- /* Slight preference for less loaded node */
- val *= (MAX_NODE_LOAD*MAX_NUMNODES);
- val += node_load[n];
-
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
-
- if (best_node >= 0)
- node_set(best_node, *used_node_mask);
-
- return best_node;
-}
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
- int i, j, k, node, local_node;
- int prev_node, load;
- struct zonelist *zonelist;
- nodemask_t used_mask;
-
- /* initialize zonelists */
- for (i = 0; i < GFP_ZONETYPES; i++) {
- zonelist = pgdat->node_zonelists + i;
- zonelist->zones[0] = NULL;
- }
-
- /* NUMA-aware ordering of nodes */
- local_node = pgdat->node_id;
- load = num_online_nodes();
- prev_node = local_node;
- nodes_clear(used_mask);
- while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
- int distance = node_distance(local_node, node);
-
- /*
- * If another node is sufficiently far away then it is better
- * to reclaim pages in a zone before going off node.
- */
- if (distance > RECLAIM_DISTANCE)
- zone_reclaim_mode = 1;
-
- /*
- * We don't want to pressure a particular node.
- * So adding penalty to the first node in same
- * distance group to make it round-robin.
- */
-
- if (distance != node_distance(local_node, prev_node))
- node_load[node] += load;
- prev_node = node;
- load--;
- for (i = 0; i < GFP_ZONETYPES; i++) {
- zonelist = pgdat->node_zonelists + i;
- for (j = 0; zonelist->zones[j] != NULL; j++);
-
- k = highest_zone(i);
-
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- zonelist->zones[j] = NULL;
- }
- }
-}
-
-#else /* CONFIG_NUMA */
-
-static void __init build_zonelists(pg_data_t *pgdat)
-{
- int i, j, k, node, local_node;
-
- local_node = pgdat->node_id;
- for (i = 0; i < GFP_ZONETYPES; i++) {
- struct zonelist *zonelist;
-
- zonelist = pgdat->node_zonelists + i;
-
- j = 0;
- k = highest_zone(i);
- j = build_zonelists_node(pgdat, zonelist, j, k);
- /*
- * Now we build the zonelist so that it contains the zones
- * of all the other nodes.
- * We don't want to pressure a particular node, so when
- * building the zones for node N, we make sure that the
- * zones coming right after the local ones are those from
- * node N+1 (modulo N)
- */
- for (node = local_node + 1; node < MAX_NUMNODES; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- }
- for (node = 0; node < local_node; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- }
-
- zonelist->zones[j] = NULL;
- }
-}
-
-#endif /* CONFIG_NUMA */
-
-void __init build_all_zonelists(void)
-{
- int i;
-
- for_each_online_node(i)
- build_zonelists(NODE_DATA(i));
- printk("Built %i zonelists\n", num_online_nodes());
- cpuset_init_current_mems_allowed();
-}
-
-/*
- * Helper functions to size the waitqueue hash table.
- * Essentially these want to choose hash table sizes sufficiently
- * large so that collisions trying to wait on pages are rare.
- * But in fact, the number of active page waitqueues on typical
- * systems is ridiculously low, less than 200. So this is even
- * conservative, even though it seems large.
- *
- * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
- * waitqueues, i.e. the size of the waitq table given the number of pages.
- */
-#define PAGES_PER_WAITQUEUE 256
-
-static inline unsigned long wait_table_size(unsigned long pages)
-{
- unsigned long size = 1;
-
- pages /= PAGES_PER_WAITQUEUE;
-
- while (size < pages)
- size <<= 1;
-
- /*
- * Once we have dozens or even hundreds of threads sleeping
- * on IO we've got bigger problems than wait queue collision.
- * Limit the size of the wait table to a reasonable size.
- */
- size = min(size, 4096UL);
-
- return max(size, 4UL);
-}
-
-/*
- * This is an integer logarithm so that shifts can be used later
- * to extract the more random high bits from the multiplicative
- * hash function before the remainder is taken.
- */
-static inline unsigned long wait_table_bits(unsigned long size)
-{
- return ffz(~size);
-}
-
-#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-
-/*
- * Initially all pages are reserved - free ones are freed
- * up by free_all_bootmem() once the early boot process is
- * done. Non-atomic initialization, single-pass.
- */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
- unsigned long start_pfn)
-{
- struct page *page;
- unsigned long end_pfn = start_pfn + size;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
- if (!early_pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
- }
-}
-
-void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
- unsigned long size)
-{
- int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
- }
-}
-
-#define ZONETABLE_INDEX(x, zone_nr) ((x << ZONES_SHIFT) | zone_nr)
-void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
- unsigned long size)
-{
- unsigned long snum = pfn_to_section_nr(pfn);
- unsigned long end = pfn_to_section_nr(pfn + size);
-
- if (FLAGS_HAS_NODE)
- zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
- else
- for (; snum <= end; snum++)
- zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
-}
-
-#ifndef __HAVE_ARCH_MEMMAP_INIT
-#define memmap_init(size, nid, zone, start_pfn) \
- memmap_init_zone((size), (nid), (zone), (start_pfn))
-#endif
-
-static int __cpuinit zone_batchsize(struct zone *zone)
-{
- int batch;
-
- /*
- * The per-cpu-pages pools are set to around 1000th of the
- * size of the zone. But no more than 1/2 of a meg.
- *
- * OK, so we don't know how big the cache is. So guess.
- */
- batch = zone->present_pages / 1024;
- if (batch * PAGE_SIZE > 512 * 1024)
- batch = (512 * 1024) / PAGE_SIZE;
- batch /= 4; /* We effectively *= 4 below */
- if (batch < 1)
- batch = 1;
-
- /*
- * Clamp the batch to a 2^n - 1 value. Having a power
- * of 2 value was found to be more likely to have
- * suboptimal cache aliasing properties in some cases.
- *
- * For example if 2 tasks are alternately allocating
- * batches of pages, one task can end up with a lot
- * of pages of one half of the possible page colors
- * and the other with pages of the other colors.
- */
- batch = (1 << (fls(batch + batch/2)-1)) - 1;
-
- return batch;
-}
-
-inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-{
- struct per_cpu_pages *pcp;
-
- memset(p, 0, sizeof(*p));
-
- pcp = &p->pcp[0]; /* hot */
- pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
- INIT_LIST_HEAD(&pcp->list);
-
- pcp = &p->pcp[1]; /* cold*/
- pcp->count = 0;
- pcp->high = 2 * batch;
- pcp->batch = max(1UL, batch/2);
- INIT_LIST_HEAD(&pcp->list);
-}
-
-/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
- */
-
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
- unsigned long high)
-{
- struct per_cpu_pages *pcp;
-
- pcp = &p->pcp[0]; /* hot list */
- pcp->high = high;
- pcp->batch = max(1UL, high/4);
- if ((high/4) > (PAGE_SHIFT * 8))
- pcp->batch = PAGE_SHIFT * 8;
-}
-
-
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
- struct zone *zone, *dzone;
-
- for_each_zone(zone) {
-
- zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
- GFP_KERNEL, cpu_to_node(cpu));
- if (!zone_pcp(zone, cpu))
- goto bad;
-
- setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
- if (percpu_pagelist_fraction)
- setup_pagelist_highmark(zone_pcp(zone, cpu),
- (zone->present_pages / percpu_pagelist_fraction));
- }
-
- return 0;
-bad:
- for_each_zone(dzone) {
- if (dzone == zone)
- break;
- kfree(zone_pcp(dzone, cpu));
- zone_pcp(dzone, cpu) = NULL;
- }
- return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
- struct zone *zone;
-
- for_each_zone(zone) {
- struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
- zone_pcp(zone, cpu) = NULL;
- kfree(pset);
- }
-}
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
- unsigned long action,
- void *hcpu)
-{
- int cpu = (long)hcpu;
- int ret = NOTIFY_OK;
-
- switch (action) {
- case CPU_UP_PREPARE:
- if (process_zones(cpu))
- ret = NOTIFY_BAD;
- break;
- case CPU_UP_CANCELED:
- case CPU_DEAD:
- free_zone_pagesets(cpu);
- break;
- default:
- break;
- }
- return ret;
-}
-
-static struct notifier_block pageset_notifier =
- { &pageset_cpuup_callback, NULL, 0 };
-
-void __init setup_per_cpu_pageset(void)
-{
- int err;
-
- /* Initialize per_cpu_pageset for cpu 0.
- * A cpuup callback will do this for every cpu
- * as it comes online
- */
- err = process_zones(smp_processor_id());
- BUG_ON(err);
- register_cpu_notifier(&pageset_notifier);
-}
-
-#endif
-
-static __meminit
-void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
-{
- int i;
- struct pglist_data *pgdat = zone->zone_pgdat;
-
- /*
- * The per-page waitqueue mechanism uses hashed waitqueues
- * per zone.
- */
- zone->wait_table_size = wait_table_size(zone_size_pages);
- zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
- zone->wait_table = (wait_queue_head_t *)
- alloc_bootmem_node(pgdat, zone->wait_table_size
- * sizeof(wait_queue_head_t));
-
- for(i = 0; i < zone->wait_table_size; ++i)
- init_waitqueue_head(zone->wait_table + i);
-}
-
-static __meminit void zone_pcp_init(struct zone *zone)
-{
- int cpu;
- unsigned long batch = zone_batchsize(zone);
-
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
- /* Early boot. Slab allocator not functional yet */
- zone_pcp(zone, cpu) = &boot_pageset[cpu];
- setup_pageset(&boot_pageset[cpu],0);
-#else
- setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
- }
- if (zone->present_pages)
- printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
- zone->name, zone->present_pages, batch);
-}
-
-static __meminit void init_currently_empty_zone(struct zone *zone,
- unsigned long zone_start_pfn, unsigned long size)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
-
- zone_wait_table_init(zone, size);
- pgdat->nr_zones = zone_idx(zone) + 1;
-
- zone->zone_start_pfn = zone_start_pfn;
-
- memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
-
- zone_init_free_lists(pgdat, zone, zone->spanned_pages);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-static int __init first_active_region_index_in_nid(int nid)
-{
- int i;
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].nid == nid)
- return i;
- }
-
- return MAX_ACTIVE_REGIONS;
-}
-
-static int __init next_active_region_index_in_nid(unsigned int index, int nid)
-{
- for (index = index + 1; early_node_map[index].end_pfn; index++) {
- if (early_node_map[index].nid == nid)
- return index;
- }
-
- return MAX_ACTIVE_REGIONS;
-}
-
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-int __init early_pfn_to_nid(unsigned long pfn)
-{
- int i;
-
- for (i = 0; early_node_map[i].end_pfn; i++) {
- unsigned long start_pfn = early_node_map[i].start_pfn;
- unsigned long end_pfn = early_node_map[i].end_pfn;
-
- if ((start_pfn <= pfn) && (pfn < end_pfn))
- return early_node_map[i].nid;
- }
-
- return -1;
-}
-#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
-
-#define for_each_active_range_index_in_nid(i, nid) \
- for (i = first_active_region_index_in_nid(nid); \
- i != MAX_ACTIVE_REGIONS; \
- i = next_active_region_index_in_nid(i, nid))
-
-void __init free_bootmem_with_active_regions(int nid,
- unsigned long max_low_pfn)
-{
- unsigned int i;
- for_each_active_range_index_in_nid(i, nid) {
- unsigned long size_pages = 0;
- unsigned long end_pfn = early_node_map[i].end_pfn;
- if (early_node_map[i].start_pfn >= max_low_pfn)
- continue;
-
- if (end_pfn > max_low_pfn)
- end_pfn = max_low_pfn;
-
- size_pages = end_pfn - early_node_map[i].start_pfn;
- free_bootmem_node(NODE_DATA(early_node_map[i].nid),
- PFN_PHYS(early_node_map[i].start_pfn),
- PFN_PHYS(size_pages));
- }
-}
-
-void __init memory_present_with_active_regions(int nid)
-{
- unsigned int i;
- for_each_active_range_index_in_nid(i, nid)
- memory_present(early_node_map[i].nid,
- early_node_map[i].start_pfn,
- early_node_map[i].end_pfn);
-}
-
-void __init get_pfn_range_for_nid(unsigned int nid,
- unsigned long *start_pfn, unsigned long *end_pfn)
-{
- unsigned int i;
- *start_pfn = -1UL;
- *end_pfn = 0;
-
- for_each_active_range_index_in_nid(i, nid) {
- if (early_node_map[i].start_pfn < *start_pfn)
- *start_pfn = early_node_map[i].start_pfn;
-
- if (early_node_map[i].end_pfn > *end_pfn)
- *end_pfn = early_node_map[i].end_pfn;
- }
-
- if (*start_pfn == -1UL) {
- printk(KERN_WARNING "Node %u active with no memory\n", nid);
- *start_pfn = 0;
- }
-}
-
-unsigned long __init zone_present_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *ignored)
-{
- unsigned long node_start_pfn, node_end_pfn;
- unsigned long zone_start_pfn, zone_end_pfn;
-
- /* Get the start and end of the node and zone */
- get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
- zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
- zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
-
- /* Check that this node has pages within the zone's required range */
- if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
- return 0;
-
- /* Move the zone boundaries inside the node if necessary */
- if (zone_end_pfn > node_end_pfn)
- zone_end_pfn = node_end_pfn;
- if (zone_start_pfn < node_start_pfn)
- zone_start_pfn = node_start_pfn;
-
- /* Return the spanned pages */
- return zone_end_pfn - zone_start_pfn;
-}
-
-static inline int __init pfn_range_in_zone(unsigned long start_pfn,
- unsigned long end_pfn,
- unsigned long zone_type)
-{
- if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
- return 0;
-
- if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
- return 0;
-
- if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
- return 0;
-
- if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
- return 0;
-
- return 1;
-}
-
-unsigned long __init zone_absent_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *ignored)
-{
- int i = 0;
- unsigned long prev_end_pfn = 0, hole_pages = 0;
- unsigned long start_pfn;
-
- /* Find the end_pfn of the first active range of pfns in the node */
- i = first_active_region_index_in_nid(nid);
- prev_end_pfn = early_node_map[i].start_pfn;
-
- /* Find all holes for the node */
- for (; i != MAX_ACTIVE_REGIONS;
- i = next_active_region_index_in_nid(i, nid)) {
-
- /* Increase the hole size if the hole is within the zone */
- start_pfn = early_node_map[i].start_pfn;
- if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
- BUG_ON(prev_end_pfn > start_pfn);
- hole_pages += start_pfn - prev_end_pfn;
- }
-
- prev_end_pfn = early_node_map[i].end_pfn;
- }
-
- return hole_pages;
-}
-#else
-static inline unsigned long zone_present_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *zones_size)
-{
- return zones_size[zone_type];
-}
-
-static inline unsigned long zone_absent_pages_in_node(int nid,
- unsigned long zone_type,
- unsigned long *zholes_size)
-{
- if (!zholes_size)
- return 0;
-
- return zholes_size[zone_type];
-}
-#endif
-
-static void __init calculate_node_totalpages(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long realtotalpages, totalpages = 0;
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- totalpages += zone_present_pages_in_node(pgdat->node_id, i,
- zones_size);
- }
- pgdat->node_spanned_pages = totalpages;
-
- realtotalpages = totalpages;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- realtotalpages -=
- zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
- }
- pgdat->node_present_pages = realtotalpages;
- printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
- realtotalpages);
-}
-
-/*
- * Set up the zone data structures:
- * - mark all pages reserved
- * - mark all memory queues empty
- * - clear the memory bitmaps
- */
-static void __init free_area_init_core(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long j;
- int nid = pgdat->node_id;
- unsigned long zone_start_pfn = pgdat->node_start_pfn;
-
- pgdat_resize_init(pgdat);
- pgdat->nr_zones = 0;
- init_waitqueue_head(&pgdat->kswapd_wait);
- pgdat->kswapd_max_order = 0;
-
- for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize;
-
- size = zone_present_pages_in_node(nid, j, zones_size);
- realsize = size - zone_absent_pages_in_node(nid, j,
- zholes_size);
- if (j < ZONE_HIGHMEM)
- nr_kernel_pages += realsize;
- nr_all_pages += realsize;
-
- zone->spanned_pages = size;
- zone->present_pages = realsize;
- zone->name = zone_names[j];
- spin_lock_init(&zone->lock);
- spin_lock_init(&zone->lru_lock);
- zone_seqlock_init(zone);
- zone->zone_pgdat = pgdat;
- zone->free_pages = 0;
-
- zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
-
- zone_pcp_init(zone);
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
- zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
- zone->nr_active = 0;
- zone->nr_inactive = 0;
- atomic_set(&zone->reclaim_in_progress, 0);
- if (!size)
- continue;
-
- zonetable_add(zone, nid, j, zone_start_pfn, size);
- init_currently_empty_zone(zone, zone_start_pfn, size);
- zone_start_pfn += size;
- }
-}
-
-static void __init alloc_node_mem_map(struct pglist_data *pgdat)
-{
- /* Skip empty nodes */
- if (!pgdat->node_spanned_pages)
- return;
-
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
- /* ia64 gets its own node_mem_map, before this, without bootmem */
- if (!pgdat->node_mem_map) {
- unsigned long size;
- struct page *map;
-
- size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
- map = alloc_remap(pgdat->node_id, size);
- if (!map)
- map = alloc_bootmem_node(pgdat, size);
- pgdat->node_mem_map = map;
- }
-#ifdef CONFIG_FLATMEM
- /*
- * With no DISCONTIG, the global mem_map is just set as node 0's
- */
- if (pgdat == NODE_DATA(0))
- mem_map = NODE_DATA(0)->node_mem_map;
-#endif
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
-}
-
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long node_start_pfn,
- unsigned long *zholes_size)
-{
- pgdat->node_id = nid;
- pgdat->node_start_pfn = node_start_pfn;
- calculate_node_totalpages(pgdat, zones_size, zholes_size);
-
- alloc_node_mem_map(pgdat);
-
- free_area_init_core(pgdat, zones_size, zholes_size);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-void __init add_active_range(unsigned int nid, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned int i;
- unsigned long pages = end_pfn - start_pfn;
-
- /* Merge with existing active regions if possible */
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].nid != nid)
- continue;
-
- if (early_node_map[i].end_pfn == start_pfn) {
- early_node_map[i].end_pfn += pages;
- return;
- }
-
- if (early_node_map[i].start_pfn == (start_pfn + pages)) {
- early_node_map[i].start_pfn -= pages;
- return;
- }
- }
-
- /*
- * Leave last entry NULL so we dont iterate off the end (we use
- * entry.end_pfn to terminate the walk).
- */
- if (i >= MAX_ACTIVE_REGIONS - 1) {
- printk(KERN_ERR "WARNING: too many memory regions in "
- "numa code, truncating\n");
- return;
- }
-
- early_node_map[i].nid = nid;
- early_node_map[i].start_pfn = start_pfn;
- early_node_map[i].end_pfn = end_pfn;
-}
-
-/* Compare two active node_active_regions */
-static int __init cmp_node_active_region(const void *a, const void *b)
-{
- struct node_active_region *arange = (struct node_active_region *)a;
- struct node_active_region *brange = (struct node_active_region *)b;
-
- /* Done this way to avoid overflows */
- if (arange->start_pfn > brange->start_pfn)
- return 1;
- if (arange->start_pfn < brange->start_pfn)
- return -1;
-
- return 0;
-}
-
-/* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
-{
- size_t num = 0;
- while (early_node_map[num].end_pfn)
- num++;
-
- sort(early_node_map, num, sizeof(struct node_active_region),
- cmp_node_active_region, NULL);
-}
-
-unsigned long __init find_min_pfn(void)
-{
- int i;
- unsigned long min_pfn = -1UL;
-
- for (i = 0; early_node_map[i].end_pfn; i++) {
- if (early_node_map[i].start_pfn < min_pfn)
- min_pfn = early_node_map[i].start_pfn;
- }
-
- return min_pfn;
-}
-
-/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
-unsigned long __init find_start_pfn_for_node(unsigned long nid)
-{
- int i;
-
- /* Assuming a sorted map, the first range found has the starting pfn */
- for_each_active_range_index_in_nid(i, nid) {
- return early_node_map[i].start_pfn;
- }
-
- /* nid does not exist in early_node_map */
- printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
- return 0;
-}
-
-void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
- unsigned long arch_max_dma32_pfn,
- unsigned long arch_max_low_pfn,
- unsigned long arch_max_high_pfn)
-{
- unsigned long nid;
-
- /* Record where the zone boundaries are */
- memset(arch_zone_lowest_possible_pfn, 0,
- sizeof(arch_zone_lowest_possible_pfn));
- memset(arch_zone_highest_possible_pfn, 0,
- sizeof(arch_zone_highest_possible_pfn));
- arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
- arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
- arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
- arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
- arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
- arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
- arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
- arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
-
- /* Regions in the early_node_map can be in any order */
- sort_node_map();
-
- for_each_online_node(nid) {
- pg_data_t *pgdat = NODE_DATA(nid);
- free_area_init_node(nid, pgdat, NULL,
- find_start_pfn_for_node(nid), NULL);
- }
-}
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
@@ -2955,32 +1977,6 @@ int lowmem_reserve_ratio_sysctl_handler(
return 0;
}
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
- * can have before it gets flushed back to buddy allocator.
- */
-
-int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
- struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
-{
- struct zone *zone;
- unsigned int cpu;
- int ret;
-
- ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
- if (!write || (ret == -EINVAL))
- return ret;
- for_each_zone(zone) {
- for_each_online_cpu(cpu) {
- unsigned long high;
- high = zone->present_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(zone_pcp(zone, cpu), high);
- }
- }
- return 0;
-}
-
__initdata int hashdist = HASHDIST_DEFAULT;
#ifdef CONFIG_NUMA
^ permalink raw reply
* [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes()
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: davej, linuxppc-dev, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for Power.
This has been boot tested on PPC64 with NUMA both enabled and disabled. It
has been compile tested for an older CHRP-based machine.
powerpc/Kconfig | 13 ++--
powerpc/mm/mem.c | 50 +++++----------
powerpc/mm/numa.c | 157 ++++---------------------------------------------
ppc/Kconfig | 3
ppc/mm/init.c | 21 +++---
5 files changed, 54 insertions(+), 190 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/Kconfig linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig 2006-04-10 10:53:03.000000000 +0100
@@ -665,11 +665,16 @@ config ARCH_SPARSEMEM_DEFAULT
def_bool y
depends on SMP && PPC_PSERIES
-source "mm/Kconfig"
-
-config HAVE_ARCH_EARLY_PFN_TO_NID
+config ARCH_POPULATES_NODE_MAP
def_bool y
- depends on NEED_MULTIPLE_NODES
+
+# Value of 256 is MAX_LMB_REGIONS * 2
+config MAX_ACTIVE_REGIONS
+ int
+ default 256
+ depends on ARCH_POPULATES_NODE_MAP
+
+source "mm/Kconfig"
config ARCH_MEMORY_PROBE
def_bool y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c 2006-04-10 10:53:03.000000000 +0100
@@ -252,30 +252,26 @@ void __init do_init_bootmem(void)
boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages);
- /* Add all physical memory to the bootmem map, mark each area
- * present.
- */
+ /* Add active regions with valid PFNs */
for (i = 0; i < lmb.memory.cnt; i++) {
unsigned long base = lmb.memory.region[i].base;
unsigned long size = lmb_size_bytes(&lmb.memory, i);
-#ifdef CONFIG_HIGHMEM
- if (base >= total_lowmem)
- continue;
- if (base + size > total_lowmem)
- size = total_lowmem - base;
-#endif
- free_bootmem(base, size);
+ add_active_range(0, base, base + size);
}
+ /* Add all physical memory to the bootmem map, mark each area
+ * present.
+ */
+ free_bootmem_with_active_regions(0, total_lowmem);
+
/* reserve the sections we're already using */
for (i = 0; i < lmb.reserved.cnt; i++)
reserve_bootmem(lmb.reserved.region[i].base,
lmb_size_bytes(&lmb.reserved, i));
/* XXX need to clip this if using highmem? */
- for (i = 0; i < lmb.memory.cnt; i++)
- memory_present(0, lmb_start_pfn(&lmb.memory, i),
- lmb_end_pfn(&lmb.memory, i));
+ memory_present_with_active_regions(0);
+
init_bootmem_done = 1;
}
@@ -284,8 +280,6 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
unsigned long total_ram = lmb_phys_mem_size();
unsigned long top_of_ram = lmb_end_of_DRAM();
@@ -303,26 +297,18 @@ void __init paging_init(void)
top_of_ram, total_ram);
printk(KERN_INFO "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
- zholes_size[ZONE_HIGHMEM] = (top_of_ram - total_ram) >> PAGE_SHIFT;
+ free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+ total_lowmem >> PAGE_SHIFT,
+ total_lowmem >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT);
#else
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-#endif /* CONFIG_HIGHMEM */
+ free_area_init_nodes(top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT,
+ top_of_ram >> PAGE_SHIFT);
+#endif
- free_area_init_node(0, NODE_DATA(0), zones_size,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size);
}
#endif /* ! CONFIG_NEED_MULTIPLE_NODES */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c 2006-04-10 10:53:03.000000000 +0100
@@ -39,96 +39,6 @@ static bootmem_data_t __initdata plat_no
static int min_common_depth;
static int n_mem_addr_cells, n_mem_size_cells;
-/*
- * We need somewhere to store start/end/node for each region until we have
- * allocated the real node_data structures.
- */
-#define MAX_REGIONS (MAX_LMB_REGIONS*2)
-static struct {
- unsigned long start_pfn;
- unsigned long end_pfn;
- int nid;
-} init_node_data[MAX_REGIONS] __initdata;
-
-int __init early_pfn_to_nid(unsigned long pfn)
-{
- unsigned int i;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start_pfn = init_node_data[i].start_pfn;
- unsigned long end_pfn = init_node_data[i].end_pfn;
-
- if ((start_pfn <= pfn) && (pfn < end_pfn))
- return init_node_data[i].nid;
- }
-
- return -1;
-}
-
-void __init add_region(unsigned int nid, unsigned long start_pfn,
- unsigned long pages)
-{
- unsigned int i;
-
- dbg("add_region nid %d start_pfn 0x%lx pages 0x%lx\n",
- nid, start_pfn, pages);
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
- if (init_node_data[i].end_pfn == start_pfn) {
- init_node_data[i].end_pfn += pages;
- return;
- }
- if (init_node_data[i].start_pfn == (start_pfn + pages)) {
- init_node_data[i].start_pfn -= pages;
- return;
- }
- }
-
- /*
- * Leave last entry NULL so we dont iterate off the end (we use
- * entry.end_pfn to terminate the walk).
- */
- if (i >= (MAX_REGIONS - 1)) {
- printk(KERN_ERR "WARNING: too many memory regions in "
- "numa code, truncating\n");
- return;
- }
-
- init_node_data[i].start_pfn = start_pfn;
- init_node_data[i].end_pfn = start_pfn + pages;
- init_node_data[i].nid = nid;
-}
-
-/* We assume init_node_data has no overlapping regions */
-void __init get_region(unsigned int nid, unsigned long *start_pfn,
- unsigned long *end_pfn, unsigned long *pages_present)
-{
- unsigned int i;
-
- *start_pfn = -1UL;
- *end_pfn = *pages_present = 0;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
-
- *pages_present += init_node_data[i].end_pfn -
- init_node_data[i].start_pfn;
-
- if (init_node_data[i].start_pfn < *start_pfn)
- *start_pfn = init_node_data[i].start_pfn;
-
- if (init_node_data[i].end_pfn > *end_pfn)
- *end_pfn = init_node_data[i].end_pfn;
- }
-
- /* We didnt find a matching region, return start/end as 0 */
- if (*start_pfn == -1UL)
- *start_pfn = 0;
-}
-
static void __cpuinit map_cpu_to_node(int cpu, int node)
{
numa_cpu_lookup_table[cpu] = node;
@@ -449,8 +359,8 @@ new_range:
continue;
}
- add_region(nid, start >> PAGE_SHIFT,
- size >> PAGE_SHIFT);
+ add_active_range(nid, start >> PAGE_SHIFT,
+ (start >> PAGE_SHIFT) + (size >> PAGE_SHIFT));
if (--ranges)
goto new_range;
@@ -463,6 +373,7 @@ static void __init setup_nonnuma(void)
{
unsigned long top_of_ram = lmb_end_of_DRAM();
unsigned long total_ram = lmb_phys_mem_size();
+ unsigned long start_pfn, end_pfn;
unsigned int i;
printk(KERN_INFO "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
@@ -470,9 +381,11 @@ static void __init setup_nonnuma(void)
printk(KERN_INFO "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- for (i = 0; i < lmb.memory.cnt; ++i)
- add_region(0, lmb.memory.region[i].base >> PAGE_SHIFT,
- lmb_size_pages(&lmb.memory, i));
+ for (i = 0; i < lmb.memory.cnt; ++i) {
+ start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+ end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+ add_active_range(0, start_pfn, end_pfn);
+ }
node_set_online(0);
}
@@ -610,11 +523,11 @@ void __init do_init_bootmem(void)
(void *)(unsigned long)boot_cpuid);
for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
@@ -647,19 +560,7 @@ void __init do_init_bootmem(void)
init_bootmem_node(NODE_DATA(nid), bootmem_paddr >> PAGE_SHIFT,
start_pfn, end_pfn);
- /* Add free regions on this node */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn << PAGE_SHIFT;
- end = init_node_data[i].end_pfn << PAGE_SHIFT;
-
- dbg("free_bootmem %lx %lx\n", start, end - start);
- free_bootmem_node(NODE_DATA(nid), start, end - start);
- }
+ free_bootmem_with_active_regions(nid, end_pfn);
/* Mark reserved regions on this node */
for (i = 0; i < lmb.reserved.cnt; i++) {
@@ -690,44 +591,14 @@ void __init do_init_bootmem(void)
}
}
- /* Add regions into sparsemem */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn;
- end = init_node_data[i].end_pfn;
-
- memory_present(nid, start, end);
- }
+ memory_present_with_active_regions(nid);
}
}
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
- int nid;
-
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
-
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
-
- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
-
- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
- zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
- zholes_size);
- }
+ unsigned long end_pfn = lmb_end_of_DRAM();
+ free_area_init_nodes(end_pfn, end_pfn, end_pfn, end_pfn);
}
static int __init early_numa(char *p)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/Kconfig linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/Kconfig
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/Kconfig 2006-04-10 10:53:03.000000000 +0100
@@ -949,6 +949,9 @@ config NR_CPUS
config HIGHMEM
bool "High memory support"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
source kernel/Kconfig.hz
source kernel/Kconfig.preempt
source "mm/Kconfig"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/mm/init.c linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c
--- linux-2.6.17-rc1-101-add_free_area_init_nodes/arch/ppc/mm/init.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c 2006-04-10 10:53:03.000000000 +0100
@@ -359,8 +359,7 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES], i;
-
+ unsigned long start_pfn, end_pfn;
#ifdef CONFIG_HIGHMEM
map_page(PKMAP_BASE, 0, 0); /* XXX gross */
pkmap_page_table = pte_offset_kernel(pmd_offset(pgd_offset_k
@@ -371,18 +370,18 @@ void __init paging_init(void)
kmap_prot = PAGE_KERNEL;
#endif /* CONFIG_HIGHMEM */
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- for (i = 1; i < MAX_NR_ZONES; i++)
- zones_size[i] = 0;
+ /* All pages are DMA-able so we put them all in the DMA zone. */
+ start_pfn = __pa(PAGE_OFFSET) >> PAGE_SHIFT;
+ end_pfn = start_pfn + (total_memory >> PAGE_SHIFT);
+ add_active_range(0, start_pfn, end_pfn);
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
+ free_area_init_nodes(total_lowmem, total_lowmem,
+ total_lowmem, total_memory);
+#else
+ free_area_init_nodes(total_memory, total_memory,
+ total_memory, total_memory);
#endif /* CONFIG_HIGHMEM */
-
- free_area_init(zones_size);
}
void __init mem_init(void)
^ permalink raw reply
* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
To: linuxppc-dev, davej, tony.luck, ak, linux-kernel; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for x86_64.
This has only been boot tested on an x86_64 with NUMA and SRAT.
arch/x86_64/Kconfig | 3 ++
arch/x86_64/kernel/e820.c | 18 ++++++++++++
arch/x86_64/mm/init.c | 60 +---------------------------------------
arch/x86_64/mm/numa.c | 15 +++++-----
include/asm-x86_64/e820.h | 1
include/asm-x86_64/proto.h | 2 -
6 files changed, 32 insertions(+), 67 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig 2006-04-10 10:54:38.000000000 +0100
@@ -73,6 +73,9 @@ config ARCH_MAY_HAVE_PC_FDC
bool
default y
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
config DMI
bool
default y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c 2006-04-10 10:54:38.000000000 +0100
@@ -220,6 +220,24 @@ e820_hole_size(unsigned long start_pfn,
return ((end - start) - ram) >> PAGE_SHIFT;
}
+/* Walk the e820 map and register active regions */
+unsigned long __init
+e820_register_active_regions(void)
+{
+ int i;
+ unsigned long start_pfn, end_pfn;
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ if (ei->type != E820_RAM)
+ continue;
+
+ start_pfn = round_up(ei->addr, PAGE_SIZE);
+ end_pfn = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+ add_active_range(0, start_pfn, end_pfn);
+ }
+}
+
/*
* Mark e820 reserved areas as busy for the resource manager.
*/
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c 2006-04-10 10:54:38.000000000 +0100
@@ -405,69 +405,13 @@ void __cpuinit zap_low_mappings(int cpu)
__flush_tlb_all();
}
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- int i;
- unsigned long w;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- z[i] = 0;
-
- if (start_pfn < MAX_DMA_PFN)
- z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- if (start_pfn < MAX_DMA32_PFN) {
- unsigned long dma32_pfn = MAX_DMA32_PFN;
- if (dma32_pfn > end_pfn)
- dma32_pfn = end_pfn;
- z[ZONE_DMA32] = dma32_pfn - start_pfn;
- }
- z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- /* Remove lower zones from higher ones. */
- w = 0;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- if (z[i])
- z[i] -= w;
- w += z[i];
- }
-
- /* Compute holes */
- w = start_pfn;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- unsigned long s = w;
- w += z[i];
- h[i] = e820_hole_size(s, w);
- }
-
- /* Add the space pace needed for mem_map to the holes too. */
- for (i = 0; i < MAX_NR_ZONES; i++)
- h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
- /* The 16MB DMA zone has the kernel and other misc mappings.
- Account them too */
- if (h[ZONE_DMA]) {
- h[ZONE_DMA] += dma_reserve;
- if (h[ZONE_DMA] >= z[ZONE_DMA]) {
- printk(KERN_WARNING
- "Kernel too large and filling up ZONE_DMA?\n");
- h[ZONE_DMA] = z[ZONE_DMA];
- }
- }
-}
-
#ifndef CONFIG_NUMA
void __init paging_init(void)
{
- unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
memory_present(0, 0, end_pfn);
sparse_init();
- size_zones(zones, holes, 0, end_pfn);
- free_area_init_node(0, NODE_DATA(0), zones,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+ e820_register_active_regions();
+ free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
}
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c 2006-04-10 10:54:38.000000000 +0100
@@ -149,13 +149,12 @@ void __init setup_node_bootmem(int nodei
void __init setup_node_zones(int nodeid)
{
unsigned long start_pfn, end_pfn, memmapsize, limit;
- unsigned long zones[MAX_NR_ZONES];
- unsigned long holes[MAX_NR_ZONES];
start_pfn = node_start_pfn(nodeid);
end_pfn = node_end_pfn(nodeid);
+ add_active_range(nodeid, start_pfn, end_pfn);
- Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+ Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
nodeid, start_pfn, end_pfn);
/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -167,10 +166,6 @@ void __init setup_node_zones(int nodeid)
memmapsize, SMP_CACHE_BYTES,
round_down(limit - memmapsize, PAGE_SIZE),
limit);
-
- size_zones(zones, holes, start_pfn, end_pfn);
- free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
- start_pfn, holes);
}
void __init numa_init_array(void)
@@ -312,12 +307,18 @@ static void __init arch_sparse_init(void
void __init paging_init(void)
{
int i;
+ unsigned long max_normal_pfn = 0;
arch_sparse_init();
for_each_online_node(i) {
setup_node_zones(i);
+ if (max_normal_pfn < node_end_pfn(i))
+ max_normal_pfn = node_end_pfn(i);
}
+
+ free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, max_normal_pfn,
+ max_normal_pfn);
}
/* [numa=off] */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h 2006-04-10 10:54:38.000000000 +0100
@@ -53,6 +53,7 @@ extern void e820_bootmem_free(pg_data_t
extern void e820_setup_gap(void);
extern unsigned long e820_hole_size(unsigned long start_pfn,
unsigned long end_pfn);
+extern unsigned long e820_register_active_regions(void);
extern void __init parse_memopt(char *p, char **end);
extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h 2006-04-10 10:54:38.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
#define mtrr_bp_init() do {} while (0)
#endif
extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn);
extern void system_call(void);
extern int kernel_syscall(void);
^ permalink raw reply
* [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: davej, tony.luck, ak, linux-kernel, linuxppc-dev; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
Size zones and holes in an architecture independent manner for x86.
This has been boot tested on;
x86 with 4 CPUs, flatmem
x86 on NUMAQ
It needs to be boot tested on an x86 machine that uses SRAT.
Kconfig | 8 +---
kernel/setup.c | 19 +++-------
kernel/srat.c | 98 ----------------------------------------------------
mm/discontig.c | 59 ++++---------------------------
4 files changed, 17 insertions(+), 167 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/Kconfig linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/Kconfig
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/Kconfig 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/Kconfig 2006-04-10 10:53:52.000000000 +0100
@@ -563,12 +563,10 @@ config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SPARSEMEM_ENABLE
-source "mm/Kconfig"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
-config HAVE_ARCH_EARLY_PFN_TO_NID
- bool
- default y
- depends on NUMA
+source "mm/Kconfig"
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/setup.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/setup.c 2006-04-10 10:53:52.000000000 +0100
@@ -1156,22 +1156,15 @@ static unsigned long __init setup_memory
void __init zone_sizes_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
- unsigned int max_dma, low;
+ unsigned int max_dma;
+#ifndef CONFIG_HIGHMEM
+ unsigned long highend_pfn = max_low_pfn;
+#endif
max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
- low = max_low_pfn;
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = highend_pfn - low;
-#endif
- }
- free_area_init(zones_size);
+ add_active_range(0, 0, highend_pfn);
+ free_area_init_nodes(max_dma, max_dma, max_low_pfn, highend_pfn);
}
#else
extern unsigned long __init setup_memory(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/srat.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/kernel/srat.c 2006-04-10 10:53:52.000000000 +0100
@@ -56,8 +56,6 @@ struct node_memory_chunk_s {
static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS];
static int num_memory_chunks; /* total number of memory chunks */
-static int zholes_size_init;
-static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES];
extern void * boot_ioremap(unsigned long, unsigned long);
@@ -137,50 +135,6 @@ static void __init parse_memory_affinity
"enabled and removable" : "enabled" ) );
}
-#if MAX_NR_ZONES != 4
-#error "MAX_NR_ZONES != 4, chunk_to_zone requires review"
-#endif
-/* Take a chunk of pages from page frame cstart to cend and count the number
- * of pages in each zone, returned via zones[].
- */
-static __init void chunk_to_zones(unsigned long cstart, unsigned long cend,
- unsigned long *zones)
-{
- unsigned long max_dma;
- extern unsigned long max_low_pfn;
-
- int z;
- unsigned long rend;
-
- /* FIXME: MAX_DMA_ADDRESS and max_low_pfn are trying to provide
- * similarly scoped information and should be handled in a consistant
- * manner.
- */
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- /* Split the hole into the zones in which it falls. Repeatedly
- * take the segment in which the remaining hole starts, round it
- * to the end of that zone.
- */
- memset(zones, 0, MAX_NR_ZONES * sizeof(long));
- while (cstart < cend) {
- if (cstart < max_dma) {
- z = ZONE_DMA;
- rend = (cend < max_dma)? cend : max_dma;
-
- } else if (cstart < max_low_pfn) {
- z = ZONE_NORMAL;
- rend = (cend < max_low_pfn)? cend : max_low_pfn;
-
- } else {
- z = ZONE_HIGHMEM;
- rend = cend;
- }
- zones[z] += rend - cstart;
- cstart = rend;
- }
-}
-
/*
* The SRAT table always lists ascending addresses, so can always
* assume that the first "start" address that you see is the real
@@ -233,7 +187,6 @@ static int __init acpi20_parse_srat(stru
memset(pxm_bitmap, 0, sizeof(pxm_bitmap)); /* init proximity domain bitmap */
memset(node_memory_chunk, 0, sizeof(node_memory_chunk));
- memset(zholes_size, 0, sizeof(zholes_size));
/* -1 in these maps means not available */
memset(pxm_to_nid_map, -1, sizeof(pxm_to_nid_map));
@@ -414,54 +367,3 @@ out_err:
printk("failed to get NUMA memory information from SRAT table\n");
return 0;
}
-
-/* For each node run the memory list to determine whether there are
- * any memory holes. For each hole determine which ZONE they fall
- * into.
- *
- * NOTE#1: this requires knowledge of the zone boundries and so
- * _cannot_ be performed before those are calculated in setup_memory.
- *
- * NOTE#2: we rely on the fact that the memory chunks are ordered by
- * start pfn number during setup.
- */
-static void __init get_zholes_init(void)
-{
- int nid;
- int c;
- int first;
- unsigned long end = 0;
-
- for_each_online_node(nid) {
- first = 1;
- for (c = 0; c < num_memory_chunks; c++){
- if (node_memory_chunk[c].nid == nid) {
- if (first) {
- end = node_memory_chunk[c].end_pfn;
- first = 0;
-
- } else {
- /* Record any gap between this chunk
- * and the previous chunk on this node
- * against the zones it spans.
- */
- chunk_to_zones(end,
- node_memory_chunk[c].start_pfn,
- &zholes_size[nid * MAX_NR_ZONES]);
- }
- }
- }
- }
-}
-
-unsigned long * __init get_zholes_size(int nid)
-{
- if (!zholes_size_init) {
- zholes_size_init++;
- get_zholes_init();
- }
- if (nid >= MAX_NUMNODES || !node_online(nid))
- printk("%s: nid = %d is invalid/offline. num_online_nodes = %d",
- __FUNCTION__, nid, num_online_nodes());
- return &zholes_size[nid * MAX_NR_ZONES];
-}
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/mm/discontig.c
--- linux-2.6.17-rc1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-103-x86_use_init_nodes/arch/i386/mm/discontig.c 2006-04-10 10:53:52.000000000 +0100
@@ -157,21 +157,6 @@ static void __init find_max_pfn_node(int
BUG();
}
-/* Find the owning node for a pfn. */
-int early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- for_each_node(nid) {
- if (node_end_pfn[nid] == 0)
- break;
- if (node_start_pfn[nid] <= pfn && node_end_pfn[nid] >= pfn)
- return nid;
- }
-
- return 0;
-}
-
/*
* Allocate memory for the pg_data_t for this node via a crude pre-bootmem
* method. For node zero take this from the bottom of memory, for
@@ -352,45 +337,17 @@ unsigned long __init setup_memory(void)
void __init zone_sizes_init(void)
{
int nid;
-
+ unsigned long max_dma_pfn;
for_each_online_node(nid) {
- unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
- unsigned long *zholes_size;
- unsigned int max_dma;
-
- unsigned long low = max_low_pfn;
- unsigned long start = node_start_pfn[nid];
- unsigned long high = node_end_pfn[nid];
-
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- if (node_has_online_mem(nid)){
- if (start > low) {
-#ifdef CONFIG_HIGHMEM
- BUG_ON(start > high);
- zones_size[ZONE_HIGHMEM] = high - start;
-#endif
- } else {
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- BUG_ON(max_dma > low);
- BUG_ON(low > high);
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = high - low;
-#endif
- }
- }
- }
-
- zholes_size = get_zholes_size(nid);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start,
- zholes_size);
+ if (node_has_online_mem(nid))
+ add_active_range(nid, node_start_pfn[nid],
+ node_end_pfn[nid]);
}
+
+ max_dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+ free_area_init_nodes(max_dma_pfn, max_dma_pfn,
+ max_low_pfn, highend_pfn);
return;
}
^ permalink raw reply
* [PATCH 1/6] Introduce mechanism for registering active regions of memory
From: Mel Gorman @ 2006-04-11 10:40 UTC (permalink / raw)
To: linuxppc-dev, ak, tony.luck, linux-kernel, davej; +Cc: Mel Gorman
In-Reply-To: <20060411103946.18153.83059.sendpatchset@skynet>
This patch defines the structure to represent an active range of page
frames within a node in an architecture independent manner. Architectures
are expected to register active ranges of PFNs using add_active_range(nid,
start_pfn, end_pfn) and call free_area_init_nodes() passing the PFNs of
the end of each zone.
include/linux/mm.h | 14 +
include/linux/mmzone.h | 15 +
mm/page_alloc.c | 374 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 378 insertions(+), 25 deletions(-)
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/include/linux/mm.h linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mm.h
--- linux-2.6.17-rc1-clean/include/linux/mm.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mm.h 2006-04-10 10:52:14.000000000 +0100
@@ -867,6 +867,20 @@ extern void free_area_init(unsigned long
extern void free_area_init_node(int nid, pg_data_t *pgdat,
unsigned long * zones_size, unsigned long zone_start_pfn,
unsigned long *zholes_size);
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+extern void free_area_init_nodes(unsigned long max_dma_pfn,
+ unsigned long max_dma32_pfn,
+ unsigned long max_low_pfn,
+ unsigned long max_high_pfn);
+extern void add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
+extern void get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn);
+extern int early_pfn_to_nid(unsigned long pfn);
+extern void free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn);
+extern void memory_present_with_active_regions(int nid);
+#endif
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
extern void setup_per_zone_pages_min(void);
extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/include/linux/mmzone.h linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mmzone.h
--- linux-2.6.17-rc1-clean/include/linux/mmzone.h 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/include/linux/mmzone.h 2006-04-10 10:52:14.000000000 +0100
@@ -271,6 +271,18 @@ struct zonelist {
struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
};
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * This represents an active range of physical memory. Architectures register
+ * a pfn range using add_active_range() and later initialise the nodes and
+ * free list with free_area_init_nodes()
+ */
+struct node_active_region {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ int nid;
+};
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
@@ -465,7 +477,8 @@ extern struct zone *next_zone(struct zon
#endif
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+ !defined(CONFIG_ARCH_POPULATES_NODE_MAP)
#define early_pfn_to_nid(nid) (0UL)
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-clean/mm/page_alloc.c linux-2.6.17-rc1-101-add_free_area_init_nodes/mm/page_alloc.c
--- linux-2.6.17-rc1-clean/mm/page_alloc.c 2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-101-add_free_area_init_nodes/mm/page_alloc.c 2006-04-10 10:52:14.000000000 +0100
@@ -37,6 +37,8 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -84,6 +86,18 @@ int min_free_kbytes = 1024;
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ #ifdef CONFIG_MAX_ACTIVE_REGIONS
+ #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+ #else
+ #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+ #endif
+
+ struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+ unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+ unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1743,25 +1757,6 @@ static inline unsigned long wait_table_b
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long realtotalpages, totalpages = 0;
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- totalpages += zones_size[i];
- pgdat->node_spanned_pages = totalpages;
-
- realtotalpages = totalpages;
- if (zholes_size)
- for (i = 0; i < MAX_NR_ZONES; i++)
- realtotalpages -= zholes_size[i];
- pgdat->node_present_pages = realtotalpages;
- printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
-}
-
-
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
@@ -2048,6 +2043,214 @@ static __meminit void init_currently_emp
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+static int __init first_active_region_index_in_nid(int nid)
+{
+ int i;
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid == nid)
+ return i;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+ for (index = index + 1; early_node_map[index].end_pfn; index++) {
+ if (early_node_map[index].nid == nid)
+ return index;
+ }
+
+ return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ unsigned long start_pfn = early_node_map[i].start_pfn;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if ((start_pfn <= pfn) && (pfn < end_pfn))
+ return early_node_map[i].nid;
+ }
+
+ return -1;
+}
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+
+#define for_each_active_range_index_in_nid(i, nid) \
+ for (i = first_active_region_index_in_nid(nid); \
+ i != MAX_ACTIVE_REGIONS; \
+ i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long size_pages = 0;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+ if (early_node_map[i].start_pfn >= max_low_pfn)
+ continue;
+
+ if (end_pfn > max_low_pfn)
+ end_pfn = max_low_pfn;
+
+ size_pages = end_pfn - early_node_map[i].start_pfn;
+ free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+ PFN_PHYS(early_node_map[i].start_pfn),
+ PFN_PHYS(size_pages));
+ }
+}
+
+void __init memory_present_with_active_regions(int nid)
+{
+ unsigned int i;
+ for_each_active_range_index_in_nid(i, nid)
+ memory_present(early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ unsigned int i;
+ *start_pfn = -1UL;
+ *end_pfn = 0;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ if (early_node_map[i].start_pfn < *start_pfn)
+ *start_pfn = early_node_map[i].start_pfn;
+
+ if (early_node_map[i].end_pfn > *end_pfn)
+ *end_pfn = early_node_map[i].end_pfn;
+ }
+
+ if (*start_pfn == -1UL) {
+ printk(KERN_WARNING "Node %u active with no memory\n", nid);
+ *start_pfn = 0;
+ }
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ /* Get the start and end of the node and zone */
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+ zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+ /* Check that this node has pages within the zone's required range */
+ if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+ return 0;
+
+ /* Move the zone boundaries inside the node if necessary */
+ if (zone_end_pfn > node_end_pfn)
+ zone_end_pfn = node_end_pfn;
+ if (zone_start_pfn < node_start_pfn)
+ zone_start_pfn = node_start_pfn;
+
+ /* Return the spanned pages */
+ return zone_end_pfn - zone_start_pfn;
+}
+
+static inline int __init pfn_range_in_zone(unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long zone_type)
+{
+ if (start_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (start_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn < arch_zone_lowest_possible_pfn[zone_type])
+ return 0;
+
+ if (end_pfn >= arch_zone_highest_possible_pfn[zone_type])
+ return 0;
+
+ return 1;
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ int i = 0;
+ unsigned long prev_end_pfn = 0, hole_pages = 0;
+ unsigned long start_pfn;
+
+ /* Find the end_pfn of the first active range of pfns in the node */
+ i = first_active_region_index_in_nid(nid);
+ prev_end_pfn = early_node_map[i].start_pfn;
+
+ /* Find all holes for the node */
+ for (; i != MAX_ACTIVE_REGIONS;
+ i = next_active_region_index_in_nid(i, nid)) {
+
+ /* Increase the hole size if the hole is within the zone */
+ start_pfn = early_node_map[i].start_pfn;
+ if (pfn_range_in_zone(prev_end_pfn, start_pfn, zone_type)) {
+ BUG_ON(prev_end_pfn > start_pfn);
+ hole_pages += start_pfn - prev_end_pfn;
+ }
+
+ prev_end_pfn = early_node_map[i].end_pfn;
+ }
+
+ return hole_pages;
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zholes_size)
+{
+ if (!zholes_size)
+ return 0;
+
+ return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long realtotalpages, totalpages = 0;
+ int i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+ zones_size);
+ }
+ pgdat->node_spanned_pages = totalpages;
+
+ realtotalpages = totalpages;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ realtotalpages -=
+ zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+ }
+ pgdat->node_present_pages = realtotalpages;
+ printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+ realtotalpages);
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2070,10 +2273,9 @@ static void __init free_area_init_core(s
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize;
- realsize = size = zones_size[j];
- if (zholes_size)
- realsize -= zholes_size[j];
-
+ size = zone_present_pages_in_node(nid, j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
+ zholes_size);
if (j < ZONE_HIGHMEM)
nr_kernel_pages += realsize;
nr_all_pages += realsize;
@@ -2140,13 +2342,137 @@ void __init free_area_init_node(int nid,
{
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+ calculate_node_totalpages(pgdat, zones_size, zholes_size);
alloc_node_mem_map(pgdat);
free_area_init_core(pgdat, zones_size, zholes_size);
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned int i;
+ unsigned long pages = end_pfn - start_pfn;
+
+ /* Merge with existing active regions if possible */
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].nid != nid)
+ continue;
+
+ if (early_node_map[i].end_pfn == start_pfn) {
+ early_node_map[i].end_pfn += pages;
+ return;
+ }
+
+ if (early_node_map[i].start_pfn == (start_pfn + pages)) {
+ early_node_map[i].start_pfn -= pages;
+ return;
+ }
+ }
+
+ /*
+ * Leave last entry NULL so we dont iterate off the end (we use
+ * entry.end_pfn to terminate the walk).
+ */
+ if (i >= MAX_ACTIVE_REGIONS - 1) {
+ printk(KERN_ERR "WARNING: too many memory regions in "
+ "numa code, truncating\n");
+ return;
+ }
+
+ early_node_map[i].nid = nid;
+ early_node_map[i].start_pfn = start_pfn;
+ early_node_map[i].end_pfn = end_pfn;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+ struct node_active_region *arange = (struct node_active_region *)a;
+ struct node_active_region *brange = (struct node_active_region *)b;
+
+ /* Done this way to avoid overflows */
+ if (arange->start_pfn > brange->start_pfn)
+ return 1;
+ if (arange->start_pfn < brange->start_pfn)
+ return -1;
+
+ return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+ size_t num = 0;
+ while (early_node_map[num].end_pfn)
+ num++;
+
+ sort(early_node_map, num, sizeof(struct node_active_region),
+ cmp_node_active_region, NULL);
+}
+
+unsigned long __init find_min_pfn(void)
+{
+ int i;
+ unsigned long min_pfn = -1UL;
+
+ for (i = 0; early_node_map[i].end_pfn; i++) {
+ if (early_node_map[i].start_pfn < min_pfn)
+ min_pfn = early_node_map[i].start_pfn;
+ }
+
+ return min_pfn;
+}
+
+/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
+unsigned long __init find_start_pfn_for_node(unsigned long nid)
+{
+ int i;
+
+ /* Assuming a sorted map, the first range found has the starting pfn */
+ for_each_active_range_index_in_nid(i, nid) {
+ return early_node_map[i].start_pfn;
+ }
+
+ /* nid does not exist in early_node_map */
+ printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+ return 0;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+ unsigned long arch_max_dma32_pfn,
+ unsigned long arch_max_low_pfn,
+ unsigned long arch_max_high_pfn)
+{
+ unsigned long nid;
+
+ /* Record where the zone boundaries are */
+ memset(arch_zone_lowest_possible_pfn, 0,
+ sizeof(arch_zone_lowest_possible_pfn));
+ memset(arch_zone_highest_possible_pfn, 0,
+ sizeof(arch_zone_highest_possible_pfn));
+ arch_zone_lowest_possible_pfn[ZONE_DMA] = find_min_pfn();
+ arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_DMA32] = arch_max_dma_pfn;
+ arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_max_dma32_pfn;
+ arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM] = arch_max_low_pfn;
+ arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+
+ /* Regions in the early_node_map can be in any order */
+ sort_node_map();
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ free_area_init_node(nid, pgdat, NULL,
+ find_start_pfn_for_node(nid), NULL);
+ }
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox