LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: Base address of executables - weirdness?
From: Andreas Schwab @ 2006-06-06 21:15 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linuxppc-dev
In-Reply-To: <4485A279.4050403@zytor.com>

"H. Peter Anvin" <hpa@zytor.com> writes:

> Until recently, binaries linked with ld defaulted to a base address of 
> 0x10000000+SIZEOF_HEADERS.  However, recently I've gotten a couple of 
> reports -- and I've been able to confirm this on my FC5 system -- that 
> some versions of ld links at 0x01800000+SIZEOF_HEADERS.

You are probably using the wrong linker emulation.  There are three
emulations enabled when building binutils for ppc-linux, but only the
elf32ppclinux emulation it the right one that uses 0x10000000 for the base
address.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply

* Re: eth0: tx queue full
From: Wolfgang Denk @ 2006-06-06 20:53 UTC (permalink / raw)
  To: salvatore cusenza; +Cc: linuxppc-embedded
In-Reply-To: <9252a64b0606060113v696adbb7ib43ad95836c0724b@mail.gmail.com>

In message <9252a64b0606060113v696adbb7ib43ad95836c0724b@mail.gmail.com> you wrote:
> 
> At runtime during the usual life of my board (MPC852 and linux-2.4.20 Denk's
> distribution)
>  I have experienced the following crash:

2.4.20 is at least 3.5 years old. Please use recent code.

Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Brain fried - Core dumped

^ permalink raw reply

* 2.4 kernel scheduling (?) problems
From: Tobias Netzel @ 2006-06-06 18:29 UTC (permalink / raw)
  To: linuxppc-dev

Hello all,

I'm new to this list - hoping you will help me with the 2.4 kernel, 
although it's old now.
I'm improving hardware support for the NuBus PMacs. The NuBus Pmacs are 
68k Macs with a PPC CPU and a different bus bridge/memory controller.
The NuBus PMac port is something like a hack to the PPC architecture 
using some things from the PMac platform.
An open firmware device tree is emulated as far as needed. We use the 
same PMU driver (although I hacked it a bit because we directly route 
the PMU interrupts), ADB and RTC driver and the same functions to 
calibrate the decrementer using the VIA timer.

The problem I got is that for example during SCSI transfers (I'm using 
an old scanner) neither the X screen gets updated nor does the system 
respond to any interrupts but the NMI.
Debugging messages through the serial port are still sent. So I have to 
wait until the whole transfer is done. The kernel only receives 
interrupts after a SCSI command has finished and before a new one is 
sent. The data from the scanner is transfered in blocks of 32 kB.
The behaviour is similar when I do an performance test using "dd" or 
"hdparm" on the IDE CD-ROM drive. Burning CDs using that IDE drive 
works but causes the same problem as when scanning with the SCSI 
scanner.
With the IDE hard disk I don't get this problem.
When I run "top" during those problem transfers the CPU utilization by 
the system is higher than 95% ("top" hardly gets updated) - I doubt 
that this is necessary as on the NuBus PMacs the PPC CPUs (and 
especially the 217 MHz G3 with 512 kB L2 cache I'm using) should be 
idle most of the time waiting for the slow 33 MHz system bus.
The strange thing is that the CPU misses all interrupts (except the 
NMI) although interrupts aren't turned off in the CPU (otherwise the 
PMU would shut us down and the NMI wouldn't work). I also tried to use 
a timer to poll the interrupt controllers but the interrupt handling 
routines also only find one pending interrupt in 10 seconds even when I 
constantly move the mouse and hit keys.
At first I thought this was something caused by the hardware but in 
MacOS 9 the SCSI driver doesn't block anything.

But is it possible that this is because of something like scheduling 
problems of the kernel?
And if so might updating to the 2.6 kernel fix that issue?

Tobias

^ permalink raw reply

* Re: Base address of executables - weirdness?
From: Linas Vepstas @ 2006-06-06 17:33 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linuxppc-dev
In-Reply-To: <4485A279.4050403@zytor.com>

On Tue, Jun 06, 2006 at 08:42:49AM -0700, H. Peter Anvin wrote:
> I'm trying to track down an odd issue with klibc on ppc32.
> 
> Until recently, binaries linked with ld defaulted to a base address of 
> 0x10000000+SIZEOF_HEADERS.  However, recently I've gotten a couple of 
> reports -- and I've been able to confirm this on my FC5 system -- that 
> some versions of ld links at 0x01800000+SIZEOF_HEADERS.  Needless to 
> say, this is more than a bit confusing, *especially* since "ld -verbose" 
> still reports:
> 
>      PROVIDE (__executable_start = 0x10000000); . = 0x10000000 + 
> SIZEOF_HEADERS;
> 
> ... at the top of the linker script.
> 
> I'm rather baffled.  Has anyone else seen this, and/or have any other 
> explanation?

Googling "0x01800000 linux ppc" brings up some interesting but old hits.

However, I swear I saw someone suggest a patch last week that changed
0x10000000 to 0x01800000 somewhere, (vmlinux.lds ??) as a proposed cure
for a bug. Sorry, I deleted it.

--linas 

^ permalink raw reply

* RE: [PATCH/2.6.17-rc4 4/10]Powerpc:  Add tsi108 pic support
From: Alexandre Bounine @ 2006-06-06 18:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Zang Roy-r61911
  Cc: linuxppc-dev list, Paul Mackerras, Yang Xin-Xin-r48390

I forgot to mention another argument in favor of adding separate =
MPIC_SPV_EOI and MPIC_CASC_NOEOI flags:

If we have MPIC with "broken" logic but standard register map we can use =
model ID =3D 0 for
the standard MPIC without creating additional data structure.

Regards,

Alex.

-----Original Message-----
From: Alexandre Bounine=20
Sent: Tuesday, June 06, 2006 10:46 AM
To: 'Benjamin Herrenschmidt'; Zang Roy-r61911
Cc: Kumar Gala; linuxppc-dev list; Yang Xin-Xin-r48390; Paul Mackerras
Subject: RE: [PATCH/2.6.17-rc4 4/10]Powerpc: Add tsi108 pic support

> -----Original Message-----
> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
> Sent: Tuesday, June 06, 2006 6:17 AM
> To: Zang Roy-r61911
> Cc: Alexandre Bounine; Kumar Gala; linuxppc-dev list; Yang
> Xin-Xin-r48390; Paul Mackerras
> Subject: RE: [PATCH/2.6.17-rc4 4/10]Powerpc: Add tsi108 pic support
>=20
>=20
> On Tue, 2006-06-06 at 17:43 +0800, Zang Roy-r61911 wrote:
>=20
> > Update Tsi108 implementation of MPIC.
> > Any comment?=20
> >=20
> > Integrate Tundra Semiconductor tsi108 host bridge interrupt=20
> controller=20
> > to mpic arch.
>=20
> Looks much better :) Still a few things...=20
>

Sounds good. We are moving in right direction :)
=20
> > +	mpic =3D mpic_alloc(mpic_paddr,
> > +			MPIC_PRIMARY | MPIC_BIG_ENDIAN |=20
> MPIC_WANTS_RESET |
> > +			MPIC_SPV_EOI | MPIC_CASC_NOEOI |=20
> > +			MPIC_MOD_ID(MPIC_ID_TSI108),
> > +			0, /* num_sources used */
> > +			TSI108_IRQ_BASE,
> > +			0, /* num_sources used */
> > +			NR_IRQS - 4 /* XXXX */,
> > +			mpc7448_hpc2_pic_initsenses,
> > +			sizeof(mpc7448_hpc2_pic_initsenses),=20
> "Tsi108_PIC");
>=20
> That's a hell lot of new flags... I'm not sure we need that many or a
> single TSI108 one that encloses all the new ones. Also, I'm=20
> not sure we
> need that model ID encoding thing. Let's do things simple, besides, I
> don't want to encourage HW folks into doing the same kind of=20
> contraption
> in the future

More details in comments below.

>(btw, tell the TSI folks for me that they had a BAD BAD
> BAD idea to muck around with the base design that way, especially
> changing the register map in incompatible ways for no good reason).
>=20

Done!

> > +	/* Configure MPIC outputs to CPU0 */
> > +	tsi108_write_reg(TSI108_MPIC_OFFSET + 0x30c, 0);
> >  }
>=20
> It doesn't use the standard multiple processor outputs mecanism of
> MPIC ?
> =20
> > +static struct mpic_info mpic_infos[] =3D {
> > +	[0] =3D {	/* Original OpenPIC compatible MPIC */
> > +	.greg_base	=3D MPIC_GREG_BASE,
> > +	.greg_frr0	=3D MPIC_GREG_FEATURE_0,
> > +	.greg_config0	=3D MPIC_GREG_GLOBAL_CONF_0,
> > +	.greg_vendor_id	=3D MPIC_GREG_VENDOR_ID,
> > +	.greg_ipi_vp0	=3D MPIC_GREG_IPI_VECTOR_PRI_0,
> > +	.greg_ipi_stride	=3D MPIC_GREG_IPI_STRIDE,
> > +	.greg_spurious	=3D MPIC_GREG_SPURIOUS,
> > +	.greg_tfrr	=3D MPIC_GREG_TIMER_FREQ,
> > +
>=20
>    .../...
>=20
> It's a bit sad to have to go all the way to doing such tables, but I
> suspect it's probably the best way to handle it at this=20
> point.

> Send more
> nastygrams to the HW folks for me.
>=20

Done:)

> >  	mpic->num_sources =3D 0; /* so far */
> >  	mpic->senses =3D senses;
> >  	mpic->senses_count =3D senses_count;
> > +	mpic->hw_set =3D &mpic_infos[MPIC_GET_MOD_ID(flags)];
>=20
> Well... the model ID thing might not be that a bad idea in=20
> the end :) I
> need to think about it. I might have to deal with yet another=20
> MPIC that
> has another regiser map (yeah yeah, TSI aren't the only ones=20
> to not get
> it)...=20
>

I'll tell this to HW guys as well :)=20

>   .../...
>=20
> > @@ -963,7 +1043,7 @@ int mpic_get_one_irq(struct mpic *mpic,=20
> >  {
> >  	u32 irq;
> > =20
> > -	irq =3D mpic_cpu_read(MPIC_CPU_INTACK) & MPIC_VECPRI_VECTOR_MASK;
> > +	irq =3D mpic_cpu_read(mpic->hw_set->cpu_intack) &=20
> mpic->hw_set->irq_vpr_vector;
> >  #ifdef DEBUG_LOW
> >  	DBG("%s: get_one_irq(): %d\n", mpic->name, irq);
> >  #endif
> > @@ -972,11 +1052,18 @@ #ifdef DEBUG_LOW
> >  		DBG("%s: cascading ...\n", mpic->name);
> >  #endif
> >  		irq =3D mpic->cascade(regs, mpic->cascade_data);
> > -		mpic_eoi(mpic);
> > +#ifdef DEBUG_LOW
> > +		DBG("%s: cascaded irq: %d\n", mpic->name, irq);
> > +#endif
> > +		if (!(mpic->flags & MPIC_CASC_NOEOI))
> > +			mpic_eoi(mpic);
> >  		return irq;
> >  	}
>=20
> Can you tell me why you need the above ? (Why you aren't EOI'ing the
> cascade ?) Note that the cascade handling is going away from=20
> mpic anyway
> with the port to genirq that I'll publish later this week for=20
> 2.6.18 and
> it will almost be handled as a normal interrupt...
>=20

We have a level-signalled irq from the cascaded PCI interrupt =
controller. If I do EOI at=20
this time, level request will not have chance to be cleared (unless all =
PCI interrupts have
an SA_INTERRUPT flag) and result in recurring interrupts.=20

I chose to have an individual flag instead of checking model ID to avoid =
multiple checks within ISR (in case if we have more that one mpic =
version requiring this option). I also expect that it may be useful for =
any external level-signalling cascades connected to MPIC.     =20

> > -	if (unlikely(irq =3D=3D MPIC_VEC_SPURRIOUS))
> > +	if (unlikely(irq =3D=3D MPIC_VEC_SPURRIOUS)) {
> > +		if (mpic->flags & MPIC_SPV_EOI)
> > +			mpic_eoi(mpic);
> >  		return -1;
> > +	}
>=20
> I think the above thing could just test the model ID. It's=20
> unlikely that
> another implementation need the same "feature", so just test the model
> ID rather than adding a flag and if we ever have another=20
> model with the
> same "feature", then we'll go back to adding a flag :)
>=20

Motivation is the same as above - I just do not want to have multiple ID =
checks here. I agree that it is driven by mpic type (model ID) only. I =
can remove this one if you do not expect any
new "broken" MPICs on horizon. =20

> Cheers,
> Ben.
>=20
Thanks for your feedback,
Alex.
>=20
>=20

^ permalink raw reply

* Re: ppc85xx DMA
From: Naru Sundar @ 2006-06-06 18:55 UTC (permalink / raw)
  To: Liu Dave-r63238; +Cc: linuxppc-embedded
In-Reply-To: <9FCDBA58F226D911B202000BDBAD4673026FD940@zch01exm40.ap.freescale.net>

On Tue, Jun 06, 2006 at 09:39:29AM +0800, Liu Dave-r63238 wrote:
> What is the DMA transfer mode? Is direct or chaining mode?

Direct mode.  I fixed an error with my bit ordering for the configuration
registers, and now the transfer seems to complete, but I don't see any
actual data showing up in the destination register that I am writing to.

> Did you ioremap the DMA register space?

Yes, I can write the destination address manually.  So I am thinking my addresses
are wrong.

For the source and dest address I used:

dma_map_single(NULL, ptr, len, DMA_TO_DEVICE)

(which effectively does a virt_to_bus on ppc and so should just return to me
the bus address used by the dma).

-naru

^ permalink raw reply

* Re: Base address of executables - weirdness?
From: H. Peter Anvin @ 2006-06-06 18:21 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev
In-Reply-To: <20060606173343.GE9294@austin.ibm.com>

Linas Vepstas wrote:
> On Tue, Jun 06, 2006 at 08:42:49AM -0700, H. Peter Anvin wrote:
>> I'm trying to track down an odd issue with klibc on ppc32.
>>
>> Until recently, binaries linked with ld defaulted to a base address of 
>> 0x10000000+SIZEOF_HEADERS.  However, recently I've gotten a couple of 
>> reports -- and I've been able to confirm this on my FC5 system -- that 
>> some versions of ld links at 0x01800000+SIZEOF_HEADERS.  Needless to 
>> say, this is more than a bit confusing, *especially* since "ld -verbose" 
>> still reports:
>>
>>      PROVIDE (__executable_start = 0x10000000); . = 0x10000000 + 
>> SIZEOF_HEADERS;
>>
>> ... at the top of the linker script.
>>
>> I'm rather baffled.  Has anyone else seen this, and/or have any other 
>> explanation?
> 
> Googling "0x01800000 linux ppc" brings up some interesting but old hits.
> 
> However, I swear I saw someone suggest a patch last week that changed
> 0x10000000 to 0x01800000 somewhere, (vmlinux.lds ??) as a proposed cure
> for a bug. Sorry, I deleted it.
> 

Well, it's worse than I previously surmised.  I can't seem to find any combination of 
options which work on both affected and unaffected binutils.  This is a real mess.

	-hpa

^ permalink raw reply

* RE: Intercept System call using Kernel  module is 2.6 kernel
From: Jenkins, Clive @ 2006-06-06 17:14 UTC (permalink / raw)
  To: Meswani, Mitesh, linuxppc-dev

>        x=3Dmitesh_func();=20
>        printf("mitesh_func returned %d\n",x);

The first thing would be to change your user-space program
to print the error number from errno after your "system call".

        x=3Dmitesh_func();=20
        printf("mitesh_func returned %d, errno=3D%d\n",x,errno);

Or you can use perror() -- look it up.

Clive
 =20


4) I verify from the system logs that when I insmod the kernel module I
get all the print statements. I verified from the logs  that the address
of the sys_call_table is correctly passed and from /proc/kallsysms I can
see that my function mitesh_func has been defined and has the address as
indicated in the logs.=20

The problem is that when I execute my user app I expect to see two
things:=20
 a) I should see a message in the log "Executing mitesh_func..." and=20
 b) A return value of 2=20

However I get an error value -1 returned.=20

Any help and ideas are highly appreciated. =20

Thank you in advance,=20
Mitesh=20

^ permalink raw reply

* Re: Intercept System call using Kernel  module is 2.6 kernel
From: Arnd Bergmann @ 2006-06-06 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Meswani, Mitesh
In-Reply-To: <C26C730943E01145B4F89E37FE0A022002BBC7A6@itdsrvmail02.utep.edu>

Am Tuesday 06 June 2006 18:25 schrieb Meswani, Mitesh:
> Any help and ideas are highly appreciated. =C2=A0

Tell your professor that the task you were given is=20

a) pointless, as you wouldn't use this kind of thing to
   solve an actual problem other than bad OS design
   homework.
b) not a correct approach regarding maintainability, since
   you can't tell for an arbitrary kernel version if
   the particular syscall you're abusing is now used for
   something else.

As a replacement task, choose one or more of the following:

=2D implement a syscall by _recompiling_ the kernel and call
  that from your user application.
=2D write a misc device driver that exposes a device to
  do ioctl() on.
=2D create a file in each of sysfs, procfs and debugfs to
  do your operation on, using read() and write().
=2D use a netlink socket for a two way communication with
  a kernel module.

	Arnd <><

^ permalink raw reply

* Re: process starvation with 2.6 scheduler
From: Thiago Galesi @ 2006-06-06 17:09 UTC (permalink / raw)
  To: Kallol Biswas; +Cc: linuxppc-dev
In-Reply-To: <478F19F21671F04298A2116393EEC3D527421D@sjc1exm08.pmc_nt.nt.pmc-sierra.bc.ca>

Did you try it with a _real_ CPU? My bet is that the timer interrupt
is overwhelming the CPU (even at 100Hz, 400kHz is too slow).

-- 
-
Thiago Galesi

^ permalink raw reply

* Re: Collecting hypervisor call stats
From: Mike Kravetz @ 2006-06-06 16:46 UTC (permalink / raw)
  To: Christopher Yeoh; +Cc: Chris Yeoh, Bryan Rosenburg, linuxppc-dev
In-Reply-To: <17534.30511.192632.558778@localhost.localdomain>

On Thu, Jun 01, 2006 at 03:12:15PM +1000, Christopher Yeoh wrote:
> Here's a patch we've used for collecting hcall counts and times.

Thanks for the patch/code Chris!  I'm using this as a basis for something
that we may want to merge into the tree.  Just a couple of questions.

Your 'wrappers' have the following general form:

> +long plpar_hcall(unsigned long opcode, unsigned long arg1,
> +			unsigned long arg2, unsigned long arg3,
> +			unsigned long arg4, unsigned long *out1,
> +			unsigned long *out2, unsigned long *out3)
> +{
> +    long retcode;
> +    unsigned long t_entry;
> +    int opcode_index;
> +    
> +    opcode_index = map_hcall_to_index(opcode);
> +    
> +    t_entry = mfspr(SPRN_PURR);
> +    barrier();
> +    
> +    retcode = plpar_hcall_real(opcode, arg1, arg2, arg3, arg4,
> +			       out1, out2, out3);
> +    
> +    barrier();
> +    get_cpu_var(hcall_type_count)[opcode_index]++;
> +    put_cpu_var(hcall_type_count);
> +    get_cpu_var(hcall_type_time)[opcode_index] += mfspr(SPRN_PURR) - t_entry;
> +    put_cpu_var(hcall_type_time);
> +    
> +    return retcode;
> +};

Can you explain the need for barrier(s) before and after the call to the
real routine?  It usually takes me a couple days of thought to figure out
exactly where these are needed. :)

The use of get_cpu_var/put_cpu_var result in disabling/enabling preemption.
I can understand why this would be desirable to assure the accuracy of the
statistics.  But, I was wondering if the desired accuracy is worth the added
overhead.  My thought was to make these as lightweight as possible and
sacrifice some accuracy if necessary.  After all, no 'internal decisions' are
being made because of this data.  It is simply exposed to user land.
Thoughts?

Thanks,
-- 
Mike

^ permalink raw reply

* Re: Intercept System call using Kernel  module is 2.6 kernel
From: Jeff.Fellin @ 2006-06-06 17:02 UTC (permalink / raw)
  To: mmeswani; +Cc: linuxppc-dev, linuxppc-dev-bounces+jeff.fellin=rflelect.com

                                                                                                                                     
                      "Meswani, Mitesh" <mmeswani@utep.edu>                                                                          
                      Sent by:                                             To:       <linuxppc-dev@ozlabs.org>                       
                      linuxppc-dev-bounces+jeff.fellin=rflelect.com        cc:                                                       
                      @ozlabs.org                                          Subject:  Intercept System call using Kernel  module is   
                                                                            2.6 kernel                                               
                                                                                                                                     
                      06/06/2006 12:25                                                                                               
                                                                                                                                     
                                                                                                                                     










>Hello


>I am attempting to run some user code with kernel space permission. I am
using the ppc64 kernel version >2.6.16-rc4-3-ppc64 for IBM Power5
processors.
>In this kernel module I am trying to implement a function that can be
called from user space.
>
>I have found through various posts that using unused system calls and
replacing them temporarily can acheive this >objective.
>
>This is what I am doing, but its not working, please bear with the
slightly long code that follows:
>
>1) since the 2.6 kernel does not export sys_call_table, I grep it from the
boot image
First sign what you are doing is not a good idea. There are better methods
of this
1) device driver interface with read/write/ioctl interface
2) procfs files from a module/driver
3) sysfs files from a module/driver

SNIP
>
>The problem is that when I execute my user app I expect to see two things:
 >a) I should see a message in the log "Executing mitesh_func..." and
> b) A return value of 2
>However I get an error value -1 returned.
An indication of thinking of system calls vs other methods is wrong!.

>Any help and ideas are highly appreciated.
Don't add your own or hijack system calls

Thank you in advance,
Mitesh
 _______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

^ permalink raw reply

* Intercept System call using Kernel  module is 2.6 kernel
From: Meswani, Mitesh @ 2006-06-06 16:25 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <C26C730943E01145B4F89E37FE0A022002BBC7A2@itdsrvmail02.utep.edu>

[-- Attachment #1: Type: text/plain, Size: 2792 bytes --]

 
 
Hello 
 
 
I am attempting to run some user code with kernel space permission. I am using the ppc64 kernel version 2.6.16-rc4-3-ppc64 for IBM Power5 processors. 
In this kernel module I am trying to implement a function that can be called from user space. 
 
I have found through various posts that using unused system calls and replacing them temporarily can acheive this objective. 
 
This is what I am doing, but its not working, please bear with the slightly long code that follows: 
 
1) since the 2.6 kernel does not export sys_call_table, I grep it from the boot image
 
2) Next I write the kernel module as : 
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/syscalls.h>
unsigned long **sctable;
void *org_func;  /***** Copy of the original calls address ********/

asmlinkage int mitesh_func(void)   
{ 
        printk(KERN_ALERT "Executing mitesh_func...\n"); 
        return 2;
} 

int init_module(void)
{
 unsigned long ptr;
 unsigned long *p;
 ptr = 0x23203404;  /*** some hard coded addresses from grepping for sys_call_table *****/
  p = (unsigned long *)ptr;
  sctable = (unsigned long **)p;
  printk("The address of the system call table is: 0x%x\n",&sctable[0]);
  printk("The address of syscall #137 is: 0x%x\n",sctable[137]);

org_func = (void *) (sctable[137]);  /**** Store the original sys call ****/
 printk("Original func address 0x%x stored \n",org_func);
 sctable[137] = (void *) mitesh_func;  /**** replace with mitesh_func ****/
printk("The new sys call address is 0x%x and stored as : 0x%x\n",mitesh_func, sctable[137]);

  return 0; 
}
void cleanup_module(void)

{
        sctable[137] = (void *) org_func; 
        printk("Upon module unload the sctable #137 address is :0x%x\n",sctable[137]);
        printk("Module is unloaded!\n");
}

3) My user app looks like this:
#include <stdio.h> 
#include <errno.h> 
#include <asm-ppc64/unistd.h> 
#define __NR_mitesh_func 137 
 
_syscall0(int, mitesh_func); 
void main() 
{
        int x=0; 
        x=mitesh_func(); 
        printf("mitesh_func returned %d\n",x);
}  

 
4) I verify from the system logs that when I insmod the kernel module I get all the print statements. I verified from the logs  that the address of the sys_call_table is correctly passed and from /proc/kallsysms I can see that my function mitesh_func has been defined and has the address as indicated in the logs. 
 
The problem is that when I execute my user app I expect to see two things: 
 a) I should see a message in the log "Executing mitesh_func..." and 
 b) A return value of 2 
 
However I get an error value -1 returned. 
 
Any help and ideas are highly appreciated.  
 
Thank you in advance, 
Mitesh 
 

[-- Attachment #2: Type: text/html, Size: 5523 bytes --]

^ permalink raw reply

* Base address of executables - weirdness?
From: H. Peter Anvin @ 2006-06-06 15:42 UTC (permalink / raw)
  To: linuxppc-dev

I'm trying to track down an odd issue with klibc on ppc32.

Until recently, binaries linked with ld defaulted to a base address of 
0x10000000+SIZEOF_HEADERS.  However, recently I've gotten a couple of 
reports -- and I've been able to confirm this on my FC5 system -- that 
some versions of ld links at 0x01800000+SIZEOF_HEADERS.  Needless to 
say, this is more than a bit confusing, *especially* since "ld -verbose" 
still reports:

     PROVIDE (__executable_start = 0x10000000); . = 0x10000000 + 
SIZEOF_HEADERS;

... at the top of the linker script.

I'm rather baffled.  Has anyone else seen this, and/or have any other 
explanation?

	-hpa

^ permalink raw reply

* PowerPC 8266 ADS-PCI(CPM8260) SCC (UART/Ethernet Mode) and SMC in UART mode
From: Aung Soe @ 2006-06-06 14:50 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

Dear All

I am working on PowerPC 8266 ADS-PCI board

 with Montavista Linux with 2.4.20 Kernel to make SCC1, SMC 1& 2

in UART mode, SCC2, 3 & 4 in Ethernet mode, and FCC 2 in Fast

Ethernet Mode; all in NMSI and not TDM. The drivers provided with

2.4.20 Linux on Power PC were refered and reused.

http://lxr.linux.no/source/arch/ppc/8260_io/?v=2.4.20;a=ppc

 When there are only 3 SCCs used in Ethernet mode and the last

SCC in UART mode then the system works well.

But when 2 more SMC are added to the UART serial driver, the SCC

Ethernet network driver is not working after all ports working for a while.

I will be appreciate to hear any pointers and hints.

Thanks in advance.

Sincerley

Aung

[-- Attachment #2: Type: text/html, Size: 1292 bytes --]

^ permalink raw reply

* Re: eth0: tx queue full
From: Steve Iribarne (GMail) @ 2006-06-06 14:45 UTC (permalink / raw)
  To: salvatore cusenza; +Cc: linuxppc-embedded
In-Reply-To: <9252a64b0606060113v696adbb7ib43ad95836c0724b@mail.gmail.com>

On 6/6/06, salvatore cusenza <salvatore.cusenza@gmail.com> wrote:
>
> At runtime during the usual life of my board (MPC852 and linux-2.4.20 Denk's
> distribution)
>  I have experienced the following crash:
>
>
> eth0: tx queue full!.
> eth0: tx queue full!.
> eth0: tx queue full!.
>
> Oops: kernel access of bad area, sig: 11
> NIP: C000D440 XER: 00000000 LR: C00BB040 SP: C0C9BC10 REGS: c0c9bb60 TRAP:
> 0300    Tainted: P
> MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
> DAR: 00001F9D, DSISR: 000000E4
> TASK = c0c9a000[145] 'L5421' Last syscall: 4
> last math 00000000 last altivec 00000000
> GPR00: 00000000 C0C9BC10 C0C9A000 C0F56D70 00001F99 0000003C C0F56D6C
> 00000007
> GPR08: 00000001 0000003C 00000000 C0F56DB0 C0D83C3C 10071D28 00000000
> C3120000
> GPR16: C311CB04 C311C8D8 C311C754 C0170000 C3120000 C311CB30 00000001
> C0169DA0
> GPR24: F0000E00 00001F9D C0F58400 0000003C 00000040 C2080100 C0F58200
> C0F501B0
> Call backtrace:
> C00BAF8C C00BABC8 C0005848 C3119448 C31194C0 C31194F8 C00066F8
> C0011A48 C00BA8FC C00CADD8 C00C3F00 C0016B50 C00AFE4C C00B4B94
> C00B5EF4 C003571C C000457C 0FFD5E4C 0FEDB8DC 0FEDB284 1003B5B8
> 1003D558 1003A88C 0FED34A4 0FED32D0 0FFCFEE4 0FD5F590
> Kernel panic: Aiee, killing interrupt handler!
> In interrupt handler - not syncing
>  <0>Rebooting in 180 seconds..
>
>
> Could you suggest me something to investigate?
>


First thing I'd do is get my hands on ksymoops and look at the callstack.

Check out...

http://www.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/

>From my past experience, seems to me that I have had a similar oops
when the driver code that I have been debugging usually is stuck in a
loop at interrupt time and your missing interrupts.

Or better yet, your just plain crashing at interrupt time.  It sounds
like this is pretty easy to reproduce so follow the readme on
ksymoops.

Hope that helps.

-stv

^ permalink raw reply

* RE: [PATCH/2.6.17-rc4 4/10]Powerpc:  Add tsi108 pic support
From: Alexandre Bounine @ 2006-06-06 14:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Zang Roy-r61911
  Cc: linuxppc-dev list, Paul Mackerras, Yang Xin-Xin-r48390



> -----Original Message-----
> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
> Sent: Tuesday, June 06, 2006 6:17 AM
> To: Zang Roy-r61911
> Cc: Alexandre Bounine; Kumar Gala; linuxppc-dev list; Yang
> Xin-Xin-r48390; Paul Mackerras
> Subject: RE: [PATCH/2.6.17-rc4 4/10]Powerpc: Add tsi108 pic support
>=20
>=20
> On Tue, 2006-06-06 at 17:43 +0800, Zang Roy-r61911 wrote:
>=20
> > Update Tsi108 implementation of MPIC.
> > Any comment?=20
> >=20
> > Integrate Tundra Semiconductor tsi108 host bridge interrupt=20
> controller=20
> > to mpic arch.
>=20
> Looks much better :) Still a few things...=20
>

Sounds good. We are moving in right direction :)
=20
> > +	mpic =3D mpic_alloc(mpic_paddr,
> > +			MPIC_PRIMARY | MPIC_BIG_ENDIAN |=20
> MPIC_WANTS_RESET |
> > +			MPIC_SPV_EOI | MPIC_CASC_NOEOI |=20
> > +			MPIC_MOD_ID(MPIC_ID_TSI108),
> > +			0, /* num_sources used */
> > +			TSI108_IRQ_BASE,
> > +			0, /* num_sources used */
> > +			NR_IRQS - 4 /* XXXX */,
> > +			mpc7448_hpc2_pic_initsenses,
> > +			sizeof(mpc7448_hpc2_pic_initsenses),=20
> "Tsi108_PIC");
>=20
> That's a hell lot of new flags... I'm not sure we need that many or a
> single TSI108 one that encloses all the new ones. Also, I'm=20
> not sure we
> need that model ID encoding thing. Let's do things simple, besides, I
> don't want to encourage HW folks into doing the same kind of=20
> contraption
> in the future

More details in comments below.

>(btw, tell the TSI folks for me that they had a BAD BAD
> BAD idea to muck around with the base design that way, especially
> changing the register map in incompatible ways for no good reason).
>=20

Done!

> > +	/* Configure MPIC outputs to CPU0 */
> > +	tsi108_write_reg(TSI108_MPIC_OFFSET + 0x30c, 0);
> >  }
>=20
> It doesn't use the standard multiple processor outputs mecanism of
> MPIC ?
> =20
> > +static struct mpic_info mpic_infos[] =3D {
> > +	[0] =3D {	/* Original OpenPIC compatible MPIC */
> > +	.greg_base	=3D MPIC_GREG_BASE,
> > +	.greg_frr0	=3D MPIC_GREG_FEATURE_0,
> > +	.greg_config0	=3D MPIC_GREG_GLOBAL_CONF_0,
> > +	.greg_vendor_id	=3D MPIC_GREG_VENDOR_ID,
> > +	.greg_ipi_vp0	=3D MPIC_GREG_IPI_VECTOR_PRI_0,
> > +	.greg_ipi_stride	=3D MPIC_GREG_IPI_STRIDE,
> > +	.greg_spurious	=3D MPIC_GREG_SPURIOUS,
> > +	.greg_tfrr	=3D MPIC_GREG_TIMER_FREQ,
> > +
>=20
>    .../...
>=20
> It's a bit sad to have to go all the way to doing such tables, but I
> suspect it's probably the best way to handle it at this=20
> point.

> Send more
> nastygrams to the HW folks for me.
>=20

Done:)

> >  	mpic->num_sources =3D 0; /* so far */
> >  	mpic->senses =3D senses;
> >  	mpic->senses_count =3D senses_count;
> > +	mpic->hw_set =3D &mpic_infos[MPIC_GET_MOD_ID(flags)];
>=20
> Well... the model ID thing might not be that a bad idea in=20
> the end :) I
> need to think about it. I might have to deal with yet another=20
> MPIC that
> has another regiser map (yeah yeah, TSI aren't the only ones=20
> to not get
> it)...=20
>

I'll tell this to HW guys as well :)=20

>   .../...
>=20
> > @@ -963,7 +1043,7 @@ int mpic_get_one_irq(struct mpic *mpic,=20
> >  {
> >  	u32 irq;
> > =20
> > -	irq =3D mpic_cpu_read(MPIC_CPU_INTACK) & MPIC_VECPRI_VECTOR_MASK;
> > +	irq =3D mpic_cpu_read(mpic->hw_set->cpu_intack) &=20
> mpic->hw_set->irq_vpr_vector;
> >  #ifdef DEBUG_LOW
> >  	DBG("%s: get_one_irq(): %d\n", mpic->name, irq);
> >  #endif
> > @@ -972,11 +1052,18 @@ #ifdef DEBUG_LOW
> >  		DBG("%s: cascading ...\n", mpic->name);
> >  #endif
> >  		irq =3D mpic->cascade(regs, mpic->cascade_data);
> > -		mpic_eoi(mpic);
> > +#ifdef DEBUG_LOW
> > +		DBG("%s: cascaded irq: %d\n", mpic->name, irq);
> > +#endif
> > +		if (!(mpic->flags & MPIC_CASC_NOEOI))
> > +			mpic_eoi(mpic);
> >  		return irq;
> >  	}
>=20
> Can you tell me why you need the above ? (Why you aren't EOI'ing the
> cascade ?) Note that the cascade handling is going away from=20
> mpic anyway
> with the port to genirq that I'll publish later this week for=20
> 2.6.18 and
> it will almost be handled as a normal interrupt...
>=20

We have a level-signalled irq from the cascaded PCI interrupt =
controller. If I do EOI at=20
this time, level request will not have chance to be cleared (unless all =
PCI interrupts have
an SA_INTERRUPT flag) and result in recurring interrupts.=20

I chose to have an individual flag instead of checking model ID to avoid =
multiple checks within ISR (in case if we have more that one mpic =
version requiring this option). I also expect that it may be useful for =
any external level-signalling cascades connected to MPIC.     =20

> > -	if (unlikely(irq =3D=3D MPIC_VEC_SPURRIOUS))
> > +	if (unlikely(irq =3D=3D MPIC_VEC_SPURRIOUS)) {
> > +		if (mpic->flags & MPIC_SPV_EOI)
> > +			mpic_eoi(mpic);
> >  		return -1;
> > +	}
>=20
> I think the above thing could just test the model ID. It's=20
> unlikely that
> another implementation need the same "feature", so just test the model
> ID rather than adding a flag and if we ever have another=20
> model with the
> same "feature", then we'll go back to adding a flag :)
>=20

Motivation is the same as above - I just do not want to have multiple ID =
checks here. I agree that it is driven by mpic type (model ID) only. I =
can remove this one if you do not expect any
new "broken" MPICs on horizon. =20

> Cheers,
> Ben.
>=20
Thanks for your feedback,
Alex.
>=20
>=20

^ permalink raw reply

* [PATCH 2/2] powerpc: node-away dma allocations
From: Christoph Hellwig @ 2006-06-06 14:11 UTC (permalink / raw)
  To: linuxppc-dev

Make sure dma_alloc_coherent allocates memory from the local node.  This
is important on Cell where we avoid going through the slow cpu
interconnect.

Note:  I could only test this patch on Cell, it should be verified on
some pseries machine by thos that have the hardware.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/iommu.c	2006-04-25 15:53:07.000000000 +0200
+++ linux-2.6/arch/powerpc/kernel/iommu.c	2006-05-30 14:54:25.000000000 +0200
@@ -536,11 +536,12 @@
  * to the dma address (mapping) of the first page.
  */
 void *iommu_alloc_coherent(struct iommu_table *tbl, size_t size,
-		dma_addr_t *dma_handle, unsigned long mask, gfp_t flag)
+		dma_addr_t *dma_handle, unsigned long mask, gfp_t flag, int node)
 {
 	void *ret = NULL;
 	dma_addr_t mapping;
 	unsigned int npages, order;
+	struct page *page;
 
 	size = PAGE_ALIGN(size);
 	npages = size >> PAGE_SHIFT;
@@ -560,9 +561,10 @@
 		return NULL;
 
 	/* Alloc enough pages (and possibly more) */
-	ret = (void *)__get_free_pages(flag, order);
-	if (!ret)
+	page = alloc_pages_node(flag, order, node);
+	if (!page)
 		return NULL;
+	ret = page_address(page);
 	memset(ret, 0, size);
 
 	/* Set up tces to cover the allocated range */
@@ -570,9 +572,9 @@
 			      mask >> PAGE_SHIFT, order);
 	if (mapping == DMA_ERROR_CODE) {
 		free_pages((unsigned long)ret, order);
-		ret = NULL;
-	} else
-		*dma_handle = mapping;
+		return NULL;
+	}
+	*dma_handle = mapping;
 	return ret;
 }
 
Index: linux-2.6/arch/powerpc/kernel/pci_iommu.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/pci_iommu.c	2006-04-25 15:53:07.000000000 +0200
+++ linux-2.6/arch/powerpc/kernel/pci_iommu.c	2006-05-30 14:55:18.000000000 +0200
@@ -86,7 +86,8 @@
 			   dma_addr_t *dma_handle, gfp_t flag)
 {
 	return iommu_alloc_coherent(devnode_table(hwdev), size, dma_handle,
-			device_to_mask(hwdev), flag);
+			device_to_mask(hwdev), flag,
+			pcibus_to_node(to_pci_dev(hwdev)->bus));
 }
 
 static void pci_iommu_free_coherent(struct device *hwdev, size_t size,
Index: linux-2.6/arch/powerpc/kernel/vio.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/vio.c	2006-04-25 15:53:07.000000000 +0200
+++ linux-2.6/arch/powerpc/kernel/vio.c	2006-05-30 14:54:38.000000000 +0200
@@ -229,7 +229,7 @@
 			   dma_addr_t *dma_handle, gfp_t flag)
 {
 	return iommu_alloc_coherent(to_vio_dev(dev)->iommu_table, size,
-			dma_handle, ~0ul, flag);
+			dma_handle, ~0ul, flag, -1);
 }
 
 static void vio_free_coherent(struct device *dev, size_t size,
Index: linux-2.6/include/asm-powerpc/iommu.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/iommu.h	2006-04-25 15:53:07.000000000 +0200
+++ linux-2.6/include/asm-powerpc/iommu.h	2006-05-30 14:55:45.000000000 +0200
@@ -76,7 +76,8 @@
 		int nelems, enum dma_data_direction direction);
 
 extern void *iommu_alloc_coherent(struct iommu_table *tbl, size_t size,
-		dma_addr_t *dma_handle, unsigned long mask, gfp_t flag);
+		dma_addr_t *dma_handle, unsigned long mask,
+		gfp_t flag, int node);
 extern void iommu_free_coherent(struct iommu_table *tbl, size_t size,
 		void *vaddr, dma_addr_t dma_handle);
 extern dma_addr_t iommu_map_single(struct iommu_table *tbl, void *vaddr,

^ permalink raw reply

* [PATCH 1/2] powerpc: implement pcibus_to_node and pcibus_to_cpumask
From: Christoph Hellwig @ 2006-06-06 14:09 UTC (permalink / raw)
  To: linuxppc-dev

On 64bit powerpc we can find out what node a pci bus hangs off, so
implement the topology.h macros that export this information.

For 32bit this seems a little more difficult, but I don't know of 32bit
powerpc NUMA machines either, so let's leave it out for now.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/asm-powerpc/topology.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/topology.h	2006-05-02 16:26:14.000000000 +0200
+++ linux-2.6/include/asm-powerpc/topology.h	2006-05-30 14:42:18.000000000 +0200
@@ -32,8 +32,13 @@
 
 int of_node_to_nid(struct device_node *device);
 
+#ifdef CONFIG_PPC64
+#define pcibus_to_node(bus)	(of_node_to_nid(bus->sysdata))
+#define pcibus_to_cpumask(bus)	(node_to_cpumask(of_node_to_nid(bus->sysdata)))
+#else
 #define pcibus_to_node(node)    (-1)
 #define pcibus_to_cpumask(bus)	(cpu_online_map)
+#endif
 
 /* sched_domains SD_NODE_INIT for PPC64 machines */
 #define SD_NODE_INIT (struct sched_domain) {		\

^ permalink raw reply

* Re: [Alsa-devel] [RFC 4/8] snd-aoa: add i2sbus
From: Takashi Iwai @ 2006-06-06 14:00 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linuxppc-dev, alsa-devel
In-Reply-To: <1149592647.5928.49.camel@johannes.berg>

At Tue, 06 Jun 2006 13:17:27 +0200,
Johannes Berg wrote:
> 
> On Fri, 2006-06-02 at 16:23 +0200, Takashi Iwai wrote:
> > > +	if (I2S_CLOCK_SPEED_18MHz % rate == 0) {
> > > +		if ((I2S_CLOCK_SPEED_18MHz / rate) % mclk == 0) {
> > 
> > Equivalent with "I2S_CLOCK_SPEED_18MHZ % (rate * mclk) == 0" ?
> 
> Yeah, I guess, never really thought about that, just wrote it down the
> way I thought to do it :) That said, I think it's more readable if
> written that way, do you want me to change it regardless?

I found a single if is more readable (and good for compiler).

> > > +	/* well, we really should support scatter/gather DMA */
> > > +	/* FIXME FIXME FIXME: If this fails, we BUG() when the alsa layer
> > > +	 * later tries to allocate memory. Apparently we should be setting
> > > +	 * some device pointer for that ...
> > > +	 */
> > > +	snd_pcm_lib_preallocate_pages_for_all(
> > > +		dev->pcm, SNDRV_DMA_TYPE_DEV,
> > > +		snd_dma_pci_data(macio_get_pci_dev(i2sdev->macio)),
> > > +		64 * 1024, 64 * 1024);
> > 
> > Is the comment true?  Yes, you have to set the device pointer via
> > snd_pcm_lib_preallocate*().  But it must be OK even if preallocate
> > fails.
> 
> Hah, I don't know actually, I didn't know you set the pointer using this
> function, when I wrote the comment I just had forgotten the preallocate
> call!
> Does that mean that _preallocate_pages_for_all() has the side effect of
> setting the pointer? If so, imho that's pretty bad.

No, the only requirement is that you have to call snd_pcm_lib_malloc()
with proper type and assigned device pointer if you use
snd_pcm_lib_malloc() function.  (If not called, you've got an error
when compiled with debug option.)


Takashi

^ permalink raw reply

* [PATCH 5/5] Have ia64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-06-06 13:48 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, linuxppc-dev, Mel Gorman, linux-kernel,
	bob.picco, ak, linux-mm
In-Reply-To: <20060606134710.21419.48239.sendpatchset@skynet.skynet.ie>


Size zones and holes in an architecture independent manner for ia64.


 arch/ia64/Kconfig          |    3 ++
 arch/ia64/mm/contig.c      |   60 +++++-----------------------------------
 arch/ia64/mm/discontig.c   |   41 ++++-----------------------
 arch/ia64/mm/init.c        |   12 ++++++++
 include/asm-ia64/meminit.h |    1 
 5 files changed, 30 insertions(+), 87 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/Kconfig linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/Kconfig
--- linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/Kconfig	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/Kconfig	2006-06-05 14:17:47.000000000 +0100
@@ -353,6 +353,9 @@ config NODES_SHIFT
 	  MAX_NUMNODES will be 2^(This value).
 	  If in doubt, use the default.
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 # VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent.
 # VIRTUAL_MEM_MAP has been retained for historical reasons.
 config VIRTUAL_MEM_MAP
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/contig.c
--- linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/contig.c	2006-06-05 14:17:47.000000000 +0100
@@ -26,10 +26,6 @@
 #include <asm/sections.h>
 #include <asm/mca.h>
 
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static unsigned long num_dma_physpages;
-#endif
-
 /**
  * show_mem - display a memory statistics summary
  *
@@ -210,18 +206,6 @@ count_pages (u64 start, u64 end, void *a
 	return 0;
 }
 
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (start < MAX_DMA_ADDRESS)
-		*count += (min(end, MAX_DMA_ADDRESS) - start) >> PAGE_SHIFT;
-	return 0;
-}
-#endif
-
 /*
  * Set up the page tables.
  */
@@ -230,47 +214,24 @@ void __init
 paging_init (void)
 {
 	unsigned long max_dma;
-	unsigned long zones_size[MAX_NR_ZONES];
 #ifdef CONFIG_VIRTUAL_MEM_MAP
-	unsigned long zholes_size[MAX_NR_ZONES];
+	unsigned long nid = 0;
 	unsigned long max_gap;
 #endif
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
 	num_physpages = 0;
 	efi_memmap_walk(count_pages, &num_physpages);
 
 	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
 #ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] =
-				((max_low_pfn - max_dma) -
-				 (num_physpages - num_dma_physpages));
-		}
-	}
-
 	max_gap = 0;
+	efi_memmap_walk(register_active_ranges, &nid);
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
 	if (max_gap < LARGE_GAP) {
 		vmem_map = (struct page *) 0;
-		free_area_init_node(0, NODE_DATA(0), zones_size, 0,
-				    zholes_size);
+		free_area_init_nodes(max_dma, max_dma,
+				max_low_pfn, max_low_pfn);
 	} else {
 		unsigned long map_size;
 
@@ -282,19 +243,14 @@ paging_init (void)
 		efi_memmap_walk(create_mem_map_page_table, NULL);
 
 		NODE_DATA(0)->node_mem_map = vmem_map;
-		free_area_init_node(0, NODE_DATA(0), zones_size,
-				    0, zholes_size);
+		free_area_init_nodes(max_dma, max_dma,
+				max_low_pfn, max_low_pfn);
 
 		printk("Virtual mem_map starts at 0x%p\n", mem_map);
 	}
 #else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-	}
-	free_area_init(zones_size);
+	add_active_range(0, 0, max_low_pfn);
+	free_area_init_nodes(max_dma, max_dma, max_low_pfn, max_low_pfn);
 #endif /* !CONFIG_VIRTUAL_MEM_MAP */
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c
--- linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c	2006-06-05 14:17:47.000000000 +0100
@@ -703,6 +703,7 @@ static __init int count_node_pages(unsig
 {
 	unsigned long end = start + len;
 
+	add_active_range(node, start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 	mem_data[node].num_physpages += len >> PAGE_SHIFT;
 	if (start <= __pa(MAX_DMA_ADDRESS))
 		mem_data[node].num_dma_physpages +=
@@ -727,9 +728,8 @@ static __init int count_node_pages(unsig
 void __init paging_init(void)
 {
 	unsigned long max_dma;
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long pfn_offset = 0;
+	unsigned long max_pfn = 0;
 	int node;
 
 	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
@@ -746,47 +746,18 @@ void __init paging_init(void)
 #endif
 
 	for_each_online_node(node) {
-		memset(zones_size, 0, sizeof(zones_size));
-		memset(zholes_size, 0, sizeof(zholes_size));
-
 		num_physpages += mem_data[node].num_physpages;
-
-		if (mem_data[node].min_pfn >= max_dma) {
-			/* All of this node's memory is above ZONE_DMA */
-			zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn -
-				mem_data[node].num_physpages;
-		} else if (mem_data[node].max_pfn < max_dma) {
-			/* All of this node's memory is in ZONE_DMA */
-			zones_size[ZONE_DMA] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_DMA] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn -
-				mem_data[node].num_dma_physpages;
-		} else {
-			/* This node has memory in both zones */
-			zones_size[ZONE_DMA] = max_dma -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] -
-				mem_data[node].num_dma_physpages;
-			zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				max_dma;
-			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] -
-				(mem_data[node].num_physpages -
-				 mem_data[node].num_dma_physpages);
-		}
-
 		pfn_offset = mem_data[node].min_pfn;
 
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 		NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset;
 #endif
-		free_area_init_node(node, NODE_DATA(node), zones_size,
-				    pfn_offset, zholes_size);
+		if (mem_data[node].max_pfn > max_pfn)
+			max_pfn = mem_data[node].max_pfn;
 	}
 
+	free_area_init_nodes(max_dma, max_dma, max_pfn, max_pfn);
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/init.c linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/init.c
--- linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/ia64/mm/init.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/arch/ia64/mm/init.c	2006-06-05 14:17:47.000000000 +0100
@@ -539,6 +539,18 @@ find_largest_hole (u64 start, u64 end, v
 	last_end = end;
 	return 0;
 }
+
+int __init
+register_active_ranges(u64 start, u64 end, void *nid)
+{
+	BUG_ON(nid == NULL);
+	BUG_ON(*(unsigned long *)nid >= MAX_NUMNODES);
+
+	add_active_range(*(unsigned long *)nid,
+				__pa(start) >> PAGE_SHIFT,
+				__pa(end) >> PAGE_SHIFT);
+	return 0;
+}
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
 static int __init
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/include/asm-ia64/meminit.h
--- linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h	2006-06-05 14:12:51.000000000 +0100
+++ linux-2.6.17-rc5-mm3-105-ia64_use_init_nodes/include/asm-ia64/meminit.h	2006-06-05 14:17:47.000000000 +0100
@@ -55,6 +55,7 @@ extern void efi_memmap_init(unsigned lon
   extern unsigned long vmalloc_end;
   extern struct page *vmem_map;
   extern int find_largest_hole (u64 start, u64 end, void *arg);
+  extern int register_active_ranges (u64 start, u64 end, void *arg);
   extern int create_mem_map_page_table (u64 start, u64 end, void *arg);
 #endif
 

^ permalink raw reply

* [PATCH 4/5] Have x86_64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-06-06 13:48 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, linux-mm, Mel Gorman, ak, bob.picco,
	linux-kernel, linuxppc-dev
In-Reply-To: <20060606134710.21419.48239.sendpatchset@skynet.skynet.ie>


Size zones and holes in an architecture independent manner for x86_64.


 arch/x86_64/Kconfig         |    3 
 arch/x86_64/kernel/e820.c   |  125 ++++++++++++++-------------------------
 arch/x86_64/kernel/setup.c  |    7 +-
 arch/x86_64/mm/init.c       |   62 -------------------
 arch/x86_64/mm/k8topology.c |    3 
 arch/x86_64/mm/numa.c       |   18 ++---
 arch/x86_64/mm/srat.c       |   11 ++-
 include/asm-x86_64/e820.h   |    5 -
 include/asm-x86_64/proto.h  |    2 
 9 files changed, 79 insertions(+), 157 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/Kconfig	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/Kconfig	2006-06-05 14:16:49.000000000 +0100
@@ -73,6 +73,9 @@ config ARCH_MAY_HAVE_PC_FDC
 	bool
 	default y
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 config DMI
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c	2006-06-05 14:16:49.000000000 +0100
@@ -17,6 +17,7 @@
 #include <linux/string.h>
 #include <linux/kexec.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 
 #include <asm/page.h>
 #include <asm/e820.h>
@@ -160,58 +161,14 @@ unsigned long __init find_e820_area(unsi
 	return -1UL;		
 } 
 
-/* 
- * Free bootmem based on the e820 table for a node.
- */
-void __init e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end)
-{
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM || 
-		    ei->addr+ei->size <= start || 
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start) 
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (last >= end)
-			last = end; 
-
-		if (last > addr && last-addr >= PAGE_SIZE)
-			free_bootmem_node(pgdat, addr, last-addr);
-	}
-}
-
 /*
  * Find the highest page frame number we have available
  */
 unsigned long __init e820_end_of_ram(void)
 {
-	int i;
 	unsigned long end_pfn = 0;
 	
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long start, end;
-
-		start = round_up(ei->addr, PAGE_SIZE); 
-		end = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (start >= end)
-			continue;
-		if (ei->type == E820_RAM) { 
-		if (end > end_pfn<<PAGE_SHIFT)
-			end_pfn = end>>PAGE_SHIFT;
-		} else { 
-			if (end > end_pfn_map<<PAGE_SHIFT) 
-				end_pfn_map = end>>PAGE_SHIFT;
-		} 
-	}
+	end_pfn = find_max_pfn_with_active_regions();
 
 	if (end_pfn > end_pfn_map) 
 		end_pfn_map = end_pfn;
@@ -222,43 +179,10 @@ unsigned long __init e820_end_of_ram(voi
 	if (end_pfn > end_pfn_map) 
 		end_pfn = end_pfn_map; 
 
+	printk("end_pfn_map = %lu\n", end_pfn_map);
 	return end_pfn;	
 }
 
-/* 
- * Compute how much memory is missing in a range.
- * Unlike the other functions in this file the arguments are in page numbers.
- */
-unsigned long __init
-e820_hole_size(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long ram = 0;
-	unsigned long start = start_pfn << PAGE_SHIFT;
-	unsigned long end = end_pfn << PAGE_SHIFT;
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM ||
-		    ei->addr+ei->size <= start ||
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start)
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE);
-		if (last >= end)
-			last = end;
-
-		if (last > addr)
-			ram += last - addr;
-	}
-	return ((end - start) - ram) >> PAGE_SHIFT;
-}
-
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -293,6 +217,49 @@ void __init e820_reserve_resources(void)
 	}
 }
 
+/* Walk the e820 map and register active regions within a node */
+void __init
+e820_register_active_regions(int nid, unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	int i;
+	unsigned long ei_startpfn, ei_endpfn;
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+		ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE)
+								>> PAGE_SHIFT;
+
+		/* Skip map entries smaller than a page */
+		if (ei_startpfn > ei_endpfn)
+			continue;
+
+		/* Check if end_pfn_map should be updated */
+		if (ei->type != E820_RAM && ei_endpfn > end_pfn_map)
+			end_pfn_map = ei_endpfn;
+
+		/* Skip if map is outside the node */
+		if (ei->type != E820_RAM ||
+				ei_endpfn <= start_pfn ||
+				ei_startpfn >= end_pfn)
+			continue;
+
+		/* Check for overlaps */
+		if (ei_startpfn < start_pfn)
+			ei_startpfn = start_pfn;
+		if (ei_endpfn > end_pfn)
+			ei_endpfn = end_pfn;
+
+		/* Obey end_user_pfn to save on memmap */
+		if (ei_startpfn >= end_user_pfn)
+			continue;
+		if (ei_endpfn > end_user_pfn)
+			ei_endpfn = end_user_pfn;
+
+		add_active_range(nid, ei_startpfn, ei_endpfn);
+	}
+}
+
 /* 
  * Add a memory region to the kernel e820 map.
  */ 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c	2006-06-05 14:16:49.000000000 +0100
@@ -466,7 +466,8 @@ contig_initmem_init(unsigned long start_
 	if (bootmap == -1L)
 		panic("Cannot find bootmem map of size %ld\n",bootmap_size);
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
-	e820_bootmem_free(NODE_DATA(0), 0, end_pfn << PAGE_SHIFT);
+	e820_register_active_regions(0, start_pfn, end_pfn);
+	free_bootmem_with_active_regions(0, end_pfn);
 	reserve_bootmem(bootmap, bootmap_size);
 } 
 #endif
@@ -640,6 +641,7 @@ void __init setup_arch(char **cmdline_p)
 
 	early_identify_cpu(&boot_cpu_data);
 
+	e820_register_active_regions(0, 0, -1UL);
 	/*
 	 * partially used pages are not usable - thus
 	 * we are rounding upwards:
@@ -665,6 +667,9 @@ void __init setup_arch(char **cmdline_p)
 	acpi_boot_table_init();
 #endif
 
+	/* Remove active ranges so rediscovery with NUMA-awareness happens */
+	remove_all_active_ranges();
+
 #ifdef CONFIG_ACPI_NUMA
 	/*
 	 * Parse SRAT to discover nodes.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/init.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c	2006-06-05 14:16:49.000000000 +0100
@@ -404,69 +404,12 @@ void __cpuinit zap_low_mappings(int cpu)
 	__flush_tlb_all();
 }
 
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
-	   unsigned long start_pfn, unsigned long end_pfn)
-{
- 	int i;
- 	unsigned long w;
-
- 	for (i = 0; i < MAX_NR_ZONES; i++)
- 		z[i] = 0;
-
- 	if (start_pfn < MAX_DMA_PFN)
- 		z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- 	if (start_pfn < MAX_DMA32_PFN) {
- 		unsigned long dma32_pfn = MAX_DMA32_PFN;
- 		if (dma32_pfn > end_pfn)
- 			dma32_pfn = end_pfn;
- 		z[ZONE_DMA32] = dma32_pfn - start_pfn;
- 	}
- 	z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- 	/* Remove lower zones from higher ones. */
- 	w = 0;
- 	for (i = 0; i < MAX_NR_ZONES; i++) {
- 		if (z[i])
- 			z[i] -= w;
- 	        w += z[i];
-	}
-
-	/* Compute holes */
-	w = start_pfn;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long s = w;
-		w += z[i];
-		h[i] = e820_hole_size(s, w);
-	}
-
-	/* Add the space pace needed for mem_map to the holes too. */
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
-	/* The 16MB DMA zone has the kernel and other misc mappings.
- 	   Account them too */
-	if (h[ZONE_DMA]) {
-		h[ZONE_DMA] += dma_reserve;
-		if (h[ZONE_DMA] >= z[ZONE_DMA]) {
-			printk(KERN_WARNING
-				"Kernel too large and filling up ZONE_DMA?\n");
-			h[ZONE_DMA] = z[ZONE_DMA];
-		}
-	}
-}
-
 #ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
-	unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
 	memory_present(0, 0, end_pfn);
 	sparse_init();
-	size_zones(zones, holes, 0, end_pfn);
-	free_area_init_node(0, NODE_DATA(0), zones,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 }
 #endif
 
@@ -615,7 +558,8 @@ void __init mem_init(void)
 #else
 	totalram_pages = free_all_bootmem();
 #endif
-	reservedpages = end_pfn - totalram_pages - e820_hole_size(0, end_pfn);
+	reservedpages = end_pfn - totalram_pages -
+					absent_pages_in_range(0, end_pfn);
 
 	after_bootmem = 1;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-05-25 02:50:17.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-06-05 14:16:49.000000000 +0100
@@ -146,6 +146,9 @@ int __init k8_scan_nodes(unsigned long s
 		
 		nodes[nodeid].start = base; 
 		nodes[nodeid].end = limit;
+		e820_register_active_regions(nodeid,
+				nodes[nodeid].start >> PAGE_SHIFT,
+				nodes[nodeid].end >> PAGE_SHIFT);
 
 		prevbase = base;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/numa.c	2006-05-25 02:50:17.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c	2006-06-05 14:16:49.000000000 +0100
@@ -161,7 +161,7 @@ void __init setup_node_bootmem(int nodei
 					 bootmap_start >> PAGE_SHIFT, 
 					 start_pfn, end_pfn); 
 
-	e820_bootmem_free(NODE_DATA(nodeid), start, end);
+	free_bootmem_with_active_regions(nodeid, end);
 
 	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
 	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
@@ -175,13 +175,11 @@ void __init setup_node_bootmem(int nodei
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
-	unsigned long zones[MAX_NR_ZONES];
-	unsigned long holes[MAX_NR_ZONES];
 
  	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
 
-	Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
 		nodeid, start_pfn, end_pfn);
 
 	/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -195,10 +193,6 @@ void __init setup_node_zones(int nodeid)
 				round_down(limit - memmapsize, PAGE_SIZE), 
 				limit);
 #endif
-
-	size_zones(zones, holes, start_pfn, end_pfn);
-	free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
-			    start_pfn, holes);
 } 
 
 void __init numa_init_array(void)
@@ -259,8 +253,11 @@ static int numa_emulation(unsigned long 
  		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
  		return -1;
  	}
- 	for_each_online_node(i)
+ 	for_each_online_node(i) {
+		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
+						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+	}
  	numa_init_array();
  	return 0;
 }
@@ -299,6 +296,7 @@ void __init numa_initmem_init(unsigned l
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
+	e820_register_active_regions(0, start_pfn, end_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
 }
 
@@ -346,6 +344,8 @@ void __init paging_init(void)
 	for_each_online_node(i) {
 		setup_node_zones(i); 
 	}
+
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 } 
 
 /* [numa=off] */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/x86_64/mm/srat.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c	2006-06-05 14:16:49.000000000 +0100
@@ -91,6 +91,7 @@ static __init void bad_srat(void)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < MAX_NUMNODES; i++)
 		nodes_add[i].start = nodes[i].end = 0;
+	remove_all_active_ranges();
 }
 
 static __init inline int srat_disabled(void)
@@ -173,7 +174,7 @@ static int hotadd_enough_memory(struct b
 
 	if (mem < 0)
 		return 0;
-	allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
+	allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
 	allowed = (allowed / 100) * hotadd_percent;
 	if (allocated + mem > allowed) {
 		unsigned long range;
@@ -223,7 +224,7 @@ static int reserve_hotadd(int node, unsi
 	}
 
 	/* This check might be a bit too strict, but I'm keeping it for now. */
-	if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
+	if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
 		printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
 		return -1;
 	}
@@ -317,6 +318,8 @@ acpi_numa_memory_affinity_init(struct ac
 
 	printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
 	       nd->start, nd->end);
+	e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
+						nd->end >> PAGE_SHIFT);
 
 #ifdef RESERVE_HOTADD
  	if (ma->flags.hot_pluggable && reserve_hotadd(node, start, end) < 0) {
@@ -341,13 +344,13 @@ static int nodes_cover_memory(void)
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= e820_hole_size(s, e);
+		pxmram -= absent_pages_in_range(s, e);
 		pxmram -= nodes_add[i].end - nodes_add[i].start;
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
 
-	e820ram = end_pfn - e820_hole_size(0, end_pfn);
+	e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
 	/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
 	if ((long)(e820ram - pxmram) >= 1*1024*1024) {
 		printk(KERN_ERR
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/include/asm-x86_64/e820.h	2006-05-25 02:50:17.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h	2006-06-05 14:16:49.000000000 +0100
@@ -50,10 +50,9 @@ extern void e820_print_map(char *who);
 extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
 extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
 
-extern void e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end);
 extern void e820_setup_gap(void);
-extern unsigned long e820_hole_size(unsigned long start_pfn,
-				    unsigned long end_pfn);
+extern void e820_register_active_regions(int nid,
+				unsigned long start_pfn, unsigned long end_pfn);
 
 extern void __init parse_memopt(char *p, char **end);
 extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/include/asm-x86_64/proto.h	2006-06-05 14:12:51.000000000 +0100
+++ linux-2.6.17-rc5-mm3-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h	2006-06-05 14:16:49.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
 #define mtrr_bp_init() do {} while (0)
 #endif
 extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
-			unsigned long start_pfn, unsigned long end_pfn);
 
 extern void system_call(void); 
 extern int kernel_syscall(void);

^ permalink raw reply

* [PATCH 3/5] Have x86 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-06-06 13:48 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, linuxppc-dev, Mel Gorman, linux-kernel,
	bob.picco, ak, linux-mm
In-Reply-To: <20060606134710.21419.48239.sendpatchset@skynet.skynet.ie>


Size zones and holes in an architecture independent manner for x86.


 Kconfig        |    8 +---
 kernel/setup.c |   19 +++------
 kernel/srat.c  |  100 +---------------------------------------------------
 mm/discontig.c |   65 +++++++--------------------------
 4 files changed, 25 insertions(+), 167 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/Kconfig linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/Kconfig
--- linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/Kconfig	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/Kconfig	2006-06-05 14:15:57.000000000 +0100
@@ -592,12 +592,10 @@ config ARCH_SELECT_MEMORY_MODEL
 config ARCH_ALIGNED_ZONE_BOUNDARIES
 	def_bool y
 
-source "mm/Kconfig"
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
 
-config HAVE_ARCH_EARLY_PFN_TO_NID
-	bool
-	default y
-	depends on NUMA
+source "mm/Kconfig"
 
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/kernel/setup.c
--- linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/kernel/setup.c	2006-06-05 14:15:57.000000000 +0100
@@ -1206,22 +1206,15 @@ static unsigned long __init setup_memory
 
 void __init zone_sizes_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
-	unsigned int max_dma, low;
+	unsigned int max_dma;
+#ifndef CONFIG_HIGHMEM
+	unsigned long highend_pfn = max_low_pfn;
+#endif
 
 	max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-	low = max_low_pfn;
 
-	if (low < max_dma)
-		zones_size[ZONE_DMA] = low;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
-		zones_size[ZONE_HIGHMEM] = highend_pfn - low;
-#endif
-	}
-	free_area_init(zones_size);
+	add_active_range(0, 0, highend_pfn);
+	free_area_init_nodes(max_dma, max_dma, max_low_pfn, highend_pfn);
 }
 #else
 extern unsigned long __init setup_memory(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/kernel/srat.c
--- linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/kernel/srat.c	2006-06-05 14:15:57.000000000 +0100
@@ -55,8 +55,6 @@ struct node_memory_chunk_s {
 static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS];
 
 static int num_memory_chunks;		/* total number of memory chunks */
-static int zholes_size_init;
-static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES];
 
 extern void * boot_ioremap(unsigned long, unsigned long);
 
@@ -136,50 +134,6 @@ static void __init parse_memory_affinity
 		 "enabled and removable" : "enabled" ) );
 }
 
-#if MAX_NR_ZONES != 4
-#error "MAX_NR_ZONES != 4, chunk_to_zone requires review"
-#endif
-/* Take a chunk of pages from page frame cstart to cend and count the number
- * of pages in each zone, returned via zones[].
- */
-static __init void chunk_to_zones(unsigned long cstart, unsigned long cend, 
-		unsigned long *zones)
-{
-	unsigned long max_dma;
-	extern unsigned long max_low_pfn;
-
-	int z;
-	unsigned long rend;
-
-	/* FIXME: MAX_DMA_ADDRESS and max_low_pfn are trying to provide
-	 * similarly scoped information and should be handled in a consistant
-	 * manner.
-	 */
-	max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-	/* Split the hole into the zones in which it falls.  Repeatedly
-	 * take the segment in which the remaining hole starts, round it
-	 * to the end of that zone.
-	 */
-	memset(zones, 0, MAX_NR_ZONES * sizeof(long));
-	while (cstart < cend) {
-		if (cstart < max_dma) {
-			z = ZONE_DMA;
-			rend = (cend < max_dma)? cend : max_dma;
-
-		} else if (cstart < max_low_pfn) {
-			z = ZONE_NORMAL;
-			rend = (cend < max_low_pfn)? cend : max_low_pfn;
-
-		} else {
-			z = ZONE_HIGHMEM;
-			rend = cend;
-		}
-		zones[z] += rend - cstart;
-		cstart = rend;
-	}
-}
-
 /*
  * The SRAT table always lists ascending addresses, so can always
  * assume that the first "start" address that you see is the real
@@ -224,7 +178,6 @@ static int __init acpi20_parse_srat(stru
 
 	memset(pxm_bitmap, 0, sizeof(pxm_bitmap));	/* init proximity domain bitmap */
 	memset(node_memory_chunk, 0, sizeof(node_memory_chunk));
-	memset(zholes_size, 0, sizeof(zholes_size));
 
 	num_memory_chunks = 0;
 	while (p < end) {
@@ -288,6 +241,7 @@ static int __init acpi20_parse_srat(stru
 		printk("chunk %d nid %d start_pfn %08lx end_pfn %08lx\n",
 		       j, chunk->nid, chunk->start_pfn, chunk->end_pfn);
 		node_read_chunk(chunk->nid, chunk);
+		add_active_range(chunk->nid, chunk->start_pfn, chunk->end_pfn);
 	}
  
 	for_each_online_node(nid) {
@@ -404,57 +358,7 @@ int __init get_memcfg_from_srat(void)
 		return acpi20_parse_srat((struct acpi_table_srat *)header);
 	}
 out_err:
+	remove_all_active_ranges();
 	printk("failed to get NUMA memory information from SRAT table\n");
 	return 0;
 }
-
-/* For each node run the memory list to determine whether there are
- * any memory holes.  For each hole determine which ZONE they fall
- * into.
- *
- * NOTE#1: this requires knowledge of the zone boundries and so
- * _cannot_ be performed before those are calculated in setup_memory.
- * 
- * NOTE#2: we rely on the fact that the memory chunks are ordered by
- * start pfn number during setup.
- */
-static void __init get_zholes_init(void)
-{
-	int nid;
-	int c;
-	int first;
-	unsigned long end = 0;
-
-	for_each_online_node(nid) {
-		first = 1;
-		for (c = 0; c < num_memory_chunks; c++){
-			if (node_memory_chunk[c].nid == nid) {
-				if (first) {
-					end = node_memory_chunk[c].end_pfn;
-					first = 0;
-
-				} else {
-					/* Record any gap between this chunk
-					 * and the previous chunk on this node
-					 * against the zones it spans.
-					 */
-					chunk_to_zones(end,
-						node_memory_chunk[c].start_pfn,
-						&zholes_size[nid * MAX_NR_ZONES]);
-				}
-			}
-		}
-	}
-}
-
-unsigned long * __init get_zholes_size(int nid)
-{
-	if (!zholes_size_init) {
-		zholes_size_init++;
-		get_zholes_init();
-	}
-	if (nid >= MAX_NUMNODES || !node_online(nid))
-		printk("%s: nid = %d is invalid/offline. num_online_nodes = %d",
-		       __FUNCTION__, nid, num_online_nodes());
-	return &zholes_size[nid * MAX_NR_ZONES];
-}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/mm/discontig.c
--- linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-103-x86_use_init_nodes/arch/i386/mm/discontig.c	2006-06-05 14:15:57.000000000 +0100
@@ -157,21 +157,6 @@ static void __init find_max_pfn_node(int
 		BUG();
 }
 
-/* Find the owning node for a pfn. */
-int early_pfn_to_nid(unsigned long pfn)
-{
-	int nid;
-
-	for_each_node(nid) {
-		if (node_end_pfn[nid] == 0)
-			break;
-		if (node_start_pfn[nid] <= pfn && node_end_pfn[nid] >= pfn)
-			return nid;
-	}
-
-	return 0;
-}
-
 /* 
  * Allocate memory for the pg_data_t for this node via a crude pre-bootmem
  * method.  For node zero take this from the bottom of memory, for
@@ -227,6 +212,8 @@ static unsigned long calculate_numa_rema
 	unsigned long pfn;
 
 	for_each_online_node(nid) {
+		unsigned old_end_pfn = node_end_pfn[nid];
+
 		/*
 		 * The acpi/srat node info can show hot-add memroy zones
 		 * where memory could be added but not currently present.
@@ -276,6 +263,7 @@ static unsigned long calculate_numa_rema
 
 		node_end_pfn[nid] -= size;
 		node_remap_start_pfn[nid] = node_end_pfn[nid];
+		shrink_active_range(nid, old_end_pfn, node_end_pfn[nid]);
 	}
 	printk("Reserving total of %ld pages for numa KVA remap\n",
 			reserve_pages);
@@ -355,45 +343,20 @@ unsigned long __init setup_memory(void)
 void __init zone_sizes_init(void)
 {
 	int nid;
+	unsigned long max_dma_pfn;
 
-
-	for_each_online_node(nid) {
-		unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
-		unsigned long *zholes_size;
-		unsigned int max_dma;
-
-		unsigned long low = max_low_pfn;
-		unsigned long start = node_start_pfn[nid];
-		unsigned long high = node_end_pfn[nid];
-
-		max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-		if (node_has_online_mem(nid)){
-			if (start > low) {
-#ifdef CONFIG_HIGHMEM
-				BUG_ON(start > high);
-				zones_size[ZONE_HIGHMEM] = high - start;
-#endif
-			} else {
-				if (low < max_dma)
-					zones_size[ZONE_DMA] = low;
-				else {
-					BUG_ON(max_dma > low);
-					BUG_ON(low > high);
-					zones_size[ZONE_DMA] = max_dma;
-					zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
-					zones_size[ZONE_HIGHMEM] = high - low;
-#endif
-				}
-			}
+	/* If SRAT has not registered memory, register it now */
+	if (find_max_pfn_with_active_regions() == 0) {
+		for_each_online_node(nid) {
+			if (node_has_online_mem(nid))
+				add_active_range(nid, node_start_pfn[nid],
+							node_end_pfn[nid]);
 		}
-
-		zholes_size = get_zholes_size(nid);
-
-		free_area_init_node(nid, NODE_DATA(nid), zones_size, start,
-				zholes_size);
 	}
+
+	max_dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+	free_area_init_nodes(max_dma_pfn, max_dma_pfn,
+						max_low_pfn, highend_pfn);
 	return;
 }
 

^ permalink raw reply

* [PATCH 2/5] Have Power use add_active_range() and free_area_init_nodes()
From: Mel Gorman @ 2006-06-06 13:47 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, linux-mm, Mel Gorman, ak, bob.picco,
	linux-kernel, linuxppc-dev
In-Reply-To: <20060606134710.21419.48239.sendpatchset@skynet.skynet.ie>


Size zones and holes in an architecture independent manner for Power.


 powerpc/Kconfig   |    7 --
 powerpc/mm/mem.c  |   53 ++++++----------
 powerpc/mm/numa.c |  157 ++++---------------------------------------------
 ppc/Kconfig       |    3 
 ppc/mm/init.c     |   26 ++++----
 5 files changed, 56 insertions(+), 190 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/Kconfig linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/Kconfig
--- linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/Kconfig	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/Kconfig	2006-06-05 14:15:06.000000000 +0100
@@ -692,11 +692,10 @@ config ARCH_SPARSEMEM_DEFAULT
 	def_bool y
 	depends on SMP && PPC_PSERIES
 
-source "mm/Kconfig"
-
-config HAVE_ARCH_EARLY_PFN_TO_NID
+config ARCH_POPULATES_NODE_MAP
 	def_bool y
-	depends on NEED_MULTIPLE_NODES
+
+source "mm/Kconfig"
 
 config ARCH_MEMORY_PROBE
 	def_bool y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c
--- linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c	2006-06-05 14:15:06.000000000 +0100
@@ -257,20 +257,22 @@ void __init do_init_bootmem(void)
 
 	boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages);
 
+	/* Add active regions with valid PFNs */
+	for (i = 0; i < lmb.memory.cnt; i++) {
+		unsigned long start_pfn, end_pfn;
+		start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+		end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+		add_active_range(0, start_pfn, end_pfn);
+	}
+
 	/* Add all physical memory to the bootmem map, mark each area
 	 * present.
 	 */
-	for (i = 0; i < lmb.memory.cnt; i++) {
-		unsigned long base = lmb.memory.region[i].base;
-		unsigned long size = lmb_size_bytes(&lmb.memory, i);
 #ifdef CONFIG_HIGHMEM
-		if (base >= total_lowmem)
-			continue;
-		if (base + size > total_lowmem)
-			size = total_lowmem - base;
+	free_bootmem_with_active_regions(0, total_lowmem >> PAGE_SHIFT);
+#else
+	free_bootmem_with_active_regions(0, max_pfn);
 #endif
-		free_bootmem(base, size);
-	}
 
 	/* reserve the sections we're already using */
 	for (i = 0; i < lmb.reserved.cnt; i++)
@@ -278,9 +280,8 @@ void __init do_init_bootmem(void)
 				lmb_size_bytes(&lmb.reserved, i));
 
 	/* XXX need to clip this if using highmem? */
-	for (i = 0; i < lmb.memory.cnt; i++)
-		memory_present(0, lmb_start_pfn(&lmb.memory, i),
-			       lmb_end_pfn(&lmb.memory, i));
+	sparse_memory_present_with_active_regions(0);
+
 	init_bootmem_done = 1;
 }
 
@@ -289,8 +290,6 @@ void __init do_init_bootmem(void)
  */
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long total_ram = lmb_phys_mem_size();
 	unsigned long top_of_ram = lmb_end_of_DRAM();
 
@@ -308,26 +307,18 @@ void __init paging_init(void)
 	       top_of_ram, total_ram);
 	printk(KERN_DEBUG "Memory hole size: %ldMB\n",
 	       (top_of_ram - total_ram) >> 20);
-	/*
-	 * All pages are DMA-able so we put them all in the DMA zone.
-	 */
-	memset(zones_size, 0, sizeof(zones_size));
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
-	zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-
 #ifdef CONFIG_HIGHMEM
-	zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
-	zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
-	zholes_size[ZONE_HIGHMEM] = (top_of_ram - total_ram) >> PAGE_SHIFT;
+	free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT);
 #else
-	zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
-	zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-#endif /* CONFIG_HIGHMEM */
+	free_area_init_nodes(top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT);
+#endif
 
-	free_area_init_node(0, NODE_DATA(0), zones_size,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size);
 }
 #endif /* ! CONFIG_NEED_MULTIPLE_NODES */
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c
--- linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c	2006-06-05 14:15:06.000000000 +0100
@@ -39,96 +39,6 @@ static bootmem_data_t __initdata plat_no
 static int min_common_depth;
 static int n_mem_addr_cells, n_mem_size_cells;
 
-/*
- * We need somewhere to store start/end/node for each region until we have
- * allocated the real node_data structures.
- */
-#define MAX_REGIONS	(MAX_LMB_REGIONS*2)
-static struct {
-	unsigned long start_pfn;
-	unsigned long end_pfn;
-	int nid;
-} init_node_data[MAX_REGIONS] __initdata;
-
-int __init early_pfn_to_nid(unsigned long pfn)
-{
-	unsigned int i;
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		unsigned long start_pfn = init_node_data[i].start_pfn;
-		unsigned long end_pfn = init_node_data[i].end_pfn;
-
-		if ((start_pfn <= pfn) && (pfn < end_pfn))
-			return init_node_data[i].nid;
-	}
-
-	return -1;
-}
-
-void __init add_region(unsigned int nid, unsigned long start_pfn,
-		       unsigned long pages)
-{
-	unsigned int i;
-
-	dbg("add_region nid %d start_pfn 0x%lx pages 0x%lx\n",
-		nid, start_pfn, pages);
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		if (init_node_data[i].nid != nid)
-			continue;
-		if (init_node_data[i].end_pfn == start_pfn) {
-			init_node_data[i].end_pfn += pages;
-			return;
-		}
-		if (init_node_data[i].start_pfn == (start_pfn + pages)) {
-			init_node_data[i].start_pfn -= pages;
-			return;
-		}
-	}
-
-	/*
-	 * Leave last entry NULL so we dont iterate off the end (we use
-	 * entry.end_pfn to terminate the walk).
-	 */
-	if (i >= (MAX_REGIONS - 1)) {
-		printk(KERN_ERR "WARNING: too many memory regions in "
-				"numa code, truncating\n");
-		return;
-	}
-
-	init_node_data[i].start_pfn = start_pfn;
-	init_node_data[i].end_pfn = start_pfn + pages;
-	init_node_data[i].nid = nid;
-}
-
-/* We assume init_node_data has no overlapping regions */
-void __init get_region(unsigned int nid, unsigned long *start_pfn,
-		       unsigned long *end_pfn, unsigned long *pages_present)
-{
-	unsigned int i;
-
-	*start_pfn = -1UL;
-	*end_pfn = *pages_present = 0;
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		if (init_node_data[i].nid != nid)
-			continue;
-
-		*pages_present += init_node_data[i].end_pfn -
-			init_node_data[i].start_pfn;
-
-		if (init_node_data[i].start_pfn < *start_pfn)
-			*start_pfn = init_node_data[i].start_pfn;
-
-		if (init_node_data[i].end_pfn > *end_pfn)
-			*end_pfn = init_node_data[i].end_pfn;
-	}
-
-	/* We didnt find a matching region, return start/end as 0 */
-	if (*start_pfn == -1UL)
-		*start_pfn = 0;
-}
-
 static void __cpuinit map_cpu_to_node(int cpu, int node)
 {
 	numa_cpu_lookup_table[cpu] = node;
@@ -471,8 +381,8 @@ new_range:
 				continue;
 		}
 
-		add_region(nid, start >> PAGE_SHIFT,
-			   size >> PAGE_SHIFT);
+		add_active_range(nid, start >> PAGE_SHIFT,
+				(start >> PAGE_SHIFT) + (size >> PAGE_SHIFT));
 
 		if (--ranges)
 			goto new_range;
@@ -485,6 +395,7 @@ static void __init setup_nonnuma(void)
 {
 	unsigned long top_of_ram = lmb_end_of_DRAM();
 	unsigned long total_ram = lmb_phys_mem_size();
+	unsigned long start_pfn, end_pfn;
 	unsigned int i;
 
 	printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
@@ -492,9 +403,11 @@ static void __init setup_nonnuma(void)
 	printk(KERN_DEBUG "Memory hole size: %ldMB\n",
 	       (top_of_ram - total_ram) >> 20);
 
-	for (i = 0; i < lmb.memory.cnt; ++i)
-		add_region(0, lmb.memory.region[i].base >> PAGE_SHIFT,
-			   lmb_size_pages(&lmb.memory, i));
+	for (i = 0; i < lmb.memory.cnt; ++i) {
+		start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+		end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+		add_active_range(0, start_pfn, end_pfn);
+	}
 	node_set_online(0);
 }
 
@@ -632,11 +545,11 @@ void __init do_init_bootmem(void)
 			  (void *)(unsigned long)boot_cpuid);
 
 	for_each_online_node(nid) {
-		unsigned long start_pfn, end_pfn, pages_present;
+		unsigned long start_pfn, end_pfn;
 		unsigned long bootmem_paddr;
 		unsigned long bootmap_pages;
 
-		get_region(nid, &start_pfn, &end_pfn, &pages_present);
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 
 		/* Allocate the node structure node local if possible */
 		NODE_DATA(nid) = careful_allocation(nid,
@@ -669,19 +582,7 @@ void __init do_init_bootmem(void)
 		init_bootmem_node(NODE_DATA(nid), bootmem_paddr >> PAGE_SHIFT,
 				  start_pfn, end_pfn);
 
-		/* Add free regions on this node */
-		for (i = 0; init_node_data[i].end_pfn; i++) {
-			unsigned long start, end;
-
-			if (init_node_data[i].nid != nid)
-				continue;
-
-			start = init_node_data[i].start_pfn << PAGE_SHIFT;
-			end = init_node_data[i].end_pfn << PAGE_SHIFT;
-
-			dbg("free_bootmem %lx %lx\n", start, end - start);
-  			free_bootmem_node(NODE_DATA(nid), start, end - start);
-		}
+		free_bootmem_with_active_regions(nid, end_pfn);
 
 		/* Mark reserved regions on this node */
 		for (i = 0; i < lmb.reserved.cnt; i++) {
@@ -712,44 +613,14 @@ void __init do_init_bootmem(void)
 			}
 		}
 
-		/* Add regions into sparsemem */
-		for (i = 0; init_node_data[i].end_pfn; i++) {
-			unsigned long start, end;
-
-			if (init_node_data[i].nid != nid)
-				continue;
-
-			start = init_node_data[i].start_pfn;
-			end = init_node_data[i].end_pfn;
-
-			memory_present(nid, start, end);
-		}
+		sparse_memory_present_with_active_regions(nid);
 	}
 }
 
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
-	int nid;
-
-	memset(zones_size, 0, sizeof(zones_size));
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	for_each_online_node(nid) {
-		unsigned long start_pfn, end_pfn, pages_present;
-
-		get_region(nid, &start_pfn, &end_pfn, &pages_present);
-
-		zones_size[ZONE_DMA] = end_pfn - start_pfn;
-		zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
-
-		dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
-		    zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
-
-		free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
-				    zholes_size);
-	}
+	unsigned long end_pfn = lmb_end_of_DRAM() >> PAGE_SHIFT;
+	free_area_init_nodes(end_pfn, end_pfn, end_pfn, end_pfn);
 }
 
 static int __init early_numa(char *p)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/ppc/Kconfig linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/ppc/Kconfig
--- linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/ppc/Kconfig	2006-06-05 14:12:48.000000000 +0100
+++ linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/ppc/Kconfig	2006-06-05 14:15:06.000000000 +0100
@@ -953,6 +953,9 @@ config NR_CPUS
 config HIGHMEM
 	bool "High memory support"
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 source kernel/Kconfig.hz
 source kernel/Kconfig.preempt
 source "mm/Kconfig"
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/ppc/mm/init.c linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/ppc/mm/init.c
--- linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/arch/ppc/mm/init.c	2006-05-25 02:50:17.000000000 +0100
+++ linux-2.6.17-rc5-mm3-102-powerpc_use_init_nodes/arch/ppc/mm/init.c	2006-06-05 14:15:06.000000000 +0100
@@ -359,8 +359,7 @@ void __init do_init_bootmem(void)
  */
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES], i;
-
+	unsigned long start_pfn, end_pfn;
 #ifdef CONFIG_HIGHMEM
 	map_page(PKMAP_BASE, 0, 0);	/* XXX gross */
 	pkmap_page_table = pte_offset_kernel(pmd_offset(pgd_offset_k
@@ -370,19 +369,22 @@ void __init paging_init(void)
 			(KMAP_FIX_BEGIN), KMAP_FIX_BEGIN), KMAP_FIX_BEGIN);
 	kmap_prot = PAGE_KERNEL;
 #endif /* CONFIG_HIGHMEM */
-
-	/*
-	 * All pages are DMA-able so we put them all in the DMA zone.
-	 */
-	zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
-	for (i = 1; i < MAX_NR_ZONES; i++)
-		zones_size[i] = 0;
+	/* All pages are DMA-able so we put them all in the DMA zone. */
+	start_pfn = __pa(PAGE_OFFSET) >> PAGE_SHIFT;
+	end_pfn = start_pfn + (total_memory >> PAGE_SHIFT);
+	add_active_range(0, start_pfn, end_pfn);
 
 #ifdef CONFIG_HIGHMEM
-	zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
+	free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT);
+#else
+	free_area_init_nodes(total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT);
 #endif /* CONFIG_HIGHMEM */
-
-	free_area_init(zones_size);
 }
 
 void __init mem_init(void)

^ permalink raw reply

* [PATCH 1/5] Introduce mechanism for registering active regions of memory
From: Mel Gorman @ 2006-06-06 13:47 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, linuxppc-dev, Mel Gorman, linux-kernel,
	bob.picco, ak, linux-mm
In-Reply-To: <20060606134710.21419.48239.sendpatchset@skynet.skynet.ie>


This patch defines the structure to represent an active range of page
frames within a node in an architecture independent manner. Architectures
are expected to register active ranges of PFNs using add_active_range(nid,
start_pfn, end_pfn) and call free_area_init_nodes() passing the PFNs of
the end of each zone.


 include/linux/mm.h     |   45 +++
 include/linux/mmzone.h |   10 
 mm/page_alloc.c        |  550 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 580 insertions(+), 25 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-clean/include/linux/mm.h linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/include/linux/mm.h
--- linux-2.6.17-rc5-mm3-clean/include/linux/mm.h	2006-06-05 14:12:51.000000000 +0100
+++ linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/include/linux/mm.h	2006-06-05 14:14:15.000000000 +0100
@@ -924,6 +924,51 @@ extern void free_area_init(unsigned long
 extern void free_area_init_node(int nid, pg_data_t *pgdat,
 	unsigned long * zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size);
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * With CONFIG_ARCH_POPULATES_NODE_MAP set, an architecture may initialise its
+ * zones, allocate the backing mem_map and account for memory holes in a more
+ * architecture independent manner. This is a substitute for creating the
+ * zone_sizes[] and zholes_size[] arrays and passing them to
+ * free_area_init_node()
+ *
+ * An architecture is expected to register range of page frames backed by
+ * physical memory with add_active_range() before calling
+ * free_area_init_nodes() passing in the PFN each zone ends at. At a basic
+ * usage, an architecture is expected to do something like
+ *
+ * for_each_valid_physical_page_range()
+ * 	add_active_range(node_id, start_pfn, end_pfn)
+ * free_area_init_nodes(max_dma, max_dma32, max_normal_pfn, max_highmem_pfn);
+ *
+ * If the architecture guarantees that there are no holes in the ranges
+ * registered with add_active_range(), free_bootmem_active_regions()
+ * will call free_bootmem_node() for each registered physical page range.
+ * Similarly sparse_memory_present_with_active_regions() calls
+ * memory_present() for each range when SPARSEMEM is enabled.
+ *
+ * See mm/page_alloc.c for more information on each function exposed by
+ * CONFIG_ARCH_POPULATES_NODE_MAP
+ */
+extern void free_area_init_nodes(unsigned long max_dma_pfn,
+					unsigned long max_dma32_pfn,
+					unsigned long max_low_pfn,
+					unsigned long max_high_pfn);
+extern void add_active_range(unsigned int nid, unsigned long start_pfn,
+					unsigned long end_pfn);
+extern void shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+						unsigned long new_end_pfn);
+extern void remove_all_active_ranges(void);
+extern unsigned long absent_pages_in_range(unsigned long start_pfn,
+						unsigned long end_pfn);
+extern void get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn);
+extern unsigned long find_min_pfn_with_active_regions(void);
+extern unsigned long find_max_pfn_with_active_regions(void);
+extern void free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn);
+extern void sparse_memory_present_with_active_regions(int nid);
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-clean/include/linux/mmzone.h linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/include/linux/mmzone.h
--- linux-2.6.17-rc5-mm3-clean/include/linux/mmzone.h	2006-06-05 14:12:51.000000000 +0100
+++ linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/include/linux/mmzone.h	2006-06-05 14:14:15.000000000 +0100
@@ -277,6 +277,13 @@ struct zonelist {
 	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
 };
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+struct node_active_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	int nid;
+};
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
@@ -484,7 +491,8 @@ extern struct zone *next_zone(struct zon
 
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+	!defined(CONFIG_ARCH_POPULATES_NODE_MAP)
 #define early_pfn_to_nid(nid)  (0UL)
 #endif
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-rc5-mm3-clean/mm/page_alloc.c linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/mm/page_alloc.c
--- linux-2.6.17-rc5-mm3-clean/mm/page_alloc.c	2006-06-05 14:12:51.000000000 +0100
+++ linux-2.6.17-rc5-mm3-101-add_free_area_init_nodes/mm/page_alloc.c	2006-06-05 14:14:15.000000000 +0100
@@ -38,6 +38,8 @@
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
 #include <linux/stop_machine.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -87,6 +89,33 @@ int min_free_kbytes = 1024;
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+  /*
+   * MAX_ACTIVE_REGIONS determines the maxmimum number of distinct
+   * ranges of memory (RAM) that may be registered with add_active_range().
+   * Ranges passed to add_active_range() will be merged if possible
+   * so the number of times add_active_range() can be called is
+   * related to the number of nodes and the number of holes
+   */
+  #ifdef CONFIG_MAX_ACTIVE_REGIONS
+    /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */
+    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+  #else
+    #if MAX_NUMNODES >= 32
+      /* If there can be many nodes, allow up to 50 holes per node */
+      #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50)
+    #else
+      /* By default, allow up to 256 distinct regions */
+      #define MAX_ACTIVE_REGIONS 256
+    #endif
+  #endif
+
+  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+  int __initdata nr_nodemap_entries;
+  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -1887,25 +1916,6 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long realtotalpages, totalpages = 0;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		totalpages += zones_size[i];
-	pgdat->node_spanned_pages = totalpages;
-
-	realtotalpages = totalpages;
-	if (zholes_size)
-		for (i = 0; i < MAX_NR_ZONES; i++)
-			realtotalpages -= zholes_size[i];
-	pgdat->node_present_pages = realtotalpages;
-	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
-}
-
-
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -2223,6 +2233,272 @@ __meminit int init_currently_empty_zone(
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * Basic iterator support. Return the first range of PFNs for a node
+ * Note: nid == MAX_NUMNODES returns first region regardless of node
+ */
+static int __init first_active_region_index_in_nid(int nid)
+{
+	int i;
+
+	for (i = 0; i < nr_nodemap_entries; i++)
+		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
+			return i;
+
+	return -1;
+}
+
+/*
+ * Basic iterator support. Return the next active range of PFNs for a node
+ * Note: nid == MAX_NUMNODES returns next region regardles of node
+ */
+static int __init next_active_region_index_in_nid(int index, int nid)
+{
+	for (index = index + 1; index < nr_nodemap_entries; index++)
+		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
+			return index;
+
+	return -1;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+/*
+ * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
+ * Architectures may implement their own version but if add_active_range()
+ * was used and there are no special requirements, this is a convenient
+ * alternative
+ */
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; i < nr_nodemap_entries; i++) {
+		unsigned long start_pfn = early_node_map[i].start_pfn;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+
+		if (start_pfn <= pfn && pfn < end_pfn)
+			return early_node_map[i].nid;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+
+/* Basic iterator support to walk early_node_map[] */
+#define for_each_active_range_index_in_nid(i, nid) \
+	for (i = first_active_region_index_in_nid(nid); i != -1; \
+				i = next_active_region_index_in_nid(i, nid))
+
+/**
+ * free_bootmem_with_active_regions - Call free_bootmem_node for each active range
+ * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed
+ * @max_low_pfn: The highest PFN that till be passed to free_bootmem_node
+ *
+ * If an architecture guarantees that all ranges registered with
+ * add_active_ranges() contain no holes and may be freed, this
+ * this function may be used instead of calling free_bootmem() manually.
+ */
+void __init free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn)
+{
+	int i;
+
+	for_each_active_range_index_in_nid(i, nid) {
+		unsigned long size_pages = 0;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+
+		if (early_node_map[i].start_pfn >= max_low_pfn)
+			continue;
+
+		if (end_pfn > max_low_pfn)
+			end_pfn = max_low_pfn;
+
+		size_pages = end_pfn - early_node_map[i].start_pfn;
+		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+				PFN_PHYS(early_node_map[i].start_pfn),
+				size_pages << PAGE_SHIFT);
+	}
+}
+
+/**
+ * sparse_memory_present_with_active_regions - Call memory_present for each active range
+ * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used
+ *
+ * If an architecture guarantees that all ranges registered with
+ * add_active_ranges() contain no holes and may be freed, this
+ * this function may be used instead of calling memory_present() manually.
+ */
+void __init sparse_memory_present_with_active_regions(int nid)
+{
+	int i;
+
+	for_each_active_range_index_in_nid(i, nid)
+		memory_present(early_node_map[i].nid,
+				early_node_map[i].start_pfn,
+				early_node_map[i].end_pfn);
+}
+
+/**
+ * get_pfn_range_for_nid - Return the start and end page frames for a node
+ * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned
+ * @start_pfn: Passed by reference. On return, it will have the node start_pfn
+ * @end_pfn: Passed by reference. On return, it will have the node end_pfn
+ *
+ * It returns the start and end page frame of a node based on information
+ * provided by an arch calling add_active_range(). If called for a node
+ * with no available memory, a warning is printed and the start and end
+ * PFNs will be 0
+ */
+void __init get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	int i;
+	*start_pfn = -1UL;
+	*end_pfn = 0;
+
+	for_each_active_range_index_in_nid(i, nid) {
+		*start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
+		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
+	}
+
+	if (*start_pfn == -1UL) {
+		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+		*start_pfn = 0;
+	}
+}
+
+/*
+ * Return the number of pages a zone spans in a node, including holes
+ * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
+ */
+unsigned long __init zone_spanned_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	unsigned long node_start_pfn, node_end_pfn;
+	unsigned long zone_start_pfn, zone_end_pfn;
+
+	/* Get the start and end of the node and zone */
+	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+	/* Check that this node has pages within the zone's required range */
+	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+		return 0;
+
+	/* Move the zone boundaries inside the node if necessary */
+	zone_end_pfn = min(zone_end_pfn, node_end_pfn);
+	zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+
+	/* Return the spanned pages */
+	return zone_end_pfn - zone_start_pfn;
+}
+
+/*
+ * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
+ * then all holes in the requested range will be accounted for
+ */
+unsigned long __init __absent_pages_in_range(int nid,
+				unsigned long range_start_pfn,
+				unsigned long range_end_pfn)
+{
+	int i = 0;
+	unsigned long prev_end_pfn = 0, hole_pages = 0;
+	unsigned long start_pfn;
+
+	/* Find the end_pfn of the first active range of pfns in the node */
+	i = first_active_region_index_in_nid(nid);
+	if (i == -1)
+		return 0;
+
+	prev_end_pfn = early_node_map[i].start_pfn;
+
+	/* Find all holes for the zone within the node */
+	for (; i != -1; i = next_active_region_index_in_nid(i, nid)) {
+
+		/* No need to continue if prev_end_pfn is outside the zone */
+		if (prev_end_pfn >= range_end_pfn)
+			break;
+
+		/* Make sure the end of the zone is not within the hole */
+		start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
+		prev_end_pfn = max(prev_end_pfn, range_start_pfn);
+
+		/* Update the hole size cound and move on */
+		if (start_pfn > range_start_pfn) {
+			BUG_ON(prev_end_pfn > start_pfn);
+			hole_pages += start_pfn - prev_end_pfn;
+		}
+		prev_end_pfn = early_node_map[i].end_pfn;
+	}
+
+	return hole_pages;
+}
+
+/**
+ * absent_pages_in_range - Return number of page frames in holes within a range
+ * @start_pfn: The start PFN to start searching for holes
+ * @end_pfn: The end PFN to stop searching for holes
+ *
+ * It returns the number of pages frames in memory holes within a range
+ */
+unsigned long __init absent_pages_in_range(unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
+}
+
+/* Return the number of page frames in holes in a zone on a node */
+unsigned long __init zone_absent_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	return __absent_pages_in_range(nid,
+				arch_zone_lowest_possible_pfn[zone_type],
+				arch_zone_highest_possible_pfn[zone_type]);
+}
+#else
+static inline unsigned long zone_spanned_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *zones_size)
+{
+	return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+						unsigned long zone_type,
+						unsigned long *zholes_size)
+{
+	if (!zholes_size)
+		return 0;
+
+	return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long realtotalpages, totalpages = 0;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		totalpages += zone_spanned_pages_in_node(pgdat->node_id, i,
+								zones_size);
+	pgdat->node_spanned_pages = totalpages;
+
+	realtotalpages = totalpages;
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		realtotalpages -=
+			zone_absent_pages_in_node(pgdat->node_id, i,
+								zholes_size);
+	pgdat->node_present_pages = realtotalpages;
+	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+							realtotalpages);
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -2246,10 +2522,9 @@ static void __meminit free_area_init_cor
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize;
 
-		realsize = size = zones_size[j];
-		if (zholes_size)
-			realsize -= zholes_size[j];
-
+		size = zone_spanned_pages_in_node(nid, j, zones_size);
+		realsize = size - zone_absent_pages_in_node(nid, j,
+								zholes_size);
 		if (j < ZONE_HIGHMEM)
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
@@ -2340,13 +2615,240 @@ void __meminit free_area_init_node(int n
 {
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+	calculate_node_totalpages(pgdat, zones_size, zholes_size);
 
 	alloc_node_mem_map(pgdat);
 
 	free_area_init_core(pgdat, zones_size, zholes_size);
 }
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/**
+ * add_active_range - Register a range of PFNs backed by physical memory
+ * @nid: The node ID the range resides on
+ * @start_pfn: The start PFN of the available physical memory
+ * @end_pfn: The end PFN of the available physical memory
+ *
+ * These ranges are stored in an early_node_map[] and later used by
+ * free_area_init_nodes() to calculate zone sizes and holes. If the
+ * range spans a memory hole, it is up to the architecture to ensure
+ * the memory is not freed by the bootmem allocator. If possible
+ * the range being registered will be merged with existing ranges.
+ */
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+						unsigned long end_pfn)
+{
+	int i;
+
+	printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) "
+			  "%d entries of %d used\n",
+			  nid, start_pfn, end_pfn,
+			  nr_nodemap_entries, MAX_ACTIVE_REGIONS);
+
+	/* Merge with existing active regions if possible */
+	for (i = 0; i < nr_nodemap_entries; i++) {
+		if (early_node_map[i].nid != nid)
+			continue;
+
+		/* Skip if an existing region covers this new one */
+		if (start_pfn >= early_node_map[i].start_pfn &&
+				end_pfn <= early_node_map[i].end_pfn)
+			return;
+
+		/* Merge forward if suitable */
+		if (start_pfn <= early_node_map[i].end_pfn &&
+				end_pfn > early_node_map[i].end_pfn) {
+			early_node_map[i].end_pfn = end_pfn;
+			return;
+		}
+
+		/* Merge backward if suitable */
+		if (start_pfn < early_node_map[i].end_pfn &&
+				end_pfn >= early_node_map[i].start_pfn) {
+			early_node_map[i].start_pfn = start_pfn;
+			return;
+		}
+	}
+
+	/* Check that early_node_map is large enough */
+	if (i >= MAX_ACTIVE_REGIONS) {
+		printk(KERN_CRIT "More than %d memory regions, truncating\n",
+							MAX_ACTIVE_REGIONS);
+		return;
+	}
+
+	early_node_map[i].nid = nid;
+	early_node_map[i].start_pfn = start_pfn;
+	early_node_map[i].end_pfn = end_pfn;
+	nr_nodemap_entries = i + 1;
+}
+
+/**
+ * shrink_active_range - Shrink an existing registered range of PFNs
+ * @nid: The node id the range is on that should be shrunk
+ * @old_end_pfn: The old end PFN of the range
+ * @new_end_pfn: The new PFN of the range
+ *
+ * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node.
+ * The map is kept at the end physical page range that has already been
+ * registered with add_active_range(). This function allows an arch to shrink
+ * an existing registered range.
+ */
+void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+						unsigned long new_end_pfn)
+{
+	int i;
+
+	/* Find the old active region end and shrink */
+	for_each_active_range_index_in_nid(i, nid)
+		if (early_node_map[i].end_pfn == old_end_pfn) {
+			early_node_map[i].end_pfn = new_end_pfn;
+			break;
+		}
+}
+
+/**
+ * remove_all_active_ranges - Remove all currently registered regions
+ * During discovery, it may be found that a table like SRAT is invalid
+ * and an alternative discovery method must be used. This function removes
+ * all currently registered regions.
+ */
+void __init remove_all_active_ranges()
+{
+	memset(early_node_map, 0, sizeof(early_node_map));
+	nr_nodemap_entries = 0;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+	struct node_active_region *arange = (struct node_active_region *)a;
+	struct node_active_region *brange = (struct node_active_region *)b;
+
+	/* Done this way to avoid overflows */
+	if (arange->start_pfn > brange->start_pfn)
+		return 1;
+	if (arange->start_pfn < brange->start_pfn)
+		return -1;
+
+	return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+	sort(early_node_map, (size_t)nr_nodemap_entries,
+			sizeof(struct node_active_region),
+			cmp_node_active_region, NULL);
+}
+
+/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
+unsigned long __init find_min_pfn_for_node(unsigned long nid)
+{
+	int i;
+
+	/* Assuming a sorted map, the first range found has the starting pfn */
+	for_each_active_range_index_in_nid(i, nid)
+		return early_node_map[i].start_pfn;
+
+	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+	return 0;
+}
+
+/**
+ * find_min_pfn_with_active_regions - Find the minimum PFN registered
+ *
+ * It returns the minimum PFN based on information provided via
+ * add_active_range()
+ */
+unsigned long __init find_min_pfn_with_active_regions(void)
+{
+	return find_min_pfn_for_node(MAX_NUMNODES);
+}
+
+/**
+ * find_max_pfn_with_active_regions - Find the maximum PFN registered
+ *
+ * It returns the maximum PFN based on information provided via
+ * add_active_range()
+ */
+unsigned long __init find_max_pfn_with_active_regions(void)
+{
+	int i;
+	unsigned long max_pfn = 0;
+
+	for (i = 0; i < nr_nodemap_entries; i++)
+		max_pfn = max(max_pfn, early_node_map[i].end_pfn);
+
+	return max_pfn;
+}
+
+/**
+ * free_area_init_nodes - Initialise all pg_data_t and zone data
+ * @arch_max_dma_pfn: The maximum PFN usable for ZONE_DMA
+ * @arch_max_dma32_pfn: The maximum PFN usable for ZONE_DMA32
+ * @arch_max_low_pfn: The maximum PFN usable for ZONE_NORMAL
+ * @arch_max_high_pfn: The maximum PFN usable for ZONE_HIGHMEM
+ *
+ * This will call free_area_init_node() for each active node in the system.
+ * Using the page ranges provided by add_active_range(), the size of each
+ * zone in each node and their holes is calculated. If the maximum PFN
+ * between two adjacent zones match, it is assumed that the zone is empty.
+ * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed
+ * that arch_max_dma32_pfn has no pages. It is also assumed that a zone
+ * starts where the previous one ended. For example, ZONE_DMA32 starts
+ * at arch_max_dma_pfn.
+ */
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+				unsigned long arch_max_dma32_pfn,
+				unsigned long arch_max_low_pfn,
+				unsigned long arch_max_high_pfn)
+{
+	unsigned long nid;
+	int i;
+
+	/* Record where the zone boundaries are */
+	memset(arch_zone_lowest_possible_pfn, 0,
+				sizeof(arch_zone_lowest_possible_pfn));
+	memset(arch_zone_highest_possible_pfn, 0,
+				sizeof(arch_zone_highest_possible_pfn));
+	arch_zone_lowest_possible_pfn[ZONE_DMA] =
+					find_min_pfn_with_active_regions();
+	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+	for (i = 1; i < MAX_NR_ZONES; i++)
+		arch_zone_lowest_possible_pfn[i] =
+			arch_zone_highest_possible_pfn[i-1];
+
+	/* Regions in the early_node_map can be in any order */
+	sort_node_map();
+
+	/* Print out the zone ranges */
+	printk("Zone PFN ranges:\n");
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		printk("  %-8s %8lu -> %8lu\n",
+				zone_names[i],
+				arch_zone_lowest_possible_pfn[i],
+				arch_zone_highest_possible_pfn[i]);
+
+	/* Print out the early_node_map[] */
+	printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries);
+	for (i = 0; i < nr_nodemap_entries; i++)
+		printk("  %3d: %8lu -> %8lu\n", early_node_map[i].nid,
+						early_node_map[i].start_pfn,
+						early_node_map[i].end_pfn);
+
+	/* Initialise every node */
+	for_each_online_node(nid) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		free_area_init_node(nid, pgdat, NULL,
+				find_min_pfn_for_node(nid), NULL);
+	}
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 static bootmem_data_t contig_bootmem_data;
 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox