LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/17] Hypervisor-mode KVM on POWER7 and PPC970
From: Paul Mackerras @ 2011-06-29 10:15 UTC (permalink / raw)
  To: linuxppc-dev, kvm, kvm-ppc, Alexander Graf

The first patch of the following series is a pure bug-fix for 32-bit
kernels.

The remainder of the following series of patches enable KVM to exploit
the hardware hypervisor mode on 64-bit Power ISA Book3S machines.  At
present, POWER7 and PPC970 processors are supported.  (Note that the
PPC970 processors in Apple G5 machines don't have a usable hypervisor
mode and are not supported by these patches.)

Running the KVM host in hypervisor mode means that the guest can use
both supervisor mode and user mode.  That means that the guest can
execute supervisor-privilege instructions and access supervisor-
privilege registers.  In addition the hardware directs most exceptions
to the guest.  Thus we don't need to emulate any instructions in the
host.  Generally, the only times we need to exit the guest are when it
does a hypercall or when an external interrupt or host timer
(decrementer) interrupt occurs.

The focus of this KVM implementation is to run guests that use the
PAPR (Power Architecture Platform Requirements) paravirtualization
interface, which is the interface supplied by PowerVM on IBM pSeries
machines.  Currently the "pseries" machine type in qemu is only
supported by book3s_hv KVM, and book3s_hv KVM only supports the
"pseries" machine type.  That will hopefully change in future.

These patches are against the master branch of the kvm tree.

Paul.

^ permalink raw reply

* Re: [PATCH 2/2] Add cpufreq driver for Momentum Maple boards
From: Benjamin Herrenschmidt @ 2011-06-29  8:54 UTC (permalink / raw)
  To: Dmitry Eremin-Solenikov; +Cc: Dave Jones, Paul Mackerras, linuxppc-dev, cpufreq
In-Reply-To: <BANLkTimJD3U1kRkDKVomGVriykwqLBQasA@mail.gmail.com>

On Wed, 2011-06-29 at 12:40 +0400, Dmitry Eremin-Solenikov wrote:
> On 6/29/11, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > Before I comment on this last one, a quick Q. for Dave: Do you want to
> > handle this or should I merge it via powerpc.git ? (It depends on
> > another change to the arch code to expose the SCOM functions that it
> > uses, and that patch is going to be in my -next branch).
> >
> > Now some remaining small nits:
> >
> > On Fri, 2011-06-17 at 17:10 +0400, Dmitry Eremin-Solenikov wrote:
> >> Add simple cpufreq driver for Maple-based boards (ppc970fx evaluation
> >> kit and others). Driver is based on a cpufreq driver for 64-bit powermac
> >> boxes with all pmac-dependant features removed and simple cleanup
> >> applied.
> >>
> >> Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
> >> ---
> >>  drivers/cpufreq/Kconfig         |    5 +
> >>  drivers/cpufreq/Kconfig.powerpc |    7 +
> >>  drivers/cpufreq/Makefile        |    5 +
> >>  drivers/cpufreq/maple-cpufreq.c |  314
> >> +++++++++++++++++++++++++++++++++++++++
> >
> > If we're going to have a Kconfig.powerpc, should we maybe just have a
> > powerpc subdirectory instead with the driver in it ?
> >
> > I'm happy at some later point to try moving some of my other ones there.
> 
> As Dave also isn't sure about subdirs, should I create cpufreq/powerpc
> directory,
> or not?

Don't bother, I can do it all at once if/when I chose to move the
powermac stuff there.

> > Do you get that property in your device-tree ? Or have you modified your
> > firmware ? If that requires a modified firmware, you should probably put
> > at least a link indicating where to get it somewhere and display a nicer
> > error code.
> 
> PIBS firmware (used on PPC970FX devkit/original Maple-D board) generates
> this property, if the board is started with dual CPUs (it can also be started
> with only one CPU selected). On the other hand SLOF firmware (used
> on JS2x blade servers) doesn't generate this property. It can be adapted
> however to generate it.

Ok.

> > Also this driver is specific to the Maple HW, you don't want it to kick
> > in and mess around on ... an Apple G5 for example. So stick somewhere a
> >
> > 	if (!machine_is(maple))
> > 		return 0;
> >
> >> +	printk(KERN_INFO "Registering G5 CPU frequency driver\n");
> >
> > s/G5/Maple
> 
> Hmmm. I'm actually thinking about doing it the other way: as this driver
> is mostly c&p of PowerMac G5 driver, as we are moving those from
> arch/powerpc to drivers/cpufreq, maybe I should merge two drivers (this
> one with cpufreq_64 from powermac)?

If you feel like it :-) The powermac one has quite a bit more plumbing
for voltage control etc... but it does make sense in the long run.

Maybe start with getting that maple driver in, and -then- merge ?

> >> +	printk(KERN_INFO "Frequency method: SCOM, Voltage method: none\n");
> >
> > This is useless.
> 
> Leftover from powermac thing.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 2/2] Add cpufreq driver for Momentum Maple boards
From: Dmitry Eremin-Solenikov @ 2011-06-29  8:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Dave Jones, Paul Mackerras, linuxppc-dev, cpufreq
In-Reply-To: <1309318110.32158.520.camel@pasglop>

On 6/29/11, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> Before I comment on this last one, a quick Q. for Dave: Do you want to
> handle this or should I merge it via powerpc.git ? (It depends on
> another change to the arch code to expose the SCOM functions that it
> uses, and that patch is going to be in my -next branch).
>
> Now some remaining small nits:
>
> On Fri, 2011-06-17 at 17:10 +0400, Dmitry Eremin-Solenikov wrote:
>> Add simple cpufreq driver for Maple-based boards (ppc970fx evaluation
>> kit and others). Driver is based on a cpufreq driver for 64-bit powermac
>> boxes with all pmac-dependant features removed and simple cleanup
>> applied.
>>
>> Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
>> ---
>>  drivers/cpufreq/Kconfig         |    5 +
>>  drivers/cpufreq/Kconfig.powerpc |    7 +
>>  drivers/cpufreq/Makefile        |    5 +
>>  drivers/cpufreq/maple-cpufreq.c |  314
>> +++++++++++++++++++++++++++++++++++++++
>
> If we're going to have a Kconfig.powerpc, should we maybe just have a
> powerpc subdirectory instead with the driver in it ?
>
> I'm happy at some later point to try moving some of my other ones there.

As Dave also isn't sure about subdirs, should I create cpufreq/powerpc
directory,
or not?

>
>  .../...
>
>> +	/* Look for the powertune data in the device-tree */
>> +	maple_pmode_data = of_get_property(cpunode, "power-mode-data", &psize);
>> +	if (!maple_pmode_data) {
>> +		DBG("No power-mode-data !\n");
>> +		goto bail_noprops;
>> +	}
>> +	maple_pmode_max = psize / sizeof(u32) - 1;
>
> Do you get that property in your device-tree ? Or have you modified your
> firmware ? If that requires a modified firmware, you should probably put
> at least a link indicating where to get it somewhere and display a nicer
> error code.

PIBS firmware (used on PPC970FX devkit/original Maple-D board) generates
this property, if the board is started with dual CPUs (it can also be started
with only one CPU selected). On the other hand SLOF firmware (used
on JS2x blade servers) doesn't generate this property. It can be adapted
however to generate it.

> Also this driver is specific to the Maple HW, you don't want it to kick
> in and mess around on ... an Apple G5 for example. So stick somewhere a
>
> 	if (!machine_is(maple))
> 		return 0;
>
>> +	printk(KERN_INFO "Registering G5 CPU frequency driver\n");
>
> s/G5/Maple

Hmmm. I'm actually thinking about doing it the other way: as this driver
is mostly c&p of PowerMac G5 driver, as we are moving those from
arch/powerpc to drivers/cpufreq, maybe I should merge two drivers (this
one with cpufreq_64 from powermac)?

>> +	printk(KERN_INFO "Frequency method: SCOM, Voltage method: none\n");
>
> This is useless.

Leftover from powermac thing.

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH 2/4] dma-mapping: add get_required_mask if arch overrides default
From: FUJITA Tomonori @ 2011-06-29  8:19 UTC (permalink / raw)
  To: nacc; +Cc: vinod.koul, linux-kernel, miltonm, dan.j.williams, linuxppc-dev
In-Reply-To: <1308942325-4813-3-git-send-email-nacc@us.ibm.com>

On Fri, 24 Jun 2011 12:05:23 -0700
Nishanth Aravamudan <nacc@us.ibm.com> wrote:

> From: Milton Miller <miltonm@bga.com>
> 
> If an architecture sets ARCH_HAS_DMA_GET_REQUIRED_MASK and has settable
> dma_map_ops, the required mask may change by the ops implementation.
> For example, a system that always has an mmu inline may only require 32
> bits while a swiotlb would desire bits to cover all of memory.
> 
> Therefore add the field if the architecture does not use the generic
> definition of dma_get_required_mask. The first use will by by powerpc.
> Note that this does add some dependency on the order in which files are
> visible here.
> 
> Signed-off-by:  Milton Miller <miltonm@bga.com>
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-kernel@vger.kernel.org
> Cc: benh@kernel.crashing.org
> ---
>  include/linux/dma-mapping.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index ba8319a..d0e023b 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -49,6 +49,9 @@ struct dma_map_ops {
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
>  	int (*set_dma_mask)(struct device *dev, u64 mask);
> +#ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
> +	u64 (*get_required_mask)(struct device *dev);
> +#endif
>  	int is_phys;
>  };

If you add get_required_mask to dma_map_ops, we should clean up ia64
too and implement the generic proper version in
dma-mapping-common.h. Then we kill ARCH_HAS_DMA_GET_REQUIRED_MASK
ifdef hack. Otherwise, I don't think it makes sense to add this to
dma_map_ops.

^ permalink raw reply

* [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2011-06-29  8:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list

Hi Linus !

Here are a handful of minor powerpc bits for 3.0

Note: At the time of this sending, the mirrors still hadn't caught up.

Cheers,
Ben.


The following changes since commit b0af8dfdd67699e25083478c63eedef2e72ebd85:

  Linux 3.0-rc5 (2011-06-27 19:12:22 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

Christian Dietrich (2):
      powerpc/rtas-rtc: remove sideeffects of printk_ratelimit
      arch/powerpc: use printk_ratelimited instead of printk_ratelimit

Michael Neuling (1):
      powerpc/pseries: remove duplicate SCSI_BNX2_ISCSI in pseries_defconfig

Scott Wood (1):
      powerpc/e500: fix breakage with fsl_rio_mcheck_exception

Shaohui Xie (1):
      powerpc/85xx: fix NAND_CMD_READID read bytes number

Timur Tabi (1):
      powerpc/p1022ds: fix audio-related properties in the device tree

 arch/powerpc/boot/dts/p1022ds.dts      |    9 +++--
 arch/powerpc/configs/pseries_defconfig |    1 -
 arch/powerpc/kernel/rtas-rtc.c         |   29 +++++++++-------
 arch/powerpc/kernel/signal_32.c        |   57 +++++++++++++++++--------------
 arch/powerpc/kernel/signal_64.c        |   17 +++++----
 arch/powerpc/kernel/traps.c            |   24 ++++++-------
 arch/powerpc/mm/fault.c                |   10 +++---
 arch/powerpc/sysdev/fsl_rio.c          |   33 +++++++++---------
 arch/powerpc/sysdev/mpic.c             |   11 +++---
 drivers/mtd/nand/fsl_elbc_nand.c       |    6 ++--
 10 files changed, 104 insertions(+), 93 deletions(-)

^ permalink raw reply

* Re: [PATCH v2] powerpc/book3e-64: use a separate TLB handler when linear map is bolted
From: Benjamin Herrenschmidt @ 2011-06-29  7:50 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev
In-Reply-To: <20110622212542.GA23089@schlenkerla.am.freescale.net>

On Wed, 2011-06-22 at 16:25 -0500, Scott Wood wrote:
> On MMUs such as FSL where we can guarantee the entire linear mapping is
> bolted, we don't need to worry about linear TLB misses.  If on top of
> that we do a full table walk, we get rid of all recursive TLB faults, and
> can dispense with some state saving.  This gains a few percent on
> TLB-miss-heavy workloads, and around 50% on a benchmark that had a high
> rate of virtual page table faults under the normal handler.
> 
> While touching the EX_TLB layout, remove EX_TLB_MMUCR0, EX_TLB_SRR0, and
> EX_TLB_SRR1 as they're not used.

I merged that into -next, but it was breaking 64K pages on WSP, I had to
add an ifdef in there to skip the PUD level when walking the page tables
(PUD_SHIFT isn't defined for asm when doing 64K pages).

Please check I didn't break anything.

Cheers,
Ben.

^ permalink raw reply

* Re: [BUG?]3.0-rc4+ftrace+kprobe: set kprobe at instruction 'stwu' lead to system crash/freeze
From: Ananth N Mavinakayanahalli @ 2011-06-29  6:46 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Jim Keniston, linux-kernel, Steven Rostedt, paulus,
	yrl.pp-manager.tt, Masami Hiramatsu, linuxppc-dev
In-Reply-To: <BANLkTimYej4_dmBqvPBCLej=JA5atLrZVA@mail.gmail.com>

On Wed, Jun 29, 2011 at 02:23:28PM +0800, Yong Zhang wrote:
> On Mon, Jun 27, 2011 at 6:01 PM, Ananth N Mavinakayanahalli
> <ananth@in.ibm.com> wrote:
> > On Sun, Jun 26, 2011 at 11:47:13PM +0900, Masami Hiramatsu wrote:
> >> (2011/06/24 19:29), Steven Rostedt wrote:
> >> > On Fri, 2011-06-24 at 17:21 +0800, Yong Zhang wrote:
> >> >> Hi,
> >> >>
> >> >> When I use kprobe to do something, I found some wired thing.
> >> >>
> >> >> When CONFIG_FUNCTION_TRACER is disabled:
> >> >> (gdb) disassemble do_fork
> >> >> Dump of assembler code for function do_fork:
> >> >>    0xc0037390 <+0>:        mflr    r0
> >> >>    0xc0037394 <+4>:        stwu    r1,-64(r1)
> >> >>    0xc0037398 <+8>:        mfcr    r12
> >> >>    0xc003739c <+12>:       stmw    r27,44(r1)
> >> >>
> >> >> Then I:
> >> >> modprobe kprobe_example func=do_fork offset=4
> >> >> ls
> >> >> Things works well.
> >> >>
> >> >> But when CONFIG_FUNCTION_TRACER is enabled:
> >> >> (gdb) disassemble do_fork
> >> >> Dump of assembler code for function do_fork:
> >> >>    0xc0040334 <+0>:        mflr    r0
> >> >>    0xc0040338 <+4>:        stw     r0,4(r1)
> >> >>    0xc004033c <+8>:        bl      0xc00109d4 <mcount>
> >> >>    0xc0040340 <+12>:       stwu    r1,-80(r1)
> >> >>    0xc0040344 <+16>:       mflr    r0
> >> >>    0xc0040348 <+20>:       stw     r0,84(r1)
> >> >>    0xc004034c <+24>:       mfcr    r12
> >> >> Then I:
> >> >> modprobe kprobe_example func=do_fork offset=12
> >> >> ls
> >> >> 'ls' will never retrun. system freeze.
> >> >
> >> > I'm not sure if x86 had a similar issue.
> >> >
> >> > Masami, have any ideas to why this happened?
> >>
> >> No, I don't familiar with ppc implementation. I guess
> >> that single-step resume code failed to emulate the
> >> instruction, but it strongly depends on ppc arch.
> >> Maybe IBM people may know what happened.
> >>
> >> Ananth, Jim, would you have any ideas?
> >
> > On powerpc, we emulate sstep whenever possible. Only recently support to
> > emulate loads and stores got added. I don't have access to a powerpc box
> > today... but will try to recreate the problem ASAP and see what could be
> > happening in the presence of mcount.
> 
> After taking more testing on it, it looks like the issue doesn't
> depend on mcount
> (AKA. CONFIG_FUNCTION_TRACER)
> 
> As I said in the first email, with eldk-5.0 CONFIG_FUNCTION_TRACER=n
> will work well.
> 
> But when I'm using eldk-4.2[1], both will fail. But the funny thing is when I
> set kprobe at several functions some works fine but some will fail. For example,
> at this time do_fork() works well, but show_interrupt() will crash.

Certain functions are off limits for probing -- look for __kprobe
annotations in the kernel. Some such functions are arch specific, but
show_interrupts() would definitely not be one of them. It works fine on
my (64bit) test box.

At this time, I think your best bet is to work with the eldk folks to
narrow down the problem. Given the current set of data, I am inclined to
think it could be an eldk bug, not a kernel one.

Ananth

^ permalink raw reply

* Re: [BUG?]3.0-rc4+ftrace+kprobe: set kprobe at instruction 'stwu' lead to system crash/freeze
From: Yong Zhang @ 2011-06-29  6:41 UTC (permalink / raw)
  To: ananth
  Cc: Jim Keniston, linux-kernel, Steven Rostedt, paulus,
	yrl.pp-manager.tt, Masami Hiramatsu, linuxppc-dev
In-Reply-To: <20110628104128.GA4310@in.ibm.com>

On Tue, Jun 28, 2011 at 6:41 PM, Ananth N Mavinakayanahalli
<ananth@in.ibm.com> wrote:
>
> My access to a 32bit powerpc box is very limited. Also, embedded powerpc
> has had issues with gcc-4.6 while gcc-4.5 worked fine.

I think I can do some test if you have any ideas :)

>
>> > > I'm not sure if x86 had a similar issue.
>> > >
>> > > Masami, have any ideas to why this happened?
>> >
>> > No, I don't familiar with ppc implementation. I guess
>> > that single-step resume code failed to emulate the
>> > instruction, but it strongly depends on ppc arch.
>> > Maybe IBM people may know what happened.
>> >
>> > Ananth, Jim, would you have any ideas?
>>
>> On powerpc, we emulate sstep whenever possible. Only recently support to
>> emulate loads and stores got added. I don't have access to a powerpc box
>> today... but will try to recreate the problem ASAP and see what could be
>> happening in the presence of mcount.
>
> I tried to recreate this problem on a 64-bit pSeries box without
> success. Every one of the instructions in the stream at .do_fork are
> emulated and work fine there -- no hangs/crashes with or without
> function tracer.
>
> Yong,
> I am copying Kumar to see if he knows of any issues with 32-bit kprobes
> (he wrote it) or with the function tracer, or with the toolchain itself.
>
> You may want to check if, in the failure case, the instruction in
> question is single-stepped or emulated (print out the value of
> kprobe->ainsn.boostable in the post_handler)

It's emulated:
root@unknown:/root> insmod kprobe_example.ko func=show_interrupts
Planted kprobe at c009be18
root@unknown:/root> cat /proc/interrupts
pre_handler: p->addr = 0xc009be18, nip = 0xc009be18, msr = 0x29000
post_handler: p->addr = 0xc009be18, msr = 0x29000,boostable = 1

Since commit 0016a4cf5582415849fafbf9f019dd9530824789 almost all
of the instructions are emulated.

But if we disable the emulation of stwu(so let single-stepped take it) like
below:
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 9a52349..07f0d4a 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1486,7 +1486,7 @@ int __kprobes emulate_step(struct pt_regs *regs,
unsigned int instr)
 		goto ldst_done;

 	case 36:	/* stw */
-	case 37:	/* stwu */
+	//case 37:	/* stwu */
 		val = regs->gpr[rd];
 		err = write_mem(val, dform_ea(instr, regs), 4, regs);
 		goto ldst_done;

The system will crash after single-step(looks
like the stack is currupted from the preempt_count value of
'cat/617/0x0000020a'):

pre_handler: p->addr = 0xc00ab12c, nip = 0xc00ab12c, msr = 0x29000
post_handler: p->addr = 0xc00ab12c, msr = 0x1000,boostable = -1
pre_handler: p->addr = 0xc00ab12c, nip = 0xc00ab12c, msr = 0x29000
post_handler: p->addr = 0xc00ab12c, msr = 0x1000,boostable = -1
BUG: scheduling while atomic: cat/617/0x0000020a
Modules linked in: kprobe_example [last unloaded: kprobe_example]
Call Trace:
[df157e90] [c00087c0] show_stack+0x98/0x1e4 (unreliable)
[df157ee0] [c0008938] dump_stack+0x2c/0x44
[df157ef0] [c00377c0] __schedule_bug+0x6c/0x84
[df157f00] [c060a364] schedule+0x398/0x48c
[df157f40] [c00107f4] recheck+0x0/0x24
--- Exception: c01 at 0xff1bbb8
    LR = 0x1000310c
Page fault in user mode with in_atomic() = 1 mm = df01c700
NIP = ff29314  MSR = 2d000
Oops: Weird page fault, sig: 11 [#1]
PREEMPT MPC8536 DS
Modules linked in: kprobe_example [last unloaded: kprobe_example]
NIP: 0ff29314 LR: 10001944 CTR: 0ff29314
REGS: df157f50 TRAP: 0401   Tainted: G        W
(3.0.0-rc4-00001-ge8ffcca-dirty)
MSR: 0002d000 <EE,PR,ME,CE>  CR: 88202682  XER: 20000000
TASK = df237190[617] 'cat' THREAD: df156000
GPR00: 100018b4 bfb5c060 48007ee0 00000000 0000000e 10004354 bfb5ccde 0ff1af28
GPR08: 0202d000 48000ee8 00000000 0ff29314 1000192c
NIP [0ff29314] 0xff29314
LR [10001944] 0x10001944
Call Trace:
Kernel panic - not syncing: Fatal exception in interrupt
Call Trace:
[df157da0] [c00087c0] show_stack+0x98/0x1e4 (unreliable)
[df157df0] [c0008938] dump_stack+0x2c/0x44
[df157e00] [c0042a80] panic+0xc4/0x1f4
[df157e60] [c000c4e0] die+0x1fc/0x22c
[df157e90] [c060e4a4] do_page_fault+0x130/0x4c4
[df157f40] [c00100fc] handle_page_fault+0xc/0x80
--- Exception: 401 at 0xff29314
    LR = 0x10001944

Thanks,
Yong


-- 
Only stand for myself

^ permalink raw reply related

* Re: [BUG?]3.0-rc4+ftrace+kprobe: set kprobe at instruction 'stwu' lead to system crash/freeze
From: Yong Zhang @ 2011-06-29  6:23 UTC (permalink / raw)
  To: ananth
  Cc: Jim Keniston, linux-kernel, Steven Rostedt, paulus,
	yrl.pp-manager.tt, Masami Hiramatsu, linuxppc-dev
In-Reply-To: <20110627100104.GA24705@in.ibm.com>

On Mon, Jun 27, 2011 at 6:01 PM, Ananth N Mavinakayanahalli
<ananth@in.ibm.com> wrote:
> On Sun, Jun 26, 2011 at 11:47:13PM +0900, Masami Hiramatsu wrote:
>> (2011/06/24 19:29), Steven Rostedt wrote:
>> > On Fri, 2011-06-24 at 17:21 +0800, Yong Zhang wrote:
>> >> Hi,
>> >>
>> >> When I use kprobe to do something, I found some wired thing.
>> >>
>> >> When CONFIG_FUNCTION_TRACER is disabled:
>> >> (gdb) disassemble do_fork
>> >> Dump of assembler code for function do_fork:
>> >> =C2=A0 =C2=A00xc0037390 <+0>: =C2=A0 =C2=A0 =C2=A0 =C2=A0mflr =C2=A0 =
=C2=A0r0
>> >> =C2=A0 =C2=A00xc0037394 <+4>: =C2=A0 =C2=A0 =C2=A0 =C2=A0stwu =C2=A0 =
=C2=A0r1,-64(r1)
>> >> =C2=A0 =C2=A00xc0037398 <+8>: =C2=A0 =C2=A0 =C2=A0 =C2=A0mfcr =C2=A0 =
=C2=A0r12
>> >> =C2=A0 =C2=A00xc003739c <+12>: =C2=A0 =C2=A0 =C2=A0 stmw =C2=A0 =C2=
=A0r27,44(r1)
>> >>
>> >> Then I:
>> >> modprobe kprobe_example func=3Ddo_fork offset=3D4
>> >> ls
>> >> Things works well.
>> >>
>> >> But when CONFIG_FUNCTION_TRACER is enabled:
>> >> (gdb) disassemble do_fork
>> >> Dump of assembler code for function do_fork:
>> >> =C2=A0 =C2=A00xc0040334 <+0>: =C2=A0 =C2=A0 =C2=A0 =C2=A0mflr =C2=A0 =
=C2=A0r0
>> >> =C2=A0 =C2=A00xc0040338 <+4>: =C2=A0 =C2=A0 =C2=A0 =C2=A0stw =C2=A0 =
=C2=A0 r0,4(r1)
>> >> =C2=A0 =C2=A00xc004033c <+8>: =C2=A0 =C2=A0 =C2=A0 =C2=A0bl =C2=A0 =
=C2=A0 =C2=A00xc00109d4 <mcount>
>> >> =C2=A0 =C2=A00xc0040340 <+12>: =C2=A0 =C2=A0 =C2=A0 stwu =C2=A0 =C2=
=A0r1,-80(r1)
>> >> =C2=A0 =C2=A00xc0040344 <+16>: =C2=A0 =C2=A0 =C2=A0 mflr =C2=A0 =C2=
=A0r0
>> >> =C2=A0 =C2=A00xc0040348 <+20>: =C2=A0 =C2=A0 =C2=A0 stw =C2=A0 =C2=A0=
 r0,84(r1)
>> >> =C2=A0 =C2=A00xc004034c <+24>: =C2=A0 =C2=A0 =C2=A0 mfcr =C2=A0 =C2=
=A0r12
>> >> Then I:
>> >> modprobe kprobe_example func=3Ddo_fork offset=3D12
>> >> ls
>> >> 'ls' will never retrun. system freeze.
>> >
>> > I'm not sure if x86 had a similar issue.
>> >
>> > Masami, have any ideas to why this happened?
>>
>> No, I don't familiar with ppc implementation. I guess
>> that single-step resume code failed to emulate the
>> instruction, but it strongly depends on ppc arch.
>> Maybe IBM people may know what happened.
>>
>> Ananth, Jim, would you have any ideas?
>
> On powerpc, we emulate sstep whenever possible. Only recently support to
> emulate loads and stores got added. I don't have access to a powerpc box
> today... but will try to recreate the problem ASAP and see what could be
> happening in the presence of mcount.

After taking more testing on it, it looks like the issue doesn't
depend on mcount
(AKA. CONFIG_FUNCTION_TRACER)

As I said in the first email, with eldk-5.0 CONFIG_FUNCTION_TRACER=3Dn
will work well.

But when I'm using eldk-4.2[1], both will fail. But the funny thing is when=
 I
set kprobe at several functions some works fine but some will fail. For exa=
mple,
at this time do_fork() works well, but show_interrupt() will crash.

root@unknown:/root> insmod kprobe_example.ko func=3Dshow_interrupts
Planted kprobe at c009be18
root@unknown:/root> cat /proc/interrupts
pre_handler: p->addr =3D 0xc009be18, nip =3D 0xc009be18, msr =3D 0x29000
post_handler: p->addr =3D 0xc009be18, msr =3D 0x29000,boostable =3D 1
Oops: Exception in kernel mode, sig: 11 [#1]
PREEMPT MPC8536 DS
Modules linked in: kprobe_example
NIP: df159e74 LR: c0106f40 CTR: c009be18
REGS: df159d90 TRAP: 0700   Not tainted  (3.0.0-rc4-00001-ge8ffcca-dirty)
MSR: 00029000 <EE,ME,CE>  CR: 20202688  XER: 00000000
TASK =3D dfaa5340[613] 'cat' THREAD: df158000
GPR00: fffff000 df159e40 dfaa5340 df024a00 df159e78 00000000 df159f20 00000=
001
GPR08: c10060d0 c009be18 00029000 df159e70 00000000 1001ca74 1ffb5f00 100a0=
1cc
GPR16: 00000000 00000000 00000000 00000000 df024a28 df159f20 00000000 dfbff=
080
GPR24: 10016000 00001000 df159f20 df159e78 dfbff080 df159e78 df024a00 df159=
e70
NIP [df159e74] 0xdf159e74
LR [c0106f40] seq_read+0x2a4/0x568
Call Trace:
[df159e40] [00029000] 0x29000 (unreliable)
[df159e74] [00000000]   (null)
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 60026bfc1fe79aed ]---
Segmentation fault

Thanks,
Yong

[1]: http://ftp.denx.de/pub/eldk/4.2/

--=20
Only stand for myself

^ permalink raw reply

* Re: [PATCH 2/2] mtd/nand : workaround for Freescale FCM to support large-page Nand chip
From: Artem Bityutskiy @ 2011-06-29  6:22 UTC (permalink / raw)
  To: b35362; +Cc: linuxppc-dev, dwmw2, linux-mtd
In-Reply-To: <1309225852-1664-2-git-send-email-b35362@freescale.com>

On Tue, 2011-06-28 at 09:50 +0800, b35362@freescale.com wrote:
> +	/* Hack for supporting the flash chip whose writesize is
> +	 * larger than 2K bytes.
> +	 */

Please, use proper kernel multi-line comments. Please, make sure
checkpatch.pl does not generate 13 errors with this patch.

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply

* Re: [PATCH 1/2] mtd/nand : don't free the global data fsl_lbc_ctrl_dev->nand in fsl_elbc_chip_remove()
From: Artem Bityutskiy @ 2011-06-29  6:20 UTC (permalink / raw)
  To: b35362; +Cc: linuxppc-dev, dwmw2, linux-mtd
In-Reply-To: <1309225852-1664-1-git-send-email-b35362@freescale.com>

On Tue, 2011-06-28 at 09:50 +0800, b35362@freescale.com wrote:
> From: Liu Shuo <b35362@freescale.com>
> 
> The global data fsl_lbc_ctrl_dev->nand don't have to be freed in
> fsl_elbc_chip_remove(). The right place to do that is in fsl_elbc_nand_remove()
> if elbc_fcm_ctrl->counter is zero.
> 
> Signed-off-by: Liu Shuo <b35362@freescale.com>
> ---
>  drivers/mtd/nand/fsl_elbc_nand.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/mtd/nand/fsl_elbc_nand.c b/drivers/mtd/nand/fsl_elbc_nand.c
> index 0bb254c..a212116 100644
> --- a/drivers/mtd/nand/fsl_elbc_nand.c
> +++ b/drivers/mtd/nand/fsl_elbc_nand.c
> @@ -829,7 +829,6 @@ static int fsl_elbc_chip_remove(struct fsl_elbc_mtd *priv)
>  
>  	elbc_fcm_ctrl->chips[priv->bank] = NULL;
>  	kfree(priv);
> -	kfree(elbc_fcm_ctrl);
>  	return 0;
>  }

Do we have to assign fsl_lbc_ctrl_dev->nand to NULL in
fsl_elbc_nand_remove() then? I think that assignment can be killed then.

        if (!elbc_fcm_ctrl->counter) {
                fsl_lbc_ctrl_dev->nand = NULL;
                kfree(elbc_fcm_ctrl);
        }

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply

* Re: [RFC][PATCH] Kexec support for PPC440x
From: Suzuki Poulose @ 2011-06-29  5:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux ppc dev, kexec@lists.infradead.org, lkml
In-Reply-To: <4DE8E746.2080100@linutronix.de>

On 06/03/11 19:23, Sebastian Andrzej Siewior wrote:
> Suzuki Poulose wrote:
>>>> The way you setup the 1:1 mapping should be close to what you are doing on
>>>> kernel entry.Isn't it possible to include the file here and in the entry
>>>> code?
>>
>>> I will make this change and resend the patch.
>>
>> I took a look at the way we do it at kernel entry. It looks more cleaner to leave
>> it untouched. Especially, when we add the support for 47x in the future, the code
>> will become more unreadable.
>>
>> What do you think ?
>
> So the entry code has one 256MiB mapping, you need 8 of those. Entry goes for TLB 63 and you need to be flexible and avoid TLB 63 :).
> So after all you don't have that much in common with the entry code. If
> you look at the FSL-book code then you will notice that I tried to share
> some code.
>
> I don't understand why you don't flip the address space bit. On fsl we
> setup the tmp mapping in the "other address" space so we don't have two
> mappings for the same address. The entry code could be doing this with STS
> bit, I'm not sure.

I am not sure if I understood this correctly.
Could you explain how could there be two mappings for the same address ?
We are setting up 1:1 mapping for 0-2GiB and the only mapping that could exist
(in other words, not invalidated) is PAGE_OFFSET mapping. Since PAGE_OFFSET < 2GiB
we won't have multiple mappings. Or in other words we could limit KEXEC_*_MEMORY_LIMIT
to PAGE_OFFSET to make sure the crossing doesn't occur.

The kernel entry code sets up the mapping without a tmp mapping in 44x. i.e, it uses
the mapping setup by the firmware/boot loader.

Thanks
Suzuki

^ permalink raw reply

* Re: [PATCH 2/2] Add cpufreq driver for Momentum Maple boards
From: Dave Jones @ 2011-06-29  3:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dmitry Eremin-Solenikov, Paul Mackerras, linuxppc-dev, cpufreq
In-Reply-To: <1309318110.32158.520.camel@pasglop>

On Wed, Jun 29, 2011 at 01:28:30PM +1000, Ben Herrenschmidt wrote:
 > Before I comment on this last one, a quick Q. for Dave: Do you want to
 > handle this or should I merge it via powerpc.git ? (It depends on
 > another change to the arch code to expose the SCOM functions that it
 > uses, and that patch is going to be in my -next branch).

If you're carrying the dependancy, it sounds like it would make more sense
for you to carry this too. There are some changes to the Kconfig/Makefile
in drivers/cpufreq in my tree for 3.1 already, so you might get a collision
when both trees end up in next & subsequently Linus' tree. Just trivial changes though. 

 > > ---
 > >  drivers/cpufreq/Kconfig         |    5 +
 > >  drivers/cpufreq/Kconfig.powerpc |    7 +
 > >  drivers/cpufreq/Makefile        |    5 +
 > >  drivers/cpufreq/maple-cpufreq.c |  314 +++++++++++++++++++++++++++++++++++++++
 > 
 > If we're going to have a Kconfig.powerpc, should we maybe just have a
 > powerpc subdirectory instead with the driver in it ?
 > 
 > I'm happy at some later point to try moving some of my other ones there.

So far we haven't bothered with additional subarch drivers/ directories for x86/arm.
I'm not against the idea. As more archs move over, I could see drivers/cpufreq/
getting more cluttered.

	Dave

^ permalink raw reply

* Re: [PATCH 1/2] mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a header
From: Benjamin Herrenschmidt @ 2011-06-29  3:40 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Ingo Molnar, linuxppc-dev, Thomas Gleixner
In-Reply-To: <1308013070.2874.784.camel@pasglop>

On Tue, 2011-06-14 at 10:57 +1000, Benjamin Herrenschmidt wrote:
> The macro MIN_MEMORY_BLOCK_SIZE is currently defined twice in two .c
> files, and I need it in a third one to fix a powerpc bug, so let's
> first move it into a header
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
> 
> Ingo, Thomas: Who needs to ack the x86 bit ? I'd like to send that
> to Linus asap with the powerpc fix.

Anybody ? Allo ?

I'm happy to send that to Linus myself but I'd like at least on or two
acks :-)

Cheers,
Ben.

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index d865c4a..bbaaa00 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -28,6 +28,7 @@
>  #include <linux/poison.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/module.h>
> +#include <linux/memory.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/nmi.h>
>  #include <linux/gfp.h>
> @@ -895,8 +896,6 @@ const char *arch_vma_name(struct vm_area_struct *vma)
>  }
>  
>  #ifdef CONFIG_X86_UV
> -#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
> -
>  unsigned long memory_block_size_bytes(void)
>  {
>  	if (is_uv_system()) {
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 9f9b235..45d7c8f 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -30,7 +30,6 @@
>  static DEFINE_MUTEX(mem_sysfs_mutex);
>  
>  #define MEMORY_CLASS_NAME	"memory"
> -#define MIN_MEMORY_BLOCK_SIZE	(1 << SECTION_SIZE_BITS)
>  
>  static int sections_per_block;
>  
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index e1e3b2b..935699b 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -20,6 +20,8 @@
>  #include <linux/compiler.h>
>  #include <linux/mutex.h>
>  
> +#define MIN_MEMORY_BLOCK_SIZE     (1 << SECTION_SIZE_BITS)
> +
>  struct memory_block {
>  	unsigned long start_section_nr;
>  	unsigned long end_section_nr;
> 

^ permalink raw reply

* Re: [PATCH V3 2/2] cpc925_edac: support single-processor configurations
From: Benjamin Herrenschmidt @ 2011-06-29  3:35 UTC (permalink / raw)
  To: Doug Thompson
  Cc: Harry Ciao, Paul Mackerras, Dmitry Eremin-Solenikov, linuxppc-dev
In-Reply-To: <1308315107-29182-3-git-send-email-dbaryshkov@gmail.com>

On Fri, 2011-06-17 at 16:51 +0400, Dmitry Eremin-Solenikov wrote:
> If second CPU is not enabled, CPC925 EDAC driver will spill out warnings
> about errors on second Processor Interface. Support masking that out,
> by detecting at runtime which CPUs are present in device tree.

Doug ? Are you going to carry this or should I via powerpc.git ? There's
a dependency on another patch that's going into powerpc-next ...

Cheers,
Ben.

> Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
> Cc: Harry Ciao <qingtao.cao@windriver.com>
> Cc: Doug Thompson <dougthompson@xmission.com>
> Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
> ---
>  drivers/edac/cpc925_edac.c |   67 ++++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 64 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/edac/cpc925_edac.c b/drivers/edac/cpc925_edac.c
> index a687a0d..a774c0d 100644
> --- a/drivers/edac/cpc925_edac.c
> +++ b/drivers/edac/cpc925_edac.c
> @@ -90,6 +90,7 @@ enum apimask_bits {
>  	ECC_MASK_ENABLE = (APIMASK_ECC_UE_H | APIMASK_ECC_CE_H |
>  			   APIMASK_ECC_UE_L | APIMASK_ECC_CE_L),
>  };
> +#define APIMASK_ADI(n)		CPC925_BIT(((n)+1))
>  
>  /************************************************************
>   *	Processor Interface Exception Register (APIEXCP)
> @@ -581,16 +582,73 @@ static void cpc925_mc_check(struct mem_ctl_info *mci)
>  }
>  
>  /******************** CPU err device********************************/
> +static u32 cpc925_cpu_mask_disabled(void)
> +{
> +	struct device_node *cpus;
> +	struct device_node *cpunode = NULL;
> +	static u32 mask = 0;
> +
> +	/* use cached value if available */
> +	if (mask != 0)
> +		return mask;
> +
> +	mask = APIMASK_ADI0 | APIMASK_ADI1;
> +
> +	cpus = of_find_node_by_path("/cpus");
> +	if (cpus == NULL) {
> +		cpc925_printk(KERN_DEBUG, "No /cpus node !\n");
> +		return 0;
> +	}
> +
> +	while ((cpunode = of_get_next_child(cpus, cpunode)) != NULL) {
> +		const u32 *reg = of_get_property(cpunode, "reg", NULL);
> +
> +		if (strcmp(cpunode->type, "cpu")) {
> +			cpc925_printk(KERN_ERR, "Not a cpu node in /cpus: %s\n", cpunode->name);
> +			continue;
> +		}
> +
> +		if (reg == NULL || *reg > 2) {
> +			cpc925_printk(KERN_ERR, "Bad reg value at %s\n", cpunode->full_name);
> +			continue;
> +		}
> +
> +		mask &= ~APIMASK_ADI(*reg);
> +	}
> +
> +	if (mask != (APIMASK_ADI0 | APIMASK_ADI1)) {
> +		/* We assume that each CPU sits on it's own PI and that
> +		 * for present CPUs the reg property equals to the PI
> +		 * interface id */
> +		cpc925_printk(KERN_WARNING,
> +				"Assuming PI id is equal to CPU MPIC id!\n");
> +	}
> +
> +	of_node_put(cpunode);
> +	of_node_put(cpus);
> +
> +	return mask;
> +}
> +
>  /* Enable CPU Errors detection */
>  static void cpc925_cpu_init(struct cpc925_dev_info *dev_info)
>  {
>  	u32 apimask;
> +	u32 cpumask;
>  
>  	apimask = __raw_readl(dev_info->vbase + REG_APIMASK_OFFSET);
> -	if ((apimask & CPU_MASK_ENABLE) == 0) {
> -		apimask |= CPU_MASK_ENABLE;
> -		__raw_writel(apimask, dev_info->vbase + REG_APIMASK_OFFSET);
> +
> +	cpumask = cpc925_cpu_mask_disabled();
> +	if (apimask & cpumask) {
> +		cpc925_printk(KERN_WARNING, "CPU(s) not present, "
> +				"but enabled in APIMASK, disabling\n");
> +		apimask &= ~cpumask;
>  	}
> +
> +	if ((apimask & CPU_MASK_ENABLE) == 0)
> +		apimask |= CPU_MASK_ENABLE;
> +
> +	__raw_writel(apimask, dev_info->vbase + REG_APIMASK_OFFSET);
>  }
>  
>  /* Disable CPU Errors detection */
> @@ -622,6 +680,9 @@ static void cpc925_cpu_check(struct edac_device_ctl_info *edac_dev)
>  	if ((apiexcp & CPU_EXCP_DETECTED) == 0)
>  		return;
>  
> +	if ((apiexcp & ~cpc925_cpu_mask_disabled()) == 0)
> +		return;
> +
>  	apimask = __raw_readl(dev_info->vbase + REG_APIMASK_OFFSET);
>  	cpc925_printk(KERN_INFO, "Processor Interface Fault\n"
>  				 "Processor Interface register dump:\n");

^ permalink raw reply

* Re: [PATCH 2/2] Add cpufreq driver for Momentum Maple boards
From: Benjamin Herrenschmidt @ 2011-06-29  3:28 UTC (permalink / raw)
  To: Dmitry Eremin-Solenikov, Dave Jones; +Cc: Paul Mackerras, linuxppc-dev, cpufreq
In-Reply-To: <1308316207-9075-2-git-send-email-dbaryshkov@gmail.com>

Before I comment on this last one, a quick Q. for Dave: Do you want to
handle this or should I merge it via powerpc.git ? (It depends on
another change to the arch code to expose the SCOM functions that it
uses, and that patch is going to be in my -next branch).

Now some remaining small nits:

On Fri, 2011-06-17 at 17:10 +0400, Dmitry Eremin-Solenikov wrote:
> Add simple cpufreq driver for Maple-based boards (ppc970fx evaluation
> kit and others). Driver is based on a cpufreq driver for 64-bit powermac
> boxes with all pmac-dependant features removed and simple cleanup
> applied.
> 
> Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
> ---
>  drivers/cpufreq/Kconfig         |    5 +
>  drivers/cpufreq/Kconfig.powerpc |    7 +
>  drivers/cpufreq/Makefile        |    5 +
>  drivers/cpufreq/maple-cpufreq.c |  314 +++++++++++++++++++++++++++++++++++++++

If we're going to have a Kconfig.powerpc, should we maybe just have a
powerpc subdirectory instead with the driver in it ?

I'm happy at some later point to try moving some of my other ones there.

 .../...

> +	/* Look for the powertune data in the device-tree */
> +	maple_pmode_data = of_get_property(cpunode, "power-mode-data", &psize);
> +	if (!maple_pmode_data) {
> +		DBG("No power-mode-data !\n");
> +		goto bail_noprops;
> +	}
> +	maple_pmode_max = psize / sizeof(u32) - 1;

Do you get that property in your device-tree ? Or have you modified your
firmware ? If that requires a modified firmware, you should probably put
at least a link indicating where to get it somewhere and display a nicer
error code.

Also this driver is specific to the Maple HW, you don't want it to kick
in and mess around on ... an Apple G5 for example. So stick somewhere a

	if (!machine_is(maple))
		return 0;

> +	printk(KERN_INFO "Registering G5 CPU frequency driver\n");

s/G5/Maple

> +	printk(KERN_INFO "Frequency method: SCOM, Voltage method: none\n");

This is useless.

Cheers,
Ben.

^ permalink raw reply

* perf_event_open system call support in powerpc
From: ashwath narasimhan @ 2011-06-29  3:03 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

Hello,

 I am new to the powerpc architecture and I am trying to use
perf_event_open() system call for power pc architecture (e500mc) using
2.6.32 kernel distribution. Is this system call number supported for power
pc architecture? If yes, is there something similar to
 arch/x86/kernel/syscall_table_32.S  listing for powerpc that indicates the
number for the above system call?

Thanks in advance for assisting me. Please email me at
ashwath.narasimhan@oneconvergence.com

-- 
-Ash

[-- Attachment #2: Type: text/html, Size: 621 bytes --]

^ permalink raw reply

* Re: powerpc/4xx: Regression failed on sil24 (and other) drivers
From: Benjamin Herrenschmidt @ 2011-06-29  1:42 UTC (permalink / raw)
  To: Ayman El-Khashab; +Cc: cam, linuxppc-dev
In-Reply-To: <20110627113137.GA10387@crust.elkhashab.com>

On Mon, 2011-06-27 at 06:31 -0500, Ayman El-Khashab wrote:
> On Mon, Jun 27, 2011 at 08:19:56PM +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2011-06-25 at 18:52 -0500, Ayman El-Khashab wrote:
> > > I noticed during a recent development with the 460SX that a
> > > simple device that once worked stopped.  I did a bisect to
> > > find the offending commit and it turns out to be this one:
> > > 
> > > 0e52247a2ed1f211f0c4f682dc999610a368903f is the first bad
> > > commit
> > > commit 0e52247a2ed1f211f0c4f682dc999610a368903f
> > > Author: Cam Macdonell <cam@cs.ualberta.ca>
> > > Date:   Tue Sep 7 17:25:20 2010 -0700
> > > 
> > >     PCI: fix pci_resource_alignment prototype
> > > 

Ok, let's see what I can dig out of those logs (sorry for the delay)

Let's start with iomem & ioport, stripped of the legacy & common stuff:

/proc/iomem, bad:

e00000000-e7fffffff : /plb/pciex@d00000000
  e00000000-e7fffffff : 0000:40:00.0
e80000000-effffffff : /plb/pciex@d20000000
  e80000000-effffffff : 0001:80:00.0

good:

e00000000-e7fffffff : /plb/pciex@d00000000
e80000000-effffffff : /plb/pciex@d20000000
  e80000000-e800fffff : PCI Bus 0001:81
    e80000000-e80001fff : 0001:81:00.0
      e80000000-e80001fff : sata_sil24
    e80002000-e8000207f : 0001:81:00.0
      e80002000-e8000207f : sata_sil24

So now that's interesting, you have a device at 0000:40:00.0 that
appears on your first PHB in the "bad" case and doesn't show up in the
"good" case.

In addition, on the "other" PHB, the bus itself doesn't show up in the
bad case. (Let's ignore IOs and focus on mem. for now).

Let's see what lead us to that from the logs. First setup before probing
is all identical. The device at 0000:40:00.0 is detected in both cases,
it's the root complex bridge. So the scanning is identical as expected.

Now the fixup/resource allocation, we start seeing some differences:

Bad:

pci 0000:40:00.0: BAR 0: assigned [mem 0xe00000000-0xe7fffffff pref]
pci 0000:40:00.0: BAR 0: set to [mem 0xe00000000-0xe7fffffff pref] (PCI address [0x80000000-0xffffffff]

vs Good:

pci 0000:40:00.0: BAR 0: can't assign mem pref (size 0x80000000)

So the "bad" case succeeds in giving out resources to the root complex,
while the "good" case fails... fun.

And similarily for the other PHB, bad:

pci 0001:80:00.0: BAR 0: assigned [mem 0xe80000000-0xeffffffff pref]
pci 0001:80:00.0: BAR 0: set to [mem 0xe80000000-0xeffffffff pref] (PCI address [0x80000000-0xffffffff]

vs good:

pci 0001:80:00.0: BAR 0: can't assign mem pref (size 0x80000000)

This then goes down to the "bad" case:

pci 0001:80:00.0: BAR 8: can't assign mem (size 0x100000)
pci 0001:80:00.0: BAR 7: assigned [io  0xfffe1000-0xfffe1fff]
pci 0001:81:00.0: BAR 2: can't assign mem (size 0x2000)
pci 0001:81:00.0: BAR 0: can't assign mem (size 0x80)

while the "good" one succeeds assigning BAR 8,2 and 0 :

pci 0001:80:00.0: BAR 8: assigned [mem 0xe80000000-0xe800fffff]
pci 0001:81:00.0: BAR 2: assigned [mem 0xe80000000-0xe80001fff 64bit]
pci 0001:81:00.0: BAR 2: set to [mem 0xe80000000-0xe80001fff 64bit] (PCI address [0x80000000-0x80001fff]
pci 0001:81:00.0: BAR 0: assigned [mem 0xe80002000-0xe8000207f 64bit]
pci 0001:81:00.0: BAR 0: set to [mem 0xe80002000-0xe8000207f 64bit] (PCI address [0x80002000-0x8000207f]

It looks to me like the "BAR 0" of the host bridges are basically taking the
resource aways from the rest of the devices. Now "BAR 0" are not bridge
resources, which would have been OK, but they are MMIO resources of the
bridge itself.

On 44x, the problem is that those bridges (stupidly) expose BARs that represent
main memory (inbound DMA). It would make sense if these weren't host bridges
but in this case that's totally non sensical (and thus IMHO a HW bug).

I thought we had code to "hide" them to avoid that problem, so I wonder what's
going on... If you look at arch/powerpc/sysdev/ppc4xx_pci.c, there's this
quirk:

static void fixup_ppc4xx_pci_bridge(struct pci_dev *dev)
{
	struct pci_controller *hose;
	int i;

	if (dev->devfn != 0 || dev->bus->self != NULL)
		return;

	hose = pci_bus_to_host(dev->bus);
	if (hose == NULL)
		return;

	if (!of_device_is_compatible(hose->dn, "ibm,plb-pciex") &&
	    !of_device_is_compatible(hose->dn, "ibm,plb-pcix") &&
	    !of_device_is_compatible(hose->dn, "ibm,plb-pci"))
		return;

	if (of_device_is_compatible(hose->dn, "ibm,plb440epx-pci") ||
		of_device_is_compatible(hose->dn, "ibm,plb440grx-pci")) {
		hose->indirect_type |= PPC_INDIRECT_TYPE_BROKEN_MRM;
	}

	/* Hide the PCI host BARs from the kernel as their content doesn't
	 * fit well in the resource management
	 */
	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
		dev->resource[i].start = dev->resource[i].end = 0;
		dev->resource[i].flags = 0;
	}

	printk(KERN_INFO "PCI: Hiding 4xx host bridge resources %s\n",
	       pci_name(dev));
}
DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID, fixup_ppc4xx_pci_bridge);

This should basically "clear out" the bridge resources for the pcie
bridge itself, which appears to haven't been done in your case.

I suspect you don't have CONFIG_PCI_QUIRKS enabled... I think that's the
cause of your problem.

It looks like this config option controls both compiling the "generic"
quirks in from drivers/pci/quirk.c, and the actually mechanism for
having quirks in the first place (pci_fixup_device() goes away without
that config option).

I think we probably want to unconditionally select that if CONFIG_PCI is
enabled in arch/powerpc...

Can you try changing it and tell us if that helps ?

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc/timebase_read: don't return time older than cycle_last
From: Benjamin Herrenschmidt @ 2011-06-29  1:06 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev
In-Reply-To: <20110628190807.53a2c289@schlenkerla.am.freescale.net>

> > Ok two things. One is first fix the comments then to stop mentioning
> > "TSC" :-)
> 
> Doh, sorry...
> 
> > Second is, I still don't think it's right. There's an expectation on
> > powerpc that the timebase works properly. If not, you have a userspace
> > visible breakage.
> 
> As the changelog notes, this isn't a full enforement of monotonicity, it's
> a way to avoid specific problems where the generic kernel timekeeping code
> blows up if it goes backwards.  Fixing userspace reads to be fully
> monotonic would be nice too, but it's a separate issue from the kernel
> throwing a timer into the distant future because the timebase went
> backwards one tick.

I don't think we ever want to "fix" userspace... how would you "fix" the
vDSO gettimeofday implementation for example since the vDSO has no
storage ?

> > There's no such thing as "a small drift". We assume no
> > difference is visible to software, period.
> 
> On what do we base this assumption, and what does making the assumption
> buy us?

We base this assumption on what I believe is an architectural
requirement tho of course it's not worded very explicitely, and probably
just "derived" from the architecture statement that the timebase can
always be used as a monotonic source of time.

It has always been the assumption of Linux/ppc port that the timebase
cannot be observed going backward accross the SMP fabric.

They -MUST- be sourced from the same clock (not drift) and the initial
synchronization must be "good enough" to make it impossible to observe
it going backward.

What it does buy us is a lot of complexity avoided in the time keeping
code and the ability to have things like vDSO
gettimeofday/clock_gettime, ie, a very fast path to reliably timestamp
things (which is among others a serious benefit for networking).

> Will smp-tbsync.c always converge on perfect sync (it has a limit on how
> long it will try, and the only indication it failed is a pr_debug)?  Will
> the timebase always increment on all cores at the same time, including on
> emulated hardware?

smp-tbsync.c is and has always been a "workaround" for broken HW.
Anybody with half a clue should follow the recommendation of the
architecture (this one is actually spelled out, but as a recommendation
only) to have a TB enable pin and use it to perform a perfect sync at
boot time.

> We had a bug in U-Boot's timebase sync where the boot core would sometimes
> be one tick faster than the other cores.

It's scary to think that your cores TBs seem to be soured from different
clock sources... ie even if you fix uBoot, can you guarantee they won't
drift ? I hope so ... I would consider that an unfixable architecture
violation and I am not at this stage keen on implementing the necessary
"workarounds" in Linux (the userspace case is nasty, really nasty).

PowerPC always prided itself on having a "sane" time base mechanism
unlike x86, please don't tell me that you guys are now breaking that
assumption.

> It's been fixed, but there are
> probably people still running the old U-Boot.  It seems like the kind of
> thing where defensive robustness is called for, like timing out instead of
> hanging if a hardware register never flips the bit we're waiting for.

No, you'll just "hide" the problem from the kernel and horrible &
unexplainable things will happen in userspace. At the VERY LEAST you
must warn very loudly if you detect this is happening.

> > We make hard assumptions here and in various places actually.
> 
> Are there any in the kernel that this doesn't cover?

Check gtod implementation, I'm not sure whether that's enough at this
stage or not for it, and then there's the vDSO of course. Not sure
what's up with sched_clock() and whether that has similar constraints.

> > So if you want to do that test, I would require that you also add a
> > warning, of the _rate_limited or _once, kind, indicating to the user
> > that something's badly wrong.
> 
> OK.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc/timebase_read: don't return time older than cycle_last
From: Scott Wood @ 2011-06-29  0:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1309303508.32158.473.camel@pasglop>

On Wed, 29 Jun 2011 09:25:08 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2011-06-28 at 11:14 -0500, Scott Wood wrote:
> > > You are applying a bandage on a wooden leg here .... userspace (vDSO)
> > > will see the time going backward if you aren't well synchronized as
> > > well, so you're stuffed anyways.
> > 
> > Sure -- but we should avoid turning a slight backwards drift into a huge
> > positive offset in the kernel's calculations.  One way to do that is for
> > the generic timekeeping code to be robust against this, for all time
> > sources.  The other is to apply this sort of hack on time sources that are
> > known to possibly go backwards.  The former is the better fix IMHO, but the
> > latter is what was already done for TSC on x86, so I went with the less
> > intrusive change.
> 
> Ok two things. One is first fix the comments then to stop mentioning
> "TSC" :-)

Doh, sorry...

> Second is, I still don't think it's right. There's an expectation on
> powerpc that the timebase works properly. If not, you have a userspace
> visible breakage.

As the changelog notes, this isn't a full enforement of monotonicity, it's
a way to avoid specific problems where the generic kernel timekeeping code
blows up if it goes backwards.  Fixing userspace reads to be fully
monotonic would be nice too, but it's a separate issue from the kernel
throwing a timer into the distant future because the timebase went
backwards one tick.

> There's no such thing as "a small drift". We assume no
> difference is visible to software, period.

On what do we base this assumption, and what does making the assumption
buy us?

Will smp-tbsync.c always converge on perfect sync (it has a limit on how
long it will try, and the only indication it failed is a pr_debug)?  Will
the timebase always increment on all cores at the same time, including on
emulated hardware?

We had a bug in U-Boot's timebase sync where the boot core would sometimes
be one tick faster than the other cores.  It's been fixed, but there are
probably people still running the old U-Boot.  It seems like the kind of
thing where defensive robustness is called for, like timing out instead of
hanging if a hardware register never flips the bit we're waiting for.

> We make hard assumptions here and in various places actually.

Are there any in the kernel that this doesn't cover?

> So if you want to do that test, I would require that you also add a
> warning, of the _rate_limited or _once, kind, indicating to the user
> that something's badly wrong.

OK.

-Scott

^ permalink raw reply

* Re: [PATCH] powerpc/timebase_read: don't return time older than cycle_last
From: Benjamin Herrenschmidt @ 2011-06-28 23:25 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev
In-Reply-To: <20110628111420.15052d9f@schlenkerla.am.freescale.net>

On Tue, 2011-06-28 at 11:14 -0500, Scott Wood wrote:
> > You are applying a bandage on a wooden leg here .... userspace (vDSO)
> > will see the time going backward if you aren't well synchronized as
> > well, so you're stuffed anyways.
> 
> Sure -- but we should avoid turning a slight backwards drift into a huge
> positive offset in the kernel's calculations.  One way to do that is for
> the generic timekeeping code to be robust against this, for all time
> sources.  The other is to apply this sort of hack on time sources that are
> known to possibly go backwards.  The former is the better fix IMHO, but the
> latter is what was already done for TSC on x86, so I went with the less
> intrusive change.

Ok two things. One is first fix the comments then to stop mentioning
"TSC" :-)

Second is, I still don't think it's right. There's an expectation on
powerpc that the timebase works properly. If not, you have a userspace
visible breakage. There's no such thing as "a small drift". We assume no
difference is visible to software, period. We make hard assumptions here
and in various places actually.

So if you want to do that test, I would require that you also add a
warning, of the _rate_limited or _once, kind, indicating to the user
that something's badly wrong.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v4]PPC4xx: Adding PCI(E) MSI support
From: Benjamin Herrenschmidt @ 2011-06-28 23:15 UTC (permalink / raw)
  To: Ayman El-Khashab; +Cc: linuxppc-dev, Rupjyoti Sarmah, rsarmah, linux-kernel
In-Reply-To: <20110628223131.GA10267@crust.elkhashab.com>

On Tue, 2011-06-28 at 17:31 -0500, Ayman El-Khashab wrote:
> > > +static int ppc4xx_setup_pcieh_hw(struct platform_device *dev,
> > > +                            struct resource res, struct
> ppc4xx_msi *msi)
> > > +{
> > > +
> 
> <snip>
> 
> > > +
> > > +   msi->msi_dev = of_find_node_by_name(NULL, "ppc4xx-msi");
> > > +   if (msi->msi_dev)
> > > +           return -ENODEV;
> 
> This does not look correct. I guess it should probably read 
> 
> if (!msi->msi_dev) .....

Indeed, that looks bogus. Rupjyoti, please test and send fixes if
necessary, obviously this code has not been tested.

This is not part of the bits I fixed up so I looks to me like the
original patch was wrong (and thus obviously untested !!!)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v4]PPC4xx: Adding PCI(E) MSI support
From: Ayman El-Khashab @ 2011-06-28 22:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, Rupjyoti Sarmah, rsarmah, linux-kernel
In-Reply-To: <1306387484.7481.453.camel@pasglop>

On Thu, May 26, 2011 at 03:24:44PM +1000, Benjamin Herrenschmidt wrote:
> 
> Please check the result and send any "fixup" patch that might be
> necessary.
> 
> > +static int ppc4xx_setup_pcieh_hw(struct platform_device *dev,
> > +				 struct resource res, struct ppc4xx_msi *msi)
> > +{
> > +

<snip>

> > +
> > +	msi->msi_dev = of_find_node_by_name(NULL, "ppc4xx-msi");
> > +	if (msi->msi_dev)
> > +		return -ENODEV;

This does not look correct. I guess it should probably read 

if (!msi->msi_dev) .....

Ayman

^ permalink raw reply

* Re: [PATCH 2/5] hugetlb: add phys addr to struct huge_bootmem_page
From: Benjamin Herrenschmidt @ 2011-06-28 21:39 UTC (permalink / raw)
  To: Becky Bruce; +Cc: linuxppc-dev, linux-kernel, wli, david
In-Reply-To: <13092910103675-git-send-email-beckyb@kernel.crashing.org>

On Tue, 2011-06-28 at 14:54 -0500, Becky Bruce wrote:
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6402458..2db81ea 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1105,8 +1105,14 @@ static void __init
> gather_bootmem_prealloc(void)
>         struct huge_bootmem_page *m;
>  
>         list_for_each_entry(m, &huge_boot_pages, list) {
> -               struct page *page = virt_to_page(m);
>                 struct hstate *h = m->hstate;
> +#ifdef CONFIG_HIGHMEM
> +               struct page *page = pfn_to_page(m->phys >>
> PAGE_SHIFT);
> +               free_bootmem_late((unsigned long)m,
> +                                 sizeof(struct huge_bootmem_page));
> +#else
> +               struct page *page = virt_to_page(m);
> +#endif
>                 __ClearPageReserved(page);

Why do you add free_bootmem_late() in the highmem case and not the
normal case ?

Cheers,
Ben.

^ permalink raw reply

* [PATCH 5/5] powerpc: Hugetlb for BookE
From: Becky Bruce @ 2011-06-28 19:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev; +Cc: wli, david
In-Reply-To: <13092911313115-git-send-email-beckyb@kernel.crashing.org>

From: Becky Bruce <beckyb@kernel.crashing.org>

Enable hugepages on Freescale BookE processors.  This allows the kernel to
use huge TLB entries to map pages, which can greatly reduce the number of
TLB misses and the amount of TLB thrashing experienced by applications with
large memory footprints.  Care should be taken when using this on FSL
processors, as the number of large TLB entries supported by the core is low
(16-64) on current processors.

The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g.
Page sizes larger than the max zone size are called "gigantic" pages and
must be allocated on the command line (and cannot be deallocated).

This is currently only fully implemented for Freescale 32-bit BookE
processors, but there is some infrastructure in the code for
64-bit BooKE.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/Kconfig                   |    3 +-
 arch/powerpc/include/asm/hugetlb.h     |   63 +++++-
 arch/powerpc/include/asm/mmu-book3e.h  |    7 +
 arch/powerpc/include/asm/mmu-hash64.h  |    3 +-
 arch/powerpc/include/asm/mmu.h         |   18 +-
 arch/powerpc/include/asm/page.h        |   31 +++-
 arch/powerpc/include/asm/page_64.h     |   11 -
 arch/powerpc/include/asm/pte-book3e.h  |    3 +
 arch/powerpc/kernel/head_fsl_booke.S   |  133 ++++++++++--
 arch/powerpc/mm/Makefile               |    1 +
 arch/powerpc/mm/hash_utils_64.c        |    3 -
 arch/powerpc/mm/hugetlbpage-book3e.c   |  121 ++++++++++
 arch/powerpc/mm/hugetlbpage.c          |  379 ++++++++++++++++++++++++++++----
 arch/powerpc/mm/init_32.c              |    9 +
 arch/powerpc/mm/mem.c                  |    5 +
 arch/powerpc/mm/mmu_context_nohash.c   |    5 +
 arch/powerpc/mm/pgtable.c              |    3 +-
 arch/powerpc/mm/tlb_low_64e.S          |   24 +-
 arch/powerpc/mm/tlb_nohash.c           |   46 ++++-
 arch/powerpc/platforms/Kconfig.cputype |    4 +-
 20 files changed, 766 insertions(+), 106 deletions(-)
 create mode 100644 arch/powerpc/mm/hugetlbpage-book3e.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2729c66..b7af257 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -426,8 +426,7 @@ config ARCH_POPULATES_NODE_MAP
 	def_bool y
 
 config SYS_SUPPORTS_HUGETLBFS
-       def_bool y
-       depends on PPC_BOOK3S_64
+	bool
 
 source "mm/Kconfig"
 
diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 5856a66..8600493 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -1,15 +1,60 @@
 #ifndef _ASM_POWERPC_HUGETLB_H
 #define _ASM_POWERPC_HUGETLB_H
 
+#ifdef CONFIG_HUGETLB_PAGE
 #include <asm/page.h>
 
+extern struct kmem_cache *hugepte_cache;
+extern void __init reserve_hugetlb_gpages(void);
+
+static inline pte_t *hugepd_page(hugepd_t hpd)
+{
+	BUG_ON(!hugepd_ok(hpd));
+	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | PD_HUGE);
+}
+
+static inline unsigned int hugepd_shift(hugepd_t hpd)
+{
+	return hpd.pd & HUGEPD_SHIFT_MASK;
+}
+
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
+				    unsigned pdshift)
+{
+	/*
+	 * On 32-bit, we have multiple higher-level table entries that point to
+	 * the same hugepte.  Just use the first one since they're all
+	 * identical.  So for that case, idx=0.
+	 */
+	unsigned long idx = 0;
+
+	pte_t *dir = hugepd_page(*hpdp);
+#ifdef CONFIG_PPC64
+	idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
+#endif
+
+	return dir + idx;
+}
+
 pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
 				 unsigned long addr, unsigned *shift);
 
 void flush_dcache_icache_hugepage(struct page *page);
 
+#if defined(CONFIG_PPC_MM_SLICES) || defined(CONFIG_PPC_SUBPAGE_PROT)
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
+#else
+static inline int is_hugepage_only_range(struct mm_struct *mm,
+					 unsigned long addr,
+					 unsigned long len)
+{
+	return 0;
+}
+#endif
+
+void book3e_hugetlb_preload(struct mm_struct *mm, unsigned long ea, pte_t pte);
+void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 			    unsigned long end, unsigned long floor,
@@ -50,8 +95,11 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
-	unsigned long old = pte_update(mm, addr, ptep, ~0UL, 1);
-	return __pte(old);
+#ifdef CONFIG_PPC64
+	return __pte(pte_update(mm, addr, ptep, ~0UL, 1));
+#else
+	return __pte(pte_update(ptep, ~0UL, 0));
+#endif
 }
 
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
@@ -93,4 +141,15 @@ static inline void arch_release_hugepage(struct page *page)
 {
 }
 
+#else /* ! CONFIG_HUGETLB_PAGE */
+static inline void reserve_hugetlb_gpages(void)
+{
+	pr_err("Cannot reserve gpages without hugetlb enabled\n");
+}
+static inline void flush_hugetlb_page(struct vm_area_struct *vma,
+				      unsigned long vmaddr)
+{
+}
+#endif
+
 #endif /* _ASM_POWERPC_HUGETLB_H */
diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 3ea0f9a..0260ea5 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -66,6 +66,7 @@
 #define MAS2_M			0x00000004
 #define MAS2_G			0x00000002
 #define MAS2_E			0x00000001
+#define MAS2_WIMGE_MASK		0x0000001f
 #define MAS2_EPN_MASK(size)		(~0 << (size + 10))
 #define MAS2_VAL(addr, size, flags)	((addr) & MAS2_EPN_MASK(size) | (flags))
 
@@ -80,6 +81,7 @@
 #define MAS3_SW			0x00000004
 #define MAS3_UR			0x00000002
 #define MAS3_SR			0x00000001
+#define MAS3_BAP_MASK		0x0000003f
 #define MAS3_SPSIZE		0x0000003e
 #define MAS3_SPSIZE_SHIFT	1
 
@@ -212,6 +214,11 @@ typedef struct {
 	unsigned int	id;
 	unsigned int	active;
 	unsigned long	vdso_base;
+#ifdef CONFIG_PPC_MM_SLICES
+	u64 low_slices_psize;   /* SLB page size encodings */
+	u64 high_slices_psize;  /* 4 bits per slice for now */
+	u16 user_psize;         /* page size index */
+#endif
 } mm_context_t;
 
 /* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index d865bd9..9169032 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -256,8 +256,7 @@ extern void hash_failure_debug(unsigned long ea, unsigned long access,
 extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
 			     unsigned long pstart, unsigned long prot,
 			     int psize, int ssize);
-extern void add_gpage(unsigned long addr, unsigned long page_size,
-			  unsigned long number_of_pages);
+extern void add_gpage(u64 addr, u64 page_size, unsigned long number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 extern void hpte_init_native(void);
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index b427a55..07f7b28 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -170,14 +170,16 @@ extern u64 ppc64_rma_size;
 #define MMU_PAGE_64K_AP	3	/* "Admixed pages" (hash64 only) */
 #define MMU_PAGE_256K	4
 #define MMU_PAGE_1M	5
-#define MMU_PAGE_8M	6
-#define MMU_PAGE_16M	7
-#define MMU_PAGE_256M	8
-#define MMU_PAGE_1G	9
-#define MMU_PAGE_16G	10
-#define MMU_PAGE_64G	11
-#define MMU_PAGE_COUNT	12
-
+#define MMU_PAGE_4M	6
+#define MMU_PAGE_8M	7
+#define MMU_PAGE_16M	8
+#define MMU_PAGE_64M	9
+#define MMU_PAGE_256M	10
+#define MMU_PAGE_1G	11
+#define MMU_PAGE_16G	12
+#define MMU_PAGE_64G	13
+
+#define MMU_PAGE_COUNT	14
 
 #if defined(CONFIG_PPC_STD_MMU_64)
 /* 64-bit classic hash table MMU */
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 2cd664e..dd9c4fd 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -36,6 +36,18 @@
 
 #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
 
+#ifndef __ASSEMBLY__
+#ifdef CONFIG_HUGETLB_PAGE
+extern unsigned int HPAGE_SHIFT;
+#else
+#define HPAGE_SHIFT PAGE_SHIFT
+#endif
+#define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
+#define HPAGE_MASK		(~(HPAGE_SIZE - 1))
+#define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
+#define HUGE_MAX_HSTATE		(MMU_PAGE_COUNT-1)
+#endif
+
 /* We do define AT_SYSINFO_EHDR but don't use the gate mechanism */
 #define __HAVE_ARCH_GATE_AREA		1
 
@@ -158,6 +170,24 @@ extern phys_addr_t kernstart_addr;
 #define is_kernel_addr(x)	((x) >= PAGE_OFFSET)
 #endif
 
+/*
+ * Use the top bit of the higher-level page table entries to indicate whether
+ * the entries we point to contain hugepages.  This works because we know that
+ * the page tables live in kernel space.  If we ever decide to support having
+ * page tables at arbitrary addresses, this breaks and will have to change.
+ */
+#ifdef CONFIG_PPC64
+#define PD_HUGE 0x8000000000000000
+#else
+#define PD_HUGE 0x80000000
+#endif
+
+/*
+ * Some number of bits at the level of the page table that points to
+ * a hugepte are used to encode the size.  This masks those bits.
+ */
+#define HUGEPD_SHIFT_MASK     0x3f
+
 #ifndef __ASSEMBLY__
 
 #undef STRICT_MM_TYPECHECKS
@@ -243,7 +273,6 @@ typedef unsigned long pgprot_t;
 #endif
 
 typedef struct { signed long pd; } hugepd_t;
-#define HUGEPD_SHIFT_MASK     0x3f
 
 #ifdef CONFIG_HUGETLB_PAGE
 static inline int hugepd_ok(hugepd_t hpd)
diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h
index 9356262..fb40ede 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -64,17 +64,6 @@ extern void copy_page(void *to, void *from);
 /* Log 2 of page table size */
 extern u64 ppc64_pft_size;
 
-/* Large pages size */
-#ifdef CONFIG_HUGETLB_PAGE
-extern unsigned int HPAGE_SHIFT;
-#else
-#define HPAGE_SHIFT PAGE_SHIFT
-#endif
-#define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
-#define HPAGE_MASK		(~(HPAGE_SIZE - 1))
-#define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
-#define HUGE_MAX_HSTATE		(MMU_PAGE_COUNT-1)
-
 #endif /* __ASSEMBLY__ */
 
 #ifdef CONFIG_PPC_MM_SLICES
diff --git a/arch/powerpc/include/asm/pte-book3e.h b/arch/powerpc/include/asm/pte-book3e.h
index 082d515..0156702 100644
--- a/arch/powerpc/include/asm/pte-book3e.h
+++ b/arch/powerpc/include/asm/pte-book3e.h
@@ -72,6 +72,9 @@
 #define	PTE_RPN_SHIFT	(24)
 #endif
 
+#define PTE_WIMGE_SHIFT (19)
+#define PTE_BAP_SHIFT	(2)
+
 /* On 32-bit, we never clear the top part of the PTE */
 #ifdef CONFIG_PPC32
 #define _PTE_NONE_MASK	0xffffffff00000000ULL
diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index 985638d..3b3de22 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -236,8 +236,24 @@ _ENTRY(__early_start)
  * if we find the pte (fall through):
  *   r11 is low pte word
  *   r12 is pointer to the pte
+ *   r10 is the pshift from the PGD, if we're a hugepage
  */
 #ifdef CONFIG_PTE_64BIT
+#ifdef CONFIG_HUGETLB_PAGE
+#define FIND_PTE	\
+	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
+	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
+	rlwinm.	r12, r11, 0, 0, 20;	/* Extract pt base address */	\
+	blt	1000f;			/* Normal non-huge page */	\
+	beq	2f;			/* Bail if no table */		\
+	oris	r11, r11, PD_HUGE@h;	/* Put back address bit */	\
+	andi.	r10, r11, HUGEPD_SHIFT_MASK@l; /* extract size field */	\
+	xor	r12, r10, r11;		/* drop size bits from pointer */ \
+	b	1001f;							\
+1000:	rlwimi	r12, r10, 23, 20, 28;	/* Compute pte address */	\
+	li	r10, 0;			/* clear r10 */			\
+1001:	lwz	r11, 4(r12);		/* Get pte entry */
+#else
 #define FIND_PTE	\
 	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
 	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
@@ -245,7 +261,8 @@ _ENTRY(__early_start)
 	beq	2f;			/* Bail if no table */		\
 	rlwimi	r12, r10, 23, 20, 28;	/* Compute pte address */	\
 	lwz	r11, 4(r12);		/* Get pte entry */
-#else
+#endif /* HUGEPAGE */
+#else /* !PTE_64BIT */
 #define FIND_PTE	\
 	rlwimi	r11, r10, 12, 20, 29;	/* Create L1 (pgdir/pmd) address */	\
 	lwz	r11, 0(r11);		/* Get L1 entry */			\
@@ -402,8 +419,8 @@ interrupt_base:
 
 #ifdef CONFIG_PTE_64BIT
 #ifdef CONFIG_SMP
-	subf	r10,r11,r12		/* create false data dep */
-	lwzx	r13,r11,r10		/* Get upper pte bits */
+	subf	r13,r11,r12		/* create false data dep */
+	lwzx	r13,r11,r13		/* Get upper pte bits */
 #else
 	lwz	r13,0(r12)		/* Get upper pte bits */
 #endif
@@ -483,8 +500,8 @@ interrupt_base:
 
 #ifdef CONFIG_PTE_64BIT
 #ifdef CONFIG_SMP
-	subf	r10,r11,r12		/* create false data dep */
-	lwzx	r13,r11,r10		/* Get upper pte bits */
+	subf	r13,r11,r12		/* create false data dep */
+	lwzx	r13,r11,r13		/* Get upper pte bits */
 #else
 	lwz	r13,0(r12)		/* Get upper pte bits */
 #endif
@@ -548,7 +565,7 @@ interrupt_base:
 /*
  * Both the instruction and data TLB miss get to this
  * point to load the TLB.
- *	r10 - available to use
+ *	r10 - tsize encoding (if HUGETLB_PAGE) or available to use
  *	r11 - TLB (info from Linux PTE)
  *	r12 - available to use
  *	r13 - upper bits of PTE (if PTE_64BIT) or available to use
@@ -558,21 +575,73 @@ interrupt_base:
  *	Upon exit, we reload everything and RFI.
  */
 finish_tlb_load:
+#ifdef CONFIG_HUGETLB_PAGE
+	cmpwi	6, r10, 0			/* check for huge page */
+	beq	6, finish_tlb_load_cont    	/* !huge */
+
+	/* Alas, we need more scratch registers for hugepages */
+	mfspr	r12, SPRN_SPRG_THREAD
+	stw	r14, THREAD_NORMSAVE(4)(r12)
+	stw	r15, THREAD_NORMSAVE(5)(r12)
+	stw	r16, THREAD_NORMSAVE(6)(r12)
+	stw	r17, THREAD_NORMSAVE(7)(r12)
+
+	/* Get the next_tlbcam_idx percpu var */
+#ifdef CONFIG_SMP
+	lwz	r12, THREAD_INFO-THREAD(r12)
+	lwz	r15, TI_CPU(r12)
+	lis     r14, __per_cpu_offset@h
+	ori     r14, r14, __per_cpu_offset@l
+	rlwinm  r15, r15, 2, 0, 29
+	lwzx    r16, r14, r15
+#else
+	li	r16, 0
+#endif
+	lis     r17, next_tlbcam_idx@h
+	ori	r17, r17, next_tlbcam_idx@l
+	add	r17, r17, r16			/* r17 = *next_tlbcam_idx */
+	lwz     r15, 0(r17)			/* r15 = next_tlbcam_idx */
+
+	lis	r14, MAS0_TLBSEL(1)@h		/* select TLB1 (TLBCAM) */
+	rlwimi	r14, r15, 16, 4, 15		/* next_tlbcam_idx entry */
+	mtspr	SPRN_MAS0, r14
+
+	/* Extract TLB1CFG(NENTRY) */
+	mfspr	r16, SPRN_TLB1CFG
+	andi.	r16, r16, 0xfff
+
+	/* Update next_tlbcam_idx, wrapping when necessary */
+	addi	r15, r15, 1
+	cmpw	r15, r16
+	blt 	100f
+	lis	r14, tlbcam_index@h
+	ori	r14, r14, tlbcam_index@l
+	lwz	r15, 0(r14)
+100:	stw	r15, 0(r17)
+
+	/*
+	 * Calc MAS1_TSIZE from r10 (which has pshift encoded)
+	 * tlb_enc = (pshift - 10).
+	 */
+	subi	r15, r10, 10
+	mfspr	r16, SPRN_MAS1
+	rlwimi	r16, r15, 7, 20, 24
+	mtspr	SPRN_MAS1, r16
+
+	/* copy the pshift for use later */
+	mr	r14, r10
+
+	/* fall through */
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
 	/*
 	 * We set execute, because we don't have the granularity to
 	 * properly set this at the page level (Linux problem).
 	 * Many of these bits are software only.  Bits we don't set
 	 * here we (properly should) assume have the appropriate value.
 	 */
-
-	mfspr	r12, SPRN_MAS2
-#ifdef CONFIG_PTE_64BIT
-	rlwimi	r12, r11, 32-19, 27, 31	/* extract WIMGE from pte */
-#else
-	rlwimi	r12, r11, 26, 27, 31	/* extract WIMGE from pte */
-#endif
-	mtspr	SPRN_MAS2, r12
-
+finish_tlb_load_cont:
 #ifdef CONFIG_PTE_64BIT
 	rlwinm	r12, r11, 32-2, 26, 31	/* Move in perm bits */
 	andi.	r10, r11, _PAGE_DIRTY
@@ -581,22 +650,40 @@ finish_tlb_load:
 	andc	r12, r12, r10
 1:	rlwimi	r12, r13, 20, 0, 11	/* grab RPN[32:43] */
 	rlwimi	r12, r11, 20, 12, 19	/* grab RPN[44:51] */
-	mtspr	SPRN_MAS3, r12
+2:	mtspr	SPRN_MAS3, r12
 BEGIN_MMU_FTR_SECTION
 	srwi	r10, r13, 12		/* grab RPN[12:31] */
 	mtspr	SPRN_MAS7, r10
 END_MMU_FTR_SECTION_IFSET(MMU_FTR_BIG_PHYS)
 #else
 	li	r10, (_PAGE_EXEC | _PAGE_PRESENT)
+	mr	r13, r11
 	rlwimi	r10, r11, 31, 29, 29	/* extract _PAGE_DIRTY into SW */
 	and	r12, r11, r10
 	andi.	r10, r11, _PAGE_USER	/* Test for _PAGE_USER */
 	slwi	r10, r12, 1
 	or	r10, r10, r12
 	iseleq	r12, r12, r10
-	rlwimi	r11, r12, 0, 20, 31	/* Extract RPN from PTE and merge with perms */
-	mtspr	SPRN_MAS3, r11
+	rlwimi	r13, r12, 0, 20, 31	/* Get RPN from PTE, merge w/ perms */
+	mtspr	SPRN_MAS3, r13
 #endif
+
+	mfspr	r12, SPRN_MAS2
+#ifdef CONFIG_PTE_64BIT
+	rlwimi	r12, r11, 32-19, 27, 31	/* extract WIMGE from pte */
+#else
+	rlwimi	r12, r11, 26, 27, 31	/* extract WIMGE from pte */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	beq	6, 3f			/* don't mask if page isn't huge */
+	li	r13, 1
+	slw	r13, r13, r14
+	subi	r13, r13, 1
+	rlwinm	r13, r13, 0, 0, 19	/* bottom bits used for WIMGE/etc */
+	andc	r12, r12, r13		/* mask off ea bits within the page */
+#endif
+3:	mtspr	SPRN_MAS2, r12
+
 #ifdef CONFIG_E200
 	/* Round robin TLB1 entries assignment */
 	mfspr	r12, SPRN_MAS0
@@ -622,11 +709,19 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_BIG_PHYS)
 	mtspr	SPRN_MAS0,r12
 #endif /* CONFIG_E200 */
 
+tlb_write_entry:
 	tlbwe
 
 	/* Done...restore registers and get out of here.  */
 	mfspr	r10, SPRN_SPRG_THREAD
-	lwz	r11, THREAD_NORMSAVE(3)(r10)
+#ifdef CONFIG_HUGETLB_PAGE
+	beq	6, 8f /* skip restore for 4k page faults */
+	lwz	r14, THREAD_NORMSAVE(4)(r10)
+	lwz	r15, THREAD_NORMSAVE(5)(r10)
+	lwz	r16, THREAD_NORMSAVE(6)(r10)
+	lwz	r17, THREAD_NORMSAVE(7)(r10)
+#endif
+8:	lwz	r11, THREAD_NORMSAVE(3)(r10)
 	mtcr	r11
 	lwz	r13, THREAD_NORMSAVE(2)(r10)
 	lwz	r12, THREAD_NORMSAVE(1)(r10)
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index bdca46e..991ee81 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
 ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-y				+= hugetlbpage.o
 obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
+obj-$(CONFIG_PPC_BOOK3E_MMU)	+= hugetlbpage-book3e.o
 endif
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 26b2872..1f8b2a0 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -105,9 +105,6 @@ int mmu_kernel_ssize = MMU_SEGSIZE_256M;
 int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 EXPORT_SYMBOL_GPL(mmu_slb_size);
-#ifdef CONFIG_HUGETLB_PAGE
-unsigned int HPAGE_SHIFT;
-#endif
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c b/arch/powerpc/mm/hugetlbpage-book3e.c
new file mode 100644
index 0000000..1295b7c
--- /dev/null
+++ b/arch/powerpc/mm/hugetlbpage-book3e.c
@@ -0,0 +1,121 @@
+/*
+ * PPC Huge TLB Page Support for Book3E MMU
+ *
+ * Copyright (C) 2009 David Gibson, IBM Corporation.
+ * Copyright (C) 2011 Becky Bruce, Freescale Semiconductor
+ *
+ */
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+
+static inline int mmu_get_tsize(int psize)
+{
+	return mmu_psize_defs[psize].enc;
+}
+
+static inline int book3e_tlb_exists(unsigned long ea, unsigned long pid)
+{
+	int found = 0;
+
+	mtspr(SPRN_MAS6, pid << 16);
+	if (mmu_has_feature(MMU_FTR_USE_TLBRSRV)) {
+		asm volatile(
+			"li	%0,0\n"
+			"tlbsx.	0,%1\n"
+			"bne	1f\n"
+			"li	%0,1\n"
+			"1:\n"
+			: "=&r"(found) : "r"(ea));
+	} else {
+		asm volatile(
+			"tlbsx	0,%1\n"
+			"mfspr	%0,0x271\n"
+			"srwi	%0,%0,31\n"
+			: "=&r"(found) : "r"(ea));
+	}
+
+	return found;
+}
+
+void book3e_hugetlb_preload(struct mm_struct *mm, unsigned long ea, pte_t pte)
+{
+	unsigned long mas1, mas2;
+	u64 mas7_3;
+	unsigned long psize, tsize, shift;
+	unsigned long flags;
+
+#ifdef CONFIG_PPC_FSL_BOOK3E
+	int index, lz, ncams;
+	struct vm_area_struct *vma;
+#endif
+
+	if (unlikely(is_kernel_addr(ea)))
+		return;
+
+#ifdef CONFIG_MM_SLICES
+	psize = mmu_get_tsize(get_slice_psize(mm, ea));
+	tsize = mmu_get_psize(psize);
+	shift = mmu_psize_defs[psize].shift;
+#else
+	vma = find_vma(mm, ea);
+	psize = vma_mmu_pagesize(vma);	/* returns actual size in bytes */
+	asm (PPC_CNTLZL "%0,%1" : "=r" (lz) : "r" (psize));
+	shift = 31 - lz;
+	tsize = 21 - lz;
+#endif
+
+	/*
+	 * We can't be interrupted while we're setting up the MAS
+	 * regusters or after we've confirmed that no tlb exists.
+	 */
+	local_irq_save(flags);
+
+	if (unlikely(book3e_tlb_exists(ea, mm->context.id))) {
+		local_irq_restore(flags);
+		return;
+	}
+
+#ifdef CONFIG_PPC_FSL_BOOK3E
+	ncams = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
+
+	/* We have to use the CAM(TLB1) on FSL parts for hugepages */
+	index = __get_cpu_var(next_tlbcam_idx);
+	mtspr(SPRN_MAS0, MAS0_ESEL(index) | MAS0_TLBSEL(1));
+
+	/* Just round-robin the entries and wrap when we hit the end */
+	if (unlikely(index == ncams - 1))
+		__get_cpu_var(next_tlbcam_idx) = tlbcam_index;
+	else
+		__get_cpu_var(next_tlbcam_idx)++;
+#endif
+	mas1 = MAS1_VALID | MAS1_TID(mm->context.id) | MAS1_TSIZE(tsize);
+	mas2 = ea & ~((1UL << shift) - 1);
+	mas2 |= (pte_val(pte) >> PTE_WIMGE_SHIFT) & MAS2_WIMGE_MASK;
+	mas7_3 = (u64)pte_pfn(pte) << PAGE_SHIFT;
+	mas7_3 |= (pte_val(pte) >> PTE_BAP_SHIFT) & MAS3_BAP_MASK;
+	if (!pte_dirty(pte))
+		mas7_3 &= ~(MAS3_SW|MAS3_UW);
+
+	mtspr(SPRN_MAS1, mas1);
+	mtspr(SPRN_MAS2, mas2);
+
+	if (mmu_has_feature(MMU_FTR_USE_PAIRED_MAS)) {
+		mtspr(SPRN_MAS7_MAS3, mas7_3);
+	} else {
+		mtspr(SPRN_MAS7, upper_32_bits(mas7_3));
+		mtspr(SPRN_MAS3, lower_32_bits(mas7_3));
+	}
+
+	asm volatile ("tlbwe");
+
+	local_irq_restore(flags);
+}
+
+void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr)
+{
+	struct hstate *hstate = hstate_file(vma->vm_file);
+	unsigned long tsize = huge_page_shift(hstate) - 10;
+
+	__flush_tlb_page(vma ? vma->vm_mm : NULL, vmaddr, tsize, 0);
+
+}
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0b9a5c1..3a5f59d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1,7 +1,8 @@
 /*
- * PPC64 (POWER4) Huge TLB Page Support for Kernel.
+ * PPC Huge TLB Page Support for Kernel.
  *
  * Copyright (C) 2003 David Gibson, IBM Corporation.
+ * Copyright (C) 2011 Becky Bruce, Freescale Semiconductor
  *
  * Based on the IA-32 version:
  * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
@@ -11,24 +12,39 @@
 #include <linux/io.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
+#include <linux/of_fdt.h>
+#include <linux/memblock.h>
+#include <linux/bootmem.h>
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
+#include <asm/setup.h>
 
 #define PAGE_SHIFT_64K	16
 #define PAGE_SHIFT_16M	24
 #define PAGE_SHIFT_16G	34
 
-#define MAX_NUMBER_GPAGES	1024
+unsigned int HPAGE_SHIFT;
 
-/* Tracks the 16G pages after the device tree is scanned and before the
- * huge_boot_pages list is ready.  */
-static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
+/*
+ * Tracks gpages after the device tree is scanned and before the
+ * huge_boot_pages list is ready.  On 64-bit implementations, this is
+ * just used to track 16G pages and so is a single array.  32-bit
+ * implementations may have more than one gpage size due to limitations
+ * of the memory allocators, so we need multiple arrays
+ */
+#ifdef CONFIG_PPC64
+#define MAX_NUMBER_GPAGES	1024
+static u64 gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
-
-/* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
- * will choke on pointers to hugepte tables, which is handy for
- * catching screwups early. */
+#else
+#define MAX_NUMBER_GPAGES	128
+struct psize_gpages {
+	u64 gpage_list[MAX_NUMBER_GPAGES];
+	unsigned int nr_gpages;
+};
+static struct psize_gpages gpage_freearray[MMU_PAGE_COUNT];
+#endif
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
@@ -49,25 +65,6 @@ static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
 
 #define hugepd_none(hpd)	((hpd).pd == 0)
 
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-	BUG_ON(!hugepd_ok(hpd));
-	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | 0xc000000000000000);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-	return hpd.pd & HUGEPD_SHIFT_MASK;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr, unsigned pdshift)
-{
-	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
-	pte_t *dir = hugepd_page(*hpdp);
-
-	return dir + idx;
-}
-
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
 {
 	pgd_t *pg;
@@ -93,7 +90,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 			if (is_hugepd(pm))
 				hpdp = (hugepd_t *)pm;
 			else if (!pmd_none(*pm)) {
-				return pte_offset_map(pm, ea);
+				return pte_offset_kernel(pm, ea);
 			}
 		}
 	}
@@ -114,8 +111,18 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned pdshift, unsigned pshift)
 {
-	pte_t *new = kmem_cache_zalloc(PGT_CACHE(pdshift - pshift),
-				       GFP_KERNEL|__GFP_REPEAT);
+	struct kmem_cache *cachep;
+	pte_t *new;
+
+#ifdef CONFIG_PPC64
+	cachep = PGT_CACHE(pdshift - pshift);
+#else
+	int i;
+	int num_hugepd = 1 << (pshift - pdshift);
+	cachep = hugepte_cache;
+#endif
+
+	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
 
 	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
 	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
@@ -124,10 +131,31 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
+#ifdef CONFIG_PPC64
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(PGT_CACHE(pdshift - pshift), new);
+		kmem_cache_free(cachep, new);
 	else
-		hpdp->pd = ((unsigned long)new & ~0x8000000000000000) | pshift;
+		hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
+#else
+	/*
+	 * We have multiple higher-level entries that point to the same
+	 * actual pte location.  Fill in each as we go and backtrack on error.
+	 * We need all of these so the DTLB pgtable walk code can find the
+	 * right higher-level entry without knowing if it's a hugepage or not.
+	 */
+	for (i = 0; i < num_hugepd; i++, hpdp++) {
+		if (unlikely(!hugepd_none(*hpdp)))
+			break;
+		else
+			hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
+	}
+	/* If we bailed from the for loop early, an error occurred, clean up */
+	if (i < num_hugepd) {
+		for (i = i - 1 ; i >= 0; i--, hpdp--)
+			hpdp->pd = 0;
+		kmem_cache_free(cachep, new);
+	}
+#endif
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -169,11 +197,132 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	return hugepte_offset(hpdp, addr, pdshift);
 }
 
+#ifdef CONFIG_PPC32
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
  */
-void add_gpage(unsigned long addr, unsigned long page_size,
-	unsigned long number_of_pages)
+void add_gpage(u64 addr, u64 page_size, unsigned long number_of_pages)
+{
+	unsigned int idx = shift_to_mmu_psize(__ffs(page_size));
+	int i;
+
+	if (addr == 0)
+		return;
+
+	gpage_freearray[idx].nr_gpages = number_of_pages;
+
+	for (i = 0; i < number_of_pages; i++) {
+		gpage_freearray[idx].gpage_list[i] = addr;
+		addr += page_size;
+	}
+}
+
+/*
+ * Moves the gigantic page addresses from the temporary list to the
+ * huge_boot_pages list.
+ */
+int alloc_bootmem_huge_page(struct hstate *hstate)
+{
+	struct huge_bootmem_page *m;
+	int idx = shift_to_mmu_psize(hstate->order + PAGE_SHIFT);
+	int nr_gpages = gpage_freearray[idx].nr_gpages;
+
+	if (nr_gpages == 0)
+		return 0;
+
+#ifdef CONFIG_HIGHMEM
+	/*
+	 * If gpages can be in highmem we can't use the trick of storing the
+	 * data structure in the page; allocate space for this
+	 */
+	m = alloc_bootmem(sizeof(struct huge_bootmem_page));
+	m->phys = gpage_freearray[idx].gpage_list[--nr_gpages];
+#else
+	m = phys_to_virt(gpage_freearray[idx].gpage_list[--nr_gpages]);
+#endif
+
+	list_add(&m->list, &huge_boot_pages);
+	gpage_freearray[idx].nr_gpages = nr_gpages;
+	gpage_freearray[idx].gpage_list[nr_gpages] = 0;
+	m->hstate = hstate;
+
+	return 1;
+}
+/*
+ * Scan the command line hugepagesz= options for gigantic pages; store those in
+ * a list that we use to allocate the memory once all options are parsed.
+ */
+
+unsigned long gpage_npages[MMU_PAGE_COUNT];
+
+static int __init do_gpage_early_setup(char *param, char *val)
+{
+	static phys_addr_t size;
+	unsigned long npages;
+
+	/*
+	 * The hugepagesz and hugepages cmdline options are interleaved.  We
+	 * use the size variable to keep track of whether or not this was done
+	 * properly and skip over instances where it is incorrect.  Other
+	 * command-line parsing code will issue warnings, so we don't need to.
+	 *
+	 */
+	if ((strcmp(param, "default_hugepagesz") == 0) ||
+	    (strcmp(param, "hugepagesz") == 0)) {
+		size = memparse(val, NULL);
+	} else if (strcmp(param, "hugepages") == 0) {
+		if (size != 0) {
+			if (sscanf(val, "%lu", &npages) <= 0)
+				npages = 0;
+			gpage_npages[shift_to_mmu_psize(__ffs(size))] = npages;
+			size = 0;
+		}
+	}
+	return 0;
+}
+
+
+/*
+ * This function allocates physical space for pages that are larger than the
+ * buddy allocator can handle.  We want to allocate these in highmem because
+ * the amount of lowmem is limited.  This means that this function MUST be
+ * called before lowmem_end_addr is set up in MMU_init() in order for the lmb
+ * allocate to grab highmem.
+ */
+void __init reserve_hugetlb_gpages(void)
+{
+	static __initdata char cmdline[COMMAND_LINE_SIZE];
+	phys_addr_t size, base;
+	int i;
+
+	strlcpy(cmdline, boot_command_line, COMMAND_LINE_SIZE);
+	parse_args("hugetlb gpages", cmdline, NULL, 0, &do_gpage_early_setup);
+
+	/*
+	 * Walk gpage list in reverse, allocating larger page sizes first.
+	 * Skip over unsupported sizes, or sizes that have 0 gpages allocated.
+	 * When we reach the point in the list where pages are no longer
+	 * considered gpages, we're done.
+	 */
+	for (i = MMU_PAGE_COUNT-1; i >= 0; i--) {
+		if (mmu_psize_defs[i].shift == 0 || gpage_npages[i] == 0)
+			continue;
+		else if (mmu_psize_to_shift(i) < (MAX_ORDER + PAGE_SHIFT))
+			break;
+
+		size = (phys_addr_t)(1ULL << mmu_psize_to_shift(i));
+		base = memblock_alloc_base(size * gpage_npages[i], size,
+					   MEMBLOCK_ALLOC_ANYWHERE);
+		add_gpage(base, size, gpage_npages[i]);
+	}
+}
+
+#else /* PPC64 */
+
+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy or bootmem allocator is setup.
+ */
+void add_gpage(u64 addr, u64 page_size, unsigned long number_of_pages)
 {
 	if (!addr)
 		return;
@@ -199,19 +348,79 @@ int alloc_bootmem_huge_page(struct hstate *hstate)
 	m->hstate = hstate;
 	return 1;
 }
+#endif
 
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
 
+#ifdef CONFIG_PPC32
+#define HUGEPD_FREELIST_SIZE \
+	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
+
+struct hugepd_freelist {
+	struct rcu_head	rcu;
+	unsigned int index;
+	void *ptes[0];
+};
+
+static DEFINE_PER_CPU(struct hugepd_freelist *, hugepd_freelist_cur);
+
+static void hugepd_free_rcu_callback(struct rcu_head *head)
+{
+	struct hugepd_freelist *batch =
+		container_of(head, struct hugepd_freelist, rcu);
+	unsigned int i;
+
+	for (i = 0; i < batch->index; i++)
+		kmem_cache_free(hugepte_cache, batch->ptes[i]);
+
+	free_page((unsigned long)batch);
+}
+
+static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
+{
+	struct hugepd_freelist **batchp;
+
+	batchp = &__get_cpu_var(hugepd_freelist_cur);
+
+	if (atomic_read(&tlb->mm->mm_users) < 2 ||
+	    cpumask_equal(mm_cpumask(tlb->mm),
+			  cpumask_of(smp_processor_id()))) {
+		kmem_cache_free(hugepte_cache, hugepte);
+		return;
+	}
+
+	if (*batchp == NULL) {
+		*batchp = (struct hugepd_freelist *)__get_free_page(GFP_ATOMIC);
+		(*batchp)->index = 0;
+	}
+
+	(*batchp)->ptes[(*batchp)->index++] = hugepte;
+	if ((*batchp)->index == HUGEPD_FREELIST_SIZE) {
+		call_rcu_sched(&(*batchp)->rcu, hugepd_free_rcu_callback);
+		*batchp = NULL;
+	}
+}
+#endif
+
 static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
 			      unsigned long start, unsigned long end,
 			      unsigned long floor, unsigned long ceiling)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
-	unsigned shift = hugepd_shift(*hpdp);
+	int i;
+
 	unsigned long pdmask = ~((1UL << pdshift) - 1);
+	unsigned int num_hugepd = 1;
+
+#ifdef CONFIG_PPC64
+	unsigned int shift = hugepd_shift(*hpdp);
+#else
+	/* Note: On 32-bit the hpdp may be the first of several */
+	num_hugepd = (1 << (hugepd_shift(*hpdp) - pdshift));
+#endif
 
 	start &= pdmask;
 	if (start < floor)
@@ -224,9 +433,15 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
 	if (end - 1 > ceiling - 1)
 		return;
 
-	hpdp->pd = 0;
+	for (i = 0; i < num_hugepd; i++, hpdp++)
+		hpdp->pd = 0;
+
 	tlb->need_flush = 1;
+#ifdef CONFIG_PPC64
 	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
+#else
+	hugepd_free(tlb, hugepte);
+#endif
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -331,18 +546,27 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 * too.
 	 */
 
-	pgd = pgd_offset(tlb->mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
+		pgd = pgd_offset(tlb->mm, addr);
 		if (!is_hugepd(pgd)) {
 			if (pgd_none_or_clear_bad(pgd))
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
+#ifdef CONFIG_PPC32
+			/*
+			 * Increment next by the size of the huge mapping since
+			 * on 32-bit there may be more than one entry at the pgd
+			 * level for a single hugepage, but all of them point to
+			 * the same kmem cache that holds the hugepte.
+			 */
+			next = addr + (1 << hugepd_shift(*(hugepd_t *)pgd));
+#endif
 			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
 					  addr, next, floor, ceiling);
 		}
-	} while (pgd++, addr = next, addr != end);
+	} while (addr = next, addr != end);
 }
 
 struct page *
@@ -466,17 +690,35 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 					unsigned long len, unsigned long pgoff,
 					unsigned long flags)
 {
+#ifdef CONFIG_MM_SLICES
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
 	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
+#else
+	return get_unmapped_area(file, addr, len, pgoff, flags);
+#endif
 }
 
 unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
 {
+#ifdef CONFIG_MM_SLICES
 	unsigned int psize = get_slice_psize(vma->vm_mm, vma->vm_start);
 
 	return 1UL << mmu_psize_to_shift(psize);
+#else
+	if (!is_vm_hugetlb_page(vma))
+		return PAGE_SIZE;
+
+	return huge_page_size(hstate_vma(vma));
+#endif
+}
+
+static inline bool is_power_of_4(unsigned long x)
+{
+	if (is_power_of_2(x))
+		return (__ilog2(x) % 2) ? false : true;
+	return false;
 }
 
 static int __init add_huge_page_size(unsigned long long size)
@@ -486,9 +728,14 @@ static int __init add_huge_page_size(unsigned long long size)
 
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable and slice limits. */
+#ifdef CONFIG_PPC_FSL_BOOK3E
+	if ((size < PAGE_SIZE) || !is_power_of_4(size))
+		return -EINVAL;
+#else
 	if (!is_power_of_2(size)
 	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
 		return -EINVAL;
+#endif
 
 	if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
 		return -EINVAL;
@@ -525,6 +772,46 @@ static int __init hugepage_setup_sz(char *str)
 }
 __setup("hugepagesz=", hugepage_setup_sz);
 
+#ifdef CONFIG_FSL_BOOKE
+struct kmem_cache *hugepte_cache;
+static int __init hugetlbpage_init(void)
+{
+	int psize;
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		unsigned shift;
+
+		if (!mmu_psize_defs[psize].shift)
+			continue;
+
+		shift = mmu_psize_to_shift(psize);
+
+		/* Don't treat normal page sizes as huge... */
+		if (shift != PAGE_SHIFT)
+			if (add_huge_page_size(1ULL << shift) < 0)
+				continue;
+	}
+
+	/*
+	 * Create a kmem cache for hugeptes.  The bottom bits in the pte have
+	 * size information encoded in them, so align them to allow this
+	 */
+	hugepte_cache =  kmem_cache_create("hugepte-cache", sizeof(pte_t),
+					   HUGEPD_SHIFT_MASK + 1, 0, NULL);
+	if (hugepte_cache == NULL)
+		panic("%s: Unable to create kmem cache for hugeptes\n",
+		      __func__);
+
+	/* Default hpage size = 4M */
+	if (mmu_psize_defs[MMU_PAGE_4M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
+	else
+		panic("%s: Unable to set default huge page size\n", __func__);
+
+
+	return 0;
+}
+#else
 static int __init hugetlbpage_init(void)
 {
 	int psize;
@@ -567,15 +854,23 @@ static int __init hugetlbpage_init(void)
 
 	return 0;
 }
-
+#endif
 module_init(hugetlbpage_init);
 
 void flush_dcache_icache_hugepage(struct page *page)
 {
 	int i;
+	void *start;
 
 	BUG_ON(!PageCompound(page));
 
-	for (i = 0; i < (1UL << compound_order(page)); i++)
-		__flush_dcache_icache(page_address(page+i));
+	for (i = 0; i < (1UL << compound_order(page)); i++) {
+		if (!PageHighMem(page)) {
+			__flush_dcache_icache(page_address(page+i));
+		} else {
+			start = kmap_atomic(page+i, KM_PPC_SYNC_ICACHE);
+			__flush_dcache_icache(start);
+			kunmap_atomic(start, KM_PPC_SYNC_ICACHE);
+		}
+	}
 }
diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c
index d65b591..3e1ba4c 100644
--- a/arch/powerpc/mm/init_32.c
+++ b/arch/powerpc/mm/init_32.c
@@ -32,6 +32,8 @@
 #include <linux/pagemap.h>
 #include <linux/memblock.h>
 #include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/prom.h>
@@ -44,6 +46,7 @@
 #include <asm/tlb.h>
 #include <asm/sections.h>
 #include <asm/system.h>
+#include <asm/hugetlb.h>
 
 #include "mmu_decl.h"
 
@@ -123,6 +126,12 @@ void __init MMU_init(void)
 	/* parse args from command line */
 	MMU_setup();
 
+	/*
+	 * Reserve gigantic pages for hugetlb.  This MUST occur before
+	 * lowmem_end_addr is initialized below.
+	 */
+	reserve_hugetlb_gpages();
+
 	if (memblock.memory.cnt > 1) {
 #ifndef CONFIG_WII
 		memblock.memory.cnt = 1;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 7209901..9e6f3a6 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -510,4 +510,9 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
 		return;
 	hash_preload(vma->vm_mm, address, access, trap);
 #endif /* CONFIG_PPC_STD_MMU */
+#if (defined(CONFIG_PPC_BOOK3E_64) || defined(CONFIG_PPC_FSL_BOOK3E)) \
+	&& defined(CONFIG_HUGETLB_PAGE)
+	if (is_vm_hugetlb_page(vma))
+		book3e_hugetlb_preload(vma->vm_mm, address, *ptep);
+#endif
 }
diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c
index 336807d..5b63bd3 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -292,6 +292,11 @@ int init_new_context(struct task_struct *t, struct mm_struct *mm)
 	mm->context.id = MMU_NO_CONTEXT;
 	mm->context.active = 0;
 
+#ifdef CONFIG_PPC_MM_SLICES
+	if (slice_mm_new_context(mm))
+		slice_set_user_psize(mm, mmu_virtual_psize);
+#endif
+
 	return 0;
 }
 
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index af40c87..214130a 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -27,6 +27,7 @@
 #include <linux/init.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/hugetlb.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -212,7 +213,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 	entry = set_access_flags_filter(entry, vma, dirty);
 	changed = !pte_same(*(ptep), entry);
 	if (changed) {
-		if (!(vma->vm_flags & VM_HUGETLB))
+		if (!is_vm_hugetlb_page(vma))
 			assert_pte_locked(vma->vm_mm, address);
 		__ptep_set_access_flags(ptep, entry);
 		flush_tlb_page_nohash(vma, address);
diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index af08922..12ae424 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -347,24 +347,24 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_TLBRSRV)
 	rldicl	r11,r16,64-VPTE_PGD_SHIFT,64-PGD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	virt_page_table_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	virt_page_table_tlb_miss_fault
 
 #ifndef CONFIG_PPC_64K_PAGES
 	/* Get to PUD entry */
 	rldicl	r11,r16,64-VPTE_PUD_SHIFT,64-PUD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	virt_page_table_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	virt_page_table_tlb_miss_fault
 #endif /* CONFIG_PPC_64K_PAGES */
 
 	/* Get to PMD entry */
 	rldicl	r11,r16,64-VPTE_PMD_SHIFT,64-PMD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	virt_page_table_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	virt_page_table_tlb_miss_fault
 
 	/* Ok, we're all right, we can now create a kernel translation for
 	 * a 4K or 64K page from r16 -> r15.
@@ -596,24 +596,24 @@ htw_tlb_miss:
 	rldicl	r11,r16,64-(PGDIR_SHIFT-3),64-PGD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	htw_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	htw_tlb_miss_fault
 
 #ifndef CONFIG_PPC_64K_PAGES
 	/* Get to PUD entry */
 	rldicl	r11,r16,64-(PUD_SHIFT-3),64-PUD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	htw_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	htw_tlb_miss_fault
 #endif /* CONFIG_PPC_64K_PAGES */
 
 	/* Get to PMD entry */
 	rldicl	r11,r16,64-(PMD_SHIFT-3),64-PMD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpldi	cr0,r15,0
-	beq	htw_tlb_miss_fault
+	cmpdi	cr0,r15,0
+	bge	htw_tlb_miss_fault
 
 	/* Ok, we're all right, we can now create an indirect entry for
 	 * a 1M or 256M page.
diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c
index ea037ba..8a8a893 100644
--- a/arch/powerpc/mm/tlb_nohash.c
+++ b/arch/powerpc/mm/tlb_nohash.c
@@ -35,14 +35,49 @@
 #include <linux/preempt.h>
 #include <linux/spinlock.h>
 #include <linux/memblock.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include <asm/code-patching.h>
+#include <asm/hugetlb.h>
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_PPC_BOOK3E
+/*
+ * This struct lists the sw-supported page sizes.  The hardawre MMU may support
+ * other sizes not listed here.   The .ind field is only used on MMUs that have
+ * indirect page table entries.
+ */
+#ifdef CONFIG_PPC_BOOK3E_MMU
+#ifdef CONFIG_FSL_BOOKE
+struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
+	[MMU_PAGE_4K] = {
+		.shift	= 12,
+		.enc	= BOOK3E_PAGESZ_4K,
+	},
+	[MMU_PAGE_4M] = {
+		.shift	= 22,
+		.enc	= BOOK3E_PAGESZ_4M,
+	},
+	[MMU_PAGE_16M] = {
+		.shift	= 24,
+		.enc	= BOOK3E_PAGESZ_16M,
+	},
+	[MMU_PAGE_64M] = {
+		.shift	= 26,
+		.enc	= BOOK3E_PAGESZ_64M,
+	},
+	[MMU_PAGE_256M] = {
+		.shift	= 28,
+		.enc	= BOOK3E_PAGESZ_256M,
+	},
+	[MMU_PAGE_1G] = {
+		.shift	= 30,
+		.enc	= BOOK3E_PAGESZ_1GB,
+	},
+};
+#else
 struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_4K] = {
 		.shift	= 12,
@@ -76,6 +111,8 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 		.enc	= BOOK3E_PAGESZ_1GB,
 	},
 };
+#endif /* CONFIG_FSL_BOOKE */
+
 static inline int mmu_get_tsize(int psize)
 {
 	return mmu_psize_defs[psize].enc;
@@ -86,7 +123,7 @@ static inline int mmu_get_tsize(int psize)
 	/* This isn't used on !Book3E for now */
 	return 0;
 }
-#endif
+#endif /* CONFIG_PPC_BOOK3E_MMU */
 
 /* The variables below are currently only used on 64-bit Book3E
  * though this will probably be made common with other nohash
@@ -265,6 +302,11 @@ void __flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
 
 void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr)
 {
+#ifdef CONFIG_HUGETLB_PAGE
+	if (is_vm_hugetlb_page(vma))
+		flush_hugetlb_page(vma, vmaddr);
+#endif
+
 	__flush_tlb_page(vma ? vma->vm_mm : NULL, vmaddr,
 			 mmu_get_tsize(mmu_virtual_psize), 0);
 }
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 2165b65..9abc655 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -69,6 +69,7 @@ choice
 config PPC_BOOK3S_64
 	bool "Server processors"
 	select PPC_FPU
+	select SYS_SUPPORTS_HUGETLBFS
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
@@ -173,6 +174,7 @@ config BOOKE
 config FSL_BOOKE
 	bool
 	depends on (E200 || E500) && PPC32
+	select SYS_SUPPORTS_HUGETLBFS if PHYS_64BIT
 	default y
 
 # this is for common code between PPC32 & PPC64 FSL BOOKE
@@ -296,7 +298,7 @@ config PPC_BOOK3E_MMU
 
 config PPC_MM_SLICES
 	bool
-	default y if HUGETLB_PAGE || (PPC_STD_MMU_64 && PPC_64K_PAGES)
+	default y if (PPC64 && HUGETLB_PAGE) || (PPC_STD_MMU_64 && PPC_64K_PAGES)
 	default n
 
 config VIRT_CPU_ACCOUNTING
-- 
1.5.6.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox