LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* RE: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Benjamin Herrenschmidt @ 2009-09-08 21:34 UTC (permalink / raw)
  To: azilkie; +Cc: phazarika, Tom Burns, Andrea Zypchen, linuxppc-dev
In-Reply-To: <1252438259.2548.50.camel@Adam>

On Tue, 2009-09-08 at 15:30 -0400, Adam Zilkie wrote:
> All,
> 
> We have found that using flush_dcache_range() after each DMA solves the
> problem. Ideally, we'd like to be able to allocate the virtual page in
> cache inhibited memory to avoid the performance loss from all the flush
> calls. To do this, we'd have to change our TLB sizes and reserve a TLB
> in memory as cache inhibited (using the 'I' bit). Will update if this
> works as well. Thanks for your help in this.

I think the problem is that you are manipulating the TLB directly, which
you shouldn't have to do. You also shouldn't have to use
flush_dcache_range() yourself neither.

It should all be handled by the DMA and PCI DMA APIs, you are just not
using those correctly.

You have two choice. You can either allocate memory permanently mapped
with I=1, in which case, use pci_alloc_consistent() (or
dma_alloc_coherent(), same thing).

Or you can use "normal" memory and ensure you flush/invalidate the cache
at the right time, which you can do with something like
pci_map_sg/pci_unmap_sg (or dma_* variants) or the dma_sync_* functions.

It's all pretty standard mechanisms in Linux, other platforms also have
non-coherent DMA (such as some ARMs) and those functions are generic.

Cheers,
Ben.

> Regards,
> Adam
> 
> On Tue, 2009-09-08 at 11:59 -0700, Prodyut Hazarika wrote:
> > Hi Adam,
> > 
> > > Yes, I am using the 440EPx (same as the sequoia board). 
> > > Our ideDriver is DMA'ing blocks of 192-byte data over the PCI bus
> > (using
> > > the Sil0680A PCI-IDE bridge). Most of the DMA's (depending on timing)
> > > end up being partially corrupted when we try to parse the data in the
> > > virtual page. We have confirmed the data is good before the PCI-IDE
> > > bridge. We are creating two 8K pages and map them to physical DMA
> > memory
> > > using single-entry scatter/gather structs. When a DMA block is
> > > corrupted, we see a random portion of it (always a multiple of 16byte
> > > cache lines) is overwritten with old data from the last time the
> > buffer
> > > was used. 
> > 
> > This looks like a cache coherency problem.
> > Can you ensure that the TLB entries corresponding to the DMA region has
> > the CacheInhibit bit set.
> > You will need a BDI connected to your system.
> > 
> > Also, you will need to invalidate and flush the lines appropriately,
> > since in 440 cores,
> > L1Cache coherency is managed entirely by software.
> > Please look at drivers/net/ibm_newemac/mal.c and core.c for example on
> > how to do it.
> > 
> > Thanks
> > Prodyut
> > 
> > On Thu, 2009-09-03 at 13:27 -0700, Prodyut Hazarika wrote:
> > > Hi Adam,
> > > 
> > > > Are you sure there is L2 cache on the 440?
> > > 
> > > It depends on the SoC you are using. SoC like 460EX (Canyonlands
> > board)
> > > have L2Cache.
> > > It seems you are using a Sequoia board, which has a 440EPx SoC. 440EPx
> > > has a 440 cpu core, but no L2Cache.
> > > Could you please tell me which SoC you are using?
> > > You can also refer to the appropriate dts file to see if there is L2C.
> > > For example, in canyonlands.dts (460EX based board), we have the L2C
> > > entry.
> > >         L2C0: l2c {
> > >               ...
> > >         }
> > > 
> > > >I am seeing this problem with our custom IDE driver which is based on
> > 
> > > >pretty old code. Our driver uses pci_alloc_consistent() to allocate
> > the
> > > 
> > > >physical DMA memory and alloc_pages() to allocate a virtual page. It 
> > > >then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I 
> > > >should convert these to the DMA API calls as you suggest.
> > > 
> > > Could you give more details on the consistency problem? It is a good
> > > idea to change to the new DMA APIs, but pci_alloc_consistent() should
> > > work too
> > > 
> > > Thanks
> > > Prodyut	
> > > 
> > > On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
> > > > On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
> > > > > Hi Adam,
> > > > > 
> > > > > If you have a look in include/asm-ppc/pgtable.h for the following
> > > section:
> > > > > #ifdef CONFIG_44x
> > > > > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED |
> > > _PAGE_GUARDED)
> > > > > #else
> > > > > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
> > > > > #endif
> > > > > 
> > > > > Try adding _PAGE_COHERENT to the appropriate line above and see if
> > > that 
> > > > > fixes your issue - this causes the 'M' bit to be set on the page
> > > which 
> > > > > sure enforce cache coherency. If it doesn't, you'll need to check
> > > the 
> > > > > 'M' bit isn't being masked out in head_44x.S (it was originally
> > > masked 
> > > > > out on arch/powerpc, but was fixed in later kernels when the cache
> > 
> > > > > coherency issues with non-SMP systems were resolved).
> > > > 
> > > > I have some doubts about the usefulness of doing that for 4xx.
> > AFAIK,
> > > > the 440 core just ignores M.
> > > > 
> > > > The problem lies probably elsewhere. Maybe the L2 cache coherency
> > > isn't
> > > > enabled or not working ?
> > > > 
> > > > The L1 cache on 440 is simply not coherent, so drivers have to make
> > > sure
> > > > they use the appropriate DMA APIs which will do cache flushing when
> > > > needed.
> > > > 
> > > > Adam, what driver is causing you that sort of problems ?
> > > > 
> > > > Cheers,
> > > > Ben.
> > > > 
> > > > 
> > -- 
> > Adam Zilkie
> > Software Designer,
> > International Datacasting Corp.
> > 
> > This message and the documents attached hereto are intended only for the
> > addressee and may contain privileged or confidential information. Any
> > unauthorized disclosure is strictly prohibited. If you have received
> > this message in error, please notify us immediately so that we may
> > correct our internal records. Please then delete the original message.
> > Thank you.
> > --------------------------------------------------------
> > 
> > CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to AppliedMicro Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
> > 

^ permalink raw reply

* Re: [RFC] [PATCH] Write to HVC terminal from purgatory code
From: Simon Horman @ 2009-09-08 23:09 UTC (permalink / raw)
  To: M. Mohan Kumar; +Cc: linuxppc-dev, kexec, miltonm
In-Reply-To: <20090907051407.GA2990@in.ibm.com>

On Mon, Sep 07, 2009 at 10:44:07AM +0530, M. Mohan Kumar wrote:
> Write to HVC terminal from purgatory code
> 
> Current x86/x86-64 kexec-tools print the message "I'm in purgatory" to
> serial console/VGA while executing the purgatory code.  Implement this
> feature for POWERPC pseries platform by using the H_PUT_TERM_CHAR
> hypervisor call by printng to hvc console.

This change seems reasonable to me, can any of the ppc people offer a review?

> Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
> ---
>  kexec/arch/ppc64/fs2dt.c               |   47 +++++++++++++++++++++++++++++++-
>  kexec/arch/ppc64/kexec-elf-ppc64.c     |    7 +++++
>  kexec/arch/ppc64/kexec-ppc64.h         |    1 +
>  purgatory/arch/ppc64/Makefile          |    1 +
>  purgatory/arch/ppc64/console-ppc64.c   |   14 +++++++++
>  purgatory/arch/ppc64/hvCall.S          |   28 +++++++++++++++++++
>  purgatory/arch/ppc64/hvCall.h          |    8 +++++
>  purgatory/arch/ppc64/purgatory-ppc64.c |    1 +
>  8 files changed, 106 insertions(+), 1 deletions(-)
>  create mode 100644 purgatory/arch/ppc64/hvCall.S
>  create mode 100644 purgatory/arch/ppc64/hvCall.h
> 
> diff --git a/kexec/arch/ppc64/fs2dt.c b/kexec/arch/ppc64/fs2dt.c
> index b01ff86..bd9d36c 100644
> --- a/kexec/arch/ppc64/fs2dt.c
> +++ b/kexec/arch/ppc64/fs2dt.c
> @@ -434,6 +434,9 @@ static void putnode(void)
>  	if (!strcmp(basename,"/chosen/")) {
>  		size_t cmd_len = 0;
>  		char *param = NULL;
> +		char filename[MAXPATH];
> +		char buff[64];

                Is always 64 big enough?
		It seems a bit arbitrary.

> +		int fd;
>  
>  		cmd_len = strlen(local_cmdline);
>  		if (cmd_len != 0) {
> @@ -446,7 +449,6 @@ static void putnode(void)
>  
>  		/* ... if not, grab root= from the old command line */
>  		if (!param) {
> -			char filename[MAXPATH];
>  			FILE *fp;
>  			char *last_cmdline = NULL;
>  			char *old_param;
> @@ -483,8 +485,51 @@ static void putnode(void)
>  		dt += (cmd_len + 3)/4;
>  
>  		fprintf(stderr, "Modified cmdline:%s\n", local_cmdline);
> +
> +		/*
> +		 * Determine the platform type/stdout type, so that purgatory
> +		 * code can print 'I'm in purgatory' message. Currently only
> +		 * pseries/hvcterminal is supported.
> +		 */
> +		strcpy(filename, pathname);
> +		strcat(filename, "linux,stdout-path");
> +		fd = open(filename, O_RDONLY);
> +		if (fd == -1) {
> +			printf("Unable to find linux,stdout-path, printing"
> +					" from purgatory is diabled\n");
> +			goto no_debug;
> +		}
> +		if (fstat(fd, &statbuf)) {
> +			printf("Unable to stat linux,stdout-path, printing"
> +					" from purgatory is diabled\n");
> +			close(fd);
> +			goto no_debug;
> +		}
> +		read(fd, buff, statbuf.st_size);
> +		close(fd);
> +		strcpy(filename, "/proc/device-tree/");
> +		strcat(filename, buff);
> +		strcat(filename, "/compatible");
> +		fd = open(filename, O_RDONLY);
> +		if (fd == -1) {
> +			printf("Unable to find linux,stdout-path/compatible, "
> +				" printing from purgatory is diabled\n");
> +			goto no_debug;
> +		}
> +		if (fstat(fd, &statbuf)) {
> +			printf("Unable to stat linux,stdout-path/compatible, "
> +				" printing from purgatory is diabled\n");
> +			close(fd);
> +			goto no_debug;
> +		}
> +		read(fd, buff, statbuf.st_size);
> +		if (!strcmp(buff, "hvterm1") ||
> +					!strcmp(buff, "hvterm-protocol"))
> +			my_debug = 1;
> +		close(fd);
>  	}
>  
> +no_debug:
>  	for (i=0; i < numlist; i++) {
>  		dp = namelist[i];
>  		strcpy(dn, dp->d_name);
> diff --git a/kexec/arch/ppc64/kexec-elf-ppc64.c b/kexec/arch/ppc64/kexec-elf-ppc64.c
> index 21533cb..65fc42f 100644
> --- a/kexec/arch/ppc64/kexec-elf-ppc64.c
> +++ b/kexec/arch/ppc64/kexec-elf-ppc64.c
> @@ -41,6 +41,8 @@
>  uint64_t initrd_base, initrd_size;
>  unsigned char reuse_initrd = 0;
>  const char *ramdisk;
> +/* Used for enabling printing message from purgatory code */
> +int my_debug = 0;
>  
>  int elf_ppc64_probe(const char *buf, off_t len)
>  {
> @@ -296,6 +298,8 @@ int elf_ppc64_load(int argc, char **argv, const char *buf, off_t len,
>  	toc_addr = my_r2(&info->rhdr);
>  	elf_rel_set_symbol(&info->rhdr, "my_toc", &toc_addr, sizeof(toc_addr));
>  
> +	/* Set debug */
> +	elf_rel_set_symbol(&info->rhdr, "debug", &my_debug, sizeof(my_debug));
>  #ifdef DEBUG
>  	my_kernel = 0;
>  	my_dt_offset = 0;
> @@ -304,6 +308,7 @@ int elf_ppc64_load(int argc, char **argv, const char *buf, off_t len,
>  	my_stack = 0;
>  	toc_addr = 0;
>  	my_run_at_load = 0;
> +	my_debug = 0;
>  
>  	elf_rel_get_symbol(&info->rhdr, "kernel", &my_kernel, sizeof(my_kernel));
>  	elf_rel_get_symbol(&info->rhdr, "dt_offset", &my_dt_offset,
> @@ -317,6 +322,7 @@ int elf_ppc64_load(int argc, char **argv, const char *buf, off_t len,
>  	elf_rel_get_symbol(&info->rhdr, "stack", &my_stack, sizeof(my_stack));
>  	elf_rel_get_symbol(&info->rhdr, "my_toc", &toc_addr,
>  				sizeof(toc_addr));
> +	elf_rel_get_symbol(&info->rhdr, "debug", &my_debug, sizeof(my_debug));
>  
>  	fprintf(stderr, "info->entry is %p\n", info->entry);
>  	fprintf(stderr, "kernel is %llx\n", (unsigned long long)my_kernel);
> @@ -329,6 +335,7 @@ int elf_ppc64_load(int argc, char **argv, const char *buf, off_t len,
>  	fprintf(stderr, "stack is %llx\n", (unsigned long long)my_stack);
>  	fprintf(stderr, "toc_addr is %llx\n", (unsigned long long)toc_addr);
>  	fprintf(stderr, "purgatory size is %zu\n", purgatory_size);
> +	fprintf(stderr, "debug is %d\n", my_debug);
>  #endif
>  
>  	for (i = 0; i < info->nr_segments; i++)
> diff --git a/kexec/arch/ppc64/kexec-ppc64.h b/kexec/arch/ppc64/kexec-ppc64.h
> index 920ac46..838c6da 100644
> --- a/kexec/arch/ppc64/kexec-ppc64.h
> +++ b/kexec/arch/ppc64/kexec-ppc64.h
> @@ -20,6 +20,7 @@ unsigned long my_r2(const struct mem_ehdr *ehdr);
>  extern uint64_t initrd_base, initrd_size;
>  extern int max_memory_ranges;
>  extern unsigned char reuse_initrd;
> +extern int my_debug;
>  
>  /* boot block version 2 as defined by the linux kernel */
>  struct bootblock {
> diff --git a/purgatory/arch/ppc64/Makefile b/purgatory/arch/ppc64/Makefile
> index aaa4046..40a9e99 100644
> --- a/purgatory/arch/ppc64/Makefile
> +++ b/purgatory/arch/ppc64/Makefile
> @@ -3,6 +3,7 @@
>  #
>  
>  ppc64_PURGATORY_SRCS += purgatory/arch/ppc64/v2wrap.S
> +ppc64_PURGATORY_SRCS += purgatory/arch/ppc64/hvCall.S
>  ppc64_PURGATORY_SRCS += purgatory/arch/ppc64/purgatory-ppc64.c
>  ppc64_PURGATORY_SRCS += purgatory/arch/ppc64/console-ppc64.c
>  ppc64_PURGATORY_SRCS += purgatory/arch/ppc64/crashdump_backup.c
> diff --git a/purgatory/arch/ppc64/console-ppc64.c b/purgatory/arch/ppc64/console-ppc64.c
> index d6da7b3..78a233b 100644
> --- a/purgatory/arch/ppc64/console-ppc64.c
> +++ b/purgatory/arch/ppc64/console-ppc64.c
> @@ -20,8 +20,22 @@
>   */
>  
>  #include <purgatory.h>
> +#include "hvCall.h"
> +
> +extern int debug;
>  
>  void putchar(int c)
>  {
> +	char buff[16];
> +	unsigned long *lbuf = (unsigned long *)buff;
> +
> +	if (!debug) /* running on non pseries */
> +		return;
> +
> +	if (c == '\n')
> +		putchar('\r');
> +
> +	buff[0] = c;
> +	plpar_hcall_norets(H_PUT_TERM_CHAR, 0, 1, lbuf[0], lbuf[1]);
>  	return;
>  }
> diff --git a/purgatory/arch/ppc64/hvCall.S b/purgatory/arch/ppc64/hvCall.S
> new file mode 100644
> index 0000000..e401f81
> --- /dev/null
> +++ b/purgatory/arch/ppc64/hvCall.S
> @@ -0,0 +1,28 @@
> +/*
> + * This file contains the generic function to perform a call to the
> + * pSeries LPAR hypervisor.
> + *
> + * Created by M. Mohan Kumar (mohan@in.ibm.com)
> + * Copyright (C) IBM Corporation
> + * Taken from linux/arch/powerpc/platforms/pseries/hvCall.S
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#define HVSC	.long 0x44000022
> +.text
> +	.machine ppc64
> +.globl .plpar_hcall_norets
> +.plpar_hcall_norets:
> +	or	6,6,6			# medium low priority
> +        mfcr	0
> +        stw	0,8(1)
> +
> +        HVSC 				/* invoke the hypervisor */
> +
> +        lwz	0,8(1)
> +        mtcrf	0xff,0
> +        blr                             /* return r3 = status */
> diff --git a/purgatory/arch/ppc64/hvCall.h b/purgatory/arch/ppc64/hvCall.h
> new file mode 100644
> index 0000000..187e24d
> --- /dev/null
> +++ b/purgatory/arch/ppc64/hvCall.h
> @@ -0,0 +1,8 @@
> +#ifndef HVCALL_H
> +#define HVCALL_H
> +
> +#define H_PUT_TERM_CHAR	0x58
> +
> +long plpar_hcall_norets(unsigned long opcode, ...);
> +
> +#endif
> diff --git a/purgatory/arch/ppc64/purgatory-ppc64.c b/purgatory/arch/ppc64/purgatory-ppc64.c
> index 93f28d2..0b6d326 100644
> --- a/purgatory/arch/ppc64/purgatory-ppc64.c
> +++ b/purgatory/arch/ppc64/purgatory-ppc64.c
> @@ -28,6 +28,7 @@ unsigned long stack = 0;
>  unsigned long dt_offset = 0;
>  unsigned long my_toc = 0;
>  unsigned long kernel = 0;
> +unsigned int debug = 0;
>  
>  void setup_arch(void)
>  {
> -- 
> 1.6.2.5

^ permalink raw reply

* RE: Queries regarding I2C and GPIO driver for Freescale MPC5121e in Linux2.6.24 of BSP: MPC512xADS_20090603-ltib.iso
From: Chen Hongjun-R66092 @ 2009-09-09  0:08 UTC (permalink / raw)
  To: Uma Kanta Patro, linuxppc-dev
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAACk7Zq0ADiBFlLw+tPrMt1bCgAAAEAAAAMhVxXG9wghAjqGQWiEYvY8BAAAAAA==@implantaire.com>

[-- Attachment #1: Type: text/plain, Size: 2423 bytes --]

One I2C driver has been included in 0603 bsp, you can refer to it.
 
It has no specific driver for GPIO, but you can find some initializing
code for GPIO in arch/powerpc/platforms/512x/mpc5125_ads.c. and
mpc512x_pm_test.c.



________________________________

	From:
linuxppc-dev-bounces+hong-jun.chen=freescale.com@lists.ozlabs.org
[mailto:linuxppc-dev-bounces+hong-jun.chen=freescale.com@lists.ozlabs.or
g] On Behalf Of Uma Kanta Patro
	Sent: Tuesday, September 08, 2009 6:56 PM
	To: linuxppc-dev@lists.ozlabs.org
	Subject: Queries regarding I2C and GPIO driver for Freescale
MPC5121e in Linux2.6.24 of BSP: MPC512xADS_20090603-ltib.iso
	
	

	Hi all,

	                I am a newbie to the powerpc linux kernel, but I
have worked on some drivers in arm architecture. I am finding powerpc
architecture to be fully different than that.

	I am working on Freescale MPC5121e with the BSP
MPC512xADS_20090603-ltib.iso running in it on the ADS512101 Rev4.1
development kit.

	Can anyone help me in finding some documentation for
understanding and working on the powerpc kernel. Any links to the
powerpc forums will also be appreciable.

	 

	 

	-> Currently I am going to develop an I2C client driver for one
slave microcontroller of our project.

	I have some knowledge in the I2C client driver making(legacy
style and new style).

	 

	I made a basic I2C client driver to probe for the chip address
and for testing I gave it the chip address 0x68(I2C chip address of the
M4T162 RTC, present on the board).

	But while inserting my driver I am getting failure message for
the detection of my chip.

	 

	So I would like to know what other formalities am I lagging in
my I2C chip driver.

	 

	-> Also I am in a need for the GPIO driver for my controller ot
get interrupt on ht estate change. When I searched in the kernel code I
could not find any procedure to do that, also I could not find out the
procedure to access either any GPIO pin macros or any register to remap
with ioremap(). So please guide me in finding the proper way to do the
GPIO accessing and interrupt registration.

	Will the ioremap() work on powerpc arch? If yes where can I find
the memory mapping(register definitions) to use for my GPIO driver
making.

	 

	Thanks for patience in reading my queries.

	Any help is appreciable.

	 

	Thanks & Regards,

	Uma

	 


[-- Attachment #2: Type: text/html, Size: 6060 bytes --]

^ permalink raw reply

* [PATCH] Don't set DABR on 64-bit BookE, use DAC1 instead
From: Benjamin Herrenschmidt @ 2009-09-09  0:16 UTC (permalink / raw)
  To: linuxppc-dev list; +Cc: Kumar Gala

Also remove a duplicate setting of it in the context switch path
on BookE.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/process.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 678ff13..0a32164 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -284,14 +284,13 @@ int set_dabr(unsigned long dabr)
 		return ppc_md.set_dabr(dabr);
 
 	/* XXX should we have a CPU_FTR_HAS_DABR ? */
-#if defined(CONFIG_PPC64) || defined(CONFIG_6xx)
-	mtspr(SPRN_DABR, dabr);
-#endif
-
 #if defined(CONFIG_BOOKE)
 	mtspr(SPRN_DAC1, dabr);
+#elif defined(CONFIG_PPC_BOOK3S)
+	mtspr(SPRN_DABR, dabr);
 #endif
 
+
 	return 0;
 }
 
@@ -372,15 +371,16 @@ struct task_struct *__switch_to(struct task_struct *prev,
 
 #endif /* CONFIG_SMP */
 
-	if (unlikely(__get_cpu_var(current_dabr) != new->thread.dabr))
-		set_dabr(new->thread.dabr);
-
 #if defined(CONFIG_BOOKE)
 	/* If new thread DAC (HW breakpoint) is the same then leave it */
 	if (new->thread.dabr)
 		set_dabr(new->thread.dabr);
+#else
+	if (unlikely(__get_cpu_var(current_dabr) != new->thread.dabr))
+		set_dabr(new->thread.dabr);
 #endif
 
+
 	new_thread = &new->thread;
 	old_thread = &current->thread;
 

^ permalink raw reply related

* [PATCH] [SCSI] mpt fusion: Fix 32 bit platforms with 64 bit resources
From: pbathija @ 2009-09-09  0:15 UTC (permalink / raw)
  To: linux-scsi; +Cc: linuxppc-dev, Pravin Bathija

From: Pravin Bathija <pbathija@amcc.com>

Powerpc 44x uses 36 bit real address while the real address defined
in MPT Fusion driver is of type 32 bit. This causes ioremap to fail and driver
fails to initialize. This fix changes the data types representing the real
address from unsigned long 32-bit types to "phys_addr_t" which is 64-bit. The
driver has been tested, the disks get discovered correctly and can do IO.

Signed-off-by: Pravin Bathija <pbathija@amcc.com>
Acked-by: Feng Kan <fkan@amcc.com>
Acked-by: Prodyut Hazarika <phazarika@amcc.com>
Acked-by: Loc Ho <lho@amcc.com>
Acked-by: Tirumala Reddy Marri <tmarri@amcc.com>
Acked-by: Victor Gallardo <vgallardo@amcc.com>
---
 drivers/message/fusion/mptbase.c |   34 +++++++++++++++++++++++++---------
 drivers/message/fusion/mptbase.h |    5 +++--
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 5d496a9..d5b0f15 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1510,11 +1510,12 @@ static int
 mpt_mapresources(MPT_ADAPTER *ioc)
 {
 	u8		__iomem *mem;
+	u8		__iomem *port;
 	int		 ii;
-	unsigned long	 mem_phys;
-	unsigned long	 port;
-	u32		 msize;
-	u32		 psize;
+	phys_addr_t	 mem_phys;
+	phys_addr_t	 port_phys;
+	resource_size_t	 msize;
+	resource_size_t	 psize;
 	u8		 revision;
 	int		 r = -ENODEV;
 	struct pci_dev *pdev;
@@ -1552,13 +1553,13 @@ mpt_mapresources(MPT_ADAPTER *ioc)
 	}
 
 	mem_phys = msize = 0;
-	port = psize = 0;
+	port_phys = psize = 0;
 	for (ii = 0; ii < DEVICE_COUNT_RESOURCE; ii++) {
 		if (pci_resource_flags(pdev, ii) & PCI_BASE_ADDRESS_SPACE_IO) {
 			if (psize)
 				continue;
 			/* Get I/O space! */
-			port = pci_resource_start(pdev, ii);
+			port_phys = pci_resource_start(pdev, ii);
 			psize = pci_resource_len(pdev, ii);
 		} else {
 			if (msize)
@@ -1580,14 +1581,23 @@ mpt_mapresources(MPT_ADAPTER *ioc)
 		return -EINVAL;
 	}
 	ioc->memmap = mem;
-	dinitprintk(ioc, printk(MYIOC_s_INFO_FMT "mem = %p, mem_phys = %lx\n",
-	    ioc->name, mem, mem_phys));
+	dinitprintk(ioc, printk(MYIOC_s_INFO_FMT "mem = %p, mem_phys = %llx\n",
+	    ioc->name, mem, (u64)mem_phys));
 
 	ioc->mem_phys = mem_phys;
 	ioc->chip = (SYSIF_REGS __iomem *)mem;
 
 	/* Save Port IO values in case we need to do downloadboot */
-	ioc->pio_mem_phys = port;
+	port = ioremap(port_phys, psize);
+	if (port == NULL) {
+		printk(MYIOC_s_ERR_FMT " : ERROR - Unable to map adapter"
+			" port !\n", ioc->name);
+		return -EINVAL;
+	}
+	ioc->portmap = port;
+	dinitprintk(ioc, printk(MYIOC_s_INFO_FMT "port=%p, port_phys=%llx\n",
+			ioc->name, port, (u64)port_phys));
+	ioc->pio_mem_phys = port_phys;
 	ioc->pio_chip = (SYSIF_REGS __iomem *)port;
 
 	return 0;
@@ -1822,6 +1832,7 @@ mpt_attach(struct pci_dev *pdev, const struct pci_device_id *id)
 		if (ioc->alt_ioc)
 			ioc->alt_ioc->alt_ioc = NULL;
 		iounmap(ioc->memmap);
+		iounmap(ioc->portmap);
 		if (r != -5)
 			pci_release_selected_regions(pdev, ioc->bars);
 
@@ -2583,6 +2594,11 @@ mpt_adapter_dispose(MPT_ADAPTER *ioc)
 		ioc->memmap = NULL;
 	}
 
+	if (ioc->portmap != NULL) {
+		iounmap(ioc->portmap);
+		ioc->portmap = NULL;
+	}
+
 	pci_disable_device(ioc->pcidev);
 	pci_release_selected_regions(ioc->pcidev, ioc->bars);
 
diff --git a/drivers/message/fusion/mptbase.h b/drivers/message/fusion/mptbase.h
index b3e981d..8e12bf8 100644
--- a/drivers/message/fusion/mptbase.h
+++ b/drivers/message/fusion/mptbase.h
@@ -584,8 +584,8 @@ typedef struct _MPT_ADAPTER
 	SYSIF_REGS __iomem	*chip;		/* == c8817000 (mmap) */
 	SYSIF_REGS __iomem	*pio_chip;	/* Programmed IO (downloadboot) */
 	u8			 bus_type;
-	u32			 mem_phys;	/* == f4020000 (mmap) */
-	u32			 pio_mem_phys;	/* Programmed IO (downloadboot) */
+	phys_addr_t		 mem_phys;	/* == f4020000 (mmap) */
+	phys_addr_t		 pio_mem_phys;	/* Programmed IO (downloadboot) */
 	int			 mem_size;	/* mmap memory size */
 	int			 number_of_buses;
 	int			 devices_per_bus;
@@ -635,6 +635,7 @@ typedef struct _MPT_ADAPTER
 	int			bars;		/* bitmask of BAR's that must be configured */
 	int			msi_enable;
 	u8			__iomem *memmap;	/* mmap address */
+	u8			__iomem *portmap;	/* mmap port address */
 	struct Scsi_Host	*sh;		/* Scsi Host pointer */
 	SpiCfgData		spi_data;	/* Scsi config. data */
 	RaidCfgData		raid_data;	/* Raid config. data */
-- 
1.5.5

^ permalink raw reply related

* Re: [FTRACE] Enabling function_graph causes OOPS
From: Steven Rostedt @ 2009-09-09  1:05 UTC (permalink / raw)
  To: Sachin Sant; +Cc: linuxppc-dev
In-Reply-To: <4A76BE81.4080707@in.ibm.com>

On Mon, 2009-08-03 at 16:10 +0530, Sachin Sant wrote:
> Steven Rostedt wrote:
> > Thanks,
> >
> > I've seen issues with my PPC box and function graph, but the bugs were
> > also caused by other changes. I'll boot up my PPC64 box and see if
> > I see the same issues you have.
> >   
> Hi Steven,
> 
> I can still recreate this issue with 2.6.31-rc5. Let me know
> if i can provide any information to find a solution for this.

Hi Sachin,

I'm going through old email, and I found this. Do you still see this
error. I don't recall seeing it myself.

Thanks,

-- Steve

^ permalink raw reply

* RE: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Benjamin Herrenschmidt @ 2009-09-09  1:34 UTC (permalink / raw)
  To: azilkie; +Cc: phazarika, Tom Burns, Andrea Zypchen, linuxppc-dev
In-Reply-To: <1252440026.2548.53.camel@Adam>

On Tue, 2009-09-08 at 16:00 -0400, Adam Zilkie wrote:
> We are using pci_alloc_consistent()

Then your flush should have no effect since pci_alloc_consistent will
return I=1 mapped memory, unless you don't have
CONFIG_NOT_COHERENT_CACHE for some reason.

Cheers,
Ben.

> Adam
> 
> On Tue, 2009-09-08 at 12:56 -0700, Prodyut Hazarika wrote:
> > > We have found that using flush_dcache_range() after each DMA solves
> > the
> > > problem. Ideally, we'd like to be able to allocate the virtual page in
> > > cache inhibited memory to avoid the performance loss from all the
> > flush
> > > calls. To do this, we'd have to change our TLB sizes and reserve a TLB
> > > in memory as cache inhibited (using the 'I' bit). Will update if this
> > > works as well. Thanks for your help in this.
> > 
> > Aren't you using dma_alloc_coherent to get buffers that are shared
> > between CPU and external devices?
> > 
> > Thanks
> > Prodyut
> > 
> > On Tue, 2009-09-08 at 11:59 -0700, Prodyut Hazarika wrote:
> > > Hi Adam,
> > > 
> > > > Yes, I am using the 440EPx (same as the sequoia board). 
> > > > Our ideDriver is DMA'ing blocks of 192-byte data over the PCI bus
> > > (using
> > > > the Sil0680A PCI-IDE bridge). Most of the DMA's (depending on
> > timing)
> > > > end up being partially corrupted when we try to parse the data in
> > the
> > > > virtual page. We have confirmed the data is good before the PCI-IDE
> > > > bridge. We are creating two 8K pages and map them to physical DMA
> > > memory
> > > > using single-entry scatter/gather structs. When a DMA block is
> > > > corrupted, we see a random portion of it (always a multiple of
> > 16byte
> > > > cache lines) is overwritten with old data from the last time the
> > > buffer
> > > > was used. 
> > > 
> > > This looks like a cache coherency problem.
> > > Can you ensure that the TLB entries corresponding to the DMA region
> > has
> > > the CacheInhibit bit set.
> > > You will need a BDI connected to your system.
> > > 
> > > Also, you will need to invalidate and flush the lines appropriately,
> > > since in 440 cores,
> > > L1Cache coherency is managed entirely by software.
> > > Please look at drivers/net/ibm_newemac/mal.c and core.c for example on
> > > how to do it.
> > > 
> > > Thanks
> > > Prodyut
> > > 
> > > On Thu, 2009-09-03 at 13:27 -0700, Prodyut Hazarika wrote:
> > > > Hi Adam,
> > > > 
> > > > > Are you sure there is L2 cache on the 440?
> > > > 
> > > > It depends on the SoC you are using. SoC like 460EX (Canyonlands
> > > board)
> > > > have L2Cache.
> > > > It seems you are using a Sequoia board, which has a 440EPx SoC.
> > 440EPx
> > > > has a 440 cpu core, but no L2Cache.
> > > > Could you please tell me which SoC you are using?
> > > > You can also refer to the appropriate dts file to see if there is
> > L2C.
> > > > For example, in canyonlands.dts (460EX based board), we have the L2C
> > > > entry.
> > > >         L2C0: l2c {
> > > >               ...
> > > >         }
> > > > 
> > > > >I am seeing this problem with our custom IDE driver which is based
> > on
> > > 
> > > > >pretty old code. Our driver uses pci_alloc_consistent() to allocate
> > > the
> > > > 
> > > > >physical DMA memory and alloc_pages() to allocate a virtual page.
> > It 
> > > > >then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I
> > 
> > > > >should convert these to the DMA API calls as you suggest.
> > > > 
> > > > Could you give more details on the consistency problem? It is a good
> > > > idea to change to the new DMA APIs, but pci_alloc_consistent()
> > should
> > > > work too
> > > > 
> > > > Thanks
> > > > Prodyut	
> > > > 
> > > > On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
> > > > > On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
> > > > > > Hi Adam,
> > > > > > 
> > > > > > If you have a look in include/asm-ppc/pgtable.h for the
> > following
> > > > section:
> > > > > > #ifdef CONFIG_44x
> > > > > > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED |
> > > > _PAGE_GUARDED)
> > > > > > #else
> > > > > > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
> > > > > > #endif
> > > > > > 
> > > > > > Try adding _PAGE_COHERENT to the appropriate line above and see
> > if
> > > > that 
> > > > > > fixes your issue - this causes the 'M' bit to be set on the page
> > > > which 
> > > > > > sure enforce cache coherency. If it doesn't, you'll need to
> > check
> > > > the 
> > > > > > 'M' bit isn't being masked out in head_44x.S (it was originally
> > > > masked 
> > > > > > out on arch/powerpc, but was fixed in later kernels when the
> > cache
> > > 
> > > > > > coherency issues with non-SMP systems were resolved).
> > > > > 
> > > > > I have some doubts about the usefulness of doing that for 4xx.
> > > AFAIK,
> > > > > the 440 core just ignores M.
> > > > > 
> > > > > The problem lies probably elsewhere. Maybe the L2 cache coherency
> > > > isn't
> > > > > enabled or not working ?
> > > > > 
> > > > > The L1 cache on 440 is simply not coherent, so drivers have to
> > make
> > > > sure
> > > > > they use the appropriate DMA APIs which will do cache flushing
> > when
> > > > > needed.
> > > > > 
> > > > > Adam, what driver is causing you that sort of problems ?
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > > 
> > > > > 
> > > -- 
> > > Adam Zilkie
> > > Software Designer,
> > > International Datacasting Corp.
> > > 
> > > This message and the documents attached hereto are intended only for
> > the
> > > addressee and may contain privileged or confidential information. Any
> > > unauthorized disclosure is strictly prohibited. If you have received
> > > this message in error, please notify us immediately so that we may
> > > correct our internal records. Please then delete the original message.
> > > Thank you.
> > > --------------------------------------------------------
> > > 
> > > CONFIDENTIALITY NOTICE: This e-mail message, including any
> > attachments, is for the sole use of the intended recipient(s) and
> > contains information that is confidential and proprietary to
> > AppliedMicro Corporation or its subsidiaries. It is to be used solely
> > for the purpose of furthering the parties' business relationship. All
> > unauthorized review, use, disclosure or distribution is prohibited. If
> > you are not the intended recipient, please contact the sender by reply
> > e-mail and destroy all copies of the original message.
> > > 

^ permalink raw reply

* Re: [PATCH][sata_fsl] Defer non-ncq commands when ncq commands active
From: Jeff Garzik @ 2009-09-09  1:25 UTC (permalink / raw)
  To: ashish kalra; +Cc: linux-ide, linuxppc-dev
In-Reply-To: <Pine.WNT.4.64.0907292129310.4440@B00888-02.fsl.freescale.net>

On 07/29/2009 12:03 PM, ashish kalra wrote:
> From: Ashish Kalra <Ashish.Kalra@freescale.com>
> Date: Wed, 29 Jul 2009 21:15:49 +0530
>
> Fix for non-ncq & ncq commands causing timeouts when both are issued
> simultaneously to the same device.
>
> Signed-off-by: Ashish Kalra <Ashish.Kalra@freescale.com>
> ---
> drivers/ata/sata_fsl.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/ata/sata_fsl.c b/drivers/ata/sata_fsl.c
> index 5a88b44..a33f130 100644
> --- a/drivers/ata/sata_fsl.c
> +++ b/drivers/ata/sata_fsl.c
> @@ -1262,6 +1262,7 @@ static struct scsi_host_template sata_fsl_sht = {
> static struct ata_port_operations sata_fsl_ops = {
> .inherits = &sata_pmp_port_ops,
>
> + .qc_defer = ata_std_qc_defer;

Applied version with obvious s/;/,/ fix...

	Jeff

^ permalink raw reply

* Re: [PATCH] powerpc/85xx: Fix SMP compile error and allow NULL for smp_ops
From: Kumar Gala @ 2009-09-09  3:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1252445472.4950.96.camel@pasglop>


On Sep 8, 2009, at 4:31 PM, Benjamin Herrenschmidt wrote:

> On Tue, 2009-09-08 at 14:21 -0500, Kumar Gala wrote:
>
>>
>> struct smp_ops_t smp_85xx_ops = {
>> +	.message_pass = NULL,
>> +	.probe = NULL,
>> 	.kick_cpu = smp_85xx_kick_cpu,
>> +	.setup_cpu = NULL,
>> };
>
> Why explicitely setting those to NULL ?
>
> Cheers,
> Ben.

couldn't remember if we get NULL for initialized structs or not.

- k

^ permalink raw reply

* [PATCH v2] powerpc/85xx: Fix SMP compile error and allow NULL for smp_ops
From: Kumar Gala @ 2009-09-09  3:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

The following commit introduced a compile error since it removed
the implementation of smp_85xx_basic_setup:

commit 77c0a700c1c292edafa11c1e52821ce4636f81b0
Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date:   Fri Aug 28 14:25:04 2009 +1000

    powerpc: Properly start decrementer on BookE secondary CPUs

Make it so that smp_ops probe() and setup_cpu() can be set to NULL.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
---
* Removed explicit setting of NULL in structure

 arch/powerpc/kernel/smp.c         |   10 +++++++---
 arch/powerpc/platforms/85xx/smp.c |   10 ----------
 2 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 96f107c..d387b39 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -269,7 +269,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	cpu_callin_map[boot_cpuid] = 1;
 
 	if (smp_ops)
-		max_cpus = smp_ops->probe();
+		if (smp_ops->probe)
+			max_cpus = smp_ops->probe();
+		else
+			max_cpus = NR_CPUS;
 	else
 		max_cpus = 1;
  
@@ -493,7 +496,8 @@ int __devinit start_secondary(void *unused)
 	preempt_disable();
 	cpu_callin_map[cpu] = 1;
 
-	smp_ops->setup_cpu(cpu);
+	if (smp_ops->setup_cpu)
+		smp_ops->setup_cpu(cpu);
 	if (smp_ops->take_timebase)
 		smp_ops->take_timebase();
 
@@ -556,7 +560,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 	old_mask = current->cpus_allowed;
 	set_cpus_allowed(current, cpumask_of_cpu(boot_cpuid));
 	
-	if (smp_ops)
+	if (smp_ops && smp_ops->setup_cpu)
 		smp_ops->setup_cpu(boot_cpuid);
 
 	set_cpus_allowed(current, old_mask);
diff --git a/arch/powerpc/platforms/85xx/smp.c b/arch/powerpc/platforms/85xx/smp.c
index 94f901d..04160a4 100644
--- a/arch/powerpc/platforms/85xx/smp.c
+++ b/arch/powerpc/platforms/85xx/smp.c
@@ -88,25 +88,15 @@ struct smp_ops_t smp_85xx_ops = {
 	.kick_cpu = smp_85xx_kick_cpu,
 };
 
-static int __init smp_dummy_probe(void)
-{
-	return NR_CPUS;
-}
-
 void __init mpc85xx_smp_init(void)
 {
 	struct device_node *np;
 
-	smp_85xx_ops.message_pass = NULL;
-
 	np = of_find_node_by_type(NULL, "open-pic");
 	if (np) {
 		smp_85xx_ops.probe = smp_mpic_probe;
 		smp_85xx_ops.setup_cpu = smp_85xx_setup_cpu;
 		smp_85xx_ops.message_pass = smp_mpic_message_pass;
-	} else {
-		smp_85xx_ops.probe = smp_dummy_probe;
-		smp_85xx_ops.setup_cpu = smp_85xx_basic_setup;
 	}
 
 	if (cpu_has_feature(CPU_FTR_DBELL))
-- 
1.6.0.6

^ permalink raw reply related

* [0/5] Assorted hugepage cleanups
From: David Gibson @ 2009-09-09  5:55 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt

Currently, ordinary pages use one pagetable layout, and each different
hugepage size uses a slightly different variant layout.  A number of
places which need to walk the pagetable must first check the slice map
to see what the pagetable layout then handle the various different
forms.  New hardware, like Book3E is liable to introduce more possible
variants.

This patch series, therefore, is designed to simplify the matter by
limiting knowledge of the pagetable layout to only the allocation
path.  With this patch, ordinary pages are handled as ever, with a
fixed 4 (or 3) level tree.  All other variants branch off from some
layer of that with a specially marked PGD/PUD/PMD pointer which also
contains enough information to interpret the directories below that
point.  This means that things walking the pagetables (without
allocating) don't need to look up the slice map, they can just step
down the tree in the usual way, branching off to the "non-standard
layout" path for hugepages, which uses the embdded information to
interpret the tree from that point on.

This reduces the source size in a number of places, and means that
newer variants on the pagetable layout to handle new hardware and new
features will need to alter the existing code in less places.

In addition we split out the hash / classic MMU specific code into a
separate hugetlbpage-hash64.c file.  This will make adding support for
other MMUs (like 440 and/or Book3E) easier.

I've used the libhugetlbfs testsuite to test these patches on a
Power5+ machine, but they could certainly do with more testing. In
particular, I don't have any suitable hardware to test 16G pages.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* [1/5] Make hpte_need_flush() correctly mask for multiple page sizes
From: David Gibson @ 2009-09-09  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090909055534.GF7909@yookeroo.seuss>

Currently, hpte_need_flush() only correctly flushes the given address
for normal pages.  Callers for hugepages are required to mask the
address themselves.

But hpte_need_flush() already looks up the page sizes for its own
reasons, so this is a rather silly imposition on the callers.  This
patch alters it to mask based on the pagesize it has looked up itself,
and removes the awkward masking code in the hugepage caller.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/mm/hugetlbpage.c |    6 +-----
 arch/powerpc/mm/tlb_hash64.c  |    8 +++-----
 2 files changed, 4 insertions(+), 10 deletions(-)

Index: working-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:36:12.000000000 +1000
@@ -53,11 +53,6 @@ void hpte_need_flush(struct mm_struct *m
 
 	i = batch->index;
 
-	/* We mask the address for the base page size. Huge pages will
-	 * have applied their own masking already
-	 */
-	addr &= PAGE_MASK;
-
 	/* Get page size (maybe move back to caller).
 	 *
 	 * NOTE: when using special 64K mappings in 4K environment like
@@ -75,6 +70,9 @@ void hpte_need_flush(struct mm_struct *m
 	} else
 		psize = pte_pagesize_index(mm, addr, pte);
 
+	/* Mask the address for the correct page size */
+	addr &= ~((1UL << mmu_psize_defs[psize].shift) - 1);
+
 	/* Build full vaddr */
 	if (!is_kernel_addr(addr)) {
 		ssize = user_segment_size(addr);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
@@ -445,11 +445,7 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		unsigned int psize = get_slice_psize(mm, addr);
-		unsigned int shift = mmu_psize_to_shift(psize);
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
+		pte_update(mm, addr, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }

^ permalink raw reply

* [2/5] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-09-09  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090909055534.GF7909@yookeroo.seuss>

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/pgalloc-64.h    |   43 ++++++++++++-----------------
 arch/powerpc/include/asm/pgalloc.h       |   25 +++--------------
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 
 arch/powerpc/mm/hugetlbpage.c            |   45 ++++++++-----------------------
 arch/powerpc/mm/init_64.c                |   42 ++++++++++++++--------------
 arch/powerpc/mm/pgtable.c                |   25 +++++++++++------
 6 files changed, 73 insertions(+), 108 deletions(-)

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-04 14:38:20.000000000 +1000
@@ -148,30 +148,30 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[PGF_SHIFT_MASK];
+
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	struct kmem_cache *new;
+
+	BUG_ON((shift < 1) || (shift > PGF_SHIFT_MASK));
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
+	PGT_CACHE(shift) = new;
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+	BUG_ON(!PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-09-04 14:38:20.000000000 +1000
@@ -16,22 +16,17 @@ static inline void subpage_prot_free(pgd
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +35,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +73,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +102,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > PGF_SHIFT_MASK);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-09-04 14:38:20.000000000 +1000
@@ -24,24 +24,12 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
 /* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
  * and small enough to fit in the low bits of any naturally aligned page
  * table cache entry. Arbitrarily set to 0x1f, that should give us some
  * room to grow
  */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
+#define PGF_SHIFT_MASK		0xf
 
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
@@ -50,12 +38,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +51,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-09-04 14:38:20.000000000 +1000
@@ -47,12 +47,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -62,13 +62,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -77,8 +77,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~PGF_SHIFT_MASK);
+		unsigned shift = batch->tables[i] & PGF_SHIFT_MASK;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -89,25 +93,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > (PGF_SHIFT_MASK + 1));
+	pgf = (unsigned long)table | (shift - 1);
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:38:20.000000000 +1000
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-04 14:38:20.000000000 +1000
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*

^ permalink raw reply

* [4/5] Cleanup initialization of hugepages on powerpc
From: David Gibson @ 2009-09-09  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090909055534.GF7909@yookeroo.seuss>

This patch simplifies the logic used to initialize hugepages on
powerpc.  The somewhat oddly named set_huge_psize() is renamed to
add_huge_page_size() and now does all necessary verification of
whether it's given a valid hugepage sizes (instead of just some) and
instantiates the generic hstate structure (but no more).  

hugetlbpage_init() now steps through the available pagesizes, checks
if they're valid for hugepages by calling add_huge_page_size() and
initializes the kmem_caches for the hugepage pagetables.  This means
we can now eliminate the mmu_huge_psizes array, since we no longer
need to pass the sizing information for the pagetable caches from
set_huge_psize() into hugetlbpage_init()

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/mm/hugetlbpage.c |  106 +++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 57 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-09 15:15:12.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-09 15:22:49.000000000 +1000
@@ -37,11 +37,6 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
- * stored for the huge page sizes that are valid.
- */
-static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -502,8 +497,6 @@ unsigned long hugetlb_get_unmapped_area(
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-	if (!mmu_huge_psizes[mmu_psize])
-		return -EINVAL;
 	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
@@ -666,47 +659,46 @@ repeat:
 	return err;
 }
 
-static void __init set_huge_psize(int psize)
+static int __init add_huge_page_size(unsigned long long size)
 {
-	unsigned pdshift;
+	int shift = __ffs(size);
+	int mmu_psize;
 
 	/* Check that it is a page size supported by the hardware and
-	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift &&
-		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
-		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size has already been setup or is the
-		 * same as the base page size. */
-		if (mmu_huge_psizes[psize] ||
-		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
-			return;
-		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
+	 * that it fits within pagetable and slice limits. */
+	if (!is_power_of_2(size)
+	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+		return -EINVAL;
 
-		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
-			pdshift = PMD_SHIFT;
-		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
-			pdshift = PUD_SHIFT;
-		else
-			pdshift = PGDIR_SHIFT;
-		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
-	}
+	if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
+		return -EINVAL;
+
+#ifndef CONFIG_SPU_FS_64K_LS
+	/* Disable support for 64K huge pages when 64K SPU local store
+	 * support is enabled as the current implementation conflicts.
+	 */
+	if (size == PAGE_SIZE_64K)
+		return -EINVAL;
+#endif /* CONFIG_SPU_FS_64K_LS */
+
+	BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
+
+	/* Return if huge page size has already been setup */
+	if (size_to_hstate(size))
+		return 0;
+
+	hugetlb_add_hstate(shift - PAGE_SHIFT);
+
+	return 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize;
-	int shift;
 
 	size = memparse(str, &str);
 
-	shift = __ffs(size);
-	mmu_psize = shift_to_mmu_psize(shift);
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
-		set_huge_psize(mmu_psize);
-	else
+	if (add_huge_page_size(size) != 0)
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
 	return 1;
@@ -720,31 +712,31 @@ static int __init hugetlbpage_init(void)
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
 
-	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
-	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
-	 * sizes changes.
-	 */
-	set_huge_psize(MMU_PAGE_16M);
-	set_huge_psize(MMU_PAGE_16G);
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		unsigned shift;
+		unsigned pdshift;
 
-	/* Temporarily disable support for 64K huge pages when 64K SPU local
-	 * store support is enabled as the current implementation conflicts.
-	 */
-#ifndef CONFIG_SPU_FS_64K_LS
-	set_huge_psize(MMU_PAGE_64K);
-#endif
+		if (!mmu_psize_defs[psize].shift)
+			continue;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
-			if (!PGT_CACHE(mmu_huge_psizes[psize]))
-				panic("hugetlbpage_init(): could not create "
-				      "pgtable cache for %d bit pagesize\n",
-				      mmu_psize_to_shift(psize));
-		}
+		shift = mmu_psize_to_shift(psize);
+
+		if (add_huge_page_size(1ULL << shift) < 0)
+			continue;
+
+		if (shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+
+		pgtable_cache_add(pdshift - shift, NULL);
+		if (!PGT_CACHE(pdshift - shift))
+			panic("hugetlbpage_init(): could not create "
+			      "pgtable cache for %d bit pagesize\n", shift);
 	}
 
 	return 0;
 }
-
 module_init(hugetlbpage_init);

^ permalink raw reply

* [3/5] Allow more flexible layouts for hugepage pagetables
From: David Gibson @ 2009-09-09  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090909055534.GF7909@yookeroo.seuss>

Currently each available hugepage size uses a slightly different
pagetable layout: that is, the bottem level table of pointers to
hugepages is a different size, and may branch off from the normal page
tables at a different level.  Every hugepage aware path that needs to
walk the pagetables must therefore look up the hugepage size from the
slice info first, and work out the correct way to walk the pagetables
accordingly.  Future hardware is likely to add more possible hugepage
sizes, more layout options and more mess.

This patch, therefore reworks the handling of hugepage pagetables to
reduce this complexity.  In the new scheme, instead of having to
consult the slice mask, pagetable walking code can check a flag in the
PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
and the entry also contains the information (eseentially hugepage
shift) necessary to then interpret that table without recourse to the
slice mask.  This scheme can be extended neatly to handle multiple
levels of self-describing "special" hugepage pagetables, although for
now we assume only one level exists.

This approach means that only the pagetable allocation path needs to
know how the pagetables should be set out.  All other (hugepage)
pagetable walking paths can just interpret the structure as they go.

There already was a flag bit in PGD/PUD/PMD entries for hugepage
directory pointers, but it was only used for debug.  We alter that
flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
pointer (normally it would be 1 since the pointer lies in the linear
mapping).  This means that asm pagetable walking can test for (and
punt on) hugepage pointers with the same test that checks for
unpopulated page directory entries (beq becomes bge), since hugepage
pointers will always be positive, and normal pointers always negative.

While we're at it, we get rid of the confusing (and grep defeating)
#defining of hugepte_shift to be the same thing as mmu_huge_psizes.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h       |   12 
 arch/powerpc/include/asm/mmu-hash64.h    |   14 
 arch/powerpc/include/asm/pgtable-ppc64.h |   13 
 arch/powerpc/kernel/perf_callchain.c     |   20 -
 arch/powerpc/mm/gup.c                    |  149 +--------
 arch/powerpc/mm/hash_utils_64.c          |   17 -
 arch/powerpc/mm/hugetlbpage.c            |  473 ++++++++++++++-----------------
 arch/powerpc/mm/init_64.c                |   11 
 8 files changed, 304 insertions(+), 405 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-08 11:44:17.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-08 17:13:33.000000000 +1000
@@ -40,25 +40,11 @@ static unsigned nr_gpages;
 /* Array of valid huge page sizes - non-zero value(hugepte_shift) is
  * stored for the huge page sizes that are valid.
  */
-unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
-#define hugepte_shift			mmu_huge_psizes
-#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
-#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
-
-#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-					 + HUGEPTE_INDEX_SIZE(psize))
-#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
-#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
+static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
-#define HUGEPD_OK	0x1
-
-typedef struct { unsigned long pd; } hugepd_t;
-
-#define hugepd_none(hpd)	((hpd).pd == 0)
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
@@ -82,71 +68,126 @@ static inline unsigned int mmu_psize_to_
 	BUG();
 }
 
+#define hugepd_none(hpd)	((hpd).pd == 0)
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-	BUG_ON(!(hpd.pd & HUGEPD_OK));
-	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
+	BUG_ON(!hugepd_ok(hpd));
+	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | 0xc000000000000000);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
-				    struct hstate *hstate)
+static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
-	unsigned int shift = huge_page_shift(hstate);
-	int psize = shift_to_mmu_psize(shift);
-	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
+	return hpd.pd & HUGEPD_SHIFT_MASK;
+}
+
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr, unsigned pdshift)
+{
+	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
+{
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pdshift = PGDIR_SHIFT;
+
+	if (shift)
+		*shift = 0;
+
+	pg = pgdir + pgd_index(ea);
+	if (is_hugepd(pg)) {
+		hpdp = (hugepd_t *)pg;
+	} else if (!pgd_none(*pg)) {
+		pdshift = PUD_SHIFT;
+		pu = pud_offset(pg, ea);
+		if (is_hugepd(pu))
+			hpdp = (hugepd_t *)pu;
+		else if (!pud_none(*pu)) {
+			pdshift = PMD_SHIFT;
+			pm = pmd_offset(pu, ea);
+			if (is_hugepd(pm))
+				hpdp = (hugepd_t *)pm;
+			else if (!pmd_none(*pm)) {
+				return pte_offset_map(pm, ea);
+			}
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	if (shift)
+		*shift = hugepd_shift(*hpdp);
+	return hugepte_offset(hpdp, ea, pdshift);
+}
+
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
+}
+
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned int psize)
+			   unsigned long address, unsigned pdshift, unsigned pshift)
 {
-	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(pdshift - pshift),
 				       GFP_KERNEL|__GFP_REPEAT);
 
+	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
+	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
+
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
+		kmem_cache_free(PGT_CACHE(pdshift - pshift), new);
 	else
-		hpdp->pd = (unsigned long)new | HUGEPD_OK;
+		hpdp->pd = ((unsigned long)new & ~0x8000000000000000) | pshift;
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
 
-
-static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_offset(pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_alloc(mm, pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_offset(pud, addr);
-	else
-		return (pmd_t *) pud;
-}
-static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_alloc(mm, pud, addr);
-	else
-		return (pmd_t *) pud;
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pshift = __ffs(sz);
+	unsigned pdshift = PGDIR_SHIFT;
+
+	addr &= ~(sz-1);
+
+	pg = pgd_offset(mm, addr);
+	if (pshift >= PUD_SHIFT) {
+		hpdp = (hugepd_t *)pg;
+	} else {
+		pdshift = PUD_SHIFT;
+		pu = pud_alloc(mm, pg, addr);
+		if (pshift >= PMD_SHIFT) {
+			hpdp = (hugepd_t *)pu;
+		} else {
+			pdshift = PMD_SHIFT;
+			pm = pmd_alloc(mm, pu, addr);
+			hpdp = (hugepd_t *)pm;
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
+
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
+		return NULL;
+
+	return hugepte_offset(hpdp, addr, pdshift);
 }
 
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -180,92 +221,38 @@ int alloc_bootmem_huge_page(struct hstat
 	return 1;
 }
 
-
-/* Modelled after find_linux_pte() */
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-
-	unsigned int psize;
-	unsigned int shift;
-	unsigned long sz;
-	struct hstate *hstate;
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_to_shift(psize);
-	sz = ((1UL) << shift);
-	hstate = size_to_hstate(sz);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	if (!pgd_none(*pg)) {
-		pu = hpud_offset(pg, addr, hstate);
-		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr, hstate);
-			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr,
-						      hstate);
-		}
-	}
-
-	return NULL;
-}
-
-pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	struct hstate *hstate;
-	unsigned int psize;
-	hstate = size_to_hstate(sz);
-
-	psize = get_slice_psize(mm, addr);
-	BUG_ON(!mmu_huge_psizes[psize]);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	pu = hpud_alloc(mm, pg, addr, hstate);
-
-	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr, hstate);
-		if (pm)
-			hpdp = (hugepd_t *)pm;
-	}
-
-	if (! hpdp)
-		return NULL;
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
-		return NULL;
-
-	return hugepte_offset(hpdp, addr, hstate);
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
-			       unsigned int psize)
+static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
+			      unsigned long start, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
+	unsigned shift = hugepd_shift(*hpdp);
+	unsigned long pdmask = ~((1UL << pdshift) - 1);
+
+	start &= pdmask;
+	if (start < floor)
+		return;
+	if (ceiling) {
+		ceiling &= pdmask;
+		if (! ceiling)
+			return;
+	}
+	if (end - 1 > ceiling - 1)
+		return;
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
+	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling,
-				   unsigned int psize)
+				   unsigned long floor, unsigned long ceiling)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -277,7 +264,8 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
+		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
+				  addr, next, floor, ceiling);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -303,23 +291,19 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
-	unsigned int shift;
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
-	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (shift < PMD_SHIFT) {
+		if (!is_hugepd(pud)) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
-					       ceiling, psize);
+					       ceiling);
 		} else {
-			if (pud_none(*pud))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pud++, addr = next, addr != end);
 
@@ -350,74 +334,34 @@ void hugetlb_free_pgd_range(struct mmu_g
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start;
 
 	/*
-	 * Comments below take from the normal free_pgd_range().  They
-	 * apply here too.  The tests against HUGEPD_MASK below are
-	 * essential, because we *don't* test for this at the bottom
-	 * level.  Without them we'll attempt to free a hugepte table
-	 * when we unmap just part of it, even if there are other
-	 * active mappings using it.
+	 * Because there are a number of different possible pagetable
+	 * layouts for hugepage ranges, we limit knowledge of how
+	 * things should be laid out to the allocation path
+	 * (huge_pte_alloc(), above).  Everything else works out the
+	 * structure as it goes from information in the hugepd
+	 * pointers.  That means that we can't here use the
+	 * optimization used in the normal page free_pgd_range(), of
+	 * checking whether we're actually covering a large enough
+	 * range to have to do anything at the top level of the walk
+	 * instead of at the bottom.
 	 *
-	 * The next few lines have given us lots of grief...
-	 *
-	 * Why are we testing HUGEPD* at this top level?  Because
-	 * often there will be no work to do at all, and we'd prefer
-	 * not to go all the way down to the bottom just to discover
-	 * that.
-	 *
-	 * Why all these "- 1"s?  Because 0 represents both the bottom
-	 * of the address space and the top of it (using -1 for the
-	 * top wouldn't help much: the masks would do the wrong thing).
-	 * The rule is that addr 0 and floor 0 refer to the bottom of
-	 * the address space, but end 0 and ceiling 0 refer to the top
-	 * Comparisons need to use "end - 1" and "ceiling - 1" (though
-	 * that end 0 case should be mythical).
-	 *
-	 * Wherever addr is brought up or ceiling brought down, we
-	 * must be careful to reject "the opposite 0" before it
-	 * confuses the subsequent tests.  But what about where end is
-	 * brought down by HUGEPD_SIZE below? no, end can't go down to
-	 * 0 there.
-	 *
-	 * Whereas we round start (addr) and ceiling down, by different
-	 * masks at different levels, in order to test whether a table
-	 * now has no other vmas using it, so can be freed, we don't
-	 * bother to round floor or end up - the tests don't need that.
+	 * To make sense of this, you should probably go read the big
+	 * block comment at the top of the normal free_pgd_range(),
+	 * too.
 	 */
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
 
-	addr &= HUGEPD_MASK(psize);
-	if (addr < floor) {
-		addr += HUGEPD_SIZE(psize);
-		if (!addr)
-			return;
-	}
-	if (ceiling) {
-		ceiling &= HUGEPD_MASK(psize);
-		if (!ceiling)
-			return;
-	}
-	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE(psize);
-	if (addr > end - 1)
-		return;
-
-	start = addr;
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		psize = get_slice_psize(tlb->mm, addr);
-		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
-		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
+		if (!is_hugepd(pgd)) {
 			if (pgd_none_or_clear_bad(pgd))
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-			if (pgd_none(*pgd))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pgd++, addr = next, addr != end);
 }
@@ -448,19 +392,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
-	unsigned int mmu_psize = get_slice_psize(mm, address);
+	unsigned shift;
+	unsigned long mask;
+
+	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!mmu_huge_psizes[mmu_psize])
+	if (!ptep || !shift)
 		return ERR_PTR(-EINVAL);
 
-	ptep = huge_pte_offset(mm, address);
+	mask = (1UL << shift) - 1;
 	page = pte_page(*ptep);
-	if (page) {
-		unsigned int shift = mmu_psize_to_shift(mmu_psize);
-		unsigned long sz = ((1UL) << shift);
-		page += (address % sz) / PAGE_SIZE;
-	}
+	if (page)
+		page += (address & mask) / PAGE_SIZE;
 
 	return page;
 }
@@ -483,6 +427,73 @@ follow_huge_pmd(struct mm_struct *mm, un
 	return NULL;
 }
 
+static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	unsigned long pte_end;
+	struct page *head, *page;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = *ptep;
+	mask = _PAGE_PRESENT | _PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		/* Could be optimized better */
+		while (*nr) {
+			put_page(page);
+			(*nr)--;
+		}
+	}
+
+	return 1;
+}
+
+int gup_hugepd(hugepd_t *hugepd, unsigned pdshift,
+	       unsigned long addr, unsigned long end,
+	       int write, struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(*hugepd);
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr))
+			return 0;
+	} while (ptep++, addr += sz, addr != end);
+
+	return 1;
+}
 
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 					unsigned long len, unsigned long pgoff,
@@ -530,34 +541,20 @@ static unsigned int hash_huge_page_do_la
 	return rflags;
 }
 
-int hash_huge_page(struct mm_struct *mm, unsigned long access,
-		   unsigned long ea, unsigned long vsid, int local,
-		   unsigned long trap)
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
 {
-	pte_t *ptep;
 	unsigned long old_pte, new_pte;
 	unsigned long va, rflags, pa, sz;
 	long slot;
 	int err = 1;
-	int ssize = user_segment_size(ea);
-	unsigned int mmu_psize;
-	int shift;
-	mmu_psize = get_slice_psize(mm, ea);
 
-	if (!mmu_huge_psizes[mmu_psize])
-		goto out;
-	ptep = huge_pte_offset(mm, ea);
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
 
 	/* Search the Linux page table for a match with va */
 	va = hpt_va(ea, vsid, ssize);
 
-	/*
-	 * If no pte found or not present, send the problem up to
-	 * do_page_fault
-	 */
-	if (unlikely(!ptep || pte_none(*ptep)))
-		goto out;
-
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
@@ -588,7 +585,6 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	shift = mmu_psize_to_shift(mmu_psize);
 	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
@@ -672,6 +668,8 @@ repeat:
 
 static void __init set_huge_psize(int psize)
 {
+	unsigned pdshift;
+
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
 	if (mmu_psize_defs[psize].shift &&
@@ -686,29 +684,14 @@ static void __init set_huge_psize(int ps
 			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		switch (mmu_psize_defs[psize].shift) {
-		case PAGE_SHIFT_64K:
-		    /* We only allow 64k hpages with 4k base page,
-		     * which was checked above, and always put them
-		     * at the PMD */
-		    hugepte_shift[psize] = PMD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16M:
-		    /* 16M pages can be at two different levels
-		     * of pagestables based on base page size */
-		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift[psize] = PMD_SHIFT;
-		    else /* 4k base page */
-			    hugepte_shift[psize] = PUD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16G:
-		    /* 16G pages are always at PGD level */
-		    hugepte_shift[psize] = PGDIR_SHIFT;
-		    break;
-		}
-		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
-	} else
-		hugepte_shift[psize] = 0;
+		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
+	}
 }
 
 static int __init hugepage_setup_sz(char *str)
@@ -732,7 +715,7 @@ __setup("hugepagesz=", hugepage_setup_sz
 
 static int __init hugetlbpage_init(void)
 {
-	unsigned int psize;
+	int psize;
 
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
@@ -753,8 +736,8 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(hugepte_shift[psize], NULL);
-			if (!PGT_CACHE(hugepte_shift[psize]))
+			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
+			if (!PGT_CACHE(mmu_huge_psizes[psize]))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n",
 				      mmu_psize_to_shift(psize));
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-02-18 14:27:16.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-09-08 16:21:37.000000000 +1000
@@ -3,6 +3,15 @@
 
 #include <asm/page.h>
 
+typedef struct { signed long pd; } hugepd_t;
+
+static inline int hugepd_ok(hugepd_t hpd)
+{
+	return (hpd.pd > 0);
+}
+
+#define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
+#define HUGEPD_SHIFT_MASK     0x3f
 
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
@@ -17,6 +26,9 @@ void set_huge_pte_at(struct mm_struct *m
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep);
 
+int gup_hugepd(hugepd_t *hugepd, unsigned pdshift, unsigned long addr,
+	       unsigned long end, int write, struct page **pages, int *nr);
+
 /*
  * The version of vma_mmu_pagesize() in arch/powerpc/mm/hugetlbpage.c needs
  * to override the version in mm/hugetlb.c
Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-08 11:44:17.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-08 11:44:17.000000000 +1000
@@ -41,6 +41,7 @@
 #include <linux/module.h>
 #include <linux/poison.h>
 #include <linux/lmb.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -154,13 +155,21 @@ void pgtable_cache_add(unsigned shift, v
 {
 	char *name;
 	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
 	struct kmem_cache *new;
 
 	BUG_ON((shift < 1) || (shift > PGF_SHIFT_MASK));
+#ifdef CONFIG_HUGETLB_PAGE
+	/* We use low bits in hugepage dir pointers to store index
+	 * size information.  Table alignment must be big enough to
+	 * fit it. */
+	align = max_t(unsigned long, align, HUGEPD_SHIFT_MASK + 1);
+#endif
+
 	if (PGT_CACHE(shift))
 		return; /* Already have a cache of this size */
 	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
-	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
 	PGT_CACHE(shift) = new;
 }
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-08 16:22:05.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-08 16:42:56.000000000 +1000
@@ -379,7 +379,18 @@ void pgtable_cache_init(void);
 	return pt;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long address);
+#ifdef CONFIG_HUGETLB_PAGE
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+				 unsigned *shift);
+#else
+static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+					       unsigned *shift)
+{
+	if (shift)
+		*shift = 0;
+	return find_linux_pte(pgdir, ea);
+}
+#endif /* !CONFIG_HUGETLB_PAGE */
 
 #endif /* __ASSEMBLY__ */
 
Index: working-2.6/arch/powerpc/mm/gup.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/gup.c	2009-09-08 16:15:37.000000000 +1000
+++ working-2.6/arch/powerpc/mm/gup.c	2009-09-08 16:15:46.000000000 +1000
@@ -55,57 +55,6 @@ static noinline int gup_pte_range(pmd_t 
 	return 1;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static noinline int gup_huge_pte(pte_t *ptep, struct hstate *hstate,
-				 unsigned long *addr, unsigned long end,
-				 int write, struct page **pages, int *nr)
-{
-	unsigned long mask;
-	unsigned long pte_end;
-	struct page *head, *page;
-	pte_t pte;
-	int refs;
-
-	pte_end = (*addr + huge_page_size(hstate)) & huge_page_mask(hstate);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = *ptep;
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_val(pte) & mask) != mask)
-		return 0;
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	refs = 0;
-	head = pte_page(pte);
-	page = head + ((*addr & ~huge_page_mask(hstate)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (*addr += PAGE_SIZE, *addr != end);
-
-	if (!page_cache_add_speculative(head, refs)) {
-		*nr -= refs;
-		return 0;
-	}
-	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		while (*nr) {
-			put_page(page);
-			(*nr)--;
-		}
-	}
-
-	return 1;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		int write, struct page **pages, int *nr)
 {
@@ -119,7 +68,11 @@ static int gup_pmd_range(pud_t pud, unsi
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(pmd))
 			return 0;
-		if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+		if (is_hugepd(pmdp)) {
+			if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
 
@@ -139,7 +92,11 @@ static int gup_pud_range(pgd_t pgd, unsi
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
 			return 0;
-		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+		if (is_hugepd(pudp)) {
+			if (!gup_hugepd((hugepd_t *)pudp, PUD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
@@ -154,10 +111,6 @@ int get_user_pages_fast(unsigned long st
 	unsigned long next;
 	pgd_t *pgdp;
 	int nr = 0;
-#ifdef CONFIG_PPC64
-	unsigned int shift;
-	int psize;
-#endif
 
 	pr_devel("%s(%lx,%x,%s)\n", __func__, start, nr_pages, write ? "write" : "read");
 
@@ -172,25 +125,6 @@ int get_user_pages_fast(unsigned long st
 
 	pr_devel("  aligned: %lx .. %lx\n", start, end);
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* We bail out on slice boundary crossing when hugetlb is
-	 * enabled in order to not have to deal with two different
-	 * page table formats
-	 */
-	if (addr < SLICE_LOW_TOP) {
-		if (end > SLICE_LOW_TOP)
-			goto slow_irqon;
-
-		if (unlikely(GET_LOW_SLICE_INDEX(addr) !=
-			     GET_LOW_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	} else {
-		if (unlikely(GET_HIGH_SLICE_INDEX(addr) !=
-			     GET_HIGH_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 	/*
 	 * XXX: batch / limit 'nr', to avoid large irq off latency
 	 * needs some instrumenting to determine the common sizes used by
@@ -210,54 +144,23 @@ int get_user_pages_fast(unsigned long st
 	 */
 	local_irq_disable();
 
-#ifdef CONFIG_PPC64
-	/* Those bits are related to hugetlbfs implementation and only exist
-	 * on 64-bit for now
-	 */
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_defs[psize].shift;
-#endif /* CONFIG_PPC64 */
-
-#ifdef CONFIG_HUGETLB_PAGE
-	if (unlikely(mmu_huge_psizes[psize])) {
-		pte_t *ptep;
-		unsigned long a = addr;
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-
-		BUG_ON(!hstate);
-		/*
-		 * XXX: could be optimized to avoid hstate
-		 * lookup entirely (just use shift)
-		 */
-
-		do {
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, a)].shift);
-			ptep = huge_pte_offset(mm, a);
-			pr_devel(" %016lx: huge ptep %p\n", a, ptep);
-			if (!ptep || !gup_huge_pte(ptep, hstate, &a, end, write, pages,
-						   &nr))
-				goto slow;
-		} while (a != end);
-	} else
-#endif /* CONFIG_HUGETLB_PAGE */
-	{
-		pgdp = pgd_offset(mm, addr);
-		do {
-			pgd_t pgd = *pgdp;
-
-#ifdef CONFIG_PPC64
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, addr)].shift);
-#endif
-			pr_devel("  %016lx: normal pgd %p\n", addr,
-				 (void *)pgd_val(pgd));
-			next = pgd_addr_end(addr, end);
-			if (pgd_none(pgd))
-				goto slow;
-			if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		pr_devel("  %016lx: normal pgd %p\n", addr,
+			 (void *)pgd_val(pgd));
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (is_hugepd(pgdp)) {
+			if (!gup_hugepd((hugepd_t *)pgdp, PGDIR_SHIFT,
+					addr, next, write, pages, &nr))
 				goto slow;
-		} while (pgdp++, addr = next, addr != end);
-	}
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+
 	local_irq_enable();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
Index: working-2.6/arch/powerpc/kernel/perf_callchain.c
===================================================================
--- working-2.6.orig/arch/powerpc/kernel/perf_callchain.c	2009-09-08 16:34:33.000000000 +1000
+++ working-2.6/arch/powerpc/kernel/perf_callchain.c	2009-09-08 17:13:10.000000000 +1000
@@ -119,13 +119,6 @@ static void perf_callchain_kernel(struct
 }
 
 #ifdef CONFIG_PPC64
-
-#ifdef CONFIG_HUGETLB_PAGE
-#define is_huge_psize(pagesize)	(HPAGE_SHIFT && mmu_huge_psizes[pagesize])
-#else
-#define is_huge_psize(pagesize)	0
-#endif
-
 /*
  * On 64-bit we don't want to invoke hash_page on user addresses from
  * interrupt context, so if the access faults, we read the page tables
@@ -135,7 +128,7 @@ static int read_user_stack_slow(void __u
 {
 	pgd_t *pgdir;
 	pte_t *ptep, pte;
-	int pagesize;
+	unsigned shift;
 	unsigned long addr = (unsigned long) ptr;
 	unsigned long offset;
 	unsigned long pfn;
@@ -145,17 +138,14 @@ static int read_user_stack_slow(void __u
 	if (!pgdir)
 		return -EFAULT;
 
-	pagesize = get_slice_psize(current->mm, addr);
+	ptep = find_linux_pte_or_hugepte(pgdir, addr, &shift);
+	if (!shift)
+		shift = PAGE_SHIFT;
 
 	/* align address to page boundary */
-	offset = addr & ((1ul << mmu_psize_defs[pagesize].shift) - 1);
+	offset = addr & ((1UL << shift) - 1);
 	addr -= offset;
 
-	if (is_huge_psize(pagesize))
-		ptep = huge_pte_offset(current->mm, addr);
-	else
-		ptep = find_linux_pte(pgdir, addr);
-
 	if (ptep == NULL)
 		return -EFAULT;
 	pte = *ptep;
Index: working-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2009-09-08 16:39:34.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hash_utils_64.c	2009-09-08 17:00:01.000000000 +1000
@@ -891,6 +891,7 @@ int hash_page(unsigned long ea, unsigned
 	unsigned long vsid;
 	struct mm_struct *mm;
 	pte_t *ptep;
+	unsigned hugeshift;
 	const struct cpumask *tmp;
 	int rc, user_region = 0, local = 0;
 	int psize, ssize;
@@ -943,14 +944,6 @@ int hash_page(unsigned long ea, unsigned
 	if (user_region && cpumask_equal(mm_cpumask(mm), tmp))
 		local = 1;
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* Handle hugepage regions */
-	if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
-		DBG_LOW(" -> huge page !\n");
-		return hash_huge_page(mm, access, ea, vsid, local, trap);
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 #ifndef CONFIG_PPC_64K_PAGES
 	/* If we use 4K pages and our psize is not 4K, then we are hitting
 	 * a special driver mapping, we need to align the address before
@@ -961,12 +954,18 @@ int hash_page(unsigned long ea, unsigned
 #endif /* CONFIG_PPC_64K_PAGES */
 
 	/* Get PTE and page size from page tables */
-	ptep = find_linux_pte(pgdir, ea);
+	ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
 	if (ptep == NULL || !pte_present(*ptep)) {
 		DBG_LOW(" no PTE !\n");
 		return 1;
 	}
 
+#ifdef CONFIG_HUGETLB_PAGE
+	if (hugeshift)
+		return __hash_page_huge(ea, access, vsid, ptep, trap, local,
+					ssize, hugeshift, psize);
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #ifndef CONFIG_PPC_64K_PAGES
 	DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
 #else
Index: working-2.6/arch/powerpc/include/asm/mmu-hash64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/mmu-hash64.h	2009-09-08 16:52:38.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/mmu-hash64.h	2009-09-08 17:13:25.000000000 +1000
@@ -173,14 +173,6 @@ extern unsigned long tce_alloc_start, tc
  */
 extern int mmu_ci_restrictions;
 
-#ifdef CONFIG_HUGETLB_PAGE
-/*
- * The page size indexes of the huge pages for use by hugetlbfs
- */
-extern unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
 /*
  * This function sets the AVPN and L fields of the HPTE  appropriately
  * for the page size
@@ -254,9 +246,9 @@ extern int __hash_page_64K(unsigned long
 			   unsigned int local, int ssize);
 struct mm_struct;
 extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap);
-extern int hash_huge_page(struct mm_struct *mm, unsigned long access,
-			  unsigned long ea, unsigned long vsid, int local,
-			  unsigned long trap);
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize);
 
 extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
 			     unsigned long pstart, unsigned long prot,

^ permalink raw reply

* [5/5] Split hash MMU specific hugepage code into a new file
From: David Gibson @ 2009-09-09  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090909055534.GF7909@yookeroo.seuss>

This patch separates the parts of hugetlbpage.c which are inherently
specific to the hash MMU into a new hugelbpage-hash64.c file.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h   |    3 
 arch/powerpc/mm/Makefile             |    5 -
 arch/powerpc/mm/hugetlbpage-hash64.c |  167 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/hugetlbpage.c        |  168 -----------------------------------
 4 files changed, 176 insertions(+), 167 deletions(-)

Index: working-2.6/arch/powerpc/mm/Makefile
===================================================================
--- working-2.6.orig/arch/powerpc/mm/Makefile	2009-08-14 16:07:54.000000000 +1000
+++ working-2.6/arch/powerpc/mm/Makefile	2009-09-09 15:24:33.000000000 +1000
@@ -28,7 +28,10 @@ obj-$(CONFIG_44x)		+= 44x_mmu.o
 obj-$(CONFIG_FSL_BOOKE)		+= fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
-obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
+ifeq ($(CONFIG_HUGETLB_PAGE),y)
+obj-y				+= hugetlbpage.o
+obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
+endif
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
Index: working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c	2009-09-09 15:25:35.000000000 +1000
@@ -0,0 +1,167 @@
+/*
+ * PPC64 Huge TLB Page Support for hash based MMUs (POWER4 and later)
+ *
+ * Copyright (C) 2003 David Gibson, IBM Corporation.
+ *
+ * Based on the IA-32 version:
+ * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/cacheflush.h>
+#include <asm/machdep.h>
+
+/*
+ * Called by asm hashtable.S for doing lazy icache flush
+ */
+static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
+					pte_t pte, int trap, unsigned long sz)
+{
+	struct page *page;
+	int i;
+
+	if (!pfn_valid(pte_pfn(pte)))
+		return rflags;
+
+	page = pte_page(pte);
+
+	/* page is dirty */
+	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
+		if (trap == 0x400) {
+			for (i = 0; i < (sz / PAGE_SIZE); i++)
+				__flush_dcache_icache(page_address(page+i));
+			set_bit(PG_arch_1, &page->flags);
+		} else {
+			rflags |= HPTE_R_N;
+		}
+	}
+	return rflags;
+}
+
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
+{
+	unsigned long old_pte, new_pte;
+	unsigned long va, rflags, pa, sz;
+	long slot;
+	int err = 1;
+
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
+
+	/* Search the Linux page table for a match with va */
+	va = hpt_va(ea, vsid, ssize);
+
+	/*
+	 * Check the user's access rights to the page.  If access should be
+	 * prevented then send the problem up to do_page_fault.
+	 */
+	if (unlikely(access & ~pte_val(*ptep)))
+		goto out;
+	/*
+	 * At this point, we have a pte (old_pte) which can be used to build
+	 * or update an HPTE. There are 2 cases:
+	 *
+	 * 1. There is a valid (present) pte with no associated HPTE (this is
+	 *	the most common case)
+	 * 2. There is a valid (present) pte with an associated HPTE. The
+	 *	current values of the pp bits in the HPTE prevent access
+	 *	because we are doing software DIRTY bit management and the
+	 *	page is currently not DIRTY.
+	 */
+
+
+	do {
+		old_pte = pte_val(*ptep);
+		if (old_pte & _PAGE_BUSY)
+			goto out;
+		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
+	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
+					 old_pte, new_pte));
+
+	rflags = 0x2 | (!(new_pte & _PAGE_RW));
+ 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
+	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
+	sz = ((1UL) << shift);
+	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
+		/* No CPU has hugepages but lacks no execute, so we
+		 * don't need to worry about that case */
+		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
+						       trap, sz);
+
+	/* Check if pte already has an hpte (case 2) */
+	if (unlikely(old_pte & _PAGE_HASHPTE)) {
+		/* There MIGHT be an HPTE for this pte */
+		unsigned long hash, slot;
+
+		hash = hpt_hash(va, shift, ssize);
+		if (old_pte & _PAGE_F_SECOND)
+			hash = ~hash;
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += (old_pte & _PAGE_F_GIX) >> 12;
+
+		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
+					 ssize, local) == -1)
+			old_pte &= ~_PAGE_HPTEFLAGS;
+	}
+
+	if (likely(!(old_pte & _PAGE_HASHPTE))) {
+		unsigned long hash = hpt_hash(va, shift, ssize);
+		unsigned long hpte_group;
+
+		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
+
+repeat:
+		hpte_group = ((hash & htab_hash_mask) *
+			      HPTES_PER_GROUP) & ~0x7UL;
+
+		/* clear HPTE slot informations in new PTE */
+#ifdef CONFIG_PPC_64K_PAGES
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
+#else
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
+#endif
+		/* Add in WIMG bits */
+		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+				      _PAGE_COHERENT | _PAGE_GUARDED));
+
+		/* Insert into the hash table, primary slot */
+		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
+					  mmu_psize, ssize);
+
+		/* Primary is full, try the secondary */
+		if (unlikely(slot == -1)) {
+			hpte_group = ((~hash & htab_hash_mask) *
+				      HPTES_PER_GROUP) & ~0x7UL;
+			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
+						  HPTE_V_SECONDARY,
+						  mmu_psize, ssize);
+			if (slot == -1) {
+				if (mftb() & 0x1)
+					hpte_group = ((hash & htab_hash_mask) *
+						      HPTES_PER_GROUP)&~0x7UL;
+
+				ppc_md.hpte_remove(hpte_group);
+				goto repeat;
+                        }
+		}
+
+		if (unlikely(slot == -2))
+			panic("hash_huge_page: pte_insert failed\n");
+
+		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
+	}
+
+	/*
+	 * No need to use ldarx/stdcx here
+	 */
+	*ptep = __pte(new_pte & ~_PAGE_BUSY);
+
+	err = 0;
+
+ out:
+	return err;
+}
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-09 15:22:49.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-09 15:25:09.000000000 +1000
@@ -7,29 +7,17 @@
  * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
  */
 
-#include <linux/init.h>
-#include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/io.h>
 #include <linux/hugetlb.h>
-#include <linux/pagemap.h>
-#include <linux/slab.h>
-#include <linux/err.h>
-#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
-#include <asm/tlbflush.h>
-#include <asm/mmu_context.h>
-#include <asm/machdep.h>
-#include <asm/cputable.h>
-#include <asm/spu.h>
 
 #define PAGE_SHIFT_64K	16
 #define PAGE_SHIFT_16M	24
 #define PAGE_SHIFT_16G	34
 
-#define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
-#define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
 #define MAX_NUMBER_GPAGES	1024
 
 /* Tracks the 16G pages after the device tree is scanned and before the
@@ -507,158 +495,6 @@ unsigned long vma_mmu_pagesize(struct vm
 	return 1UL << mmu_psize_to_shift(psize);
 }
 
-/*
- * Called by asm hashtable.S for doing lazy icache flush
- */
-static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-					pte_t pte, int trap, unsigned long sz)
-{
-	struct page *page;
-	int i;
-
-	if (!pfn_valid(pte_pfn(pte)))
-		return rflags;
-
-	page = pte_page(pte);
-
-	/* page is dirty */
-	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
-		if (trap == 0x400) {
-			for (i = 0; i < (sz / PAGE_SIZE); i++)
-				__flush_dcache_icache(page_address(page+i));
-			set_bit(PG_arch_1, &page->flags);
-		} else {
-			rflags |= HPTE_R_N;
-		}
-	}
-	return rflags;
-}
-
-int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
-		     pte_t *ptep, unsigned long trap, int local, int ssize,
-		     unsigned int shift, unsigned int mmu_psize)
-{
-	unsigned long old_pte, new_pte;
-	unsigned long va, rflags, pa, sz;
-	long slot;
-	int err = 1;
-
-	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
-
-	/* Search the Linux page table for a match with va */
-	va = hpt_va(ea, vsid, ssize);
-
-	/* 
-	 * Check the user's access rights to the page.  If access should be
-	 * prevented then send the problem up to do_page_fault.
-	 */
-	if (unlikely(access & ~pte_val(*ptep)))
-		goto out;
-	/*
-	 * At this point, we have a pte (old_pte) which can be used to build
-	 * or update an HPTE. There are 2 cases:
-	 *
-	 * 1. There is a valid (present) pte with no associated HPTE (this is 
-	 *	the most common case)
-	 * 2. There is a valid (present) pte with an associated HPTE. The
-	 *	current values of the pp bits in the HPTE prevent access
-	 *	because we are doing software DIRTY bit management and the
-	 *	page is currently not DIRTY. 
-	 */
-
-
-	do {
-		old_pte = pte_val(*ptep);
-		if (old_pte & _PAGE_BUSY)
-			goto out;
-		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
-	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
-					 old_pte, new_pte));
-
-	rflags = 0x2 | (!(new_pte & _PAGE_RW));
- 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
-	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	sz = ((1UL) << shift);
-	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
-		/* No CPU has hugepages but lacks no execute, so we
-		 * don't need to worry about that case */
-		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap, sz);
-
-	/* Check if pte already has an hpte (case 2) */
-	if (unlikely(old_pte & _PAGE_HASHPTE)) {
-		/* There MIGHT be an HPTE for this pte */
-		unsigned long hash, slot;
-
-		hash = hpt_hash(va, shift, ssize);
-		if (old_pte & _PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & _PAGE_F_GIX) >> 12;
-
-		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
-					 ssize, local) == -1)
-			old_pte &= ~_PAGE_HPTEFLAGS;
-	}
-
-	if (likely(!(old_pte & _PAGE_HASHPTE))) {
-		unsigned long hash = hpt_hash(va, shift, ssize);
-		unsigned long hpte_group;
-
-		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
-
-repeat:
-		hpte_group = ((hash & htab_hash_mask) *
-			      HPTES_PER_GROUP) & ~0x7UL;
-
-		/* clear HPTE slot informations in new PTE */
-#ifdef CONFIG_PPC_64K_PAGES
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
-#else
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
-#endif
-		/* Add in WIMG bits */
-		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
-				      _PAGE_COHERENT | _PAGE_GUARDED));
-
-		/* Insert into the hash table, primary slot */
-		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
-					  mmu_psize, ssize);
-
-		/* Primary is full, try the secondary */
-		if (unlikely(slot == -1)) {
-			hpte_group = ((~hash & htab_hash_mask) *
-				      HPTES_PER_GROUP) & ~0x7UL; 
-			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
-						  HPTE_V_SECONDARY,
-						  mmu_psize, ssize);
-			if (slot == -1) {
-				if (mftb() & 0x1)
-					hpte_group = ((hash & htab_hash_mask) *
-						      HPTES_PER_GROUP)&~0x7UL;
-
-				ppc_md.hpte_remove(hpte_group);
-				goto repeat;
-                        }
-		}
-
-		if (unlikely(slot == -2))
-			panic("hash_huge_page: pte_insert failed\n");
-
-		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
-	}
-
-	/*
-	 * No need to use ldarx/stdcx here
-	 */
-	*ptep = __pte(new_pte & ~_PAGE_BUSY);
-
-	err = 0;
-
- out:
-	return err;
-}
-
 static int __init add_huge_page_size(unsigned long long size)
 {
 	int shift = __ffs(size);
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-09-09 15:15:12.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-09-09 15:24:33.000000000 +1000
@@ -13,6 +13,9 @@ static inline int hugepd_ok(hugepd_t hpd
 #define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
 #define HUGEPD_SHIFT_MASK     0x3f
 
+pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
+				 unsigned long addr, unsigned *shift);
+
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
 

^ permalink raw reply

* Re: [FTRACE] Enabling function_graph causes OOPS
From: Sachin Sant @ 2009-09-09  6:27 UTC (permalink / raw)
  To: rostedt; +Cc: linuxppc-dev
In-Reply-To: <1252458303.20985.10.camel@gandalf.stny.rr.com>

Steven Rostedt wrote:
> I'm going through old email, and I found this. Do you still see this
> error. I don't recall seeing it myself.
>   
I can still recreate this with 31-rc9. When i enable tracing
with function_graph i notice the following oops. This happens
only once. Later if i try to enable/disable tracing i don't
get this oops message. This behavior is observed only with
function_graph. Other tracers work fine.

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in: ipv6 fuse loop dm_mod sr_mod ehea ibmveth sg cdrom sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod
NIP: c000000000008f30 LR: c000000000008f04 CTR: 80000000000f6d68
REGS: c00000003e98f560 TRAP: 0300   Not tainted  (2.6.31-rc9)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000422  XER: 00000020
DAR: 0000000000000008, DSISR: 0000000040000000
TASK = c00000003e953b20[2483] 'irqbalance' THREAD: c00000003e98c000 CPU: 1
GPR00: c000000000008f04 c00000003e98f7e0 d00000000117ed38 0000000000000000
GPR04: 0000000000000000 0000000066000000 00000000000010bf 0000000000000000
GPR08: 0000000000000000 800000010021bb40 00000000000000ff 800000010021bb60
GPR12: 0000000000000002 c000000001032800 0000000000000000 ffffffffeffdff68
GPR16: 00000fffa39fd6a0 00000fffa39e6c38 c00000003ebe9c38 fffffffffffff000
GPR20: c00000002a6cf980 c00000003e98fdf8 c00000003e98fba8 00000fffa1740000
GPR24: fffffffffffff000 8001000003000000 ffe0000000000000 0000000000000009
GPR28: c00000003db40000 0000000000020000 d00000000117da78 c00000003e98f850
NIP [c000000000008f30] .mod_return_to_handler+0x2c/0x64
LR [c000000000008f04] .mod_return_to_handler+0x0/0x64
Call Trace:
[c00000003e98f7e0] [c00000002a6cf980] 0xc00000002a6cf980 (unreliable)
[c00000003e98f850] [c000000000008f04] .mod_return_to_handler+0x0/0x64
[c00000003e98f900] [c000000000008f04] .mod_return_to_handler+0x0/0x64
[c00000003e98f9a0] [c000000000008f04] .mod_return_to_handler+0x0/0x64
[c00000003e98fa30] [c000000000008ed0] .return_to_handler+0x0/0x34 (.bad_page_fault+0xc8/0xe8)
[c00000003e98fb30] [c000000000008ed0] .return_to_handler+0x0/0x34 (handle_page_fault+0x3c/0x5c)
[c00000003e98fc20] [c000000000008ed0] .return_to_handler+0x0/0x34 (.ehea_h_query_ehea_port+0x74/0x9c [ehea])
[c00000003e98fcd0] [c000000000008ed0] .return_to_handler+0x0/0x34 (.ehea_get_stats+0xa0/0x1d0 [ehea])
[c00000003e98fd80] [c000000000008ed0] .return_to_handler+0x0/0x34 (.dev_get_stats+0x50/0xec)
[c00000003e98fe30] [c000000000008ed0] .return_to_handler+0x0/0x34 (.dev_seq_show+0x5c/0x140)
Instruction dump:
4e800020 f881ffe0 f861ffe8 f841fff0 fbe1fff8 7c3f0b78 f821ff91 3c800000
60840000 788407c6 64840000 60840000 <e8440008> 48126375 60000000 7c6803a6
---[ end trace bb43efc994aed790 ]---

function_graph traces are recorded and can be retrieved using
/sys/kernel/debug/tracing/trace.

1)   3.936 us    |                        }
1)               |                        .release_console_sem() {
1)   0.594 us    |                          ._spin_lock_irqsave();
1)   0.560 us    |                          ._call_console_drivers();
1)   0.580 us    |                          ._call_console_drivers();
1)   0.582 us    |                          ._spin_lock_irqsave();
1)               |                          .up() {
1)   0.592 us    |                            ._spin_lock_irqsave();
1)   0.556 us    |                            ._spin_unlock_irqrestore();
1)   2.842 us    |                          }
1)   0.588 us    |                          ._spin_unlock_irqrestore();
1)   9.750 us    |                        }
1) + 75.274 us   |                      }
1)               |                      .die() {
1)               |                        .oops_enter() {

Thanks
-Sachin

-- 

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

^ permalink raw reply

* Re: Question about e300 core decrementer interrupt
From: Kenneth Johansson @ 2009-09-09 11:16 UTC (permalink / raw)
  To: Li Tao-B22598; +Cc: linuxppc-dev
In-Reply-To: <FF7429C6AD6EFB489C0A5A2021CDFEFA7C30E6@zmy16exm21.fsl.freescale.net>

On Tue, 2009-09-08 at 13:48 +0800, Li Tao-B22598 wrote:
> Dear all,
> 
> I have a problem in MPC5121 sleep mode. As you know MPC5121 use e300c4
> core. When I make the e300c4 core into sleep mode, it will return to
> full power mode when the“decrementer interrupt” occurred.
> 
> But in the e300 core reference manual said that the “decrementer
> interrupt”have no effect when e300 core in sleep mode, because the
> time
> base and decrementer are disabled while the core is in sleep mode.
> Can anybody explain about this procedure ?


Please talk to people internal to freescale. There is errata on this
that is known for a long time(more than a year now) that for some reason
is never entered into the errata document. 

I'm a bit irritated that it's not as the "solution" can mean hardware
changes an thus it's potentially expensive.

^ permalink raw reply

* [PATCH] powerpc: Fix bug where perf_counters breaks oprofile
From: Paul Mackerras @ 2009-09-09 11:26 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Maynard Johnson

Currently there is a bug where if you use oprofile on a pSeries
machine, then use perf_counters, then use oprofile again, oprofile
will not work correctly; it will lose the PMU configuration the next
time the hypervisor does a partition context switch, and thereafter
won't count anything.

Maynard Johnson identified the sequence causing the problem:
- oprofile setup calls ppc_enable_pmcs(), which calls
  pseries_lpar_enable_pmcs, which tells the hypervisor that we want
  to use the PMU, and sets the "PMU in use" flag in the lppaca.
  This flag tells the hypervisor whether it needs to save and restore
  the PMU config.
- The perf_counter code sets and clears the "PMU in use" flag directly
  as it context-switches the PMU between tasks, and leaves it clear
  when it finishes.
- oprofile setup, called for a new oprofile run, calls ppc_enable_pmcs,
  which does nothing because it has already been called.  In particular
  it doesn't set the "PMU in use" flag.

This fixes the problem by arranging for ppc_enable_pmcs to always set
the "PMU in use" flag.  It makes the perf_counter code call
ppc_enable_pmcs also rather than calling the lower-level function
directly, and removes the setting of the "PMU in use" flag from
pseries_lpar_enable_pmcs, since that is now done in its caller.

This also removes the declaration of pasemi_enable_pmcs because it
isn't defined anywhere.

Reported-by: Maynard Johnson <mpjohn@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/pmc.h         |   16 ++++++++++++++--
 arch/powerpc/kernel/perf_counter.c     |   13 +++----------
 arch/powerpc/kernel/sysfs.c            |    3 +++
 arch/powerpc/platforms/pseries/setup.c |    4 ----
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
index d6a616a..ccc68b5 100644
--- a/arch/powerpc/include/asm/pmc.h
+++ b/arch/powerpc/include/asm/pmc.h
@@ -27,10 +27,22 @@ extern perf_irq_t perf_irq;
 
 int reserve_pmc_hardware(perf_irq_t new_perf_irq);
 void release_pmc_hardware(void);
+void ppc_enable_pmcs(void);
 
 #ifdef CONFIG_PPC64
-void power4_enable_pmcs(void);
-void pasemi_enable_pmcs(void);
+#include <asm/lppaca.h>
+
+static inline void ppc_set_pmu_inuse(int inuse)
+{
+	get_lppaca()->pmcregs_in_use = inuse;
+}
+
+extern void power4_enable_pmcs(void);
+
+#else /* CONFIG_PPC64 */
+
+static inline void ppc_set_pmu_inuse(int inuse) { }
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index 70e1f57..ccd6b21 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -62,7 +62,6 @@ static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
 {
 	return 0;
 }
-static inline void perf_set_pmu_inuse(int inuse) { }
 static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp) { }
 static inline u32 perf_get_misc_flags(struct pt_regs *regs)
 {
@@ -93,11 +92,6 @@ static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
 	return 0;
 }
 
-static inline void perf_set_pmu_inuse(int inuse)
-{
-	get_lppaca()->pmcregs_in_use = inuse;
-}
-
 /*
  * The user wants a data address recorded.
  * If we're not doing instruction sampling, give them the SDAR
@@ -531,8 +525,7 @@ void hw_perf_disable(void)
 		 * Check if we ever enabled the PMU on this cpu.
 		 */
 		if (!cpuhw->pmcs_enabled) {
-			if (ppc_md.enable_pmcs)
-				ppc_md.enable_pmcs();
+			ppc_enable_pmcs();
 			cpuhw->pmcs_enabled = 1;
 		}
 
@@ -594,7 +587,7 @@ void hw_perf_enable(void)
 		mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
 		mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
 		if (cpuhw->n_counters == 0)
-			perf_set_pmu_inuse(0);
+			ppc_set_pmu_inuse(0);
 		goto out_enable;
 	}
 
@@ -627,7 +620,7 @@ void hw_perf_enable(void)
 	 * bit set and set the hardware counters to their initial values.
 	 * Then unfreeze the counters.
 	 */
-	perf_set_pmu_inuse(1);
+	ppc_set_pmu_inuse(1);
 	mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
 	mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
 	mtspr(SPRN_MMCR0, (cpuhw->mmcr[0] & ~(MMCR0_PMC1CE | MMCR0_PMCjCE))
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..956ab33 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -17,6 +17,7 @@
 #include <asm/prom.h>
 #include <asm/machdep.h>
 #include <asm/smp.h>
+#include <asm/pmc.h>
 
 #include "cacheinfo.h"
 
@@ -123,6 +124,8 @@ static DEFINE_PER_CPU(char, pmcs_enabled);
 
 void ppc_enable_pmcs(void)
 {
+	ppc_set_pmu_inuse(1);
+
 	/* Only need to enable them once */
 	if (__get_cpu_var(pmcs_enabled))
 		return;
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 8d75ea2..ca5f2e1 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -223,10 +223,6 @@ static void pseries_lpar_enable_pmcs(void)
 	set = 1UL << 63;
 	reset = 0;
 	plpar_hcall_norets(H_PERFMON, set, reset);
-
-	/* instruct hypervisor to maintain PMCs */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR))
-		get_lppaca()->pmcregs_in_use = 1;
 }
 
 static void __init pseries_discover_pic(void)
-- 
1.6.0.4

^ permalink raw reply related

* RE: Queries regarding I2C and GPIO driver for Freescale MPC5121e in Linux2.6.24 of BSP: MPC512xADS_20090603-ltib.iso
From: Uma Kanta Patro @ 2009-09-09 13:13 UTC (permalink / raw)
  To: 'Chen Hongjun-R66092', linuxppc-dev
In-Reply-To: <3A45394FD742FA419B760BB8D398F9ED64C5B2@zch01exm26.fsl.freescale.net>

[-- Attachment #1: Type: text/plain, Size: 4529 bytes --]

Hi Chen Hongjun-R66092,

 

Thanks for your response.

Actually for the GPIO driver I am having some success and it is in progress.

But regarding the I2C chip(client) driver I am running witout any progress.

Actually I followed the existing driver $(LINUX)\drivers\rtc\rtc-m41t80.c
(for the RTC M41T62 existing on the ADS5121Rev4.1 board).

 I made a legacy style driver with attach_adapter and detach_client
functions defined.

For testing purpose I geve the chip address as 0x68(address of M41T62
existing on the board). But when I tried ot insert my driver I get the error
message as:

 

[root@freescale chips]# insmod dis_fpc.ko

[  177.808848] i2c 0-0068: uevent

[  498.528032] In dis_fpc_init

[  498.531851] i2c-core: driver [dis_fpc] registered

[  498.532446] In dis_fpc_attach_adapter

[  498.536827] i2c-adapter i2c-0: found normal entry for adapter 0, addr
0x55

[  498.537730] i2c-adapter i2c-0: master_xfer[0] W, addr=0x55, len=0

[  498.538533] Doing write 0 bytes to 0x55 - 1 of 1 messages

[  498.539770] I2C: No RXAK

[  498.540970] In dis_fpc_attach_adapter

[  498.554500] i2c-adapter i2c-1: found normal entry for adapter 1, addr
0x55

[  498.555166] i2c-adapter i2c-1: master_xfer[0] W, addr=0x55, len=0

[  498.555872] Doing write 0 bytes to 0x55 - 1 of 1 messages

[  498.556785] I2C: MAL

[  498.557476] In dis_fpc_attach_adapter

[  498.565733] i2c-adapter i2c-2: found normal entry for adapter 2, addr
0x55

[  498.566377] i2c-adapter i2c-2: master_xfer[0] W, addr=0x55, len=0

[  498.567082] Doing write 0 bytes to 0x55 - 1 of 1 messages

[  498.568240] I2C: No RXAK

 

So can you tell me what other places do I need to change the configurations(
like i2c_platform_data definition, linking the chip to the specific I2C
module(0/1/2) with the adapter, configuring the speed of I2C communication
etc).

 

I would like to get any suggestion on making the I2C chip driver inpowerpc
platform.

 

Thanks & Regards,

Uma

 

From: Chen Hongjun-R66092 [mailto:hong-jun.chen@freescale.com] 
Sent: Wednesday, September 09, 2009 5:39 AM
To: Uma Kanta Patro; linuxppc-dev@lists.ozlabs.org
Subject: RE: Queries regarding I2C and GPIO driver for Freescale MPC5121e in
Linux2.6.24 of BSP: MPC512xADS_20090603-ltib.iso

 

One I2C driver has been included in 0603 bsp, you can refer to it.

 

It has no specific driver for GPIO, but you can find some initializing code
for GPIO in arch/powerpc/platforms/512x/mpc5125_ads.c. and
mpc512x_pm_test.c.

 


  _____  


From: linuxppc-dev-bounces+hong-jun.chen=freescale.com@lists.ozlabs.org
[mailto:linuxppc-dev-bounces+hong-jun.chen=freescale.com@lists.ozlabs.org]
On Behalf Of Uma Kanta Patro
Sent: Tuesday, September 08, 2009 6:56 PM
To: linuxppc-dev@lists.ozlabs.org
Subject: Queries regarding I2C and GPIO driver for Freescale MPC5121e in
Linux2.6.24 of BSP: MPC512xADS_20090603-ltib.iso

Hi all,

                I am a newbie to the powerpc linux kernel, but I have worked
on some drivers in arm architecture. I am finding powerpc architecture to be
fully different than that.

I am working on Freescale MPC5121e with the BSP MPC512xADS_20090603-ltib.iso
running in it on the ADS512101 Rev4.1 development kit.

Can anyone help me in finding some documentation for understanding and
working on the powerpc kernel. Any links to the powerpc forums will also be
appreciable.

 

 

-> Currently I am going to develop an I2C client driver for one slave
microcontroller of our project.

I have some knowledge in the I2C client driver making(legacy style and new
style).

 

I made a basic I2C client driver to probe for the chip address and for
testing I gave it the chip address 0x68(I2C chip address of the M4T162 RTC,
present on the board).

But while inserting my driver I am getting failure message for the detection
of my chip.

 

So I would like to know what other formalities am I lagging in my I2C chip
driver.

 

-> Also I am in a need for the GPIO driver for my controller ot get
interrupt on ht estate change. When I searched in the kernel code I could
not find any procedure to do that, also I could not find out the procedure
to access either any GPIO pin macros or any register to remap with
ioremap(). So please guide me in finding the proper way to do the GPIO
accessing and interrupt registration.

Will the ioremap() work on powerpc arch? If yes where can I find the memory
mapping(register definitions) to use for my GPIO driver making.

 

Thanks for patience in reading my queries.

Any help is appreciable.

 

Thanks & Regards,

Uma

 


[-- Attachment #2: Type: text/html, Size: 12627 bytes --]

^ permalink raw reply

* pq2 pro: kgdb access to MURAM?
From: Michael Barkowski @ 2009-09-09 13:26 UTC (permalink / raw)
  To: linuxppc-dev

Just wondering how I can get kgdb to show me the contents of MURAM on the QE?

-- 
Michael Barkowski
905-482-4577

^ permalink raw reply

* Re: [PATCH] powerpc: Fix bug where perf_counters breaks oprofile
From: Maynard Johnson @ 2009-09-09 13:31 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Maynard Johnson, linuxppc-dev
In-Reply-To: <19111.37067.783017.427206@cargo.ozlabs.ibm.com>

Paul Mackerras wrote:
> Currently there is a bug where if you use oprofile on a pSeries
> machine, then use perf_counters, then use oprofile again, oprofile
> will not work correctly; it will lose the PMU configuration the next
> time the hypervisor does a partition context switch, and thereafter
> won't count anything.
> 
> Maynard Johnson identified the sequence causing the problem:
> - oprofile setup calls ppc_enable_pmcs(), which calls
>   pseries_lpar_enable_pmcs, which tells the hypervisor that we want
>   to use the PMU, and sets the "PMU in use" flag in the lppaca.
>   This flag tells the hypervisor whether it needs to save and restore
>   the PMU config.
> - The perf_counter code sets and clears the "PMU in use" flag directly
>   as it context-switches the PMU between tasks, and leaves it clear
>   when it finishes.
> - oprofile setup, called for a new oprofile run, calls ppc_enable_pmcs,
>   which does nothing because it has already been called.  In particular
>   it doesn't set the "PMU in use" flag.
> 
> This fixes the problem by arranging for ppc_enable_pmcs to always set
> the "PMU in use" flag.  It makes the perf_counter code call
> ppc_enable_pmcs also rather than calling the lower-level function
> directly, and removes the setting of the "PMU in use" flag from
> pseries_lpar_enable_pmcs, since that is now done in its caller.
> 
> This also removes the declaration of pasemi_enable_pmcs because it
> isn't defined anywhere.
Thanks, Paul.  I tested the patch, and oprofile and perf now play nicely together.

-Maynard
> 
> Reported-by: Maynard Johnson <mpjohn@us.ibm.com>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  arch/powerpc/include/asm/pmc.h         |   16 ++++++++++++++--
>  arch/powerpc/kernel/perf_counter.c     |   13 +++----------
>  arch/powerpc/kernel/sysfs.c            |    3 +++
>  arch/powerpc/platforms/pseries/setup.c |    4 ----
>  4 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
> index d6a616a..ccc68b5 100644
> --- a/arch/powerpc/include/asm/pmc.h
> +++ b/arch/powerpc/include/asm/pmc.h
> @@ -27,10 +27,22 @@ extern perf_irq_t perf_irq;
> 
>  int reserve_pmc_hardware(perf_irq_t new_perf_irq);
>  void release_pmc_hardware(void);
> +void ppc_enable_pmcs(void);
> 
>  #ifdef CONFIG_PPC64
> -void power4_enable_pmcs(void);
> -void pasemi_enable_pmcs(void);
> +#include <asm/lppaca.h>
> +
> +static inline void ppc_set_pmu_inuse(int inuse)
> +{
> +	get_lppaca()->pmcregs_in_use = inuse;
> +}
> +
> +extern void power4_enable_pmcs(void);
> +
> +#else /* CONFIG_PPC64 */
> +
> +static inline void ppc_set_pmu_inuse(int inuse) { }
> +
>  #endif
> 
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
> index 70e1f57..ccd6b21 100644
> --- a/arch/powerpc/kernel/perf_counter.c
> +++ b/arch/powerpc/kernel/perf_counter.c
> @@ -62,7 +62,6 @@ static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>  {
>  	return 0;
>  }
> -static inline void perf_set_pmu_inuse(int inuse) { }
>  static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp) { }
>  static inline u32 perf_get_misc_flags(struct pt_regs *regs)
>  {
> @@ -93,11 +92,6 @@ static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>  	return 0;
>  }
> 
> -static inline void perf_set_pmu_inuse(int inuse)
> -{
> -	get_lppaca()->pmcregs_in_use = inuse;
> -}
> -
>  /*
>   * The user wants a data address recorded.
>   * If we're not doing instruction sampling, give them the SDAR
> @@ -531,8 +525,7 @@ void hw_perf_disable(void)
>  		 * Check if we ever enabled the PMU on this cpu.
>  		 */
>  		if (!cpuhw->pmcs_enabled) {
> -			if (ppc_md.enable_pmcs)
> -				ppc_md.enable_pmcs();
> +			ppc_enable_pmcs();
>  			cpuhw->pmcs_enabled = 1;
>  		}
> 
> @@ -594,7 +587,7 @@ void hw_perf_enable(void)
>  		mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
>  		mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
>  		if (cpuhw->n_counters == 0)
> -			perf_set_pmu_inuse(0);
> +			ppc_set_pmu_inuse(0);
>  		goto out_enable;
>  	}
> 
> @@ -627,7 +620,7 @@ void hw_perf_enable(void)
>  	 * bit set and set the hardware counters to their initial values.
>  	 * Then unfreeze the counters.
>  	 */
> -	perf_set_pmu_inuse(1);
> +	ppc_set_pmu_inuse(1);
>  	mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
>  	mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
>  	mtspr(SPRN_MMCR0, (cpuhw->mmcr[0] & ~(MMCR0_PMC1CE | MMCR0_PMCjCE))
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index f41aec8..956ab33 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -17,6 +17,7 @@
>  #include <asm/prom.h>
>  #include <asm/machdep.h>
>  #include <asm/smp.h>
> +#include <asm/pmc.h>
> 
>  #include "cacheinfo.h"
> 
> @@ -123,6 +124,8 @@ static DEFINE_PER_CPU(char, pmcs_enabled);
> 
>  void ppc_enable_pmcs(void)
>  {
> +	ppc_set_pmu_inuse(1);
> +
>  	/* Only need to enable them once */
>  	if (__get_cpu_var(pmcs_enabled))
>  		return;
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 8d75ea2..ca5f2e1 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -223,10 +223,6 @@ static void pseries_lpar_enable_pmcs(void)
>  	set = 1UL << 63;
>  	reset = 0;
>  	plpar_hcall_norets(H_PERFMON, set, reset);
> -
> -	/* instruct hypervisor to maintain PMCs */
> -	if (firmware_has_feature(FW_FEATURE_SPLPAR))
> -		get_lppaca()->pmcregs_in_use = 1;
>  }
> 
>  static void __init pseries_discover_pic(void)

^ permalink raw reply

* Re: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Mikhail Zolotaryov @ 2009-09-09 13:28 UTC (permalink / raw)
  To: Prodyut Hazarika; +Cc: Tom Burns, Andrea Zypchen, linuxppc-dev, azilkie
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606F60B70@SDCEXCHANGE01.ad.amcc.com>

Hi,

Why manage cache lines  manually, if appropriate code is a part of 
__dma_sync / dma_sync_single_for_device of DMA API ? (implies 
CONFIG_NOT_COHERENT_CACHE enabled, as default for Sequoia Board)

Prodyut Hazarika wrote:
> Hi Adam,
>
>   
>> Yes, I am using the 440EPx (same as the sequoia board). 
>> Our ideDriver is DMA'ing blocks of 192-byte data over the PCI bus
>>     
> (using
>   
>> the Sil0680A PCI-IDE bridge). Most of the DMA's (depending on timing)
>> end up being partially corrupted when we try to parse the data in the
>> virtual page. We have confirmed the data is good before the PCI-IDE
>> bridge. We are creating two 8K pages and map them to physical DMA
>>     
> memory
>   
>> using single-entry scatter/gather structs. When a DMA block is
>> corrupted, we see a random portion of it (always a multiple of 16byte
>> cache lines) is overwritten with old data from the last time the
>>     
> buffer
>   
>> was used. 
>>     
>
> This looks like a cache coherency problem.
> Can you ensure that the TLB entries corresponding to the DMA region has
> the CacheInhibit bit set.
> You will need a BDI connected to your system.
>
> Also, you will need to invalidate and flush the lines appropriately,
> since in 440 cores,
> L1Cache coherency is managed entirely by software.
> Please look at drivers/net/ibm_newemac/mal.c and core.c for example on
> how to do it.
>
> Thanks
> Prodyut
>
> On Thu, 2009-09-03 at 13:27 -0700, Prodyut Hazarika wrote:
>   
>> Hi Adam,
>>
>>     
>>> Are you sure there is L2 cache on the 440?
>>>       
>> It depends on the SoC you are using. SoC like 460EX (Canyonlands
>>     
> board)
>   
>> have L2Cache.
>> It seems you are using a Sequoia board, which has a 440EPx SoC. 440EPx
>> has a 440 cpu core, but no L2Cache.
>> Could you please tell me which SoC you are using?
>> You can also refer to the appropriate dts file to see if there is L2C.
>> For example, in canyonlands.dts (460EX based board), we have the L2C
>> entry.
>>         L2C0: l2c {
>>               ...
>>         }
>>
>>     
>>> I am seeing this problem with our custom IDE driver which is based on
>>>       
>
>   
>>> pretty old code. Our driver uses pci_alloc_consistent() to allocate
>>>       
> the
>   
>>> physical DMA memory and alloc_pages() to allocate a virtual page. It 
>>> then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I 
>>> should convert these to the DMA API calls as you suggest.
>>>       
>> Could you give more details on the consistency problem? It is a good
>> idea to change to the new DMA APIs, but pci_alloc_consistent() should
>> work too
>>
>> Thanks
>> Prodyut	
>>
>> On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
>>     
>>> On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
>>>       
>>>> Hi Adam,
>>>>
>>>> If you have a look in include/asm-ppc/pgtable.h for the following
>>>>         
>> section:
>>     
>>>> #ifdef CONFIG_44x
>>>> #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED |
>>>>         
>> _PAGE_GUARDED)
>>     
>>>> #else
>>>> #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
>>>> #endif
>>>>
>>>> Try adding _PAGE_COHERENT to the appropriate line above and see if
>>>>         
>> that 
>>     
>>>> fixes your issue - this causes the 'M' bit to be set on the page
>>>>         
>> which 
>>     
>>>> sure enforce cache coherency. If it doesn't, you'll need to check
>>>>         
>> the 
>>     
>>>> 'M' bit isn't being masked out in head_44x.S (it was originally
>>>>         
>> masked 
>>     
>>>> out on arch/powerpc, but was fixed in later kernels when the cache
>>>>         
>
>   
>>>> coherency issues with non-SMP systems were resolved).
>>>>         
>>> I have some doubts about the usefulness of doing that for 4xx.
>>>       
> AFAIK,
>   
>>> the 440 core just ignores M.
>>>
>>> The problem lies probably elsewhere. Maybe the L2 cache coherency
>>>       
>> isn't
>>     
>>> enabled or not working ?
>>>
>>> The L1 cache on 440 is simply not coherent, so drivers have to make
>>>       
>> sure
>>     
>>> they use the appropriate DMA APIs which will do cache flushing when
>>> needed.
>>>
>>> Adam, what driver is causing you that sort of problems ?
>>>
>>> Cheers,
>>> Ben.
>>>
>>>
>>>       

^ permalink raw reply

* Bug in of_mpc8xxx_spi chipselect
From: Michael Barkowski @ 2009-09-09 13:42 UTC (permalink / raw)
  To: linuxppc-dev

Just want to document this bug, since I don't have time to make a patch:

In of_mpc8xxx_spi_get_chipselects():

	pinfo->alow_flags[i] = flags & OF_GPIO_ACTIVE_LOW;

	ret = gpio_direction_output(pinfo->gpios[i],
				    pinfo->alow_flags[i]);


The initial value of the chip should be disabled.  If SPI_CS_HIGH, value of 0 means disabled - ok, but...

If not SPI_CS_HIGH for a given device (which is the case for most devices), that device will be enabled until it is disabled at the end of the first transaction to that device.  If there are transactions to other devices on the same bus in the meantime, this device may be confused and fail the first transaction.

Maybe the chip select should be disabled until the device entry is initialized with full knowledge of its configuration.  Not sure of the right solution.

-- 
Michael Barkowski
905-482-4577

^ permalink raw reply

* Re: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Tom Burns @ 2009-09-09 13:43 UTC (permalink / raw)
  To: lebon; +Cc: Prodyut Hazarika, Andrea Zypchen, linuxppc-dev, azilkie
In-Reply-To: <4AA7AD65.7070403@lebon.org.ua>

Hi,

With the default config for the Sequoia board on 2.6.24, calling 
pci_dma_sync_sg_for_cpu() results in executing
invalidate_dcache_range() in arch/ppc/kernel/misc.S from __dma_sync().  
This OOPses on PPC440 since it tries to call directly the assembly 
instruction dcbi, which can only be executed in supervisor mode.  We 
tried that before resorting to manual cache line management with 
usermode-safe assembly calls.

Regards,
Tom Burns
International Datacasting Corporation

Mikhail Zolotaryov wrote:
> Hi,
>
> Why manage cache lines  manually, if appropriate code is a part of 
> __dma_sync / dma_sync_single_for_device of DMA API ? (implies 
> CONFIG_NOT_COHERENT_CACHE enabled, as default for Sequoia Board)
>
> Prodyut Hazarika wrote:
>> Hi Adam,
>>
>>  
>>> Yes, I am using the 440EPx (same as the sequoia board). Our 
>>> ideDriver is DMA'ing blocks of 192-byte data over the PCI bus
>>>     
>> (using
>>  
>>> the Sil0680A PCI-IDE bridge). Most of the DMA's (depending on timing)
>>> end up being partially corrupted when we try to parse the data in the
>>> virtual page. We have confirmed the data is good before the PCI-IDE
>>> bridge. We are creating two 8K pages and map them to physical DMA
>>>     
>> memory
>>  
>>> using single-entry scatter/gather structs. When a DMA block is
>>> corrupted, we see a random portion of it (always a multiple of 16byte
>>> cache lines) is overwritten with old data from the last time the
>>>     
>> buffer
>>  
>>> was used.     
>>
>> This looks like a cache coherency problem.
>> Can you ensure that the TLB entries corresponding to the DMA region has
>> the CacheInhibit bit set.
>> You will need a BDI connected to your system.
>>
>> Also, you will need to invalidate and flush the lines appropriately,
>> since in 440 cores,
>> L1Cache coherency is managed entirely by software.
>> Please look at drivers/net/ibm_newemac/mal.c and core.c for example on
>> how to do it.
>>
>> Thanks
>> Prodyut
>>
>> On Thu, 2009-09-03 at 13:27 -0700, Prodyut Hazarika wrote:
>>  
>>> Hi Adam,
>>>
>>>    
>>>> Are you sure there is L2 cache on the 440?
>>>>       
>>> It depends on the SoC you are using. SoC like 460EX (Canyonlands
>>>     
>> board)
>>  
>>> have L2Cache.
>>> It seems you are using a Sequoia board, which has a 440EPx SoC. 440EPx
>>> has a 440 cpu core, but no L2Cache.
>>> Could you please tell me which SoC you are using?
>>> You can also refer to the appropriate dts file to see if there is L2C.
>>> For example, in canyonlands.dts (460EX based board), we have the L2C
>>> entry.
>>>         L2C0: l2c {
>>>               ...
>>>         }
>>>
>>>    
>>>> I am seeing this problem with our custom IDE driver which is based on
>>>>       
>>
>>  
>>>> pretty old code. Our driver uses pci_alloc_consistent() to allocate
>>>>       
>> the
>>  
>>>> physical DMA memory and alloc_pages() to allocate a virtual page. 
>>>> It then uses pci_map_sg() to map to a scatter/gather buffer. 
>>>> Perhaps I should convert these to the DMA API calls as you suggest.
>>>>       
>>> Could you give more details on the consistency problem? It is a good
>>> idea to change to the new DMA APIs, but pci_alloc_consistent() should
>>> work too
>>>
>>> Thanks
>>> Prodyut   
>>>
>>> On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
>>>    
>>>> On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
>>>>      
>>>>> Hi Adam,
>>>>>
>>>>> If you have a look in include/asm-ppc/pgtable.h for the following
>>>>>         
>>> section:
>>>    
>>>>> #ifdef CONFIG_44x
>>>>> #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED |
>>>>>         
>>> _PAGE_GUARDED)
>>>    
>>>>> #else
>>>>> #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
>>>>> #endif
>>>>>
>>>>> Try adding _PAGE_COHERENT to the appropriate line above and see if
>>>>>         
>>> that    
>>>>> fixes your issue - this causes the 'M' bit to be set on the page
>>>>>         
>>> which    
>>>>> sure enforce cache coherency. If it doesn't, you'll need to check
>>>>>         
>>> the    
>>>>> 'M' bit isn't being masked out in head_44x.S (it was originally
>>>>>         
>>> masked    
>>>>> out on arch/powerpc, but was fixed in later kernels when the cache
>>>>>         
>>
>>  
>>>>> coherency issues with non-SMP systems were resolved).
>>>>>         
>>>> I have some doubts about the usefulness of doing that for 4xx.
>>>>       
>> AFAIK,
>>  
>>>> the 440 core just ignores M.
>>>>
>>>> The problem lies probably elsewhere. Maybe the L2 cache coherency
>>>>       
>>> isn't
>>>    
>>>> enabled or not working ?
>>>>
>>>> The L1 cache on 440 is simply not coherent, so drivers have to make
>>>>       
>>> sure
>>>    
>>>> they use the appropriate DMA APIs which will do cache flushing when
>>>> needed.
>>>>
>>>> Adam, what driver is causing you that sort of problems ?
>>>>
>>>> Cheers,
>>>> Ben.
>>>>
>>>>
>>>>       
>
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox