LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: Accessing flash directly from User Space [SOLVED]
From: Michael Buesch @ 2009-10-30 14:56 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Bill Gatliff, Jonathan Haws
In-Reply-To: <BB99A6BA28709744BF22A68E6D7EB51F0330E230FC@midas.usurf.usu.edu>

On Friday 30 October 2009 15:50:22 Jonathan Haws wrote:
> > I suspect that the msync() was merely serving as a very heavyweight
> > memory barrier.
> 
> I did try hacking the mb() calls from the kernel source to use them in user space, but they had no effect.  I still had to include the calls to msync().

What did the resulting mb() that you used look like?

-- 
Greetings, Michael.

^ permalink raw reply

* RE: Accessing flash directly from User Space [SOLVED]
From: Jonathan Haws @ 2009-10-30 14:50 UTC (permalink / raw)
  To: Scott Wood, Joakim Tjernlund; +Cc: Bill Gatliff, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20091029233038.GA21587@b07421-ec1.am.freescale.net>

> On Thu, Oct 29, 2009 at 10:00:28AM +0100, Joakim Tjernlund wrote:
> > > I have found the problem.  It occurred to me in the shower (okay
> not really,
> > > but most good ideas happen there).
> > >
> > > What was happening is that I was in fact able to write to the
> correct
> > > registers.  However, I would try and write to them in a batch.
> But the way
> > > mmap works (at least according to the man page) with MAP_SHARED
> is that the
> > > file may not be updated until msync() is called.  Now, I thought
> that O_SYNC
> > > would take care of that when I open /dev/mem, but that was not
> the case.
> > >
> > > Anyway, to make a long story short, I inserted an msync() after
> each
> > > assignment to the flash.  This resolved my problem and I can now
> program my flash.
> >
> > Ouch, this was news to me too. Calling msync() after every write
> kills performance.
> > We use mmap(/dev/mem) to access HW and havn't seen any issues yet.
> Is this
> > perhaps a new behaviour for mmap(/dev/mem) and is there a way
> > to avoid calling msync()?
>=20
> I suspect that the msync() was merely serving as a very heavyweight
> memory barrier.

I did try hacking the mb() calls from the kernel source to use them in user=
 space, but they had no effect.  I still had to include the calls to msync(=
).

^ permalink raw reply

* Re: mpc512x/clock: fix clk_get logic
From: Mark Brown @ 2009-10-30 12:02 UTC (permalink / raw)
  To: Wolfram Sang; +Cc: linuxppc-dev, Wolfgang Denk
In-Reply-To: <20091030113644.GE12068@pengutronix.de>

On Fri, Oct 30, 2009 at 12:36:44PM +0100, Wolfram Sang wrote:
> On Fri, Oct 30, 2009 at 10:54:14AM +0000, Mark Brown wrote:

> > > - require the id field (as _this_ is the unique identifier)

> > NULL id fields are supposed to be supported in the cannonical clkdev
> > API, unfortunately.

> Hmm, ok, thanks for the hint. Where is this documented, I can't find it?

I'm not aware of any particular documentation on it but searching for
mailing list posts from rmk ought to turn up some posts on the issue.

^ permalink raw reply

* Re: mpc512x/clock: fix clk_get logic
From: Wolfram Sang @ 2009-10-30 11:36 UTC (permalink / raw)
  To: Mark Brown; +Cc: linuxppc-dev, Wolfgang Denk
In-Reply-To: <20091030105414.GD717@sirena.org.uk>

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On Fri, Oct 30, 2009 at 10:54:14AM +0000, Mark Brown wrote:
> On Fri, Oct 30, 2009 at 10:17:14AM +0100, Wolfram Sang wrote:
> > The matching logic returns a clock even if only the dev-part matches. This is
> > wrong as devices may utilize more than one clock, so the wrong clock may be
> > returned due to dev being not unique (noticed while working on the CAN driver).
> > The proposed new method will:
> 
> > - require the id field (as _this_ is the unique identifier)
> 
> NULL id fields are supposed to be supported in the cannonical clkdev
> API, unfortunately.

Hmm, ok, thanks for the hint. Where is this documented, I can't find it?

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: mpc512x/clock: fix clk_get logic
From: Mark Brown @ 2009-10-30 10:54 UTC (permalink / raw)
  To: Wolfram Sang; +Cc: linuxppc-dev, Wolfgang Denk
In-Reply-To: <1256894234-11264-1-git-send-email-w.sang@pengutronix.de>

On Fri, Oct 30, 2009 at 10:17:14AM +0100, Wolfram Sang wrote:
> The matching logic returns a clock even if only the dev-part matches. This is
> wrong as devices may utilize more than one clock, so the wrong clock may be
> returned due to dev being not unique (noticed while working on the CAN driver).
> The proposed new method will:

> - require the id field (as _this_ is the unique identifier)

NULL id fields are supposed to be supported in the cannonical clkdev
API, unfortunately.

^ permalink raw reply

* R_PPC_PLTREL24 reloc against local symbol
From: Pu Yiqiao @ 2009-10-30  9:35 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1011 bytes --]

Hi,
Now I am trying to work on ppc440. And when I compile my program I received
this error message while ld:
ppc_4xx-ld: fifo_tmp.o(.text+0x6158): R_PPC_PLTREL24 reloc against local
symbol
fifo_tmp.o: could not read symbols: Bad value
I am using ELDK 4.2.
and the compile flags is as following:
 /eldk4.2/ppc_4xx/usr/bin/../
cc1 -quiet -nostdinc -v -I/home/joy/work/ppc/partikle//user/ulibc/include
-imultilib m403/pic -iprefix
/eldk4.2/ppc_4xx/usr/bin/../lib/gcc/powerpc-linux/4.2.2/ -D__unix__
-D__gnu_linux__ -D__linux__ -Dunix -D__unix -Dlinux -D__linux -Asystem=linux
-Asystem=unix -Asystem=posix -Dxm_ppc440 fifo.c -quiet -dumpbase fifo.c
-mcpu=403 -auxbase-strip fifo.o -O2 -Wall -Wno-pointer-sign -version
-fexceptions -fno-builtin -fno-strict-aliasing -fPIC -fno-stack-protector
-fomit-frame-pointer -o /tmp/ccOIjGSh.s
 as -m403 -V -Qy -K PIC -o fifo.o /tmp/ccOIjGSh.s

Is there someone know why this is happend and how to resolve it?
Thank you.

-- 
Best wishes

                       Pu Yiqiao(Joy)

[-- Attachment #2: Type: text/html, Size: 1127 bytes --]

^ permalink raw reply

* mpc512x/clock: fix clk_get logic
From: Wolfram Sang @ 2009-10-30  9:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Wolfgang Denk

The matching logic returns a clock even if only the dev-part matches. This is
wrong as devices may utilize more than one clock, so the wrong clock may be
returned due to dev being not unique (noticed while working on the CAN driver).
The proposed new method will:

- require the id field (as _this_ is the unique identifier)
- dev need not be given; if NULL, it will match any device.
  if given, it has to match the dev of the clock
- using the above rules, both fields need to match in order to claim the clock

Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Cc: Wolfgang Denk <wd@denx.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
---
 arch/powerpc/platforms/512x/clock.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Index: .kernel/arch/powerpc/platforms/512x/clock.c
===================================================================
--- .kernel.orig/arch/powerpc/platforms/512x/clock.c
+++ .kernel/arch/powerpc/platforms/512x/clock.c
@@ -53,19 +53,21 @@ static DEFINE_MUTEX(clocks_mutex);
 static struct clk *mpc5121_clk_get(struct device *dev, const char *id)
 {
 	struct clk *p, *clk = ERR_PTR(-ENOENT);
-	int dev_match = 0;
-	int id_match = 0;
+	bool id_match = false;
+	/* Match any device if no dev given */
+	bool dev_match = !dev;
 
-	if (dev == NULL || id == NULL)
+	/* We need the unique identifier */
+	if (id == NULL)
 		return NULL;
 
 	mutex_lock(&clocks_mutex);
 	list_for_each_entry(p, &clocks, node) {
 		if (dev == p->dev)
-			dev_match++;
+			dev_match = true;
 		if (strcmp(id, p->name) == 0)
-			id_match++;
-		if ((dev_match || id_match) && try_module_get(p->owner)) {
+			id_match = true;
+		if (dev_match && id_match && try_module_get(p->owner)) {
 			clk = p;
 			break;
 		}

^ permalink raw reply

* Re: SPRN_SVR for MPC5121e?
From: Wolfram Sang @ 2009-10-30  9:14 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linuxppc-dev
In-Reply-To: <20091029062650.79D42E916D8@gemini.denx.de>

[-- Attachment #1: Type: text/plain, Size: 300 bytes --]

> We saw
> 
> 	PVR/SVR = 0x80862010 / 0x80180010 for MPC5121
> and
> 
> 	PVR/SVR = 0x80862010 / 0x80180030 for MPC5123

Thanx a lot!

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* powerpc "next" branch update
From: Benjamin Herrenschmidt @ 2009-10-30  6:26 UTC (permalink / raw)
  To: linuxppc-dev

Hoy hoy !

It's late, I know, but finally here's a first batch of patches in
powerpc "next" branch that I pushed out today. Hopefully more to come
as I start pulling from Kumar, Grant, etc... 

Cheers,
Ben.

Alexey Dobriyan (1):
      Convert /proc/device-tree/ to seq_file

Anton Vorontsov (2):
      powerpc: Make it possible to select hibernation on all PowerPCs
      of/platform: Implement support for dev_pm_ops

Benjamin Herrenschmidt (4):
      powerpc: Cleanup Kconfig selection of hugetlbfs support
      powerpc: Move /proc/ppc64 to /proc/powerpc and add symlink
      powerpc/chrp: Use the same RTAS daemon as pSeries
      powerpc/8xx: Fix build breakage with sparse irq changes

Brian King (1):
      powerpc: Add kdump support to Collaborative Memory Manager

David Gibson (6):
      powerpc/mm: Make hpte_need_flush() correctly mask for multiple page sizes
      powerpc/mm: Cleanup management of kmem_caches for pagetables
      powerpc/mm: Allow more flexible layouts for hugepage pagetables
      powerpc/mm: Cleanup initialization of hugepages on powerpc
      powerpc/mm: Split hash MMU specific hugepage code into a new file
      powerpc/mm: Bring hugepage PTE accessor functions back into sync with normal accessors

Jean Delvare (1):
      powerpc/therm_adt746x: Don't access non-existing register

Michael Ellerman (7):
      powerpc/ps3: Use pr_devel() in ps3/mm.c
      powerpc: Make NR_IRQS a CONFIG option
      powerpc/pseries: Use irq_has_action() in eeh_disable_irq()
      powerpc: Remove get_irq_desc()
      powerpc: Make virq_debug_show() cope with sparse irq_descs
      powerpc: Rearrange and fix show_interrupts() for sparse irq_descs
      powerpc: Enable sparse irq_descs on powerpc

Michael Neuling (1):
      powerpc: Fix potential compile error irqs_disabled_flags

Olof Johansson (1):
      powerpc: pasemi_defconfig update

Thomas Gleixner (3):
      powerpc/nvram_64: Remove unused code
      powerpc/nvram_64: Check nvram_error_log_index in nvram_clear_error_log()
      powerpc/nvram_64: Mark init code __init

^ permalink raw reply

* [PATCH] powerpc/8xx: Fix build breakage with sparse irq changes
From: Benjamin Herrenschmidt @ 2009-10-30  6:19 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev


Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Sticking that on top of my -next branch

diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c b/arch/powerpc/platforms/8xx/m8xx_setup.c
index 385acfc..242954c 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -222,7 +222,7 @@ static void cpm_cascade(unsigned int irq, struct irq_desc *desc)
 	int cascade_irq;
 
 	if ((cascade_irq = cpm_get_irq()) >= 0) {
-		struct irq_desc *cdesc = irq_desc + cascade_irq;
+		struct irq_desc *cdesc = irq_to_desc(cascade_irq);
 
 		generic_handle_irq(cascade_irq);
 		cdesc->chip->eoi(cascade_irq);
-- 
1.6.1.2.14.gf26b5

^ permalink raw reply related

* Re: [GIT PULL] perf_event/tracing/powerpc patches from Anton Blanchard
From: Benjamin Herrenschmidt @ 2009-10-30  6:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linuxppc-dev, Paul Mackerras, linux-kernel,
	Anton Blanchard
In-Reply-To: <20091029065501.GB12874@elte.hu>

On Thu, 2009-10-29 at 07:55 +0100, Ingo Molnar wrote:
> * Paul Mackerras <paulus@samba.org> wrote:
> 
> > Here is a series of patches from Anton Blanchard that implement some 
> > nice tracing and perf_event features on powerpc.  One of them is 
> > generic perf_event stuff (adding software events for alignment faults 
> > and instruction emulation faults).
> > 
> > Since this touches the perf_event and tracing subsystems as well as 
> > the powerpc architecture code, I think the best way forward is for 
> > both Ingo and Ben to pull it into their trees.  I have based it on the 
> > most recent point in Linus' tree that Ingo had pulled into his perf 
> > branches (as of yesterday or so).
> 
> The generic perf bits look good to me - can pull it if Ben OKs the 
> PowerPC bits.

Yup. Just went through all of them, they look fine. I also test built on
a number of default configs and it seems to pass.

Cheers,
Ben.

^ permalink raw reply

* Re: [4/6] Cleanup initialization of hugepages on powerpc
From: David Gibson @ 2009-10-30  5:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052431.B9CF4B7C07@ozlabs.org>

This patch got apply-broken by the fixes to the earlier patch in the
series.  Here's the fixed version.

Cleanup initialization of hugepages on powerpc

This patch simplifies the logic used to initialize hugepages on
powerpc.  The somewhat oddly named set_huge_psize() is renamed to
add_huge_page_size() and now does all necessary verification of
whether it's given a valid hugepage sizes (instead of just some) and
instantiates the generic hstate structure (but no more).  

hugetlbpage_init() now steps through the available pagesizes, checks
if they're valid for hugepages by calling add_huge_page_size() and
initializes the kmem_caches for the hugepage pagetables.  This means
we can now eliminate the mmu_huge_psizes array, since we no longer
need to pass the sizing information for the pagetable caches from
set_huge_psize() into hugetlbpage_init()

Determination of the default huge page size is also moved from the
hash code into the general hugepage code.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-30 16:16:20.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-30 16:19:32.000000000 +1100
@@ -37,27 +37,17 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
- * stored for the huge page sizes that are valid.
- */
-static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
-	switch (shift) {
-#ifndef CONFIG_PPC_64K_PAGES
-	case PAGE_SHIFT_64K:
-	    return MMU_PAGE_64K;
-#endif
-	case PAGE_SHIFT_16M:
-	    return MMU_PAGE_16M;
-	case PAGE_SHIFT_16G:
-	    return MMU_PAGE_16G;
-	}
+	int psize;
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize)
+		if (mmu_psize_defs[psize].shift == shift)
+			return psize;
 	return -1;
 }
 
@@ -502,8 +492,6 @@ unsigned long hugetlb_get_unmapped_area(
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-	if (!mmu_huge_psizes[mmu_psize])
-		return -EINVAL;
 	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
@@ -666,47 +654,46 @@ repeat:
 	return err;
 }
 
-static void __init set_huge_psize(int psize)
+static int __init add_huge_page_size(unsigned long long size)
 {
-	unsigned pdshift;
+	int shift = __ffs(size);
+	int mmu_psize;
 
 	/* Check that it is a page size supported by the hardware and
-	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift &&
-		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
-		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size has already been setup or is the
-		 * same as the base page size. */
-		if (mmu_huge_psizes[psize] ||
-		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
-			return;
-		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
+	 * that it fits within pagetable and slice limits. */
+	if (!is_power_of_2(size)
+	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+		return -EINVAL;
 
-		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
-			pdshift = PMD_SHIFT;
-		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
-			pdshift = PUD_SHIFT;
-		else
-			pdshift = PGDIR_SHIFT;
-		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
-	}
+	if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
+		return -EINVAL;
+
+#ifdef CONFIG_SPU_FS_64K_LS
+	/* Disable support for 64K huge pages when 64K SPU local store
+	 * support is enabled as the current implementation conflicts.
+	 */
+	if (shift == PAGE_SHIFT_64K)
+		return -EINVAL;
+#endif /* CONFIG_SPU_FS_64K_LS */
+
+	BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
+
+	/* Return if huge page size has already been setup */
+	if (size_to_hstate(size))
+		return 0;
+
+	hugetlb_add_hstate(shift - PAGE_SHIFT);
+
+	return 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize;
-	int shift;
 
 	size = memparse(str, &str);
 
-	shift = __ffs(size);
-	mmu_psize = shift_to_mmu_psize(shift);
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
-		set_huge_psize(mmu_psize);
-	else
+	if (add_huge_page_size(size) != 0)
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
 	return 1;
@@ -720,30 +707,39 @@ static int __init hugetlbpage_init(void)
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
 
-	/* Add supported huge page sizes.  Need to change
-	 *  HUGE_MAX_HSTATE if the number of supported huge page sizes
-	 *  changes.
-	 */
-	set_huge_psize(MMU_PAGE_16M);
-	set_huge_psize(MMU_PAGE_16G);
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		unsigned shift;
+		unsigned pdshift;
 
-	/* Temporarily disable support for 64K huge pages when 64K SPU local
-	 * store support is enabled as the current implementation conflicts.
-	 */
-#ifndef CONFIG_SPU_FS_64K_LS
-	set_huge_psize(MMU_PAGE_64K);
-#endif
+		if (!mmu_psize_defs[psize].shift)
+			continue;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
-			if (!PGT_CACHE(mmu_huge_psizes[psize]))
-				panic("hugetlbpage_init(): could not create "
-				      "pgtable cache for %d bit pagesize\n",
-				      mmu_psize_to_shift(psize));
-		}
+		shift = mmu_psize_to_shift(psize);
+
+		if (add_huge_page_size(1ULL << shift) < 0)
+			continue;
+
+		if (shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+
+		pgtable_cache_add(pdshift - shift, NULL);
+		if (!PGT_CACHE(pdshift - shift))
+			panic("hugetlbpage_init(): could not create "
+			      "pgtable cache for %d bit pagesize\n", shift);
 	}
 
+	/* Set default large page size. Currently, we pick 16M or 1M
+	 * depending on what is available
+	 */
+	if (mmu_psize_defs[MMU_PAGE_16M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
+	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
+
 	return 0;
 }
 
Index: working-2.6/arch/powerpc/include/asm/page_64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/page_64.h	2009-08-14 16:07:54.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/page_64.h	2009-10-30 16:16:30.000000000 +1100
@@ -90,7 +90,7 @@ extern unsigned int HPAGE_SHIFT;
 #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
-#define HUGE_MAX_HSTATE		3
+#define HUGE_MAX_HSTATE		(MMU_PAGE_COUNT-1)
 
 #endif /* __ASSEMBLY__ */
 
Index: working-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2009-10-30 16:16:20.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hash_utils_64.c	2009-10-30 16:16:30.000000000 +1100
@@ -481,16 +481,6 @@ static void __init htab_init_page_sizes(
 #ifdef CONFIG_HUGETLB_PAGE
 	/* Reserve 16G huge page memory sections for huge pages */
 	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
-
-/* Set default large page size. Currently, we pick 16M or 1M depending
-	 * on what is available
-	 */
-	if (mmu_psize_defs[MMU_PAGE_16M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
-	/* With 4k/4level pagetables, we can't (for now) cope with a
-	 * huge page size < PMD_SIZE */
-	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* [PATCH v5 4/4] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
From: Gautham R Shenoy @ 2009-10-30  5:23 UTC (permalink / raw)
  To: Nathan Fontenot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-kernel, Arun R Bharadwaj, Andrew Morton,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20091030052106.25493.42109.stgit@sofia.in.ibm.com>

Currently the cpu-allocation/deallocation process comprises of two steps:
- Set the indicators and to update the device tree with DLPAR node
  information.

- Online/offline the allocated/deallocated CPU.

This is achieved by writing to the sysfs tunables "probe" during allocation
and "release" during deallocation.

At the sametime, the userspace can independently online/offline the CPUs of
the system using the sysfs tunable "online".

It is quite possible that when a userspace tool offlines a CPU
for the purpose of deallocation and is in the process of updating the device
tree, some other userspace tool could bring the CPU back online by writing to
the "online" sysfs tunable thereby causing the deallocate process to fail.

The solution to this is to serialize writes to the "probe/release" sysfs
tunable with the writes to the "online" sysfs tunable.

This patch employs a mutex to provide this serialization, which is a no-op on
all architectures except PPC_PSERIES

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
---
 arch/powerpc/platforms/pseries/dlpar.c |   28 +++++++++++++++++++++++-----
 drivers/base/cpu.c                     |    2 ++
 include/linux/cpu.h                    |   13 +++++++++++++
 3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index 8e04a69..b6fc6ab 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -501,6 +501,18 @@ int release_drc(u32 drc_index)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+static DEFINE_MUTEX(pseries_cpu_hotplug_mutex);
+
+void cpu_hotplug_driver_lock()
+{
+	mutex_lock(&pseries_cpu_hotplug_mutex);
+}
+
+void cpu_hotplug_driver_unlock()
+{
+	mutex_unlock(&pseries_cpu_hotplug_mutex);
+}
+
 static ssize_t cpu_probe_store(struct class *class, const char *buf,
 			       size_t count)
 {
@@ -509,18 +521,19 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
 	char *cpu_name;
 	int rc;
 
+	cpu_hotplug_driver_lock();
 	rc = strict_strtoul(buf, 0, &drc_index);
 	if (rc)
-		return -EINVAL;
+		goto out;
 
 	rc = acquire_drc(drc_index);
 	if (rc)
-		return -EINVAL;
+		goto out;
 
 	dn = configure_connector(drc_index);
 	if (!dn) {
 		release_drc(drc_index);
-		return -EINVAL;
+		goto out;
 	}
 
 	/* fixup dn name */
@@ -529,7 +542,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
 	if (!cpu_name) {
 		free_cc_nodes(dn);
 		release_drc(drc_index);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto out;
 	}
 
 	sprintf(cpu_name, "/cpus/%s", dn->full_name);
@@ -541,6 +555,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
 		release_drc(drc_index);
 
 	rc = online_node_cpus(dn);
+out:
+	cpu_hotplug_driver_unlock();
 
 	return rc ? -EINVAL : count;
 }
@@ -562,6 +578,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
 		return -EINVAL;
 	}
 
+	cpu_hotplug_driver_lock();
 	rc = offline_node_cpus(dn);
 
 	if (rc)
@@ -570,7 +587,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
 	rc = release_drc(*drc_index);
 	if (rc) {
 		of_node_put(dn);
-		return -EINVAL;
+		goto out;
 	}
 
 	rc = remove_device_tree_nodes(dn);
@@ -579,6 +596,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
 
 	of_node_put(dn);
 out:
+	cpu_hotplug_driver_unlock();
 	return rc ? -EINVAL : count;
 }
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index e62a4cc..07c3f05 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
 	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
 	ssize_t ret;
 
+	cpu_hotplug_driver_lock();
 	switch (buf[0]) {
 	case '0':
 		ret = cpu_down(cpu->sysdev.id);
@@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
 	default:
 		ret = -EINVAL;
 	}
+	cpu_hotplug_driver_unlock();
 
 	if (ret >= 0)
 		ret = count;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 4753619..b0ad4e1 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -115,6 +115,19 @@ extern void put_online_cpus(void);
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
 int cpu_down(unsigned int cpu);
 
+#ifdef CONFIG_PPC_PSERIES
+extern void cpu_hotplug_driver_lock(void);
+extern void cpu_hotplug_driver_unlock(void);
+#else
+static inline void cpu_hotplug_driver_lock(void)
+{
+}
+
+static inline void cpu_hotplug_driver_unlock(void)
+{
+}
+#endif
+
 #else		/* CONFIG_HOTPLUG_CPU */
 
 #define get_online_cpus()	do { } while (0)

^ permalink raw reply related

* [PATCH v5 3/4] pseries: Add code to online/offline CPUs of a DLPAR node.
From: Gautham R Shenoy @ 2009-10-30  5:22 UTC (permalink / raw)
  To: Nathan Fontenot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-kernel, Arun R Bharadwaj, Andrew Morton,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20091030052106.25493.42109.stgit@sofia.in.ibm.com>

Currently the cpu-allocation/deallocation on pSeries is a
two step process from the Userspace.

- Set the indicators and update the device tree by writing to the sysfs
  tunable "probe" during allocation and "release" during deallocation.
- Online / Offline the CPUs of the allocated/would_be_deallocated node by
  writing to the sysfs tunable "online".

This patch adds kernel code to online/offline the CPUs soon_after/just_before
they have been allocated/would_be_deallocated. This way, the userspace tool
that performs DLPAR operations would only have to deal with one set of sysfs
tunables namely "probe" and release".

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
 arch/powerpc/platforms/pseries/dlpar.c |  101 ++++++++++++++++++++++++++++++++
 1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index ceb6d17..8e04a69 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -19,6 +19,7 @@
 #include <linux/sysdev.h>
 #include <linux/sysfs.h>
 #include <linux/cpu.h>
+#include "offline_states.h"
 
 #include <asm/prom.h>
 #include <asm/machdep.h>
@@ -353,6 +354,98 @@ int remove_device_tree_nodes(struct device_node *dn)
 	return rc;
 }
 
+int online_node_cpus(struct device_node *dn)
+{
+	int rc = 0;
+	unsigned int cpu;
+	int len, nthreads, i;
+	const u32 *intserv;
+
+	intserv = of_get_property(dn, "ibm,ppc-interrupt-server#s", &len);
+	if (!intserv)
+		return -EINVAL;
+
+	nthreads = len / sizeof(u32);
+
+	cpu_maps_update_begin();
+	for (i = 0; i < nthreads; i++) {
+		for_each_present_cpu(cpu) {
+			if (get_hard_smp_processor_id(cpu) != intserv[i])
+				continue;
+			BUG_ON(get_cpu_current_state(cpu)
+					!= CPU_STATE_OFFLINE);
+			cpu_maps_update_done();
+			rc = cpu_up(cpu);
+			if (rc)
+				goto out;
+			cpu_maps_update_begin();
+
+			break;
+		}
+		if (cpu == num_possible_cpus())
+			printk(KERN_WARNING "Could not find cpu to online "
+			       "with physical id 0x%x\n", intserv[i]);
+	}
+	cpu_maps_update_done();
+
+out:
+	return rc;
+
+}
+
+int offline_node_cpus(struct device_node *dn)
+{
+	int rc = 0;
+	unsigned int cpu;
+	int len, nthreads, i;
+	const u32 *intserv;
+
+	intserv = of_get_property(dn, "ibm,ppc-interrupt-server#s", &len);
+	if (!intserv)
+		return -EINVAL;
+
+	nthreads = len / sizeof(u32);
+
+	cpu_maps_update_begin();
+	for (i = 0; i < nthreads; i++) {
+		for_each_present_cpu(cpu) {
+			if (get_hard_smp_processor_id(cpu) != intserv[i])
+				continue;
+
+			if (get_cpu_current_state(cpu) == CPU_STATE_OFFLINE)
+				break;
+
+			if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) {
+				cpu_maps_update_done();
+				rc = cpu_down(cpu);
+				if (rc)
+					goto out;
+				cpu_maps_update_begin();
+				break;
+
+			}
+
+			/*
+			 * The cpu is in CPU_STATE_INACTIVE.
+			 * Upgrade it's state to CPU_STATE_OFFLINE.
+			 */
+			set_preferred_offline_state(cpu, CPU_STATE_OFFLINE);
+			BUG_ON(plpar_hcall_norets(H_PROD, intserv[i])
+								!= H_SUCCESS);
+			__cpu_die(cpu);
+			break;
+		}
+		if (cpu == num_possible_cpus())
+			printk(KERN_WARNING "Could not find cpu to offline "
+			       "with physical id 0x%x\n", intserv[i]);
+	}
+	cpu_maps_update_done();
+
+out:
+	return rc;
+
+}
+
 #define DR_ENTITY_SENSE		9003
 #define DR_ENTITY_PRESENT	1
 #define DR_ENTITY_UNUSABLE	2
@@ -447,6 +540,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
 	if (rc)
 		release_drc(drc_index);
 
+	rc = online_node_cpus(dn);
+
 	return rc ? -EINVAL : count;
 }
 
@@ -467,6 +562,11 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
 		return -EINVAL;
 	}
 
+	rc = offline_node_cpus(dn);
+
+	if (rc)
+		goto out;
+
 	rc = release_drc(*drc_index);
 	if (rc) {
 		of_node_put(dn);
@@ -478,6 +578,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
 		acquire_drc(*drc_index);
 
 	of_node_put(dn);
+out:
 	return rc ? -EINVAL : count;
 }
 

^ permalink raw reply related

* [PATCH v5 2/4] pSeries: Add hooks to put the CPU into an appropriate offline state
From: Gautham R Shenoy @ 2009-10-30  5:22 UTC (permalink / raw)
  To: Nathan Fontenot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-kernel, Arun R Bharadwaj, Andrew Morton,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20091030052106.25493.42109.stgit@sofia.in.ibm.com>

When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand
the CPU back to the resource pool. This path is used for DLPAR which will
cause a change in the LPAR configuration which will be visible outside.

This patch changes the default state a CPU is put into when it is offlined.
On platforms which support ceding the processor to the hypervisor with
latency hint specifier value, during a cpu offline operation,
instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor
while passing a latency hint specifier value. The Hypervisor can use this hint
to provide better energy savings. Also, during the offline
operation, the control of the vCPU remains with the LPAR as oppposed to
returning it to the resource pool.

The patch achieves this by creating an infrastructure to set the
preferred_offline_state() which can be either
- CPU_STATE_OFFLINE: which is the current behaviour of calling
  rtas_stop_self()

- CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency
  hint specifier.

The codepath which wants to perform a DLPAR operation can set the
preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking
cpu_down().

The patch also provides a boot-time command line argument to disable/enable
CPU_STATE_INACTIVE.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
 Documentation/cpu-hotplug.txt                   |    6 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c    |  182 ++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/offline_states.h |   18 ++
 arch/powerpc/platforms/pseries/smp.c            |   19 ++
 4 files changed, 216 insertions(+), 9 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/offline_states.h

diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index 9d620c1..4d4a644 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -49,6 +49,12 @@ maxcpus=n    Restrict boot time cpus to n. Say if you have 4 cpus, using
 additional_cpus=n (*)	Use this to limit hotpluggable cpus. This option sets
   			cpu_possible_map = cpu_present_map + additional_cpus
 
+cede_offline={"off","on"}  Use this option to disable/enable putting offlined
+		            processors to an extended H_CEDE state on
+			    supported pseries platforms.
+			    If nothing is specified,
+			    cede_offline is set to "on".
+
 (*) Option valid only for following architectures
 - ia64
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index ebff6d9..6ea4698 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -30,6 +30,7 @@
 #include <asm/pSeries_reconfig.h>
 #include "xics.h"
 #include "plpar_wrappers.h"
+#include "offline_states.h"
 
 /* This version can't take the spinlock, because it never returns */
 static struct rtas_args rtas_stop_self_args = {
@@ -39,6 +40,55 @@ static struct rtas_args rtas_stop_self_args = {
 	.rets = &rtas_stop_self_args.args[0],
 };
 
+static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) =
+							CPU_STATE_OFFLINE;
+static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE;
+
+static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
+
+static int cede_offline_enabled __read_mostly = 1;
+
+/*
+ * Enable/disable cede_offline when available.
+ */
+static int __init setup_cede_offline(char *str)
+{
+	if (!strcmp(str, "off"))
+		cede_offline_enabled = 0;
+	else if (!strcmp(str, "on"))
+		cede_offline_enabled = 1;
+	else
+		return 0;
+	return 1;
+}
+
+__setup("cede_offline=", setup_cede_offline);
+
+enum cpu_state_vals get_cpu_current_state(int cpu)
+{
+	return per_cpu(current_state, cpu);
+}
+
+void set_cpu_current_state(int cpu, enum cpu_state_vals state)
+{
+	per_cpu(current_state, cpu) = state;
+}
+
+enum cpu_state_vals get_preferred_offline_state(int cpu)
+{
+	return per_cpu(preferred_offline_state, cpu);
+}
+
+void set_preferred_offline_state(int cpu, enum cpu_state_vals state)
+{
+	per_cpu(preferred_offline_state, cpu) = state;
+}
+
+void set_default_offline_state(int cpu)
+{
+	per_cpu(preferred_offline_state, cpu) = default_offline_state;
+}
+
 static void rtas_stop_self(void)
 {
 	struct rtas_args *args = &rtas_stop_self_args;
@@ -56,11 +106,61 @@ static void rtas_stop_self(void)
 
 static void pseries_mach_cpu_die(void)
 {
+	unsigned int cpu = smp_processor_id();
+	unsigned int hwcpu = hard_smp_processor_id();
+	u8 cede_latency_hint = 0;
+
 	local_irq_disable();
 	idle_task_exit();
 	xics_teardown_cpu();
-	unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow()));
-	rtas_stop_self();
+
+	if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
+		set_cpu_current_state(cpu, CPU_STATE_INACTIVE);
+		cede_latency_hint = 2;
+
+		get_lppaca()->idle = 1;
+		if (!get_lppaca()->shared_proc)
+			get_lppaca()->donate_dedicated_cpu = 1;
+
+		printk(KERN_INFO
+			"cpu %u (hwid %u) ceding for offline with hint %d\n",
+			cpu, hwcpu, cede_latency_hint);
+		while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
+			extended_cede_processor(cede_latency_hint);
+			printk(KERN_INFO "cpu %u (hwid %u) returned from cede.\n",
+				cpu, hwcpu);
+			printk(KERN_INFO
+			"Decrementer value = %x Timebase value = %llx\n",
+			get_dec(), get_tb());
+		}
+
+		printk(KERN_INFO "cpu %u (hwid %u) got prodded to go online\n",
+			cpu, hwcpu);
+
+		if (!get_lppaca()->shared_proc)
+			get_lppaca()->donate_dedicated_cpu = 0;
+		get_lppaca()->idle = 0;
+	}
+
+	if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) {
+		unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
+
+		/*
+		 * NOTE: Calling start_secondary() here for now to
+		 * start new context.
+		 * However, need to do it cleanly by resetting the
+		 * stack pointer.
+		 */
+		start_secondary();
+
+	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+
+		set_cpu_current_state(cpu, CPU_STATE_OFFLINE);
+		unregister_slb_shadow(hard_smp_processor_id(),
+					__pa(get_slb_shadow()));
+		rtas_stop_self();
+	}
+
 	/* Should never get here... */
 	BUG();
 	for(;;);
@@ -106,18 +206,43 @@ static int pseries_cpu_disable(void)
 	return 0;
 }
 
+/*
+ * pseries_cpu_die: Wait for the cpu to die.
+ * @cpu: logical processor id of the CPU whose death we're awaiting.
+ *
+ * This function is called from the context of the thread which is performing
+ * the cpu-offline. Here we wait for long enough to allow the cpu in question
+ * to self-destroy so that the cpu-offline thread can send the CPU_DEAD
+ * notifications.
+ *
+ * OTOH, pseries_mach_cpu_die() is called by the @cpu when it wants to
+ * self-destruct.
+ */
 static void pseries_cpu_die(unsigned int cpu)
 {
 	int tries;
-	int cpu_status;
+	int cpu_status = 1;
 	unsigned int pcpu = get_hard_smp_processor_id(cpu);
 
-	for (tries = 0; tries < 25; tries++) {
-		cpu_status = query_cpu_stopped(pcpu);
-		if (cpu_status == 0 || cpu_status == -1)
-			break;
-		cpu_relax();
+	if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
+		cpu_status = 1;
+		for (tries = 0; tries < 1000; tries++) {
+			if (get_cpu_current_state(cpu) == CPU_STATE_INACTIVE) {
+				cpu_status = 0;
+				break;
+			}
+			cpu_relax();
+		}
+	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+
+		for (tries = 0; tries < 25; tries++) {
+			cpu_status = query_cpu_stopped(pcpu);
+			if (cpu_status == 0 || cpu_status == -1)
+				break;
+			cpu_relax();
+		}
 	}
+
 	if (cpu_status != 0) {
 		printk("Querying DEAD? cpu %i (%i) shows %i\n",
 		       cpu, pcpu, cpu_status);
@@ -252,10 +377,41 @@ static struct notifier_block pseries_smp_nb = {
 	.notifier_call = pseries_smp_notifier,
 };
 
+#define MAX_CEDE_LATENCY_LEVELS		4
+#define	CEDE_LATENCY_PARAM_LENGTH	10
+#define CEDE_LATENCY_PARAM_MAX_LENGTH	\
+	(MAX_CEDE_LATENCY_LEVELS * CEDE_LATENCY_PARAM_LENGTH * sizeof(char))
+#define CEDE_LATENCY_TOKEN		45
+
+static char cede_parameters[CEDE_LATENCY_PARAM_MAX_LENGTH];
+
+static int parse_cede_parameters(void)
+{
+	int call_status;
+
+	memset(cede_parameters, 0, CEDE_LATENCY_PARAM_MAX_LENGTH);
+	call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+				NULL,
+				CEDE_LATENCY_TOKEN,
+				__pa(cede_parameters),
+				CEDE_LATENCY_PARAM_MAX_LENGTH);
+
+	if (call_status != 0)
+		printk(KERN_INFO "CEDE_LATENCY: \
+			%s %s Error calling get-system-parameter(0x%x)\n",
+			__FILE__, __func__, call_status);
+	else
+		printk(KERN_INFO "CEDE_LATENCY: \
+			get-system-parameter successful.\n");
+
+	return call_status;
+}
+
 static int __init pseries_cpu_hotplug_init(void)
 {
 	struct device_node *np;
 	const char *typep;
+	int cpu;
 
 	for_each_node_by_name(np, "interrupt-controller") {
 		typep = of_get_property(np, "compatible", NULL);
@@ -283,8 +439,16 @@ static int __init pseries_cpu_hotplug_init(void)
 	smp_ops->cpu_die = pseries_cpu_die;
 
 	/* Processors can be added/removed only on LPAR */
-	if (firmware_has_feature(FW_FEATURE_LPAR))
+	if (firmware_has_feature(FW_FEATURE_LPAR)) {
 		pSeries_reconfig_notifier_register(&pseries_smp_nb);
+		cpu_maps_update_begin();
+		if (cede_offline_enabled && parse_cede_parameters() == 0) {
+			default_offline_state = CPU_STATE_INACTIVE;
+			for_each_online_cpu(cpu)
+				set_default_offline_state(cpu);
+		}
+		cpu_maps_update_done();
+	}
 
 	return 0;
 }
diff --git a/arch/powerpc/platforms/pseries/offline_states.h b/arch/powerpc/platforms/pseries/offline_states.h
new file mode 100644
index 0000000..22574e0
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/offline_states.h
@@ -0,0 +1,18 @@
+#ifndef _OFFLINE_STATES_H_
+#define _OFFLINE_STATES_H_
+
+/* Cpu offline states go here */
+enum cpu_state_vals {
+	CPU_STATE_OFFLINE,
+	CPU_STATE_INACTIVE,
+	CPU_STATE_ONLINE,
+	CPU_MAX_OFFLINE_STATES
+};
+
+extern enum cpu_state_vals get_cpu_current_state(int cpu);
+extern void set_cpu_current_state(int cpu, enum cpu_state_vals state);
+extern enum cpu_state_vals get_preferred_offline_state(int cpu);
+extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state);
+extern void set_default_offline_state(int cpu);
+extern int start_secondary(void);
+#endif
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 440000c..8868c01 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -48,6 +48,7 @@
 #include "plpar_wrappers.h"
 #include "pseries.h"
 #include "xics.h"
+#include "offline_states.h"
 
 
 /*
@@ -84,6 +85,9 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 	/* Fixup atomic count: it exited inside IRQ handler. */
 	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
 
+	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
+		goto out;
+
 	/* 
 	 * If the RTAS start-cpu token does not exist then presume the
 	 * cpu is already spinning.
@@ -98,6 +102,7 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 		return 0;
 	}
 
+out:
 	return 1;
 }
 
@@ -111,12 +116,16 @@ static void __devinit smp_xics_setup_cpu(int cpu)
 		vpa_init(cpu);
 
 	cpu_clear(cpu, of_spin_map);
+	set_cpu_current_state(cpu, CPU_STATE_ONLINE);
+	set_default_offline_state(cpu);
 
 }
 #endif /* CONFIG_XICS */
 
 static void __devinit smp_pSeries_kick_cpu(int nr)
 {
+	long rc;
+	unsigned long hcpuid;
 	BUG_ON(nr < 0 || nr >= NR_CPUS);
 
 	if (!smp_startup_cpu(nr))
@@ -128,6 +137,16 @@ static void __devinit smp_pSeries_kick_cpu(int nr)
 	 * the processor will continue on to secondary_start
 	 */
 	paca[nr].cpu_start = 1;
+
+	set_preferred_offline_state(nr, CPU_STATE_ONLINE);
+
+	if (get_cpu_current_state(nr) == CPU_STATE_INACTIVE) {
+		hcpuid = get_hard_smp_processor_id(nr);
+		rc = plpar_hcall_norets(H_PROD, hcpuid);
+		if (rc != H_SUCCESS)
+			panic("Error: Prod to wake up processor %d Ret= %ld\n",
+				nr, rc);
+	}
 }
 
 static int smp_pSeries_cpu_bootable(unsigned int nr)

^ permalink raw reply related

* [PATCH v5 1/4] pSeries: extended_cede_processor() helper function.
From: Gautham R Shenoy @ 2009-10-30  5:22 UTC (permalink / raw)
  To: Nathan Fontenot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-kernel, Arun R Bharadwaj, Andrew Morton,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20091030052106.25493.42109.stgit@sofia.in.ibm.com>

This patch provides an extended_cede_processor() helper function
which takes the cede latency hint as an argument. This hint is to be passed
on to the hypervisor to cede to the corresponding state on platforms
which support it.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/lppaca.h               |    9 ++++++++-
 arch/powerpc/platforms/pseries/plpar_wrappers.h |   22 ++++++++++++++++++++++
 arch/powerpc/xmon/xmon.c                        |    3 ++-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index f78f65c..14b592d 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -100,7 +100,14 @@ struct lppaca {
 	// Used to pass parms from the OS to PLIC for SetAsrAndRfid
 	u64	saved_gpr3;		// Saved GPR3                   x20-x27
 	u64	saved_gpr4;		// Saved GPR4                   x28-x2F
-	u64	saved_gpr5;		// Saved GPR5                   x30-x37
+	union {
+		u64	saved_gpr5;	/* Saved GPR5               x30-x37 */
+		struct {
+			u8	cede_latency_hint;  /*			x30 */
+			u8	reserved[7];        /*		    x31-x36 */
+		} fields;
+	} gpr5_dword;
+
 
 	u8	dtl_enable_mask;	// Dispatch Trace Log mask	x38-x38
 	u8	donate_dedicated_cpu;	// Donate dedicated CPU cycles  x39-x39
diff --git a/arch/powerpc/platforms/pseries/plpar_wrappers.h b/arch/powerpc/platforms/pseries/plpar_wrappers.h
index a24a6b2..0603c91 100644
--- a/arch/powerpc/platforms/pseries/plpar_wrappers.h
+++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h
@@ -9,11 +9,33 @@ static inline long poll_pending(void)
 	return plpar_hcall_norets(H_POLL_PENDING);
 }
 
+static inline u8 get_cede_latency_hint(void)
+{
+	return get_lppaca()->gpr5_dword.fields.cede_latency_hint;
+}
+
+static inline void set_cede_latency_hint(u8 latency_hint)
+{
+	get_lppaca()->gpr5_dword.fields.cede_latency_hint = latency_hint;
+}
+
 static inline long cede_processor(void)
 {
 	return plpar_hcall_norets(H_CEDE);
 }
 
+static inline long extended_cede_processor(unsigned long latency_hint)
+{
+	long rc;
+	u8 old_latency_hint = get_cede_latency_hint();
+
+	set_cede_latency_hint(latency_hint);
+	rc = cede_processor();
+	set_cede_latency_hint(old_latency_hint);
+
+	return rc;
+}
+
 static inline long vpa_call(unsigned long flags, unsigned long cpu,
 		unsigned long vpa)
 {
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index c6f0a71..57124cf 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1623,7 +1623,8 @@ static void super_regs(void)
 			       ptrLpPaca->saved_srr0, ptrLpPaca->saved_srr1);
 			printf("    Saved Gpr3=%.16lx  Saved Gpr4=%.16lx \n",
 			       ptrLpPaca->saved_gpr3, ptrLpPaca->saved_gpr4);
-			printf("    Saved Gpr5=%.16lx \n", ptrLpPaca->saved_gpr5);
+			printf("    Saved Gpr5=%.16lx \n",
+				ptrLpPaca->gpr5_dword.saved_gpr5);
 		}
 #endif
 

^ permalink raw reply related

* [PATCH v5 0/4] pseries: Add cede support for cpu-offline
From: Gautham R Shenoy @ 2009-10-30  5:22 UTC (permalink / raw)
  To: Nathan Fontenot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-kernel, Arun R Bharadwaj, Andrew Morton,
	linuxppc-dev, Ingo Molnar

Hi,

This is version 5 of patch series that provides a framework to choose the
state a pseries CPU must be put to when it is offlined.

Previous versions can be found here:
Version 4: http://lkml.org/lkml/2009/10/9/59
Version 3: http://lkml.org/lkml/2009/9/15/164
Version 2: http://lkml.org/lkml/2009/8/28/102
Version 1: http://lkml.org/lkml/2009/8/6/236

Changes from the previous version include:
- Rebased against Nathan Fontenot's latest "pseries kernel handling of dynamic
  logical paritioning v4" patches found here:
  http://lkml.org/lkml/2009/10/21/98

- Added boot-time option to disable putting the offlined vcpus into an
  extended H_CEDE state.

- Addressed Ben's comments regarding the if-else sequencing in
  pseries_mach_cpu_die().

- Addition of comments for pseries_cpu_die() to distinguish it from
  pseries_mach_cpu_die()

Also,

- This approach addresses Peter Z's objections regarding layering
  violations. The user simply offlines the cpu and doesn't worry about what
  state the CPU should be put into. That part is automatically handled by the
  kernel.

- It does not add any additional sysfs interface instead uses the existing
  sysfs interface to offline CPUs.

- On platforms which do not have support for ceding the vcpu with a
  latency specifier value, the offlining mechanism defaults to the current
  method of calling rtas_stop_self().

The patchset has been tested on the available pseries platforms and it works
as per the expectations. I believe that the patch set is ready for inclusion.
---

Gautham R Shenoy (4):
      pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
      pseries: Add code to online/offline CPUs of a DLPAR node.
      pSeries: Add hooks to put the CPU into an appropriate offline state
      pSeries: extended_cede_processor() helper function.


 Documentation/cpu-hotplug.txt                   |    6 +
 arch/powerpc/include/asm/lppaca.h               |    9 +
 arch/powerpc/platforms/pseries/dlpar.c          |  129 ++++++++++++++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c    |  182 ++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/offline_states.h |   18 ++
 arch/powerpc/platforms/pseries/plpar_wrappers.h |   22 +++
 arch/powerpc/platforms/pseries/smp.c            |   19 ++
 arch/powerpc/xmon/xmon.c                        |    3 
 drivers/base/cpu.c                              |    2 
 include/linux/cpu.h                             |   13 ++
 10 files changed, 387 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/offline_states.h

-- 
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [2/6] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-10-30  5:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091030033922.GA4144@yookeroo.seuss>

On Fri, Oct 30, 2009 at 02:39:22PM +1100, David Gibson wrote:
> On Thu, Oct 29, 2009 at 01:27:18PM +1100, David Gibson wrote:
> > Oops, there was one big, nasty, stupid bug in this patch.  Corrected
> > patch below.
> 
> Ugh.  And another, which broke compile on all 32-bit platforms.  I'm
> running out of brown paper bags :(.

And now one that might built for 32-bit SMP.  Sigh

Cleanup management of kmem_caches for pagetables

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-28 12:50:46.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-10-30 15:55:09.000000000 +1100
@@ -119,30 +119,58 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+/*
+ * Create a kmem_cache() for pagetables.  This is not used for PTE
+ * pages - they're linked to struct page, come from the normal free
+ * pages pool and have a different entry size (see real_pte_t) to
+ * everything else.  Caches created by this function are used for all
+ * the higher level pagetables, and for hugepage pagetables.
+ */
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
+	PGT_CACHE(shift) = new;
+
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+
+	/* In all current configs, when the PUD index exists it's the
+	 * same size as either the pgd or pmd index.  Verify that the
+	 * initialization above has also created a PUD cache.  This
+	 * will need re-examiniation if we add new possibilities for
+	 * the pagetable layout. */
+	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-08-03 16:00:45.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-30 14:05:24.000000000 +1100
@@ -11,27 +11,39 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
+
 #ifndef CONFIG_PPC_SUBPAGE_PROT
 static inline void subpage_prot_free(pgd_t *pgd) {}
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +52,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +90,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +119,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-08-14 16:07:54.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-10-30 14:05:24.000000000 +1100
@@ -24,25 +24,6 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
-/* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
- * and small enough to fit in the low bits of any naturally aligned page
- * table cache entry. Arbitrarily set to 0x1f, that should give us some
- * room to grow
- */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
@@ -50,12 +31,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +44,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-10-27 11:59:11.000000000 +1100
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-10-30 14:05:24.000000000 +1100
@@ -49,12 +49,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -64,13 +64,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -79,8 +79,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
+		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -91,25 +95,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf = (unsigned long)table | shift;
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-30 14:05:24.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-30 15:57:46.000000000 +1100
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -753,9 +737,9 @@ static int __init hugetlbpage_init(void)
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
 
-	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
-	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
-	 * sizes changes.
+	/* Add supported huge page sizes.  Need to change
+	 *  HUGE_MAX_HSTATE if the number of supported huge page sizes
+	 *  changes.
 	 */
 	set_huge_psize(MMU_PAGE_16M);
 	set_huge_psize(MMU_PAGE_16G);
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-08-28 13:46:31.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-30 15:55:09.000000000 +1100
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*
Index: working-2.6/arch/powerpc/include/asm/pgalloc-32.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-32.h	2009-10-30 14:09:30.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc-32.h	2009-10-30 15:57:20.000000000 +1100
@@ -3,7 +3,8 @@
 
 #include <linux/threads.h>
 
-#define PTE_NONCACHE_NUM	0  /* dummy for now to share code w/ppc64 */
+/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
+#define MAX_PGTABLE_INDEX_SIZE	0
 
 extern void __bad_pte(pmd_t *pmd);
 
@@ -36,11 +37,10 @@ extern void pgd_free(struct mm_struct *m
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-
-	free_page((unsigned long)p);
+	BUG_ON(index_size); /* 32-bit doesn't use this */
+	free_page((unsigned long)table);
 }
 
 #define check_pgt_cache()	do { } while (0)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2009-10-30  4:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list
In-Reply-To: <alpine.LFD.2.01.0910290911550.31845@localhost.localdomain>

On Thu, 2009-10-29 at 09:14 -0700, Linus Torvalds wrote:
> 
> On Tue, 27 Oct 2009, Benjamin Herrenschmidt wrote:
> > 
> > Kumar Gala (7):
> >       powerpc: Add a Book-3E 64-bit defconfig
> >       powerpc: Fix compile errors found by new ppc64e_defconfig
> >       powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
> 
> This is incredibly ugly. Why should the generic fs/Kconfig know about some 
> random odd architecture detail like PPC_BOOK3S_64?
> 
> I merged it, and noticed this because Super-H caused clashes by cleaning 
> up. I would suggest PowerPC do the same.
> 
> This patch is not signed-off, nor do I want any credit. But if it works on 
> ppc, please send me something like this back.

Pushed this to my merge branch:

(BTW. Where does this SYS_* form comes from ? We had ARCH_ now we have
SYS_ * ? Oic... it's a mips thing... oh well no big deal)

>From 5a1eb5c4453207ad9e7f6e8ca4f8db289743c993 Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri, 30 Oct 2009 15:03:54 +1100
Subject: [PATCH] powerpc: Cleanup Kconfig selection of hugetlbfs support

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/Kconfig |    4 ++++
 fs/Kconfig           |    2 +-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 10a0a54..2ba14e7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -414,6 +414,10 @@ config ARCH_SPARSEMEM_DEFAULT
 config ARCH_POPULATES_NODE_MAP
 	def_bool y
 
+config SYS_SUPPORTS_HUGETLBFS
+       def_bool y
+       depends on PPC_BOOK3S_64
+
 source "mm/Kconfig"
 
 config ARCH_MEMORY_PROBE
diff --git a/fs/Kconfig b/fs/Kconfig
index 2126078..64d44ef 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -135,7 +135,7 @@ config TMPFS_POSIX_ACL
 
 config HUGETLBFS
 	bool "HugeTLB file system support"
-	depends on X86 || IA64 || PPC_BOOK3S_64 || SPARC64 || (S390 && 64BIT) || \
+	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
 		   SYS_SUPPORTS_HUGETLBFS || BROKEN
 	help
 	  hugetlbfs is a filesystem backing for HugeTLB pages, based on
-- 
1.6.1.2.14.gf26b5

^ permalink raw reply related

* Re: [2/6] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-10-30  3:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091029022718.GC10725@yookeroo.seuss>

On Thu, Oct 29, 2009 at 01:27:18PM +1100, David Gibson wrote:
> Oops, there was one big, nasty, stupid bug in this patch.  Corrected
> patch below.

Ugh.  And another, which broke compile on all 32-bit platforms.  I'm
running out of brown paper bags :(.

Cleanup management of kmem_caches for pagetables

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-28 12:50:46.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-10-30 14:08:40.000000000 +1100
@@ -119,30 +119,58 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+/*
+ * Create a kmem_cache() for pagetables.  This is not used for PTE
+ * pages - they're linked to struct page, come from the normal free
+ * pages pool and have a different entry size (see real_pte_t) to
+ * everything else.  Caches created by this function are used for all
+ * the higher level pagetables, and for hugepage pagetables.
+ */
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
+	PGT_CACHE(shift) = new;
+
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+
+	/* In all current configs, when the PUD index exists it's the
+	 * same size as either the pgd or pmd index.  Verify that the
+	 * initialization above has also created a PUD cache.  This
+	 * will need re-examiniation if we add new possibilities for
+	 * the pagetable layout. */
+	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-08-03 16:00:45.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-30 14:05:24.000000000 +1100
@@ -11,27 +11,39 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
+
 #ifndef CONFIG_PPC_SUBPAGE_PROT
 static inline void subpage_prot_free(pgd_t *pgd) {}
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +52,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +90,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +119,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-08-14 16:07:54.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-10-30 14:05:24.000000000 +1100
@@ -24,25 +24,6 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
-/* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
- * and small enough to fit in the low bits of any naturally aligned page
- * table cache entry. Arbitrarily set to 0x1f, that should give us some
- * room to grow
- */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
@@ -50,12 +31,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +44,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-10-27 11:59:11.000000000 +1100
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-10-30 14:05:24.000000000 +1100
@@ -49,12 +49,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -64,13 +64,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -79,8 +79,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
+		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -91,25 +95,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf = (unsigned long)table | shift;
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-30 14:05:24.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-30 14:08:40.000000000 +1100
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-08-28 13:46:31.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-30 14:08:40.000000000 +1100
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*
Index: working-2.6/arch/powerpc/include/asm/pgalloc-32.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-32.h	2009-10-30 14:09:30.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc-32.h	2009-10-30 14:12:21.000000000 +1100
@@ -36,11 +36,10 @@ extern void pgd_free(struct mm_struct *m
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-
-	free_page((unsigned long)p);
+	BUG_ON(index_size); /* 32-bit doesn't use this */
+	free_page((unsigned long)table);
 }
 
 #define check_pgt_cache()	do { } while (0)


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Joakim Tjernlund @ 2009-10-30  0:51 UTC (permalink / raw)
  To: Scott Wood; +Cc: Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <20091030001228.GA29259@loki.buserror.net>

Scott Wood <scottwood@freescale.com> wrote on 30/10/2009 01:12:28:
>
> On Sat, Oct 17, 2009 at 02:01:38PM +0200, Joakim Tjernlund wrote:
> > Joakim Tjernlund/Transmode wrote on 17/10/2009 13:24:18:
> > >
> > > Rex Feany <RFeany@mrv.com> wrote on 16/10/2009 22:25:41:
> > > >
> > > > Thus spake Joakim Tjernlund (joakim.tjernlund@transmode.se):
> > > >
> > > > > Right, it is the pte table walk that is blowing up.
> > > > > I just noted that 2.6 lacks a tophys() call in its table walk
> > > > > so I removed that one(there is one more tophys call but I don't think
> > > > > it should be removed).
> > > > > Try this addon patch:
> > > >
> > > > no difference
> >
> > > OK, thinking a bit more, this part should not be executed as
> > > copy_tofrom_user executes in kernel space.
> > >
> > > Any chance you can stick a HW breakpoint on FixupDAR?
> > > Perhaps there is something different with kernel
> > > virtual address to phys address?
> > > A simple topys() works in 2.4, but perhaps not in 2.6?
> > > this is the part of interest:
> > > FixupDAR: /* Entry point for dcbx workaround. */
> > >  /* fetch instruction from memory. */
> > >  mfspr r10, SPRN_SRR0
> > >  andis. r11, r10, 0x8000
> > >  tophys  (r11, r10)
> > >  beq- 139b  /* Branch if user space address */
> > > 140: lwz r11,0(r11)
> >
> > Probably better to walk the kernel page table too. Does this
> > make a difference(needs the tophys() patch I posted earlier):
>
> After applying by hand (whitespace damage), I get this and a bunch more:

OK, please send your diff to head_8xx.S. Maybe I can spot an
error, otherwise you will have to set a hw BP on fixDAR and step
through it.

 Jocke

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Scott Wood @ 2009-10-30  0:12 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <OFA43B9FDF.F43B1705-ONC1257652.0041B12F-C1257652.004211A3@transmode.se>

On Sat, Oct 17, 2009 at 02:01:38PM +0200, Joakim Tjernlund wrote:
> Joakim Tjernlund/Transmode wrote on 17/10/2009 13:24:18:
> >
> > Rex Feany <RFeany@mrv.com> wrote on 16/10/2009 22:25:41:
> > >
> > > Thus spake Joakim Tjernlund (joakim.tjernlund@transmode.se):
> > >
> > > > Right, it is the pte table walk that is blowing up.
> > > > I just noted that 2.6 lacks a tophys() call in its table walk
> > > > so I removed that one(there is one more tophys call but I don't think
> > > > it should be removed).
> > > > Try this addon patch:
> > >
> > > no difference
> 
> > OK, thinking a bit more, this part should not be executed as
> > copy_tofrom_user executes in kernel space.
> >
> > Any chance you can stick a HW breakpoint on FixupDAR?
> > Perhaps there is something different with kernel
> > virtual address to phys address?
> > A simple topys() works in 2.4, but perhaps not in 2.6?
> > this is the part of interest:
> > FixupDAR: /* Entry point for dcbx workaround. */
> >  /* fetch instruction from memory. */
> >  mfspr r10, SPRN_SRR0
> >  andis. r11, r10, 0x8000
> >  tophys  (r11, r10)
> >  beq- 139b  /* Branch if user space address */
> > 140: lwz r11,0(r11)
> 
> Probably better to walk the kernel page table too. Does this
> make a difference(needs the tophys() patch I posted earlier):

After applying by hand (whitespace damage), I get this and a bunch more:

VFS: Mounted root (nfs filesystem) readonly on device 0:12.                     
Freeing unused kernel memory: 96k init                                          
INIT: version 2.85 booting                                                      
Mounting /proc and /sys                                                         
Oops: Machine check, sig: 7 [#1]                                                
Embedded Planet EP88xC                                                          
NIP: 00002020 LR: c0089c58 CTR: 00000038                                        
REGS: c38d7de0 TRAP: 0200   Not tainted  (2.6.32-rc4-00971-g2edbf13-dirty)      
MSR: 00001000 <ME>  CR: 44002428  XER: 00000000                                 
TASK = c383b7a0[173] 'udev' THREAD: c38d6000                                    
GPR00: 00000001 c38d7e90 c383b7a0 00000014 c380bffc 0000000c 3001fffc 00000001  
GPR08: 00000038 0000039b c001137c c021c000 00000000 100c7368 c01f59f4 c01f59d0  
GPR16: c0240000 100982dc 100c0aac 10095ccc 00000047 c38a5868 c38d7f20 00000000  
GPR24: c38dd880 00000400 30020000 00000000 c38d7ea0 00000000 0000039c c38a5840  
NIP [00002020] 0x2020                                                           
LR [c0089c58] seq_read+0x488/0x558                                              
Call Trace:                                                                     
[c38d7e90] [c0089a74] seq_read+0x2a4/0x558 (unreliable)                         
[c38d7ee0] [c00ac264] proc_reg_read+0x4c/0x70                                   
[c38d7ef0] [c006f7f4] vfs_read+0xb4/0x158                                       
[c38d7f10] [c006fb04] sys_read+0x4c/0x90                                        
[c38d7f40] [c000dfb8] ret_from_syscall+0x0/0x38                                 
Instruction dump:                                                               
00000000 XXXXXXXX XXXXXXXX XXXXXXXX 7d5a02a6 XXXXXXXX XXXXXXXX XXXXXXXX         
41800010 XXXXXXXX XXXXXXXX XXXXXXXX 816b0000 XXXXXXXX XXXXXXXX XXXXXXXX         
---[ end trace fab819d28e265675 ]---                                            
/etc/rc.d/rcS: line 24:   173 Bus error               /etc/rc.d/init.d/$i $mode 

-Scott

^ permalink raw reply

* Re: Accessing flash directly from User Space [SOLVED]
From: Scott Wood @ 2009-10-29 23:30 UTC (permalink / raw)
  To: Joakim Tjernlund
  Cc: Bill Gatliff, linuxppc-dev@lists.ozlabs.org, Jonathan Haws
In-Reply-To: <OFE3AA516D.46A96178-ONC125765E.0030DA7C-C125765E.00317B8C@transmode.se>

On Thu, Oct 29, 2009 at 10:00:28AM +0100, Joakim Tjernlund wrote:
> > I have found the problem.  It occurred to me in the shower (okay not really,
> > but most good ideas happen there).
> >
> > What was happening is that I was in fact able to write to the correct
> > registers.  However, I would try and write to them in a batch.  But the way
> > mmap works (at least according to the man page) with MAP_SHARED is that the
> > file may not be updated until msync() is called.  Now, I thought that O_SYNC
> > would take care of that when I open /dev/mem, but that was not the case.
> >
> > Anyway, to make a long story short, I inserted an msync() after each
> > assignment to the flash.  This resolved my problem and I can now program my flash.
> 
> Ouch, this was news to me too. Calling msync() after every write kills performance.
> We use mmap(/dev/mem) to access HW and havn't seen any issues yet. Is this
> perhaps a new behaviour for mmap(/dev/mem) and is there a way
> to avoid calling msync()?

I suspect that the msync() was merely serving as a very heavyweight
memory barrier.

-Scott

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2009-10-29 23:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list
In-Reply-To: <alpine.LFD.2.01.0910290911550.31845@localhost.localdomain>

On Thu, 2009-10-29 at 09:14 -0700, Linus Torvalds wrote:
> 
> On Tue, 27 Oct 2009, Benjamin Herrenschmidt wrote:
> > 
> > Kumar Gala (7):
> >       powerpc: Add a Book-3E 64-bit defconfig
> >       powerpc: Fix compile errors found by new ppc64e_defconfig
> >       powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
> 
> This is incredibly ugly. Why should the generic fs/Kconfig know about some 
> random odd architecture detail like PPC_BOOK3S_64?
> 
> I merged it, and noticed this because Super-H caused clashes by cleaning 
> up. I would suggest PowerPC do the same.

Right. I noticed the same thing yesterday and planned to do so today

> This patch is not signed-off, nor do I want any credit. But if it works on 
> ppc, please send me something like this back.

I'll stick something like that in my tree and send you a pull request.

Cheers,
Ben.

> 		Linus
> ---
>  arch/powerpc/Kconfig |    3 +++
>  fs/Kconfig           |    2 +-
>  2 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 10a0a54..877db84 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -131,6 +131,9 @@ config PPC
>  	select GENERIC_ATOMIC64 if PPC32
>  	select HAVE_PERF_EVENTS
>  
> +config SYS_SUPPORTS_HUGETLBFS
> +	defbool PPC_BOOK3S_64
> +
>  config EARLY_PRINTK
>  	bool
>  	default y
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 2126078..64d44ef 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -135,7 +135,7 @@ config TMPFS_POSIX_ACL
>  
>  config HUGETLBFS
>  	bool "HugeTLB file system support"
> -	depends on X86 || IA64 || PPC_BOOK3S_64 || SPARC64 || (S390 && 64BIT) || \
> +	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
>  		   SYS_SUPPORTS_HUGETLBFS || BROKEN
>  	help
>  	  hugetlbfs is a filesystem backing for HugeTLB pages, based on

^ permalink raw reply

* Re: [PATCH v0 2/2] Crypto: Talitos: Support for Async_tx XOR offload
From: Dan Williams @ 2009-10-29 22:46 UTC (permalink / raw)
  To: Vishnu Suresh
  Cc: herbert, linux-kernel, linux-raid, linuxppc-dev, linux-crypto,
	Dipen Dudhat, Maneesh Gupta
In-Reply-To: <1255588866-4133-2-git-send-email-Vishnu@freescale.com>

On Wed, Oct 14, 2009 at 11:41 PM, Vishnu Suresh <Vishnu@freescale.com> wrot=
e:
[..]
> diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
> index b08403d..343e578 100644
> --- a/drivers/crypto/Kconfig
> +++ b/drivers/crypto/Kconfig
> @@ -192,6 +192,8 @@ config CRYPTO_DEV_TALITOS
> =A0 =A0 =A0 =A0select CRYPTO_ALGAPI
> =A0 =A0 =A0 =A0select CRYPTO_AUTHENC
> =A0 =A0 =A0 =A0select HW_RANDOM
> + =A0 =A0 =A0 select DMA_ENGINE
> + =A0 =A0 =A0 select ASYNC_XOR

You need only select DMA_ENGINE.  ASYNC_XOR is selected by its users.

> =A0 =A0 =A0 =A0depends on FSL_SOC
> =A0 =A0 =A0 =A0help
> =A0 =A0 =A0 =A0 =A0Say 'Y' here to use the Freescale Security Engine (SEC=
)
> diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
> index c47ffe8..84819d4 100644
> --- a/drivers/crypto/talitos.c
> +++ b/drivers/crypto/talitos.c
[..]
> +static void talitos_process_pending(struct talitos_xor_chan *xor_chan)
> +{
> + =A0 =A0 =A0 struct talitos_xor_desc *desc, *_desc;
> + =A0 =A0 =A0 unsigned long flags;
> + =A0 =A0 =A0 int status;
> +
> + =A0 =A0 =A0 spin_lock_irqsave(&xor_chan->desc_lock, flags);
> +
> + =A0 =A0 =A0 list_for_each_entry_safe(desc, _desc, &xor_chan->pending_q,=
 node) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 status =3D talitos_submit(xor_chan->dev, &d=
esc->hwdesc,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 talitos_release_xor, desc);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (status !=3D -EINPROGRESS)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 list_del(&desc->node);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 list_add_tail(&desc->node, &xor_chan->in_pr=
ogress_q);
> + =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 spin_unlock_irqrestore(&xor_chan->desc_lock, flags);
> +}

The driver uses spin_lock_bh everywhere else which is either a bug, or
this code is being overly protective.  In any event lockdep will
rightly complain about this.  The API and its users do not submit
operations in hard-irq context.

> +
> +static void talitos_release_xor(struct device *dev, struct talitos_desc =
*hwdesc,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *conte=
xt, int error)
> +{
> + =A0 =A0 =A0 struct talitos_xor_desc *desc =3D context;
> + =A0 =A0 =A0 struct talitos_xor_chan *xor_chan;
> + =A0 =A0 =A0 dma_async_tx_callback callback;
> + =A0 =A0 =A0 void *callback_param;
> +
> + =A0 =A0 =A0 if (unlikely(error)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 dev_err(dev, "xor operation: talitos error =
%d\n", error);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 BUG();
> + =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 xor_chan =3D container_of(desc->async_tx.chan, struct talit=
os_xor_chan,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 common);
> + =A0 =A0 =A0 spin_lock_bh(&xor_chan->desc_lock);
> + =A0 =A0 =A0 if (xor_chan->completed_cookie < desc->async_tx.cookie)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 xor_chan->completed_cookie =3D desc->async_=
tx.cookie;
> +
> + =A0 =A0 =A0 callback =3D desc->async_tx.callback;
> + =A0 =A0 =A0 callback_param =3D desc->async_tx.callback_param;
> +
> + =A0 =A0 =A0 if (callback) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_unlock_bh(&xor_chan->desc_lock);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 callback(callback_param);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_lock_bh(&xor_chan->desc_lock);
> + =A0 =A0 =A0 }

You do not need to unlock to call the callback.  Users of the API
assume that they are called in bh context and that they are not
allowed to submit new operations directly from the callback (this
simplifies the descriptor cleanup routines in other drivers).

> +
> + =A0 =A0 =A0 list_del(&desc->node);
> + =A0 =A0 =A0 list_add_tail(&desc->node, &xor_chan->free_desc);
> + =A0 =A0 =A0 spin_unlock_bh(&xor_chan->desc_lock);
> + =A0 =A0 =A0 if (!list_empty(&xor_chan->pending_q))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 talitos_process_pending(xor_chan);
> +}

It appears this routine is missing a call to dma_run_dependencies()?
This is needed if another channel is handling memory copy offload.  I
assume this is the case due to your other patch to fsldma.  See
iop_adma_run_tx_complete_actions().  If this xor resource can also
handle copy operations that would be more efficient as it eliminates
channel switching.  See the ioatdma driver and its use of
ASYNC_TX_DISABLE_CHANNEL_SWITCH.

> +static struct dma_async_tx_descriptor * talitos_prep_dma_xor(
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct dma_chan *chan, dma_=
addr_t dest, dma_addr_t *src,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned int src_cnt, size_=
t len, unsigned long flags)
> +{
> + =A0 =A0 =A0 struct talitos_xor_chan *xor_chan;
> + =A0 =A0 =A0 struct talitos_xor_desc *new;
> + =A0 =A0 =A0 struct talitos_desc *desc;
> + =A0 =A0 =A0 int i, j;
> +
> + =A0 =A0 =A0 BUG_ON(unlikely(len > TALITOS_MAX_DATA_LEN));
> +
> + =A0 =A0 =A0 xor_chan =3D container_of(chan, struct talitos_xor_chan, co=
mmon);
> +
> + =A0 =A0 =A0 spin_lock_bh(&xor_chan->desc_lock);
> + =A0 =A0 =A0 if (!list_empty(&xor_chan->free_desc)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 new =3D container_of(xor_chan->free_desc.ne=
xt,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0stru=
ct talitos_xor_desc, node);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 list_del(&new->node);
> + =A0 =A0 =A0 } else {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 new =3D talitos_xor_alloc_descriptor(xor_ch=
an, GFP_KERNEL);
> + =A0 =A0 =A0 }
> + =A0 =A0 =A0 spin_unlock_bh(&xor_chan->desc_lock);
> +
> + =A0 =A0 =A0 if (!new) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 dev_err(xor_chan->common.device->dev,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "No free memory for XOR DMA=
 descriptor\n");
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
> + =A0 =A0 =A0 }
> + =A0 =A0 =A0 dma_async_tx_descriptor_init(&new->async_tx, &xor_chan->com=
mon);
> +
> + =A0 =A0 =A0 INIT_LIST_HEAD(&new->node);
> + =A0 =A0 =A0 INIT_LIST_HEAD(&new->tx_list);

You can save some overhead in the fast path by moving this
initialization to talitos_xor_alloc_descriptor.

> +
> + =A0 =A0 =A0 desc =3D &new->hwdesc;
> + =A0 =A0 =A0 /* Set destination: Last pointer pair */
> + =A0 =A0 =A0 to_talitos_ptr(&desc->ptr[6], dest);
> + =A0 =A0 =A0 desc->ptr[6].len =3D cpu_to_be16(len);
> + =A0 =A0 =A0 desc->ptr[6].j_extent =3D 0;
> +
> + =A0 =A0 =A0 /* Set Sources: End loading from second-last pointer pair *=
/
> + =A0 =A0 =A0 for (i =3D 5, j =3D 0; (j < src_cnt) && (i > 0); i--, j++) =
{
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 to_talitos_ptr(&desc->ptr[i], src[j]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 desc->ptr[i].len =3D cpu_to_be16(len);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 desc->ptr[i].j_extent =3D 0;
> + =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 /*
> + =A0 =A0 =A0 =A0* documentation states first 0 ptr/len combo marks end o=
f sources
> + =A0 =A0 =A0 =A0* yet device produces scatter boundary error unless all =
subsequent
> + =A0 =A0 =A0 =A0* sources are zeroed out
> + =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 for (; i >=3D 0; i--) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 to_talitos_ptr(&desc->ptr[i], 0);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 desc->ptr[i].len =3D 0;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 desc->ptr[i].j_extent =3D 0;
> + =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 desc->hdr =3D DESC_HDR_SEL0_AESU | DESC_HDR_MODE0_AESU_XOR
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | DESC_HDR_TYPE_RAID_XOR;
> +
> + =A0 =A0 =A0 new->async_tx.parent =3D NULL;
> + =A0 =A0 =A0 new->async_tx.next =3D NULL;

These fields are managed by the async_tx channel switch code.  No need
to manage it here.

> + =A0 =A0 =A0 new->async_tx.cookie =3D 0;

This is set below to -EBUSY, it's redundant to touch it here.

> + =A0 =A0 =A0 async_tx_ack(&new->async_tx);

This makes it impossible to attach a dependency to this operation.
Not sure this is what you want.

--
Dan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox