LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* INIT: PANIC: segmentation violation! sleeping for 30 seconds.
From: Breno Leitao @ 2009-10-26 17:01 UTC (permalink / raw)
  To: linuxppc-dev

Hi, 

I just put a upstream kernel(rc5) on a specific machine I have (Power5), and I got
the following error: 

INIT: PANIC: segmentation violation! sleeping for 30 seconds.
init has generated signal 11 but has no handler for it
init used greatest stack depth: 6240 bytes left
Kernel panic - not syncing: Attempted to kill init!
Call Trace:
[c0000001c6e7f920] [c000000000012588] .show_stack+0x6c/0x194 (unreliable)
[c0000001c6e7f9d0] [c000000000088bd4] .panic+0x74/0x1c0
[c0000001c6e7fa60] [c00000000008cbdc] .do_exit+0x43c/0x82c
[c0000001c6e7fb20] [c0000000000286f4] ._exception+0x1d4/0x204
[c0000001c6e7fcf0] [c0000000004e7dc8] .do_page_fault+0x4fc/0x634
[c0000001c6e7fe30] [c00000000000560c] handle_page_fault+0x20/0x74


Downgrading to rc2 shows the same result. Interesting enough, this is the only
machine that fails with the upstream kernel.

Have anyone seen anything similar ?

Thanks

^ permalink raw reply

* Network Stack SKB Reallocation
From: Jonathan Haws @ 2009-10-26 18:43 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org

Quick question about the network stack in general:

Does the stack itself release an SKB allocated by the device driver back to=
 the heap upstream, or does it require that the device driver handle that?

Thanks!

Jonathan

^ permalink raw reply

* RE: Network Stack SKB Reallocation
From: Jonathan Haws @ 2009-10-26 19:16 UTC (permalink / raw)
  To: Michael Buesch, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <200910262013.52458.mb@bu3sch.de>

So, in my case, I allocate a bunch of skb's that I want to be able to reuse=
 during network operation (256 in fact).  When I pass it up the stack, the =
stack will free that skb back to the system making any further use of it in=
valid until I call alloc_skb() again?

Thanks.

> On Monday 26 October 2009 19:43:00 Jonathan Haws wrote:
> > Quick question about the network stack in general:
> >
> > Does the stack itself release an SKB allocated by the device
> driver back to the heap upstream, or does it require that the device
> driver handle that?
>=20
> There's the concept of passing responsibilities for the frames
> between
> the networking layers. So the driver passes the frame and all
> responsibilities
> to the networking stack. So if the networking stack accepts the
> packet in the first place,
> it needs to free it (or pass it to somebody else to take care of).
>=20
> --
> Greetings, Michael.

^ permalink raw reply

* Re: Network Stack SKB Reallocation
From: Michael Buesch @ 2009-10-26 19:13 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jonathan Haws
In-Reply-To: <BB99A6BA28709744BF22A68E6D7EB51F0330BEB38D@midas.usurf.usu.edu>

On Monday 26 October 2009 19:43:00 Jonathan Haws wrote:
> Quick question about the network stack in general:
> 
> Does the stack itself release an SKB allocated by the device driver back to the heap upstream, or does it require that the device driver handle that?

There's the concept of passing responsibilities for the frames between
the networking layers. So the driver passes the frame and all responsibilities
to the networking stack. So if the networking stack accepts the packet in the first place,
it needs to free it (or pass it to somebody else to take care of).

-- 
Greetings, Michael.

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Benjamin Herrenschmidt @ 2009-10-26 22:47 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFA43B9FDF.F43B1705-ONC1257652.0041B12F-C1257652.004211A3@transmode.se>


> Probably better to walk the kernel page table too. Does this
> make a difference(needs the tophys() patch I posted earlier):

This whole thing would be a -lot- easier to do from C code. Why ? Simply
because you could just use get_user() to load the instruction rather
than doing this page table walking in asm, which is simpler, faster, and
more fool proof (ok, you do pay the price of a kernel entry/exit
instead, but I still believe that code simplicity and maintainability
wins here).

Ben.

> >From 862dda30c3d3d3bedcc605e8520626408a26891c Mon Sep 17 00:00:00 2001
> From: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
> Date: Sat, 17 Oct 2009 13:54:03 +0200
> Subject: [PATCH] 8xx: Walk the page table for kernel addresses too.
> 
> ---
>  arch/powerpc/kernel/head_8xx.S |   25 ++++++++++++-------------
>  1 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
> index 0e91da4..edc9e9b 100644
> --- a/arch/powerpc/kernel/head_8xx.S
> +++ b/arch/powerpc/kernel/head_8xx.S
> @@ -532,28 +532,27 @@ DARFixed:/* Return from dcbx instruction bug workaround, r10 holds value of DAR
>   * by decoding the registers used by the dcbx instruction and adding them.
>   * DAR is set to the calculated address and r10 also holds the EA on exit.
>   */
> -#define NO_SELF_MODIFYING_CODE /* define if you don't want to use self modifying code */
> -     nop	/* A few nops to make the modified_instr: space below cache line aligned */
> -     nop
> -139:	/* fetch instruction from userspace memory */
> + /* define if you don't want to use self modifying code */
> +#define NO_SELF_MODIFYING_CODE
> +FixupDAR:/* Entry point for dcbx workaround. */
> +	/* fetch instruction from memory. */
> +     mfspr r10, SPRN_SRR0
>       DO_8xx_CPU6(0x3780, r3)
>       mtspr SPRN_MD_EPN, r10
>       mfspr r11, SPRN_M_TWB	/* Get level 1 table entry address */
> -     lwz   r11, 0(r11)	/* Get the level 1 entry */
> +     cmplwi      cr0, r11, 0x0800
> +     blt-  3f		/* Branch if user space */
> +     lis   r11, swapper_pg_dir@h
> +     ori   r11, r11, swapper_pg_dir@l
> +     rlwimi      r11, r11, 0, 2, 19
> +3:   lwz   r11, 0(r11)	/* Get the level 1 entry */
>       DO_8xx_CPU6(0x3b80, r3)
>       mtspr SPRN_MD_TWC, r11	/* Load pte table base address */
>       mfspr r11, SPRN_MD_TWC	/* ....and get the pte address */
>       lwz   r11, 0(r11)	/* Get the pte */
>       /* concat physical page address(r11) and page offset(r10) */
>       rlwimi      r11, r10, 0, 20, 31
> -     b     140f
> -FixupDAR:	/* Entry point for dcbx workaround. */
> -	/* fetch instruction from memory. */
> -     mfspr r10, SPRN_SRR0
> -     andis.      r11, r10, 0x8000
> -     tophys  (r11, r10)
> -     beq-  139b		/* Branch if user space address */
> -140: lwz   r11,0(r11)
> +     lwz   r11,0(r11)
>  /* Check if it really is a dcbx instruction. */
>  /* dcbt and dcbtst does not generate DTLB Misses/Errors,
>   * no need to include them here */
> --
> 1.6.4.4

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Dan Malek @ 2009-10-26 23:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1256597241.2076.38.camel@pasglop>


On Oct 26, 2009, at 3:47 PM, Benjamin Herrenschmidt wrote:

> This whole thing would be a -lot- easier to do from C code. Why ?  
> Simply
> because you could just use get_user() to load the instruction rather
> than doing this page table walking in asm,

Just be careful the get_user() doesn't regenerate the same
translation error you are trying to fix by being here......
It is nice doing things in C code, but you have to be aware
of the environment and the side effects when in this kind
of exception state.

Thanks.

	-- Dan

^ permalink raw reply

* Re: [PATCH] [RFC] PowerPC64: Use preempt_schedule_irq instead of preempt_schedule when returning from exceptions
From: Benjamin Herrenschmidt @ 2009-10-26 23:55 UTC (permalink / raw)
  To: Valentine Barshak; +Cc: olof, linuxppc-dev, paulus
In-Reply-To: <20091019182858.GA10495@ru.mvista.com>

On Mon, 2009-10-19 at 22:28 +0400, Valentine Barshak wrote:
> Use preempt_schedule_irq to prevent infinite irq-entry and
> eventual stack overflow problems with fast-paced IRQ sources.
> This kind of problems has been observed on the PASemi Electra IDE
> controller. We have to make sure we are soft-disabled before calling
> preempt_schedule_irq and hard disable interrupts after that
> to avoid unrecoverable exceptions.
> 
> This patch also moves the "clrrdi r9,r1,THREAD_SHIFT" out of
> the #ifdef CONFIG_PPC_BOOK3E scope, since r9 is clobbered
> and has to be restored in both cases.

So I _think_ that the irqs on/off accounting for lockdep isn't quite
right. What do you think of this slightly modified version ? I've only
done a quick boot test on a G5 with lockdep enabled and a played a bit,
nothing shows up so far but it's definitely not conclusive.

The main difference is that I call trace_hardirqs_off to "advertise"
the fact that we are soft-disabling (it could be a dup, but at this
stage this is no big deal, but it's not always, like in syscall return
the kernel thinks we have interrupts enabled and could thus get out
of sync without that).

I also mark the PACA hard disable to reflect the MSR:EE state before
calling into preempt_schedule_irq().

---

[PATCH v2] powerpc: Use preempt_schedule_irq instead of preempt_schedule when returning from exceptions

Use preempt_schedule_irq to prevent infinite irq-entry and
eventual stack overflow problems with fast-paced IRQ sources.
This kind of problems has been observed on the PASemi Electra IDE
controller. We have to make sure we are soft-disabled before calling
preempt_schedule_irq and hard disable interrupts after that
to avoid unrecoverable exceptions.

This patch also moves the "clrrdi r9,r1,THREAD_SHIFT" out of
the #ifdef CONFIG_PPC_BOOK3E scope, since r9 is clobbered
and has to be restored in both cases.

Signed-off-by: Valentine Barshak <vbarshak@ru.mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/entry_64.S |   38 +++++++++++++++++++++-----------------
 1 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index f9fd54b..b64ae3d 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -659,34 +659,38 @@ do_work:
 	crandc	eq,cr1*4+eq,eq
 	bne	restore
 	/* here we are preempting the current task */
-1:
+	/* ensure we are soft-disabled
+	 * */
+	li	r0,0
+	stb	r0,PACASOFTIRQEN(r13)
+	/* Trace the IRQ state change */
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	.trace_hardirqs_on
-	/* Note: we just clobbered r10 which used to contain the previous
-	 * MSR before the hard-disabling done by the caller of do_work.
-	 * We don't have that value anymore, but it doesn't matter as
-	 * we will hard-enable unconditionally, we can just reload the
-	 * current MSR into r10
-	 */
+	bl	.trace_hardirqs_off
+#endif
+1:	/* And make sure we are hard-enabled */
+#ifdef CONFIG_PPC_BOOK3E
+	wrteei	1
+#else
 	mfmsr	r10
-#endif /* CONFIG_TRACE_IRQFLAGS */
+	ori	r10,r10,MSR_EE
+	mtmsrd	r10,1
+#endif
 	li	r0,1
-	stb	r0,PACASOFTIRQEN(r13)
 	stb	r0,PACAHARDIRQEN(r13)
+	/* Call the scheduler with soft IRQs off */
+	bl	.preempt_schedule_irq
+	/* hard-disable interrupts again */
 #ifdef CONFIG_PPC_BOOK3E
-	wrteei	1
-	bl	.preempt_schedule
 	wrteei	0
 #else
-	ori	r10,r10,MSR_EE
-	mtmsrd	r10,1		/* reenable interrupts */
-	bl	.preempt_schedule
 	mfmsr	r10
-	clrrdi	r9,r1,THREAD_SHIFT
-	rldicl	r10,r10,48,1	/* disable interrupts again */
+	rldicl	r10,r10,48,1
 	rotldi	r10,r10,16
 	mtmsrd	r10,1
 #endif /* CONFIG_PPC_BOOK3E */
+	li	r0,0
+	stb	r0,PACAHARDIRQEN(r13)
+	clrrdi	r9,r1,THREAD_SHIFT
 	ld	r4,TI_FLAGS(r9)
 	andi.	r0,r4,_TIF_NEED_RESCHED
 	bne	1b
-- 
1.6.1.2.14.gf26b5

^ permalink raw reply related

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Benjamin Herrenschmidt @ 2009-10-27  0:00 UTC (permalink / raw)
  To: Dan Malek; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <21C04675-E563-49E2-B2E6-7CAB9D8BE985@embeddedalley.com>

On Mon, 2009-10-26 at 16:26 -0700, Dan Malek wrote:
> Just be careful the get_user() doesn't regenerate the same
> translation error you are trying to fix by being here......

It shouldn't since it will always come up with a proper DAR but
you may want to double check before hand that your instruction
address you are loading from is -not- your marker value for bad DAR.

> It is nice doing things in C code, but you have to be aware
> of the environment and the side effects when in this kind 

Yup.

Cheers,
Ben.

^ permalink raw reply

* Re: [2/6] Cleanup management of kmem_caches for pagetables
From: Benjamin Herrenschmidt @ 2009-10-27  2:28 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev
In-Reply-To: <20091016052212.E565EB7BBB@ozlabs.org>

On Fri, 2009-10-16 at 16:22 +1100, David Gibson wrote:

Minor nits... if you can respin today I should push it out to -next

> +void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
> +{
> +	char *name;
> +	unsigned long table_size = sizeof(void *) << shift;
> +	unsigned long align = table_size;

This is a bit thick.. could use some air. Just separate the definitions
from the assignments so you can make the code breath a bit :-)

Also the above warrants a comment explaining that this won't work for
PTE pages since sizeof(PTE) >= sizeof(void *) and the day we finally
move out of pte page == struct page, the code here will have to be
adapted.

> +	/* When batching pgtable pointers for RCU freeing, we store
> +	 * the index size in the low bits.  Table alignment must be
> +	 * big enough to fit it */
> +	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
> +	struct kmem_cache *new;
> +
> +	/* It would be nice if this was a BUILD_BUG_ON(), but at the
> +	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
> +	 * constant expression, so so much for that. */
> +	BUG_ON(!is_power_of_2(minalign));
> +	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
> +
> +	if (PGT_CACHE(shift))
> +		return; /* Already have a cache of this size */

Blank line here too

> +	align = max_t(unsigned long, align, minalign);
> +	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
> +	new = kmem_cache_create(name, table_size, align, 0, ctor);
> +	PGT_CACHE(shift) = new;

And here

> +	pr_debug("Allocated pgtable cache for order %d\n", shift);
> +}
> +
>  
>  void pgtable_cache_init(void)
>  {
> -	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
> -	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
> +	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
> +	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
> +	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
> +		panic("Couldn't allocate pgtable caches");
> +	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
>  }

panic vs. BUG_ON() ... could be a bit more consistent.
 
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
> Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
> ===================================================================
> --- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-10-16 12:53:45.000000000 +1100
> +++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-16 12:53:51.000000000 +1100
> @@ -11,27 +11,30 @@
>  #include <linux/cpumask.h>
>  #include <linux/percpu.h>
>  
> +/*
> + * This needs to be big enough to allow any pagetable sizes we need,
> + * but small enough to fit in the low bits of any page table pointer.
> + * In other words all pagetables, even tiny ones, must be aligned to
> + * allow at least enough low 0 bits to contain this value.
> + */
> +#define MAX_PGTABLE_INDEX_SIZE	0xf

This also has the constraint of being a (power of 2) - 1... worth
mentioning somewhere ?

Also if you could comment somewhere that index size == 0 means a PTE
page ? Not totally obvious at first.

Cheers,
Ben.

^ permalink raw reply

* Re: [3/6] Allow more flexible layouts for hugepage pagetables
From: Benjamin Herrenschmidt @ 2009-10-27  3:10 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev
In-Reply-To: <20091016052212.E1DE0B7BBD@ozlabs.org>

On Fri, 2009-10-16 at 16:22 +1100, David Gibson wrote:

So far haven't seen anything blatantly wrong, in fact, this patch
results in some nice cleanups.

One thing tho...

> -#ifdef CONFIG_HUGETLB_PAGE
> -       /* Handle hugepage regions */
> -       if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
> -               DBG_LOW(" -> huge page !\n");
> -               return hash_huge_page(mm, access, ea, vsid, local, trap);
> -       }
> -#endif /* CONFIG_HUGETLB_PAGE */
> -
>  #ifndef CONFIG_PPC_64K_PAGES
>         /* If we use 4K pages and our psize is not 4K, then we are hitting
>          * a special driver mapping, we need to align the address before
> @@ -961,12 +954,18 @@ int hash_page(unsigned long ea, unsigned
>  #endif /* CONFIG_PPC_64K_PAGES */

You basically made the above code be run with huge pages. This may not
be what you want ... It will result in cropping the low EA bits probably
at a stage where you don't want that (it might also be a non-issue, I
just want you to double check :-)

I suppose one option would be to remove that alignment and duplicate
the PTEs when creating those "special" mappings (afaik the only user
is spufs using 64K pages to map the local store)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 10/16] percpu: make percpu symbols in powerpc unique
From: Benjamin Herrenschmidt @ 2009-10-27  3:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, rusty, linux-kernel, rostedt, linuxppc-dev, mingo,
	Paul Mackerras, cebbert, hpa, tglx, akpm
In-Reply-To: <1255500125-3210-11-git-send-email-tj@kernel.org>

On Wed, 2009-10-14 at 15:01 +0900, Tejun Heo wrote:
> This patch updates percpu related symbols in powerpc such that percpu
> symbols are unique and don't clash with local symbols.  This serves
> two purposes of decreasing the possibility of global percpu symbol
> collision and allowing dropping per_cpu__ prefix from percpu symbols.
> 
> * arch/powerpc/kernel/perf_callchain.c: s/callchain/cpu_perf_callchain/
> 
> * arch/powerpc/kernel/setup-common.c: s/pvr/cpu_pvr/
> 
> * arch/powerpc/platforms/pseries/dtl.c: s/dtl/cpu_dtl/
> 
> * arch/powerpc/platforms/cell/interrupt.c: s/iic/cpu_iic/
> 
> Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
> which cause name clashes" patch.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Cc: Rusty Russell <rusty@rustcorp.com.au>

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v4 4/4] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
From: Benjamin Herrenschmidt @ 2009-10-27  3:23 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Arun R Bharadwaj, linuxppc-dev, Peter Zijlstra, linux-kernel
In-Reply-To: <20091009083105.32381.42354.stgit@sofia.in.ibm.com>

On Fri, 2009-10-09 at 14:01 +0530, Gautham R Shenoy wrote:
> Currently the cpu-allocation/deallocation process comprises of two steps:
> - Set the indicators and to update the device tree with DLPAR node
>   information.
> 
> - Online/offline the allocated/deallocated CPU.
> 
> This is achieved by writing to the sysfs tunables "probe" during allocation
> and "release" during deallocation.
> 
> At the sametime, the userspace can independently online/offline the CPUs of
> the system using the sysfs tunable "online".
> 
> It is quite possible that when a userspace tool offlines a CPU
> for the purpose of deallocation and is in the process of updating the device
> tree, some other userspace tool could bring the CPU back online by writing to
> the "online" sysfs tunable thereby causing the deallocate process to fail.
> 
> The solution to this is to serialize writes to the "probe/release" sysfs
> tunable with the writes to the "online" sysfs tunable.
> 
> This patch employs a mutex to provide this serialization, which is a no-op on
> all architectures except PPC_PSERIES
> 
> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>

Peter, did you get a chance to review this one ?

Cheers,
Ben.

> ---
>  arch/powerpc/platforms/pseries/dlpar.c |   26 ++++++++++++++++++++++----
>  drivers/base/cpu.c                     |    2 ++
>  include/linux/cpu.h                    |   13 +++++++++++++
>  3 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
> index 9752386..fc261e6 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -644,6 +644,18 @@ static ssize_t memory_release_store(struct class *class, const char *buf,
>  	return rc ? -1 : count;
>  }
>  
> +static DEFINE_MUTEX(pseries_cpu_hotplug_mutex);
> +
> +void cpu_hotplug_driver_lock()
> +{
> +	mutex_lock(&pseries_cpu_hotplug_mutex);
> +}
> +
> +void cpu_hotplug_driver_unlock()
> +{
> +	mutex_unlock(&pseries_cpu_hotplug_mutex);
> +}
> +
>  static ssize_t cpu_probe_store(struct class *class, const char *buf,
>  			       size_t count)
>  {
> @@ -656,14 +668,15 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
>  	if (rc)
>  		return -EINVAL;
>  
> +	cpu_hotplug_driver_lock();
>  	rc = acquire_drc(drc_index);
>  	if (rc)
> -		return rc;
> +		goto out;
>  
>  	dn = configure_connector(drc_index);
>  	if (!dn) {
>  		release_drc(drc_index);
> -		return rc;
> +		goto out;
>  	}
>  
>  	/* fixup dn name */
> @@ -672,7 +685,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
>  	if (!cpu_name) {
>  		free_cc_nodes(dn);
>  		release_drc(drc_index);
> -		return -ENOMEM;
> +		rc = -ENOMEM;
> +		goto out;
>  	}
>  
>  	sprintf(cpu_name, "/cpus/%s", dn->full_name);
> @@ -684,6 +698,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf,
>  		release_drc(drc_index);
>  
>  	rc = online_node_cpus(dn);
> +out:
> +	cpu_hotplug_driver_unlock();
>  
>  	return rc ? rc : count;
>  }
> @@ -705,6 +721,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
>  		return -EINVAL;
>  	}
>  
> +	cpu_hotplug_driver_lock();
>  	rc = offline_node_cpus(dn);
>  
>  	if (rc)
> @@ -713,7 +730,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
>  	rc = release_drc(*drc_index);
>  	if (rc) {
>  		of_node_put(dn);
> -		return rc;
> +		goto out;
>  	}
>  
>  	rc = remove_device_tree_nodes(dn);
> @@ -723,6 +740,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf,
>  	of_node_put(dn);
>  
>  out:
> +	cpu_hotplug_driver_unlock();
>  	return rc ? rc : count;
>  }
>  
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index e62a4cc..07c3f05 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
>  	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
>  	ssize_t ret;
>  
> +	cpu_hotplug_driver_lock();
>  	switch (buf[0]) {
>  	case '0':
>  		ret = cpu_down(cpu->sysdev.id);
> @@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
>  	default:
>  		ret = -EINVAL;
>  	}
> +	cpu_hotplug_driver_unlock();
>  
>  	if (ret >= 0)
>  		ret = count;
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 4753619..b0ad4e1 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -115,6 +115,19 @@ extern void put_online_cpus(void);
>  #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
>  int cpu_down(unsigned int cpu);
>  
> +#ifdef CONFIG_PPC_PSERIES
> +extern void cpu_hotplug_driver_lock(void);
> +extern void cpu_hotplug_driver_unlock(void);
> +#else
> +static inline void cpu_hotplug_driver_lock(void)
> +{
> +}
> +
> +static inline void cpu_hotplug_driver_unlock(void)
> +{
> +}
> +#endif
> +
>  #else		/* CONFIG_HOTPLUG_CPU */
>  
>  #define get_online_cpus()	do { } while (0)
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Is there a patch for MPC8548 XOR?
From: hank peng @ 2009-10-27  3:43 UTC (permalink / raw)
  To: linuxppc-dev

I want to use its' XOR engine to compute raid5 parity, but I can't
find this function in 2.6.30 downloaded from kernel.org, someone know
if there is a patch?

-- 
The simplest is not all best but the best is surely the simplest!

^ permalink raw reply

* Re: [2/6] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-10-27  3:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1256610499.2076.69.camel@pasglop>

On Tue, Oct 27, 2009 at 01:28:19PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2009-10-16 at 16:22 +1100, David Gibson wrote:
> 
> Minor nits... if you can respin today I should push it out to -next
> 
> > +void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
> > +{
> > +	char *name;
> > +	unsigned long table_size = sizeof(void *) << shift;
> > +	unsigned long align = table_size;
> 
> This is a bit thick.. could use some air. Just separate the definitions
> from the assignments so you can make the code breath a bit :-)

Ok.

> Also the above warrants a comment explaining that this won't work for
> PTE pages since sizeof(PTE) >= sizeof(void *) and the day we finally
> move out of pte page == struct page, the code here will have to be
> adapted.

Ok.

[snip]
> >  void pgtable_cache_init(void)
> >  {
> > -	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
> > -	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
> > +	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
> > +	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
> > +	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
> > +		panic("Couldn't allocate pgtable caches");
> > +	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
> >  }
> 
> panic vs. BUG_ON() ... could be a bit more consistent.

Uh.. there is actually a rationale for the difference here.  The
panic() is due to a a runtime error - couldn't allocate the caches -
which isn't necessarily a kernel bug (could be hardware error, or
ludicrously short on memory).

The trick is that allocating the PGD and PMD caches is supposed to
also create the PUD cache, because the PUD index size is always the
same as either the PGD or PUD cache.  If that's not true, we've broken
the assumptions the code is based on, hence BUG().

> >  #ifdef CONFIG_SPARSEMEM_VMEMMAP
> > Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
> > ===================================================================
> > --- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-10-16 12:53:45.000000000 +1100
> > +++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-16 12:53:51.000000000 +1100
> > @@ -11,27 +11,30 @@
> >  #include <linux/cpumask.h>
> >  #include <linux/percpu.h>
> >  
> > +/*
> > + * This needs to be big enough to allow any pagetable sizes we need,
> > + * but small enough to fit in the low bits of any page table pointer.
> > + * In other words all pagetables, even tiny ones, must be aligned to
> > + * allow at least enough low 0 bits to contain this value.
> > + */
> > +#define MAX_PGTABLE_INDEX_SIZE	0xf
> 
> This also has the constraint of being a (power of 2) - 1... worth
> mentioning somewhere ?
> 
> Also if you could comment somewhere that index size == 0 means a PTE
> page ? Not totally obvious at first.

Ok, I've expanded on this comment.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [2/6] Cleanup management of kmem_caches for pagetables
From: Benjamin Herrenschmidt @ 2009-10-27  4:30 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev
In-Reply-To: <20091027034602.GB20694@yookeroo.seuss>

On Tue, 2009-10-27 at 14:46 +1100, David Gibson wrote:
> 
> The trick is that allocating the PGD and PMD caches is supposed to
> also create the PUD cache, because the PUD index size is always the
> same as either the PGD or PUD cache.  If that's not true, we've broken
> the assumptions the code is based on, hence BUG(). 

Ok, so maybe a little comment with the above explanation concerning
the PUD index size being the same as the PGD or PMD one would be
useful :-)

Cheers,
Ben.

^ permalink raw reply

* [PATCH 1/6] powerpc: tracing: Add powerpc tracepoints for interrupt entry and exit
From: Anton Blanchard @ 2009-10-27  4:47 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, benh; +Cc: linuxppc-dev


This patch adds powerpc specific tracepoints for interrupt entry and exit.

While we already have generic irq_handler_entry and irq_handler_exit
tracepoints there are cases on our virtualised powerpc machines where an
interrupt is presented to the OS, but subsequently handled by the hypervisor.
This means no OS interrupt handler is invoked.

Here is an example on a POWER6 machine with the patch below applied:
 
<idle>-0     [006]  3243.949840744: irq_entry: pt_regs=c0000000ce31fb10
<idle>-0     [006]  3243.949850520: irq_exit: pt_regs=c0000000ce31fb10

<idle>-0     [007]  3243.950218208: irq_entry: pt_regs=c0000000ce323b10
<idle>-0     [007]  3243.950224080: irq_exit: pt_regs=c0000000ce323b10

<idle>-0     [000]  3244.021879320: irq_entry: pt_regs=c000000000a63aa0
<idle>-0     [000]  3244.021883616: irq_handler_entry: irq=87 handler=eth0
<idle>-0     [000]  3244.021887328: irq_handler_exit: irq=87 return=handled
<idle>-0     [000]  3244.021897408: irq_exit: pt_regs=c000000000a63aa0

Here we see two phantom interrupts (no handler was invoked), followed
by a real interrupt for eth0. Without the tracepoints in this patch we
would have missed the phantom interrupts.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
--

No change to this patch.

Index: linux.trees.git/arch/powerpc/include/asm/trace.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/powerpc/include/asm/trace.h	2009-10-17 08:45:08.000000000 +1100
@@ -0,0 +1,53 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM powerpc
+
+#if !defined(_TRACE_POWERPC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_POWERPC_H
+
+#include <linux/tracepoint.h>
+
+struct pt_regs;
+
+TRACE_EVENT(irq_entry,
+
+	TP_PROTO(struct pt_regs *regs),
+
+	TP_ARGS(regs),
+
+	TP_STRUCT__entry(
+		__field(struct pt_regs *, regs)
+	),
+
+	TP_fast_assign(
+		__entry->regs = regs;
+	),
+
+	TP_printk("pt_regs=%p", __entry->regs)
+);
+
+TRACE_EVENT(irq_exit,
+
+	TP_PROTO(struct pt_regs *regs),
+
+	TP_ARGS(regs),
+
+	TP_STRUCT__entry(
+		__field(struct pt_regs *, regs)
+	),
+
+	TP_fast_assign(
+		__entry->regs = regs;
+	),
+
+	TP_printk("pt_regs=%p", __entry->regs)
+);
+
+#endif /* _TRACE_POWERPC_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+
+#define TRACE_INCLUDE_PATH asm
+#define TRACE_INCLUDE_FILE trace
+
+#include <trace/define_trace.h>
Index: linux.trees.git/arch/powerpc/kernel/irq.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/irq.c	2009-10-17 08:44:32.000000000 +1100
+++ linux.trees.git/arch/powerpc/kernel/irq.c	2009-10-17 08:45:44.000000000 +1100
@@ -70,6 +70,8 @@
 #include <asm/firmware.h>
 #include <asm/lv1call.h>
 #endif
+#define CREATE_TRACE_POINTS
+#include <asm/trace.h>
 
 int __irq_offset_value;
 static int ppc_spurious_interrupts;
@@ -325,6 +327,8 @@ void do_IRQ(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	unsigned int irq;
 
+	trace_irq_entry(regs);
+
 	irq_enter();
 
 	check_stack_overflow();
@@ -348,6 +352,8 @@ void do_IRQ(struct pt_regs *regs)
 		timer_interrupt(regs);
 	}
 #endif
+
+	trace_irq_exit(regs);
 }
 
 void __init init_IRQ(void)

^ permalink raw reply

* [PATCH 2/6] powerpc: tracing: Add powerpc tracepoints for timer entry and exit
From: Anton Blanchard @ 2009-10-27  4:49 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, benh; +Cc: linuxppc-dev
In-Reply-To: <20091027044742.GB3085@kryten>


We can monitor the effectiveness of our power management of both the
kernel and hypervisor by probing the timer interrupt. For example, on
this box we see 10.37s timer interrupts on an idle core:

<idle>-0     [010]  3900.671297: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0     [010]  3900.671302: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

<idle>-0     [010]  3911.042963: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0     [010]  3911.042968: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

<idle>-0     [010]  3921.414630: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0     [010]  3921.414635: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

Since we have a 207MHz decrementer it will go negative and fire every 10.37s
even if Linux is completely idle.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/time.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/time.c	2009-10-07 17:21:21.000000000 +1100
+++ linux.trees.git/arch/powerpc/kernel/time.c	2009-10-07 17:21:52.000000000 +1100
@@ -54,6 +54,7 @@
 #include <linux/irq.h>
 #include <linux/delay.h>
 #include <linux/perf_event.h>
+#include <asm/trace.h>
 
 #include <asm/io.h>
 #include <asm/processor.h>
@@ -571,6 +572,8 @@ void timer_interrupt(struct pt_regs * re
 	struct clock_event_device *evt = &decrementer->event;
 	u64 now;
 
+	trace_timer_interrupt_entry(regs);
+
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continuue to take decrementer exceptions */
 	set_dec(DECREMENTER_MAX);
@@ -590,6 +593,7 @@ void timer_interrupt(struct pt_regs * re
 		now = decrementer->next_tb - now;
 		if (now <= DECREMENTER_MAX)
 			set_dec((int)now);
+		trace_timer_interrupt_exit(regs);
 		return;
 	}
 	old_regs = set_irq_regs(regs);
@@ -620,6 +624,8 @@ void timer_interrupt(struct pt_regs * re
 
 	irq_exit();
 	set_irq_regs(old_regs);
+
+	trace_timer_interrupt_exit(regs);
 }
 
 void wakeup_decrementer(void)
Index: linux.trees.git/arch/powerpc/include/asm/trace.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/trace.h	2009-10-07 17:22:25.000000000 +1100
+++ linux.trees.git/arch/powerpc/include/asm/trace.h	2009-10-07 17:23:20.000000000 +1100
@@ -42,6 +42,40 @@ TRACE_EVENT(irq_exit,
 	TP_printk("pt_regs=%p", __entry->regs)
 );
 
+TRACE_EVENT(timer_interrupt_entry,
+
+	TP_PROTO(struct pt_regs *regs),
+
+	TP_ARGS(regs),
+
+	TP_STRUCT__entry(
+		__field(struct pt_regs *, regs)
+	),
+
+	TP_fast_assign(
+		__entry->regs = regs;
+	),
+
+	TP_printk("pt_regs=%p", __entry->regs)
+);
+
+TRACE_EVENT(timer_interrupt_exit,
+
+	TP_PROTO(struct pt_regs *regs),
+
+	TP_ARGS(regs),
+
+	TP_STRUCT__entry(
+		__field(struct pt_regs *, regs)
+	),
+
+	TP_fast_assign(
+		__entry->regs = regs;
+	),
+
+	TP_printk("pt_regs=%p", __entry->regs)
+);
+
 #endif /* _TRACE_POWERPC_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply

* [PATCH 3/6] powerpc: tracing: Add hypervisor call tracepoints
From: Anton Blanchard @ 2009-10-27  4:50 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, benh; +Cc: linuxppc-dev
In-Reply-To: <20091027044914.GC3085@kryten>


Add hcall_entry and hcall_exit tracepoints. This replaces the inline
assembly HCALL_STATS code and converts it to use the new tracepoints.

To keep the disabled case as quick as possible, we embed a status word
in the TOC so we can get at it with a single load. By doing so we
keep the overhead at a minimum. Time taken for a null hcall:

No tracepoint code:	135.79 cycles
Disabled tracepoints:	137.95 cycles

For reference, before this patch enabling HCALL_STATS resulted in a null
hcall of 201.44 cycles!

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/platforms/pseries/hvCall.S
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/hvCall.S	2009-10-27 13:36:05.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/hvCall.S	2009-10-27 14:53:21.000000000 +1100
@@ -14,20 +14,54 @@
 	
 #define STK_PARM(i)     (48 + ((i)-3)*8)
 
-#ifdef CONFIG_HCALL_STATS
+#ifdef CONFIG_TRACEPOINTS
+
+	.section	".toc","aw"
+
+	.globl hcall_tracepoint_refcount
+hcall_tracepoint_refcount:
+	.llong	0
+
+	.section	".text"
+
 /*
  * precall must preserve all registers.  use unused STK_PARM()
- * areas to save snapshots and opcode.
+ * areas to save snapshots and opcode. We branch around this
+ * in early init (eg when populating the MMU hashtable) by using an
+ * unconditional cpu feature.
  */
 #define HCALL_INST_PRECALL					\
-	std	r3,STK_PARM(r3)(r1);	/* save opcode */	\
-	mftb	r0;			/* get timebase and */	\
-	std     r0,STK_PARM(r5)(r1);	/* save for later */	\
 BEGIN_FTR_SECTION;						\
-	mfspr	r0,SPRN_PURR;		/* get PURR and */	\
-	std	r0,STK_PARM(r6)(r1);	/* save for later */	\
-END_FTR_SECTION_IFSET(CPU_FTR_PURR);
-	
+	b	1f;						\
+END_FTR_SECTION(0, 1);						\
+	ld      r12,hcall_tracepoint_refcount@toc(r2);		\
+	cmpdi	r12,0;						\
+	beq+	1f;						\
+	mflr	r0;						\
+	std	r3,STK_PARM(r3)(r1);				\
+	std	r4,STK_PARM(r4)(r1);				\
+	std	r5,STK_PARM(r5)(r1);				\
+	std	r6,STK_PARM(r6)(r1);				\
+	std	r7,STK_PARM(r7)(r1);				\
+	std	r8,STK_PARM(r8)(r1);				\
+	std	r9,STK_PARM(r9)(r1);				\
+	std	r10,STK_PARM(r10)(r1);				\
+	std	r0,16(r1);					\
+	stdu	r1,-STACK_FRAME_OVERHEAD(r1);			\
+	bl	.__trace_hcall_entry;				\
+	addi	r1,r1,STACK_FRAME_OVERHEAD;			\
+	ld	r0,16(r1);					\
+	ld	r3,STK_PARM(r3)(r1);				\
+	ld	r4,STK_PARM(r4)(r1);				\
+	ld	r5,STK_PARM(r5)(r1);				\
+	ld	r6,STK_PARM(r6)(r1);				\
+	ld	r7,STK_PARM(r7)(r1);				\
+	ld	r8,STK_PARM(r8)(r1);				\
+	ld	r9,STK_PARM(r9)(r1);				\
+	ld	r10,STK_PARM(r10)(r1);				\
+	mtlr	r0;						\
+1:
+
 /*
  * postcall is performed immediately before function return which
  * allows liberal use of volatile registers.  We branch around this
@@ -38,40 +72,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_PURR);
 BEGIN_FTR_SECTION;						\
 	b	1f;						\
 END_FTR_SECTION(0, 1);						\
-	ld	r4,STK_PARM(r3)(r1);	/* validate opcode */	\
-	cmpldi	cr7,r4,MAX_HCALL_OPCODE;			\
-	bgt-	cr7,1f;						\
-								\
-	/* get time and PURR snapshots after hcall */		\
-	mftb	r7;			/* timebase after */	\
-BEGIN_FTR_SECTION;						\
-	mfspr	r8,SPRN_PURR;		/* PURR after */	\
-	ld	r6,STK_PARM(r6)(r1);	/* PURR before */	\
-	subf	r6,r6,r8;		/* delta */		\
-END_FTR_SECTION_IFSET(CPU_FTR_PURR);				\
-	ld	r5,STK_PARM(r5)(r1);	/* timebase before */	\
-	subf	r5,r5,r7;		/* time delta */	\
-								\
-	/* calculate address of stat structure r4 = opcode */	\
-	srdi	r4,r4,2;		/* index into array */	\
-	mulli	r4,r4,HCALL_STAT_SIZE;				\
-	LOAD_REG_ADDR(r7, per_cpu__hcall_stats);		\
-	add	r4,r4,r7;					\
-	ld	r7,PACA_DATA_OFFSET(r13); /* per cpu offset */	\
-	add	r4,r4,r7;					\
-								\
-	/* update stats	*/					\
-	ld	r7,HCALL_STAT_CALLS(r4); /* count */		\
-	addi	r7,r7,1;					\
-	std	r7,HCALL_STAT_CALLS(r4);			\
-	ld      r7,HCALL_STAT_TB(r4);	/* timebase */		\
-	add	r7,r7,r5;					\
-	std	r7,HCALL_STAT_TB(r4);				\
-BEGIN_FTR_SECTION;						\
-	ld	r7,HCALL_STAT_PURR(r4);	/* PURR */		\
-	add	r7,r7,r6;					\
-	std	r7,HCALL_STAT_PURR(r4);				\
-END_FTR_SECTION_IFSET(CPU_FTR_PURR);				\
+	ld      r12,hcall_tracepoint_refcount@toc(r2);		\
+	cmpdi	r12,0;						\
+	beq+	1f;						\
+	mflr	r0;						\
+	ld	r6,STK_PARM(r3)(r1);				\
+	std	r3,STK_PARM(r3)(r1);				\
+	mr	r4,r3;						\
+	mr	r3,r6;						\
+	std	r0,16(r1);					\
+	stdu	r1,-STACK_FRAME_OVERHEAD(r1);			\
+	bl	.__trace_hcall_exit;				\
+	addi	r1,r1,STACK_FRAME_OVERHEAD;			\
+	ld	r0,16(r1);					\
+	ld	r3,STK_PARM(r3)(r1);				\
+	mtlr	r0;						\
 1:
 #else
 #define HCALL_INST_PRECALL
Index: linux.trees.git/arch/powerpc/platforms/pseries/lpar.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/lpar.c	2009-10-27 13:36:05.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/lpar.c	2009-10-27 14:53:21.000000000 +1100
@@ -39,6 +39,7 @@
 #include <asm/cputable.h>
 #include <asm/udbg.h>
 #include <asm/smp.h>
+#include <asm/trace.h>
 
 #include "plpar_wrappers.h"
 #include "pseries.h"
@@ -661,3 +662,34 @@ void arch_free_page(struct page *page, i
 EXPORT_SYMBOL(arch_free_page);
 
 #endif
+
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * We optimise our hcall path by placing hcall_tracepoint_refcount
+ * directly in the TOC so we can check if the hcall tracepoints are
+ * enabled via a single load.
+ */
+
+/* NB: reg/unreg are called while guarded with the tracepoints_mutex */
+extern long hcall_tracepoint_refcount;
+
+void hcall_tracepoint_regfunc(void)
+{
+	hcall_tracepoint_refcount++;
+}
+
+void hcall_tracepoint_unregfunc(void)
+{
+	hcall_tracepoint_refcount--;
+}
+
+void __trace_hcall_entry(unsigned long opcode)
+{
+	trace_hcall_entry(opcode);
+}
+
+void __trace_hcall_exit(long opcode, unsigned long retval)
+{
+	trace_hcall_exit(opcode, retval);
+}
+#endif
Index: linux.trees.git/arch/powerpc/include/asm/trace.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/trace.h	2009-10-27 13:36:30.000000000 +1100
+++ linux.trees.git/arch/powerpc/include/asm/trace.h	2009-10-27 14:56:38.000000000 +1100
@@ -76,6 +76,51 @@ TRACE_EVENT(timer_interrupt_exit,
 	TP_printk("pt_regs=%p", __entry->regs)
 );
 
+#ifdef CONFIG_PPC_PSERIES
+extern void hcall_tracepoint_regfunc(void);
+extern void hcall_tracepoint_unregfunc(void);
+
+TRACE_EVENT_FN(hcall_entry,
+
+	TP_PROTO(unsigned long opcode),
+
+	TP_ARGS(opcode),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, opcode)
+	),
+
+	TP_fast_assign(
+		__entry->opcode = opcode;
+	),
+
+	TP_printk("opcode=%lu", __entry->opcode),
+
+	hcall_tracepoint_regfunc, hcall_tracepoint_unregfunc
+);
+
+TRACE_EVENT_FN(hcall_exit,
+
+	TP_PROTO(unsigned long opcode, unsigned long retval),
+
+	TP_ARGS(opcode, retval),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, opcode)
+		__field(unsigned long, retval)
+	),
+
+	TP_fast_assign(
+		__entry->opcode = opcode;
+		__entry->retval = retval;
+	),
+
+	TP_printk("opcode=%lu retval=%lu", __entry->opcode, __entry->retval),
+
+	hcall_tracepoint_regfunc, hcall_tracepoint_unregfunc
+);
+#endif
+
 #endif /* _TRACE_POWERPC_H */
 
 #undef TRACE_INCLUDE_PATH
Index: linux.trees.git/arch/powerpc/include/asm/hvcall.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/hvcall.h	2009-10-27 13:36:05.000000000 +1100
+++ linux.trees.git/arch/powerpc/include/asm/hvcall.h	2009-10-27 14:53:21.000000000 +1100
@@ -274,6 +274,8 @@ struct hcall_stats {
 	unsigned long	num_calls;	/* number of calls (on this CPU) */
 	unsigned long	tb_total;	/* total wall time (mftb) of calls. */
 	unsigned long	purr_total;	/* total cpu time (PURR) of calls. */
+	unsigned long	tb_start;
+	unsigned long	purr_start;
 };
 #define HCALL_STAT_ARRAY_SIZE	((MAX_HCALL_OPCODE >> 2) + 1)
 
Index: linux.trees.git/arch/powerpc/platforms/pseries/hvCall_inst.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/hvCall_inst.c	2009-10-27 13:36:05.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/hvCall_inst.c	2009-10-27 14:53:21.000000000 +1100
@@ -26,6 +26,7 @@
 #include <asm/hvcall.h>
 #include <asm/firmware.h>
 #include <asm/cputable.h>
+#include <asm/trace.h>
 
 DEFINE_PER_CPU(struct hcall_stats[HCALL_STAT_ARRAY_SIZE], hcall_stats);
 
@@ -100,6 +101,34 @@ static const struct file_operations hcal
 #define	HCALL_ROOT_DIR		"hcall_inst"
 #define CPU_NAME_BUF_SIZE	32
 
+
+static void probe_hcall_entry(unsigned long opcode)
+{
+	struct hcall_stats *h;
+
+	if (opcode > MAX_HCALL_OPCODE)
+		return;
+
+	h = &get_cpu_var(hcall_stats)[opcode / 4];
+	h->tb_start = mftb();
+	h->purr_start = mfspr(SPRN_PURR);
+}
+
+static void probe_hcall_exit(unsigned long opcode, unsigned long retval)
+{
+	struct hcall_stats *h;
+
+	if (opcode > MAX_HCALL_OPCODE)
+		return;
+
+	h = &__get_cpu_var(hcall_stats)[opcode / 4];
+	h->num_calls++;
+	h->tb_total = mftb() - h->tb_start;
+	h->purr_total = mfspr(SPRN_PURR) - h->purr_start;
+
+	put_cpu_var(hcall_stats);
+}
+
 static int __init hcall_inst_init(void)
 {
 	struct dentry *hcall_root;
@@ -110,6 +139,14 @@ static int __init hcall_inst_init(void)
 	if (!firmware_has_feature(FW_FEATURE_LPAR))
 		return 0;
 
+	if (register_trace_hcall_entry(probe_hcall_entry))
+		return -EINVAL;
+
+	if (register_trace_hcall_exit(probe_hcall_exit)) {
+		unregister_trace_hcall_entry(probe_hcall_entry);
+		return -EINVAL;
+	}
+
 	hcall_root = debugfs_create_dir(HCALL_ROOT_DIR, NULL);
 	if (!hcall_root)
 		return -ENOMEM;
Index: linux.trees.git/arch/powerpc/Kconfig.debug
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig.debug	2009-10-27 13:36:05.000000000 +1100
+++ linux.trees.git/arch/powerpc/Kconfig.debug	2009-10-27 13:36:30.000000000 +1100
@@ -46,7 +46,7 @@ config DEBUG_STACK_USAGE
 
 config HCALL_STATS
 	bool "Hypervisor call instrumentation"
-	depends on PPC_PSERIES && DEBUG_FS
+	depends on PPC_PSERIES && DEBUG_FS && TRACEPOINTS
 	help
 	  Adds code to keep track of the number of hypervisor calls made and
 	  the amount of time spent in hypervisor calls.  Wall time spent in

^ permalink raw reply

* [PATCH 4/6] powerpc: tracing: Give hypervisor call tracepoints access to arguments
From: Anton Blanchard @ 2009-10-27  4:51 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, benh; +Cc: linuxppc-dev
In-Reply-To: <20091027045029.GD3085@kryten>


While most users of the hcall tracepoints will only want the opcode and return
code, some will want all the arguments. To avoid the complexity of using
varargs we pass a pointer to the register save area which contain all
arguments.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/platforms/pseries/hvCall.S
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/hvCall.S	2009-10-27 14:29:09.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/hvCall.S	2009-10-27 14:29:16.000000000 +1100
@@ -30,7 +30,7 @@ hcall_tracepoint_refcount:
  * in early init (eg when populating the MMU hashtable) by using an
  * unconditional cpu feature.
  */
-#define HCALL_INST_PRECALL					\
+#define HCALL_INST_PRECALL(FIRST_REG)				\
 BEGIN_FTR_SECTION;						\
 	b	1f;						\
 END_FTR_SECTION(0, 1);						\
@@ -47,6 +47,7 @@ END_FTR_SECTION(0, 1);						\
 	std	r9,STK_PARM(r9)(r1);				\
 	std	r10,STK_PARM(r10)(r1);				\
 	std	r0,16(r1);					\
+	addi	r4,r1,STK_PARM(FIRST_REG);			\
 	stdu	r1,-STACK_FRAME_OVERHEAD(r1);			\
 	bl	.__trace_hcall_entry;				\
 	addi	r1,r1,STACK_FRAME_OVERHEAD;			\
@@ -68,7 +69,7 @@ END_FTR_SECTION(0, 1);						\
  * in early init (eg when populating the MMU hashtable) by using an
  * unconditional cpu feature.
  */
-#define HCALL_INST_POSTCALL					\
+#define __HCALL_INST_POSTCALL					\
 BEGIN_FTR_SECTION;						\
 	b	1f;						\
 END_FTR_SECTION(0, 1);						\
@@ -88,9 +89,19 @@ END_FTR_SECTION(0, 1);						\
 	ld	r3,STK_PARM(r3)(r1);				\
 	mtlr	r0;						\
 1:
+
+#define HCALL_INST_POSTCALL_NORETS				\
+	li	r5,0;						\
+	__HCALL_INST_POSTCALL
+
+#define HCALL_INST_POSTCALL(BUFREG)				\
+	mr	r5,BUFREG;					\
+	__HCALL_INST_POSTCALL
+
 #else
-#define HCALL_INST_PRECALL
-#define HCALL_INST_POSTCALL
+#define HCALL_INST_PRECALL(FIRST_ARG)
+#define HCALL_INST_POSTCALL_NORETS
+#define HCALL_INST_POSTCALL(BUFREG)
 #endif
 
 	.text
@@ -101,11 +112,11 @@ _GLOBAL(plpar_hcall_norets)
 	mfcr	r0
 	stw	r0,8(r1)
 
-	HCALL_INST_PRECALL
+	HCALL_INST_PRECALL(r4)
 
 	HVSC				/* invoke the hypervisor */
 
-	HCALL_INST_POSTCALL
+	HCALL_INST_POSTCALL_NORETS
 
 	lwz	r0,8(r1)
 	mtcrf	0xff,r0
@@ -117,7 +128,7 @@ _GLOBAL(plpar_hcall)
 	mfcr	r0
 	stw	r0,8(r1)
 
-	HCALL_INST_PRECALL
+	HCALL_INST_PRECALL(r5)
 
 	std     r4,STK_PARM(r4)(r1)     /* Save ret buffer */
 
@@ -136,7 +147,7 @@ _GLOBAL(plpar_hcall)
 	std	r6, 16(r12)
 	std	r7, 24(r12)
 
-	HCALL_INST_POSTCALL
+	HCALL_INST_POSTCALL(r12)
 
 	lwz	r0,8(r1)
 	mtcrf	0xff,r0
@@ -183,7 +194,7 @@ _GLOBAL(plpar_hcall9)
 	mfcr	r0
 	stw	r0,8(r1)
 
-	HCALL_INST_PRECALL
+	HCALL_INST_PRECALL(r5)
 
 	std     r4,STK_PARM(r4)(r1)     /* Save ret buffer */
 
@@ -211,7 +222,7 @@ _GLOBAL(plpar_hcall9)
 	std	r11,56(r12)
 	std	r0, 64(r12)
 
-	HCALL_INST_POSTCALL
+	HCALL_INST_POSTCALL(r12)
 
 	lwz	r0,8(r1)
 	mtcrf	0xff,r0
Index: linux.trees.git/arch/powerpc/include/asm/trace.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/trace.h	2009-10-27 14:28:15.000000000 +1100
+++ linux.trees.git/arch/powerpc/include/asm/trace.h	2009-10-27 14:29:16.000000000 +1100
@@ -81,9 +81,9 @@ extern void hcall_tracepoint_unregfunc(v
 
 TRACE_EVENT_FN(hcall_entry,
 
-	TP_PROTO(unsigned long opcode),
+	TP_PROTO(unsigned long opcode, unsigned long *args),
 
-	TP_ARGS(opcode),
+	TP_ARGS(opcode, args),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, opcode)
@@ -100,9 +100,10 @@ TRACE_EVENT_FN(hcall_entry,
 
 TRACE_EVENT_FN(hcall_exit,
 
-	TP_PROTO(unsigned long opcode, unsigned long retval),
+	TP_PROTO(unsigned long opcode, unsigned long retval,
+		unsigned long *retbuf),
 
-	TP_ARGS(opcode, retval),
+	TP_ARGS(opcode, retval, retbuf),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, opcode)
Index: linux.trees.git/arch/powerpc/platforms/pseries/lpar.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/lpar.c	2009-10-27 14:28:16.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/lpar.c	2009-10-27 14:29:16.000000000 +1100
@@ -683,13 +683,14 @@ void hcall_tracepoint_unregfunc(void)
 	hcall_tracepoint_refcount--;
 }
 
-void __trace_hcall_entry(unsigned long opcode)
+void __trace_hcall_entry(unsigned long opcode, unsigned long *args)
 {
-	trace_hcall_entry(opcode);
+	trace_hcall_entry(opcode, args);
 }
 
-void __trace_hcall_exit(long opcode, unsigned long retval)
+void __trace_hcall_exit(long opcode, unsigned long retval,
+			unsigned long *retbuf)
 {
-	trace_hcall_exit(opcode, retval);
+	trace_hcall_exit(opcode, retval, retbuf);
 }
 #endif
Index: linux.trees.git/arch/powerpc/platforms/pseries/hvCall_inst.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/hvCall_inst.c	2009-10-27 14:28:16.000000000 +1100
+++ linux.trees.git/arch/powerpc/platforms/pseries/hvCall_inst.c	2009-10-27 14:29:16.000000000 +1100
@@ -102,7 +102,7 @@ static const struct file_operations hcal
 #define CPU_NAME_BUF_SIZE	32
 
 
-static void probe_hcall_entry(unsigned long opcode)
+static void probe_hcall_entry(unsigned long opcode, unsigned long *args)
 {
 	struct hcall_stats *h;
 
@@ -114,7 +114,8 @@ static void probe_hcall_entry(unsigned l
 	h->purr_start = mfspr(SPRN_PURR);
 }
 
-static void probe_hcall_exit(unsigned long opcode, unsigned long retval)
+static void probe_hcall_exit(unsigned long opcode, unsigned long retval,
+			     unsigned long *retbuf)
 {
 	struct hcall_stats *h;
 

^ permalink raw reply

* [PATCH 5/6] powerpc: Disable HCALL_STATS by default
From: Anton Blanchard @ 2009-10-27  4:51 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <20091027045109.GE3085@kryten>


The overhead of HCALL_STATS is quite high and the functionality is very
rarely used. Key statistics are also missing (eg min/max).

With the new hcall tracepoints much more powerful tracing can be done in
a kernel module. Lets disable this by default.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/configs/pseries_defconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/configs/pseries_defconfig	2009-10-27 14:56:58.000000000 +1100
+++ linux.trees.git/arch/powerpc/configs/pseries_defconfig	2009-10-27 14:57:11.000000000 +1100
@@ -1683,7 +1683,7 @@ CONFIG_HAVE_ARCH_KGDB=y
 CONFIG_DEBUG_STACKOVERFLOW=y
 # CONFIG_DEBUG_STACK_USAGE is not set
 # CONFIG_DEBUG_PAGEALLOC is not set
-CONFIG_HCALL_STATS=y
+# CONFIG_HCALL_STATS is not set
 # CONFIG_CODE_PATCHING_SELFTEST is not set
 # CONFIG_FTR_FIXUP_SELFTEST is not set
 # CONFIG_MSI_BITMAP_SELFTEST is not set

^ permalink raw reply

* [PATCH 6/6] powerpc: Export powerpc_debugfs_root
From: Anton Blanchard @ 2009-10-27  4:52 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <20091027045157.GF3085@kryten>


Kernel modules should be able to place their debug output inside our powerpc
debugfs directory.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/setup-common.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/setup-common.c	2009-10-27 12:59:00.000000000 +1100
+++ linux.trees.git/arch/powerpc/kernel/setup-common.c	2009-10-27 12:59:15.000000000 +1100
@@ -660,6 +660,7 @@ late_initcall(check_cache_coherency);
 
 #ifdef CONFIG_DEBUG_FS
 struct dentry *powerpc_debugfs_root;
+EXPORT_SYMBOL(powerpc_debugfs_root);
 
 static int powerpc_debugfs_init(void)
 {

^ permalink raw reply

* Re: [3/6] Allow more flexible layouts for hugepage pagetables
From: David Gibson @ 2009-10-27  4:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1256613059.11607.18.camel@pasglop>

On Tue, Oct 27, 2009 at 02:10:59PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2009-10-16 at 16:22 +1100, David Gibson wrote:
> 
> So far haven't seen anything blatantly wrong, in fact, this patch
> results in some nice cleanups.
> 
> One thing tho...
> 
> > -#ifdef CONFIG_HUGETLB_PAGE
> > -       /* Handle hugepage regions */
> > -       if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
> > -               DBG_LOW(" -> huge page !\n");
> > -               return hash_huge_page(mm, access, ea, vsid, local, trap);
> > -       }
> > -#endif /* CONFIG_HUGETLB_PAGE */
> > -
> >  #ifndef CONFIG_PPC_64K_PAGES
> >         /* If we use 4K pages and our psize is not 4K, then we are hitting
> >          * a special driver mapping, we need to align the address before
> > @@ -961,12 +954,18 @@ int hash_page(unsigned long ea, unsigned
> >  #endif /* CONFIG_PPC_64K_PAGES */
> 
> You basically made the above code be run with huge pages. This may not
> be what you want ... It will result in cropping the low EA bits probably
> at a stage where you don't want that (it might also be a non-issue, I
> just want you to double check :-)

Ok, I've done that, and adjusted the comment accordingly.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* hypervisor call trace module
From: Anton Blanchard @ 2009-10-27  5:02 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, benh; +Cc: linuxppc-dev
In-Reply-To: <20091027045224.GG3085@kryten>

[-- Attachment #1: Type: text/plain, Size: 509 bytes --]


Here is an example of using the hcall tracepoints. This kernel
module provides strace like functionality for hypervisor hcalls:

-> 0x64(ff000002, 1, 2, d0000000034d7a71, f, c000000000a6f388, 1, c000000000989008, c000000000a3f480)
  <- 0x64()

Which was an EOI (opcode 0x64) of 0xff000002

There are a number of drivers that carry a lot of hcall related debug
code just in case we have to chase down a bug. I'm hoping hcall tracepoints
could replace it all and allow for much more powerful debugging.

Anton

[-- Attachment #2: Makefile --]
[-- Type: text/plain, Size: 238 bytes --]

obj-m := hcall_trace.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

clean:
	rm -rf *.mod.c *.ko *.o .*.cmd .tmp_versions Module.markers modules.order Module.symvers

[-- Attachment #3: hcall_trace.c --]
[-- Type: text/x-csrc, Size: 3597 bytes --]

/*
 * Hypervisor hcall trace
 *
 * Look for output in /sys/kernel/debug/powerpc/hcall_trace/
 * 
 * Copyright (C) 2009 Anton Blanchard <anton@au.ibm.com>, IBM
 *      
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 */             
                        
#include <linux/module.h>
#include <linux/debugfs.h>
#include <linux/relay.h>
#include <asm/trace.h>

#define SUBBUF_SIZE	131072
#define N_SUBBUFS	8

#define BUFLEN		512

static struct rchan *log_chan;

static void probe_hcall_entry(unsigned long opcode, unsigned long *args)
{
	char buf[BUFLEN];

	/* Don't log H_CEDE */
	if (opcode == H_CEDE)
		return;

	snprintf(buf, BUFLEN,
		"-> 0x%lx(%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx)\n",
		opcode, *args, *(args+1), *(args+2), *(args+3), *(args+4),
		*(args+5), *(args+6), *(args+7), *(args+8));

	relay_write(log_chan, buf, strlen(buf));
}

static void probe_hcall_exit(unsigned long opcode, unsigned long retval,
			     unsigned long *retbuf)
{
	char buf[BUFLEN];

	/* Don't log H_CEDE */
	if (opcode == H_CEDE)
		return;

	if (retbuf)
		snprintf(buf, BUFLEN, 
		    "  <- 0x%lx(%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx)\n",
			opcode, *retbuf, *(retbuf+1),
			*(retbuf+2), *(retbuf+3), *(retbuf+4), *(retbuf+5),
			*(retbuf+6), *(retbuf+7), *(retbuf+8));
	else
		sprintf(buf, "  <- 0x%lx()\n", opcode);

	relay_write(log_chan, buf, strlen(buf));
}

static struct dentry *create_buf_file_handler(const char *filename,
					      struct dentry *parent, int mode,
					      struct rchan_buf *buf,
					      int *is_global)
{
	return debugfs_create_file(filename, mode, parent, buf,
		&relay_file_operations);
}

static int remove_buf_file_handler(struct dentry *dentry)
{
	debugfs_remove(dentry);
	return 0;
}

static int subbuf_start(struct rchan_buf *buf, void *subbuf, void *prev_subbuf,
			size_t prev_padding)
{
	return 1;
}

static struct rchan_callbacks relay_callbacks =
{
	.create_buf_file = create_buf_file_handler,
	.remove_buf_file = remove_buf_file_handler,
	.subbuf_start = subbuf_start,
};

static struct dentry *debugfs_root;

static int __init hcall_trace_init(void)
{
	debugfs_root = debugfs_create_dir("hcall_trace", powerpc_debugfs_root);

	if (debugfs_root == ERR_PTR(-ENODEV)) {
		printk(KERN_ERR "Debugfs not configured\n");
		goto err_out;
	}

	if (!debugfs_root) {
		printk(KERN_ERR "Could not create debugfs directory\n");
		goto err_out;
	}

	log_chan = relay_open("cpu", debugfs_root, SUBBUF_SIZE,
			      N_SUBBUFS, &relay_callbacks, NULL);
	if (!log_chan) {
		printk(KERN_ERR "relay_open failed\n");
		goto err_relay_open;
	}

	if (register_trace_hcall_entry(probe_hcall_entry)) {
		printk(KERN_ERR "probe_hcall_entry probe point failed\n");
		goto err_probe_hcall_entry;
	}

	if (register_trace_hcall_exit(probe_hcall_exit)) {
		printk(KERN_ERR "probe_hcall_exit probe point failed\n");
		goto err_probe_hcall_exit;
	}

	return 0;

err_probe_hcall_exit:
	unregister_trace_hcall_entry(probe_hcall_entry);
err_probe_hcall_entry:
	relay_close(log_chan);
err_relay_open:
	debugfs_remove(debugfs_root);
err_out:
	return -ENODEV;
}

static void __exit hcall_trace_exit(void)
{
	unregister_trace_hcall_exit(probe_hcall_exit);
	unregister_trace_hcall_entry(probe_hcall_entry);

	relay_close(log_chan);
	debugfs_remove(debugfs_root);
}

module_init(hcall_trace_init)
module_exit(hcall_trace_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

^ permalink raw reply

* [0/6] Assorted hugepage cleanups (v4)
From: David Gibson @ 2009-10-27  5:22 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt

Currently, ordinary pages use one pagetable layout, and each different
hugepage size uses a slightly different variant layout.  A number of
places which need to walk the pagetable must first check the slice map
to see what the pagetable layout then handle the various different
forms.  New hardware, like Book3E is liable to introduce more possible
variants.

This patch series, therefore, is designed to simplify the matter by
limiting knowledge of the pagetable layout to only the allocation
path.  With this patch, ordinary pages are handled as ever, with a
fixed 4 (or 3) level tree.  All other variants branch off from some
layer of that with a specially marked PGD/PUD/PMD pointer which also
contains enough information to interpret the directories below that
point.  This means that things walking the pagetables (without
allocating) don't need to look up the slice map, they can just step
down the tree in the usual way, branching off to the "non-standard
layout" path for hugepages, which uses the embdded information to
interpret the tree from that point on.

This reduces the source size in a number of places, and means that
newer variants on the pagetable layout to handle new hardware and new
features will need to alter the existing code in less places.

In addition we split out the hash / classic MMU specific code into a
separate hugetlbpage-hash64.c file.  This will make adding support for
other MMUs (like 440 and/or Book3E) easier.

I've used the libhugetlbfs testsuite to test these patches on a
Power5+ machine, but they could certainly do with more testing. In
particular, I don't have any suitable hardware to test 16G pages.

V2: Made the tweaks that BenH suggested to patch 2 of the original
series.  Some corresponding tweaks in patch 3 to match.

V3: Fix a bug in the creation of the pgrable caches.  Slightly extend
the initialization cleanup.  Add a new patch cleaning up the hugepage
pte accessor functions.

V4: Revisions based on BenH's comments, fix compile breakage for
!CONFIG_HUGETLB_PAGE.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* [2/6] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/pgalloc-64.h    |   60 +++++++++++++++-----------
 arch/powerpc/include/asm/pgalloc.h       |   30 +------------
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 
 arch/powerpc/mm/hugetlbpage.c            |   45 +++++--------------
 arch/powerpc/mm/init_64.c                |   70 +++++++++++++++++++++----------
 arch/powerpc/mm/pgtable.c                |   25 +++++++----
 6 files changed, 117 insertions(+), 114 deletions(-)

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-10-27 15:37:04.000000000 +1100
@@ -119,30 +119,58 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+/*
+ * Create a kmem_cache() for pagetables.  This is not used for PTE
+ * pages - they're linked to struct page, come from the normal free
+ * pages pool and have a different entry size (see real_pte_t) to
+ * everything else.  Caches created by this function are used for all
+ * the higher level pagetables, and for hugepage pagetables.
+ */
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
+	PGT_CACHE(shift) = new;
+
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+
+	/* In all current configs, when the PUD index exists it's the
+	 * same size as either the pgd or pmd index.  Verify that the
+	 * initialization above has also created a PUD cache.  This
+	 * will need re-examiniation if we add new possibilities for
+	 * the pagetable layout. */
+	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-10-27 15:30:16.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-27 15:30:18.000000000 +1100
@@ -11,27 +11,39 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
+
 #ifndef CONFIG_PPC_SUBPAGE_PROT
 static inline void subpage_prot_free(pgd_t *pgd) {}
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +52,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +90,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +119,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-10-27 15:30:18.000000000 +1100
@@ -24,25 +24,6 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
-/* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
- * and small enough to fit in the low bits of any naturally aligned page
- * table cache entry. Arbitrarily set to 0x1f, that should give us some
- * room to grow
- */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
@@ -50,12 +31,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +44,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-10-27 15:30:18.000000000 +1100
@@ -49,12 +49,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -64,13 +64,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -79,8 +79,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
+		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -91,25 +95,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf = (unsigned long)table | (shift - 1);
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:35:27.000000000 +1100
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:35:27.000000000 +1100
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox