LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/4] kvm/ppc/booke64: Fix lazy ee handling in kvmppc_handle_exit()
From: Scott Wood @ 2013-05-10 22:43 UTC (permalink / raw)
  To: Bhushan Bharat-R65777
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <6A3DF150A5B70D4F9B66A25E3F7C888D0700F847@039-SN2MPN1-011.039d.mgd.msft.net>

On 05/10/2013 12:01:19 AM, Bhushan Bharat-R65777 wrote:
>=20
>=20
> > -----Original Message-----
> > From: kvm-ppc-owner@vger.kernel.org =20
> [mailto:kvm-ppc-owner@vger.kernel.org] On
> > Behalf Of Scott Wood
> > Sent: Friday, May 10, 2013 8:40 AM
> > To: Alexander Graf; Benjamin Herrenschmidt
> > Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; =20
> linuxppc-dev@lists.ozlabs.org;
> > Wood Scott-B07421
> > Subject: [PATCH v2 2/4] kvm/ppc/booke64: Fix lazy ee handling in
> > kvmppc_handle_exit()
> >
> > EE is hard-disabled on entry to kvmppc_handle_exit(), so call
> > hard_irq_disable() so that PACA_IRQ_HARD_DIS is set, and =20
> soft_enabled
> > is unset.
> >
> > Without this, we get warnings such as =20
> arch/powerpc/kernel/time.c:300,
> > and sometimes host kernel hangs.
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > ---
> >  arch/powerpc/kvm/booke.c |    5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> > index 1020119..705fc5c 100644
> > --- a/arch/powerpc/kvm/booke.c
> > +++ b/arch/powerpc/kvm/booke.c
> > @@ -833,6 +833,11 @@ int kvmppc_handle_exit(struct kvm_run *run, =20
> struct kvm_vcpu
> > *vcpu,
> >  	int r =3D RESUME_HOST;
> >  	int s;
> >
> > +#ifdef CONFIG_PPC64
> > +	WARN_ON(local_paca->irq_happened !=3D 0);
> > +#endif
> > +	hard_irq_disable();
>=20
> It is not actually to hard disable as EE is already clear but to make =20
> it looks like hard_disable to host. Right?
> If so, should we write a comment here on why we are doing this?

Yes, I can add a comment.

-Scott=

^ permalink raw reply

* Re: [PATCH v2 3/4] kvm/ppc: Call trace_hardirqs_on before entry
From: Scott Wood @ 2013-05-10 22:47 UTC (permalink / raw)
  To: tiejun.chen
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	kvm-ppc@vger.kernel.org, Bhushan Bharat-R65777,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <518C7A28.4000903@windriver.com>

On 05/09/2013 11:40:08 PM, tiejun.chen wrote:
> On 05/10/2013 11:34 AM, Bhushan Bharat-R65777 wrote:
>>> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
>>> index 705fc5c..eb89b83 100644
>>> --- a/arch/powerpc/kvm/booke.c
>>> +++ b/arch/powerpc/kvm/booke.c
>>> @@ -673,7 +673,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, =20
>>> struct kvm_vcpu
>>> *vcpu)
>>>   		ret =3D s;
>>>   		goto out;
>>>   	}
>>> -	kvmppc_lazy_ee_enable();
>>> +	kvmppc_fix_ee_before_entry();
>>=20
>> local_irq_disable() is called before kvmppc_prepare_to_enter().
>=20
> In patch 4, we call hard_irq_disable() once enter =20
> kvmppc_prepare_to_enter().

And before patch 4, we have the code near the end of =20
kvmppc_prepare_to_enter() that checks lazy_irq_pending() and aborts =20
guest entry if there was a race.  If I'd known about that bit of code =20
beforehand, I probably wouldn't have bothered with most of patch 4/4, =20
but now that it's been done it seems cleaner.

-Scott=

^ permalink raw reply

* Re: [PATCH v2 4/4] kvm/ppc: IRQ disabling cleanup
From: Scott Wood @ 2013-05-10 22:53 UTC (permalink / raw)
  To: Bhushan Bharat-R65777
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <6A3DF150A5B70D4F9B66A25E3F7C888D0700F859@039-SN2MPN1-011.039d.mgd.msft.net>

On 05/10/2013 12:01:38 AM, Bhushan Bharat-R65777 wrote:
>=20
>=20
> > -----Original Message-----
> > From: kvm-ppc-owner@vger.kernel.org =20
> [mailto:kvm-ppc-owner@vger.kernel.org] On
> > Behalf Of Scott Wood
> > Sent: Friday, May 10, 2013 8:40 AM
> > To: Alexander Graf; Benjamin Herrenschmidt
> > Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; =20
> linuxppc-dev@lists.ozlabs.org;
> > Wood Scott-B07421
> > Subject: [PATCH v2 4/4] kvm/ppc: IRQ disabling cleanup
> >
> > -	WARN_ON_ONCE(!irqs_disabled());
> > +	WARN_ON(irqs_disabled());
> > +	hard_irq_disable();
>=20
> Here we hard disable in kvmppc_prepare_to_enter(), so my comment in =20
> other patch about interrupt loss is no more valid.
>=20
> So here
>   MSR.EE =3D 0
>   local_paca->soft_enabled =3D 0
>   local_paca->irq_happened |=3D PACA_IRQ_HARD_DIS;
>=20
> > +
> >  	while (true) {
> >  		if (need_resched()) {
> >  			local_irq_enable();
>=20
> This will make the state:
>   MSR.EE =3D 1
>   local_paca->soft_enabled =3D 1
>   local_paca->irq_happened =3D PACA_IRQ_HARD_DIS;  //same as before
>=20
> Is that a valid state where interrupts are fully enabled and =20
> irq_happend in not 0?

PACA_IRQ_HARD_DIS will have been cleared by local_irq_enable(), as =20
Tiejun pointed out.

> int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
> {
>         int r =3D 0;
>         WARN_ON_ONCE(!irqs_disabled());
>=20
>         kvmppc_core_check_exceptions(vcpu);
>=20
>         if (vcpu->requests) {
>                 /* Exception delivery raised request; start over */
>                 return 1;
>         }
>=20
>         if (vcpu->arch.shared->msr & MSR_WE) {
>                 local_irq_enable();
>                 kvm_vcpu_block(vcpu);
>                 clear_bit(KVM_REQ_UNHALT, &vcpu->requests);
>                 local_irq_disable();
> ^^^
> We do not require hard_irq_disable() here?

Yes, that should be changed to hard_irq_disable(), and I'll add a =20
WARN_ON to double check that interrupts are hard-disabled (eventually =20
we'll probably want to make these critical-path assertions dependent on =20
a debug option...).  It doesn't really matter all that much, though, =20
since we don't have MSR_WE on any 64-bit booke chips. :-)

-Scott=

^ permalink raw reply

* Re: [PATCH 2/2] powerpc/85xx: enable coreint for all the 64bit boards
From: Kevin Hao @ 2013-05-11  7:00 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc
In-Reply-To: <1365643954-20798-3-git-send-email-haokexin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On Thu, Apr 11, 2013 at 09:32:34AM +0800, Kevin Hao wrote:
> With the patch 7230c564 (powerpc: Rework lazy-interrupt handling),
> it seems that the coreint works pretty well on the 85xx 64bit kernel.
> So use the coreint by default for these boards.

Hi Kumar,

Since the external proxy works pretty well with the lazy EE without any change
to the current kernel, could you please pick this patch up? This has already
been acked by Scott. I also think I got a implicit ack from Benjamin.
I will try to use the PACA_IRQ_EE_EDGE to make the support for external
proxy more better a little later as suggested by Benjamin.

Thanks,
Kevin

> 
> Signed-off-by: Kevin Hao <haokexin@gmail.com>
> ---
>  arch/powerpc/platforms/85xx/p5020_ds.c  | 5 -----
>  arch/powerpc/platforms/85xx/p5040_ds.c  | 5 -----
>  arch/powerpc/platforms/85xx/t4240_qds.c | 5 -----
>  3 files changed, 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/85xx/p5020_ds.c b/arch/powerpc/platforms/85xx/p5020_ds.c
> index 753a42c..39cfa40 100644
> --- a/arch/powerpc/platforms/85xx/p5020_ds.c
> +++ b/arch/powerpc/platforms/85xx/p5020_ds.c
> @@ -75,12 +75,7 @@ define_machine(p5020_ds) {
>  #ifdef CONFIG_PCI
>  	.pcibios_fixup_bus	= fsl_pcibios_fixup_bus,
>  #endif
> -/* coreint doesn't play nice with lazy EE, use legacy mpic for now */
> -#ifdef CONFIG_PPC64
> -	.get_irq		= mpic_get_irq,
> -#else
>  	.get_irq		= mpic_get_coreint_irq,
> -#endif
>  	.restart		= fsl_rstcr_restart,
>  	.calibrate_decr		= generic_calibrate_decr,
>  	.progress		= udbg_progress,
> diff --git a/arch/powerpc/platforms/85xx/p5040_ds.c b/arch/powerpc/platforms/85xx/p5040_ds.c
> index 1138185..f70e74c 100644
> --- a/arch/powerpc/platforms/85xx/p5040_ds.c
> +++ b/arch/powerpc/platforms/85xx/p5040_ds.c
> @@ -66,12 +66,7 @@ define_machine(p5040_ds) {
>  #ifdef CONFIG_PCI
>  	.pcibios_fixup_bus	= fsl_pcibios_fixup_bus,
>  #endif
> -/* coreint doesn't play nice with lazy EE, use legacy mpic for now */
> -#ifdef CONFIG_PPC64
> -	.get_irq		= mpic_get_irq,
> -#else
>  	.get_irq		= mpic_get_coreint_irq,
> -#endif
>  	.restart		= fsl_rstcr_restart,
>  	.calibrate_decr		= generic_calibrate_decr,
>  	.progress		= udbg_progress,
> diff --git a/arch/powerpc/platforms/85xx/t4240_qds.c b/arch/powerpc/platforms/85xx/t4240_qds.c
> index 5998e9f..91ead6b 100644
> --- a/arch/powerpc/platforms/85xx/t4240_qds.c
> +++ b/arch/powerpc/platforms/85xx/t4240_qds.c
> @@ -75,12 +75,7 @@ define_machine(t4240_qds) {
>  #ifdef CONFIG_PCI
>  	.pcibios_fixup_bus	= fsl_pcibios_fixup_bus,
>  #endif
> -/* coreint doesn't play nice with lazy EE, use legacy mpic for now */
> -#ifdef CONFIG_PPC64
> -	.get_irq		= mpic_get_irq,
> -#else
>  	.get_irq		= mpic_get_coreint_irq,
> -#endif
>  	.restart		= fsl_rstcr_restart,
>  	.calibrate_decr		= generic_calibrate_decr,
>  	.progress		= udbg_progress,
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: [PATCH] perf: Power7: Make CPI stack events available in sysfs
From: Sukadev Bhattiprolu @ 2013-05-11 22:31 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linuxppc-dev, Paul Mackerras, linux-kernel,
	Arnaldo Carvalho de Melo
In-Reply-To: <20130422155505.GA26169@us.ibm.com>

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| Michael Ellerman [michael@ellerman.id.au] wrote:
| | On Sat, Apr 06, 2013 at 09:48:03AM -0700, Sukadev Bhattiprolu wrote:
| | > From bdeacf7175241f6c79b5b2be0fa6b20b0d0b7d1c Mon Sep 17 00:00:00 2001
| | > From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| | > Date: Sat, 6 Apr 2013 08:48:26 -0700
| | > Subject: [PATCH] perf: Power7: Make CPI stack events available in sysfs
| | > 
| | > A set of Power7 events are often used for Cycles Per Instruction (CPI) stack
| | > analysis. Make these events available in sysfs (/sys/devices/cpu/events/) so
| | > they can be identified using their symbolic names:
| | > 
| | > 	perf stat -e 'cpu/PM_CMPLU_STALL_DCACHE_MISS/' /bin/ls
| | 
| | Should we take these two via the powerpc tree? Or do you want to take
| | them Arnaldo?
| 
| I think it can go through powerpc tree since it is all arch-specific.
| 

Ben, Paul,

Any comments on this patch ?

	https://lkml.org/lkml/2013/4/6/169

^ permalink raw reply

* [PATCH 0/4] enable the PACA_IRQ_EE_EDGE support for book3e
From: Kevin Hao @ 2013-05-11 23:26 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc

Hi,

In the current kernel, the external proxy works pretty well with the lazy
EE. But it still treat the external input interrupt as level sensitive
and would hard disable the irq while a interrupt occurs with irq soft
disabled. This patch series enables PACA_IRQ_EE_EDGE support for book3e
and will leave the irq still hard enabled while a external interrupt
occurs with soft irq disabled. This passed the boot test on T4240QDS
board.

---
Kevin Hao (4):
  powerpc/book3e: introduce external_input_edge exception handler for
    64bit kernel
  powerpc: move the patch_exception to a common place
  powerpc: use patch_exception to update the debug exception handler
  powerpc/fsl-book3e: enable the external_input_edge exception handler

 arch/powerpc/include/asm/code-patching.h |  7 +++++++
 arch/powerpc/include/asm/paca.h          |  1 +
 arch/powerpc/kernel/asm-offsets.c        |  1 +
 arch/powerpc/kernel/entry_64.S           |  2 ++
 arch/powerpc/kernel/exceptions-64e.S     | 26 +++++++++++++++++++++++---
 arch/powerpc/kernel/irq.c                | 12 ++++++------
 arch/powerpc/kernel/paca.c               |  3 +++
 arch/powerpc/kernel/setup_64.c           |  6 +-----
 arch/powerpc/lib/code-patching.c         | 15 +++++++++++++++
 arch/powerpc/mm/tlb_nohash.c             | 19 -------------------
 arch/powerpc/platforms/85xx/corenet_ds.c |  9 ++++++++-
 arch/powerpc/sysdev/mpic.c               |  8 ++++++++
 12 files changed, 75 insertions(+), 34 deletions(-)

Thanks,
Kevin
-- 
1.8.1.4

^ permalink raw reply

* [PATCH 1/4] powerpc/book3e: introduce external_input_edge exception handler for 64bit kernel
From: Kevin Hao @ 2013-05-11 23:26 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc
In-Reply-To: <1368314784-971-1-git-send-email-haokexin@gmail.com>

In the external proxy facility mode, the interrupt is automatically
acknowledged with the same effect as reading the IACK register. So
this makes external input interrupt more like edge sensitive. That
means we can leave the irq hard enabled when it occurs with irq soft
disabled just like the dec and doorbell interrupt. But the External
Proxy Register(EPR) is only considered valid from the time that the
external interrupt occurs until MSR[EE] is set to 1. So we have to
save the EPR before irq hard enabled.

In general, we would initialize all the interrupt sources to the same
priority. That means if a interrupt is in service and the other
interrupt would never get any chance to be delivered to the cpu.
And also in order not to complicate the implementation, we only support
to save one content of the EPR register and will set PACA_IRQ_EE
and then hard disable the irq if more external interrupt are received.

In order to support to choose the working mode of MPIC at runtime,
we introduce a new external input exception handler. And we will
use this handler if the external proxy is enabled.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/include/asm/paca.h      |  1 +
 arch/powerpc/kernel/asm-offsets.c    |  1 +
 arch/powerpc/kernel/entry_64.S       |  2 ++
 arch/powerpc/kernel/exceptions-64e.S | 26 +++++++++++++++++++++++---
 arch/powerpc/kernel/irq.c            | 12 ++++++------
 arch/powerpc/kernel/paca.c           |  3 +++
 arch/powerpc/sysdev/mpic.c           |  8 ++++++++
 7 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 77c91e7..9b3649d 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -118,6 +118,7 @@ struct paca_struct {
 	void *mc_kstack;
 	void *crit_kstack;
 	void *dbg_kstack;
+	u32 saved_epr;		/* EPR saved here */
 #endif /* CONFIG_PPC_BOOK3E */
 
 	mm_context_t context;
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index b51a97c..3199708 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -201,6 +201,7 @@ int main(void)
 	DEFINE(PACA_MC_STACK, offsetof(struct paca_struct, mc_kstack));
 	DEFINE(PACA_CRIT_STACK, offsetof(struct paca_struct, crit_kstack));
 	DEFINE(PACA_DBG_STACK, offsetof(struct paca_struct, dbg_kstack));
+	DEFINE(PACASAVEDEPR, offsetof(struct paca_struct, saved_epr));
 #endif /* CONFIG_PPC_BOOK3E */
 
 #ifdef CONFIG_PPC_STD_MMU_64
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 3fe5259..09ac356 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -884,6 +884,8 @@ restore_check_irq_replay:
 	 * still soft-disabled and we keep them that way.
 	*/
 	cmpwi	cr0,r3,0x500
+	cmpwi	cr1,r3,0x510
+	cror	eq,4*cr1+eq,eq
 	bne	1f
 	addi	r3,r1,STACK_FRAME_OVERHEAD;
  	bl	.do_IRQ
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 42a756e..e32bd6c 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -358,6 +358,9 @@ interrupt_end_book3e:
 	MASKABLE_EXCEPTION(0x500, BOOKE_INTERRUPT_EXTERNAL,
 			   external_input, .do_IRQ, ACK_NONE)
 
+	MASKABLE_EXCEPTION(0x510, BOOKE_INTERRUPT_EXTERNAL,
+			   external_input_edge, .do_IRQ, ACK_NONE)
+
 /* Alignment */
 	START_EXCEPTION(alignment);
 	NORMAL_EXCEPTION_PROLOG(0x600, BOOKE_INTERRUPT_ALIGNMENT,
@@ -685,13 +688,25 @@ kernel_dbg_exc:
 	ori	r10,r10,\paca_irq
 	stb	r10,PACAIRQHAPPENED(r13)
 
+	.if \paca_irq == PACA_IRQ_EE_EDGE
+	lwz	r10,PACASAVEDEPR(r13)
+	cmpwi	r10,~0
+	bne-	98f
+	mfspr	r10,SPRN_EPR
+	stw	r10,PACASAVEDEPR(r13)
+	b	99f
+98:	lbz	r10,PACAIRQHAPPENED(r13)
+	ori	r10,r10,PACA_IRQ_EE
+	stb	r10,PACAIRQHAPPENED(r13)
+	.endif
+
 	.if \full_mask == 1
 	rldicl	r10,r11,48,1		/* clear MSR_EE */
 	rotldi	r11,r10,16
 	mtspr	SPRN_SRR1,r11
 	.endif
 
-	lwz	r11,PACA_EXGEN+EX_CR(r13)
+99:	lwz	r11,PACA_EXGEN+EX_CR(r13)
 	mtcr	r11
 	ld	r10,PACA_EXGEN+EX_R10(r13)
 	ld	r11,PACA_EXGEN+EX_R11(r13)
@@ -701,9 +716,11 @@ kernel_dbg_exc:
 .endm
 
 masked_interrupt_book3e_0x500:
-	// XXX When adding support for EPR, use PACA_IRQ_EE_EDGE
 	masked_interrupt_book3e PACA_IRQ_EE 1
 
+masked_interrupt_book3e_0x510:
+	masked_interrupt_book3e PACA_IRQ_EE_EDGE 1
+
 masked_interrupt_book3e_0x900:
 	ACK_DEC(r10);
 	masked_interrupt_book3e PACA_IRQ_DEC 0
@@ -718,7 +735,7 @@ masked_interrupt_book3e_0x2c0:
 
 /*
  * Called from arch_local_irq_enable when an interrupt needs
- * to be resent. r3 contains either 0x500,0x900,0x260 or 0x280
+ * to be resent. r3 contains either 0x500,0x900,0x510 or 0x280
  * to indicate the kind of interrupt. MSR:EE is already off.
  * We generate a stackframe like if a real interrupt had happened.
  *
@@ -745,6 +762,8 @@ _GLOBAL(__replay_interrupt)
 	beq	exc_0x900_common
 	cmpwi	cr0,r3,0x280
 	beq	exc_0x280_common
+	cmpwi	cr0,r3,0x510
+	beq	exc_0x510_common
 	blr
 
 
@@ -859,6 +878,7 @@ BAD_STACK_TRAMPOLINE(0x310)
 BAD_STACK_TRAMPOLINE(0x320)
 BAD_STACK_TRAMPOLINE(0x400)
 BAD_STACK_TRAMPOLINE(0x500)
+BAD_STACK_TRAMPOLINE(0x510)
 BAD_STACK_TRAMPOLINE(0x600)
 BAD_STACK_TRAMPOLINE(0x700)
 BAD_STACK_TRAMPOLINE(0x800)
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 5cbcf4d..9349db9 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -165,11 +165,6 @@ notrace unsigned int __check_irq_replay(void)
 	if (decrementer_check_overflow())
 		return 0x900;
 
-	/* Finally check if an external interrupt happened */
-	local_paca->irq_happened &= ~PACA_IRQ_EE;
-	if (happened & PACA_IRQ_EE)
-		return 0x500;
-
 #ifdef CONFIG_PPC_BOOK3E
 	/* Finally check if an EPR external interrupt happened
 	 * this bit is typically set if we need to handle another
@@ -177,7 +172,7 @@ notrace unsigned int __check_irq_replay(void)
 	 */
 	local_paca->irq_happened &= ~PACA_IRQ_EE_EDGE;
 	if (happened & PACA_IRQ_EE_EDGE)
-		return 0x500;
+		return 0x510;
 
 	local_paca->irq_happened &= ~PACA_IRQ_DBELL;
 	if (happened & PACA_IRQ_DBELL)
@@ -191,6 +186,11 @@ notrace unsigned int __check_irq_replay(void)
 	}
 #endif /* CONFIG_PPC_BOOK3E */
 
+	/* Finally check if an external interrupt happened */
+	local_paca->irq_happened &= ~PACA_IRQ_EE;
+	if (happened & PACA_IRQ_EE)
+		return 0x500;
+
 	/* There should be nothing left ! */
 	BUG_ON(local_paca->irq_happened != 0);
 
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index f8f2468..ba370e9 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -144,6 +144,9 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 #ifdef CONFIG_PPC_STD_MMU_64
 	new_paca->slb_shadow_ptr = &slb_shadow[cpu];
 #endif /* CONFIG_PPC_STD_MMU_64 */
+#ifdef CONFIG_PPC_BOOK3E
+	new_paca->saved_epr = ~0;
+#endif
 }
 
 /* Put the paca pointer into r13 and SPRG_PACA */
diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index ee21b5e..4d248d3 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -1804,7 +1804,15 @@ unsigned int mpic_get_coreint_irq(void)
 
 	BUG_ON(mpic == NULL);
 
+#ifdef CONFIG_PPC64
+	if (local_paca->saved_epr != ~0) {
+		src = local_paca->saved_epr;
+		local_paca->saved_epr = ~0;
+	} else
+		src = mfspr(SPRN_EPR);
+#else
 	src = mfspr(SPRN_EPR);
+#endif
 
 	if (unlikely(src == mpic->spurious_vec)) {
 		if (mpic->flags & MPIC_SPV_EOI)
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 2/4] powerpc: move the patch_exception to a common place
From: Kevin Hao @ 2013-05-11 23:26 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc
In-Reply-To: <1368314784-971-1-git-send-email-haokexin@gmail.com>

So that it can be used by other codes. No function change.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/include/asm/code-patching.h |  7 +++++++
 arch/powerpc/lib/code-patching.c         | 15 +++++++++++++++
 arch/powerpc/mm/tlb_nohash.c             | 19 -------------------
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
index a6f8c7a..97e02f9 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -34,6 +34,13 @@ int instr_is_branch_to_addr(const unsigned int *instr, unsigned long addr);
 unsigned long branch_target(const unsigned int *instr);
 unsigned int translate_branch(const unsigned int *dest,
 			      const unsigned int *src);
+#ifdef CONFIG_PPC_BOOK3E_64
+void __patch_exception(int exc, unsigned long addr);
+#define patch_exception(exc, name) do { \
+	extern unsigned int name; \
+	__patch_exception((exc), (unsigned long)&name); \
+} while (0)
+#endif
 
 static inline unsigned long ppc_function_entry(void *func)
 {
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 17e5b23..d5edbeb 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -159,6 +159,21 @@ unsigned int translate_branch(const unsigned int *dest, const unsigned int *src)
 	return 0;
 }
 
+#ifdef CONFIG_PPC_BOOK3E_64
+void __patch_exception(int exc, unsigned long addr)
+{
+	extern unsigned int interrupt_base_book3e;
+	unsigned int *ibase = &interrupt_base_book3e;
+
+	/* Our exceptions vectors start with a NOP and -then- a branch
+	 * to deal with single stepping from userspace which stops on
+	 * the second instruction. Thus we need to patch the second
+	 * instruction of the exception, not the first one
+	 */
+
+	patch_branch(ibase + (exc / 4) + 1, addr, 0);
+}
+#endif
 
 #ifdef CONFIG_CODE_PATCHING_SELFTEST
 
diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c
index 6888cad..0658aad 100644
--- a/arch/powerpc/mm/tlb_nohash.c
+++ b/arch/powerpc/mm/tlb_nohash.c
@@ -518,25 +518,6 @@ static void setup_page_sizes(void)
 	}
 }
 
-static void __patch_exception(int exc, unsigned long addr)
-{
-	extern unsigned int interrupt_base_book3e;
- 	unsigned int *ibase = &interrupt_base_book3e;
- 
-	/* Our exceptions vectors start with a NOP and -then- a branch
-	 * to deal with single stepping from userspace which stops on
-	 * the second instruction. Thus we need to patch the second
-	 * instruction of the exception, not the first one
-	 */
-
-	patch_branch(ibase + (exc / 4) + 1, addr, 0);
-}
-
-#define patch_exception(exc, name) do { \
-	extern unsigned int name; \
-	__patch_exception((exc), (unsigned long)&name); \
-} while (0)
-
 static void setup_mmu_htw(void)
 {
 	/* Check if HW tablewalk is present, and if yes, enable it by:
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 3/4] powerpc: use patch_exception to update the debug exception handler
From: Kevin Hao @ 2013-05-11 23:26 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc
In-Reply-To: <1368314784-971-1-git-send-email-haokexin@gmail.com>

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/kernel/setup_64.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index e379d3f..59dffd0 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -505,9 +505,6 @@ static void __init irqstack_early_init(void)
 #ifdef CONFIG_PPC_BOOK3E
 static void __init exc_lvl_early_init(void)
 {
-	extern unsigned int interrupt_base_book3e;
-	extern unsigned int exc_debug_debug_book3e;
-
 	unsigned int i;
 
 	for_each_possible_cpu(i) {
@@ -520,8 +517,7 @@ static void __init exc_lvl_early_init(void)
 	}
 
 	if (cpu_has_feature(CPU_FTR_DEBUG_LVL_EXC))
-		patch_branch(&interrupt_base_book3e + (0x040 / 4) + 1,
-			     (unsigned long)&exc_debug_debug_book3e, 0);
+		patch_exception(0x040, exc_debug_debug_book3e);
 }
 #else
 #define exc_lvl_early_init()
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 4/4] powerpc/fsl-book3e: enable the external_input_edge exception handler
From: Kevin Hao @ 2013-05-11 23:26 UTC (permalink / raw)
  To: Kumar Gala, Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc
In-Reply-To: <1368314784-971-1-git-send-email-haokexin@gmail.com>

Enable the external_input_edge exception handler for the fsl ppc64
board when external proxy is enabled. This will leave the irq still
hard enabled when a interrupt occurs with irq soft disabled.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/platforms/85xx/corenet_ds.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/85xx/corenet_ds.c b/arch/powerpc/platforms/85xx/corenet_ds.c
index c59c617..ad688bf 100644
--- a/arch/powerpc/platforms/85xx/corenet_ds.c
+++ b/arch/powerpc/platforms/85xx/corenet_ds.c
@@ -25,6 +25,9 @@
 #include <asm/prom.h>
 #include <asm/udbg.h>
 #include <asm/mpic.h>
+#ifdef CONFIG_PPC64
+#include <asm/code-patching.h>
+#endif
 
 #include <linux/of_platform.h>
 #include <sysdev/fsl_soc.h>
@@ -37,8 +40,12 @@ void __init corenet_ds_pic_init(void)
 	unsigned int flags = MPIC_BIG_ENDIAN | MPIC_SINGLE_DEST_CPU |
 		MPIC_NO_RESET;
 
-	if (ppc_md.get_irq == mpic_get_coreint_irq)
+	if (ppc_md.get_irq == mpic_get_coreint_irq) {
 		flags |= MPIC_ENABLE_COREINT;
+#ifdef CONFIG_PPC64
+		patch_exception(0x0a0, exc_external_input_edge_book3e);
+#endif
+	}
 
 	mpic = mpic_alloc(NULL, 0, flags, 0, 512, " OpenPIC  ");
 	BUG_ON(mpic == NULL);
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH] powerpc/mm: Use the correct mask value when looking at pgtable address
From: Aneesh Kumar K.V @ 2013-05-12  8:33 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Our pgtable are 2*sizeof(pte_t)*PTRS_PER_PTE which is PTE_FRAG_SIZE.
Instead of depending on frag size, mask with PMD_MASKED_BITS.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgalloc-64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 91acb12..b66ae72 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -186,7 +186,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 static inline pgtable_t pmd_pgtable(pmd_t pmd)
 {
-	return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
+	return (pgtable_t)(pmd_val(pmd) & ~PMD_MASKED_BITS);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 11/18] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/hugetlbpage.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 2865077..4928204 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -936,30 +936,50 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 
 	pg = pgdir + pgd_index(ea);
 
-	if (pgd_huge(*pg)) {
+	/*
+	 * we should first check for none. That takes care of a
+	 * a parallel hugetlb or THP pagefault moving none entries
+	 * to respective types.
+	 */
+	if (pgd_none(*pg))
+		return NULL;
+	else if (pgd_huge(*pg)) {
 		ret_pte = (pte_t *) pg;
 		goto out;
 	} else if (is_hugepd(pg))
 		hpdp = (hugepd_t *)pg;
-	else if (!pgd_none(*pg)) {
+	else {
 		pdshift = PUD_SHIFT;
 		pu = pud_offset(pg, ea);
 
-		if (pud_huge(*pu)) {
+		if (pud_none(*pu))
+			return NULL;
+		else if (pud_huge(*pu)) {
 			ret_pte = (pte_t *) pu;
 			goto out;
 		} else if (is_hugepd(pu))
 			hpdp = (hugepd_t *)pu;
-		else if (!pud_none(*pu)) {
+		else {
 			pdshift = PMD_SHIFT;
 			pm = pmd_offset(pu, ea);
+			/*
+			 * A hugepage collapse is captured by pmd_none, because
+			 * it mark the pmd none and do a hpte invalidate.
+			 *
+			 * A hugepage split is captured by pmd_trans_splitting
+			 * because we mark the pmd trans splitting and do a
+			 * hpte invalidate
+			 *
+			 */
+			if (pmd_none(*pm) || pmd_trans_splitting(*pm))
+				return NULL;
 
-			if (pmd_huge(*pm)) {
+			if (pmd_huge(*pm) || pmd_large(*pm)) {
 				ret_pte = (pte_t *) pm;
 				goto out;
 			} else if (is_hugepd(pm))
 				hpdp = (hugepd_t *)pm;
-			else if (!pmd_none(*pm))
+			else
 				return pte_offset_kernel(pm, ea);
 		}
 	}
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 00/18] THP support for PPC64
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev

Hi,

This is the second patchset needed to support THP on ppc64. Some of the changes
included in this series are tricky in that it changes the powerpc linux page table
walk subtly. We also overload few of the pte flags for ptes at PMD level (huge
page PTEs).

The below patches are also sent as a seperate patch series to linux-mm

[PATCH -V8 02/18] mm/THP: Use pmd_populate to update the pmd with
[PATCH -V8 03/18] mm/THP: Use the right function when updating access
[PATCH -V8 04/18] mm/THP: Add pmd args to pgtable deposit and
[PATCH -V8 05/18] mm/THP: withdraw the pgtable after pmdp related
[PATCH -V8 06/18] mm/THP: Don't use HPAGE_SHIFT in transparent
[PATCH -V8 07/18] mm/THP: deposit the transpare huge pgtable before

The below fixes is also sent to linuxppc-devel as an independent patch.
[PATCH -V8 01/18] powerpc/mm: Use the correct mask value when looking

I am adding the above patches in this series for easier testing.

Some numbers:

The latency measurements code from Anton  found at
http://ozlabs.org/~anton/junkcode/latency2001.c

64K page size (With THP support)
--------------------------
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    428.49 cycles    120.50 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592    471.16 cycles    132.50 ns
[root@llmp24l02 test]# echo never > /sys/kernel/mm/transparent_hugepage/enabled 
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    766.52 cycles    215.56 ns
[root@llmp24l02 test]# 

4K page size (No THP support for 4K)
----------------------------
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    814.88 cycles    229.16 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592    463.69 cycles    130.40 ns
[root@llmp24l02 test]# 

We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.

Another test that does 50000000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.

split_huge_page impact:
---------------------
To look at the performance impact of large page invalidate, I tried the below
experiment. The test involved, accessing a large contiguous region of memory
location as below

    for (i = 0; i < size; i += PAGE_SIZE)
	data[i] = i;

We wanted to access the data in sequential order so that we look at the
worst case THP performance. Accesing the data in sequential order implies
we have the Page table cached and overhead of TLB miss is as minimal as
possible. We also don't touch the entire page, because that can result in
cache evict.

After we touched the full range as above, we now call mprotect on each
of that page. A mprotect will result in a hugepage split. This should
allow us to measure the impact of hugepage split.

    for (i = 0; i < size; i += PAGE_SIZE)
	 mprotect(&data[i], PAGE_SIZE, PROT_READ);

Split hugepage impact: 
---------------------
THP enabled: 2.851561705 seconds for test completion
THP disable: 3.599146098 seconds for test completion

We are 20.7% better than non THP case even when we have all the large pages split.

Detailed output:

THP enabled:
---------------------------------------
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G                                                                      
time taken to touch all the data in ns: 2763096913 

 Performance counter stats for './split-huge-page-mpro 20G':

	     1,581 page-faults                                                 
	     3,159 dTLB-load-misses                                            

       2.851561705 seconds time elapsed

[root@llmp24l02 ~]# 
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 1279
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 1279
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# 

    77.05%  split-huge-page  [kernel.kallsyms]     [k] .clear_user_page                        
     7.10%  split-huge-page  [kernel.kallsyms]     [k] .perf_event_mmap_ctx                    
     1.51%  split-huge-page  split-huge-page-mpro  [.] 0x0000000000000a70                      
     0.96%  split-huge-page  [unknown]             [H] 0x000000000157e3bc                      
     0.81%  split-huge-page  [kernel.kallsyms]     [k] .up_write                               
     0.76%  split-huge-page  [kernel.kallsyms]     [k] .perf_event_mmap                        
     0.76%  split-huge-page  [kernel.kallsyms]     [k] .down_write                             
     0.74%  split-huge-page  [kernel.kallsyms]     [k] .lru_add_page_tail                      
     0.61%  split-huge-page  [kernel.kallsyms]     [k] .split_huge_page                        
     0.59%  split-huge-page  [kernel.kallsyms]     [k] .change_protection                      
     0.51%  split-huge-page  [kernel.kallsyms]     [k] .release_pages                          

     0.96%  split-huge-page  [unknown]             [H] 0x000000000157e3bc                      
	    |          
	    |--79.44%-- reloc_start
	    |          |          
	    |          |--86.54%-- .__pSeries_lpar_hugepage_invalidate
	    |          |          .pSeries_lpar_hugepage_invalidate
	    |          |          .hpte_need_hugepage_flush
	    |          |          .split_huge_page
	    |          |          .__split_huge_page_pmd
	    |          |          .vma_adjust
	    |          |          .vma_merge
	    |          |          .mprotect_fixup
	    |          |          .SyS_mprotect

THP disabled:
---------------
[root@llmp24l02 ~]# echo never > /sys/kernel/mm/transparent_hugepage/enabled
[root@llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G
time taken to touch all the data in ns: 3513767220 

 Performance counter stats for './split-huge-page-mpro 20G':

	  3,27,726 page-faults                                                 
	  3,29,654 dTLB-load-misses                                            

       3.599146098 seconds time elapsed

[root@llmp24l02 ~]#

Changes from V7:
* Address review feedback.
* mm/ patches also posted as a seperate series to linux-mm list
* Fixes for races against split/collapse and page table walk.
* Updated comments regarding locking details

Changes from V6:
* split the patch series into two patchset.
* Address review feedback.

Changes from V5:
* Address review comments
* Added new patch to not use hugepd for explcit hugepages. Explicit hugepaes
  now use PTE format similar to transparent hugepages.
* We don't use page->_mapcount for tracking free PTE frags in a PTE page.
* rebased to a86d52667d8eda5de39393ce737794403bdce1eb
* Tested with libhugetlbfs test suite

Changes from V4:
* Fix bad page error in page_table_alloc
  BUG: Bad page state in process stream  pfn:f1a59
  page:f0000000034dc378 count:1 mapcount:0 mapping:          (null) index:0x0
  [c000000f322c77d0] [c00000000015e198] .bad_page+0xe8/0x140
  [c000000f322c7860] [c00000000015e3c4] .free_pages_prepare+0x1d4/0x1e0
  [c000000f322c7910] [c000000000160450] .free_hot_cold_page+0x50/0x230
  [c000000f322c79c0] [c00000000003ad18] .page_table_alloc+0x168/0x1c0

Changes from V3:
* PowerNV boot fixes

Change from V2:
* Change patch "powerpc: Reduce PTE table memory wastage" to use much simpler approach
  for PTE page sharing.
* Changes to handle huge pages in KVM code.
* Address other review comments

Changes from V1
* Address review comments
* More patch split
* Add batch hpte invalidate for hugepages.

Changes from RFC V2:
* Address review comments
* More code cleanup and patch split

Changes from RFC V1:
* HugeTLB fs now works
* Compile issues fixed
* rebased to v3.8
* Patch series reorded so that ppc64 cleanups and MM THP changes are moved
  early in the series. This should help in picking those patches early.

Thanks,
-aneesh

^ permalink raw reply

* [PATCH -V8 12/18] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 24 ------------------------
 arch/powerpc/kernel/io-workarounds.c     | 10 ++++++++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  2 +-
 arch/powerpc/mm/hash_utils_64.c          |  8 +++++++-
 arch/powerpc/mm/hugetlbpage.c            |  8 ++++++--
 arch/powerpc/mm/tlb_hash64.c             |  9 +++++++--
 arch/powerpc/platforms/pseries/eeh.c     |  7 ++++++-
 7 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 74ce13b..69cb52b 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -343,30 +343,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry)
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
-
-/*
- * find_linux_pte returns the address of a linux pte for a given
- * effective address and directory.  If not found, it returns zero.
- */
-static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	pte_t *pt = NULL;
-
-	pg = pgdir + pgd_index(ea);
-	if (!pgd_none(*pg)) {
-		pu = pud_offset(pg, ea);
-		if (!pud_none(*pu)) {
-			pm = pmd_offset(pu, ea);
-			if (pmd_present(*pm))
-				pt = pte_offset_kernel(pm, ea);
-		}
-	}
-	return pt;
-}
-
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/powerpc/kernel/io-workarounds.c b/arch/powerpc/kernel/io-workarounds.c
index 50e90b7..e5263ab 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -55,6 +55,7 @@ static struct iowa_bus *iowa_pci_find(unsigned long vaddr, unsigned long paddr)
 
 struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 {
+	unsigned shift;
 	struct iowa_bus *bus;
 	int token;
 
@@ -70,11 +71,16 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 		if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
 			return NULL;
 
-		ptep = find_linux_pte(init_mm.pgd, vaddr);
+		ptep = find_linux_pte_or_hugepte(init_mm.pgd, vaddr, &shift);
 		if (ptep == NULL)
 			paddr = 0;
-		else
+		else {
+			/*
+			 * we don't have hugepages backing iomem
+			 */
+			BUG_ON(shift);
 			paddr = pte_pfn(*ptep) << PAGE_SHIFT;
+		}
 		bus = iowa_pci_find(vaddr, paddr);
 
 		if (bus == NULL)
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 6dcbb49..dcf892d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -27,7 +27,7 @@ static void *real_vmalloc_addr(void *x)
 	unsigned long addr = (unsigned long) x;
 	pte_t *p;
 
-	p = find_linux_pte(swapper_pg_dir, addr);
+	p = find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
 	if (!p || !pte_present(*p))
 		return NULL;
 	/* assume we don't have huge pages in vmalloc space... */
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 88ac0ee..49b9cb7 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1131,6 +1131,7 @@ EXPORT_SYMBOL_GPL(hash_page);
 void hash_preload(struct mm_struct *mm, unsigned long ea,
 		  unsigned long access, unsigned long trap)
 {
+	int shift;
 	unsigned long vsid;
 	pgd_t *pgdir;
 	pte_t *ptep;
@@ -1152,10 +1153,15 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 	pgdir = mm->pgd;
 	if (pgdir == NULL)
 		return;
-	ptep = find_linux_pte(pgdir, ea);
+	/*
+	 * THP pages use update_mmu_cache_pmd. We don't do
+	 * hash preload there. Hence can ignore THP here
+	 */
+	ptep = find_linux_pte_or_hugepte(pgdir, ea, &shift);
 	if (!ptep)
 		return;
 
+	BUG_ON(shift);
 #ifdef CONFIG_PPC_64K_PAGES
 	/* If either _PAGE_4K_PFN or _PAGE_NO_CACHE is set (and we are on
 	 * a 64K kernel), then we don't preload, hash_page() will take
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 4928204..8add580 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -105,6 +105,7 @@ int pgd_huge(pgd_t pgd)
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
+	/* Only called for hugetlbfs pages, hence can ignore THP */
 	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
 }
 
@@ -673,11 +674,14 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 	struct page *page;
 	unsigned shift;
 	unsigned long mask;
-
+	/*
+	 * Transparent hugepages are handled by generic code. We can skip them
+	 * here.
+	 */
 	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!ptep || !shift)
+	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
 		return ERR_PTR(-EINVAL);
 
 	mask = (1UL << shift) - 1;
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index fe07f57..089afb1 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -189,6 +189,7 @@ void tlb_flush(struct mmu_gather *tlb)
 void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 			      unsigned long end)
 {
+	int hugepage_shift;
 	unsigned long flags;
 
 	start = _ALIGN_DOWN(start, PAGE_SIZE);
@@ -206,7 +207,8 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 	local_irq_save(flags);
 	arch_enter_lazy_mmu_mode();
 	for (; start < end; start += PAGE_SIZE) {
-		pte_t *ptep = find_linux_pte(mm->pgd, start);
+		pte_t *ptep = find_linux_pte_or_hugepte(mm->pgd, start,
+							&hugepage_shift);
 		unsigned long pte;
 
 		if (ptep == NULL)
@@ -214,7 +216,10 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 		pte = pte_val(*ptep);
 		if (!(pte & _PAGE_HASHPTE))
 			continue;
-		hpte_need_flush(mm, start, ptep, pte, 0);
+		if (unlikely(hugepage_shift && pmd_trans_huge(*(pmd_t *)pte)))
+			hpte_do_hugepage_flush(mm, start, (pmd_t *)pte);
+		else
+			hpte_need_flush(mm, start, ptep, pte, 0);
 	}
 	arch_leave_lazy_mmu_mode();
 	local_irq_restore(flags);
diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
index 6b73d6c..d2e76d2 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -258,12 +258,17 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
  */
 static inline unsigned long eeh_token_to_phys(unsigned long token)
 {
+	int shift;
 	pte_t *ptep;
 	unsigned long pa;
 
-	ptep = find_linux_pte(init_mm.pgd, token);
+	/*
+	 * We won't find hugepages here, iomem
+	 */
+	ptep = find_linux_pte_or_hugepte(init_mm.pgd, token, &shift);
 	if (!ptep)
 		return token;
+	BUG_ON(shift);
 	pa = pte_pfn(*ptep) << PAGE_SHIFT;
 
 	return pa | (token & (PAGE_SIZE-1));
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 05/18] mm/THP: withdraw the pgtable after pmdp related operations
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

For architectures like ppc64 we look at deposited pgtable when
calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw
after finishing pmdp related operations.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 69f4153..5d99b2f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1360,9 +1360,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
-		pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+		/*
+		 * For architectures like ppc64 we look at deposited pgtable
+		 * when calling pmdp_get_and_clear. So do the
+		 * pgtable_trans_huge_withdraw after finishing pmdp related
+		 * operations.
+		 */
 		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
 		if (is_huge_zero_pmd(orig_pmd)) {
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 07/18] mm/THP: deposit the transpare huge pgtable before set_pmd
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Architectures like powerpc use the deposited pgtable to store
hash index values. We need to make the deposted pgtable visible
to other cpus before we are ready to take a hash fault.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/huge_memory.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d99b2f..cc47b29 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -729,8 +729,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pmd_t entry;
 		entry = mk_huge_pmd(page, vma);
 		page_add_new_anon_rmap(page, vma, haddr);
-		set_pmd_at(mm, haddr, pmd, entry);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm->nr_ptes++;
 		spin_unlock(&mm->page_table_lock);
@@ -771,8 +771,8 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	set_pmd_at(mm, haddr, pmd, entry);
 	mm->nr_ptes++;
 	return true;
 }
@@ -916,8 +916,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
-	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	dst_mm->nr_ptes++;
 
 	ret = 0;
@@ -2362,9 +2362,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(&mm->page_table_lock);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	spin_unlock(&mm->page_table_lock);
 
 	*hpage = NULL;
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 16/18] powerpc/THP: Enable THP on PPC64
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We enable only if the we support 16MB page size.

Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
 arch/powerpc/mm/pgtable_64.c             | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 69cb52b..d046289 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -429,8 +429,7 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 	return 0;
 }
 
-/* We will enable it in the last patch */
-#define has_transparent_hugepage() 0
+extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pmd_pte(pmd_t pmd)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 2ff92aa..1abf9cd 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -818,4 +818,33 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm,
 	memset(pgtable, 0, PTE_FRAG_SIZE);
 	return old_pmd;
 }
+
+int has_transparent_hugepage(void)
+{
+	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
+		return 0;
+	/*
+	 * We support THP only if PMD_SIZE is 16MB.
+	 */
+	if (mmu_psize_defs[MMU_PAGE_16M].shift != PMD_SHIFT)
+		return 0;
+	/*
+	 * We need to make sure that we support 16MB hugepage in a segement
+	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
+	 * of 64K.
+	 */
+	/*
+	 * If we have 64K HPTE, we will be using that by default
+	 */
+	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
+	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
+		return 0;
+	/*
+	 * Ok we only have 4K HPTE
+	 */
+	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
+		return 0;
+
+	return 1;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 14/18] powerpc/THP: Add code to handle HPTE faults for hugepages
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/mmu-hash64.h |  13 +++
 arch/powerpc/mm/Makefile              |   1 +
 arch/powerpc/mm/hash_utils_64.c       |  13 ++-
 arch/powerpc/mm/hugepage-hash64.c     | 176 ++++++++++++++++++++++++++++++++++
 4 files changed, 199 insertions(+), 4 deletions(-)
 create mode 100644 arch/powerpc/mm/hugepage-hash64.c

diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 2accc96..3d6fbb0 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -340,6 +340,19 @@ extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		     pte_t *ptep, unsigned long trap, int local, int ssize,
 		     unsigned int shift, unsigned int mmu_psize);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __hash_page_thp(unsigned long ea, unsigned long access,
+			   unsigned long vsid, pmd_t *pmdp, unsigned long trap,
+			   int local, int ssize, unsigned int psize);
+#else
+static inline int __hash_page_thp(unsigned long ea, unsigned long access,
+				  unsigned long vsid, pmd_t *pmdp,
+				  unsigned long trap, int local,
+				  int ssize, unsigned int psize)
+{
+	BUG();
+}
+#endif
 extern void hash_failure_debug(unsigned long ea, unsigned long access,
 			       unsigned long vsid, unsigned long trap,
 			       int ssize, int psize, int lpsize,
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index fde36e6..87671eb 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -33,6 +33,7 @@ ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)	+= hugetlbpage-book3e.o
 endif
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += hugepage-hash64.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 49b9cb7..693ff94 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1041,11 +1041,16 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 		return 1;
 	}
 
+	if (hugeshift) {
+		if (pmd_trans_huge(*(pmd_t *)ptep))
+			return __hash_page_thp(ea, access, vsid, (pmd_t *)ptep,
+					       trap, local, ssize, psize);
 #ifdef CONFIG_HUGETLB_PAGE
-	if (hugeshift)
-		return __hash_page_huge(ea, access, vsid, ptep, trap, local,
-					ssize, hugeshift, psize);
-#endif /* CONFIG_HUGETLB_PAGE */
+		else
+			return __hash_page_huge(ea, access, vsid, ptep, trap,
+						local, ssize, hugeshift, psize);
+#endif
+	}
 
 #ifndef CONFIG_PPC_64K_PAGES
 	DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
new file mode 100644
index 0000000..3712e80
--- /dev/null
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -0,0 +1,176 @@
+/*
+ * Copyright IBM Corporation, 2013
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+/*
+ * PPC64 THP Support for hash based MMUs
+ */
+#include <linux/mm.h>
+#include <asm/machdep.h>
+
+int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+		    pmd_t *pmdp, unsigned long trap, int local, int ssize,
+		    unsigned int psize)
+{
+	unsigned int index, valid;
+	unsigned char *hpte_slot_array;
+	unsigned long rflags, pa, hidx;
+	unsigned long old_pmd, new_pmd;
+	int ret, lpsize = MMU_PAGE_16M;
+	unsigned long vpn, hash, shift, slot;
+
+	/*
+	 * atomically mark the linux large page PMD busy and dirty
+	 */
+	do {
+		old_pmd = pmd_val(*pmdp);
+		/* If PMD busy, retry the access */
+		if (unlikely(old_pmd & _PAGE_BUSY))
+			return 0;
+		/* If PMD permissions don't match, take page fault */
+		if (unlikely(access & ~old_pmd))
+			return 1;
+		/*
+		 * Try to lock the PTE, add ACCESSED and DIRTY if it was
+		 * a write access
+		 */
+		new_pmd = old_pmd | _PAGE_BUSY | _PAGE_ACCESSED;
+		if (access & _PAGE_RW)
+			new_pmd |= _PAGE_DIRTY;
+	} while (old_pmd != __cmpxchg_u64((unsigned long *)pmdp,
+					  old_pmd, new_pmd));
+	/*
+	 * PP bits. _PAGE_USER is already PP bit 0x2, so we only
+	 * need to add in 0x1 if it's a read-only user page
+	 */
+	rflags = new_pmd & _PAGE_USER;
+	if ((new_pmd & _PAGE_USER) && !((new_pmd & _PAGE_RW) &&
+					   (new_pmd & _PAGE_DIRTY)))
+		rflags |= 0x1;
+	/*
+	 * _PAGE_EXEC -> HW_NO_EXEC since it's inverted
+	 */
+	rflags |= ((new_pmd & _PAGE_EXEC) ? 0 : HPTE_R_N);
+
+#if 0
+	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
+
+		/*
+		 * No CPU has hugepages but lacks no execute, so we
+		 * don't need to worry about that case
+		 */
+		rflags = hash_page_do_lazy_icache(rflags, __pte(old_pte), trap);
+	}
+#endif
+	/*
+	 * Find the slot index details for this ea, using base page size.
+	 */
+	shift = mmu_psize_defs[psize].shift;
+	index = (ea & ~HPAGE_PMD_MASK) >> shift;
+	BUG_ON(index >= 4096);
+
+	vpn = hpt_vpn(ea, vsid, ssize);
+	hash = hpt_hash(vpn, shift, ssize);
+	/*
+	 * The hpte hindex are stored in the pgtable whose address is in the
+	 * second half of the PMD
+	 */
+	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
+
+	valid = hpte_valid(hpte_slot_array, index);
+	if (valid) {
+		/* update the hpte bits */
+		hidx =  hpte_hash_index(hpte_slot_array, index);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		ret = ppc_md.hpte_updatepp(slot, rflags, vpn,
+					   psize, ssize, local);
+		/*
+		 * We failed to update, try to insert a new entry.
+		 */
+		if (ret == -1) {
+			/*
+			 * large pte is marked busy, so we can be sure
+			 * nobody is looking at hpte_slot_array. hence we can
+			 * safely update this here.
+			 */
+			valid = 0;
+			new_pmd &= ~_PAGE_HPTEFLAGS;
+			hpte_slot_array[index] = 0;
+		} else
+			/* clear the busy bits and set the hash pte bits */
+			new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
+	}
+
+	if (!valid) {
+		unsigned long hpte_group;
+
+		/* insert new entry */
+		pa = pmd_pfn(__pmd(old_pmd)) << PAGE_SHIFT;
+repeat:
+		hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+
+		/* clear the busy bits and set the hash pte bits */
+		new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
+
+		/* Add in WIMG bits */
+		rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+				      _PAGE_COHERENT | _PAGE_GUARDED));
+
+		/* Insert into the hash table, primary slot */
+		slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
+					  psize, lpsize, ssize);
+		/*
+		 * Primary is full, try the secondary
+		 */
+		if (unlikely(slot == -1)) {
+			hpte_group = ((~hash & htab_hash_mask) *
+				      HPTES_PER_GROUP) & ~0x7UL;
+			slot = ppc_md.hpte_insert(hpte_group, vpn, pa,
+						  rflags, HPTE_V_SECONDARY,
+						  psize, lpsize, ssize);
+			if (slot == -1) {
+				if (mftb() & 0x1)
+					hpte_group = ((hash & htab_hash_mask) *
+						      HPTES_PER_GROUP) & ~0x7UL;
+
+				ppc_md.hpte_remove(hpte_group);
+				goto repeat;
+			}
+		}
+		/*
+		 * Hypervisor failure. Restore old pmd and return -1
+		 * similar to __hash_page_*
+		 */
+		if (unlikely(slot == -2)) {
+			*pmdp = __pmd(old_pmd);
+			hash_failure_debug(ea, access, vsid, trap, ssize,
+					   psize, lpsize, old_pmd);
+			return -1;
+		}
+		/*
+		 * large pte is marked busy, so we can be sure
+		 * nobody is looking at hpte_slot_array. hence we can
+		 * safely update this here.
+		 */
+		mark_hpte_slot_valid(hpte_slot_array, index, slot);
+	}
+	/*
+	 * No need to use ldarx/stdcx here
+	 */
+	*pmdp = __pmd(new_pmd & ~_PAGE_BUSY);
+	return 0;
+}
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 08/18] powerpc/THP: Double the PMD table size for THP
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.

On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgalloc-64.h        | 6 +++---
 arch/powerpc/include/asm/pgtable-ppc64-64k.h | 3 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h     | 6 +++++-
 arch/powerpc/mm/init_64.c                    | 9 ++++++---
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index b66ae72..f65e27b 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -221,17 +221,17 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
+	return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
+	kmem_cache_free(PGT_CACHE(PMD_CACHE_INDEX), pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd, addr)		      \
-	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
+	pgtable_free_tlb(tlb, pmd, PMD_CACHE_INDEX)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
 	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index 45142d6..a56b82f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -33,7 +33,8 @@
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
 /* Bits to mask out from a PMD to get to the PTE page */
-#define PMD_MASKED_BITS		0x1ff
+/* PMDs point to PTE table fragments which are 4K aligned.  */
+#define PMD_MASKED_BITS		0xfff
 /* Bits to mask out from a PGD/PUD to get to the PMD page */
 #define PUD_MASKED_BITS		0x1ff
 
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..ab84332 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -20,7 +20,11 @@
                 	    PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT)
 #define PGTABLE_RANGE (ASM_CONST(1) << PGTABLE_EADDR_SIZE)
 
-
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PMD_CACHE_INDEX	(PMD_INDEX_SIZE + 1)
+#else
+#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
+#endif
 /*
  * Define the address range of the kernel non-linear virtual area
  */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..d0cd9e4 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -88,7 +88,11 @@ static void pgd_ctor(void *addr)
 
 static void pmd_ctor(void *addr)
 {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	memset(addr, 0, PMD_TABLE_SIZE * 2);
+#else
 	memset(addr, 0, PMD_TABLE_SIZE);
+#endif
 }
 
 struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
@@ -137,10 +141,9 @@ void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
 void pgtable_cache_init(void)
 {
 	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
-	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
-	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+	pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
 		panic("Couldn't allocate pgtable caches");
-
 	/* In all current configs, when the PUD index exists it's the
 	 * same size as either the pgd or pmd index.  Verify that the
 	 * initialization above has also created a PUD cache.  This
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 01/18] powerpc/mm: Use the correct mask value when looking at pgtable address
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Our pgtable are 2*sizeof(pte_t)*PTRS_PER_PTE which is PTE_FRAG_SIZE.
Instead of depending on frag size, mask with PMD_MASKED_BITS.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgalloc-64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 91acb12..b66ae72 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -186,7 +186,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 static inline pgtable_t pmd_pgtable(pmd_t pmd)
 {
-	return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
+	return (pgtable_t)(pmd_val(pmd) & ~PMD_MASKED_BITS);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 17/18] powerpc: Optimize hugepage invalidate
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Hugepage invalidate involves invalidating multiple hpte entries.
Optimize the operation using H_BULK_REMOVE on lpar platforms.
On native, reduce the number of tlb flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h    |   3 +
 arch/powerpc/mm/hash_native_64.c      |  74 +++++++++++++++++++++
 arch/powerpc/mm/pgtable_64.c          |  12 +++-
 arch/powerpc/platforms/pseries/lpar.c | 122 ++++++++++++++++++++++++++++++++--
 4 files changed, 202 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 92386fc..9c5a3ad 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -57,6 +57,9 @@ struct machdep_calls {
 	void            (*hpte_removebolted)(unsigned long ea,
 					     int psize, int ssize);
 	void		(*flush_hash_range)(unsigned long number, int local);
+	void		(*hugepage_invalidate)(struct mm_struct *mm,
+					       unsigned char *hpte_slot_array,
+					       unsigned long addr, int psize);
 
 	/* special for kexec, to be called in real mode, linear mapping is
 	 * destroyed as well */
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..4a92f14 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -455,6 +455,79 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	local_irq_restore(flags);
 }
 
+static void native_hugepage_invalidate(struct mm_struct *mm,
+				       unsigned char *hpte_slot_array,
+				       unsigned long addr, int psize)
+{
+	int ssize = 0, i;
+	int lock_tlbie;
+	struct hash_pte *hptep;
+	int actual_psize = MMU_PAGE_16M;
+	unsigned int max_hpte_count, valid;
+	unsigned long flags, s_addr = addr;
+	unsigned long hpte_v, want_v, shift;
+	unsigned long hidx, vpn = 0, vsid, hash, slot;
+
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HPAGE_PMD_SIZE >> shift;
+
+	local_irq_save(flags);
+	for (i = 0; i < max_hpte_count; i++) {
+		valid = hpte_valid(hpte_slot_array, i);
+		if (!valid)
+			continue;
+		hidx =  hpte_hash_index(hpte_slot_array, i);
+
+		/* get the vpn */
+		addr = s_addr + (i * (1ul << shift));
+		if (!is_kernel_addr(addr)) {
+			ssize = user_segment_size(addr);
+			vsid = get_vsid(mm->context.id, addr, ssize);
+			WARN_ON(vsid == 0);
+		} else {
+			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+			ssize = mmu_kernel_ssize;
+		}
+
+		vpn = hpt_vpn(addr, vsid, ssize);
+		hash = hpt_hash(vpn, shift, ssize);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		hptep = htab_address + slot;
+		want_v = hpte_encode_avpn(vpn, psize, ssize);
+		native_lock_hpte(hptep);
+		hpte_v = hptep->v;
+
+		/* Even if we miss, we need to invalidate the TLB */
+		if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
+			native_unlock_hpte(hptep);
+		else
+			/* Invalidate the hpte. NOTE: this also unlocks it */
+			hptep->v = 0;
+	}
+	/*
+	 * Since this is a hugepage, we just need a single tlbie.
+	 * use the last vpn.
+	 */
+	lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+	if (lock_tlbie)
+		raw_spin_lock(&native_tlbie_lock);
+
+	asm volatile("ptesync":::"memory");
+	__tlbie(vpn, psize, actual_psize, ssize);
+	asm volatile("eieio; tlbsync; ptesync":::"memory");
+
+	if (lock_tlbie)
+		raw_spin_unlock(&native_tlbie_lock);
+
+	local_irq_restore(flags);
+}
+
+
 static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
 			int *psize, int *apsize, int *ssize, unsigned long *vpn)
 {
@@ -658,4 +731,5 @@ void __init hpte_init_native(void)
 	ppc_md.hpte_remove	= native_hpte_remove;
 	ppc_md.hpte_clear_all	= native_hpte_clear;
 	ppc_md.flush_hash_range = native_flush_hash_range;
+	ppc_md.hugepage_invalidate   = native_hugepage_invalidate;
 }
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 1abf9cd..8dd7c83 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -692,6 +692,7 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
 {
 	int ssize, i;
 	unsigned long s_addr;
+	int max_hpte_count;
 	unsigned int psize, valid;
 	unsigned char *hpte_slot_array;
 	unsigned long hidx, vpn, vsid, hash, shift, slot;
@@ -715,9 +716,16 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
 
 	/* get the base page size */
 	psize = get_slice_psize(mm, s_addr);
-	shift = mmu_psize_defs[psize].shift;
 
-	for (i = 0; i < (HPAGE_PMD_SIZE >> shift); i++) {
+	if (ppc_md.hugepage_invalidate)
+		return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
+						  s_addr, psize);
+	/*
+	 * No bluk hpte removal support, invalidate each entry
+	 */
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HPAGE_PMD_SIZE >> shift;
+	for (i = 0; i < max_hpte_count; i++) {
 		/*
 		 * 8 bits per each hpte entries
 		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6d62072..aa098c7 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -45,6 +45,13 @@
 #include "plpar_wrappers.h"
 #include "pseries.h"
 
+/* Flag bits for H_BULK_REMOVE */
+#define HBR_REQUEST	0x4000000000000000UL
+#define HBR_RESPONSE	0x8000000000000000UL
+#define HBR_END		0xc000000000000000UL
+#define HBR_AVPN	0x0200000000000000UL
+#define HBR_ANDCOND	0x0100000000000000UL
+
 
 /* in hvCall.S */
 EXPORT_SYMBOL(plpar_hcall);
@@ -345,6 +352,113 @@ static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	BUG_ON(lpar_rc != H_SUCCESS);
 }
 
+/*
+ * Limit iterations holding pSeries_lpar_tlbie_lock to 3. We also need
+ * to make sure that we avoid bouncing the hypervisor tlbie lock.
+ */
+#define PPC64_HUGE_HPTE_BATCH 12
+
+static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
+					     unsigned long *vpn, int count,
+					     int psize, int ssize)
+{
+	unsigned long param[8];
+	int i = 0, pix = 0, rc;
+	unsigned long flags = 0;
+	int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+
+	if (lock_tlbie)
+		spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags);
+
+	for (i = 0; i < count; i++) {
+
+		if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
+			pSeries_lpar_hpte_invalidate(slot[i], vpn[i], psize,
+						     ssize, 0);
+		} else {
+			param[pix] = HBR_REQUEST | HBR_AVPN | slot[i];
+			param[pix+1] = hpte_encode_avpn(vpn[i], psize, ssize);
+			pix += 2;
+			if (pix == 8) {
+				rc = plpar_hcall9(H_BULK_REMOVE, param,
+						  param[0], param[1], param[2],
+						  param[3], param[4], param[5],
+						  param[6], param[7]);
+				BUG_ON(rc != H_SUCCESS);
+				pix = 0;
+			}
+		}
+	}
+	if (pix) {
+		param[pix] = HBR_END;
+		rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
+				  param[2], param[3], param[4], param[5],
+				  param[6], param[7]);
+		BUG_ON(rc != H_SUCCESS);
+	}
+
+	if (lock_tlbie)
+		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
+}
+
+static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+				       unsigned char *hpte_slot_array,
+				       unsigned long addr, int psize)
+{
+	int ssize = 0, i, index = 0;
+	unsigned long s_addr = addr;
+	unsigned int max_hpte_count, valid;
+	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
+	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
+	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
+
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HPAGE_PMD_SIZE >> shift;
+
+	for (i = 0; i < max_hpte_count; i++) {
+		valid = hpte_valid(hpte_slot_array, i);
+		if (!valid)
+			continue;
+		hidx =  hpte_hash_index(hpte_slot_array, i);
+
+		/* get the vpn */
+		addr = s_addr + (i * (1ul << shift));
+		if (!is_kernel_addr(addr)) {
+			ssize = user_segment_size(addr);
+			vsid = get_vsid(mm->context.id, addr, ssize);
+			WARN_ON(vsid == 0);
+		} else {
+			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+			ssize = mmu_kernel_ssize;
+		}
+
+		vpn = hpt_vpn(addr, vsid, ssize);
+		hash = hpt_hash(vpn, shift, ssize);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		slot_array[index] = slot;
+		vpn_array[index] = vpn;
+		if (index == PPC64_HUGE_HPTE_BATCH - 1) {
+			/*
+			 * Now do a bluk invalidate
+			 */
+			__pSeries_lpar_hugepage_invalidate(slot_array,
+							   vpn_array,
+							   PPC64_HUGE_HPTE_BATCH,
+							   psize, ssize);
+			index = 0;
+		} else
+			index++;
+	}
+	if (index)
+		__pSeries_lpar_hugepage_invalidate(slot_array, vpn_array,
+						   index, psize, ssize);
+}
+
 static void pSeries_lpar_hpte_removebolted(unsigned long ea,
 					   int psize, int ssize)
 {
@@ -360,13 +474,6 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
 	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
 }
 
-/* Flag bits for H_BULK_REMOVE */
-#define HBR_REQUEST	0x4000000000000000UL
-#define HBR_RESPONSE	0x8000000000000000UL
-#define HBR_END		0xc000000000000000UL
-#define HBR_AVPN	0x0200000000000000UL
-#define HBR_ANDCOND	0x0100000000000000UL
-
 /*
  * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
  * lock.
@@ -452,6 +559,7 @@ void __init hpte_init_lpar(void)
 	ppc_md.hpte_removebolted = pSeries_lpar_hpte_removebolted;
 	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
 	ppc_md.hpte_clear_all   = pSeries_lpar_hptab_clear;
+	ppc_md.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
 }
 
 #ifdef CONFIG_PPC_SMLPAR
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 02/18] mm/THP: Use pmd_populate to update the pmd with pgtable_t pointer
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We should not use set_pmd_at to update pmd_t with pgtable_t pointer. set_pmd_at
is used to set pmd with huge pte entries and architectures like ppc64, clear
few flags from the pte when saving a new entry. Without this change we observe
bad pte errors like below on ppc64 with THP enabled.

BUG: Bad page map in process ld mm=0xc000001ee39f4780 pte:7fc3f37848000001 pmd:c000001ec0000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..49cb001 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2325,7 +2325,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		pte_unmap(pte);
 		spin_lock(&mm->page_table_lock);
 		BUG_ON(!pmd_none(*pmd));
-		set_pmd_at(mm, address, pmd, _pmd);
+		pmd_populate(mm, pmd, _pmd);
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock_write(vma->anon_vma);
 		goto out;
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 04/18] mm/THP: Add pmd args to pgtable deposit and withdraw APIs
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/s390/include/asm/pgtable.h     |  5 +++--
 arch/s390/mm/pgtable.c              |  5 +++--
 arch/sparc/include/asm/pgtable_64.h |  5 +++--
 arch/sparc/mm/tlb.c                 |  5 +++--
 include/asm-generic/pgtable.h       |  5 +++--
 mm/huge_memory.c                    | 18 +++++++++---------
 mm/pgtable-generic.c                |  5 +++--
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 4105b82..9fba7a8 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1351,10 +1351,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, , pmd_t *pmdp,
+				       pgtable_t pgtable);
 
 #define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 
 static inline int pmd_trans_splitting(pmd_t pmd)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 7805ddc..555c11d 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1118,7 +1118,8 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
 	}
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				pgtable_t pgtable)
 {
 	struct list_head *lh = (struct list_head *) pgtable;
 
@@ -1132,7 +1133,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
 	mm->pmd_huge_pte = pgtable;
 }
 
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
 	struct list_head *lh;
 	pgtable_t pgtable;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 7619f2f..d22b92d 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -853,10 +853,11 @@ extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 				 pmd_t *pmd);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				       pgtable_t pgtable);
 
 #define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 /* Encode and de-code a swap entry */
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 83d89bc..f828dd3 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -188,7 +188,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	}
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				pgtable_t pgtable)
 {
 	struct list_head *lh = (struct list_head *) pgtable;
 
@@ -202,7 +203,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
 	mm->pmd_huge_pte = pgtable;
 }
 
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
 	struct list_head *lh;
 	pgtable_t pgtable;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index a59ff51..18e27c2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -173,11 +173,12 @@ extern void pmdp_splitting_flush(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				       pgtable_t pgtable);
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a43d736..69f4153 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -730,7 +730,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
-		pgtable_trans_huge_deposit(mm, pgtable);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm->nr_ptes++;
 		spin_unlock(&mm->page_table_lock);
@@ -772,7 +772,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	set_pmd_at(mm, haddr, pmd, entry);
-	pgtable_trans_huge_deposit(mm, pgtable);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	mm->nr_ptes++;
 	return true;
 }
@@ -917,7 +917,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-	pgtable_trans_huge_deposit(dst_mm, pgtable);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	dst_mm->nr_ptes++;
 
 	ret = 0;
@@ -987,7 +987,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
-	pgtable = pgtable_trans_huge_withdraw(mm);
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1085,7 +1085,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
-	pgtable = pgtable_trans_huge_withdraw(mm);
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1360,7 +1360,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
-		pgtable = pgtable_trans_huge_withdraw(tlb->mm);
+		pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
 		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 		if (is_huge_zero_pmd(orig_pmd)) {
@@ -1693,7 +1693,7 @@ static int __split_huge_page_map(struct page *page,
 	pmd = page_check_address_pmd(page, mm, address,
 				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
 	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm);
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 		pmd_populate(mm, &_pmd, pgtable);
 
 		haddr = address;
@@ -2358,7 +2358,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	page_add_new_anon_rmap(new_page, vma, address);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
-	pgtable_trans_huge_deposit(mm, pgtable);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	spin_unlock(&mm->page_table_lock);
 
 	*hpage = NULL;
@@ -2664,7 +2664,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
-	pgtable = pgtable_trans_huge_withdraw(mm);
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c8323f..e1a6e4f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -124,7 +124,8 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
 
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				pgtable_t pgtable)
 {
 	assert_spin_locked(&mm->page_table_lock);
 
@@ -141,7 +142,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* no "address" argument so destroys page coloring of some arch */
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
 	pgtable_t pgtable;
 
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 03/18] mm/THP: Use the right function when updating access flags
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We should use pmdp_set_access_flags to update access flags. Archs like powerpc
use extra checks(_PAGE_BUSY) when updating a hugepage PTE. A set_pmd_at doesn't
do those checks. We should use set_pmd_at only when updating a none hugepage PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 49cb001..a43d736 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1265,7 +1265,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		 * young bit, instead of the current set_pmd_at.
 		 */
 		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
-		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+		if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
+					  pmd, _pmd,  1))
+			update_mmu_cache_pmd(vma, addr, pmd);
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		if (page->mapping && trylock_page(page)) {
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH -V8 18/18] powerpc: disable assert_pte_locked for collapse_huge_page
From: Aneesh Kumar K.V @ 2013-05-12 16:14 UTC (permalink / raw)
  To: benh, paulus, dwg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1368375265-19593-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

With THP we set pmd to none, before we do pte_clear. Hence we can't
walk page table to get the pte lock ptr and verify whether it is locked.
THP do take pte lock before calling pte_clear. So we don't change the locking
rules here. It is that we can't use page table walking to check whether
pte locks are held with THP.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/pgtable.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 214130a..edda589 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -235,6 +235,14 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(pgd, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
+	/*
+	 * khugepaged to collapse normal pages to hugepage, first set
+	 * pmd to none to force page fault/gup to take mmap_sem. After
+	 * pmd is set to none, we do a pte_clear which does this assertion
+	 * so if we find pmd none, return.
+	 */
+	if (pmd_none(*pmd))
+		return;
 	BUG_ON(!pmd_present(*pmd));
 	assert_spin_locked(pte_lockptr(mm, pmd));
 }
-- 
1.8.1.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox