LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH -V5 13/25] powerpc: Update tlbie/tlbiel as per ISA doc
From: David Gibson @ 2013-04-11  3:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-14-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2239 bytes --]

On Thu, Apr 04, 2013 at 11:27:51AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This make sure we handle multiple page size segment correctly.

This needs a much more detailed message.  In what way was the existing
code not matching the ISA documentation?  What consequences did that
have?

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/hash_native_64.c |   30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
> index b461b2d..ac84fa6 100644
> --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -61,7 +61,10 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>  
>  	switch (psize) {
>  	case MMU_PAGE_4K:
> +		/* clear out bits after (52) [0....52.....63] */
> +		va &= ~((1ul << (64 - 52)) - 1);
>  		va |= ssize << 8;
> +		va |= mmu_psize_defs[apsize].sllp << 6;

sllp is the per-segment encoding, so it sure must be looked up via
psize, not apsize.

>  		asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
>  			     : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
>  			     : "memory");
> @@ -69,9 +72,19 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>  	default:
>  		/* We need 14 to 14 + i bits of va */
>  		penc = mmu_psize_defs[psize].penc[apsize];
> -		va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
> +		va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
>  		va |= penc << 12;
>  		va |= ssize << 8;
> +		/* Add AVAL part */
> +		if (psize != apsize) {
> +			/*
> +			 * MPSS, 64K base page size and 16MB parge page size
> +			 * We don't need all the bits, but this seems to work.
> +			 * vpn cover upto 65 bits of va. (0...65) and we need
> +			 * 58..64 bits of va.

"seems to work" is not a comment I like to see in core MMU code...

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 12/25] powerpc: Return all the valid pte ecndoing in KVM_PPC_GET_SMMU_INFO ioctl
From: David Gibson @ 2013-04-11  3:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-13-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 440 bytes --]

On Thu, Apr 04, 2013 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Surely this can't be correct until the KVM H_ENTER implementation is
updated to cope with the MPSS page sizes.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 14/25] mm/THP: HPAGE_SHIFT is not a #define on some arch
From: David Gibson @ 2013-04-11  3:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Andrea Arcangeli, paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-15-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On Thu, Apr 04, 2013 at 11:27:52AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT
> and other derived values like HPAGE_PMD_ORDER are not constants. So move
> that to hugepage_init
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Looks ok to me.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 10/25] powerpc: print both base and actual page size on hash failure
From: David Gibson @ 2013-04-11  3:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-11-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

On Thu, Apr 04, 2013 at 11:27:48AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] powerpc/crypto: removed qoriq-sec4.1-0.dtsi.
From: Vakul Garg @ 2013-04-11  3:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Shaveta Leekha, linux-kernel, Paul Mackerras, Andy Fleming

Removing qoriq-sec4.1-0.dtsi as it is not used by any soc anymore.

Signed-off-by: Vakul Garg <vakul@freescale.com>
---
 arch/powerpc/boot/dts/fsl/qoriq-sec4.1-0.dtsi |  109 -------------------------
 1 files changed, 0 insertions(+), 109 deletions(-)
 delete mode 100644 arch/powerpc/boot/dts/fsl/qoriq-sec4.1-0.dtsi

diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec4.1-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec4.1-0.dtsi
deleted file mode 100644
index 3308986..0000000
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec4.1-0.dtsi
+++ /dev/null
@@ -1,109 +0,0 @@
-/*
- * QorIQ Sec/Crypto 4.1 device tree stub [ controller @ offset 0x300000 ]
- *
- * Copyright 2011 Freescale Semiconductor Inc.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of Freescale Semiconductor nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- *
- * ALTERNATIVELY, this software may be distributed under the terms of the
- * GNU General Public License ("GPL") as published by the Free Software
- * Foundation, either version 2 of that License or (at your option) any
- * later version.
- *
- * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-crypto: crypto@300000 {
-	compatible = "fsl,sec-v4.1", "fsl,sec-v4.0";
-	#address-cells = <1>;
-	#size-cells = <1>;
-	reg		 = <0x300000 0x10000>;
-	ranges		 = <0 0x300000 0x10000>;
-	interrupts	 = <92 2 0 0>;
-
-	sec_jr0: jr@1000 {
-		compatible = "fsl,sec-v4.1-job-ring",
-			     "fsl,sec-v4.0-job-ring";
-		reg = <0x1000 0x1000>;
-		interrupts = <88 2 0 0>;
-	};
-
-	sec_jr1: jr@2000 {
-		compatible = "fsl,sec-v4.1-job-ring",
-			     "fsl,sec-v4.0-job-ring";
-		reg = <0x2000 0x1000>;
-		interrupts = <89 2 0 0>;
-	};
-
-	sec_jr2: jr@3000 {
-		compatible = "fsl,sec-v4.1-job-ring",
-			     "fsl,sec-v4.0-job-ring";
-		reg = <0x3000 0x1000>;
-		interrupts = <90 2 0 0>;
-	};
-
-	sec_jr3: jr@4000 {
-		compatible = "fsl,sec-v4.1-job-ring",
-			     "fsl,sec-v4.0-job-ring";
-		reg = <0x4000 0x1000>;
-		interrupts = <91 2 0 0>;
-	};
-
-	rtic@6000 {
-		compatible = "fsl,sec-v4.1-rtic",
-			     "fsl,sec-v4.0-rtic";
-		#address-cells = <1>;
-		#size-cells = <1>;
-		reg = <0x6000 0x100>;
-		ranges = <0x0 0x6100 0xe00>;
-
-		rtic_a: rtic-a@0 {
-			compatible = "fsl,sec-v4.1-rtic-memory",
-				     "fsl,sec-v4.0-rtic-memory";
-			reg = <0x00 0x20 0x100 0x80>;
-		};
-
-		rtic_b: rtic-b@20 {
-			compatible = "fsl,sec-v4.1-rtic-memory",
-				     "fsl,sec-v4.0-rtic-memory";
-			reg = <0x20 0x20 0x200 0x80>;
-		};
-
-		rtic_c: rtic-c@40 {
-			compatible = "fsl,sec-v4.1-rtic-memory",
-				     "fsl,sec-v4.0-rtic-memory";
-			reg = <0x40 0x20 0x300 0x80>;
-		};
-
-		rtic_d: rtic-d@60 {
-			compatible = "fsl,sec-v4.1-rtic-memory",
-				     "fsl,sec-v4.0-rtic-memory";
-			reg = <0x60 0x20 0x500 0x80>;
-		};
-	};
-};
-
-sec_mon: sec_mon@314000 {
-	compatible = "fsl,sec-v4.1-mon", "fsl,sec-v4.0-mon";
-	reg = <0x314000 0x1000>;
-	interrupts = <93 2 0 0>;
-};
-- 
1.7.7.6

^ permalink raw reply related

* [PATCHv2] powerpc/crypto: Add property for 'era' in SEC dts crypto node
From: Vakul Garg @ 2013-04-11  3:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Shaveta Leekha, linux-kernel, Paul Mackerras, Andy Fleming

The crypto node now contains a new property 'fsl,sec-era'.
This is required so that applications can retrieve era info without
having to be able to read SEC's register space.

Signed-off-by: Vakul Garg <vakul@freescale.com>
---
Changelog:

	v2: Added era in p1023si-post.dtsi as per Kim's comments.

 arch/powerpc/boot/dts/fsl/p1023si-post.dtsi   |    1 +
 arch/powerpc/boot/dts/fsl/pq3-sec4.4-0.dtsi   |    1 +
 arch/powerpc/boot/dts/fsl/qoriq-sec4.0-0.dtsi |    1 +
 arch/powerpc/boot/dts/fsl/qoriq-sec4.2-0.dtsi |    1 +
 arch/powerpc/boot/dts/fsl/qoriq-sec5.0-0.dtsi |    1 +
 arch/powerpc/boot/dts/fsl/qoriq-sec5.2-0.dtsi |    1 +
 arch/powerpc/boot/dts/fsl/qoriq-sec5.3-0.dtsi |    1 +
 7 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/p1023si-post.dtsi b/arch/powerpc/boot/dts/fsl/p1023si-post.dtsi
index 941fa15..f1105bf 100644
--- a/arch/powerpc/boot/dts/fsl/p1023si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p1023si-post.dtsi
@@ -148,6 +148,7 @@
 
 	crypto: crypto@300000 {
 		compatible = "fsl,sec-v4.2", "fsl,sec-v4.0";
+		fsl,sec-era = <3>;
 		#address-cells = <1>;
 		#size-cells = <1>;
 		reg = <0x30000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/pq3-sec4.4-0.dtsi b/arch/powerpc/boot/dts/fsl/pq3-sec4.4-0.dtsi
index ffadcb5..bb3d826 100644
--- a/arch/powerpc/boot/dts/fsl/pq3-sec4.4-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/pq3-sec4.4-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto@30000 {
 	compatible = "fsl,sec-v4.4", "fsl,sec-v4.0";
+	fsl,sec-era = <3>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	ranges		 = <0x0 0x30000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec4.0-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec4.0-0.dtsi
index 0cbbac3..02bee5f 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec4.0-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-sec4.0-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto: crypto@300000 {
 	compatible = "fsl,sec-v4.0";
+	fsl,sec-era = <1>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	reg = <0x300000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec4.2-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec4.2-0.dtsi
index 7990e0d..7f7574e 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec4.2-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-sec4.2-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto: crypto@300000 {
 	compatible = "fsl,sec-v4.2", "fsl,sec-v4.0";
+	fsl,sec-era = <3>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	reg		 = <0x300000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec5.0-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec5.0-0.dtsi
index ffd458f..e298efb 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec5.0-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-sec5.0-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto: crypto@300000 {
 	compatible = "fsl,sec-v5.0", "fsl,sec-v4.0";
+	fsl,sec-era = <5>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	reg		 = <0x300000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec5.2-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec5.2-0.dtsi
index 7b2ab8a..33ff09d 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec5.2-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-sec5.2-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto: crypto@300000 {
 	compatible = "fsl,sec-v5.2", "fsl,sec-v5.0", "fsl,sec-v4.0";
+	fsl,sec-era = <5>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	reg		 = <0x300000 0x10000>;
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-sec5.3-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-sec5.3-0.dtsi
index 0339825..0877822 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-sec5.3-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-sec5.3-0.dtsi
@@ -34,6 +34,7 @@
 
 crypto: crypto@300000 {
 	compatible = "fsl,sec-v5.3", "fsl,sec-v5.0", "fsl,sec-v4.0";
+	fsl,sec-era = <4>;
 	#address-cells = <1>;
 	#size-cells = <1>;
 	reg		 = <0x300000 0x10000>;
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH -V5 12/25] powerpc: Return all the valid pte ecndoing in KVM_PPC_GET_SMMU_INFO ioctl
From: Aneesh Kumar K.V @ 2013-04-11  5:11 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <20130411032447.GU8165@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Thu, Apr 04, 2013 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> Surely this can't be correct until the KVM H_ENTER implementation is
> updated to cope with the MPSS page sizes.

Why ? We are returning info regarding penc values for different
combination. I would guess qemu to only use info related to base page
size. Rest it can ignore right ?. Obviously i haven't tested this
part. So let me know if I should drop this ?

-aneesh

^ permalink raw reply

* Re: [PATCH -V5 13/25] powerpc: Update tlbie/tlbiel as per ISA doc
From: Aneesh Kumar K.V @ 2013-04-11  5:20 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <20130411033010.GV8165@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Thu, Apr 04, 2013 at 11:27:51AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> This make sure we handle multiple page size segment correctly.
>
> This needs a much more detailed message.  In what way was the existing
> code not matching the ISA documentation?  What consequences did that
> have?

Mostly to make sure we use the right penc values in tlbie. I did test
these changes on PowerNV. 


>
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/mm/hash_native_64.c |   30 ++++++++++++++++++++++++++++--
>>  1 file changed, 28 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
>> index b461b2d..ac84fa6 100644
>> --- a/arch/powerpc/mm/hash_native_64.c
>> +++ b/arch/powerpc/mm/hash_native_64.c
>> @@ -61,7 +61,10 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>>  
>>  	switch (psize) {
>>  	case MMU_PAGE_4K:
>> +		/* clear out bits after (52) [0....52.....63] */
>> +		va &= ~((1ul << (64 - 52)) - 1);
>>  		va |= ssize << 8;
>> +		va |= mmu_psize_defs[apsize].sllp << 6;
>
> sllp is the per-segment encoding, so it sure must be looked up via
> psize, not apsize.


as per ISA doc, for base page size 4K, RB[56:58] must be set to
SLB[L|LP] encoded for the page size corresponding to the actual page
size specified by the PTE that was used to create the the TLB entry to
be invalidated.


>
>>  		asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
>>  			     : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
>>  			     : "memory");
>> @@ -69,9 +72,19 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>>  	default:
>>  		/* We need 14 to 14 + i bits of va */
>>  		penc = mmu_psize_defs[psize].penc[apsize];
>> -		va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
>> +		va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
>>  		va |= penc << 12;
>>  		va |= ssize << 8;
>> +		/* Add AVAL part */
>> +		if (psize != apsize) {
>> +			/*
>> +			 * MPSS, 64K base page size and 16MB parge page size
>> +			 * We don't need all the bits, but this seems to work.
>> +			 * vpn cover upto 65 bits of va. (0...65) and we need
>> +			 * 58..64 bits of va.
>
> "seems to work" is not a comment I like to see in core MMU code...
>

As per ISA spec, the "other bits" in RB[56:62] must be ignored by the
processor. Hence I didn't bother to do zero it out. Since we only
support one MPSS combination, we could easily zero out using 0xf0. 

-aneesh

^ permalink raw reply

* Re: [PATCH -V5 17/25] powerpc/THP: Implement transparent hugepages for ppc64
From: David Gibson @ 2013-04-11  5:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-18-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 23390 bytes --]

On Thu, Apr 04, 2013 at 11:27:55AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> We now have pmd entries covering to 16MB range. To implement THP on powerpc,
> we double the size of PMD. The second half is used to deposit the pgtable (PTE page).
> We also use the depoisted PTE page for tracking the HPTE information. The information
> include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> 4096 entries. Both will fit in a 4K PTE page.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/page.h              |    2 +-
>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |    3 +-
>  arch/powerpc/include/asm/pgtable-ppc64.h     |    2 +-
>  arch/powerpc/include/asm/pgtable.h           |  240 ++++++++++++++++++++
>  arch/powerpc/mm/pgtable.c                    |  314 ++++++++++++++++++++++++++
>  arch/powerpc/mm/pgtable_64.c                 |   13 ++
>  arch/powerpc/platforms/Kconfig.cputype       |    1 +
>  7 files changed, 572 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 38e7ff6..b927447 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -40,7 +40,7 @@
>  #ifdef CONFIG_HUGETLB_PAGE
>  extern unsigned int HPAGE_SHIFT;
>  #else
> -#define HPAGE_SHIFT PAGE_SHIFT
> +#define HPAGE_SHIFT PMD_SHIFT

That looks like it could break everything except the 64k page size
64-bit base.

>  #endif
>  #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
>  #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> index 3c529b4..5c5541a 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> @@ -33,7 +33,8 @@
>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>  
>  /* Bits to mask out from a PMD to get to the PTE page */
> -#define PMD_MASKED_BITS		0x1ff
> +/* PMDs point to PTE table fragments which are 4K aligned.  */
> +#define PMD_MASKED_BITS		0xfff
>  /* Bits to mask out from a PGD/PUD to get to the PMD page */
>  #define PUD_MASKED_BITS		0x1ff
>  
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 0182c20..c0747c7 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -150,7 +150,7 @@
>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
> +extern struct page *pmd_page(pmd_t pmd);

Does unconditionally changing pmd_page() from a macro to an external
function have a noticeable performance impact?

>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
>  #define pud_none(pud)		(!pud_val(pud))
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 4b52726..9fbe2a7 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -23,7 +23,247 @@ struct mm_struct;
>   */
>  #define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
>  
> +/* A large part matches with pte bits */
> +#define PMD_HUGE_PRESENT	0x001 /* software: pte contains a translation */
> +#define PMD_HUGE_USER		0x002 /* matches one of the PP bits */
> +#define PMD_HUGE_FILE		0x002 /* (!present only) software: pte holds file offset */

Can we actually get hugepage PMDs that are in this state?

> +#define PMD_HUGE_EXEC		0x004 /* No execute on POWER4 and newer (we invert) */
> +#define PMD_HUGE_SPLITTING	0x008
> +#define PMD_HUGE_SAO		0x010 /* strong Access order */
> +#define PMD_HUGE_HASHPTE	0x020
> +#define PMD_ISHUGE		0x040
> +#define PMD_HUGE_DIRTY		0x080 /* C: page changed */
> +#define PMD_HUGE_ACCESSED	0x100 /* R: page referenced */
> +#define PMD_HUGE_RW		0x200 /* software: user write access allowed */
> +#define PMD_HUGE_BUSY		0x800 /* software: PTE & hash are busy */
> +#define PMD_HUGE_HPTEFLAGS	(PMD_HUGE_BUSY | PMD_HUGE_HASHPTE)
> +/*
> + * We keep both the pmd and pte rpn shift same, eventhough we use only
> + * lower 12 bits for hugepage flags at pmd level

Why?

> + */
> +#define PMD_HUGE_RPN_SHIFT	PTE_RPN_SHIFT
> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))
> +
>  #ifndef __ASSEMBLY__
> +extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +				     pmd_t *pmdp);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
> +extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
> +extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
> +extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		       pmd_t *pmdp, pmd_t pmd);
> +extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +				 pmd_t *pmd);
> +static inline int pmd_large(pmd_t pmd)
> +{
> +	return (pmd_val(pmd) & (PMD_ISHUGE | PMD_HUGE_PRESENT)) ==
> +		(PMD_ISHUGE | PMD_HUGE_PRESENT);
> +}
> +
> +static inline int pmd_trans_splitting(pmd_t pmd)
> +{
> +	return (pmd_val(pmd) & (PMD_ISHUGE|PMD_HUGE_SPLITTING)) ==
> +		(PMD_ISHUGE|PMD_HUGE_SPLITTING);
> +}
> +
> +static inline int pmd_trans_huge(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & PMD_ISHUGE;
> +}
> +/* We will enable it in the last patch */
> +#define has_transparent_hugepage() 0
> +#else
> +#define pmd_large(pmd)		0
> +#define has_transparent_hugepage() 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +static inline unsigned long pmd_pfn(pmd_t pmd)
> +{
> +	/*
> +	 * Only called for hugepage pmd
> +	 */
> +	return pmd_val(pmd) >> PMD_HUGE_RPN_SHIFT;
> +}
> +
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & PMD_HUGE_ACCESSED;
> +}
> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	/* Do nothing, mk_pmd() does this part.  */
> +	return pmd;
> +}
> +
> +#define __HAVE_ARCH_PMD_WRITE
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & PMD_HUGE_RW;
> +}
> +
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~PMD_HUGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~PMD_HUGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= PMD_HUGE_DIRTY;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= PMD_HUGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= PMD_HUGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~PMD_HUGE_PRESENT;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mksplitting(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= PMD_HUGE_SPLITTING;
> +	return pmd;
> +}
> +
> +/*
> + * Set the dirty and/or accessed bits atomically in a linux hugepage PMD, this
> + * function doesn't need to flush the hash entry
> + */
> +static inline void __pmdp_set_access_flags(pmd_t *pmdp, pmd_t entry)
> +{
> +	unsigned long bits = pmd_val(entry) & (PMD_HUGE_DIRTY |
> +					       PMD_HUGE_ACCESSED |
> +					       PMD_HUGE_RW | PMD_HUGE_EXEC);
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%4\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		or	%0,%3,%0\n\
> +		stdcx.	%0,0,%4\n\
> +		bne-	1b"
> +	:"=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	:"r" (bits), "r" (pmdp), "m" (*pmdp), "i" (PMD_HUGE_BUSY)
> +	:"cc");
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | bits);
> +#endif
> +}
> +
> +#define __HAVE_ARCH_PMD_SAME
> +static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> +{
> +	return (((pmd_val(pmd_a) ^ pmd_val(pmd_b)) & ~PMD_HUGE_HPTEFLAGS) == 0);
> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
> +
> +static inline unsigned long pmd_hugepage_update(struct mm_struct *mm,
> +						unsigned long addr,
> +						pmd_t *pmdp, unsigned long clr)
> +{
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		andc	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (PMD_HUGE_BUSY)
> +	: "cc" );
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old & ~clr);
> +#endif
> +
> +#ifdef CONFIG_PPC_STD_MMU_64
> +	if (old & PMD_HUGE_HASHPTE)
> +		hpte_need_hugepage_flush(mm, addr, pmdp);
> +#endif
> +	return old;
> +}
> +
> +static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
> +					      unsigned long addr, pmd_t *pmdp)
> +{
> +	unsigned long old;
> +
> +	if ((pmd_val(*pmdp) & (PMD_HUGE_ACCESSED | PMD_HUGE_HASHPTE)) == 0)
> +		return 0;
> +	old = pmd_hugepage_update(mm, addr, pmdp, PMD_HUGE_ACCESSED);
> +	return ((old & PMD_HUGE_ACCESSED) != 0);
> +}
> +
> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long address, pmd_t *pmdp);
> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
> +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
> +				       unsigned long addr, pmd_t *pmdp)
> +{
> +	unsigned long old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
> +	return __pmd(old);
> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT
> +static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
> +				      pmd_t *pmdp)
> +{
> +
> +	if ((pmd_val(*pmdp) & PMD_HUGE_RW) == 0)
> +		return;
> +
> +	pmd_hugepage_update(mm, addr, pmdp, PMD_HUGE_RW);
> +}
> +
> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PGTABLE_DEPOSIT
> +extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				       pgtable_t pgtable);
> +#define __HAVE_ARCH_PGTABLE_WITHDRAW
> +extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_INVALIDATE
> +extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +			    pmd_t *pmdp);
>  
>  #include <asm/tlbflush.h>
>  
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 214130a..9f33780 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -31,6 +31,7 @@
>  #include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> +#include <asm/machdep.h>
>  
>  #include "mmu_decl.h"
>  
> @@ -240,3 +241,316 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  }
>  #endif /* CONFIG_DEBUG_VM */
>  
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
> +					      struct vm_area_struct *vma,
> +					      int dirty)
> +{
> +	return pmd;
> +}

I don't really see why you're splitting out these trivial ...filter()
functions, rather than just doing it inline in the (single) caller.

> +
> +/*
> + * This is called when relaxing access to a hugepage. It's also called in the page
> + * fault path when we don't hit any of the major fault cases, ie, a minor
> + * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
> + * handled those two for us, we additionally deal with missing execute
> + * permission here on some processors
> + */
> +int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
> +			  pmd_t *pmdp, pmd_t entry, int dirty)
> +{
> +	int changed;
> +	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
> +	changed = !pmd_same(*(pmdp), entry);
> +	if (changed) {
> +		__pmdp_set_access_flags(pmdp, entry);
> +		/*
> +		 * Since we are not supporting SW TLB systems, we don't
> +		 * have any thing similar to flush_tlb_page_nohash()
> +		 */
> +	}
> +	return changed;
> +}
> +
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We currently remove entries from the hashtable regardless of whether
> + * the entry was young or dirty. The generic routines only flush if the
> + * entry was young or dirty which is not good enough.
> + *
> + * We should be more intelligent about this but for the moment we override
> + * these functions and force a tlb flush unconditionally
> + */
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We mark the pmd splitting and invalidate all the hpte
> + * entries for this hugepage.
> + */
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	unsigned long old, tmp;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +#ifdef PTE_ATOMIC_UPDATES
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		ori	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "i" (PMD_HUGE_SPLITTING), "m" (*pmdp), "i" (PMD_HUGE_BUSY)
> +	: "cc" );
> +#else
> +	old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | PMD_HUGE_SPLITTING);
> +#endif
> +	/*
> +	 * If we didn't had the splitting flag set, go and flush the
> +	 * HPTE entries and serialize against gup fast.
> +	 */
> +	if (!(old & PMD_HUGE_SPLITTING)) {
> +#ifdef CONFIG_PPC_STD_MMU_64
> +		/* We need to flush the hpte */
> +		if (old & PMD_HUGE_HASHPTE)
> +			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
> +#endif
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +}
> +
> +/*
> + * We want to put the pgtable in pmd and use pgtable for tracking
> + * the base page size hptes
> + */
> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				pgtable_t pgtable)
> +{
> +	unsigned long *pgtable_slot;
> +	assert_spin_locked(&mm->page_table_lock);
> +	/*
> +	 * we store the pgtable in the second half of PMD
> +	 */
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	*pgtable_slot = (unsigned long)pgtable;
> +}
> +
> +#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))

Another example of why this define should be moved to a header.

> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> +{
> +	pgtable_t pgtable;
> +	unsigned long *pgtable_slot;
> +
> +	assert_spin_locked(&mm->page_table_lock);
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	pgtable = (pgtable_t) *pgtable_slot;
> +	/*
> +	 * We store HPTE information in the deposited PTE fragment.
> +	 * zero out the content on withdraw.
> +	 */
> +	memset(pgtable, 0, PTE_FRAG_SIZE);
> +	return pgtable;
> +}
> +
> +/*
> + * Since we are looking at latest ppc64, we don't need to worry about
> + * i/d cache coherency on exec fault
> + */
> +static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
> +{
> +	pmd = __pmd(pmd_val(pmd) & ~PMD_HUGE_HPTEFLAGS);
> +	return pmd;
> +}
> +
> +/*
> + * We can make it less convoluted than __set_pte_at, because
> + * we can ignore lot of hardware here, because this is only for
> + * MPSS
> + */
> +static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +				pmd_t *pmdp, pmd_t pmd, int percpu)
> +{
> +	/*
> +	 * There is nothing in hash page table now, so nothing to
> +	 * invalidate, set_pte_at is used for adding new entry.
> +	 * For updating we should use update_hugepage_pmd()
> +	 */
> +	*pmdp = pmd;
> +}
> +
> +/*
> + * set a new huge pmd. We should not be called for updating
> + * an existing pmd entry. That should go via pmd_hugepage_update.
> + */
> +void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		pmd_t *pmdp, pmd_t pmd)
> +{
> +	/*
> +	 * Note: mm->context.id might not yet have been assigned as
> +	 * this context might not have been activated yet when this
> +	 * is called.
> +	 */
> +	pmd = set_pmd_filter(pmd, addr);
> +
> +	__set_pmd_at(mm, addr, pmdp, pmd, 0);
> +
> +}
> +
> +void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +		     pmd_t *pmdp)
> +{
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, PMD_HUGE_PRESENT);
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +}
> +
> +/*
> + * A linux hugepage PMD was changed and the corresponding hash table entry
> + * neesd to be flushed.
> + *
> + * The linux hugepage PMD now include the pmd entries followed by the address
> + * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
> + * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> + * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> + * 4096 entries. Both will fit in a 4K pgtable_t.
> + */
> +void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +			      pmd_t *pmdp)
> +{
> +	int ssize, i;
> +	unsigned long s_addr;
> +	unsigned int psize, valid;
> +	unsigned char *hpte_slot_array;
> +	unsigned long hidx, vpn, vsid, hash, shift, slot;
> +
> +	/*
> +	 * Flush all the hptes mapping this hugepage
> +	 */
> +	s_addr = addr & HUGE_PAGE_MASK;
> +	/*
> +	 * The hpte hindex are stored in the pgtable whose address is in the
> +	 * second half of the PMD
> +	 */
> +	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
> +
> +	/* get the base page size */
> +	psize = get_slice_psize(mm, s_addr);
> +	shift = mmu_psize_defs[psize].shift;
> +
> +	for (i = 0; i < HUGE_PAGE_SIZE/(1ul << shift); i++) {

HUGE_PAGE_SIZE >> shift would be a simpler way to do this calculation.

> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
> +	}
> +}
> +
> +static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
> +{
> +	unsigned long pmd_prot = 0;
> +	unsigned long prot = pgprot_val(pgprot);
> +
> +	if (prot & _PAGE_PRESENT)
> +		pmd_prot |= PMD_HUGE_PRESENT;
> +	if (prot & _PAGE_USER)
> +		pmd_prot |= PMD_HUGE_USER;
> +	if (prot & _PAGE_FILE)
> +		pmd_prot |= PMD_HUGE_FILE;
> +	if (prot & _PAGE_EXEC)
> +		pmd_prot |= PMD_HUGE_EXEC;
> +	/*
> +	 * _PAGE_COHERENT should always be set
> +	 */
> +	VM_BUG_ON(!(prot & _PAGE_COHERENT));
> +
> +	if (prot & _PAGE_SAO)
> +		pmd_prot |= PMD_HUGE_SAO;

This looks dubious because _PAGE_SAO is not a single bit.  What
happens if WRITETHRU or NO_CACHE is set without the other?

> +	if (prot & _PAGE_DIRTY)
> +		pmd_prot |= PMD_HUGE_DIRTY;
> +	if (prot & _PAGE_ACCESSED)
> +		pmd_prot |= PMD_HUGE_ACCESSED;
> +	if (prot & _PAGE_RW)
> +		pmd_prot |= PMD_HUGE_RW;
> +
> +	pmd_val(pmd) |= pmd_prot;
> +	return pmd;
> +}
> +
> +pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
> +{
> +	pmd_t pmd;
> +
> +	pmd_val(pmd) = pfn << PMD_HUGE_RPN_SHIFT;
> +	pmd_val(pmd) |= PMD_ISHUGE;
> +	pmd = pmd_set_protbits(pmd, pgprot);
> +	return pmd;
> +}
> +
> +pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
> +{
> +	return pfn_pmd(page_to_pfn(page), pgprot);
> +}
> +
> +pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> +{
> +	/* FIXME!! why are this bits cleared ? */

You really need to answer this question...

> +	pmd_val(pmd) &= ~(PMD_HUGE_PRESENT |
> +			  PMD_HUGE_RW |
> +			  PMD_HUGE_EXEC);
> +	pmd = pmd_set_protbits(pmd, newprot);
> +	return pmd;
> +}
> +
> +/*
> + * This is called at the end of handling a user page fault, when the
> + * fault has been handled by updating a HUGE PMD entry in the linux page tables.
> + * We use it to preload an HPTE into the hash table corresponding to
> + * the updated linux HUGE PMD entry.
> + */
> +void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +			  pmd_t *pmd)
> +{
> +	/* FIXME!!
> +	 * Will be done in a later patch
> +	 */

If you need another patch to make the code in this patch work, they
should probably be folded together.

> +}
> +
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index e79840b..6fc3488 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
>  EXPORT_SYMBOL(__iounmap);
>  EXPORT_SYMBOL(__iounmap_at);
>  
> +/*
> + * For hugepage we have pfn in the pmd, we use PMD_HUGE_RPN_SHIFT bits for flags
> + * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
> + */
> +struct page *pmd_page(pmd_t pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (pmd_val(pmd) & PMD_ISHUGE)
> +		return pfn_to_page(pmd_pfn(pmd));
> +#endif
> +	return virt_to_page(pmd_page_vaddr(pmd));
> +}
> +
>  #ifdef CONFIG_PPC_64K_PAGES
>  /*
>   * we support 16 fragments per PTE page. This is limited by how many
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 72afd28..90ee19b 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>  	select PPC_FPU
>  	select PPC_HAVE_PMU_SUPPORT
>  	select SYS_SUPPORTS_HUGETLBFS
> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
>  	bool "Embedded processors"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 12/25] powerpc: Return all the valid pte ecndoing in KVM_PPC_GET_SMMU_INFO ioctl
From: David Gibson @ 2013-04-11  5:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, paulus, linux-mm
In-Reply-To: <874nfdodlu.fsf@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

On Thu, Apr 11, 2013 at 10:41:57AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <dwg@au1.ibm.com> writes:
> 
> > On Thu, Apr 04, 2013 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >
> > Surely this can't be correct until the KVM H_ENTER implementation is
> > updated to cope with the MPSS page sizes.
> 
> Why ? We are returning info regarding penc values for different
> combination. I would guess qemu to only use info related to base page
> size. Rest it can ignore right ?. Obviously i haven't tested this
> part. So let me know if I should drop this ?

The guest can't actually use those encodings unless the host's H_ENTER
allows it to, though, so this patch should be moved after extended
that KVM support.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 13/25] powerpc: Update tlbie/tlbiel as per ISA doc
From: David Gibson @ 2013-04-11  6:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, paulus, linux-mm
In-Reply-To: <871uahod83.fsf@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3495 bytes --]

On Thu, Apr 11, 2013 at 10:50:12AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <dwg@au1.ibm.com> writes:
> 
> > On Thu, Apr 04, 2013 at 11:27:51AM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >> 
> >> This make sure we handle multiple page size segment correctly.
> >
> > This needs a much more detailed message.  In what way was the existing
> > code not matching the ISA documentation?  What consequences did that
> > have?
> 
> Mostly to make sure we use the right penc values in tlbie. I did test
> these changes on PowerNV. 

A vague description like this is not adequate.  Your commit message
needs to explain what was wrong with the existing behaviour.

> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/mm/hash_native_64.c |   30 ++++++++++++++++++++++++++++--
> >>  1 file changed, 28 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
> >> index b461b2d..ac84fa6 100644
> >> --- a/arch/powerpc/mm/hash_native_64.c
> >> +++ b/arch/powerpc/mm/hash_native_64.c
> >> @@ -61,7 +61,10 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
> >>  
> >>  	switch (psize) {
> >>  	case MMU_PAGE_4K:
> >> +		/* clear out bits after (52) [0....52.....63] */
> >> +		va &= ~((1ul << (64 - 52)) - 1);
> >>  		va |= ssize << 8;
> >> +		va |= mmu_psize_defs[apsize].sllp << 6;
> >
> > sllp is the per-segment encoding, so it sure must be looked up via
> > psize, not apsize.
> 
> as per ISA doc, for base page size 4K, RB[56:58] must be set to
> SLB[L|LP] encoded for the page size corresponding to the actual page
> size specified by the PTE that was used to create the the TLB entry to
> be invalidated.

Ok, I see.  Wow, our architecture is even more convoluted than I
thought.  This could really do with a comment, because this is a very
surprising aspect of the architecture.

> >
> >>  		asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
> >>  			     : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
> >>  			     : "memory");
> >> @@ -69,9 +72,19 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
> >>  	default:
> >>  		/* We need 14 to 14 + i bits of va */
> >>  		penc = mmu_psize_defs[psize].penc[apsize];
> >> -		va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
> >> +		va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
> >>  		va |= penc << 12;
> >>  		va |= ssize << 8;
> >> +		/* Add AVAL part */
> >> +		if (psize != apsize) {
> >> +			/*
> >> +			 * MPSS, 64K base page size and 16MB parge page size
> >> +			 * We don't need all the bits, but this seems to work.
> >> +			 * vpn cover upto 65 bits of va. (0...65) and we need
> >> +			 * 58..64 bits of va.
> >
> > "seems to work" is not a comment I like to see in core MMU code...
> >
> 
> As per ISA spec, the "other bits" in RB[56:62] must be ignored by the
> processor. Hence I didn't bother to do zero it out. Since we only
> support one MPSS combination, we could easily zero out using 0xf0. 

Then update the comment to clearly explain why what you're doing is
correct, not just say it "seems to work".

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 18/25] powerpc/THP: Double the PMD table size for THP
From: David Gibson @ 2013-04-11  6:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-19-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 840 bytes --]

On Thu, Apr 04, 2013 at 11:27:56AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> THP code does PTE page allocation along with large page request and deposit them
> for later use. This is to ensure that we won't have any failures when we split
> hugepages to regular pages.
> 
> On powerpc we want to use the deposited PTE page for storing hash pte slot and
> secondary bit information for the HPTEs. We use the second half
> of the pmd table to save the deposted PTE page.

The previous patch accesses data in that second half of the PMD table,
so this patch should go before it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 13/25] powerpc: Update tlbie/tlbiel as per ISA doc
From: Aneesh Kumar K.V @ 2013-04-11  6:36 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, paulus, linux-mm
In-Reply-To: <20130411061630.GG8165@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Thu, Apr 11, 2013 at 10:50:12AM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <dwg@au1.ibm.com> writes:
>> 
>> > On Thu, Apr 04, 2013 at 11:27:51AM +0530, Aneesh Kumar K.V wrote:
>> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> >> 
>> >> This make sure we handle multiple page size segment correctly.
>> >
>> > This needs a much more detailed message.  In what way was the existing
>> > code not matching the ISA documentation?  What consequences did that
>> > have?
>> 
>> Mostly to make sure we use the right penc values in tlbie. I did test
>> these changes on PowerNV. 
>
> A vague description like this is not adequate.  Your commit message
> needs to explain what was wrong with the existing behaviour.
>
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/powerpc/mm/hash_native_64.c |   30 ++++++++++++++++++++++++++++--
>> >>  1 file changed, 28 insertions(+), 2 deletions(-)
>> >> 
>> >> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
>> >> index b461b2d..ac84fa6 100644
>> >> --- a/arch/powerpc/mm/hash_native_64.c
>> >> +++ b/arch/powerpc/mm/hash_native_64.c
>> >> @@ -61,7 +61,10 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>> >>  
>> >>  	switch (psize) {
>> >>  	case MMU_PAGE_4K:
>> >> +		/* clear out bits after (52) [0....52.....63] */
>> >> +		va &= ~((1ul << (64 - 52)) - 1);
>> >>  		va |= ssize << 8;
>> >> +		va |= mmu_psize_defs[apsize].sllp << 6;
>> >
>> > sllp is the per-segment encoding, so it sure must be looked up via
>> > psize, not apsize.
>> 
>> as per ISA doc, for base page size 4K, RB[56:58] must be set to
>> SLB[L|LP] encoded for the page size corresponding to the actual page
>> size specified by the PTE that was used to create the the TLB entry to
>> be invalidated.
>
> Ok, I see.  Wow, our architecture is even more convoluted than I
> thought.  This could really do with a comment, because this is a very
> surprising aspect of the architecture.
>
>> >
>> >>  		asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
>> >>  			     : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
>> >>  			     : "memory");
>> >> @@ -69,9 +72,19 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
>> >>  	default:
>> >>  		/* We need 14 to 14 + i bits of va */
>> >>  		penc = mmu_psize_defs[psize].penc[apsize];
>> >> -		va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
>> >> +		va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
>> >>  		va |= penc << 12;
>> >>  		va |= ssize << 8;
>> >> +		/* Add AVAL part */
>> >> +		if (psize != apsize) {
>> >> +			/*
>> >> +			 * MPSS, 64K base page size and 16MB parge page size
>> >> +			 * We don't need all the bits, but this seems to work.
>> >> +			 * vpn cover upto 65 bits of va. (0...65) and we need
>> >> +			 * 58..64 bits of va.
>> >
>> > "seems to work" is not a comment I like to see in core MMU code...
>> >
>> 
>> As per ISA spec, the "other bits" in RB[56:62] must be ignored by the
>> processor. Hence I didn't bother to do zero it out. Since we only
>> support one MPSS combination, we could easily zero out using 0xf0. 
>
> Then update the comment to clearly explain why what you're doing is
> correct, not just say it "seems to work".

Will do.

-aneesh

^ permalink raw reply

* Re: [PATCH 9/9] powerpc: cpufreq: move cpufreq driver to drivers/cpufreq
From: Olof Johansson @ 2013-04-11  6:59 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: robin.randhawa, linux-pm, Liviu.Dudau, linux-kernel, cpufreq,
	Rafael J. Wysocki, Steve.Bannister, Paul Mackerras, Arnd Bergmann,
	arvind.chauhan, linuxppc-dev, linaro-kernel, charles.garcia-tobin
In-Reply-To: <CAKohpo=EPbGocbNaZcNtGuZy4Tst5jH4rPviQ6L=YuZceDnKMw@mail.gmail.com>

On Thu, Apr 04, 2013 at 11:55:56AM +0530, Viresh Kumar wrote:
> On 3 April 2013 16:00, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > On Wed, 2013-04-03 at 15:00 +0530, Viresh Kumar wrote:
> >> On 31 March 2013 09:33, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> 
> >> > Benjamin/Paul/Olof,
> >> >
> >> > Any comments on this?
> >>
> >> Ping!!
> >
> > I'm on vacation until end of April. No objection to the patch but
> > somebody needs to test it.
> 
> Hi,
> 
> Can somebody else from powerpc world give it a try?
> 
> OR
> 
> @Rafael: Can we get this pushed in linux-next as is and then people would
> be forced to test it and in case there are any complains, i will fix them or
> you can revert it?


That's not very nice this late in the staging cycle. Give the powerpc
guys a little more time than a single day to come back with test results,
please.


-Olof

^ permalink raw reply

* Re: [PATCH 9/9] powerpc: cpufreq: move cpufreq driver to drivers/cpufreq
From: Viresh Kumar @ 2013-04-11  7:02 UTC (permalink / raw)
  To: Olof Johansson
  Cc: robin.randhawa, linux-pm, Liviu.Dudau, linux-kernel, cpufreq,
	Rafael J. Wysocki, Steve.Bannister, Paul Mackerras, Arnd Bergmann,
	arvind.chauhan, linuxppc-dev, linaro-kernel, charles.garcia-tobin
In-Reply-To: <20130411065957.GB31465@quad.lixom.net>

On 11 April 2013 12:29, Olof Johansson <olof@lixom.net> wrote:
> That's not very nice this late in the staging cycle. Give the powerpc
> guys a little more time than a single day to come back with test results,
> please.

Yes we are waiting indefinitely for an Ack from powerpc guys.. If we can
get a Ack soon, we will get it in 3.10 otherwise next cycle.

^ permalink raw reply

* Re: [PATCH -V5 04/25] powerpc: Reduce the PTE_INDEX_SIZE
From: David Gibson @ 2013-04-11  7:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1365055083-31956-5-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2564 bytes --]

On Thu, Apr 04, 2013 at 11:27:42AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This make one PMD cover 16MB range. That helps in easier implementation of THP
> on power. THP core code make use of one pmd entry to track the hugepage and
> the range mapped by a single pmd entry should be equal to the hugepage size
> supported by the hardware.
> 
> Acked-by: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> index be4e287..3c529b4 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> @@ -4,10 +4,10 @@
>  #include <asm-generic/pgtable-nopud.h>
>  
>  
> -#define PTE_INDEX_SIZE  12
> +#define PTE_INDEX_SIZE  8
>  #define PMD_INDEX_SIZE  12
>  #define PUD_INDEX_SIZE	0
> -#define PGD_INDEX_SIZE  6
> +#define PGD_INDEX_SIZE  10
>  
>  #ifndef __ASSEMBLY__
>  #define PTE_TABLE_SIZE	(sizeof(real_pte_t) << PTE_INDEX_SIZE)

Actually, I've realised there's a much more serious problem here.
This patch as is will break existing hugpage support.  With the
previous numbers we had pagetable levels covering 256M and 1TB.  That
meant that at whichever level we split off a hugepd, it would line up
with the slice/segment boundaries.  Now it won't, and that means that
(explicitly) mapping hugepages and normal pages with correctly
constructed alignments will lead to the normal page fault paths
attempting to walk down hugepds or vice versa which will cause
crashes.

In fact.. with the new boundaries, we will attempt to put explicit 16M
hugepages in a hugepd of 4096 entries covering a total of 64G.  Which
means any attempt to use explicit hugepages in a 32-bit process will
blow up horribly.

The obvious solution is to make explicit hugepages also use your new
hugepage encoding, as a PMD entry pointing directly to the page data.
That's also a good idea, to avoid yet more variants on the pagetable
encoding.  But this conversion of the explicit hugepage code really
needs to be done before attempting to implement THP.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V5 17/25] powerpc/THP: Implement transparent hugepages for ppc64
From: Aneesh Kumar K.V @ 2013-04-11  7:40 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <20130411053823.GE8165@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Thu, Apr 04, 2013 at 11:27:55AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We now have pmd entries covering to 16MB range. To implement THP on powerpc,
>> we double the size of PMD. The second half is used to deposit the pgtable (PTE page).
>> We also use the depoisted PTE page for tracking the HPTE information. The information
>> include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
>> With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
>> 4096 entries. Both will fit in a 4K PTE page.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/page.h              |    2 +-
>>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |    3 +-
>>  arch/powerpc/include/asm/pgtable-ppc64.h     |    2 +-
>>  arch/powerpc/include/asm/pgtable.h           |  240 ++++++++++++++++++++
>>  arch/powerpc/mm/pgtable.c                    |  314 ++++++++++++++++++++++++++
>>  arch/powerpc/mm/pgtable_64.c                 |   13 ++
>>  arch/powerpc/platforms/Kconfig.cputype       |    1 +
>>  7 files changed, 572 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
>> index 38e7ff6..b927447 100644
>> --- a/arch/powerpc/include/asm/page.h
>> +++ b/arch/powerpc/include/asm/page.h
>> @@ -40,7 +40,7 @@
>>  #ifdef CONFIG_HUGETLB_PAGE
>>  extern unsigned int HPAGE_SHIFT;
>>  #else
>> -#define HPAGE_SHIFT PAGE_SHIFT
>> +#define HPAGE_SHIFT PMD_SHIFT
>
> That looks like it could break everything except the 64k page size
> 64-bit base.

How about 

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index b927447..fadb1ce 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -37,10 +37,19 @@
 #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
 
 #ifndef __ASSEMBLY__
-#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
+ * configurable. But we enable THP only with 16MB hugepage.
+ * With only THP configured, we force hugepage size to 16MB.
+ * This should ensure that all subarchs that doesn't support
+ * THP continue to work fine with HPAGE_SHIFT usage.
+ */
+#if defined(CONFIG_HUGETLB_PAGE)
 extern unsigned int HPAGE_SHIFT;
-#else
+#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
 #define HPAGE_SHIFT PMD_SHIFT
+#else
+#define HPAGE_SHIFT PAGE_SHIFT
 #endif
 #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))


>
>>  #endif
>>  #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
>>  #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> index 3c529b4..5c5541a 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> @@ -33,7 +33,8 @@
>>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>>  
>>  /* Bits to mask out from a PMD to get to the PTE page */
>> -#define PMD_MASKED_BITS		0x1ff
>> +/* PMDs point to PTE table fragments which are 4K aligned.  */
>> +#define PMD_MASKED_BITS		0xfff
>>  /* Bits to mask out from a PGD/PUD to get to the PMD page */
>>  #define PUD_MASKED_BITS		0x1ff
>>  
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index 0182c20..c0747c7 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -150,7 +150,7 @@
>>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
>>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
>>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
>> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
>> +extern struct page *pmd_page(pmd_t pmd);
>
> Does unconditionally changing pmd_page() from a macro to an external
> function have a noticeable performance impact?

I did measure performance impact with THP enabled. Didn't do a micro
benchmark to measure impact of this change. Any suggestion what test
results you would like to see ?

>
>>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
>>  #define pud_none(pud)		(!pud_val(pud))
>> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
>> index 4b52726..9fbe2a7 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -23,7 +23,247 @@ struct mm_struct;
>>   */
>>  #define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
>>  
>> +/* A large part matches with pte bits */
>> +#define PMD_HUGE_PRESENT	0x001 /* software: pte contains a translation */
>> +#define PMD_HUGE_USER		0x002 /* matches one of the PP bits */
>> +#define PMD_HUGE_FILE		0x002 /* (!present only) software: pte holds file offset */
>
> Can we actually get hugepage PMDs that are in this state?

Currently we can't, but we would start supporting THP for page cache later.

>
>> +#define PMD_HUGE_EXEC		0x004 /* No execute on POWER4 and newer (we invert) */
>> +#define PMD_HUGE_SPLITTING	0x008
>> +#define PMD_HUGE_SAO		0x010 /* strong Access order */
>> +#define PMD_HUGE_HASHPTE	0x020
>> +#define PMD_ISHUGE		0x040
>> +#define PMD_HUGE_DIRTY		0x080 /* C: page changed */
>> +#define PMD_HUGE_ACCESSED	0x100 /* R: page referenced */
>> +#define PMD_HUGE_RW		0x200 /* software: user write access allowed */
>> +#define PMD_HUGE_BUSY		0x800 /* software: PTE & hash are busy */
>> +#define PMD_HUGE_HPTEFLAGS	(PMD_HUGE_BUSY | PMD_HUGE_HASHPTE)
>> +/*
>> + * We keep both the pmd and pte rpn shift same, eventhough we use only
>> + * lower 12 bits for hugepage flags at pmd level
>
> Why?
>

I was trying to keep PTE level access code to as much similar to huge
PMD level code. hence retained the same RPN SHIFT, since that would work.

>> + */
>> +#define PMD_HUGE_RPN_SHIFT	PTE_RPN_SHIFT
>> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
>> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))
>> +

.....

>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index 214130a..9f33780 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -31,6 +31,7 @@
>>  #include <asm/pgalloc.h>
>>  #include <asm/tlbflush.h>
>>  #include <asm/tlb.h>
>> +#include <asm/machdep.h>
>>  
>>  #include "mmu_decl.h"
>>  
>> @@ -240,3 +241,316 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>>  }
>>  #endif /* CONFIG_DEBUG_VM */
>>  
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
>> +					      struct vm_area_struct *vma,
>> +					      int dirty)
>> +{
>> +	return pmd;
>> +}
>
> I don't really see why you're splitting out these trivial ...filter()
> functions, rather than just doing it inline in the (single) caller.


No specific reason other than keeping the hugepmd related code similar
to PTE related access code. This should enable us to easily enable THP
support for subarchs.

>
>> +
>> +/*
>> + * This is called when relaxing access to a hugepage. It's also called in the page
>> + * fault path when we don't hit any of the major fault cases, ie, a minor
>> + * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
>> + * handled those two for us, we additionally deal with missing execute
>> + * permission here on some processors
>> + */
>> +int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>> +			  pmd_t *pmdp, pmd_t entry, int dirty)
>> +{
>> +	int changed;
>> +	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
>> +	changed = !pmd_same(*(pmdp), entry);
>> +	if (changed) {
>> +		__pmdp_set_access_flags(pmdp, entry);
>> +		/*
>> +		 * Since we are not supporting SW TLB systems, we don't
>> +		 * have any thing similar to flush_tlb_page_nohash()
>> +		 */
>> +	}
>> +	return changed;
>> +}
>> +
>> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>> +			      unsigned long address, pmd_t *pmdp)
>> +{
>> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
>> +}
>> +
>> +/*
>> + * We currently remove entries from the hashtable regardless of whether
>> + * the entry was young or dirty. The generic routines only flush if the
>> + * entry was young or dirty which is not good enough.
>> + *
>> + * We should be more intelligent about this but for the moment we override
>> + * these functions and force a tlb flush unconditionally
>> + */
>> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pmd_t *pmdp)
>> +{
>> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
>> +}
>> +
>> +/*
>> + * We mark the pmd splitting and invalidate all the hpte
>> + * entries for this hugepage.
>> + */
>> +void pmdp_splitting_flush(struct vm_area_struct *vma,
>> +			  unsigned long address, pmd_t *pmdp)
>> +{
>> +	unsigned long old, tmp;
>> +
>> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +#ifdef PTE_ATOMIC_UPDATES
>> +
>> +	__asm__ __volatile__(
>> +	"1:	ldarx	%0,0,%3\n\
>> +		andi.	%1,%0,%6\n\
>> +		bne-	1b \n\
>> +		ori	%1,%0,%4 \n\
>> +		stdcx.	%1,0,%3 \n\
>> +		bne-	1b"
>> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
>> +	: "r" (pmdp), "i" (PMD_HUGE_SPLITTING), "m" (*pmdp), "i" (PMD_HUGE_BUSY)
>> +	: "cc" );
>> +#else
>> +	old = pmd_val(*pmdp);
>> +	*pmdp = __pmd(old | PMD_HUGE_SPLITTING);
>> +#endif
>> +	/*
>> +	 * If we didn't had the splitting flag set, go and flush the
>> +	 * HPTE entries and serialize against gup fast.
>> +	 */
>> +	if (!(old & PMD_HUGE_SPLITTING)) {
>> +#ifdef CONFIG_PPC_STD_MMU_64
>> +		/* We need to flush the hpte */
>> +		if (old & PMD_HUGE_HASHPTE)
>> +			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
>> +#endif
>> +		/* need tlb flush only to serialize against gup-fast */
>> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>> +	}
>> +}
>> +
>> +/*
>> + * We want to put the pgtable in pmd and use pgtable for tracking
>> + * the base page size hptes
>> + */
>> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>> +				pgtable_t pgtable)
>> +{
>> +	unsigned long *pgtable_slot;
>> +	assert_spin_locked(&mm->page_table_lock);
>> +	/*
>> +	 * we store the pgtable in the second half of PMD
>> +	 */
>> +	pgtable_slot = pmdp + PTRS_PER_PMD;
>> +	*pgtable_slot = (unsigned long)pgtable;
>> +}
>> +
>> +#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
>
> Another example of why this define should be moved to a header.

will drop. 

>
>> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>> +{
>> +	pgtable_t pgtable;
>> +	unsigned long *pgtable_slot;
>> +
>> +	assert_spin_locked(&mm->page_table_lock);
>> +	pgtable_slot = pmdp + PTRS_PER_PMD;
>> +	pgtable = (pgtable_t) *pgtable_slot;
>> +	/*
>> +	 * We store HPTE information in the deposited PTE fragment.
>> +	 * zero out the content on withdraw.
>> +	 */
>> +	memset(pgtable, 0, PTE_FRAG_SIZE);
>> +	return pgtable;
>> +}
>> +
>> +/*
>> + * Since we are looking at latest ppc64, we don't need to worry about
>> + * i/d cache coherency on exec fault
>> + */
>> +static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
>> +{
>> +	pmd = __pmd(pmd_val(pmd) & ~PMD_HUGE_HPTEFLAGS);
>> +	return pmd;
>> +}
>> +
>> +/*
>> + * We can make it less convoluted than __set_pte_at, because
>> + * we can ignore lot of hardware here, because this is only for
>> + * MPSS
>> + */
>> +static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> +				pmd_t *pmdp, pmd_t pmd, int percpu)
>> +{
>> +	/*
>> +	 * There is nothing in hash page table now, so nothing to
>> +	 * invalidate, set_pte_at is used for adding new entry.
>> +	 * For updating we should use update_hugepage_pmd()
>> +	 */
>> +	*pmdp = pmd;
>> +}
>> +
>> +/*
>> + * set a new huge pmd. We should not be called for updating
>> + * an existing pmd entry. That should go via pmd_hugepage_update.
>> + */
>> +void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> +		pmd_t *pmdp, pmd_t pmd)
>> +{
>> +	/*
>> +	 * Note: mm->context.id might not yet have been assigned as
>> +	 * this context might not have been activated yet when this
>> +	 * is called.
>> +	 */
>> +	pmd = set_pmd_filter(pmd, addr);
>> +
>> +	__set_pmd_at(mm, addr, pmdp, pmd, 0);
>> +
>> +}
>> +
>> +void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> +		     pmd_t *pmdp)
>> +{
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, PMD_HUGE_PRESENT);
>> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>> +}
>> +
>> +/*
>> + * A linux hugepage PMD was changed and the corresponding hash table entry
>> + * neesd to be flushed.
>> + *
>> + * The linux hugepage PMD now include the pmd entries followed by the address
>> + * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
>> + * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
>> + * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
>> + * 4096 entries. Both will fit in a 4K pgtable_t.
>> + */
>> +void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>> +			      pmd_t *pmdp)
>> +{
>> +	int ssize, i;
>> +	unsigned long s_addr;
>> +	unsigned int psize, valid;
>> +	unsigned char *hpte_slot_array;
>> +	unsigned long hidx, vpn, vsid, hash, shift, slot;
>> +
>> +	/*
>> +	 * Flush all the hptes mapping this hugepage
>> +	 */
>> +	s_addr = addr & HUGE_PAGE_MASK;
>> +	/*
>> +	 * The hpte hindex are stored in the pgtable whose address is in the
>> +	 * second half of the PMD
>> +	 */
>> +	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
>> +
>> +	/* get the base page size */
>> +	psize = get_slice_psize(mm, s_addr);
>> +	shift = mmu_psize_defs[psize].shift;
>> +
>> +	for (i = 0; i < HUGE_PAGE_SIZE/(1ul << shift); i++) {
>
> HUGE_PAGE_SIZE >> shift would be a simpler way to do this calculation.

Wonder how I missed that


>
>> +		/*
>> +		 * 8 bits per each hpte entries
>> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
>> +		 */
>> +		valid = hpte_slot_array[i] & 0x1;
>> +		if (!valid)
>> +			continue;
>> +		hidx =  hpte_slot_array[i]  >> 1;
>> +
>> +		/* get the vpn */
>> +		addr = s_addr + (i * (1ul << shift));
>> +		if (!is_kernel_addr(addr)) {
>> +			ssize = user_segment_size(addr);
>> +			vsid = get_vsid(mm->context.id, addr, ssize);
>> +			WARN_ON(vsid == 0);
>> +		} else {
>> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
>> +			ssize = mmu_kernel_ssize;
>> +		}
>> +
>> +		vpn = hpt_vpn(addr, vsid, ssize);
>> +		hash = hpt_hash(vpn, shift, ssize);
>> +		if (hidx & _PTEIDX_SECONDARY)
>> +			hash = ~hash;
>> +
>> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> +		slot += hidx & _PTEIDX_GROUP_IX;
>> +		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
>> +	}
>> +}
>> +
>> +static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
>> +{
>> +	unsigned long pmd_prot = 0;
>> +	unsigned long prot = pgprot_val(pgprot);
>> +
>> +	if (prot & _PAGE_PRESENT)
>> +		pmd_prot |= PMD_HUGE_PRESENT;
>> +	if (prot & _PAGE_USER)
>> +		pmd_prot |= PMD_HUGE_USER;
>> +	if (prot & _PAGE_FILE)
>> +		pmd_prot |= PMD_HUGE_FILE;
>> +	if (prot & _PAGE_EXEC)
>> +		pmd_prot |= PMD_HUGE_EXEC;
>> +	/*
>> +	 * _PAGE_COHERENT should always be set
>> +	 */
>> +	VM_BUG_ON(!(prot & _PAGE_COHERENT));
>> +
>> +	if (prot & _PAGE_SAO)
>> +		pmd_prot |= PMD_HUGE_SAO;
>
> This looks dubious because _PAGE_SAO is not a single bit.  What
> happens if WRITETHRU or NO_CACHE is set without the other?

yes that should be 
    if ((prot & _PAGE_SAO) == _PAGE_SAO )


>
>> +	if (prot & _PAGE_DIRTY)
>> +		pmd_prot |= PMD_HUGE_DIRTY;
>> +	if (prot & _PAGE_ACCESSED)
>> +		pmd_prot |= PMD_HUGE_ACCESSED;
>> +	if (prot & _PAGE_RW)
>> +		pmd_prot |= PMD_HUGE_RW;
>> +
>> +	pmd_val(pmd) |= pmd_prot;
>> +	return pmd;
>> +}
>> +
>> +pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
>> +{
>> +	pmd_t pmd;
>> +
>> +	pmd_val(pmd) = pfn << PMD_HUGE_RPN_SHIFT;
>> +	pmd_val(pmd) |= PMD_ISHUGE;
>> +	pmd = pmd_set_protbits(pmd, pgprot);
>> +	return pmd;
>> +}
>> +
>> +pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
>> +{
>> +	return pfn_pmd(page_to_pfn(page), pgprot);
>> +}
>> +
>> +pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>> +{
>> +	/* FIXME!! why are this bits cleared ? */
>
> You really need to answer this question...

will check. 

>
>> +	pmd_val(pmd) &= ~(PMD_HUGE_PRESENT |
>> +			  PMD_HUGE_RW |
>> +			  PMD_HUGE_EXEC);
>> +	pmd = pmd_set_protbits(pmd, newprot);
>> +	return pmd;
>> +}
>> +
>> +/*
>> + * This is called at the end of handling a user page fault, when the
>> + * fault has been handled by updating a HUGE PMD entry in the linux page tables.
>> + * We use it to preload an HPTE into the hash table corresponding to
>> + * the updated linux HUGE PMD entry.
>> + */
>> +void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +			  pmd_t *pmd)
>> +{
>> +	/* FIXME!!
>> +	 * Will be done in a later patch
>> +	 */
>
> If you need another patch to make the code in this patch work, they
> should probably be folded together.
>

I have that as TODO, we can do a hash_preload for hugepage here. But I
don't see we doing that for HugeTLB. So I haven't yet done that for
hugepage. Do you know why we don't do hash_preload for HugeTLB page ?


>> +}
>> +
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index e79840b..6fc3488 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
>>  EXPORT_SYMBOL(__iounmap);
>>  EXPORT_SYMBOL(__iounmap_at);
>>  
>> +/*
>> + * For hugepage we have pfn in the pmd, we use PMD_HUGE_RPN_SHIFT bits for flags
>> + * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
>> + */
>> +struct page *pmd_page(pmd_t pmd)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	if (pmd_val(pmd) & PMD_ISHUGE)
>> +		return pfn_to_page(pmd_pfn(pmd));
>> +#endif
>> +	return virt_to_page(pmd_page_vaddr(pmd));
>> +}
>> +
>>  #ifdef CONFIG_PPC_64K_PAGES
>>  /*
>>   * we support 16 fragments per PTE page. This is limited by how many
>> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
>> index 72afd28..90ee19b 100644
>> --- a/arch/powerpc/platforms/Kconfig.cputype
>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>>  	select PPC_FPU
>>  	select PPC_HAVE_PMU_SUPPORT
>>  	select SYS_SUPPORTS_HUGETLBFS
>> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>>  
>>  config PPC_BOOK3E_64
>>  	bool "Embedded processors"

-aneesh

^ permalink raw reply related

* RE: [PATCH V5] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Jia Hongtao-B38951 @ 2013-04-11  9:14 UTC (permalink / raw)
  To: Wood Scott-B07421; +Cc: linuxppc-dev@lists.ozlabs.org, Li Yang-R58472
In-Reply-To: <1365630703.8381.23@snotra>



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Thursday, April 11, 2013 5:52 AM
> To: Jia Hongtao-B38951
> Cc: linuxppc-dev@lists.ozlabs.org; galak@kernel.crashing.org; Wood Scott-
> B07421; Li Yang-R58472; Jia Hongtao-B38951
> Subject: Re: [PATCH V5] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
>=20
> On 04/08/2013 03:26:54 AM, Jia Hongtao wrote:
> > @@ -826,6 +829,124 @@ u64 fsl_pci_immrbar_base(struct pci_controller
> > *hose)
> >  	return 0;
> >  }
> >
> > +#ifdef CONFIG_E500
> > +
> > +#define OP_LWZ  32
> > +#define OP_LWZU 33
> > +#define OP_LBZ  34
> > +#define OP_LBZU 35
> > +#define OP_LHZ  40
> > +#define OP_LHZU 41
> > +#define OP_LHA  42
> > +#define OP_LHAU 43
>=20
> Can you move these to asm/ppc-opcode.h (or possibly
> asm/ppc-disassemble.h if we don't want to mix the two methods of
> describing instructions)?

Yes, mix the two methods is not appropriate.
asm/ppc-disassemble.h is a nice choice.

>=20
> > +static int mcheck_handle_load(struct pt_regs *regs, u32 inst)
> > +{
> > +	unsigned int rd, ra, d;
> > +
> > +	rd =3D get_rt(inst);
> > +	ra =3D get_ra(inst);
> > +	d =3D get_d(inst);
> > +
> > +	switch (get_op(inst)) {
> > +	case OP_LWZ:
> > +		regs->gpr[rd] =3D 0xffffffff;
> > +		break;
> > +
> > +	case OP_LWZU:
> > +		regs->gpr[rd] =3D 0xffffffff;
> > +		regs->gpr[ra] +=3D (s16)d;
> > +		break;
> > +
> > +	case OP_LBZ:
> > +		regs->gpr[rd] =3D 0xff;
> > +		break;
> > +
> > +	case OP_LBZU:
> > +		regs->gpr[rd] =3D 0xff;
> > +		regs->gpr[ra] +=3D (s16)d;
> > +		break;
> > +
> > +	case OP_LHZ:
> > +		regs->gpr[rd] =3D 0xffff;
> > +		break;
> > +
> > +	case OP_LHZU:
> > +		regs->gpr[rd] =3D 0xffff;
> > +		regs->gpr[ra] +=3D (s16)d;
> > +		break;
> > +
> > +	case OP_LHA:
> > +		regs->gpr[rd] =3D 0xffff;
> > +		break;
> > +
> > +	case OP_LHAU:
> > +		regs->gpr[rd] =3D 0xffff;
> > +		regs->gpr[ra] +=3D (s16)d;
> > +		break;
>=20
> The X and (especially for PCI) BRX versions are important -- probably
> more so than the U versions.  I doubt we need the A variant.

Then I will add X and BRX variant and remove A variant.

>=20
> If you do support the A variant, why are you not sign-extending the
> value?

Just curious, sign-extending the value means fill rd with 0xffffffff?

>=20
> Is this erratum present on any 64-bit chips?

No. This erratum only happened in e500 core chips.

>=20
> -Scott

Thanks.
-Hongtao

^ permalink raw reply

* [PATCH V6] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Jia Hongtao @ 2013-04-11  8:36 UTC (permalink / raw)
  To: linuxppc-dev, galak; +Cc: B07421, hongtao.jia

A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
goes down. when the link goes down, Non-posted transactions issued
via the ATMU requiring completion result in an instruction stall.
At the same time a machine-check exception is generated to the core
to allow further processing by the handler. We implements the handler
which skips the instruction caused the stall.

This patch depends on patch:
powerpc/85xx: Add platform_device declaration to fsl_pci.h

Signed-off-by: Zhao Chenhui <b35336@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Liu Shuo <soniccat.liu@gmail.com>
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
---
V5:
* Move OP and XOP defines to a new header file: asm/ppc-disassemble.h
* Add X UX BRX variant of load instruction emulation
* Remove A variant of load instruction emulation

V4:
* Fill rd with all-Fs if the skipped instruction is load and emulate the
  instruction.
* Let KVM/QEMU deal with the exception if the machine check comes from KVM.

 arch/powerpc/include/asm/ppc-disassemble.h |  31 +++++++
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 arch/powerpc/kernel/traps.c                |   3 +
 arch/powerpc/sysdev/fsl_pci.c              | 140 +++++++++++++++++++++++++++++
 arch/powerpc/sysdev/fsl_pci.h              |   6 ++
 5 files changed, 181 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/ppc-disassemble.h

diff --git a/arch/powerpc/include/asm/ppc-disassemble.h b/arch/powerpc/include/asm/ppc-disassemble.h
new file mode 100644
index 0000000..f9782b8
--- /dev/null
+++ b/arch/powerpc/include/asm/ppc-disassemble.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright 2012-2013 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * provides opcode and xopcode images for use by emulating
+ * instructions
+ */
+#ifndef _ASM_POWERPC_PPC_DISASSEMBLE_H
+#define _ASM_POWERPC_PPC_DISASSEMBLE_H
+
+#define OP_LWZ  32
+#define OP_LWZU 33
+#define OP_LBZ  34
+#define OP_LBZU 35
+#define OP_LHZ  40
+#define OP_LHZU 41
+
+#define OP_31_XOP_LWZX      23
+#define OP_31_XOP_LWZUX     55
+#define OP_31_XOP_LBZX      87
+#define OP_31_XOP_LBZUX     119
+#define OP_31_XOP_LHZX      279
+#define OP_31_XOP_LHZUX     311
+#define OP_31_XOP_LWBRX     534
+#define OP_31_XOP_LHBRX     790
+
+#endif
diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index dcd8819..f1bde90 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -66,7 +66,7 @@ _GLOBAL(__setup_cpu_e500v2)
 	bl	__e500_icache_setup
 	bl	__e500_dcache_setup
 	bl	__setup_e500_ivors
-#ifdef CONFIG_FSL_RIO
+#if defined(CONFIG_FSL_RIO) || defined(CONFIG_FSL_PCI)
 	/* Ensure that RFXE is set */
 	mfspr	r3,SPRN_HID1
 	oris	r3,r3,HID1_RFXE@h
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a008cf5..dd275a4 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -59,6 +59,7 @@
 #include <asm/fadump.h>
 #include <asm/switch_to.h>
 #include <asm/debug.h>
+#include <sysdev/fsl_pci.h>
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -556,6 +557,8 @@ int machine_check_e500(struct pt_regs *regs)
 	if (reason & MCSR_BUS_RBERR) {
 		if (fsl_rio_mcheck_exception(regs))
 			return 1;
+		if (fsl_pci_mcheck_exception(regs))
+			return 1;
 	}
 
 	printk("Machine check in kernel mode.\n");
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 682084d..aaa54c5 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -26,11 +26,15 @@
 #include <linux/memblock.h>
 #include <linux/log2.h>
 #include <linux/slab.h>
+#include <linux/uaccess.h>
 
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
 #include <asm/machdep.h>
+#include <asm/disassemble.h>
+#include <asm/ppc-disassemble.h>
 #include <sysdev/fsl_soc.h>
 #include <sysdev/fsl_pci.h>
 
@@ -826,6 +830,142 @@ u64 fsl_pci_immrbar_base(struct pci_controller *hose)
 	return 0;
 }
 
+#ifdef CONFIG_E500
+static int mcheck_handle_load(struct pt_regs *regs, u32 inst)
+{
+	unsigned int rd, ra, rb, d;
+
+	rd = get_rt(inst);
+	ra = get_ra(inst);
+	rb = get_rb(inst);
+	d = get_d(inst);
+
+	switch (get_op(inst)) {
+	case 31:
+		switch (get_xop(inst)) {
+		case OP_31_XOP_LWZX:
+		case OP_31_XOP_LWBRX:
+			regs->gpr[rd] = 0xffffffff;
+			break;
+
+		case OP_31_XOP_LWZUX:
+			regs->gpr[rd] = 0xffffffff;
+			regs->gpr[ra] += regs->gpr[rb];
+			break;
+
+		case OP_31_XOP_LBZX:
+			regs->gpr[rd] = 0xff;
+			break;
+
+		case OP_31_XOP_LBZUX:
+			regs->gpr[rd] = 0xff;
+			regs->gpr[ra] += regs->gpr[rb];
+			break;
+
+		case OP_31_XOP_LHZX:
+		case OP_31_XOP_LHBRX:
+			regs->gpr[rd] = 0xffff;
+			break;
+
+		case OP_31_XOP_LHZUX:
+			regs->gpr[rd] = 0xffff;
+			regs->gpr[ra] += regs->gpr[rb];
+			break;
+
+		default:
+			return 0;
+		}
+		break;
+
+	case OP_LWZ:
+		regs->gpr[rd] = 0xffffffff;
+		break;
+
+	case OP_LWZU:
+		regs->gpr[rd] = 0xffffffff;
+		regs->gpr[ra] += (s16)d;
+		break;
+
+	case OP_LBZ:
+		regs->gpr[rd] = 0xff;
+		break;
+
+	case OP_LBZU:
+		regs->gpr[rd] = 0xff;
+		regs->gpr[ra] += (s16)d;
+		break;
+
+	case OP_LHZ:
+		regs->gpr[rd] = 0xffff;
+		break;
+
+	case OP_LHZU:
+		regs->gpr[rd] = 0xffff;
+		regs->gpr[ra] += (s16)d;
+		break;
+
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int is_in_pci_mem_space(phys_addr_t addr)
+{
+	struct pci_controller *hose;
+	struct resource *res;
+	int i;
+
+	list_for_each_entry(hose, &hose_list, list_node) {
+		if (!early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			res = &hose->mem_resources[i];
+			if ((res->flags & IORESOURCE_MEM) &&
+				addr >= res->start && addr <= res->end)
+				return 1;
+		}
+	}
+	return 0;
+}
+
+int fsl_pci_mcheck_exception(struct pt_regs *regs)
+{
+	u32 inst;
+	int ret;
+	phys_addr_t addr = 0;
+
+	/* Let KVM/QEMU deal with the exception */
+	if (regs->msr & MSR_GS)
+		return 0;
+
+#ifdef CONFIG_PHYS_64BIT
+	addr = mfspr(SPRN_MCARU);
+	addr <<= 32;
+#endif
+	addr += mfspr(SPRN_MCAR);
+
+	if (is_in_pci_mem_space(addr)) {
+		if (user_mode(regs)) {
+			pagefault_disable();
+			ret = get_user(regs->nip, &inst);
+			pagefault_enable();
+		} else {
+			ret = probe_kernel_address(regs->nip, inst);
+		}
+
+		if (mcheck_handle_load(regs, inst)) {
+			regs->nip += 4;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
 static const struct of_device_id pci_ids[] = {
 	{ .compatible = "fsl,mpc8540-pci", },
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/arch/powerpc/sysdev/fsl_pci.h
index 851dd56..b0d01ea 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/arch/powerpc/sysdev/fsl_pci.h
@@ -115,5 +115,11 @@ static inline int mpc85xx_pci_err_probe(struct platform_device *op)
 }
 #endif
 
+#ifdef CONFIG_FSL_PCI
+extern int fsl_pci_mcheck_exception(struct pt_regs *);
+#else
+static inline int fsl_pci_mcheck_exception(struct pt_regs *regs) {return 0; }
+#endif
+
 #endif /* __POWERPC_FSL_PCI_H */
 #endif /* __KERNEL__ */
-- 
1.8.0

^ permalink raw reply related

* [PATCH 0/2] net: mv643xx_eth: use managed clk and allocation
From: Sebastian Hesselbarth @ 2013-04-11  9:29 UTC (permalink / raw)
  To: Sebastian Hesselbarth
  Cc: Andrew Lunn, Jason Cooper, Sergei Shtylyov, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, Paul Mackerras,
	Rob Landley, netdev, linuxppc-dev, Florian Fainelli,
	Lennert Buytenhek

With introduction of common clock framework and the ability to provide
gated clocks, mv643xx_eth required calls to get and enable these clock
gates on some platforms. Back then, common clock framework api wasn't
safe for architectures without support for common clocks. This has
changed now and there are also managed (devm_) counterparts for clock
related functions.

The second patch in this series, also converts kzalloc to devm_kzalloc
where applicable.

Both patches have been sent to the corresponding mailing lists as
individual patches before. To get the order required to apply them right,
this patch set combines both patches into one set.

Sebastian Hesselbarth (2):
  net: mv643xx_eth: add shared clk and cleanup existing clk handling
  net: mv643xx_eth: use managed devm_kzalloc

 Documentation/devicetree/bindings/marvell.txt |    3 ++
 drivers/net/ethernet/marvell/mv643xx_eth.c    |   44 +++++++++----------------
 2 files changed, 18 insertions(+), 29 deletions(-)

---
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Rob Landley <rob@landley.net>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Florian Fainelli <florian@openwrt.org>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: devicetree-discuss@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
-- 
1.7.10.4

^ permalink raw reply

* [PATCH 1/2] net: mv643xx_eth: add shared clk and cleanup existing clk handling
From: Sebastian Hesselbarth @ 2013-04-11  9:29 UTC (permalink / raw)
  To: Sebastian Hesselbarth
  Cc: Andrew Lunn, Jason Cooper, Sergei Shtylyov, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, Paul Mackerras,
	Rob Landley, netdev, linuxppc-dev, Florian Fainelli,
	Lennert Buytenhek
In-Reply-To: <1365672574-31123-1-git-send-email-sebastian.hesselbarth@gmail.com>

This patch adds an optional shared block clock to avoid lockups on
clock gated controllers. Besides the new clock, clock handling for
existing clocks is cleaned up and moved to devm_clk_get. Device
tree binding documentation is updated for the new clocks property.

Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
---
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Rob Landley <rob@landley.net>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Florian Fainelli <florian@openwrt.org>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: devicetree-discuss@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 Documentation/devicetree/bindings/marvell.txt |    3 +++
 drivers/net/ethernet/marvell/mv643xx_eth.c    |   27 ++++++++++---------------
 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt
index f1533d9..f7a0da6 100644
--- a/Documentation/devicetree/bindings/marvell.txt
+++ b/Documentation/devicetree/bindings/marvell.txt
@@ -115,6 +115,9 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
      - compatible : "marvell,mv64360-eth-block"
      - reg : Offset and length of the register set for this block
 
+   Optional properties:
+     - clocks : Phandle to the clock control device and gate bit
+
    Example Discovery Ethernet block node:
      ethernet-block@2000 {
 	     #address-cells = <1>;
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index aedbd82..bbe6104 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -268,7 +268,7 @@ struct mv643xx_eth_shared_private {
 	int extended_rx_coal_limit;
 	int tx_bw_control;
 	int tx_csum_limit;
-
+	struct clk *clk;
 };
 
 #define TX_BW_CONTROL_ABSENT		0
@@ -410,9 +410,7 @@ struct mv643xx_eth_private {
 	/*
 	 * Hardware-specific parameters.
 	 */
-#if defined(CONFIG_HAVE_CLK)
 	struct clk *clk;
-#endif
 	unsigned int t_clk;
 };
 
@@ -2569,6 +2567,10 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 	if (msp->base == NULL)
 		goto out_free;
 
+	msp->clk = devm_clk_get(&pdev->dev, NULL);
+	if (!IS_ERR(msp->clk))
+		clk_prepare_enable(msp->clk);
+
 	/*
 	 * (Re-)program MBUS remapping windows if we are asked to.
 	 */
@@ -2595,6 +2597,8 @@ static int mv643xx_eth_shared_remove(struct platform_device *pdev)
 	struct mv643xx_eth_shared_private *msp = platform_get_drvdata(pdev);
 
 	iounmap(msp->base);
+	if (!IS_ERR(msp->clk))
+		clk_disable_unprepare(msp->clk);
 	kfree(msp);
 
 	return 0;
@@ -2801,13 +2805,12 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 	 * it to override the default.
 	 */
 	mp->t_clk = 133000000;
-#if defined(CONFIG_HAVE_CLK)
-	mp->clk = clk_get(&pdev->dev, (pdev->id ? "1" : "0"));
+	mp->clk = devm_clk_get(&pdev->dev, NULL);
 	if (!IS_ERR(mp->clk)) {
 		clk_prepare_enable(mp->clk);
 		mp->t_clk = clk_get_rate(mp->clk);
 	}
-#endif
+
 	set_params(mp, pd);
 	netif_set_real_num_tx_queues(dev, mp->txq_count);
 	netif_set_real_num_rx_queues(dev, mp->rxq_count);
@@ -2889,12 +2892,8 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 	return 0;
 
 out:
-#if defined(CONFIG_HAVE_CLK)
-	if (!IS_ERR(mp->clk)) {
+	if (!IS_ERR(mp->clk))
 		clk_disable_unprepare(mp->clk);
-		clk_put(mp->clk);
-	}
-#endif
 	free_netdev(dev);
 
 	return err;
@@ -2909,12 +2908,8 @@ static int mv643xx_eth_remove(struct platform_device *pdev)
 		phy_detach(mp->phy);
 	cancel_work_sync(&mp->tx_timeout_task);
 
-#if defined(CONFIG_HAVE_CLK)
-	if (!IS_ERR(mp->clk)) {
+	if (!IS_ERR(mp->clk))
 		clk_disable_unprepare(mp->clk);
-		clk_put(mp->clk);
-	}
-#endif
 
 	free_netdev(mp->dev);
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2 2/2] net: mv643xx_eth: use managed devm_kzalloc
From: Sebastian Hesselbarth @ 2013-04-11  9:29 UTC (permalink / raw)
  To: Sebastian Hesselbarth
  Cc: Andrew Lunn, Jason Cooper, Sergei Shtylyov, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, Paul Mackerras,
	Rob Landley, netdev, linuxppc-dev, Florian Fainelli,
	Lennert Buytenhek
In-Reply-To: <1365672574-31123-1-git-send-email-sebastian.hesselbarth@gmail.com>

This patch moves shared private data kzalloc to managed devm_kzalloc and
cleans now unneccessary kfree and error handling.

Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
---
Changes from v1:
- replaced EADDRNOTAVAIL with ENOMEM on failing ioremap (Reported by
  Sergei Shtylyov)

Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Rob Landley <rob@landley.net>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Florian Fainelli <florian@openwrt.org>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: devicetree-discuss@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/marvell/mv643xx_eth.c |   17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index bbe6104..305038f 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -2547,25 +2547,22 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 	struct mv643xx_eth_shared_private *msp;
 	const struct mbus_dram_target_info *dram;
 	struct resource *res;
-	int ret;
 
 	if (!mv643xx_eth_version_printed++)
 		pr_notice("MV-643xx 10/100/1000 ethernet driver version %s\n",
 			  mv643xx_eth_driver_version);
 
-	ret = -EINVAL;
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	if (res == NULL)
-		goto out;
+		return -EINVAL;
 
-	ret = -ENOMEM;
-	msp = kzalloc(sizeof(*msp), GFP_KERNEL);
+	msp = devm_kzalloc(&pdev->dev, sizeof(*msp), GFP_KERNEL);
 	if (msp == NULL)
-		goto out;
+		return -ENOMEM;
 
 	msp->base = ioremap(res->start, resource_size(res));
 	if (msp->base == NULL)
-		goto out_free;
+		return -ENOMEM;
 
 	msp->clk = devm_clk_get(&pdev->dev, NULL);
 	if (!IS_ERR(msp->clk))
@@ -2585,11 +2582,6 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 	platform_set_drvdata(pdev, msp);
 
 	return 0;
-
-out_free:
-	kfree(msp);
-out:
-	return ret;
 }
 
 static int mv643xx_eth_shared_remove(struct platform_device *pdev)
@@ -2599,7 +2591,6 @@ static int mv643xx_eth_shared_remove(struct platform_device *pdev)
 	iounmap(msp->base);
 	if (!IS_ERR(msp->clk))
 		clk_disable_unprepare(msp->clk);
-	kfree(msp);
 
 	return 0;
 }
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 0/2] net: mv643xx_eth: use managed clk and allocation
From: Florian Fainelli @ 2013-04-11  9:32 UTC (permalink / raw)
  To: Sebastian Hesselbarth
  Cc: Andrew Lunn, Jason Cooper, Sergei Shtylyov, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, Paul Mackerras,
	Rob Landley, netdev, linuxppc-dev, Lennert Buytenhek
In-Reply-To: <1365672574-31123-1-git-send-email-sebastian.hesselbarth@gmail.com>

Le 04/11/13 11:29, Sebastian Hesselbarth a écrit :
> With introduction of common clock framework and the ability to provide
> gated clocks, mv643xx_eth required calls to get and enable these clock
> gates on some platforms. Back then, common clock framework api wasn't
> safe for architectures without support for common clocks. This has
> changed now and there are also managed (devm_) counterparts for clock
> related functions.
>
> The second patch in this series, also converts kzalloc to devm_kzalloc
> where applicable.
>
> Both patches have been sent to the corresponding mailing lists as
> individual patches before. To get the order required to apply them right,
> this patch set combines both patches into one set.
>
> Sebastian Hesselbarth (2):
>    net: mv643xx_eth: add shared clk and cleanup existing clk handling
>    net: mv643xx_eth: use managed devm_kzalloc

Looks good, thanks Sebastian!

Acked-by: Florian Fainelli <florian@openwrt.org>
--
Florian

^ permalink raw reply

* [PATCH 2/8 v3] KVM: PPC: e500: Expose MMU registers via ONE_REG
From: Mihai Caraman @ 2013-04-11 10:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1365674594-17410-1-git-send-email-mihai.caraman@freescale.com>

MMU registers were exposed to user-space using sregs interface. Add them
to ONE_REG interface using kvmppc_get_one_reg/kvmppc_set_one_reg delegation
mechanism.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
v3:
 - Fix case breaks
 
v2:
 - Restrict set_one_reg operation for MMU registers to HW values

 Documentation/virtual/kvm/api.txt   |   11 ++++
 arch/powerpc/include/uapi/asm/kvm.h |   17 ++++++
 arch/powerpc/kvm/e500.c             |    6 ++-
 arch/powerpc/kvm/e500.h             |    4 ++
 arch/powerpc/kvm/e500_mmu.c         |   94 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/e500mc.c           |    6 ++-
 6 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..1a76663 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1792,6 +1792,17 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TSR	| 32
   PPC   | KVM_REG_PPC_OR_TSR	| 32
   PPC   | KVM_REG_PPC_CLEAR_TSR	| 32
+  PPC   | KVM_REG_PPC_MAS0	| 32
+  PPC   | KVM_REG_PPC_MAS1	| 32
+  PPC   | KVM_REG_PPC_MAS2	| 64
+  PPC   | KVM_REG_PPC_MAS7_3	| 64
+  PPC   | KVM_REG_PPC_MAS4	| 32
+  PPC   | KVM_REG_PPC_MAS6	| 32
+  PPC   | KVM_REG_PPC_MMUCFG	| 32
+  PPC   | KVM_REG_PPC_TLB0CFG	| 32
+  PPC   | KVM_REG_PPC_TLB1CFG	| 32
+  PPC   | KVM_REG_PPC_TLB2CFG	| 32
+  PPC   | KVM_REG_PPC_TLB3CFG	| 32
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index c2ff99c..93d063f 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -426,4 +426,21 @@ struct kvm_get_htab_header {
 /* Debugging: Special instruction for software breakpoint */
 #define KVM_REG_PPC_DEBUG_INST	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x8b)
 
+/* MMU registers */
+#define KVM_REG_PPC_MAS0	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x8c)
+#define KVM_REG_PPC_MAS1	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x8d)
+#define KVM_REG_PPC_MAS2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8e)
+#define KVM_REG_PPC_MAS7_3	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8f)
+#define KVM_REG_PPC_MAS4	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x90)
+#define KVM_REG_PPC_MAS6	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x91)
+#define KVM_REG_PPC_MMUCFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x92)
+/*
+ * TLBnCFG fields TLBnCFG_N_ENTRY and TLBnCFG_ASSOC can be changed only using
+ * KVM_CAP_SW_TLB ioctl
+ */
+#define KVM_REG_PPC_TLB0CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x93)
+#define KVM_REG_PPC_TLB1CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x94)
+#define KVM_REG_PPC_TLB2CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x95)
+#define KVM_REG_PPC_TLB3CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x96)
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 576010f..ce6b73c 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -428,13 +428,15 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
 			union kvmppc_one_reg *val)
 {
-	return -EINVAL;
+	int r = kvmppc_get_one_reg_e500_tlb(vcpu, id, val);
+	return r;
 }
 
 int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
 		       union kvmppc_one_reg *val)
 {
-	return -EINVAL;
+	int r = kvmppc_get_one_reg_e500_tlb(vcpu, id, val);
+	return r;
 }
 
 struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 33db48a..b73ca7a 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -131,6 +131,10 @@ void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500);
 void kvmppc_get_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 int kvmppc_set_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 
+int kvmppc_get_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
+				union kvmppc_one_reg *val);
+int kvmppc_set_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
+			       union kvmppc_one_reg *val);
 
 #ifdef CONFIG_KVM_E500V2
 unsigned int kvmppc_e500_get_sid(struct kvmppc_vcpu_e500 *vcpu_e500,
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 5c44759..44f7762 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -596,6 +596,100 @@ int kvmppc_set_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	return 0;
 }
 
+int kvmppc_get_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
+				union kvmppc_one_reg *val)
+{
+	int r = 0;
+	long int i;
+
+	switch (id) {
+	case KVM_REG_PPC_MAS0:
+		*val = get_reg_val(id, vcpu->arch.shared->mas0);
+		break;
+	case KVM_REG_PPC_MAS1:
+		*val = get_reg_val(id, vcpu->arch.shared->mas1);
+		break;
+	case KVM_REG_PPC_MAS2:
+		*val = get_reg_val(id, vcpu->arch.shared->mas2);
+		break;
+	case KVM_REG_PPC_MAS7_3:
+		*val = get_reg_val(id, vcpu->arch.shared->mas7_3);
+		break;
+	case KVM_REG_PPC_MAS4:
+		*val = get_reg_val(id, vcpu->arch.shared->mas4);
+		break;
+	case KVM_REG_PPC_MAS6:
+		*val = get_reg_val(id, vcpu->arch.shared->mas6);
+		break;
+	case KVM_REG_PPC_MMUCFG:
+		*val = get_reg_val(id, vcpu->arch.mmucfg);
+		break;
+	case KVM_REG_PPC_TLB0CFG:
+	case KVM_REG_PPC_TLB1CFG:
+	case KVM_REG_PPC_TLB2CFG:
+	case KVM_REG_PPC_TLB3CFG:
+		i = id - KVM_REG_PPC_TLB0CFG;
+		*val = get_reg_val(id, vcpu->arch.tlbcfg[i]);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
+int kvmppc_set_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
+			       union kvmppc_one_reg *val)
+{
+	int r = 0;
+	long int i;
+
+	switch (id) {
+	case KVM_REG_PPC_MAS0:
+		vcpu->arch.shared->mas0 = set_reg_val(id, *val);
+		break;
+	case KVM_REG_PPC_MAS1:
+		vcpu->arch.shared->mas1 = set_reg_val(id, *val);
+		break;
+	case KVM_REG_PPC_MAS2:
+		vcpu->arch.shared->mas2 = set_reg_val(id, *val);
+		break;
+	case KVM_REG_PPC_MAS7_3:
+		vcpu->arch.shared->mas7_3 = set_reg_val(id, *val);
+		break;
+	case KVM_REG_PPC_MAS4:
+		vcpu->arch.shared->mas4 = set_reg_val(id, *val);
+		break;
+	case KVM_REG_PPC_MAS6:
+		vcpu->arch.shared->mas6 = set_reg_val(id, *val);
+		break;
+	/* Only allow MMU registers to be set to the config supported by KVM */
+	case KVM_REG_PPC_MMUCFG: {
+		u32 reg = set_reg_val(id, *val);
+		if (reg != vcpu->arch.mmucfg)
+			r = -EINVAL;
+		break;
+	}
+	case KVM_REG_PPC_TLB0CFG:
+	case KVM_REG_PPC_TLB1CFG:
+	case KVM_REG_PPC_TLB2CFG:
+	case KVM_REG_PPC_TLB3CFG: {
+		/* MMU geometry (N_ENTRY/ASSOC) can be set only using SW_TLB */
+		u32 reg = set_reg_val(id, *val);
+		i = id - KVM_REG_PPC_TLB0CFG;
+		if (reg != vcpu->arch.tlbcfg[i])
+			r = -EINVAL;
+		break;
+	}
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg)
 {
diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index b071bdc..ab073a8 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -258,13 +258,15 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
 			union kvmppc_one_reg *val)
 {
-	return -EINVAL;
+	int r = kvmppc_get_one_reg_e500_tlb(vcpu, id, val);
+	return r;
 }
 
 int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
 		       union kvmppc_one_reg *val)
 {
-	return -EINVAL;
+	int r = kvmppc_set_one_reg_e500_tlb(vcpu, id, val);
+	return r;
 }
 
 struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 4/8 v3] KVM: PPC: e500: Add support for TLBnPS registers
From: Mihai Caraman @ 2013-04-11 10:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1365674594-17410-1-git-send-email-mihai.caraman@freescale.com>

Add support for TLBnPS registers available in MMU Architecture Version
(MAV) 2.0.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
v3:
 - Add vcpu_ftr enum
 
v2:
 - Add vcpu generic function has_feature()

 Documentation/virtual/kvm/api.txt   |    4 ++++
 arch/powerpc/include/asm/kvm_host.h |    1 +
 arch/powerpc/include/uapi/asm/kvm.h |    4 ++++
 arch/powerpc/kvm/e500.h             |   18 ++++++++++++++++++
 arch/powerpc/kvm/e500_emulate.c     |   10 ++++++++++
 arch/powerpc/kvm/e500_mmu.c         |   22 ++++++++++++++++++++++
 6 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1a76663..f045377 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1803,6 +1803,10 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB1CFG	| 32
   PPC   | KVM_REG_PPC_TLB2CFG	| 32
   PPC   | KVM_REG_PPC_TLB3CFG	| 32
+  PPC   | KVM_REG_PPC_TLB0PS	| 32
+  PPC   | KVM_REG_PPC_TLB1PS	| 32
+  PPC   | KVM_REG_PPC_TLB2PS	| 32
+  PPC   | KVM_REG_PPC_TLB3PS	| 32
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..3b6cee3 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -502,6 +502,7 @@ struct kvm_vcpu_arch {
 	spinlock_t wdt_lock;
 	struct timer_list wdt_timer;
 	u32 tlbcfg[4];
+	u32 tlbps[4];
 	u32 mmucfg;
 	u32 epr;
 	u32 crit_save;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 93d063f..91341d9 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -442,5 +442,9 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB1CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x94)
 #define KVM_REG_PPC_TLB2CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x95)
 #define KVM_REG_PPC_TLB3CFG	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x96)
+#define KVM_REG_PPC_TLB0PS	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x97)
+#define KVM_REG_PPC_TLB1PS	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x98)
+#define KVM_REG_PPC_TLB2PS	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x99)
+#define KVM_REG_PPC_TLB3PS	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index b73ca7a..c2e5e98 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -23,6 +23,10 @@
 #include <asm/mmu-book3e.h>
 #include <asm/tlb.h>
 
+enum vcpu_ftr {
+	VCPU_FTR_MMU_V2
+};
+
 #define E500_PID_NUM   3
 #define E500_TLB_NUM   2
 
@@ -299,4 +303,18 @@ static inline unsigned int get_tlbmiss_tid(struct kvm_vcpu *vcpu)
 #define get_tlb_sts(gtlbe)              (MAS1_TS)
 #endif /* !BOOKE_HV */
 
+static inline bool has_feature(const struct kvm_vcpu *vcpu,
+			       enum vcpu_ftr ftr)
+{
+	bool has_ftr;
+	switch (ftr) {
+	case VCPU_FTR_MMU_V2:
+		has_ftr = ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2);
+		break;
+	default:
+		return false;
+	}
+	return has_ftr;
+}
+
 #endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index e78f353..12b8de2 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -284,6 +284,16 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val)
 	case SPRN_TLB1CFG:
 		*spr_val = vcpu->arch.tlbcfg[1];
 		break;
+	case SPRN_TLB0PS:
+		if (!has_feature(vcpu, VCPU_FTR_MMU_V2))
+			return EMULATE_FAIL;
+		*spr_val = vcpu->arch.tlbps[0];
+		break;
+	case SPRN_TLB1PS:
+		if (!has_feature(vcpu, VCPU_FTR_MMU_V2))
+			return EMULATE_FAIL;
+		*spr_val = vcpu->arch.tlbps[1];
+		break;
 	case SPRN_L1CSR0:
 		*spr_val = vcpu_e500->l1csr0;
 		break;
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 08a5b0d..a863dc1 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -631,6 +631,13 @@ int kvmppc_get_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
 		i = id - KVM_REG_PPC_TLB0CFG;
 		*val = get_reg_val(id, vcpu->arch.tlbcfg[i]);
 		break;
+	case KVM_REG_PPC_TLB0PS:
+	case KVM_REG_PPC_TLB1PS:
+	case KVM_REG_PPC_TLB2PS:
+	case KVM_REG_PPC_TLB3PS:
+		i = id - KVM_REG_PPC_TLB0PS;
+		*val = get_reg_val(id, vcpu->arch.tlbps[i]);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -682,6 +689,16 @@ int kvmppc_set_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
 			r = -EINVAL;
 		break;
 	}
+	case KVM_REG_PPC_TLB0PS:
+	case KVM_REG_PPC_TLB1PS:
+	case KVM_REG_PPC_TLB2PS:
+	case KVM_REG_PPC_TLB3PS: {
+		u32 reg = set_reg_val(id, *val);
+		i = id - KVM_REG_PPC_TLB0PS;
+		if (reg != vcpu->arch.tlbps[i])
+			r = -EINVAL;
+		break;
+	}
 	default:
 		r = -EINVAL;
 		break;
@@ -855,6 +872,11 @@ static int vcpu_mmu_init(struct kvm_vcpu *vcpu,
 	vcpu->arch.tlbcfg[1] |= params[1].entries;
 	vcpu->arch.tlbcfg[1] |= params[1].ways << TLBnCFG_ASSOC_SHIFT;
 
+	if (has_feature(vcpu, VCPU_FTR_MMU_V2)) {
+		vcpu->arch.tlbps[0] = mfspr(SPRN_TLB0PS);
+		vcpu->arch.tlbps[1] = mfspr(SPRN_TLB1PS);
+	}
+
 	return 0;
 }
 
-- 
1.7.4.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox