LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: 3.10-rc ppc64 corrupts usermem when swapping
From: Aneesh Kumar K.V @ 2013-05-30 16:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Hugh Dickins
  Cc: linuxppc-dev, Anton Blanchard, Paul Mackerras, David Gibson
In-Reply-To: <87vc60na89.fsf@linux.vnet.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Benjamin Herrenschmidt <benh@au1.ibm.com> writes:
>
>> On Wed, 2013-05-29 at 22:47 -0700, Hugh Dickins wrote:
>>> Running my favourite swapping load (repeated make -j20 kernel builds
>>> in tmpfs in parallel with repeated make -j20 kernel builds in ext4 on
>>> loop on tmpfs file, all limited by mem=700M and swap 1.5G) on 3.10-rc
>>> on PowerMac G5, the test dies with corrupted usermem after a few hours.
>>> 
>>> Variously, segmentation fault or Binutils assertion fail or gcc Internal
>>> error in either or both builds: usually signs of swapping or TLB flushing
>>> gone wrong.  Sometimes the tmpfs build breaks first, sometimes the ext4 on
>>> loop on tmpfs, so at least it looks unrelated to loop.  No problem on x86.
>>> 
>>> This is 64-bit kernel but 4k pages and old SuSE 11.1 32-bit userspace.
>>> 
>>> I've just finished a manual bisection on arch/powerpc/mm (which might
>>> have been a wrong guess, but has paid off): the first bad commit is
>>> 7e74c3921ad9610c0b49f28b8fc69f7480505841
>>> "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
>>
>> Ok, I have other reasons to think is wrong. I debugged a case last week
>> where after kexec we still had stale TLB entries, due to the TLB cleanup
>> not working.
>>
>> Thanks for doing that bisection ! I'll investigate ASAP (though it will
>> probably have to wait for tomorrow unless Paul beats me to it)
>>
>>> I don't know if it's actually swapping to swap that's triggering the
>>> problem, or a more general page reclaim or TLB flush problem.  I hit
>>> it originally when trying to test Mel Gorman's pagevec series on top
>>> of 3.10-rc; and though I then reproduced it without that series, it
>>> did seem to take much longer: so I have been applying Mel's series to
>>> speed up each step of the bisection.  But if I went back again, might
>>> find it was just chance that I hit it sooner with Mel's series than
>>> without.  So, you're probably safe to ignore that detail, but I
>>> mention it just in case it turns out to have some relevance.
>>> 
>>> Something else peculiar that I've been doing in these runs, may or may
>>> not be relevant: I've been running swapon and swapoff repeatedly in the
>>> background, so that we're doing swapoff even while busy building.
>>> 
>>> I probably can't go into much more detail on the test (it's hard
>>> to get the balance right, to be swapping rather than OOMing or just
>>> running without reclaim), but can test any patches you'd like me to
>>> try (though it may take 24 hours for me to report back usefully).
>>
>> I think it's just failing to invalidate the TLB properly. At least one
>> bug I can spot just looking at it:
>>
>> static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
>> 				   int psize, int ssize, int local)
>>
>>    .../...
>>
>> 	native_lock_hpte(hptep);
>> 	hpte_v = hptep->v;
>>
>> 	actual_psize = hpte_actual_psize(hptep, psize);
>> 	if (actual_psize < 0) {
>> 		native_unlock_hpte(hptep);
>> 		local_irq_restore(flags);
>> 		return;
>> 	}
>>
>> That's wrong. We must still perform the TLB invalidation even if the
>> hash PTE is empty.
>>
>> In fact, Aneesh, this is a problem with MPSS for your THP work, I just
>> thought about it.
>>
>> The reason is that if a hash bucket gets full, we "evict" a more/less
>> random entry from it. When we do that we don't invalidate the TLB
>> (hpte_remove) because we assume the old translation is still technically
>> "valid".
>>
>
> Hmm that is correct, I missed that. But to do a tlb invalidate we need
> both base and actual page size. One of the reason i didn't update the
> hpte_invalidate callback to take both the page sizes was because, PAPR
> didn't need that for invalidate (H_REMOVE). hpte_remove did result in a
> tlb invalidate there. 
>
>
>> However that means that an hpte_invalidate *must* invalidate the TLB
>> later on even if it's not hitting the right entry in the hash.
>>
>> However, I can see why that cannot work with THP/MPSS since you have no
>> way to know the page size from the PTE anymore....
>>
>> So my question is, apart from hpte_decode used by kexec, which I will
>> fix by just blowing the whole TLB when not running phyp, why do you need
>> the "actual" size in invalidate and updatepp ? You really can't rely on
>> the size passed by the upper layers ?
>
> So for upstream I have below which should address the
> above. Meanwhile I will see what the impact would be to do a tlb
> invalidate in hpte_remove, so that we can keep both lpar and native
> changes similar.
>

How about the below ?

commit 9f70fd8cfeb7fca33ef8453197b76df46c860b52
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Thu May 30 20:09:58 2013 +0530

    powerpc/mm: Always invalidate tlb on hpte invalidate and update
    
    If a hash bucket gets full, we "evict" a more/less random entry from it.
    When we do that we don't invalidate the TLB (hpte_remove) because we assume
    the old translation is still technically "valid". This implies that when
    we are invalidating or updating pte, even if HPTE entry is not valid
    we should do a tlb invalidate.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 92386fc..801e3c6 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -36,13 +36,13 @@ struct machdep_calls {
 #ifdef CONFIG_PPC64
 	void            (*hpte_invalidate)(unsigned long slot,
 					   unsigned long vpn,
-					   int psize, int ssize,
-					   int local);
+					   int bpsize, int apsize,
+					   int ssize, int local);
 	long		(*hpte_updatepp)(unsigned long slot, 
 					 unsigned long newpp, 
 					 unsigned long vpn,
-					 int psize, int ssize,
-					 int local);
+					 int bpsize, int apsize,
+					 int ssize, int local);
 	void            (*hpte_updateboltedpp)(unsigned long newpp, 
 					       unsigned long ea,
 					       int psize, int ssize);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 3a9a1ac..176d3fd 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -34,7 +34,7 @@
 void kvmppc_mmu_invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 {
 	ppc_md.hpte_invalidate(pte->slot, pte->host_vpn,
-			       MMU_PAGE_4K, MMU_SEGSIZE_256M,
+			       MMU_PAGE_4K, MMU_PAGE_4K, MMU_SEGSIZE_256M,
 			       false);
 }
 
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 0e980ac..d3cbda6 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -289,9 +289,10 @@ htab_modify_pte:
 
 	/* Call ppc_md.hpte_updatepp */
 	mr	r5,r29			/* vpn */
-	li	r6,MMU_PAGE_4K		/* page size */
-	ld	r7,STK_PARAM(R9)(r1)	/* segment size */
-	ld	r8,STK_PARAM(R8)(r1)	/* get "local" param */
+	li	r6,MMU_PAGE_4K		/* base page size */
+	li	r7,MMU_PAGE_4K		/* actual page size */
+	ld	r8,STK_PARAM(R9)(r1)	/* segment size */
+	ld	r9,STK_PARAM(R8)(r1)	/* get "local" param */
 _GLOBAL(htab_call_hpte_updatepp)
 	bl	.			/* Patched by htab_finish_init() */
 
@@ -649,9 +650,10 @@ htab_modify_pte:
 
 	/* Call ppc_md.hpte_updatepp */
 	mr	r5,r29			/* vpn */
-	li	r6,MMU_PAGE_4K		/* page size */
-	ld	r7,STK_PARAM(R9)(r1)	/* segment size */
-	ld	r8,STK_PARAM(R8)(r1)	/* get "local" param */
+	li	r6,MMU_PAGE_4K		/* base page size */
+	li	r7,MMU_PAGE_4K		/* actual page size */
+	ld	r8,STK_PARAM(R9)(r1)	/* segment size */
+	ld	r9,STK_PARAM(R8)(r1)	/* get "local" param */
 _GLOBAL(htab_call_hpte_updatepp)
 	bl	.			/* patched by htab_finish_init() */
 
@@ -937,9 +939,10 @@ ht64_modify_pte:
 
 	/* Call ppc_md.hpte_updatepp */
 	mr	r5,r29			/* vpn */
-	li	r6,MMU_PAGE_64K
-	ld	r7,STK_PARAM(R9)(r1)	/* segment size */
-	ld	r8,STK_PARAM(R8)(r1)	/* get "local" param */
+	li	r6,MMU_PAGE_64K		/* base page size */
+	li	r7,MMU_PAGE_64K		/* actual page size */
+	ld	r8,STK_PARAM(R9)(r1)	/* segment size */
+	ld	r9,STK_PARAM(R8)(r1)	/* get "local" param */
 _GLOBAL(ht64_call_hpte_updatepp)
 	bl	.			/* patched by htab_finish_init() */
 
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..121c8ef 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -273,61 +273,15 @@ static long native_hpte_remove(unsigned long hpte_group)
 	return i;
 }
 
-static inline int __hpte_actual_psize(unsigned int lp, int psize)
-{
-	int i, shift;
-	unsigned int mask;
-
-	/* start from 1 ignoring MMU_PAGE_4K */
-	for (i = 1; i < MMU_PAGE_COUNT; i++) {
-
-		/* invalid penc */
-		if (mmu_psize_defs[psize].penc[i] == -1)
-			continue;
-		/*
-		 * encoding bits per actual page size
-		 *        PTE LP     actual page size
-		 *    rrrr rrrz		>=8KB
-		 *    rrrr rrzz		>=16KB
-		 *    rrrr rzzz		>=32KB
-		 *    rrrr zzzz		>=64KB
-		 * .......
-		 */
-		shift = mmu_psize_defs[i].shift - LP_SHIFT;
-		if (shift > LP_BITS)
-			shift = LP_BITS;
-		mask = (1 << shift) - 1;
-		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
-			return i;
-	}
-	return -1;
-}
-
-static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
-{
-	/* Look at the 8 bit LP value */
-	unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
-
-	if (!(hptep->v & HPTE_V_VALID))
-		return -1;
-
-	/* First check if it is large page */
-	if (!(hptep->v & HPTE_V_LARGE))
-		return MMU_PAGE_4K;
-
-	return __hpte_actual_psize(lp, psize);
-}
-
 static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
-				 unsigned long vpn, int psize, int ssize,
-				 int local)
+				 unsigned long vpn, int bpsize,
+				 int apsize, int ssize, int local)
 {
 	struct hash_pte *hptep = htab_address + slot;
 	unsigned long hpte_v, want_v;
 	int ret = 0;
-	int actual_psize;
 
-	want_v = hpte_encode_avpn(vpn, psize, ssize);
+	want_v = hpte_encode_avpn(vpn, bpsize, ssize);
 
 	DBG_LOW("    update(vpn=%016lx, avpnv=%016lx, group=%lx, newpp=%lx)",
 		vpn, want_v & HPTE_V_AVPN, slot, newpp);
@@ -335,13 +289,14 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 	native_lock_hpte(hptep);
 
 	hpte_v = hptep->v;
-	actual_psize = hpte_actual_psize(hptep, psize);
-	if (actual_psize < 0) {
-		native_unlock_hpte(hptep);
-		return -1;
-	}
-	/* Even if we miss, we need to invalidate the TLB */
-	if (!HPTE_V_COMPARE(hpte_v, want_v)) {
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
+	if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID)) {
 		DBG_LOW(" -> miss\n");
 		ret = -1;
 	} else {
@@ -353,7 +308,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 	native_unlock_hpte(hptep);
 
 	/* Ensure it is out of the tlb too. */
-	tlbie(vpn, psize, actual_psize, ssize, local);
+	tlbie(vpn, bpsize, apsize, ssize, local);
 
 	return ret;
 }
@@ -394,7 +349,6 @@ static long native_hpte_find(unsigned long vpn, int psize, int ssize)
 static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
 				       int psize, int ssize)
 {
-	int actual_psize;
 	unsigned long vpn;
 	unsigned long vsid;
 	long slot;
@@ -407,54 +361,82 @@ static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
 	if (slot == -1)
 		panic("could not find page to bolt\n");
 	hptep = htab_address + slot;
-	actual_psize = hpte_actual_psize(hptep, psize);
-	if (actual_psize < 0)
-		return;
 
 	/* Update the HPTE */
 	hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
 		(newpp & (HPTE_R_PP | HPTE_R_N));
-
-	/* Ensure it is out of the tlb too. */
-	tlbie(vpn, psize, actual_psize, ssize, 0);
+	/*
+	 * Ensure it is out of the tlb too. Bolted entries base and
+	 * actual page size will be same.
+	 */
+	tlbie(vpn, psize, psize, ssize, 0);
 }
 
 static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
-				   int psize, int ssize, int local)
+				   int bpsize, int apsize, int ssize, int local)
 {
 	struct hash_pte *hptep = htab_address + slot;
 	unsigned long hpte_v;
 	unsigned long want_v;
 	unsigned long flags;
-	int actual_psize;
 
 	local_irq_save(flags);
 
 	DBG_LOW("    invalidate(vpn=%016lx, hash: %lx)\n", vpn, slot);
 
-	want_v = hpte_encode_avpn(vpn, psize, ssize);
+	want_v = hpte_encode_avpn(vpn, bpsize, ssize);
 	native_lock_hpte(hptep);
 	hpte_v = hptep->v;
 
-	actual_psize = hpte_actual_psize(hptep, psize);
-	if (actual_psize < 0) {
-		native_unlock_hpte(hptep);
-		local_irq_restore(flags);
-		return;
-	}
-	/* Even if we miss, we need to invalidate the TLB */
-	if (!HPTE_V_COMPARE(hpte_v, want_v))
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
+	if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
 		native_unlock_hpte(hptep);
 	else
 		/* Invalidate the hpte. NOTE: this also unlocks it */
 		hptep->v = 0;
 
 	/* Invalidate the TLB */
-	tlbie(vpn, psize, actual_psize, ssize, local);
+	tlbie(vpn, bpsize, apsize, ssize, local);
 
 	local_irq_restore(flags);
 }
 
+static inline int __hpte_actual_psize(unsigned int lp, int psize)
+{
+	int i, shift;
+	unsigned int mask;
+
+	/* start from 1 ignoring MMU_PAGE_4K */
+	for (i = 1; i < MMU_PAGE_COUNT; i++) {
+
+		/* invalid penc */
+		if (mmu_psize_defs[psize].penc[i] == -1)
+			continue;
+		/*
+		 * encoding bits per actual page size
+		 *        PTE LP     actual page size
+		 *    rrrr rrrz		>=8KB
+		 *    rrrr rrzz		>=16KB
+		 *    rrrr rzzz		>=32KB
+		 *    rrrr zzzz		>=64KB
+		 * .......
+		 */
+		shift = mmu_psize_defs[i].shift - LP_SHIFT;
+		if (shift > LP_BITS)
+			shift = LP_BITS;
+		mask = (1 << shift) - 1;
+		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
+			return i;
+	}
+	return -1;
+}
+
 static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
 			int *psize, int *apsize, int *ssize, unsigned long *vpn)
 {
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index e303a6d..2f47080 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1232,7 +1232,11 @@ void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 		slot += hidx & _PTEIDX_GROUP_IX;
 		DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
-		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, local);
+		/*
+		 * We use same base page size and actual psize, because we don't
+		 * use these functions for hugepage
+		 */
+		ppc_md.hpte_invalidate(slot, vpn, psize, psize, ssize, local);
 	} pte_iterate_hashed_end();
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -1365,7 +1369,8 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
 		hash = ~hash;
 	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 	slot += hidx & _PTEIDX_GROUP_IX;
-	ppc_md.hpte_invalidate(slot, vpn, mmu_linear_psize, mmu_kernel_ssize, 0);
+	ppc_md.hpte_invalidate(slot, vpn, mmu_linear_psize, mmu_linear_psize,
+			       mmu_kernel_ssize, 0);
 }
 
 void kernel_map_pages(struct page *page, int numpages, int enable)
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index 0f1d94a..0b7fb67 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -81,7 +81,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		slot += (old_pte & _PAGE_F_GIX) >> 12;
 
 		if (ppc_md.hpte_updatepp(slot, rflags, vpn, mmu_psize,
-					 ssize, local) == -1)
+					 mmu_psize, ssize, local) == -1)
 			old_pte &= ~_PAGE_HPTEFLAGS;
 	}
 
diff --git a/arch/powerpc/platforms/cell/beat_htab.c b/arch/powerpc/platforms/cell/beat_htab.c
index 246e1d8..c34ee4e 100644
--- a/arch/powerpc/platforms/cell/beat_htab.c
+++ b/arch/powerpc/platforms/cell/beat_htab.c
@@ -185,7 +185,8 @@ static void beat_lpar_hptab_clear(void)
 static long beat_lpar_hpte_updatepp(unsigned long slot,
 				    unsigned long newpp,
 				    unsigned long vpn,
-				    int psize, int ssize, int local)
+				    int psize, int apsize,
+				    int ssize, int local)
 {
 	unsigned long lpar_rc;
 	u64 dummy0, dummy1;
@@ -274,7 +275,8 @@ static void beat_lpar_hpte_updateboltedpp(unsigned long newpp,
 }
 
 static void beat_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
-					 int psize, int ssize, int local)
+				      int psize, int apsize,
+				      int ssize, int local)
 {
 	unsigned long want_v;
 	unsigned long lpar_rc;
@@ -364,9 +366,10 @@ static long beat_lpar_hpte_insert_v3(unsigned long hpte_group,
  * already zero.  For now I am paranoid.
  */
 static long beat_lpar_hpte_updatepp_v3(unsigned long slot,
-				    unsigned long newpp,
-				    unsigned long vpn,
-				    int psize, int ssize, int local)
+				       unsigned long newpp,
+				       unsigned long vpn,
+				       int psize, int apsize,
+				       int ssize, int local)
 {
 	unsigned long lpar_rc;
 	unsigned long want_v;
@@ -394,7 +397,8 @@ static long beat_lpar_hpte_updatepp_v3(unsigned long slot,
 }
 
 static void beat_lpar_hpte_invalidate_v3(unsigned long slot, unsigned long vpn,
-					 int psize, int ssize, int local)
+					 int psize, int apsize,
+					 int ssize, int local)
 {
 	unsigned long want_v;
 	unsigned long lpar_rc;
diff --git a/arch/powerpc/platforms/ps3/htab.c b/arch/powerpc/platforms/ps3/htab.c
index 177a2f7..3e270e3 100644
--- a/arch/powerpc/platforms/ps3/htab.c
+++ b/arch/powerpc/platforms/ps3/htab.c
@@ -109,7 +109,8 @@ static long ps3_hpte_remove(unsigned long hpte_group)
 }
 
 static long ps3_hpte_updatepp(unsigned long slot, unsigned long newpp,
-	unsigned long vpn, int psize, int ssize, int local)
+			      unsigned long vpn, int psize, int apsize,
+			      int ssize, int local)
 {
 	int result;
 	u64 hpte_v, want_v, hpte_rs;
@@ -162,7 +163,7 @@ static void ps3_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
 }
 
 static void ps3_hpte_invalidate(unsigned long slot, unsigned long vpn,
-	int psize, int ssize, int local)
+				int psize, int apsize, int ssize, int local)
 {
 	unsigned long flags;
 	int result;
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6d62072..ca45c8f 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -240,7 +240,8 @@ static void pSeries_lpar_hptab_clear(void)
 static long pSeries_lpar_hpte_updatepp(unsigned long slot,
 				       unsigned long newpp,
 				       unsigned long vpn,
-				       int psize, int ssize, int local)
+				       int psize, int apsize,
+				       int ssize, int local)
 {
 	unsigned long lpar_rc;
 	unsigned long flags = (newpp & 7) | H_AVPN;
@@ -328,7 +329,8 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
 }
 
 static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
-					 int psize, int ssize, int local)
+					 int psize, int apsize,
+					 int ssize, int local)
 {
 	unsigned long want_v;
 	unsigned long lpar_rc;
@@ -356,8 +358,10 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
 
 	slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
 	BUG_ON(slot == -1);
-
-	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
+	/*
+	 * lpar doesn't use the passed actual page size
+	 */
+	pSeries_lpar_hpte_invalidate(slot, vpn, psize, 0, ssize, 0);
 }
 
 /* Flag bits for H_BULK_REMOVE */
@@ -400,8 +404,11 @@ static void pSeries_lpar_flush_hash_range(unsigned long number, int local)
 			slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 			slot += hidx & _PTEIDX_GROUP_IX;
 			if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
+				/*
+				 * lpar doesn't use the passed actual page size
+				 */
 				pSeries_lpar_hpte_invalidate(slot, vpn, psize,
-							     ssize, local);
+							     0, ssize, local);
 			} else {
 				param[pix] = HBR_REQUEST | HBR_AVPN | slot;
 				param[pix+1] = hpte_encode_avpn(vpn, psize,

^ permalink raw reply related

* [GIT PULL 00/66] perf/core improvements and fixes
From: Arnaldo Carvalho de Melo @ 2013-05-30 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Stephane Eranian, linuxppc-dev, Andi Kleen,
	Paul Mackerras, Sam Ravnborg, Rabin Vincent, Jiri Olsa,
	Xiao Guangrong, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Sukadev Bhattiprolu, Corey Ashford, Namhyung Kim, Borislav Petkov,
	Runzhen Wang, William Cohen, Arnaldo Carvalho de Melo,
	Mike Galbraith, linux-kernel, Pekka Enberg, Minchan Kim,
	David Ahern

From: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

Hi Ingo,

	Please consider pulling,

- Arnaldo

The following changes since commit c0ffaf3655fab1909a920c8f30ba1722932d01bb:

  watchdog: Remove softlockup_thresh from Documentation (2013-05-28 11:28:20 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux tags/perf-core-for-mingo

for you to fetch changes up to c3c44709b5095091216c06b8df83feddc01ba6b0:

  perf tools: Add missing liblk.a dependency for python/perf.so (2013-05-30 17:36:16 +0300)

----------------------------------------------------------------
perf/core improvements and fixes:

. Reset SIGTERM handler in workload child process, fix from David Ahern.

. Handle death by SIGTERM in 'perf record', fix from David Ahern.

. Fix printing of perf_event_paranoid message, from David Ahern.

. Handle realloc failures in 'perf kvm', from David Ahern.

. Fix divide by 0 in variance, from David Ahern.

. Save parent pid in thread struct, from David Ahern.

. Handle JITed code in shared memory, from Andi Kleen.

. Makefile reorganization, prep work for Kconfig patches, from Jiri Olsa.

. Fixes for 'perf diff', from Jiri Olsa.

. Add automated make test suite, from Jiri Olsa.

. 'perf tests' fixes from Jiri Olsa.

. Remove some unused struct members, from Jiri Olsa.

. Add missing liblk.a dependency for python/perf.so, fix from Jiri Olsa.

. Respect CROSS_COMPILE in liblk.a, from Rabin Vincent.

. Expand definition of sysfs format attribute, from Michael Ellerman.

. No need to do locking when adding hists in perf report, only 'top'
  needs that, from Namhyung Kim.

. Sorting improvements, from Namhyung Kim.

. Fix alignment of symbol column in in the hists browser (top, report)
  when -v is given, from NAmhyung Kim.

. Add --percent-limit option to 'top' and 'report', from Namhyung Kim.

. Fix 'perf top' -E option behavior, from Namhyung Kim.

. Fix bug in isupper() and islower(), from Sukadev Bhattiprolu.

. Fix compile errors in bp_signal 'perf test', from Sukadev Bhattiprolu.

. Make Power7 CPI stack events available in sysfs, from Sukadev Bhattiprolu.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

----------------------------------------------------------------
Andi Kleen (1):
      perf tools: Handle JITed code in shared memory

Arnaldo Carvalho de Melo (3):
      perf archive: Fix typo on Documentation
      perf hists browser: Use sort__has_sym
      perf test: Fix typo

David Ahern (6):
      perf record: handle death by SIGTERM
      perf evsel: Fix printing of perf_event_paranoid message
      perf kvm: Handle realloc failures
      perf stats: Fix divide by 0 in variance
      perf  tools: Save parent pid in thread struct
      perf evlist: Reset SIGTERM handler in workload child process

Jiri Olsa (32):
      perf tools: Fix tab vs spaces issue in Makefile ifdef/endif
      perf diff: Use internal rb tree for hists__precompute
      perf hists: Rename hist_entry__add_pair arguments
      perf tools: Add automated make test suite
      perf tools: Move arch check into config/Makefile
      perf tools: Move programs check into config/Makefile
      perf tools: Move compiler and linker flags check into config/Makefile
      perf tools: Move libelf check config into config/Makefile
      perf tools: Move libdw check config into config/Makefile
      perf tools: Move libunwind check config into config/Makefile
      perf tools: Move libaudit check config into config/Makefile
      perf tools: Move slang check config into config/Makefile
      perf tools: Move gtk2 check config into config/Makefile
      perf tools: Move libperl check config into config/Makefile
      perf tools: Move libpython check config into config/Makefile
      perf tools: Move libbfd check config into config/Makefile
      perf tools: Move stdlib check config into config/Makefile
      perf tools: Move libnuma check config into config/Makefile
      perf tools: Move paths config into config/Makefile
      perf tools: Final touches for CHK config move
      perf tests: Fix attr test for record -d option
      perf tests: Fix exclude_guest|exclude_host checking for attr tests
      perf tools: Remove frozen from perf_header struct
      perf tools: Remove cwdlen from struct perf_session
      perf tools: Merge all *CFLAGS* make variable into CFLAGS
      perf tools: Merge all *LDFLAGS* make variable into LDFLAGS
      perf tools: Switch to full path C include directories
      perf tools: Add NO_BIONIC variable to confiure bionic setup
      perf tools: Replace tabs with spaces for all non-commands statements
      perf tools: Replace multiple line assignment with multiple statements
      perf tools: Remove '?=' Makefile STRIP assignment
      perf tools: Add missing liblk.a dependency for python/perf.so

Michael Ellerman (1):
      perf: Expand definition of sysfs format attribute

Namhyung Kim (18):
      perf hists: Fix an invalid memory free on he->branch_info
      perf hists: Free unused mem info of a matched hist entry
      perf report: Fix alignment of symbol column when -v is given
      perf sort: Introduce sort__mode variable
      perf sort: Factor out common code in sort_dimension__add()
      perf sort: Separate out memory-specific sort keys
      perf sort: Consolidate sort_entry__setup_elide()
      perf sort: Reorder HISTC_SRCLINE index
      perf sort: Cleanup sort__has_sym setting
      perf top: Use sort__has_sym
      perf top: Fix -E option behavior
      perf top: Fix percent output when no samples collected
      perf top: Get rid of *_threaded() functions
      perf hists: Move locking to its call-sites
      perf report: Don't bother locking when adding hist entries
      perf report: Add --percent-limit option
      perf top: Add --percent-limit option
      perf report: Add report.percent-limit config variable

Rabin Vincent (1):
      tools lib lk: Respect CROSS_COMPILE

Sukadev Bhattiprolu (4):
      perf tools: Fix bug in isupper() and islower()
      perf tests: Fix compile errors in bp_signal files
      perf: Power7: Make CPI stack events available in sysfs
      perf: Power7 Update testing ABI to list CPI-stack events

 .../testing/sysfs-bus-event_source-devices-events  |  32 +-
 .../testing/sysfs-bus-event_source-devices-format  |   6 +
 arch/powerpc/perf/power7-pmu.c                     |  73 +++
 tools/lib/lk/Makefile                              |   3 +
 tools/perf/Documentation/perf-archive.txt          |   2 +-
 tools/perf/Documentation/perf-report.txt           |   4 +
 tools/perf/Documentation/perf-top.txt              |   4 +
 tools/perf/Makefile                                | 630 ++++-----------------
 tools/perf/builtin-diff.c                          |  19 +-
 tools/perf/builtin-kvm.c                           |   3 +
 tools/perf/builtin-record.c                        |   2 +-
 tools/perf/builtin-report.c                        | 102 ++--
 tools/perf/builtin-top.c                           |  74 +--
 tools/perf/config/Makefile                         | 477 ++++++++++++++++
 tools/perf/tests/attr/base-record                  |   4 +-
 tools/perf/tests/attr/base-stat                    |   4 +-
 tools/perf/tests/attr/test-record-data             |   5 +-
 tools/perf/tests/bp_signal.c                       |   6 +
 tools/perf/tests/bp_signal_overflow.c              |   6 +
 tools/perf/tests/builtin-test.c                    |   2 +-
 tools/perf/tests/make                              | 138 +++++
 tools/perf/ui/browsers/hists.c                     | 106 +++-
 tools/perf/ui/gtk/hists.c                          |  13 +-
 tools/perf/ui/stdio/hist.c                         |   7 +-
 tools/perf/util/evlist.c                           |   2 +
 tools/perf/util/evsel.c                            |   2 +-
 tools/perf/util/header.c                           |   2 -
 tools/perf/util/header.h                           |   1 -
 tools/perf/util/hist.c                             |  96 ++--
 tools/perf/util/hist.h                             |  16 +-
 tools/perf/util/map.c                              |   1 +
 tools/perf/util/session.h                          |   1 -
 tools/perf/util/setup.py                           |   5 +-
 tools/perf/util/sort.c                             | 128 +++--
 tools/perf/util/sort.h                             |  36 +-
 tools/perf/util/stat.c                             |   2 +-
 tools/perf/util/thread.c                           |   4 +
 tools/perf/util/thread.h                           |   1 +
 tools/perf/util/top.c                              |  23 +-
 tools/perf/util/top.h                              |   2 +-
 tools/perf/util/util.h                             |   4 +-
 41 files changed, 1270 insertions(+), 778 deletions(-)
 create mode 100644 tools/perf/config/Makefile
 create mode 100644 tools/perf/tests/make

^ permalink raw reply

* [PATCH 56/66] perf: Power7 Update testing ABI to list CPI-stack events
From: Arnaldo Carvalho de Melo @ 2013-05-30 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Arnaldo Carvalho de Melo, linuxppc-dev,
	Paul Mackerras, Sukadev Bhattiprolu
In-Reply-To: <1369929699-8724-1-git-send-email-acme@infradead.org>

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Following patch added several Power7 events into /sys/devices/cpu/events.
Document those events in the testing ABI.

	https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-April/105167.html

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/20130406170623.GA900@us.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 .../testing/sysfs-bus-event_source-devices-events  | 32 ++++++++++++++++++----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
index 0adeb52..8b25ffb 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
@@ -27,14 +27,36 @@ Description:	Generic performance monitoring events
 		"basename".
 
 
-What: 		/sys/devices/cpu/events/PM_LD_MISS_L1
-		/sys/devices/cpu/events/PM_LD_REF_L1
-		/sys/devices/cpu/events/PM_CYC
+What: 		/sys/devices/cpu/events/PM_1PLUS_PPC_CMPL
 		/sys/devices/cpu/events/PM_BRU_FIN
-		/sys/devices/cpu/events/PM_GCT_NOSLOT_CYC
 		/sys/devices/cpu/events/PM_BRU_MPRED
-		/sys/devices/cpu/events/PM_INST_CMPL
 		/sys/devices/cpu/events/PM_CMPLU_STALL
+		/sys/devices/cpu/events/PM_CMPLU_STALL_BRU
+		/sys/devices/cpu/events/PM_CMPLU_STALL_DCACHE_MISS
+		/sys/devices/cpu/events/PM_CMPLU_STALL_DFU
+		/sys/devices/cpu/events/PM_CMPLU_STALL_DIV
+		/sys/devices/cpu/events/PM_CMPLU_STALL_ERAT_MISS
+		/sys/devices/cpu/events/PM_CMPLU_STALL_FXU
+		/sys/devices/cpu/events/PM_CMPLU_STALL_IFU
+		/sys/devices/cpu/events/PM_CMPLU_STALL_LSU
+		/sys/devices/cpu/events/PM_CMPLU_STALL_REJECT
+		/sys/devices/cpu/events/PM_CMPLU_STALL_SCALAR
+		/sys/devices/cpu/events/PM_CMPLU_STALL_SCALAR_LONG
+		/sys/devices/cpu/events/PM_CMPLU_STALL_STORE
+		/sys/devices/cpu/events/PM_CMPLU_STALL_THRD
+		/sys/devices/cpu/events/PM_CMPLU_STALL_VECTOR
+		/sys/devices/cpu/events/PM_CMPLU_STALL_VECTOR_LONG
+		/sys/devices/cpu/events/PM_CYC
+		/sys/devices/cpu/events/PM_GCT_NOSLOT_BR_MPRED
+		/sys/devices/cpu/events/PM_GCT_NOSLOT_BR_MPRED_IC_MISS
+		/sys/devices/cpu/events/PM_GCT_NOSLOT_CYC
+		/sys/devices/cpu/events/PM_GCT_NOSLOT_IC_MISS
+		/sys/devices/cpu/events/PM_GRP_CMPL
+		/sys/devices/cpu/events/PM_INST_CMPL
+		/sys/devices/cpu/events/PM_LD_MISS_L1
+		/sys/devices/cpu/events/PM_LD_REF_L1
+		/sys/devices/cpu/events/PM_RUN_CYC
+		/sys/devices/cpu/events/PM_RUN_INST_CMPL
 
 Date:		2013/01/08
 
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 55/66] perf: Power7: Make CPI stack events available in sysfs
From: Arnaldo Carvalho de Melo @ 2013-05-30 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Arnaldo Carvalho de Melo, linuxppc-dev,
	Paul Mackerras, Sukadev Bhattiprolu
In-Reply-To: <1369929699-8724-1-git-send-email-acme@infradead.org>

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

A set of Power7 events are often used for Cycles Per Instruction (CPI) stack
analysis. Make these events available in sysfs (/sys/devices/cpu/events/) so
they can be identified using their symbolic names:

	perf stat -e 'cpu/PM_CMPLU_STALL_DCACHE_MISS/' /bin/ls

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/20130406164803.GA408@us.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/powerpc/perf/power7-pmu.c | 73 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 3c475d6..13c3f0e 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -62,6 +62,29 @@
 #define	PME_PM_BRU_FIN			0x10068
 #define	PME_PM_BRU_MPRED		0x400f6
 
+#define PME_PM_CMPLU_STALL_FXU			0x20014
+#define PME_PM_CMPLU_STALL_DIV			0x40014
+#define PME_PM_CMPLU_STALL_SCALAR		0x40012
+#define PME_PM_CMPLU_STALL_SCALAR_LONG		0x20018
+#define PME_PM_CMPLU_STALL_VECTOR		0x2001c
+#define PME_PM_CMPLU_STALL_VECTOR_LONG		0x4004a
+#define PME_PM_CMPLU_STALL_LSU			0x20012
+#define PME_PM_CMPLU_STALL_REJECT		0x40016
+#define PME_PM_CMPLU_STALL_ERAT_MISS		0x40018
+#define PME_PM_CMPLU_STALL_DCACHE_MISS		0x20016
+#define PME_PM_CMPLU_STALL_STORE		0x2004a
+#define PME_PM_CMPLU_STALL_THRD			0x1001c
+#define PME_PM_CMPLU_STALL_IFU			0x4004c
+#define PME_PM_CMPLU_STALL_BRU			0x4004e
+#define PME_PM_GCT_NOSLOT_IC_MISS		0x2001a
+#define PME_PM_GCT_NOSLOT_BR_MPRED		0x4001a
+#define PME_PM_GCT_NOSLOT_BR_MPRED_IC_MISS	0x4001c
+#define PME_PM_GRP_CMPL				0x30004
+#define PME_PM_1PLUS_PPC_CMPL			0x100f2
+#define PME_PM_CMPLU_STALL_DFU			0x2003c
+#define PME_PM_RUN_CYC				0x200f4
+#define PME_PM_RUN_INST_CMPL			0x400fa
+
 /*
  * Layout of constraint bits:
  * 6666555555555544444444443333333333222222222211111111110000000000
@@ -393,6 +416,31 @@ POWER_EVENT_ATTR(LD_MISS_L1,			LD_MISS_L1);
 POWER_EVENT_ATTR(BRU_FIN,			BRU_FIN)
 POWER_EVENT_ATTR(BRU_MPRED,			BRU_MPRED);
 
+POWER_EVENT_ATTR(CMPLU_STALL_FXU,		CMPLU_STALL_FXU);
+POWER_EVENT_ATTR(CMPLU_STALL_DIV,		CMPLU_STALL_DIV);
+POWER_EVENT_ATTR(CMPLU_STALL_SCALAR,		CMPLU_STALL_SCALAR);
+POWER_EVENT_ATTR(CMPLU_STALL_SCALAR_LONG,	CMPLU_STALL_SCALAR_LONG);
+POWER_EVENT_ATTR(CMPLU_STALL_VECTOR,		CMPLU_STALL_VECTOR);
+POWER_EVENT_ATTR(CMPLU_STALL_VECTOR_LONG,	CMPLU_STALL_VECTOR_LONG);
+POWER_EVENT_ATTR(CMPLU_STALL_LSU,		CMPLU_STALL_LSU);
+POWER_EVENT_ATTR(CMPLU_STALL_REJECT,		CMPLU_STALL_REJECT);
+
+POWER_EVENT_ATTR(CMPLU_STALL_ERAT_MISS,		CMPLU_STALL_ERAT_MISS);
+POWER_EVENT_ATTR(CMPLU_STALL_DCACHE_MISS,	CMPLU_STALL_DCACHE_MISS);
+POWER_EVENT_ATTR(CMPLU_STALL_STORE,		CMPLU_STALL_STORE);
+POWER_EVENT_ATTR(CMPLU_STALL_THRD,		CMPLU_STALL_THRD);
+POWER_EVENT_ATTR(CMPLU_STALL_IFU,		CMPLU_STALL_IFU);
+POWER_EVENT_ATTR(CMPLU_STALL_BRU,		CMPLU_STALL_BRU);
+POWER_EVENT_ATTR(GCT_NOSLOT_IC_MISS,		GCT_NOSLOT_IC_MISS);
+
+POWER_EVENT_ATTR(GCT_NOSLOT_BR_MPRED,		GCT_NOSLOT_BR_MPRED);
+POWER_EVENT_ATTR(GCT_NOSLOT_BR_MPRED_IC_MISS,	GCT_NOSLOT_BR_MPRED_IC_MISS);
+POWER_EVENT_ATTR(GRP_CMPL,			GRP_CMPL);
+POWER_EVENT_ATTR(1PLUS_PPC_CMPL,		1PLUS_PPC_CMPL);
+POWER_EVENT_ATTR(CMPLU_STALL_DFU,		CMPLU_STALL_DFU);
+POWER_EVENT_ATTR(RUN_CYC,			RUN_CYC);
+POWER_EVENT_ATTR(RUN_INST_CMPL,			RUN_INST_CMPL);
+
 static struct attribute *power7_events_attr[] = {
 	GENERIC_EVENT_PTR(CYC),
 	GENERIC_EVENT_PTR(GCT_NOSLOT_CYC),
@@ -411,6 +459,31 @@ static struct attribute *power7_events_attr[] = {
 	POWER_EVENT_PTR(LD_MISS_L1),
 	POWER_EVENT_PTR(BRU_FIN),
 	POWER_EVENT_PTR(BRU_MPRED),
+
+	POWER_EVENT_PTR(CMPLU_STALL_FXU),
+	POWER_EVENT_PTR(CMPLU_STALL_DIV),
+	POWER_EVENT_PTR(CMPLU_STALL_SCALAR),
+	POWER_EVENT_PTR(CMPLU_STALL_SCALAR_LONG),
+	POWER_EVENT_PTR(CMPLU_STALL_VECTOR),
+	POWER_EVENT_PTR(CMPLU_STALL_VECTOR_LONG),
+	POWER_EVENT_PTR(CMPLU_STALL_LSU),
+	POWER_EVENT_PTR(CMPLU_STALL_REJECT),
+
+	POWER_EVENT_PTR(CMPLU_STALL_ERAT_MISS),
+	POWER_EVENT_PTR(CMPLU_STALL_DCACHE_MISS),
+	POWER_EVENT_PTR(CMPLU_STALL_STORE),
+	POWER_EVENT_PTR(CMPLU_STALL_THRD),
+	POWER_EVENT_PTR(CMPLU_STALL_IFU),
+	POWER_EVENT_PTR(CMPLU_STALL_BRU),
+	POWER_EVENT_PTR(GCT_NOSLOT_IC_MISS),
+	POWER_EVENT_PTR(GCT_NOSLOT_BR_MPRED),
+
+	POWER_EVENT_PTR(GCT_NOSLOT_BR_MPRED_IC_MISS),
+	POWER_EVENT_PTR(GRP_CMPL),
+	POWER_EVENT_PTR(1PLUS_PPC_CMPL),
+	POWER_EVENT_PTR(CMPLU_STALL_DFU),
+	POWER_EVENT_PTR(RUN_CYC),
+	POWER_EVENT_PTR(RUN_INST_CMPL),
 	NULL
 };
 
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH] powerpc/mpc85xx: match with the pci bus address used by u-boot for all p1_p2_rdb_pc boards
From: Scott Wood @ 2013-05-30 14:43 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc, Kevin Hao
In-Reply-To: <9845CB5F-0E99-49A5-A10B-CD2E2379E903@kernel.crashing.org>

On 05/30/2013 09:21:19 AM, Kumar Gala wrote:
>=20
> On May 28, 2013, at 5:45 PM, Scott Wood wrote:
>=20
> > On 05/16/2013 01:29:45 AM, Kevin Hao wrote:
> >> All these boards use the same configuration file p1_p2_rdb_pc.h in
> >> u-boot. So they have the same pci bus address set by the u-boot.
> >> But in some of these boards the bus address set in dtb don't match
> >> the one used by u-boot. And this will trigger a kernel bug in 32bit
> >> kernel and cause the pci device malfunction. For example, on a
> >> p2020rdb-pc board the u-boot use the 0xa0000000 as both bus address
> >> and cpu address for one pci controller and then assign bus address
> >> such as 0xa00004000 to some pci device. But in the kernel, the dtb
> >> set the bus address to 0xe0000000 and the cpu address to =20
> 0xa0000000.
> >> The kernel assumes mistakenly the assigned bus address 0xa0004000
> >> in pci device is correct and keep it unchanged. This will =20
> definitely
> >> cause the pci device malfunction. I have made two patches to fix
> >> this in the pci subsystem.
> >> http://patchwork.ozlabs.org/patch/243702/
> >> http://patchwork.ozlabs.org/patch/243703/
> >> But I still think it makes sense to set these bus address to match
> >> with the u-boot. This issue can't be reproduced on 36bit kernel.
> >> But I also tweak the 36bit dtb for the above reason.
> >
> > IIRC the reason for using 0xe0000000 on all PCIe roots is to =20
> maximize the memory that is DMA-addressable without involving swiotlb.
> >
> > Maybe U-Boot should be fixed?
> >
> > -Scott
>=20
> I feel that u-boot was the way it is to allow accessing each bus from =20
> the command line in u-boot w/o big changes for >32-bit addressing.
>=20
> Linux was able to handle the PCI bus addresses all being the same.

It's a bit of a hack though, in that you're using the device tree to =20
indicate how you want the hardware programmed rather than to describe =20
the hardware or even what U-Boot's done to it, and in that you can't =20
arbitrarily change what U-Boot chose -- it only works because you're =20
picking an address that U-Boot used for one of the PCIe controllers and =20
thus U-Boot covered it with a LAW.

-Scott=

^ permalink raw reply

* Re: [PATCH] powerpc/mpc85xx: match with the pci bus address used by u-boot for all p1_p2_rdb_pc boards
From: Kumar Gala @ 2013-05-30 14:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc, Kevin Hao
In-Reply-To: <1369781156.18630.24@snotra>


On May 28, 2013, at 5:45 PM, Scott Wood wrote:

> On 05/16/2013 01:29:45 AM, Kevin Hao wrote:
>> All these boards use the same configuration file p1_p2_rdb_pc.h in
>> u-boot. So they have the same pci bus address set by the u-boot.
>> But in some of these boards the bus address set in dtb don't match
>> the one used by u-boot. And this will trigger a kernel bug in 32bit
>> kernel and cause the pci device malfunction. For example, on a
>> p2020rdb-pc board the u-boot use the 0xa0000000 as both bus address
>> and cpu address for one pci controller and then assign bus address
>> such as 0xa00004000 to some pci device. But in the kernel, the dtb
>> set the bus address to 0xe0000000 and the cpu address to 0xa0000000.
>> The kernel assumes mistakenly the assigned bus address 0xa0004000
>> in pci device is correct and keep it unchanged. This will definitely
>> cause the pci device malfunction. I have made two patches to fix
>> this in the pci subsystem.
>> http://patchwork.ozlabs.org/patch/243702/
>> http://patchwork.ozlabs.org/patch/243703/
>> But I still think it makes sense to set these bus address to match
>> with the u-boot. This issue can't be reproduced on 36bit kernel.
>> But I also tweak the 36bit dtb for the above reason.
>=20
> IIRC the reason for using 0xe0000000 on all PCIe roots is to maximize =
the memory that is DMA-addressable without involving swiotlb.
>=20
> Maybe U-Boot should be fixed?
>=20
> -Scott

I feel that u-boot was the way it is to allow accessing each bus from =
the command line in u-boot w/o big changes for >32-bit addressing.

Linux was able to handle the PCI bus addresses all being the same.

- k=

^ permalink raw reply

* Re: 3.10-rc ppc64 corrupts usermem when swapping
From: Hugh Dickins @ 2013-05-30 13:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, Anton Blanchard, Paul Mackerras, Aneesh Kumar K.V,
	David Gibson
In-Reply-To: <1369902786.3928.94.camel@pasglop>

On Thu, 30 May 2013, Benjamin Herrenschmidt wrote:
> On Thu, 2013-05-30 at 13:57 +0530, Aneesh Kumar K.V wrote:
> > +               /* FIXME!!, will fail with when we enable hugepage
> > support */
> 
> Just fix that to say "Transparent huge pages" as normal huge pages
> should work fine unless I'm missing something.
> 
> Hugh, any chance you can give that a spin ? 

Sure, it's now under way.  If all goes well, I'll give you a
progress report in about 15 hours time; but given the variance in
how long it took to hit, I won't feel fully confident until this
time tomorrow, when I'll update you again.

Thank you both for the great response!

Hugh

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: wolfking @ 2013-05-30 12:49 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <51A72D04.2090203@windriver.com>

tiejun.chen wrote
> On 05/30/2013 03:32 PM, wolfking wrote:
>> (continued)
>>    I traced the 8139too.c when it uses pci_iomap, the pci_iomap called
>> the
>> ioport_map. The difference between 8139 and my PCIe card lies in the
>> "port" value :
>> void __iomem *ioport_map(unsigned long port, unsigned int len)
>> {
>> 	return (void __iomem *) (port + _IO_BASE);
> 
> _IO_BASE is equal to isa_io_base. So if this is not zero, I think there's
> a isa 
> bridge in your platform. So you can access these I/O ports based on that
> isa 
> bridge/bus with ioreadx/iowritex.
> 
> I tried ioread8/iowriet8 after ioremap, it doesn't work
> 
>> }
>>    in 8139too.c, the "port" value is 0x1000; for my PCIe card, the "port"
>> value
>> is 0xfefff000. And the value is got from pci_resource_start. So you see,
>> the
> 
> But this means the port is as memory-mapped so ioremap() should be
> workable in 
> this case. Then out_bex/in_bex should be fine.
> 
> Tiejun
> 
> _______________________________________________
> Linuxppc-dev mailing list

> Linuxppc-dev@.ozlabs

> https://lists.ozlabs.org/listinfo/linuxppc-dev





--
View this message in context: http://linuxppc.10917.n7.nabble.com/can-t-access-PCIe-card-under-sbc8548-tp71775p71827.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: wolfking @ 2013-05-30 12:45 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <51A72524.2060403@windriver.com>

tiejun.chen wrote
> On 05/30/2013 06:02 PM, wolfking wrote:
>> I tried several R/W functions: inb/outb and ioread8/iowrite8. The
>> inb/outb
> 
> What about out_be32/in_be32?
> 
> my PCIe card's internal register is only 8 bit. So I tried ioremap,
> out_8/in_8.
> It still doesn't work.
> 
> Tiejun
> _______________________________________________
> Linuxppc-dev mailing list

> Linuxppc-dev@.ozlabs

> https://lists.ozlabs.org/listinfo/linuxppc-dev





--
View this message in context: http://linuxppc.10917.n7.nabble.com/can-t-access-PCIe-card-under-sbc8548-tp71775p71826.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: tiejun.chen @ 2013-05-30 10:42 UTC (permalink / raw)
  To: wolfking; +Cc: linuxppc-dev
In-Reply-To: <1369899157331-71783.post@n7.nabble.com>

On 05/30/2013 03:32 PM, wolfking wrote:
> (continued)
>    I traced the 8139too.c when it uses pci_iomap, the pci_iomap called the
> ioport_map. The difference between 8139 and my PCIe card lies in the
> "port" value :
> void __iomem *ioport_map(unsigned long port, unsigned int len)
> {
> 	return (void __iomem *) (port + _IO_BASE);

_IO_BASE is equal to isa_io_base. So if this is not zero, I think there's a isa 
bridge in your platform. So you can access these I/O ports based on that isa 
bridge/bus with ioreadx/iowritex.

> }
>    in 8139too.c, the "port" value is 0x1000; for my PCIe card, the "port"
> value
> is 0xfefff000. And the value is got from pci_resource_start. So you see, the

But this means the port is as memory-mapped so ioremap() should be workable in 
this case. Then out_bex/in_bex should be fine.

Tiejun

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/mpc85xx: remove the unneeded pci init functions for corenet ds board
From: Kevin Hao @ 2013-05-30 10:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc
In-Reply-To: <1369781529.18630.25@snotra>

[-- Attachment #1: Type: text/plain, Size: 1262 bytes --]

On Tue, May 28, 2013 at 05:52:09PM -0500, Scott Wood wrote:
> On 05/21/2013 07:04:58 AM, Kevin Hao wrote:
> >It also seems that we don't support ISA on all the current corenet ds
> >boards. So picking a primary bus seems useless, remove that function
> >too.
> 
> IIRC that was due to some bugs in the PPC PCI code in the absence of
> any primary bus.

Do you know more about these bugs?

>  fsl_pci_assign_primary() will arbitrarily pick one
> to be primary if there's no ISA.  Have the bugs been fixed?

I know there should be some reason that we put the fsl_pci_assign_primary()
here. But frankly I am not sure what bugs this workaround try to fix. For these
corenet boards picking one to be primary has no effect to the 64bit kernel.
And for 32bit kernel, the only effect of this is that isa_io_base is set to the
io virtual base of the primary bus. But the isa_io_base only make sense when
we do have a isa bus, so that we can access some well-known io ports directly
by using outx/inx. But if we don't have isa bus on the board, the value of
isa_io_base should make no sense at all. So we really don't need to pick a
fake primary bus. Of course I may miss something, correct me if I am wrong. :-)

Thanks,
Kevin

> 
> -Scott

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: tiejun.chen @ 2013-05-30 10:08 UTC (permalink / raw)
  To: wolfking; +Cc: linuxppc-dev
In-Reply-To: <1369908140054-71817.post@n7.nabble.com>

On 05/30/2013 06:02 PM, wolfking wrote:
> I tried several R/W functions: inb/outb and ioread8/iowrite8. The inb/outb

What about out_be32/in_be32?

Tiejun

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: wolfking @ 2013-05-30 10:02 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1369885321567-71775.post@n7.nabble.com>

I tried several R/W functions: inb/outb and ioread8/iowrite8. The inb/outb 
doesn't report any error while the ioread8 freeze. The load/storage sync 
code that I have used include mb and something else.

--
View this message in context: http://linuxppc.10917.n7.nabble.com/can-t-access-PCIe-card-under-sbc8548-tp71775p71817.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: tiejun.chen @ 2013-05-30  9:30 UTC (permalink / raw)
  To: wolfking; +Cc: linuxppc-dev
In-Reply-To: <1369905341079-71815.post@n7.nabble.com>

On 05/30/2013 05:15 PM, wolfking wrote:
> hi, tiejun.chen:
>    When I use ioremap, the card seems to work fine. That is, I can access

Yes, ioremap() should work for MMIO.

> part of all register. My PCIe card is a rs232 expand card, it has some
> standard UART register, for example the SCR(scratch register). My driver
> can access the SCR(write and read) normally, but the other registers
> behave odd. For example, the DLM should be 0, but it reads 5. The card
> has a software reset bit, when it is set to 1, the card reset itself. When
> it finished reset, this reset bit should be back to 0. But In sbc8548, when
> I set this
> bit, it remains high. So I guess, the area I accessed is not the PCIe card,

I suspect you're missing some load/storage sync code, so what is your R/W 
function exactly?

Tiejun

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: wolfking @ 2013-05-30  9:15 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <51A71225.2040102@windriver.com>

hi, tiejun.chen:
  When I use ioremap, the card seems to work fine. That is, I can access
part of all register. My PCIe card is a rs232 expand card, it has some
standard UART register, for example the SCR(scratch register). My driver 
can access the SCR(write and read) normally, but the other registers
behave odd. For example, the DLM should be 0, but it reads 5. The card
has a software reset bit, when it is set to 1, the card reset itself. When
it finished reset, this reset bit should be back to 0. But In sbc8548, when
I set this
bit, it remains high. So I guess, the area I accessed is not the PCIe card,
instead it maybe some RAM in the system. :>
I'm sure the card hardware is OK, I insert it into the 8641d board, it works
ok.

--
View this message in context: http://linuxppc.10917.n7.nabble.com/can-t-access-PCIe-card-under-sbc8548-tp71775p71815.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: [PATCH v5 12/13] ARM: kirkwood: remove redundant DT board files
From: Arnaud Ebalard @ 2013-05-30  9:06 UTC (permalink / raw)
  To: Sebastian Hesselbarth, Jason Cooper
  Cc: Thomas Petazzoni, Andrew Lunn, linux-kernel, Lennert Buytenhek,
	netdev, linuxppc-dev, David Miller, linux-arm-kernel
In-Reply-To: <1369855975-21489-13-git-send-email-sebastian.hesselbarth@gmail.com>

Hi Jason and Sebastian,

Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> writes:

> With DT support for mv643xx_eth board specific init for some boards now
> is unneccessary. Remove those board files, Kconfig entries, and
> corresponding entries in kirkwood_defconfig.
>
> Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
> ---
> Note: board-km_kirkwood.c is also removed, as Valentin Longchamp confirmed
> the lock-up is not caused by accessing clock gating registers but rather
> non-existent device registers. This will be addressed by dtsi separation
> for kirkwood and bobcat SoC variants.
>
> Changelog:
> v3->v4:
> - remove more boards that don't require board specific setup
>
> Cc: David Miller <davem@davemloft.net>
> Cc: Lennert Buytenhek <buytenh@wantstofly.org>
> Cc: Jason Cooper <jason@lakedaemon.net>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: netdev@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/arm/configs/kirkwood_defconfig           |   16 ----
>  arch/arm/mach-kirkwood/Kconfig                |  117 -------------------------
>  arch/arm/mach-kirkwood/Makefile               |   16 ----
>  arch/arm/mach-kirkwood/board-dnskw.c          |    7 --
>  arch/arm/mach-kirkwood/board-dockstar.c       |   32 -------
>  arch/arm/mach-kirkwood/board-dreamplug.c      |   35 --------
>  arch/arm/mach-kirkwood/board-dt.c             |   62 +------------
>  arch/arm/mach-kirkwood/board-goflexnet.c      |   34 -------
>  arch/arm/mach-kirkwood/board-guruplug.c       |   33 -------
>  arch/arm/mach-kirkwood/board-ib62x0.c         |   29 ------
>  arch/arm/mach-kirkwood/board-iconnect.c       |   10 ---
>  arch/arm/mach-kirkwood/board-iomega_ix2_200.c |   34 -------
>  arch/arm/mach-kirkwood/board-km_kirkwood.c    |   44 ----------
>  arch/arm/mach-kirkwood/board-lsxl.c           |   16 ----
>  arch/arm/mach-kirkwood/board-mplcec4.c        |   14 ---
>  arch/arm/mach-kirkwood/board-ns2.c            |   35 --------
>  arch/arm/mach-kirkwood/board-openblocks_a6.c  |   26 ------
>  arch/arm/mach-kirkwood/board-readynas.c       |    6 --

Just a stupid note: With Thomas ongoing work to get mvebu-pcie driver in
place and enabled for kirkwood, some boards setup files will also lose
their pcie init routines, which may allow you to kill those additonal
files soon.

For instance 6bd98481ab34 (arm: kirkwood: NETGEAR ReadyNAS Duo v2 init
PCIe via DT) currently sitting in jcooper/mvebu/pcie_kirkwood removes
the PCIE init routine in board-readynas.c, and yours remove ge00
init. With both applied, the whole file can go away.

AFAICT, this may be the case soon for:

 arch/arm/mach-kirkwood/board-iconnect.c   (36e5722089)
 arch/arm/mach-kirkwood/board-mplcec4.c    (9470fbfb8d)
 arch/arm/mach-kirkwood/board-nsa310.c     (40fa8e5da2)
 arch/arm/mach-kirkwood/board-readynas.c   (6bd98481ab)
 arch/arm/mach-kirkwood/board-ts219.c      (259e234608)

Anyway, thanks for this work Sebastian.

Cheers,

a+

^ permalink raw reply

* Re: [PATCH v5 12/13] ARM: kirkwood: remove redundant DT board files
From: Sebastian Hesselbarth @ 2013-05-30  9:08 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Thomas Petazzoni, Andrew Lunn, Jason Cooper, linux-kernel,
	Lennert Buytenhek, netdev, linuxppc-dev, David Miller,
	linux-arm-kernel
In-Reply-To: <8738t4q1kv.fsf@natisbad.org>

On 05/30/13 11:06, Arnaud Ebalard wrote:
> Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> writes:
>> With DT support for mv643xx_eth board specific init for some boards now
>> is unneccessary. Remove those board files, Kconfig entries, and
>> corresponding entries in kirkwood_defconfig.
>>
>> Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
...
> Just a stupid note: With Thomas ongoing work to get mvebu-pcie driver in
> place and enabled for kirkwood, some boards setup files will also lose
> their pcie init routines, which may allow you to kill those additonal
> files soon.
>
> For instance 6bd98481ab34 (arm: kirkwood: NETGEAR ReadyNAS Duo v2 init
> PCIe via DT) currently sitting in jcooper/mvebu/pcie_kirkwood removes
> the PCIE init routine in board-readynas.c, and yours remove ge00
> init. With both applied, the whole file can go away.
>
> AFAICT, this may be the case soon for:
>
>   arch/arm/mach-kirkwood/board-iconnect.c   (36e5722089)
>   arch/arm/mach-kirkwood/board-mplcec4.c    (9470fbfb8d)
>   arch/arm/mach-kirkwood/board-nsa310.c     (40fa8e5da2)
>   arch/arm/mach-kirkwood/board-readynas.c   (6bd98481ab)
>   arch/arm/mach-kirkwood/board-ts219.c      (259e234608)
>
> Anyway, thanks for this work Sebastian.

Arnaud,

I already realized this when merging Jason's recent PRs and put
mv643xx_eth patches on top. I am aware of it but as ARM part of
mv643xx_eth will be delayed until net part surfaces, we have
plenty of time to react on current updates to mach-kirkwood
boards.

Sebastian

^ permalink raw reply

* Re: can't access PCIe card under sbc8548
From: tiejun.chen @ 2013-05-30  8:47 UTC (permalink / raw)
  To: wolfking; +Cc: linuxppc-dev
In-Reply-To: <1369898369690-71782.post@n7.nabble.com>

On 05/30/2013 03:19 PM, wolfking wrote:
> hi, tiejun.chen:
>    Thanks for replying.
>    I tried to use ioremap too, but it doesn't work. The ioport_map method

What happened?

Tiejun

^ permalink raw reply

* Re: 3.10-rc ppc64 corrupts usermem when swapping
From: Benjamin Herrenschmidt @ 2013-05-30  8:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Paul Mackerras, Anton Blanchard, Hugh Dickins, linuxppc-dev,
	David Gibson
In-Reply-To: <87vc60na89.fsf@linux.vnet.ibm.com>

On Thu, 2013-05-30 at 13:57 +0530, Aneesh Kumar K.V wrote:
> +               /* FIXME!!, will fail with when we enable hugepage
> support */

Just fix that to say "Transparent huge pages" as normal huge pages
should work fine unless I'm missing something.

Hugh, any chance you can give that a spin ? 

Cheers,
Ben.

^ permalink raw reply

* Re: 3.10-rc ppc64 corrupts usermem when swapping
From: Aneesh Kumar K.V @ 2013-05-30  8:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Hugh Dickins
  Cc: linuxppc-dev, Anton Blanchard, Paul Mackerras, David Gibson
In-Reply-To: <1369897236.3928.93.camel@pasglop>

Benjamin Herrenschmidt <benh@au1.ibm.com> writes:

> On Wed, 2013-05-29 at 22:47 -0700, Hugh Dickins wrote:
>> Running my favourite swapping load (repeated make -j20 kernel builds
>> in tmpfs in parallel with repeated make -j20 kernel builds in ext4 on
>> loop on tmpfs file, all limited by mem=700M and swap 1.5G) on 3.10-rc
>> on PowerMac G5, the test dies with corrupted usermem after a few hours.
>> 
>> Variously, segmentation fault or Binutils assertion fail or gcc Internal
>> error in either or both builds: usually signs of swapping or TLB flushing
>> gone wrong.  Sometimes the tmpfs build breaks first, sometimes the ext4 on
>> loop on tmpfs, so at least it looks unrelated to loop.  No problem on x86.
>> 
>> This is 64-bit kernel but 4k pages and old SuSE 11.1 32-bit userspace.
>> 
>> I've just finished a manual bisection on arch/powerpc/mm (which might
>> have been a wrong guess, but has paid off): the first bad commit is
>> 7e74c3921ad9610c0b49f28b8fc69f7480505841
>> "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
>
> Ok, I have other reasons to think is wrong. I debugged a case last week
> where after kexec we still had stale TLB entries, due to the TLB cleanup
> not working.
>
> Thanks for doing that bisection ! I'll investigate ASAP (though it will
> probably have to wait for tomorrow unless Paul beats me to it)
>
>> I don't know if it's actually swapping to swap that's triggering the
>> problem, or a more general page reclaim or TLB flush problem.  I hit
>> it originally when trying to test Mel Gorman's pagevec series on top
>> of 3.10-rc; and though I then reproduced it without that series, it
>> did seem to take much longer: so I have been applying Mel's series to
>> speed up each step of the bisection.  But if I went back again, might
>> find it was just chance that I hit it sooner with Mel's series than
>> without.  So, you're probably safe to ignore that detail, but I
>> mention it just in case it turns out to have some relevance.
>> 
>> Something else peculiar that I've been doing in these runs, may or may
>> not be relevant: I've been running swapon and swapoff repeatedly in the
>> background, so that we're doing swapoff even while busy building.
>> 
>> I probably can't go into much more detail on the test (it's hard
>> to get the balance right, to be swapping rather than OOMing or just
>> running without reclaim), but can test any patches you'd like me to
>> try (though it may take 24 hours for me to report back usefully).
>
> I think it's just failing to invalidate the TLB properly. At least one
> bug I can spot just looking at it:
>
> static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
> 				   int psize, int ssize, int local)
>
>    .../...
>
> 	native_lock_hpte(hptep);
> 	hpte_v = hptep->v;
>
> 	actual_psize = hpte_actual_psize(hptep, psize);
> 	if (actual_psize < 0) {
> 		native_unlock_hpte(hptep);
> 		local_irq_restore(flags);
> 		return;
> 	}
>
> That's wrong. We must still perform the TLB invalidation even if the
> hash PTE is empty.
>
> In fact, Aneesh, this is a problem with MPSS for your THP work, I just
> thought about it.
>
> The reason is that if a hash bucket gets full, we "evict" a more/less
> random entry from it. When we do that we don't invalidate the TLB
> (hpte_remove) because we assume the old translation is still technically
> "valid".
>

Hmm that is correct, I missed that. But to do a tlb invalidate we need
both base and actual page size. One of the reason i didn't update the
hpte_invalidate callback to take both the page sizes was because, PAPR
didn't need that for invalidate (H_REMOVE). hpte_remove did result in a
tlb invalidate there. 


> However that means that an hpte_invalidate *must* invalidate the TLB
> later on even if it's not hitting the right entry in the hash.
>
> However, I can see why that cannot work with THP/MPSS since you have no
> way to know the page size from the PTE anymore....
>
> So my question is, apart from hpte_decode used by kexec, which I will
> fix by just blowing the whole TLB when not running phyp, why do you need
> the "actual" size in invalidate and updatepp ? You really can't rely on
> the size passed by the upper layers ?

So for upstream I have below which should address the
above. Meanwhile I will see what the impact would be to do a tlb
invalidate in hpte_remove, so that we can keep both lpar and native
changes similar.


diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..6d1bd81 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -336,11 +336,19 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 
 	hpte_v = hptep->v;
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
 	if (actual_psize < 0) {
-		native_unlock_hpte(hptep);
-		return -1;
+		/* FIXME!!, will fail with when we enable hugepage support */
+		actual_psize = psize;
+		ret = -1;
+		goto err_out;
 	}
-	/* Even if we miss, we need to invalidate the TLB */
 	if (!HPTE_V_COMPARE(hpte_v, want_v)) {
 		DBG_LOW(" -> miss\n");
 		ret = -1;
@@ -350,6 +358,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 		hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
 			(newpp & (HPTE_R_PP | HPTE_R_N | HPTE_R_C));
 	}
+err_out:
 	native_unlock_hpte(hptep);
 
 	/* Ensure it is out of the tlb too. */
@@ -408,8 +417,9 @@ static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
 		panic("could not find page to bolt\n");
 	hptep = htab_address + slot;
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/* FIXME!! can this happen for bolted entry ? */
 	if (actual_psize < 0)
-		return;
+		actual_psize = psize;
 
 	/* Update the HPTE */
 	hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
@@ -437,21 +447,28 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	hpte_v = hptep->v;
 
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
 	if (actual_psize < 0) {
+		/* FIXME!!, will fail with when we enable hugepage support */
+		actual_psize = psize;
 		native_unlock_hpte(hptep);
-		local_irq_restore(flags);
-		return;
+		goto err_out;
 	}
-	/* Even if we miss, we need to invalidate the TLB */
 	if (!HPTE_V_COMPARE(hpte_v, want_v))
 		native_unlock_hpte(hptep);
 	else
 		/* Invalidate the hpte. NOTE: this also unlocks it */
 		hptep->v = 0;
 
+err_out:
 	/* Invalidate the TLB */
 	tlbie(vpn, psize, actual_psize, ssize, local);
-
 	local_irq_restore(flags);
 }
 

^ permalink raw reply related

* [PATCH 21/23] powerpc/eeh: Process interrupts caused by EEH
From: Gavin Shan @ 2013-05-30  8:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1369902245-5886-1-git-send-email-shangw@linux.vnet.ibm.com>

On PowerNV platform, the EEH event is produced either by detect
on accessing config or I/O registers, or by interrupts dedicated
for EEH report. The patch adds support to process the interrupts
dedicated for EEH report.

Firstly, the kernel thread will be waken up to process incoming
interrupt. The PHBs will be scanned one by one to process all
existing EEH errors. Besides, There're mulple EEH errors that can
be reported from interrupts and we have differentiated actions
against them:

* If the IOC is dead, we will simply panic the system.
* If the PHB is dead, we also simply panic the system.
* If the PHB is fenced, EEH event will be sent to EEH core and
  the fenced PHB is expected to be resetted completely.
* If specific PE has been put into frozen state, EEH event will
  be sent to EEH core so that the PE will be resetted.
* If the error is informational one, we just output the related
  registers for debugging purpose and no more action will be
  taken.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h             |    8 +
 arch/powerpc/platforms/powernv/Makefile    |    2 +-
 arch/powerpc/platforms/powernv/pci-err.c   |  475 ++++++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/eeh_event.c |    8 +
 4 files changed, 492 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-err.c

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 05b70dc..7d0dfbf 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -212,6 +212,14 @@ void eeh_add_device_tree_late(struct pci_bus *);
 void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_bus_device(struct pci_dev *, int);
 
+#ifdef CONFIG_PPC_POWERNV
+void pci_err_event(void);
+void pci_err_release(void);
+#else
+static inline void pci_err_event(void) { }
+static inline void pci_err_release(void) { }
+#endif
+
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
  *
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 7fe5951..912fa7c 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -3,4 +3,4 @@ obj-y			+= opal-rtc.o opal-nvram.o
 
 obj-$(CONFIG_SMP)	+= smp.o
 obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
-obj-$(CONFIG_EEH)	+= eeh-ioda.o eeh-powernv.o
+obj-$(CONFIG_EEH)	+= pci-err.o eeh-ioda.o eeh-powernv.o
diff --git a/arch/powerpc/platforms/powernv/pci-err.c b/arch/powerpc/platforms/powernv/pci-err.c
new file mode 100644
index 0000000..12ae2ce
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-err.c
@@ -0,0 +1,475 @@
+/*
+ * The file instends to handle those interrupts dedicated for error
+ * detection from IOC chips. Currently, we only support P7IOC and
+ * need support more IOC chips in the future. The interrupts have
+ * been exported to hypervisor through "opal-interrupts" of "ibm,opal"
+ * OF node. When one of them comes in, the hypervisor simply turns
+ * to the firmware and expects the appropriate events returned. In
+ * turn, we will format one message and queue that in order to process
+ * it at later point.
+ *
+ * On the other hand, we need maintain information about the states
+ * of IO HUBs and their associated PHBs. The information would be
+ * shared by hypervisor and guests in future. While hypervisor or guests
+ * accessing IO HUBs, PHBs and PEs, the state should be checked and
+ * return approriate results. That would benefit EEH RTAS emulation in
+ * hypervisor as well.
+ *
+ * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2013.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+#include <linux/semaphore.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <linux/irq.h>
+#include <linux/io.h>
+#include <linux/kthread.h>
+#include <linux/msi.h>
+
+#include <asm/firmware.h>
+#include <asm/sections.h>
+#include <asm/io.h>
+#include <asm/prom.h>
+#include <asm/pci-bridge.h>
+#include <asm/machdep.h>
+#include <asm/msi_bitmap.h>
+#include <asm/ppc-pci.h>
+#include <asm/opal.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
+#include <asm/eeh_event.h>
+#include <asm/eeh.h>
+
+#include "powernv.h"
+#include "pci.h"
+
+/* Debugging option */
+#ifdef PCI_ERR_DEBUG_ON
+#define PCI_ERR_DBG(args...)	pr_info(args)
+#else
+#define PCI_ERR_DBG(args...)
+#endif
+
+static struct task_struct *pci_err_thread;
+static struct semaphore pci_err_int_sem;
+static struct semaphore pci_err_seq_sem;
+static char *pci_err_diag;
+
+/**
+ * pci_err_event - Report PCI error event
+ * @type: event type
+ *
+ * The function is used for interrupt handler of PCI error IRQs to report
+ * event. As the result, the kthread will be started to handle the PCI
+ * error.
+ */
+void pci_err_event(void)
+{
+	/* Notify kthread to process error */
+	up(&pci_err_int_sem);
+}
+
+static void pci_err_take(void)
+{
+	down(&pci_err_seq_sem);
+}
+
+/**
+ * pci_err_release - Enable error report for sending events
+ *
+ * We're hanlding the EEH event one by one. Each time, there only has
+ * one EEH event caused by error IRQ. The function is called to enable
+ * error report in order to send more EEH events.
+ */
+void pci_err_release(void)
+{
+	up(&pci_err_seq_sem);
+}
+
+/*
+ * When we get global interrupts (e.g. P7IOC RGC), PCI error happens
+ * in critical component of the IOC or PHB. For the formal case, the
+ * firmware just returns OPAL_PCI_ERR_CLASS_HUB and we needn't proceed.
+ * For the late case, we probably need reset one particular PHB. For
+ * that, we're doing is to send EEH event to the toppset PE of that
+ * problematic PHB so that the PHB can be reset by the EEH core.
+ */
+static int pci_err_check_phb(struct pci_controller *hose)
+{
+	struct eeh_pe *phb_pe;
+
+	/* Find the PHB PE */
+	phb_pe = eeh_phb_pe_get(hose);
+	if (!phb_pe) {
+		pr_debug("%s Can't find PE for PHB#%d\n",
+			__func__, hose->global_number);
+		return -EEXIST;
+	}
+	PCI_ERR_DBG("PCI_ERR: PHB#%d PE found\n",
+		hose->global_number);
+
+	/*
+	 * Fence the PHB and send one event to EEH core
+	 * for further processing. We have to fence the
+	 * PHB here because the EEH core always return
+	 * normal state for PHB PE, so we can't do it
+	 * through EEH core.
+	 */
+	if (!(phb_pe->state & EEH_PE_ISOLATED)) {
+		PCI_ERR_DBG("PCI_ERR: Fence PHB#%x and send event "
+			    "to EEH core\n", hose->global_number);
+		eeh_pe_state_mark(phb_pe, EEH_PE_ISOLATED);
+		eeh_send_failure_event(phb_pe, EEH_EVENT_INT);
+	} else {
+		pci_err_release();
+	}
+
+	return 0;
+}
+
+/*
+ * When we get interrupts from PHB, there are probablly some PEs that
+ * have been put into frozen state. What we need do is sent one message
+ * to the EEH device, no matter which one it is, so that the EEH core
+ * can check it out and do PE reset accordingly.
+ */
+static int pci_err_check_pe(struct pci_controller *hose, u16 pe_no)
+{
+	struct eeh_pe *phb_pe, *pe;
+	struct eeh_dev dev, *edev;
+
+	/* Find the PHB PE */
+	phb_pe = eeh_phb_pe_get(hose);
+	if (!phb_pe) {
+		pr_warning("%s Can't find PE for PHB#%d\n",
+			__func__, hose->global_number);
+		return -EEXIST;
+	}
+	PCI_ERR_DBG("PCI_ERR: PHB#%d PE found\n",
+		hose->global_number);
+
+	/*
+	 * If the PHB has been put into fenced state, we
+	 * needn't send the duplicate event because the
+	 * whole PHB is going to take reset.
+	 */
+	if (phb_pe->state & EEH_PE_ISOLATED)
+		return 0;
+
+	/* Find the PE according to PE# */
+	memset(&dev, 0, sizeof(struct eeh_dev));
+	dev.phb = hose;
+	dev.pe_config_addr = pe_no;
+	pe = eeh_pe_get(&dev);
+	if (!pe) {
+		pr_debug("%s: Can't find PE for PHB#%x - PE#%x\n",
+			__func__, hose->global_number, pe_no);
+		return -EEXIST;
+	}
+	PCI_ERR_DBG("PCI_ERR: PE (%x) found for PHB#%x - PE#%x\n",
+		pe->addr, hose->global_number, pe_no);
+
+	/*
+	 * It doesn't matter which EEH device to get
+	 * the message. Just pick up the one on the
+	 * toppest position.
+	 */
+	edev = list_first_entry(&pe->edevs, struct eeh_dev, list);
+	if (!edev) {
+		pr_err("%s: No EEH devices hooked on PHB#%x - PE#%x\n",
+			__func__, hose->global_number, pe_no);
+		return -EEXIST;
+	}
+	PCI_ERR_DBG("PCI_ERR: First EEH device found on PHB#%x - PE#%x\n",
+		hose->global_number, pe_no);
+
+	if (!eeh_dev_check_failure(edev, EEH_EVENT_INT))
+		pci_err_release();
+
+	return 0;
+}
+
+static void pci_err_hub_diag_common(struct OpalIoP7IOCErrorData *data)
+{
+	/* GEM */
+	pr_info("  GEM XFIR:        %016llx\n", data->gemXfir);
+	pr_info("  GEM RFIR:        %016llx\n", data->gemRfir);
+	pr_info("  GEM RIRQFIR:     %016llx\n", data->gemRirqfir);
+	pr_info("  GEM Mask:        %016llx\n", data->gemMask);
+	pr_info("  GEM RWOF:        %016llx\n", data->gemRwof);
+
+	/* LEM */
+	pr_info("  LEM FIR:         %016llx\n", data->lemFir);
+	pr_info("  LEM Error Mask:  %016llx\n", data->lemErrMask);
+	pr_info("  LEM Action 0:    %016llx\n", data->lemAction0);
+	pr_info("  LEM Action 1:    %016llx\n", data->lemAction1);
+	pr_info("  LEM WOF:         %016llx\n", data->lemWof);
+}
+
+static void pci_err_hub_diag_data(struct pci_controller *hose)
+{
+	struct pnv_phb *phb = hose->private_data;
+	struct OpalIoP7IOCErrorData *data;
+	long ret;
+
+	data = (struct OpalIoP7IOCErrorData *)pci_err_diag;
+	ret = opal_pci_get_hub_diag_data(phb->hub_id, data, PAGE_SIZE);
+	if (ret != OPAL_SUCCESS) {
+		pr_warning("%s: Failed to get HUB#%llx diag-data, ret=%ld\n",
+			__func__, phb->hub_id, ret);
+		return;
+	}
+
+	/* Check the error type */
+	if (data->type <= OPAL_P7IOC_DIAG_TYPE_NONE ||
+	    data->type >= OPAL_P7IOC_DIAG_TYPE_LAST) {
+		pr_warning("%s: Invalid type of HUB#%llx diag-data (%d)\n",
+			__func__, phb->hub_id, data->type);
+		return;
+	}
+
+	switch (data->type) {
+	case OPAL_P7IOC_DIAG_TYPE_RGC:
+		pr_info("P7IOC diag-data for RGC\n\n");
+		pci_err_hub_diag_common(data);
+		pr_info("  RGC Status:      %016llx\n", data->rgc.rgcStatus);
+		pr_info("  RGC LDCP:        %016llx\n", data->rgc.rgcLdcp);
+		break;
+	case OPAL_P7IOC_DIAG_TYPE_BI:
+		pr_info("P7IOC diag-data for BI %s\n\n",
+			data->bi.biDownbound ? "Downbound" : "Upbound");
+		pci_err_hub_diag_common(data);
+		pr_info("  BI LDCP 0:       %016llx\n", data->bi.biLdcp0);
+		pr_info("  BI LDCP 1:       %016llx\n", data->bi.biLdcp1);
+		pr_info("  BI LDCP 2:       %016llx\n", data->bi.biLdcp2);
+		pr_info("  BI Fence Status: %016llx\n", data->bi.biFenceStatus);
+		break;
+	case OPAL_P7IOC_DIAG_TYPE_CI:
+		pr_info("P7IOC diag-data for CI Port %d\\nn",
+			data->ci.ciPort);
+		pci_err_hub_diag_common(data);
+		pr_info("  CI Port Status:  %016llx\n", data->ci.ciPortStatus);
+		pr_info("  CI Port LDCP:    %016llx\n", data->ci.ciPortLdcp);
+		break;
+	case OPAL_P7IOC_DIAG_TYPE_MISC:
+		pr_info("P7IOC diag-data for MISC\n\n");
+		pci_err_hub_diag_common(data);
+		break;
+	case OPAL_P7IOC_DIAG_TYPE_I2C:
+		pr_info("P7IOC diag-data for I2C\n\n");
+		pci_err_hub_diag_common(data);
+		break;
+	}
+}
+
+static void pci_err_phb_diag_data(struct pci_controller *hose)
+{
+	struct pnv_phb *phb = hose->private_data;
+	struct OpalIoP7IOCPhbErrorData *data;
+	int i;
+	long ret;
+
+	data = (struct OpalIoP7IOCPhbErrorData *)pci_err_diag;
+	ret = opal_pci_get_phb_diag_data(phb->opal_id, data, PAGE_SIZE, 1);
+	if (ret != OPAL_SUCCESS) {
+		pr_warning("%s: Failed to get diag-data for PHB#%x, ret=%ld\n",
+			__func__, hose->global_number, ret);
+		return;
+	}
+
+	pr_info("PHB#%x Diag-data\n\n", hose->global_number);
+	pr_info("  brdgCtl:              %08x\n", data->brdgCtl);
+
+	pr_info("  portStatusReg:        %08x\n", data->portStatusReg);
+	pr_info("  rootCmplxStatus:      %08x\n", data->rootCmplxStatus);
+	pr_info("  busAgentStatus:       %08x\n", data->busAgentStatus);
+
+	pr_info("  deviceStatus:         %08x\n", data->deviceStatus);
+	pr_info("  slotStatus:           %08x\n", data->slotStatus);
+	pr_info("  linkStatus:           %08x\n", data->linkStatus);
+	pr_info("  devCmdStatus:         %08x\n", data->devCmdStatus);
+	pr_info("  devSecStatus:         %08x\n", data->devSecStatus);
+
+	pr_info("  rootErrorStatus:      %08x\n", data->rootErrorStatus);
+	pr_info("  uncorrErrorStatus:    %08x\n", data->uncorrErrorStatus);
+	pr_info("  corrErrorStatus:      %08x\n", data->corrErrorStatus);
+	pr_info("  tlpHdr1:              %08x\n", data->tlpHdr1);
+	pr_info("  tlpHdr2:              %08x\n", data->tlpHdr2);
+	pr_info("  tlpHdr3:              %08x\n", data->tlpHdr3);
+	pr_info("  tlpHdr4:              %08x\n", data->tlpHdr4);
+	pr_info("  sourceId:             %08x\n", data->sourceId);
+
+	pr_info("  errorClass:           %016llx\n", data->errorClass);
+	pr_info("  correlator:           %016llx\n", data->correlator);
+	pr_info("  p7iocPlssr:           %016llx\n", data->p7iocPlssr);
+	pr_info("  p7iocCsr:             %016llx\n", data->p7iocCsr);
+	pr_info("  lemFir:               %016llx\n", data->lemFir);
+	pr_info("  lemErrorMask:         %016llx\n", data->lemErrorMask);
+	pr_info("  lemWOF:               %016llx\n", data->lemWOF);
+	pr_info("  phbErrorStatus:       %016llx\n", data->phbErrorStatus);
+	pr_info("  phbFirstErrorStatus:  %016llx\n", data->phbFirstErrorStatus);
+	pr_info("  phbErrorLog0:         %016llx\n", data->phbErrorLog0);
+	pr_info("  phbErrorLog1:         %016llx\n", data->phbErrorLog1);
+	pr_info("  mmioErrorStatus:      %016llx\n", data->mmioErrorStatus);
+	pr_info("  mmioFirstErrorStatus: %016llx\n", data->mmioFirstErrorStatus);
+	pr_info("  mmioErrorLog0:        %016llx\n", data->mmioErrorLog0);
+	pr_info("  mmioErrorLog1:        %016llx\n", data->mmioErrorLog1);
+	pr_info("  dma0ErrorStatus:      %016llx\n", data->dma0ErrorStatus);
+	pr_info("  dma0FirstErrorStatus: %016llx\n", data->dma0FirstErrorStatus);
+	pr_info("  dma0ErrorLog0:        %016llx\n", data->dma0ErrorLog0);
+	pr_info("  dma0ErrorLog1:        %016llx\n", data->dma0ErrorLog1);
+	pr_info("  dma1ErrorStatus:      %016llx\n", data->dma1ErrorStatus);
+	pr_info("  dma1FirstErrorStatus: %016llx\n", data->dma1FirstErrorStatus);
+	pr_info("  dma1ErrorLog0:        %016llx\n", data->dma1ErrorLog0);
+	pr_info("  dma1ErrorLog1:        %016llx\n", data->dma1ErrorLog1);
+
+	for (i = 0; i < OPAL_P7IOC_NUM_PEST_REGS; i++) {
+		if ((data->pestA[i] >> 63) == 0 &&
+		    (data->pestB[i] >> 63) == 0)
+			continue;
+
+		pr_info("  PE[%3d] PESTA:        %016llx\n", i, data->pestA[i]);
+		pr_info("          PESTB:        %016llx\n", data->pestB[i]);
+	}
+}
+
+/*
+ * Process PCI errors from IOC, PHB, or PE. Here's the list
+ * of expected error types and their severities, as well as
+ * the corresponding action.
+ *
+ * Type                        Severity                Action
+ * OPAL_EEH_ERROR_IOC  OPAL_EEH_SEV_IOC_DEAD   panic
+ * OPAL_EEH_ERROR_IOC  OPAL_EEH_SEV_INF        diag_data
+ * OPAL_EEH_ERROR_PHB  OPAL_EEH_SEV_PHB_DEAD   panic
+ * OPAL_EEH_ERROR_PHB  OPAL_EEH_SEV_PHB_FENCED eeh
+ * OPAL_EEH_ERROR_PHB  OPAL_EEH_SEV_INF        diag_data
+ * OPAL_EEH_ERROR_PE   OPAL_EEH_SEV_PE_ER      eeh
+ */
+static void pci_err_process(struct pci_controller *hose,
+			u16 err_type, u16 severity, u16 pe_no)
+{
+	PCI_ERR_DBG("PCI_ERR: Process error (%d, %d, %d) on PHB#%x\n",
+		err_type, severity, pe_no, hose->global_number);
+
+	switch (err_type) {
+	case OPAL_EEH_IOC_ERROR:
+		if (severity == OPAL_EEH_SEV_IOC_DEAD)
+			panic("Dead IOC of PHB#%x", hose->global_number);
+		else if (severity == OPAL_EEH_SEV_INF) {
+			pci_err_hub_diag_data(hose);
+			pci_err_release();
+		}
+
+		break;
+	case OPAL_EEH_PHB_ERROR:
+		if (severity == OPAL_EEH_SEV_PHB_DEAD)
+			panic("Dead PHB#%x", hose->global_number);
+		else if (severity == OPAL_EEH_SEV_PHB_FENCED)
+			pci_err_check_phb(hose);
+		else if (severity == OPAL_EEH_SEV_INF) {
+			pci_err_phb_diag_data(hose);
+			pci_err_release();
+		}
+
+		break;
+	case OPAL_EEH_PE_ERROR:
+		pci_err_check_pe(hose, pe_no);
+		break;
+	}
+}
+
+static int pci_err_handler(void *dummy)
+{
+	struct pnv_phb *phb;
+	struct pci_controller *hose, *tmp;
+	u64 frozen_pe_no;
+	u16 err_type, severity;
+	long ret;
+
+	while (!kthread_should_stop()) {
+		down(&pci_err_int_sem);
+		PCI_ERR_DBG("PCI_ERR: Get PCI error semaphore\n");
+
+		list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+			phb = hose->private_data;
+restart:
+			pci_err_take();
+			ret = opal_pci_next_error(phb->opal_id,
+					&frozen_pe_no, &err_type, &severity);
+
+			/* If OPAL API returns error, we needn't proceed */
+			if (ret != OPAL_SUCCESS) {
+				PCI_ERR_DBG("PCI_ERR: Invalid return value on "
+					    "PHB#%x (0x%lx) from opal_pci_next_error",
+					    hose->global_number, ret);
+				pci_err_release();
+				continue;
+			}
+
+			/* If the PHB doesn't have error, stop processing */
+			if (err_type == OPAL_EEH_NO_ERROR ||
+			    severity == OPAL_EEH_SEV_NO_ERROR) {
+				PCI_ERR_DBG("PCI_ERR: No error found on PHB#%x\n",
+					hose->global_number);
+				pci_err_release();
+				continue;
+			}
+
+			/*
+			 * Process the error until there're no pending
+			 * errors on the specific PHB.
+			 */
+			pci_err_process(hose, err_type, severity, frozen_pe_no);
+			goto restart;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * pci_err_init - Initialize PCI error handling component
+ *
+ * It should be done before OPAL interrupts got registered because
+ * that depends on this.
+ */
+static int __init pci_err_init(void)
+{
+	int ret = -ENOMEM;
+
+	if (!firmware_has_feature(FW_FEATURE_OPAL))
+		return ret;
+
+	pci_err_diag = (char *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
+	if (!pci_err_diag) {
+		pr_err("%s: Failed to alloc memory for diag data\n",
+			__func__);
+		return ret;
+	}
+
+	/* Initialize semaphore and start kthread */
+	sema_init(&pci_err_int_sem, 0);
+	sema_init(&pci_err_seq_sem, 1);
+	pci_err_thread = kthread_run(pci_err_handler, NULL, "PCI_ERR");
+	if (IS_ERR(pci_err_thread)) {
+		free_page((unsigned long)pci_err_diag);
+		ret = PTR_ERR(pci_err_thread);
+		pr_err("%s: Failed to start kthread, ret=%d\n",
+			__func__, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+arch_initcall(pci_err_init);
diff --git a/arch/powerpc/platforms/pseries/eeh_event.c b/arch/powerpc/platforms/pseries/eeh_event.c
index 1f86b80..e4c636e 100644
--- a/arch/powerpc/platforms/pseries/eeh_event.c
+++ b/arch/powerpc/platforms/pseries/eeh_event.c
@@ -84,6 +84,14 @@ static int eeh_event_handler(void * dummy)
 	eeh_handle_event(pe);
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
 
+	/*
+	 * If it's the event caused by error reporting IRQ,
+	 * we need release the module so that precedent events
+	 * could be fired.
+	 */
+	if (event->flag & EEH_EVENT_INT)
+		pci_err_release();
+
 	kfree(event);
 	mutex_unlock(&eeh_event_mutex);
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 23/23] powerpc/eeh: Add debugfs entry to inject errors
From: Gavin Shan @ 2013-05-30  8:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1369902245-5886-1-git-send-email-shangw@linux.vnet.ibm.com>

The patch intends to add debugfs entry powerpc/EEH/PHBx so that
the administrator can inject EEH errors to specified PCI host
bridge for testing purpose.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-ioda.c |   36 ++++++++++++++++++++++++++++-
 1 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index ec5c524..4cc9db7 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -22,6 +22,7 @@
 
 #include <linux/bootmem.h>
 #include <linux/delay.h>
+#include <linux/debugfs.h>
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/irq.h>
@@ -43,6 +44,29 @@
 #include "powernv.h"
 #include "pci.h"
 
+static struct dentry *ioda_eeh_dbgfs = NULL;
+
+static int ioda_eeh_dbgfs_set(void *data, u64 val)
+{
+	struct pci_controller *hose = data;
+	struct pnv_phb *phb = hose->private_data;
+
+	out_be64(phb->regs + 0xD10, val);
+	return 0;
+}
+
+static int ioda_eeh_dbgfs_get(void *data, u64 *val)
+{
+	struct pci_controller *hose = data;
+	struct pnv_phb *phb = hose->private_data;
+
+	*val = in_be64(phb->regs + 0xD10);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(ioda_eeh_dbgfs_ops, ioda_eeh_dbgfs_get,
+			ioda_eeh_dbgfs_set, "0x%llx\n");
+
 /**
  * ioda_eeh_post_init - Chip dependent post initialization
  * @hose: PCI controller
@@ -54,10 +78,20 @@
 static int ioda_eeh_post_init(struct pci_controller *hose)
 {
 	struct pnv_phb *phb = hose->private_data;
+	char name[16];
+
+	/* Create EEH debugfs root if possible */
+	if (!ioda_eeh_dbgfs)
+		ioda_eeh_dbgfs = debugfs_create_dir("EEH", powerpc_debugfs_root);
 
 	/* FIXME: Enable it for PHB3 later */
-	if (phb->type == PNV_PHB_IODA1)
+	if (phb->type == PNV_PHB_IODA1) {
+		sprintf(name, "PHB%d", hose->global_number);
+		debugfs_create_file(name, 0600, ioda_eeh_dbgfs,
+				    hose, &ioda_eeh_dbgfs_ops);
+
 		phb->eeh_enabled = 1;
+	}
 
 	return 0;
 }
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 19/23] powerpc/eeh: Initialization for PowerNV
From: Gavin Shan @ 2013-05-30  8:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1369902245-5886-1-git-send-email-shangw@linux.vnet.ibm.com>

The patch initializes EEH for PowerNV platform. Because the OPAL
APIs requires HUB ID, we need trace that through struct pnv_phb.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c   |   16 +++++++++++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |    6 ++++--
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9c9d15e..48b0940 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -973,6 +973,11 @@ static void pnv_pci_ioda_fixup(void)
 	pnv_pci_ioda_setup_PEs();
 	pnv_pci_ioda_setup_seg();
 	pnv_pci_ioda_setup_DMA();
+
+#ifdef CONFIG_EEH
+	eeh_addr_cache_build();
+	eeh_init();
+#endif
 }
 
 /*
@@ -1049,7 +1054,8 @@ static void pnv_pci_ioda_shutdown(struct pnv_phb *phb)
 		       OPAL_ASSERT_RESET);
 }
 
-void __init pnv_pci_init_ioda_phb(struct device_node *np, int ioda_type)
+void __init pnv_pci_init_ioda_phb(struct device_node *np,
+				  u64 hub_id, int ioda_type)
 {
 	struct pci_controller *hose;
 	static int primary = 1;
@@ -1087,6 +1093,7 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np, int ioda_type)
 	hose->first_busno = 0;
 	hose->last_busno = 0xff;
 	hose->private_data = phb;
+	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
 
@@ -1172,6 +1179,9 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np, int ioda_type)
 		phb->ioda.io_size, phb->ioda.io_segsize);
 
 	phb->hose->ops = &pnv_pci_ops;
+#ifdef CONFIG_EEH
+	phb->eeh_ops = &ioda_eeh_ops;
+#endif
 
 	/* Setup RID -> PE mapping function */
 	phb->bdfn_to_pe = pnv_ioda_bdfn_to_pe;
@@ -1212,7 +1222,7 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np, int ioda_type)
 
 void pnv_pci_init_ioda2_phb(struct device_node *np)
 {
-	pnv_pci_init_ioda_phb(np, PNV_PHB_IODA2);
+	pnv_pci_init_ioda_phb(np, 0, PNV_PHB_IODA2);
 }
 
 void __init pnv_pci_init_ioda_hub(struct device_node *np)
@@ -1235,6 +1245,6 @@ void __init pnv_pci_init_ioda_hub(struct device_node *np)
 	for_each_child_of_node(np, phbn) {
 		/* Look for IODA1 PHBs */
 		if (of_device_is_compatible(phbn, "ibm,ioda-phb"))
-			pnv_pci_init_ioda_phb(phbn, PNV_PHB_IODA1);
+			pnv_pci_init_ioda_phb(phbn, hub_id, PNV_PHB_IODA1);
 	}
 }
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 92b37a0..ae72616 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -92,7 +92,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 	set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
 }
 
-static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np,
+static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
 					   void *tce_mem, u64 tce_size)
 {
 	struct pnv_phb *phb;
@@ -133,6 +133,7 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np,
 	phb->hose->first_busno = 0;
 	phb->hose->last_busno = 0xff;
 	phb->hose->private_data = phb;
+	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = PNV_PHB_P5IOC2;
 	phb->model = PNV_PHB_MODEL_P5IOC2;
@@ -226,7 +227,8 @@ void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)
 	for_each_child_of_node(np, phbn) {
 		if (of_device_is_compatible(phbn, "ibm,p5ioc2-pcix") ||
 		    of_device_is_compatible(phbn, "ibm,p5ioc2-pciex")) {
-			pnv_pci_init_p5ioc2_phb(phbn, tce_mem, tce_per_phb);
+			pnv_pci_init_p5ioc2_phb(phbn, hub_id,
+					tce_mem, tce_per_phb);
 			tce_mem += tce_per_phb;
 		}
 	}
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 22/23] powerpc/eeh: Connect EEH error interrupt handle
From: Gavin Shan @ 2013-05-30  8:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1369902245-5886-1-git-send-email-shangw@linux.vnet.ibm.com>

The EEH error interrupts should have been exported by firmware
through device tree. The OS already installed the interrupt
handler (opal_interrupt()) for them. The patch checks if we have
any existing EEH errors and wakes the kernel thread up to process
that if possible.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 628c564..cca78c9 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -18,6 +18,8 @@
 #include <linux/slab.h>
 #include <asm/opal.h>
 #include <asm/firmware.h>
+#include <asm/io.h>
+#include <asm/eeh.h>
 
 #include "powernv.h"
 
@@ -296,6 +298,10 @@ static irqreturn_t opal_interrupt(int irq, void *data)
 	uint64_t events;
 
 	opal_handle_interrupt(virq_to_hw(irq), &events);
+#ifdef CONFIG_EEH
+	if (events & OPAL_EVENT_PCI_ERROR)
+		pci_err_event();
+#endif
 
 	/* XXX TODO: Do something with the events */
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 17/23] powerpc/eeh: I/O chip PE log and bridge setup
From: Gavin Shan @ 2013-05-30  8:23 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1369902245-5886-1-git-send-email-shangw@linux.vnet.ibm.com>

The patch adds backends to retrieve error log and configure p2p
bridges for the indicated PE.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-ioda.c |   59 ++++++++++++++++++++++++++++-
 1 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index be2c336..ec5c524 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -456,11 +456,66 @@ static int ioda_eeh_reset(struct eeh_pe *pe, int option)
 	return ret;
 }
 
+/**
+ * ioda_eeh_get_log - Retrieve error log
+ * @pe: EEH PE
+ * @severity: Severity level of the log
+ * @drv_log: buffer to store the log
+ * @len: space of the log buffer
+ *
+ * The function is used to retrieve error log from P7IOC.
+ */
+static int ioda_eeh_get_log(struct eeh_pe *pe, int severity,
+		char *drv_log, unsigned long len)
+{
+	s64 ret;
+	u8 extra_log = !!(pe->type & EEH_PE_PHB);
+	unsigned long flags;
+	struct pci_controller *hose = pe->phb;
+	struct pnv_phb *phb = hose->private_data;
+
+	spin_lock_irqsave(&phb->lock, flags);
+
+	ret = opal_pci_get_phb_diag_data(phb->opal_id,
+			phb->diag.blob, PNV_PCI_DIAG_BUF_SIZE,
+			!extra_log);
+	if (ret) {
+		spin_unlock_irqrestore(&phb->lock, flags);
+		pr_warning("%s: Failed to retrieve log for PHB#%x-PE#%x\n",
+			__func__, hose->global_number, pe->addr);
+		return -EIO;
+	}
+
+	/*
+	 * FIXME: We probably need log the error in somewhere.
+	 * Lets make it up in future.
+	 */
+	/* pr_info("%s", phb->diag.blob); */
+
+	spin_unlock_irqrestore(&phb->lock, flags);
+
+	return 0;
+}
+
+/**
+ * ioda_eeh_configure_bridge - Configure the PCI bridges for the indicated PE
+ * @pe: EEH PE
+ *
+ * For particular PE, it might have included PCI bridges. In order
+ * to make the PE work properly, those PCI bridges should be configured
+ * correctly. However, we need do nothing on P7IOC since the reset
+ * function will do everything that should be covered by the function.
+ */
+static int ioda_eeh_configure_bridge(struct eeh_pe *pe)
+{
+	return 0;
+}
+
 struct pnv_eeh_ops ioda_eeh_ops = {
 	.post_init		= ioda_eeh_post_init,
 	.set_option		= ioda_eeh_set_option,
 	.get_state		= ioda_eeh_get_state,
 	.reset			= ioda_eeh_reset,
-	.get_log		= NULL,
-	.configure_bridge	= NULL
+	.get_log		= ioda_eeh_get_log,
+	.configure_bridge	= ioda_eeh_configure_bridge
 };
-- 
1.7.5.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox