LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Support of the kmcoge4 board
From: Valentin Longchamp @ 2014-02-11 11:50 UTC (permalink / raw)
  To: Scott Wood, linuxppc-dev; +Cc: Valentin Longchamp, devicetree

This series adds support for Keymile's COGE4 board, called kmcoge4. This
board is the reference design for further designs at Keymile around the
P2040/P2041 SoCs from Freescale. This reference design is internally
called kmp204x.

Changes in v2:
- add a patch so that the Zarlink vendor prefix is defined
- add some nodes on the localbus CS when possible
- only use the corenet_generic machine and add kmcoge4 to the supported
  boards instead of defining a new kmp204x machine
- set better and more precise device nodes for the spi devices
- remove the partion layout for the spi_flash@0

Valentin Longchamp (2):
  devicetree: bindings: add Zarlink to the vendor prefixes
  powerpc/mpc85xx: add support for Keymile's kmcoge4 board

 .../devicetree/bindings/vendor-prefixes.txt        |   1 +
 arch/powerpc/boot/dts/kmcoge4.dts                  | 161 +++++++++++++++
 arch/powerpc/configs/85xx/kmp204x_defconfig        | 227 +++++++++++++++++++++
 arch/powerpc/platforms/85xx/Kconfig                |   2 +-
 arch/powerpc/platforms/85xx/corenet_generic.c      |   3 +-
 5 files changed, 392 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/boot/dts/kmcoge4.dts
 create mode 100644 arch/powerpc/configs/85xx/kmp204x_defconfig

-- 
1.8.0.1

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-11 10:40 UTC (permalink / raw)
  To: Raghavendra KT
  Cc: Tom Musta, Peter Zijlstra, Raghavendra KT,
	Linux Kernel Mailing List, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <CAC4Lta2Pssu5QY_QLmy5C3Fm2G3fwpiuc5BNOLBnQTmmRH+W4Q@mail.gmail.com>

On Tue, Feb 11, 2014 at 03:23:51PM +0530, Raghavendra KT wrote:
> How much important to have holder information for PPC? From my
> previous experiment
> on x86, it was lock-waiter preemption which is problematic rather than
> lock-holder preemption.

It's something very special to IBM pSeries: the hypervisor can assign
fractions of physical CPUs to guests. Sometimes a guest with 4 quarter
CPUs will be faster than 1 monoprocessor. (correct me if I'm wrong).

The directed yield resolves the silly situation when holder and waiter
reside on the same physical CPU, as I understand it.

x86 has nothing comparable.

	Torsten

^ permalink raw reply

* [PATCH 3/3] mm: Use ptep/pmdp_set_numa for updating _PAGE_NUMA bit
From: Aneesh Kumar K.V @ 2014-02-11 10:34 UTC (permalink / raw)
  To: benh, paulus, riel, mgorman, mpe; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1392114895-14997-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions. ppc64 also doesn't implement
flush_tlb_range. ppc64 require the tlb flushing to be batched within ptl locks. The reason
to do that is to ensure that the hash page table is in sync with linux page table.
We track the hpte index in linux pte and if we clear them without flushing hash and drop the
ptl lock, we can have another cpu update the pte and can end up with double hash. We also want
to keep set_pte_at simpler by not requiring them to do hash flush for performance reason.
Hence cannot use them while updating _PAGE_NUMA bit. Add new functions for marking pte/pmd numa

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable.h | 22 ++++++++++++++++++++++
 include/asm-generic/pgtable.h      | 24 ++++++++++++++++++++++++
 mm/huge_memory.c                   |  9 ++-------
 mm/mprotect.c                      |  4 +---
 4 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index f83b6f3e1b39..3ebb188c3ff5 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -75,12 +75,34 @@ static inline pte_t pte_mknuma(pte_t pte)
 	return pte;
 }
 
+#define ptep_set_numa ptep_set_numa
+static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
+				 pte_t *ptep)
+{
+	if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
+		VM_BUG_ON(1);
+
+	pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
+	return;
+}
+
 #define pmd_numa pmd_numa
 static inline int pmd_numa(pmd_t pmd)
 {
 	return pte_numa(pmd_pte(pmd));
 }
 
+#define pmdp_set_numa pmdp_set_numa
+static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
+				 pmd_t *pmdp)
+{
+	if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
+		VM_BUG_ON(1);
+
+	pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
+	return;
+}
+
 #define pmd_mknonnuma pmd_mknonnuma
 static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 {
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 8e4f41d9af4d..93fdb5315a0d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -669,6 +669,18 @@ static inline int pmd_numa(pmd_t pmd)
 }
 #endif
 
+#ifndef pmdp_set_numa
+static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
+				 pmd_t *pmdp)
+{
+	pmd_t pmd = *pmdp;
+
+	pmd = pmd_mknuma(entry);
+	set_pmd_at(mm, addr, pmdp, pmd);
+	return;
+}
+#endif
+
 /*
  * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
  * because they're called by the NUMA hinting minor page fault. If we
@@ -701,6 +713,18 @@ static inline pte_t pte_mknuma(pte_t pte)
 }
 #endif
 
+#ifndef ptep_set_numa
+static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
+				 pte_t *ptep)
+{
+	pte_t ptent = *ptep;
+
+	ptent = pte_mknuma(ptent);
+	set_pte_at(mm, addr, ptep, ptent);
+	return;
+}
+#endif
+
 #ifndef pmd_mknuma
 static inline pmd_t pmd_mknuma(pmd_t pmd)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf974e1..da23eb96779f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1545,6 +1545,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
+			set_pmd_at(mm, addr, pmd, entry);
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1557,16 +1558,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
-				entry = *pmd;
-				entry = pmd_mknuma(entry);
+				pmdp_set_numa(mm, addr, pmd);
 				ret = HPAGE_PMD_NR;
 			}
 		}
-
-		/* Set PMD if cleared earlier */
-		if (ret == HPAGE_PMD_NR)
-			set_pmd_at(mm, addr, pmd, entry);
-
 		spin_unlock(ptl);
 	}
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33eab902f10e..769a67a15803 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,12 +69,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			} else {
 				struct page *page;
 
-				ptent = *pte;
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page && !PageKsm(page)) {
 					if (!pte_numa(oldpte)) {
-						ptent = pte_mknuma(ptent);
-						set_pte_at(mm, addr, pte, ptent);
+						ptep_set_numa(mm, addr, pte);
 						updated = true;
 					}
 				}
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 0/3] powerpc: Fix random application crashes with NUMA_BALANCING enabled
From: Aneesh Kumar K.V @ 2014-02-11 10:34 UTC (permalink / raw)
  To: benh, paulus, riel, mgorman, mpe; +Cc: linux-mm, linuxppc-dev

Hello,

This patch series fix random application crashes observed on ppc64 with numa
balancing enabled. Without the patch we see crashes like

anacron[14551]: unhandled signal 11 at 0000000000000041 nip 000000003cfd54b4 lr 000000003cfd5464 code 30001
anacron[14599]: unhandled signal 11 at 0000000000000041 nip 000000003efc54b4 lr 000000003efc5464 code 30001

-aneesh

^ permalink raw reply

* [PATCH 1/3] powerpc: mm: Add new set flag argument to pte/pmd update function
From: Aneesh Kumar K.V @ 2014-02-11 10:34 UTC (permalink / raw)
  To: benh, paulus, riel, mgorman, mpe; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1392114895-14997-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We will use this later to set the _PAGE_NUMA bit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/hugetlb.h       |  2 +-
 arch/powerpc/include/asm/pgtable-ppc64.h | 26 +++++++++++++++-----------
 arch/powerpc/mm/pgtable_64.c             | 12 +++++++-----
 arch/powerpc/mm/subpage-prot.c           |  2 +-
 4 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index d750336b171d..623f2971ce0e 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -127,7 +127,7 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
 #ifdef CONFIG_PPC64
-	return __pte(pte_update(mm, addr, ptep, ~0UL, 1));
+	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
 #else
 	return __pte(pte_update(ptep, ~0UL, 0));
 #endif
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index bc141c950b1e..eb9261024f51 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -195,6 +195,7 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 static inline unsigned long pte_update(struct mm_struct *mm,
 				       unsigned long addr,
 				       pte_t *ptep, unsigned long clr,
+				       unsigned long set,
 				       int huge)
 {
 #ifdef PTE_ATOMIC_UPDATES
@@ -205,14 +206,15 @@ static inline unsigned long pte_update(struct mm_struct *mm,
 	andi.	%1,%0,%6\n\
 	bne-	1b \n\
 	andc	%1,%0,%4 \n\
+	or	%1,%1,%7\n\
 	stdcx.	%1,0,%3 \n\
 	bne-	1b"
 	: "=&r" (old), "=&r" (tmp), "=m" (*ptep)
-	: "r" (ptep), "r" (clr), "m" (*ptep), "i" (_PAGE_BUSY)
+	: "r" (ptep), "r" (clr), "m" (*ptep), "i" (_PAGE_BUSY), "r" (set)
 	: "cc" );
 #else
 	unsigned long old = pte_val(*ptep);
-	*ptep = __pte(old & ~clr);
+	*ptep = __pte((old & ~clr) | set);
 #endif
 	/* huge pages use the old page table lock */
 	if (!huge)
@@ -231,9 +233,9 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
 {
 	unsigned long old;
 
-       	if ((pte_val(*ptep) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
+	if ((pte_val(*ptep) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
 		return 0;
-	old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0);
+	old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
 	return (old & _PAGE_ACCESSED) != 0;
 }
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
@@ -252,7 +254,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 	if ((pte_val(*ptep) & _PAGE_RW) == 0)
 		return;
 
-	pte_update(mm, addr, ptep, _PAGE_RW, 0);
+	pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
 }
 
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
@@ -261,7 +263,7 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	if ((pte_val(*ptep) & _PAGE_RW) == 0)
 		return;
 
-	pte_update(mm, addr, ptep, _PAGE_RW, 1);
+	pte_update(mm, addr, ptep, _PAGE_RW, 0, 1);
 }
 
 /*
@@ -284,14 +286,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pte_t *ptep)
 {
-	unsigned long old = pte_update(mm, addr, ptep, ~0UL, 0);
+	unsigned long old = pte_update(mm, addr, ptep, ~0UL, 0, 0);
 	return __pte(old);
 }
 
 static inline void pte_clear(struct mm_struct *mm, unsigned long addr,
 			     pte_t * ptep)
 {
-	pte_update(mm, addr, ptep, ~0UL, 0);
+	pte_update(mm, addr, ptep, ~0UL, 0, 0);
 }
 
 
@@ -506,7 +508,9 @@ extern int pmdp_set_access_flags(struct vm_area_struct *vma,
 
 extern unsigned long pmd_hugepage_update(struct mm_struct *mm,
 					 unsigned long addr,
-					 pmd_t *pmdp, unsigned long clr);
+					 pmd_t *pmdp,
+					 unsigned long clr,
+					 unsigned long set);
 
 static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
 					      unsigned long addr, pmd_t *pmdp)
@@ -515,7 +519,7 @@ static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
 
 	if ((pmd_val(*pmdp) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
 		return 0;
-	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED);
+	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED, 0);
 	return ((old & _PAGE_ACCESSED) != 0);
 }
 
@@ -542,7 +546,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 	if ((pmd_val(*pmdp) & _PAGE_RW) == 0)
 		return;
 
-	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW);
+	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
 }
 
 #define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 65b7b65e8708..62bf5e8e78da 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -510,7 +510,8 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 }
 
 unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
-				  pmd_t *pmdp, unsigned long clr)
+				  pmd_t *pmdp, unsigned long clr,
+				  unsigned long set)
 {
 
 	unsigned long old, tmp;
@@ -526,14 +527,15 @@ unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 		andi.	%1,%0,%6\n\
 		bne-	1b \n\
 		andc	%1,%0,%4 \n\
+		or	%1,%1,%7\n\
 		stdcx.	%1,0,%3 \n\
 		bne-	1b"
 	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
-	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY)
+	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY), "r" (set)
 	: "cc" );
 #else
 	old = pmd_val(*pmdp);
-	*pmdp = __pmd(old & ~clr);
+	*pmdp = __pmd((old & ~clr) | set);
 #endif
 	if (old & _PAGE_HASHPTE)
 		hpte_do_hugepage_flush(mm, addr, pmdp);
@@ -708,7 +710,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT);
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
 }
 
 /*
@@ -835,7 +837,7 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm,
 	unsigned long old;
 	pgtable_t *pgtable_slot;
 
-	old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
+	old = pmd_hugepage_update(mm, addr, pmdp, ~0UL, 0);
 	old_pmd = __pmd(old);
 	/*
 	 * We have pmd == none and we are holding page_table_lock.
diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index a770df2dae70..6c0b1f5f8d2c 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -78,7 +78,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 	for (; npages > 0; --npages) {
-		pte_update(mm, addr, pte, 0, 0);
+		pte_update(mm, addr, pte, 0, 0, 0);
 		addr += PAGE_SIZE;
 		++pte;
 	}
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 2/3] mm: dirty accountable change only apply to non prot numa case
From: Aneesh Kumar K.V @ 2014-02-11 10:34 UTC (permalink / raw)
  To: benh, paulus, riel, mgorman, mpe; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1392114895-14997-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

So move it within the if loop

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/mprotect.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7332c1785744..33eab902f10e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -58,6 +58,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				if (pte_numa(ptent))
 					ptent = pte_mknonnuma(ptent);
 				ptent = pte_modify(ptent, newprot);
+				/*
+				 * Avoid taking write faults for pages we
+				 * know to be dirty.
+				 */
+				if (dirty_accountable && pte_dirty(ptent))
+					ptent = pte_mkwrite(ptent);
+				ptep_modify_prot_commit(mm, addr, pte, ptent);
 				updated = true;
 			} else {
 				struct page *page;
@@ -72,22 +79,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					}
 				}
 			}
-
-			/*
-			 * Avoid taking write faults for pages we know to be
-			 * dirty.
-			 */
-			if (dirty_accountable && pte_dirty(ptent)) {
-				ptent = pte_mkwrite(ptent);
-				updated = true;
-			}
-
 			if (updated)
 				pages++;
-
-			/* Only !prot_numa always clears the pte */
-			if (!prot_numa)
-				ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH V4 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Daniel Lezcano @ 2014-02-11 10:16 UTC (permalink / raw)
  To: Preeti U Murthy, linux-pm, peterz, benh, rafael.j.wysocki,
	linux-kernel, tglx, linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck
In-Reply-To: <20140207080632.17187.80532.stgit@preeti.in.ibm.com>

On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> On some architectures, in certain CPU deep idle states the local timers stop.
> An external clock device is used to wakeup these CPUs. The kernel support for the
> wakeup of these CPUs is provided by the tick broadcast framework by using the
> external clock device as the wakeup source.
>
> However not all implementations of architectures provide such an external
> clock device. This patch includes support in the broadcast framework to handle
> the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
> on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.
>
> This patchset introduces a pseudo clock device which can be registered by the
> archs as tick_broadcast_device in the absence of a real external clock
> device. Once registered, the broadcast framework will work as is for these
> architectures as long as the archs take care of the BROADCAST_ENTER
> notification failing for one of the CPUs. This CPU is made the stand by CPU to
> handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.
>
> The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
> stand by CPU dynamically moves around and so does the hrtimer which is queued
> to trigger at the next earliest wakeup time. This is consistent with the case where
> an external clock device is present. The smp affinity of this clock device is
> set to the CPU with the earliest wakeup.

Hi Preeti,

jumping a bit late in the thread...

Setting the smp affinity on the earliest timer should be handled 
automatically with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using 
this flag ?

Another comment is the overall approach. We enter the cpuidle idle 
framework with a specific state to go to and it is the tick framework 
telling us we mustn't go to this state. IMO the logic is wrong, the 
decision to not enter this state should be moved somewhere else.

Why don't you create a cpuidle driver with the shallow idle states 
assigned to a cpu (let's say cpu0) and another one with all the deeper 
idle states for the rest of the cpus ? Using the multiple cpuidle driver 
support makes it possible. The timer won't be moving around and a cpu 
will be dedicated to act as the broadcast timer.

Wouldn't make sense and be less intrusive than the patchset you proposed ?


> This patchset handles the hotplug of
> the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
> notification.
>
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> [Added Changelog and code to handle reprogramming of hrtimer]
> ---
>
>   include/linux/clockchips.h           |    9 +++
>   kernel/time/Makefile                 |    2 -
>   kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
>   kernel/time/tick-broadcast.c         |   54 +++++++++++++++++
>   4 files changed, 166 insertions(+), 4 deletions(-)
>   create mode 100644 kernel/time/tick-broadcast-hrtimer.c
>
> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
> index e0c5a6c..dbe9e14 100644
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -62,6 +62,11 @@ enum clock_event_mode {
>   #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
>   #define CLOCK_EVT_FEAT_PERCPU		0x000040
>
> +/*
> + * Clockevent device is based on a hrtimer for broadcast
> + */
> +#define CLOCK_EVT_FEAT_HRTIMER		0x000080
> +
>   /**
>    * struct clock_event_device - clock event device descriptor
>    * @event_handler:	Assigned by the framework to be called by the low
> @@ -83,6 +88,7 @@ enum clock_event_mode {
>    * @name:		ptr to clock event name
>    * @rating:		variable to rate clock event devices
>    * @irq:		IRQ number (only for non CPU local devices)
> + * @bound_on:		Bound on CPU
>    * @cpumask:		cpumask to indicate for which CPUs this device works
>    * @list:		list head for the management code
>    * @owner:		module reference
> @@ -113,6 +119,7 @@ struct clock_event_device {
>   	const char		*name;
>   	int			rating;
>   	int			irq;
> +	int			bound_on;
>   	const struct cpumask	*cpumask;
>   	struct list_head	list;
>   	struct module		*owner;
> @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
>   #endif
>
>   #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
> +extern void tick_setup_hrtimer_broadcast(void);
>   extern int tick_check_broadcast_expired(void);
>   #else
>   static inline int tick_check_broadcast_expired(void) { return 0; }
> +static void tick_setup_hrtimer_broadcast(void) {};
>   #endif
>
>   #ifdef CONFIG_GENERIC_CLOCKEVENTS
> diff --git a/kernel/time/Makefile b/kernel/time/Makefile
> index 9250130..06151ef 100644
> --- a/kernel/time/Makefile
> +++ b/kernel/time/Makefile
> @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
>
>   obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
>   obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
> -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
> +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
>   obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
>   obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
>   obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
> diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
> new file mode 100644
> index 0000000..af1e119
> --- /dev/null
> +++ b/kernel/time/tick-broadcast-hrtimer.c
> @@ -0,0 +1,105 @@
> +/*
> + * linux/kernel/time/tick-broadcast-hrtimer.c
> + * This file emulates a local clock event device
> + * via a pseudo clock device.
> + */
> +#include <linux/cpu.h>
> +#include <linux/err.h>
> +#include <linux/hrtimer.h>
> +#include <linux/interrupt.h>
> +#include <linux/percpu.h>
> +#include <linux/profile.h>
> +#include <linux/clockchips.h>
> +#include <linux/sched.h>
> +#include <linux/smp.h>
> +#include <linux/module.h>
> +
> +#include "tick-internal.h"
> +
> +static struct hrtimer bctimer;
> +
> +static void bc_set_mode(enum clock_event_mode mode,
> +			struct clock_event_device *bc)
> +{
> +	switch (mode) {
> +	case CLOCK_EVT_MODE_SHUTDOWN:
> +		/*
> +		 * Note, we cannot cancel the timer here as we might
> +		 * run into the following live lock scenario:
> +		 *
> +		 * cpu 0		cpu1
> +		 * lock(broadcast_lock);
> +		 *			hrtimer_interrupt()
> +		 *			bc_handler()
> +		 *			   tick_handle_oneshot_broadcast();
> +		 *			    lock(broadcast_lock);
> +		 * hrtimer_cancel()
> +		 *  wait_for_callback()
> +		 */
> +		hrtimer_try_to_cancel(&bctimer);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +/*
> + * This is called from the guts of the broadcast code when the cpu
> + * which is about to enter idle has the earliest broadcast timer event.
> + */
> +static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
> +{
> +	/*
> +	 * We try to cancel the timer first. If the callback is on
> +	 * flight on some other cpu then we let it handle it. If we
> +	 * were able to cancel the timer nothing can rearm it as we
> +	 * own broadcast_lock.
> +	 *
> +	 * However we can also be called from the event handler of
> +	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
> +	 * restart the timer since it is on flight on the same CPU. But
> +	 * due to the same reason we can reset it.
> +	 */
> +	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
> +		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
> +		/* Bind the "device" to the cpu */
> +		bc->bound_on = smp_processor_id();
> +	} else if (bc->bound_on == smp_processor_id()) {
> +		hrtimer_set_expires(&bctimer, expires);
> +	}
> +	return 0;
> +}
> +
> +static struct clock_event_device ce_broadcast_hrtimer = {
> +	.set_mode		= bc_set_mode,
> +	.set_next_ktime		= bc_set_next,
> +	.features		= CLOCK_EVT_FEAT_ONESHOT |
> +				  CLOCK_EVT_FEAT_KTIME |
> +				  CLOCK_EVT_FEAT_HRTIMER,
> +	.rating			= 0,
> +	.bound_on		= -1,
> +	.min_delta_ns		= 1,
> +	.max_delta_ns		= KTIME_MAX,
> +	.min_delta_ticks	= 1,
> +	.max_delta_ticks	= KTIME_MAX,
> +	.mult			= 1,
> +	.shift			= 0,
> +	.cpumask		= cpu_all_mask,
> +};
> +
> +static enum hrtimer_restart bc_handler(struct hrtimer *t)
> +{
> +	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
> +
> +	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
> +		return HRTIMER_NORESTART;
> +
> +	return HRTIMER_RESTART;
> +}
> +
> +void tick_setup_hrtimer_broadcast(void)
> +{
> +	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
> +	bctimer.function = bc_handler;
> +	clockevents_register_device(&ce_broadcast_hrtimer);
> +}
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index ddf2ac2..2f013c3 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -630,6 +630,42 @@ again:
>   	raw_spin_unlock(&tick_broadcast_lock);
>   }
>
> +static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
> +{
> +	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
> +		return 0;
> +	if (bc->next_event.tv64 == KTIME_MAX)
> +		return 0;
> +	return bc->bound_on == cpu ? -EBUSY : 0;
> +}
> +
> +static void broadcast_shutdown_local(struct clock_event_device *bc,
> +				     struct clock_event_device *dev)
> +{
> +	/*
> +	 * For hrtimer based broadcasting we cannot shutdown the cpu
> +	 * local device if our own event is the first one to expire or
> +	 * if we own the broadcast timer.
> +	 */
> +	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
> +		if (broadcast_needs_cpu(bc, smp_processor_id()))
> +			return;
> +		if (dev->next_event.tv64 < bc->next_event.tv64)
> +			return;
> +	}
> +	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> +}
> +
> +static void broadcast_move_bc(int deadcpu)
> +{
> +	struct clock_event_device *bc = tick_broadcast_device.evtdev;
> +
> +	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
> +		return;
> +	/* This moves the broadcast assignment to this cpu */
> +	clockevents_program_event(bc, bc->next_event, 1);
> +}
> +
>   /*
>    * Powerstate information: The system enters/leaves a state, where
>    * affected devices might stop
> @@ -648,7 +684,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	 * states
>   	 */
>   	if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> -		return;
> +		return 0;
>
>   	/*
>   	 * We are called with preemtion disabled from the depth of the
> @@ -659,7 +695,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	dev = td->evtdev;
>
>   	if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
> -		return;
> +		return 0;
>
>   	bc = tick_broadcast_device.evtdev;
>
> @@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
>   		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
>   			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
> -			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> +			broadcast_shutdown_local(bc, dev);
>   			/*
>   			 * We only reprogram the broadcast timer if we
>   			 * did not mark ourself in the force mask and
> @@ -680,6 +716,16 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   			    dev->next_event.tv64 < bc->next_event.tv64)
>   				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
>   		}
> +		/*
> +		 * If the current CPU owns the hrtimer broadcast
> +		 * mechanism, it cannot go deep idle and we remove the
> +		 * CPU from the broadcast mask. We don't have to go
> +		 * through the EXIT path as the local timer is not
> +		 * shutdown.
> +		 */
> +		ret = broadcast_needs_cpu(bc, cpu);
> +		if (ret)
> +			cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
>   	} else {
>   		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
>   			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> @@ -853,6 +899,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
>   	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
>   	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
>
> +	broadcast_move_bc(cpu);
> +
>   	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>   }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Raghavendra KT @ 2014-02-11  9:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, Raghavendra KT,
	Linux Kernel Mailing List, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1392001823.3996.21.camel@pasglop>

On Mon, Feb 10, 2014 at 8:40 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
>>  typedef struct {
>> -       volatile unsigned int slock;
>> -} arch_spinlock_t;
>> +       union {
>> +               __ticketpair_t head_tail;
>> +               struct __raw_tickets {
>> +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
>> +                       __ticket_t tail, head;
>> +#else
>> +                       __ticket_t head, tail;
>> +#endif
>> +               } tickets;
>> +       };
>> +#if defined(CONFIG_PPC_SPLPAR)
>> +       u32 holder;
>> +#endif
>> +} arch_spinlock_t __aligned(4);
>
> That's still broken with lockref (which we just merged).
>
> We must have the arch_spinlock_t and the ref in the same 64-bit word
> otherwise it will break.
>
> We can make it work in theory since the holder doesn't have to be
> accessed atomically, but the practicals are a complete mess ...
> lockref would essentially have to re-implement the holder handling
> of the spinlocks and use lower level ticket stuff.
>

Probably very basic and stupid question from me.
How much important to have holder information for PPC? From my
previous experiment
on x86, it was lock-waiter preemption which is problematic rather than
lock-holder preemption.

^ permalink raw reply

* [PATCH] powerpc/spufs: Fix duplicate definition of MAX_USER_PRIO
From: Peter Zijlstra @ 2014-02-11  9:41 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Dongsheng Yang, Arnd Bergmann, linux-kernel, kbuild-all,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <52f92761.nu9w3QblrTnGpeYg%fengguang.wu@intel.com>

On Tue, Feb 11, 2014 at 03:24:17AM +0800, kbuild test robot wrote:
> >> arch/powerpc/platforms/cell/spufs/sched.c:86:0: warning: "MAX_USER_PRIO" redefined [enabled by default]
>     #define MAX_USER_PRIO  (MAX_PRIO - MAX_RT_PRIO)
>     ^
>    In file included from include/linux/sched.h:6:0,
>                     from arch/powerpc/platforms/cell/spufs/sched.c:26:
>    include/linux/sched/prio.h:39:0: note: this is the location of the previous definition
>     #define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
>     ^

Since USER_PRIO(p) is ((p)-MAX_RT_PRIO) the above two definitions are
the same and we can simply remove the spufs one.

Fixes: 6b6350f155af ("sched: Expose some macros related to priority")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/powerpc/platforms/cell/spufs/sched.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 49318385d4fa..4a0a64fe25df 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -83,7 +83,6 @@ static struct timer_list spuloadavg_timer;
 #define MIN_SPU_TIMESLICE	max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
 #define DEF_SPU_TIMESLICE	(100 * HZ / (1000 * SPUSCHED_TICK))
 
-#define MAX_USER_PRIO		(MAX_PRIO - MAX_RT_PRIO)
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 

^ permalink raw reply related

* Re: [PATCH v2] powerpc ticket locks
From: Raghavendra KT @ 2014-02-11  9:39 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, Raghavendra KT,
	Linux Kernel Mailing List, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207165801.GC2107@lst.de>

On Fri, Feb 7, 2014 at 10:28 PM, Torsten Duwe <duwe@lst.de> wrote:
> Ticket locks for ppc, version 2. Changes since v1:
> * The atomically exchanged entity is always 32 bits.
> * asm inline string variations thus removed.
> * Carry the additional holder hint only #if defined(CONFIG_PPC_SPLPAR)
>
> Signed-off-by: Torsten Duwe <duwe@suse.de>
> --
[...]
> +static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>  {
> +       register struct __raw_tickets old, tmp,
> +               inc = { .tail = TICKET_LOCK_INC };
> +
>         CLEAR_IO_SYNC;
> -       while (1) {
> -               if (likely(__arch_spin_trylock(lock) == 0))
> -                       break;
> +       __asm__ __volatile__(
> +"1:    lwarx   %0,0,%4         # arch_spin_lock\n"
> +"      add     %1,%3,%0\n"
> +       PPC405_ERR77(0, "%4")
> +"      stwcx.  %1,0,%4\n"
> +"      bne-    1b"
> +       : "=&r" (old), "=&r" (tmp), "+m" (lock->tickets)
> +       : "r" (inc), "r" (&lock->tickets)
> +       : "cc");
> +
> +       if (likely(old.head == old.tail))
> +               goto out;
> +
> +       for (;;) {
> +               unsigned count = 100;

I am sure you wanted to tune the total loops to typical lock holding time ...

^ permalink raw reply

* Re: [PATCH v1 1/2] powernv: cpufreq driver for powernv platform
From: Preeti U Murthy @ 2014-02-11  8:37 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan; +Cc: linuxppc-dev, Anton Blanchard, Srivatsa S. Bhat
In-Reply-To: <20140211070201.21159.31101.stgit@drishya>

Hi Vaidy,

On 02/11/2014 12:32 PM, Vaidyanathan Srinivasan wrote:
> Backend driver to dynamically set voltage and frequency on
> IBM POWER non-virtualized platforms.  Power management SPRs
> are used to set the required PState.
> 
> This driver works in conjunction with cpufreq governors
> like 'ondemand' to provide a demand based frequency and
> voltage setting on IBM POWER non-virtualized platforms.
> 
> PState table is obtained from OPAL v3 firmware through device
> tree.
> 
> powernv_cpufreq back-end driver would parse the relevant device-tree
> nodes and initialise the cpufreq subsystem on powernv platform.
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
<snip>

> +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +	int base, i;
> +
> +#ifdef CONFIG_SMP
> +	base = cpu_first_thread_sibling(policy->cpu);
> +
> +	for (i = 0; i < threads_per_core; i++)
> +		cpumask_set_cpu(base + i, policy->cpus);
> +#endif
> +	policy->cpuinfo.transition_latency = 25000;

Is it ok to hard code this field? How about getting this also from the
device tree?

> +
> +	/* Print frequency table */
> +	for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
> +		pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);

The frequency table as a result will be printed on every cpu when
cpufreq gets initialized. Considering this information will not vary
across CPUs, can we print this during powernv_cpufreq_init() after
parsing the device tree for the pstates?

Thanks

Regards
Preeti U Murthy

^ permalink raw reply

* Re: [PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions
From: Preeti U Murthy @ 2014-02-11  8:15 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan; +Cc: linuxppc-dev, Anton Blanchard, Srivatsa S. Bhat
In-Reply-To: <20140211070208.21159.86862.stgit@drishya>

Hi Vaidy,

On 02/11/2014 12:32 PM, Vaidyanathan Srinivasan wrote:
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> 
> On POWER systems, the CPU frequency is controlled at a core-level and
> hence we need to serialize so that only one of the threads in the core
> switches the core's frequency at a time.
> 
> Using a global mutex lock would needlessly serialize _all_ frequency
> transitions in the system (across all cores). So introduce per-core
> locking to enable finer-grained synchronization and thereby enhance
> the speed and responsiveness of the cpufreq driver to varying workload
> demands.
> 
> The design of per-core locking is very simple and straight-forward: we
> first define a Per-CPU lock and use the ones that belongs to the first
> thread sibling of the core.
> 
> cpu_first_thread_sibling() macro is used to find the *common* lock for
> all thread siblings belonging to a core.
> 
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c |   21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index ea3b630..8240e90 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -24,8 +24,15 @@
>  #include <linux/of.h>
>  #include <asm/cputhreads.h>
> 
> -/* FIXME: Make this per-core */
> -static DEFINE_MUTEX(freq_switch_mutex);
> +/* Per-Core locking for frequency transitions */
> +static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
> +
> +#define lock_core_freq(cpu)				\
> +			mutex_lock(&per_cpu(freq_switch_lock,\
> +				cpu_first_thread_sibling(cpu)));
> +#define unlock_core_freq(cpu)				\
> +			mutex_unlock(&per_cpu(freq_switch_lock,\
> +				cpu_first_thread_sibling(cpu)));
> 
>  #define POWERNV_MAX_PSTATES	256
> 
> @@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
>  	freqs.new = powernv_freqs[new_index].frequency;
>  	freqs.cpu = policy->cpu;
> 
> -	mutex_lock(&freq_switch_mutex);
> +	lock_core_freq(policy->cpu);
>  	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
> 
>  	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
> @@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
>  	rc = powernv_set_freq(policy->cpus, new_index);
> 
>  	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
> -	mutex_unlock(&freq_switch_mutex);
> +	unlock_core_freq(policy->cpu);
> 
>  	return rc;
>  }
> @@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
> 
>  static int __init powernv_cpufreq_init(void)
>  {
> -	int rc = 0;
> +	int cpu, rc = 0;
> 
>  	/* Discover pstates from device tree and init */
> 
> @@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void)
>  		pr_info("powernv-cpufreq disabled\n");
>  		return rc;
>  	}
> +	/* Init per-core mutex */
> +	for_each_possible_cpu(cpu) {
> +		mutex_init(&per_cpu(freq_switch_lock, cpu));
> +	}
> 
>  	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
>  	return rc;

This looks good to me.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Thanks

Regards
Preeti U Murthy
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH] powerpc/spufs: Remove MAX_USER_PRIO define
From: Kamalesh Babulal @ 2014-02-11  8:15 UTC (permalink / raw)
  To: Jeremy Kerr; +Cc: Dongsheng Yang, linuxppc-dev, Ingo Molnar, linux-kernel
In-Reply-To: <1392098717.689604.970589769393.1.gpush@pablo>

* Jeremy Kerr <jk@ozlabs.org> [2014-02-11 14:05:17]:

> Current ppc64_defconfig fails with:
> 
>  arch/powerpc/platforms/cell/spufs/sched.c:86:0: error: "MAX_USER_PRIO" redefined [-Werror]
>  cc1: all warnings being treated as errors
> 
> 6b6350f1 introduced a generic MAX_USER_PRIO macro to sched/prio.h, which
> is causing the conflit. Use that one instead of our own.

you can also use DEFAULT_PRIO from sched/prio.h instead of NORMAL_PRIO.

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 49318385d4fa..014979db2018 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -64,11 +64,6 @@ static struct timer_list spusched_timer;
 static struct timer_list spuloadavg_timer;
 
 /*
- * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
- */
-#define NORMAL_PRIO		120
-
-/*
  * Frequency of the spu scheduler tick.  By default we do one SPU scheduler
  * tick for every 10 CPU scheduler ticks.
  */
@@ -97,7 +92,7 @@ static struct timer_list spuloadavg_timer;
  */
 void spu_set_timeslice(struct spu_context *ctx)
 {
-	if (ctx->prio < NORMAL_PRIO)
+	if (ctx->prio < DEFAULT_PRIO)
 		ctx->time_slice = SCALE_PRIO(DEF_SPU_TIMESLICE * 4, ctx->prio);
 	else
 		ctx->time_slice = SCALE_PRIO(DEF_SPU_TIMESLICE, ctx->prio);

Thanks,
Kamalesh.

^ permalink raw reply related

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-11  7:26 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <alpine.LRH.2.00.1402101056410.10318@ra8135-ec1.am.freescale.net>

	Hi James,

On Mon, Feb 10, 2014 at 11:03:07AM -0600, James Yang wrote:

[snipped]
> > Ok, if you have measured that method1 is faster than method2, let us go for it.
> > I believe method2 would be faster if you had a large out-of-order execution
> > window, because more parallelism can be extracted from it, but this is probably
> > only true for high end cores, which do not need FPU emulation in the first place.
> 
> Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
> compiler generated.   
> 

Then it is method1.

>  
> > The other place where we can optimize is the generation of FEX. Here is 
> > my current patch:
> > 
> > 
> > diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
> > index dbce92e..b57b3fa8 100644
> > --- a/arch/powerpc/math-emu/mtfsf.c
> > +++ b/arch/powerpc/math-emu/mtfsf.c
> > @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
> >  	u32 mask;
> >  	u32 fpscr;
> >  
> > -	if (FM == 0)
> > +	if (likely(FM == 0xff))
> > +		mask = 0xffffffff;
> > +	else if (unlikely(FM == 0))
> >  		return 0;
> > -
> > -	if (FM == 0xff)
> > -		mask = 0x9fffffff;
> >  	else {
> > -		mask = 0;
> > -		if (FM & (1 << 0))
> > -			mask |= 0x90000000;
> > -		if (FM & (1 << 1))
> > -			mask |= 0x0f000000;
> > -		if (FM & (1 << 2))
> > -			mask |= 0x00f00000;
> > -		if (FM & (1 << 3))
> > -			mask |= 0x000f0000;
> > -		if (FM & (1 << 4))
> > -			mask |= 0x0000f000;
> > -		if (FM & (1 << 5))
> > -			mask |= 0x00000f00;
> > -		if (FM & (1 << 6))
> > -			mask |= 0x000000f0;
> > -		if (FM & (1 << 7))
> > -			mask |= 0x0000000f;
> > +		mask = (FM & 1);
> > +		mask |= (FM << 3) & 0x10;
> > +		mask |= (FM << 6) & 0x100;
> > +		mask |= (FM << 9) & 0x1000;
> > +		mask |= (FM << 12) & 0x10000;
> > +		mask |= (FM << 15) & 0x100000;
> > +		mask |= (FM << 18) & 0x1000000;
> > +		mask |= (FM << 21) & 0x10000000;
> > +		mask *= 15;
> 
> 
> Needs to also mask out bits 1 and 2, they aren't to be set from frB.
> 
> 		mask &= 0x9FFFFFFF;
> 
> 

Look at the following lines:

> 
> 
> >  	}
> >  
> > -	__FPU_FPSCR &= ~(mask);
> > -	__FPU_FPSCR |= (frB[1] & mask);
> > +	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
> > +		~(FPSCR_VX | FPSCR_FEX);
 
It's here (masking FPSCR_VX and FPSCR_FEX).

Actually the previous code was redundant, it cleared FEX and VX in the
mask computation and later again when recomputing them. Clearing them
once should be enough.

> >  
> > -	__FPU_FPSCR &= ~(FPSCR_VX);
> > -	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
> > +	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
> >  		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
> >  		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
> > -		__FPU_FPSCR |= FPSCR_VX;
> > -
> > -	fpscr = __FPU_FPSCR;
> > -	fpscr &= ~(FPSCR_FEX);
> > -	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
> > -	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
> > -	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
> > -	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
> > -	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
> > -		fpscr |= FPSCR_FEX;
> > +		fpscr |= FPSCR_VX;
> > +
> > +	/* The bit order of exception enables and exception status
> > +	 * is the same. Simply shift and mask to check for enabled
> > +	 * exceptions.
> > +	 */
> > +	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
> >  	__FPU_FPSCR = fpscr;
> >  
> >  #ifdef DEBUG
> >  mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
> >  1 file changed, 22 insertions(+), 35 deletions(-)
> > 
> > 
> > Notes: 
> > 
> > 1) I'm really unsure on whether 0xff is frequent or not. So the likely()
> > statement at the beginning may be wrong. Actually, if it is not very likely,
> > it might be better to remove the special casef for FM==0xff. A look at 
> > GCC sources shows that it never generates a mask of 0xff. From glibc
> > sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.
> 
> Can't handle all cases here.  

That's why I would go for the simplest possible code. Conditionals are
expensive and minimizing cache footprint is often the best measure of 
performance for infrequently used code. With this in mind, I would get 
rid of all the tests for special FM values and rely on the optimized
generic case.


>  
> > 2) it may be better to remove the check for FM==0, after all, the instruction
> > effectively becomes a nop, and generating the instruction in the first place
> > would be too stupid for words.
> 
> Hmm a heavy no-op.  I wonder if it is heavier than a sync.

In theory not. It contains the equivalent of several isync (taking an exception
and returning from it), but not any synchronization wrt the memory accesses.

	Gabriel

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-11  7:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140210191321.GD1558@linux.vnet.ibm.com>

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:        8264704 kB
> MemFree:         5924608 kB
> ...
> Slab:            1402496 kB
> SReclaimable:     102848 kB
> SUnreclaim:      1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       207 MB   98.60%  100.00%
> task_struct                         134 MB   97.82%  100.00%
> kmalloc-8192                        117 MB  100.00%  100.00%
> pgtable-2^12                        111 MB  100.00%  100.00%
> pgtable-2^10                        104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:        8264704 kB
> MemFree:         4150464 kB
> ...
> Slab:            1590336 kB
> SReclaimable:     208768 kB
> SUnreclaim:      1381568 kB
> 
> slabusage:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       227 MB   98.63%  100.00%
> kmalloc-8192                        130 MB  100.00%  100.00%
> task_struct                         129 MB   97.73%  100.00%
> pgtable-2^12                        112 MB  100.00%  100.00%
> pgtable-2^10                        106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:        8264640 kB
  MemFree:         7119680 kB
  Slab:             207232 kB
  SReclaimable:      32896 kB
  SUnreclaim:       174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.

^ permalink raw reply

* [PATCH v1 1/2] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan @ 2014-02-11  7:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat
In-Reply-To: <20140211065757.21159.49689.stgit@drishya>

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/include/asm/reg.h    |    4 +
 drivers/cpufreq/Kconfig.powerpc   |    9 +
 drivers/cpufreq/Makefile          |    1 
 drivers/cpufreq/powernv-cpufreq.c |  275 +++++++++++++++++++++++++++++++++++++
 4 files changed, 289 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1	0x13B	/* Hypervisor Save/Restore 1 */
 #define SPRN_IC		0x350	/* Virtual Instruction Count */
 #define SPRN_VTB	0x351	/* Virtual Time Base */
+#define SPRN_PMICR	0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR	0x355   /* Power Management Status Reg */
+#define SPRN_PMCR	0x374	/* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG	8	/* Enable Target Address Register */
 #define FSCR_EBB_LG	7	/* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..4a91ab1 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ
 	help
 	  This adds the support for frequency switching on PA Semi
 	  PWRficient processors.
+
+config POWERNV_CPUFREQ
+       tristate "CPU frequency scaling for IBM POWERNV platform"
+       depends on PPC_POWERNV
+       select CPU_FREQ_TABLE
+       default y
+       help
+	 This adds support for CPU frequency switching on IBM POWERNV
+	 platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)		+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)		+= pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)	+= pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)		+= powernv-cpufreq.o
 
 ##################################################################################
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 0000000..ea3b630
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,275 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)	"powernv-cpufreq: " fmt
+
+#include <linux/module.h>
+#include <linux/cpufreq.h>
+#include <linux/of.h>
+#include <asm/cputhreads.h>
+
+/* FIXME: Make this per-core */
+static DEFINE_MUTEX(freq_switch_mutex);
+
+#define POWERNV_MAX_PSTATES	256
+
+static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+
+/*
+ * Initialize the freq table based on data obtained
+ * from the firmware passed via device-tree
+ */
+
+static int init_powernv_pstates(void)
+{
+	struct device_node *power_mgt;
+	int nr_pstates = 0;
+	int pstate_min, pstate_max, pstate_nominal;
+	const __be32 *pstate_ids, *pstate_freqs;
+	int i;
+	u32 len_ids, len_freqs;
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("power-mgt node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-min", &pstate_min)) {
+		pr_warn("ibm,pstate-min node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-max", &pstate_max)) {
+		pr_warn("ibm,pstate-max node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-nominal",
+				 &pstate_nominal)) {
+		pr_warn("ibm,pstate-nominal not found\n");
+		return -ENODEV;
+	}
+	pr_info("cpufreq pstate min %d nominal %d max %d\n", pstate_min,
+		pstate_nominal, pstate_max);
+
+	pstate_ids = of_get_property(power_mgt, "ibm,pstate-ids", &len_ids);
+	if (!pstate_ids) {
+		pr_warn("ibm,pstate-ids not found\n");
+		return -ENODEV;
+	}
+
+	pstate_freqs = of_get_property(power_mgt, "ibm,pstate-frequencies-mhz",
+				      &len_freqs);
+	if (!pstate_freqs) {
+		pr_warn("ibm,pstate-frequencies-mhz not found\n");
+		return -ENODEV;
+	}
+
+	WARN_ON(len_ids != len_freqs);
+	nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
+	WARN_ON(!nr_pstates);
+
+	pr_debug("NR PStates %d\n", nr_pstates);
+	for (i = 0; i < nr_pstates; i++) {
+		u32 id = be32_to_cpu(pstate_ids[i]);
+		u32 freq = be32_to_cpu(pstate_freqs[i]);
+
+		pr_debug("PState id %d freq %d MHz\n", id, freq);
+		powernv_freqs[i].driver_data = id;
+		powernv_freqs[i].frequency = freq * 1000; /* kHz */
+	}
+	/* End of list marker entry */
+	powernv_freqs[i].driver_data = 0;
+	powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
+
+	return 0;
+}
+
+static struct freq_attr *powernv_cpu_freq_attr[] = {
+	&cpufreq_freq_attr_scaling_available_freqs,
+	NULL,
+};
+
+/* Helper routines */
+
+/* Access helpers to power mgt SPR */
+
+static inline unsigned long get_pmspr(unsigned long sprn)
+{
+	switch (sprn) {
+	case SPRN_PMCR:
+		return mfspr(SPRN_PMCR);
+
+	case SPRN_PMICR:
+		return mfspr(SPRN_PMICR);
+
+	case SPRN_PMSR:
+		return mfspr(SPRN_PMSR);
+	}
+	BUG();
+}
+
+static inline void set_pmspr(unsigned long sprn, unsigned long val)
+{
+	switch (sprn) {
+	case SPRN_PMCR:
+		mtspr(SPRN_PMCR, val);
+		return;
+
+	case SPRN_PMICR:
+		mtspr(SPRN_PMICR, val);
+		return;
+
+	case SPRN_PMSR:
+		mtspr(SPRN_PMSR, val);
+		return;
+	}
+	BUG();
+}
+
+static void set_pstate(void *pstate)
+{
+	unsigned long val;
+	unsigned long pstate_ul = *(unsigned long *) pstate;
+
+	val = get_pmspr(SPRN_PMCR);
+	val = val & 0x0000ffffffffffffULL;
+	/* Set both local and global PStates */
+	val = val | (pstate_ul << 56) | (pstate_ul << 48);
+	pr_debug("Setting cpu %d pmcr to %016lX\n", smp_processor_id(), val);
+	set_pmspr(SPRN_PMCR, val);
+}
+
+static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index)
+{
+	unsigned long val = (unsigned long)powernv_freqs[new_index].driver_data;
+
+	/*
+	 * Use smp_call_function to send IPI and execute the
+	 * mtspr on target cpu.  We could do that without IPI
+	 * if current CPU is within policy->cpus (core)
+	 */
+
+	val = val & 0xFF;
+	smp_call_function_any(cpus, set_pstate, &val, 1);
+	return 0;
+}
+
+static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+	int base, i;
+
+#ifdef CONFIG_SMP
+	base = cpu_first_thread_sibling(policy->cpu);
+
+	for (i = 0; i < threads_per_core; i++)
+		cpumask_set_cpu(base + i, policy->cpus);
+#endif
+	policy->cpuinfo.transition_latency = 25000;
+
+	/* Print frequency table */
+	for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
+		pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
+
+	policy->cur = powernv_freqs[0].frequency;
+	cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu);
+	return cpufreq_frequency_table_cpuinfo(policy, powernv_freqs);
+}
+
+static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	cpufreq_frequency_table_put_attr(policy->cpu);
+	return 0;
+}
+
+static int powernv_cpufreq_verify(struct cpufreq_policy *policy)
+{
+	return cpufreq_frequency_table_verify(policy, powernv_freqs);
+}
+
+static int powernv_cpufreq_target(struct cpufreq_policy *policy,
+			      unsigned int target_freq,
+			      unsigned int relation)
+{
+	int rc;
+	struct cpufreq_freqs freqs;
+	unsigned int new_index;
+
+	cpufreq_frequency_table_target(policy, powernv_freqs, target_freq,
+				       relation, &new_index);
+
+	freqs.old = policy->cur;
+	freqs.new = powernv_freqs[new_index].frequency;
+	freqs.cpu = policy->cpu;
+
+	mutex_lock(&freq_switch_mutex);
+	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
+
+	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
+		 policy->cpu,
+		 powernv_freqs[new_index].frequency,
+		 new_index,
+		 powernv_freqs[new_index].driver_data);
+
+	rc = powernv_set_freq(policy->cpus, new_index);
+
+	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
+	mutex_unlock(&freq_switch_mutex);
+
+	return rc;
+}
+
+static struct cpufreq_driver powernv_cpufreq_driver = {
+	.verify		= powernv_cpufreq_verify,
+	.target		= powernv_cpufreq_target,
+	.init		= powernv_cpufreq_cpu_init,
+	.exit		= powernv_cpufreq_cpu_exit,
+	.name		= "powernv-cpufreq",
+	.flags		= CPUFREQ_CONST_LOOPS,
+	.attr		= powernv_cpu_freq_attr,
+};
+
+static int __init powernv_cpufreq_init(void)
+{
+	int rc = 0;
+
+	/* Discover pstates from device tree and init */
+
+	rc = init_powernv_pstates();
+
+	if (rc) {
+		pr_info("powernv-cpufreq disabled\n");
+		return rc;
+	}
+
+	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
+	return rc;
+}
+
+static void __exit powernv_cpufreq_exit(void)
+{
+	cpufreq_unregister_driver(&powernv_cpufreq_driver);
+}
+
+module_init(powernv_cpufreq_init);
+module_exit(powernv_cpufreq_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>");

^ permalink raw reply related

* [PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions
From: Vaidyanathan Srinivasan @ 2014-02-11  7:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat
In-Reply-To: <20140211065757.21159.49689.stgit@drishya>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index ea3b630..8240e90 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include <linux/of.h>
 #include <asm/cputhreads.h>

-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)				\
+			mutex_lock(&per_cpu(freq_switch_lock,\
+				cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)				\
+			mutex_unlock(&per_cpu(freq_switch_lock,\
+				cpu_first_thread_sibling(cpu)));

 #define POWERNV_MAX_PSTATES	256

@@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
 	freqs.new = powernv_freqs[new_index].frequency;
 	freqs.cpu = policy->cpu;

-	mutex_lock(&freq_switch_mutex);
+	lock_core_freq(policy->cpu);
 	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);

 	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
@@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
 	rc = powernv_set_freq(policy->cpus, new_index);

 	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
-	mutex_unlock(&freq_switch_mutex);
+	unlock_core_freq(policy->cpu);

 	return rc;
 }
@@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {

 static int __init powernv_cpufreq_init(void)
 {
-	int rc = 0;
+	int cpu, rc = 0;

 	/* Discover pstates from device tree and init */

@@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void)
 		pr_info("powernv-cpufreq disabled\n");
 		return rc;
 	}
+	/* Init per-core mutex */
+	for_each_possible_cpu(cpu) {
+		mutex_init(&per_cpu(freq_switch_lock, cpu));
+	}

 	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
 	return rc;

^ permalink raw reply related

* [PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform
From: Vaidyanathan Srinivasan @ 2014-02-11  7:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat

Hi,

The following patch series implements the platform driver to support
dynamic CPU frequency scaling on IBM POWERNV platforms.

This patch series is based on Linux kernel 3.14-rc2 and tested on
OPAL v3 based IBM POWERNV platform and IBM POWER8 processor.

--Vaidy

---

Srivatsa S. Bhat (1):
      powernv, cpufreq: Add per-core locking to serialize frequency transitions

Vaidyanathan Srinivasan (1):
      powernv: cpufreq driver for powernv platform

 arch/powerpc/include/asm/reg.h    |    4 +
 drivers/cpufreq/Kconfig.powerpc   |    9 +
 drivers/cpufreq/Makefile          |    1 
 drivers/cpufreq/powernv-cpufreq.c |  286 +++++++++++++++++++++++++++++++++++++
 4 files changed, 300 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 

^ permalink raw reply

* [PATCH] powerpc/spufs: Remove MAX_USER_PRIO define
From: Jeremy Kerr @ 2014-02-11  6:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dongsheng Yang, linuxppc-dev, linux-kernel

Current ppc64_defconfig fails with:

 arch/powerpc/platforms/cell/spufs/sched.c:86:0: error: "MAX_USER_PRIO" redefined [-Werror]
 cc1: all warnings being treated as errors

6b6350f1 introduced a generic MAX_USER_PRIO macro to sched/prio.h, which
is causing the conflit. Use that one instead of our own.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
Ingo: 6b6350f1 is currently in tip; this fixes a build breakage for spufs

---
 arch/powerpc/platforms/cell/spufs/sched.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4931838..4a0a64f 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -83,7 +83,6 @@ static struct timer_list spuloadavg_timer;
 #define MIN_SPU_TIMESLICE	max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
 #define DEF_SPU_TIMESLICE	(100 * HZ / (1000 * SPUSCHED_TICK))
 
-#define MAX_USER_PRIO		(MAX_PRIO - MAX_RT_PRIO)
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 

^ permalink raw reply related

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  3:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140211025645.GJ18016@ZenIV.linux.org.uk>

On Tue, 2014-02-11 at 02:56 +0000, Al Viro wrote:
> > So the question is, is it reasonable to have the ref smaller than
> > 32-bit...
> 
> Every time you open a file, you bump dentry refcount.  Something like
> libc or ld.so will be opened on just about every execve(), so I'd say
> that 16 bits is far too low.  If nothing else, 32 bits might be too
> low on 64bit boxen...

So back to square 1 ... we can't implement together lockref, ticket
locks, and our lock confer mechanism within 64-bit.

I see two options at this stage. Both require a custom implementation
of lockref for powerpc, so some ifdef's such that we can replace the
generic implementation completely.

 - We can use a small ref, and when it's too big, overflow into a larger
one, falling back to the "old style" lock + ref (an overflow bit or a
compare with ffff)

 - We can have lockref "build" it's own lock out of the ticketpair and
ref, keeping the owner in a separate word. The owner doesn't strictly
need to be atomic.

Both are gross though :(

Anybody has a better idea ?

Ben.

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Al Viro @ 2014-02-11  2:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1392086660.3996.50.camel@pasglop>

On Tue, Feb 11, 2014 at 01:44:20PM +1100, Benjamin Herrenschmidt wrote:
> That leaves us with 32 bits to put the ref and the owner. The question
> is how big the ref really has to be and can we have a reasonable failure
> mode if it overflows ?
> 
> If we limit ourselves to, for example, 16-bit for the ref in lockref,
> then we can have the second 32-bit split between the owner and the ref.
> 
> If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...
> 
> So the question is, is it reasonable to have the ref smaller than
> 32-bit...

Every time you open a file, you bump dentry refcount.  Something like
libc or ld.so will be opened on just about every execve(), so I'd say
that 16 bits is far too low.  If nothing else, 32 bits might be too
low on 64bit boxen...

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  2:44 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar, Al Viro
In-Reply-To: <20140210155217.GF2107@lst.de>

(Linus, Al, a question for you down there about lockref "ref" size)

On Mon, 2014-02-10 at 16:52 +0100, Torsten Duwe wrote:

> What if I squeeze the bits a little?
> 4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
> At the cost that unlock may become an ll/sc operation again.
> I could think about a trick against that.
> But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/
> 
> Doing another round of yields for lockrefs now doesn't
> sound so bad any more.

So, the ticketpair has to be 16:16 so we can avoid the atomic on unlock

That leaves us with 32 bits to put the ref and the owner. The question
is how big the ref really has to be and can we have a reasonable failure
mode if it overflows ?

If we limit ourselves to, for example, 16-bit for the ref in lockref,
then we can have the second 32-bit split between the owner and the ref.

If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...

So the question is, is it reasonable to have the ref smaller than
32-bit...

Cheers,
Ben.

^ permalink raw reply

* [PATCH] powerpc/powernv: Add iommu DMA bypass support for IODA2
From: Benjamin Herrenschmidt @ 2014-02-11  0:38 UTC (permalink / raw)
  To: linuxppc-dev list

this patch adds the support for to create a direct iommu "bypass"
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.

Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

This is pretty much identical to the old version I posted a while ago,
except that it does have the iommu calls to enable/disable which I had
forgotten to git add before posting the previous one.

 arch/powerpc/include/asm/dma-mapping.h    |  1 +
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/kernel/dma.c                 | 10 ++--
 arch/powerpc/kernel/iommu.c               | 12 +++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 +++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci.c      | 10 ++++
 arch/powerpc/platforms/powernv/pci.h      |  6 ++-
 arch/powerpc/platforms/powernv/powernv.h  |  4 ++
 arch/powerpc/platforms/powernv/setup.c    |  9 ++++
 9 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index e27e9ad..150866b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -134,6 +134,7 @@ static inline int dma_supported(struct device *dev, u64 mask)
 }
 
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
+extern int __dma_set_mask(struct device *dev, u64 dma_mask);
 
 #define dma_alloc_coherent(d,s,h,f)	dma_alloc_attrs(d,s,h,f,NULL)
 
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f7a8036..42632c7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,6 +77,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 8032b97..ee78f6e 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -191,12 +191,10 @@ EXPORT_SYMBOL(dma_direct_ops);
 
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 
-int dma_set_mask(struct device *dev, u64 dma_mask)
+int __dma_set_mask(struct device *dev, u64 dma_mask)
 {
 	struct dma_map_ops *dma_ops = get_dma_ops(dev);
 
-	if (ppc_md.dma_set_mask)
-		return ppc_md.dma_set_mask(dev, dma_mask);
 	if ((dma_ops != NULL) && (dma_ops->set_dma_mask != NULL))
 		return dma_ops->set_dma_mask(dev, dma_mask);
 	if (!dev->dma_mask || !dma_supported(dev, dma_mask))
@@ -204,6 +202,12 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 	*dev->dma_mask = dma_mask;
 	return 0;
 }
+int dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (ppc_md.dma_set_mask)
+		return ppc_md.dma_set_mask(dev, dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
 EXPORT_SYMBOL(dma_set_mask);
 
 u64 dma_get_required_mask(struct device *dev)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..88e3ec6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,6 +1088,14 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
+	/*
+	 * Disable iommu bypass, otherwise the user can DMA to all of
+	 * our physical memory via the bypass window instead of just
+	 * the pages that has been explicitly mapped into the iommu
+	 */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, false);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1102,6 +1110,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	/* The kernel owns the device now, we can restore the iommu bypass */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7d6dcc6..3b2b4fb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -21,6 +21,7 @@
 #include <linux/irq.h>
 #include <linux/io.h>
 #include <linux/msi.h>
+#include <linux/memblock.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -460,9 +461,39 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 		return;
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
+	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
 	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
 }
 
+static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+				     struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	struct pnv_ioda_pe *pe;
+	uint64_t top;
+	bool bypass = false;
+
+	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+		return -ENODEV;;
+
+	pe = &phb->ioda.pe_array[pdn->pe_number];
+	if (pe->tce_bypass_enabled) {
+		top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		bypass = (dma_mask >= top);
+	}
+
+	if (bypass) {
+		dev_info(&pdev->dev, "Using 64-bit DMA iommu bypass\n");
+		set_dma_ops(&pdev->dev, &dma_direct_ops);
+		set_dma_offset(&pdev->dev, pe->tce_bypass_base);
+	} else {
+		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
+		set_dma_ops(&pdev->dev, &dma_iommu_ops);
+		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	}
+	return 0;
+}
+
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
 	struct pci_dev *dev;
@@ -657,6 +688,56 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+{
+	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
+					      tce32_table);
+	uint16_t window_id = (pe->pe_number << 1 ) + 1;
+	int64_t rc;
+
+	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
+	if (enable) {
+		phys_addr_t top = memblock_end_of_DRAM();
+
+		top = roundup_pow_of_two(top);
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     top);
+	} else {
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     0);
+
+		/*
+		 * We might want to reset the DMA ops of all devices on
+		 * this PE. However in theory, that shouldn't be necessary
+		 * as this is used for VFIO/KVM pass-through and the device
+		 * hasn't yet been returned to its kernel driver
+		 */
+	}
+	if (rc)
+		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
+	else
+		pe->tce_bypass_enabled = enable;
+}
+
+static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
+					  struct pnv_ioda_pe *pe)
+{
+	/* TVE #1 is selected by PCI address bit 59 */
+	pe->tce_bypass_base = 1ull << 59;
+
+	/* Install set_bypass callback for VFIO */
+	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+
+	/* Enable bypass by default */
+	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+}
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -727,6 +808,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
+	/* Also create a bypass window */
+	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
 	return;
 fail:
 	if (pe->tce32_seg >= 0)
@@ -1286,6 +1369,7 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
+	phb->dma_set_mask = pnv_pci_ioda_dma_set_mask;
 
 	/* Setup shutdown function for kexec */
 	phb->shutdown = pnv_pci_ioda_shutdown;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b555ebc..95633d7 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -634,6 +634,16 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 		pnv_pci_dma_fallback_setup(hose, pdev);
 }
 
+int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+	struct pnv_phb *phb = hose->private_data;
+
+	if (phb && phb->dma_set_mask)
+		return phb->dma_set_mask(phb, pdev, dma_mask);
+	return __dma_set_mask(&pdev->dev, dma_mask);
+}
+
 void pnv_pci_shutdown(void)
 {
 	struct pci_controller *hose;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 13f1942..cde1694 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -54,7 +54,9 @@ struct pnv_ioda_pe {
 	struct iommu_table	tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
-	/* XXX TODO: Add support for additional 64-bit iommus */
+	/* 64-bit TCE bypass region */
+	bool			tce_bypass_enabled;
+	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
@@ -113,6 +115,8 @@ struct pnv_phb {
 			 unsigned int hwirq, unsigned int virq,
 			 unsigned int is_64, struct msi_msg *msg);
 	void (*dma_dev_setup)(struct pnv_phb *phb, struct pci_dev *pdev);
+	int (*dma_set_mask)(struct pnv_phb *phb, struct pci_dev *pdev,
+			    u64 dma_mask);
 	void (*fixup_phb)(struct pci_controller *hose);
 	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
 	void (*shutdown)(struct pnv_phb *phb);
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index de6819b..213887a 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -7,12 +7,16 @@ extern void pnv_smp_init(void);
 static inline void pnv_smp_init(void) { }
 #endif
 
+struct pci_dev;
+
 #ifdef CONFIG_PCI
 extern void pnv_pci_init(void);
 extern void pnv_pci_shutdown(void);
+extern int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask);
 #else
 static inline void pnv_pci_init(void) { }
 static inline void pnv_pci_shutdown(void) { }
+static inline int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask) { }
 #endif
 
 extern void pnv_lpc_init(void);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 21166f6..110f4fb 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -27,6 +27,7 @@
 #include <linux/interrupt.h>
 #include <linux/bug.h>
 #include <linux/cpuidle.h>
+#include <linux/pci.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -141,6 +142,13 @@ static void pnv_progress(char *s, unsigned short hex)
 {
 }
 
+static int pnv_dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (dev_is_pci(dev))
+		return pnv_pci_dma_set_mask(to_pci_dev(dev), dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
+
 static void pnv_shutdown(void)
 {
 	/* Let the PCI code clear up IODA tables */
@@ -238,6 +246,7 @@ define_machine(powernv) {
 	.machine_shutdown	= pnv_shutdown,
 	.power_save             = powernv_idle,
 	.calibrate_decr		= generic_calibrate_decr,
+	.dma_set_mask		= pnv_dma_set_mask,
 #ifdef CONFIG_KEXEC
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif

^ permalink raw reply related

* Re: [PATCH] Handle vmalloc addresses
From: Benjamin Herrenschmidt @ 2014-02-11  0:19 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: Rong Song Shen, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <52E92DB0.1050902@linux.vnet.ibm.com>

Hi Nathan !

Please do a better submission :-)

Your subject is to be honest, crap. Something like

[PATCH] crypto/nx/nx-842: Fix handling of vmalloc addresses

Would have been much more informative.

Additionally, this is a patch for drivers/crypto, and while that driver
is powerpc-specific meaning I *could* take that patch, it should at
least be CCed to the crypto list/maintainer since that would
be the normal path for such a patch to be applied.

I'm taking it this time around but I know you can do better !

Cheers,
Ben.

On Wed, 2014-01-29 at 10:34 -0600, Nathan Fontenot wrote:
> The nx-842 compression driver does not currently handle getting
> a physical address for vmalloc addresses. The current driver
> uses __pa() for all addresses which does not properly handle
> vmalloc addresses and thus causes a failure since we do not pass
> a proper physical address to phyp.
> 
> This patch adds a routine to convert an address to a physical
> address by checking for vmalloc addresses and handling them properly.
> 
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>  ---
>  drivers/crypto/nx/nx-842.c |   29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> Index: linux/drivers/crypto/nx/nx-842.c
> ===================================================================
> --- linux.orig/drivers/crypto/nx/nx-842.c	2014-01-22 08:52:55.000000000 -0600
> +++ linux/drivers/crypto/nx/nx-842.c	2014-01-29 08:25:33.000000000 -0600
> @@ -158,6 +158,15 @@
>  	return sl->entry_nr * sizeof(struct nx842_slentry);
>  }
>  
> +static inline unsigned long nx842_get_pa(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_to_phys(vmalloc_to_page(addr))
> +		       + offset_in_page(addr);
> +	else
> +		return __pa(addr);
> +}
> +
>  static int nx842_build_scatterlist(unsigned long buf, int len,
>  			struct nx842_scatterlist *sl)
>  {
> @@ -168,7 +177,7 @@
>  
>  	entry = sl->entries;
>  	while (len) {
> -		entry->ptr = __pa(buf);
> +		entry->ptr = nx842_get_pa((void *)buf);
>  		nextpage = ALIGN(buf + 1, NX842_HW_PAGE_SIZE);
>  		if (nextpage < buf + len) {
>  			/* we aren't at the end yet */
> @@ -370,8 +379,8 @@
>  	op.flags = NX842_OP_COMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> -	op.out = __pa(slout.entries);
> +	op.csbcpb = nx842_get_pa(csbcpb);
> +	op.out = nx842_get_pa(slout.entries);
>  
>  	for (i = 0; i < hdr->blocks_nr; i++) {
>  		/*
> @@ -401,13 +410,13 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = max_sync_size;
>  
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, max_sync_size, &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -565,7 +574,7 @@
>  	op.flags = NX842_OP_DECOMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> +	op.csbcpb = nx842_get_pa(csbcpb);
>  
>  	/*
>  	 * max_sync_size may have changed since compression,
> @@ -597,12 +606,12 @@
>  		if (likely((inbuf & NX842_HW_PAGE_MASK) ==
>  			((inbuf + hdr->sizes[i] - 1) & NX842_HW_PAGE_MASK))) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = hdr->sizes[i];
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, hdr->sizes[i] , &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -613,12 +622,12 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.out = __pa(outbuf);
> +			op.out = nx842_get_pa((void *)outbuf);
>  			op.outlen = max_sync_size;
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(outbuf, max_sync_size, &slout);
> -			op.out = __pa(slout.entries);
> +			op.out = nx842_get_pa(slout.entries);
>  			op.outlen = -nx842_get_scatterlist_size(&slout);
>  		}
>  
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 19:39 UTC (permalink / raw)
  To: Hou Zhiqiang; +Cc: scottwood, linuxppc-dev, mingkai.hu, linux-mtd, linux-spi
In-Reply-To: <1390717003-28699-1-git-send-email-b48286@freescale.com>

On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote:
> To give spi flash layout using "mtdparts=..." in cmdline, we must
> give mtd_info a fixed name,because the cmdlinepart's parser will
> match the name given in cmdline with the mtd_info.
> 
> Now, if use OF node, mtd_info's name will be spi->dev->name. It
> consists of spi_master->bus_num, and the spi_master->bus_num maybe
> dynamically fetched.
> So, give the mtd_info a new fiexd name "name.cs", "name" is name of
> spi_device_id and "cs" is chip-select in spi_dev.
> 
> Signed-off-by: Hou Zhiqiang <b48286@freescale.com>
> ---
> v2:
>  - add check for return value of function kasprintf.
>  - whether the spi_master->bus_num is dynamical is determined by spi
>    controller driver, and it can't be check in this driver. So, we can
>    not initial the mtd_info's name by distinguishing the spi_master
>    bus_num dynamically-allocated or not.

How about spi->master->bus_num < 0 ?

>  drivers/mtd/devices/m25p80.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c
> index eb558e8..1f494d2 100644
> --- a/drivers/mtd/devices/m25p80.c
> +++ b/drivers/mtd/devices/m25p80.c
> @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi)
>  
>  	if (data && data->name)
>  		flash->mtd.name = data->name;
> -	else
> -		flash->mtd.name = dev_name(&spi->dev);
> +	else {
> +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> +				id->name, spi->chip_select);

I don't think this name is specific enough. What if there are more than
one SPI controller? Then there could be one chip with the same
chip-select. You probably still need to incorporate the SPI master
somehow, even if it's not by using the bus number directly (because it's
dynamic).

> +		if (!flash->mtd.name)
> +			return -ENOMEM;
> +	}
>  
>  	flash->mtd.type = MTD_NORFLASH;
>  	flash->mtd.writesize = 1;

Brian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox