LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH] powerpc/fsl: Add barrier_nospec implementation for NXP PowerPC Book E
From: Michael Ellerman @ 2018-06-01 10:40 UTC (permalink / raw)
  To: Scott Wood, Diana Madalina Craciun, linuxppc-dev@lists.ozlabs.org; +Cc: Leo Li
In-Reply-To: <1527804215.30674.35.camel@buserror.net>

Scott Wood <oss@buserror.net> writes:

> On Thu, 2018-05-31 at 14:35 +0000, Diana Madalina Craciun wrote:
>> On 5/31/2018 5:21 PM, Michael Ellerman wrote:
>> > 
>> > We can add a nospectre_v1 command line option if necessary.
>> 
>> What about nobarrier_nospec (or similar) instead of nospectre_v1 command
>> line? We are not disabling all the v1 mitigations, the masking part will
>> remain unchanged.
>
> I think nospectre_v1 makes more sense as it's about the user's intentions
> rather than the implementation.  The user is giving the kernel permission to
> not defend against spectre v1, and it's up to the implementation which
> mitigations (if any) to disable in response to that, same as any other
> optimization.

Yeah I agree. We also have `nospectre_v2` on x86/s390 so I think keeping
consistency with that is a must.

cheers

^ permalink raw reply

* Re: [PATCH v5 0/4] powerpc patches for new Kconfig language
From: Michael Ellerman @ 2018-06-01 10:34 UTC (permalink / raw)
  To: Masahiro Yamada; +Cc: Nicholas Piggin, Linux Kbuild mailing list, linuxppc-dev
In-Reply-To: <CAK7LNASxgQOsSveFuKrx7do-+643KUZwU_Gn+z7O7JK+TtMEww@mail.gmail.com>

Hi Masahiro,

Masahiro Yamada <yamada.masahiro@socionext.com> writes:
...
>
> Also, the change logs could be dropped.
>
> I see
>
> Since v1: reworded changelog to explain the cause of the problem (thanks
> Segher) and moved the flags into the 64-32 cross compile case.
>
> or
>
> Since v1: removed extra -EB in the recordmcount script (thanks mpe)
>
>
> above your signed-off-by.
>
>
> Of course, this is your call,
> and you do not need to disturb the git history if it is too late.

You're right, sorry those are bit messy.

I'm happy to update them, I haven't merged them yet, but I see you have
them in your tree.

So I won't change them unless you confirm you're OK with it. Let me
know.

cheers

^ permalink raw reply

* [PATCH v4 7/7] powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

When a single-threaded process has a non-local mm_cpumask, try to use
that point to flush the TLBs out of other CPUs in the cpumask.

An IPI is used for clearing remote CPUs for a few reasons:
- An IPI can end lazy TLB use of the mm, which is required to prevent
  TLB entries being created on the remote CPU. The alternative is to
  drop lazy TLB switching completely, which costs 7.5% in a context
  switch ping-pong test betwee a process and kernel idle thread.
- An IPI can have remote CPUs flush the entire PID, but the local CPU
  can flush a specific VA. tlbie would require over-flushing of the
  local CPU (where the process is running).
- A single threaded process that is migrated to a different CPU is
  likely to have a relatively small mm_cpumask, so IPI is reasonable.

No other thread can concurrently switch to this mm, because it must
have been given a reference to mm_users by the current thread before it
can use_mm. mm_users can be asynchronously incremented (by
mm_activate or mmget_not_zero), but those users must use remote mm
access and can't use_mm or access user address space. Existing code
makes the this assumption already, for example sparc64 has reset
mm_cpumask using this condition since the start of history, see
arch/sparc/kernel/smp_64.c.

This reduces tlbies for a kernel compile workload from 0.90M to 0.12M,
tlbiels are increased significantly due to the PID flushing for the
cleaning up remote CPUs, and increased local flushes (PID flushes take
128 tlbiels vs 1 tlbie).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h |  13 +++
 arch/powerpc/mm/tlb-radix.c    | 148 +++++++++++++++++++++++++++------
 2 files changed, 134 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index a7eabff27a0f..9138baccebb0 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -76,6 +76,19 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 		return false;
 	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
+static inline void mm_reset_thread_local(struct mm_struct *mm)
+{
+	WARN_ON(atomic_read(&mm->context.copros) > 0);
+	/*
+	 * It's possible for mm_access to take a reference on mm_users to
+	 * access the remote mm from another thread, but it's not allowed
+	 * to set mm_cpumask, so mm_users may be > 1 here.
+	 */
+	WARN_ON(current->mm != mm);
+	atomic_set(&mm->context.active_cpus, 1);
+	cpumask_clear(mm_cpumask(mm));
+	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index cdc50398fd60..67a6e86d3e7e 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -12,6 +12,8 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/memblock.h>
+#include <linux/mmu_context.h>
+#include <linux/sched/mm.h>
 
 #include <asm/ppc-opcode.h>
 #include <asm/tlb.h>
@@ -504,6 +506,15 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
+static bool mm_is_singlethreaded(struct mm_struct *mm)
+{
+	if (atomic_read(&mm->context.copros) > 0)
+		return false;
+	if (atomic_read(&mm->mm_users) <= 1 && current->mm == mm)
+		return true;
+	return false;
+}
+
 static bool mm_needs_flush_escalation(struct mm_struct *mm)
 {
 	/*
@@ -511,10 +522,47 @@ static bool mm_needs_flush_escalation(struct mm_struct *mm)
 	 * caching PTEs and not flushing them properly when
 	 * RIC = 0 for a PID/LPID invalidate
 	 */
-	return atomic_read(&mm->context.copros) != 0;
+	if (atomic_read(&mm->context.copros) > 0)
+		return true;
+	return false;
 }
 
 #ifdef CONFIG_SMP
+static void do_exit_flush_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+	unsigned long pid = mm->context.id;
+
+	if (current->mm == mm)
+		return; /* Local CPU */
+
+	if (current->active_mm == mm) {
+		/*
+		 * Must be a kernel thread because sender is single-threaded.
+		 */
+		BUG_ON(current->mm);
+		mmgrab(&init_mm);
+		switch_mm(mm, &init_mm, current);
+		current->active_mm = &init_mm;
+		mmdrop(mm);
+	}
+	_tlbiel_pid(pid, RIC_FLUSH_ALL);
+}
+
+static void exit_flush_lazy_tlbs(struct mm_struct *mm)
+{
+	/*
+	 * Would be nice if this was async so it could be run in
+	 * parallel with our local flush, but generic code does not
+	 * give a good API for it. Could extend the generic code or
+	 * make a special powerpc IPI for flushing TLBs.
+	 * For now it's not too performance critical.
+	 */
+	smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
+				(void *)mm, 1);
+	mm_reset_thread_local(mm);
+}
+
 void radix__flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long pid;
@@ -530,17 +578,24 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 	 */
 	smp_mb();
 	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			exit_flush_lazy_tlbs(mm);
+			goto local;
+		}
+
 		if (mm_needs_flush_escalation(mm))
 			_tlbie_pid(pid, RIC_FLUSH_ALL);
 		else
 			_tlbie_pid(pid, RIC_FLUSH_TLB);
-	} else
+	} else {
+local:
 		_tlbiel_pid(pid, RIC_FLUSH_TLB);
+	}
 	preempt_enable();
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
-void radix__flush_all_mm(struct mm_struct *mm)
+static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 {
 	unsigned long pid;
 
@@ -550,12 +605,24 @@ void radix__flush_all_mm(struct mm_struct *mm)
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	if (!mm_is_thread_local(mm))
+	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			if (!fullmm) {
+				exit_flush_lazy_tlbs(mm);
+				goto local;
+			}
+		}
 		_tlbie_pid(pid, RIC_FLUSH_ALL);
-	else
+	} else {
+local:
 		_tlbiel_pid(pid, RIC_FLUSH_ALL);
+	}
 	preempt_enable();
 }
+void radix__flush_all_mm(struct mm_struct *mm)
+{
+	__flush_all_mm(mm, false);
+}
 EXPORT_SYMBOL(radix__flush_all_mm);
 
 void radix__flush_tlb_pwc(struct mmu_gather *tlb, unsigned long addr)
@@ -575,10 +642,16 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	if (!mm_is_thread_local(mm))
+	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			exit_flush_lazy_tlbs(mm);
+			goto local;
+		}
 		_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
-	else
+	} else {
+local:
 		_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+	}
 	preempt_enable();
 }
 
@@ -638,14 +711,21 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	if (mm_is_thread_local(mm)) {
-		local = true;
-		full = (end == TLB_FLUSH_ALL ||
-				nr_pages > tlb_local_single_page_flush_ceiling);
-	} else {
+	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			if (end != TLB_FLUSH_ALL) {
+				exit_flush_lazy_tlbs(mm);
+				goto is_local;
+			}
+		}
 		local = false;
 		full = (end == TLB_FLUSH_ALL ||
 				nr_pages > tlb_single_page_flush_ceiling);
+	} else {
+is_local:
+		local = true;
+		full = (end == TLB_FLUSH_ALL ||
+				nr_pages > tlb_local_single_page_flush_ceiling);
 	}
 
 	if (full) {
@@ -766,7 +846,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 * See the comment for radix in arch_exit_mmap().
 	 */
 	if (tlb->fullmm) {
-		radix__flush_all_mm(mm);
+		__flush_all_mm(mm, true);
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->need_flush_all)
 			radix__flush_tlb_mm(mm);
@@ -800,24 +880,32 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	if (mm_is_thread_local(mm)) {
-		local = true;
-		full = (end == TLB_FLUSH_ALL ||
-				nr_pages > tlb_local_single_page_flush_ceiling);
-	} else {
+	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			if (end != TLB_FLUSH_ALL) {
+				exit_flush_lazy_tlbs(mm);
+				goto is_local;
+			}
+		}
 		local = false;
 		full = (end == TLB_FLUSH_ALL ||
 				nr_pages > tlb_single_page_flush_ceiling);
+	} else {
+is_local:
+		local = true;
+		full = (end == TLB_FLUSH_ALL ||
+				nr_pages > tlb_local_single_page_flush_ceiling);
 	}
 
 	if (full) {
-		if (!local && mm_needs_flush_escalation(mm))
-			also_pwc = true;
-
-		if (local)
+		if (local) {
 			_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
-		else
-			_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL: RIC_FLUSH_TLB);
+		} else {
+			if (mm_needs_flush_escalation(mm))
+				also_pwc = true;
+
+			_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
+		}
 	} else {
 		if (local)
 			_tlbiel_va_range(start, end, pid, page_size, psize, also_pwc);
@@ -859,10 +947,16 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 	/* Otherwise first do the PWC, then iterate the pages. */
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	if (mm_is_thread_local(mm)) {
-		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
-	} else {
+	if (!mm_is_thread_local(mm)) {
+		if (unlikely(mm_is_singlethreaded(mm))) {
+			exit_flush_lazy_tlbs(mm);
+			goto local;
+		}
 		_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
+		goto local;
+	} else {
+local:
+		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	}
 
 	preempt_enable();
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 6/7] powerpc/64s/radix: optimise pte_update
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

Implementing pte_update with pte_xchg (which uses cmpxchg) is
inefficient. A single larx/stcx. works fine, no need for the less
efficient cmpxchg sequence.

Then remove the memory barriers from the operation. There is a
requirement for TLB flushing to load mm_cpumask after the store
that reduces pte permissions, which is moved into the TLB flush
code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 25 +++++++++++-----------
 arch/powerpc/mm/mmu_context.c              |  6 ++++--
 arch/powerpc/mm/tlb-radix.c                | 11 +++++++++-
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 9c567d243f61..ef9f96742ce1 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -131,20 +131,21 @@ extern void radix__ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep
 static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
 					       unsigned long set)
 {
-	pte_t pte;
-	unsigned long old_pte, new_pte;
-
-	do {
-		pte = READ_ONCE(*ptep);
-		old_pte = pte_val(pte);
-		new_pte = (old_pte | set) & ~clr;
-
-	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
-
-	return old_pte;
+	__be64 old_be, tmp_be;
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%3		# pte_update\n"
+	"	andc	%1,%0,%5	\n"
+	"	or	%1,%1,%4	\n"
+	"	stdcx.	%1,0,%3		\n"
+	"	bne-	1b"
+	: "=&r" (old_be), "=&r" (tmp_be), "=m" (*ptep)
+	: "r" (ptep), "r" (cpu_to_be64(set)), "r" (cpu_to_be64(clr))
+	: "cc" );
+
+	return be64_to_cpu(old_be);
 }
 
-
 static inline unsigned long radix__pte_update(struct mm_struct *mm,
 					unsigned long addr,
 					pte_t *ptep, unsigned long clr,
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0ab297c4cfad..f84e14f23e50 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -57,8 +57,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * in switch_slb(), and/or the store of paca->mm_ctx_id in
 		 * copy_mm_to_paca().
 		 *
-		 * On the read side the barrier is in pte_xchg(), which orders
-		 * the store to the PTE vs the load of mm_cpumask.
+		 * On the other side, the barrier is in mm/tlb-radix.c for
+		 * radix which orders earlier stores to clear the PTEs vs
+		 * the load of mm_cpumask. And pte_xchg which does the same
+		 * thing for hash.
 		 *
 		 * This full barrier is needed by membarrier when switching
 		 * between processes after store to rq->curr, before user-space
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 5ac3206c51cc..cdc50398fd60 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -524,6 +524,11 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	/*
+	 * Order loads of mm_cpumask vs previous stores to clear ptes before
+	 * the invalidate. See barrier in switch_mm_irqs_off
+	 */
+	smp_mb();
 	if (!mm_is_thread_local(mm)) {
 		if (mm_needs_flush_escalation(mm))
 			_tlbie_pid(pid, RIC_FLUSH_ALL);
@@ -544,6 +549,7 @@ void radix__flush_all_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (!mm_is_thread_local(mm))
 		_tlbie_pid(pid, RIC_FLUSH_ALL);
 	else
@@ -568,6 +574,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (!mm_is_thread_local(mm))
 		_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
 	else
@@ -630,6 +637,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -791,6 +799,7 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -849,7 +858,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 
 	/* Otherwise first do the PWC, then iterate the pages. */
 	preempt_disable();
-
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	} else {
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 5/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

The ISA suggests ptesync after setting a pte, to prevent a table walk
initiated by a subsequent access from missing that store and causing a
spurious fault. This is an architectual allowance that allows an
implementation's page table walker to be incoherent with the store
queue.

However there is no correctness problem in taking a spurious fault in
userspace -- the kernel copes with these at any time, so the updated
pte will be found eventually. Spurious kernel faults on vmap memory
must be avoided, so a ptesync is put into flush_cache_vmap.

On POWER9 so far I have not found a measurable window where this can
result in more minor faults, so as an optimisation, remove the costly
ptesync from pte updates. If an implementation benefits from ptesync,
it would be better to add it back in update_mmu_cache, so it's not
done for things like fork(2).

fork --fork --exec benchmark improved 5.2% (12400->13100).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 19 ++++++++++++++++++-
 arch/powerpc/include/asm/cacheflush.h      | 13 +++++++++++++
 arch/powerpc/mm/pgtable-radix.c            |  2 +-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 01f6c2ca7ecd..9c567d243f61 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -202,7 +202,24 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
 				 pte_t *ptep, pte_t pte, int percpu)
 {
 	*ptep = pte;
-	asm volatile("ptesync" : : : "memory");
+
+	/*
+	 * The architecture suggests a ptesync after setting the pte, which
+	 * orders the store that updates the pte with subsequent page table
+	 * walk accesses which may load the pte. Without this it may be
+	 * possible for a subsequent access to result in spurious fault.
+	 *
+	 * This is not necessary for correctness, because a spurious fault
+	 * is tolerated by the page fault handler, and this store will
+	 * eventually be seen. In testing, there was no noticable increase
+	 * in user faults on POWER9. Avoiding ptesync here is a significant
+	 * win for things like fork. If a future microarchitecture benefits
+	 * from ptesync, it should probably go into update_mmu_cache, rather
+	 * than set_pte_at (which is used to set ptes unrelated to faults).
+	 *
+	 * Spurious faults to vmalloc region are not tolerated, so there is
+	 * a ptesync in flush_cache_vmap.
+	 */
 }
 
 static inline int radix__pmd_bad(pmd_t pmd)
diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h
index 11843e37d9cf..e9662648e72d 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -26,6 +26,19 @@
 #define flush_cache_vmap(start, end)		do { } while (0)
 #define flush_cache_vunmap(start, end)		do { } while (0)
 
+#ifdef CONFIG_BOOK3S_64
+/*
+ * Book3s has no ptesync after setting a pte, so without this ptesync it's
+ * possible for a kernel virtual mapping access to return a spurious fault
+ * if it's accessed right after the pte is set. The page fault handler does
+ * not expect this type of fault. flush_cache_vmap is not exactly the right
+ * place to put this, but it seems to work well enough.
+ */
+#define flush_cache_vmap(start, end)		do { asm volatile("ptesync"); } while (0)
+#else
+#define flush_cache_vmap(start, end)		do { } while (0)
+#endif
+
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 extern void flush_dcache_page(struct page *page);
 #define flush_dcache_mmap_lock(mapping)		do { } while (0)
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index d6f74cbf0fed..96f68c5aa1f5 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -1115,5 +1115,5 @@ void radix__ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
 		 * an access fault, which is defined by the architectue.
 		 */
 	}
-	asm volatile("ptesync" : : : "memory");
+	/* See ptesync comment in radix__set_pte_at */
 }
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 4/7] powerpc/64s/radix: prefetch user address in update_mmu_cache
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

Prefetch the faulting address in update_mmu_cache to give the page
table walker perhaps 100 cycles head start as locks are dropped and
the interrupt completed.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/mem.c              | 4 +++-
 arch/powerpc/mm/pgtable-book3s64.c | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c3c39b02b2ba..8cecda4bd66a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -509,8 +509,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
 	 */
 	unsigned long access, trap;
 
-	if (radix_enabled())
+	if (radix_enabled()) {
+		prefetch((void *)address);
 		return;
+	}
 
 	/* We only want HPTEs for linux PTEs that have _PAGE_ACCESSED set */
 	if (!pte_young(*ptep) || address >= TASK_SIZE)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 82fed87289de..c1f4ca45c93a 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -152,7 +152,8 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-	return;
+	if (radix_enabled())
+		prefetch((void *)addr);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 3/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

This matches other architectures, when we know there will be no
further accesses to the address (e.g., for teardown), page table
entries can be cleared non-atomically.

The comments about NMMU are bogus: all MMU notifiers (including NMMU)
are released at this point, with their TLBs flushed. An NMMU access at
this point would be a bug.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 62a73a7a78a4..01f6c2ca7ecd 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -180,14 +180,8 @@ static inline pte_t radix__ptep_get_and_clear_full(struct mm_struct *mm,
 	unsigned long old_pte;

 	if (full) {
-		/*
-		 * If we are trying to clear the pte, we can skip
-		 * the DD1 pte update sequence and batch the tlb flush. The
-		 * tlb flush batching is done by mmu gather code. We
-		 * still keep the cmp_xchg update to make sure we get
-		 * correct R/C bit which might be updated via Nest MMU.
-		 */
-		old_pte = __radix_pte_update(ptep, ~0ul, 0);
+		old_pte = pte_val(*ptep);
+		*ptep = __pte(0);
 	} else
 		old_pte = radix__pte_update(mm, addr, ptep, ~0ul, 0, 0);

-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 2/7] powerpc/64s/radix: do not flush TLB on spurious fault
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

In the case of a spurious fault (which can happen due to a race with
another thread that changes the page table), the default Linux mm code
calls flush_tlb_page for that address. This is not required because
the pte will be re-fetched. Hash does not wire this up to a hardware
TLB flush for this reason. This patch avoids the flush for radix.

>From Power ISA v3.0B, p.1090:

    Setting a Reference or Change Bit or Upgrading Access Authority
    (PTE Subject to Atomic Hardware Updates)

    If the only change being made to a valid PTE that is subject to
    atomic hardware updates is to set the Refer- ence or Change bit to
    1 or to add access authorities, a simpler sequence suffices
    because the translation hardware will refetch the PTE if an access
    is attempted for which the only problems were reference and/or
    change bits needing to be set or insufficient access authority.

The nest MMU on POWER9 does not re-fetch the PTE after such an access
attempt before faulting, so address spaces with a coprocessor
attached will continue to flush in these cases.

This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.

fork --fork --exec benchmark improved 0.5% (12300->12400).

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0cac17253513..ebf572ea621e 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -4,7 +4,7 @@

 #define MMU_NO_CONTEXT	~0UL

-
+#include <linux/mm_types.h>
 #include <asm/book3s/64/tlbflush-hash.h>
 #include <asm/book3s/64/tlbflush-radix.h>

@@ -137,6 +137,16 @@ static inline void flush_all_mm(struct mm_struct *mm)
 #define flush_tlb_page(vma, addr)	local_flush_tlb_page(vma, addr)
 #define flush_all_mm(mm)		local_flush_all_mm(mm)
 #endif /* CONFIG_SMP */
+
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
+static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+						unsigned long address)
+{
+	/* See ptep_set_access_flags comment */
+	if (atomic_read(&vma->vm_mm->context.copros) > 0)
+		flush_tlb_page(vma, address);
+}
+
 /*
  * flush the page walk cache for the address
  */
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 1/7] powerpc/64s/radix: do not flush TLB when relaxing access
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V
In-Reply-To: <20180601100121.393-1-npiggin@gmail.com>

Radix flushes the TLB when updating ptes to increase permissiveness
of protection (increase access authority). Book3S does not require
TLB flushing in this case, and it is not done on hash. This patch
avoids the flush for radix.

>From Power ISA v3.0B, p.1090:

    Setting a Reference or Change Bit or Upgrading Access Authority
    (PTE Subject to Atomic Hardware Updates)

    If the only change being made to a valid PTE that is subject to
    atomic hardware updates is to set the Reference or Change bit to 1
    or to add access authorities, a simpler sequence suffices because
    the translation hardware will refetch the PTE if an access is
    attempted for which the only problems were reference and/or change
    bits needing to be set or insufficient access authority.

The nest MMU on POWER9 does not re-fetch the PTE after such an access
attempt before faulting, so address spaces with a coprocessor
attached will continue to flush in these cases.

This reduces tlbies for a kernel compile workload from 1.28M to 0.95M,
tlbiels from 20.17M 19.68M.

fork --fork --exec benchmark improved 2.77% (12000->12300).

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/pgtable-radix.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 0ddfe591cd24..d6f74cbf0fed 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -1108,7 +1108,12 @@ void radix__ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
 		__radix_pte_update(ptep, 0, new_pte);
 	} else {
 		__radix_pte_update(ptep, 0, set);
-		radix__flush_tlb_page_psize(mm, address, psize);
+		/*
+		 * Book3S does not require a TLB flush when relaxing access
+		 * restrictions when the address space is not attached to a
+		 * NMMU, because the core MMU will reload the pte after taking
+		 * an access fault, which is defined by the architectue.
+		 */
 	}
 	asm volatile("ptesync" : : : "memory");
 }
-- 
2.17.0

^ permalink raw reply related

* [PATCH v4 0/7] Various TLB and PTE improvements
From: Nicholas Piggin @ 2018-06-01 10:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V

Since last time:
- Rebased on top of Aneesh's series "[PATCH V2 1/4] powerpc/mm/hugetlb:
  Update huge_ptep_set_access_flags to call __ptep_set_access_flags"

Thanks,
Nick

Nicholas Piggin (7):
  powerpc/64s/radix: do not flush TLB when relaxing access
  powerpc/64s/radix: do not flush TLB on spurious fault
  powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the
    full case
  powerpc/64s/radix: prefetch user address in update_mmu_cache
  powerpc/64s/radix: avoid ptesync after set_pte and
    ptep_set_access_flags
  powerpc/64s/radix: optimise pte_update
  powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask

 arch/powerpc/include/asm/book3s/64/radix.h    |  54 +++---
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  12 +-
 arch/powerpc/include/asm/cacheflush.h         |  13 ++
 arch/powerpc/include/asm/tlb.h                |  13 ++
 arch/powerpc/mm/mem.c                         |   4 +-
 arch/powerpc/mm/mmu_context.c                 |   6 +-
 arch/powerpc/mm/pgtable-book3s64.c            |   3 +-
 arch/powerpc/mm/pgtable-radix.c               |   9 +-
 arch/powerpc/mm/tlb-radix.c                   | 159 +++++++++++++++---
 9 files changed, 217 insertions(+), 56 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH] powerpc/mm/hugetlb: Update hugetlb related locks
From: Aneesh Kumar K.V @ 2018-06-01  8:24 UTC (permalink / raw)
  To: benh, paulus, mpe, npiggin; +Cc: linuxppc-dev, Aneesh Kumar K.V

With split pmd page table lock enabled, we don't use mm->page_table_lock when
updating pmd entries. This patch update hugetlb path to use the right lock
when inserting huge page directory entries into page table.

ex: if we are using hugepd and inserting hugepd entry at the pmd level, we
use pmd_lockptr, which based on config can be split pmd lock.

For update huge page directory entries itself we use mm->page_table_lock. We
do have a helper huge_pte_lockptr() for that.

Fixes: 675d99529 ("powerpc/book3s64: Enable split pmd ptlock")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/hugetlbpage.c | 33 +++++++++++++++++++++++----------
 arch/powerpc/mm/pgtable.c     | 12 +++++++-----
 2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 2a4b1bf8bde6..7c5f479c5c00 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -52,7 +52,8 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long s
 }
 
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned pdshift, unsigned pshift)
+			   unsigned long address, unsigned int pdshift,
+			   unsigned int pshift, spinlock_t *ptl)
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
@@ -82,8 +83,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	 */
 	smp_wmb();
 
-	spin_lock(&mm->page_table_lock);
-
+	spin_lock(ptl);
 	/*
 	 * We have multiple higher-level entries that point to the same
 	 * actual pte location.  Fill in each as we go and backtrack on error.
@@ -113,7 +113,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			*hpdp = __hugepd(0);
 		kmem_cache_free(cachep, new);
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return 0;
 }
 
@@ -138,6 +138,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	hugepd_t *hpdp = NULL;
 	unsigned pshift = __ffs(sz);
 	unsigned pdshift = PGDIR_SHIFT;
+	spinlock_t *ptl;
 
 	addr &= ~(sz-1);
 	pg = pgd_offset(mm, addr);
@@ -146,39 +147,46 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	if (pshift == PGDIR_SHIFT)
 		/* 16GB huge page */
 		return (pte_t *) pg;
-	else if (pshift > PUD_SHIFT)
+	else if (pshift > PUD_SHIFT) {
 		/*
 		 * We need to use hugepd table
 		 */
+		ptl = &mm->page_table_lock;
 		hpdp = (hugepd_t *)pg;
-	else {
+	} else {
 		pdshift = PUD_SHIFT;
 		pu = pud_alloc(mm, pg, addr);
 		if (pshift == PUD_SHIFT)
 			return (pte_t *)pu;
-		else if (pshift > PMD_SHIFT)
+		else if (pshift > PMD_SHIFT) {
+			ptl = pud_lockptr(mm, pu);
 			hpdp = (hugepd_t *)pu;
-		else {
+		} else {
 			pdshift = PMD_SHIFT;
 			pm = pmd_alloc(mm, pu, addr);
 			if (pshift == PMD_SHIFT)
 				/* 16MB hugepage */
 				return (pte_t *)pm;
-			else
+			else {
+				ptl = pmd_lockptr(mm, pm);
 				hpdp = (hugepd_t *)pm;
+			}
 		}
 	}
 #else
 	if (pshift >= HUGEPD_PGD_SHIFT) {
+		ptl = &mm->page_table_lock;
 		hpdp = (hugepd_t *)pg;
 	} else {
 		pdshift = PUD_SHIFT;
 		pu = pud_alloc(mm, pg, addr);
 		if (pshift >= HUGEPD_PUD_SHIFT) {
+			ptl = pud_lockptr(mm, pu);
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
 			pm = pmd_alloc(mm, pu, addr);
+			ptl = pmd_lockptr(mm, pm);
 			hpdp = (hugepd_t *)pm;
 		}
 	}
@@ -188,7 +196,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 
 	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
+						  pdshift, pshift, ptl))
 		return NULL;
 
 	return hugepte_offset(*hpdp, addr, pdshift);
@@ -499,6 +508,10 @@ struct page *follow_huge_pd(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 
 retry:
+	/*
+	 * hugepage directory entries are protected by mm->page_table_lock
+	 * Use this instead of huge_pte_lockptr
+	 */
 	ptl = &mm->page_table_lock;
 	spin_lock(ptl);
 
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 5281c2c064af..d71c7777669c 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -249,17 +249,19 @@ extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed) {
 
 #ifdef CONFIG_PPC_BOOK3S_64
-		struct hstate *hstate = hstate_file(vma->vm_file);
-		psize = hstate_get_psize(hstate);
+		struct hstate *h = hstate_vma(vma);
+
+		psize = hstate_get_psize(h);
+#ifdef CONFIG_DEBUG_VM
+		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+#endif
+
 #else
 		/*
 		 * Not used on non book3s64 platforms. But 8xx
 		 * can possibly use tsize derived from hstate.
 		 */
 		psize = 0;
-#endif
-#ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(&vma->vm_mm->page_table_lock);
 #endif
 		__ptep_set_access_flags(vma, ptep, pte, addr, psize);
 	}
-- 
2.17.0

^ permalink raw reply related

* [PATCH] powerpc/mm/hash: hard disable irq in the SLB insert path
From: Aneesh Kumar K.V @ 2018-06-01  8:24 UTC (permalink / raw)
  To: benh, paulus, mpe, npiggin; +Cc: linuxppc-dev, Aneesh Kumar K.V

When inserting SLB entries for EA above 512TB, we need to hard disable irq.
This will make sure we don't take a PMU interrupt that can possibly touch
user space address via a stack dump. To prevent this, we need to hard disable
the interrupt.

Also add a comment explaining why we don't need context synchronizing isync
with slbmte.

Fixes: f384796c4 ("powerpc/mm: Add support for handling > 512TB address in SLB miss")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/slb.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 66577cc66dc9..27f5b81c372a 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -352,6 +352,14 @@ static void insert_slb_entry(unsigned long vsid, unsigned long ea,
 	/*
 	 * We are irq disabled, hence should be safe to access PACA.
 	 */
+	VM_WARN_ON(!irqs_disabled());
+
+	/*
+	 * We can't take a PMU exception in the following code, so hard
+	 * disable interrupts.
+	 */
+	hard_irq_disable();
+
 	index = get_paca()->stab_rr;
 
 	/*
@@ -369,6 +377,11 @@ static void insert_slb_entry(unsigned long vsid, unsigned long ea,
 		    ((unsigned long) ssize << SLB_VSID_SSIZE_SHIFT);
 	esid_data = mk_esid_data(ea, ssize, index);
 
+	/*
+	 * No need for an isync before or after this slbmte. The exception
+	 * we enter with and the rfid we exit with are context synchronizing.
+	 * Also we only handle user segments here.
+	 */
 	asm volatile("slbmte %0, %1" : : "r" (vsid_data), "r" (esid_data)
 		     : "memory");
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH kernel 2/2] powerpc/powernv: Define PHB4 type and enable sketchy bypass on POWER9
From: Alexey Kardashevskiy @ 2018-06-01  8:10 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Russell Currey
In-Reply-To: <20180601081028.29401-1-aik@ozlabs.ru>

These are found in POWER9 chips. Right now these PHBs have unknown type
so changing it to PHB4 won't make much of a difference except enabling
sketchy bypass for POWER9 as this does below.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      | 1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index eada4b6..1408247 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 	PNV_PHB_MODEL_UNKNOWN,
 	PNV_PHB_MODEL_P7IOC,
 	PNV_PHB_MODEL_PHB3,
+	PNV_PHB_MODEL_PHB4,
 	PNV_PHB_MODEL_NPU,
 	PNV_PHB_MODEL_NPU2,
 };
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9239142..66c2804 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1882,7 +1882,8 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
 		if (dma_mask >> 32 &&
 		    dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
 		    pnv_pci_ioda_pe_single_vendor(pe) &&
-		    phb->model == PNV_PHB_MODEL_PHB3) {
+		    (phb->model == PNV_PHB_MODEL_PHB3 ||
+		     phb->model == PNV_PHB_MODEL_PHB4)) {
 			/* Configure the bypass mode */
 			rc = pnv_pci_ioda_dma_64bit_bypass(pe);
 			if (rc)
@@ -3930,6 +3931,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 		phb->model = PNV_PHB_MODEL_P7IOC;
 	else if (of_device_is_compatible(np, "ibm,power8-pciex"))
 		phb->model = PNV_PHB_MODEL_PHB3;
+	else if (of_device_is_compatible(np, "ibm,power9-pciex"))
+		phb->model = PNV_PHB_MODEL_PHB4;
 	else if (of_device_is_compatible(np, "ibm,power8-npu-pciex"))
 		phb->model = PNV_PHB_MODEL_NPU;
 	else if (of_device_is_compatible(np, "ibm,power9-npu-pciex"))
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel 1/2] powerpc/powernv: Reuse existing TCE code for sketchy bypass
From: Alexey Kardashevskiy @ 2018-06-01  8:10 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Russell Currey
In-Reply-To: <20180601081028.29401-1-aik@ozlabs.ru>

The existing sketchy bypass ignores the existing default 32bit TCE table
(created by default for every PE at boot time or after being used by
VFIO) and it allocates another table instead without updating PE DMA
config (pe->table_group). So if we decide to use such device for VFIO
later, this new table will also leak memory.

This replaces adhoc table allocation and programming with the existing
API which handles memory leaks.

This programs the default 32bit table back to TVE#0 if configuring
the new table failed for some reason.

While we are at it, switch from the hardcoded 256MB TCEs to the biggest
size supported by the hardware and reported by the firmware. This allows
the sketchy bypass (originally made for POWER8 only) to work on POWER9
too assuming that PHB4 type is defined and pnv_pci_ioda_dma_64bit_bypass()
is called (coming next).

This does not call iommu_init_table() for the new table as the caller
will use &dma_nommu_ops and therefore ::it_map is not needed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

Tested with:
 	if (pe->tce_bypass_enabled) {
 		top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
-		bypass = (dma_mask >= top);
+		bypass = false;//(dma_mask >= top);
 	}
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 71 +++++++++++++++++--------------
 1 file changed, 39 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ceb7e64..9239142 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1791,54 +1791,61 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
  *
  * Currently this will only work on PHB3 (POWER8).
  */
+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+		struct iommu_table **ptbl);
+
+static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
+		int num, struct iommu_table *tbl);
+
+static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
+
 static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
 {
-	u64 window_size, table_size, tce_count, addr;
-	struct page *table_pages;
-	u64 tce_order = 28; /* 256MB TCEs */
-	__be64 *tces;
+	u64 window_size;
 	s64 rc;
+	struct iommu_table *tbl, *oldtbl = NULL;
+	unsigned long shift, offset;
 
 	/*
 	 * Window size needs to be a power of two, but needs to account for
 	 * shifting memory by the 4GB offset required to skip 32bit space.
 	 */
-	window_size = roundup_pow_of_two(memory_hotplug_max() + (1ULL << 32));
-	tce_count = window_size >> tce_order;
-	table_size = tce_count << 3;
-
-	if (table_size < PAGE_SIZE)
-		table_size = PAGE_SIZE;
+	window_size = roundup_pow_of_two(memory_hotplug_max() + SZ_4G);
+	shift = ilog2(pnv_ioda_parse_tce_sizes(pe->phb));
+	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, shift, window_size,
+			POWERNV_IOMMU_DEFAULT_LEVELS, &tbl);
+	if (rc) {
+		pe_err(pe, "Failed to create 64-bypass TCE table, err %ld", rc);
+		return rc;
+	}
 
-	table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
-				       get_order(table_size));
-	if (!table_pages)
+	offset = SZ_4G >> shift;
+	rc = tbl->it_ops->set(tbl, offset, tbl->it_size - offset,
+			0 /* uaddr */, DMA_BIDIRECTIONAL, 0 /* attrs */);
+	if (rc)
 		goto err;
 
-	tces = page_address(table_pages);
-	if (!tces)
+	if (pe->table_group.tables[0]) {
+		oldtbl = pe->table_group.tables[0];
+		pnv_pci_ioda2_unset_window(&pe->table_group, 0);
+	}
+
+	rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
+	if (rc != OPAL_SUCCESS) {
+		rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, oldtbl);
 		goto err;
+	}
 
-	memset(tces, 0, table_size);
+	if (oldtbl)
+		iommu_tce_table_put(oldtbl);
 
-	for (addr = 0; addr < memory_hotplug_max(); addr += (1 << tce_order)) {
-		tces[(addr + (1ULL << 32)) >> tce_order] =
-			cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
-	}
+	pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
+	return 0;
 
-	rc = opal_pci_map_pe_dma_window(pe->phb->opal_id,
-					pe->pe_number,
-					/* reconfigure window 0 */
-					(pe->pe_number << 1) + 0,
-					1,
-					__pa(tces),
-					table_size,
-					1 << tce_order);
-	if (rc == OPAL_SUCCESS) {
-		pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
-		return 0;
-	}
 err:
+	iommu_tce_table_put(tbl);
+
 	pe_err(pe, "Error configuring 64-bit DMA bypass\n");
 	return -EIO;
 }
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel 0/2] powerpc/powernv: Rework sketchy bypass
From: Alexey Kardashevskiy @ 2018-06-01  8:10 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Russell Currey

I came across this adhoc implementation and thought it could use some
polishing. This fixes memory leaks and add P9 support. Based on the current
upstream.


Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc/powernv: Reuse existing TCE code for sketchy bypass
  powerpc/powernv: Define PHB4 type and enable sketchy bypass on POWER9

 arch/powerpc/platforms/powernv/pci.h      |  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 76 +++++++++++++++++--------------
 2 files changed, 44 insertions(+), 33 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH kernel] powerpc/powernv/ioda2: Reduce upper limit for DMA window size
From: Alexey Kardashevskiy @ 2018-06-01  8:06 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Russell Currey

We use PHB in mode1 which uses bit 59 to select a correct DMA window.
However there is mode2 which uses bits 59:55 and allows up to 32 DMA
windows per a PE.

Even though documentation does not clearly specify that, it seems that
the actual hardware does not support bits 59:55 even in mode1, in other
words we can create a window as big as 1<<58 but DMA simply won't work.

This reduces the upper limit from 59 to 55 bits to let the userspace know
about the hardware limits.

Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 92ca662..50e21d7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2839,7 +2839,7 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	level_shift = entries_shift + 3;
 	level_shift = max_t(unsigned, level_shift, PAGE_SHIFT);

-	if ((level_shift - 3) * levels + page_shift >= 60)
+	if ((level_shift - 3) * levels + page_shift >= 55)
 		return -EINVAL;

 	/* Allocate TCE table */
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers
From: Stewart Smith @ 2018-06-01  7:41 UTC (permalink / raw)
  To: Haren Myneni, mpe; +Cc: linuxppc-dev, herbert, linux-crypto
In-Reply-To: <1527789287.5945.23.camel@hbabu-laptop>

Haren Myneni <haren@linux.vnet.ibm.com> writes:
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
>
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
>
> Signed-off-by: Haren Myneni <haren@us.ibm.com>
>
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index d886a5b..ff61e4b 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -206,7 +206,8 @@
>  #define OPAL_NPU_TL_SET				161
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
> -#define OPAL_LAST				165
> +#define	OPAL_NX_COPROC_INIT			167
> +#define OPAL_LAST				167
>  
>  /* Device tree flags */
>  
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 7159e1a..d79eb82 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -288,6 +288,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t address,
>  int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
>  int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>  int opal_sensor_group_clear(u32 group_hndl, int token);
> +int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
>  
>  s64 opal_signal_system_reset(s32 cpu);
>  
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 3da30c2..c7541a9 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -325,3 +325,4 @@ OPAL_CALL(opal_npu_spa_clear_cache,		OPAL_NPU_SPA_CLEAR_CACHE);
>  OPAL_CALL(opal_npu_tl_set,			OPAL_NPU_TL_SET);
>  OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,		OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
> +OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 48fbb41..5e13908 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -1035,3 +1035,5 @@ void powernv_set_nmmu_ptcr(unsigned long ptcr)
>  EXPORT_SYMBOL_GPL(opal_int_set_mfrr);
>  EXPORT_SYMBOL_GPL(opal_int_eoi);
>  EXPORT_SYMBOL_GPL(opal_error_code);
> +/* Export the below symbol for NX compression */
> +EXPORT_SYMBOL(opal_nx_coproc_init);
> diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c
> index 1e87637..6c4784d 100644
> --- a/drivers/crypto/nx/nx-842-powernv.c
> +++ b/drivers/crypto/nx/nx-842-powernv.c
> @@ -24,6 +24,8 @@
>  #include <asm/icswx.h>
>  #include <asm/vas.h>
>  #include <asm/reg.h>
> +#include <asm/opal-api.h>
> +#include <asm/opal.h>
>  
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Dan Streetman <ddstreet@ieee.org>");
> @@ -803,9 +805,26 @@ static int __init vas_cfg_coproc_info(struct device_node *dn, int chip_id,
>  	if (!coproc)
>  		return -ENOMEM;
>  
> -	if (!strcmp(priority, "High"))
> +	if (!strcmp(priority, "High")) {
> +		/*
> +		 * (lpid, pid, tid) combination has to be unique for each
> +		 * coprocessor instance in the system. So to make it
> +		 * unique, skiboot uses coprocessor type such as 842 or
> +		 * GZIP for pid and provides this value to kernel in pid
> +		 * device-tree property.
> +		 *
> +		 * Initialize each NX instance for both high and normal
> +		 * priority FIFOs.
> +		 */
> +		ret = opal_nx_coproc_init(chip_id, pid);
> +		if (ret) {
> +			pr_err("Failed to initialize NX coproc: %d\n", ret);
> +			ret = opal_error_code(ret);
> +			goto err_out;
> +		}
> +
>  		coproc->ct = VAS_COP_TYPE_842_HIPRI;

I think this should be called for all priority queues as it would be at
least theoretically possible to only have Normal priority queues, in
which case this patch wouldn't fix the problem.


-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply

* Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers
From: Haren Myneni @ 2018-06-01  6:30 UTC (permalink / raw)
  To: Stewart Smith; +Cc: mpe, linuxppc-dev, herbert, linux-crypto
In-Reply-To: <87sh672mf7.fsf@linux.vnet.ibm.com>

On 05/31/2018 08:52 PM, Stewart Smith wrote:
> Haren Myneni <haren@linux.vnet.ibm.com> writes:
>> NX increments readOffset by FIFO size in receive FIFO control register
>> when CRB is read. But the index in RxFIFO has to match with the
>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>> may be processing incorrect CRBs and can cause CRB timeout.
>>
>> VAS FIFO offset is 0 when the receive window is opened during
>> initialization. When the module is reloaded or in kexec boot, readOffset
>> in FIFO control register may not match with VAS entry. This patch adds
>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>> control register for both high and normal FIFOs.
>>
>> Signed-off-by: Haren Myneni <haren@us.ibm.com>
> 
> I've yet to go and check out the skiboot patch properly, but should this
> be both:
> Fixes: b0d6c9bab crypto/nx: Add P9 NX support for 842 compression engine
> CC: stable # v4.14+
> 
> as otherwise "rmmod ; insmod" will crash, and possibly even issues over kexec?
> 

Correct, P9 NX support is included in 4.14. We also need fix in stable trees (4.14+). But this patch will not apply cleanly. I will post different patch for 4.14 and 4.16 stable trees. 

Thanks
Haren

^ permalink raw reply

* Re: [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.
From: Gautham R Shenoy @ 2018-06-01  4:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Gautham R. Shenoy, Rafael J. Wysocki, Daniel Lezcano,
	Michael Ellerman, Stewart Smith, Michael Neuling,
	Vaidyanathan Srinivasan, Shilpasri G Bhat, Akshay Adiga,
	Nicholas Piggin, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-kernel@vger.kernel.org, linux-pm
In-Reply-To: <CAKTCnz=WhC_vamKPZruEqc04WKpD5ZqZm0zTcCtagRHib6OZPw@mail.gmail.com>

Hi Balbir,

Thanks for reviewing the patch!

On Fri, Jun 01, 2018 at 12:51:05AM +1000, Balbir Singh wrote:
> On Thu, May 31, 2018 at 10:15 PM, Gautham R. Shenoy

[..snip..]
> >
> > +static u64 get_snooze_timeout(struct cpuidle_device *dev,
> > +                             struct cpuidle_driver *drv,
> > +                             int index)
> > +{
> > +       int i;
> > +
> > +       if (unlikely(!snooze_timeout_en))
> > +               return default_snooze_timeout;
> > +
> > +       for (i = index + 1; i < drv->state_count; i++) {
> > +               struct cpuidle_state *s = &drv->states[i];
> > +               struct cpuidle_state_usage *su = &dev->states_usage[i];
> > +
> > +               if (s->disabled || su->disable)
> > +                       continue;
> > +
> > +               return s->target_residency * tb_ticks_per_usec;
> 
> Can we ensure this is not prone to overflow?

s->target_residency is an "unsigned int" so can take a maximum value
of UINT_MAX. tb_ticks_per_usec is an "unsigned long" with a value in
the range of 100-1000. The return value is a u64. The product of
s->target_residency and tb_ticks_per_usec should be contained in u64.

Is there a potential case of overflow that I am missing ?

> 
> Otherwise looks good
> 
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers
From: Stewart Smith @ 2018-06-01  3:52 UTC (permalink / raw)
  To: Haren Myneni, mpe; +Cc: linuxppc-dev, herbert, linux-crypto
In-Reply-To: <1527789287.5945.23.camel@hbabu-laptop>

Haren Myneni <haren@linux.vnet.ibm.com> writes:
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
>
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
>
> Signed-off-by: Haren Myneni <haren@us.ibm.com>

I've yet to go and check out the skiboot patch properly, but should this
be both:
Fixes: b0d6c9bab crypto/nx: Add P9 NX support for 842 compression engine
CC: stable # v4.14+

as otherwise "rmmod ; insmod" will crash, and possibly even issues over kexec?

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Fix compiler store ordering to SLB shadow area
From: Nicholas Piggin @ 2018-05-31 22:52 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Aneesh Kumar K . V, Segher Boessenkool
In-Reply-To: <87tvqnykf6.fsf@concordia.ellerman.id.au>

On Fri, 01 Jun 2018 00:22:21 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > The stores to update the SLB shadow area must be made as they appear
> > in the C code, so that the hypervisor does not see an entry with
> > mismatched vsid and esid. Use WRITE_ONCE for this.
> >
> > GCC has been observed to elide the first store to esid in the update,
> > which means that if the hypervisor interrupts the guest after storing
> > to vsid, it could see an entry with old esid and new vsid, which may
> > possibly result in memory corruption.
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/mm/slb.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> > index 66577cc66dc9..2f4b33b24b3b 100644
> > --- a/arch/powerpc/mm/slb.c
> > +++ b/arch/powerpc/mm/slb.c
> > @@ -63,14 +63,14 @@ static inline void slb_shadow_update(unsigned long ea, int ssize,
> >  	 * updating it.  No write barriers are needed here, provided
> >  	 * we only update the current CPU's SLB shadow buffer.
> >  	 */
> > -	p->save_area[index].esid = 0;
> > -	p->save_area[index].vsid = cpu_to_be64(mk_vsid_data(ea, ssize, flags));
> > -	p->save_area[index].esid = cpu_to_be64(mk_esid_data(ea, ssize, index));
> > +	WRITE_ONCE(p->save_area[index].esid, 0);
> > +	WRITE_ONCE(p->save_area[index].vsid, cpu_to_be64(mk_vsid_data(ea, ssize, flags)));
> > +	WRITE_ONCE(p->save_area[index].esid, cpu_to_be64(mk_esid_data(ea, ssize, index)));  
> 
> What's the code-gen for that look like? I suspect it's terrible?

Yeah it's not great.

> 
> Should we just do it in inline-asm I wonder?

There should be no fundamental correctness reason why we can't store
to a volatile with a byteswap store. The other option we could do is
add a compiler barrier() between each store. The reason I didn't is
that in theory we don't need to invalidate all memory contents here,
but in practice probably the end result code generation would be
better.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH] powerpc/fsl: Add barrier_nospec implementation for NXP PowerPC Book E
From: Scott Wood @ 2018-05-31 22:03 UTC (permalink / raw)
  To: Diana Madalina Craciun, Michael Ellerman,
	linuxppc-dev@lists.ozlabs.org
  Cc: Leo Li
In-Reply-To: <VI1PR0401MB2463C17743BC5C25C2152631FF630@VI1PR0401MB2463.eurprd04.prod.outlook.com>

On Thu, 2018-05-31 at 14:35 +0000, Diana Madalina Craciun wrote:
> On 5/31/2018 5:21 PM, Michael Ellerman wrote:
> > 
> > We can add a nospectre_v1 command line option if necessary.
> 
> What about nobarrier_nospec (or similar) instead of nospectre_v1 command
> line? We are not disabling all the v1 mitigations, the masking part will
> remain unchanged.

I think nospectre_v1 makes more sense as it's about the user's intentions
rather than the implementation.  The user is giving the kernel permission to
not defend against spectre v1, and it's up to the implementation which
mitigations (if any) to disable in response to that, same as any other
optimization.

-Scott

^ permalink raw reply

* Re: [PATCH 2/2] powerpc: Add support for function error injection
From: Naveen N. Rao @ 2018-05-31 19:06 UTC (permalink / raw)
  To: Josef Bacik, Masami Hiramatsu, Ingo Molnar, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <87zi0fyktj.fsf@concordia.ellerman.id.au>

Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>=20
>> Michael Ellerman wrote:
>>> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>>>> ...
>>>=20
>>> A change log is always nice even if it's short :)
>>
>> I tried, but really couldn't come up with anything more to write. I'll=20
>> try harder for v2 :)
>=20
> Yeah true.
>=20
> You could just say "All that's required is to define x and y and select
> the Kconfig symbol".

Ok, sure.

>=20
>>> This looks fine to me, it's probably easiest if it goes in via tip alon=
g
>>> with patch 1.
>>>=20
>>> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
>>
>> Thanks for the review. I'll base v2 on -tip
>=20
> I guess if it doesn't already apply to tip you should rebase it. You've
> probably missed 4.18 anyway.

Oh ok. I just tried and it seems to apply just fine. I'll post v2 after=20
giving this a quick test.

- Naveen

=

^ permalink raw reply

* [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers
From: Haren Myneni @ 2018-05-31 17:54 UTC (permalink / raw)
  To: mpe; +Cc: herbert, linuxppc-dev, linux-crypto, hbabu


NX increments readOffset by FIFO size in receive FIFO control register
when CRB is read. But the index in RxFIFO has to match with the
corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
may be processing incorrect CRBs and can cause CRB timeout.

VAS FIFO offset is 0 when the receive window is opened during
initialization. When the module is reloaded or in kexec boot, readOffset
in FIFO control register may not match with VAS entry. This patch adds
nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
control register for both high and normal FIFOs.

Signed-off-by: Haren Myneni <haren@us.ibm.com>

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index d886a5b..ff61e4b 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -206,7 +206,8 @@
 #define OPAL_NPU_TL_SET				161
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
-#define OPAL_LAST				165
+#define	OPAL_NX_COPROC_INIT			167
+#define OPAL_LAST				167
 
 /* Device tree flags */
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 7159e1a..d79eb82 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -288,6 +288,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t address,
 int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
+int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
 
 s64 opal_signal_system_reset(s32 cpu);
 
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 3da30c2..c7541a9 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -325,3 +325,4 @@ OPAL_CALL(opal_npu_spa_clear_cache,		OPAL_NPU_SPA_CLEAR_CACHE);
 OPAL_CALL(opal_npu_tl_set,			OPAL_NPU_TL_SET);
 OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,		OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
+OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 48fbb41..5e13908 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -1035,3 +1035,5 @@ void powernv_set_nmmu_ptcr(unsigned long ptcr)
 EXPORT_SYMBOL_GPL(opal_int_set_mfrr);
 EXPORT_SYMBOL_GPL(opal_int_eoi);
 EXPORT_SYMBOL_GPL(opal_error_code);
+/* Export the below symbol for NX compression */
+EXPORT_SYMBOL(opal_nx_coproc_init);
diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c
index 1e87637..6c4784d 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -24,6 +24,8 @@
 #include <asm/icswx.h>
 #include <asm/vas.h>
 #include <asm/reg.h>
+#include <asm/opal-api.h>
+#include <asm/opal.h>
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Dan Streetman <ddstreet@ieee.org>");
@@ -803,9 +805,26 @@ static int __init vas_cfg_coproc_info(struct device_node *dn, int chip_id,
 	if (!coproc)
 		return -ENOMEM;
 
-	if (!strcmp(priority, "High"))
+	if (!strcmp(priority, "High")) {
+		/*
+		 * (lpid, pid, tid) combination has to be unique for each
+		 * coprocessor instance in the system. So to make it
+		 * unique, skiboot uses coprocessor type such as 842 or
+		 * GZIP for pid and provides this value to kernel in pid
+		 * device-tree property.
+		 *
+		 * Initialize each NX instance for both high and normal
+		 * priority FIFOs.
+		 */
+		ret = opal_nx_coproc_init(chip_id, pid);
+		if (ret) {
+			pr_err("Failed to initialize NX coproc: %d\n", ret);
+			ret = opal_error_code(ret);
+			goto err_out;
+		}
+
 		coproc->ct = VAS_COP_TYPE_842_HIPRI;
-	else if (!strcmp(priority, "Normal"))
+	} else if (!strcmp(priority, "Normal"))
 		coproc->ct = VAS_COP_TYPE_842;
 	else {
 		pr_err("Invalid RxFIFO priority value\n");

^ permalink raw reply related

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-05-31 17:43 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Ram Pai, robh, aik, jasowang, linux-kernel, virtualization, hch,
	joe, linuxppc-dev, elfring, david
In-Reply-To: <0c508eb2-08df-3f76-c260-90cf7137af80@linux.vnet.ibm.com>

On Thu, May 31, 2018 at 09:09:24AM +0530, Anshuman Khandual wrote:
> On 05/24/2018 12:51 PM, Ram Pai wrote:
> > On Wed, May 23, 2018 at 09:50:02PM +0300, Michael S. Tsirkin wrote:
> >> subj: s/virito/virtio/
> >>
> > ..snip..
> >>>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> >>> +
> >>> +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> >>> +{
> >>> +	/*
> >>> +	 * On protected guest platforms, force virtio core to use DMA
> >>> +	 * MAP API for all virtio devices. But there can also be some
> >>> +	 * exceptions for individual devices like virtio balloon.
> >>> +	 */
> >>> +	return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> >>> +}
> >>
> >> Isn't this kind of slow?  vring_use_dma_api is on
> >> data path and supposed to be very fast.
> > 
> > Yes it is slow and not ideal. This won't be the final code. The final
> > code will cache the information in some global variable and used
> > in this function.
> 
> Right this will be a simple check based on a global variable.
> 

Pls work on a long term solution. Short term needs can be served by
enabling the iommu platform in qemu.

-- 
MST

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox