* [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case
2018-05-13 4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
@ 2018-05-13 4:21 ` Nicholas Piggin
2018-05-13 4:21 ` [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
2018-05-13 4:21 ` [PATCH 3/3] powerpc/64s/radix: optimise pte_update Nicholas Piggin
2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13 4:21 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
This matches other architectures, when we know there will be no
further accesses to the address (e.g., for teardown), page table
entries can be cleared non-atomically.
The comments about NMMU are bogus, all MMU notifiers (including NMMU)
are released at this point, with their TLBs flushed. An NMMU access
at this point would be a bug.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 705193e7192f..fcd92f9b6ec0 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -176,14 +176,8 @@ static inline pte_t radix__ptep_get_and_clear_full(struct mm_struct *mm,
unsigned long old_pte;
if (full) {
- /*
- * If we are trying to clear the pte, we can skip
- * the DD1 pte update sequence and batch the tlb flush. The
- * tlb flush batching is done by mmu gather code. We
- * still keep the cmp_xchg update to make sure we get
- * correct R/C bit which might be updated via Nest MMU.
- */
- old_pte = __radix_pte_update(ptep, ~0ul, 0);
+ old_pte = pte_val(*ptep);
+ *ptep = __pte(0);
} else
old_pte = radix__pte_update(mm, addr, ptep, ~0ul, 0, 0);
--
2.17.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags
2018-05-13 4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
2018-05-13 4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
@ 2018-05-13 4:21 ` Nicholas Piggin
2018-05-13 4:21 ` [PATCH 3/3] powerpc/64s/radix: optimise pte_update Nicholas Piggin
2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13 4:21 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
The ISA suggests ptesync after setting a pte, to prevent a table walk
initiated by a subsequent access from causing a spurious fault, which
may be an allowance implementation to have page table walk loads
incoherent with store queues.
However there is no correctness problem in spurious faults -- the
kernel copes with these at any time, and the architecture requires
the pte to be re-loaded, which would eventually find the updated pte.
On POWER9 there does not appear to be a large window where this is a
problem, so as an optimisation, remove the costly ptesync from pte
updates. If implementations benefit from ptesync, it would likely be
better to go in update_mmu_cache, rather than set_pte etc which is
called for things like fork and mprotect.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index fcd92f9b6ec0..45bf1e1b1d33 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -209,7 +209,6 @@ static inline void radix__ptep_set_access_flags(struct mm_struct *mm,
__radix_pte_update(ptep, 0, new_pte);
} else
__radix_pte_update(ptep, 0, set);
- asm volatile("ptesync" : : : "memory");
}
static inline int radix__pte_same(pte_t pte_a, pte_t pte_b)
@@ -226,7 +225,6 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
{
*ptep = pte;
- asm volatile("ptesync" : : : "memory");
}
static inline int radix__pmd_bad(pmd_t pmd)
--
2.17.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 3/3] powerpc/64s/radix: optimise pte_update
2018-05-13 4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
2018-05-13 4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
2018-05-13 4:21 ` [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
@ 2018-05-13 4:21 ` Nicholas Piggin
2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13 4:21 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
Implementing pte_update with pte_xchg (which uses cmpxchg) is
inefficient. A single larx/stcx. works fine, no need for the less
efficient cmpxchg sequence.
Then remove the memory barriers from the operation. There is a
requirement for TLB flushing to load mm_cpumask after the store
that reduces pte permissions, which is moved into the TLB flush
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 25 +++++++++++-----------
arch/powerpc/mm/mmu_context.c | 6 ++++--
arch/powerpc/mm/tlb-radix.c | 11 +++++++++-
3 files changed, 27 insertions(+), 15 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 45bf1e1b1d33..cc9437a542cc 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -127,20 +127,21 @@ extern void radix__mark_initmem_nx(void);
static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
unsigned long set)
{
- pte_t pte;
- unsigned long old_pte, new_pte;
-
- do {
- pte = READ_ONCE(*ptep);
- old_pte = pte_val(pte);
- new_pte = (old_pte | set) & ~clr;
-
- } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
-
- return old_pte;
+ __be64 old_be, tmp_be;
+
+ __asm__ __volatile__(
+ "1: ldarx %0,0,%3 # pte_update\n"
+ " andc %1,%0,%5 \n"
+ " or %1,%1,%4 \n"
+ " stdcx. %1,0,%3 \n"
+ " bne- 1b"
+ : "=&r" (old_be), "=&r" (tmp_be), "=m" (*ptep)
+ : "r" (ptep), "r" (cpu_to_be64(set)), "r" (cpu_to_be64(clr))
+ : "cc" );
+
+ return be64_to_cpu(old_be);
}
-
static inline unsigned long radix__pte_update(struct mm_struct *mm,
unsigned long addr,
pte_t *ptep, unsigned long clr,
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0ab297c4cfad..f84e14f23e50 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -57,8 +57,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* in switch_slb(), and/or the store of paca->mm_ctx_id in
* copy_mm_to_paca().
*
- * On the read side the barrier is in pte_xchg(), which orders
- * the store to the PTE vs the load of mm_cpumask.
+ * On the other side, the barrier is in mm/tlb-radix.c for
+ * radix which orders earlier stores to clear the PTEs vs
+ * the load of mm_cpumask. And pte_xchg which does the same
+ * thing for hash.
*
* This full barrier is needed by membarrier when switching
* between processes after store to rq->curr, before user-space
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 55f93d66c8d2..b419702b1ba6 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -535,6 +535,11 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
return;
preempt_disable();
+ /*
+ * Order loads of mm_cpumask vs previous stores to clear ptes before
+ * the invalidate. See barrier in switch_mm_irqs_off
+ */
+ smp_mb();
if (!mm_is_thread_local(mm)) {
if (mm_is_singlethreaded(mm)) {
_tlbie_pid(pid, RIC_FLUSH_ALL);
@@ -560,6 +565,7 @@ void radix__flush_all_mm(struct mm_struct *mm)
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (!mm_is_thread_local(mm)) {
_tlbie_pid(pid, RIC_FLUSH_ALL);
if (mm_is_singlethreaded(mm))
@@ -587,6 +593,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
} else {
@@ -655,6 +662,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
local = true;
full = (end == TLB_FLUSH_ALL ||
@@ -820,6 +828,7 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
local = true;
full = (end == TLB_FLUSH_ALL ||
@@ -882,7 +891,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
/* Otherwise first do the PWC, then iterate the pages. */
preempt_disable();
-
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
} else {
--
2.17.0
^ permalink raw reply related [flat|nested] 4+ messages in thread