* [PATCH v3 0/7] Various TLB and PTE improvements
@ 2018-05-24 17:58 Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 1/7] powerpc/64s/radix: do not flush TLB when relaxing access Nicholas Piggin
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
Since last time:
- Fixed compile error on ppc32
- Significantly reworked mm_cpumask reset patch to restore the
lazy PID context switch optimisation, and not over-flush the
local CPU when flushing remotes (using IPIs).
- Moved mm_cpumask reset patch to the end of the series.
Nicholas Piggin (7):
powerpc/64s/radix: do not flush TLB when relaxing access
powerpc/64s/radix: do not flush TLB on spurious fault
powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the
full case
powerpc/64s/radix: prefetch user address in update_mmu_cache
powerpc/64s/radix: avoid ptesync after set_pte and
ptep_set_access_flags
powerpc/64s/radix: optimise pte_update
powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask
arch/powerpc/include/asm/book3s/64/radix.h | 37 ++--
arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 +-
arch/powerpc/include/asm/cacheflush.h | 13 ++
arch/powerpc/include/asm/tlb.h | 13 ++
arch/powerpc/mm/mem.c | 4 +-
arch/powerpc/mm/mmu_context.c | 6 +-
arch/powerpc/mm/pgtable-book3s64.c | 13 +-
arch/powerpc/mm/pgtable.c | 25 ++-
arch/powerpc/mm/tlb-radix.c | 159 +++++++++++++++---
9 files changed, 222 insertions(+), 60 deletions(-)
--
2.17.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v3 1/7] powerpc/64s/radix: do not flush TLB when relaxing access
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 2/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
Radix flushes the TLB when updating ptes to increase permissiveness
of protection (increase access authority). Book3S does not require
TLB flushing in this case, and it is not done on hash. This patch
avoids the flush for radix.
>From Power ISA v3.0B, p.1090:
Setting a Reference or Change Bit or Upgrading Access Authority
(PTE Subject to Atomic Hardware Updates)
If the only change being made to a valid PTE that is subject to
atomic hardware updates is to set the Reference or Change bit to 1
or to add access authorities, a simpler sequence suffices because
the translation hardware will refetch the PTE if an access is
attempted for which the only problems were reference and/or change
bits needing to be set or insufficient access authority.
The nest MMU on POWER9 does not re-fetch the PTE after such an access
attempt before faulting, so address spaces with a coprocessor
attached will continue to flush in these cases.
This reduces tlbies for a kernel compile workload from 1.28M to 0.95M,
tlbiels from 20.17M 19.68M.
fork --fork --exec benchmark improved 2.77% (12000->12300).
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/mm/pgtable-book3s64.c | 10 +++++++---
arch/powerpc/mm/pgtable.c | 25 +++++++++++++++++++++++--
2 files changed, 30 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 518518fb7c45..994492453f0e 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -31,16 +31,20 @@ int (*register_process_table)(unsigned long base, unsigned long page_size,
int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp, pmd_t entry, int dirty)
{
+ struct mm_struct *mm = vma->vm_mm;
int changed;
#ifdef CONFIG_DEBUG_VM
WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
- assert_spin_locked(&vma->vm_mm->page_table_lock);
+ assert_spin_locked(&mm->page_table_lock);
#endif
changed = !pmd_same(*(pmdp), entry);
if (changed) {
- __ptep_set_access_flags(vma->vm_mm, pmdp_ptep(pmdp),
+ __ptep_set_access_flags(mm, pmdp_ptep(pmdp),
pmd_pte(entry), address);
- flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ /* See ptep_set_access_flags comments */
+ if (atomic_read(&mm->context.copros) > 0)
+ flush_pmd_tlb_range(vma, address,
+ address + HPAGE_PMD_SIZE);
}
return changed;
}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9f361ae571e9..02a24bce7e51 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -217,14 +217,35 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
pte_t *ptep, pte_t entry, int dirty)
{
+ struct mm_struct *mm = vma->vm_mm;
int changed;
+
entry = set_access_flags_filter(entry, vma, dirty);
changed = !pte_same(*(ptep), entry);
if (changed) {
if (!is_vm_hugetlb_page(vma))
- assert_pte_locked(vma->vm_mm, address);
- __ptep_set_access_flags(vma->vm_mm, ptep, entry, address);
+ assert_pte_locked(mm, address);
+ __ptep_set_access_flags(mm, ptep, entry, address);
+#ifdef CONFIG_PPC_BOOK3S_64
+ /*
+ * Book3S does not require a TLB flush when relaxing access
+ * restrictions because the core MMU will reload the pte after
+ * taking an access fault. However the NMMU on POWER9 does not
+ * re-load the pte, so flush if we have a coprocessor attached
+ * to this address space.
+ *
+ * This could be further refined and pushed out to NMMU drivers
+ * so TLBIEs are only done for NMMU faults, but this is a more
+ * minimal fix. The NMMU fault handler does
+ * get_user_pages_remote or similar to bring the page tables
+ * in, and this flush_tlb_page will do a global TLBIE because
+ * the coprocessor is attached to the address space.
+ */
+ if (atomic_read(&mm->context.copros) > 0)
+ flush_tlb_page(vma, address);
+#else
flush_tlb_page(vma, address);
+#endif
}
return changed;
}
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 2/7] powerpc/64s/radix: do not flush TLB on spurious fault
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 1/7] powerpc/64s/radix: do not flush TLB when relaxing access Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 3/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
In the case of a spurious fault (which can happen due to a race with
another thread that changes the page table), the default Linux mm code
calls flush_tlb_page for that address. This is not required because
the pte will be re-fetched. Hash does not wire this up to a hardware
TLB flush for this reason. This patch avoids the flush for radix.
>From Power ISA v3.0B, p.1090:
Setting a Reference or Change Bit or Upgrading Access Authority
(PTE Subject to Atomic Hardware Updates)
If the only change being made to a valid PTE that is subject to
atomic hardware updates is to set the Refer- ence or Change bit to
1 or to add access authorities, a simpler sequence suffices
because the translation hardware will refetch the PTE if an access
is attempted for which the only problems were reference and/or
change bits needing to be set or insufficient access authority.
The nest MMU on POWER9 does not re-fetch the PTE after such an access
attempt before faulting, so address spaces with a coprocessor
attached will continue to flush in these cases.
This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.
fork --fork --exec benchmark improved 0.5% (12300->12400).
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0cac17253513..ebf572ea621e 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -4,7 +4,7 @@
#define MMU_NO_CONTEXT ~0UL
-
+#include <linux/mm_types.h>
#include <asm/book3s/64/tlbflush-hash.h>
#include <asm/book3s/64/tlbflush-radix.h>
@@ -137,6 +137,16 @@ static inline void flush_all_mm(struct mm_struct *mm)
#define flush_tlb_page(vma, addr) local_flush_tlb_page(vma, addr)
#define flush_all_mm(mm) local_flush_all_mm(mm)
#endif /* CONFIG_SMP */
+
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
+static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ /* See ptep_set_access_flags comment */
+ if (atomic_read(&vma->vm_mm->context.copros) > 0)
+ flush_tlb_page(vma, address);
+}
+
/*
* flush the page walk cache for the address
*/
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 3/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 1/7] powerpc/64s/radix: do not flush TLB when relaxing access Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 2/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 4/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
This matches other architectures, when we know there will be no
further accesses to the address (e.g., for teardown), page table
entries can be cleared non-atomically.
The comments about NMMU are bogus: all MMU notifiers (including NMMU)
are released at this point, with their TLBs flushed. An NMMU access at
this point would be a bug.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 705193e7192f..fcd92f9b6ec0 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -176,14 +176,8 @@ static inline pte_t radix__ptep_get_and_clear_full(struct mm_struct *mm,
unsigned long old_pte;
if (full) {
- /*
- * If we are trying to clear the pte, we can skip
- * the DD1 pte update sequence and batch the tlb flush. The
- * tlb flush batching is done by mmu gather code. We
- * still keep the cmp_xchg update to make sure we get
- * correct R/C bit which might be updated via Nest MMU.
- */
- old_pte = __radix_pte_update(ptep, ~0ul, 0);
+ old_pte = pte_val(*ptep);
+ *ptep = __pte(0);
} else
old_pte = radix__pte_update(mm, addr, ptep, ~0ul, 0, 0);
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 4/7] powerpc/64s/radix: prefetch user address in update_mmu_cache
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
` (2 preceding siblings ...)
2018-05-24 17:58 ` [PATCH v3 3/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 5/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
Prefetch the faulting address in update_mmu_cache to give the page
table walker perhaps 100 cycles head start as locks are dropped and
the interrupt completed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/mm/mem.c | 4 +++-
arch/powerpc/mm/pgtable-book3s64.c | 3 ++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c3c39b02b2ba..8cecda4bd66a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -509,8 +509,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
*/
unsigned long access, trap;
- if (radix_enabled())
+ if (radix_enabled()) {
+ prefetch((void *)address);
return;
+ }
/* We only want HPTEs for linux PTEs that have _PAGE_ACCESSED set */
if (!pte_young(*ptep) || address >= TASK_SIZE)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 994492453f0e..7ce889a7e5ce 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -145,7 +145,8 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd)
{
- return;
+ if (radix_enabled())
+ prefetch((void *)addr);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 5/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
` (3 preceding siblings ...)
2018-05-24 17:58 ` [PATCH v3 4/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 6/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 7/7] powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask Nicholas Piggin
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
The ISA suggests ptesync after setting a pte, to prevent a table walk
initiated by a subsequent access from missing that store and causing a
spurious fault. This is an architectual allowance that allows an
implementation's page table walker to be incoherent with the store
queue.
However there is no correctness problem in taking a spurious fault in
userspace -- the kernel copes with these at any time, so the updated
pte will be found eventually. Spurious kernel faults on vmap memory
must be avoided, so a ptesync is put into flush_cache_vmap.
On POWER9 so far I have not found a measurable window where this can
result in more minor faults, so as an optimisation, remove the costly
ptesync from pte updates. If an implementation benefits from ptesync,
it would be better to add it back in update_mmu_cache, so it's not
done for things like fork(2).
fork --fork --exec benchmark improved 5.2% (12400->13100).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 2 --
arch/powerpc/include/asm/cacheflush.h | 13 +++++++++++++
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index fcd92f9b6ec0..45bf1e1b1d33 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -209,7 +209,6 @@ static inline void radix__ptep_set_access_flags(struct mm_struct *mm,
__radix_pte_update(ptep, 0, new_pte);
} else
__radix_pte_update(ptep, 0, set);
- asm volatile("ptesync" : : : "memory");
}
static inline int radix__pte_same(pte_t pte_a, pte_t pte_b)
@@ -226,7 +225,6 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
{
*ptep = pte;
- asm volatile("ptesync" : : : "memory");
}
static inline int radix__pmd_bad(pmd_t pmd)
diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h
index 11843e37d9cf..e9662648e72d 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -26,6 +26,19 @@
#define flush_cache_vmap(start, end) do { } while (0)
#define flush_cache_vunmap(start, end) do { } while (0)
+#ifdef CONFIG_BOOK3S_64
+/*
+ * Book3s has no ptesync after setting a pte, so without this ptesync it's
+ * possible for a kernel virtual mapping access to return a spurious fault
+ * if it's accessed right after the pte is set. The page fault handler does
+ * not expect this type of fault. flush_cache_vmap is not exactly the right
+ * place to put this, but it seems to work well enough.
+ */
+#define flush_cache_vmap(start, end) do { asm volatile("ptesync"); } while (0)
+#else
+#define flush_cache_vmap(start, end) do { } while (0)
+#endif
+
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
extern void flush_dcache_page(struct page *page);
#define flush_dcache_mmap_lock(mapping) do { } while (0)
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 6/7] powerpc/64s/radix: optimise pte_update
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
` (4 preceding siblings ...)
2018-05-24 17:58 ` [PATCH v3 5/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 7/7] powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask Nicholas Piggin
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
Implementing pte_update with pte_xchg (which uses cmpxchg) is
inefficient. A single larx/stcx. works fine, no need for the less
efficient cmpxchg sequence.
Then remove the memory barriers from the operation. There is a
requirement for TLB flushing to load mm_cpumask after the store
that reduces pte permissions, which is moved into the TLB flush
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 25 +++++++++++-----------
arch/powerpc/mm/mmu_context.c | 6 ++++--
arch/powerpc/mm/tlb-radix.c | 11 +++++++++-
3 files changed, 27 insertions(+), 15 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 45bf1e1b1d33..cc9437a542cc 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -127,20 +127,21 @@ extern void radix__mark_initmem_nx(void);
static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
unsigned long set)
{
- pte_t pte;
- unsigned long old_pte, new_pte;
-
- do {
- pte = READ_ONCE(*ptep);
- old_pte = pte_val(pte);
- new_pte = (old_pte | set) & ~clr;
-
- } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
-
- return old_pte;
+ __be64 old_be, tmp_be;
+
+ __asm__ __volatile__(
+ "1: ldarx %0,0,%3 # pte_update\n"
+ " andc %1,%0,%5 \n"
+ " or %1,%1,%4 \n"
+ " stdcx. %1,0,%3 \n"
+ " bne- 1b"
+ : "=&r" (old_be), "=&r" (tmp_be), "=m" (*ptep)
+ : "r" (ptep), "r" (cpu_to_be64(set)), "r" (cpu_to_be64(clr))
+ : "cc" );
+
+ return be64_to_cpu(old_be);
}
-
static inline unsigned long radix__pte_update(struct mm_struct *mm,
unsigned long addr,
pte_t *ptep, unsigned long clr,
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0ab297c4cfad..f84e14f23e50 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -57,8 +57,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* in switch_slb(), and/or the store of paca->mm_ctx_id in
* copy_mm_to_paca().
*
- * On the read side the barrier is in pte_xchg(), which orders
- * the store to the PTE vs the load of mm_cpumask.
+ * On the other side, the barrier is in mm/tlb-radix.c for
+ * radix which orders earlier stores to clear the PTEs vs
+ * the load of mm_cpumask. And pte_xchg which does the same
+ * thing for hash.
*
* This full barrier is needed by membarrier when switching
* between processes after store to rq->curr, before user-space
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 5ac3206c51cc..cdc50398fd60 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -524,6 +524,11 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
return;
preempt_disable();
+ /*
+ * Order loads of mm_cpumask vs previous stores to clear ptes before
+ * the invalidate. See barrier in switch_mm_irqs_off
+ */
+ smp_mb();
if (!mm_is_thread_local(mm)) {
if (mm_needs_flush_escalation(mm))
_tlbie_pid(pid, RIC_FLUSH_ALL);
@@ -544,6 +549,7 @@ void radix__flush_all_mm(struct mm_struct *mm)
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (!mm_is_thread_local(mm))
_tlbie_pid(pid, RIC_FLUSH_ALL);
else
@@ -568,6 +574,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (!mm_is_thread_local(mm))
_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
else
@@ -630,6 +637,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
local = true;
full = (end == TLB_FLUSH_ALL ||
@@ -791,6 +799,7 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
return;
preempt_disable();
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
local = true;
full = (end == TLB_FLUSH_ALL ||
@@ -849,7 +858,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
/* Otherwise first do the PWC, then iterate the pages. */
preempt_disable();
-
+ smp_mb(); /* see radix__flush_tlb_mm */
if (mm_is_thread_local(mm)) {
_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
} else {
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 7/7] powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
` (5 preceding siblings ...)
2018-05-24 17:58 ` [PATCH v3 6/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
@ 2018-05-24 17:58 ` Nicholas Piggin
6 siblings, 0 replies; 8+ messages in thread
From: Nicholas Piggin @ 2018-05-24 17:58 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin
When a single-threaded process has a non-local mm_cpumask, try to use
that point to flush the TLBs out of other CPUs in the cpumask.
An IPI is used for clearing remote CPUs for a few reasons:
- An IPI can end lazy TLB use of the mm, which is required to prevent
TLB entries being created on the remote CPU. The alternative is to
drop lazy TLB switching completely, which costs 7.5% in a context
switch ping-pong test betwee a process and kernel idle thread.
- An IPI can have remote CPUs flush the entire PID, but the local CPU
can flush a specific VA. tlbie would require over-flushing of the
local CPU (where the process is running).
- A single threaded process that is migrated to a different CPU is
likely to have a relatively small mm_cpumask, so IPI is reasonable.
No other thread can concurrently switch to this mm, because it must
have been given a reference to mm_users by the current thread before it
can use_mm. mm_users can be asynchronously incremented (by
mm_activate or mmget_not_zero), but those users must use remote mm
access and can't use_mm or access user address space. Existing code
makes the this assumption already, for example sparc64 has reset
mm_cpumask using this condition since the start of history, see
arch/sparc/kernel/smp_64.c.
This reduces tlbies for a kernel compile workload from 0.90M to 0.12M,
tlbiels are increased significantly due to the PID flushing for the
cleaning up remote CPUs, and increased local flushes (PID flushes take
128 tlbiels vs 1 tlbie).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/tlb.h | 13 +++
arch/powerpc/mm/tlb-radix.c | 148 +++++++++++++++++++++++++++------
2 files changed, 134 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index a7eabff27a0f..9138baccebb0 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -76,6 +76,19 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
}
+static inline void mm_reset_thread_local(struct mm_struct *mm)
+{
+ WARN_ON(atomic_read(&mm->context.copros) > 0);
+ /*
+ * It's possible for mm_access to take a reference on mm_users to
+ * access the remote mm from another thread, but it's not allowed
+ * to set mm_cpumask, so mm_users may be > 1 here.
+ */
+ WARN_ON(current->mm != mm);
+ atomic_set(&mm->context.active_cpus, 1);
+ cpumask_clear(mm_cpumask(mm));
+ cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+}
#else /* CONFIG_PPC_BOOK3S_64 */
static inline int mm_is_thread_local(struct mm_struct *mm)
{
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index cdc50398fd60..67a6e86d3e7e 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -12,6 +12,8 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/memblock.h>
+#include <linux/mmu_context.h>
+#include <linux/sched/mm.h>
#include <asm/ppc-opcode.h>
#include <asm/tlb.h>
@@ -504,6 +506,15 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
}
EXPORT_SYMBOL(radix__local_flush_tlb_page);
+static bool mm_is_singlethreaded(struct mm_struct *mm)
+{
+ if (atomic_read(&mm->context.copros) > 0)
+ return false;
+ if (atomic_read(&mm->mm_users) <= 1 && current->mm == mm)
+ return true;
+ return false;
+}
+
static bool mm_needs_flush_escalation(struct mm_struct *mm)
{
/*
@@ -511,10 +522,47 @@ static bool mm_needs_flush_escalation(struct mm_struct *mm)
* caching PTEs and not flushing them properly when
* RIC = 0 for a PID/LPID invalidate
*/
- return atomic_read(&mm->context.copros) != 0;
+ if (atomic_read(&mm->context.copros) > 0)
+ return true;
+ return false;
}
#ifdef CONFIG_SMP
+static void do_exit_flush_lazy_tlb(void *arg)
+{
+ struct mm_struct *mm = arg;
+ unsigned long pid = mm->context.id;
+
+ if (current->mm == mm)
+ return; /* Local CPU */
+
+ if (current->active_mm == mm) {
+ /*
+ * Must be a kernel thread because sender is single-threaded.
+ */
+ BUG_ON(current->mm);
+ mmgrab(&init_mm);
+ switch_mm(mm, &init_mm, current);
+ current->active_mm = &init_mm;
+ mmdrop(mm);
+ }
+ _tlbiel_pid(pid, RIC_FLUSH_ALL);
+}
+
+static void exit_flush_lazy_tlbs(struct mm_struct *mm)
+{
+ /*
+ * Would be nice if this was async so it could be run in
+ * parallel with our local flush, but generic code does not
+ * give a good API for it. Could extend the generic code or
+ * make a special powerpc IPI for flushing TLBs.
+ * For now it's not too performance critical.
+ */
+ smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
+ (void *)mm, 1);
+ mm_reset_thread_local(mm);
+}
+
void radix__flush_tlb_mm(struct mm_struct *mm)
{
unsigned long pid;
@@ -530,17 +578,24 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
*/
smp_mb();
if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ exit_flush_lazy_tlbs(mm);
+ goto local;
+ }
+
if (mm_needs_flush_escalation(mm))
_tlbie_pid(pid, RIC_FLUSH_ALL);
else
_tlbie_pid(pid, RIC_FLUSH_TLB);
- } else
+ } else {
+local:
_tlbiel_pid(pid, RIC_FLUSH_TLB);
+ }
preempt_enable();
}
EXPORT_SYMBOL(radix__flush_tlb_mm);
-void radix__flush_all_mm(struct mm_struct *mm)
+static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
{
unsigned long pid;
@@ -550,12 +605,24 @@ void radix__flush_all_mm(struct mm_struct *mm)
preempt_disable();
smp_mb(); /* see radix__flush_tlb_mm */
- if (!mm_is_thread_local(mm))
+ if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ if (!fullmm) {
+ exit_flush_lazy_tlbs(mm);
+ goto local;
+ }
+ }
_tlbie_pid(pid, RIC_FLUSH_ALL);
- else
+ } else {
+local:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
+ }
preempt_enable();
}
+void radix__flush_all_mm(struct mm_struct *mm)
+{
+ __flush_all_mm(mm, false);
+}
EXPORT_SYMBOL(radix__flush_all_mm);
void radix__flush_tlb_pwc(struct mmu_gather *tlb, unsigned long addr)
@@ -575,10 +642,16 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
preempt_disable();
smp_mb(); /* see radix__flush_tlb_mm */
- if (!mm_is_thread_local(mm))
+ if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ exit_flush_lazy_tlbs(mm);
+ goto local;
+ }
_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
- else
+ } else {
+local:
_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+ }
preempt_enable();
}
@@ -638,14 +711,21 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
preempt_disable();
smp_mb(); /* see radix__flush_tlb_mm */
- if (mm_is_thread_local(mm)) {
- local = true;
- full = (end == TLB_FLUSH_ALL ||
- nr_pages > tlb_local_single_page_flush_ceiling);
- } else {
+ if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ if (end != TLB_FLUSH_ALL) {
+ exit_flush_lazy_tlbs(mm);
+ goto is_local;
+ }
+ }
local = false;
full = (end == TLB_FLUSH_ALL ||
nr_pages > tlb_single_page_flush_ceiling);
+ } else {
+is_local:
+ local = true;
+ full = (end == TLB_FLUSH_ALL ||
+ nr_pages > tlb_local_single_page_flush_ceiling);
}
if (full) {
@@ -766,7 +846,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
* See the comment for radix in arch_exit_mmap().
*/
if (tlb->fullmm) {
- radix__flush_all_mm(mm);
+ __flush_all_mm(mm, true);
} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
if (!tlb->need_flush_all)
radix__flush_tlb_mm(mm);
@@ -800,24 +880,32 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
preempt_disable();
smp_mb(); /* see radix__flush_tlb_mm */
- if (mm_is_thread_local(mm)) {
- local = true;
- full = (end == TLB_FLUSH_ALL ||
- nr_pages > tlb_local_single_page_flush_ceiling);
- } else {
+ if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ if (end != TLB_FLUSH_ALL) {
+ exit_flush_lazy_tlbs(mm);
+ goto is_local;
+ }
+ }
local = false;
full = (end == TLB_FLUSH_ALL ||
nr_pages > tlb_single_page_flush_ceiling);
+ } else {
+is_local:
+ local = true;
+ full = (end == TLB_FLUSH_ALL ||
+ nr_pages > tlb_local_single_page_flush_ceiling);
}
if (full) {
- if (!local && mm_needs_flush_escalation(mm))
- also_pwc = true;
-
- if (local)
+ if (local) {
_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
- else
- _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL: RIC_FLUSH_TLB);
+ } else {
+ if (mm_needs_flush_escalation(mm))
+ also_pwc = true;
+
+ _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
+ }
} else {
if (local)
_tlbiel_va_range(start, end, pid, page_size, psize, also_pwc);
@@ -859,10 +947,16 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
/* Otherwise first do the PWC, then iterate the pages. */
preempt_disable();
smp_mb(); /* see radix__flush_tlb_mm */
- if (mm_is_thread_local(mm)) {
- _tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
- } else {
+ if (!mm_is_thread_local(mm)) {
+ if (unlikely(mm_is_singlethreaded(mm))) {
+ exit_flush_lazy_tlbs(mm);
+ goto local;
+ }
_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
+ goto local;
+ } else {
+local:
+ _tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
}
preempt_enable();
--
2.17.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-05-24 17:59 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-24 17:58 [PATCH v3 0/7] Various TLB and PTE improvements Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 1/7] powerpc/64s/radix: do not flush TLB when relaxing access Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 2/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 3/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 4/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 5/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 6/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
2018-05-24 17:58 ` [PATCH v3 7/7] powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask Nicholas Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).