linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] arm64 switch_mm improvements
@ 2015-09-17 12:50 Will Deacon
  2015-09-17 12:50 ` [PATCH 01/10] arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function Will Deacon
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This series aims to improve out switch_mm implementation and bring the
ASID allocator up-to-speed with what we already have for arch/arm/.

In particular, this series:

  - Introduces routines for CPU-local I-cache and TLB invalidation, and
    converts relevant callers to use these in preference to the
    inner-shareable variants

  - Rewrites the ASID allocator to use the bitmap algorithm implemented
    for arch/arm/ plus the new relaxed atomics introduced with 4.3 (a
    subsequent patch series will optimise these for arm64).

  - Degrades fullm TLB flushing on exit/execve to a NOP, since the ASID
    allocator will never re-allocate a dirty ASID

  - Kills mm_cpumask, as it is no longer used and therefore doesn't need
    to be kept up-to-date

  - Removes some redundant DSB instructions for synchronising page-table
    updates

Feedback (particularly as a result of testing!) is more than welcome.

Will

--->8

Will Deacon (10):
  arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function
  arm64: proc: de-scope TLBI operation during cold boot
  arm64: flush: use local TLB and I-cache invalidation
  arm64: mm: rewrite ASID allocator and MM context-switching code
  arm64: tlbflush: remove redundant ASID casts to (unsigned long)
  arm64: tlbflush: avoid flushing when fullmm == 1
  arm64: switch_mm: simplify mm and CPU checks
  arm64: mm: kill mm_cpumask usage
  arm64: tlb: remove redundant barrier from __flush_tlb_pgtable
  arm64: mm: remove dsb from update_mmu_cache

 arch/arm64/include/asm/cacheflush.h  |   7 ++
 arch/arm64/include/asm/mmu.h         |  10 +-
 arch/arm64/include/asm/mmu_context.h | 113 ++++-------------
 arch/arm64/include/asm/pgtable.h     |   6 +-
 arch/arm64/include/asm/thread_info.h |   1 -
 arch/arm64/include/asm/tlb.h         |  26 ++--
 arch/arm64/include/asm/tlbflush.h    |  18 ++-
 arch/arm64/kernel/asm-offsets.c      |   2 +-
 arch/arm64/kernel/efi.c              |   5 +-
 arch/arm64/kernel/smp.c              |   9 +-
 arch/arm64/kernel/suspend.c          |   2 +-
 arch/arm64/mm/context.c              | 236 +++++++++++++++++++++--------------
 arch/arm64/mm/mmu.c                  |   2 +-
 arch/arm64/mm/proc.S                 |   6 +-
 14 files changed, 217 insertions(+), 226 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/10] arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 02/10] arm64: proc: de-scope TLBI operation during cold boot Will Deacon
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

With commit b08d4640a3dc ("arm64: remove dead code"),
cpu_set_idmap_tcr_t0sz is no longer called and can therefore be removed
from the kernel.

This patch removes the function and effectively inlines the helper
function __cpu_set_tcr_t0sz into cpu_set_default_tcr_t0sz.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/mmu_context.h | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 8ec41e5f56f0..549b89554ce8 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -77,34 +77,23 @@ static inline bool __cpu_uses_extended_idmap(void)
 		unlikely(idmap_t0sz != TCR_T0SZ(VA_BITS)));
 }
 
-static inline void __cpu_set_tcr_t0sz(u64 t0sz)
-{
-	unsigned long tcr;
-
-	if (__cpu_uses_extended_idmap())
-		asm volatile (
-		"	mrs	%0, tcr_el1	;"
-		"	bfi	%0, %1, %2, %3	;"
-		"	msr	tcr_el1, %0	;"
-		"	isb"
-		: "=&r" (tcr)
-		: "r"(t0sz), "I"(TCR_T0SZ_OFFSET), "I"(TCR_TxSZ_WIDTH));
-}
-
-/*
- * Set TCR.T0SZ to the value appropriate for activating the identity map.
- */
-static inline void cpu_set_idmap_tcr_t0sz(void)
-{
-	__cpu_set_tcr_t0sz(idmap_t0sz);
-}
-
 /*
  * Set TCR.T0SZ to its default value (based on VA_BITS)
  */
 static inline void cpu_set_default_tcr_t0sz(void)
 {
-	__cpu_set_tcr_t0sz(TCR_T0SZ(VA_BITS));
+	unsigned long tcr;
+
+	if (!__cpu_uses_extended_idmap())
+		return;
+
+	asm volatile (
+	"	mrs	%0, tcr_el1	;"
+	"	bfi	%0, %1, %2, %3	;"
+	"	msr	tcr_el1, %0	;"
+	"	isb"
+	: "=&r" (tcr)
+	: "r"(TCR_T0SZ(VA_BITS)), "I"(TCR_T0SZ_OFFSET), "I"(TCR_TxSZ_WIDTH));
 }
 
 static inline void switch_new_context(struct mm_struct *mm)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 02/10] arm64: proc: de-scope TLBI operation during cold boot
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
  2015-09-17 12:50 ` [PATCH 01/10] arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 03/10] arm64: flush: use local TLB and I-cache invalidation Will Deacon
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

When cold-booting a CPU, we must invalidate any junk entries from the
local TLB prior to enabling the MMU. This doesn't require broadcasting
within the inner-shareable domain, so de-scope the operation to apply
only to the local CPU.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/mm/proc.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index e4ee7bd8830a..bbde13d77da5 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -146,8 +146,8 @@ ENDPROC(cpu_do_switch_mm)
  *	value of the SCTLR_EL1 register.
  */
 ENTRY(__cpu_setup)
-	tlbi	vmalle1is			// invalidate I + D TLBs
-	dsb	ish
+	tlbi	vmalle1				// Invalidate local TLB
+	dsb	nsh
 
 	mov	x0, #3 << 20
 	msr	cpacr_el1, x0			// Enable FP/ASIMD
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 03/10] arm64: flush: use local TLB and I-cache invalidation
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
  2015-09-17 12:50 ` [PATCH 01/10] arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function Will Deacon
  2015-09-17 12:50 ` [PATCH 02/10] arm64: proc: de-scope TLBI operation during cold boot Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code Will Deacon
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

There are a number of places where a single CPU is running with a
private page-table and we need to perform maintenance on the TLB and
I-cache in order to ensure correctness, but do not require the operation
to be broadcast to other CPUs.

This patch adds local variants of tlb_flush_all and __flush_icache_all
to support these use-cases and updates the callers respectively.
__local_flush_icache_all also implies an isb, since it is intended to be
used synchronously.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cacheflush.h | 7 +++++++
 arch/arm64/include/asm/tlbflush.h   | 8 ++++++++
 arch/arm64/kernel/efi.c             | 4 ++--
 arch/arm64/kernel/smp.c             | 2 +-
 arch/arm64/kernel/suspend.c         | 2 +-
 arch/arm64/mm/context.c             | 4 ++--
 arch/arm64/mm/mmu.c                 | 2 +-
 7 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index c75b8d027eb1..54efedaf331f 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -115,6 +115,13 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 extern void flush_dcache_page(struct page *);
 
+static inline void __local_flush_icache_all(void)
+{
+	asm("ic iallu");
+	dsb(nsh);
+	isb();
+}
+
 static inline void __flush_icache_all(void)
 {
 	asm("ic	ialluis");
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 7bd2da021658..96f944e75dc4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -63,6 +63,14 @@
  *		only require the D-TLB to be invalidated.
  *		- kaddr - Kernel virtual memory address
  */
+static inline void local_flush_tlb_all(void)
+{
+	dsb(nshst);
+	asm("tlbi	vmalle1");
+	dsb(nsh);
+	isb();
+}
+
 static inline void flush_tlb_all(void)
 {
 	dsb(ishst);
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index e8ca6eaedd02..b0f6dbdc5260 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -343,9 +343,9 @@ static void efi_set_pgd(struct mm_struct *mm)
 	else
 		cpu_switch_mm(mm->pgd, mm);
 
-	flush_tlb_all();
+	local_flush_tlb_all();
 	if (icache_is_aivivt())
-		__flush_icache_all();
+		__local_flush_icache_all();
 }
 
 void efi_virtmap_load(void)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index dbdaacddd9a5..fdd4d4dbd64f 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -152,7 +152,7 @@ asmlinkage void secondary_start_kernel(void)
 	 * point to zero page to avoid speculatively fetching new entries.
 	 */
 	cpu_set_reserved_ttbr0();
-	flush_tlb_all();
+	local_flush_tlb_all();
 	cpu_set_default_tcr_t0sz();
 
 	preempt_disable();
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 8297d502217e..3c5e4e6dcf68 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -90,7 +90,7 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
 		else
 			cpu_switch_mm(mm->pgd, mm);
 
-		flush_tlb_all();
+		local_flush_tlb_all();
 
 		/*
 		 * Restore per-cpu offset before any kernel
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index d70ff14dbdbd..48b53fb381af 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -48,9 +48,9 @@ static void flush_context(void)
 {
 	/* set the reserved TTBR0 before flushing the TLB */
 	cpu_set_reserved_ttbr0();
-	flush_tlb_all();
+	local_flush_tlb_all();
 	if (icache_is_aivivt())
-		__flush_icache_all();
+		__local_flush_icache_all();
 }
 
 static void set_mm_context(struct mm_struct *mm, unsigned int asid)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9211b8527f25..71a310478c9e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -456,7 +456,7 @@ void __init paging_init(void)
 	 * point to zero page to avoid speculatively fetching new entries.
 	 */
 	cpu_set_reserved_ttbr0();
-	flush_tlb_all();
+	local_flush_tlb_all();
 	cpu_set_default_tcr_t0sz();
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (2 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 03/10] arm64: flush: use local TLB and I-cache invalidation Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-29  8:46   ` Catalin Marinas
  2015-09-17 12:50 ` [PATCH 05/10] arm64: tlbflush: remove redundant ASID casts to (unsigned long) Will Deacon
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

Our current switch_mm implementation suffers from a number of problems:

  (1) The ASID allocator relies on IPIs to synchronise the CPUs on a
      rollover event

  (2) Because of (1), we cannot allocate ASIDs with interrupts disabled
      and therefore make use of a TIF_SWITCH_MM flag to postpone the
      actual switch to finish_arch_post_lock_switch

  (3) We run context switch with a reserved (invalid) TTBR0 value, even
      though the ASID and pgd are updated atomically

  (4) We take a global spinlock (cpu_asid_lock) during context-switch

  (5) We use h/w broadcast TLB operations when they are not required
      (e.g. in flush_context)

This patch addresses these problems by rewriting the ASID algorithm to
match the bitmap-based arch/arm/ implementation more closely. This in
turn allows us to remove much of the complications surrounding switch_mm,
including the ugly thread flag.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/mmu.h         |  10 +-
 arch/arm64/include/asm/mmu_context.h |  76 ++---------
 arch/arm64/include/asm/thread_info.h |   1 -
 arch/arm64/kernel/asm-offsets.c      |   2 +-
 arch/arm64/kernel/efi.c              |   1 -
 arch/arm64/mm/context.c              | 238 +++++++++++++++++++++--------------
 arch/arm64/mm/proc.S                 |   2 +-
 7 files changed, 161 insertions(+), 169 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 030208767185..6af677c4f118 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -17,15 +17,11 @@
 #define __ASM_MMU_H
 
 typedef struct {
-	unsigned int id;
-	raw_spinlock_t id_lock;
-	void *vdso;
+	atomic64_t	id;
+	void		*vdso;
 } mm_context_t;
 
-#define INIT_MM_CONTEXT(name) \
-	.context.id_lock = __RAW_SPIN_LOCK_UNLOCKED(name.context.id_lock),
-
-#define ASID(mm)	((mm)->context.id & 0xffff)
+#define ASID(mm)	((mm)->context.id.counter & 0xffff)
 
 extern void paging_init(void);
 extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt);
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 549b89554ce8..f4c74a951b6c 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -28,13 +28,6 @@
 #include <asm/cputype.h>
 #include <asm/pgtable.h>
 
-#define MAX_ASID_BITS	16
-
-extern unsigned int cpu_last_asid;
-
-void __init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void __new_context(struct mm_struct *mm);
-
 #ifdef CONFIG_PID_IN_CONTEXTIDR
 static inline void contextidr_thread_switch(struct task_struct *next)
 {
@@ -96,66 +89,19 @@ static inline void cpu_set_default_tcr_t0sz(void)
 	: "r"(TCR_T0SZ(VA_BITS)), "I"(TCR_T0SZ_OFFSET), "I"(TCR_TxSZ_WIDTH));
 }
 
-static inline void switch_new_context(struct mm_struct *mm)
-{
-	unsigned long flags;
-
-	__new_context(mm);
-
-	local_irq_save(flags);
-	cpu_switch_mm(mm->pgd, mm);
-	local_irq_restore(flags);
-}
-
-static inline void check_and_switch_context(struct mm_struct *mm,
-					    struct task_struct *tsk)
-{
-	/*
-	 * Required during context switch to avoid speculative page table
-	 * walking with the wrong TTBR.
-	 */
-	cpu_set_reserved_ttbr0();
-
-	if (!((mm->context.id ^ cpu_last_asid) >> MAX_ASID_BITS))
-		/*
-		 * The ASID is from the current generation, just switch to the
-		 * new pgd. This condition is only true for calls from
-		 * context_switch() and interrupts are already disabled.
-		 */
-		cpu_switch_mm(mm->pgd, mm);
-	else if (irqs_disabled())
-		/*
-		 * Defer the new ASID allocation until after the context
-		 * switch critical region since __new_context() cannot be
-		 * called with interrupts disabled.
-		 */
-		set_ti_thread_flag(task_thread_info(tsk), TIF_SWITCH_MM);
-	else
-		/*
-		 * That is a direct call to switch_mm() or activate_mm() with
-		 * interrupts enabled and a new context.
-		 */
-		switch_new_context(mm);
-}
-
-#define init_new_context(tsk,mm)	(__init_new_context(tsk,mm),0)
+/*
+ * It would be nice to return ASIDs back to the allocator, but unfortunately
+ * that introduces a race with a generation rollover where we could erroneously
+ * free an ASID allocated in a future generation. We could workaround this by
+ * freeing the ASID from the context of the dying mm (e.g. in arch_exit_mmap),
+ * but we'd then need to make sure that we didn't dirty any TLBs afterwards.
+ * Setting a reserved TTBR0 or EPD0 would work, but it all gets ugly when you
+ * take CPU migration into account.
+ */
 #define destroy_context(mm)		do { } while(0)
+void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
 
-#define finish_arch_post_lock_switch \
-	finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
-{
-	if (test_and_clear_thread_flag(TIF_SWITCH_MM)) {
-		struct mm_struct *mm = current->mm;
-		unsigned long flags;
-
-		__new_context(mm);
-
-		local_irq_save(flags);
-		cpu_switch_mm(mm->pgd, mm);
-		local_irq_restore(flags);
-	}
-}
+#define init_new_context(tsk,mm)	({ atomic64_set(&mm->context.id, 0); 0; })
 
 /*
  * This is called when "tsk" is about to enter lazy TLB mode.
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..555c6dec5ef2 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -111,7 +111,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
 #define TIF_32BIT		22	/* 32bit process */
-#define TIF_SWITCH_MM		23	/* deferred switch_mm */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 8d89cf8dae55..25de8b244961 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -60,7 +60,7 @@ int main(void)
   DEFINE(S_SYSCALLNO,		offsetof(struct pt_regs, syscallno));
   DEFINE(S_FRAME_SIZE,		sizeof(struct pt_regs));
   BLANK();
-  DEFINE(MM_CONTEXT_ID,		offsetof(struct mm_struct, context.id));
+  DEFINE(MM_CONTEXT_ID,		offsetof(struct mm_struct, context.id.counter));
   BLANK();
   DEFINE(VMA_VM_MM,		offsetof(struct vm_area_struct, vm_mm));
   DEFINE(VMA_VM_FLAGS,		offsetof(struct vm_area_struct, vm_flags));
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index b0f6dbdc5260..de30a469ccd5 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -48,7 +48,6 @@ static struct mm_struct efi_mm = {
 	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
-	INIT_MM_CONTEXT(efi_mm)
 };
 
 static int uefi_debug __initdata;
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 48b53fb381af..e902229b1a3d 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -17,135 +17,187 @@
  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
  */
 
-#include <linux/init.h>
+#include <linux/bitops.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/percpu.h>
 
+#include <asm/cpufeature.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-#include <asm/cachetype.h>
 
-#define asid_bits(reg) \
-	(((read_cpuid(ID_AA64MMFR0_EL1) & 0xf0) >> 2) + 8)
+static u32 asid_bits;
+static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
 
-#define ASID_FIRST_VERSION	(1 << MAX_ASID_BITS)
+static atomic64_t asid_generation;
+static unsigned long *asid_map;
 
-static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
-unsigned int cpu_last_asid = ASID_FIRST_VERSION;
+static DEFINE_PER_CPU(atomic64_t, active_asids);
+static DEFINE_PER_CPU(u64, reserved_asids);
+static cpumask_t tlb_flush_pending;
 
-/*
- * We fork()ed a process, and we need a new context for the child to run in.
- */
-void __init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+#define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
+#define ASID_FIRST_VERSION	(1UL << asid_bits)
+#define NUM_USER_ASIDS		ASID_FIRST_VERSION
+
+static void flush_context(unsigned int cpu)
 {
-	mm->context.id = 0;
-	raw_spin_lock_init(&mm->context.id_lock);
+	int i;
+	u64 asid;
+
+	/* Update the list of reserved ASIDs and the ASID bitmap. */
+	bitmap_clear(asid_map, 0, NUM_USER_ASIDS);
+
+	/*
+	 * Ensure the generation bump is observed before we xchg the
+	 * active_asids.
+	 */
+	smp_wmb();
+
+	for_each_possible_cpu(i) {
+		asid = atomic64_xchg_relaxed(&per_cpu(active_asids, i), 0);
+		/*
+		 * If this CPU has already been through a
+		 * rollover, but hasn't run another task in
+		 * the meantime, we must preserve its reserved
+		 * ASID, as this is the only trace we have of
+		 * the process it is still running.
+		 */
+		if (asid == 0)
+			asid = per_cpu(reserved_asids, i);
+		__set_bit(asid & ~ASID_MASK, asid_map);
+		per_cpu(reserved_asids, i) = asid;
+	}
+
+	/* Queue a TLB invalidate and flush the I-cache if necessary. */
+	cpumask_setall(&tlb_flush_pending);
+
+	if (icache_is_aivivt())
+		__flush_icache_all();
 }
 
-static void flush_context(void)
+static int is_reserved_asid(u64 asid)
 {
-	/* set the reserved TTBR0 before flushing the TLB */
-	cpu_set_reserved_ttbr0();
-	local_flush_tlb_all();
-	if (icache_is_aivivt())
-		__local_flush_icache_all();
+	int cpu;
+	for_each_possible_cpu(cpu)
+		if (per_cpu(reserved_asids, cpu) == asid)
+			return 1;
+	return 0;
 }
 
-static void set_mm_context(struct mm_struct *mm, unsigned int asid)
+static u64 new_context(struct mm_struct *mm, unsigned int cpu)
 {
-	unsigned long flags;
+	static u32 cur_idx = 1;
+	u64 asid = atomic64_read(&mm->context.id);
+	u64 generation = atomic64_read(&asid_generation);
 
-	/*
-	 * Locking needed for multi-threaded applications where the same
-	 * mm->context.id could be set from different CPUs during the
-	 * broadcast. This function is also called via IPI so the
-	 * mm->context.id_lock has to be IRQ-safe.
-	 */
-	raw_spin_lock_irqsave(&mm->context.id_lock, flags);
-	if (likely((mm->context.id ^ cpu_last_asid) >> MAX_ASID_BITS)) {
+	if (asid != 0) {
 		/*
-		 * Old version of ASID found. Set the new one and reset
-		 * mm_cpumask(mm).
+		 * If our current ASID was active during a rollover, we
+		 * can continue to use it and this was just a false alarm.
 		 */
-		mm->context.id = asid;
-		cpumask_clear(mm_cpumask(mm));
+		if (is_reserved_asid(asid))
+			return generation | (asid & ~ASID_MASK);
+
+		/*
+		 * We had a valid ASID in a previous life, so try to re-use
+		 * it if possible.
+		 */
+		asid &= ~ASID_MASK;
+		if (!__test_and_set_bit(asid, asid_map))
+			goto bump_gen;
 	}
-	raw_spin_unlock_irqrestore(&mm->context.id_lock, flags);
 
 	/*
-	 * Set the mm_cpumask(mm) bit for the current CPU.
+	 * Allocate a free ASID. If we can't find one, take a note of the
+	 * currently active ASIDs and mark the TLBs as requiring flushes.
+	 * We always count from ASID #1, as we use ASID #0 when setting a
+	 * reserved TTBR0 for the init_mm.
 	 */
-	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+	asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, cur_idx);
+	if (asid != NUM_USER_ASIDS)
+		goto set_asid;
+
+	/* We're out of ASIDs, so increment the global generation count */
+	generation = atomic64_add_return_relaxed(ASID_FIRST_VERSION,
+						 &asid_generation);
+	flush_context(cpu);
+
+	/* We have at least 1 ASID per CPU, so this will always succeed */
+	asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, 1);
+
+set_asid:
+	__set_bit(asid, asid_map);
+	cur_idx = asid;
+
+bump_gen:
+	asid |= generation;
+	cpumask_clear(mm_cpumask(mm));
+	return asid;
 }
 
-/*
- * Reset the ASID on the current CPU. This function call is broadcast from the
- * CPU handling the ASID rollover and holding cpu_asid_lock.
- */
-static void reset_context(void *info)
+void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 {
-	unsigned int asid;
-	unsigned int cpu = smp_processor_id();
-	struct mm_struct *mm = current->active_mm;
+	unsigned long flags;
+	u64 asid;
+
+	asid = atomic64_read(&mm->context.id);
 
 	/*
-	 * current->active_mm could be init_mm for the idle thread immediately
-	 * after secondary CPU boot or hotplug. TTBR0_EL1 is already set to
-	 * the reserved value, so no need to reset any context.
+	 * The memory ordering here is subtle. We rely on the control
+	 * dependency between the generation read and the update of
+	 * active_asids to ensure that we are synchronised with a
+	 * parallel rollover (i.e. this pairs with the smp_wmb() in
+	 * flush_context).
 	 */
-	if (mm == &init_mm)
-		return;
+	if (!((asid ^ atomic64_read(&asid_generation)) >> asid_bits)
+	    && atomic64_xchg_relaxed(&per_cpu(active_asids, cpu), asid))
+		goto switch_mm_fastpath;
+
+	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
+	/* Check that our ASID belongs to the current generation. */
+	asid = atomic64_read(&mm->context.id);
+	if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) {
+		asid = new_context(mm, cpu);
+		atomic64_set(&mm->context.id, asid);
+	}
 
-	smp_rmb();
-	asid = cpu_last_asid + cpu;
+	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
+		local_flush_tlb_all();
 
-	flush_context();
-	set_mm_context(mm, asid);
+	atomic64_set(&per_cpu(active_asids, cpu), asid);
+	cpumask_set_cpu(cpu, mm_cpumask(mm));
+	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
 
-	/* set the new ASID */
+switch_mm_fastpath:
 	cpu_switch_mm(mm->pgd, mm);
 }
 
-void __new_context(struct mm_struct *mm)
+static int asids_init(void)
 {
-	unsigned int asid;
-	unsigned int bits = asid_bits();
-
-	raw_spin_lock(&cpu_asid_lock);
-	/*
-	 * Check the ASID again, in case the change was broadcast from another
-	 * CPU before we acquired the lock.
-	 */
-	if (!unlikely((mm->context.id ^ cpu_last_asid) >> MAX_ASID_BITS)) {
-		cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-		raw_spin_unlock(&cpu_asid_lock);
-		return;
-	}
-	/*
-	 * At this point, it is guaranteed that the current mm (with an old
-	 * ASID) isn't active on any other CPU since the ASIDs are changed
-	 * simultaneously via IPI.
-	 */
-	asid = ++cpu_last_asid;
-
-	/*
-	 * If we've used up all our ASIDs, we need to start a new version and
-	 * flush the TLB.
-	 */
-	if (unlikely((asid & ((1 << bits) - 1)) == 0)) {
-		/* increment the ASID version */
-		cpu_last_asid += (1 << MAX_ASID_BITS) - (1 << bits);
-		if (cpu_last_asid == 0)
-			cpu_last_asid = ASID_FIRST_VERSION;
-		asid = cpu_last_asid + smp_processor_id();
-		flush_context();
-		smp_wmb();
-		smp_call_function(reset_context, NULL, 1);
-		cpu_last_asid += NR_CPUS - 1;
+	int fld = cpuid_feature_extract_field(read_cpuid(ID_AA64MMFR0_EL1), 4);
+
+	switch (fld) {
+	default:
+		pr_warn("Unknown ASID size (%d); assuming 8-bit\n", fld);
+		/* Fallthrough */
+	case 0:
+		asid_bits = 8;
+		break;
+	case 2:
+		asid_bits = 16;
 	}
 
-	set_mm_context(mm, asid);
-	raw_spin_unlock(&cpu_asid_lock);
+	/* If we end up with more CPUs than ASIDs, expect things to crash */
+	WARN_ON(NUM_USER_ASIDS < num_possible_cpus());
+	atomic64_set(&asid_generation, ASID_FIRST_VERSION);
+	asid_map = kzalloc(BITS_TO_LONGS(NUM_USER_ASIDS) * sizeof(*asid_map),
+			   GFP_KERNEL);
+	if (!asid_map)
+		panic("Failed to allocate bitmap for %lu ASIDs\n",
+		      NUM_USER_ASIDS);
+
+	pr_info("ASID allocator initialised with %lu entries\n", NUM_USER_ASIDS);
+	return 0;
 }
+early_initcall(asids_init);
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index bbde13d77da5..91cb2eaac256 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -130,7 +130,7 @@ ENDPROC(cpu_do_resume)
  *	- pgd_phys - physical address of new TTB
  */
 ENTRY(cpu_do_switch_mm)
-	mmid	w1, x1				// get mm->context.id
+	mmid	x1, x1				// get mm->context.id
 	bfi	x0, x1, #48, #16		// set the ASID
 	msr	ttbr0_el1, x0			// set TTBR0
 	isb
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 05/10] arm64: tlbflush: remove redundant ASID casts to (unsigned long)
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (3 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1 Will Deacon
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

The ASID macro returns a 64-bit (long long) value, so there is no need
to cast to (unsigned long) before shifting prior to a TLBI operation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 96f944e75dc4..93e9f964805c 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -81,7 +81,7 @@ static inline void flush_tlb_all(void)
 
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
-	unsigned long asid = (unsigned long)ASID(mm) << 48;
+	unsigned long asid = ASID(mm) << 48;
 
 	dsb(ishst);
 	asm("tlbi	aside1is, %0" : : "r" (asid));
@@ -91,8 +91,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
-	unsigned long addr = uaddr >> 12 |
-		((unsigned long)ASID(vma->vm_mm) << 48);
+	unsigned long addr = uaddr >> 12 | (ASID(vma->vm_mm) << 48);
 
 	dsb(ishst);
 	asm("tlbi	vale1is, %0" : : "r" (addr));
@@ -109,7 +108,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     bool last_level)
 {
-	unsigned long asid = (unsigned long)ASID(vma->vm_mm) << 48;
+	unsigned long asid = ASID(vma->vm_mm) << 48;
 	unsigned long addr;
 
 	if ((end - start) > MAX_TLB_RANGE) {
@@ -162,7 +161,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
 static inline void __flush_tlb_pgtable(struct mm_struct *mm,
 				       unsigned long uaddr)
 {
-	unsigned long addr = uaddr >> 12 | ((unsigned long)ASID(mm) << 48);
+	unsigned long addr = uaddr >> 12 | (ASID(mm) << 48);
 
 	dsb(ishst);
 	asm("tlbi	vae1is, %0" : : "r" (addr));
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (4 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 05/10] arm64: tlbflush: remove redundant ASID casts to (unsigned long) Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-29  9:29   ` Catalin Marinas
  2015-09-17 12:50 ` [PATCH 07/10] arm64: switch_mm: simplify mm and CPU checks Will Deacon
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

The TLB gather code sets fullmm=1 when tearing down the entire address
space for an mm_struct on exit or execve. Given that the ASID allocator
will never re-allocate a dirty ASID, this flushing is not needed and can
simply be avoided in the flushing code.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlb.h | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index d6e6b6660380..ffdaea7954bb 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -37,17 +37,21 @@ static inline void __tlb_remove_table(void *_table)
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm) {
-		flush_tlb_mm(tlb->mm);
-	} else {
-		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
-		/*
-		 * The intermediate page table levels are already handled by
-		 * the __(pte|pmd|pud)_free_tlb() functions, so last level
-		 * TLBI is sufficient here.
-		 */
-		__flush_tlb_range(&vma, tlb->start, tlb->end, true);
-	}
+	struct vm_area_struct vma = { .vm_mm = tlb->mm, };
+
+	/*
+	 * The ASID allocator will either invalidate the ASID or mark
+	 * it as used.
+	 */
+	if (tlb->fullmm)
+		return;
+
+	/*
+	 * The intermediate page table levels are already handled by
+	 * the __(pte|pmd|pud)_free_tlb() functions, so last level
+	 * TLBI is sufficient here.
+	 */
+	__flush_tlb_range(&vma, tlb->start, tlb->end, true);
 }
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 07/10] arm64: switch_mm: simplify mm and CPU checks
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (5 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1 Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 08/10] arm64: mm: kill mm_cpumask usage Will Deacon
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

switch_mm performs some checks to try and avoid entering the ASID
allocator:

  (1) If we're switching to the init_mm (no user mappings), then simply
      set a reserved TTBR0 value with no page table (the zero page)

  (2) If prev == next *and* the mm_cpumask indicates that we've run on
      this CPU before, then we can skip the allocator.

However, there is plenty of redundancy here. With the new ASID allocator,
if prev == next, then we know that our ASID is valid and do not need to
worry about re-allocation. Consequently, we can drop the mm_cpumask check
in (2) and move the prev == next check before the init_mm check, since
if prev == next == init_mm then there's nothing to do.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/mmu_context.h | 6 ++++--
 arch/arm64/mm/context.c              | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index f4c74a951b6c..c0e87898ba96 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -129,6 +129,9 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 {
 	unsigned int cpu = smp_processor_id();
 
+	if (prev == next)
+		return;
+
 	/*
 	 * init_mm.pgd does not contain any user mappings and it is always
 	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
@@ -138,8 +141,7 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		return;
 	}
 
-	if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next)
-		check_and_switch_context(next, tsk);
+	check_and_switch_context(next, cpu);
 }
 
 #define deactivate_mm(tsk,mm)	do { } while (0)
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index e902229b1a3d..4b9ec4484e3f 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -166,10 +166,10 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 		local_flush_tlb_all();
 
 	atomic64_set(&per_cpu(active_asids, cpu), asid);
-	cpumask_set_cpu(cpu, mm_cpumask(mm));
 	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
 
 switch_mm_fastpath:
+	cpumask_set_cpu(cpu, mm_cpumask(mm));
 	cpu_switch_mm(mm->pgd, mm);
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/10] arm64: mm: kill mm_cpumask usage
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (6 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 07/10] arm64: switch_mm: simplify mm and CPU checks Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 09/10] arm64: tlb: remove redundant barrier from __flush_tlb_pgtable Will Deacon
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

mm_cpumask isn't actually used for anything on arm64, so remove all the
code trying to keep it up-to-date.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/kernel/smp.c | 7 -------
 arch/arm64/mm/context.c | 2 --
 2 files changed, 9 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index fdd4d4dbd64f..03b0aa28ea61 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -142,7 +142,6 @@ asmlinkage void secondary_start_kernel(void)
 	 */
 	atomic_inc(&mm->mm_count);
 	current->active_mm = mm;
-	cpumask_set_cpu(cpu, mm_cpumask(mm));
 
 	set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
 	printk("CPU%u: Booted secondary processor\n", cpu);
@@ -233,12 +232,6 @@ int __cpu_disable(void)
 	 * OK - migrate IRQs away from this CPU
 	 */
 	migrate_irqs();
-
-	/*
-	 * Remove this CPU from the vm mask set of all processes.
-	 */
-	clear_tasks_mm_cpumask(cpu);
-
 	return 0;
 }
 
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 4b9ec4484e3f..f636a2639f03 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -132,7 +132,6 @@ set_asid:
 
 bump_gen:
 	asid |= generation;
-	cpumask_clear(mm_cpumask(mm));
 	return asid;
 }
 
@@ -169,7 +168,6 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
 
 switch_mm_fastpath:
-	cpumask_set_cpu(cpu, mm_cpumask(mm));
 	cpu_switch_mm(mm->pgd, mm);
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 09/10] arm64: tlb: remove redundant barrier from __flush_tlb_pgtable
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (7 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 08/10] arm64: mm: kill mm_cpumask usage Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-17 12:50 ` [PATCH 10/10] arm64: mm: remove dsb from update_mmu_cache Will Deacon
  2015-09-29  9:55 ` [PATCH 00/10] arm64 switch_mm improvements Catalin Marinas
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

__flush_tlb_pgtable is used to invalidate intermediate page table
entries after they have been cleared and are about to be freed. Since
pXd_clear imply memory barriers, we don't need the extra one here.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 93e9f964805c..b460ae28e346 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -163,7 +163,6 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
 {
 	unsigned long addr = uaddr >> 12 | (ASID(mm) << 48);
 
-	dsb(ishst);
 	asm("tlbi	vae1is, %0" : : "r" (addr));
 	dsb(ish);
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 10/10] arm64: mm: remove dsb from update_mmu_cache
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (8 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 09/10] arm64: tlb: remove redundant barrier from __flush_tlb_pgtable Will Deacon
@ 2015-09-17 12:50 ` Will Deacon
  2015-09-29  9:55 ` [PATCH 00/10] arm64 switch_mm improvements Catalin Marinas
  10 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2015-09-17 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

update_mmu_cache() consists of a dsb(ishst) instruction so that new user
mappings are guaranteed to be visible to the page table walker on
exception return.

In reality this can be a very expensive operation which is rarely needed.
Removing this barrier shows a modest improvement in hackbench scores and
, in the worst case, we re-take the user fault and establish that there
was nothing to do.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6900b2d95371..7be3c16d8545 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -650,10 +650,10 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep)
 {
 	/*
-	 * set_pte() does not have a DSB for user mappings, so make sure that
-	 * the page table write is visible.
+	 * We don't do anything here, so there's a very small chance of
+	 * us retaking a user fault which we just fixed up. The alternative
+	 * is doing a dsb(ishst), but that penalises the fastpath.
 	 */
-	dsb(ishst);
 }
 
 #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code
  2015-09-17 12:50 ` [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code Will Deacon
@ 2015-09-29  8:46   ` Catalin Marinas
  2015-10-05 16:31     ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Catalin Marinas @ 2015-09-29  8:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 17, 2015 at 01:50:13PM +0100, Will Deacon wrote:
> Our current switch_mm implementation suffers from a number of problems:
> 
>   (1) The ASID allocator relies on IPIs to synchronise the CPUs on a
>       rollover event
> 
>   (2) Because of (1), we cannot allocate ASIDs with interrupts disabled
>       and therefore make use of a TIF_SWITCH_MM flag to postpone the
>       actual switch to finish_arch_post_lock_switch
> 
>   (3) We run context switch with a reserved (invalid) TTBR0 value, even
>       though the ASID and pgd are updated atomically
> 
>   (4) We take a global spinlock (cpu_asid_lock) during context-switch
> 
>   (5) We use h/w broadcast TLB operations when they are not required
>       (e.g. in flush_context)
> 
> This patch addresses these problems by rewriting the ASID algorithm to
> match the bitmap-based arch/arm/ implementation more closely. This in
> turn allows us to remove much of the complications surrounding switch_mm,
> including the ugly thread flag.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/include/asm/mmu.h         |  10 +-
>  arch/arm64/include/asm/mmu_context.h |  76 ++---------
>  arch/arm64/include/asm/thread_info.h |   1 -
>  arch/arm64/kernel/asm-offsets.c      |   2 +-
>  arch/arm64/kernel/efi.c              |   1 -
>  arch/arm64/mm/context.c              | 238 +++++++++++++++++++++--------------
>  arch/arm64/mm/proc.S                 |   2 +-
>  7 files changed, 161 insertions(+), 169 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 030208767185..6af677c4f118 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -17,15 +17,11 @@
>  #define __ASM_MMU_H
>  
>  typedef struct {
> -	unsigned int id;
> -	raw_spinlock_t id_lock;
> -	void *vdso;
> +	atomic64_t	id;
> +	void		*vdso;
>  } mm_context_t;
>  
> -#define INIT_MM_CONTEXT(name) \
> -	.context.id_lock = __RAW_SPIN_LOCK_UNLOCKED(name.context.id_lock),
> -
> -#define ASID(mm)	((mm)->context.id & 0xffff)
> +#define ASID(mm)	((mm)->context.id.counter & 0xffff)

If you changed the id to atomic64_t, can you not use atomic64_read()
here?

> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 48b53fb381af..e902229b1a3d 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -17,135 +17,187 @@
>   * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>   */
>  
> -#include <linux/init.h>
> +#include <linux/bitops.h>
>  #include <linux/sched.h>
> +#include <linux/slab.h>
>  #include <linux/mm.h>
> -#include <linux/smp.h>
> -#include <linux/percpu.h>
>  
> +#include <asm/cpufeature.h>
>  #include <asm/mmu_context.h>
>  #include <asm/tlbflush.h>
> -#include <asm/cachetype.h>
>  
> -#define asid_bits(reg) \
> -	(((read_cpuid(ID_AA64MMFR0_EL1) & 0xf0) >> 2) + 8)
> +static u32 asid_bits;
> +static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
>  
> -#define ASID_FIRST_VERSION	(1 << MAX_ASID_BITS)
> +static atomic64_t asid_generation;
> +static unsigned long *asid_map;
>  
> -static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
> -unsigned int cpu_last_asid = ASID_FIRST_VERSION;
> +static DEFINE_PER_CPU(atomic64_t, active_asids);
> +static DEFINE_PER_CPU(u64, reserved_asids);
> +static cpumask_t tlb_flush_pending;
>  
> -/*
> - * We fork()ed a process, and we need a new context for the child to run in.
> - */
> -void __init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> +#define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
> +#define ASID_FIRST_VERSION	(1UL << asid_bits)
> +#define NUM_USER_ASIDS		ASID_FIRST_VERSION

Apart from NUM_USER_ASIDS, I think we can live with constants for
ASID_MASK and ASID_FIRST_VERSION (as per 16-bit ASIDs, together with
some shifts converted to a constant), marginally more optimal code
generation which avoids reading asid_bits all the time. We should be ok
with 48-bit generation field.

> +static void flush_context(unsigned int cpu)
>  {
> -	mm->context.id = 0;
> -	raw_spin_lock_init(&mm->context.id_lock);
> +	int i;
> +	u64 asid;
> +
> +	/* Update the list of reserved ASIDs and the ASID bitmap. */
> +	bitmap_clear(asid_map, 0, NUM_USER_ASIDS);
> +
> +	/*
> +	 * Ensure the generation bump is observed before we xchg the
> +	 * active_asids.
> +	 */
> +	smp_wmb();
> +
> +	for_each_possible_cpu(i) {
> +		asid = atomic64_xchg_relaxed(&per_cpu(active_asids, i), 0);
> +		/*
> +		 * If this CPU has already been through a
> +		 * rollover, but hasn't run another task in
> +		 * the meantime, we must preserve its reserved
> +		 * ASID, as this is the only trace we have of
> +		 * the process it is still running.
> +		 */
> +		if (asid == 0)
> +			asid = per_cpu(reserved_asids, i);
> +		__set_bit(asid & ~ASID_MASK, asid_map);
> +		per_cpu(reserved_asids, i) = asid;
> +	}
> +
> +	/* Queue a TLB invalidate and flush the I-cache if necessary. */
> +	cpumask_setall(&tlb_flush_pending);
> +
> +	if (icache_is_aivivt())
> +		__flush_icache_all();
>  }
[...]
> +void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  {
> -	unsigned int asid;
> -	unsigned int cpu = smp_processor_id();
> -	struct mm_struct *mm = current->active_mm;
> +	unsigned long flags;
> +	u64 asid;
> +
> +	asid = atomic64_read(&mm->context.id);
>  
>  	/*
> -	 * current->active_mm could be init_mm for the idle thread immediately
> -	 * after secondary CPU boot or hotplug. TTBR0_EL1 is already set to
> -	 * the reserved value, so no need to reset any context.
> +	 * The memory ordering here is subtle. We rely on the control
> +	 * dependency between the generation read and the update of
> +	 * active_asids to ensure that we are synchronised with a
> +	 * parallel rollover (i.e. this pairs with the smp_wmb() in
> +	 * flush_context).
>  	 */
> -	if (mm == &init_mm)
> -		return;
> +	if (!((asid ^ atomic64_read(&asid_generation)) >> asid_bits)
> +	    && atomic64_xchg_relaxed(&per_cpu(active_asids, cpu), asid))
> +		goto switch_mm_fastpath;

Just trying to make sense of this ;). At a parallel roll-over, we have
two cases for the asid check above: it either (1) sees the new
generation or (2) the old one.

(1) is simple since it falls back on the slow path.

(2) means that it goes on and performs an atomic64_xchg. This may happen
before or after the active_asids xchg in flush_context(). We now have
two sub-cases:

a) if the code above sees the updated (in flush_context()) active_asids,
it falls back on the slow path since xchg returns 0. Here we are
guaranteed that another read of asid_generation returns the new value
(by the smp_wmb() in flush_context).

b) the code above sees the old active_asids, goes to the fast path just
like a roll-over hasn't happened (on this CPU). On the CPU doing the
roll-over, we want the active_asids xchg to see the new asid. That's
guaranteed by the atomicity of the xchg implementation (otherwise it
would be case (a) above).

So what the control dependency actually buys us is that a store
(exclusive) is not architecturally visible if the generation check
fails. I guess this only works (with respect to the load) because of the
exclusiveness of the memory accesses.

> +	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
> +	/* Check that our ASID belongs to the current generation. */
> +	asid = atomic64_read(&mm->context.id);
> +	if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) {
> +		asid = new_context(mm, cpu);
> +		atomic64_set(&mm->context.id, asid);
> +	}
>  
> -	smp_rmb();
> -	asid = cpu_last_asid + cpu;
> +	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
> +		local_flush_tlb_all();
>  
> -	flush_context();
> -	set_mm_context(mm, asid);
> +	atomic64_set(&per_cpu(active_asids, cpu), asid);
> +	cpumask_set_cpu(cpu, mm_cpumask(mm));
> +	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
>  
> -	/* set the new ASID */
> +switch_mm_fastpath:
>  	cpu_switch_mm(mm->pgd, mm);
>  }

And on the slow path, races with roll-overs on other CPUs are serialised
by cpu_asid_lock.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1
  2015-09-17 12:50 ` [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1 Will Deacon
@ 2015-09-29  9:29   ` Catalin Marinas
  2015-10-05 16:33     ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Catalin Marinas @ 2015-09-29  9:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 17, 2015 at 01:50:15PM +0100, Will Deacon wrote:
> The TLB gather code sets fullmm=1 when tearing down the entire address
> space for an mm_struct on exit or execve. Given that the ASID allocator
> will never re-allocate a dirty ASID, this flushing is not needed and can
> simply be avoided in the flushing code.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/include/asm/tlb.h | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index d6e6b6660380..ffdaea7954bb 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -37,17 +37,21 @@ static inline void __tlb_remove_table(void *_table)
>  
>  static inline void tlb_flush(struct mmu_gather *tlb)
>  {
> -	if (tlb->fullmm) {
> -		flush_tlb_mm(tlb->mm);
> -	} else {
> -		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> -		/*
> -		 * The intermediate page table levels are already handled by
> -		 * the __(pte|pmd|pud)_free_tlb() functions, so last level
> -		 * TLBI is sufficient here.
> -		 */
> -		__flush_tlb_range(&vma, tlb->start, tlb->end, true);
> -	}
> +	struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> +
> +	/*
> +	 * The ASID allocator will either invalidate the ASID or mark
> +	 * it as used.
> +	 */
> +	if (tlb->fullmm)
> +		return;

BTW, do we actually need this flush_tlb_mm() with the current ASID
allocator? It doesn't reuse old ASIDs either before a full TLBI (just
trying to remember if we had any logic; or maybe it was needed before
non-lazy __pte_free_tlb).

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 00/10] arm64 switch_mm improvements
  2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
                   ` (9 preceding siblings ...)
  2015-09-17 12:50 ` [PATCH 10/10] arm64: mm: remove dsb from update_mmu_cache Will Deacon
@ 2015-09-29  9:55 ` Catalin Marinas
  10 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2015-09-29  9:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 17, 2015 at 01:50:09PM +0100, Will Deacon wrote:
> Will Deacon (10):
>   arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function
>   arm64: proc: de-scope TLBI operation during cold boot
>   arm64: flush: use local TLB and I-cache invalidation
>   arm64: mm: rewrite ASID allocator and MM context-switching code
>   arm64: tlbflush: remove redundant ASID casts to (unsigned long)
>   arm64: tlbflush: avoid flushing when fullmm == 1
>   arm64: switch_mm: simplify mm and CPU checks
>   arm64: mm: kill mm_cpumask usage
>   arm64: tlb: remove redundant barrier from __flush_tlb_pgtable
>   arm64: mm: remove dsb from update_mmu_cache

Apart from some minor comments on patch 4, the series looks fine to me.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

I'll do some more tests in the meantime.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code
  2015-09-29  8:46   ` Catalin Marinas
@ 2015-10-05 16:31     ` Will Deacon
  2015-10-05 17:16       ` Catalin Marinas
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2015-10-05 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Catalin,

On Tue, Sep 29, 2015 at 09:46:15AM +0100, Catalin Marinas wrote:
> On Thu, Sep 17, 2015 at 01:50:13PM +0100, Will Deacon wrote:
> > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > index 030208767185..6af677c4f118 100644
> > --- a/arch/arm64/include/asm/mmu.h
> > +++ b/arch/arm64/include/asm/mmu.h
> > @@ -17,15 +17,11 @@
> >  #define __ASM_MMU_H
> >  
> >  typedef struct {
> > -	unsigned int id;
> > -	raw_spinlock_t id_lock;
> > -	void *vdso;
> > +	atomic64_t	id;
> > +	void		*vdso;
> >  } mm_context_t;
> >  
> > -#define INIT_MM_CONTEXT(name) \
> > -	.context.id_lock = __RAW_SPIN_LOCK_UNLOCKED(name.context.id_lock),
> > -
> > -#define ASID(mm)	((mm)->context.id & 0xffff)
> > +#define ASID(mm)	((mm)->context.id.counter & 0xffff)
> 
> If you changed the id to atomic64_t, can you not use atomic64_read()
> here?

I could, but it forces the access to be volatile which I don't think is
necessary for any of the users of this macro (i.e. the tlbflushing code).

> > -#define asid_bits(reg) \
> > -	(((read_cpuid(ID_AA64MMFR0_EL1) & 0xf0) >> 2) + 8)
> > +static u32 asid_bits;
> > +static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
> >  
> > -#define ASID_FIRST_VERSION	(1 << MAX_ASID_BITS)
> > +static atomic64_t asid_generation;
> > +static unsigned long *asid_map;
> >  
> > -static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
> > -unsigned int cpu_last_asid = ASID_FIRST_VERSION;
> > +static DEFINE_PER_CPU(atomic64_t, active_asids);
> > +static DEFINE_PER_CPU(u64, reserved_asids);
> > +static cpumask_t tlb_flush_pending;
> >  
> > -/*
> > - * We fork()ed a process, and we need a new context for the child to run in.
> > - */
> > -void __init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> > +#define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
> > +#define ASID_FIRST_VERSION	(1UL << asid_bits)
> > +#define NUM_USER_ASIDS		ASID_FIRST_VERSION
> 
> Apart from NUM_USER_ASIDS, I think we can live with constants for
> ASID_MASK and ASID_FIRST_VERSION (as per 16-bit ASIDs, together with
> some shifts converted to a constant), marginally more optimal code
> generation which avoids reading asid_bits all the time. We should be ok
> with 48-bit generation field.

The main reason for writing it like this is that it's easy to test the
code with different asid sizes -- you just change asid_bits and all of
the masks change accordingly. If we hardcode ASID_MASK then we'll break
flush_context (which uses it to generate a bitmap index) and, given that
ASID_MASK and ASID_FIRST_VERSION are only used on the slow-path, I'd
favour the current code over a micro-optimisation.

> > +void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
> >  {
> > -	unsigned int asid;
> > -	unsigned int cpu = smp_processor_id();
> > -	struct mm_struct *mm = current->active_mm;
> > +	unsigned long flags;
> > +	u64 asid;
> > +
> > +	asid = atomic64_read(&mm->context.id);
> >  
> >  	/*
> > -	 * current->active_mm could be init_mm for the idle thread immediately
> > -	 * after secondary CPU boot or hotplug. TTBR0_EL1 is already set to
> > -	 * the reserved value, so no need to reset any context.
> > +	 * The memory ordering here is subtle. We rely on the control
> > +	 * dependency between the generation read and the update of
> > +	 * active_asids to ensure that we are synchronised with a
> > +	 * parallel rollover (i.e. this pairs with the smp_wmb() in
> > +	 * flush_context).
> >  	 */
> > -	if (mm == &init_mm)
> > -		return;
> > +	if (!((asid ^ atomic64_read(&asid_generation)) >> asid_bits)
> > +	    && atomic64_xchg_relaxed(&per_cpu(active_asids, cpu), asid))
> > +		goto switch_mm_fastpath;
> 
> Just trying to make sense of this ;). At a parallel roll-over, we have
> two cases for the asid check above: it either (1) sees the new
> generation or (2) the old one.
> 
> (1) is simple since it falls back on the slow path.
> 
> (2) means that it goes on and performs an atomic64_xchg. This may happen
> before or after the active_asids xchg in flush_context(). We now have
> two sub-cases:
> 
> a) if the code above sees the updated (in flush_context()) active_asids,
> it falls back on the slow path since xchg returns 0. Here we are
> guaranteed that another read of asid_generation returns the new value
> (by the smp_wmb() in flush_context).
> 
> b) the code above sees the old active_asids, goes to the fast path just
> like a roll-over hasn't happened (on this CPU). On the CPU doing the
> roll-over, we want the active_asids xchg to see the new asid. That's
> guaranteed by the atomicity of the xchg implementation (otherwise it
> would be case (a) above).
> 
> So what the control dependency actually buys us is that a store
> (exclusive) is not architecturally visible if the generation check
> fails. I guess this only works (with respect to the load) because of the
> exclusiveness of the memory accesses.

This is also the case for non-exclusive stores (i.e. a control dependency
from a load to a store creates order) since we don't permit speculative
writes. So here, the control dependency is between the atomic64_read of
the generation and the store-exclusive part of the xchg. The
exclusiveness then guarantees that we replay the load-exclusive part of
the xchg in the face of contention (due to a parallel rollover).

You seem to have the gist of it, though.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1
  2015-09-29  9:29   ` Catalin Marinas
@ 2015-10-05 16:33     ` Will Deacon
  2015-10-05 17:18       ` Catalin Marinas
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2015-10-05 16:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 29, 2015 at 10:29:58AM +0100, Catalin Marinas wrote:
> On Thu, Sep 17, 2015 at 01:50:15PM +0100, Will Deacon wrote:
> > The TLB gather code sets fullmm=1 when tearing down the entire address
> > space for an mm_struct on exit or execve. Given that the ASID allocator
> > will never re-allocate a dirty ASID, this flushing is not needed and can
> > simply be avoided in the flushing code.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm64/include/asm/tlb.h | 26 +++++++++++++++-----------
> >  1 file changed, 15 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> > index d6e6b6660380..ffdaea7954bb 100644
> > --- a/arch/arm64/include/asm/tlb.h
> > +++ b/arch/arm64/include/asm/tlb.h
> > @@ -37,17 +37,21 @@ static inline void __tlb_remove_table(void *_table)
> >  
> >  static inline void tlb_flush(struct mmu_gather *tlb)
> >  {
> > -	if (tlb->fullmm) {
> > -		flush_tlb_mm(tlb->mm);
> > -	} else {
> > -		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> > -		/*
> > -		 * The intermediate page table levels are already handled by
> > -		 * the __(pte|pmd|pud)_free_tlb() functions, so last level
> > -		 * TLBI is sufficient here.
> > -		 */
> > -		__flush_tlb_range(&vma, tlb->start, tlb->end, true);
> > -	}
> > +	struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> > +
> > +	/*
> > +	 * The ASID allocator will either invalidate the ASID or mark
> > +	 * it as used.
> > +	 */
> > +	if (tlb->fullmm)
> > +		return;
> 
> BTW, do we actually need this flush_tlb_mm() with the current ASID
> allocator? It doesn't reuse old ASIDs either before a full TLBI (just
> trying to remember if we had any logic; or maybe it was needed before
> non-lazy __pte_free_tlb).

I'm afraid I don't follow you here. This diff is removing the flush_tlb_mm
because, as you point out, it's no longer needed.

What am I missing?

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code
  2015-10-05 16:31     ` Will Deacon
@ 2015-10-05 17:16       ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2015-10-05 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 05, 2015 at 05:31:00PM +0100, Will Deacon wrote:
> On Tue, Sep 29, 2015 at 09:46:15AM +0100, Catalin Marinas wrote:
> > On Thu, Sep 17, 2015 at 01:50:13PM +0100, Will Deacon wrote:
> > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > > index 030208767185..6af677c4f118 100644
> > > --- a/arch/arm64/include/asm/mmu.h
> > > +++ b/arch/arm64/include/asm/mmu.h
> > > @@ -17,15 +17,11 @@
> > >  #define __ASM_MMU_H
> > >  
> > >  typedef struct {
> > > -	unsigned int id;
> > > -	raw_spinlock_t id_lock;
> > > -	void *vdso;
> > > +	atomic64_t	id;
> > > +	void		*vdso;
> > >  } mm_context_t;
> > >  
> > > -#define INIT_MM_CONTEXT(name) \
> > > -	.context.id_lock = __RAW_SPIN_LOCK_UNLOCKED(name.context.id_lock),
> > > -
> > > -#define ASID(mm)	((mm)->context.id & 0xffff)
> > > +#define ASID(mm)	((mm)->context.id.counter & 0xffff)
> > 
> > If you changed the id to atomic64_t, can you not use atomic64_read()
> > here?
> 
> I could, but it forces the access to be volatile which I don't think is
> necessary for any of the users of this macro (i.e. the tlbflushing code).

OK. But please add a small comment (it can be a separate patch, up to
you).

> > > -#define asid_bits(reg) \
> > > -	(((read_cpuid(ID_AA64MMFR0_EL1) & 0xf0) >> 2) + 8)
> > > +static u32 asid_bits;
> > > +static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
> > >  
> > > -#define ASID_FIRST_VERSION	(1 << MAX_ASID_BITS)
> > > +static atomic64_t asid_generation;
> > > +static unsigned long *asid_map;
> > >  
> > > -static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
> > > -unsigned int cpu_last_asid = ASID_FIRST_VERSION;
> > > +static DEFINE_PER_CPU(atomic64_t, active_asids);
> > > +static DEFINE_PER_CPU(u64, reserved_asids);
> > > +static cpumask_t tlb_flush_pending;
> > >  
> > > -/*
> > > - * We fork()ed a process, and we need a new context for the child to run in.
> > > - */
> > > -void __init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> > > +#define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
> > > +#define ASID_FIRST_VERSION	(1UL << asid_bits)
> > > +#define NUM_USER_ASIDS		ASID_FIRST_VERSION
> > 
> > Apart from NUM_USER_ASIDS, I think we can live with constants for
> > ASID_MASK and ASID_FIRST_VERSION (as per 16-bit ASIDs, together with
> > some shifts converted to a constant), marginally more optimal code
> > generation which avoids reading asid_bits all the time. We should be ok
> > with 48-bit generation field.
> 
> The main reason for writing it like this is that it's easy to test the
> code with different asid sizes -- you just change asid_bits and all of
> the masks change accordingly.

My point was that an inclusive mask should be enough as long as
NUM_USER_ASIDS changes.

> If we hardcode ASID_MASK then we'll break flush_context (which uses it
> to generate a bitmap index)

I don't fully get how it would break if the generation always starts
from bit 16 and the asids are capped to NUM_USER_ASIDS. But I probably
miss something.

> and, given that ASID_MASK and ASID_FIRST_VERSION are only used on the
> slow-path, I'd favour the current code over a micro-optimisation.

That's a good point. So leave it as it is, or maybe just avoid negating
it twice and use (GENMASK(...)) directly.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1
  2015-10-05 16:33     ` Will Deacon
@ 2015-10-05 17:18       ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2015-10-05 17:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 05, 2015 at 05:33:13PM +0100, Will Deacon wrote:
> On Tue, Sep 29, 2015 at 10:29:58AM +0100, Catalin Marinas wrote:
> > On Thu, Sep 17, 2015 at 01:50:15PM +0100, Will Deacon wrote:
> > > The TLB gather code sets fullmm=1 when tearing down the entire address
> > > space for an mm_struct on exit or execve. Given that the ASID allocator
> > > will never re-allocate a dirty ASID, this flushing is not needed and can
> > > simply be avoided in the flushing code.
> > > 
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > > ---
> > >  arch/arm64/include/asm/tlb.h | 26 +++++++++++++++-----------
> > >  1 file changed, 15 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> > > index d6e6b6660380..ffdaea7954bb 100644
> > > --- a/arch/arm64/include/asm/tlb.h
> > > +++ b/arch/arm64/include/asm/tlb.h
> > > @@ -37,17 +37,21 @@ static inline void __tlb_remove_table(void *_table)
> > >  
> > >  static inline void tlb_flush(struct mmu_gather *tlb)
> > >  {
> > > -	if (tlb->fullmm) {
> > > -		flush_tlb_mm(tlb->mm);
> > > -	} else {
> > > -		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> > > -		/*
> > > -		 * The intermediate page table levels are already handled by
> > > -		 * the __(pte|pmd|pud)_free_tlb() functions, so last level
> > > -		 * TLBI is sufficient here.
> > > -		 */
> > > -		__flush_tlb_range(&vma, tlb->start, tlb->end, true);
> > > -	}
> > > +	struct vm_area_struct vma = { .vm_mm = tlb->mm, };
> > > +
> > > +	/*
> > > +	 * The ASID allocator will either invalidate the ASID or mark
> > > +	 * it as used.
> > > +	 */
> > > +	if (tlb->fullmm)
> > > +		return;
> > 
> > BTW, do we actually need this flush_tlb_mm() with the current ASID
> > allocator? It doesn't reuse old ASIDs either before a full TLBI (just
> > trying to remember if we had any logic; or maybe it was needed before
> > non-lazy __pte_free_tlb).
> 
> I'm afraid I don't follow you here. This diff is removing the flush_tlb_mm
> because, as you point out, it's no longer needed.

I don't think you are missing anything, just wondering why we had it
before (probably forgot to remove it when the non-lazy __pte_free_tlb
was added).

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-05 17:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-17 12:50 [PATCH 00/10] arm64 switch_mm improvements Will Deacon
2015-09-17 12:50 ` [PATCH 01/10] arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function Will Deacon
2015-09-17 12:50 ` [PATCH 02/10] arm64: proc: de-scope TLBI operation during cold boot Will Deacon
2015-09-17 12:50 ` [PATCH 03/10] arm64: flush: use local TLB and I-cache invalidation Will Deacon
2015-09-17 12:50 ` [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code Will Deacon
2015-09-29  8:46   ` Catalin Marinas
2015-10-05 16:31     ` Will Deacon
2015-10-05 17:16       ` Catalin Marinas
2015-09-17 12:50 ` [PATCH 05/10] arm64: tlbflush: remove redundant ASID casts to (unsigned long) Will Deacon
2015-09-17 12:50 ` [PATCH 06/10] arm64: tlbflush: avoid flushing when fullmm == 1 Will Deacon
2015-09-29  9:29   ` Catalin Marinas
2015-10-05 16:33     ` Will Deacon
2015-10-05 17:18       ` Catalin Marinas
2015-09-17 12:50 ` [PATCH 07/10] arm64: switch_mm: simplify mm and CPU checks Will Deacon
2015-09-17 12:50 ` [PATCH 08/10] arm64: mm: kill mm_cpumask usage Will Deacon
2015-09-17 12:50 ` [PATCH 09/10] arm64: tlb: remove redundant barrier from __flush_tlb_pgtable Will Deacon
2015-09-17 12:50 ` [PATCH 10/10] arm64: mm: remove dsb from update_mmu_cache Will Deacon
2015-09-29  9:55 ` [PATCH 00/10] arm64 switch_mm improvements Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).