[RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU
@ 2025-08-29 15:35 Ryan Roberts
  2025-08-29 15:35 ` [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro Ryan Roberts
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-08-29 15:35 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, James Morse
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Hi All,

This is an RFC for my implementation of an idea from James Morse to avoid
broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
have ever observed the pgtable entry for the TLB entry that is being
invalidated. It turns out that x86 does something similar in principle.

The primary feedback I'm looking for is; is this actually correct and safe?
James and I both believe it to be, but it would be useful to get further
validation.

Beyond that, the next question is; does it actually improve performance?
stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
do a much better job of sustaining the overall number of "tlb shootdowns per
second" after the change:

+------------+--------------------------+--------------------------+--------------------------+
|            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
+------------+-------------+------------+-------------+------------+-------------+------------+
| nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
|            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
+------------+-------------+------------+-------------+------------+-------------+------------+
|          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
|          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
|          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
|         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
|         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
|         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
|        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
+------------+-------------+------------+-------------+------------+-------------+------------+

But looking at real-world benchmarks, I haven't yet found anything where it
makes a huge difference; When compiling the kernel, it reduces kernel time by
~2.2%, but overall wall time remains the same. I'd be interested in any
suggestions for workloads where this might prove valuable.

All mm selftests have been run and no regressions are observed. Applies on
v6.17-rc3.

Thanks,
Ryan

Ryan Roberts (2):
  arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro
  arm64: tlbflush: Don't broadcast if mm was only active on local cpu

 arch/arm64/include/asm/mmu.h         |  12 +++
 arch/arm64/include/asm/mmu_context.h |   2 +
 arch/arm64/include/asm/tlbflush.h    | 116 ++++++++++++++++++++++++---
 arch/arm64/mm/context.c              |  30 ++++++-
 4 files changed, 145 insertions(+), 15 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro
  2025-08-29 15:35 [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Ryan Roberts
@ 2025-08-29 15:35 ` Ryan Roberts
  2025-09-02 16:25   ` Catalin Marinas
  2025-08-29 15:35 ` [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Ryan Roberts
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-08-29 15:35 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, James Morse
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

__flush_tlb_range_op() is a pre-processor macro that takes the TLBI
operation as a string, and builds the instruction from it. This prevents
passing the TLBI operation around as a variable. __flush_tlb_range_op()
also takes 7 other arguments.

Adding extra invocations for different TLB operations means duplicating
the whole thing, but those 7 extra arguments are the same each time.

Add an enum for the TLBI operations that __flush_tlb_range() uses, and a
macro to pass the operation name as a string to __flush_tlb_range_op(),
and the rest of the arguments using __VA_ARGS_.

The result is easier to add new TLBI operations to, and to modify any of
the other arguments as they only appear once.

Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 18a5dc0c9a54..f66b8c4696d0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -11,6 +11,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/bitfield.h>
+#include <linux/build_bug.h>
 #include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <linux/mmu_notifier.h>
@@ -433,12 +434,32 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
 	return false;
 }
 
+enum tlbi_op {
+	TLBI_VALE1IS,
+	TLBI_VAE1IS,
+};
+
+#define flush_tlb_range_op(op, ...)					\
+do {									\
+	switch (op) {							\
+	case TLBI_VALE1IS:						\
+		__flush_tlb_range_op(vale1is, __VA_ARGS__);		\
+		break;							\
+	case TLBI_VAE1IS:						\
+		__flush_tlb_range_op(vae1is, __VA_ARGS__);		\
+		break;							\
+	default:							\
+		BUILD_BUG_ON_MSG(1, "Unknown TLBI op");			\
+	}								\
+} while (0)
+
 static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
 {
 	unsigned long asid, pages;
+	enum tlbi_op tlbi_op;
 
 	start = round_down(start, stride);
 	end = round_up(end, stride);
@@ -452,12 +473,9 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 	dsb(ishst);
 	asid = ASID(mm);
 
-	if (last_level)
-		__flush_tlb_range_op(vale1is, start, pages, stride, asid,
-				     tlb_level, true, lpa2_is_enabled());
-	else
-		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
-				     tlb_level, true, lpa2_is_enabled());
+	tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
+	flush_tlb_range_op(tlbi_op, start, pages, stride, asid, tlb_level,
+			   true, lpa2_is_enabled());
 
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2025-08-29 15:35 [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Ryan Roberts
  2025-08-29 15:35 ` [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro Ryan Roberts
@ 2025-08-29 15:35 ` Ryan Roberts
  2025-09-01  9:08   ` Alexandru Elisei
  2025-09-02 16:23   ` Catalin Marinas
  2025-09-02 16:47 ` [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Catalin Marinas
  2025-09-03  2:12 ` Huang, Ying
  3 siblings, 2 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-08-29 15:35 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, James Morse
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

There are 3 variants of tlb flush that invalidate user mappings:
flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
would previously unconditionally broadcast their tlbis to all cpus in
the inner shareable domain.

But this is a waste of effort if we can prove that the mm for which we
are flushing the mappings has only ever been active on the local cpu. In
that case, it is safe to avoid the broadcast and simply invalidate the
current cpu.

So let's track in mm_context_t::active_cpu either the mm has never been
active on any cpu, has been active on more than 1 cpu, or has been
active on precisely 1 cpu - and in that case, which one. We update this
when switching context, being careful to ensure that it gets updated
*before* installing the mm's pgtables. On the reader side, we ensure we
read *after* the previous write(s) to the pgtable(s) that necessitated
the tlb flush have completed. This guarrantees that if a cpu that is
doing a tlb flush sees it's own id in active_cpu, then the old pgtable
entry cannot have been seen by any other cpu and we can flush only the
local cpu.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/mmu.h         | 12 ++++
 arch/arm64/include/asm/mmu_context.h |  2 +
 arch/arm64/include/asm/tlbflush.h    | 90 +++++++++++++++++++++++++---
 arch/arm64/mm/context.c              | 30 +++++++++-
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 6e8aa8e72601..ca32fb860309 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -17,6 +17,17 @@
 #include <linux/refcount.h>
 #include <asm/cpufeature.h>
 
+/*
+ * Sentinal values for mm_context_t::active_cpu. ACTIVE_CPU_NONE indicates the
+ * mm has never been active on any CPU. ACTIVE_CPU_MULTIPLE indicates the mm
+ * has been active on multiple CPUs. Any other value is the ID of the single
+ * CPU that the mm has been active on.
+ */
+enum active_cpu {
+	ACTIVE_CPU_NONE = UINT_MAX,
+	ACTIVE_CPU_MULTIPLE = UINT_MAX - 1,
+};
+
 typedef struct {
 	atomic64_t	id;
 #ifdef CONFIG_COMPAT
@@ -26,6 +37,7 @@ typedef struct {
 	void		*vdso;
 	unsigned long	flags;
 	u8		pkey_allocation_map;
+	unsigned int	active_cpu;
 } mm_context_t;
 
 /*
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 0dbe3b29049b..db2505a8eba8 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -180,6 +180,8 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	/* pkey 0 is the default, so always reserve it. */
 	mm->context.pkey_allocation_map = BIT(0);
 
+	WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_NONE);
+
 	return 0;
 }
 
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index f66b8c4696d0..651440e0aff9 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
 		(__pages >> (5 * (scale) + 1)) - 1;			\
 	})
 
+/*
+ * Determines whether the user tlbi invalidation can be performed only on the
+ * local CPU or whether it needs to be broadcast. (Returns true for local).
+ * Additionally issues appropriate barrier to ensure prior pgtable updates are
+ * visible to the table walker. Must be paired with flush_tlb_user_post().
+ */
+static inline bool flush_tlb_user_pre(struct mm_struct *mm)
+{
+	unsigned int self, active;
+	bool local;
+
+	migrate_disable();
+
+	self = smp_processor_id();
+
+	/*
+	 * The load of mm->context.active_cpu must not be reordered before the
+	 * store to the pgtable that necessitated this flush. This ensures that
+	 * if the value read is our cpu id, then no other cpu can have seen the
+	 * old pgtable value and therefore does not need this old value to be
+	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
+	 * needed to make the pgtable updates visible to the walker, to a
+	 * dsb(ish) by default. So speculatively load without a barrier and if
+	 * it indicates our cpu id, then upgrade the barrier and re-load.
+	 */
+	active = READ_ONCE(mm->context.active_cpu);
+	if (active == self) {
+		dsb(ish);
+		active = READ_ONCE(mm->context.active_cpu);
+	} else {
+		dsb(ishst);
+	}
+
+	local = active == self;
+	if (!local)
+		migrate_enable();
+
+	return local;
+}
+
+static inline void flush_tlb_user_post(bool local)
+{
+	if (local)
+		migrate_enable();
+}
+
 /*
  *	TLB Invalidation
  *	================
@@ -274,11 +320,18 @@ static inline void flush_tlb_all(void)
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long asid;
+	bool local;
 
-	dsb(ishst);
+	local = flush_tlb_user_pre(mm);
 	asid = __TLBI_VADDR(0, ASID(mm));
-	__tlbi(aside1is, asid);
-	__tlbi_user(aside1is, asid);
+	if (local) {
+		__tlbi(aside1, asid);
+		__tlbi_user(aside1, asid);
+	} else {
+		__tlbi(aside1is, asid);
+		__tlbi_user(aside1is, asid);
+	}
+	flush_tlb_user_post(local);
 	dsb(ish);
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
@@ -287,11 +340,18 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					   unsigned long uaddr)
 {
 	unsigned long addr;
+	bool local;
 
-	dsb(ishst);
+	local = flush_tlb_user_pre(mm);
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
-	__tlbi(vale1is, addr);
-	__tlbi_user(vale1is, addr);
+	if (local) {
+		__tlbi(vale1, addr);
+		__tlbi_user(vale1, addr);
+	} else {
+		__tlbi(vale1is, addr);
+		__tlbi_user(vale1is, addr);
+	}
+	flush_tlb_user_post(local);
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
 						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
@@ -437,6 +497,8 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
 enum tlbi_op {
 	TLBI_VALE1IS,
 	TLBI_VAE1IS,
+	TLBI_VALE1,
+	TLBI_VAE1,
 };
 
 #define flush_tlb_range_op(op, ...)					\
@@ -448,6 +510,12 @@ do {									\
 	case TLBI_VAE1IS:						\
 		__flush_tlb_range_op(vae1is, __VA_ARGS__);		\
 		break;							\
+	case TLBI_VALE1:						\
+		__flush_tlb_range_op(vale1, __VA_ARGS__);		\
+		break;							\
+	case TLBI_VAE1:							\
+		__flush_tlb_range_op(vae1,  __VA_ARGS__);		\
+		break;							\
 	default:							\
 		BUILD_BUG_ON_MSG(1, "Unknown TLBI op");			\
 	}								\
@@ -460,6 +528,7 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 {
 	unsigned long asid, pages;
 	enum tlbi_op tlbi_op;
+	bool local;
 
 	start = round_down(start, stride);
 	end = round_up(end, stride);
@@ -470,13 +539,18 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 		return;
 	}
 
-	dsb(ishst);
+	local = flush_tlb_user_pre(mm);
 	asid = ASID(mm);
 
-	tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
+	if (local)
+		tlbi_op = last_level ? TLBI_VALE1 : TLBI_VAE1;
+	else
+		tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
+
 	flush_tlb_range_op(tlbi_op, start, pages, stride, asid, tlb_level,
 			   true, lpa2_is_enabled());
 
+	flush_tlb_user_post(local);
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index b2ac06246327..adf4fc26ddb6 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
 
 void check_and_switch_context(struct mm_struct *mm)
 {
-	unsigned long flags;
-	unsigned int cpu;
+	unsigned int cpu = smp_processor_id();
 	u64 asid, old_active_asid;
+	unsigned int active;
+	unsigned long flags;
 
 	if (system_supports_cnp())
 		cpu_set_reserved_ttbr0();
@@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
 		atomic64_set(&mm->context.id, asid);
 	}
 
-	cpu = smp_processor_id();
 	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
 		local_flush_tlb_all();
 
@@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
 
 	arm64_apply_bp_hardening();
 
+	/*
+	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
+	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
+	 * then we are the first cpu to run so set it to our id. If initially
+	 * any id other than ours, we are the second cpu to run so set it to
+	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
+	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
+	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
+	 * store must be visible to another cpu before this cpu could have
+	 * populated any TLB entries based on the pgtables that will be
+	 * installed.
+	 */
+	active = READ_ONCE(mm->context.active_cpu);
+	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
+		if (active == ACTIVE_CPU_NONE)
+			active = cmpxchg_relaxed(&mm->context.active_cpu,
+						 ACTIVE_CPU_NONE, cpu);
+
+		if (active != ACTIVE_CPU_NONE)
+			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
+
+		dsb(ishst);
+	}
+
 	/*
 	 * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
 	 * emulating PAN.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2025-08-29 15:35 ` [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Ryan Roberts
@ 2025-09-01  9:08   ` Alexandru Elisei
  2025-09-01  9:18     ` Ryan Roberts
  2025-09-02 16:23   ` Catalin Marinas
  1 sibling, 1 reply; 11+ messages in thread
From: Alexandru Elisei @ 2025-09-01  9:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, James Morse,
	linux-arm-kernel, linux-kernel

Hi Ryan,

On Fri, Aug 29, 2025 at 04:35:08PM +0100, Ryan Roberts wrote:
> There are 3 variants of tlb flush that invalidate user mappings:
> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
> would previously unconditionally broadcast their tlbis to all cpus in
> the inner shareable domain.
> 
> But this is a waste of effort if we can prove that the mm for which we
> are flushing the mappings has only ever been active on the local cpu. In
> that case, it is safe to avoid the broadcast and simply invalidate the
> current cpu.
> 
> So let's track in mm_context_t::active_cpu either the mm has never been
> active on any cpu, has been active on more than 1 cpu, or has been
> active on precisely 1 cpu - and in that case, which one. We update this
> when switching context, being careful to ensure that it gets updated
> *before* installing the mm's pgtables. On the reader side, we ensure we
> read *after* the previous write(s) to the pgtable(s) that necessitated
> the tlb flush have completed. This guarrantees that if a cpu that is
> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
> entry cannot have been seen by any other cpu and we can flush only the
> local cpu.

Just curious, how does this work under virtualization, when the VCPU can get
moved around on different physical CPUs?

Thanks,
Alex

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/mmu.h         | 12 ++++
>  arch/arm64/include/asm/mmu_context.h |  2 +
>  arch/arm64/include/asm/tlbflush.h    | 90 +++++++++++++++++++++++++---
>  arch/arm64/mm/context.c              | 30 +++++++++-
>  4 files changed, 123 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 6e8aa8e72601..ca32fb860309 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -17,6 +17,17 @@
>  #include <linux/refcount.h>
>  #include <asm/cpufeature.h>
>  
> +/*
> + * Sentinal values for mm_context_t::active_cpu. ACTIVE_CPU_NONE indicates the
> + * mm has never been active on any CPU. ACTIVE_CPU_MULTIPLE indicates the mm
> + * has been active on multiple CPUs. Any other value is the ID of the single
> + * CPU that the mm has been active on.
> + */
> +enum active_cpu {
> +	ACTIVE_CPU_NONE = UINT_MAX,
> +	ACTIVE_CPU_MULTIPLE = UINT_MAX - 1,
> +};
> +
>  typedef struct {
>  	atomic64_t	id;
>  #ifdef CONFIG_COMPAT
> @@ -26,6 +37,7 @@ typedef struct {
>  	void		*vdso;
>  	unsigned long	flags;
>  	u8		pkey_allocation_map;
> +	unsigned int	active_cpu;
>  } mm_context_t;
>  
>  /*
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 0dbe3b29049b..db2505a8eba8 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -180,6 +180,8 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>  	/* pkey 0 is the default, so always reserve it. */
>  	mm->context.pkey_allocation_map = BIT(0);
>  
> +	WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_NONE);
> +
>  	return 0;
>  }
>  
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index f66b8c4696d0..651440e0aff9 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
>  		(__pages >> (5 * (scale) + 1)) - 1;			\
>  	})
>  
> +/*
> + * Determines whether the user tlbi invalidation can be performed only on the
> + * local CPU or whether it needs to be broadcast. (Returns true for local).
> + * Additionally issues appropriate barrier to ensure prior pgtable updates are
> + * visible to the table walker. Must be paired with flush_tlb_user_post().
> + */
> +static inline bool flush_tlb_user_pre(struct mm_struct *mm)
> +{
> +	unsigned int self, active;
> +	bool local;
> +
> +	migrate_disable();
> +
> +	self = smp_processor_id();
> +
> +	/*
> +	 * The load of mm->context.active_cpu must not be reordered before the
> +	 * store to the pgtable that necessitated this flush. This ensures that
> +	 * if the value read is our cpu id, then no other cpu can have seen the
> +	 * old pgtable value and therefore does not need this old value to be
> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> +	 * needed to make the pgtable updates visible to the walker, to a
> +	 * dsb(ish) by default. So speculatively load without a barrier and if
> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active == self) {
> +		dsb(ish);
> +		active = READ_ONCE(mm->context.active_cpu);
> +	} else {
> +		dsb(ishst);
> +	}
> +
> +	local = active == self;
> +	if (!local)
> +		migrate_enable();
> +
> +	return local;
> +}
> +
> +static inline void flush_tlb_user_post(bool local)
> +{
> +	if (local)
> +		migrate_enable();
> +}
> +
>  /*
>   *	TLB Invalidation
>   *	================
> @@ -274,11 +320,18 @@ static inline void flush_tlb_all(void)
>  static inline void flush_tlb_mm(struct mm_struct *mm)
>  {
>  	unsigned long asid;
> +	bool local;
>  
> -	dsb(ishst);
> +	local = flush_tlb_user_pre(mm);
>  	asid = __TLBI_VADDR(0, ASID(mm));
> -	__tlbi(aside1is, asid);
> -	__tlbi_user(aside1is, asid);
> +	if (local) {
> +		__tlbi(aside1, asid);
> +		__tlbi_user(aside1, asid);
> +	} else {
> +		__tlbi(aside1is, asid);
> +		__tlbi_user(aside1is, asid);
> +	}
> +	flush_tlb_user_post(local);
>  	dsb(ish);
>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
> @@ -287,11 +340,18 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>  					   unsigned long uaddr)
>  {
>  	unsigned long addr;
> +	bool local;
>  
> -	dsb(ishst);
> +	local = flush_tlb_user_pre(mm);
>  	addr = __TLBI_VADDR(uaddr, ASID(mm));
> -	__tlbi(vale1is, addr);
> -	__tlbi_user(vale1is, addr);
> +	if (local) {
> +		__tlbi(vale1, addr);
> +		__tlbi_user(vale1, addr);
> +	} else {
> +		__tlbi(vale1is, addr);
> +		__tlbi_user(vale1is, addr);
> +	}
> +	flush_tlb_user_post(local);
>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>  						(uaddr & PAGE_MASK) + PAGE_SIZE);
>  }
> @@ -437,6 +497,8 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
>  enum tlbi_op {
>  	TLBI_VALE1IS,
>  	TLBI_VAE1IS,
> +	TLBI_VALE1,
> +	TLBI_VAE1,
>  };
>  
>  #define flush_tlb_range_op(op, ...)					\
> @@ -448,6 +510,12 @@ do {									\
>  	case TLBI_VAE1IS:						\
>  		__flush_tlb_range_op(vae1is, __VA_ARGS__);		\
>  		break;							\
> +	case TLBI_VALE1:						\
> +		__flush_tlb_range_op(vale1, __VA_ARGS__);		\
> +		break;							\
> +	case TLBI_VAE1:							\
> +		__flush_tlb_range_op(vae1,  __VA_ARGS__);		\
> +		break;							\
>  	default:							\
>  		BUILD_BUG_ON_MSG(1, "Unknown TLBI op");			\
>  	}								\
> @@ -460,6 +528,7 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>  {
>  	unsigned long asid, pages;
>  	enum tlbi_op tlbi_op;
> +	bool local;
>  
>  	start = round_down(start, stride);
>  	end = round_up(end, stride);
> @@ -470,13 +539,18 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>  		return;
>  	}
>  
> -	dsb(ishst);
> +	local = flush_tlb_user_pre(mm);
>  	asid = ASID(mm);
>  
> -	tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
> +	if (local)
> +		tlbi_op = last_level ? TLBI_VALE1 : TLBI_VAE1;
> +	else
> +		tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
> +
>  	flush_tlb_range_op(tlbi_op, start, pages, stride, asid, tlb_level,
>  			   true, lpa2_is_enabled());
>  
> +	flush_tlb_user_post(local);
>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>  }
>  
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index b2ac06246327..adf4fc26ddb6 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>  
>  void check_and_switch_context(struct mm_struct *mm)
>  {
> -	unsigned long flags;
> -	unsigned int cpu;
> +	unsigned int cpu = smp_processor_id();
>  	u64 asid, old_active_asid;
> +	unsigned int active;
> +	unsigned long flags;
>  
>  	if (system_supports_cnp())
>  		cpu_set_reserved_ttbr0();
> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>  		atomic64_set(&mm->context.id, asid);
>  	}
>  
> -	cpu = smp_processor_id();
>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>  		local_flush_tlb_all();
>  
> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>  
>  	arm64_apply_bp_hardening();
>  
> +	/*
> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
> +	 * then we are the first cpu to run so set it to our id. If initially
> +	 * any id other than ours, we are the second cpu to run so set it to
> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
> +	 * store must be visible to another cpu before this cpu could have
> +	 * populated any TLB entries based on the pgtables that will be
> +	 * installed.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
> +		if (active == ACTIVE_CPU_NONE)
> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
> +						 ACTIVE_CPU_NONE, cpu);
> +
> +		if (active != ACTIVE_CPU_NONE)
> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
> +
> +		dsb(ishst);
> +	}
> +
>  	/*
>  	 * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
>  	 * emulating PAN.
> -- 
> 2.43.0
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2025-09-01  9:08   ` Alexandru Elisei
@ 2025-09-01  9:18     ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-09-01  9:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, James Morse,
	linux-arm-kernel, linux-kernel

On 01/09/2025 10:08, Alexandru Elisei wrote:
> Hi Ryan,
> 
> On Fri, Aug 29, 2025 at 04:35:08PM +0100, Ryan Roberts wrote:
>> There are 3 variants of tlb flush that invalidate user mappings:
>> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
>> would previously unconditionally broadcast their tlbis to all cpus in
>> the inner shareable domain.
>>
>> But this is a waste of effort if we can prove that the mm for which we
>> are flushing the mappings has only ever been active on the local cpu. In
>> that case, it is safe to avoid the broadcast and simply invalidate the
>> current cpu.
>>
>> So let's track in mm_context_t::active_cpu either the mm has never been
>> active on any cpu, has been active on more than 1 cpu, or has been
>> active on precisely 1 cpu - and in that case, which one. We update this
>> when switching context, being careful to ensure that it gets updated
>> *before* installing the mm's pgtables. On the reader side, we ensure we
>> read *after* the previous write(s) to the pgtable(s) that necessitated
>> the tlb flush have completed. This guarrantees that if a cpu that is
>> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
>> entry cannot have been seen by any other cpu and we can flush only the
>> local cpu.
> 
> Just curious, how does this work under virtualization, when the VCPU can get
> moved around on different physical CPUs?

I believe the HW is configured to upgrade local TLBIs to broadcast for this
case? I'll have to check the details of that though...

> 
> Thanks,
> Alex
> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/mmu.h         | 12 ++++
>>  arch/arm64/include/asm/mmu_context.h |  2 +
>>  arch/arm64/include/asm/tlbflush.h    | 90 +++++++++++++++++++++++++---
>>  arch/arm64/mm/context.c              | 30 +++++++++-
>>  4 files changed, 123 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 6e8aa8e72601..ca32fb860309 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -17,6 +17,17 @@
>>  #include <linux/refcount.h>
>>  #include <asm/cpufeature.h>
>>  
>> +/*
>> + * Sentinal values for mm_context_t::active_cpu. ACTIVE_CPU_NONE indicates the
>> + * mm has never been active on any CPU. ACTIVE_CPU_MULTIPLE indicates the mm
>> + * has been active on multiple CPUs. Any other value is the ID of the single
>> + * CPU that the mm has been active on.
>> + */
>> +enum active_cpu {
>> +	ACTIVE_CPU_NONE = UINT_MAX,
>> +	ACTIVE_CPU_MULTIPLE = UINT_MAX - 1,
>> +};
>> +
>>  typedef struct {
>>  	atomic64_t	id;
>>  #ifdef CONFIG_COMPAT
>> @@ -26,6 +37,7 @@ typedef struct {
>>  	void		*vdso;
>>  	unsigned long	flags;
>>  	u8		pkey_allocation_map;
>> +	unsigned int	active_cpu;
>>  } mm_context_t;
>>  
>>  /*
>> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
>> index 0dbe3b29049b..db2505a8eba8 100644
>> --- a/arch/arm64/include/asm/mmu_context.h
>> +++ b/arch/arm64/include/asm/mmu_context.h
>> @@ -180,6 +180,8 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>  	/* pkey 0 is the default, so always reserve it. */
>>  	mm->context.pkey_allocation_map = BIT(0);
>>  
>> +	WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_NONE);
>> +
>>  	return 0;
>>  }
>>  
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index f66b8c4696d0..651440e0aff9 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
>>  		(__pages >> (5 * (scale) + 1)) - 1;			\
>>  	})
>>  
>> +/*
>> + * Determines whether the user tlbi invalidation can be performed only on the
>> + * local CPU or whether it needs to be broadcast. (Returns true for local).
>> + * Additionally issues appropriate barrier to ensure prior pgtable updates are
>> + * visible to the table walker. Must be paired with flush_tlb_user_post().
>> + */
>> +static inline bool flush_tlb_user_pre(struct mm_struct *mm)
>> +{
>> +	unsigned int self, active;
>> +	bool local;
>> +
>> +	migrate_disable();
>> +
>> +	self = smp_processor_id();
>> +
>> +	/*
>> +	 * The load of mm->context.active_cpu must not be reordered before the
>> +	 * store to the pgtable that necessitated this flush. This ensures that
>> +	 * if the value read is our cpu id, then no other cpu can have seen the
>> +	 * old pgtable value and therefore does not need this old value to be
>> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
>> +	 * needed to make the pgtable updates visible to the walker, to a
>> +	 * dsb(ish) by default. So speculatively load without a barrier and if
>> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active == self) {
>> +		dsb(ish);
>> +		active = READ_ONCE(mm->context.active_cpu);
>> +	} else {
>> +		dsb(ishst);
>> +	}
>> +
>> +	local = active == self;
>> +	if (!local)
>> +		migrate_enable();
>> +
>> +	return local;
>> +}
>> +
>> +static inline void flush_tlb_user_post(bool local)
>> +{
>> +	if (local)
>> +		migrate_enable();
>> +}
>> +
>>  /*
>>   *	TLB Invalidation
>>   *	================
>> @@ -274,11 +320,18 @@ static inline void flush_tlb_all(void)
>>  static inline void flush_tlb_mm(struct mm_struct *mm)
>>  {
>>  	unsigned long asid;
>> +	bool local;
>>  
>> -	dsb(ishst);
>> +	local = flush_tlb_user_pre(mm);
>>  	asid = __TLBI_VADDR(0, ASID(mm));
>> -	__tlbi(aside1is, asid);
>> -	__tlbi_user(aside1is, asid);
>> +	if (local) {
>> +		__tlbi(aside1, asid);
>> +		__tlbi_user(aside1, asid);
>> +	} else {
>> +		__tlbi(aside1is, asid);
>> +		__tlbi_user(aside1is, asid);
>> +	}
>> +	flush_tlb_user_post(local);
>>  	dsb(ish);
>>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>> @@ -287,11 +340,18 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>  					   unsigned long uaddr)
>>  {
>>  	unsigned long addr;
>> +	bool local;
>>  
>> -	dsb(ishst);
>> +	local = flush_tlb_user_pre(mm);
>>  	addr = __TLBI_VADDR(uaddr, ASID(mm));
>> -	__tlbi(vale1is, addr);
>> -	__tlbi_user(vale1is, addr);
>> +	if (local) {
>> +		__tlbi(vale1, addr);
>> +		__tlbi_user(vale1, addr);
>> +	} else {
>> +		__tlbi(vale1is, addr);
>> +		__tlbi_user(vale1is, addr);
>> +	}
>> +	flush_tlb_user_post(local);
>>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>>  						(uaddr & PAGE_MASK) + PAGE_SIZE);
>>  }
>> @@ -437,6 +497,8 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
>>  enum tlbi_op {
>>  	TLBI_VALE1IS,
>>  	TLBI_VAE1IS,
>> +	TLBI_VALE1,
>> +	TLBI_VAE1,
>>  };
>>  
>>  #define flush_tlb_range_op(op, ...)					\
>> @@ -448,6 +510,12 @@ do {									\
>>  	case TLBI_VAE1IS:						\
>>  		__flush_tlb_range_op(vae1is, __VA_ARGS__);		\
>>  		break;							\
>> +	case TLBI_VALE1:						\
>> +		__flush_tlb_range_op(vale1, __VA_ARGS__);		\
>> +		break;							\
>> +	case TLBI_VAE1:							\
>> +		__flush_tlb_range_op(vae1,  __VA_ARGS__);		\
>> +		break;							\
>>  	default:							\
>>  		BUILD_BUG_ON_MSG(1, "Unknown TLBI op");			\
>>  	}								\
>> @@ -460,6 +528,7 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>>  {
>>  	unsigned long asid, pages;
>>  	enum tlbi_op tlbi_op;
>> +	bool local;
>>  
>>  	start = round_down(start, stride);
>>  	end = round_up(end, stride);
>> @@ -470,13 +539,18 @@ static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>>  		return;
>>  	}
>>  
>> -	dsb(ishst);
>> +	local = flush_tlb_user_pre(mm);
>>  	asid = ASID(mm);
>>  
>> -	tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
>> +	if (local)
>> +		tlbi_op = last_level ? TLBI_VALE1 : TLBI_VAE1;
>> +	else
>> +		tlbi_op = last_level ? TLBI_VALE1IS : TLBI_VAE1IS;
>> +
>>  	flush_tlb_range_op(tlbi_op, start, pages, stride, asid, tlb_level,
>>  			   true, lpa2_is_enabled());
>>  
>> +	flush_tlb_user_post(local);
>>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>>  }
>>  
>> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
>> index b2ac06246327..adf4fc26ddb6 100644
>> --- a/arch/arm64/mm/context.c
>> +++ b/arch/arm64/mm/context.c
>> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>>  
>>  void check_and_switch_context(struct mm_struct *mm)
>>  {
>> -	unsigned long flags;
>> -	unsigned int cpu;
>> +	unsigned int cpu = smp_processor_id();
>>  	u64 asid, old_active_asid;
>> +	unsigned int active;
>> +	unsigned long flags;
>>  
>>  	if (system_supports_cnp())
>>  		cpu_set_reserved_ttbr0();
>> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>>  		atomic64_set(&mm->context.id, asid);
>>  	}
>>  
>> -	cpu = smp_processor_id();
>>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>>  		local_flush_tlb_all();
>>  
>> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>>  
>>  	arm64_apply_bp_hardening();
>>  
>> +	/*
>> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
>> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
>> +	 * then we are the first cpu to run so set it to our id. If initially
>> +	 * any id other than ours, we are the second cpu to run so set it to
>> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
>> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
>> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
>> +	 * store must be visible to another cpu before this cpu could have
>> +	 * populated any TLB entries based on the pgtables that will be
>> +	 * installed.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
>> +		if (active == ACTIVE_CPU_NONE)
>> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
>> +						 ACTIVE_CPU_NONE, cpu);
>> +
>> +		if (active != ACTIVE_CPU_NONE)
>> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
>> +
>> +		dsb(ishst);
>> +	}
>> +
>>  	/*
>>  	 * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
>>  	 * emulating PAN.
>> -- 
>> 2.43.0
>>
>>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2025-08-29 15:35 ` [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Ryan Roberts
  2025-09-01  9:08   ` Alexandru Elisei
@ 2025-09-02 16:23   ` Catalin Marinas
  2025-09-02 16:54     ` Ryan Roberts
  1 sibling, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2025-09-02 16:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Mark Rutland, James Morse, linux-arm-kernel,
	linux-kernel

On Fri, Aug 29, 2025 at 04:35:08PM +0100, Ryan Roberts wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index f66b8c4696d0..651440e0aff9 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
>  		(__pages >> (5 * (scale) + 1)) - 1;			\
>  	})
>  
> +/*
> + * Determines whether the user tlbi invalidation can be performed only on the
> + * local CPU or whether it needs to be broadcast. (Returns true for local).
> + * Additionally issues appropriate barrier to ensure prior pgtable updates are
> + * visible to the table walker. Must be paired with flush_tlb_user_post().
> + */
> +static inline bool flush_tlb_user_pre(struct mm_struct *mm)
> +{
> +	unsigned int self, active;
> +	bool local;
> +
> +	migrate_disable();
> +
> +	self = smp_processor_id();
> +
> +	/*
> +	 * The load of mm->context.active_cpu must not be reordered before the
> +	 * store to the pgtable that necessitated this flush. This ensures that
> +	 * if the value read is our cpu id, then no other cpu can have seen the
> +	 * old pgtable value and therefore does not need this old value to be
> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> +	 * needed to make the pgtable updates visible to the walker, to a
> +	 * dsb(ish) by default. So speculatively load without a barrier and if
> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active == self) {
> +		dsb(ish);
> +		active = READ_ONCE(mm->context.active_cpu);
> +	} else {
> +		dsb(ishst);
> +	}

Does the ISH vs ISHST make much difference in practice? I wonder whether
we could keep this simple.

> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index b2ac06246327..adf4fc26ddb6 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>  
>  void check_and_switch_context(struct mm_struct *mm)
>  {
> -	unsigned long flags;
> -	unsigned int cpu;
> +	unsigned int cpu = smp_processor_id();
>  	u64 asid, old_active_asid;
> +	unsigned int active;
> +	unsigned long flags;
>  
>  	if (system_supports_cnp())
>  		cpu_set_reserved_ttbr0();
> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>  		atomic64_set(&mm->context.id, asid);
>  	}
>  
> -	cpu = smp_processor_id();
>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>  		local_flush_tlb_all();
>  
> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>  
>  	arm64_apply_bp_hardening();
>  
> +	/*
> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
> +	 * then we are the first cpu to run so set it to our id. If initially
> +	 * any id other than ours, we are the second cpu to run so set it to
> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
> +	 * store must be visible to another cpu before this cpu could have
> +	 * populated any TLB entries based on the pgtables that will be
> +	 * installed.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
> +		if (active == ACTIVE_CPU_NONE)
> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
> +						 ACTIVE_CPU_NONE, cpu);
> +
> +		if (active != ACTIVE_CPU_NONE)
> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
> +
> +		dsb(ishst);
> +	}

I think this works. I recall James had a litmus test for the model
checker confirming this.

In a cut-down version, we'd have:

P0:				P1:

set_pte();			WRITE_ONCE(active_cpu);
dsb();				dsb();
READ_ONCE(active_cpu);		READ_ONCE(pte);

The pte read on P1 is implied following the TTBR0 write. As you
described, if P0 did not see the active_cpu update, P1 would have seen
the updated pte.

So far I couldn't fail this, so:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro
  2025-08-29 15:35 ` [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro Ryan Roberts
@ 2025-09-02 16:25   ` Catalin Marinas
  0 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2025-09-02 16:25 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Mark Rutland, James Morse, linux-arm-kernel,
	linux-kernel

On Fri, Aug 29, 2025 at 04:35:07PM +0100, Ryan Roberts wrote:
> __flush_tlb_range_op() is a pre-processor macro that takes the TLBI
> operation as a string, and builds the instruction from it. This prevents
> passing the TLBI operation around as a variable. __flush_tlb_range_op()
> also takes 7 other arguments.
> 
> Adding extra invocations for different TLB operations means duplicating
> the whole thing, but those 7 extra arguments are the same each time.
> 
> Add an enum for the TLBI operations that __flush_tlb_range() uses, and a
> macro to pass the operation name as a string to __flush_tlb_range_op(),
> and the rest of the arguments using __VA_ARGS_.
> 
> The result is easier to add new TLBI operations to, and to modify any of
> the other arguments as they only appear once.
> 
> Suggested-by: James Morse <james.morse@arm.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU
  2025-08-29 15:35 [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Ryan Roberts
  2025-08-29 15:35 ` [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro Ryan Roberts
  2025-08-29 15:35 ` [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Ryan Roberts
@ 2025-09-02 16:47 ` Catalin Marinas
  2025-09-02 16:56   ` Ryan Roberts
  2025-09-03  2:12 ` Huang, Ying
  3 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2025-09-02 16:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Mark Rutland, James Morse, linux-arm-kernel,
	linux-kernel

On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
> 
> +------------+--------------------------+--------------------------+--------------------------+
> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> 
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.

I suspect it's highly dependent on hardware and how it handles the DVM
messages. There were some old proposals from Fujitsu:

https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/

Christoph Lameter (Ampere) also followed with some refactoring in this
area to allow a boot-configurable way to do TLBI via IS ops or IPI:

https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/

(for some reason, the patches did not make it to the list, I have them
in my inbox if you are interested)

I don't remember any real-world workload, more like hand-crafted
mprotect() loops.

Anyway, I think the approach in your series doesn't have downsides, it's
fairly clean and addresses some low-hanging fruits. For multi-threaded
workloads where a flush_tlb_mm() is cheaper than a series of per-page
TLBIs, I think we can wait for that hardware to be phased out. The TLBI
range operations should significantly reduce the DVM messages between
CPUs.

-- 
Catalin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2025-09-02 16:23   ` Catalin Marinas
@ 2025-09-02 16:54     ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-09-02 16:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Mark Rutland, James Morse, linux-arm-kernel,
	linux-kernel

On 02/09/2025 17:23, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:08PM +0100, Ryan Roberts wrote:
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index f66b8c4696d0..651440e0aff9 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
>>  		(__pages >> (5 * (scale) + 1)) - 1;			\
>>  	})
>>  
>> +/*
>> + * Determines whether the user tlbi invalidation can be performed only on the
>> + * local CPU or whether it needs to be broadcast. (Returns true for local).
>> + * Additionally issues appropriate barrier to ensure prior pgtable updates are
>> + * visible to the table walker. Must be paired with flush_tlb_user_post().
>> + */
>> +static inline bool flush_tlb_user_pre(struct mm_struct *mm)
>> +{
>> +	unsigned int self, active;
>> +	bool local;
>> +
>> +	migrate_disable();
>> +
>> +	self = smp_processor_id();
>> +
>> +	/*
>> +	 * The load of mm->context.active_cpu must not be reordered before the
>> +	 * store to the pgtable that necessitated this flush. This ensures that
>> +	 * if the value read is our cpu id, then no other cpu can have seen the
>> +	 * old pgtable value and therefore does not need this old value to be
>> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
>> +	 * needed to make the pgtable updates visible to the walker, to a
>> +	 * dsb(ish) by default. So speculatively load without a barrier and if
>> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active == self) {
>> +		dsb(ish);
>> +		active = READ_ONCE(mm->context.active_cpu);
>> +	} else {
>> +		dsb(ishst);
>> +	}
> 
> Does the ISH vs ISHST make much difference in practice? I wonder whether
> we could keep this simple.

I don't know. I was being conservative - I'm a bit nervous about upgrading a
barrier unconditionally. I'll run some benchmarks with the simpler version and
compare.

> 
>> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
>> index b2ac06246327..adf4fc26ddb6 100644
>> --- a/arch/arm64/mm/context.c
>> +++ b/arch/arm64/mm/context.c
>> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>>  
>>  void check_and_switch_context(struct mm_struct *mm)
>>  {
>> -	unsigned long flags;
>> -	unsigned int cpu;
>> +	unsigned int cpu = smp_processor_id();
>>  	u64 asid, old_active_asid;
>> +	unsigned int active;
>> +	unsigned long flags;
>>  
>>  	if (system_supports_cnp())
>>  		cpu_set_reserved_ttbr0();
>> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>>  		atomic64_set(&mm->context.id, asid);
>>  	}
>>  
>> -	cpu = smp_processor_id();
>>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>>  		local_flush_tlb_all();
>>  
>> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>>  
>>  	arm64_apply_bp_hardening();
>>  
>> +	/*
>> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
>> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
>> +	 * then we are the first cpu to run so set it to our id. If initially
>> +	 * any id other than ours, we are the second cpu to run so set it to
>> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
>> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
>> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
>> +	 * store must be visible to another cpu before this cpu could have
>> +	 * populated any TLB entries based on the pgtables that will be
>> +	 * installed.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
>> +		if (active == ACTIVE_CPU_NONE)
>> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
>> +						 ACTIVE_CPU_NONE, cpu);
>> +
>> +		if (active != ACTIVE_CPU_NONE)
>> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
>> +
>> +		dsb(ishst);
>> +	}
> 
> I think this works. I recall James had a litmus test for the model
> checker confirming this.
> 
> In a cut-down version, we'd have:
> 
> P0:				P1:
> 
> set_pte();			WRITE_ONCE(active_cpu);
> dsb();				dsb();
> READ_ONCE(active_cpu);		READ_ONCE(pte);
> 
> The pte read on P1 is implied following the TTBR0 write. As you
> described, if P0 did not see the active_cpu update, P1 would have seen
> the updated pte.
> 
> So far I couldn't fail this, so:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Hmm... Rb at v1.. are you feeling ok? :)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU
  2025-09-02 16:47 ` [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Catalin Marinas
@ 2025-09-02 16:56   ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-09-02 16:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Mark Rutland, James Morse, linux-arm-kernel,
	linux-kernel

On 02/09/2025 17:47, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
>> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
>> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
>> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
>> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
>> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
>> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
>> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
> 
> I suspect it's highly dependent on hardware and how it handles the DVM
> messages. There were some old proposals from Fujitsu:
> 
> https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
> 
> Christoph Lameter (Ampere) also followed with some refactoring in this
> area to allow a boot-configurable way to do TLBI via IS ops or IPI:
> 
> https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/
> 
> (for some reason, the patches did not make it to the list, I have them
> in my inbox if you are interested)
> 
> I don't remember any real-world workload, more like hand-crafted
> mprotect() loops.
> 
> Anyway, I think the approach in your series doesn't have downsides, it's
> fairly clean and addresses some low-hanging fruits. For multi-threaded
> workloads where a flush_tlb_mm() is cheaper than a series of per-page
> TLBIs, I think we can wait for that hardware to be phased out. The TLBI
> range operations should significantly reduce the DVM messages between
> CPUs.

I'll gather some more numbers and try to make a case for merging it then. I
don't really want to add complexity if there is no clear value.

Thanks for the review.

Thanks,
Ryan





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU
  2025-08-29 15:35 [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-09-02 16:47 ` [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Catalin Marinas
@ 2025-09-03  2:12 ` Huang, Ying
  3 siblings, 0 replies; 11+ messages in thread
From: Huang, Ying @ 2025-09-03  2:12 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, James Morse,
	linux-arm-kernel, linux-kernel, Takao Indoh, QI Fuli,
	Andrea Arcangeli, Rafael Aquini

Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This is an RFC for my implementation of an idea from James Morse to avoid
> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
> have ever observed the pgtable entry for the TLB entry that is being
> invalidated. It turns out that x86 does something similar in principle.
>
> The primary feedback I'm looking for is; is this actually correct and safe?
> James and I both believe it to be, but it would be useful to get further
> validation.
>
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
>
> +------------+--------------------------+--------------------------+--------------------------+
> |            |     Baseline (v6.15)     |        tlbi local        |        Improvement       |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads |     ops/sec |    ops/sec |     ops/sec |    ops/sec |     ops/sec |    ops/sec |
> |            | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> |          1 |        9109 |       2573 |        8903 |       3653 |         -2% |        42% |
> |          4 |        8115 |       1299 |        9892 |       1059 |         22% |       -18% |
> |          8 |        5119 |        477 |       11854 |       1265 |        132% |       165% |
> |         16 |        4796 |        286 |       14176 |        821 |        196% |       187% |
> |         32 |        1593 |         38 |       15328 |        474 |        862% |      1147% |
> |         64 |        1486 |         19 |        8096 |        131 |        445% |       589% |
> |        128 |        1315 |         16 |        8257 |        145 |        528% |       806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
>
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.
>
> All mm selftests have been run and no regressions are observed. Applies on
> v6.17-rc3.

Thanks for working on this.

Several previous TLBI broadcast optimization have been tried before,
Cced the original authors for discussion.  Some workloads show good
improvement,

https://lore.kernel.org/lkml/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
https://lore.kernel.org/all/20200203201745.29986-1-aarcange@redhat.com/

Especially in the following mail,

https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-03  2:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29 15:35 [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Ryan Roberts
2025-08-29 15:35 ` [RFC PATCH v1 1/2] arm64: tlbflush: Move invocation of __flush_tlb_range_op() to a macro Ryan Roberts
2025-09-02 16:25   ` Catalin Marinas
2025-08-29 15:35 ` [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Ryan Roberts
2025-09-01  9:08   ` Alexandru Elisei
2025-09-01  9:18     ` Ryan Roberts
2025-09-02 16:23   ` Catalin Marinas
2025-09-02 16:54     ` Ryan Roberts
2025-09-02 16:47 ` [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on local CPU Catalin Marinas
2025-09-02 16:56   ` Ryan Roberts
2025-09-03  2:12 ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).