[PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
@ 2026-05-23 13:47 Linu Cherian
  2026-06-14 11:04 ` Will Deacon
  0 siblings, 1 reply; 8+ messages in thread
From: Linu Cherian @ 2026-05-23 13:47 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ryan Roberts, Kevin Brodsky,
	Anshuman Khandual, Yang Shi, Mark Rutland, Huang Ying
  Cc: linux-arm-kernel, linux-kernel, Linu Cherian

From: Ryan Roberts <ryan.roberts@arm.com>

There are 3 variants of tlb flush that invalidate user mappings:
flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
would previously unconditionally broadcast their tlbis to all cpus in
the inner shareable domain.

But this is a waste of effort if we can prove that the mm for which we
are flushing the mappings has only ever been active on the local cpu. In
that case, it is safe to avoid the broadcast and simply invalidate the
current cpu.

So let's track in mm_context_t::active_cpu either the mm has never been
active on any cpu, has been active on more than 1 cpu, or has been
active on precisely 1 cpu - and in that case, which one. We update this
when switching context, being careful to ensure that it gets updated
*before* installing the mm's pgtables. On the reader side, we ensure we
read *after* the previous write(s) to the pgtable(s) that necessitated
the tlb flush have completed. This guarrantees that if a cpu that is
doing a tlb flush sees it's own id in active_cpu, then the old pgtable
entry cannot have been seen by any other cpu and we can flush only the
local cpu.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Huang Ying <ying.huang@linux.alibaba.com>
[linu.cherian@arm.com: Adapted for v7.1 flush tlb API changes]
Signed-off-by: Linu Cherian <linu.cherian@arm.com>
---
Changelog from RFC v1:
- Adapted for v7.1 flush tlb API changes
  No changes in core logic
- Collected Rb and Tb tags
- lat_mmap benchmark showed dsb(ishst) performs better than dsb(ish),
  hence retained dsb(ishst) in flush_tlb_user_pre	


Testing with 7.1-rc4 :
+-----------------------+---------------------------------------------------+-------------+
| Benchmark             | Result Class                                      |  Improvement|  
+=======================+===================================================+=============+
| perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
+-----------------------+---------------------------------------------------+-------------+
| pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
| 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
+-----------------------+---------------------------------------------------+-------------+

 arch/arm64/include/asm/mmu.h         |  12 +++
 arch/arm64/include/asm/mmu_context.h |   2 +
 arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
 arch/arm64/mm/context.c              |  30 ++++++-
 4 files changed, 141 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 5e1211c540ab..0002101c1f21 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -16,6 +16,17 @@
 #include <linux/refcount.h>
 #include <asm/cpufeature.h>
 
+/*
+ * Sentinal values for mm_context_t::active_cpu. ACTIVE_CPU_NONE indicates the
+ * mm has never been active on any CPU. ACTIVE_CPU_MULTIPLE indicates the mm
+ * has been active on multiple CPUs. Any other value is the ID of the single
+ * CPU that the mm has been active on.
+ */
+enum active_cpu {
+	ACTIVE_CPU_NONE = UINT_MAX,
+	ACTIVE_CPU_MULTIPLE = UINT_MAX - 1,
+};
+
 typedef struct {
 	atomic64_t	id;
 #ifdef CONFIG_COMPAT
@@ -25,6 +36,7 @@ typedef struct {
 	void		*vdso;
 	unsigned long	flags;
 	u8		pkey_allocation_map;
+	unsigned int	active_cpu;
 } mm_context_t;
 
 /*
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 803b68758152..101cae0c7262 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -172,6 +172,8 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	/* pkey 0 is the default, so always reserve it. */
 	mm->context.pkey_allocation_map = BIT(0);
 
+	WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_NONE);
+
 	return 0;
 }
 
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index c0bf5b398041..1f75bce4fa0d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -164,6 +164,12 @@ static inline void sme_dvmsync_batch(struct arch_tlbflush_unmap_batch *batch)
 
 typedef void (*tlbi_op)(u64 arg);
 
+static __always_inline void vae1(u64 arg)
+{
+	__tlbi(vae1, arg);
+	__tlbi_user(vae1, arg);
+}
+
 static __always_inline void vae1is(u64 arg)
 {
 	__tlbi(vae1is, arg);
@@ -308,6 +314,74 @@ static inline void __tlbi_sync_s1ish_hyp(void)
 	__repeat_tlbi_sync(vale2is, 0);
 }
 
+typedef unsigned __bitwise tlbf_t;
+
+/* No special behaviour. */
+#define TLBF_NONE		((__force tlbf_t)0)
+
+/* Invalidate tlb entries only, leaving the page table walk cache intact. */
+#define TLBF_NOWALKCACHE	((__force tlbf_t)BIT(0))
+
+/* Skip the trailing dsb after issuing tlbi. */
+#define TLBF_NOSYNC		((__force tlbf_t)BIT(1))
+
+/* Suppress tlb notifier callbacks for this flush operation. */
+#define TLBF_NONOTIFY		((__force tlbf_t)BIT(2))
+
+/* Perform the tlbi locally without broadcasting to other CPUs. */
+#define TLBF_NOBROADCAST	((__force tlbf_t)BIT(3))
+
+/*
+ * Determines whether the user tlbi invalidation can be performed only on the
+ * local CPU or whether it needs to be broadcast. (Returns true for local).
+ * Additionally issues appropriate barrier to ensure prior pgtable updates are
+ * visible to the table walker. Must be paired with flush_tlb_user_post().
+ */
+static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
+{
+	unsigned int self, active;
+	bool local;
+
+	migrate_disable();
+
+	if (flags & TLBF_NOBROADCAST) {
+		dsb(nshst);
+		return true;
+	}
+
+	self = smp_processor_id();
+
+	/*
+	 * The load of mm->context.active_cpu must not be reordered before the
+	 * store to the pgtable that necessitated this flush. This ensures that
+	 * if the value read is our cpu id, then no other cpu can have seen the
+	 * old pgtable value and therefore does not need this old value to be
+	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
+	 * needed to make the pgtable updates visible to the walker, to a
+	 * dsb(ish) by default. So speculatively load without a barrier and if
+	 * it indicates our cpu id, then upgrade the barrier and re-load.
+	 */
+	active = READ_ONCE(mm->context.active_cpu);
+	if (active == self) {
+		dsb(ish);
+		active = READ_ONCE(mm->context.active_cpu);
+	} else {
+		dsb(ishst);
+	}
+
+	local = active == self;
+	if (!local)
+		migrate_enable();
+
+	return local;
+}
+
+static inline void flush_tlb_user_post(bool local)
+{
+	if (local)
+		migrate_enable();
+}
+
 /*
  *	TLB Invalidation
  *	================
@@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long asid;
+	bool local;
 
-	dsb(ishst);
+	local = flush_tlb_user_pre(mm, TLBF_NONE);
 	asid = __TLBI_VADDR(0, ASID(mm));
-	__tlbi(aside1is, asid);
-	__tlbi_user(aside1is, asid);
-	__tlbi_sync_s1ish(mm);
+	if (local) {
+		__tlbi(aside1, asid);
+		__tlbi_user(aside1, asid);
+		dsb(nsh);
+	} else {
+		__tlbi(aside1is, asid);
+		__tlbi_user(aside1is, asid);
+		__tlbi_sync_s1ish(mm);
+	}
+	flush_tlb_user_post(local);
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
@@ -475,6 +557,12 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  *    operations can only span an even number of pages. We save this for last to
  *    ensure 64KB start alignment is maintained for the LPA2 case.
  */
+static __always_inline void rvae1(u64 arg)
+{
+	__tlbi(rvae1, arg);
+	__tlbi_user(rvae1, arg);
+}
+
 static __always_inline void rvae1is(u64 arg)
 {
 	__tlbi(rvae1is, arg);
@@ -573,23 +661,6 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long pages,
 	return pages >= (MAX_DVM_OPS * stride) >> PAGE_SHIFT;
 }
 
-typedef unsigned __bitwise tlbf_t;
-
-/* No special behaviour. */
-#define TLBF_NONE		((__force tlbf_t)0)
-
-/* Invalidate tlb entries only, leaving the page table walk cache intact. */
-#define TLBF_NOWALKCACHE	((__force tlbf_t)BIT(0))
-
-/* Skip the trailing dsb after issuing tlbi. */
-#define TLBF_NOSYNC		((__force tlbf_t)BIT(1))
-
-/* Suppress tlb notifier callbacks for this flush operation. */
-#define TLBF_NONOTIFY		((__force tlbf_t)BIT(2))
-
-/* Perform the tlbi locally without broadcasting to other CPUs. */
-#define TLBF_NOBROADCAST	((__force tlbf_t)BIT(3))
-
 static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end,
 					unsigned long stride, int tlb_level,
@@ -597,6 +668,7 @@ static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long asid, pages;
+	bool local;
 
 	pages = (end - start) >> PAGE_SHIFT;
 
@@ -605,10 +677,9 @@ static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
 		return;
 	}
 
-	if (!(flags & TLBF_NOBROADCAST))
-		dsb(ishst);
-	else
-		dsb(nshst);
+	local = flush_tlb_user_pre(mm, flags);
+	if (local && !(flags & TLBF_NOBROADCAST))
+		flags |= TLBF_NOBROADCAST;
 
 	asid = ASID(mm);
 
@@ -622,8 +693,8 @@ static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
 					asid, tlb_level);
 		break;
 	case TLBF_NOBROADCAST:
-		/* Combination unused */
-		BUG();
+		__flush_s1_tlb_range_op(vae1, start, pages, stride,
+					asid, tlb_level);
 		break;
 	case TLBF_NOWALKCACHE | TLBF_NOBROADCAST:
 		__flush_s1_tlb_range_op(vale1, start, pages, stride,
@@ -640,6 +711,8 @@ static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
 		else
 			dsb(nsh);
 	}
+
+	flush_tlb_user_post(local);
 }
 
 static inline void __flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 0f4a28b87469..f34ed78393e0 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
 
 void check_and_switch_context(struct mm_struct *mm)
 {
-	unsigned long flags;
-	unsigned int cpu;
+	unsigned int cpu = smp_processor_id();
 	u64 asid, old_active_asid;
+	unsigned int active;
+	unsigned long flags;
 
 	if (system_supports_cnp())
 		cpu_set_reserved_ttbr0();
@@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
 		atomic64_set(&mm->context.id, asid);
 	}
 
-	cpu = smp_processor_id();
 	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
 		local_flush_tlb_all();
 
@@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
 
 	arm64_apply_bp_hardening();
 
+	/*
+	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
+	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
+	 * then we are the first cpu to run so set it to our id. If initially
+	 * any id other than ours, we are the second cpu to run so set it to
+	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
+	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
+	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
+	 * store must be visible to another cpu before this cpu could have
+	 * populated any TLB entries based on the pgtables that will be
+	 * installed.
+	 */
+	active = READ_ONCE(mm->context.active_cpu);
+	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
+		if (active == ACTIVE_CPU_NONE)
+			active = cmpxchg_relaxed(&mm->context.active_cpu,
+						 ACTIVE_CPU_NONE, cpu);
+
+		if (active != ACTIVE_CPU_NONE)
+			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
+
+		dsb(ishst);
+	}
+
 	/*
 	 * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
 	 * emulating PAN.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-05-23 13:47 [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Linu Cherian
@ 2026-06-14 11:04 ` Will Deacon
  2026-06-14 11:33   ` Will Deacon
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Will Deacon @ 2026-06-14 11:04 UTC (permalink / raw)
  To: Linu Cherian
  Cc: Catalin Marinas, Ryan Roberts, Kevin Brodsky, Anshuman Khandual,
	Yang Shi, Mark Rutland, Huang Ying, linux-arm-kernel,
	linux-kernel

On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
> From: Ryan Roberts <ryan.roberts@arm.com>
> 
> There are 3 variants of tlb flush that invalidate user mappings:
> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
> would previously unconditionally broadcast their tlbis to all cpus in
> the inner shareable domain.
> 
> But this is a waste of effort if we can prove that the mm for which we
> are flushing the mappings has only ever been active on the local cpu. In
> that case, it is safe to avoid the broadcast and simply invalidate the
> current cpu.
> 
> So let's track in mm_context_t::active_cpu either the mm has never been
> active on any cpu, has been active on more than 1 cpu, or has been
> active on precisely 1 cpu - and in that case, which one. We update this
> when switching context, being careful to ensure that it gets updated
> *before* installing the mm's pgtables. On the reader side, we ensure we
> read *after* the previous write(s) to the pgtable(s) that necessitated
> the tlb flush have completed. This guarrantees that if a cpu that is
> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
> entry cannot have been seen by any other cpu and we can flush only the
> local cpu.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Tested-by: Huang Ying <ying.huang@linux.alibaba.com>
> [linu.cherian@arm.com: Adapted for v7.1 flush tlb API changes]
> Signed-off-by: Linu Cherian <linu.cherian@arm.com>
> ---
> Changelog from RFC v1:
> - Adapted for v7.1 flush tlb API changes
>   No changes in core logic
> - Collected Rb and Tb tags
> - lat_mmap benchmark showed dsb(ishst) performs better than dsb(ish),
>   hence retained dsb(ishst) in flush_tlb_user_pre	
> 
> 
> Testing with 7.1-rc4 :
> +-----------------------+---------------------------------------------------+-------------+
> | Benchmark             | Result Class                                      |  Improvement|  
> +=======================+===================================================+=============+
> | perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
> +-----------------------+---------------------------------------------------+-------------+
> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
> | 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
> +-----------------------+---------------------------------------------------+-------------+

I think we need a much more comprehensive set of benchmarks before we can
begin to consider a change like this.

>  arch/arm64/include/asm/mmu.h         |  12 +++
>  arch/arm64/include/asm/mmu_context.h |   2 +
>  arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
>  arch/arm64/mm/context.c              |  30 ++++++-
>  4 files changed, 141 insertions(+), 30 deletions(-)

Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
even if you can provide some more compelling numbers.

> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
> +{
> +	unsigned int self, active;
> +	bool local;
> +
> +	migrate_disable();
> +
> +	if (flags & TLBF_NOBROADCAST) {
> +		dsb(nshst);
> +		return true;
> +	}

Why does the NOBROADCAST case need migration disabled? It didn't before...

> +
> +	self = smp_processor_id();
> +
> +	/*
> +	 * The load of mm->context.active_cpu must not be reordered before the
> +	 * store to the pgtable that necessitated this flush. This ensures that
> +	 * if the value read is our cpu id, then no other cpu can have seen the
> +	 * old pgtable value and therefore does not need this old value to be
> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> +	 * needed to make the pgtable updates visible to the walker, to a
> +	 * dsb(ish) by default. So speculatively load without a barrier and if
> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active == self) {
> +		dsb(ish);
> +		active = READ_ONCE(mm->context.active_cpu);
> +	} else {
> +		dsb(ishst);
> +	}

Why can't you just do:

	dsb(ishst);
	active = READ_ONCE(mm->context.active_cpu);

?

> +
> +	local = active == self;
> +	if (!local)
> +		migrate_enable();
> +
> +	return local;
> +}
> +
> +static inline void flush_tlb_user_post(bool local)
> +{
> +	if (local)
> +		migrate_enable();
> +}

I was under the impression that disabling/enabling migration was an
expensive thing to do, so I'd really want to see some more numbers to
justify this (including from inside a VM) and allow us to consider the
trade-offs properly. It's also not at all clear to me that it's safe
from such a low-level TLB invalidation helper.

> +
>  /*
>   *	TLB Invalidation
>   *	================
> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
>  static inline void flush_tlb_mm(struct mm_struct *mm)
>  {
>  	unsigned long asid;
> +	bool local;
>  
> -	dsb(ishst);
> +	local = flush_tlb_user_pre(mm, TLBF_NONE);
>  	asid = __TLBI_VADDR(0, ASID(mm));
> -	__tlbi(aside1is, asid);
> -	__tlbi_user(aside1is, asid);
> -	__tlbi_sync_s1ish(mm);
> +	if (local) {
> +		__tlbi(aside1, asid);
> +		__tlbi_user(aside1, asid);
> +		dsb(nsh);
> +	} else {
> +		__tlbi(aside1is, asid);
> +		__tlbi_user(aside1is, asid);
> +		__tlbi_sync_s1ish(mm);
> +	}
> +	flush_tlb_user_post(local);

I think you've changed this since Ryan's original patch, but why are you
only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
the erratum workaround when running as a VM if the vCPU is migrated?

> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 0f4a28b87469..f34ed78393e0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>  
>  void check_and_switch_context(struct mm_struct *mm)
>  {
> -	unsigned long flags;
> -	unsigned int cpu;
> +	unsigned int cpu = smp_processor_id();
>  	u64 asid, old_active_asid;
> +	unsigned int active;
> +	unsigned long flags;
>  
>  	if (system_supports_cnp())
>  		cpu_set_reserved_ttbr0();
> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>  		atomic64_set(&mm->context.id, asid);
>  	}
>  
> -	cpu = smp_processor_id();
>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>  		local_flush_tlb_all();
>  
> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>  
>  	arm64_apply_bp_hardening();
>  
> +	/*
> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
> +	 * then we are the first cpu to run so set it to our id. If initially
> +	 * any id other than ours, we are the second cpu to run so set it to
> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
> +	 * store must be visible to another cpu before this cpu could have
> +	 * populated any TLB entries based on the pgtables that will be
> +	 * installed.
> +	 */
> +	active = READ_ONCE(mm->context.active_cpu);
> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
> +		if (active == ACTIVE_CPU_NONE)
> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
> +						 ACTIVE_CPU_NONE, cpu);
> +
> +		if (active != ACTIVE_CPU_NONE)
> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
> +
> +		dsb(ishst);
> +	}
> +

Can you simplify the 'if' condition here?

	if (active == ACTIVE_CPU_NONE) {
		if (!try_cmpxchg_relaxed(...))
			WRITE_ONCE(...);

		dsb(ishst);
	}

(as an aside, maybe we should implement arch_try_cmpxchg{,_relaxed} so
 we could drop the READ_ONCE() here as well?)

Will


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-14 11:04 ` Will Deacon
@ 2026-06-14 11:33   ` Will Deacon
  2026-06-15 11:21   ` Ryan Roberts
  2026-06-15 12:39   ` Mark Rutland
  2 siblings, 0 replies; 8+ messages in thread
From: Will Deacon @ 2026-06-14 11:33 UTC (permalink / raw)
  To: Linu Cherian
  Cc: Catalin Marinas, Ryan Roberts, Kevin Brodsky, Anshuman Khandual,
	Yang Shi, Mark Rutland, Huang Ying, linux-arm-kernel,
	linux-kernel

On Sun, Jun 14, 2026 at 12:04:44PM +0100, Will Deacon wrote:
> Can you simplify the 'if' condition here?
> 
> 	if (active == ACTIVE_CPU_NONE) {
> 		if (!try_cmpxchg_relaxed(...))
> 			WRITE_ONCE(...);
> 
> 		dsb(ishst);
> 	}
> 
> (as an aside, maybe we should implement arch_try_cmpxchg{,_relaxed} so
>  we could drop the READ_ONCE() here as well?)

Mulling this over a little more, we probably can't drop the READ_ONCE()
even if we optimised our try_cmpxchg() implementation, as it would
prevent us from eliding the DSB on the fast path.

The rest of my comments (including the refactoring above) stand, however.

Will


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-14 11:04 ` Will Deacon
  2026-06-14 11:33   ` Will Deacon
@ 2026-06-15 11:21   ` Ryan Roberts
  2026-06-15 14:43     ` Will Deacon
  2026-06-15 12:39   ` Mark Rutland
  2 siblings, 1 reply; 8+ messages in thread
From: Ryan Roberts @ 2026-06-15 11:21 UTC (permalink / raw)
  To: Will Deacon, Linu Cherian
  Cc: Catalin Marinas, Kevin Brodsky, Anshuman Khandual, Yang Shi,
	Mark Rutland, Huang Ying, linux-arm-kernel, linux-kernel

On 14/06/2026 12:04, Will Deacon wrote:
> On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> There are 3 variants of tlb flush that invalidate user mappings:
>> flush_tlb_mm(), flush_tlb_page() and __flush_tlb_range(). All of these
>> would previously unconditionally broadcast their tlbis to all cpus in
>> the inner shareable domain.
>>
>> But this is a waste of effort if we can prove that the mm for which we
>> are flushing the mappings has only ever been active on the local cpu. In
>> that case, it is safe to avoid the broadcast and simply invalidate the
>> current cpu.
>>
>> So let's track in mm_context_t::active_cpu either the mm has never been
>> active on any cpu, has been active on more than 1 cpu, or has been
>> active on precisely 1 cpu - and in that case, which one. We update this
>> when switching context, being careful to ensure that it gets updated
>> *before* installing the mm's pgtables. On the reader side, we ensure we
>> read *after* the previous write(s) to the pgtable(s) that necessitated
>> the tlb flush have completed. This guarrantees that if a cpu that is
>> doing a tlb flush sees it's own id in active_cpu, then the old pgtable
>> entry cannot have been seen by any other cpu and we can flush only the
>> local cpu.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> Tested-by: Huang Ying <ying.huang@linux.alibaba.com>
>> [linu.cherian@arm.com: Adapted for v7.1 flush tlb API changes]
>> Signed-off-by: Linu Cherian <linu.cherian@arm.com>
>> ---
>> Changelog from RFC v1:
>> - Adapted for v7.1 flush tlb API changes
>>   No changes in core logic
>> - Collected Rb and Tb tags
>> - lat_mmap benchmark showed dsb(ishst) performs better than dsb(ish),
>>   hence retained dsb(ishst) in flush_tlb_user_pre	
>>
>>
>> Testing with 7.1-rc4 :
>> +-----------------------+---------------------------------------------------+-------------+
>> | Benchmark             | Result Class                                      |  Improvement|  
>> +=======================+===================================================+=============+
>> | perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
>> +-----------------------+---------------------------------------------------+-------------+
>> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
>> | 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
>> +-----------------------+---------------------------------------------------+-------------+
> 
> I think we need a much more comprehensive set of benchmarks before we can
> begin to consider a change like this.

I believe that Linu ran a wider set of benchmarks and didn't find any
regressions. These are just the ones that show improvement (Linu, please correct
me and/or provide details).

Additionally Huang Ying did some testing against the RFC and reported 4.5%
improvement with Redis:

https://lore.kernel.org/linux-arm-kernel/87segumv6w.fsf@DESKTOP-5N7EMDA/

> 
>>  arch/arm64/include/asm/mmu.h         |  12 +++
>>  arch/arm64/include/asm/mmu_context.h |   2 +
>>  arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
>>  arch/arm64/mm/context.c              |  30 ++++++-
>>  4 files changed, 141 insertions(+), 30 deletions(-)
> 
> Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
> even if you can provide some more compelling numbers.

AIUI, we don't support BTM upstream - the SMMU uses private ASIDs and implements
MMU notifiers to forward the TLBIs via its command queue interface.

I was also under the impression that supporting BTM upsteam was not desired;
Please correct me if that's not accurate or if you're aware of plans to add
support. I've been (coincidentlly) looking at some other stuff that could
benefit from BTM but had concluded it wouldn't be an acceptable approach upstream.

If we did ever want to add SMMU BTM support though, I think it would be simple
enough to add an interface to allow the SMMU to disable the optimization (i.e.
force active_cpu to ACTIVE_CPU_MULTIPLE)?

> 
>> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
>> +{
>> +	unsigned int self, active;
>> +	bool local;
>> +
>> +	migrate_disable();
>> +
>> +	if (flags & TLBF_NOBROADCAST) {
>> +		dsb(nshst);
>> +		return true;
>> +	}
> 
> Why does the NOBROADCAST case need migration disabled? It didn't before...

The existing semantic for TLBF_BOBROADCAST is that it emits a local TLBI on
whatever CPU we happen to be executing on. It's used for lazily fixing up
spurious faults (i.e. hitting RO TLB entries when the PTE has been relaxed to
RW). So it's still functionally correct if the thread migrates CPU between
taking the fault and issuing the local TLBI - in the worst case it just leads to
another spurious fault.

For this new case, we need to ensure we don't get migrated between reading
active_cpu and issuing the local TLBI, otherwise we would only issue a local
TLBI when a broadcast was required.

> 
>> +
>> +	self = smp_processor_id();
>> +
>> +	/*
>> +	 * The load of mm->context.active_cpu must not be reordered before the
>> +	 * store to the pgtable that necessitated this flush. This ensures that
>> +	 * if the value read is our cpu id, then no other cpu can have seen the
>> +	 * old pgtable value and therefore does not need this old value to be
>> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
>> +	 * needed to make the pgtable updates visible to the walker, to a
>> +	 * dsb(ish) by default. So speculatively load without a barrier and if
>> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active == self) {
>> +		dsb(ish);
>> +		active = READ_ONCE(mm->context.active_cpu);
>> +	} else {
>> +		dsb(ishst);
>> +	}
> 
> Why can't you just do:
> 
> 	dsb(ishst);
> 	active = READ_ONCE(mm->context.active_cpu);
> 
> ?

Prior to this optimization, we always issued a dsb(ishst) here. Catalin
suggested the same simplification against the RFC. I believe Linu tried it but
saw regressions; Hopefully Linu can provide the details.

> 
>> +
>> +	local = active == self;
>> +	if (!local)
>> +		migrate_enable();
>> +
>> +	return local;
>> +}
>> +
>> +static inline void flush_tlb_user_post(bool local)
>> +{
>> +	if (local)
>> +		migrate_enable();
>> +}
> 
> I was under the impression that disabling/enabling migration was an
> expensive thing to do, so I'd really want to see some more numbers to
> justify this (including from inside a VM) and allow us to consider the
> trade-offs properly. It's also not at all clear to me that it's safe
> from such a low-level TLB invalidation helper.

I had assumed it wasn't very expensive, but perhaps I'm wrong. I know
preempt_enable() can be expensive because it has to test to see if it needs to
reschedule. But I assumed for disabling/enabling migration, it would just be a
counter and the scheduler would check that it's zero before considing moving the
task to another run queue. (But I have practically zero understanding of the
scheduler so I'll assume I'm wrong...).

Instead of disabling migration, perhaps we could re-check active_cpu after
issuing the local tlbi - if it's now reporting "multiple" we must have been
migrated and we need to upgrade to a braodcast TLBI?

> 
>> +
>>  /*
>>   *	TLB Invalidation
>>   *	================
>> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
>>  static inline void flush_tlb_mm(struct mm_struct *mm)
>>  {
>>  	unsigned long asid;
>> +	bool local;
>>  
>> -	dsb(ishst);
>> +	local = flush_tlb_user_pre(mm, TLBF_NONE);
>>  	asid = __TLBI_VADDR(0, ASID(mm));
>> -	__tlbi(aside1is, asid);
>> -	__tlbi_user(aside1is, asid);
>> -	__tlbi_sync_s1ish(mm);
>> +	if (local) {
>> +		__tlbi(aside1, asid);
>> +		__tlbi_user(aside1, asid);
>> +		dsb(nsh);
>> +	} else {
>> +		__tlbi(aside1is, asid);
>> +		__tlbi_user(aside1is, asid);
>> +		__tlbi_sync_s1ish(mm);
>> +	}
>> +	flush_tlb_user_post(local);
> 
> I think you've changed this since Ryan's original patch, but why are you
> only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
> the erratum workaround when running as a VM if the vCPU is migrated?

Hmm. So from the guest kernel's perspective, it has concluded that it only needs
to target the local (v)CPU. Since the errata only affect boardcast TLBIs, it
concludes there is no need to issue the workarounds. But since it's a VM, the HW
will upgrade the local TLBIs to broadcast TLBIs, but will not magically
re-instate the workarounds. I guess the simplest solution would be to disable
the optimization when either workaround is enabled.

Perhaps this is all getting a bit too complex for not enough benefit...

Thanks,
Ryan

> 
>> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
>> index 0f4a28b87469..f34ed78393e0 100644
>> --- a/arch/arm64/mm/context.c
>> +++ b/arch/arm64/mm/context.c
>> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>>  
>>  void check_and_switch_context(struct mm_struct *mm)
>>  {
>> -	unsigned long flags;
>> -	unsigned int cpu;
>> +	unsigned int cpu = smp_processor_id();
>>  	u64 asid, old_active_asid;
>> +	unsigned int active;
>> +	unsigned long flags;
>>  
>>  	if (system_supports_cnp())
>>  		cpu_set_reserved_ttbr0();
>> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>>  		atomic64_set(&mm->context.id, asid);
>>  	}
>>  
>> -	cpu = smp_processor_id();
>>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>>  		local_flush_tlb_all();
>>  
>> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>>  
>>  	arm64_apply_bp_hardening();
>>  
>> +	/*
>> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
>> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
>> +	 * then we are the first cpu to run so set it to our id. If initially
>> +	 * any id other than ours, we are the second cpu to run so set it to
>> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
>> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
>> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
>> +	 * store must be visible to another cpu before this cpu could have
>> +	 * populated any TLB entries based on the pgtables that will be
>> +	 * installed.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
>> +		if (active == ACTIVE_CPU_NONE)
>> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
>> +						 ACTIVE_CPU_NONE, cpu);
>> +
>> +		if (active != ACTIVE_CPU_NONE)
>> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
>> +
>> +		dsb(ishst);
>> +	}
>> +
> 
> Can you simplify the 'if' condition here?
> 
> 	if (active == ACTIVE_CPU_NONE) {
> 		if (!try_cmpxchg_relaxed(...))
> 			WRITE_ONCE(...);
> 
> 		dsb(ishst);
> 	}
> 
> (as an aside, maybe we should implement arch_try_cmpxchg{,_relaxed} so
>  we could drop the READ_ONCE() here as well?)
> 
> Will



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-14 11:04 ` Will Deacon
  2026-06-14 11:33   ` Will Deacon
  2026-06-15 11:21   ` Ryan Roberts
@ 2026-06-15 12:39   ` Mark Rutland
  2026-06-15 14:44     ` Will Deacon
  2 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2026-06-15 12:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linu Cherian, Catalin Marinas, Ryan Roberts, Kevin Brodsky,
	Anshuman Khandual, Yang Shi, Huang Ying, linux-arm-kernel,
	linux-kernel

Hi Will,

On Sun, Jun 14, 2026 at 12:04:44PM +0100, Will Deacon wrote:
> On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:

> >  static inline void flush_tlb_mm(struct mm_struct *mm)
> >  {
> >  	unsigned long asid;
> > +	bool local;
> >  
> > -	dsb(ishst);
> > +	local = flush_tlb_user_pre(mm, TLBF_NONE);
> >  	asid = __TLBI_VADDR(0, ASID(mm));
> > -	__tlbi(aside1is, asid);
> > -	__tlbi_user(aside1is, asid);
> > -	__tlbi_sync_s1ish(mm);
> > +	if (local) {
> > +		__tlbi(aside1, asid);
> > +		__tlbi_user(aside1, asid);
> > +		dsb(nsh);
> > +	} else {
> > +		__tlbi(aside1is, asid);
> > +		__tlbi_user(aside1is, asid);
> > +		__tlbi_sync_s1ish(mm);
> > +	}
> > +	flush_tlb_user_post(local);
> 
> I think you've changed this since Ryan's original patch, but why are you
> only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
> the erratum workaround when running as a VM if the vCPU is migrated?

The errata mitigated by __tlbi_sync_s1ish() only affect broadcast
maintenance (the 'ish' in the name was intended to convey that). No
workaround is necessary for local TLB maintenance; aside from anything
else, when some PE executes the DSB to complete the maintenance, that
DSB alone is sufficient to complete memory accesses made by that PE.

If it would make things clearer, we could add a __tlbi_sync_s1nsh()
helper for the local case, which would boil down to a DSB NSH.

Regardless of the erratum, to correctly handle a vCPU being migrated
from pCPU-x to pCPU-y, we rely on:

* The host to set HCR_EL2.FB to ensure that TLB maintenance is
  broadcast to the ISH domain.

* The host to set HCR_EL2.BSU to ensure the DSB is upgrade to ISH such
  that any guest-issued DSB NSH will it can complete any TLB maintenance
  that was upgraded to ISH.

* The host to issue a DSB ISH on pCPU-x before the vCPU can run on
  pCPU-y, to complete any outstanding maintenance that was issued on
  pCPU-x. IIUC a DSB ISH on pCPU-y is not architecturally sufficient; it
  must be executed on the same CPU which issued the TLB maintenance.

... but as above, all of that should be independent of any of the errata
that require the workaround.

Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-15 11:21   ` Ryan Roberts
@ 2026-06-15 14:43     ` Will Deacon
  2026-06-15 15:41       ` Ryan Roberts
  0 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2026-06-15 14:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Linu Cherian, Catalin Marinas, Kevin Brodsky, Anshuman Khandual,
	Yang Shi, Mark Rutland, Huang Ying, linux-arm-kernel,
	linux-kernel, shameerali.kolothum.thodi

On Mon, Jun 15, 2026 at 12:21:19PM +0100, Ryan Roberts wrote:
> On 14/06/2026 12:04, Will Deacon wrote:
> > On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
> >> From: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >> Testing with 7.1-rc4 :
> >> +-----------------------+---------------------------------------------------+-------------+
> >> | Benchmark             | Result Class                                      |  Improvement|  
> >> +=======================+===================================================+=============+
> >> | perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
> >> +-----------------------+---------------------------------------------------+-------------+
> >> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
> >> | 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
> >> +-----------------------+---------------------------------------------------+-------------+
> > 
> > I think we need a much more comprehensive set of benchmarks before we can
> > begin to consider a change like this.
> 
> I believe that Linu ran a wider set of benchmarks and didn't find any
> regressions. These are just the ones that show improvement (Linu, please correct
> me and/or provide details).

I think it's important to show the ones that suffer as well... and also
look at different configurations (e.g. preemptible settings) and different
environments (e.g. native vs in a VM).

> Additionally Huang Ying did some testing against the RFC and reported 4.5%
> improvement with Redis:
> 
> https://lore.kernel.org/linux-arm-kernel/87segumv6w.fsf@DESKTOP-5N7EMDA 

To be clear: I'm not disputing that some benchmarks appear to show a small
boost from this series. I'm just worried that's not the whole story.

> >>  arch/arm64/include/asm/mmu.h         |  12 +++
> >>  arch/arm64/include/asm/mmu_context.h |   2 +
> >>  arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
> >>  arch/arm64/mm/context.c              |  30 ++++++-
> >>  4 files changed, 141 insertions(+), 30 deletions(-)
> > 
> > Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
> > even if you can provide some more compelling numbers.
> 
> AIUI, we don't support BTM upstream - the SMMU uses private ASIDs and implements
> MMU notifiers to forward the TLBIs via its command queue interface.
> 
> I was also under the impression that supporting BTM upsteam was not desired;
> Please correct me if that's not accurate or if you're aware of plans to add
> support. I've been (coincidentlly) looking at some other stuff that could
> benefit from BTM but had concluded it wouldn't be an acceptable approach upstream.
> 
> If we did ever want to add SMMU BTM support though, I think it would be simple
> enough to add an interface to allow the SMMU to disable the optimization (i.e.
> force active_cpu to ACTIVE_CPU_MULTIPLE)?

We used to have some initial BTM support in the SMMUv3 driver but the
main problem was finding an upstream driver/soc that can use it properly
and so it was ultimately removed in d38c28dbefee ("iommu/arm-smmu-v3: Put
the SVA mmu notifier in the smmu_domain") because it was getting in the
way of wider driver rework and we couldn't test it.

However, there *is* work to re-enable it on top of that rework (and other
changes):

  https://lore.kernel.org/linux-iommu/20250319173202.78988-6-shameerali.kolothum.thodi@huawei.com/

although I don't know if Shameer intends to repost that...

> >> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
> >> +{
> >> +	unsigned int self, active;
> >> +	bool local;
> >> +
> >> +	migrate_disable();
> >> +
> >> +	if (flags & TLBF_NOBROADCAST) {
> >> +		dsb(nshst);
> >> +		return true;
> >> +	}
> > 
> > Why does the NOBROADCAST case need migration disabled? It didn't before...
> 
> The existing semantic for TLBF_BOBROADCAST is that it emits a local TLBI on
> whatever CPU we happen to be executing on. It's used for lazily fixing up
> spurious faults (i.e. hitting RO TLB entries when the PTE has been relaxed to
> RW). So it's still functionally correct if the thread migrates CPU between
> taking the fault and issuing the local TLBI - in the worst case it just leads to
> another spurious fault.
> 
> For this new case, we need to ensure we don't get migrated between reading
> active_cpu and issuing the local TLBI, otherwise we would only issue a local
> TLBI when a broadcast was required.

Sounds like those two users probably need separating out, then?

> >> +	self = smp_processor_id();
> >> +
> >> +	/*
> >> +	 * The load of mm->context.active_cpu must not be reordered before the
> >> +	 * store to the pgtable that necessitated this flush. This ensures that
> >> +	 * if the value read is our cpu id, then no other cpu can have seen the
> >> +	 * old pgtable value and therefore does not need this old value to be
> >> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
> >> +	 * needed to make the pgtable updates visible to the walker, to a
> >> +	 * dsb(ish) by default. So speculatively load without a barrier and if
> >> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
> >> +	 */
> >> +	active = READ_ONCE(mm->context.active_cpu);
> >> +	if (active == self) {
> >> +		dsb(ish);
> >> +		active = READ_ONCE(mm->context.active_cpu);
> >> +	} else {
> >> +		dsb(ishst);
> >> +	}
> > 
> > Why can't you just do:
> > 
> > 	dsb(ishst);
> > 	active = READ_ONCE(mm->context.active_cpu);
> > 
> > ?
> 
> Prior to this optimization, we always issued a dsb(ishst) here. Catalin
> suggested the same simplification against the RFC. I believe Linu tried it but
> saw regressions; Hopefully Linu can provide the details.

I don't follow...

The old code always did dsb(ishst). The proposed code here does either
dsb(ish) or dsb(ishst). How can that possibly be faster?

> >> +	local = active == self;
> >> +	if (!local)
> >> +		migrate_enable();
> >> +
> >> +	return local;
> >> +}
> >> +
> >> +static inline void flush_tlb_user_post(bool local)
> >> +{
> >> +	if (local)
> >> +		migrate_enable();
> >> +}
> > 
> > I was under the impression that disabling/enabling migration was an
> > expensive thing to do, so I'd really want to see some more numbers to
> > justify this (including from inside a VM) and allow us to consider the
> > trade-offs properly. It's also not at all clear to me that it's safe
> > from such a low-level TLB invalidation helper.
> 
> I had assumed it wasn't very expensive, but perhaps I'm wrong. I know
> preempt_enable() can be expensive because it has to test to see if it needs to
> reschedule. But I assumed for disabling/enabling migration, it would just be a
> counter and the scheduler would check that it's zero before considing moving the
> task to another run queue. (But I have practically zero understanding of the
> scheduler so I'll assume I'm wrong...).

I'm not an expert here either, but reading the code shows that it has
a preempt guard along with additional book-keeping.

> Instead of disabling migration, perhaps we could re-check active_cpu after
> issuing the local tlbi - if it's now reporting "multiple" we must have been
> migrated and we need to upgrade to a braodcast TLBI?

That's an interesting idea, although I suppose it means the
post-invalidation DSB needs to be ISH for the local case to check the
active_cpu safely?

> >>   *	TLB Invalidation
> >>   *	================
> >> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
> >>  static inline void flush_tlb_mm(struct mm_struct *mm)
> >>  {
> >>  	unsigned long asid;
> >> +	bool local;
> >>  
> >> -	dsb(ishst);
> >> +	local = flush_tlb_user_pre(mm, TLBF_NONE);
> >>  	asid = __TLBI_VADDR(0, ASID(mm));
> >> -	__tlbi(aside1is, asid);
> >> -	__tlbi_user(aside1is, asid);
> >> -	__tlbi_sync_s1ish(mm);
> >> +	if (local) {
> >> +		__tlbi(aside1, asid);
> >> +		__tlbi_user(aside1, asid);
> >> +		dsb(nsh);
> >> +	} else {
> >> +		__tlbi(aside1is, asid);
> >> +		__tlbi_user(aside1is, asid);
> >> +		__tlbi_sync_s1ish(mm);
> >> +	}
> >> +	flush_tlb_user_post(local);
> > 
> > I think you've changed this since Ryan's original patch, but why are you
> > only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
> > the erratum workaround when running as a VM if the vCPU is migrated?
> 
> Hmm. So from the guest kernel's perspective, it has concluded that it only needs
> to target the local (v)CPU. Since the errata only affect boardcast TLBIs, it
> concludes there is no need to issue the workarounds. But since it's a VM, the HW
> will upgrade the local TLBIs to broadcast TLBIs, but will not magically
> re-instate the workarounds. I guess the simplest solution would be to disable
> the optimization when either workaround is enabled.

That's what I was thinking, but Mark seems to think it's ok. I'll reply
to him on the other part of the thread.

> Perhaps this is all getting a bit too complex for not enough benefit...

I don't think the complexity is unmanageable, but I'm not yet convinced
that this offers any real benefit overall.

Will


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-15 12:39   ` Mark Rutland
@ 2026-06-15 14:44     ` Will Deacon
  0 siblings, 0 replies; 8+ messages in thread
From: Will Deacon @ 2026-06-15 14:44 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Linu Cherian, Catalin Marinas, Ryan Roberts, Kevin Brodsky,
	Anshuman Khandual, Yang Shi, Huang Ying, linux-arm-kernel,
	linux-kernel

On Mon, Jun 15, 2026 at 01:39:43PM +0100, Mark Rutland wrote:
> On Sun, Jun 14, 2026 at 12:04:44PM +0100, Will Deacon wrote:
> > On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
> 
> > >  static inline void flush_tlb_mm(struct mm_struct *mm)
> > >  {
> > >  	unsigned long asid;
> > > +	bool local;
> > >  
> > > -	dsb(ishst);
> > > +	local = flush_tlb_user_pre(mm, TLBF_NONE);
> > >  	asid = __TLBI_VADDR(0, ASID(mm));
> > > -	__tlbi(aside1is, asid);
> > > -	__tlbi_user(aside1is, asid);
> > > -	__tlbi_sync_s1ish(mm);
> > > +	if (local) {
> > > +		__tlbi(aside1, asid);
> > > +		__tlbi_user(aside1, asid);
> > > +		dsb(nsh);
> > > +	} else {
> > > +		__tlbi(aside1is, asid);
> > > +		__tlbi_user(aside1is, asid);
> > > +		__tlbi_sync_s1ish(mm);
> > > +	}
> > > +	flush_tlb_user_post(local);
> > 
> > I think you've changed this since Ryan's original patch, but why are you
> > only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
> > the erratum workaround when running as a VM if the vCPU is migrated?
> 
> The errata mitigated by __tlbi_sync_s1ish() only affect broadcast
> maintenance (the 'ish' in the name was intended to convey that). No
> workaround is necessary for local TLB maintenance; aside from anything
> else, when some PE executes the DSB to complete the maintenance, that
> DSB alone is sufficient to complete memory accesses made by that PE.
> 
> If it would make things clearer, we could add a __tlbi_sync_s1nsh()
> helper for the local case, which would boil down to a DSB NSH.

No, I don't think that's what I'm concerned about.

> Regardless of the erratum, to correctly handle a vCPU being migrated
> from pCPU-x to pCPU-y, we rely on:
> 
> * The host to set HCR_EL2.FB to ensure that TLB maintenance is
>   broadcast to the ISH domain.
> 
> * The host to set HCR_EL2.BSU to ensure the DSB is upgrade to ISH such
>   that any guest-issued DSB NSH will it can complete any TLB maintenance
>   that was upgraded to ISH.
> 
> * The host to issue a DSB ISH on pCPU-x before the vCPU can run on
>   pCPU-y, to complete any outstanding maintenance that was issued on
>   pCPU-x. IIUC a DSB ISH on pCPU-y is not architecturally sufficient; it
>   must be executed on the same CPU which issued the TLB maintenance.
> 
> ... but as above, all of that should be independent of any of the errata
> that require the workaround.

Yes, I understand all of the above but the case I'm struggling with is
where a vCPU runs on a system that needs the TLB invalidation to be
performed twice. For non-broadcast invalidation (from the guest
perspective), this patch will mean that it only performs the
invalidation once. So if the vCPU migrates to another physical CPU, can
that effectively undo the HCR_EL2.FB upgrade unless KVM issues TLB
invalidation as well as a DSB on migration?

Maybe I'm missing something, as it looks like upstream already elides
the call to __tlbi_sync_s1ish() for the NOBROADCAST case.

Will


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu
  2026-06-15 14:43     ` Will Deacon
@ 2026-06-15 15:41       ` Ryan Roberts
  0 siblings, 0 replies; 8+ messages in thread
From: Ryan Roberts @ 2026-06-15 15:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linu Cherian, Catalin Marinas, Kevin Brodsky, Anshuman Khandual,
	Yang Shi, Mark Rutland, Huang Ying, linux-arm-kernel,
	linux-kernel, shameerali.kolothum.thodi

On 15/06/2026 15:43, Will Deacon wrote:
> On Mon, Jun 15, 2026 at 12:21:19PM +0100, Ryan Roberts wrote:
>> On 14/06/2026 12:04, Will Deacon wrote:
>>> On Sat, May 23, 2026 at 07:17:10PM +0530, Linu Cherian wrote:
>>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>>
>>>> Testing with 7.1-rc4 :
>>>> +-----------------------+---------------------------------------------------+-------------+
>>>> | Benchmark             | Result Class                                      |  Improvement|  
>>>> +=======================+===================================================+=============+
>>>> | perf/syscall          | fork (ops/sec)                                    |   (I) 3.25% |
>>>> +-----------------------+---------------------------------------------------+-------------+
>>>> | pts/memtier-benchmark | Protocol: Redis Clients: 100 Ratio: 1:5 (Ops/sec) |   (I) 2.70% |
>>>> | 			| Protocol: Redis Clients: 100 Ratio: 5:1 (Ops/sec) |   (I) 2.13% |
>>>> +-----------------------+---------------------------------------------------+-------------+
>>>
>>> I think we need a much more comprehensive set of benchmarks before we can
>>> begin to consider a change like this.
>>
>> I believe that Linu ran a wider set of benchmarks and didn't find any
>> regressions. These are just the ones that show improvement (Linu, please correct
>> me and/or provide details).
> 
> I think it's important to show the ones that suffer as well... and also
> look at different configurations (e.g. preemptible settings) and different
> environments (e.g. native vs in a VM).
> 
>> Additionally Huang Ying did some testing against the RFC and reported 4.5%
>> improvement with Redis:
>>
>> https://lore.kernel.org/linux-arm-kernel/87segumv6w.fsf@DESKTOP-5N7EMDA 
> 
> To be clear: I'm not disputing that some benchmarks appear to show a small
> boost from this series. I'm just worried that's not the whole story.
> 
>>>>  arch/arm64/include/asm/mmu.h         |  12 +++
>>>>  arch/arm64/include/asm/mmu_context.h |   2 +
>>>>  arch/arm64/include/asm/tlbflush.h    | 127 +++++++++++++++++++++------
>>>>  arch/arm64/mm/context.c              |  30 ++++++-
>>>>  4 files changed, 141 insertions(+), 30 deletions(-)
>>>
>>> Doesn't this break BTM/SVM with the SMMU? I think that's a non-starter
>>> even if you can provide some more compelling numbers.
>>
>> AIUI, we don't support BTM upstream - the SMMU uses private ASIDs and implements
>> MMU notifiers to forward the TLBIs via its command queue interface.
>>
>> I was also under the impression that supporting BTM upsteam was not desired;
>> Please correct me if that's not accurate or if you're aware of plans to add
>> support. I've been (coincidentlly) looking at some other stuff that could
>> benefit from BTM but had concluded it wouldn't be an acceptable approach upstream.
>>
>> If we did ever want to add SMMU BTM support though, I think it would be simple
>> enough to add an interface to allow the SMMU to disable the optimization (i.e.
>> force active_cpu to ACTIVE_CPU_MULTIPLE)?
> 
> We used to have some initial BTM support in the SMMUv3 driver but the
> main problem was finding an upstream driver/soc that can use it properly
> and so it was ultimately removed in d38c28dbefee ("iommu/arm-smmu-v3: Put
> the SVA mmu notifier in the smmu_domain") because it was getting in the
> way of wider driver rework and we couldn't test it.
> 
> However, there *is* work to re-enable it on top of that rework (and other
> changes):
> 
>   https://lore.kernel.org/linux-iommu/20250319173202.78988-6-shameerali.kolothum.thodi@huawei.com/
> 
> although I don't know if Shameer intends to repost that...

Thanks for the pointers; That's very interesting feedback. I'll take a closer
look :)

> 
>>>> +static inline bool flush_tlb_user_pre(struct mm_struct *mm, tlbf_t flags)
>>>> +{
>>>> +	unsigned int self, active;
>>>> +	bool local;
>>>> +
>>>> +	migrate_disable();
>>>> +
>>>> +	if (flags & TLBF_NOBROADCAST) {
>>>> +		dsb(nshst);
>>>> +		return true;
>>>> +	}
>>>
>>> Why does the NOBROADCAST case need migration disabled? It didn't before...
>>
>> The existing semantic for TLBF_BOBROADCAST is that it emits a local TLBI on
>> whatever CPU we happen to be executing on. It's used for lazily fixing up
>> spurious faults (i.e. hitting RO TLB entries when the PTE has been relaxed to
>> RW). So it's still functionally correct if the thread migrates CPU between
>> taking the fault and issuing the local TLBI - in the worst case it just leads to
>> another spurious fault.
>>
>> For this new case, we need to ensure we don't get migrated between reading
>> active_cpu and issuing the local TLBI, otherwise we would only issue a local
>> TLBI when a broadcast was required.
> 
> Sounds like those two users probably need separating out, then?

Ahh, I see; I'll admit I hadn't actually reviewed the new integration part. I
agree - NOBROADCAST is different to to this. This is an optimization for the
"not NOBROADCAST" case. We need to avoid disabling migration in the NOBROADCAST
case.

> 
>>>> +	self = smp_processor_id();
>>>> +
>>>> +	/*
>>>> +	 * The load of mm->context.active_cpu must not be reordered before the
>>>> +	 * store to the pgtable that necessitated this flush. This ensures that
>>>> +	 * if the value read is our cpu id, then no other cpu can have seen the
>>>> +	 * old pgtable value and therefore does not need this old value to be
>>>> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
>>>> +	 * needed to make the pgtable updates visible to the walker, to a
>>>> +	 * dsb(ish) by default. So speculatively load without a barrier and if
>>>> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
>>>> +	 */
>>>> +	active = READ_ONCE(mm->context.active_cpu);
>>>> +	if (active == self) {
>>>> +		dsb(ish);
>>>> +		active = READ_ONCE(mm->context.active_cpu);
>>>> +	} else {
>>>> +		dsb(ishst);
>>>> +	}
>>>
>>> Why can't you just do:
>>>
>>> 	dsb(ishst);
>>> 	active = READ_ONCE(mm->context.active_cpu);
>>>
>>> ?
>>
>> Prior to this optimization, we always issued a dsb(ishst) here. Catalin
>> suggested the same simplification against the RFC. I believe Linu tried it but
>> saw regressions; Hopefully Linu can provide the details.
> 
> I don't follow...
> 
> The old code always did dsb(ishst). The proposed code here does either
> dsb(ish) or dsb(ishst). How can that possibly be faster?

Ugh, sorry - I read your suggestion as unconditionally issuing a dsb(ish).

Ignore my previous answer, and now I'll demonstrate my total lack of
understanding of barriers instead...

As the comment says, "The load of mm->context.active_cpu must not be reordered
before the store to the pgtable that necessitated this flush". I thought that a
dsb(ishst) would only provide ordering between stores. Don't we need the
dsb(ish) to prevent the load from being reordered before the store?

> 
>>>> +	local = active == self;
>>>> +	if (!local)
>>>> +		migrate_enable();
>>>> +
>>>> +	return local;
>>>> +}
>>>> +
>>>> +static inline void flush_tlb_user_post(bool local)
>>>> +{
>>>> +	if (local)
>>>> +		migrate_enable();
>>>> +}
>>>
>>> I was under the impression that disabling/enabling migration was an
>>> expensive thing to do, so I'd really want to see some more numbers to
>>> justify this (including from inside a VM) and allow us to consider the
>>> trade-offs properly. It's also not at all clear to me that it's safe
>>> from such a low-level TLB invalidation helper.
>>
>> I had assumed it wasn't very expensive, but perhaps I'm wrong. I know
>> preempt_enable() can be expensive because it has to test to see if it needs to
>> reschedule. But I assumed for disabling/enabling migration, it would just be a
>> counter and the scheduler would check that it's zero before considing moving the
>> task to another run queue. (But I have practically zero understanding of the
>> scheduler so I'll assume I'm wrong...).
> 
> I'm not an expert here either, but reading the code shows that it has
> a preempt guard along with additional book-keeping.
> 
>> Instead of disabling migration, perhaps we could re-check active_cpu after
>> issuing the local tlbi - if it's now reporting "multiple" we must have been
>> migrated and we need to upgrade to a braodcast TLBI?
> 
> That's an interesting idea, although I suppose it means the
> post-invalidation DSB needs to be ISH for the local case to check the
> active_cpu safely?

Another idea could be to use PeterZ's "kernel mode restartable sequence" thingy,
then we can detect migration and retry? There a thread briefly talking about
doing something similar to avoid disabling preemption in the this_cpu_ ops, but
not sure if it went anywhere.

> 
>>>>   *	TLB Invalidation
>>>>   *	================
>>>> @@ -408,12 +482,20 @@ static inline void flush_tlb_all(void)
>>>>  static inline void flush_tlb_mm(struct mm_struct *mm)
>>>>  {
>>>>  	unsigned long asid;
>>>> +	bool local;
>>>>  
>>>> -	dsb(ishst);
>>>> +	local = flush_tlb_user_pre(mm, TLBF_NONE);
>>>>  	asid = __TLBI_VADDR(0, ASID(mm));
>>>> -	__tlbi(aside1is, asid);
>>>> -	__tlbi_user(aside1is, asid);
>>>> -	__tlbi_sync_s1ish(mm);
>>>> +	if (local) {
>>>> +		__tlbi(aside1, asid);
>>>> +		__tlbi_user(aside1, asid);
>>>> +		dsb(nsh);
>>>> +	} else {
>>>> +		__tlbi(aside1is, asid);
>>>> +		__tlbi_user(aside1is, asid);
>>>> +		__tlbi_sync_s1ish(mm);
>>>> +	}
>>>> +	flush_tlb_user_post(local);
>>>
>>> I think you've changed this since Ryan's original patch, but why are you
>>> only calling __tlbi_sync_s1ish() for the !local case? Doesn't that break
>>> the erratum workaround when running as a VM if the vCPU is migrated?
>>
>> Hmm. So from the guest kernel's perspective, it has concluded that it only needs
>> to target the local (v)CPU. Since the errata only affect boardcast TLBIs, it
>> concludes there is no need to issue the workarounds. But since it's a VM, the HW
>> will upgrade the local TLBIs to broadcast TLBIs, but will not magically
>> re-instate the workarounds. I guess the simplest solution would be to disable
>> the optimization when either workaround is enabled.
> 
> That's what I was thinking, but Mark seems to think it's ok. I'll reply
> to him on the other part of the thread.
> 
>> Perhaps this is all getting a bit too complex for not enough benefit...
> 
> I don't think the complexity is unmanageable, but I'm not yet convinced
> that this offers any real benefit overall.

I'll talk with Linu and see if we can present a clearer view.

Thanks,
Ryan


> 
> Will



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-15 15:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-23 13:47 [PATCH v2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu Linu Cherian
2026-06-14 11:04 ` Will Deacon
2026-06-14 11:33   ` Will Deacon
2026-06-15 11:21   ` Ryan Roberts
2026-06-15 14:43     ` Will Deacon
2026-06-15 15:41       ` Ryan Roberts
2026-06-15 12:39   ` Mark Rutland
2026-06-15 14:44     ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox