[PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
@ 2026-02-18 16:43 Mark Rutland
  2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will

Hi all,

Some Arm partners have complained that the overhead of
ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
errata being categorized as "rare", they still want to use the
workaround in some deployments.

For historical reasons, the current workaround is far stronger (and
consequently far more expensive) than necessary. In part, the SDENs had
somewhat misleading descriptions, which have recently been clarified:

* Arm Cortex-A76 erratum #1286807
  SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/

* Arm Cortex-A55 erratum #2441007
  SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/

* Arm Cortex-A510 erratum #2441009
  SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/

Patch 1 allows the __TLBI*() helpers to generate XZR as an argument.
I've split this out as its own patch to make bisection easier in case we
see any problems due to incorrect trap+emulation handling of XZR.
Otherwise this shouldn't have any functional change.

Patch 2 is the actual optimization, spelled out in detail in the commit
message. The gist is that it's not necessary to duplicate every
individual TLBI, and it's sufficient to have a single arbitrary TLBI;DSB
after any number of batched TLBIs;DSB.

As mentioned in the commit message for patch 2, this results in fewer
alternatives and better code generation whenever
ARM64_WORKAROUND_REPEAT_TLBI is built into the kernel, so it's a
(trivial) win on hardware that isn't affected by the relevant errata.

Mark.

Mark Rutland (2):
  arm64: tlb: Allow XZR argument to TLBI ops
  arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI

 arch/arm64/include/asm/tlbflush.h | 63 ++++++++++++++++++-------------
 arch/arm64/kernel/sys_compat.c    |  2 +-
 arch/arm64/kvm/hyp/nvhe/mm.c      |  2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c     |  8 ++--
 arch/arm64/kvm/hyp/pgtable.c      |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c      | 10 ++---
 6 files changed, 49 insertions(+), 38 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops
  2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
@ 2026-02-18 16:43 ` Mark Rutland
  2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will

The TLBI instruction accepts XZR as a register argument, and for TLBI
operations with a register argument, there is no functional difference
between using XZR or another GPR which contains zeroes. Operations
without a register argument are encoded as if XZR were used.

Allow the __TLBI_1() macro to use XZR when a register argument is all
zeroes.

Today this only results in a trivial code saving in
__do_compat_cache_op()'s workaround for Neoverse-N1 erratum #1542419. In
subsequent patches this pattern will be used more generally.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/tlbflush.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a2d65d7d6aaeb..bf1cc9949dc87 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -38,12 +38,12 @@
 			    : : )
 
 #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE			       \
-			       "tlbi " #op ", %0\n"			       \
+			       "tlbi " #op ", %x0\n"			       \
 		   ALTERNATIVE("nop\n			nop",		       \
-			       "dsb ish\n		tlbi " #op ", %0",     \
+			       "dsb ish\n		tlbi " #op ", %x0",    \
 			       ARM64_WORKAROUND_REPEAT_TLBI,		       \
 			       CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)	       \
-			    : : "r" (arg))
+			    : : "rZ" (arg))
 
 #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
  2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
  2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
@ 2026-02-18 16:43 ` Mark Rutland
  2026-02-26  0:06 ` [PATCH 0/2] " Will Deacon
  2026-02-26  8:58 ` Marc Zyngier
  3 siblings, 0 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will

The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
errata where broadcast TLBI;DSB sequences don't provide all the
architecturally required synchronization. The workaround performs more
work than necessary, and can have significant overhead. This patch
optimizes the workaround, as explained below.

The workaround was originally added for Qualcomm Falkor erratum 1009 in
commit:

  d9ff80f83ecb ("arm64: Work around Falkor erratum 1009")

As noted in the message for that commit, the workaround is applied even
in cases where it is not strictly necessary.

The workaround was later reused without changes for:

* Arm Cortex-A76 erratum #1286807
  SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/

* Arm Cortex-A55 erratum #2441007
  SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/

* Arm Cortex-A510 erratum #2441009
  SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/

The important details to note are as follows:

1. All relevant errata only affect the ordering and/or completion of
   memory accesses which have been translated by an invalidated TLB
   entry. The actual invalidation of TLB entries is unaffected.

2. The existing workaround is applied to both broadcast and local TLB
   invalidation, whereas for all relevant errata it is only necessary to
   apply a workaround for broadcast invalidation.

3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
   sequence, whereas for all relevant errata it is only necessary to
   execute a single additional TLBI;DSB sequence after any number of
   TLBIs are completed by a DSB.

   For example, for a sequence of batched TLBIs:

       TLBI <op1>[, <arg1>]
       TLBI <op2>[, <arg2>]
       TLBI <op3>[, <arg3>]
       DSB ISH

   ... the existing workaround will expand this to:

       TLBI <op1>[, <arg1>]
       DSB ISH                  // additional
       TLBI <op1>[, <arg1>]     // additional
       TLBI <op2>[, <arg2>]
       DSB ISH                  // additional
       TLBI <op2>[, <arg2>]     // additional
       TLBI <op3>[, <arg3>]
       DSB ISH                  // additional
       TLBI <op3>[, <arg3>]     // additional
       DSB ISH

   ... whereas it is sufficient to have:

       TLBI <op1>[, <arg1>]
       TLBI <op2>[, <arg2>]
       TLBI <op3>[, <arg3>]
       DSB ISH
       TLBI <opX>[, <argX>]     // additional
       DSB ISH                  // additional

   Using a single additional TBLI and DSB at the end of the sequence can
   have significantly lower overhead as each DSB which completes a TLBI
   must synchronize with other PEs in the system, with potential
   performance effects both locally and system-wide.

4. The existing workaround repeats each specific TLBI operation, whereas
   for all relevant errata it is sufficient for the additional TLBI to
   use *any* operation which will be broadcast, regardless of which
   translation regime or stage of translation the operation applies to.

   For example, for a single TLBI:

       TLBI ALLE2IS
       DSB ISH

   ... the existing workaround will expand this to:

       TLBI ALLE2IS
       DSB ISH
       TLBI ALLE2IS             // additional
       DSB ISH                  // additional

   ... whereas it is sufficient to have:

       TLBI ALLE2IS
       DSB ISH
       TLBI VALE1IS, XZR        // additional
       DSB ISH                  // additional

   As the additional TLBI doesn't have to match a specific earlier TLBI,
   the additional TLBI can be implemented in separate code, with no
   memory of the earlier TLBIs. The additional TLBI can also use a
   cheaper TLBI operation.

5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
   invalidation, whereas for all relevant errata it is only necessary to
   apply a workaround for Stage-1 invalidation.

   Architecturally, TLBI operations which invalidate only Stage-2
   information (e.g. IPAS2E1IS) are not required to invalidate TLB
   entries which combine information from Stage-1 and Stage-2
   translation table entries, and consequently may not complete memory
   accesses translated by those combined entries. In these cases,
   completion of memory accesses is only guaranteed after subsequent
   invalidation of Stage-1 information (e.g. VMALLE1IS).

Taking the above points into account, this patch reworks the workaround
logic to reduce overhead:

* New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are
  added and used in place of any dsb(ish) which is used to complete
  broadcast Stage-1 TLB maintenance. When the
  ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will
  execute an additional TLBI;DSB sequence.

  For consistency, it might make sense to add __tlbi_sync_*() helpers
  for local and stage 2 maintenance. For now I've left those with
  open-coded dsb() to keep the diff small.

* The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This
  is no longer needed as the necessary synchronization will happen in
  __tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp().

* The additional TLBI operation is chosen to have minimal impact:

  - __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at
    EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused
    entry for the reserved ASID in the kernel's own translation regime,
    and have no adverse affect.

  - __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used
    in hyp code, where it will target an unused entry in the hyp code's
    TTBR0 mapping, and should have no adverse effect.

* As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a
  TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no
  need for arch_tlbbatch_should_defer() to consider
  ARM64_WORKAROUND_REPEAT_TLBI.

When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this
patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes
the resulting Image 64KiB smaller:

| [mark@lakrids:~/src/linux]% size vmlinux-*
|    text    data     bss     dec     hex filename
| 21179831        19660919         708216 41548966        279fca6 vmlinux-after
| 21181075        19660903         708216 41550194        27a0172 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l vmlinux-*
| -rwxr-xr-x 1 mark mark 157771472 Feb  4 12:05 vmlinux-after
| -rwxr-xr-x 1 mark mark 157815432 Feb  4 12:05 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l Image-*
| -rw-r--r-- 1 mark mark 41007616 Feb  4 12:05 Image-after
| -rw-r--r-- 1 mark mark 41073152 Feb  4 12:05 Image-before

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/tlbflush.h | 59 ++++++++++++++++++-------------
 arch/arm64/kernel/sys_compat.c    |  2 +-
 arch/arm64/kvm/hyp/nvhe/mm.c      |  2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c     |  8 ++---
 arch/arm64/kvm/hyp/pgtable.c      |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c      | 10 +++---
 6 files changed, 47 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bf1cc9949dc87..1416e652612b7 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -31,18 +31,10 @@
  */
 #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE			       \
 			       "tlbi " #op "\n"				       \
-		   ALTERNATIVE("nop\n			nop",		       \
-			       "dsb ish\n		tlbi " #op,	       \
-			       ARM64_WORKAROUND_REPEAT_TLBI,		       \
-			       CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)	       \
 			    : : )
 
 #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE			       \
 			       "tlbi " #op ", %x0\n"			       \
-		   ALTERNATIVE("nop\n			nop",		       \
-			       "dsb ish\n		tlbi " #op ", %x0",    \
-			       ARM64_WORKAROUND_REPEAT_TLBI,		       \
-			       CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)	       \
 			    : : "rZ" (arg))
 
 #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
@@ -181,6 +173,34 @@ static inline unsigned long get_trans_granule(void)
 		(__pages >> (5 * (scale) + 1)) - 1;			\
 	})
 
+#define __repeat_tlbi_sync(op, arg...)						\
+do {										\
+	if (!alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI))	\
+		break;								\
+	__tlbi(op, ##arg);							\
+	dsb(ish);								\
+} while (0)
+
+/*
+ * Complete broadcast TLB maintenance issued by the host which invalidates
+ * stage 1 information in the host's own translation regime.
+ */
+static inline void __tlbi_sync_s1ish(void)
+{
+	dsb(ish);
+	__repeat_tlbi_sync(vale1is, 0);
+}
+
+/*
+ * Complete broadcast TLB maintenance issued by hyp code which invalidates
+ * stage 1 translation information in any translation regime.
+ */
+static inline void __tlbi_sync_s1ish_hyp(void)
+{
+	dsb(ish);
+	__repeat_tlbi_sync(vale2is, 0);
+}
+
 /*
  *	TLB Invalidation
  *	================
@@ -279,7 +299,7 @@ static inline void flush_tlb_all(void)
 {
 	dsb(ishst);
 	__tlbi(vmalle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish();
 	isb();
 }
 
@@ -291,7 +311,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	asid = __TLBI_VADDR(0, ASID(mm));
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
-	dsb(ish);
+	__tlbi_sync_s1ish();
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
@@ -345,20 +365,11 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
 	flush_tlb_page_nosync(vma, uaddr);
-	dsb(ish);
+	__tlbi_sync_s1ish();
 }
 
 static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 {
-	/*
-	 * TLB flush deferral is not required on systems which are affected by
-	 * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
-	 * will have two consecutive TLBI instructions with a dsb(ish) in between
-	 * defeating the purpose (i.e save overall 'dsb ish' cost).
-	 */
-	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI))
-		return false;
-
 	return true;
 }
 
@@ -374,7 +385,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
  */
 static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	dsb(ish);
+	__tlbi_sync_s1ish();
 }
 
 /*
@@ -509,7 +520,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 {
 	__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
 				 last_level, tlb_level);
-	dsb(ish);
+	__tlbi_sync_s1ish();
 }
 
 static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
@@ -557,7 +568,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
 	dsb(ishst);
 	__flush_tlb_range_op(vaale1is, start, pages, stride, 0,
 			     TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
-	dsb(ish);
+	__tlbi_sync_s1ish();
 	isb();
 }
 
@@ -571,7 +582,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 
 	dsb(ishst);
 	__tlbi(vaae1is, addr);
-	dsb(ish);
+	__tlbi_sync_s1ish();
 	isb();
 }
 
diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
index 4a609e9b65de0..b9d4998c97efa 100644
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@ -37,7 +37,7 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
 			 * We pick the reserved-ASID to minimise the impact.
 			 */
 			__tlbi(aside1is, __TLBI_VADDR(0, 0));
-			dsb(ish);
+			__tlbi_sync_s1ish();
 		}
 
 		ret = caches_clean_inval_user_pou(start, start + chunk);
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index ae8391baebc30..218976287d3fe 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -271,7 +271,7 @@ static void fixmap_clear_slot(struct hyp_fixmap_slot *slot)
 	 */
 	dsb(ishst);
 	__tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 }
 
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index 48da9ca9763f6..3dc1ce0d27fe6 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -169,7 +169,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	 */
 	dsb(ish);
 	__tlbi(vmalle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -226,7 +226,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
 
 	dsb(ish);
 	__tlbi(vmalle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -240,7 +240,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 	enter_vmid_context(mmu, &cxt, false);
 
 	__tlbi(vmalls12e1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -266,5 +266,5 @@ void __kvm_flush_vm_context(void)
 	/* Same remark as in enter_vmid_context() */
 	dsb(ish);
 	__tlbi(alle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 }
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 947ac1a951a5b..da8f8d4c4d5da 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -483,7 +483,7 @@ static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 		*unmapped += granule;
 	}
 
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 	mm_ops->put_page(ctx->ptep);
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index ec25698186297..35855dadfb1b3 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -115,7 +115,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	 */
 	dsb(ish);
 	__tlbi(vmalle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -176,7 +176,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
 
 	dsb(ish);
 	__tlbi(vmalle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -192,7 +192,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 	enter_vmid_context(mmu, &cxt);
 
 	__tlbi(vmalls12e1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	exit_vmid_context(&cxt);
@@ -217,7 +217,7 @@ void __kvm_flush_vm_context(void)
 {
 	dsb(ishst);
 	__tlbi(alle1is);
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 }
 
 /*
@@ -358,7 +358,7 @@ int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding)
 	default:
 		ret = -EINVAL;
 	}
-	dsb(ish);
+	__tlbi_sync_s1ish_hyp();
 	isb();
 
 	if (mmu)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
  2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
  2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
  2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
@ 2026-02-26  0:06 ` Will Deacon
  2026-02-26  8:58 ` Marc Zyngier
  3 siblings, 0 replies; 5+ messages in thread
From: Will Deacon @ 2026-02-26  0:06 UTC (permalink / raw)
  To: linux-arm-kernel, Mark Rutland
  Cc: catalin.marinas, kernel-team, Will Deacon, maz, oupton,
	ryan.roberts

On Wed, 18 Feb 2026 16:43:46 +0000, Mark Rutland wrote:
> Some Arm partners have complained that the overhead of
> ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
> errata being categorized as "rare", they still want to use the
> workaround in some deployments.
> 
> For historical reasons, the current workaround is far stronger (and
> consequently far more expensive) than necessary. In part, the SDENs had
> somewhat misleading descriptions, which have recently been clarified:
> 
> [...]

Applied to arm64 (for-next/fixes), thanks!

[1/2] arm64: tlb: Allow XZR argument to TLBI ops
      https://git.kernel.org/arm64/c/bfd9c931d19a
[2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
      https://git.kernel.org/arm64/c/a8f78680ee6b

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
  2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
                   ` (2 preceding siblings ...)
  2026-02-26  0:06 ` [PATCH 0/2] " Will Deacon
@ 2026-02-26  8:58 ` Marc Zyngier
  3 siblings, 0 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-02-26  8:58 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, catalin.marinas, oupton, ryan.roberts, will

On Wed, 18 Feb 2026 16:43:46 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> Hi all,
> 
> Some Arm partners have complained that the overhead of
> ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
> errata being categorized as "rare", they still want to use the
> workaround in some deployments.
> 
> For historical reasons, the current workaround is far stronger (and
> consequently far more expensive) than necessary. In part, the SDENs had
> somewhat misleading descriptions, which have recently been clarified:
> 
> * Arm Cortex-A76 erratum #1286807
>   SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
> 
> * Arm Cortex-A55 erratum #2441007
>   SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
> 
> * Arm Cortex-A510 erratum #2441009
>   SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
> 
> Patch 1 allows the __TLBI*() helpers to generate XZR as an argument.
> I've split this out as its own patch to make bisection easier in case we
> see any problems due to incorrect trap+emulation handling of XZR.
> Otherwise this shouldn't have any functional change.
> 
> Patch 2 is the actual optimization, spelled out in detail in the commit
> message. The gist is that it's not necessary to duplicate every
> individual TLBI, and it's sufficient to have a single arbitrary TLBI;DSB
> after any number of batched TLBIs;DSB.
> 
> As mentioned in the commit message for patch 2, this results in fewer
> alternatives and better code generation whenever
> ARM64_WORKAROUND_REPEAT_TLBI is built into the kernel, so it's a
> (trivial) win on hardware that isn't affected by the relevant errata.
> 
> Mark.
> 
> Mark Rutland (2):
>   arm64: tlb: Allow XZR argument to TLBI ops
>   arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
> 
>  arch/arm64/include/asm/tlbflush.h | 63 ++++++++++++++++++-------------
>  arch/arm64/kernel/sys_compat.c    |  2 +-
>  arch/arm64/kvm/hyp/nvhe/mm.c      |  2 +-
>  arch/arm64/kvm/hyp/nvhe/tlb.c     |  8 ++--
>  arch/arm64/kvm/hyp/pgtable.c      |  2 +-
>  arch/arm64/kvm/hyp/vhe/tlb.c      | 10 ++---
>  6 files changed, 49 insertions(+), 38 deletions(-)
> 

A bit late, but FTR,

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-26  9:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-26  0:06 ` [PATCH 0/2] " Will Deacon
2026-02-26  8:58 ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox