* [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
@ 2026-02-18 16:43 Mark Rutland
2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
To: linux-arm-kernel
Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will
Hi all,
Some Arm partners have complained that the overhead of
ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
errata being categorized as "rare", they still want to use the
workaround in some deployments.
For historical reasons, the current workaround is far stronger (and
consequently far more expensive) than necessary. In part, the SDENs had
somewhat misleading descriptions, which have recently been clarified:
* Arm Cortex-A76 erratum #1286807
SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
* Arm Cortex-A55 erratum #2441007
SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
* Arm Cortex-A510 erratum #2441009
SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
Patch 1 allows the __TLBI*() helpers to generate XZR as an argument.
I've split this out as its own patch to make bisection easier in case we
see any problems due to incorrect trap+emulation handling of XZR.
Otherwise this shouldn't have any functional change.
Patch 2 is the actual optimization, spelled out in detail in the commit
message. The gist is that it's not necessary to duplicate every
individual TLBI, and it's sufficient to have a single arbitrary TLBI;DSB
after any number of batched TLBIs;DSB.
As mentioned in the commit message for patch 2, this results in fewer
alternatives and better code generation whenever
ARM64_WORKAROUND_REPEAT_TLBI is built into the kernel, so it's a
(trivial) win on hardware that isn't affected by the relevant errata.
Mark.
Mark Rutland (2):
arm64: tlb: Allow XZR argument to TLBI ops
arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
arch/arm64/include/asm/tlbflush.h | 63 ++++++++++++++++++-------------
arch/arm64/kernel/sys_compat.c | 2 +-
arch/arm64/kvm/hyp/nvhe/mm.c | 2 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 8 ++--
arch/arm64/kvm/hyp/pgtable.c | 2 +-
arch/arm64/kvm/hyp/vhe/tlb.c | 10 ++---
6 files changed, 49 insertions(+), 38 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
@ 2026-02-18 16:43 ` Mark Rutland
2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
To: linux-arm-kernel
Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will
The TLBI instruction accepts XZR as a register argument, and for TLBI
operations with a register argument, there is no functional difference
between using XZR or another GPR which contains zeroes. Operations
without a register argument are encoded as if XZR were used.
Allow the __TLBI_1() macro to use XZR when a register argument is all
zeroes.
Today this only results in a trivial code saving in
__do_compat_cache_op()'s workaround for Neoverse-N1 erratum #1542419. In
subsequent patches this pattern will be used more generally.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
---
arch/arm64/include/asm/tlbflush.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a2d65d7d6aaeb..bf1cc9949dc87 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -38,12 +38,12 @@
: : )
#define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
- "tlbi " #op ", %0\n" \
+ "tlbi " #op ", %x0\n" \
ALTERNATIVE("nop\n nop", \
- "dsb ish\n tlbi " #op ", %0", \
+ "dsb ish\n tlbi " #op ", %x0", \
ARM64_WORKAROUND_REPEAT_TLBI, \
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
- : : "r" (arg))
+ : : "rZ" (arg))
#define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
--
2.30.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
@ 2026-02-18 16:43 ` Mark Rutland
2026-02-26 0:06 ` [PATCH 0/2] " Will Deacon
2026-02-26 8:58 ` Marc Zyngier
3 siblings, 0 replies; 5+ messages in thread
From: Mark Rutland @ 2026-02-18 16:43 UTC (permalink / raw)
To: linux-arm-kernel
Cc: catalin.marinas, mark.rutland, maz, oupton, ryan.roberts, will
The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
errata where broadcast TLBI;DSB sequences don't provide all the
architecturally required synchronization. The workaround performs more
work than necessary, and can have significant overhead. This patch
optimizes the workaround, as explained below.
The workaround was originally added for Qualcomm Falkor erratum 1009 in
commit:
d9ff80f83ecb ("arm64: Work around Falkor erratum 1009")
As noted in the message for that commit, the workaround is applied even
in cases where it is not strictly necessary.
The workaround was later reused without changes for:
* Arm Cortex-A76 erratum #1286807
SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
* Arm Cortex-A55 erratum #2441007
SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
* Arm Cortex-A510 erratum #2441009
SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
The important details to note are as follows:
1. All relevant errata only affect the ordering and/or completion of
memory accesses which have been translated by an invalidated TLB
entry. The actual invalidation of TLB entries is unaffected.
2. The existing workaround is applied to both broadcast and local TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for broadcast invalidation.
3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
sequence, whereas for all relevant errata it is only necessary to
execute a single additional TLBI;DSB sequence after any number of
TLBIs are completed by a DSB.
For example, for a sequence of batched TLBIs:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
... the existing workaround will expand this to:
TLBI <op1>[, <arg1>]
DSB ISH // additional
TLBI <op1>[, <arg1>] // additional
TLBI <op2>[, <arg2>]
DSB ISH // additional
TLBI <op2>[, <arg2>] // additional
TLBI <op3>[, <arg3>]
DSB ISH // additional
TLBI <op3>[, <arg3>] // additional
DSB ISH
... whereas it is sufficient to have:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
TLBI <opX>[, <argX>] // additional
DSB ISH // additional
Using a single additional TBLI and DSB at the end of the sequence can
have significantly lower overhead as each DSB which completes a TLBI
must synchronize with other PEs in the system, with potential
performance effects both locally and system-wide.
4. The existing workaround repeats each specific TLBI operation, whereas
for all relevant errata it is sufficient for the additional TLBI to
use *any* operation which will be broadcast, regardless of which
translation regime or stage of translation the operation applies to.
For example, for a single TLBI:
TLBI ALLE2IS
DSB ISH
... the existing workaround will expand this to:
TLBI ALLE2IS
DSB ISH
TLBI ALLE2IS // additional
DSB ISH // additional
... whereas it is sufficient to have:
TLBI ALLE2IS
DSB ISH
TLBI VALE1IS, XZR // additional
DSB ISH // additional
As the additional TLBI doesn't have to match a specific earlier TLBI,
the additional TLBI can be implemented in separate code, with no
memory of the earlier TLBIs. The additional TLBI can also use a
cheaper TLBI operation.
5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for Stage-1 invalidation.
Architecturally, TLBI operations which invalidate only Stage-2
information (e.g. IPAS2E1IS) are not required to invalidate TLB
entries which combine information from Stage-1 and Stage-2
translation table entries, and consequently may not complete memory
accesses translated by those combined entries. In these cases,
completion of memory accesses is only guaranteed after subsequent
invalidation of Stage-1 information (e.g. VMALLE1IS).
Taking the above points into account, this patch reworks the workaround
logic to reduce overhead:
* New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are
added and used in place of any dsb(ish) which is used to complete
broadcast Stage-1 TLB maintenance. When the
ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will
execute an additional TLBI;DSB sequence.
For consistency, it might make sense to add __tlbi_sync_*() helpers
for local and stage 2 maintenance. For now I've left those with
open-coded dsb() to keep the diff small.
* The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This
is no longer needed as the necessary synchronization will happen in
__tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp().
* The additional TLBI operation is chosen to have minimal impact:
- __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at
EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused
entry for the reserved ASID in the kernel's own translation regime,
and have no adverse affect.
- __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used
in hyp code, where it will target an unused entry in the hyp code's
TTBR0 mapping, and should have no adverse effect.
* As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a
TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no
need for arch_tlbbatch_should_defer() to consider
ARM64_WORKAROUND_REPEAT_TLBI.
When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this
patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes
the resulting Image 64KiB smaller:
| [mark@lakrids:~/src/linux]% size vmlinux-*
| text data bss dec hex filename
| 21179831 19660919 708216 41548966 279fca6 vmlinux-after
| 21181075 19660903 708216 41550194 27a0172 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l vmlinux-*
| -rwxr-xr-x 1 mark mark 157771472 Feb 4 12:05 vmlinux-after
| -rwxr-xr-x 1 mark mark 157815432 Feb 4 12:05 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l Image-*
| -rw-r--r-- 1 mark mark 41007616 Feb 4 12:05 Image-after
| -rw-r--r-- 1 mark mark 41073152 Feb 4 12:05 Image-before
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
---
arch/arm64/include/asm/tlbflush.h | 59 ++++++++++++++++++-------------
arch/arm64/kernel/sys_compat.c | 2 +-
arch/arm64/kvm/hyp/nvhe/mm.c | 2 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 8 ++---
arch/arm64/kvm/hyp/pgtable.c | 2 +-
arch/arm64/kvm/hyp/vhe/tlb.c | 10 +++---
6 files changed, 47 insertions(+), 36 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bf1cc9949dc87..1416e652612b7 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -31,18 +31,10 @@
*/
#define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op "\n" \
- ALTERNATIVE("nop\n nop", \
- "dsb ish\n tlbi " #op, \
- ARM64_WORKAROUND_REPEAT_TLBI, \
- CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : )
#define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op ", %x0\n" \
- ALTERNATIVE("nop\n nop", \
- "dsb ish\n tlbi " #op ", %x0", \
- ARM64_WORKAROUND_REPEAT_TLBI, \
- CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : "rZ" (arg))
#define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
@@ -181,6 +173,34 @@ static inline unsigned long get_trans_granule(void)
(__pages >> (5 * (scale) + 1)) - 1; \
})
+#define __repeat_tlbi_sync(op, arg...) \
+do { \
+ if (!alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI)) \
+ break; \
+ __tlbi(op, ##arg); \
+ dsb(ish); \
+} while (0)
+
+/*
+ * Complete broadcast TLB maintenance issued by the host which invalidates
+ * stage 1 information in the host's own translation regime.
+ */
+static inline void __tlbi_sync_s1ish(void)
+{
+ dsb(ish);
+ __repeat_tlbi_sync(vale1is, 0);
+}
+
+/*
+ * Complete broadcast TLB maintenance issued by hyp code which invalidates
+ * stage 1 translation information in any translation regime.
+ */
+static inline void __tlbi_sync_s1ish_hyp(void)
+{
+ dsb(ish);
+ __repeat_tlbi_sync(vale2is, 0);
+}
+
/*
* TLB Invalidation
* ================
@@ -279,7 +299,7 @@ static inline void flush_tlb_all(void)
{
dsb(ishst);
__tlbi(vmalle1is);
- dsb(ish);
+ __tlbi_sync_s1ish();
isb();
}
@@ -291,7 +311,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
asid = __TLBI_VADDR(0, ASID(mm));
__tlbi(aside1is, asid);
__tlbi_user(aside1is, asid);
- dsb(ish);
+ __tlbi_sync_s1ish();
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
}
@@ -345,20 +365,11 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
flush_tlb_page_nosync(vma, uaddr);
- dsb(ish);
+ __tlbi_sync_s1ish();
}
static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
{
- /*
- * TLB flush deferral is not required on systems which are affected by
- * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
- * will have two consecutive TLBI instructions with a dsb(ish) in between
- * defeating the purpose (i.e save overall 'dsb ish' cost).
- */
- if (alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI))
- return false;
-
return true;
}
@@ -374,7 +385,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
*/
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- dsb(ish);
+ __tlbi_sync_s1ish();
}
/*
@@ -509,7 +520,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
{
__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
last_level, tlb_level);
- dsb(ish);
+ __tlbi_sync_s1ish();
}
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
@@ -557,7 +568,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
dsb(ishst);
__flush_tlb_range_op(vaale1is, start, pages, stride, 0,
TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
- dsb(ish);
+ __tlbi_sync_s1ish();
isb();
}
@@ -571,7 +582,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
dsb(ishst);
__tlbi(vaae1is, addr);
- dsb(ish);
+ __tlbi_sync_s1ish();
isb();
}
diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
index 4a609e9b65de0..b9d4998c97efa 100644
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@ -37,7 +37,7 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
* We pick the reserved-ASID to minimise the impact.
*/
__tlbi(aside1is, __TLBI_VADDR(0, 0));
- dsb(ish);
+ __tlbi_sync_s1ish();
}
ret = caches_clean_inval_user_pou(start, start + chunk);
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index ae8391baebc30..218976287d3fe 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -271,7 +271,7 @@ static void fixmap_clear_slot(struct hyp_fixmap_slot *slot)
*/
dsb(ishst);
__tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
}
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index 48da9ca9763f6..3dc1ce0d27fe6 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -169,7 +169,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
*/
dsb(ish);
__tlbi(vmalle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -226,7 +226,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
dsb(ish);
__tlbi(vmalle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -240,7 +240,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
enter_vmid_context(mmu, &cxt, false);
__tlbi(vmalls12e1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -266,5 +266,5 @@ void __kvm_flush_vm_context(void)
/* Same remark as in enter_vmid_context() */
dsb(ish);
__tlbi(alle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
}
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 947ac1a951a5b..da8f8d4c4d5da 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -483,7 +483,7 @@ static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
*unmapped += granule;
}
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
mm_ops->put_page(ctx->ptep);
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index ec25698186297..35855dadfb1b3 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -115,7 +115,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
*/
dsb(ish);
__tlbi(vmalle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -176,7 +176,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
dsb(ish);
__tlbi(vmalle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -192,7 +192,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
enter_vmid_context(mmu, &cxt);
__tlbi(vmalls12e1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
exit_vmid_context(&cxt);
@@ -217,7 +217,7 @@ void __kvm_flush_vm_context(void)
{
dsb(ishst);
__tlbi(alle1is);
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
}
/*
@@ -358,7 +358,7 @@ int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding)
default:
ret = -EINVAL;
}
- dsb(ish);
+ __tlbi_sync_s1ish_hyp();
isb();
if (mmu)
--
2.30.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
@ 2026-02-26 0:06 ` Will Deacon
2026-02-26 8:58 ` Marc Zyngier
3 siblings, 0 replies; 5+ messages in thread
From: Will Deacon @ 2026-02-26 0:06 UTC (permalink / raw)
To: linux-arm-kernel, Mark Rutland
Cc: catalin.marinas, kernel-team, Will Deacon, maz, oupton,
ryan.roberts
On Wed, 18 Feb 2026 16:43:46 +0000, Mark Rutland wrote:
> Some Arm partners have complained that the overhead of
> ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
> errata being categorized as "rare", they still want to use the
> workaround in some deployments.
>
> For historical reasons, the current workaround is far stronger (and
> consequently far more expensive) than necessary. In part, the SDENs had
> somewhat misleading descriptions, which have recently been clarified:
>
> [...]
Applied to arm64 (for-next/fixes), thanks!
[1/2] arm64: tlb: Allow XZR argument to TLBI ops
https://git.kernel.org/arm64/c/bfd9c931d19a
[2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
https://git.kernel.org/arm64/c/a8f78680ee6b
Cheers,
--
Will
https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
` (2 preceding siblings ...)
2026-02-26 0:06 ` [PATCH 0/2] " Will Deacon
@ 2026-02-26 8:58 ` Marc Zyngier
3 siblings, 0 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-02-26 8:58 UTC (permalink / raw)
To: Mark Rutland
Cc: linux-arm-kernel, catalin.marinas, oupton, ryan.roberts, will
On Wed, 18 Feb 2026 16:43:46 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi all,
>
> Some Arm partners have complained that the overhead of
> ARM64_WORKAROUND_REPEAT_TLBI is too large, and despite the relevant
> errata being categorized as "rare", they still want to use the
> workaround in some deployments.
>
> For historical reasons, the current workaround is far stronger (and
> consequently far more expensive) than necessary. In part, the SDENs had
> somewhat misleading descriptions, which have recently been clarified:
>
> * Arm Cortex-A76 erratum #1286807
> SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
>
> * Arm Cortex-A55 erratum #2441007
> SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
>
> * Arm Cortex-A510 erratum #2441009
> SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
>
> Patch 1 allows the __TLBI*() helpers to generate XZR as an argument.
> I've split this out as its own patch to make bisection easier in case we
> see any problems due to incorrect trap+emulation handling of XZR.
> Otherwise this shouldn't have any functional change.
>
> Patch 2 is the actual optimization, spelled out in detail in the commit
> message. The gist is that it's not necessary to duplicate every
> individual TLBI, and it's sufficient to have a single arbitrary TLBI;DSB
> after any number of batched TLBIs;DSB.
>
> As mentioned in the commit message for patch 2, this results in fewer
> alternatives and better code generation whenever
> ARM64_WORKAROUND_REPEAT_TLBI is built into the kernel, so it's a
> (trivial) win on hardware that isn't affected by the relevant errata.
>
> Mark.
>
> Mark Rutland (2):
> arm64: tlb: Allow XZR argument to TLBI ops
> arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
>
> arch/arm64/include/asm/tlbflush.h | 63 ++++++++++++++++++-------------
> arch/arm64/kernel/sys_compat.c | 2 +-
> arch/arm64/kvm/hyp/nvhe/mm.c | 2 +-
> arch/arm64/kvm/hyp/nvhe/tlb.c | 8 ++--
> arch/arm64/kvm/hyp/pgtable.c | 2 +-
> arch/arm64/kvm/hyp/vhe/tlb.c | 10 ++---
> 6 files changed, 49 insertions(+), 38 deletions(-)
>
A bit late, but FTR,
Reviewed-by: Marc Zyngier <maz@kernel.org>
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-26 9:18 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18 16:43 [PATCH 0/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-18 16:43 ` [PATCH 1/2] arm64: tlb: Allow XZR argument to TLBI ops Mark Rutland
2026-02-18 16:43 ` [PATCH 2/2] arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI Mark Rutland
2026-02-26 0:06 ` [PATCH 0/2] " Will Deacon
2026-02-26 8:58 ` Marc Zyngier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox