* [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995)
@ 2026-03-02 16:57 Catalin Marinas
2026-03-02 16:57 ` [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance Catalin Marinas
` (3 more replies)
0 siblings, 4 replies; 24+ messages in thread
From: Catalin Marinas @ 2026-03-02 16:57 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Will Deacon, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
Arm C1-Pro prior to r1p3 has an erratum (4193714) where a TLBI+DSB
sequence might fail to ensure the completion of all outstanding SME
(Scalable Matrix Extension) memory accesses. The DVMSync message is
acknowledged before the SME accesses have fully completed, potentially
allowing pages to be reused before all in-flight accesses are done.
The workaround consists of executing a DSB locally (via IPI)
on all affected CPUs running with SME enabled, after the TLB
invalidation. This ensures the SME accesses have completed before the
IPI is acknowledged.
The first two patches are preparatory: patch 1 adds
__tlbi_sync_s1ish_kernel() to distinguish kernel from user TLB
maintenance; patch 2 passes the mm_struct to __tlbi_sync_s1ish().
Patch 3 implements the actual erratum workaround for the kernel
(non-virtualised) case. It applies only to user mappings and limited to
tasks using SME (tracked via a new MMCF_SME_DVMSYNC flag) and running at
EL0. The smp_call_function() does not need an explicit DSB on the
interrupted CPUs since SCTLR_EL1.IESB=1 forces the completion of SME
accesses when entering the kernel from EL0.
Patch 4 handles the pKVM case. The aim is to ensure the kernel will not
compromise the security of protected guests. pKVM delegates the
workaround to EL3 via an SMC call (to Trusted Firmware-A). The TF-A
patches are provided separately in the project's repository.
Since SME in guests is not currently supported, no additional KVM
workaround needed to prevent guests from exploiting the erratum.
This has been assigned CVE-2026-0995:
https://developer.arm.com/documentation/111823/latest/
Backports available here (no stable-6.12.y since SME is not supported):
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git errata/c1-pro-erratum-4193714-stable-6.19.y
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git errata/c1-pro-erratum-4193714-stable-6.18.y
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git errata/c1-pro-erratum-4193714-android16-6.12-lts
Thanks.
Catalin Marinas (3):
arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance
arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
arm64: errata: Work around early CME DVMSync acknowledgement
James Morse (1):
KVM: arm64: Add SMC hook for SME dvmsync erratum
arch/arm64/Kconfig | 12 ++++
arch/arm64/include/asm/cpucaps.h | 2 +
arch/arm64/include/asm/cputype.h | 2 +
arch/arm64/include/asm/fpsimd.h | 29 ++++++++++
arch/arm64/include/asm/mmu.h | 1 +
arch/arm64/include/asm/tlbflush.h | 39 ++++++++++---
arch/arm64/kernel/cpu_errata.c | 19 +++++++
arch/arm64/kernel/entry-common.c | 3 +
arch/arm64/kernel/fpsimd.c | 81 +++++++++++++++++++++++++++
arch/arm64/kernel/process.c | 7 +++
arch/arm64/kernel/sys_compat.c | 2 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 17 ++++++
arch/arm64/tools/cpucaps | 1 +
include/linux/arm-smccc.h | 5 ++
14 files changed, 211 insertions(+), 9 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance
2026-03-02 16:57 [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995) Catalin Marinas
@ 2026-03-02 16:57 ` Catalin Marinas
2026-03-03 13:12 ` Mark Rutland
2026-03-02 16:57 ` [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish() Catalin Marinas
` (2 subsequent siblings)
3 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-02 16:57 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Will Deacon, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
Add __tlbi_sync_s1ish_kernel() similar to __tlbi_sync_s1ish() and use it
for kernel TLB maintenance. Also use this function in flush_tlb_all()
which is only used in relation to kernel mappings. Subsequent patches
can differentiate between workarounds that apply to user only or both
user and kernel.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
---
arch/arm64/include/asm/tlbflush.h | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 1416e652612b..19be0f7bfca5 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -191,6 +191,12 @@ static inline void __tlbi_sync_s1ish(void)
__repeat_tlbi_sync(vale1is, 0);
}
+static inline void __tlbi_sync_s1ish_kernel(void)
+{
+ dsb(ish);
+ __repeat_tlbi_sync(vale1is, 0);
+}
+
/*
* Complete broadcast TLB maintenance issued by hyp code which invalidates
* stage 1 translation information in any translation regime.
@@ -299,7 +305,7 @@ static inline void flush_tlb_all(void)
{
dsb(ishst);
__tlbi(vmalle1is);
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish_kernel();
isb();
}
@@ -568,7 +574,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
dsb(ishst);
__flush_tlb_range_op(vaale1is, start, pages, stride, 0,
TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish_kernel();
isb();
}
@@ -582,7 +588,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
dsb(ishst);
__tlbi(vaae1is, addr);
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish_kernel();
isb();
}
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-02 16:57 [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995) Catalin Marinas
2026-03-02 16:57 ` [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance Catalin Marinas
@ 2026-03-02 16:57 ` Catalin Marinas
2026-03-05 14:33 ` Will Deacon
2026-03-02 16:57 ` [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement Catalin Marinas
2026-03-02 16:57 ` [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum Catalin Marinas
3 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-02 16:57 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Will Deacon, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
The mm structure will be used for workarounds that need limiting to
specific tasks.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
---
arch/arm64/include/asm/tlbflush.h | 10 +++++-----
arch/arm64/kernel/sys_compat.c | 2 +-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 19be0f7bfca5..14f116bfec73 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -185,7 +185,7 @@ do { \
* Complete broadcast TLB maintenance issued by the host which invalidates
* stage 1 information in the host's own translation regime.
*/
-static inline void __tlbi_sync_s1ish(void)
+static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
{
dsb(ish);
__repeat_tlbi_sync(vale1is, 0);
@@ -317,7 +317,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
asid = __TLBI_VADDR(0, ASID(mm));
__tlbi(aside1is, asid);
__tlbi_user(aside1is, asid);
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish(mm);
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
}
@@ -371,7 +371,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
flush_tlb_page_nosync(vma, uaddr);
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish(vma->vm_mm);
}
static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
@@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
*/
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish(NULL);
}
/*
@@ -526,7 +526,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
{
__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
last_level, tlb_level);
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish(vma->vm_mm);
}
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
index b9d4998c97ef..03fde2677d5b 100644
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@ -37,7 +37,7 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
* We pick the reserved-ASID to minimise the impact.
*/
__tlbi(aside1is, __TLBI_VADDR(0, 0));
- __tlbi_sync_s1ish();
+ __tlbi_sync_s1ish(current->mm);
}
ret = caches_clean_inval_user_pou(start, start + chunk);
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-02 16:57 [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995) Catalin Marinas
2026-03-02 16:57 ` [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance Catalin Marinas
2026-03-02 16:57 ` [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish() Catalin Marinas
@ 2026-03-02 16:57 ` Catalin Marinas
2026-03-05 14:32 ` Will Deacon
2026-03-02 16:57 ` [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum Catalin Marinas
3 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-02 16:57 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Will Deacon, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
C1-Pro acknowledges DVMSync messages before completing the SME/CME
memory accesses. Work around this by issuing an IPI+DSB to the affected
CPUs if they are running in EL0 with SME enabled.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
---
arch/arm64/Kconfig | 12 +++++
arch/arm64/include/asm/cpucaps.h | 2 +
arch/arm64/include/asm/cputype.h | 2 +
arch/arm64/include/asm/fpsimd.h | 29 +++++++++++
arch/arm64/include/asm/mmu.h | 1 +
arch/arm64/include/asm/tlbflush.h | 17 +++++++
arch/arm64/kernel/cpu_errata.c | 19 ++++++++
arch/arm64/kernel/entry-common.c | 3 ++
arch/arm64/kernel/fpsimd.c | 81 +++++++++++++++++++++++++++++++
arch/arm64/kernel/process.c | 7 +++
arch/arm64/tools/cpucaps | 1 +
11 files changed, 174 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 38dba5f7e4d2..f07cdb6ada08 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1175,6 +1175,18 @@ config ARM64_ERRATUM_4311569
If unsure, say Y.
+config ARM64_ERRATUM_SME_DVMSYNC
+ bool "C1-Pro: 4193714: SME DVMSync early acknowledgement"
+ depends on ARM64_SME
+ default y
+ help
+ Enable workaround for C1-Pro acknowledging the DVMSync before
+ the SME memory accesses are complete. This would cause TLB
+ maintenance for processes using SME to also issue an IPI to
+ the affected CPUs.
+
+ If unsure, say Y.
+
config CAVIUM_ERRATUM_22375
bool "Cavium erratum 22375, 24313"
default y
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 177c691914f8..d0e6cff93876 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -64,6 +64,8 @@ cpucap_is_possible(const unsigned int cap)
return IS_ENABLED(CONFIG_ARM64_WORKAROUND_REPEAT_TLBI);
case ARM64_WORKAROUND_SPECULATIVE_SSBS:
return IS_ENABLED(CONFIG_ARM64_ERRATUM_3194386);
+ case ARM64_WORKAROUND_SME_DVMSYNC:
+ return IS_ENABLED(CONFIG_ARM64_ERRATUM_SME_DVMSYNC);
case ARM64_MPAM:
/*
* KVM MPAM support doesn't rely on the host kernel supporting MPAM.
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 08860d482e60..7b518e81dd15 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -98,6 +98,7 @@
#define ARM_CPU_PART_CORTEX_A725 0xD87
#define ARM_CPU_PART_CORTEX_A720AE 0xD89
#define ARM_CPU_PART_NEOVERSE_N3 0xD8E
+#define ARM_CPU_PART_C1_PRO 0xD8B
#define APM_CPU_PART_XGENE 0x000
#define APM_CPU_VAR_POTENZA 0x00
@@ -189,6 +190,7 @@
#define MIDR_CORTEX_A725 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A725)
#define MIDR_CORTEX_A720AE MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A720AE)
#define MIDR_NEOVERSE_N3 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N3)
+#define MIDR_C1_PRO MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_C1_PRO)
#define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index 1d2e33559bd5..a956fe12fc4d 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -428,6 +428,32 @@ static inline size_t sme_state_size(struct task_struct const *task)
return __sme_state_size(task_get_sme_vl(task));
}
+#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
+
+void sme_enable_dvmsync(void);
+void sme_set_active(unsigned int cpu);
+void sme_clear_active(unsigned int cpu);
+
+#else
+
+static inline void sme_enable_dvmsync(void) { }
+static inline void sme_set_active(unsigned int cpu) { }
+static inline void sme_clear_active(unsigned int cpu) { }
+
+#endif /* CONFIG_ARM64_ERRATUM_SME_DVMSYNC */
+
+static inline void sme_enter_from_user_mode(void)
+{
+ if (test_thread_flag(TIF_SME))
+ sme_clear_active(smp_processor_id());
+}
+
+static inline void sme_exit_to_user_mode(void)
+{
+ if (test_thread_flag(TIF_SME))
+ sme_set_active(smp_processor_id());
+}
+
#else
static inline void sme_user_disable(void) { BUILD_BUG(); }
@@ -456,6 +482,9 @@ static inline size_t sme_state_size(struct task_struct const *task)
return 0;
}
+static inline void sme_enter_from_user_mode(void) { }
+static inline void sme_exit_to_user_mode(void) { }
+
#endif /* ! CONFIG_ARM64_SME */
/* For use by EFI runtime services calls only */
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 137a173df1ff..ec6003db4d20 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -8,6 +8,7 @@
#include <asm/cputype.h>
#define MMCF_AARCH32 0x1 /* mm context flag for AArch32 executables */
+#define MMCF_SME_DVMSYNC 0x2 /* force DVMSync via IPI for SME completion */
#define USER_ASID_BIT 48
#define USER_ASID_FLAG (UL(1) << USER_ASID_BIT)
#define TTBR_ASID_MASK (UL(0xffff) << 48)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 14f116bfec73..e3ea0246a4f4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -80,6 +80,22 @@ static inline unsigned long get_trans_granule(void)
}
}
+#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
+void sme_do_dvmsync(void);
+
+static inline void sme_dvmsync(struct mm_struct *mm)
+{
+ if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
+ return;
+ if (mm && !test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
+ return;
+
+ sme_do_dvmsync();
+}
+#else
+static inline void sme_dvmsync(struct mm_struct *mm) { }
+#endif
+
/*
* Level-based TLBI operations.
*
@@ -189,6 +205,7 @@ static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
{
dsb(ish);
__repeat_tlbi_sync(vale1is, 0);
+ sme_dvmsync(mm);
}
static inline void __tlbi_sync_s1ish_kernel(void)
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 5c0ab6bfd44a..fef522a6b4b7 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -11,6 +11,7 @@
#include <asm/cpu.h>
#include <asm/cputype.h>
#include <asm/cpufeature.h>
+#include <asm/fpsimd.h>
#include <asm/kvm_asm.h>
#include <asm/smp_plat.h>
@@ -575,6 +576,14 @@ static const struct midr_range erratum_spec_ssbs_list[] = {
};
#endif
+#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
+static void cpu_enable_sme_dvmsync(const struct arm64_cpu_capabilities *__unused)
+{
+ if (this_cpu_has_cap(ARM64_WORKAROUND_SME_DVMSYNC))
+ sme_enable_dvmsync();
+}
+#endif
+
#ifdef CONFIG_AMPERE_ERRATUM_AC03_CPU_38
static const struct midr_range erratum_ac03_cpu_38_list[] = {
MIDR_ALL_VERSIONS(MIDR_AMPERE1),
@@ -901,6 +910,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
.matches = need_arm_si_l1_workaround_4311569,
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
+ {
+ .desc = "C1-Pro SME DVMSync early acknowledgement",
+ .capability = ARM64_WORKAROUND_SME_DVMSYNC,
+ .cpu_enable = cpu_enable_sme_dvmsync,
+ /* C1-Pro r0p0 - r1p2 (the latter only when REVIDR_EL1[0]==0 */
+ ERRATA_MIDR_RANGE(MIDR_C1_PRO, 0, 0, 1, 2),
+ MIDR_FIXED(MIDR_CPU_VAR_REV(1, 2), BIT(0)),
+ },
+#endif
#ifdef CONFIG_ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD
{
.desc = "ARM errata 2966298, 3117295",
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 3625797e9ee8..fb1e374af622 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -21,6 +21,7 @@
#include <asm/daifflags.h>
#include <asm/esr.h>
#include <asm/exception.h>
+#include <asm/fpsimd.h>
#include <asm/irq_regs.h>
#include <asm/kprobes.h>
#include <asm/mmu.h>
@@ -67,6 +68,7 @@ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
mte_disable_tco_entry(current);
+ sme_enter_from_user_mode();
}
/*
@@ -80,6 +82,7 @@ static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
local_irq_disable();
exit_to_user_mode_prepare_legacy(regs);
local_daif_mask();
+ sme_exit_to_user_mode();
mte_check_tfsr_exit();
exit_to_user_mode();
}
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 9de1d8a604cb..90015fc29722 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -15,6 +15,7 @@
#include <linux/compiler.h>
#include <linux/cpu.h>
#include <linux/cpu_pm.h>
+#include <linux/cpumask.h>
#include <linux/ctype.h>
#include <linux/kernel.h>
#include <linux/linkage.h>
@@ -28,6 +29,7 @@
#include <linux/sched/task_stack.h>
#include <linux/signal.h>
#include <linux/slab.h>
+#include <linux/smp.h>
#include <linux/stddef.h>
#include <linux/sysctl.h>
#include <linux/swab.h>
@@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
put_cpu_fpsimd_context();
}
+#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
+
+/*
+ * SME/CME erratum handling
+ */
+static cpumask_var_t sme_dvmsync_cpus;
+static cpumask_var_t sme_active_cpus;
+
+void sme_set_active(unsigned int cpu)
+{
+ if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
+ return;
+ if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
+ return;
+
+ if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
+ set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
+
+ cpumask_set_cpu(cpu, sme_active_cpus);
+
+ /*
+ * Ensure subsequent (SME) memory accesses are observed after the
+ * cpumask and the MMCF_SME_DVMSYNC flag setting.
+ */
+ smp_mb();
+}
+
+void sme_clear_active(unsigned int cpu)
+{
+ if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
+ return;
+ if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
+ return;
+
+ /*
+ * With SCTLR_EL1.IESB enabled, the SME memory transactions are
+ * completed on entering EL1.
+ */
+ cpumask_clear_cpu(cpu, sme_active_cpus);
+}
+
+static void sme_dvmsync_ipi(void *unused)
+{
+ /*
+ * With SCTLR_EL1.IESB on, taking an exception is sufficient to ensure
+ * the completion of the SME memory accesses, so no need for an
+ * explicit DSB.
+ */
+}
+
+void sme_do_dvmsync(void)
+{
+ /*
+ * This is called from the TLB maintenance functions after the DSB ISH
+ * to send hardware DVMSync message. If this CPU sees the mask as
+ * empty, the remote CPU executing sme_set_active() would have seen
+ * the DVMSync and no IPI required.
+ */
+ if (cpumask_empty(sme_active_cpus))
+ return;
+
+ preempt_disable();
+ smp_call_function_many(sme_active_cpus, sme_dvmsync_ipi, NULL, true);
+ preempt_enable();
+}
+
+void sme_enable_dvmsync(void)
+{
+ if ((!cpumask_available(sme_dvmsync_cpus) &&
+ !zalloc_cpumask_var(&sme_dvmsync_cpus, GFP_ATOMIC)) ||
+ (!cpumask_available(sme_active_cpus) &&
+ !zalloc_cpumask_var(&sme_active_cpus, GFP_ATOMIC)))
+ panic("Unable to allocate the cpumasks for SME DVMSync erratum");
+
+ cpumask_set_cpu(smp_processor_id(), sme_dvmsync_cpus);
+}
+
+#endif /* CONFIG_ARM64_ERRATUM_SME_DVMSYNC */
+
/*
* Trapped SME access
*
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 489554931231..6154d0b454a3 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -471,6 +471,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
ret = copy_thread_za(p, current);
if (ret)
return ret;
+ /*
+ * Disable the SME DVMSync workaround for the
+ * new process, it will be enabled on return
+ * to user if TIF_SME is set.
+ */
+ if (cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
+ p->mm->context.flags &= ~MMCF_SME_DVMSYNC;
} else {
p->thread.tpidr2_el0 = 0;
WARN_ON_ONCE(p->thread.svcr & SVCR_ZA_MASK);
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 7261553b644b..7d69d8a16eae 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -123,6 +123,7 @@ WORKAROUND_PMUV3_IMPDEF_TRAPS
WORKAROUND_QCOM_FALKOR_E1003
WORKAROUND_QCOM_ORYON_CNTVOFF
WORKAROUND_REPEAT_TLBI
+WORKAROUND_SME_DVMSYNC
WORKAROUND_SPECULATIVE_AT
WORKAROUND_SPECULATIVE_SSBS
WORKAROUND_SPECULATIVE_UNPRIV_LOAD
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum
2026-03-02 16:57 [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995) Catalin Marinas
` (2 preceding siblings ...)
2026-03-02 16:57 ` [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement Catalin Marinas
@ 2026-03-02 16:57 ` Catalin Marinas
2026-03-05 14:32 ` Will Deacon
3 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-02 16:57 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Will Deacon, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
From: James Morse <james.morse@arm.com>
C1-Pro cores with SME have an erratum where TLBI+DSB does not complete
all outstanding SME accesses. Instead a DSB needs to be executed on the
affecteed CPUs. The implication is pages cannot be unmapped from the
host stage2 then provided to the guest. Host SME accesses may occur
after this point.
This erratum breaks pKVM's guarantees, and the workaround is hard to
implement as EL2 and EL1 share a security state meaning EL1 can mask
IPI sent by EL2, leading to interrupt blackouts.
Instead, do this in EL3. This has the advantage of a separate security
state, meaning lower EL cannot mask the IPI. It is also simpler for EL3
to know about CPUs that are off or in PSCI's CPU_SUSPEND.
Add the needed hook.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Sudeep Holla <sudeep.holla@kernel.org>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 17 +++++++++++++++++
include/linux/arm-smccc.h | 5 +++++
2 files changed, 22 insertions(+)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 38f66a56a766..ab7f9273fddf 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -5,6 +5,8 @@
*/
#include <linux/kvm_host.h>
+#include <linux/arm-smccc.h>
+
#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
@@ -28,6 +30,15 @@ static struct hyp_pool host_s2_pool;
static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
#define current_vm (*this_cpu_ptr(&__current_vm))
+static void pkvm_sme_dvmsync_fw_call(void)
+{
+ if (cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC)) {
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_smc(ARM_SMCCC_CPU_SME_DVMSYNC_WORKAROUND, &res);
+ }
+}
+
static void guest_lock_component(struct pkvm_hyp_vm *vm)
{
hyp_spin_lock(&vm->lock);
@@ -553,6 +564,12 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id)
if (ret)
return ret;
+ /*
+ * After stage2 maintenance has happened, but before the page owner has
+ * changed.
+ */
+ pkvm_sme_dvmsync_fw_call();
+
/* Don't forget to update the vmemmap tracking for the host */
if (owner_id == PKVM_ID_HOST)
__host_update_page_state(addr, size, PKVM_PAGE_OWNED);
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 50b47eba7d01..3489db78b0bd 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -105,6 +105,11 @@
ARM_SMCCC_SMC_32, \
0, 0x3fff)
+#define ARM_SMCCC_CPU_SME_DVMSYNC_WORKAROUND \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ ARM_SMCCC_OWNER_CPU, 0x10)
+
#define ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
ARM_SMCCC_SMC_32, \
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance
2026-03-02 16:57 ` [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance Catalin Marinas
@ 2026-03-03 13:12 ` Mark Rutland
2026-03-05 11:27 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Mark Rutland @ 2026-03-03 13:12 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Will Deacon, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Brown, kvmarm
Hi Catalin,
On Mon, Mar 02, 2026 at 04:57:54PM +0000, Catalin Marinas wrote:
> Add __tlbi_sync_s1ish_kernel() similar to __tlbi_sync_s1ish() and use it
> for kernel TLB maintenance. Also use this function in flush_tlb_all()
> which is only used in relation to kernel mappings. Subsequent patches
> can differentiate between workarounds that apply to user only or both
> user and kernel.
>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
This looks fine to me. I have one minor comment/naming nit below, but
this looks functionally correct, and I'm happy to spin a follow-up for
that.
With or without the changes below:
Acked-by: Mark Rutland <mark.rutland@arm.com>
> ---
> arch/arm64/include/asm/tlbflush.h | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 1416e652612b..19be0f7bfca5 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -191,6 +191,12 @@ static inline void __tlbi_sync_s1ish(void)
> __repeat_tlbi_sync(vale1is, 0);
> }
>
> +static inline void __tlbi_sync_s1ish_kernel(void)
> +{
> + dsb(ish);
> + __repeat_tlbi_sync(vale1is, 0);
> +}
> +
> /*
> * Complete broadcast TLB maintenance issued by hyp code which invalidates
> * stage 1 translation information in any translation regime.
> @@ -299,7 +305,7 @@ static inline void flush_tlb_all(void)
> {
> dsb(ishst);
> __tlbi(vmalle1is);
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish_kernel();
> isb();
> }
The commit message is correct that flush_tlb_all() is only used for
kernel mappings today, via flush_tlb_kernel_range(), so this is safe.
However, the big comment block around line 200 says:
flush_tlb_all()
Invalidate the entire TLB (kernel + user) on all CPUs
... and:
local_flush_tlb_all()
Same as flush_tlb_all(), but only applies to the calling CPU.
... where the latter is used for user mappings (upon ASID overflow), so
I think there's some risk of future confusion.
To minimize the risk that flush_tlb_all() gets used for user mappings in
future, how about we rename flush_tlb_all() => flush_tlb_kernel_all(), and
update those comments:
flush_tlb_kernel_all()
Invalidate all kernel mappings on all CPUs.
Should not be used to invalidate user mappings.
local_flush_tlb_all()
Invalidate all (kernel + user) mappings on the calling CPU.
Note: I chose flush_tlb_kernel_all() rather than flush_tlb_all_kernel()
__flush_tlb_kernel_{pgtable,range}, with 'kernel' before the operation/scope.
Thanks,
Mark.
> @@ -568,7 +574,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
> dsb(ishst);
> __flush_tlb_range_op(vaale1is, start, pages, stride, 0,
> TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish_kernel();
> isb();
> }
>
> @@ -582,7 +588,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>
> dsb(ishst);
> __tlbi(vaae1is, addr);
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish_kernel();
> isb();
> }
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance
2026-03-03 13:12 ` Mark Rutland
@ 2026-03-05 11:27 ` Catalin Marinas
2026-03-09 12:12 ` Mark Rutland
0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-05 11:27 UTC (permalink / raw)
To: Mark Rutland
Cc: linux-arm-kernel, Will Deacon, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Brown, kvmarm
On Tue, Mar 03, 2026 at 01:12:50PM +0000, Mark Rutland wrote:
> On Mon, Mar 02, 2026 at 04:57:54PM +0000, Catalin Marinas wrote:
> > Add __tlbi_sync_s1ish_kernel() similar to __tlbi_sync_s1ish() and use it
> > for kernel TLB maintenance. Also use this function in flush_tlb_all()
> > which is only used in relation to kernel mappings. Subsequent patches
> > can differentiate between workarounds that apply to user only or both
> > user and kernel.
> >
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Mark Rutland <mark.rutland@arm.com>
>
> This looks fine to me. I have one minor comment/naming nit below, but
> this looks functionally correct, and I'm happy to spin a follow-up for
> that.
>
> With or without the changes below:
>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
Thanks Mark.
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 1416e652612b..19be0f7bfca5 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -191,6 +191,12 @@ static inline void __tlbi_sync_s1ish(void)
> > __repeat_tlbi_sync(vale1is, 0);
> > }
> >
> > +static inline void __tlbi_sync_s1ish_kernel(void)
> > +{
> > + dsb(ish);
> > + __repeat_tlbi_sync(vale1is, 0);
> > +}
> > +
> > /*
> > * Complete broadcast TLB maintenance issued by hyp code which invalidates
> > * stage 1 translation information in any translation regime.
> > @@ -299,7 +305,7 @@ static inline void flush_tlb_all(void)
> > {
> > dsb(ishst);
> > __tlbi(vmalle1is);
> > - __tlbi_sync_s1ish();
> > + __tlbi_sync_s1ish_kernel();
> > isb();
> > }
>
> The commit message is correct that flush_tlb_all() is only used for
> kernel mappings today, via flush_tlb_kernel_range(), so this is safe.
Unfortunately, it's also used by the core code -
hugetlb_vmemmap_restore_folios() (and another function in this file).
> However, the big comment block around line 200 says:
>
> flush_tlb_all()
> Invalidate the entire TLB (kernel + user) on all CPUs
>
> ... and:
>
> local_flush_tlb_all()
> Same as flush_tlb_all(), but only applies to the calling CPU.
>
> ... where the latter is used for user mappings (upon ASID overflow), so
> I think there's some risk of future confusion.
Ignoring this erratum, the statements are still correct for arm64 as it
flushes both kernel and user, though I see what you mean w.r.t. its
intended use.
> To minimize the risk that flush_tlb_all() gets used for user mappings in
> future, how about we rename flush_tlb_all() => flush_tlb_kernel_all(), and
> update those comments:
>
> flush_tlb_kernel_all()
> Invalidate all kernel mappings on all CPUs.
> Should not be used to invalidate user mappings.
>
> local_flush_tlb_all()
> Invalidate all (kernel + user) mappings on the calling CPU.
>
> Note: I chose flush_tlb_kernel_all() rather than flush_tlb_all_kernel()
> __flush_tlb_kernel_{pgtable,range}, with 'kernel' before the operation/scope.
I'm fine to update the comments but, for backporting, I'd not change the
function name as it will have to touch core code. Ideally we should go
around and change the other architectures to follow the same semantics
(I briefly looked at x86 and powerpc and they also seem to use
flush_tlb_all() only for kernel mappings).
So, I think it's better to do this cleanup separately ;).
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-02 16:57 ` [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement Catalin Marinas
@ 2026-03-05 14:32 ` Will Deacon
2026-03-06 12:00 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2026-03-05 14:32 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Mon, Mar 02, 2026 at 04:57:56PM +0000, Catalin Marinas wrote:
> C1-Pro acknowledges DVMSync messages before completing the SME/CME
> memory accesses. Work around this by issuing an IPI+DSB to the affected
> CPUs if they are running in EL0 with SME enabled.
Just to make sure I understand the implications, but this _only_ applies
to explicit memory accesses from the SME unit and not, for example, to
page-table walks initiated by SME instructions?
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Mark Brown <broonie@kernel.org>
> ---
> arch/arm64/Kconfig | 12 +++++
> arch/arm64/include/asm/cpucaps.h | 2 +
> arch/arm64/include/asm/cputype.h | 2 +
> arch/arm64/include/asm/fpsimd.h | 29 +++++++++++
> arch/arm64/include/asm/mmu.h | 1 +
> arch/arm64/include/asm/tlbflush.h | 17 +++++++
> arch/arm64/kernel/cpu_errata.c | 19 ++++++++
> arch/arm64/kernel/entry-common.c | 3 ++
> arch/arm64/kernel/fpsimd.c | 81 +++++++++++++++++++++++++++++++
> arch/arm64/kernel/process.c | 7 +++
> arch/arm64/tools/cpucaps | 1 +
> 11 files changed, 174 insertions(+)
[...]
> @@ -575,6 +576,14 @@ static const struct midr_range erratum_spec_ssbs_list[] = {
> };
> #endif
>
> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> +static void cpu_enable_sme_dvmsync(const struct arm64_cpu_capabilities *__unused)
> +{
> + if (this_cpu_has_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> + sme_enable_dvmsync();
> +}
> +#endif
> +
> #ifdef CONFIG_AMPERE_ERRATUM_AC03_CPU_38
> static const struct midr_range erratum_ac03_cpu_38_list[] = {
> MIDR_ALL_VERSIONS(MIDR_AMPERE1),
> @@ -901,6 +910,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
> .matches = need_arm_si_l1_workaround_4311569,
> },
> #endif
> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> + {
> + .desc = "C1-Pro SME DVMSync early acknowledgement",
> + .capability = ARM64_WORKAROUND_SME_DVMSYNC,
> + .cpu_enable = cpu_enable_sme_dvmsync,
> + /* C1-Pro r0p0 - r1p2 (the latter only when REVIDR_EL1[0]==0 */
> + ERRATA_MIDR_RANGE(MIDR_C1_PRO, 0, 0, 1, 2),
> + MIDR_FIXED(MIDR_CPU_VAR_REV(1, 2), BIT(0)),
> + },
> +#endif
An alternative to this workaround is just to disable SME entirely, perhaps
by passing 'arm64.nosme' on the cmdline. Maybe we should disable the
workaround in that case?
> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> put_cpu_fpsimd_context();
> }
>
> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> +
> +/*
> + * SME/CME erratum handling
> + */
> +static cpumask_var_t sme_dvmsync_cpus;
> +static cpumask_var_t sme_active_cpus;
> +
> +void sme_set_active(unsigned int cpu)
> +{
> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> + return;
> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> + return;
> +
> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> +
> + cpumask_set_cpu(cpu, sme_active_cpus);
> +
> + /*
> + * Ensure subsequent (SME) memory accesses are observed after the
> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> + */
> + smp_mb();
I can't convince myself that a DMB is enough here, as the whole issue
is that the SME memory accesses can be observed _after_ the TLB
invalidation. I'd have thought we'd need a DSB to ensure that the flag
updates are visible before the exception return.
> +void sme_do_dvmsync(void)
> +{
> + /*
> + * This is called from the TLB maintenance functions after the DSB ISH
> + * to send hardware DVMSync message. If this CPU sees the mask as
> + * empty, the remote CPU executing sme_set_active() would have seen
> + * the DVMSync and no IPI required.
> + */
> + if (cpumask_empty(sme_active_cpus))
> + return;
> +
> + preempt_disable();
> + smp_call_function_many(sme_active_cpus, sme_dvmsync_ipi, NULL, true);
> + preempt_enable();
> +}
Why do we care about all CPUs using SME, rather than limiting it to the
set of CPUs using SME with the mm we've invalidated? This looks like it
will result in unnecessary cross-calls when multiple tasks are using SME
(especially as the mm flag is only cleared on fork).
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum
2026-03-02 16:57 ` [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum Catalin Marinas
@ 2026-03-05 14:32 ` Will Deacon
2026-03-06 12:52 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2026-03-05 14:32 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Mon, Mar 02, 2026 at 04:57:57PM +0000, Catalin Marinas wrote:
> From: James Morse <james.morse@arm.com>
>
> C1-Pro cores with SME have an erratum where TLBI+DSB does not complete
> all outstanding SME accesses. Instead a DSB needs to be executed on the
> affecteed CPUs. The implication is pages cannot be unmapped from the
> host stage2 then provided to the guest. Host SME accesses may occur
> after this point.
>
> This erratum breaks pKVM's guarantees, and the workaround is hard to
> implement as EL2 and EL1 share a security state meaning EL1 can mask
> IPI sent by EL2, leading to interrupt blackouts.
>
> Instead, do this in EL3. This has the advantage of a separate security
> state, meaning lower EL cannot mask the IPI. It is also simpler for EL3
> to know about CPUs that are off or in PSCI's CPU_SUSPEND.
>
> Add the needed hook.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Oliver Upton <oupton@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Cc: Sudeep Holla <sudeep.holla@kernel.org>
> ---
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 17 +++++++++++++++++
> include/linux/arm-smccc.h | 5 +++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 38f66a56a766..ab7f9273fddf 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -5,6 +5,8 @@
> */
>
> #include <linux/kvm_host.h>
> +#include <linux/arm-smccc.h>
> +
> #include <asm/kvm_emulate.h>
> #include <asm/kvm_hyp.h>
> #include <asm/kvm_mmu.h>
> @@ -28,6 +30,15 @@ static struct hyp_pool host_s2_pool;
> static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
> #define current_vm (*this_cpu_ptr(&__current_vm))
>
> +static void pkvm_sme_dvmsync_fw_call(void)
> +{
> + if (cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC)) {
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_smc(ARM_SMCCC_CPU_SME_DVMSYNC_WORKAROUND, &res);
> + }
> +}
> +
> static void guest_lock_component(struct pkvm_hyp_vm *vm)
> {
> hyp_spin_lock(&vm->lock);
> @@ -553,6 +564,12 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id)
> if (ret)
> return ret;
>
> + /*
> + * After stage2 maintenance has happened, but before the page owner has
> + * changed.
> + */
> + pkvm_sme_dvmsync_fw_call();
Please note that this will conflict with my patch series adding support
for protected memory with pkvm. I _think_ the right answer is to
move this call into host_stage2_set_owner_metadata_locked().
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-02 16:57 ` [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish() Catalin Marinas
@ 2026-03-05 14:33 ` Will Deacon
2026-03-05 19:19 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2026-03-05 14:33 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Mon, Mar 02, 2026 at 04:57:55PM +0000, Catalin Marinas wrote:
> The mm structure will be used for workarounds that need limiting to
> specific tasks.
>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> ---
> arch/arm64/include/asm/tlbflush.h | 10 +++++-----
> arch/arm64/kernel/sys_compat.c | 2 +-
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 19be0f7bfca5..14f116bfec73 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -185,7 +185,7 @@ do { \
> * Complete broadcast TLB maintenance issued by the host which invalidates
> * stage 1 information in the host's own translation regime.
> */
> -static inline void __tlbi_sync_s1ish(void)
> +static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> {
> dsb(ish);
> __repeat_tlbi_sync(vale1is, 0);
> @@ -317,7 +317,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> asid = __TLBI_VADDR(0, ASID(mm));
> __tlbi(aside1is, asid);
> __tlbi_user(aside1is, asid);
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish(mm);
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> }
>
> @@ -371,7 +371,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
> {
> flush_tlb_page_nosync(vma, uaddr);
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish(vma->vm_mm);
> }
>
> static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> @@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> */
> static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> {
> - __tlbi_sync_s1ish();
> + __tlbi_sync_s1ish(NULL);
Hmm, it seems a bit rubbish to pass NULL here as that means that we'll
deploy the mitigation regardless of the mm flags when finishing the
batch.
It also looks like we could end up doing the workaround multiple times
if arch_tlbbatch_add_pending() is passed a large enough region that we
call __flush_tlb_range_limit_excess() fires.
So perhaps we should stash the mm in 'struct arch_tlbflush_unmap_batch'
alongside some state to track whether or not we have uncompleted TLB
maintenance in flight?
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-05 14:33 ` Will Deacon
@ 2026-03-05 19:19 ` Catalin Marinas
2026-03-06 11:15 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-05 19:19 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Thu, Mar 05, 2026 at 02:33:18PM +0000, Will Deacon wrote:
> On Mon, Mar 02, 2026 at 04:57:55PM +0000, Catalin Marinas wrote:
> > The mm structure will be used for workarounds that need limiting to
> > specific tasks.
> >
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > ---
> > arch/arm64/include/asm/tlbflush.h | 10 +++++-----
> > arch/arm64/kernel/sys_compat.c | 2 +-
> > 2 files changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 19be0f7bfca5..14f116bfec73 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -185,7 +185,7 @@ do { \
> > * Complete broadcast TLB maintenance issued by the host which invalidates
> > * stage 1 information in the host's own translation regime.
> > */
> > -static inline void __tlbi_sync_s1ish(void)
> > +static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> > {
> > dsb(ish);
> > __repeat_tlbi_sync(vale1is, 0);
> > @@ -317,7 +317,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> > asid = __TLBI_VADDR(0, ASID(mm));
> > __tlbi(aside1is, asid);
> > __tlbi_user(aside1is, asid);
> > - __tlbi_sync_s1ish();
> > + __tlbi_sync_s1ish(mm);
> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> > }
> >
> > @@ -371,7 +371,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> > unsigned long uaddr)
> > {
> > flush_tlb_page_nosync(vma, uaddr);
> > - __tlbi_sync_s1ish();
> > + __tlbi_sync_s1ish(vma->vm_mm);
> > }
> >
> > static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > @@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > */
> > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > {
> > - __tlbi_sync_s1ish();
> > + __tlbi_sync_s1ish(NULL);
>
> Hmm, it seems a bit rubbish to pass NULL here as that means that we'll
> deploy the mitigation regardless of the mm flags when finishing the
> batch.
>
> It also looks like we could end up doing the workaround multiple times
> if arch_tlbbatch_add_pending() is passed a large enough region that we
> call __flush_tlb_range_limit_excess() fires.
>
> So perhaps we should stash the mm in 'struct arch_tlbflush_unmap_batch'
> alongside some state to track whether or not we have uncompleted TLB
> maintenance in flight?
The problem is that arch_tlbbatch_flush() can be called to synchronise
multiple mm structures that were touched by TTU. We can't have the mm in
arch_tlbflush_unmap_batch. But we can track if any of the mms had
MMCF_SME_DVMSYNC flag set, something like below (needs testing, tidying
up). TBH, I did not notice any problem in benchmarking as I guess we
haven't exercised the TTU path much, so did not bother to optimise it.
For the TTU case, I don't think we need to worry about the excess limit
and doing the IPI twice. But I'll double check the code paths tomorrow.
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
index fedb0b87b8db..e756eaca6cb8 100644
--- a/arch/arm64/include/asm/tlbbatch.h
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -7,6 +7,8 @@ struct arch_tlbflush_unmap_batch {
* For arm64, HW can do tlb shootdown, so we don't
* need to record cpumask for sending IPI
*/
+
+ bool sme_dvmsync;
};
#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index e3ea0246a4f4..c1141a684854 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -201,10 +201,15 @@ do { \
* Complete broadcast TLB maintenance issued by the host which invalidates
* stage 1 information in the host's own translation regime.
*/
-static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
+static inline void __tlbi_sync_s1ish_no_sme_dvmsync(void)
{
dsb(ish);
__repeat_tlbi_sync(vale1is, 0);
+}
+
+static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
+{
+ __tlbi_sync_s1ish_no_sme_dvmsync();
sme_dvmsync(mm);
}
@@ -408,7 +413,11 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
*/
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- __tlbi_sync_s1ish(NULL);
+ __tlbi_sync_s1ish_no_sme_dvmsync();
+ if (batch->sme_dvmsync) {
+ batch->sme_dvmsync = false;
+ sme_dvmsync(NULL);
+ }
}
/*
@@ -613,6 +622,8 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
struct mm_struct *mm, unsigned long start, unsigned long end)
{
__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
+ if (test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
+ batch->sme_dvmsync = true;
}
static inline bool __pte_flags_need_flush(ptdesc_t oldval, ptdesc_t newval)
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-05 19:19 ` Catalin Marinas
@ 2026-03-06 11:15 ` Catalin Marinas
2026-03-12 15:00 ` Will Deacon
0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-06 11:15 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Thu, Mar 05, 2026 at 07:19:15PM +0000, Catalin Marinas wrote:
> On Thu, Mar 05, 2026 at 02:33:18PM +0000, Will Deacon wrote:
> > On Mon, Mar 02, 2026 at 04:57:55PM +0000, Catalin Marinas wrote:
> > > @@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > > */
> > > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > {
> > > - __tlbi_sync_s1ish();
> > > + __tlbi_sync_s1ish(NULL);
> >
> > Hmm, it seems a bit rubbish to pass NULL here as that means that we'll
> > deploy the mitigation regardless of the mm flags when finishing the
> > batch.
> >
> > It also looks like we could end up doing the workaround multiple times
> > if arch_tlbbatch_add_pending() is passed a large enough region that we
> > call __flush_tlb_range_limit_excess() fires.
> >
> > So perhaps we should stash the mm in 'struct arch_tlbflush_unmap_batch'
> > alongside some state to track whether or not we have uncompleted TLB
> > maintenance in flight?
>
> The problem is that arch_tlbbatch_flush() can be called to synchronise
> multiple mm structures that were touched by TTU. We can't have the mm in
> arch_tlbflush_unmap_batch. But we can track if any of the mms had
> MMCF_SME_DVMSYNC flag set, something like below (needs testing, tidying
> up). TBH, I did not notice any problem in benchmarking as I guess we
> haven't exercised the TTU path much, so did not bother to optimise it.
>
> For the TTU case, I don't think we need to worry about the excess limit
> and doing the IPI twice. But I'll double check the code paths tomorrow.
>
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> index fedb0b87b8db..e756eaca6cb8 100644
> --- a/arch/arm64/include/asm/tlbbatch.h
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -7,6 +7,8 @@ struct arch_tlbflush_unmap_batch {
> * For arm64, HW can do tlb shootdown, so we don't
> * need to record cpumask for sending IPI
> */
> +
> + bool sme_dvmsync;
> };
>
> #endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index e3ea0246a4f4..c1141a684854 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -201,10 +201,15 @@ do { \
> * Complete broadcast TLB maintenance issued by the host which invalidates
> * stage 1 information in the host's own translation regime.
> */
> -static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> +static inline void __tlbi_sync_s1ish_no_sme_dvmsync(void)
> {
> dsb(ish);
> __repeat_tlbi_sync(vale1is, 0);
> +}
> +
> +static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> +{
> + __tlbi_sync_s1ish_no_sme_dvmsync();
> sme_dvmsync(mm);
> }
>
> @@ -408,7 +413,11 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> */
> static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> {
> - __tlbi_sync_s1ish(NULL);
> + __tlbi_sync_s1ish_no_sme_dvmsync();
> + if (batch->sme_dvmsync) {
> + batch->sme_dvmsync = false;
> + sme_dvmsync(NULL);
> + }
> }
>
> /*
> @@ -613,6 +622,8 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
> struct mm_struct *mm, unsigned long start, unsigned long end)
> {
> __flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
> + if (test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
> + batch->sme_dvmsync = true;
> }
While writing a reply to your other comments, I realised why this
wouldn't work (I had something similar but dropped it) - we can have the
flag cleared here (or mm_cpumask() if we are to track per-mm) but we
have not issued the DVMSync yet. The task may start using SME before
arch_tlbbatch_flush() and we just missed it. Any checks on whether to
issue the IPI like reading flags needs to be after the DVMSync.
Anyway, more on the next patch where you asked about the DMB.
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-05 14:32 ` Will Deacon
@ 2026-03-06 12:00 ` Catalin Marinas
2026-03-06 12:19 ` Catalin Marinas
2026-03-09 10:13 ` Vladimir Murzin
0 siblings, 2 replies; 24+ messages in thread
From: Catalin Marinas @ 2026-03-06 12:00 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Thu, Mar 05, 2026 at 02:32:11PM +0000, Will Deacon wrote:
> On Mon, Mar 02, 2026 at 04:57:56PM +0000, Catalin Marinas wrote:
> > C1-Pro acknowledges DVMSync messages before completing the SME/CME
> > memory accesses. Work around this by issuing an IPI+DSB to the affected
> > CPUs if they are running in EL0 with SME enabled.
>
> Just to make sure I understand the implications, but this _only_ applies
> to explicit memory accesses from the SME unit and not, for example, to
> page-table walks initiated by SME instructions?
Yes, only explicit accesses from the SME unit (CME).
> > @@ -575,6 +576,14 @@ static const struct midr_range erratum_spec_ssbs_list[] = {
> > };
> > #endif
> >
> > +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > +static void cpu_enable_sme_dvmsync(const struct arm64_cpu_capabilities *__unused)
> > +{
> > + if (this_cpu_has_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > + sme_enable_dvmsync();
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_AMPERE_ERRATUM_AC03_CPU_38
> > static const struct midr_range erratum_ac03_cpu_38_list[] = {
> > MIDR_ALL_VERSIONS(MIDR_AMPERE1),
> > @@ -901,6 +910,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
> > .matches = need_arm_si_l1_workaround_4311569,
> > },
> > #endif
> > +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > + {
> > + .desc = "C1-Pro SME DVMSync early acknowledgement",
> > + .capability = ARM64_WORKAROUND_SME_DVMSYNC,
> > + .cpu_enable = cpu_enable_sme_dvmsync,
> > + /* C1-Pro r0p0 - r1p2 (the latter only when REVIDR_EL1[0]==0 */
> > + ERRATA_MIDR_RANGE(MIDR_C1_PRO, 0, 0, 1, 2),
> > + MIDR_FIXED(MIDR_CPU_VAR_REV(1, 2), BIT(0)),
> > + },
> > +#endif
>
> An alternative to this workaround is just to disable SME entirely, perhaps
> by passing 'arm64.nosme' on the cmdline. Maybe we should disable the
> workaround in that case?
Good point, the workaround isn't necessary if the SME is off. I can add
an extra check, though given that no-one would run in user-space with
TIF_SME, the overhead is only in the sme_active_cpus mask checking.
> > @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> > put_cpu_fpsimd_context();
> > }
> >
> > +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > +
> > +/*
> > + * SME/CME erratum handling
> > + */
> > +static cpumask_var_t sme_dvmsync_cpus;
> > +static cpumask_var_t sme_active_cpus;
> > +
> > +void sme_set_active(unsigned int cpu)
> > +{
> > + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > + return;
> > + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> > + return;
> > +
> > + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> > + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> > +
> > + cpumask_set_cpu(cpu, sme_active_cpus);
> > +
> > + /*
> > + * Ensure subsequent (SME) memory accesses are observed after the
> > + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> > + */
> > + smp_mb();
>
> I can't convince myself that a DMB is enough here, as the whole issue
> is that the SME memory accesses can be observed _after_ the TLB
> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> updates are visible before the exception return.
This is only to ensure that the sme_active_cpus mask is observed before
any SME accesses. The mask is later used to decide whether to send the
IPI. We have something like this:
P0
STSET [sme_active_cpus]
DMB
SME access to [addr]
P1
TLBI [addr]
DSB
LDR [sme_active_cpus]
CBZ out
Do IPI
out:
If P1 did not observe the STSET to [sme_active_cpus], P0 should have
received and acknowledged the DVMSync before the STSET. Is your concern
that P1 can observe the subsequent SME access but not the STSET?
No idea whether herd can model this (I only put this in TLA+ for the
main logic check but it doesn't do subtle memory ordering).
> > +void sme_do_dvmsync(void)
> > +{
> > + /*
> > + * This is called from the TLB maintenance functions after the DSB ISH
> > + * to send hardware DVMSync message. If this CPU sees the mask as
> > + * empty, the remote CPU executing sme_set_active() would have seen
> > + * the DVMSync and no IPI required.
> > + */
> > + if (cpumask_empty(sme_active_cpus))
> > + return;
> > +
> > + preempt_disable();
> > + smp_call_function_many(sme_active_cpus, sme_dvmsync_ipi, NULL, true);
> > + preempt_enable();
> > +}
>
> Why do we care about all CPUs using SME, rather than limiting it to the
> set of CPUs using SME with the mm we've invalidated? This looks like it
> will result in unnecessary cross-calls when multiple tasks are using SME
> (especially as the mm flag is only cleared on fork).
Yes, it's a possibility but I traded it for simplicity. We also have the
TTU case where we don't have an mm and we don't want to broadcast to all
CPUs either, hence an sme_active_cpus mask. As I just replied on patch
2, for the TLB batching we wouldn't be able to use a cpumask in the
batching structure since, per the ordering above, we need the DVMSync
before checking if/where to send the IPI to.
For the typical TLBI (not TTU), we can track a per-mm mask passed down
to this function (I have patches doing this but it didn't make a
significant difference in benchmarks). However, for upstream we may want
to use mm_cpumask() for something else in the future (FEAT_TLBID; work
in progress), so we should probably add a different mask. Well, C1-Pro
doesn't support FEAT_TLBID, so we could disable the workaround and use
the same mm_cpumask(); it just gets messier.
We can keep the sme_active_cpus for the TTU case and easily add a
cpumask for the other TLBI cases where we have the mm. Is it worth it? I
don't think so if the only SME apps are some benchmarks ;).
(for the Android backports I did not want to break KMI by expanding
kernel data structures; of course, not a concern for mainline but the
only users of this workaround are likely to be using GKI than upstream)
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-06 12:00 ` Catalin Marinas
@ 2026-03-06 12:19 ` Catalin Marinas
2026-03-09 10:13 ` Vladimir Murzin
1 sibling, 0 replies; 24+ messages in thread
From: Catalin Marinas @ 2026-03-06 12:19 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Fri, Mar 06, 2026 at 12:00:30PM +0000, Catalin Marinas wrote:
> On Thu, Mar 05, 2026 at 02:32:11PM +0000, Will Deacon wrote:
> > On Mon, Mar 02, 2026 at 04:57:56PM +0000, Catalin Marinas wrote:
> > > +void sme_do_dvmsync(void)
> > > +{
> > > + /*
> > > + * This is called from the TLB maintenance functions after the DSB ISH
> > > + * to send hardware DVMSync message. If this CPU sees the mask as
> > > + * empty, the remote CPU executing sme_set_active() would have seen
> > > + * the DVMSync and no IPI required.
> > > + */
> > > + if (cpumask_empty(sme_active_cpus))
> > > + return;
> > > +
> > > + preempt_disable();
> > > + smp_call_function_many(sme_active_cpus, sme_dvmsync_ipi, NULL, true);
> > > + preempt_enable();
> > > +}
> >
> > Why do we care about all CPUs using SME, rather than limiting it to the
> > set of CPUs using SME with the mm we've invalidated? This looks like it
> > will result in unnecessary cross-calls when multiple tasks are using SME
> > (especially as the mm flag is only cleared on fork).
>
> Yes, it's a possibility but I traded it for simplicity. We also have the
> TTU case where we don't have an mm and we don't want to broadcast to all
> CPUs either, hence an sme_active_cpus mask. As I just replied on patch
> 2, for the TLB batching we wouldn't be able to use a cpumask in the
> batching structure since, per the ordering above, we need the DVMSync
> before checking if/where to send the IPI to.
>
> For the typical TLBI (not TTU), we can track a per-mm mask passed down
> to this function (I have patches doing this but it didn't make a
> significant difference in benchmarks).
Reusing the current mm_cpumask(), something like below. We could also
scrap the MMCF_SME_DVMSYNC flag, though we end up always call
sme_do_dvmsync() and checking the mask, probably more expensive than a
flag check.
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index e3ea0246a4f4..2c77ca41cb14 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -81,7 +81,7 @@ static inline unsigned long get_trans_granule(void)
}
#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
-void sme_do_dvmsync(void);
+void sme_do_dvmsync(struct mm_struct *mm);
static inline void sme_dvmsync(struct mm_struct *mm)
{
@@ -90,7 +90,7 @@ static inline void sme_dvmsync(struct mm_struct *mm)
if (mm && !test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
return;
- sme_do_dvmsync();
+ sme_do_dvmsync(mm);
}
#else
static inline void sme_dvmsync(struct mm_struct *mm) { }
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 90015fc29722..37e215cd0f39 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1378,6 +1378,7 @@ void sme_set_active(unsigned int cpu)
if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
+ cpumask_set_cpu(cpu, mm_cpumask(current->mm));
cpumask_set_cpu(cpu, sme_active_cpus);
/*
@@ -1398,6 +1399,7 @@ void sme_clear_active(unsigned int cpu)
* With SCTLR_EL1.IESB enabled, the SME memory transactions are
* completed on entering EL1.
*/
+ cpumask_clear_cpu(cpu, mm_cpumask(current->mm));
cpumask_clear_cpu(cpu, sme_active_cpus);
}
@@ -1410,19 +1412,25 @@ static void sme_dvmsync_ipi(void *unused)
*/
}
-void sme_do_dvmsync(void)
+void sme_do_dvmsync(struct mm_struct *mm)
{
/*
* This is called from the TLB maintenance functions after the DSB ISH
* to send hardware DVMSync message. If this CPU sees the mask as
* empty, the remote CPU executing sme_set_active() would have seen
* the DVMSync and no IPI required.
+ *
+ * When an mm is provided, limit the IPI to CPUs that are actively
+ * running SME code for that mm (recorded in mm_cpumask()), otherwise
+ * fall back to the global sme_active_cpus mask.
*/
- if (cpumask_empty(sme_active_cpus))
+ const struct cpumask *mask = mm ? mm_cpumask(mm) : sme_active_cpus;
+
+ if (cpumask_empty(mask))
return;
preempt_disable();
- smp_call_function_many(sme_active_cpus, sme_dvmsync_ipi, NULL, true);
+ smp_call_function_many(mask, sme_dvmsync_ipi, NULL, true);
preempt_enable();
}
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum
2026-03-05 14:32 ` Will Deacon
@ 2026-03-06 12:52 ` Catalin Marinas
0 siblings, 0 replies; 24+ messages in thread
From: Catalin Marinas @ 2026-03-06 12:52 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Thu, Mar 05, 2026 at 02:32:54PM +0000, Will Deacon wrote:
> On Mon, Mar 02, 2026 at 04:57:57PM +0000, Catalin Marinas wrote:
> > @@ -28,6 +30,15 @@ static struct hyp_pool host_s2_pool;
> > static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
> > #define current_vm (*this_cpu_ptr(&__current_vm))
> >
> > +static void pkvm_sme_dvmsync_fw_call(void)
> > +{
> > + if (cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC)) {
> > + struct arm_smccc_res res;
> > +
> > + arm_smccc_1_1_smc(ARM_SMCCC_CPU_SME_DVMSYNC_WORKAROUND, &res);
> > + }
> > +}
> > +
> > static void guest_lock_component(struct pkvm_hyp_vm *vm)
> > {
> > hyp_spin_lock(&vm->lock);
> > @@ -553,6 +564,12 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id)
> > if (ret)
> > return ret;
> >
> > + /*
> > + * After stage2 maintenance has happened, but before the page owner has
> > + * changed.
> > + */
> > + pkvm_sme_dvmsync_fw_call();
>
> Please note that this will conflict with my patch series adding support
> for protected memory with pkvm. I _think_ the right answer is to
> move this call into host_stage2_set_owner_metadata_locked().
Yes, it needs to be after host_stage2_try(), so it makes sense to move
it to host_stage2_set_owner_metadata_locked(). Let's see which order the
patches go in or we may have to fix the conflict during merge.
We can also leave the pKVM workaround for later once your rework goes
in.
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-06 12:00 ` Catalin Marinas
2026-03-06 12:19 ` Catalin Marinas
@ 2026-03-09 10:13 ` Vladimir Murzin
2026-03-10 15:35 ` Catalin Marinas
1 sibling, 1 reply; 24+ messages in thread
From: Vladimir Murzin @ 2026-03-09 10:13 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
Hi All,
On 3/6/26 12:00, Catalin Marinas wrote:
>>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
>>> put_cpu_fpsimd_context();
>>> }
>>>
>>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
>>> +
>>> +/*
>>> + * SME/CME erratum handling
>>> + */
>>> +static cpumask_var_t sme_dvmsync_cpus;
>>> +static cpumask_var_t sme_active_cpus;
>>> +
>>> +void sme_set_active(unsigned int cpu)
>>> +{
>>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
>>> + return;
>>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
>>> + return;
>>> +
>>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
>>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
>>> +
>>> + cpumask_set_cpu(cpu, sme_active_cpus);
>>> +
>>> + /*
>>> + * Ensure subsequent (SME) memory accesses are observed after the
>>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
>>> + */
>>> + smp_mb();
>> I can't convince myself that a DMB is enough here, as the whole issue
>> is that the SME memory accesses can be observed _after_ the TLB
>> invalidation. I'd have thought we'd need a DSB to ensure that the flag
>> updates are visible before the exception return.
> This is only to ensure that the sme_active_cpus mask is observed before
> any SME accesses. The mask is later used to decide whether to send the
> IPI. We have something like this:
>
> P0
> STSET [sme_active_cpus]
> DMB
> SME access to [addr]
>
> P1
> TLBI [addr]
> DSB
> LDR [sme_active_cpus]
> CBZ out
> Do IPI
> out:
>
> If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> received and acknowledged the DVMSync before the STSET. Is your concern
> that P1 can observe the subsequent SME access but not the STSET?
>
> No idea whether herd can model this (I only put this in TLA+ for the
> main logic check but it doesn't do subtle memory ordering).
JFYI, herd support for SME is still work-in-progress (specifically it misses
updates in cat), yet it can model VMSA.
IIUC, expectation here is that either
- P1 observes sme_active_cpus, so we have to do_IPI or
- P0 observes TLBI (say shutdown, so it must fault)
anything else is unexpected/forbidden.
AArch64 A
variant=vmsa
{
int x=0;
int active=0;
0:X1=active;
0:X3=x;
1:X0=(valid:0);
1:X1=PTE(x);
1:X2=x;
1:X3=active;
}
P0 | P1 ;
MOV W0,#1 | STR X0,[X1] ;
STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
DMB SY | LSR X9,X2,#12 ;
LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
| DSB ISH ;
| LDR W4,[X3] (* sme_active_cpus *) ;
exists ~(1:X4=1 \/ fault(P0,x))
Is that correct understanding? Have I missed anything?
Cheers
Vladimir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance
2026-03-05 11:27 ` Catalin Marinas
@ 2026-03-09 12:12 ` Mark Rutland
0 siblings, 0 replies; 24+ messages in thread
From: Mark Rutland @ 2026-03-09 12:12 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Will Deacon, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Brown, kvmarm
On Thu, Mar 05, 2026 at 11:27:54AM +0000, Catalin Marinas wrote:
> On Tue, Mar 03, 2026 at 01:12:50PM +0000, Mark Rutland wrote:
> > On Mon, Mar 02, 2026 at 04:57:54PM +0000, Catalin Marinas wrote:
> > > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > > index 1416e652612b..19be0f7bfca5 100644
> > > --- a/arch/arm64/include/asm/tlbflush.h
> > > +++ b/arch/arm64/include/asm/tlbflush.h
> > > @@ -191,6 +191,12 @@ static inline void __tlbi_sync_s1ish(void)
> > > __repeat_tlbi_sync(vale1is, 0);
> > > }
> > >
> > > +static inline void __tlbi_sync_s1ish_kernel(void)
> > > +{
> > > + dsb(ish);
> > > + __repeat_tlbi_sync(vale1is, 0);
> > > +}
> > > +
> > > /*
> > > * Complete broadcast TLB maintenance issued by hyp code which invalidates
> > > * stage 1 translation information in any translation regime.
> > > @@ -299,7 +305,7 @@ static inline void flush_tlb_all(void)
> > > {
> > > dsb(ishst);
> > > __tlbi(vmalle1is);
> > > - __tlbi_sync_s1ish();
> > > + __tlbi_sync_s1ish_kernel();
> > > isb();
> > > }
> >
> > The commit message is correct that flush_tlb_all() is only used for
> > kernel mappings today, via flush_tlb_kernel_range(), so this is safe.
>
> Unfortunately, it's also used by the core code -
> hugetlb_vmemmap_restore_folios() (and another function in this file).
Sorry, I missed that.
AFAICT even with that it's only used for kernel mappings, though it's
not clear to me exactly what the xen balloon driver is doing.
> > However, the big comment block around line 200 says:
> >
> > flush_tlb_all()
> > Invalidate the entire TLB (kernel + user) on all CPUs
> >
> > ... and:
> >
> > local_flush_tlb_all()
> > Same as flush_tlb_all(), but only applies to the calling CPU.
> >
> > ... where the latter is used for user mappings (upon ASID overflow), so
> > I think there's some risk of future confusion.
>
> Ignoring this erratum, the statements are still correct for arm64 as it
> flushes both kernel and user, though I see what you mean w.r.t. its
> intended use.
My concern was just in the presence of this erratum, since we skip the
workaround in flush_tlb_all() by calling __tlbi_sync_s1ish_kernel().
As below, I agree this doesn't need to change now.
> > To minimize the risk that flush_tlb_all() gets used for user mappings in
> > future, how about we rename flush_tlb_all() => flush_tlb_kernel_all(), and
> > update those comments:
> >
> > flush_tlb_kernel_all()
> > Invalidate all kernel mappings on all CPUs.
> > Should not be used to invalidate user mappings.
> >
> > local_flush_tlb_all()
> > Invalidate all (kernel + user) mappings on the calling CPU.
> >
> > Note: I chose flush_tlb_kernel_all() rather than flush_tlb_all_kernel()
> > __flush_tlb_kernel_{pgtable,range}, with 'kernel' before the operation/scope.
>
> I'm fine to update the comments but, for backporting, I'd not change the
> function name as it will have to touch core code. Ideally we should go
> around and change the other architectures to follow the same semantics
> (I briefly looked at x86 and powerpc and they also seem to use
> flush_tlb_all() only for kernel mappings).
>
> So, I think it's better to do this cleanup separately ;).
That's fair enough, and my ack stands even without any changes.
Mark.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-09 10:13 ` Vladimir Murzin
@ 2026-03-10 15:35 ` Catalin Marinas
2026-03-12 14:55 ` Will Deacon
0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-10 15:35 UTC (permalink / raw)
To: Vladimir Murzin
Cc: Will Deacon, linux-arm-kernel, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Rutland,
Mark Brown, kvmarm
Thanks Vladimir,
On Mon, Mar 09, 2026 at 10:13:20AM +0000, Vladimir Murzin wrote:
> On 3/6/26 12:00, Catalin Marinas wrote:
> >>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> >>> put_cpu_fpsimd_context();
> >>> }
> >>>
> >>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> >>> +
> >>> +/*
> >>> + * SME/CME erratum handling
> >>> + */
> >>> +static cpumask_var_t sme_dvmsync_cpus;
> >>> +static cpumask_var_t sme_active_cpus;
> >>> +
> >>> +void sme_set_active(unsigned int cpu)
> >>> +{
> >>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> >>> + return;
> >>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> >>> + return;
> >>> +
> >>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> >>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> >>> +
> >>> + cpumask_set_cpu(cpu, sme_active_cpus);
> >>> +
> >>> + /*
> >>> + * Ensure subsequent (SME) memory accesses are observed after the
> >>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> >>> + */
> >>> + smp_mb();
> >>
> >> I can't convince myself that a DMB is enough here, as the whole issue
> >> is that the SME memory accesses can be observed _after_ the TLB
> >> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> >> updates are visible before the exception return.
> >
> > This is only to ensure that the sme_active_cpus mask is observed before
> > any SME accesses. The mask is later used to decide whether to send the
> > IPI. We have something like this:
> >
> > P0
> > STSET [sme_active_cpus]
> > DMB
> > SME access to [addr]
> >
> > P1
> > TLBI [addr]
> > DSB
> > LDR [sme_active_cpus]
> > CBZ out
> > Do IPI
> > out:
> >
> > If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> > received and acknowledged the DVMSync before the STSET. Is your concern
> > that P1 can observe the subsequent SME access but not the STSET?
> >
> > No idea whether herd can model this (I only put this in TLA+ for the
> > main logic check but it doesn't do subtle memory ordering).
>
> JFYI, herd support for SME is still work-in-progress (specifically it misses
> updates in cat), yet it can model VMSA.
>
> IIUC, expectation here is that either
> - P1 observes sme_active_cpus, so we have to do_IPI or
> - P0 observes TLBI (say shutdown, so it must fault)
>
> anything else is unexpected/forbidden.
>
> AArch64 A
> variant=vmsa
> {
> int x=0;
> int active=0;
>
> 0:X1=active;
> 0:X3=x;
>
> 1:X0=(valid:0);
> 1:X1=PTE(x);
> 1:X2=x;
> 1:X3=active;
>
> }
> P0 | P1 ;
> MOV W0,#1 | STR X0,[X1] ;
> STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
> DMB SY | LSR X9,X2,#12 ;
> LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
> | DSB ISH ;
> | LDR W4,[X3] (* sme_active_cpus *) ;
>
> exists ~(1:X4=1 \/ fault(P0,x))
>
> Is that correct understanding? Have I missed anything?
Yes, I think that's correct. Another tweak specific to this erratum
would be for P1 to do a store to x via another mapping after the
TLBI+DSB and the P0 load should not see it.
Even with the CPU erratum, if the P1 DVMSync is received/acknowledged by
P0 before its STR to sme_active_cpus, I don't see how the subsequent SME
load would overtake the STR given the DMB. The erratum messed up the
DVMSync acknowledgement, not the barriers.
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-10 15:35 ` Catalin Marinas
@ 2026-03-12 14:55 ` Will Deacon
2026-03-13 15:48 ` Catalin Marinas
2026-03-17 12:09 ` Mark Rutland
0 siblings, 2 replies; 24+ messages in thread
From: Will Deacon @ 2026-03-12 14:55 UTC (permalink / raw)
To: Catalin Marinas
Cc: Vladimir Murzin, linux-arm-kernel, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Rutland,
Mark Brown, kvmarm
On Tue, Mar 10, 2026 at 03:35:19PM +0000, Catalin Marinas wrote:
> Thanks Vladimir,
>
> On Mon, Mar 09, 2026 at 10:13:20AM +0000, Vladimir Murzin wrote:
> > On 3/6/26 12:00, Catalin Marinas wrote:
> > >>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> > >>> put_cpu_fpsimd_context();
> > >>> }
> > >>>
> > >>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > >>> +
> > >>> +/*
> > >>> + * SME/CME erratum handling
> > >>> + */
> > >>> +static cpumask_var_t sme_dvmsync_cpus;
> > >>> +static cpumask_var_t sme_active_cpus;
> > >>> +
> > >>> +void sme_set_active(unsigned int cpu)
> > >>> +{
> > >>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > >>> + return;
> > >>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> > >>> + return;
> > >>> +
> > >>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> > >>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> > >>> +
> > >>> + cpumask_set_cpu(cpu, sme_active_cpus);
> > >>> +
> > >>> + /*
> > >>> + * Ensure subsequent (SME) memory accesses are observed after the
> > >>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> > >>> + */
> > >>> + smp_mb();
> > >>
> > >> I can't convince myself that a DMB is enough here, as the whole issue
> > >> is that the SME memory accesses can be observed _after_ the TLB
> > >> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> > >> updates are visible before the exception return.
> > >
> > > This is only to ensure that the sme_active_cpus mask is observed before
> > > any SME accesses. The mask is later used to decide whether to send the
> > > IPI. We have something like this:
> > >
> > > P0
> > > STSET [sme_active_cpus]
> > > DMB
> > > SME access to [addr]
> > >
> > > P1
> > > TLBI [addr]
> > > DSB
> > > LDR [sme_active_cpus]
> > > CBZ out
> > > Do IPI
> > > out:
> > >
> > > If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> > > received and acknowledged the DVMSync before the STSET. Is your concern
> > > that P1 can observe the subsequent SME access but not the STSET?
> > >
> > > No idea whether herd can model this (I only put this in TLA+ for the
> > > main logic check but it doesn't do subtle memory ordering).
> >
> > JFYI, herd support for SME is still work-in-progress (specifically it misses
> > updates in cat), yet it can model VMSA.
> >
> > IIUC, expectation here is that either
> > - P1 observes sme_active_cpus, so we have to do_IPI or
> > - P0 observes TLBI (say shutdown, so it must fault)
> >
> > anything else is unexpected/forbidden.
> >
> > AArch64 A
> > variant=vmsa
> > {
> > int x=0;
> > int active=0;
> >
> > 0:X1=active;
> > 0:X3=x;
> >
> > 1:X0=(valid:0);
> > 1:X1=PTE(x);
> > 1:X2=x;
> > 1:X3=active;
> >
> > }
> > P0 | P1 ;
> > MOV W0,#1 | STR X0,[X1] ;
> > STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
> > DMB SY | LSR X9,X2,#12 ;
> > LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
> > | DSB ISH ;
> > | LDR W4,[X3] (* sme_active_cpus *) ;
> >
> > exists ~(1:X4=1 \/ fault(P0,x))
> >
> > Is that correct understanding? Have I missed anything?
>
> Yes, I think that's correct. Another tweak specific to this erratum
> would be for P1 to do a store to x via another mapping after the
> TLBI+DSB and the P0 load should not see it.
>
> Even with the CPU erratum, if the P1 DVMSync is received/acknowledged by
> P0 before its STR to sme_active_cpus, I don't see how the subsequent SME
> load would overtake the STR given the DMB. The erratum messed up the
> DVMSync acknowledgement, not the barriers.
I'm still finding this hard to reason about.
Why can't:
1. P0 translates its SME load and puts the valid translation into its TLB
2. P1 runs to completion, sees sme_active_cpus as 0 and so doesn't IPI
3. P0 writes to sme_active_cpus and then does the SME load using the
translation from (1)
I guess it's diving into ugly corners of what the erratum actually is...
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-06 11:15 ` Catalin Marinas
@ 2026-03-12 15:00 ` Will Deacon
2026-03-13 16:27 ` Catalin Marinas
0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2026-03-12 15:00 UTC (permalink / raw)
To: Catalin Marinas
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Fri, Mar 06, 2026 at 11:15:34AM +0000, Catalin Marinas wrote:
> On Thu, Mar 05, 2026 at 07:19:15PM +0000, Catalin Marinas wrote:
> > On Thu, Mar 05, 2026 at 02:33:18PM +0000, Will Deacon wrote:
> > > On Mon, Mar 02, 2026 at 04:57:55PM +0000, Catalin Marinas wrote:
> > > > @@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > > > */
> > > > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > > {
> > > > - __tlbi_sync_s1ish();
> > > > + __tlbi_sync_s1ish(NULL);
> > >
> > > Hmm, it seems a bit rubbish to pass NULL here as that means that we'll
> > > deploy the mitigation regardless of the mm flags when finishing the
> > > batch.
> > >
> > > It also looks like we could end up doing the workaround multiple times
> > > if arch_tlbbatch_add_pending() is passed a large enough region that we
> > > call __flush_tlb_range_limit_excess() fires.
> > >
> > > So perhaps we should stash the mm in 'struct arch_tlbflush_unmap_batch'
> > > alongside some state to track whether or not we have uncompleted TLB
> > > maintenance in flight?
> >
> > The problem is that arch_tlbbatch_flush() can be called to synchronise
> > multiple mm structures that were touched by TTU. We can't have the mm in
> > arch_tlbflush_unmap_batch. But we can track if any of the mms had
> > MMCF_SME_DVMSYNC flag set, something like below (needs testing, tidying
> > up). TBH, I did not notice any problem in benchmarking as I guess we
> > haven't exercised the TTU path much, so did not bother to optimise it.
> >
> > For the TTU case, I don't think we need to worry about the excess limit
> > and doing the IPI twice. But I'll double check the code paths tomorrow.
> >
> > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> > index fedb0b87b8db..e756eaca6cb8 100644
> > --- a/arch/arm64/include/asm/tlbbatch.h
> > +++ b/arch/arm64/include/asm/tlbbatch.h
> > @@ -7,6 +7,8 @@ struct arch_tlbflush_unmap_batch {
> > * For arm64, HW can do tlb shootdown, so we don't
> > * need to record cpumask for sending IPI
> > */
> > +
> > + bool sme_dvmsync;
> > };
> >
> > #endif /* _ARCH_ARM64_TLBBATCH_H */
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index e3ea0246a4f4..c1141a684854 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -201,10 +201,15 @@ do { \
> > * Complete broadcast TLB maintenance issued by the host which invalidates
> > * stage 1 information in the host's own translation regime.
> > */
> > -static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> > +static inline void __tlbi_sync_s1ish_no_sme_dvmsync(void)
> > {
> > dsb(ish);
> > __repeat_tlbi_sync(vale1is, 0);
> > +}
> > +
> > +static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> > +{
> > + __tlbi_sync_s1ish_no_sme_dvmsync();
> > sme_dvmsync(mm);
> > }
> >
> > @@ -408,7 +413,11 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > */
> > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > {
> > - __tlbi_sync_s1ish(NULL);
> > + __tlbi_sync_s1ish_no_sme_dvmsync();
> > + if (batch->sme_dvmsync) {
> > + batch->sme_dvmsync = false;
> > + sme_dvmsync(NULL);
> > + }
> > }
> >
> > /*
> > @@ -613,6 +622,8 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
> > struct mm_struct *mm, unsigned long start, unsigned long end)
> > {
> > __flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
> > + if (test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
> > + batch->sme_dvmsync = true;
> > }
>
> While writing a reply to your other comments, I realised why this
> wouldn't work (I had something similar but dropped it) - we can have the
> flag cleared here (or mm_cpumask() if we are to track per-mm) but we
> have not issued the DVMSync yet. The task may start using SME before
> arch_tlbbatch_flush() and we just missed it. Any checks on whether to
> issue the IPI like reading flags needs to be after the DVMSync.
Ah, yeah. I wonder if it's worth detecting the change of mm in
arch_tlbbatch_add_pending() and then pro-actively doing the DSB on CPUs
with the erratum? I suppose it depends on how often SME is being used.
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-12 14:55 ` Will Deacon
@ 2026-03-13 15:48 ` Catalin Marinas
2026-03-13 15:58 ` Will Deacon
2026-03-17 12:09 ` Mark Rutland
1 sibling, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2026-03-13 15:48 UTC (permalink / raw)
To: Will Deacon
Cc: Vladimir Murzin, linux-arm-kernel, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Rutland,
Mark Brown, kvmarm
On Thu, Mar 12, 2026 at 02:55:15PM +0000, Will Deacon wrote:
> On Tue, Mar 10, 2026 at 03:35:19PM +0000, Catalin Marinas wrote:
> > On Mon, Mar 09, 2026 at 10:13:20AM +0000, Vladimir Murzin wrote:
> > > On 3/6/26 12:00, Catalin Marinas wrote:
> > > >>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> > > >>> put_cpu_fpsimd_context();
> > > >>> }
> > > >>>
> > > >>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > > >>> +
> > > >>> +/*
> > > >>> + * SME/CME erratum handling
> > > >>> + */
> > > >>> +static cpumask_var_t sme_dvmsync_cpus;
> > > >>> +static cpumask_var_t sme_active_cpus;
> > > >>> +
> > > >>> +void sme_set_active(unsigned int cpu)
> > > >>> +{
> > > >>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > > >>> + return;
> > > >>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> > > >>> + return;
> > > >>> +
> > > >>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> > > >>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> > > >>> +
> > > >>> + cpumask_set_cpu(cpu, sme_active_cpus);
> > > >>> +
> > > >>> + /*
> > > >>> + * Ensure subsequent (SME) memory accesses are observed after the
> > > >>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> > > >>> + */
> > > >>> + smp_mb();
> > > >>
> > > >> I can't convince myself that a DMB is enough here, as the whole issue
> > > >> is that the SME memory accesses can be observed _after_ the TLB
> > > >> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> > > >> updates are visible before the exception return.
> > > >
> > > > This is only to ensure that the sme_active_cpus mask is observed before
> > > > any SME accesses. The mask is later used to decide whether to send the
> > > > IPI. We have something like this:
> > > >
> > > > P0
> > > > STSET [sme_active_cpus]
> > > > DMB
> > > > SME access to [addr]
> > > >
> > > > P1
> > > > TLBI [addr]
> > > > DSB
> > > > LDR [sme_active_cpus]
> > > > CBZ out
> > > > Do IPI
> > > > out:
> > > >
> > > > If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> > > > received and acknowledged the DVMSync before the STSET. Is your concern
> > > > that P1 can observe the subsequent SME access but not the STSET?
> > > >
> > > > No idea whether herd can model this (I only put this in TLA+ for the
> > > > main logic check but it doesn't do subtle memory ordering).
> > >
> > > JFYI, herd support for SME is still work-in-progress (specifically it misses
> > > updates in cat), yet it can model VMSA.
> > >
> > > IIUC, expectation here is that either
> > > - P1 observes sme_active_cpus, so we have to do_IPI or
> > > - P0 observes TLBI (say shutdown, so it must fault)
> > >
> > > anything else is unexpected/forbidden.
> > >
> > > AArch64 A
> > > variant=vmsa
> > > {
> > > int x=0;
> > > int active=0;
> > >
> > > 0:X1=active;
> > > 0:X3=x;
> > >
> > > 1:X0=(valid:0);
> > > 1:X1=PTE(x);
> > > 1:X2=x;
> > > 1:X3=active;
> > >
> > > }
> > > P0 | P1 ;
> > > MOV W0,#1 | STR X0,[X1] ;
> > > STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
> > > DMB SY | LSR X9,X2,#12 ;
> > > LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
> > > | DSB ISH ;
> > > | LDR W4,[X3] (* sme_active_cpus *) ;
> > >
> > > exists ~(1:X4=1 \/ fault(P0,x))
> > >
> > > Is that correct understanding? Have I missed anything?
> >
> > Yes, I think that's correct. Another tweak specific to this erratum
> > would be for P1 to do a store to x via another mapping after the
> > TLBI+DSB and the P0 load should not see it.
> >
> > Even with the CPU erratum, if the P1 DVMSync is received/acknowledged by
> > P0 before its STR to sme_active_cpus, I don't see how the subsequent SME
> > load would overtake the STR given the DMB. The erratum messed up the
> > DVMSync acknowledgement, not the barriers.
>
> I'm still finding this hard to reason about.
>
> Why can't:
>
> 1. P0 translates its SME load and puts the valid translation into its TLB
> 2. P1 runs to completion, sees sme_active_cpus as 0 and so doesn't IPI
> 3. P0 writes to sme_active_cpus and then does the SME load using the
> translation from (1)
>
> I guess it's diving into ugly corners of what the erratum actually is...
From discussing with the microarchitects at the time, a DMB ISH was
sufficient on the ERET path. Whether they thought about your scenario,
not sure. Memory ordering isn't broken by this bug, only the DVMSync
acknowledgement not waiting for the CME unit (shared by multiple CPUs)
to complete an in-flight memory access. My assumption is that step (1)
won't actually start until the STR in (3) is issued and this would
include the TLB lookup.
Anyway, I'll ask them again to be sure.
--
Catalin
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-13 15:48 ` Catalin Marinas
@ 2026-03-13 15:58 ` Will Deacon
0 siblings, 0 replies; 24+ messages in thread
From: Will Deacon @ 2026-03-13 15:58 UTC (permalink / raw)
To: Catalin Marinas
Cc: Vladimir Murzin, linux-arm-kernel, Marc Zyngier, Oliver Upton,
Lorenzo Pieralisi, Sudeep Holla, James Morse, Mark Rutland,
Mark Brown, kvmarm
On Fri, Mar 13, 2026 at 03:48:57PM +0000, Catalin Marinas wrote:
> On Thu, Mar 12, 2026 at 02:55:15PM +0000, Will Deacon wrote:
> > On Tue, Mar 10, 2026 at 03:35:19PM +0000, Catalin Marinas wrote:
> > > On Mon, Mar 09, 2026 at 10:13:20AM +0000, Vladimir Murzin wrote:
> > > > On 3/6/26 12:00, Catalin Marinas wrote:
> > > > >>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> > > > >>> put_cpu_fpsimd_context();
> > > > >>> }
> > > > >>>
> > > > >>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > > > >>> +
> > > > >>> +/*
> > > > >>> + * SME/CME erratum handling
> > > > >>> + */
> > > > >>> +static cpumask_var_t sme_dvmsync_cpus;
> > > > >>> +static cpumask_var_t sme_active_cpus;
> > > > >>> +
> > > > >>> +void sme_set_active(unsigned int cpu)
> > > > >>> +{
> > > > >>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > > > >>> + return;
> > > > >>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> > > > >>> + return;
> > > > >>> +
> > > > >>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> > > > >>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> > > > >>> +
> > > > >>> + cpumask_set_cpu(cpu, sme_active_cpus);
> > > > >>> +
> > > > >>> + /*
> > > > >>> + * Ensure subsequent (SME) memory accesses are observed after the
> > > > >>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> > > > >>> + */
> > > > >>> + smp_mb();
> > > > >>
> > > > >> I can't convince myself that a DMB is enough here, as the whole issue
> > > > >> is that the SME memory accesses can be observed _after_ the TLB
> > > > >> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> > > > >> updates are visible before the exception return.
> > > > >
> > > > > This is only to ensure that the sme_active_cpus mask is observed before
> > > > > any SME accesses. The mask is later used to decide whether to send the
> > > > > IPI. We have something like this:
> > > > >
> > > > > P0
> > > > > STSET [sme_active_cpus]
> > > > > DMB
> > > > > SME access to [addr]
> > > > >
> > > > > P1
> > > > > TLBI [addr]
> > > > > DSB
> > > > > LDR [sme_active_cpus]
> > > > > CBZ out
> > > > > Do IPI
> > > > > out:
> > > > >
> > > > > If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> > > > > received and acknowledged the DVMSync before the STSET. Is your concern
> > > > > that P1 can observe the subsequent SME access but not the STSET?
> > > > >
> > > > > No idea whether herd can model this (I only put this in TLA+ for the
> > > > > main logic check but it doesn't do subtle memory ordering).
> > > >
> > > > JFYI, herd support for SME is still work-in-progress (specifically it misses
> > > > updates in cat), yet it can model VMSA.
> > > >
> > > > IIUC, expectation here is that either
> > > > - P1 observes sme_active_cpus, so we have to do_IPI or
> > > > - P0 observes TLBI (say shutdown, so it must fault)
> > > >
> > > > anything else is unexpected/forbidden.
> > > >
> > > > AArch64 A
> > > > variant=vmsa
> > > > {
> > > > int x=0;
> > > > int active=0;
> > > >
> > > > 0:X1=active;
> > > > 0:X3=x;
> > > >
> > > > 1:X0=(valid:0);
> > > > 1:X1=PTE(x);
> > > > 1:X2=x;
> > > > 1:X3=active;
> > > >
> > > > }
> > > > P0 | P1 ;
> > > > MOV W0,#1 | STR X0,[X1] ;
> > > > STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
> > > > DMB SY | LSR X9,X2,#12 ;
> > > > LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
> > > > | DSB ISH ;
> > > > | LDR W4,[X3] (* sme_active_cpus *) ;
> > > >
> > > > exists ~(1:X4=1 \/ fault(P0,x))
> > > >
> > > > Is that correct understanding? Have I missed anything?
> > >
> > > Yes, I think that's correct. Another tweak specific to this erratum
> > > would be for P1 to do a store to x via another mapping after the
> > > TLBI+DSB and the P0 load should not see it.
> > >
> > > Even with the CPU erratum, if the P1 DVMSync is received/acknowledged by
> > > P0 before its STR to sme_active_cpus, I don't see how the subsequent SME
> > > load would overtake the STR given the DMB. The erratum messed up the
> > > DVMSync acknowledgement, not the barriers.
> >
> > I'm still finding this hard to reason about.
> >
> > Why can't:
> >
> > 1. P0 translates its SME load and puts the valid translation into its TLB
> > 2. P1 runs to completion, sees sme_active_cpus as 0 and so doesn't IPI
> > 3. P0 writes to sme_active_cpus and then does the SME load using the
> > translation from (1)
> >
> > I guess it's diving into ugly corners of what the erratum actually is...
>
> From discussing with the microarchitects at the time, a DMB ISH was
> sufficient on the ERET path. Whether they thought about your scenario,
> not sure. Memory ordering isn't broken by this bug, only the DVMSync
> acknowledgement not waiting for the CME unit (shared by multiple CPUs)
> to complete an in-flight memory access. My assumption is that step (1)
> won't actually start until the STR in (3) is issued and this would
> include the TLB lookup.
Maybe, but that sounds slow...
> Anyway, I'll ask them again to be sure.
Thanks.
If you're speaking to them, it might also be worth asking whether or not
the exception return alone is sufficient (given that we run with IESB=1).
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
2026-03-12 15:00 ` Will Deacon
@ 2026-03-13 16:27 ` Catalin Marinas
0 siblings, 0 replies; 24+ messages in thread
From: Catalin Marinas @ 2026-03-13 16:27 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arm-kernel, Marc Zyngier, Oliver Upton, Lorenzo Pieralisi,
Sudeep Holla, James Morse, Mark Rutland, Mark Brown, kvmarm
On Thu, Mar 12, 2026 at 03:00:09PM +0000, Will Deacon wrote:
> On Fri, Mar 06, 2026 at 11:15:34AM +0000, Catalin Marinas wrote:
> > On Thu, Mar 05, 2026 at 07:19:15PM +0000, Catalin Marinas wrote:
> > > On Thu, Mar 05, 2026 at 02:33:18PM +0000, Will Deacon wrote:
> > > > On Mon, Mar 02, 2026 at 04:57:55PM +0000, Catalin Marinas wrote:
> > > > > @@ -391,7 +391,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > > > > */
> > > > > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > > > {
> > > > > - __tlbi_sync_s1ish();
> > > > > + __tlbi_sync_s1ish(NULL);
> > > >
> > > > Hmm, it seems a bit rubbish to pass NULL here as that means that we'll
> > > > deploy the mitigation regardless of the mm flags when finishing the
> > > > batch.
> > > >
> > > > It also looks like we could end up doing the workaround multiple times
> > > > if arch_tlbbatch_add_pending() is passed a large enough region that we
> > > > call __flush_tlb_range_limit_excess() fires.
> > > >
> > > > So perhaps we should stash the mm in 'struct arch_tlbflush_unmap_batch'
> > > > alongside some state to track whether or not we have uncompleted TLB
> > > > maintenance in flight?
> > >
> > > The problem is that arch_tlbbatch_flush() can be called to synchronise
> > > multiple mm structures that were touched by TTU. We can't have the mm in
> > > arch_tlbflush_unmap_batch. But we can track if any of the mms had
> > > MMCF_SME_DVMSYNC flag set, something like below (needs testing, tidying
> > > up). TBH, I did not notice any problem in benchmarking as I guess we
> > > haven't exercised the TTU path much, so did not bother to optimise it.
> > >
> > > For the TTU case, I don't think we need to worry about the excess limit
> > > and doing the IPI twice. But I'll double check the code paths tomorrow.
> > >
> > > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> > > index fedb0b87b8db..e756eaca6cb8 100644
> > > --- a/arch/arm64/include/asm/tlbbatch.h
> > > +++ b/arch/arm64/include/asm/tlbbatch.h
> > > @@ -7,6 +7,8 @@ struct arch_tlbflush_unmap_batch {
> > > * For arm64, HW can do tlb shootdown, so we don't
> > > * need to record cpumask for sending IPI
> > > */
> > > +
> > > + bool sme_dvmsync;
> > > };
> > >
> > > #endif /* _ARCH_ARM64_TLBBATCH_H */
> > > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > > index e3ea0246a4f4..c1141a684854 100644
> > > --- a/arch/arm64/include/asm/tlbflush.h
> > > +++ b/arch/arm64/include/asm/tlbflush.h
> > > @@ -201,10 +201,15 @@ do { \
> > > * Complete broadcast TLB maintenance issued by the host which invalidates
> > > * stage 1 information in the host's own translation regime.
> > > */
> > > -static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> > > +static inline void __tlbi_sync_s1ish_no_sme_dvmsync(void)
> > > {
> > > dsb(ish);
> > > __repeat_tlbi_sync(vale1is, 0);
> > > +}
> > > +
> > > +static inline void __tlbi_sync_s1ish(struct mm_struct *mm)
> > > +{
> > > + __tlbi_sync_s1ish_no_sme_dvmsync();
> > > sme_dvmsync(mm);
> > > }
> > >
> > > @@ -408,7 +413,11 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > > */
> > > static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > {
> > > - __tlbi_sync_s1ish(NULL);
> > > + __tlbi_sync_s1ish_no_sme_dvmsync();
> > > + if (batch->sme_dvmsync) {
> > > + batch->sme_dvmsync = false;
> > > + sme_dvmsync(NULL);
> > > + }
> > > }
> > >
> > > /*
> > > @@ -613,6 +622,8 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
> > > struct mm_struct *mm, unsigned long start, unsigned long end)
> > > {
> > > __flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
> > > + if (test_bit(ilog2(MMCF_SME_DVMSYNC), &mm->context.flags))
> > > + batch->sme_dvmsync = true;
> > > }
> >
> > While writing a reply to your other comments, I realised why this
> > wouldn't work (I had something similar but dropped it) - we can have the
> > flag cleared here (or mm_cpumask() if we are to track per-mm) but we
> > have not issued the DVMSync yet. The task may start using SME before
> > arch_tlbbatch_flush() and we just missed it. Any checks on whether to
> > issue the IPI like reading flags needs to be after the DVMSync.
>
> Ah, yeah. I wonder if it's worth detecting the change of mm in
> arch_tlbbatch_add_pending() and then pro-actively doing the DSB on CPUs
> with the erratum? I suppose it depends on how often SME is being used.
I don't think it's worth it with the current use of SME (short bursts).
sme_do_dvmsync() already tracks which CPUs run with the SME enabled at
EL0, so it's not like we always broadcast to all. But if you want it,
something like below with more ifs and ifdefs when the erratum is
disabled:
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
index fedb0b87b8db..948d6f248250 100644
--- a/arch/arm64/include/asm/tlbbatch.h
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -2,11 +2,12 @@
#ifndef _ARCH_ARM64_TLBBATCH_H
#define _ARCH_ARM64_TLBBATCH_H
+struct mm_struct;
+
+#define ARCH_TLBBATCH_MULTIPLE_MMS ((struct mm_struct *)-1UL)
+
struct arch_tlbflush_unmap_batch {
- /*
- * For arm64, HW can do tlb shootdown, so we don't
- * need to record cpumask for sending IPI
- */
+ struct mm_struct *mm;
};
#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 2c77ca41cb14..1ed19480305f 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -408,7 +408,13 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
*/
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- __tlbi_sync_s1ish(NULL);
+ struct mm_struct *mm = batch->mm;
+
+ batch->mm = NULL;
+ if (mm == ARCH_TLBBATCH_MULTIPLE_MMS)
+ mm = NULL;
+
+ __tlbi_sync_s1ish(mm);
}
/*
@@ -612,6 +618,11 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm, unsigned long start, unsigned long end)
{
+ if (!batch->mm)
+ batch->mm = mm;
+ else if (batch->mm != mm && batch->mm != ARCH_TLBBATCH_MULTIPLE_MMS)
+ batch->mm = ARCH_TLBBATCH_MULTIPLE_MMS;
+
__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
}
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement
2026-03-12 14:55 ` Will Deacon
2026-03-13 15:48 ` Catalin Marinas
@ 2026-03-17 12:09 ` Mark Rutland
1 sibling, 0 replies; 24+ messages in thread
From: Mark Rutland @ 2026-03-17 12:09 UTC (permalink / raw)
To: Will Deacon
Cc: Catalin Marinas, Vladimir Murzin, linux-arm-kernel, Marc Zyngier,
Oliver Upton, Lorenzo Pieralisi, Sudeep Holla, James Morse,
Mark Brown, kvmarm
On Thu, Mar 12, 2026 at 02:55:15PM +0000, Will Deacon wrote:
> On Tue, Mar 10, 2026 at 03:35:19PM +0000, Catalin Marinas wrote:
> > Thanks Vladimir,
> >
> > On Mon, Mar 09, 2026 at 10:13:20AM +0000, Vladimir Murzin wrote:
> > > On 3/6/26 12:00, Catalin Marinas wrote:
> > > >>> @@ -1358,6 +1360,85 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
> > > >>> put_cpu_fpsimd_context();
> > > >>> }
> > > >>>
> > > >>> +#ifdef CONFIG_ARM64_ERRATUM_SME_DVMSYNC
> > > >>> +
> > > >>> +/*
> > > >>> + * SME/CME erratum handling
> > > >>> + */
> > > >>> +static cpumask_var_t sme_dvmsync_cpus;
> > > >>> +static cpumask_var_t sme_active_cpus;
> > > >>> +
> > > >>> +void sme_set_active(unsigned int cpu)
> > > >>> +{
> > > >>> + if (!cpus_have_final_cap(ARM64_WORKAROUND_SME_DVMSYNC))
> > > >>> + return;
> > > >>> + if (!cpumask_test_cpu(cpu, sme_dvmsync_cpus))
> > > >>> + return;
> > > >>> +
> > > >>> + if (!test_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags))
> > > >>> + set_bit(ilog2(MMCF_SME_DVMSYNC), ¤t->mm->context.flags);
> > > >>> +
> > > >>> + cpumask_set_cpu(cpu, sme_active_cpus);
> > > >>> +
> > > >>> + /*
> > > >>> + * Ensure subsequent (SME) memory accesses are observed after the
> > > >>> + * cpumask and the MMCF_SME_DVMSYNC flag setting.
> > > >>> + */
> > > >>> + smp_mb();
> > > >>
> > > >> I can't convince myself that a DMB is enough here, as the whole issue
> > > >> is that the SME memory accesses can be observed _after_ the TLB
> > > >> invalidation. I'd have thought we'd need a DSB to ensure that the flag
> > > >> updates are visible before the exception return.
> > > >
> > > > This is only to ensure that the sme_active_cpus mask is observed before
> > > > any SME accesses. The mask is later used to decide whether to send the
> > > > IPI. We have something like this:
> > > >
> > > > P0
> > > > STSET [sme_active_cpus]
> > > > DMB
> > > > SME access to [addr]
> > > >
> > > > P1
> > > > TLBI [addr]
> > > > DSB
> > > > LDR [sme_active_cpus]
> > > > CBZ out
> > > > Do IPI
> > > > out:
> > > >
> > > > If P1 did not observe the STSET to [sme_active_cpus], P0 should have
> > > > received and acknowledged the DVMSync before the STSET. Is your concern
> > > > that P1 can observe the subsequent SME access but not the STSET?
> > > >
> > > > No idea whether herd can model this (I only put this in TLA+ for the
> > > > main logic check but it doesn't do subtle memory ordering).
> > >
> > > JFYI, herd support for SME is still work-in-progress (specifically it misses
> > > updates in cat), yet it can model VMSA.
> > >
> > > IIUC, expectation here is that either
> > > - P1 observes sme_active_cpus, so we have to do_IPI or
> > > - P0 observes TLBI (say shutdown, so it must fault)
> > >
> > > anything else is unexpected/forbidden.
> > >
> > > AArch64 A
> > > variant=vmsa
> > > {
> > > int x=0;
> > > int active=0;
> > >
> > > 0:X1=active;
> > > 0:X3=x;
> > >
> > > 1:X0=(valid:0);
> > > 1:X1=PTE(x);
> > > 1:X2=x;
> > > 1:X3=active;
> > >
> > > }
> > > P0 | P1 ;
> > > MOV W0,#1 | STR X0,[X1] ;
> > > STR W0,[X1] (* sme_active_cpus *) | DSB ISH ;
> > > DMB SY | LSR X9,X2,#12 ;
> > > LDR W2,[X3] (* access to [addr] *) | TLBI VAAE1IS,X9 (* [addr] *) ;
> > > | DSB ISH ;
> > > | LDR W4,[X3] (* sme_active_cpus *) ;
> > >
> > > exists ~(1:X4=1 \/ fault(P0,x))
> > >
> > > Is that correct understanding? Have I missed anything?
> >
> > Yes, I think that's correct. Another tweak specific to this erratum
> > would be for P1 to do a store to x via another mapping after the
> > TLBI+DSB and the P0 load should not see it.
> >
> > Even with the CPU erratum, if the P1 DVMSync is received/acknowledged by
> > P0 before its STR to sme_active_cpus, I don't see how the subsequent SME
> > load would overtake the STR given the DMB. The erratum messed up the
> > DVMSync acknowledgement, not the barriers.
>
> I'm still finding this hard to reason about.
>
> Why can't:
>
> 1. P0 translates its SME load and puts the valid translation into its TLB
> 2. P1 runs to completion, sees sme_active_cpus as 0 and so doesn't IPI
> 3. P0 writes to sme_active_cpus and then does the SME load using the
> translation from (1)
The key thing is that for micro-architectural reasons, C1-Pro provides
stronger than architectural properties for TLB invalidation (aside from
*completion* of SME accesses specifically). The DMB is not material to
this example, but could matter if we wanted ordering in the absence of a
TLBI.
Specifically, where C1-Pro receives a broadcast TLBI, and that TLBI
architecturally affects the translation of an explicit memory effect of
some instruction INSN (which may be an SME instruction), C1-Pro will
also complete the explicit memory effects of all earlier (non-SME)
instructions *in program order* before INSN. This happens regardless of
out-of-order execution, etc.
When C1-Pro executes a sequence:
STR <1>, [<flag_addr>]
SME_LDR <dst>, [<sme_addr>]
... if a broadcast TLBI is received which affects sme_addr, either:
(a) The TLBI is received before any of SME_LDR's accesses to sme_addr
are translated. The SME_LDR instruction WILL NOT use the stale
translation for sme_addr.
(b) The TLBI is received after any of SME_LDR's accesses to sme_addr are
translated. The SME_LDR instruction MIGHT use the stale translation
for sme_addr. Completion of the TLBI WILL ensure that the STR to
flag_addr has been globally observed. Until completion of the TLBI,
the STR to flag_addr and the SME_LDR to sme_addr could become
observed in any order.
... and so IF the SME_LDR consumes a stale translation for sme_addr, the
store to flag_addr WILL be globally observed before completion of the
TLBI.
When the STR and SME_LDR are either side of an ERET, the ERET itself is
immaterial, and the scenario decays to the example above:
STR <1>, [<flag_addr>]
ERET // immaterial
SME_LDR <dst>, [<sme_addr>]
However, when clearing the flag *after* executing SME loads/stores, we
still need to complete those SME loads/stores before clearing the flag.
Either a DSB (or IESB as part of exception entry) are sufficient to
complete those earlier SME accesses.
Mark.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-03-17 12:09 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 16:57 [PATCH 0/4] arm64: Work around C1-Pro erratum 4193714 (CVE-2026-0995) Catalin Marinas
2026-03-02 16:57 ` [PATCH 1/4] arm64: tlb: Use __tlbi_sync_s1ish_kernel() for kernel TLB maintenance Catalin Marinas
2026-03-03 13:12 ` Mark Rutland
2026-03-05 11:27 ` Catalin Marinas
2026-03-09 12:12 ` Mark Rutland
2026-03-02 16:57 ` [PATCH 2/4] arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish() Catalin Marinas
2026-03-05 14:33 ` Will Deacon
2026-03-05 19:19 ` Catalin Marinas
2026-03-06 11:15 ` Catalin Marinas
2026-03-12 15:00 ` Will Deacon
2026-03-13 16:27 ` Catalin Marinas
2026-03-02 16:57 ` [PATCH 3/4] arm64: errata: Work around early CME DVMSync acknowledgement Catalin Marinas
2026-03-05 14:32 ` Will Deacon
2026-03-06 12:00 ` Catalin Marinas
2026-03-06 12:19 ` Catalin Marinas
2026-03-09 10:13 ` Vladimir Murzin
2026-03-10 15:35 ` Catalin Marinas
2026-03-12 14:55 ` Will Deacon
2026-03-13 15:48 ` Catalin Marinas
2026-03-13 15:58 ` Will Deacon
2026-03-17 12:09 ` Mark Rutland
2026-03-02 16:57 ` [PATCH 4/4] KVM: arm64: Add SMC hook for SME dvmsync erratum Catalin Marinas
2026-03-05 14:32 ` Will Deacon
2026-03-06 12:52 ` Catalin Marinas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox