* [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2025-12-08 13:55 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Dan Driscoll, Noor Ahsan Khawaja,
Fahad Arslan, Andrew Bachtel
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This patch series addresses a major issue for running Xen on KVM i.e.
costly emulation of VMALLS12E1IS which becomes worse when this TLBI
is invoked too many times. There are mainly two places where this is
problematic:
(a) When vCPUs switch on a pCPU or pCPUs
(b) When domu mapped pages onto dom0, are to be unmapped, then each
page being removed by XENMEM_remove_from_physmap has its TLBs
invalidated by the TLBI variant that flushes the whole range.
This patch series prefers usage of IPA-based TLBIs wherever possible
instead of complete flushing of TLBs every time.
It consists of three patches where the first one address the issue
being discussed for Arm64. Second patch further optimizes the
combined stage-1,2 TLB flushes by leveraging FEAT_nTLBPA. Third patch
introduces IPA-based TLBI for Arm32 in presence of FEAT_nTLBPA.
Haseeb Ashraf (3):
xen/arm/p2m: perform IPA-based TLBI when IPA is known
xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
xen/arm32: add CPU capability for IPA-based TLBI
Changes in v3:
- Mainly the handling of repeat TLBI workaround with IPA-based TLBI,
so that the extra TLBI and DSB are repeated only for the final TLBI
and DSB of the whole sequence.
- Updated code comments as per feedback. Further details are
available in each commit's changelog.
- Minor updates to code as per feedback. Further details are
available in each commit's changelog.
Changes in v2:
- Split up the commit in 3 commits. First commit implements the
baseline implementation without any addition of new CPU
capabilities. Implemented new CPU caps in separate features to
emphasize how each of it optimizes the TLB invalidation.
- Moved ARM32 and ARM64 specific implementations of TLBIs to
architecture specific flushtlb.h.
- Added references of ARM ARM in code comments.
- Evaluated and added a threshold to select between IPA-based TLB
invalidation vs fallback to full stage TLB invalidation above
the threshold.
- Introduced ARM_HAS_NTLBPA CPU capability which leverages
FEAT_nTLBPA for arm32 as well as arm64.
- Introduced ARM_HAS_TLB_IPA CPU capability for IPA-based TLBI
for arm32.
Haseeb Ashraf (3):
xen/arm/p2m: perform IPA-based TLBI when IPA is known
xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
xen/arm32: add CPU capability for IPA-based TLBI
xen/arch/arm/cpufeature.c | 31 ++++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 87 +++++++++++++++++++++
xen/arch/arm/include/asm/arm64/flushtlb.h | 77 +++++++++++++++++++
xen/arch/arm/include/asm/cpregs.h | 4 +
xen/arch/arm/include/asm/cpufeature.h | 27 ++++++-
xen/arch/arm/include/asm/mmu/p2m.h | 2 +
xen/arch/arm/include/asm/processor.h | 10 +++
xen/arch/arm/mmu/p2m.c | 92 +++++++++++++++++------
8 files changed, 302 insertions(+), 28 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3] xen/arm/p2m: perform IPA-based TLBI when IPA is known
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2025-12-08 13:55 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Dan Driscoll, Noor Ahsan Khawaja,
Fahad Arslan, Andrew Bachtel
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This commit addresses a major issue for running Xen on KVM i.e.
costly emulation of VMALLS12E1IS which becomes worse when this TLBI
is invoked too many times. There are mainly two places where this is
problematic:
(a) When vCPUs switch on a pCPU or pCPUs
(b) When domu mapped pages onto dom0, are to be unmapped, then each
page being removed by XENMEM_remove_from_physmap has its TLBs
invalidated by VMALLS12E1IS.
The first one is addressed by relaxing VMALLS12E1 -> VMALLE1 as the
stage-2 is common between all the vCPUs of a VM. Since each CPU has
its own private TLBs, so flush between vCPU of the same domains is
still required to avoid translations from vCPUx to "leak" to the
vCPUy which can be achieved by using VMALLE1.
The second one is addressed by using IPA-based TLBI (IPAS2E1) in
combination with VMALLE1 whenever the IPA range is known instead of
using VMALLS12E1. There is an upper cap placed on number of IPA-based
TLBI. This factor for execution time of VMALLS12E1 vs IPAS2E1 is
found to be 70K on Graviton4 in Xen on KVM virtualization. So,
64K * 4KB = 256MB is set as the threshold.
For arm32, TLBIALL instruction can invalidate both stage-1 and
stage-2 entries, so using IPA-based TLBI would be redundant as
TLBIALL is required in any case to invalidate corresponding cached
entries from stage-1.
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- Updated IPA-based TLBI sequence to apply ARM64 repeat TLBI
workaround to only final TLBI and DSB of the sequence.
- Removed TLB_HELPER_IPA and instead directly used the TLBI
instruction where needed as that was the only instance where it is
being used.
- Removed flush_guest_tlb_range_ipa_local() as it was not being used.
- Updated comments as per feedback in v2 about holding lock before
p2m_load_vttbr.
- Updated references of ARM ARM to use newer version DDI 0487L.b
instead of older version DDI 0487A.e.
Changes in v2:
- This commit implements the basline implementation to address the
problem at hand. Removed the FEAT_nTLBPA implementation from this
commit which will be implemented in following commit using CPU
capability.
- Moved ARM32 and ARM64 specific implementations of TLBIs to
architecture specific flushtlb.h.
- Added references of ARM ARM in code comments.
- Evaluated and added a threshold to select between IPA-based TLB
invalidation vs fallback to full stage TLB invalidation above
the threshold.
---
xen/arch/arm/include/asm/arm32/flushtlb.h | 53 +++++++++++++
xen/arch/arm/include/asm/arm64/flushtlb.h | 46 ++++++++++++
xen/arch/arm/include/asm/mmu/p2m.h | 2 +
xen/arch/arm/mmu/p2m.c | 92 +++++++++++++++++------
4 files changed, 168 insertions(+), 25 deletions(-)
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 61c25a3189..3c0c2123d4 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -45,6 +45,43 @@ TLB_HELPER(flush_xen_tlb_local, TLBIALLH, nsh)
#undef TLB_HELPER
+/*
+ * Flush TLB of local processor. Use when flush for only stage-1 is intended.
+ *
+ * The following function should be used where intention is to clear only
+ * stage-1 TLBs. This would be helpful in future in identifying which stage-1
+ * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ */
+static inline void flush_guest_tlb_s1_local(void)
+{
+ /*
+ * Same instruction can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution context.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ return flush_guest_tlb_local();
+}
+
+/*
+ * Flush TLB of inner-shareable processor domain. Use when flush for only
+ * stage-1 is intended.
+ *
+ * The following function should be used where intention is to clear only
+ * stage-1 TLBs. This would be helpful in future in identifying which stage-1
+ * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ */
+static inline void flush_guest_tlb_s1(void)
+{
+ /*
+ * Same instruction can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution context.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ return flush_guest_tlb();
+}
+
/* Flush TLB of local processor for address va. */
static inline void __flush_xen_tlb_one_local(vaddr_t va)
{
@@ -57,6 +94,22 @@ static inline void __flush_xen_tlb_one(vaddr_t va)
asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory");
}
+/*
+ * Flush a range of IPA's mappings from the TLB of all processors in the
+ * inner-shareable domain.
+ */
+static inline void flush_guest_tlb_range_ipa(paddr_t ipa,
+ unsigned long size)
+{
+ /*
+ * Following can invalidate both stage-1 and stage-2 TLBs depending upon
+ * the execution mode.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ flush_guest_tlb();
+}
+
#endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
/*
* Local variables:
diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h b/xen/arch/arm/include/asm/arm64/flushtlb.h
index 3b99c11b50..67ae616993 100644
--- a/xen/arch/arm/include/asm/arm64/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
@@ -1,6 +1,8 @@
#ifndef __ASM_ARM_ARM64_FLUSHTLB_H__
#define __ASM_ARM_ARM64_FLUSHTLB_H__
+#include <xen/sizes.h> /* For SZ_* macros. */
+
/*
* Every invalidation operation use the following patterns:
*
@@ -72,6 +74,12 @@ TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
/* Flush innershareable TLBs, current VMID only */
TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
+/* Flush local TLBs, current VMID, stage-1 only */
+TLB_HELPER(flush_guest_tlb_s1_local, vmalle1, nsh)
+
+/* Flush innershareable TLBs, current VMID, stage-1 only */
+TLB_HELPER(flush_guest_tlb_s1, vmalle1is, ish)
+
/* Flush local TLBs, all VMIDs, non-hypervisor mode */
TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
@@ -90,6 +98,44 @@ TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
#undef TLB_HELPER
#undef TLB_HELPER_VA
+/*
+ * Flush a range of IPA's mappings from the TLB of all processors in the
+ * inner-shareable domain.
+ */
+static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
+{
+ paddr_t end;
+
+ /*
+ * If IPA range is too big (empirically found to be 256M), then fallback to
+ * full TLB flush.
+ */
+ if ( size > SZ_256M )
+ return flush_guest_tlb();
+
+ end = ipa + size;
+
+ /*
+ * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from stage 2
+ * translations) for details of TLBI sequence.
+ */
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address */
+ asm_inline volatile (
+ "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
+ ipa += PAGE_SIZE;
+ }
+ /*
+ * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to last TLBI
+ * of the sequence, it is only needed to be handled in the following
+ * invocation. Final dsb() and isb() are also applied in the following
+ * invocation.
+ */
+ flush_guest_tlb_s1();
+}
+
#endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
/*
* Local variables:
diff --git a/xen/arch/arm/include/asm/mmu/p2m.h b/xen/arch/arm/include/asm/mmu/p2m.h
index 58496c0b09..8a16722b82 100644
--- a/xen/arch/arm/include/asm/mmu/p2m.h
+++ b/xen/arch/arm/include/asm/mmu/p2m.h
@@ -10,6 +10,8 @@ extern unsigned int p2m_root_level;
struct p2m_domain;
void p2m_force_tlb_flush_sync(struct p2m_domain *p2m);
+void p2m_force_tlb_flush_range_sync(struct p2m_domain *p2m, uint64_t start_ipa,
+ uint64_t page_count);
void p2m_tlb_flush_sync(struct p2m_domain *p2m);
void p2m_clear_root_pages(struct p2m_domain *p2m);
diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 51abf3504f..eec59056fa 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -235,33 +235,28 @@ void p2m_restore_state(struct vcpu *n)
* when running multiple vCPU of the same domain on a single pCPU.
*/
if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
- flush_guest_tlb_local();
+ flush_guest_tlb_s1_local();
*last_vcpu_ran = n->vcpu_id;
}
/*
- * Force a synchronous P2M TLB flush.
+ * Loads VTTBR from given P2M.
*
* Must be called with the p2m lock held.
+ *
+ * This returns switched out VTTBR.
*/
-void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+static uint64_t p2m_load_vttbr(struct p2m_domain *p2m, unsigned long *flags)
{
- unsigned long flags = 0;
uint64_t ovttbr;
- ASSERT(p2m_is_write_locked(p2m));
-
- /*
- * ARM only provides an instruction to flush TLBs for the current
- * VMID. So switch to the VTTBR of a given P2M if different.
- */
ovttbr = READ_SYSREG64(VTTBR_EL2);
if ( ovttbr != p2m->vttbr )
{
uint64_t vttbr;
- local_irq_save(flags);
+ local_irq_save(*flags);
/*
* ARM64_WORKAROUND_AT_SPECULATE: We need to stop AT to allocate
@@ -280,8 +275,14 @@ void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
isb();
}
- flush_guest_tlb();
+ return ovttbr;
+}
+/*
+ * Restores VTTBR which was switched out as a result of p2m_load_vttbr().
+ */
+static void p2m_restore_vttbr(uint64_t ovttbr, unsigned long flags)
+{
if ( ovttbr != READ_SYSREG64(VTTBR_EL2) )
{
WRITE_SYSREG64(ovttbr, VTTBR_EL2);
@@ -289,10 +290,58 @@ void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
isb();
local_irq_restore(flags);
}
+}
+
+/*
+ * Force a synchronous P2M TLB flush.
+ *
+ * Must be called with the p2m lock held.
+ */
+void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+{
+ unsigned long flags = 0;
+ uint64_t ovttbr;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * ARM only provides an instruction to flush TLBs for the current
+ * VMID. So switch to the VTTBR of a given P2M if different.
+ */
+ ovttbr = p2m_load_vttbr(p2m, &flags);
+
+ flush_guest_tlb();
+
+ p2m_restore_vttbr(ovttbr, flags);
p2m->need_flush = false;
}
+/*
+ * Force a synchronous P2M TLB flush on a range of addresses.
+ *
+ * Must be called with the p2m lock held.
+ */
+void p2m_force_tlb_flush_range_sync(struct p2m_domain *p2m, uint64_t start_ipa,
+ uint64_t page_count)
+{
+ unsigned long flags = 0;
+ uint64_t ovttbr;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * ARM only provides an instruction to flush TLBs for the current
+ * VMID. So switch to the VTTBR of a given P2M if different.
+ */
+ ovttbr = p2m_load_vttbr(p2m, &flags);
+
+ /* Invalidate TLB entries by IPA range */
+ flush_guest_tlb_range_ipa(start_ipa, PAGE_SIZE * page_count);
+
+ p2m_restore_vttbr(ovttbr, flags);
+}
+
void p2m_tlb_flush_sync(struct p2m_domain *p2m)
{
if ( p2m->need_flush )
@@ -1034,7 +1083,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
* For more details see (D4.7.1 in ARM DDI 0487A.j).
*/
p2m_remove_pte(entry, p2m->clean_pte);
- p2m_force_tlb_flush_sync(p2m);
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
p2m_write_pte(entry, split_pte, p2m->clean_pte);
@@ -1090,8 +1140,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
p2m_remove_pte(entry, p2m->clean_pte);
if ( removing_mapping )
- /* Flush can be deferred if the entry is removed */
- p2m->need_flush |= !!lpae_is_valid(orig_pte);
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
else
{
lpae_t pte = mfn_to_p2m_entry(smfn, t, a);
@@ -1102,18 +1152,10 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
/*
* It is necessary to flush the TLB before writing the new entry
* to keep coherency when the previous entry was valid.
- *
- * Although, it could be defered when only the permissions are
- * changed (e.g in case of memaccess).
*/
if ( lpae_is_valid(orig_pte) )
- {
- if ( likely(!p2m->mem_access_enabled) ||
- P2M_CLEAR_PERM(pte) != P2M_CLEAR_PERM(orig_pte) )
- p2m_force_tlb_flush_sync(p2m);
- else
- p2m->need_flush = true;
- }
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
else if ( !p2m_is_valid(orig_pte) ) /* new mapping */
p2m->stats.mappings[level]++;
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/3] xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2025-12-08 13:55 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Dan Driscoll, Noor Ahsan Khawaja,
Fahad Arslan, Andrew Bachtel, Mohamed Mediouni
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
FEAT_nTLBPA (quoting definition) introduces a mechanism to identify
if the intermediate caching of translation table walks does not
include non-coherent caches of previous valid translation table
entries since the last completed TLBI applicable to the PE.
As there won't be any non-coherent caches since the last completed
TLBI, stage-1 TLBI won't be required while performing stage-2 TLBI.
This feature is optionally available in both arm32 and arm64.
Suggested-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- This commit has no functional change in v3, only rebasing changes
due to updates in commit-1.
Changes in v2:
- This commit is implemented in v2 and is splitted from commit-1 in
v1. This is implemented by using CPU capability.
---
xen/arch/arm/cpufeature.c | 19 ++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 14 +++--
xen/arch/arm/include/asm/arm64/flushtlb.h | 77 ++++++++++++++++-------
xen/arch/arm/include/asm/cpufeature.h | 24 ++++++-
xen/arch/arm/include/asm/processor.h | 7 +++
5 files changed, 109 insertions(+), 32 deletions(-)
diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
index 1a80738571..9fa1c45869 100644
--- a/xen/arch/arm/cpufeature.c
+++ b/xen/arch/arm/cpufeature.c
@@ -17,7 +17,19 @@ DECLARE_BITMAP(cpu_hwcaps, ARM_NCAPS);
struct cpuinfo_arm __read_mostly domain_cpuinfo;
+#ifdef CONFIG_ARM_32
+static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm32.ntlbpa == MM32_NTLBPA_SUPPORT_IMP;
+}
+#endif
+
#ifdef CONFIG_ARM_64
+static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm64.ntlbpa == MM64_NTLBPA_SUPPORT_IMP;
+}
+
static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
{
return system_cpuinfo.isa64.sb;
@@ -25,6 +37,13 @@ static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
#endif
static const struct arm_cpu_capabilities arm_features[] = {
+#if defined(CONFIG_ARM_32) || defined(CONFIG_ARM_64)
+ {
+ .desc = "Intermediate caching of translation table walks (nTLBPA)",
+ .capability = ARM_HAS_NTLBPA,
+ .matches = has_ntlbpa,
+ },
+#endif
#ifdef CONFIG_ARM_64
{
.desc = "Speculation barrier instruction (SB)",
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 3c0c2123d4..7cff042508 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -49,8 +49,8 @@ TLB_HELPER(flush_xen_tlb_local, TLBIALLH, nsh)
* Flush TLB of local processor. Use when flush for only stage-1 is intended.
*
* The following function should be used where intention is to clear only
- * stage-1 TLBs. This would be helpful in future in identifying which stage-1
- * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ * stage-1 TLBs. This would be helpful in identifying which stage-1 TLB flushes
+ * can be skipped such as in present of FEAT_nTLBPA.
*/
static inline void flush_guest_tlb_s1_local(void)
{
@@ -60,7 +60,8 @@ static inline void flush_guest_tlb_s1_local(void)
*
* See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
*/
- return flush_guest_tlb_local();
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ flush_guest_tlb_local();
}
/*
@@ -68,8 +69,8 @@ static inline void flush_guest_tlb_s1_local(void)
* stage-1 is intended.
*
* The following function should be used where intention is to clear only
- * stage-1 TLBs. This would be helpful in future in identifying which stage-1
- * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ * stage-1 TLBs. This would be helpful in identifying which stage-1 TLB flushes
+ * can be skipped such as in present of FEAT_nTLBPA.
*/
static inline void flush_guest_tlb_s1(void)
{
@@ -79,7 +80,8 @@ static inline void flush_guest_tlb_s1(void)
*
* See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
*/
- return flush_guest_tlb();
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ flush_guest_tlb();
}
/* Flush TLB of local processor for address va. */
diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h b/xen/arch/arm/include/asm/arm64/flushtlb.h
index 67ae616993..0f0d5050e5 100644
--- a/xen/arch/arm/include/asm/arm64/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
@@ -47,6 +47,24 @@ static inline void name(void) \
: : : "memory"); \
}
+#define TLB_HELPER_NTLBPA(name, tlbop, sh) \
+static inline void name(void) \
+{ \
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) ) \
+ asm_inline volatile ( \
+ "dsb " # sh "st;" \
+ "tlbi " # tlbop ";" \
+ ALTERNATIVE( \
+ "nop; nop;", \
+ "dsb ish;" \
+ "tlbi " # tlbop ";", \
+ ARM64_WORKAROUND_REPEAT_TLBI, \
+ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
+ "dsb " # sh ";" \
+ "isb;" \
+ : : : "memory"); \
+}
+
/*
* FLush TLB by VA. This will likely be used in a loop, so the caller
* is responsible to use the appropriate memory barriers before/after
@@ -75,10 +93,10 @@ TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
/* Flush local TLBs, current VMID, stage-1 only */
-TLB_HELPER(flush_guest_tlb_s1_local, vmalle1, nsh)
+TLB_HELPER_NTLBPA(flush_guest_tlb_s1_local, vmalle1, nsh)
/* Flush innershareable TLBs, current VMID, stage-1 only */
-TLB_HELPER(flush_guest_tlb_s1, vmalle1is, ish)
+TLB_HELPER_NTLBPA(flush_guest_tlb_s1, vmalle1is, ish)
/* Flush local TLBs, all VMIDs, non-hypervisor mode */
TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
@@ -104,8 +122,6 @@ TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
*/
static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
{
- paddr_t end;
-
/*
* If IPA range is too big (empirically found to be 256M), then fallback to
* full TLB flush.
@@ -113,27 +129,42 @@ static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
if ( size > SZ_256M )
return flush_guest_tlb();
- end = ipa + size;
-
- /*
- * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from stage 2
- * translations) for details of TLBI sequence.
- */
- dsb(ishst); /* Ensure prior page-tables updates have completed */
- while ( ipa < end )
+ else if ( size > 0 )
{
- /* Flush stage-2 TLBs for ipa address */
- asm_inline volatile (
- "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
- ipa += PAGE_SIZE;
+ paddr_t end = ipa + size;
+
+ /*
+ * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from
+ * stage 2 translations) for details on TLBI sequence.
+ */
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address */
+ asm_inline volatile (
+ "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
+ ipa += PAGE_SIZE;
+ }
+ if ( cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ asm_inline volatile (
+ ALTERNATIVE(
+ "nop; nop;",
+ "dsb ish;"
+ "tlbi ipas2e1is, %0;",
+ ARM64_WORKAROUND_REPEAT_TLBI,
+ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)
+ "dsb ish;"
+ "isb;"
+ : : "r" ((ipa - PAGE_SIZE) >> PAGE_SHIFT) : "memory" );
+ else
+ /*
+ * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to
+ * last TLBI of the sequence, it is only needed to be handled in
+ * the following invocation. Final dsb() and isb() are also applied
+ * in the following invocation.
+ */
+ flush_guest_tlb_s1();
}
- /*
- * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to last TLBI
- * of the sequence, it is only needed to be handled in the following
- * invocation. Final dsb() and isb() are also applied in the following
- * invocation.
- */
- flush_guest_tlb_s1();
}
#endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
diff --git a/xen/arch/arm/include/asm/cpufeature.h b/xen/arch/arm/include/asm/cpufeature.h
index 13353c8e1a..9f796ed4c1 100644
--- a/xen/arch/arm/include/asm/cpufeature.h
+++ b/xen/arch/arm/include/asm/cpufeature.h
@@ -76,8 +76,9 @@
#define ARM_WORKAROUND_BHB_SMCC_3 15
#define ARM_HAS_SB 16
#define ARM64_WORKAROUND_1508412 17
+#define ARM_HAS_NTLBPA 18
-#define ARM_NCAPS 18
+#define ARM_NCAPS 19
#ifndef __ASSEMBLER__
@@ -269,7 +270,8 @@ struct cpuinfo_arm {
unsigned long ets:4;
unsigned long __res1:4;
unsigned long afp:4;
- unsigned long __res2:12;
+ unsigned long ntlbpa:4;
+ unsigned long __res2:8;
unsigned long ecbhb:4;
/* MMFR2 */
@@ -430,8 +432,24 @@ struct cpuinfo_arm {
register_t bits[1];
} aux32;
- struct {
+ union {
register_t bits[6];
+ struct {
+ /* MMFR0 */
+ unsigned long __res0:32;
+ /* MMFR1 */
+ unsigned long __res1:32;
+ /* MMFR2 */
+ unsigned long __res2:32;
+ /* MMFR3 */
+ unsigned long __res3:32;
+ /* MMFR4 */
+ unsigned long __res4:32;
+ /* MMFR5 */
+ unsigned long __res5:4;
+ unsigned long ntlbpa:4;
+ unsigned long __res6:24;
+ };
} mm32;
struct {
diff --git a/xen/arch/arm/include/asm/processor.h b/xen/arch/arm/include/asm/processor.h
index 1a48c9ff3b..85f3b643a0 100644
--- a/xen/arch/arm/include/asm/processor.h
+++ b/xen/arch/arm/include/asm/processor.h
@@ -459,9 +459,16 @@
/* FSR long format */
#define FSRL_STATUS_DEBUG (_AC(0x22,UL)<<0)
+#ifdef CONFIG_ARM_32
+#define MM32_NTLBPA_SUPPORT_NI 0x0
+#define MM32_NTLBPA_SUPPORT_IMP 0x1
+#endif
+
#ifdef CONFIG_ARM_64
#define MM64_VMID_8_BITS_SUPPORT 0x0
#define MM64_VMID_16_BITS_SUPPORT 0x2
+#define MM64_NTLBPA_SUPPORT_NI 0x0
+#define MM64_NTLBPA_SUPPORT_IMP 0x1
#endif
#ifndef __ASSEMBLER__
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/3] xen/arm32: add CPU capability for IPA-based TLBI
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2025-12-08 13:55 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Dan Driscoll, Noor Ahsan Khawaja,
Fahad Arslan, Andrew Bachtel
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This feature is available since armv8 and can be used to perform
IPA-based TLBI for arm32. XENMEM_remove_from_physmap performs this
invalidation in each hypercall so this code path will be optimized,
instead of performing a TLBIALL each time in presence of nTLBPA.
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- There are no functional changes in this version. There are minor
code updates and comment updates as per the feedback on v2.
- The cpregs are defined in order as per Coprocessor-> CRn-> Opcode 1
-> CRm-> Opcode 2.
- Added comment to explain why IPA-based TLBI is added only in
presence of FEAT_nTLBPA.
- Replaced `goto default_tlbi` with if...else.
- Removed extra definitions of MM32_UNITLB_* macros which were not
being used.
Changes in v2:
- This commit is implemented in v2 as per the feedback to implement
IPA-based TLBI for Arm32 in addition to Arm64.
---
xen/arch/arm/cpufeature.c | 12 +++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 42 ++++++++++++++++++++---
xen/arch/arm/include/asm/cpregs.h | 4 +++
xen/arch/arm/include/asm/cpufeature.h | 15 ++++----
xen/arch/arm/include/asm/processor.h | 3 ++
5 files changed, 65 insertions(+), 11 deletions(-)
diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
index 9fa1c45869..d18c6449c6 100644
--- a/xen/arch/arm/cpufeature.c
+++ b/xen/arch/arm/cpufeature.c
@@ -18,6 +18,11 @@ DECLARE_BITMAP(cpu_hwcaps, ARM_NCAPS);
struct cpuinfo_arm __read_mostly domain_cpuinfo;
#ifdef CONFIG_ARM_32
+static bool has_tlb_ipa_instruction(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm32.unitlb == MM32_UNITLB_BY_IPA;
+}
+
static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
{
return system_cpuinfo.mm32.ntlbpa == MM32_NTLBPA_SUPPORT_IMP;
@@ -37,6 +42,13 @@ static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
#endif
static const struct arm_cpu_capabilities arm_features[] = {
+#ifdef CONFIG_ARM_32
+ {
+ .desc = "IPA-based TLB Invalidation",
+ .capability = ARM32_HAS_TLB_IPA,
+ .matches = has_tlb_ipa_instruction,
+ },
+#endif
#if defined(CONFIG_ARM_32) || defined(CONFIG_ARM_64)
{
.desc = "Intermediate caching of translation table walks (nTLBPA)",
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 7cff042508..3e6f86f6d2 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -1,6 +1,8 @@
#ifndef __ASM_ARM_ARM32_FLUSHTLB_H__
#define __ASM_ARM_ARM32_FLUSHTLB_H__
+#include <xen/sizes.h> /* For SZ_* macros. */
+
/*
* Every invalidation operation use the following patterns:
*
@@ -104,12 +106,42 @@ static inline void flush_guest_tlb_range_ipa(paddr_t ipa,
unsigned long size)
{
/*
- * Following can invalidate both stage-1 and stage-2 TLBs depending upon
- * the execution mode.
- *
- * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ * IPA-based TLBI is used only in presence of nTLBPA, otherwise, stage-1
+ * invalidation would still be required and there is no separate TLBI for
+ * stage-1 on Arm32. So in absence of nTLBPA, it is pointless to flush by
+ * IPA.
*/
- flush_guest_tlb();
+ if ( cpus_have_const_cap(ARM_HAS_NTLBPA) &&
+ cpus_have_const_cap(ARM32_HAS_TLB_IPA) )
+ {
+ /*
+ * If IPA range is too big (empirically found to be 256M), then
+ * fallback to full TLB flush
+ */
+ if ( size > SZ_256M )
+ /*
+ * Following can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution mode.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ flush_guest_tlb();
+ else
+ {
+ paddr_t end = ipa + size;
+
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address. */
+ asm volatile(STORE_CP32(0, TLBIIPAS2IS)
+ : : "r" (ipa >> PAGE_SHIFT) : "memory");
+ ipa += PAGE_SIZE;
+ }
+ dsb(ish);
+ isb();
+ }
+ }
}
#endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
diff --git a/xen/arch/arm/include/asm/cpregs.h b/xen/arch/arm/include/asm/cpregs.h
index a7503a190f..51f091dace 100644
--- a/xen/arch/arm/include/asm/cpregs.h
+++ b/xen/arch/arm/include/asm/cpregs.h
@@ -223,9 +223,13 @@
#define TLBIMVA p15,0,c8,c7,1 /* invalidate unified TLB entry by MVA */
#define TLBIASID p15,0,c8,c7,2 /* invalid unified TLB by ASID match */
#define TLBIMVAA p15,0,c8,c7,3 /* invalidate unified TLB entries by MVA all ASID */
+#define TLBIIPAS2IS p15,4,c8,c0,1 /* Invalidate unified TLB entry for stage 2 by IPA inner shareable */
+#define TLBIIPAS2LIS p15,4,c8,c0,5 /* Invalidate unified TLB entry for stage 2 last level by IPA inner shareable */
#define TLBIALLHIS p15,4,c8,c3,0 /* Invalidate Entire Hyp. Unified TLB inner shareable */
#define TLBIMVAHIS p15,4,c8,c3,1 /* Invalidate Unified Hyp. TLB by MVA inner shareable */
#define TLBIALLNSNHIS p15,4,c8,c3,4 /* Invalidate Entire Non-Secure Non-Hyp. Unified TLB inner shareable */
+#define TLBIIPAS2 p15,4,c8,c4,1 /* Invalidate unified TLB entry for stage 2 by IPA */
+#define TLBIIPAS2L p15,4,c8,c4,5 /* Invalidate unified TLB entry for stage 2 last level by IPA */
#define TLBIALLH p15,4,c8,c7,0 /* Invalidate Entire Hyp. Unified TLB */
#define TLBIMVAH p15,4,c8,c7,1 /* Invalidate Unified Hyp. TLB by MVA */
#define TLBIALLNSNH p15,4,c8,c7,4 /* Invalidate Entire Non-Secure Non-Hyp. Unified TLB */
diff --git a/xen/arch/arm/include/asm/cpufeature.h b/xen/arch/arm/include/asm/cpufeature.h
index 9f796ed4c1..07f1d770b3 100644
--- a/xen/arch/arm/include/asm/cpufeature.h
+++ b/xen/arch/arm/include/asm/cpufeature.h
@@ -77,8 +77,9 @@
#define ARM_HAS_SB 16
#define ARM64_WORKAROUND_1508412 17
#define ARM_HAS_NTLBPA 18
+#define ARM32_HAS_TLB_IPA 19
-#define ARM_NCAPS 19
+#define ARM_NCAPS 20
#ifndef __ASSEMBLER__
@@ -440,15 +441,17 @@ struct cpuinfo_arm {
/* MMFR1 */
unsigned long __res1:32;
/* MMFR2 */
- unsigned long __res2:32;
+ unsigned long __res2:16;
+ unsigned long unitlb:4;
+ unsigned long __res3:12;
/* MMFR3 */
- unsigned long __res3:32;
- /* MMFR4 */
unsigned long __res4:32;
+ /* MMFR4 */
+ unsigned long __res5:32;
/* MMFR5 */
- unsigned long __res5:4;
+ unsigned long __res6:4;
unsigned long ntlbpa:4;
- unsigned long __res6:24;
+ unsigned long __res7:24;
};
} mm32;
diff --git a/xen/arch/arm/include/asm/processor.h b/xen/arch/arm/include/asm/processor.h
index 85f3b643a0..eda39566e1 100644
--- a/xen/arch/arm/include/asm/processor.h
+++ b/xen/arch/arm/include/asm/processor.h
@@ -460,6 +460,9 @@
#define FSRL_STATUS_DEBUG (_AC(0x22,UL)<<0)
#ifdef CONFIG_ARM_32
+#define MM32_UNITLB_NI 0x0
+#define MM32_UNITLB_BY_IPA 0x6
+
#define MM32_NTLBPA_SUPPORT_NI 0x0
#define MM32_NTLBPA_SUPPORT_IMP 0x1
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is
2026-01-18 13:33 ` [XEN PATCH v3 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is known Haseeb Ashraf
` (3 preceding siblings ...)
(?)
@ 2025-12-16 12:08 ` Ashraf, Haseeb
2026-01-06 6:40 ` Ashraf, Haseeb
-1 siblings, 1 reply; 10+ messages in thread
From: Ashraf, Haseeb @ 2025-12-16 12:08 UTC (permalink / raw)
To: Haseeb Ashraf, xen-devel@lists.xenproject.org, Julien Grall
Cc: Stefano Stabellini, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, Driscoll, Dan, Ahsan Khawaja, Noor,
Arslan, Fahad, Bachtel, Andrew
[-- Attachment #1: Type: text/plain, Size: 188 bytes --]
Hi @Julien Grall<mailto:julien@xen.org>,
Bringing up this patch series. Please have a look at it, and let me if there is any comment on v3 of this series.
Thanks & Regards,
Haseeb
[-- Attachment #2: Type: text/html, Size: 1442 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is
2025-12-16 12:08 ` [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is Ashraf, Haseeb
@ 2026-01-06 6:40 ` Ashraf, Haseeb
0 siblings, 0 replies; 10+ messages in thread
From: Ashraf, Haseeb @ 2026-01-06 6:40 UTC (permalink / raw)
To: Haseeb Ashraf, xen-devel@lists.xenproject.org, Julien Grall
Cc: Stefano Stabellini, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, Driscoll, Dan, Ahsan Khawaja, Noor,
Arslan, Fahad, Bachtel, Andrew
[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]
Hello everyone,
Is there any comment on this patch series? I want to get this merged as it is a blocker for supporting Xen on KVM.
Regards,
Haseeb
________________________________
From: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN) <haseeb.ashraf@siemens.com>
Sent: Tuesday, December 16, 2025 5:08 PM
To: Haseeb Ashraf <haseebashraf091@gmail.com>; xen-devel@lists.xenproject.org <xen-devel@lists.xenproject.org>; Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>; Bertrand Marquis <bertrand.marquis@arm.com>; Michal Orzel <michal.orzel@amd.com>; Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>; Driscoll, Dan (DI SW EDA HAV SLS EPS TOA) <dan.driscoll@siemens.com>; Ahsan Khawaja, Noor (DI SW EDA HAV SLS EPS RTOS LIN) <noor.ahsan@siemens.com>; Arslan, Fahad (DI SW EDA HAV SLS EPS RTOS LIN) <fahad.arslan@siemens.com>; Bachtel, Andrew (DI SW EDA HAV SLS EPS TOA) <andrew.bachtel@siemens.com>
Subject: Re: [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is
Hi @Julien Grall<mailto:julien@xen.org>,
Bringing up this patch series. Please have a look at it, and let me if there is any comment on v3 of this series.
Thanks & Regards,
Haseeb
[-- Attachment #2: Type: text/html, Size: 3622 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* [XEN PATCH v3 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is known
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2026-01-18 13:33 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This patch series addresses a major issue for running Xen on KVM i.e.
costly emulation of VMALLS12E1IS which becomes worse when this TLBI
is invoked too many times. There are mainly two places where this is
problematic:
(a) When vCPUs switch on a pCPU or pCPUs
(b) When domu mapped pages onto dom0, are to be unmapped, then each
page being removed by XENMEM_remove_from_physmap has its TLBs
invalidated by the TLBI variant that flushes the whole range.
This patch series prefers usage of IPA-based TLBIs wherever possible
instead of complete flushing of TLBs every time.
It consists of three patches where the first one address the issue
being discussed for Arm64. Second patch further optimizes the
combined stage-1,2 TLB flushes by leveraging FEAT_nTLBPA. Third patch
introduces IPA-based TLBI for Arm32 in presence of FEAT_nTLBPA.
Haseeb Ashraf (3):
xen/arm/p2m: perform IPA-based TLBI when IPA is known
xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
xen/arm32: add CPU capability for IPA-based TLBI
Changes in v3:
- Mainly the handling of repeat TLBI workaround with IPA-based TLBI,
so that the extra TLBI and DSB are repeated only for the final TLBI
and DSB of the whole sequence.
- Updated code comments as per feedback. Further details are
available in each commit's changelog.
- Minor updates to code as per feedback. Further details are
available in each commit's changelog.
Changes in v2:
- Split up the commit in 3 commits. First commit implements the
baseline implementation without any addition of new CPU
capabilities. Implemented new CPU caps in separate features to
emphasize how each of it optimizes the TLB invalidation.
- Moved ARM32 and ARM64 specific implementations of TLBIs to
architecture specific flushtlb.h.
- Added references of ARM ARM in code comments.
- Evaluated and added a threshold to select between IPA-based TLB
invalidation vs fallback to full stage TLB invalidation above
the threshold.
- Introduced ARM_HAS_NTLBPA CPU capability which leverages
FEAT_nTLBPA for arm32 as well as arm64.
- Introduced ARM_HAS_TLB_IPA CPU capability for IPA-based TLBI
for arm32.
Haseeb Ashraf (3):
xen/arm/p2m: perform IPA-based TLBI when IPA is known
xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
xen/arm32: add CPU capability for IPA-based TLBI
xen/arch/arm/cpufeature.c | 31 ++++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 87 +++++++++++++++++++++
xen/arch/arm/include/asm/arm64/flushtlb.h | 77 +++++++++++++++++++
xen/arch/arm/include/asm/cpregs.h | 4 +
xen/arch/arm/include/asm/cpufeature.h | 27 ++++++-
xen/arch/arm/include/asm/mmu/p2m.h | 2 +
xen/arch/arm/include/asm/processor.h | 10 +++
xen/arch/arm/mmu/p2m.c | 92 +++++++++++++++++------
8 files changed, 302 insertions(+), 28 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [XEN PATCH v3 1/3] xen/arm/p2m: perform IPA-based TLBI when IPA is known
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2026-01-18 13:33 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This commit addresses a major issue for running Xen on KVM i.e.
costly emulation of VMALLS12E1IS which becomes worse when this TLBI
is invoked too many times. There are mainly two places where this is
problematic:
(a) When vCPUs switch on a pCPU or pCPUs
(b) When domu mapped pages onto dom0, are to be unmapped, then each
page being removed by XENMEM_remove_from_physmap has its TLBs
invalidated by VMALLS12E1IS.
The first one is addressed by relaxing VMALLS12E1 -> VMALLE1 as the
stage-2 is common between all the vCPUs of a VM. Since each CPU has
its own private TLBs, so flush between vCPU of the same domains is
still required to avoid translations from vCPUx to "leak" to the
vCPUy which can be achieved by using VMALLE1.
The second one is addressed by using IPA-based TLBI (IPAS2E1) in
combination with VMALLE1 whenever the IPA range is known instead of
using VMALLS12E1. There is an upper cap placed on number of IPA-based
TLBI. This factor for execution time of VMALLS12E1 vs IPAS2E1 is
found to be 70K on Graviton4 in Xen on KVM virtualization. So,
64K * 4KB = 256MB is set as the threshold.
For arm32, TLBIALL instruction can invalidate both stage-1 and
stage-2 entries, so using IPA-based TLBI would be redundant as
TLBIALL is required in any case to invalidate corresponding cached
entries from stage-1.
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- Updated IPA-based TLBI sequence to apply ARM64 repeat TLBI
workaround to only final TLBI and DSB of the sequence.
- Removed TLB_HELPER_IPA and instead directly used the TLBI
instruction where needed as that was the only instance where it is
being used.
- Removed flush_guest_tlb_range_ipa_local() as it was not being used.
- Updated comments as per feedback in v2 about holding lock before
p2m_load_vttbr.
- Updated references of ARM ARM to use newer version DDI 0487L.b
instead of older version DDI 0487A.e.
Changes in v2:
- This commit implements the basline implementation to address the
problem at hand. Removed the FEAT_nTLBPA implementation from this
commit which will be implemented in following commit using CPU
capability.
- Moved ARM32 and ARM64 specific implementations of TLBIs to
architecture specific flushtlb.h.
- Added references of ARM ARM in code comments.
- Evaluated and added a threshold to select between IPA-based TLB
invalidation vs fallback to full stage TLB invalidation above
the threshold.
---
xen/arch/arm/include/asm/arm32/flushtlb.h | 53 +++++++++++++
xen/arch/arm/include/asm/arm64/flushtlb.h | 46 ++++++++++++
xen/arch/arm/include/asm/mmu/p2m.h | 2 +
xen/arch/arm/mmu/p2m.c | 92 +++++++++++++++++------
4 files changed, 168 insertions(+), 25 deletions(-)
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 61c25a3189..3c0c2123d4 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -45,6 +45,43 @@ TLB_HELPER(flush_xen_tlb_local, TLBIALLH, nsh)
#undef TLB_HELPER
+/*
+ * Flush TLB of local processor. Use when flush for only stage-1 is intended.
+ *
+ * The following function should be used where intention is to clear only
+ * stage-1 TLBs. This would be helpful in future in identifying which stage-1
+ * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ */
+static inline void flush_guest_tlb_s1_local(void)
+{
+ /*
+ * Same instruction can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution context.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ return flush_guest_tlb_local();
+}
+
+/*
+ * Flush TLB of inner-shareable processor domain. Use when flush for only
+ * stage-1 is intended.
+ *
+ * The following function should be used where intention is to clear only
+ * stage-1 TLBs. This would be helpful in future in identifying which stage-1
+ * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ */
+static inline void flush_guest_tlb_s1(void)
+{
+ /*
+ * Same instruction can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution context.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ return flush_guest_tlb();
+}
+
/* Flush TLB of local processor for address va. */
static inline void __flush_xen_tlb_one_local(vaddr_t va)
{
@@ -57,6 +94,22 @@ static inline void __flush_xen_tlb_one(vaddr_t va)
asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory");
}
+/*
+ * Flush a range of IPA's mappings from the TLB of all processors in the
+ * inner-shareable domain.
+ */
+static inline void flush_guest_tlb_range_ipa(paddr_t ipa,
+ unsigned long size)
+{
+ /*
+ * Following can invalidate both stage-1 and stage-2 TLBs depending upon
+ * the execution mode.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ flush_guest_tlb();
+}
+
#endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
/*
* Local variables:
diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h b/xen/arch/arm/include/asm/arm64/flushtlb.h
index 3b99c11b50..67ae616993 100644
--- a/xen/arch/arm/include/asm/arm64/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
@@ -1,6 +1,8 @@
#ifndef __ASM_ARM_ARM64_FLUSHTLB_H__
#define __ASM_ARM_ARM64_FLUSHTLB_H__
+#include <xen/sizes.h> /* For SZ_* macros. */
+
/*
* Every invalidation operation use the following patterns:
*
@@ -72,6 +74,12 @@ TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
/* Flush innershareable TLBs, current VMID only */
TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
+/* Flush local TLBs, current VMID, stage-1 only */
+TLB_HELPER(flush_guest_tlb_s1_local, vmalle1, nsh)
+
+/* Flush innershareable TLBs, current VMID, stage-1 only */
+TLB_HELPER(flush_guest_tlb_s1, vmalle1is, ish)
+
/* Flush local TLBs, all VMIDs, non-hypervisor mode */
TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
@@ -90,6 +98,44 @@ TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
#undef TLB_HELPER
#undef TLB_HELPER_VA
+/*
+ * Flush a range of IPA's mappings from the TLB of all processors in the
+ * inner-shareable domain.
+ */
+static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
+{
+ paddr_t end;
+
+ /*
+ * If IPA range is too big (empirically found to be 256M), then fallback to
+ * full TLB flush.
+ */
+ if ( size > SZ_256M )
+ return flush_guest_tlb();
+
+ end = ipa + size;
+
+ /*
+ * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from stage 2
+ * translations) for details of TLBI sequence.
+ */
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address */
+ asm_inline volatile (
+ "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
+ ipa += PAGE_SIZE;
+ }
+ /*
+ * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to last TLBI
+ * of the sequence, it is only needed to be handled in the following
+ * invocation. Final dsb() and isb() are also applied in the following
+ * invocation.
+ */
+ flush_guest_tlb_s1();
+}
+
#endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
/*
* Local variables:
diff --git a/xen/arch/arm/include/asm/mmu/p2m.h b/xen/arch/arm/include/asm/mmu/p2m.h
index 58496c0b09..8a16722b82 100644
--- a/xen/arch/arm/include/asm/mmu/p2m.h
+++ b/xen/arch/arm/include/asm/mmu/p2m.h
@@ -10,6 +10,8 @@ extern unsigned int p2m_root_level;
struct p2m_domain;
void p2m_force_tlb_flush_sync(struct p2m_domain *p2m);
+void p2m_force_tlb_flush_range_sync(struct p2m_domain *p2m, uint64_t start_ipa,
+ uint64_t page_count);
void p2m_tlb_flush_sync(struct p2m_domain *p2m);
void p2m_clear_root_pages(struct p2m_domain *p2m);
diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 51abf3504f..eec59056fa 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -235,33 +235,28 @@ void p2m_restore_state(struct vcpu *n)
* when running multiple vCPU of the same domain on a single pCPU.
*/
if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
- flush_guest_tlb_local();
+ flush_guest_tlb_s1_local();
*last_vcpu_ran = n->vcpu_id;
}
/*
- * Force a synchronous P2M TLB flush.
+ * Loads VTTBR from given P2M.
*
* Must be called with the p2m lock held.
+ *
+ * This returns switched out VTTBR.
*/
-void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+static uint64_t p2m_load_vttbr(struct p2m_domain *p2m, unsigned long *flags)
{
- unsigned long flags = 0;
uint64_t ovttbr;
- ASSERT(p2m_is_write_locked(p2m));
-
- /*
- * ARM only provides an instruction to flush TLBs for the current
- * VMID. So switch to the VTTBR of a given P2M if different.
- */
ovttbr = READ_SYSREG64(VTTBR_EL2);
if ( ovttbr != p2m->vttbr )
{
uint64_t vttbr;
- local_irq_save(flags);
+ local_irq_save(*flags);
/*
* ARM64_WORKAROUND_AT_SPECULATE: We need to stop AT to allocate
@@ -280,8 +275,14 @@ void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
isb();
}
- flush_guest_tlb();
+ return ovttbr;
+}
+/*
+ * Restores VTTBR which was switched out as a result of p2m_load_vttbr().
+ */
+static void p2m_restore_vttbr(uint64_t ovttbr, unsigned long flags)
+{
if ( ovttbr != READ_SYSREG64(VTTBR_EL2) )
{
WRITE_SYSREG64(ovttbr, VTTBR_EL2);
@@ -289,10 +290,58 @@ void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
isb();
local_irq_restore(flags);
}
+}
+
+/*
+ * Force a synchronous P2M TLB flush.
+ *
+ * Must be called with the p2m lock held.
+ */
+void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+{
+ unsigned long flags = 0;
+ uint64_t ovttbr;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * ARM only provides an instruction to flush TLBs for the current
+ * VMID. So switch to the VTTBR of a given P2M if different.
+ */
+ ovttbr = p2m_load_vttbr(p2m, &flags);
+
+ flush_guest_tlb();
+
+ p2m_restore_vttbr(ovttbr, flags);
p2m->need_flush = false;
}
+/*
+ * Force a synchronous P2M TLB flush on a range of addresses.
+ *
+ * Must be called with the p2m lock held.
+ */
+void p2m_force_tlb_flush_range_sync(struct p2m_domain *p2m, uint64_t start_ipa,
+ uint64_t page_count)
+{
+ unsigned long flags = 0;
+ uint64_t ovttbr;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * ARM only provides an instruction to flush TLBs for the current
+ * VMID. So switch to the VTTBR of a given P2M if different.
+ */
+ ovttbr = p2m_load_vttbr(p2m, &flags);
+
+ /* Invalidate TLB entries by IPA range */
+ flush_guest_tlb_range_ipa(start_ipa, PAGE_SIZE * page_count);
+
+ p2m_restore_vttbr(ovttbr, flags);
+}
+
void p2m_tlb_flush_sync(struct p2m_domain *p2m)
{
if ( p2m->need_flush )
@@ -1034,7 +1083,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
* For more details see (D4.7.1 in ARM DDI 0487A.j).
*/
p2m_remove_pte(entry, p2m->clean_pte);
- p2m_force_tlb_flush_sync(p2m);
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
p2m_write_pte(entry, split_pte, p2m->clean_pte);
@@ -1090,8 +1140,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
p2m_remove_pte(entry, p2m->clean_pte);
if ( removing_mapping )
- /* Flush can be deferred if the entry is removed */
- p2m->need_flush |= !!lpae_is_valid(orig_pte);
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
else
{
lpae_t pte = mfn_to_p2m_entry(smfn, t, a);
@@ -1102,18 +1152,10 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
/*
* It is necessary to flush the TLB before writing the new entry
* to keep coherency when the previous entry was valid.
- *
- * Although, it could be defered when only the permissions are
- * changed (e.g in case of memaccess).
*/
if ( lpae_is_valid(orig_pte) )
- {
- if ( likely(!p2m->mem_access_enabled) ||
- P2M_CLEAR_PERM(pte) != P2M_CLEAR_PERM(orig_pte) )
- p2m_force_tlb_flush_sync(p2m);
- else
- p2m->need_flush = true;
- }
+ p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
+ 1UL << page_order);
else if ( !p2m_is_valid(orig_pte) ) /* new mapping */
p2m->stats.mappings[level]++;
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [XEN PATCH v3 2/3] xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2026-01-18 13:33 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Mohamed Mediouni
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
FEAT_nTLBPA (quoting definition) introduces a mechanism to identify
if the intermediate caching of translation table walks does not
include non-coherent caches of previous valid translation table
entries since the last completed TLBI applicable to the PE.
As there won't be any non-coherent caches since the last completed
TLBI, stage-1 TLBI won't be required while performing stage-2 TLBI.
This feature is optionally available in both arm32 and arm64.
Suggested-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- This commit has no functional change in v3, only rebasing changes
due to updates in commit-1.
Changes in v2:
- This commit is implemented in v2 and is splitted from commit-1 in
v1. This is implemented by using CPU capability.
---
xen/arch/arm/cpufeature.c | 19 ++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 14 +++--
xen/arch/arm/include/asm/arm64/flushtlb.h | 77 ++++++++++++++++-------
xen/arch/arm/include/asm/cpufeature.h | 24 ++++++-
xen/arch/arm/include/asm/processor.h | 7 +++
5 files changed, 109 insertions(+), 32 deletions(-)
diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
index 1a80738571..9fa1c45869 100644
--- a/xen/arch/arm/cpufeature.c
+++ b/xen/arch/arm/cpufeature.c
@@ -17,7 +17,19 @@ DECLARE_BITMAP(cpu_hwcaps, ARM_NCAPS);
struct cpuinfo_arm __read_mostly domain_cpuinfo;
+#ifdef CONFIG_ARM_32
+static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm32.ntlbpa == MM32_NTLBPA_SUPPORT_IMP;
+}
+#endif
+
#ifdef CONFIG_ARM_64
+static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm64.ntlbpa == MM64_NTLBPA_SUPPORT_IMP;
+}
+
static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
{
return system_cpuinfo.isa64.sb;
@@ -25,6 +37,13 @@ static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
#endif
static const struct arm_cpu_capabilities arm_features[] = {
+#if defined(CONFIG_ARM_32) || defined(CONFIG_ARM_64)
+ {
+ .desc = "Intermediate caching of translation table walks (nTLBPA)",
+ .capability = ARM_HAS_NTLBPA,
+ .matches = has_ntlbpa,
+ },
+#endif
#ifdef CONFIG_ARM_64
{
.desc = "Speculation barrier instruction (SB)",
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 3c0c2123d4..7cff042508 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -49,8 +49,8 @@ TLB_HELPER(flush_xen_tlb_local, TLBIALLH, nsh)
* Flush TLB of local processor. Use when flush for only stage-1 is intended.
*
* The following function should be used where intention is to clear only
- * stage-1 TLBs. This would be helpful in future in identifying which stage-1
- * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ * stage-1 TLBs. This would be helpful in identifying which stage-1 TLB flushes
+ * can be skipped such as in present of FEAT_nTLBPA.
*/
static inline void flush_guest_tlb_s1_local(void)
{
@@ -60,7 +60,8 @@ static inline void flush_guest_tlb_s1_local(void)
*
* See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
*/
- return flush_guest_tlb_local();
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ flush_guest_tlb_local();
}
/*
@@ -68,8 +69,8 @@ static inline void flush_guest_tlb_s1_local(void)
* stage-1 is intended.
*
* The following function should be used where intention is to clear only
- * stage-1 TLBs. This would be helpful in future in identifying which stage-1
- * TLB flushes can be skipped such as in present of FEAT_nTLBPA.
+ * stage-1 TLBs. This would be helpful in identifying which stage-1 TLB flushes
+ * can be skipped such as in present of FEAT_nTLBPA.
*/
static inline void flush_guest_tlb_s1(void)
{
@@ -79,7 +80,8 @@ static inline void flush_guest_tlb_s1(void)
*
* See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
*/
- return flush_guest_tlb();
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ flush_guest_tlb();
}
/* Flush TLB of local processor for address va. */
diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h b/xen/arch/arm/include/asm/arm64/flushtlb.h
index 67ae616993..0f0d5050e5 100644
--- a/xen/arch/arm/include/asm/arm64/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
@@ -47,6 +47,24 @@ static inline void name(void) \
: : : "memory"); \
}
+#define TLB_HELPER_NTLBPA(name, tlbop, sh) \
+static inline void name(void) \
+{ \
+ if ( !cpus_have_const_cap(ARM_HAS_NTLBPA) ) \
+ asm_inline volatile ( \
+ "dsb " # sh "st;" \
+ "tlbi " # tlbop ";" \
+ ALTERNATIVE( \
+ "nop; nop;", \
+ "dsb ish;" \
+ "tlbi " # tlbop ";", \
+ ARM64_WORKAROUND_REPEAT_TLBI, \
+ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
+ "dsb " # sh ";" \
+ "isb;" \
+ : : : "memory"); \
+}
+
/*
* FLush TLB by VA. This will likely be used in a loop, so the caller
* is responsible to use the appropriate memory barriers before/after
@@ -75,10 +93,10 @@ TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
/* Flush local TLBs, current VMID, stage-1 only */
-TLB_HELPER(flush_guest_tlb_s1_local, vmalle1, nsh)
+TLB_HELPER_NTLBPA(flush_guest_tlb_s1_local, vmalle1, nsh)
/* Flush innershareable TLBs, current VMID, stage-1 only */
-TLB_HELPER(flush_guest_tlb_s1, vmalle1is, ish)
+TLB_HELPER_NTLBPA(flush_guest_tlb_s1, vmalle1is, ish)
/* Flush local TLBs, all VMIDs, non-hypervisor mode */
TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
@@ -104,8 +122,6 @@ TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
*/
static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
{
- paddr_t end;
-
/*
* If IPA range is too big (empirically found to be 256M), then fallback to
* full TLB flush.
@@ -113,27 +129,42 @@ static inline void flush_guest_tlb_range_ipa(paddr_t ipa, unsigned long size)
if ( size > SZ_256M )
return flush_guest_tlb();
- end = ipa + size;
-
- /*
- * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from stage 2
- * translations) for details of TLBI sequence.
- */
- dsb(ishst); /* Ensure prior page-tables updates have completed */
- while ( ipa < end )
+ else if ( size > 0 )
{
- /* Flush stage-2 TLBs for ipa address */
- asm_inline volatile (
- "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
- ipa += PAGE_SIZE;
+ paddr_t end = ipa + size;
+
+ /*
+ * See ARM ARM DDI 0487L.b D8.17.6.1 (Invalidating TLB entries from
+ * stage 2 translations) for details on TLBI sequence.
+ */
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address */
+ asm_inline volatile (
+ "tlbi ipas2e1is, %0;" : : "r" (ipa >> PAGE_SHIFT) : "memory" );
+ ipa += PAGE_SIZE;
+ }
+ if ( cpus_have_const_cap(ARM_HAS_NTLBPA) )
+ asm_inline volatile (
+ ALTERNATIVE(
+ "nop; nop;",
+ "dsb ish;"
+ "tlbi ipas2e1is, %0;",
+ ARM64_WORKAROUND_REPEAT_TLBI,
+ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)
+ "dsb ish;"
+ "isb;"
+ : : "r" ((ipa - PAGE_SIZE) >> PAGE_SHIFT) : "memory" );
+ else
+ /*
+ * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to
+ * last TLBI of the sequence, it is only needed to be handled in
+ * the following invocation. Final dsb() and isb() are also applied
+ * in the following invocation.
+ */
+ flush_guest_tlb_s1();
}
- /*
- * As ARM64_WORKAROUND_REPEAT_TLBI is required to be applied to last TLBI
- * of the sequence, it is only needed to be handled in the following
- * invocation. Final dsb() and isb() are also applied in the following
- * invocation.
- */
- flush_guest_tlb_s1();
}
#endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
diff --git a/xen/arch/arm/include/asm/cpufeature.h b/xen/arch/arm/include/asm/cpufeature.h
index 13353c8e1a..9f796ed4c1 100644
--- a/xen/arch/arm/include/asm/cpufeature.h
+++ b/xen/arch/arm/include/asm/cpufeature.h
@@ -76,8 +76,9 @@
#define ARM_WORKAROUND_BHB_SMCC_3 15
#define ARM_HAS_SB 16
#define ARM64_WORKAROUND_1508412 17
+#define ARM_HAS_NTLBPA 18
-#define ARM_NCAPS 18
+#define ARM_NCAPS 19
#ifndef __ASSEMBLER__
@@ -269,7 +270,8 @@ struct cpuinfo_arm {
unsigned long ets:4;
unsigned long __res1:4;
unsigned long afp:4;
- unsigned long __res2:12;
+ unsigned long ntlbpa:4;
+ unsigned long __res2:8;
unsigned long ecbhb:4;
/* MMFR2 */
@@ -430,8 +432,24 @@ struct cpuinfo_arm {
register_t bits[1];
} aux32;
- struct {
+ union {
register_t bits[6];
+ struct {
+ /* MMFR0 */
+ unsigned long __res0:32;
+ /* MMFR1 */
+ unsigned long __res1:32;
+ /* MMFR2 */
+ unsigned long __res2:32;
+ /* MMFR3 */
+ unsigned long __res3:32;
+ /* MMFR4 */
+ unsigned long __res4:32;
+ /* MMFR5 */
+ unsigned long __res5:4;
+ unsigned long ntlbpa:4;
+ unsigned long __res6:24;
+ };
} mm32;
struct {
diff --git a/xen/arch/arm/include/asm/processor.h b/xen/arch/arm/include/asm/processor.h
index 1a48c9ff3b..85f3b643a0 100644
--- a/xen/arch/arm/include/asm/processor.h
+++ b/xen/arch/arm/include/asm/processor.h
@@ -459,9 +459,16 @@
/* FSR long format */
#define FSRL_STATUS_DEBUG (_AC(0x22,UL)<<0)
+#ifdef CONFIG_ARM_32
+#define MM32_NTLBPA_SUPPORT_NI 0x0
+#define MM32_NTLBPA_SUPPORT_IMP 0x1
+#endif
+
#ifdef CONFIG_ARM_64
#define MM64_VMID_8_BITS_SUPPORT 0x0
#define MM64_VMID_16_BITS_SUPPORT 0x2
+#define MM64_NTLBPA_SUPPORT_NI 0x0
+#define MM64_NTLBPA_SUPPORT_IMP 0x1
#endif
#ifndef __ASSEMBLER__
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [XEN PATCH v3 3/3] xen/arm32: add CPU capability for IPA-based TLBI
@ 2026-01-18 13:33 ` Haseeb Ashraf
0 siblings, 0 replies; 10+ messages in thread
From: Haseeb Ashraf @ 2026-01-18 13:33 UTC (permalink / raw)
To: xen-devel
Cc: Haseeb Ashraf, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk
From: Haseeb Ashraf <haseeb.ashraf@siemens.com>
This feature is available since armv8 and can be used to perform
IPA-based TLBI for arm32. XENMEM_remove_from_physmap performs this
invalidation in each hypercall so this code path will be optimized,
instead of performing a TLBIALL each time in presence of nTLBPA.
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Haseeb Ashraf <haseeb.ashraf@siemens.com>
Changes in v3:
- There are no functional changes in this version. There are minor
code updates and comment updates as per the feedback on v2.
- The cpregs are defined in order as per Coprocessor-> CRn-> Opcode 1
-> CRm-> Opcode 2.
- Added comment to explain why IPA-based TLBI is added only in
presence of FEAT_nTLBPA.
- Replaced `goto default_tlbi` with if...else.
- Removed extra definitions of MM32_UNITLB_* macros which were not
being used.
Changes in v2:
- This commit is implemented in v2 as per the feedback to implement
IPA-based TLBI for Arm32 in addition to Arm64.
---
xen/arch/arm/cpufeature.c | 12 +++++++
xen/arch/arm/include/asm/arm32/flushtlb.h | 42 ++++++++++++++++++++---
xen/arch/arm/include/asm/cpregs.h | 4 +++
xen/arch/arm/include/asm/cpufeature.h | 15 ++++----
xen/arch/arm/include/asm/processor.h | 3 ++
5 files changed, 65 insertions(+), 11 deletions(-)
diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
index 9fa1c45869..d18c6449c6 100644
--- a/xen/arch/arm/cpufeature.c
+++ b/xen/arch/arm/cpufeature.c
@@ -18,6 +18,11 @@ DECLARE_BITMAP(cpu_hwcaps, ARM_NCAPS);
struct cpuinfo_arm __read_mostly domain_cpuinfo;
#ifdef CONFIG_ARM_32
+static bool has_tlb_ipa_instruction(const struct arm_cpu_capabilities *entry)
+{
+ return system_cpuinfo.mm32.unitlb == MM32_UNITLB_BY_IPA;
+}
+
static bool has_ntlbpa(const struct arm_cpu_capabilities *entry)
{
return system_cpuinfo.mm32.ntlbpa == MM32_NTLBPA_SUPPORT_IMP;
@@ -37,6 +42,13 @@ static bool has_sb_instruction(const struct arm_cpu_capabilities *entry)
#endif
static const struct arm_cpu_capabilities arm_features[] = {
+#ifdef CONFIG_ARM_32
+ {
+ .desc = "IPA-based TLB Invalidation",
+ .capability = ARM32_HAS_TLB_IPA,
+ .matches = has_tlb_ipa_instruction,
+ },
+#endif
#if defined(CONFIG_ARM_32) || defined(CONFIG_ARM_64)
{
.desc = "Intermediate caching of translation table walks (nTLBPA)",
diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h b/xen/arch/arm/include/asm/arm32/flushtlb.h
index 7cff042508..3e6f86f6d2 100644
--- a/xen/arch/arm/include/asm/arm32/flushtlb.h
+++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
@@ -1,6 +1,8 @@
#ifndef __ASM_ARM_ARM32_FLUSHTLB_H__
#define __ASM_ARM_ARM32_FLUSHTLB_H__
+#include <xen/sizes.h> /* For SZ_* macros. */
+
/*
* Every invalidation operation use the following patterns:
*
@@ -104,12 +106,42 @@ static inline void flush_guest_tlb_range_ipa(paddr_t ipa,
unsigned long size)
{
/*
- * Following can invalidate both stage-1 and stage-2 TLBs depending upon
- * the execution mode.
- *
- * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ * IPA-based TLBI is used only in presence of nTLBPA, otherwise, stage-1
+ * invalidation would still be required and there is no separate TLBI for
+ * stage-1 on Arm32. So in absence of nTLBPA, it is pointless to flush by
+ * IPA.
*/
- flush_guest_tlb();
+ if ( cpus_have_const_cap(ARM_HAS_NTLBPA) &&
+ cpus_have_const_cap(ARM32_HAS_TLB_IPA) )
+ {
+ /*
+ * If IPA range is too big (empirically found to be 256M), then
+ * fallback to full TLB flush
+ */
+ if ( size > SZ_256M )
+ /*
+ * Following can invalidate both stage-1 and stage-2 TLBs depending
+ * upon the execution mode.
+ *
+ * See ARMv8 (DDI 0487L.b): G5-11698 Table G5-23.
+ */
+ flush_guest_tlb();
+ else
+ {
+ paddr_t end = ipa + size;
+
+ dsb(ishst); /* Ensure prior page-tables updates have completed */
+ while ( ipa < end )
+ {
+ /* Flush stage-2 TLBs for ipa address. */
+ asm volatile(STORE_CP32(0, TLBIIPAS2IS)
+ : : "r" (ipa >> PAGE_SHIFT) : "memory");
+ ipa += PAGE_SIZE;
+ }
+ dsb(ish);
+ isb();
+ }
+ }
}
#endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
diff --git a/xen/arch/arm/include/asm/cpregs.h b/xen/arch/arm/include/asm/cpregs.h
index a7503a190f..51f091dace 100644
--- a/xen/arch/arm/include/asm/cpregs.h
+++ b/xen/arch/arm/include/asm/cpregs.h
@@ -223,9 +223,13 @@
#define TLBIMVA p15,0,c8,c7,1 /* invalidate unified TLB entry by MVA */
#define TLBIASID p15,0,c8,c7,2 /* invalid unified TLB by ASID match */
#define TLBIMVAA p15,0,c8,c7,3 /* invalidate unified TLB entries by MVA all ASID */
+#define TLBIIPAS2IS p15,4,c8,c0,1 /* Invalidate unified TLB entry for stage 2 by IPA inner shareable */
+#define TLBIIPAS2LIS p15,4,c8,c0,5 /* Invalidate unified TLB entry for stage 2 last level by IPA inner shareable */
#define TLBIALLHIS p15,4,c8,c3,0 /* Invalidate Entire Hyp. Unified TLB inner shareable */
#define TLBIMVAHIS p15,4,c8,c3,1 /* Invalidate Unified Hyp. TLB by MVA inner shareable */
#define TLBIALLNSNHIS p15,4,c8,c3,4 /* Invalidate Entire Non-Secure Non-Hyp. Unified TLB inner shareable */
+#define TLBIIPAS2 p15,4,c8,c4,1 /* Invalidate unified TLB entry for stage 2 by IPA */
+#define TLBIIPAS2L p15,4,c8,c4,5 /* Invalidate unified TLB entry for stage 2 last level by IPA */
#define TLBIALLH p15,4,c8,c7,0 /* Invalidate Entire Hyp. Unified TLB */
#define TLBIMVAH p15,4,c8,c7,1 /* Invalidate Unified Hyp. TLB by MVA */
#define TLBIALLNSNH p15,4,c8,c7,4 /* Invalidate Entire Non-Secure Non-Hyp. Unified TLB */
diff --git a/xen/arch/arm/include/asm/cpufeature.h b/xen/arch/arm/include/asm/cpufeature.h
index 9f796ed4c1..07f1d770b3 100644
--- a/xen/arch/arm/include/asm/cpufeature.h
+++ b/xen/arch/arm/include/asm/cpufeature.h
@@ -77,8 +77,9 @@
#define ARM_HAS_SB 16
#define ARM64_WORKAROUND_1508412 17
#define ARM_HAS_NTLBPA 18
+#define ARM32_HAS_TLB_IPA 19
-#define ARM_NCAPS 19
+#define ARM_NCAPS 20
#ifndef __ASSEMBLER__
@@ -440,15 +441,17 @@ struct cpuinfo_arm {
/* MMFR1 */
unsigned long __res1:32;
/* MMFR2 */
- unsigned long __res2:32;
+ unsigned long __res2:16;
+ unsigned long unitlb:4;
+ unsigned long __res3:12;
/* MMFR3 */
- unsigned long __res3:32;
- /* MMFR4 */
unsigned long __res4:32;
+ /* MMFR4 */
+ unsigned long __res5:32;
/* MMFR5 */
- unsigned long __res5:4;
+ unsigned long __res6:4;
unsigned long ntlbpa:4;
- unsigned long __res6:24;
+ unsigned long __res7:24;
};
} mm32;
diff --git a/xen/arch/arm/include/asm/processor.h b/xen/arch/arm/include/asm/processor.h
index 85f3b643a0..eda39566e1 100644
--- a/xen/arch/arm/include/asm/processor.h
+++ b/xen/arch/arm/include/asm/processor.h
@@ -460,6 +460,9 @@
#define FSRL_STATUS_DEBUG (_AC(0x22,UL)<<0)
#ifdef CONFIG_ARM_32
+#define MM32_UNITLB_NI 0x0
+#define MM32_UNITLB_BY_IPA 0x6
+
#define MM32_NTLBPA_SUPPORT_NI 0x0
#define MM32_NTLBPA_SUPPORT_IMP 0x1
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-01-18 13:34 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-08 13:55 [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is Haseeb Ashraf
2026-01-18 13:33 ` [XEN PATCH v3 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is known Haseeb Ashraf
2025-12-08 13:55 ` [PATCH 1/3] xen/arm/p2m: " Haseeb Ashraf
2026-01-18 13:33 ` [XEN PATCH v3 " Haseeb Ashraf
2025-12-08 13:55 ` [PATCH 2/3] xen/arm: optimize stage-1,2 combined TLBI in presence of FEAT_nTLBPA Haseeb Ashraf
2026-01-18 13:33 ` [XEN PATCH v3 " Haseeb Ashraf
2025-12-08 13:55 ` [PATCH 3/3] xen/arm32: add CPU capability for IPA-based TLBI Haseeb Ashraf
2026-01-18 13:33 ` [XEN PATCH v3 " Haseeb Ashraf
2025-12-16 12:08 ` [PATCH 0/3] xen/arm{32,64}: perform IPA-based TLBI when IPA is Ashraf, Haseeb
2026-01-06 6:40 ` Ashraf, Haseeb
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.