[PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS)

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS)
@ 2026-06-29 11:17 Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
                   ` (12 more replies)
  0 siblings, 13 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm


Disclaimer: While this patchset is buildable and testable on it's own,
it is not ready to be merged, as it depends on bits from another
patchset that will superseed the one in [0], and will enable HDBSS.
See note below on patches 1,2.

My expectation on sharing this earlier is to gather feedback on
implementation and testing methods, to make sure it's sure it's ready
when [0] becomes ready.

===

Create an arch-generic dirty-bit cleaning acceleration interface, which
compiles-out if the arch does not implement it, and creates no new API.
Using that, implement an arm64 accelerator based on HACDBS.

This implementation is able to accelerate the cleaning on both
dirty-bitmap and dirty-ring tracking mechanisms on KVM.

Patch 1 & 2 are here just to make this testable, as this patchset
depends on bits from HDBSS that are not upstream yet.

Patch 2 should be included in the HDBSS patchset, and patch 1
is a bunch of bits that I collected across other patches so this can
work. So few free to ignore them on review, as they should be reviewed
in the HDBSS patchset.

To be able to properly use HACDBS, it requires a PPI IRQ that triggers
either on error, or when processing is complete. It's called
HACDBSIRQ, and there is currently no upstream way of announcing it on
ACPI tables, so this patchset uses the suggested table/index in [1],
which at the moment is not merged.

Kernel v7.2-rc1 + this patchset builds properly, passing both kvm selftests
for dirty-bit tracking[2] and a qemu live migration test, with both
HW HACDBS enabled or disabled.

On terms of performance improvement, tests were done using
dirty_log_perf_test[3] to measure the time spent on the following ioctl:
a - KVM_GET_DIRTY_LOG,     using dirty-bitmap without manual protect
    command:    ./dirty_log_perf_test -m 3 -m 6 -m 12 -g
b - KVM_CLEAR_DIRTY_LOG,   using dirty-bitmap with manual protect
    command:    ./dirty_log_perf_test -m 3 -m 6 -m 12
c - KVM_RESET_DIRTY_RINGS, using dirty-ring, using 4096 entries/vcpu.
    command:    ./dirty_log_perf_test -m 3 -m 6 -m 12 -d 4096

Tests ran in the model show that runtime was reduced by:
-(a) 82.19% (0.16% stdev) for 1GB memory, and
     82.45% (0.02% stdev) for 3GB memory
-(b) 81.74% (0.19% stdev) for 1GB memory, and
     82.40% (0.38% stdev) for 3GB memory
-(b) 70.90% (0.18% stdev) for 1GB memory, and
     70.92% (0.01% stdev) for 3GB memory

Above numbers already take into account the improvements in S2
hugepage-splitting that is implemented by [4].

Please let me know of any question :)

Thanks for reviewing!
Leo


Changes since v1:
- Improvements on splitting with manual protect, skipping when cleaning
  pages from the same level-2 entry. (new patch)
- Got the correct concept of chunk_size and thus:
  - Corrected it to a reasonable chunk to do page splitting before
    rescheduling, considering new improvements from [4].
  - TTWL is not based in chunk size, so fix it in LAST_LEVEL
- Minor fixes, removing debugging traces
v1 Link: https://lore.kernel.org/all/20260430111424.3479613-2-leo.bras@arm.com/


[0]: https://lore.kernel.org/all/20260225040421.2683931-1-zhengtian10@huawei.com/
[1]: https://github.com/tianocore/edk2/issues/12409
[2]: dirty_log_test && dirty_log_perf_test
[3]: using this patchset to enable dirty-ring on dirty_log_perf_test:
     https://lore.kernel.org/all/20260629105950.1790259-1-leo.bras@arm.com/
[4]: https://lore.kernel.org/all/20260618131447.764085-1-leo.bras@arm.com/

Leonardo Bras (13):
  KVM: arm64: HDBSS bits
  KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  arm64/cpufeature: Add system-wide FEAT_HACDBS detection
  arm64/sysreg: Add HACDBS consumer and base registers
  KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  kvm: Add arch-generic interface for hw-accelerated dirty-bitmap
    cleaning
  KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine
  KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks
  kvm/dirty_ring: Introduce get_memslot and move helpers to header
  kvm/dirty_ring: Add arch-generic interface for hw-accelerated
    dirty-ring cleaning
  KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine
  KVM: arm64: Enable KVM_HW_DIRTY_BIT

 arch/arm64/include/asm/acpi.h          |   3 +
 arch/arm64/include/asm/cpufeature.h    |  10 +
 arch/arm64/include/asm/kvm_dirty_bit.h |  67 ++++
 arch/arm64/include/asm/kvm_pgtable.h   |   3 +
 include/acpi/actbl2.h                  |   1 +
 include/linux/kvm_dirty_bit.h          |  34 ++
 include/linux/kvm_dirty_ring.h         |  12 +
 include/linux/kvm_host.h               |   3 +
 arch/arm64/kernel/cpufeature.c         |  20 ++
 arch/arm64/kvm/arm.c                   |   5 +
 arch/arm64/kvm/dirty_bit.c             | 411 +++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c           |  15 +-
 arch/arm64/kvm/mmu.c                   |  12 +-
 virt/kvm/dirty_ring.c                  |  34 +-
 virt/kvm/kvm_main.c                    |  13 +-
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   2 +-
 arch/arm64/tools/cpucaps               |   2 +
 arch/arm64/tools/sysreg                |  30 ++
 virt/kvm/Kconfig                       |   3 +
 20 files changed, 659 insertions(+), 22 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_dirty_bit.h
 create mode 100644 include/linux/kvm_dirty_bit.h
 create mode 100644 arch/arm64/kvm/dirty_bit.c


base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
-- 
2.54.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 01/13] KVM: arm64: HDBSS bits
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:34   ` sashiko-bot
  2026-06-29 11:17 ` [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

All those bits should come from a future version of HDBSS patchset:
https://lore.kernel.org/lkml/20260225040421.2683931-1-zhengtian10@huawei.com

I added them here in order to fulfill the dependencies and be able to
easily build this patchset, but this particular patch should not be merged
upstream.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/include/asm/cpufeature.h    |  5 +++++
 arch/arm64/include/asm/kvm_dirty_bit.h | 12 ++++++++++++
 arch/arm64/include/asm/kvm_pgtable.h   |  3 +++
 arch/arm64/kernel/cpufeature.c         | 12 ++++++++++++
 arch/arm64/kvm/dirty_bit.c             | 16 ++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c           | 15 +++++++++++++--
 arch/arm64/kvm/Makefile                |  2 +-
 arch/arm64/tools/cpucaps               |  1 +
 8 files changed, 63 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_dirty_bit.h
 create mode 100644 arch/arm64/kvm/dirty_bit.c

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index a57870fa96db..bdfab086fd94 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -856,20 +856,25 @@ static inline bool system_supports_poe(void)
 static inline bool system_supports_gcs(void)
 {
 	return alternative_has_cap_unlikely(ARM64_HAS_GCS);
 }
 
 static inline bool system_supports_haft(void)
 {
 	return cpus_have_final_cap(ARM64_HAFT);
 }
 
+static inline bool system_supports_hdbss(void)
+{
+	return cpus_have_final_cap(ARM64_HAS_HDBSS);
+}
+
 static __always_inline bool system_supports_mpam(void)
 {
 	return alternative_has_cap_unlikely(ARM64_MPAM);
 }
 
 static __always_inline bool system_supports_mpam_hcr(void)
 {
 	return alternative_has_cap_unlikely(ARM64_MPAM_HCR);
 }
 
diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
new file mode 100644
index 000000000000..dd16438f0651
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_dirty_bit.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2026 ARM Ltd.
+ * Author: Leonardo Bras <leo.bras@arm.com>
+ */
+
+#ifndef __ARM64_KVM_DIRTY_BIT_H__
+#define __ARM64_KVM_DIRTY_BIT_H__
+
+#include <asm/kvm_pgtable.h>
+
+#endif /* __ARM64_KVM_DIRTY_BIT_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 41a8687938eb..646ff88e0258 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -86,20 +86,22 @@ typedef u64 kvm_pte_t;
 #define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 50)
 
 #define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
 
 #define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
 #define KVM_PTE_LEAF_ATTR_HI_S1_UXN	BIT(54)
 #define KVM_PTE_LEAF_ATTR_HI_S1_PXN	BIT(53)
 
 #define KVM_PTE_LEAF_ATTR_HI_S2_XN	GENMASK(54, 53)
 
+#define KVM_PTE_LEAF_ATTR_HI_S2_DBM	BIT(51)
+
 #define KVM_PTE_LEAF_ATTR_HI_S1_GP	BIT(50)
 
 #define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
 					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
 					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
 
 /* pKVM invalid pte encodings */
 #define KVM_INVALID_PTE_TYPE_MASK	GENMASK(63, 60)
 #define KVM_INVALID_PTE_ANNOT_MASK	~(KVM_PTE_VALID | \
 					  KVM_INVALID_PTE_TYPE_MASK)
@@ -246,20 +248,21 @@ struct kvm_pgtable_mm_ops {
 };
 
 /**
  * enum kvm_pgtable_stage2_flags - Stage-2 page-table flags.
  * @KVM_PGTABLE_S2_IDMAP:	Only use identity mappings.
  * @KVM_PGTABLE_S2_AS_S1:	Final memory attributes are that of Stage-1.
  */
 enum kvm_pgtable_stage2_flags {
 	KVM_PGTABLE_S2_IDMAP			= BIT(0),
 	KVM_PGTABLE_S2_AS_S1			= BIT(1),
+	KVM_PGTABLE_S2_DBM			= BIT(2),
 };
 
 /**
  * enum kvm_pgtable_prot - Page-table permissions and attributes.
  * @KVM_PGTABLE_PROT_UX:	Unprivileged execute permission.
  * @KVM_PGTABLE_PROT_PX:	Privileged execute permission.
  * @KVM_PGTABLE_PROT_X:		Privileged and unprivileged execute permission.
  * @KVM_PGTABLE_PROT_W:		Write permission.
  * @KVM_PGTABLE_PROT_R:		Read permission.
  * @KVM_PGTABLE_PROT_DEVICE:	Device attributes.
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9a22df0c5120..aa327eebaf1c 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2124,20 +2124,25 @@ static bool has_nested_virt_support(const struct arm64_cpu_capabilities *cap,
 
 	return true;
 }
 
 static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
 			  int __unused)
 {
 	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
 }
 
+static bool has_vhe_hdbss(const struct arm64_cpu_capabilities *entry, int cope)
+{
+	return is_kernel_in_hyp_mode() && has_cpuid_feature(entry, cope);
+}
+
 bool cpu_supports_bbml2_noabort(void)
 {
 	/*
 	 * We want to allow usage of BBML2 in as wide a range of kernel contexts
 	 * as possible. This list is therefore an allow-list of known-good
 	 * implementations that both support BBML2 and additionally, fulfill the
 	 * extra constraint of never generating TLB conflict aborts when using
 	 * the relaxed BBML2 semantics (such aborts make use of BBML2 in certain
 	 * kernel contexts difficult to prove safe against recursive aborts).
 	 *
@@ -2774,20 +2779,27 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		 * cannot be emulated in software (no access fault will occur).
 		 * Therefore this should be used only if it's supported system
 		 * wide.
 		 */
 		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
 		.capability = ARM64_HAFT,
 		.matches = has_cpuid_feature,
 		ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HAFDBS, HAFT)
 	},
 #endif
+	{
+		.desc = "Hardware Dirty state tracking structure (HDBSS)",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HAS_HDBSS,
+		.matches = has_vhe_hdbss,
+		ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HAFDBS, HDBSS)
+	},
 	{
 		.desc = "CRC32 instructions",
 		.capability = ARM64_HAS_CRC32,
 		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
 		.matches = has_cpuid_feature,
 		ARM64_CPUID_FIELDS(ID_AA64ISAR0_EL1, CRC32, IMP)
 	},
 	{
 		.desc = "Speculative Store Bypassing Safe (SSBS)",
 		.capability = ARM64_SSBS,
diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
new file mode 100644
index 000000000000..32fe938d6bf7
--- /dev/null
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2026 ARM Ltd.
+ * Author: Leonardo Bras <leo.bras@arm.com>
+ */
+
+#include <asm/kvm_dirty_bit.h>
+
+/* HDBSS entry field definitions */
+#define HDBSS_ENTRY_VALID BIT(0)
+#define HDBSS_ENTRY_TTWL_SHIFT (1)
+#define HDBSS_ENTRY_TTWL_MASK (GENMASK(3, 1))
+#define HDBSS_ENTRY_TTWL(x) \
+	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
+#define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
+#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 91a7dfad6686..e16729f0b7bd 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -724,23 +724,27 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
 		attr = KVM_S2_MEMATTR(pgt, NORMAL);
 	}
 
 	r = stage2_set_xn_attr(prot, &attr);
 	if (r)
 		return r;
 
 	if (prot & KVM_PGTABLE_PROT_R)
 		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
 
-	if (prot & KVM_PGTABLE_PROT_W)
+	if (prot & KVM_PGTABLE_PROT_W) {
 		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
 
+		if (pgt->flags & KVM_PGTABLE_S2_DBM)
+			attr |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;
+	}
+
 	if (!kvm_lpa2_is_enabled())
 		attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
 
 	attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
 	attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW;
 	*ptep = attr;
 
 	return 0;
 }
 
@@ -1360,23 +1364,27 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	kvm_pte_t xn = 0, set = 0, clr = 0;
 	s8 level;
 	int ret;
 
 	if (prot & KVM_PTE_LEAF_ATTR_HI_SW)
 		return -EINVAL;
 
 	if (prot & KVM_PGTABLE_PROT_R)
 		set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
 
-	if (prot & KVM_PGTABLE_PROT_W)
+	if (prot & KVM_PGTABLE_PROT_W) {
 		set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
 
+		if (pgt->flags & KVM_PGTABLE_S2_DBM)
+			set |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;
+	}
+
 	ret = stage2_set_xn_attr(prot, &xn);
 	if (ret)
 		return ret;
 
 	set |= xn & KVM_PTE_LEAF_ATTR_HI_S2_XN;
 	clr |= ~xn & KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
 	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, flags);
 	if (!ret || ret == -EAGAIN)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
@@ -1578,20 +1586,23 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 	u64 vtcr = mmu->vtcr;
 	u32 ia_bits = VTCR_EL2_IPA(vtcr);
 	u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
 	s8 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;
 
 	pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
 	pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_pages_exact(pgd_sz);
 	if (!pgt->pgd)
 		return -ENOMEM;
 
+	if (system_supports_hdbss())
+		flags |= KVM_PGTABLE_S2_DBM;
+
 	pgt->ia_bits		= ia_bits;
 	pgt->start_level	= start_level;
 	pgt->mm_ops		= mm_ops;
 	pgt->mmu		= mmu;
 	pgt->flags		= flags;
 	pgt->force_pte_cb	= force_pte_cb;
 
 	/* Ensure zeroed PGD pages are visible to the hardware walker */
 	dsb(ishst);
 	return 0;
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 59612d2f277c..6faacd857346 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -17,21 +17,21 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 inject_fault.o va_layout.o handle_exit.o config.o \
 	 guest.o debug.o reset.o sys_regs.o stacktrace.o \
 	 vgic-sys-reg-v3.o fpsimd.o pkvm.o \
 	 arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
 	 vgic/vgic.o vgic/vgic-init.o \
 	 vgic/vgic-irqfd.o vgic/vgic-v2.o \
 	 vgic/vgic-v3.o vgic/vgic-v4.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
-	 vgic/vgic-v5.o
+	 vgic/vgic-v5.o dirty_bit.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
 kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
 
 kvm-$(CONFIG_NVHE_EL2_TRACING) += hyp_trace.o
 
 always-y := hyp_constants.h hyp-constants.s
 
 define rule_gen_hyp_constants
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 9b85a84f6fd4..a87706c9d160 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -62,20 +62,21 @@ HAS_RASV1P1_EXTN
 HAS_RNG
 HAS_SB
 HAS_STAGE2_FWB
 HAS_TCR2
 HAS_TIDCP1
 HAS_TLB_RANGE
 HAS_VA52
 HAS_VIRT_HOST_EXTN
 HAS_WFXT
 HAS_XNX
+HAS_HDBSS
 HAFT
 HW_DBM
 KVM_HVHE
 KVM_PROTECTED_MODE
 MISMATCHED_CACHE_TYPE
 MPAM
 MPAM_HCR
 MTE
 MTE_ASYMM
 MTE_FAR
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:36   ` sashiko-bot
  2026-06-29 11:17 ` [PATCH v2 03/13] arm64/cpufeature: Add system-wide FEAT_HACDBS detection Leonardo Bras
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

FEAT_HDBSS speeds up guest memory dirty tracking by avoiding a page fault
and saving the entry in a tracking structure.

That may be a problem when we have guest memory backed by hugepages or
transparent huge pages, as it's not possible to do on-demand hugepage
splitting, relying only on eager hugepage splitting.

So, at stage2 initialization, enable eager hugepage splitting with
chunk = 256K * PAGE_SIZE if the system supports HDBSS.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/kvm/mmu.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6c941aaa10c6..e086c01a9325 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1020,22 +1020,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 
 	mmu->last_vcpu_ran = alloc_percpu(typeof(*mmu->last_vcpu_ran));
 	if (!mmu->last_vcpu_ran) {
 		err = -ENOMEM;
 		goto out_destroy_pgtable;
 	}
 
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
-	 /* The eager page splitting is disabled by default */
-	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+	 /* The eager page splitting is disabled by default if system has no HDBSS */
+	if (system_supports_hdbss())
+		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;
+	else
+		mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+
 	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
 
 	mmu->pgd_phys = __pa(pgt->pgd);
 
 	if (kvm_is_nested_s2_mmu(kvm, mmu))
 		kvm_init_nested_s2_mmu(mmu);
 
 	return 0;
 
 out_destroy_pgtable:
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 03/13] arm64/cpufeature: Add system-wide FEAT_HACDBS detection
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 04/13] arm64/sysreg: Add HACDBS consumer and base registers Leonardo Bras
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

FEAT_HACDBS will only be used for dirty-bit cleaning if it is detected in
all running cpus.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/include/asm/cpufeature.h | 5 +++++
 arch/arm64/kernel/cpufeature.c      | 8 ++++++++
 arch/arm64/tools/cpucaps            | 1 +
 3 files changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index bdfab086fd94..620ae4cddb76 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -861,20 +861,25 @@ static inline bool system_supports_gcs(void)
 static inline bool system_supports_haft(void)
 {
 	return cpus_have_final_cap(ARM64_HAFT);
 }
 
 static inline bool system_supports_hdbss(void)
 {
 	return cpus_have_final_cap(ARM64_HAS_HDBSS);
 }
 
+static inline bool system_supports_hacdbs(void)
+{
+	return cpus_have_final_cap(ARM64_HACDBS);
+}
+
 static __always_inline bool system_supports_mpam(void)
 {
 	return alternative_has_cap_unlikely(ARM64_MPAM);
 }
 
 static __always_inline bool system_supports_mpam_hcr(void)
 {
 	return alternative_has_cap_unlikely(ARM64_MPAM_HCR);
 }
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index aa327eebaf1c..62f56bbd0a65 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -516,20 +516,21 @@ static const struct arm64_ftr_bits ftr_id_aa64mmfr3[] = {
 		       FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_S1POE_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_S1PIE_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_SCTLRX_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_TCRX_SHIFT, 4, 0),
 	ARM64_FTR_END,
 };
 
 static const struct arm64_ftr_bits ftr_id_aa64mmfr4[] = {
 	S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR4_EL1_E2H0_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR4_EL1_NV_frac_SHIFT, 4, 0),
+	ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR4_EL1_HACDBS_SHIFT, 4, 0),
 	ARM64_FTR_END,
 };
 
 static const struct arm64_ftr_bits ftr_ctr[] = {
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_EXACT, 31, 1, 1), /* RES1 */
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_EL0_DIC_SHIFT, 1, 1),
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_EL0_IDC_SHIFT, 1, 1),
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_OR_ZERO_SAFE, CTR_EL0_CWG_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_OR_ZERO_SAFE, CTR_EL0_ERG_SHIFT, 4, 0),
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_EL0_DminLine_SHIFT, 4, 1),
@@ -2764,20 +2765,27 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 	{
 		.desc = "Hardware dirty bit management",
 		.type = ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE,
 		.capability = ARM64_HW_DBM,
 		.matches = has_hw_dbm,
 		.cpu_enable = cpu_enable_hw_dbm,
 		.cpus = &dbm_cpus,
 		ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HAFDBS, DBM)
 	},
 #endif
+	{
+		.desc = "Hardware dirty bit Cleaning",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HACDBS,
+		.matches = has_cpuid_feature,
+		ARM64_CPUID_FIELDS(ID_AA64MMFR4_EL1, HACDBS, IMP)
+	},
 #ifdef CONFIG_ARM64_HAFT
 	{
 		.desc = "Hardware managed Access Flag for Table Descriptors",
 		/*
 		 * Contrary to the page/block access flag, the table access flag
 		 * cannot be emulated in software (no access fault will occur).
 		 * Therefore this should be used only if it's supported system
 		 * wide.
 		 */
 		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index a87706c9d160..bd2c0bb98da6 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -65,20 +65,21 @@ HAS_STAGE2_FWB
 HAS_TCR2
 HAS_TIDCP1
 HAS_TLB_RANGE
 HAS_VA52
 HAS_VIRT_HOST_EXTN
 HAS_WFXT
 HAS_XNX
 HAS_HDBSS
 HAFT
 HW_DBM
+HACDBS
 KVM_HVHE
 KVM_PROTECTED_MODE
 MISMATCHED_CACHE_TYPE
 MPAM
 MPAM_HCR
 MTE
 MTE_ASYMM
 MTE_FAR
 MTE_STORE_ONLY
 SME
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 04/13] arm64/sysreg: Add HACDBS consumer and base registers
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (2 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 03/13] arm64/cpufeature: Add system-wide FEAT_HACDBS detection Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

They will be used on a later commit to make use of the FEAT_HACDBS
mechanism if available.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/tools/sysreg | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index bc1788b1662b..7b7c3d6a0f03 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -4627,20 +4627,50 @@ EndEnum
 Enum	9:8	IRGN0
 	0b00	NC
 	0b01	WBWA
 	0b10	WT
 	0b11	WBnWA
 EndEnum
 Field	7:6	SL0
 Field	5:0	T0SZ
 EndSysreg
 
+Sysreg	HACDBSBR_EL2	3	4	2	3	4
+Res0	63:56
+Field	55:12	BADDR
+Field	11	EN
+Res0	10:4
+UnsignedEnum	3:0	SZ
+	0b0000	4K
+	0b0001	8K
+	0b0010	16K
+	0b0011	32K
+	0b0100	64K
+	0b0101	128K
+	0b0110	256K
+	0b0111	512K
+	0b1000	1M
+	0b1001	2M
+EndEnum
+EndSysreg
+
+Sysreg	HACDBSCONS_EL2	3	4	2	3	5
+UnsignedEnum	63:62	ERR_REASON
+	0b00	NOF
+	0b01	STRUCTF
+	0b10	IPAF
+	0b11	IPAHACF
+EndEnum
+Res0	61:19
+Field	18:0	INDEX
+EndSysreg
+
 Sysreg	GCSCR_EL2	3	4	2	5	0
 Fields	GCSCR_ELx
 EndSysreg
 
 Sysreg	GCSPR_EL2	3	4	2	5	1
 Fields	GCSPR_ELx
 EndSysreg
 
 Sysreg	HDBSSBR_EL2	3	4	2	3	2
 Res0	63:56
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (3 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 04/13] arm64/sysreg: Add HACDBS consumer and base registers Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:32   ` sashiko-bot
  2026-06-29 17:22   ` Oliver Upton
  2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Find via ACPI [1] the Id for HACDBSIRQ, initialize it as a per-cpu IRQ
and make sure any cpu able to run virtualization has it active.

Introduce a per-cpu structure used by the HACDBSIRQ handler to keep track
of entries size and the status of HACDBS. Size is used to detect end of
processing in case the number of entries being processed is different of
the supported entries size.

Status may look easily replaceable by checking HACDBS registers now, but
will make the OFF/IDLE detection easier in next patches.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>

[1] https://github.com/tianocore/edk2/issues/12409
---
 arch/arm64/include/asm/acpi.h          |  3 +
 arch/arm64/include/asm/kvm_dirty_bit.h | 18 +++++
 include/acpi/actbl2.h                  |  1 +
 arch/arm64/kvm/arm.c                   |  5 ++
 arch/arm64/kvm/dirty_bit.c             | 97 ++++++++++++++++++++++++++
 5 files changed, 124 insertions(+)

diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
index 8a54ca6ba602..883315e9d79d 100644
--- a/arch/arm64/include/asm/acpi.h
+++ b/arch/arm64/include/asm/acpi.h
@@ -38,20 +38,23 @@
 
 #define BAD_MADT_GICC_ENTRY(entry, end)					\
 	(!(entry) || (entry)->header.length < ACPI_MADT_GICC_MIN_LENGTH || \
 	(unsigned long)(entry) + (entry)->header.length > (end))
 
 #define ACPI_MADT_GICC_SPE  (offsetof(struct acpi_madt_generic_interrupt, \
 	spe_interrupt) + sizeof(u16))
 
 #define ACPI_MADT_GICC_TRBE  (offsetof(struct acpi_madt_generic_interrupt, \
 	trbe_interrupt) + sizeof(u16))
+
+#define ACPI_MADT_GICC_HACDBSIRQ  (offsetof(struct acpi_madt_generic_interrupt, \
+	hacdbsirq_gsi) + sizeof(u32))
 /*
  * Arm® Functional Fixed Hardware Specification Version 1.2.
  * Table 2: Arm Architecture context loss flags
  */
 #define CPUIDLE_CORE_CTXT		BIT(0) /* Core context Lost */
 
 static inline unsigned int arch_get_idle_state_flags(u32 arch_flags)
 {
 	if (arch_flags & CPUIDLE_CORE_CTXT)
 		return CPUIDLE_FLAG_TIMER_STOP;
diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
index dd16438f0651..904e59f95b7e 100644
--- a/arch/arm64/include/asm/kvm_dirty_bit.h
+++ b/arch/arm64/include/asm/kvm_dirty_bit.h
@@ -2,11 +2,29 @@
 /*
  * Copyright (C) 2026 ARM Ltd.
  * Author: Leonardo Bras <leo.bras@arm.com>
  */
 
 #ifndef __ARM64_KVM_DIRTY_BIT_H__
 #define __ARM64_KVM_DIRTY_BIT_H__
 
 #include <asm/kvm_pgtable.h>
 
+enum hacdbs_status {
+	HACDBS_OFF,
+	HACDBS_IDLE,
+	HACDBS_RUNNING,
+	HACDBS_ERROR
+};
+
+struct hacdbs {
+	enum hacdbs_status status;
+	int size;
+};
+
+DECLARE_PER_CPU(struct hacdbs, hacdbs_pcp);
+
+void __init kvm_hacdbs_init(void);
+void kvm_hacdbs_cpu_up(void);
+void kvm_hacdbs_cpu_down(void);
+
 #endif /* __ARM64_KVM_DIRTY_BIT_H__ */
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index baef525367b5..eaefb494ef59 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1442,20 +1442,21 @@ struct acpi_madt_generic_interrupt {
 	u64 gich_base_address;
 	u32 vgic_interrupt;
 	u64 gicr_base_address;
 	u64 arm_mpidr;
 	u8 efficiency_class;
 	u8 reserved2[1];
 	u16 spe_interrupt;	/* ACPI 6.3 */
 	u16 trbe_interrupt;	/* ACPI 6.5 */
 	u16 iaffid;		/* ACPI 6.7 */
 	u32 irs_id;
+	u32 hacdbsirq_gsi;	/* ACPI 6.X */
 };
 
 /* Masks for Flags field above */
 
 /* ACPI_MADT_ENABLED                    (1)      Processor is usable if set */
 #define ACPI_MADT_PERFORMANCE_IRQ_MODE  (1<<1)	/* 01: Performance Interrupt Mode */
 #define ACPI_MADT_VGIC_IRQ_MODE         (1<<2)	/* 02: VGIC Maintenance Interrupt mode */
 #define ACPI_MADT_GICC_ONLINE_CAPABLE   (1<<3)	/* 03: Processor is online capable  */
 #define ACPI_MADT_GICC_NON_COHERENT     (1<<4)	/* 04: GIC redistributor is not coherent */
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 50adfff75be8..dc1a4629aaeb 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -35,20 +35,21 @@
 #include <asm/cpufeature.h>
 #include <asm/virt.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_nested.h>
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_ptrauth.h>
+#include <asm/kvm_dirty_bit.h>
 #include <asm/sections.h>
 #include <asm/stacktrace/nvhe.h>
 
 #include <kvm/arm_hypercalls.h>
 #include <kvm/arm_pmu.h>
 #include <kvm/arm_psci.h>
 #include <kvm/arm_vgic.h>
 
 #include <linux/irqchip/arm-gic-v5.h>
 
@@ -2300,28 +2301,30 @@ int kvm_arch_enable_virtualization_cpu(void)
 	 * disabled, but not with preemption disabled. The former is
 	 * enough to ensure correctness, but most of the helpers
 	 * expect the later and will throw a tantrum otherwise.
 	 */
 	preempt_disable();
 
 	cpu_hyp_init(NULL);
 
 	kvm_vgic_cpu_up();
 	kvm_timer_cpu_up();
+	kvm_hacdbs_cpu_up();
 
 	preempt_enable();
 
 	return 0;
 }
 
 void kvm_arch_disable_virtualization_cpu(void)
 {
+	kvm_hacdbs_cpu_down();
 	kvm_timer_cpu_down();
 	kvm_vgic_cpu_down();
 
 	if (!is_protected_kvm_enabled())
 		cpu_hyp_uninit(NULL);
 }
 
 #ifdef CONFIG_CPU_PM
 static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
 				    unsigned long cmd,
@@ -2474,20 +2477,22 @@ static int __init init_subsystems(void)
 		goto out;
 	}
 
 	/*
 	 * Init HYP architected timer support
 	 */
 	err = kvm_timer_hyp_init(vgic_present);
 	if (err)
 		goto out;
 
+	kvm_hacdbs_init();
+
 	kvm_register_perf_callbacks();
 
 	err = kvm_hyp_trace_init();
 	if (err)
 		kvm_err("Failed to initialize Hyp tracing\n");
 
 out:
 	if (err)
 		hyp_cpu_pm_exit();
 
diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
index 32fe938d6bf7..789da8712b1b 100644
--- a/arch/arm64/kvm/dirty_bit.c
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -1,16 +1,113 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
  * Copyright (C) 2026 ARM Ltd.
  * Author: Leonardo Bras <leo.bras@arm.com>
  */
 
 #include <asm/kvm_dirty_bit.h>
+#include <linux/kconfig.h>
+#include <linux/acpi.h>
+
+DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
+	.status = HACDBS_OFF,
+	.size = 0,
+};
 
 /* HDBSS entry field definitions */
 #define HDBSS_ENTRY_VALID BIT(0)
 #define HDBSS_ENTRY_TTWL_SHIFT (1)
 #define HDBSS_ENTRY_TTWL_MASK (GENMASK(3, 1))
 #define HDBSS_ENTRY_TTWL(x) \
 	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
 #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
 #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
+
+static __ro_after_init int hacdbsirq = -1;
+
+static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
+{
+	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
+	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
+
+	switch (err) {
+	case HACDBSCONS_EL2_ERR_REASON_NOF:
+		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
+		break;
+	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
+		/* When size not a power of two >= 4k, exit with reserved TTLW */
+		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
+
+		if (index >= this_cpu_read(hacdbs_pcp.size)) {
+			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
+			break;
+		}
+		fallthrough;
+	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
+	case HACDBSCONS_EL2_ERR_REASON_IPAF:
+		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
+		break;
+	}
+
+	return IRQ_HANDLED;
+}
+
+void kvm_hacdbs_cpu_up(void)
+{
+	if (hacdbsirq < 0)
+		return;
+
+	enable_percpu_irq(hacdbsirq, IRQ_TYPE_LEVEL_HIGH);
+	this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
+}
+
+void kvm_hacdbs_cpu_down(void)
+{
+	if (hacdbsirq < 0)
+		return;
+
+	disable_percpu_irq(hacdbsirq);
+	this_cpu_write(hacdbs_pcp.status, HACDBS_OFF);
+}
+
+#ifdef CONFIG_ACPI
+static int __init hacdbs_acpi_get_irq(void)
+{
+	struct acpi_madt_generic_interrupt *gicc;
+	u32 gsi;
+	int irq;
+
+	gicc = acpi_cpu_get_madt_gicc(smp_processor_id());
+	if (gicc->header.length < ACPI_MADT_GICC_HACDBSIRQ)
+		return -ENXIO;
+
+	gsi =  gicc->hacdbsirq_gsi;
+
+	irq = acpi_register_gsi(NULL, gsi, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);
+	if (irq < 0) {
+		pr_warn("ACPI: Unable to register HACDBS interrupt: %d\n", gsi);
+		return -ENXIO;
+	}
+
+	return irq;
+}
+#else
+#define hacdbs_acpi_get_irq() (-ENXIO)
+#endif
+
+void __init kvm_hacdbs_init(void)
+{
+	int irq;
+
+	/* FEAT_HACDBS is only supported if Linux runs in EL2 (VHE) */
+	if (!system_supports_hacdbs() || !is_kernel_in_hyp_mode())
+		return;
+
+	irq = hacdbs_acpi_get_irq();
+	if (irq < 0)
+		return;
+
+	if (request_percpu_irq(irq, hacdbsirq_handler, "HACDBSIRQ", &hacdbs_pcp) < 0)
+		return;
+
+	hacdbsirq = irq;
+}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (4 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:29   ` sashiko-bot
  2026-06-29 17:36   ` Oliver Upton
  2026-06-29 11:17 ` [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning Leonardo Bras
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Introduce the basic cleaning routine that is going to be used for both
dirty-bitmap and dirty-ring routines.

It sets the required registers with the input buffer, and wait for
HACDBSIRQ to happen, which means either the task is done, or there was some
error during processing.

It is ran with preemption disabled, as a task being scheduled in could
change the translation registers used by HACDBS and end up corrupting the
current dirty-bit tracking and the sched-in task's S2 pagetables.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/kvm/dirty_bit.c | 81 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
index 789da8712b1b..e4283828b780 100644
--- a/arch/arm64/kvm/dirty_bit.c
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -1,36 +1,117 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
  * Copyright (C) 2026 ARM Ltd.
  * Author: Leonardo Bras <leo.bras@arm.com>
  */
 
 #include <asm/kvm_dirty_bit.h>
+#include <asm/kvm_mmu.h>
 #include <linux/kconfig.h>
 #include <linux/acpi.h>
 
 DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
 	.status = HACDBS_OFF,
 	.size = 0,
 };
 
 /* HDBSS entry field definitions */
 #define HDBSS_ENTRY_VALID BIT(0)
 #define HDBSS_ENTRY_TTWL_SHIFT (1)
 #define HDBSS_ENTRY_TTWL_MASK (GENMASK(3, 1))
 #define HDBSS_ENTRY_TTWL(x) \
 	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
 #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
 #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
 
 static __ro_after_init int hacdbsirq = -1;
 
+static void hacdbs_start(u64 *hw_entries, int size)
+{
+	u64 br;
+	/* Each entry is 8 bytes */
+	int size_b = size * sizeof(hw_entries[0]);
+	int size_p2 = max(roundup_pow_of_two(size_b), PAGE_SIZE);
+
+	/* If not using the full size of the array, put a stop entry at the end */
+	if (size_b < size_p2)
+		hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
+
+	sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
+			   HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
+
+	br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
+	     FIELD_PREP(HACDBSBR_EL2_SZ, ilog2(size_p2) - 12) |
+	     FIELD_PREP(HACDBSBR_EL2_EN, 1);
+
+	this_cpu_write(hacdbs_pcp.status, HACDBS_RUNNING);
+	this_cpu_write(hacdbs_pcp.size, size);
+	write_sysreg_s(br, SYS_HACDBSBR_EL2);
+	isb();
+}
+
+static int hacdbs_stop(void)
+{
+	write_sysreg_s(0, SYS_HACDBSBR_EL2);
+	isb();
+
+	if (this_cpu_read(hacdbs_pcp.status) == HACDBS_ERROR) {
+		/* In case of error, HACDBSCONS_EL2.INDEX should point the faulty entry */
+		u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
+		int idx = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
+
+		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
+
+		return idx;
+	}
+
+	return this_cpu_read(hacdbs_pcp.size);
+}
+
+/*
+ * Clears dirty-bits for an array of pages (hw_entries) using HACDBS
+ * Returns the number of items cleaned from the array. If returns value < size,
+ *	there was an error in the processing.
+ */
+static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
+{
+	u64 hcr_el2;
+	int ret;
+
+	preempt_disable();
+
+	hcr_el2 = read_sysreg(HCR_EL2);
+	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
+	__load_stage2(&kvm->arch.mmu);
+
+	hacdbs_start(hw_entries, size);
+
+	do {
+		wfi();
+	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
+
+	ret = hacdbs_stop();
+
+	write_sysreg(hcr_el2, HCR_EL2);
+	isb();
+
+	/*
+	 * No DSB is needed here, as kvm_flush_remote_tlbs_memslot() that happens
+	 * later in generic dirty-cleaning code already performs a DSB before
+	 * doing the TLBI.
+	 */
+
+	preempt_enable();
+
+	return ret;
+}
+
 static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
 {
 	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
 	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
 
 	switch (err) {
 	case HACDBSCONS_EL2_ERR_REASON_NOF:
 		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
 		break;
 	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (5 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:38   ` sashiko-bot
  2026-06-29 11:17 ` [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine Leonardo Bras
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Introduce kvm_arch_dirty_log_clear() that allow implementation of
arch-specific hardware-accelerated dirty-log routines.

A call to that is added on both kvm_get_dirty_log_protect() and
kvm_clear_dirty_log_protect() and will fall back to software version if
not implemented, or any error was detected in the arch-specific routine.

For an arch to implement this function, it's required to provide an
asm/kvm_dirty_bit.h and have CONFIG_HAVE_KVM_HW_DIRTY_BIT=y on building.

If the arch does not implement it, and thus lack above config, the
introduced snippet is expected to be compiled-out and have zero impact at
runtime.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 include/linux/kvm_dirty_bit.h | 27 +++++++++++++++++++++++++++
 virt/kvm/kvm_main.c           | 13 ++++++++++++-
 virt/kvm/Kconfig              |  3 +++
 3 files changed, 42 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kvm_dirty_bit.h

diff --git a/include/linux/kvm_dirty_bit.h b/include/linux/kvm_dirty_bit.h
new file mode 100644
index 000000000000..fa4f6b67b623
--- /dev/null
+++ b/include/linux/kvm_dirty_bit.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2026 ARM Ltd.
+ * Author: Leonardo Bras <leo.bras@arm.com>
+ */
+
+#ifndef __KVM_DIRTY_BIT_H__
+#define __KVM_DIRTY_BIT_H__
+
+#ifndef CONFIG_HAVE_KVM_HW_DIRTY_BIT
+
+static inline int kvm_arch_dirty_log_clear(struct kvm *kvm,
+					   struct kvm_memory_slot *memslot,
+					   struct kvm_clear_dirty_log *log,
+					   unsigned long *bitmap,
+					   bool *flush)
+{
+	return -ENXIO;
+}
+
+#else /* CONFIG_HAVE_KVM_HW_DIRTY_BIT */
+
+#include <asm/kvm_dirty_bit.h>
+
+#endif /* CONFIG_HAVE_KVM_HW_DIRTY_BIT */
+
+#endif /* __KVM_DIRTY_BIT_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e44c20c04961..a25b8902cdfc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -58,20 +58,21 @@
 #include "async_pf.h"
 #include "kvm_mm.h"
 #include "vfio.h"
 
 #include <trace/events/ipi.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
 
 #include <linux/kvm_dirty_ring.h>
+#include <linux/kvm_dirty_bit.h>
 
 
 /* Worst case buffer size needed for holding an integer. */
 #define ITOA_MAX_LEN 12
 
 MODULE_AUTHOR("Qumranet");
 MODULE_DESCRIPTION("Kernel-based Virtual Machine (KVM) Hypervisor");
 MODULE_LICENSE("GPL");
 
 /* Architectures should define their poll value according to the halt latency */
@@ -2255,39 +2256,44 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 		 * is some code duplication between this function and
 		 * kvm_get_dirty_log, but hopefully all architecture
 		 * transition to kvm_get_dirty_log_protect and kvm_get_dirty_log
 		 * can be eliminated.
 		 */
 		dirty_bitmap_buffer = dirty_bitmap;
 	} else {
 		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 		memset(dirty_bitmap_buffer, 0, n);
 
+		if (kvm_arch_dirty_log_clear(kvm, memslot, NULL,
+					     dirty_bitmap_buffer, &flush) >= 0)
+			goto out;
+
 		KVM_MMU_LOCK(kvm);
 		for (i = 0; i < n / sizeof(long); i++) {
 			unsigned long mask;
 			gfn_t offset;
 
 			if (!dirty_bitmap[i])
 				continue;
 
 			flush = true;
 			mask = xchg(&dirty_bitmap[i], 0);
 			dirty_bitmap_buffer[i] = mask;
 
 			offset = i * BITS_PER_LONG;
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
 		KVM_MMU_UNLOCK(kvm);
 	}
 
+out:
 	if (flush)
 		kvm_flush_remote_tlbs_memslot(kvm, memslot);
 
 	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
 		return -EFAULT;
 	return 0;
 }
 
 
 /**
@@ -2366,45 +2372,50 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	    (log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
 	    return -EINVAL;
 
 	kvm_arch_sync_dirty_log(kvm, memslot);
 
 	flush = false;
 	dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
+	if (kvm_arch_dirty_log_clear(kvm, memslot, log, dirty_bitmap_buffer,
+				     &flush) >= 0)
+		goto out;
+
 	KVM_MMU_LOCK(kvm);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
 		unsigned long mask = *dirty_bitmap_buffer++;
 		atomic_long_t *p = (atomic_long_t *) &dirty_bitmap[i];
 		if (!mask)
 			continue;
 
 		mask &= atomic_long_fetch_andnot(mask, p);
 
 		/*
 		 * mask contains the bits that really have been cleared.  This
 		 * never includes any bits beyond the length of the memslot (if
 		 * the length is not aligned to 64 pages), therefore it is not
 		 * a problem if userspace sets them in log->dirty_bitmap.
 		*/
 		if (mask) {
 			flush = true;
+
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
 	}
 	KVM_MMU_UNLOCK(kvm);
-
+out:
 	if (flush)
 		kvm_flush_remote_tlbs_memslot(kvm, memslot);
 
 	return 0;
 }
 
 static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 					struct kvm_clear_dirty_log *log)
 {
 	int r;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 794976b88c6f..f8757b5b84b3 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -13,20 +13,23 @@ config HAVE_KVM_PFNCACHE
 
 config HAVE_KVM_IRQCHIP
        bool
 
 config HAVE_KVM_IRQ_ROUTING
        bool
 
 config HAVE_KVM_DIRTY_RING
        bool
 
+config HAVE_KVM_HW_DIRTY_BIT
+       bool
+
 # Only strongly ordered architectures can select this, as it doesn't
 # put any explicit constraint on userspace ordering. They can also
 # select the _ACQ_REL version.
 config HAVE_KVM_DIRTY_RING_TSO
        bool
        select HAVE_KVM_DIRTY_RING
        depends on X86
 
 # Weakly ordered architectures can only select this, advertising
 # to userspace the additional ordering requirements.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (6 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:45   ` sashiko-bot
  2026-06-29 11:17 ` [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks Leonardo Bras
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Implement arm64 version of kvm_arch_dirty_log_clear() making use of
FEAT_HACDBS.

It works by transversing the dirty-bitmap and converting the set bits into
HDBSS entries in a 64-page blocks granularity.

The resulting HDBSS array is then fed to the HACDBS mechanism that walks
the pagetable marking writable-dirty pages as writable-clean.

In case of error, rewrite all unprocessed entries, including the faulting
one, to the dirty-bitmap and fall back to generic software cleaning.

In case of the options to "manual protect + init set" are enabled, do
the hugepage splitting in the same fashion as the generic software
cleaning, i.e. in 64-page blocks. For that, remove the static qualifier
from kvm_mmu_split_huge_pages() and make the function available on
kvm_host.h.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/include/asm/kvm_dirty_bit.h |  24 ++++
 include/linux/kvm_host.h               |   3 +
 arch/arm64/kvm/dirty_bit.c             | 146 +++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c                   |   4 +-
 4 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
index 904e59f95b7e..3d749f979c67 100644
--- a/arch/arm64/include/asm/kvm_dirty_bit.h
+++ b/arch/arm64/include/asm/kvm_dirty_bit.h
@@ -20,11 +20,35 @@ struct hacdbs {
 	enum hacdbs_status status;
 	int size;
 };
 
 DECLARE_PER_CPU(struct hacdbs, hacdbs_pcp);
 
 void __init kvm_hacdbs_init(void);
 void kvm_hacdbs_cpu_up(void);
 void kvm_hacdbs_cpu_down(void);
 
+int __kvm_arch_dirty_log_clear(struct kvm *kvm,
+			       struct kvm_memory_slot *memslot,
+			       struct kvm_clear_dirty_log *log,
+			       unsigned long *bitmap,
+			       bool *flush);
+
+static inline bool kvm_arch_dirty_clear_enabled(struct kvm *kvm)
+{
+	return this_cpu_read(hacdbs_pcp.status) == HACDBS_IDLE &&
+	       (kvm->arch.mmu.pgt->flags & KVM_PGTABLE_S2_DBM);
+}
+
+static inline int kvm_arch_dirty_log_clear(struct kvm *kvm,
+					   struct kvm_memory_slot *memslot,
+					   struct kvm_clear_dirty_log *log,
+					   unsigned long *bitmap,
+					   bool *flush)
+{
+	if (!kvm_arch_dirty_clear_enabled(kvm))
+		return -EPERM;
+
+	return __kvm_arch_dirty_log_clear(kvm, memslot, log, bitmap, flush);
+}
+
 #endif /* __ARM64_KVM_DIRTY_BIT_H__ */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab8cfaec82d3..7ea6ed7ce203 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1662,20 +1662,23 @@ void kvm_arch_disable_virtualization_cpu(void);
 bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 void kvm_arch_create_vm_debugfs(struct kvm *kvm);
 
+int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			     phys_addr_t end);
+
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
  * All architectures that want to use vzalloc currently also
  * need their own kvm_arch_alloc_vm implementation.
  */
 static inline struct kvm *kvm_arch_alloc_vm(void)
 {
 	return kzalloc_obj(struct kvm, GFP_KERNEL_ACCOUNT);
 }
 #endif
diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
index e4283828b780..d05af3de78be 100644
--- a/arch/arm64/kvm/dirty_bit.c
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -98,20 +98,166 @@ static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
 	 * No DSB is needed here, as kvm_flush_remote_tlbs_memslot() that happens
 	 * later in generic dirty-cleaning code already performs a DSB before
 	 * doing the TLBI.
 	 */
 
 	preempt_enable();
 
 	return ret;
 }
 
+static inline void hdbss_to_bitmap(u64 *hdbss_array, int start, int end,
+				   unsigned long *dirty_bitmap,
+				   unsigned long long offset)
+{
+	u64 w = (gpa_to_gfn(hdbss_array[start]) - offset) / BITS_PER_LONG;
+	u64 mask = 0;
+	int idx = start;
+
+	do {
+		u64 entry = (gpa_to_gfn(hdbss_array[idx]) - offset);
+
+		if (entry / BITS_PER_LONG == w) {
+			mask |= BIT(entry % BITS_PER_LONG);
+		} else {
+			atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[w]);
+			w = entry / BITS_PER_LONG;
+			mask = BIT(entry % BITS_PER_LONG);
+		}
+	} while (++idx < end);
+	atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[w]);
+}
+
+static inline int mask_to_hdbss(unsigned long *mask, u64 *hw_entries, const gfn_t offset,
+				u64 ttwl, int idx, int entries_sz)
+{
+	while (idx < entries_sz) {
+		int j = __ffs(*mask);
+		u64 a = gfn_to_gpa(offset + j);
+
+		hw_entries[idx++] = (a & HDBSS_ENTRY_IPA) |
+				    ttwl |
+				    HDBSS_ENTRY_VALID;
+
+		*mask &= ~BIT(j);
+		if (!*mask)
+			break;
+	}
+
+	return idx;
+}
+
+int __kvm_arch_dirty_log_clear(struct kvm *kvm,
+			       struct kvm_memory_slot *memslot,
+			       struct kvm_clear_dirty_log *log,
+			       unsigned long *bitmap,
+			       bool *flush)
+{
+	int ret = 0;
+	int idx = 0;
+	unsigned long *dirty_bitmap = memslot->dirty_bitmap;
+	u64 *hw_entries;
+	const int entries_sz = PAGE_SIZE / sizeof(*hw_entries);
+	u64 ttwl;
+	u64 start, end;
+	gfn_t base_gfn;
+
+	hw_entries = kmalloc_objs(u64, entries_sz, GFP_KERNEL);
+	if (!hw_entries)
+		return -ENOMEM;
+
+	ttwl = HDBSS_ENTRY_TTWL(KVM_PGTABLE_LAST_LEVEL);
+
+	if (log) {
+		start = log->first_page / BITS_PER_LONG;
+		end = start + DIV_ROUND_UP(log->num_pages, BITS_PER_LONG);
+		base_gfn = memslot->base_gfn + log->first_page % BITS_PER_LONG;
+	} else {
+		start = 0;
+		end = kvm_dirty_bitmap_bytes(memslot) / sizeof(long);
+		base_gfn = memslot->base_gfn;
+	}
+
+	write_lock(&kvm->mmu_lock);
+
+	for (unsigned long i = start; i < end; i++) {
+		unsigned long mask;
+		gfn_t offset;
+		atomic_long_t *p;
+
+		if (log) { /* Clean only what is in the input bitmap */
+			mask = bitmap[i];
+			if (!mask)
+				continue;
+
+			p = (atomic_long_t *)&dirty_bitmap[i];
+			mask &= atomic_long_fetch_andnot(mask, p);
+		} else { /* Clean everything */
+			if (!dirty_bitmap[i])
+				continue;
+
+			mask = xchg(&dirty_bitmap[i], 0);
+			bitmap[i] = mask;
+		}
+
+		if (!mask)
+			continue;
+
+		offset = base_gfn + i * BITS_PER_LONG;
+
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			kvm_mmu_split_huge_pages(kvm,
+						 gfn_to_gpa(offset + __ffs(mask)),
+						 gfn_to_gpa(offset + __fls(mask) + 1));
+
+		do {
+			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);
+			if (idx >= entries_sz) {
+				ret = dirty_bit_clear(kvm, hw_entries, idx);
+				*flush = *flush || ret > 0;
+				if (ret != idx) {
+					/* Save bits not converted back to bitmap */
+					atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[i]);
+					goto out_err;
+				}
+				idx = 0;
+			}
+		} while (mask);
+	}
+
+	if (idx != 0) {
+		ret = dirty_bit_clear(kvm, hw_entries, idx);
+		*flush = *flush || ret > 0;
+	}
+out_err:
+	if (unlikely(ret != idx)) {
+		/*
+		 * In case there is an error and not all entries in HACDBS get
+		 * cleaned, we have to mark the dirty bits back in the bitmap,
+		 * as that will be used by the software routine.
+		 *
+		 * Entries should be in order, since they were extraxed from
+		 * the dirty-bitmap, so batching the atomic writes is efficient.
+		 */
+
+		if (ret < idx)
+			hdbss_to_bitmap(hw_entries, ret, idx, dirty_bitmap, memslot->base_gfn);
+
+		ret = -EAGAIN;
+	}
+
+	write_unlock(&kvm->mmu_lock);
+	kfree(hw_entries);
+
+	return ret;
+}
+
 static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
 {
 	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
 	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
 
 	switch (err) {
 	case HACDBSCONS_EL2_ERR_REASON_NOF:
 		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
 		break;
 	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e086c01a9325..2f9d90c35668 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -110,22 +110,22 @@ static bool need_split_memcache_topup_or_resched(struct kvm *kvm)
 
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		return true;
 
 	chunk_size = kvm->arch.mmu.split_page_chunk_size;
 	min = kvm_mmu_split_nr_page_tables(chunk_size);
 	cache = &kvm->arch.mmu.split_page_cache;
 	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
 }
 
-static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
-				    phys_addr_t end)
+int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			     phys_addr_t end)
 {
 	struct kvm_mmu_memory_cache *cache;
 	struct kvm_pgtable *pgt;
 	int ret, cache_capacity;
 	u64 next, chunk_size;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	chunk_size = kvm->arch.mmu.split_page_chunk_size;
 	cache_capacity = kvm_mmu_split_nr_page_tables(chunk_size);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (7 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:39   ` sashiko-bot
  2026-06-29 11:17 ` [PATCH v2 10/13] kvm/dirty_ring: Introduce get_memslot and move helpers to header Leonardo Bras
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

If previous dirty-clean already split a block, then avoid calling the
split helper on that block again.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/kvm/dirty_bit.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
index d05af3de78be..6c928677ce12 100644
--- a/arch/arm64/kvm/dirty_bit.c
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -153,20 +153,21 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
 			       bool *flush)
 {
 	int ret = 0;
 	int idx = 0;
 	unsigned long *dirty_bitmap = memslot->dirty_bitmap;
 	u64 *hw_entries;
 	const int entries_sz = PAGE_SIZE / sizeof(*hw_entries);
 	u64 ttwl;
 	u64 start, end;
 	gfn_t base_gfn;
+	gpa_t split_end = 0;
 
 	hw_entries = kmalloc_objs(u64, entries_sz, GFP_KERNEL);
 	if (!hw_entries)
 		return -ENOMEM;
 
 	ttwl = HDBSS_ENTRY_TTWL(KVM_PGTABLE_LAST_LEVEL);
 
 	if (log) {
 		start = log->first_page / BITS_PER_LONG;
 		end = start + DIV_ROUND_UP(log->num_pages, BITS_PER_LONG);
@@ -197,24 +198,28 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
 
 			mask = xchg(&dirty_bitmap[i], 0);
 			bitmap[i] = mask;
 		}
 
 		if (!mask)
 			continue;
 
 		offset = base_gfn + i * BITS_PER_LONG;
 
-		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
-			kvm_mmu_split_huge_pages(kvm,
-						 gfn_to_gpa(offset + __ffs(mask)),
-						 gfn_to_gpa(offset + __fls(mask) + 1));
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
+		    (offset + BITS_PER_LONG > split_end)) {
+			gpa_t start = gfn_to_gpa(offset + __ffs(mask));
+			gpa_t end = gfn_to_gpa(offset + __fls(mask) + 1);
+
+			kvm_mmu_split_huge_pages(kvm, start, end);
+			split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
+		}
 
 		do {
 			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);
 			if (idx >= entries_sz) {
 				ret = dirty_bit_clear(kvm, hw_entries, idx);
 				*flush = *flush || ret > 0;
 				if (ret != idx) {
 					/* Save bits not converted back to bitmap */
 					atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[i]);
 					goto out_err;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 10/13] kvm/dirty_ring: Introduce get_memslot and move helpers to header
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (8 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:17 ` [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning Leonardo Bras
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Dirty-ring entry struct carries a slot number which is used to return
a struct memslot*. That struct carry important information such as
memory offset of that memory slot, which is used to calculate the guest
physical address of the page which needs to be marked clean.

In order to fetch that memslot information without duplicating code, split
that part of kvm_reset_dirty_gfn() into a new helper
kvm_dirty_ring_get_memslot().

Along with that, make it available on kvm_dirty_ring.h other helpers such
as kvm_dirty_gfn_harvested() and kvm_dirty_gfn_set_invalid() which will be
useful on implementing arch specific dirty-ring cleaning accelerators.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 include/linux/kvm_dirty_ring.h | 12 ++++++++++++
 virt/kvm/dirty_ring.c          | 30 ++++++++++++++++--------------
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index eb10d87adf7d..190d97fce4a4 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -77,18 +77,30 @@ bool kvm_use_dirty_bitmap(struct kvm *kvm);
 bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm);
 u32 kvm_dirty_ring_get_rsvd_entries(struct kvm *kvm);
 int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring,
 			 int index, u32 size);
 int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 			 int *nr_entries_reset);
 void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset);
 
 bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu);
 
+static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
+{
+	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
+}
+
+static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn)
+{
+	smp_store_release(&gfn->flags, 0);
+}
+
+struct kvm_memory_slot *kvm_dirty_ring_get_memslot(struct kvm *kvm, u32 slot);
+
 /* for use in vm_operations_struct */
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
 
 void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
 
 #endif /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 #endif	/* KVM_DIRTY_RING_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 572b854edf74..42de1a511037 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -43,32 +43,44 @@ static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
 static bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
 {
 	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
 }
 
 static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
 {
 	return kvm_dirty_ring_used(ring) >= ring->size;
 }
 
-static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+static inline struct kvm_memory_slot *
+__kvm_dirty_ring_get_memslot(struct kvm *kvm, u32 slot)
 {
-	struct kvm_memory_slot *memslot;
 	int as_id, id;
 
 	as_id = slot >> 16;
 	id = (u16)slot;
 
 	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
-		return;
+		return 0;
 
-	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+	return id_to_memslot(__kvm_memslots(kvm, as_id), id);
+}
+
+struct kvm_memory_slot *kvm_dirty_ring_get_memslot(struct kvm *kvm, u32 slot)
+{
+	return __kvm_dirty_ring_get_memslot(kvm, slot);
+}
+
+static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = __kvm_dirty_ring_get_memslot(kvm, slot);
 
 	if (!memslot || offset >= memslot->npages ||
 	    offset + __fls(mask) >= memslot->npages)
 		return;
 
 	KVM_MMU_LOCK(kvm);
 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
 	KVM_MMU_UNLOCK(kvm);
 }
 
@@ -81,35 +93,25 @@ int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring,
 
 	ring->size = size / sizeof(struct kvm_dirty_gfn);
 	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries(kvm);
 	ring->dirty_index = 0;
 	ring->reset_index = 0;
 	ring->index = index;
 
 	return 0;
 }
 
-static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn)
-{
-	smp_store_release(&gfn->flags, 0);
-}
-
 static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
 {
 	gfn->flags = KVM_DIRTY_GFN_F_DIRTY;
 }
 
-static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
-{
-	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
-}
-
 int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 			 int *nr_entries_reset)
 {
 	/*
 	 * To minimize mmu_lock contention, batch resets for harvested entries
 	 * whose gfns are in the same slot, and are within N frame numbers of
 	 * each other, where N is the number of bits in an unsigned long.  For
 	 * simplicity, process the current set of entries when the next entry
 	 * can't be included in the batch.
 	 *
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (9 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 10/13] kvm/dirty_ring: Introduce get_memslot and move helpers to header Leonardo Bras
@ 2026-06-29 11:17 ` Leonardo Bras
  2026-06-29 11:49   ` sashiko-bot
  2026-06-29 11:18 ` [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine Leonardo Bras
  2026-06-29 11:18 ` [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT Leonardo Bras
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Introduce kvm_arch_dirty_ring_clear() that allow implementation of
arch-specific hardware-accelerated dirty-ring routines.

A call to that is added on kvm_dirty_ring_reset() and will fall back to
software version if not implemented, or any error was detected in the
arch-specific routine.

For an arch to implement this function, it's required to provide it in a
asm/kvm_dirty_bit.h and have CONFIG_HAVE_KVM_HW_DIRTY_BIT=y on building.

If the arch does not implement it, and thus lack above config, the
introduced snippet is expected to be compiled-out and have zero impact at
runtime.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 include/linux/kvm_dirty_bit.h | 7 +++++++
 virt/kvm/dirty_ring.c         | 4 ++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/kvm_dirty_bit.h b/include/linux/kvm_dirty_bit.h
index fa4f6b67b623..8492979d694e 100644
--- a/include/linux/kvm_dirty_bit.h
+++ b/include/linux/kvm_dirty_bit.h
@@ -11,17 +11,24 @@
 
 static inline int kvm_arch_dirty_log_clear(struct kvm *kvm,
 					   struct kvm_memory_slot *memslot,
 					   struct kvm_clear_dirty_log *log,
 					   unsigned long *bitmap,
 					   bool *flush)
 {
 	return -ENXIO;
 }
 
+static inline int kvm_arch_dirty_ring_clear(struct kvm *kvm,
+					    struct kvm_dirty_ring *ring,
+					    int *nr_entries_reset)
+{
+	return -ENXIO;
+}
+
 #else /* CONFIG_HAVE_KVM_HW_DIRTY_BIT */
 
 #include <asm/kvm_dirty_bit.h>
 
 #endif /* CONFIG_HAVE_KVM_HW_DIRTY_BIT */
 
 #endif /* __KVM_DIRTY_BIT_H__ */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 42de1a511037..fe4e7da6cc4a 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -1,20 +1,21 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
  * KVM dirty ring implementation
  *
  * Copyright 2019 Red Hat, Inc.
  */
 #include <linux/kvm_host.h>
 #include <linux/kvm.h>
 #include <linux/vmalloc.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/kvm_dirty_bit.h>
 #include <trace/events/kvm.h>
 #include "kvm_mm.h"
 
 int __weak kvm_cpu_dirty_log_size(struct kvm *kvm)
 {
 	return 0;
 }
 
 u32 kvm_dirty_ring_get_rsvd_entries(struct kvm *kvm)
 {
@@ -126,20 +127,23 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 	struct kvm_dirty_gfn *entry;
 
 	/*
 	 * Ensure concurrent calls to KVM_RESET_DIRTY_RINGS are serialized,
 	 * e.g. so that KVM fully resets all entries processed by a given call
 	 * before returning to userspace.  Holding slots_lock also protects
 	 * the various memslot accesses.
 	 */
 	lockdep_assert_held(&kvm->slots_lock);
 
+	if (kvm_arch_dirty_ring_clear(kvm, ring, nr_entries_reset) >= 0)
+		return 0;
+
 	while (likely((*nr_entries_reset) < INT_MAX)) {
 		if (signal_pending(current))
 			return -EINTR;
 
 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
 
 		if (!kvm_dirty_gfn_harvested(entry))
 			break;
 
 		next_slot = READ_ONCE(entry->slot);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (10 preceding siblings ...)
  2026-06-29 11:17 ` [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning Leonardo Bras
@ 2026-06-29 11:18 ` Leonardo Bras
  2026-06-29 11:49   ` sashiko-bot
  2026-06-29 11:18 ` [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT Leonardo Bras
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:18 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Implement arm64 version of kvm_arch_dirty_ring_clear() making use of
FEAT_HACDBS.

It works by transversing the dirty-ring and converting its entries into
HDBSS entries based on the slot offset.

The resulting HDBSS array is then fed to the HACDBS mechanism that walks
the pagetable marking writable-dirty pages as writable-clean.

Only successfully cleaned entries are set as invalid on the dirty-ring, so
in case of error, falling back to generic software cleaning will take care
of any remaining entry in the dirty-ring.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/include/asm/kvm_dirty_bit.h | 13 +++++
 arch/arm64/kvm/dirty_bit.c             | 66 ++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
index 3d749f979c67..d76c109937d8 100644
--- a/arch/arm64/include/asm/kvm_dirty_bit.h
+++ b/arch/arm64/include/asm/kvm_dirty_bit.h
@@ -26,29 +26,42 @@ DECLARE_PER_CPU(struct hacdbs, hacdbs_pcp);
 void __init kvm_hacdbs_init(void);
 void kvm_hacdbs_cpu_up(void);
 void kvm_hacdbs_cpu_down(void);
 
 int __kvm_arch_dirty_log_clear(struct kvm *kvm,
 			       struct kvm_memory_slot *memslot,
 			       struct kvm_clear_dirty_log *log,
 			       unsigned long *bitmap,
 			       bool *flush);
 
+int __kvm_arch_dirty_ring_clear(struct kvm *kvm, struct kvm_dirty_ring *ring,
+				int *nr_entries_reset);
+
 static inline bool kvm_arch_dirty_clear_enabled(struct kvm *kvm)
 {
 	return this_cpu_read(hacdbs_pcp.status) == HACDBS_IDLE &&
 	       (kvm->arch.mmu.pgt->flags & KVM_PGTABLE_S2_DBM);
 }
 
 static inline int kvm_arch_dirty_log_clear(struct kvm *kvm,
 					   struct kvm_memory_slot *memslot,
 					   struct kvm_clear_dirty_log *log,
 					   unsigned long *bitmap,
 					   bool *flush)
 {
 	if (!kvm_arch_dirty_clear_enabled(kvm))
 		return -EPERM;
 
 	return __kvm_arch_dirty_log_clear(kvm, memslot, log, bitmap, flush);
 }
 
+static inline int kvm_arch_dirty_ring_clear(struct kvm *kvm,
+					    struct kvm_dirty_ring *ring,
+					    int *nr_entries_reset)
+{
+	if (!kvm_arch_dirty_clear_enabled(kvm))
+		return -EPERM;
+
+	return __kvm_arch_dirty_ring_clear(kvm, ring, nr_entries_reset);
+}
+
 #endif /* __ARM64_KVM_DIRTY_BIT_H__ */
diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
index 6c928677ce12..19289ea73d96 100644
--- a/arch/arm64/kvm/dirty_bit.c
+++ b/arch/arm64/kvm/dirty_bit.c
@@ -249,20 +249,86 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
 
 		ret = -EAGAIN;
 	}
 
 	write_unlock(&kvm->mmu_lock);
 	kfree(hw_entries);
 
 	return ret;
 }
 
+int __kvm_arch_dirty_ring_clear(struct kvm *kvm, struct kvm_dirty_ring *ring,
+				int *nr_entries_reset)
+{
+	u64 *hw_entries;
+	u64 slot_offset = 0;
+	u64 ttwl;
+	int i, ret;
+	u32 slot = -1;
+
+	if (signal_pending(current))
+		return -EINTR;
+
+	ttwl = HDBSS_ENTRY_TTWL(KVM_PGTABLE_LAST_LEVEL);
+
+	hw_entries = kmalloc(max(ring->size * sizeof(u64), PAGE_SIZE), GFP_KERNEL);
+	if (!hw_entries)
+		return -ENOMEM;
+
+	for (i = 0; i < ring->size; i++) {
+		struct kvm_dirty_gfn *entry;
+		gfn_t gfn;
+
+		entry = &ring->dirty_gfns[(ring->reset_index + i) &
+					  (ring->size - 1)];
+
+		if (!kvm_dirty_gfn_harvested(entry))
+			break;
+
+		if (entry->slot != slot) {
+			struct kvm_memory_slot *memslot;
+
+			memslot = kvm_dirty_ring_get_memslot(kvm, entry->slot);
+			slot = entry->slot;
+			slot_offset = memslot->base_gfn;
+		}
+
+		gfn = slot_offset + entry->offset;
+
+		hw_entries[i] = (gfn_to_gpa(gfn) & HDBSS_ENTRY_IPA) |
+				ttwl | HDBSS_ENTRY_VALID;
+	}
+
+	ret = dirty_bit_clear(kvm, hw_entries, i);
+
+	/* Set as invalid all successfully cleaned entries */
+	for (int j = 0; j < ret; j++) {
+		struct kvm_dirty_gfn *entry;
+
+		entry = &ring->dirty_gfns[(ring->reset_index + j) &
+					  (ring->size - 1)];
+
+		kvm_dirty_gfn_set_invalid(entry);
+	}
+
+	/* In case of error, try software cleaning from the faulting entry */
+	ring->reset_index += ret;
+	*nr_entries_reset += ret;
+
+	kfree(hw_entries);
+
+	if (ret < i)
+		return -EFAULT;
+
+	return ret;
+}
+
 static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
 {
 	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
 	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
 
 	switch (err) {
 	case HACDBSCONS_EL2_ERR_REASON_NOF:
 		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
 		break;
 	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT
  2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
                   ` (11 preceding siblings ...)
  2026-06-29 11:18 ` [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine Leonardo Bras
@ 2026-06-29 11:18 ` Leonardo Bras
  2026-06-29 11:52   ` sashiko-bot
  12 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 11:18 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Leonardo Bras, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng
  Cc: linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

Set the corresponding bit to enable hardware accelerated dirty-bitmap and
dirty-ring cleaning for arm64. Actually using acceleration depends on the
cpus enabling FEAT_HACDBS as well as the pre-requisite features for it,
such as FEAT_HDBSS and FEAT_HAFDBS.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 449154f9a485..db8487bf738b 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -27,20 +27,21 @@ menuconfig KVM
 	select VIRT_XFER_TO_GUEST_WORK
 	select KVM_VFIO
 	select HAVE_KVM_DIRTY_RING_ACQ_REL
 	select NEED_KVM_DIRTY_RING_WITH_BITMAP
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQ_ROUTING
 	select HAVE_KVM_IRQ_BYPASS
 	select HAVE_KVM_READONLY_MEM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
+	select HAVE_KVM_HW_DIRTY_BIT if ACPI
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
 	select KVM_GUEST_MEMFD
 	help
 	  Support hosting virtualized guest machines.
 
 	  If unsure, say N.
 
 if KVM
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
@ 2026-06-29 11:29   ` sashiko-bot
  2026-06-29 15:54     ` Leonardo Bras
  2026-06-29 17:36   ` Oliver Upton
  1 sibling, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:29 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvmarm, kvm, Marc Zyngier, Oliver Upton

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Missing Context Synchronization Event (ISB) after writes to control-plane system registers.
- [High] TOCTOU race in wfi() polling loop causes a lost wakeup and indefinite CPU stall.
- [High] Missing memory barrier (dsb) before starting the HACDBS hardware engine.
- [High] Missing speculative execution barrier (dsb) after disabling an asynchronous hardware engine.
--

commit 47f5964570817fd652e798e082c349d012b3edda
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine

Introduce the basic cleaning routine that is going to be used for both
dirty-bitmap and dirty-ring routines. It sets the required registers with
the input buffer, and wait for HACDBSIRQ to happen, which means either
the task is done, or there was some error during processing.

> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> index 789da8712b1b4..e4283828b7807 100644
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c

[ ... ]

> @@ -24,6 +25,86 @@ DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
>  
>  static __ro_after_init int hacdbsirq = -1;
>  
> +static void hacdbs_start(u64 *hw_entries, int size)
> +{
> +	u64 br;
> +	/* Each entry is 8 bytes */
> +	int size_b = size * sizeof(hw_entries[0]);
> +	int size_p2 = max(roundup_pow_of_two(size_b), PAGE_SIZE);
> +
> +	/* If not using the full size of the array, put a stop entry at the end */
> +	if (size_b < size_p2)
> +		hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
> +
> +	sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
> +			   HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
> +

[Severity: High]
Does this code need an immediate isb() after writing the control-plane system
register SYS_HACDBSCONS_EL2? The ARM64 guidelines indicate that an isb() must
be the very next instruction after writing a control-plane system register to
ensure the pipeline doesn't operate with architecturally undefined state.

[Severity: High]
Additionally, should there be a dsb() between the memory store to
hw_entries[size] and the write to SYS_HACDBSBR_EL2 below? Without an explicit
memory barrier, the store buffer might not be globally visible before the
engine begins reading memory, which could cause it to miss the stop marker
and access out-of-bounds memory.

> +	br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
> +	     FIELD_PREP(HACDBSBR_EL2_SZ, ilog2(size_p2) - 12) |
> +	     FIELD_PREP(HACDBSBR_EL2_EN, 1);
> +
> +	this_cpu_write(hacdbs_pcp.status, HACDBS_RUNNING);
> +	this_cpu_write(hacdbs_pcp.size, size);
> +	write_sysreg_s(br, SYS_HACDBSBR_EL2);
> +	isb();
> +}

[ ... ]

> +/*
> + * Clears dirty-bits for an array of pages (hw_entries) using HACDBS
> + * Returns the number of items cleaned from the array. If returns value < size,
> + *	there was an error in the processing.
> + */
> +static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> +{
> +	u64 hcr_el2;
> +	int ret;
> +
> +	preempt_disable();
> +
> +	hcr_el2 = read_sysreg(HCR_EL2);
> +	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> +	__load_stage2(&kvm->arch.mmu);

[Severity: High]
Is an isb() required immediately after writing HCR_EL2 before proceeding to
__load_stage2()? HCR_EL2 is a control-plane register, and delaying the barrier
can cause the pipeline to operate with architecturally undefined state.

> +
> +	hacdbs_start(hw_entries, size);
> +
> +	do {
> +		wfi();
> +	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);

[Severity: High]
Could this polling loop cause a lost wakeup and indefinite CPU stall? The loop
unconditionally sleeps on the first iteration and checks the condition
non-atomically with respect to interrupts. 

If the HACDBSIRQ interrupt is delivered and serviced just before the wfi()
instruction executes, the wfi() will wait for a new interrupt that will never
arrive. Should the check be made atomic with the sleep, for example by masking
IRQs before the check?

> +
> +	ret = hacdbs_stop();
> +
> +	write_sysreg(hcr_el2, HCR_EL2);
> +	isb();

[Severity: High]
Should there be a speculative execution barrier (dsb) after stopping the engine
in hacdbs_stop() and before changing the translation context here? 

The ARM64 guidelines mandate a speculative execution barrier (dsb + isb) after
disabling an asynchronous engine to definitively halt out-of-context execution
before updating the translation regime.

> +
> +	/*
> +	 * No DSB is needed here, as kvm_flush_remote_tlbs_memslot() that happens
> +	 * later in generic dirty-cleaning code already performs a DSB before
> +	 * doing the TLBI.
> +	 */
> +
> +	preempt_enable();
> +
> +	return ret;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=6

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
@ 2026-06-29 11:32   ` sashiko-bot
  2026-06-29 15:43     ` Leonardo Bras
  2026-06-29 17:22   ` Oliver Upton
  1 sibling, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:32 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvmarm, Oliver Upton, Marc Zyngier, kvm

Thank you for your contribution! Sashiko AI review found 12 potential issue(s) to consider:
- [Critical] The `wfi()` loop in `dirty_bit_clear()` creates a check-then-sleep race condition that can stall the CPU indefinitely.
- [Critical] The level-triggered `HACDBSIRQ` is not acknowledged or cleared in the interrupt handler, resulting in an infinite interrupt storm.
- [High] Missing `isb()` context synchronization barrier after writing to control-plane system registers `HCR_EL2` and `SYS_HACDBSCONS_EL2`.
- [High] Missing memory barrier (`dmb ish`) before starting the HACDBS hardware engine.
- [High] The ACPI MADT parser registers `hacdbsirq_gsi` without verifying it is non-zero, risking a hijack of SGI 0.
- [High] Missing `default:` case in the error handling switch statement leaves the tracking status unchanged on unexpected hardware errors, causing a vCPU hang.
- [High] If HACDBS IRQ initialization fails, the feature remains enabled in KVM, causing a permanent host hang upon first use of dirty tracking.
- [Medium] ACPI GSI mapping is leaked if `request_percpu_irq()` fails.
- [Medium] Missing teardown of HACDBS resources if KVM subsystem initialization fails.
- [Medium] The per-cpu HACDBS status is initialized after enabling the per-cpu interrupt, causing a TOCTOU race.
- [Medium] Calling `smp_processor_id()` in a preemptible context triggers a kernel BUG/warning.
- [Low] A variable declaration immediately follows a case label, violating C syntax rules and causing a build failure.
--

commit a0d2df687f2678f7a10866e285efd8319bf84cea
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ

Find via ACPI the Id for HACDBSIRQ, initialize it as a per-cpu IRQ and make
sure any cpu able to run virtualization has it active. Introduce a per-cpu
structure used by the HACDBSIRQ handler to keep track of entries size and
the status of HACDBS.

Link: https://github.com/tianocore/edk2/issues/12409

> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 50adfff75be82..dc1a4629aaeb9 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
[ ... ]
> @@ -2481,6 +2484,8 @@ static int __init init_subsystems(void)
>  	if (err)
>  		goto out;
>  
> +	kvm_hacdbs_init();
> +
>  	kvm_register_perf_callbacks();
>  
>  	err = kvm_hyp_trace_init();

[Severity: Medium]
If a subsequent initialization step like kvm_hyp_trace_init() fails, the
code branches to the out: error path and aborts KVM initialization. Does
this path need to call a teardown function to free the per-CPU IRQ and
release the GSI allocated by kvm_hacdbs_init()?

> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> index 32fe938d6bf70..789da8712b1b4 100644
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c
[ ... ]
> @@ -14,3 +21,93 @@
>  	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
>  #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
>  #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> +
> +static __ro_after_init int hacdbsirq = -1;
> +
> +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> +{
> +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> +
> +	switch (err) {
> +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +		break;
> +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);

[Severity: Low]
Does this variable declaration cause a build failure? Under standard C rules,
a label can only be attached to a statement, and a declaration is not a
statement.

> +
> +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +			break;
> +		}
> +		fallthrough;
> +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> +		break;

[Severity: High]
What happens if the hardware reports an undefined or reserved error reason
in HACDBSCONS_EL2_ERR_REASON?

Without a default: case, the tracking status wouldn't be updated, remaining
as HACDBS_RUNNING. Could this cause the synchronous wfi() wait loop in
dirty_bit_clear() to spin forever and lock up the host CPU?

> +	}
> +
> +	return IRQ_HANDLED;
> +}

[Severity: Critical]
Is it necessary to write back to a hardware register to acknowledge or clear
the active source of this interrupt before returning?

The ACPI MADT registers the HACDBS interrupt as ACPI_LEVEL_SENSITIVE.
Since a level-triggered interrupt remains asserted until explicitly cleared,
would returning IRQ_HANDLED without clearing the hardware signal trap the CPU
in an infinite loop of executing the ISR?

> +
> +void kvm_hacdbs_cpu_up(void)
> +{
> +	if (hacdbsirq < 0)
> +		return;
> +
> +	enable_percpu_irq(hacdbsirq, IRQ_TYPE_LEVEL_HIGH);
> +	this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +}

[Severity: Medium]
Could this sequence cause a race condition if a pending interrupt fires
immediately upon being unmasked?

If hacdbsirq_handler() executes and updates the state, and then
kvm_hacdbs_cpu_up() resumes and unconditionally overwrites the state
with HACDBS_IDLE, the handler's update would be lost. Should the status
be initialized before enabling the interrupt?

> +
> +void kvm_hacdbs_cpu_down(void)
> +{
> +	if (hacdbsirq < 0)
> +		return;
> +
> +	disable_percpu_irq(hacdbsirq);
> +	this_cpu_write(hacdbs_pcp.status, HACDBS_OFF);
> +}
> +
> +#ifdef CONFIG_ACPI
> +static int __init hacdbs_acpi_get_irq(void)
> +{
> +	struct acpi_madt_generic_interrupt *gicc;
> +	u32 gsi;
> +	int irq;
> +
> +	gicc = acpi_cpu_get_madt_gicc(smp_processor_id());

[Severity: Medium]
Is it safe to call smp_processor_id() here?

Since kvm_hacdbs_init() is called from init_subsystems() during a
device_initcall, preemption is enabled. Would calling smp_processor_id()
directly without disabling preemption trigger a kernel bug/warning if
CONFIG_DEBUG_PREEMPT is enabled? Should raw_smp_processor_id() or
get_cpu()/put_cpu() be used instead?

> +	if (gicc->header.length < ACPI_MADT_GICC_HACDBSIRQ)
> +		return -ENXIO;
> +
> +	gsi =  gicc->hacdbsirq_gsi;
> +
> +	irq = acpi_register_gsi(NULL, gsi, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);

[Severity: High]
Does the GSI read from ACPI need to be validated before registration?

If the firmware does not support HACDBS, it might leave this field as 0.
On ARM64, GSI 0 corresponds to Software Generated Interrupt (SGI) 0, which is
used for Inter-Processor Interrupts (IPIs). Could registering a device
interrupt handler on GSI 0 corrupt host IPI handling and crash the system?

> +	if (irq < 0) {
> +		pr_warn("ACPI: Unable to register HACDBS interrupt: %d\n", gsi);
> +		return -ENXIO;
> +	}
> +
> +	return irq;
> +}

[Severity: Medium]
If the subsequent call to request_percpu_irq() in kvm_hacdbs_init() fails,
how is this ACPI GSI mapping cleaned up?

Since hacdbs_acpi_get_irq() only returns the mapped IRQ and discards the
original GSI, it seems impossible for the caller to properly unregister it
via acpi_unregister_gsi(), which would result in leaking the resource.

> +#else
> +#define hacdbs_acpi_get_irq() (-ENXIO)
> +#endif
> +
> +void __init kvm_hacdbs_init(void)
> +{
> +	int irq;
> +
> +	/* FEAT_HACDBS is only supported if Linux runs in EL2 (VHE) */
> +	if (!system_supports_hacdbs() || !is_kernel_in_hyp_mode())
> +		return;
> +
> +	irq = hacdbs_acpi_get_irq();
> +	if (irq < 0)
> +		return;
> +
> +	if (request_percpu_irq(irq, hacdbsirq_handler, "HACDBSIRQ", &hacdbs_pcp) < 0)
> +		return;

[Severity: High]
If request_percpu_irq() or hacdbs_acpi_get_irq() fails, the function returns
early. Is there a global flag that needs to be unset to tell KVM that HACDBS
hardware cleaning is unavailable?

If KVM later assumes hardware support is available and calls
dirty_bit_clear() unconditionally, it might set the software status to
HACDBS_RUNNING and wait in a wfi() loop. Since the IRQ is disabled or
unmapped, the handler would never fire, resulting in a host hang.

Additionally, I have a few questions regarding the implementation of
dirty_bit_clear() and hacdbs_start(), which appear to interact with the
features introduced in this patch:

[Severity: High]
In dirty_bit_clear(), is an isb() required after the write to HCR_EL2?

dirty_bit_clear() {
    ...
    write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
    __load_stage2(&kvm->arch.mmu);
    ...
}

Similarly, in hacdbs_start(), is an isb() required after writing to
SYS_HACDBSCONS_EL2?

hacdbs_start() {
    ...
    sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
                       HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);

    br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
    ...
}

The ARM64 subsystem guidelines indicate that writes to control-plane system
registers must be followed by an isb() as the very next instruction to
ensure subsequent instructions do not execute with an architecturally
undefined pipeline state.

[Severity: Critical]
Does the wfi() loop in dirty_bit_clear() contain a check-then-sleep race
condition?

dirty_bit_clear() {
    ...
    do {
        wfi();
    } while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
    ...
}

Although preemption is disabled, local interrupts are not disabled during this
check. If the hardware interrupt fires after the condition is checked but
before the wfi() executes, the CPU could service the interrupt, update the
status to HACDBS_IDLE, return, and then execute wfi(). The CPU would then
sleep indefinitely, causing a host hard lockup.

[Severity: High]
In hacdbs_start(), is a memory barrier like `dmb ish` required before starting
the hardware engine?

hacdbs_start() {
    ...
    if (size_b < size_p2)
        hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
    ...
    write_sysreg_s(br, SYS_HACDBSBR_EL2);
    isb();
}

A sentinel stop entry is written to the memory buffer before writing to
SYS_HACDBSBR_EL2 to enable the engine. Since system register writes are not
inherently ordered with respect to memory stores, the hardware engine might
begin reading before the stop entry is visible in coherent memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=5

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/13] KVM: arm64: HDBSS bits
  2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
@ 2026-06-29 11:34   ` sashiko-bot
  2026-06-29 12:57     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:34 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvmarm, Oliver Upton, Marc Zyngier, kvm

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] Mapping writeable pages with both DBM and S2AP_W set causes `kvm_pgtable_stage2_wrprotect` to silently bypass software write-protection.
- [High] `has_vhe_hdbss` incorrectly uses `is_kernel_in_hyp_mode()` which evaluates to true during early boot for pKVM (nVHE), erroneously enabling VHE-specific features.
- [Low] Typo in the parameter name for `has_vhe_hdbss`.
--

commit 49089779295fae526ba79b7c7e78d31fbe6a9636
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: HDBSS bits

The commit introduces HDBSS bits from a future patchset to fulfill
dependencies and allow building.

> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 9a22df0c5120f..aa327eebaf1ca 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c

[ ... ]

> @@ -2131,6 +2131,11 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
>  	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
>  }
>  
> +static bool has_vhe_hdbss(const struct arm64_cpu_capabilities *entry, int cope)
                                                                             ^^^^
[Severity: Low]
This isn't a bug, but there is a typo in the parameter name (cope instead
of scope).

> +{
> +	return is_kernel_in_hyp_mode() && has_cpuid_feature(entry, cope);

[Severity: High]
Does using is_kernel_in_hyp_mode() here erroneously enable VHE-specific
features for pKVM (nVHE)?

During early boot feature detection, is_kernel_in_hyp_mode() evaluates to
true because the host kernel boots at EL2 and only drops privileges to
EL1 later during device_initcall_sync (finalize_pkvm()).

This early check could force ARM64_HAS_HDBSS to be enabled on pKVM
systems, propagating the KVM_PGTABLE_S2_DBM flag to hypervisor-managed
stage-2 page tables.

Should this use is_vhe_hyp_mode() instead to avoid incorrectly matching
during the nVHE privileged window?

> +}

[ ... ]

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 91a7dfad66866..e16729f0b7bd2 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c

[ ... ]

> @@ -731,9 +731,13 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>  	if (prot & KVM_PGTABLE_PROT_R)
>  		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
>  
> -	if (prot & KVM_PGTABLE_PROT_W)
> +	if (prot & KVM_PGTABLE_PROT_W) {
>  		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
>  
> +		if (pgt->flags & KVM_PGTABLE_S2_DBM)
> +			attr |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;

[Severity: Critical]
Can setting KVM_PTE_LEAF_ATTR_HI_S2_DBM here lead to a silent bypass of
software write-protection?

When a page is mapped writable, both DBM and S2AP_W are set. If KVM later
write-protects the page via kvm_pgtable_stage2_wrprotect(), only S2AP_W
is cleared:

kvm_pgtable_stage2_wrprotect() {
    ...
    return stage2_update_leaf_attrs(pgt, addr, size, 0,
                                    KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
                                    NULL, NULL,
                                    KVM_PGTABLE_WALK_IGNORE_EAGAIN);
}

This leaves DBM intact. Hardware interprets DBM=1, S2AP_W=0 as a clean
writable page and automatically sets S2AP_W=1 on the next write without
generating a Permission Fault.

For pKVM, this allows a guest to bypass memory isolation. For nested
virtualization, nested guest writes will silently set S2AP_W=1 without
trapping to the host, resulting in lost dirty pages during live migration.

> +	}
> +
>  	if (!kvm_lpa2_is_enabled())
>  		attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-29 11:17 ` [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
@ 2026-06-29 11:36   ` sashiko-bot
  2026-06-29 14:47     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:36 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Marc Zyngier, Oliver Upton, kvmarm, kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] The default eager page splitting chunk size violates the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE UAPI contract by using a value that is not a valid block size for 16KB and 64KB page kernels.

Pre-existing issues:
- [Critical] Dropping mmu_lock between primary and nested stage-2 write-protection creates a window where nested guests can bypass dirty logging.
--

commit 09e1d250ed331b9dc20b42b7f9315892a86a5d50
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Enable eager hugepage splitting if HDBSS is available

This commit enables eager hugepage splitting by default for systems
supporting HDBSS, setting the default chunk size to 256K pages at
stage2 initialization.

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 6c941aaa10c63..e086c01a9325e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1027,8 +1027,12 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  	for_each_possible_cpu(cpu)
>  		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
>  
> -	 /* The eager page splitting is disabled by default */
> -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> +	if (system_supports_hdbss())
> +		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;

[Severity: High]
Does this hardcoded chunk size violate the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
UAPI contract on kernels with 16KB or 64KB pages?

By setting the default chunk size to 256 * 1024 * PAGE_SIZE, this evaluates to
4GB on 16KB page systems and 16GB on 64KB page systems.

However, in arch/arm64/kvm/arm.c:kvm_vm_ioctl_enable_cap(), the chunk size is
validated against valid architectural block sizes:

    if (!new_cap || kvm_is_block_size_supported(new_cap)) {
    ...

Since 4GB and 16GB are not valid block sizes on those configurations, a VMM
like QEMU attempting to read the default capability via KVM_CHECK_EXTENSION
and set it back via KVM_ENABLE_CAP will be rejected with -EINVAL.

[Severity: Critical]
This is a pre-existing issue, but by enabling eager hugepage splitting by
default for HDBSS, does this exacerbate a race condition with nested guests
bypassing dirty logging?

In arch/arm64/kvm/mmu.c:kvm_arch_mmu_enable_log_dirty_pt_masked(), KVM enables
dirty logging by write-protecting the primary stage-2 and then calling
kvm_mmu_split_huge_pages():

    if (kvm_dirty_log_manual_protect_and_init_set(kvm))
        kvm_mmu_split_huge_pages(kvm, start, end);

    kvm_nested_s2_wp(kvm);

Because kvm_mmu_split_huge_pages() can drop the mmu_lock to top-up the memory
cache or reschedule, a race window is created before kvm_nested_s2_wp()
protects the nested stage-2 tables.

While the lock is dropped, a nested VCPU can fault, map the page via its
separate shadow stage-2 (which hasn't been write-protected yet), and write to
the page without setting the dirty bitmap, leading to silent memory corruption
after live migration.

> +	else
> +		mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +
>  	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
>  
>  	mmu->pgd_phys = __pa(pgt->pgd);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning
  2026-06-29 11:17 ` [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning Leonardo Bras
@ 2026-06-29 11:38   ` sashiko-bot
  2026-06-29 16:07     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:38 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Oliver Upton, Marc Zyngier, kvm, kvmarm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Data loss (lost dirty pages) when kvm_arch_dirty_log_clear fails during KVM_GET_DIRTY_LOG.
- [High] Out-of-bounds read of userspace buffer in kvm_arch_dirty_log_clear during KVM_CLEAR_DIRTY_LOG.
--

commit ea34df3acf2397922e1a2152d707b3b9f6012964
Author: Leonardo Bras <leo.bras@arm.com>

kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning

Introduce kvm_arch_dirty_log_clear() that allow implementation of
arch-specific hardware-accelerated dirty-log routines.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e44c20c049610..a25b8902cdfc1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2262,6 +2263,10 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>  		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
>  		memset(dirty_bitmap_buffer, 0, n);
>  
> +		if (kvm_arch_dirty_log_clear(kvm, memslot, NULL,
> +					     dirty_bitmap_buffer, &flush) >= 0)
> +			goto out;
> +

[Severity: Critical]
If kvm_arch_dirty_log_clear() successfully clears some pages in hardware but
subsequently fails and returns an error, will the fallback loop overwrite the
successfully cleared bits?

It appears that if the arch hook returns an error, KVM falls back to the
generic software loop which includes:

    mask = xchg(&dirty_bitmap[i], 0);
    dirty_bitmap_buffer[i] = mask;

Because the generic fallback loop blindly overwrites dirty_bitmap_buffer[i]
with the remaining uncleared bits, could this result in silently losing the
dirty pages that the arch hook managed to clear before failing?

[ ... ]

> @@ -2373,6 +2379,10 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>  	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>  		return -EFAULT;
>  
> +	if (kvm_arch_dirty_log_clear(kvm, memslot, log, dirty_bitmap_buffer,
> +				     &flush) >= 0)
> +		goto out;
> +

[Severity: High]
Could passing dirty_bitmap_buffer directly to the arch-specific routine lead
to an out-of-bounds read?

When userspace calls KVM_CLEAR_DIRTY_LOG with a first_page > 0, KVM copies
exactly num_pages bits into dirty_bitmap_buffer starting at index 0.

If the arch implementation of kvm_arch_dirty_log_clear() assumes the buffer
represents the entire memslot and reads from it using an absolute offset like
log->first_page / BITS_PER_LONG, would it read past the end of this
partially-sized array?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=7

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks
  2026-06-29 11:17 ` [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks Leonardo Bras
@ 2026-06-29 11:39   ` sashiko-bot
  2026-06-29 17:07     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:39 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvmarm, Oliver Upton, Marc Zyngier, kvm

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Incorrect PMD boundary calculation skips huge page splitting for adjacent PMDs and causes redundant splitting for the last chunk of each PMD.
- [Low] Variable `split_end` is declared as a Guest Physical Address (`gpa_t`) but assigned and compared as a Guest Frame Number (`gfn_t`).
- [Low] The local variables `start` and `end` shadow the outer loop bounds.
--

commit 8863fe6c6c67ca7006255c8805499ac99f887014
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks

If previous dirty-clean already split a block, then avoid calling the
split helper on that block again.

> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c
> @@ -160,6 +160,7 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
>  	u64 ttwl;
>  	u64 start, end;
>  	gfn_t base_gfn;
> +	gpa_t split_end = 0;

[Severity: Low]
This isn't a bug, but should split_end be declared as a gfn_t?

It is declared as a Guest Physical Address (gpa_t) here, but later receives
a Guest Frame Number (gfn_t) from gpa_to_gfn(). While both compile as u64,
this mixes address types.

[ ... ]

> @@ -204,10 +205,14 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
>  
>  		offset = base_gfn + i * BITS_PER_LONG;
>  
> -		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> -			kvm_mmu_split_huge_pages(kvm,
> -						 gfn_to_gpa(offset + __ffs(mask)),
> -						 gfn_to_gpa(offset + __fls(mask) + 1));
> +		if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
> +		    (offset + BITS_PER_LONG > split_end)) {

[Severity: High]
Does this comparison check an exclusive GFN against an inclusive GFN?

The expression offset + BITS_PER_LONG calculates an exclusive boundary,
while split_end is calculated below as an inclusive boundary:

    split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);

Comparing these might cause a redundant split for the last 64-page batch
of an already-split PMD.

> +			gpa_t start = gfn_to_gpa(offset + __ffs(mask));
> +			gpa_t end = gfn_to_gpa(offset + __fls(mask) + 1);

[Severity: Low]
This isn't a bug, but do these variables shadow the outer loop bounds?

The variables start and end declared here share names with the loop bounds:

    u64 start, end;
    ...
    for (unsigned long i = start; i < end; i++) {

While scoping rules use the correct variables, shadowing loop control
variables might make the code prone to subtle bugs during future refactoring.

> +
> +			kvm_mmu_split_huge_pages(kvm, start, end);
> +			split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);

[Severity: High]
Can this alignment skip splitting for the next PMD?

The variable end is an exclusive physical address:

    gpa_t end = gfn_to_gpa(offset + __fls(mask) + 1);

If a block of dirty pages ends exactly at a PMD boundary, ALIGN_DOWN() on
the exclusive end address aligns it to that boundary. Adding PMD_SIZE - 1
then advances split_end to the end of the next PMD.

This would skip huge page splitting for the adjacent PMD, breaking dirty
logging and potentially causing data corruption during live migration as
dirty pages would not be tracked at the PTE level.

> +		}
>  
>  		do {
>  			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=9

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine
  2026-06-29 11:17 ` [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine Leonardo Bras
@ 2026-06-29 11:45   ` sashiko-bot
  2026-06-29 16:49     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:45 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvmarm, Marc Zyngier, Oliver Upton, kvm

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [High] Missing `isb()` context synchronization event after writing to the `HCR_EL2` control-plane system register.
- [High] Host kernel directly accesses EL2 system registers, which will crash nVHE configurations.
- [High] Incorrect indexing of the userspace-provided `bitmap` buffer leads to out-of-bounds memory reads.
- [High] Dropping `kvm->mmu_lock` before pending HACDBS entries are flushed creates a TOCTOU race window that permanently loses guest writes.
- [High] Off-by-one error in calculating `split_end` causes KVM to mistakenly skip huge page splitting for subsequent PMDs, leading to data loss.
- [High] Flawed return value logic in `__kvm_arch_dirty_log_clear()` incorrectly returns `-EAGAIN` on exact-multiple success or a positive integer on general success.
- [High] The `wfi()` loop waits indefinitely with `preempt_disable()` and `kvm->mmu_lock` held in write mode without masking local interrupts, risking lost wakeups and hard lockups.
- [High] Ignoring the return value of `kvm_mmu_split_huge_pages()` can permanently break dirty logging for huge pages.
- [High] Dereferencing `kvm->arch.mmu.pgt` without checking if it is NULL leads to a kernel panic.
- [Low] A variable declaration immediately follows a `case` label, resulting in a build failure.
--

commit 8c1efc438f0d50c036b9548bfa7efea74d90a446
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine

This patch implements an arm64-specific kvm_arch_dirty_log_clear() using the
FEAT_HACDBS hardware feature. It converts dirty bitmap entries into HDBSS
structures and feeds them to the hardware to clean the page tables. It also
splits huge pages in chunks if manual protect and init set options are
enabled.

> diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
> --- a/arch/arm64/include/asm/kvm_dirty_bit.h
> +++ b/arch/arm64/include/asm/kvm_dirty_bit.h

[ ... ]

> +static inline bool kvm_arch_dirty_clear_enabled(struct kvm *kvm)
> +{
> +	return this_cpu_read(hacdbs_pcp.status) == HACDBS_IDLE &&
> +	       (kvm->arch.mmu.pgt->flags & KVM_PGTABLE_S2_DBM);
> +}

[Severity: High]
Can this cause a kernel panic if invoked before the MMU is fully initialized?
The stage-2 page tables (pgt) are allocated lazily, so if userspace calls
the clear dirty log ioctl before initialization, kvm->arch.mmu.pgt could be
NULL.

> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c

[ ... ]

> +int __kvm_arch_dirty_log_clear(struct kvm *kvm,
> +			       struct kvm_memory_slot *memslot,
> +			       struct kvm_clear_dirty_log *log,
> +			       unsigned long *bitmap,
> +			       bool *flush)
> +{

[ ... ]

> +	for (unsigned long i = start; i < end; i++) {
> +		unsigned long mask;
> +		gfn_t offset;
> +		atomic_long_t *p;
> +
> +		if (log) { /* Clean only what is in the input bitmap */
> +			mask = bitmap[i];

[Severity: High]
Does this code read out of bounds of the userspace-provided bitmap?
When a dirty log is provided, the bitmap is sized relative to log->num_pages.
Since i is an absolute index into the guest's memory slot (starting at
log->first_page / BITS_PER_LONG), accessing bitmap[i] uses the absolute 
offset instead of a 0-based relative offset.

> +			if (!mask)
> +				continue;

[ ... ]

> +		offset = base_gfn + i * BITS_PER_LONG;
> +
> +		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> +			kvm_mmu_split_huge_pages(kvm,
> +						 gfn_to_gpa(offset + __ffs(mask)),
> +						 gfn_to_gpa(offset + __fls(mask) + 1));

[Severity: High]
Does this create a race condition that loses guest writes?
The software dirty bitmap is cleared at the start of this loop. If
kvm_mmu_split_huge_pages() is called here, it drops and reacquires
kvm->mmu_lock to top up the memory cache. Any guest write that occurs
while the lock is dropped is recorded in the hardware PTE, but because
the software bit was already cleared, the subsequent hardware execution
clears the PTE's dirty bit, erasing the concurrent write.

[Severity: High]
Can this permanently break dirty logging for huge pages if splitting fails?
The return value of kvm_mmu_split_huge_pages() is ignored. If it fails
(e.g., returning -ENOMEM), the block is not split, but the IPAs are
still fed to the hardware. The hardware will then clear the dirty bit on
the 2MB block descriptor, losing dirty tracking for the entire region.

[Severity: High]
Is there an off-by-one error in the split_end calculation?
Looking at the split_end assignment in the full context:

__kvm_arch_dirty_log_clear() {
    ...
    split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
    ...
}

Since end is an exclusive upper bound, if the dirty region perfectly aligns
with a PMD boundary, this calculates a split_end that spans to the end of
the next PMD. This would cause the next PMD to skip splitting, leading to
data loss when the hardware clears the dirty bit on the unsplit block.

> +		do {
> +			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);
> +			if (idx >= entries_sz) {
> +				ret = dirty_bit_clear(kvm, hw_entries, idx);

[Severity: High]
Is there a missing context synchronization event in dirty_bit_clear()?
When this calls dirty_bit_clear(), that function writes to the HCR_EL2
control-plane system register and immediately calls __load_stage2():

dirty_bit_clear() {
    ...
    hcr_el2 = read_sysreg(HCR_EL2);
    write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
    __load_stage2(&kvm->arch.mmu);
    ...
}

The architecture requires an isb() as the very next instruction after
writing a control-plane system register. Without it, the CPU executes
subsequent instructions in an architecturally undefined pipeline state.

[Severity: High]
Does this code access EL2 system registers directly from the host kernel?
In nVHE configurations, the host kernel runs at EL1. The dirty_bit_clear()
function directly reads and writes HCR_EL2. This will trigger an Undefined 
Instruction exception and panic the host kernel on nVHE systems. KVM must 
use EL2-private hyp code or hypercalls to access EL2 state outside of VHE 
mode.

[Severity: High]
Does the wait loop in dirty_bit_clear() risk a lost wakeup and hard lockup?
The code disables preemption but fails to disable local interrupts before
calling wfi():

dirty_bit_clear() {
    preempt_disable();
    ...
    do {
        wfi();
    } while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
    ...
}

If the hardware interrupt fires and sets the status to idle after the
condition is checked but before the wfi() instruction executes, the CPU
will sleep indefinitely.

> +				*flush = *flush || ret > 0;
> +				if (ret != idx) {
> +					/* Save bits not converted back to bitmap */
> +					atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[i]);
> +					goto out_err;
> +				}
> +				idx = 0;
> +			}
> +		} while (mask);
> +	}

[ ... ]

> +out_err:
> +	if (unlikely(ret != idx)) {
> +		/*
> +		 * In case there is an error and not all entries in HACDBS get
> +		 * cleaned, we have to mark the dirty bits back in the bitmap,
> +		 * as that will be used by the software routine.
> +		 *
> +		 * Entries should be in order, since they were extraxed from
> +		 * the dirty-bitmap, so batching the atomic writes is efficient.
> +		 */
> +
> +		if (ret < idx)
> +			hdbss_to_bitmap(hw_entries, ret, idx, dirty_bitmap, memslot->base_gfn);
> +
> +		ret = -EAGAIN;
> +	}
> +
> +	write_unlock(&kvm->mmu_lock);
> +	kfree(hw_entries);
> +
> +	return ret;
> +}

[Severity: High]
Does this error handling block incorrectly treat exact-multiple success states
as failures?
If the processed entries perfectly divide by entries_sz, the loop resets
idx = 0 and sets ret = entries_sz. The check if (unlikely(ret != idx))
then evaluates to true, falsely triggering this error path and returning
-EAGAIN.

Additionally, if it is not an exact multiple, the function exits with
ret = idx (a positive integer), which the generic KVM ioctl layer treats
as a failure code.

>  static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
>  {
>  	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);

[Severity: High]
Does this handler also directly access an EL2-only register from the host?
Just like dirty_bit_clear(), calling read_sysreg_s(SYS_HACDBSCONS_EL2)
from the host IRQ handler running at EL1 will crash nVHE configurations
with an Undefined Instruction exception.

[Severity: Low]
Is there a syntax error in this IRQ handler's switch statement?
Looking at the rest of hacdbsirq_handler():

hacdbsirq_handler() {
    ...
    case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
        int index = ...;
    ...
}

A variable declaration immediately following a case label is a syntax
error in C, which will cause a build failure.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=8

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning
  2026-06-29 11:17 ` [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning Leonardo Bras
@ 2026-06-29 11:49   ` sashiko-bot
  2026-06-29 17:09     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:49 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvm, Oliver Upton, Marc Zyngier, kvmarm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Hardware-accelerated dirty ring reset completely bypasses the reset tracepoint.
--

commit 8d48fca1a0a1a0e2a8d66dd93d79ffc96e524e83
Author: Leonardo Bras <leo.bras@arm.com>

kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning

This commit introduces kvm_arch_dirty_ring_clear() to allow architectures
to implement hardware-accelerated dirty-ring routines. It adds a call to
this new function in kvm_dirty_ring_reset(), which will fall back to the
software version if the hardware version is not implemented or fails.

[ ... ]

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 42de1a511037a..fe4e7da6cc4a6 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -133,6 +134,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  	 */
>  	lockdep_assert_held(&kvm->slots_lock);
>  
> +	if (kvm_arch_dirty_ring_clear(kvm, ring, nr_entries_reset) >= 0)
> +		return 0;

[Severity: Medium]
Does this early return bypass the trace_kvm_dirty_ring_reset() tracepoint
at the bottom of the function?

When hardware-accelerated dirty ring cleaning succeeds, returning early here
skips the tracepoint entirely. This could break observability tools and
scripts that rely on this tracepoint to monitor KVM dirty ring resets.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=11

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine
  2026-06-29 11:18 ` [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine Leonardo Bras
@ 2026-06-29 11:49   ` sashiko-bot
  2026-06-29 17:26     ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:49 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Marc Zyngier, Oliver Upton, kvmarm, kvm

Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:

New issues:
- [Critical] Unvalidated userspace memory accesses in dirty ring clearing lead to host DoS and memory corruption.
- [High] Missing TLB invalidation and DSB in dirty ring hardware path.
- [High] Hardware buffer allocation lacks strict physical alignment guarantees.

Pre-existing issues:
- [High] Missing architecturally required context synchronization (ISB) after control register writes.
- [High] Unsafe wfi() polling loop causes lost wakeups and CPU lockups.
- [Medium] Hardware dirty log clearing erroneously falls back to software due to incorrect return value processing.
--

commit 9f463875fd0f7eddcaa7053b557d7a2cc21e83c1
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine

This commit implements an arm64 version of kvm_arch_dirty_ring_clear using
FEAT_HACDBS. It translates dirty-ring entries into HDBSS entries and uses
the hardware mechanism to mark writable-dirty pages as clean. Invalid
entries are then cleared, falling back to software cleaning on error.

> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c
[ ... ]
> static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> {
> 	u64 hcr_el2;
> 	int ret;
> 
> 	preempt_disable();
> 
> 	hcr_el2 = read_sysreg(HCR_EL2);
> 	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> 	__load_stage2(&kvm->arch.mmu);

[Severity: High]
This is a pre-existing issue, but does this missing ISB after control register
writes expose the pipeline to undefined architectural states?

Writing to HCR_EL2 requires an immediate context synchronization event.
Executing __load_stage2 immediately after without an isb means it might
operate under a stale configuration.

> 
> 	hacdbs_start(hw_entries, size);
> 
> 	do {
> 		wfi();
> 	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);

[Severity: High]
This is a pre-existing issue, but can this unprotected wfi wait loop cause
a lost wakeup and an indefinite CPU stall?

Because local interrupts are not masked around this check and wait, a timely
interrupt could satisfy the condition prematurely before wfi executes.

[ ... ]
> 	if (idx != 0) {
> 		ret = dirty_bit_clear(kvm, hw_entries, idx);
> 		*flush = *flush || ret > 0;
> 	}
> out_err:
> 	if (unlikely(ret != idx)) {

[Severity: Medium]
This is a pre-existing issue, but does this logic incorrectly evaluate a
successful batch outcome as an error?

If the loop finishes exactly on a batch boundary, idx is reset to 0, but ret
holds the success count of the previous batch. This causes the mismatch check
to evaluate to true, forcing a -EAGAIN return and triggering an unnecessary
software fallback.

[ ... ]
> @@ -256,6 +256,72 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
>  	return ret;
>  }
>  
> +int __kvm_arch_dirty_ring_clear(struct kvm *kvm, struct kvm_dirty_ring *ring,
> +				int *nr_entries_reset)
> +{
> +	u64 *hw_entries;
> +	u64 slot_offset = 0;
> +	u64 ttwl;
> +	int i, ret;
> +	u32 slot = -1;
> +
> +	if (signal_pending(current))
> +		return -EINTR;
> +
> +	ttwl = HDBSS_ENTRY_TTWL(KVM_PGTABLE_LAST_LEVEL);
> +
> +	hw_entries = kmalloc(max(ring->size * sizeof(u64), PAGE_SIZE), GFP_KERNEL);

[Severity: High]
Does this kmalloc call lack the strict physical alignment guarantees required
by the hardware accelerator?

Since kmalloc does not guarantee natural physical alignment for large buffers,
is it possible for physical address truncation to occur? This could cause the
hardware accelerator to parse unaligned memory and corrupt arbitrary page DBM
bits.

> +	if (!hw_entries)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < ring->size; i++) {
> +		struct kvm_dirty_gfn *entry;
> +		gfn_t gfn;
> +
> +		entry = &ring->dirty_gfns[(ring->reset_index + i) &
> +					  (ring->size - 1)];
> +
> +		if (!kvm_dirty_gfn_harvested(entry))
> +			break;
> +
> +		if (entry->slot != slot) {
> +			struct kvm_memory_slot *memslot;
> +
> +			memslot = kvm_dirty_ring_get_memslot(kvm, entry->slot);
> +			slot = entry->slot;
> +			slot_offset = memslot->base_gfn;

[Severity: Critical]
Can an invalid, user-provided entry->slot lead to a NULL pointer dereference
here?

The memslot pointer is dereferenced directly to read base_gfn without a NULL
check.

> +		}
> +
> +		gfn = slot_offset + entry->offset;

[Severity: Critical]
Is entry->offset validated against the memslot bounds?

Using an unvalidated user-provided offset here allows arbitrary Stage-2 GPA
computation. This could enable Stage-2 memory corruption or privilege
escalation.

> +
> +		hw_entries[i] = (gfn_to_gpa(gfn) & HDBSS_ENTRY_IPA) |
> +				ttwl | HDBSS_ENTRY_VALID;
> +	}
> +
> +	ret = dirty_bit_clear(kvm, hw_entries, i);

[Severity: High]
Does this path miss an architecturally required TLB invalidation and DSB?

dirty_bit_clear relies on the caller to handle the TLBI (as noted in its
internal comments for the log clearing path), but this ring clearing routine
appears to jump directly to invalidating the entries and returning.

This could leave stale writable TLB entries in the guest.

> +
> +	/* Set as invalid all successfully cleaned entries */
> +	for (int j = 0; j < ret; j++) {
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=12

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT
  2026-06-29 11:18 ` [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT Leonardo Bras
@ 2026-06-29 11:52   ` sashiko-bot
  0 siblings, 0 replies; 47+ messages in thread
From: sashiko-bot @ 2026-06-29 11:52 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: kvm, kvmarm, Marc Zyngier, Oliver Upton

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Malicious userspace can trigger a host kernel panic by providing an invalid memslot ID in the dirty ring.
- [High] Unbounded stall in `dirty_bit_clear()` due to a missed wakeup race with `wfi()`.
- [High] Missing `isb()` immediately after writes to control-plane system registers `HCR_EL2` and `SYS_HACDBSCONS_EL2`.
- [High] Off-by-one error in `split_end` calculation skips huge page splits for subsequent PMD blocks.
- [High] kvm_arch_dirty_clear_enabled() uses this_cpu_read() without preemption disabled.
--

commit 7b3668eabb9d9e558d0cab4dd205a0b0b5c74a14
Author: Leonardo Bras <leo.bras@arm.com>

KVM: arm64: Enable KVM_HW_DIRTY_BIT

This commit sets the Kconfig flag to enable hardware accelerated
dirty-bitmap and dirty-ring cleaning for arm64 when ACPI is enabled.

> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 449154f9a4852..db8487bf738b7 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -34,6 +34,7 @@ menuconfig KVM
>  	select HAVE_KVM_IRQ_BYPASS
>  	select HAVE_KVM_READONLY_MEM
>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
> +	select HAVE_KVM_HW_DIRTY_BIT if ACPI
>  	select SCHED_INFO
>  	select GUEST_PERF_EVENTS if PERF_EVENTS
>  	select KVM_GUEST_MEMFD

I noticed a few preexisting issues in the hardware dirty tracking
implementation enabled by this configuration.

arch/arm64/kvm/dirty_bit.c:dirty_bit_clear() {
    ...
    do {
        wfi();
    } while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
    ...
}

[Severity: High]
This isn't a bug introduced by this patch, but can this wfi() sleep
unconditionally without disabling local interrupts first?

If the HACDBS hardware finishes and its interrupt is handled immediately
after hacdbs_start() but before the wfi() instruction, wouldn't the wakeup
event be missed, causing a regression with an unbounded stall?

arch/arm64/kvm/dirty_bit.c:__kvm_arch_dirty_ring_clear() {
    ...
    memslot = kvm_dirty_ring_get_memslot(kvm, entry->slot);
    slot = entry->slot;
    slot_offset = memslot->base_gfn;
    ...
}

[Severity: Critical]
This is a pre-existing issue, but is it safe to blindly dereference memslot
here without checking if kvm_dirty_ring_get_memslot() returned NULL?

If unprivileged userspace provides an invalid or out-of-bounds slot ID in
the dirty ring, kvm_dirty_ring_get_memslot() will return NULL, which would
trigger a panic regression when memslot->base_gfn is accessed.

arch/arm64/kvm/dirty_bit.c:dirty_bit_clear() {
    ...
    write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
    __load_stage2(&kvm->arch.mmu);
    ...
}

arch/arm64/kvm/dirty_bit.c:hacdbs_start() {
    ...
    sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
                       HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);

    br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
    ...
}

[Severity: High]
This isn't a bug introduced by this patch, but do these writes to HCR_EL2
and SYS_HACDBSCONS_EL2 need an immediate isb()?

In dirty_bit_clear(), HCR_EL2 is written and immediately followed by
__load_stage2(). In hacdbs_start(), SYS_HACDBSCONS_EL2 is written without
an immediate isb(). Could this cause a regression where the CPU executes
subsequent instructions with architecturally undefined pipeline state?

arch/arm64/kvm/dirty_bit.c:__kvm_arch_dirty_log_clear() {
    ...
    kvm_mmu_split_huge_pages(kvm, start, end);
    split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
    ...
}

[Severity: High]
This is a pre-existing issue, but does the calculation for split_end have
an off-by-one error?

If the dirtied range perfectly aligns with a PMD boundary (where end is a
multiple of PMD_SIZE), split_end would advance to cover the entire next
PMD block.

Subsequent iterations would then evaluate offset + BITS_PER_LONG > split_end
as false, which skips the kvm_mmu_split_huge_pages() call for that next
block.

arch/arm64/include/asm/kvm_dirty_bit.h:kvm_arch_dirty_clear_enabled() {
    ...
    static inline bool kvm_arch_dirty_clear_enabled(struct kvm *kvm)
    {
        return this_cpu_read(hacdbs_pcp.status) == HACDBS_IDLE &&
    ...
}

[Severity: High]
This isn't a bug introduced by this patch, but does
kvm_arch_dirty_clear_enabled() use this_cpu_read() in a preemptible
context?

This function is called from kvm_arch_dirty_log_clear() and
kvm_arch_dirty_ring_clear(), where preemption is enabled. Could the thread
migrate to another CPU right after the check, leading to a regression where
dirty clearing logic runs on a CPU with the hardware scanner disabled?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=13

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/13] KVM: arm64: HDBSS bits
  2026-06-29 11:34   ` sashiko-bot
@ 2026-06-29 12:57     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 12:57 UTC (permalink / raw)
  To: kvmarm; +Cc: Leonardo Bras, Oliver Upton, Marc Zyngier, kvm

On Mon, Jun 29, 2026 at 11:34:56AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [Critical] Mapping writeable pages with both DBM and S2AP_W set causes `kvm_pgtable_stage2_wrprotect` to silently bypass software write-protection.
> - [High] `has_vhe_hdbss` incorrectly uses `is_kernel_in_hyp_mode()` which evaluates to true during early boot for pKVM (nVHE), erroneously enabling VHE-specific features.
> - [Low] Typo in the parameter name for `has_vhe_hdbss`.
> --
> 
> commit 49089779295fae526ba79b7c7e78d31fbe6a9636
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: HDBSS bits
> 
> The commit introduces HDBSS bits from a future patchset to fulfill
> dependencies and allow building.
> 
> > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> > index 9a22df0c5120f..aa327eebaf1ca 100644
> > --- a/arch/arm64/kernel/cpufeature.c
> > +++ b/arch/arm64/kernel/cpufeature.c
> 
> [ ... ]
> 
> > @@ -2131,6 +2131,11 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
> >  	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
> >  }
> >  
> > +static bool has_vhe_hdbss(const struct arm64_cpu_capabilities *entry, int cope)
>                                                                              ^^^^
> [Severity: Low]
> This isn't a bug, but there is a typo in the parameter name (cope instead
> of scope).
> 
> > +{
> > +	return is_kernel_in_hyp_mode() && has_cpuid_feature(entry, cope);
> 
> [Severity: High]
> Does using is_kernel_in_hyp_mode() here erroneously enable VHE-specific
> features for pKVM (nVHE)?
> 
> During early boot feature detection, is_kernel_in_hyp_mode() evaluates to
> true because the host kernel boots at EL2 and only drops privileges to
> EL1 later during device_initcall_sync (finalize_pkvm()).
> 
> This early check could force ARM64_HAS_HDBSS to be enabled on pKVM
> systems, propagating the KVM_PGTABLE_S2_DBM flag to hypervisor-managed
> stage-2 page tables.
> 
> Should this use is_vhe_hyp_mode() instead to avoid incorrectly matching
> during the nVHE privileged window?
> 

Copied from HDBSS patchset, so notes for Tian Zheng ^

> > +}
> 
> [ ... ]
> 
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 91a7dfad66866..e16729f0b7bd2 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> 
> [ ... ]
> 
> > @@ -731,9 +731,13 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
> >  	if (prot & KVM_PGTABLE_PROT_R)
> >  		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
> >  
> > -	if (prot & KVM_PGTABLE_PROT_W)
> > +	if (prot & KVM_PGTABLE_PROT_W) {
> >  		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> >  
> > +		if (pgt->flags & KVM_PGTABLE_S2_DBM)
> > +			attr |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;
> 
> [Severity: Critical]
> Can setting KVM_PTE_LEAF_ATTR_HI_S2_DBM here lead to a silent bypass of
> software write-protection?
> 
> When a page is mapped writable, both DBM and S2AP_W are set. If KVM later
> write-protects the page via kvm_pgtable_stage2_wrprotect(), only S2AP_W
> is cleared:
> 
> kvm_pgtable_stage2_wrprotect() {
>     ...
>     return stage2_update_leaf_attrs(pgt, addr, size, 0,
>                                     KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
>                                     NULL, NULL,
>                                     KVM_PGTABLE_WALK_IGNORE_EAGAIN);
> }
> 
> This leaves DBM intact. Hardware interprets DBM=1, S2AP_W=0 as a clean
> writable page and automatically sets S2AP_W=1 on the next write without
> generating a Permission Fault.
> 
> For pKVM, this allows a guest to bypass memory isolation. For nested
> virtualization, nested guest writes will silently set S2AP_W=1 without
> trapping to the host, resulting in lost dirty pages during live migration.
> 

Is there any case where we want to write protect a page that was marked as 
writeable, but not for dirty-tracking? 
If so, we need to also clear DBM for that, and make sure it is done only 
in those cases. 

The whole idea of HAFDBS / HDBSS is that the page is not actually write 
protected, it just doing dirty-detection, and this is done much much 
faster.

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-29 11:36   ` sashiko-bot
@ 2026-06-29 14:47     ` Leonardo Bras
  2026-06-29 17:06       ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 14:47 UTC (permalink / raw)
  To: sashiko-reviews
  Cc: Leonardo Bras, Marc Zyngier, Oliver Upton, kvmarm, kvm,
	Wei-Lin Chang

On Mon, Jun 29, 2026 at 11:36:45AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> 
> New issues:
> - [High] The default eager page splitting chunk size violates the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE UAPI contract by using a value that is not a valid block size for 16KB and 64KB page kernels.
> 
> Pre-existing issues:
> - [Critical] Dropping mmu_lock between primary and nested stage-2 write-protection creates a window where nested guests can bypass dirty logging.
> --
> 
> commit 09e1d250ed331b9dc20b42b7f9315892a86a5d50
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: Enable eager hugepage splitting if HDBSS is available
> 
> This commit enables eager hugepage splitting by default for systems
> supporting HDBSS, setting the default chunk size to 256K pages at
> stage2 initialization.
> 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 6c941aaa10c63..e086c01a9325e 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1027,8 +1027,12 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> >  	for_each_possible_cpu(cpu)
> >  		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
> >  
> > -	 /* The eager page splitting is disabled by default */
> > -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> > +	if (system_supports_hdbss())
> > +		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;
> 
> [Severity: High]
> Does this hardcoded chunk size violate the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> UAPI contract on kernels with 16KB or 64KB pages?
> 
> By setting the default chunk size to 256 * 1024 * PAGE_SIZE, this evaluates to
> 4GB on 16KB page systems and 16GB on 64KB page systems.
> 
> However, in arch/arm64/kvm/arm.c:kvm_vm_ioctl_enable_cap(), the chunk size is
> validated against valid architectural block sizes:
> 
>     if (!new_cap || kvm_is_block_size_supported(new_cap)) {
>     ...
> 
> Since 4GB and 16GB are not valid block sizes on those configurations, a VMM
> like QEMU attempting to read the default capability via KVM_CHECK_EXTENSION
> and set it back via KVM_ENABLE_CAP will be rejected with -EINVAL.

Sure, changing this to something like PMD_SIZE or PUD_SIZE should be no issue.
 
But this raises a topic I would like to understand:
- Do we actually need this to be a block_size to assure correctness? or is 
  it just about efficiency?

I was thinking on using PUD_SIZE here before, but that would mean very 
different splitting granularity, as on 4k pagesize we would be splitting 
512*512 pages (above mentioned 256k pages), but on 16k pagesize we would be 
splitting 2048*2048 pages (4M), and on 64k pagesize it would be 8k*8k pages 
(64M). 

Doing that could defeat the purpose of chunk size to have a possible sleep 
to not keep the lock hold for too long. In that case, would not it make 
more sense to use a fixed number of pages instead?


> 
> [Severity: Critical]
> This is a pre-existing issue, but by enabling eager hugepage splitting by
> default for HDBSS, does this exacerbate a race condition with nested guests
> bypassing dirty logging?
> 
> In arch/arm64/kvm/mmu.c:kvm_arch_mmu_enable_log_dirty_pt_masked(), KVM enables
> dirty logging by write-protecting the primary stage-2 and then calling
> kvm_mmu_split_huge_pages():
> 
>     if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>         kvm_mmu_split_huge_pages(kvm, start, end);
> 
>     kvm_nested_s2_wp(kvm);
> 
> Because kvm_mmu_split_huge_pages() can drop the mmu_lock to top-up the memory
> cache or reschedule, a race window is created before kvm_nested_s2_wp()
> protects the nested stage-2 tables.
> 
> While the lock is dropped, a nested VCPU can fault, map the page via its
> separate shadow stage-2 (which hasn't been write-protected yet), and write to
> the page without setting the dirty bitmap, leading to silent memory corruption
> after live migration.
> 

But if it's already write-protected in S2, if the nested VCPU faults 
something, would not that be handled by non-nested faulting routine?

I understand very little of nested, and I can't understand how does that 
work properly. Maybe Marc or Wei Lin may have a better answer for that.

CC: Wei Lin

Thanks!
Leo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 11:32   ` sashiko-bot
@ 2026-06-29 15:43     ` Leonardo Bras
  2026-06-29 16:52       ` Vladimir Murzin
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 15:43 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, kvmarm, Oliver Upton, Marc Zyngier, kvm

On Mon, Jun 29, 2026 at 11:32:42AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 12 potential issue(s) to consider:
> - [Critical] The `wfi()` loop in `dirty_bit_clear()` creates a check-then-sleep race condition that can stall the CPU indefinitely.
> - [Critical] The level-triggered `HACDBSIRQ` is not acknowledged or cleared in the interrupt handler, resulting in an infinite interrupt storm.
> - [High] Missing `isb()` context synchronization barrier after writing to control-plane system registers `HCR_EL2` and `SYS_HACDBSCONS_EL2`.
> - [High] Missing memory barrier (`dmb ish`) before starting the HACDBS hardware engine.
> - [High] The ACPI MADT parser registers `hacdbsirq_gsi` without verifying it is non-zero, risking a hijack of SGI 0.
> - [High] Missing `default:` case in the error handling switch statement leaves the tracking status unchanged on unexpected hardware errors, causing a vCPU hang.
> - [High] If HACDBS IRQ initialization fails, the feature remains enabled in KVM, causing a permanent host hang upon first use of dirty tracking.
> - [Medium] ACPI GSI mapping is leaked if `request_percpu_irq()` fails.
> - [Medium] Missing teardown of HACDBS resources if KVM subsystem initialization fails.
> - [Medium] The per-cpu HACDBS status is initialized after enabling the per-cpu interrupt, causing a TOCTOU race.
> - [Medium] Calling `smp_processor_id()` in a preemptible context triggers a kernel BUG/warning.
> - [Low] A variable declaration immediately follows a case label, violating C syntax rules and causing a build failure.
> --
> 
> commit a0d2df687f2678f7a10866e285efd8319bf84cea
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
> 
> Find via ACPI the Id for HACDBSIRQ, initialize it as a per-cpu IRQ and make
> sure any cpu able to run virtualization has it active. Introduce a per-cpu
> structure used by the HACDBSIRQ handler to keep track of entries size and
> the status of HACDBS.
> 
> Link: https://github.com/tianocore/edk2/issues/12409
> 
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 50adfff75be82..dc1a4629aaeb9 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> [ ... ]
> > @@ -2481,6 +2484,8 @@ static int __init init_subsystems(void)
> >  	if (err)
> >  		goto out;
> >  
> > +	kvm_hacdbs_init();
> > +
> >  	kvm_register_perf_callbacks();
> >  
> >  	err = kvm_hyp_trace_init();
> 
> [Severity: Medium]
> If a subsequent initialization step like kvm_hyp_trace_init() fails, the
> code branches to the out: error path and aborts KVM initialization. Does
> this path need to call a teardown function to free the per-CPU IRQ and
> release the GSI allocated by kvm_hacdbs_init()?
> 

Maybe. Another solution would be to put it after anything that may present 
any error. I wonder how ofter do we get kvm to fail startup, though.

> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > index 32fe938d6bf70..789da8712b1b4 100644
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> [ ... ]
> > @@ -14,3 +21,93 @@
> >  	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
> >  #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
> >  #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > +
> > +static __ro_after_init int hacdbsirq = -1;
> > +
> > +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> > +{
> > +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> > +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> > +
> > +	switch (err) {
> > +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> > +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +		break;
> > +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> > +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> > +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> 
> [Severity: Low]
> Does this variable declaration cause a build failure?

No

> Under standard C rules,
> a label can only be attached to a statement, and a declaration is not a
> statement.

That's not a label.

> 
> > +
> > +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> > +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +			break;
> > +		}
> > +		fallthrough;
> > +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> > +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> > +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> > +		break;
> 
> [Severity: High]
> What happens if the hardware reports an undefined or reserved error reason
> in HACDBSCONS_EL2_ERR_REASON?
> 
> Without a default: case, the tracking status wouldn't be updated, remaining
> as HACDBS_RUNNING. Could this cause the synchronous wfi() wait loop in
> dirty_bit_clear() to spin forever and lock up the host CPU?

I get the error code by reading a 2-bit bitfield. There can only be 4 
different combinations here, and all are dealt with. 

> 
> > +	}
> > +
> > +	return IRQ_HANDLED;
> > +}
> 
> [Severity: Critical]
> Is it necessary to write back to a hardware register to acknowledge or clear
> the active source of this interrupt before returning?
> 
> The ACPI MADT registers the HACDBS interrupt as ACPI_LEVEL_SENSITIVE.
> Since a level-triggered interrupt remains asserted until explicitly cleared,
> would returning IRQ_HANDLED without clearing the hardware signal trap the CPU
> in an infinite loop of executing the ISR?
> 

I am pretty sure this is done by generic code.

> > +
> > +void kvm_hacdbs_cpu_up(void)
> > +{
> > +	if (hacdbsirq < 0)
> > +		return;
> > +
> > +	enable_percpu_irq(hacdbsirq, IRQ_TYPE_LEVEL_HIGH);
> > +	this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +}
> 
> [Severity: Medium]
> Could this sequence cause a race condition if a pending interrupt fires
> immediately upon being unmasked?
> 
> If hacdbsirq_handler() executes and updates the state, and then
> kvm_hacdbs_cpu_up() resumes and unconditionally overwrites the state
> with HACDBS_IDLE, the handler's update would be lost. Should the status
> be initialized before enabling the interrupt?

This is a per-cpu irq, and above happens when we are enabling 
virtualization on a CPU. Prior to the endo of kvm_hacdbs_cpu_up() there 
should be no guest running in this cpu, so faulting is not an issue. 

> 
> > +
> > +void kvm_hacdbs_cpu_down(void)
> > +{
> > +	if (hacdbsirq < 0)
> > +		return;
> > +
> > +	disable_percpu_irq(hacdbsirq);
> > +	this_cpu_write(hacdbs_pcp.status, HACDBS_OFF);
> > +}
> > +
> > +#ifdef CONFIG_ACPI
> > +static int __init hacdbs_acpi_get_irq(void)
> > +{
> > +	struct acpi_madt_generic_interrupt *gicc;
> > +	u32 gsi;
> > +	int irq;
> > +
> > +	gicc = acpi_cpu_get_madt_gicc(smp_processor_id());
> 
> [Severity: Medium]
> Is it safe to call smp_processor_id() here?
> 
> Since kvm_hacdbs_init() is called from init_subsystems() during a
> device_initcall, preemption is enabled. Would calling smp_processor_id()
> directly without disabling preemption trigger a kernel bug/warning if
> CONFIG_DEBUG_PREEMPT is enabled? Should raw_smp_processor_id() or
> get_cpu()/put_cpu() be used instead?

This is called in init time, by kvm_arm_init(). Also, it should not matter 
as we are not actually touching anything from that cpu, but instead just 
fetching this read-only information about the IRQ.


> 
> > +	if (gicc->header.length < ACPI_MADT_GICC_HACDBSIRQ)
> > +		return -ENXIO;
> > +
> > +	gsi =  gicc->hacdbsirq_gsi;
> > +
> > +	irq = acpi_register_gsi(NULL, gsi, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);
> 
> [Severity: High]
> Does the GSI read from ACPI need to be validated before registration?
> 
> If the firmware does not support HACDBS, it might leave this field as 0.
> On ARM64, GSI 0 corresponds to Software Generated Interrupt (SGI) 0, which is
> used for Inter-Processor Interrupts (IPIs). Could registering a device
> interrupt handler on GSI 0 corrupt host IPI handling and crash the system?
> 

Adding a test here could be easy.
In any case I remember testing with the table entry being zero and this 
not being an issue.

> > +	if (irq < 0) {
> > +		pr_warn("ACPI: Unable to register HACDBS interrupt: %d\n", gsi);
> > +		return -ENXIO;
> > +	}
> > +
> > +	return irq;
> > +}
> 
> [Severity: Medium]
> If the subsequent call to request_percpu_irq() in kvm_hacdbs_init() fails,
> how is this ACPI GSI mapping cleaned up?
> 
> Since hacdbs_acpi_get_irq() only returns the mapped IRQ and discards the
> original GSI, it seems impossible for the caller to properly unregister it
> via acpi_unregister_gsi(), which would result in leaking the resource.
> 

This feature should only be enabled if the whole system supports HACDBS, so 
once one cpu succeeds in request_percpu_irq() every other one should 
succeed as well.

> > +#else
> > +#define hacdbs_acpi_get_irq() (-ENXIO)
> > +#endif
> > +
> > +void __init kvm_hacdbs_init(void)
> > +{
> > +	int irq;
> > +
> > +	/* FEAT_HACDBS is only supported if Linux runs in EL2 (VHE) */
> > +	if (!system_supports_hacdbs() || !is_kernel_in_hyp_mode())
> > +		return;
> > +
> > +	irq = hacdbs_acpi_get_irq();
> > +	if (irq < 0)
> > +		return;
> > +
> > +	if (request_percpu_irq(irq, hacdbsirq_handler, "HACDBSIRQ", &hacdbs_pcp) < 0)
> > +		return;
> 
> [Severity: High]
> If request_percpu_irq() or hacdbs_acpi_get_irq() fails, the function returns
> early. Is there a global flag that needs to be unset to tell KVM that HACDBS
> hardware cleaning is unavailable?

The availability is tracked by hacdbsirq variable. Once it's set, it means 
the whole system supports hacdbs, and the IRQ is available.


> 
> If KVM later assumes hardware support is available and calls
> dirty_bit_clear() unconditionally, it might set the software status to
> HACDBS_RUNNING and wait in a wfi() loop. Since the IRQ is disabled or
> unmapped, the handler would never fire, resulting in a host hang.
>
> 
> Additionally, I have a few questions regarding the implementation of
> dirty_bit_clear() and hacdbs_start(), which appear to interact with the
> features introduced in this patch:
> 
> [Severity: High]
> In dirty_bit_clear(), is an isb() required after the write to HCR_EL2?
>

No, and __load_stage2 will do that anyway if needed, as it also sets 
registers. Also, it's done at the end of hacdbs_start()
 
> dirty_bit_clear() {
>     ...
>     write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
>     __load_stage2(&kvm->arch.mmu);
>     ...
> }
> 
> Similarly, in hacdbs_start(), is an isb() required after writing to
> SYS_HACDBSCONS_EL2?
> 

it's done at the end of hacdbs_start.

> hacdbs_start() {
>     ...
>     sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
>                        HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
> 
>     br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
>     ...
> }
> 
> The ARM64 subsystem guidelines indicate that writes to control-plane system
> registers must be followed by an isb() as the very next instruction to
> ensure subsequent instructions do not execute with an architecturally
> undefined pipeline state.
> 

Those intructions are unrelated between themselves, the one at the end of 
hacdbs_start() should be enough.

> [Severity: Critical]
> Does the wfi() loop in dirty_bit_clear() contain a check-then-sleep race
> condition?
> 
> dirty_bit_clear() {
>     ...
>     do {
>         wfi();
>     } while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
>     ...
> }
> 
> Although preemption is disabled, local interrupts are not disabled during this
> check. If the hardware interrupt fires after the condition is checked but
> before the wfi() executes, the CPU could service the interrupt, update the
> status to HACDBS_IDLE, return, and then execute wfi(). The CPU would then
> sleep indefinitely, causing a host hard lockup.

Until the next interrupt, though. Very unlikely, but possible.

Maybe a WFIT (timeout), then?

> 
> [Severity: High]
> In hacdbs_start(), is a memory barrier like `dmb ish` required before starting
> the hardware engine?
> 
> hacdbs_start() {
>     ...
>     if (size_b < size_p2)
>         hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
>     ...
>     write_sysreg_s(br, SYS_HACDBSBR_EL2);
>     isb();
> }
> 
> A sentinel stop entry is written to the memory buffer before writing to
> SYS_HACDBSBR_EL2 to enable the engine. Since system register writes are not
> inherently ordered with respect to memory stores, the hardware engine might
> begin reading before the stop entry is visible in coherent memory.
> 

And that could mean possibly running though entries that are not supposed 
to be valid, but could be because we are talking about random data.
So we need to make sure the write buffer is clear before setting the 
register.

Thanks
Leo



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-29 11:29   ` sashiko-bot
@ 2026-06-29 15:54     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 15:54 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, kvmarm, kvm, Marc Zyngier, Oliver Upton

On Mon, Jun 29, 2026 at 11:29:48AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [High] Missing Context Synchronization Event (ISB) after writes to control-plane system registers.
> - [High] TOCTOU race in wfi() polling loop causes a lost wakeup and indefinite CPU stall.
> - [High] Missing memory barrier (dsb) before starting the HACDBS hardware engine.
> - [High] Missing speculative execution barrier (dsb) after disabling an asynchronous hardware engine.
> --
> 
> commit 47f5964570817fd652e798e082c349d012b3edda
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
> 
> Introduce the basic cleaning routine that is going to be used for both
> dirty-bitmap and dirty-ring routines. It sets the required registers with
> the input buffer, and wait for HACDBSIRQ to happen, which means either
> the task is done, or there was some error during processing.
> 
> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > index 789da8712b1b4..e4283828b7807 100644
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> 
> [ ... ]
> 
> > @@ -24,6 +25,86 @@ DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
> >  
> >  static __ro_after_init int hacdbsirq = -1;
> >  
> > +static void hacdbs_start(u64 *hw_entries, int size)
> > +{
> > +	u64 br;
> > +	/* Each entry is 8 bytes */
> > +	int size_b = size * sizeof(hw_entries[0]);
> > +	int size_p2 = max(roundup_pow_of_two(size_b), PAGE_SIZE);
> > +
> > +	/* If not using the full size of the array, put a stop entry at the end */
> > +	if (size_b < size_p2)
> > +		hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
> > +
> > +	sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
> > +			   HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
> > +
> 
> [Severity: High]
> Does this code need an immediate isb() after writing the control-plane system
> register SYS_HACDBSCONS_EL2? The ARM64 guidelines indicate that an isb() must
> be the very next instruction after writing a control-plane system register to
> ensure the pipeline doesn't operate with architecturally undefined state.

IIUC they will have no effect until the base register says enabled. So an 
isb() at the end should be enough.

> 
> [Severity: High]
> Additionally, should there be a dsb() between the memory store to
> hw_entries[size] and the write to SYS_HACDBSBR_EL2 below? Without an explicit
> memory barrier, the store buffer might not be globally visible before the
> engine begins reading memory, which could cause it to miss the stop marker
> and access out-of-bounds memory.
> 
> > +	br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
> > +	     FIELD_PREP(HACDBSBR_EL2_SZ, ilog2(size_p2) - 12) |
> > +	     FIELD_PREP(HACDBSBR_EL2_EN, 1);
> > +
> > +	this_cpu_write(hacdbs_pcp.status, HACDBS_RUNNING);
> > +	this_cpu_write(hacdbs_pcp.size, size);
> > +	write_sysreg_s(br, SYS_HACDBSBR_EL2);
> > +	isb();
> > +}
> 
> [ ... ]
> 
> > +/*
> > + * Clears dirty-bits for an array of pages (hw_entries) using HACDBS
> > + * Returns the number of items cleaned from the array. If returns value < size,
> > + *	there was an error in the processing.
> > + */
> > +static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> > +{
> > +	u64 hcr_el2;
> > +	int ret;
> > +
> > +	preempt_disable();
> > +
> > +	hcr_el2 = read_sysreg(HCR_EL2);
> > +	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> > +	__load_stage2(&kvm->arch.mmu);
> 
> [Severity: High]
> Is an isb() required immediately after writing HCR_EL2 before proceeding to
> __load_stage2()? HCR_EL2 is a control-plane register, and delaying the barrier
> can cause the pipeline to operate with architecturally undefined state.
> 

Again, it won't operate until the enable at the end of hacdbs_start()

> > +
> > +	hacdbs_start(hw_entries, size);
> > +
> > +	do {
> > +		wfi();
> > +	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
> 
> [Severity: High]
> Could this polling loop cause a lost wakeup and indefinite CPU stall? The loop
> unconditionally sleeps on the first iteration and checks the condition
> non-atomically with respect to interrupts. 
> 
> If the HACDBSIRQ interrupt is delivered and serviced just before the wfi()
> instruction executes, the wfi() will wait for a new interrupt that will never
> arrive. Should the check be made atomic with the sleep, for example by masking
> IRQs before the check?
> 

Answered before. Yes, it's unlikely, but maybe we need a WFIT.

> > +
> > +	ret = hacdbs_stop();
> > +
> > +	write_sysreg(hcr_el2, HCR_EL2);
> > +	isb();
> 
> [Severity: High]
> Should there be a speculative execution barrier (dsb) after stopping the engine
> in hacdbs_stop() and before changing the translation context here? 
> 
> The ARM64 guidelines mandate a speculative execution barrier (dsb + isb) after
> disabling an asynchronous engine to definitively halt out-of-context execution
> before updating the translation regime.


There is LITERALY a comment below addressing this.

> 
> > +
> > +	/*
> > +	 * No DSB is needed here, as kvm_flush_remote_tlbs_memslot() that happens
> > +	 * later in generic dirty-cleaning code already performs a DSB before
> > +	 * doing the TLBI.
> > +	 */
> > +
> > +	preempt_enable();
> > +
> > +	return ret;
> > +}
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=6

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning
  2026-06-29 11:38   ` sashiko-bot
@ 2026-06-29 16:07     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 16:07 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, Oliver Upton, Marc Zyngier, kvm, kvmarm

On Mon, Jun 29, 2026 at 11:38:18AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [Critical] Data loss (lost dirty pages) when kvm_arch_dirty_log_clear fails during KVM_GET_DIRTY_LOG.
> - [High] Out-of-bounds read of userspace buffer in kvm_arch_dirty_log_clear during KVM_CLEAR_DIRTY_LOG.
> --
> 
> commit ea34df3acf2397922e1a2152d707b3b9f6012964
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning
> 
> Introduce kvm_arch_dirty_log_clear() that allow implementation of
> arch-specific hardware-accelerated dirty-log routines.
> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e44c20c049610..a25b8902cdfc1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2262,6 +2263,10 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
> >  		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
> >  		memset(dirty_bitmap_buffer, 0, n);
> >  
> > +		if (kvm_arch_dirty_log_clear(kvm, memslot, NULL,
> > +					     dirty_bitmap_buffer, &flush) >= 0)
> > +			goto out;
> > +
> 
> [Severity: Critical]
> If kvm_arch_dirty_log_clear() successfully clears some pages in hardware but
> subsequently fails and returns an error, will the fallback loop overwrite the
> successfully cleared bits?
> 
> It appears that if the arch hook returns an error, KVM falls back to the
> generic software loop which includes:
> 
>     mask = xchg(&dirty_bitmap[i], 0);
>     dirty_bitmap_buffer[i] = mask;
>     
> Because the generic fallback loop blindly overwrites dirty_bitmap_buffer[i]
> with the remaining uncleared bits, could this result in silently losing the
> dirty pages that the arch hook managed to clear before failing?

Yes, it falls back to software. 
But it will only overwrite the bits IF the entry is not empty.

One thing that could happen, though, is that the error happened in 
the middle of the 64-page slot, and the output buffer would contain bits 
set, while the dirty-bitmap also contains them. That would cause the bits 
to be overwritten. To avoid that, we could ask to make sure to only clear 
the dirty-bitmap word/slot if all bits have been cleaned. 

> 
> [ ... ]
> 
> > @@ -2373,6 +2379,10 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
> >  	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
> >  		return -EFAULT;
> >  
> > +	if (kvm_arch_dirty_log_clear(kvm, memslot, log, dirty_bitmap_buffer,
> > +				     &flush) >= 0)
> > +		goto out;
> > +
> 
> [Severity: High]
> Could passing dirty_bitmap_buffer directly to the arch-specific routine lead
> to an out-of-bounds read?
> 
> When userspace calls KVM_CLEAR_DIRTY_LOG with a first_page > 0, KVM copies
> exactly num_pages bits into dirty_bitmap_buffer starting at index 0.
> 
> If the arch implementation of kvm_arch_dirty_log_clear() assumes the buffer
> represents the entire memslot and reads from it using an absolute offset like
> log->first_page / BITS_PER_LONG, would it read past the end of this
> partially-sized array?
> 

Dealt with this in the implementation. Maybe it should be clear to whoever 
implements it will have to take that into account as well.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine
  2026-06-29 11:45   ` sashiko-bot
@ 2026-06-29 16:49     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 16:49 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, kvmarm, Marc Zyngier, Oliver Upton, kvm

On Mon, Jun 29, 2026 at 11:45:01AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
> - [High] Missing `isb()` context synchronization event after writing to the `HCR_EL2` control-plane system register.
> - [High] Host kernel directly accesses EL2 system registers, which will crash nVHE configurations.
> - [High] Incorrect indexing of the userspace-provided `bitmap` buffer leads to out-of-bounds memory reads.
> - [High] Dropping `kvm->mmu_lock` before pending HACDBS entries are flushed creates a TOCTOU race window that permanently loses guest writes.
> - [High] Off-by-one error in calculating `split_end` causes KVM to mistakenly skip huge page splitting for subsequent PMDs, leading to data loss.
> - [High] Flawed return value logic in `__kvm_arch_dirty_log_clear()` incorrectly returns `-EAGAIN` on exact-multiple success or a positive integer on general success.
> - [High] The `wfi()` loop waits indefinitely with `preempt_disable()` and `kvm->mmu_lock` held in write mode without masking local interrupts, risking lost wakeups and hard lockups.
> - [High] Ignoring the return value of `kvm_mmu_split_huge_pages()` can permanently break dirty logging for huge pages.
> - [High] Dereferencing `kvm->arch.mmu.pgt` without checking if it is NULL leads to a kernel panic.
> - [Low] A variable declaration immediately follows a `case` label, resulting in a build failure.
> --
> 
> commit 8c1efc438f0d50c036b9548bfa7efea74d90a446
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine
> 
> This patch implements an arm64-specific kvm_arch_dirty_log_clear() using the
> FEAT_HACDBS hardware feature. It converts dirty bitmap entries into HDBSS
> structures and feeds them to the hardware to clean the page tables. It also
> splits huge pages in chunks if manual protect and init set options are
> enabled.
> 
> > diff --git a/arch/arm64/include/asm/kvm_dirty_bit.h b/arch/arm64/include/asm/kvm_dirty_bit.h
> > --- a/arch/arm64/include/asm/kvm_dirty_bit.h
> > +++ b/arch/arm64/include/asm/kvm_dirty_bit.h
> 
> [ ... ]
> 
> > +static inline bool kvm_arch_dirty_clear_enabled(struct kvm *kvm)
> > +{
> > +	return this_cpu_read(hacdbs_pcp.status) == HACDBS_IDLE &&
> > +	       (kvm->arch.mmu.pgt->flags & KVM_PGTABLE_S2_DBM);
> > +}
> 
> [Severity: High]
> Can this cause a kernel panic if invoked before the MMU is fully initialized?
> The stage-2 page tables (pgt) are allocated lazily, so if userspace calls
> the clear dirty log ioctl before initialization, kvm->arch.mmu.pgt could be
> NULL.
>

It should only be cleaned after dirtied, which happens after dirty_log is 
enabled, which happens way after initalization.
 
> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> 
> [ ... ]
> 
> > +int __kvm_arch_dirty_log_clear(struct kvm *kvm,
> > +			       struct kvm_memory_slot *memslot,
> > +			       struct kvm_clear_dirty_log *log,
> > +			       unsigned long *bitmap,
> > +			       bool *flush)
> > +{
> 
> [ ... ]
> 
> > +	for (unsigned long i = start; i < end; i++) {
> > +		unsigned long mask;
> > +		gfn_t offset;
> > +		atomic_long_t *p;
> > +
> > +		if (log) { /* Clean only what is in the input bitmap */
> > +			mask = bitmap[i];
> 
> [Severity: High]
> Does this code read out of bounds of the userspace-provided bitmap?
> When a dirty log is provided, the bitmap is sized relative to log->num_pages.
> Since i is an absolute index into the guest's memory slot (starting at
> log->first_page / BITS_PER_LONG), accessing bitmap[i] uses the absolute 
> offset instead of a 0-based relative offset.

It is the same algorithm as the generic sw cleaning. Those verifications 
already happen in the caller function.

> 
> > +			if (!mask)
> > +				continue;
> 
> [ ... ]
> 
> > +		offset = base_gfn + i * BITS_PER_LONG;
> > +
> > +		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > +			kvm_mmu_split_huge_pages(kvm,
> > +						 gfn_to_gpa(offset + __ffs(mask)),
> > +						 gfn_to_gpa(offset + __fls(mask) + 1));
> 
> [Severity: High]
> Does this create a race condition that loses guest writes?
> The software dirty bitmap is cleared at the start of this loop. If
> kvm_mmu_split_huge_pages() is called here, it drops and reacquires
> kvm->mmu_lock to top up the memory cache. Any guest write that occurs
> while the lock is dropped is recorded in the hardware PTE, but because
> the software bit was already cleared, the subsequent hardware execution
> clears the PTE's dirty bit, erasing the concurrent write.
> 

The write is already captured, and will be reported to the user at the 
end of this syscall. If it gets dirty before the bit is actually clean, 
there is no issue, as the user already knows the page is dirty.


> [Severity: High]
> Can this permanently break dirty logging for huge pages if splitting fails?
> The return value of kvm_mmu_split_huge_pages() is ignored. If it fails
> (e.g., returning -ENOMEM), the block is not split, but the IPAs are
> still fed to the hardware. The hardware will then clear the dirty bit on
> the 2MB block descriptor, losing dirty tracking for the entire region.
> 

It will not. 
HACDBS will fail because the block level is not the same as reported.

So it falls back to software.

Interestingly enough, no code actually checks if the splitting actually 
succeeded before proceeding.


> [Severity: High]
> Is there an off-by-one error in the split_end calculation?
> Looking at the split_end assignment in the full context:
> 
> __kvm_arch_dirty_log_clear() {
>     ...
>     split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
>     ...
> }
> 
> Since end is an exclusive upper bound, if the dirty region perfectly aligns
> with a PMD boundary, this calculates a split_end that spans to the end of
> the next PMD. This would cause the next PMD to skip splitting, leading to
> data loss when the hardware clears the dirty bit on the unsplit block.
> 

Its the other way around, split_end is used to skip splitting if the value 
is smaller than it. 

Say the block addr and size are 0x1000, it splits from 0x1000 to 
0x2000. split_end will be 0x1FFF, and any value above that (starting at 
0x2000) will trigger a new split. That's why the '-1' is at the end of its 
calculation.
 

> > +		do {
> > +			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);
> > +			if (idx >= entries_sz) {
> > +				ret = dirty_bit_clear(kvm, hw_entries, idx);
> 
> [Severity: High]
> Is there a missing context synchronization event in dirty_bit_clear()?
> When this calls dirty_bit_clear(), that function writes to the HCR_EL2
> control-plane system register and immediately calls __load_stage2():
> 
> dirty_bit_clear() {
>     ...
>     hcr_el2 = read_sysreg(HCR_EL2);
>     write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
>     __load_stage2(&kvm->arch.mmu);
>     ...
> }
> 
> The architecture requires an isb() as the very next instruction after
> writing a control-plane system register. Without it, the CPU executes
> subsequent instructions in an architecturally undefined pipeline state.
> 

Look previous patches.


> [Severity: High]
> Does this code access EL2 system registers directly from the host kernel?
> In nVHE configurations, the host kernel runs at EL1. The dirty_bit_clear()
> function directly reads and writes HCR_EL2. This will trigger an Undefined 
> Instruction exception and panic the host kernel on nVHE systems. KVM must 
> use EL2-private hyp code or hypercalls to access EL2 state outside of VHE 
> mode.
> 

This is a VHE exclusive feature, as seen in the first patch.

> [Severity: High]
> Does the wait loop in dirty_bit_clear() risk a lost wakeup and hard lockup?
> The code disables preemption but fails to disable local interrupts before
> calling wfi():
> 
> dirty_bit_clear() {
>     preempt_disable();
>     ...
>     do {
>         wfi();
>     } while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
>     ...
> }
> 
> If the hardware interrupt fires and sets the status to idle after the
> condition is checked but before the wfi() instruction executes, the CPU
> will sleep indefinitely.
> 

Yes, see previous patches.

> > +				*flush = *flush || ret > 0;
> > +				if (ret != idx) {
> > +					/* Save bits not converted back to bitmap */
> > +					atomic_long_or(mask, (atomic_long_t *)&dirty_bitmap[i]);
> > +					goto out_err;
> > +				}
> > +				idx = 0;
> > +			}
> > +		} while (mask);
> > +	}
> 
> [ ... ]
> 
> > +out_err:
> > +	if (unlikely(ret != idx)) {
> > +		/*
> > +		 * In case there is an error and not all entries in HACDBS get
> > +		 * cleaned, we have to mark the dirty bits back in the bitmap,
> > +		 * as that will be used by the software routine.
> > +		 *
> > +		 * Entries should be in order, since they were extraxed from
> > +		 * the dirty-bitmap, so batching the atomic writes is efficient.
> > +		 */
> > +
> > +		if (ret < idx)
> > +			hdbss_to_bitmap(hw_entries, ret, idx, dirty_bitmap, memslot->base_gfn);
> > +
> > +		ret = -EAGAIN;
> > +	}
> > +
> > +	write_unlock(&kvm->mmu_lock);
> > +	kfree(hw_entries);
> > +
> > +	return ret;
> > +}
> 
> [Severity: High]
> Does this error handling block incorrectly treat exact-multiple success states
> as failures?
> If the processed entries perfectly divide by entries_sz, the loop resets
> idx = 0 and sets ret = entries_sz. The check if (unlikely(ret != idx))
> then evaluates to true, falsely triggering this error path and returning
> -EAGAIN.

Looks like so. But the software routine would just scan the dirty-bitmap 
and do nothing but flush the TLBs as it's needed. We can avoid this by 
adding and ret=0; after idx=0; so they will be always equal in this case.


> 
> Additionally, if it is not an exact multiple, the function exits with
> ret = idx (a positive integer), which the generic KVM ioctl layer treats
> as a failure code.

It does exit with a positive integer, but that is not returned by generic 
code. It will only cause a goto: out, which will return 0;

> 
> >  static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> >  {
> >  	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> 
> [Severity: High]
> Does this handler also directly access an EL2-only register from the host?
> Just like dirty_bit_clear(), calling read_sysreg_s(SYS_HACDBSCONS_EL2)
> from the host IRQ handler running at EL1 will crash nVHE configurations
> with an Undefined Instruction exception.
> 

VHE-only feature.

> [Severity: Low]
> Is there a syntax error in this IRQ handler's switch statement?
> Looking at the rest of hacdbsirq_handler():
> 
> hacdbsirq_handler() {
>     ...
>     case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
>         int index = ...;
>     ...
> }
> 
> A variable declaration immediately following a case label is a syntax
> error in C, which will cause a build failure.

It can be put in a block, it can be declared before.
I have no issue dealing with this, but does not currently cause any build 
error.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 15:43     ` Leonardo Bras
@ 2026-06-29 16:52       ` Vladimir Murzin
  2026-06-30 14:52         ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Murzin @ 2026-06-29 16:52 UTC (permalink / raw)
  To: Leonardo Bras, sashiko-reviews; +Cc: kvmarm, Oliver Upton, Marc Zyngier, kvm

Hi,

On 6/29/26 16:43, Leonardo Bras wrote:
>>> +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
>>> +		/* When size not a power of two >= 4k, exit with reserved TTLW */
>>> +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
>> [Severity: Low]
>> Does this variable declaration cause a build failure?
> No
> 
>> Under standard C rules,
>> a label can only be attached to a statement, and a declaration is not a
>> statement.
> That's not a label.

I think Sashiko is right

arch/arm64/kvm/dirty_bit.c: In function 'hacdbsirq_handler':
arch/arm64/kvm/dirty_bit.c:336:3: error: a label can only be part of a statement and a declaration is not a statement
  336 |   int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
      |   ^~~
make[4]: *** [scripts/Makefile.build:289: arch/arm64/kvm/dirty_bit.o] Error 1
make[4]: *** Waiting for unfinished jobs....

Based on other threads you might need GCC<=10.5.0 to hit that.

Obvious fix-up would be to wrap the case into curly braces :)

Cheers
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-29 14:47     ` Leonardo Bras
@ 2026-06-29 17:06       ` Oliver Upton
  2026-06-30 12:58         ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2026-06-29 17:06 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: sashiko-reviews, Marc Zyngier, kvmarm, kvm, Wei-Lin Chang

On Mon, Jun 29, 2026 at 03:47:51PM +0100, Leonardo Bras wrote:
> On Mon, Jun 29, 2026 at 11:36:45AM +0000, sashiko-bot@kernel.org wrote:
> > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> > 
> > New issues:
> > - [High] The default eager page splitting chunk size violates the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE UAPI contract by using a value that is not a valid block size for 16KB and 64KB page kernels.
> > 
> > Pre-existing issues:
> > - [Critical] Dropping mmu_lock between primary and nested stage-2 write-protection creates a window where nested guests can bypass dirty logging.
> > --
> > 
> > commit 09e1d250ed331b9dc20b42b7f9315892a86a5d50
> > Author: Leonardo Bras <leo.bras@arm.com>
> > 
> > KVM: arm64: Enable eager hugepage splitting if HDBSS is available
> > 
> > This commit enables eager hugepage splitting by default for systems
> > supporting HDBSS, setting the default chunk size to 256K pages at
> > stage2 initialization.
> > 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 6c941aaa10c63..e086c01a9325e 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1027,8 +1027,12 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> > >  	for_each_possible_cpu(cpu)
> > >  		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
> > >  
> > > -	 /* The eager page splitting is disabled by default */
> > > -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > > +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> > > +	if (system_supports_hdbss())
> > > +		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;
> > 
> > [Severity: High]
> > Does this hardcoded chunk size violate the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> > UAPI contract on kernels with 16KB or 64KB pages?
> > 
> > By setting the default chunk size to 256 * 1024 * PAGE_SIZE, this evaluates to
> > 4GB on 16KB page systems and 16GB on 64KB page systems.
> > 
> > However, in arch/arm64/kvm/arm.c:kvm_vm_ioctl_enable_cap(), the chunk size is
> > validated against valid architectural block sizes:
> > 
> >     if (!new_cap || kvm_is_block_size_supported(new_cap)) {
> >     ...
> > 
> > Since 4GB and 16GB are not valid block sizes on those configurations, a VMM
> > like QEMU attempting to read the default capability via KVM_CHECK_EXTENSION
> > and set it back via KVM_ENABLE_CAP will be rejected with -EINVAL.
> 
> Sure, changing this to something like PMD_SIZE or PUD_SIZE should be no issue.
>  
> But this raises a topic I would like to understand:
> - Do we actually need this to be a block_size to assure correctness? or is 
>   it just about efficiency?

What value is there in having a chunk size larger than the largest
possible block mapping? The whole UAPI is deliberately tied up with page
table geometry.

Overall, I'm not buying the argument for changing the behavior of
KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
*not* eagerly splitting the entire address space, especially if you know
the working set of the VM is small.

You can still use HDBSS without eagerly splitting, so long as block
mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
a writable PFN) are {1, 0}.

> > [Severity: Critical]
> > This is a pre-existing issue, but by enabling eager hugepage splitting by
> > default for HDBSS, does this exacerbate a race condition with nested guests
> > bypassing dirty logging?
> > 
> > In arch/arm64/kvm/mmu.c:kvm_arch_mmu_enable_log_dirty_pt_masked(), KVM enables
> > dirty logging by write-protecting the primary stage-2 and then calling
> > kvm_mmu_split_huge_pages():
> > 
> >     if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >         kvm_mmu_split_huge_pages(kvm, start, end);
> > 
> >     kvm_nested_s2_wp(kvm);
> > 
> > Because kvm_mmu_split_huge_pages() can drop the mmu_lock to top-up the memory
> > cache or reschedule, a race window is created before kvm_nested_s2_wp()
> > protects the nested stage-2 tables.
> > 
> > While the lock is dropped, a nested VCPU can fault, map the page via its
> > separate shadow stage-2 (which hasn't been write-protected yet), and write to
> > the page without setting the dirty bitmap, leading to silent memory corruption
> > after live migration.
> > 
> 
> But if it's already write-protected in S2, if the nested VCPU faults 
> something, would not that be handled by non-nested faulting routine?
> 
> I understand very little of nested, and I can't understand how does that 
> work properly. Maybe Marc or Wei Lin may have a better answer for that.

The only time this would be an issue is if we allowed userspace to
observe a clean dirty bitmap with a stale, writable shadow stage-2. All
roads go through the slots_lock, so I don't believe this is actually an
issue.

The order of events Sashiko mentions makes no sense. If the shadow
stage-2 translation is already writable there would be no intervening
fault. Since the dirty bitmap is already cleared for these GFNs, a
racing write permission fault would actually set the dirty bitmap.

With that being said, I'd be happy with kvm_nested_s2_wp() getting moved
up earlier in the function anyway as it is easy to forget that the
mmu_lock could be dropped in between.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks
  2026-06-29 11:39   ` sashiko-bot
@ 2026-06-29 17:07     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 17:07 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, kvmarm, Oliver Upton, Marc Zyngier, kvm

On Mon, Jun 29, 2026 at 11:39:37AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Incorrect PMD boundary calculation skips huge page splitting for adjacent PMDs and causes redundant splitting for the last chunk of each PMD.
> - [Low] Variable `split_end` is declared as a Guest Physical Address (`gpa_t`) but assigned and compared as a Guest Frame Number (`gfn_t`).
> - [Low] The local variables `start` and `end` shadow the outer loop bounds.
> --
> 
> commit 8863fe6c6c67ca7006255c8805499ac99f887014
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks
> 
> If previous dirty-clean already split a block, then avoid calling the
> split helper on that block again.
> 
> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> > @@ -160,6 +160,7 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
> >  	u64 ttwl;
> >  	u64 start, end;
> >  	gfn_t base_gfn;
> > +	gpa_t split_end = 0;
> 
> [Severity: Low]
> This isn't a bug, but should split_end be declared as a gfn_t?
> 
> It is declared as a Guest Physical Address (gpa_t) here, but later receives
> a Guest Frame Number (gfn_t) from gpa_to_gfn(). While both compile as u64,
> this mixes address types.
> 
> [ ... ]

Sure, can be changed.

> 
> > @@ -204,10 +205,14 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
> >  
> >  		offset = base_gfn + i * BITS_PER_LONG;
> >  
> > -		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > -			kvm_mmu_split_huge_pages(kvm,
> > -						 gfn_to_gpa(offset + __ffs(mask)),
> > -						 gfn_to_gpa(offset + __fls(mask) + 1));
> > +		if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
> > +		    (offset + BITS_PER_LONG > split_end)) {
> 
> [Severity: High]
> Does this comparison check an exclusive GFN against an inclusive GFN?
> 
> The expression offset + BITS_PER_LONG calculates an exclusive boundary,
> while split_end is calculated below as an inclusive boundary:
> 
>     split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
> 
> Comparing these might cause a redundant split for the last 64-page batch
> of an already-split PMD.
> 

It can not.
offset+bits_per_long has to be bigger than split_end. Since split_end is 
the last bit of a level-2 block, it starts splitting again only if the 
address > split_end. 

> > +			gpa_t start = gfn_to_gpa(offset + __ffs(mask));
> > +			gpa_t end = gfn_to_gpa(offset + __fls(mask) + 1);
> 
> [Severity: Low]
> This isn't a bug, but do these variables shadow the outer loop bounds?
> 
> The variables start and end declared here share names with the loop bounds:
> 

Yes, they are used as loop bounds.
No they are not re-declared inside a loop, so it should be fine.

>     u64 start, end;
>     ...
>     for (unsigned long i = start; i < end; i++) {
> 
> While scoping rules use the correct variables, shadowing loop control
> variables might make the code prone to subtle bugs during future refactoring.
> 
> > +
> > +			kvm_mmu_split_huge_pages(kvm, start, end);
> > +			split_end = gpa_to_gfn(ALIGN_DOWN(end, PMD_SIZE) + PMD_SIZE - 1);
> 
> [Severity: High]
> Can this alignment skip splitting for the next PMD?
> 
> The variable end is an exclusive physical address:
> 
>     gpa_t end = gfn_to_gpa(offset + __fls(mask) + 1);
> 
> If a block of dirty pages ends exactly at a PMD boundary, ALIGN_DOWN() on
> the exclusive end address aligns it to that boundary. Adding PMD_SIZE - 1
> then advances split_end to the end of the next PMD.

Okay, because this end is non-inclusive in kvm_mmu_split_huge_pages, but 
for split_end it will cause to use the next block of address.

> 
> This would skip huge page splitting for the adjacent PMD, breaking dirty
> logging and potentially causing data corruption during live migration as
> dirty pages would not be tracked at the PTE level.

I need to fix  this in v3.


> 
> > +		}
> >  
> >  		do {
> >  			idx = mask_to_hdbss(&mask, hw_entries, offset, ttwl, idx, entries_sz);
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=9

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning
  2026-06-29 11:49   ` sashiko-bot
@ 2026-06-29 17:09     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 17:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, kvm, Oliver Upton, Marc Zyngier, kvmarm

On Mon, Jun 29, 2026 at 11:49:03AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Hardware-accelerated dirty ring reset completely bypasses the reset tracepoint.
> --
> 
> commit 8d48fca1a0a1a0e2a8d66dd93d79ffc96e524e83
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning
> 
> This commit introduces kvm_arch_dirty_ring_clear() to allow architectures
> to implement hardware-accelerated dirty-ring routines. It adds a call to
> this new function in kvm_dirty_ring_reset(), which will fall back to the
> software version if the hardware version is not implemented or fails.
> 
> [ ... ]
> 
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index 42de1a511037a..fe4e7da6cc4a6 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -133,6 +134,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >  	 */
> >  	lockdep_assert_held(&kvm->slots_lock);
> >  
> > +	if (kvm_arch_dirty_ring_clear(kvm, ring, nr_entries_reset) >= 0)
> > +		return 0;
> 
> [Severity: Medium]
> Does this early return bypass the trace_kvm_dirty_ring_reset() tracepoint
> at the bottom of the function?
> 
> When hardware-accelerated dirty ring cleaning succeeds, returning early here
> skips the tracepoint entirely. This could break observability tools and
> scripts that rely on this tracepoint to monitor KVM dirty ring resets.

Ok, sure, I can goto-out to before the tracing, then.

> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=11

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
  2026-06-29 11:32   ` sashiko-bot
@ 2026-06-29 17:22   ` Oliver Upton
  2026-06-30 14:50     ` Leonardo Bras
  1 sibling, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2026-06-29 17:22 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Joey Gouly,
	Steffen Eiden, Suzuki K Poulose, Zenghui Yu, Rafael J. Wysocki,
	Len Brown, Saket Dumbre, Paolo Bonzini, Jonathan Cameron,
	Chengwen Feng, Kees Cook, Mikołaj Lenczewski, James Morse,
	Zeng Heng, mrigendrachaubey, Thomas Huth, Ryan Roberts,
	Yeoreum Yun, Mark Brown, Kevin Brodsky, James Clark, Fuad Tabba,
	Raghavendra Rao Ananta, Lorenzo Pieralisi, Sascha Bischoff,
	Anshuman Khandual, Tian Zheng, linux-arm-kernel, linux-kernel,
	kvmarm, linux-acpi, acpica-devel, kvm

On Mon, Jun 29, 2026 at 12:17:53PM +0100, Leonardo Bras wrote:
> Find via ACPI [1] the Id for HACDBSIRQ, initialize it as a per-cpu IRQ
> and make sure any cpu able to run virtualization has it active.
> 
> Introduce a per-cpu structure used by the HACDBSIRQ handler to keep track
> of entries size and the status of HACDBS. Size is used to detect end of
> processing in case the number of entries being processed is different of
> the supported entries size.
> 
> Status may look easily replaceable by checking HACDBS registers now, but
> will make the OFF/IDLE detection easier in next patches.
> 
> Signed-off-by: Leonardo Bras <leo.bras@arm.com>
> 
> [1] https://github.com/tianocore/edk2/issues/12409

Reference the ACPI specification instead please. Any link you want to
include in a changelog should use the Link: footer, the linkage to the
inline citation will be obvious.

If we need to initialize the IRQ I'd really like to see device tree
bindings for HACDBSIRQ as well. Pretty much any system us plebs can get
our hands on is gonna be DT anyway.

> +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> +{
> +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> +
> +	switch (err) {
> +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +		break;
> +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> +
> +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +			break;
> +		}
> +		fallthrough;
> +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> +		break;
> +	}
> +
> +	return IRQ_HANDLED;
> +}

I have a pretty extreme distaste for creating a state machine between
the callsite and the IRQ handler. The callsite should poll HACDBS for
completion. The thread has nothing better to do anyway.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine
  2026-06-29 11:49   ` sashiko-bot
@ 2026-06-29 17:26     ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-29 17:26 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Leonardo Bras, Marc Zyngier, Oliver Upton, kvmarm, kvm

On Mon, Jun 29, 2026 at 11:49:20AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:
> 
> New issues:
> - [Critical] Unvalidated userspace memory accesses in dirty ring clearing lead to host DoS and memory corruption.
> - [High] Missing TLB invalidation and DSB in dirty ring hardware path.
> - [High] Hardware buffer allocation lacks strict physical alignment guarantees.
> 
> Pre-existing issues:
> - [High] Missing architecturally required context synchronization (ISB) after control register writes.
> - [High] Unsafe wfi() polling loop causes lost wakeups and CPU lockups.
> - [Medium] Hardware dirty log clearing erroneously falls back to software due to incorrect return value processing.
> --
> 
> commit 9f463875fd0f7eddcaa7053b557d7a2cc21e83c1
> Author: Leonardo Bras <leo.bras@arm.com>
> 
> KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine
> 
> This commit implements an arm64 version of kvm_arch_dirty_ring_clear using
> FEAT_HACDBS. It translates dirty-ring entries into HDBSS entries and uses
> the hardware mechanism to mark writable-dirty pages as clean. Invalid
> entries are then cleared, falling back to software cleaning on error.
> 
> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> [ ... ]
> > static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> > {
> > 	u64 hcr_el2;
> > 	int ret;
> > 
> > 	preempt_disable();
> > 
> > 	hcr_el2 = read_sysreg(HCR_EL2);
> > 	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> > 	__load_stage2(&kvm->arch.mmu);
> 
> [Severity: High]
> This is a pre-existing issue, but does this missing ISB after control register
> writes expose the pipeline to undefined architectural states?
> 
> Writing to HCR_EL2 requires an immediate context synchronization event.
> Executing __load_stage2 immediately after without an isb means it might
> operate under a stale configuration.

See prev patches


> 
> > 
> > 	hacdbs_start(hw_entries, size);
> > 
> > 	do {
> > 		wfi();
> > 	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
> 
> [Severity: High]
> This is a pre-existing issue, but can this unprotected wfi wait loop cause
> a lost wakeup and an indefinite CPU stall?
> 
> Because local interrupts are not masked around this check and wait, a timely
> interrupt could satisfy the condition prematurely before wfi executes.
> 

iden

> [ ... ]
> > 	if (idx != 0) {
> > 		ret = dirty_bit_clear(kvm, hw_entries, idx);
> > 		*flush = *flush || ret > 0;
> > 	}
> > out_err:
> > 	if (unlikely(ret != idx)) {
> 
> [Severity: Medium]
> This is a pre-existing issue, but does this logic incorrectly evaluate a
> successful batch outcome as an error?
> 
> If the loop finishes exactly on a batch boundary, idx is reset to 0, but ret
> holds the success count of the previous batch. This causes the mismatch check
> to evaluate to true, forcing a -EAGAIN return and triggering an unnecessary
> software fallback.
> 

iden

> [ ... ]
> > @@ -256,6 +256,72 @@ int __kvm_arch_dirty_log_clear(struct kvm *kvm,
> >  	return ret;
> >  }
> >  
> > +int __kvm_arch_dirty_ring_clear(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > +				int *nr_entries_reset)
> > +{
> > +	u64 *hw_entries;
> > +	u64 slot_offset = 0;
> > +	u64 ttwl;
> > +	int i, ret;
> > +	u32 slot = -1;
> > +
> > +	if (signal_pending(current))
> > +		return -EINTR;
> > +
> > +	ttwl = HDBSS_ENTRY_TTWL(KVM_PGTABLE_LAST_LEVEL);
> > +
> > +	hw_entries = kmalloc(max(ring->size * sizeof(u64), PAGE_SIZE), GFP_KERNEL);
> 
> [Severity: High]
> Does this kmalloc call lack the strict physical alignment guarantees required
> by the hardware accelerator?
> 
> Since kmalloc does not guarantee natural physical alignment for large buffers,
> is it possible for physical address truncation to occur? This could cause the
> hardware accelerator to parse unaligned memory and corrupt arbitrary page DBM
> bits.

I recall it had size-alignmen guarantees. Largest buffer size should be 
64k x 8B, so it should be ok.


> 
> > +	if (!hw_entries)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < ring->size; i++) {
> > +		struct kvm_dirty_gfn *entry;
> > +		gfn_t gfn;
> > +
> > +		entry = &ring->dirty_gfns[(ring->reset_index + i) &
> > +					  (ring->size - 1)];
> > +
> > +		if (!kvm_dirty_gfn_harvested(entry))
> > +			break;
> > +
> > +		if (entry->slot != slot) {
> > +			struct kvm_memory_slot *memslot;
> > +
> > +			memslot = kvm_dirty_ring_get_memslot(kvm, entry->slot);
> > +			slot = entry->slot;
> > +			slot_offset = memslot->base_gfn;
> 
> [Severity: Critical]
> Can an invalid, user-provided entry->slot lead to a NULL pointer dereference
> here?
> 
> The memslot pointer is dereferenced directly to read base_gfn without a NULL
> check.
> 
> > +		}
> > +
> > +		gfn = slot_offset + entry->offset;
> 

Correct. I have to verify the return value of get_memslot before using it.
Will do in v3

> [Severity: Critical]
> Is entry->offset validated against the memslot bounds?
> 
> Using an unvalidated user-provided offset here allows arbitrary Stage-2 GPA
> computation. This could enable Stage-2 memory corruption or privilege
> escalation.

Correct. Will verify the offset against memslot size before this.
Will be done in v3.

> 
> > +
> > +		hw_entries[i] = (gfn_to_gpa(gfn) & HDBSS_ENTRY_IPA) |
> > +				ttwl | HDBSS_ENTRY_VALID;
> > +	}
> > +
> > +	ret = dirty_bit_clear(kvm, hw_entries, i);
> 
> [Severity: High]
> Does this path miss an architecturally required TLB invalidation and DSB?
> 
> dirty_bit_clear relies on the caller to handle the TLBI (as noted in its
> internal comments for the log clearing path), but this ring clearing routine
> appears to jump directly to invalidating the entries and returning.

There is a dsb() and then a TLBI before returning to guest.

> 
> This could leave stale writable TLB entries in the guest.
> 
> > +
> > +	/* Set as invalid all successfully cleaned entries */
> > +	for (int j = 0; j < ret; j++) {
> [ ... ]
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260629111820.1873540-1-leo.bras@arm.com?part=12

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
  2026-06-29 11:29   ` sashiko-bot
@ 2026-06-29 17:36   ` Oliver Upton
  2026-06-30 14:59     ` Leonardo Bras
  1 sibling, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2026-06-29 17:36 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Joey Gouly,
	Steffen Eiden, Suzuki K Poulose, Zenghui Yu, Rafael J. Wysocki,
	Len Brown, Saket Dumbre, Paolo Bonzini, Jonathan Cameron,
	Chengwen Feng, Kees Cook, Mikołaj Lenczewski, James Morse,
	Zeng Heng, mrigendrachaubey, Thomas Huth, Ryan Roberts,
	Yeoreum Yun, Mark Brown, Kevin Brodsky, James Clark, Fuad Tabba,
	Raghavendra Rao Ananta, Lorenzo Pieralisi, Sascha Bischoff,
	Anshuman Khandual, Tian Zheng, linux-arm-kernel, linux-kernel,
	kvmarm, linux-acpi, acpica-devel, kvm

On Mon, Jun 29, 2026 at 12:17:54PM +0100, Leonardo Bras wrote:
> Introduce the basic cleaning routine that is going to be used for both
> dirty-bitmap and dirty-ring routines.
> 
> It sets the required registers with the input buffer, and wait for
> HACDBSIRQ to happen, which means either the task is done, or there was some
> error during processing.
> 
> It is ran with preemption disabled, as a task being scheduled in could
> change the translation registers used by HACDBS and end up corrupting the
> current dirty-bit tracking and the sched-in task's S2 pagetables.
> 
> Signed-off-by: Leonardo Bras <leo.bras@arm.com>
> ---
>  arch/arm64/kvm/dirty_bit.c | 81 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 81 insertions(+)
> 
> diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> index 789da8712b1b..e4283828b780 100644
> --- a/arch/arm64/kvm/dirty_bit.c
> +++ b/arch/arm64/kvm/dirty_bit.c
> @@ -1,36 +1,117 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
>   * Copyright (C) 2026 ARM Ltd.
>   * Author: Leonardo Bras <leo.bras@arm.com>
>   */
>  
>  #include <asm/kvm_dirty_bit.h>
> +#include <asm/kvm_mmu.h>
>  #include <linux/kconfig.h>
>  #include <linux/acpi.h>
>  
>  DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
>  	.status = HACDBS_OFF,
>  	.size = 0,
>  };
>  
>  /* HDBSS entry field definitions */
>  #define HDBSS_ENTRY_VALID BIT(0)
>  #define HDBSS_ENTRY_TTWL_SHIFT (1)
>  #define HDBSS_ENTRY_TTWL_MASK (GENMASK(3, 1))
>  #define HDBSS_ENTRY_TTWL(x) \
>  	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
>  #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
>  #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>  
>  static __ro_after_init int hacdbsirq = -1;
>  
> +static void hacdbs_start(u64 *hw_entries, int size)
> +{
> +	u64 br;
> +	/* Each entry is 8 bytes */
> +	int size_b = size * sizeof(hw_entries[0]);
> +	int size_p2 = max(roundup_pow_of_two(size_b), PAGE_SIZE);
> +
> +	/* If not using the full size of the array, put a stop entry at the end */
> +	if (size_b < size_p2)
> +		hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
> +
> +	sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
> +			   HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
> +
> +	br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
> +	     FIELD_PREP(HACDBSBR_EL2_SZ, ilog2(size_p2) - 12) |
> +	     FIELD_PREP(HACDBSBR_EL2_EN, 1);
> +
> +	this_cpu_write(hacdbs_pcp.status, HACDBS_RUNNING);
> +	this_cpu_write(hacdbs_pcp.size, size);
> +	write_sysreg_s(br, SYS_HACDBSBR_EL2);
> +	isb();
> +}
> +
> +static int hacdbs_stop(void)
> +{
> +	write_sysreg_s(0, SYS_HACDBSBR_EL2);
> +	isb();
> +
> +	if (this_cpu_read(hacdbs_pcp.status) == HACDBS_ERROR) {
> +		/* In case of error, HACDBSCONS_EL2.INDEX should point the faulty entry */
> +		u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> +		int idx = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> +
> +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> +
> +		return idx;
> +	}
> +
> +	return this_cpu_read(hacdbs_pcp.size);
> +}
> +
> +/*
> + * Clears dirty-bits for an array of pages (hw_entries) using HACDBS
> + * Returns the number of items cleaned from the array. If returns value < size,
> + *	there was an error in the processing.
> + */
> +static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> +{
> +	u64 hcr_el2;
> +	int ret;
> +
> +	preempt_disable();
> +
> +	hcr_el2 = read_sysreg(HCR_EL2);
> +	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);

sysreg_clear_set_hcr(). I'm pretty sure all the speculative AT errata
depend on HCR_EL2.VM being set _after_ the stage-2 MMU has been loaded.

> +	__load_stage2(&kvm->arch.mmu);

Pretty sure you need an ISB here to ensure loading the MMU is ordered
with enabling HACDBS.

> +	hacdbs_start(hw_entries, size);
> +
> +	do {
> +		wfi();
> +	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);

This is exactly why I said you should just poll hardware instead. It is
entirely possible that the IRQ arrives before you WFI.

> +	ret = hacdbs_stop();
> +
> +	write_sysreg(hcr_el2, HCR_EL2);

write_sysreg_hcr()

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-29 17:06       ` Oliver Upton
@ 2026-06-30 12:58         ` Leonardo Bras
  2026-06-30 15:44           ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 12:58 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Leonardo Bras, sashiko-reviews, Marc Zyngier, kvmarm, kvm,
	Wei-Lin Chang

On Mon, Jun 29, 2026 at 10:06:38AM -0700, Oliver Upton wrote:
> On Mon, Jun 29, 2026 at 03:47:51PM +0100, Leonardo Bras wrote:
> > On Mon, Jun 29, 2026 at 11:36:45AM +0000, sashiko-bot@kernel.org wrote:
> > > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> > > 
> > > New issues:
> > > - [High] The default eager page splitting chunk size violates the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE UAPI contract by using a value that is not a valid block size for 16KB and 64KB page kernels.
> > > 
> > > Pre-existing issues:
> > > - [Critical] Dropping mmu_lock between primary and nested stage-2 write-protection creates a window where nested guests can bypass dirty logging.
> > > --
> > > 
> > > commit 09e1d250ed331b9dc20b42b7f9315892a86a5d50
> > > Author: Leonardo Bras <leo.bras@arm.com>
> > > 
> > > KVM: arm64: Enable eager hugepage splitting if HDBSS is available
> > > 
> > > This commit enables eager hugepage splitting by default for systems
> > > supporting HDBSS, setting the default chunk size to 256K pages at
> > > stage2 initialization.
> > > 
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 6c941aaa10c63..e086c01a9325e 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -1027,8 +1027,12 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> > > >  	for_each_possible_cpu(cpu)
> > > >  		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
> > > >  
> > > > -	 /* The eager page splitting is disabled by default */
> > > > -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > > > +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> > > > +	if (system_supports_hdbss())
> > > > +		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;
> > > 
> > > [Severity: High]
> > > Does this hardcoded chunk size violate the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> > > UAPI contract on kernels with 16KB or 64KB pages?
> > > 
> > > By setting the default chunk size to 256 * 1024 * PAGE_SIZE, this evaluates to
> > > 4GB on 16KB page systems and 16GB on 64KB page systems.
> > > 
> > > However, in arch/arm64/kvm/arm.c:kvm_vm_ioctl_enable_cap(), the chunk size is
> > > validated against valid architectural block sizes:
> > > 
> > >     if (!new_cap || kvm_is_block_size_supported(new_cap)) {
> > >     ...
> > > 
> > > Since 4GB and 16GB are not valid block sizes on those configurations, a VMM
> > > like QEMU attempting to read the default capability via KVM_CHECK_EXTENSION
> > > and set it back via KVM_ENABLE_CAP will be rejected with -EINVAL.
> > 
> > Sure, changing this to something like PMD_SIZE or PUD_SIZE should be no issue.
> >  
> > But this raises a topic I would like to understand:
> > - Do we actually need this to be a block_size to assure correctness? or is 
> >   it just about efficiency?
> 
> What value is there in having a chunk size larger than the largest
> possible block mapping? The whole UAPI is deliberately tied up with page
> table geometry.

Not larger, possibly smallerv My concern was the difference in pages to 
split between 4k, 16k and 64k.

> 
> Overall, I'm not buying the argument for changing the behavior of
> KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
> *not* eagerly splitting the entire address space, especially if you know
> the working set of the VM is small.
> 
> You can still use HDBSS without eagerly splitting, so long as block
> mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
> a writable PFN) are {1, 0}.
> 

Block mappings being read-only, and leaf mappings being writable-clean, 
then? Could you please ellaborate on why does not it need eager-split?

As a review, what I recall from the strategy for hw dirty-logging was:
- If we have HDBSS, add DBM for all writable pages {1, 1}
- On dirty-logging start, make them writable-clean {1, 0}
  - Can be done using HACDBS
  - Enable HDBSS & HAFDBS

We don't have a fault for making pages dirty anymore, as this is done 
by HAFDBS and recorded by HDBSS, so splitting does not happen on demand 
anymore. So if we want to split pages, for better tracking granularity, or 
anything, we have to eager-split them.

Not doing so may cause us to have different level leaf sizes (pages & 
hugepages) mixed up in HDBSS, which adds complexity, on top of possibly 
making a live migration slower (the whole leaf entry has to be sent over 
network).

> > > [Severity: Critical]
> > > This is a pre-existing issue, but by enabling eager hugepage splitting by
> > > default for HDBSS, does this exacerbate a race condition with nested guests
> > > bypassing dirty logging?
> > > 
> > > In arch/arm64/kvm/mmu.c:kvm_arch_mmu_enable_log_dirty_pt_masked(), KVM enables
> > > dirty logging by write-protecting the primary stage-2 and then calling
> > > kvm_mmu_split_huge_pages():
> > > 
> > >     if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > >         kvm_mmu_split_huge_pages(kvm, start, end);
> > > 
> > >     kvm_nested_s2_wp(kvm);
> > > 
> > > Because kvm_mmu_split_huge_pages() can drop the mmu_lock to top-up the memory
> > > cache or reschedule, a race window is created before kvm_nested_s2_wp()
> > > protects the nested stage-2 tables.
> > > 
> > > While the lock is dropped, a nested VCPU can fault, map the page via its
> > > separate shadow stage-2 (which hasn't been write-protected yet), and write to
> > > the page without setting the dirty bitmap, leading to silent memory corruption
> > > after live migration.
> > > 
> > 
> > But if it's already write-protected in S2, if the nested VCPU faults 
> > something, would not that be handled by non-nested faulting routine?
> > 
> > I understand very little of nested, and I can't understand how does that 
> > work properly. Maybe Marc or Wei Lin may have a better answer for that.
> 
> The only time this would be an issue is if we allowed userspace to
> observe a clean dirty bitmap with a stale, writable shadow stage-2. All
> roads go through the slots_lock, so I don't believe this is actually an
> issue.
> 

Awesome! 

> The order of events Sashiko mentions makes no sense. If the shadow
> stage-2 translation is already writable there would be no intervening
> fault. Since the dirty bitmap is already cleared for these GFNs, a
> racing write permission fault would actually set the dirty bitmap.
> 
> With that being said, I'd be happy with kvm_nested_s2_wp() getting moved
> up earlier in the function anyway as it is easy to forget that the
> mmu_lock could be dropped in between.
> 

Makes sense.

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 17:22   ` Oliver Upton
@ 2026-06-30 14:50     ` Leonardo Bras
  2026-06-30 16:03       ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 14:50 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Leonardo Bras, Catalin Marinas, Will Deacon, Marc Zyngier,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng,
	linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

On Mon, Jun 29, 2026 at 10:22:12AM -0700, Oliver Upton wrote:
> On Mon, Jun 29, 2026 at 12:17:53PM +0100, Leonardo Bras wrote:
> > Find via ACPI [1] the Id for HACDBSIRQ, initialize it as a per-cpu IRQ
> > and make sure any cpu able to run virtualization has it active.
> > 
> > Introduce a per-cpu structure used by the HACDBSIRQ handler to keep track
> > of entries size and the status of HACDBS. Size is used to detect end of
> > processing in case the number of entries being processed is different of
> > the supported entries size.
> > 
> > Status may look easily replaceable by checking HACDBS registers now, but
> > will make the OFF/IDLE detection easier in next patches.
> > 
> > Signed-off-by: Leonardo Bras <leo.bras@arm.com>
> > 
> > [1] https://github.com/tianocore/edk2/issues/12409
> 
> Reference the ACPI specification instead please.

It's not accepted yet, unfortunately. I commented that into the cover 
letter, but forgot to add it here.

> Any link you want to
> include in a changelog should use the Link: footer, the linkage to the
> inline citation will be obvious.

Sure, will remember that in the future.

> 
> If we need to initialize the IRQ I'd really like to see device tree
> bindings for HACDBSIRQ as well. Pretty much any system us plebs can get
> our hands on is gonna be DT anyway.

Agree. I started out with ACPI because that's what the main target is, as 
dirty-logging is focused in Live Migration, which is usually more 
appreciated in the server space, which generally uses ACPI.

I spoke to some people, and I could not hear of anyone releasing a product 
based in DT that would implement this yet, so I postponed the DT 
enablement.

> 
> > +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> > +{
> > +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> > +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> > +
> > +	switch (err) {
> > +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> > +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +		break;
> > +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> > +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> > +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> > +
> > +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> > +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +			break;
> > +		}
> > +		fallthrough;
> > +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> > +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> > +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> > +		break;
> > +	}
> > +
> > +	return IRQ_HANDLED;
> > +}
> 
> I have a pretty extreme distaste for creating a state machine between
> the callsite and the IRQ handler. The callsite should poll HACDBS for
> completion. The thread has nothing better to do anyway.

Well, there is one argument it could just wait and save some energy, but I 
agree it is not relevant in server space. The main reason I did this is 
because I am planning on later doing an improved version of this that would 
clean the dirty-bit *while* running the guest, and having the IRQ is needed 
for exiting guest so we can notify userspace the cleaning is done. So I 
laid the HACDBSIRQ infra here so we don't have both polling and IRQ options 
happening. 

That idea would require us to add new API (a return value for 'cleaned'), 
and also a new flag for the clean ioctl. We also need the VMM to 
implement that, but then we get a proper cpu usage of cleaning time.

I wanted to start with a backwards compatible version, and do the above 
idea once I put my hands in hardware that implements HACDBS, so I can 
properly measure how much performance we get on above strategy.

What do you think?

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-29 16:52       ` Vladimir Murzin
@ 2026-06-30 14:52         ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 14:52 UTC (permalink / raw)
  To: Vladimir Murzin
  Cc: Leonardo Bras, sashiko-reviews, kvmarm, Oliver Upton,
	Marc Zyngier, kvm

On Mon, Jun 29, 2026 at 05:52:45PM +0100, Vladimir Murzin wrote:
> Hi,
> 
> On 6/29/26 16:43, Leonardo Bras wrote:
> >>> +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> >>> +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> >>> +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> >> [Severity: Low]
> >> Does this variable declaration cause a build failure?
> > No
> > 
> >> Under standard C rules,
> >> a label can only be attached to a statement, and a declaration is not a
> >> statement.
> > That's not a label.
> 
> I think Sashiko is right
> 
> arch/arm64/kvm/dirty_bit.c: In function 'hacdbsirq_handler':
> arch/arm64/kvm/dirty_bit.c:336:3: error: a label can only be part of a statement and a declaration is not a statement
>   336 |   int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
>       |   ^~~
> make[4]: *** [scripts/Makefile.build:289: arch/arm64/kvm/dirty_bit.o] Error 1
> make[4]: *** Waiting for unfinished jobs....
> 
> Based on other threads you might need GCC<=10.5.0 to hit that.
> 
> Obvious fix-up would be to wrap the case into curly braces :)
> 
> Cheers
> Vladimir

Right!
I could finally understand what it meant on a next patch, where it 
explained the issue in a different way. Will add curly braces (or declare 
the variable before, honestly), and this should go.

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-29 17:36   ` Oliver Upton
@ 2026-06-30 14:59     ` Leonardo Bras
  2026-06-30 19:06       ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 14:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Leonardo Bras, Catalin Marinas, Will Deacon, Marc Zyngier,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng,
	linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

On Mon, Jun 29, 2026 at 10:36:40AM -0700, Oliver Upton wrote:
> On Mon, Jun 29, 2026 at 12:17:54PM +0100, Leonardo Bras wrote:
> > Introduce the basic cleaning routine that is going to be used for both
> > dirty-bitmap and dirty-ring routines.
> > 
> > It sets the required registers with the input buffer, and wait for
> > HACDBSIRQ to happen, which means either the task is done, or there was some
> > error during processing.
> > 
> > It is ran with preemption disabled, as a task being scheduled in could
> > change the translation registers used by HACDBS and end up corrupting the
> > current dirty-bit tracking and the sched-in task's S2 pagetables.
> > 
> > Signed-off-by: Leonardo Bras <leo.bras@arm.com>
> > ---
> >  arch/arm64/kvm/dirty_bit.c | 81 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 81 insertions(+)
> > 
> > diff --git a/arch/arm64/kvm/dirty_bit.c b/arch/arm64/kvm/dirty_bit.c
> > index 789da8712b1b..e4283828b780 100644
> > --- a/arch/arm64/kvm/dirty_bit.c
> > +++ b/arch/arm64/kvm/dirty_bit.c
> > @@ -1,36 +1,117 @@
> >  // SPDX-License-Identifier: GPL-2.0-only
> >  /*
> >   * Copyright (C) 2026 ARM Ltd.
> >   * Author: Leonardo Bras <leo.bras@arm.com>
> >   */
> >  
> >  #include <asm/kvm_dirty_bit.h>
> > +#include <asm/kvm_mmu.h>
> >  #include <linux/kconfig.h>
> >  #include <linux/acpi.h>
> >  
> >  DEFINE_PER_CPU(struct hacdbs, hacdbs_pcp) = {
> >  	.status = HACDBS_OFF,
> >  	.size = 0,
> >  };
> >  
> >  /* HDBSS entry field definitions */
> >  #define HDBSS_ENTRY_VALID BIT(0)
> >  #define HDBSS_ENTRY_TTWL_SHIFT (1)
> >  #define HDBSS_ENTRY_TTWL_MASK (GENMASK(3, 1))
> >  #define HDBSS_ENTRY_TTWL(x) \
> >  	(((x) << HDBSS_ENTRY_TTWL_SHIFT) & HDBSS_ENTRY_TTWL_MASK)
> >  #define HDBSS_ENTRY_TTWL_RESV HDBSS_ENTRY_TTWL(-4)
> >  #define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> >  
> >  static __ro_after_init int hacdbsirq = -1;
> >  
> > +static void hacdbs_start(u64 *hw_entries, int size)
> > +{
> > +	u64 br;
> > +	/* Each entry is 8 bytes */
> > +	int size_b = size * sizeof(hw_entries[0]);
> > +	int size_p2 = max(roundup_pow_of_two(size_b), PAGE_SIZE);
> > +
> > +	/* If not using the full size of the array, put a stop entry at the end */
> > +	if (size_b < size_p2)
> > +		hw_entries[size] = HDBSS_ENTRY_VALID | HDBSS_ENTRY_TTWL_RESV;
> > +
> > +	sysreg_clear_set_s(SYS_HACDBSCONS_EL2,
> > +			   HACDBSCONS_EL2_ERR_REASON | HACDBSCONS_EL2_INDEX, 0);
> > +
> > +	br = (virt_to_phys(hw_entries) & HACDBSBR_EL2_BADDR_MASK) |
> > +	     FIELD_PREP(HACDBSBR_EL2_SZ, ilog2(size_p2) - 12) |
> > +	     FIELD_PREP(HACDBSBR_EL2_EN, 1);
> > +
> > +	this_cpu_write(hacdbs_pcp.status, HACDBS_RUNNING);
> > +	this_cpu_write(hacdbs_pcp.size, size);
> > +	write_sysreg_s(br, SYS_HACDBSBR_EL2);
> > +	isb();
> > +}
> > +
> > +static int hacdbs_stop(void)
> > +{
> > +	write_sysreg_s(0, SYS_HACDBSBR_EL2);
> > +	isb();
> > +
> > +	if (this_cpu_read(hacdbs_pcp.status) == HACDBS_ERROR) {
> > +		/* In case of error, HACDBSCONS_EL2.INDEX should point the faulty entry */
> > +		u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> > +		int idx = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> > +
> > +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > +
> > +		return idx;
> > +	}
> > +
> > +	return this_cpu_read(hacdbs_pcp.size);
> > +}
> > +
> > +/*
> > + * Clears dirty-bits for an array of pages (hw_entries) using HACDBS
> > + * Returns the number of items cleaned from the array. If returns value < size,
> > + *	there was an error in the processing.
> > + */
> > +static int dirty_bit_clear(struct kvm *kvm, u64 *hw_entries, int size)
> > +{
> > +	u64 hcr_el2;
> > +	int ret;
> > +
> > +	preempt_disable();
> > +
> > +	hcr_el2 = read_sysreg(HCR_EL2);
> > +	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> 
> sysreg_clear_set_hcr(). I'm pretty sure all the speculative AT errata
> depend on HCR_EL2.VM being set _after_ the stage-2 MMU has been loaded.
> 

So, move this to after __load_stage2()?
ok

> > +	__load_stage2(&kvm->arch.mmu);
> 
> Pretty sure you need an ISB here to ensure loading the MMU is ordered
> with enabling HACDBS.
>

does not __load_stage2() have an isb() here?
In any case, will add an isb() after sysreg_clear_set_hcr(), which should 
come after __load_stage2() IIUC.


 
> > +	hacdbs_start(hw_entries, size);
> > +
> > +	do {
> > +		wfi();
> > +	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
> 
> This is exactly why I said you should just poll hardware instead. It is
> entirely possible that the IRQ arrives before you WFI.

It should be fine with WFIT, though, right?

I understand the reason in pooling, and even done some workaround in 
pooling for getting this to run in the model. 

Based on the previous reply, do you think I should only use polling for 
now, and implement the IRQ later?

> 
> > +	ret = hacdbs_stop();
> > +
> > +	write_sysreg(hcr_el2, HCR_EL2);
> 
> write_sysreg_hcr()

Sure!

Thanks for reviewing!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-30 12:58         ` Leonardo Bras
@ 2026-06-30 15:44           ` Oliver Upton
  2026-06-30 17:09             ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2026-06-30 15:44 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: sashiko-reviews, Marc Zyngier, kvmarm, kvm, Wei-Lin Chang

On Tue, Jun 30, 2026 at 01:58:48PM +0100, Leonardo Bras wrote:
> On Mon, Jun 29, 2026 at 10:06:38AM -0700, Oliver Upton wrote:
> > > But this raises a topic I would like to understand:
> > > - Do we actually need this to be a block_size to assure correctness? or is 
> > >   it just about efficiency?
> > 
> > What value is there in having a chunk size larger than the largest
> > possible block mapping? The whole UAPI is deliberately tied up with page
> > table geometry.
> 
> Not larger, possibly smallerv My concern was the difference in pages to 
> split between 4k, 16k and 64k.

Ok, well in any case the upper bound is going to be the largest possible
block mapping for a given page granule.

> > 
> > Overall, I'm not buying the argument for changing the behavior of
> > KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
> > *not* eagerly splitting the entire address space, especially if you know
> > the working set of the VM is small.
> > 
> > You can still use HDBSS without eagerly splitting, so long as block
> > mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
> > a writable PFN) are {1, 0}.
> > 
> 
> Block mappings being read-only, and leaf mappings being writable-clean, 
> then? Could you please ellaborate on why does not it need eager-split?

Read-only translations will continue to generate permission faults
whereas writable-clean descriptors can be updated by hardware. You get
the opportunity to split a block mapping lazily while preserving
hardware dirty tracking for page mappings.

> As a review, what I recall from the strategy for hw dirty-logging was:
> - If we have HDBSS, add DBM for all writable pages {1, 1}
> - On dirty-logging start, make them writable-clean {1, 0}
>   - Can be done using HACDBS
>   - Enable HDBSS & HAFDBS
> 
> We don't have a fault for making pages dirty anymore, as this is done 
> by HAFDBS and recorded by HDBSS, so splitting does not happen on demand 
> anymore. So if we want to split pages, for better tracking granularity, or 
> anything, we have to eager-split them.

What I'm saying is the presumption that eager page splitting is always a
net-win is wrong. Nor is eager page splitting a hard requirement for
using HDBSS since you can set up the stage-2 in such a way that only
page granularity mappings are dirtied by hardware.

You could, in theory, have a workload that is read-heavy for a majority
of the VM's address space and writes to only a subset of that memory.
Eagerly splitting pages would likely regress the workload from a higher
rate of TLB refills / more TLB walk steps.

Lazily splitting would have the effect of leaving block mappings in
place for most of the VM. This is exactly why the VMM is in the driver
seat for deciding whether to lazily or eagerly split the stage-2.

The approach I think we may need is:

 - Use a software bit in the PTE to stash whether or not a PFN is
   'software-writable' when constructing the stage-2. By this I mean
   we've already faulted it in for write from the primary MMU.

 - At the time of write protection, reap the hardware-writable state
   from all PTEs but preserve the software-writable bit.

 - Whenever splitting a block mapping, set the DBM bit in the page-level
   PTEs if the block was software-writable and HDBSS is present.

That way you'd have sufficient metadata in the PTE to safely set DBM. We
could even make use of that metadata for write faults on non-HDBSS
hardware to avoid the overheads of user_mem_abort() (e.g. VMA lookup)
and treat it more like access flag updates.

The last point still needs some thought.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-30 14:50     ` Leonardo Bras
@ 2026-06-30 16:03       ` Oliver Upton
  2026-06-30 17:19         ` Leonardo Bras
  0 siblings, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2026-06-30 16:03 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Joey Gouly,
	Steffen Eiden, Suzuki K Poulose, Zenghui Yu, Rafael J. Wysocki,
	Len Brown, Saket Dumbre, Paolo Bonzini, Jonathan Cameron,
	Chengwen Feng, Kees Cook, Mikołaj Lenczewski, James Morse,
	Zeng Heng, mrigendrachaubey, Thomas Huth, Ryan Roberts,
	Yeoreum Yun, Mark Brown, Kevin Brodsky, James Clark, Fuad Tabba,
	Raghavendra Rao Ananta, Lorenzo Pieralisi, Sascha Bischoff,
	Anshuman Khandual, Tian Zheng, linux-arm-kernel, linux-kernel,
	kvmarm, linux-acpi, acpica-devel, kvm

On Tue, Jun 30, 2026 at 03:50:17PM +0100, Leonardo Bras wrote:
> On Mon, Jun 29, 2026 at 10:22:12AM -0700, Oliver Upton wrote:
> > If we need to initialize the IRQ I'd really like to see device tree
> > bindings for HACDBSIRQ as well. Pretty much any system us plebs can get
> > our hands on is gonna be DT anyway.
> 
> Agree. I started out with ACPI because that's what the main target is, as 
> dirty-logging is focused in Live Migration, which is usually more 
> appreciated in the server space, which generally uses ACPI.
> 
> I spoke to some people, and I could not hear of anyone releasing a product 
> based in DT that would implement this yet, so I postponed the DT 
> enablement.

Nested virt is always a good example. In some distant future KVM could
expose FEAT_HACDBS to the L1 hypervisor, and the VMM may be using DT
instead of ACPI (like kvmtool).

> > 
> > > +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> > > +{
> > > +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> > > +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> > > +
> > > +	switch (err) {
> > > +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> > > +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > > +		break;
> > > +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> > > +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> > > +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> > > +
> > > +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> > > +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > > +			break;
> > > +		}
> > > +		fallthrough;
> > > +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> > > +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> > > +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> > > +		break;
> > > +	}
> > > +
> > > +	return IRQ_HANDLED;
> > > +}
> > 
> > I have a pretty extreme distaste for creating a state machine between
> > the callsite and the IRQ handler. The callsite should poll HACDBS for
> > completion. The thread has nothing better to do anyway.
> 
> Well, there is one argument it could just wait and save some energy, but I 
> agree it is not relevant in server space.

I wouldn't suggest polling in a tight loop :) I'd say use something like
__mdelay() to get the core into a low-power state w/o using a naked WFI.
In fact, that already uses WFxT under the hood.

> The main reason I did this is 
> because I am planning on later doing an improved version of this that would 
> clean the dirty-bit *while* running the guest, and having the IRQ is needed 
> for exiting guest so we can notify userspace the cleaning is done. So I 
> laid the HACDBSIRQ infra here so we don't have both polling and IRQ options 
> happening. 
> 
> That idea would require us to add new API (a return value for 'cleaned'), 
> and also a new flag for the clean ioctl. We also need the VMM to 
> implement that, but then we get a proper cpu usage of cleaning time.
>
> I wanted to start with a backwards compatible version, and do the above 
> idea once I put my hands in hardware that implements HACDBS, so I can 
> properly measure how much performance we get on above strategy.
> 
> What do you think?

Yeah, I'd want to see some extremely compelling performance numbers for
this approach before considering it, alongside the necessary VMM patches
to actually activate it.

Seems likely to me that the VMM will want the background thread back
ASAP that calls the clean ioctl so you'll need to work out how to cope
with idle vCPUs in that case.

Even still, with this hypothetical approach I'd expect KVM to inspect
the HACDBS state on every exit. The IRQ is just a convenient kick back
out to the main KVM_RUN loop.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-30 15:44           ` Oliver Upton
@ 2026-06-30 17:09             ` Leonardo Bras
  2026-06-30 18:43               ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 17:09 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Leonardo Bras, sashiko-reviews, Marc Zyngier, kvmarm, kvm,
	Wei-Lin Chang

On Tue, Jun 30, 2026 at 08:44:20AM -0700, Oliver Upton wrote:
> On Tue, Jun 30, 2026 at 01:58:48PM +0100, Leonardo Bras wrote:
> > On Mon, Jun 29, 2026 at 10:06:38AM -0700, Oliver Upton wrote:
> > > > But this raises a topic I would like to understand:
> > > > - Do we actually need this to be a block_size to assure correctness? or is 
> > > >   it just about efficiency?
> > > 
> > > What value is there in having a chunk size larger than the largest
> > > possible block mapping? The whole UAPI is deliberately tied up with page
> > > table geometry.
> > 
> > Not larger, possibly smallerv My concern was the difference in pages to 
> > split between 4k, 16k and 64k.
> 
> Ok, well in any case the upper bound is going to be the largest possible
> block mapping for a given page granule.
>

Sure, we can do this.

I was worried because that would mean dealing, per granule, with 256k pages 
in PGSIZE 4k, a 4M pages in PGSIZE 16k, an 64M pages in PGSIZE 64k. 
Those are values with different orders of magnitude, and I worried that it 
would take too long, or require too much cache for a single run.

But if you think that's ok, sure then. 


 
> > > 
> > > Overall, I'm not buying the argument for changing the behavior of
> > > KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
> > > *not* eagerly splitting the entire address space, especially if you know
> > > the working set of the VM is small.
> > > 
> > > You can still use HDBSS without eagerly splitting, so long as block
> > > mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
> > > a writable PFN) are {1, 0}.
> > > 
> > 
> > Block mappings being read-only, and leaf mappings being writable-clean, 
> > then? Could you please ellaborate on why does not it need eager-split?
> 
> Read-only translations will continue to generate permission faults
> whereas writable-clean descriptors can be updated by hardware. You get
> the opportunity to split a block mapping lazily while preserving
> hardware dirty tracking for page mappings.
> 

So you suggest we only enable DBM bit after we split the block, that will 
happen only after a block is dirtied for the first time after dirty-log 
starts? 

> > As a review, what I recall from the strategy for hw dirty-logging was:
> > - If we have HDBSS, add DBM for all writable pages {1, 1}
> > - On dirty-logging start, make them writable-clean {1, 0}
> >   - Can be done using HACDBS
> >   - Enable HDBSS & HAFDBS
> > 
> > We don't have a fault for making pages dirty anymore, as this is done 
> > by HAFDBS and recorded by HDBSS, so splitting does not happen on demand 
> > anymore. So if we want to split pages, for better tracking granularity, or 
> > anything, we have to eager-split them.
> 
> What I'm saying is the presumption that eager page splitting is always a
> net-win is wrong. Nor is eager page splitting a hard requirement for
> using HDBSS since you can set up the stage-2 in such a way that only
> page granularity mappings are dirtied by hardware.
> 
> You could, in theory, have a workload that is read-heavy for a majority
> of the VM's address space and writes to only a subset of that memory.
> Eagerly splitting pages would likely regress the workload from a higher
> rate of TLB refills / more TLB walk steps.
> 
> Lazily splitting would have the effect of leaving block mappings in
> place for most of the VM. This is exactly why the VMM is in the driver
> seat for deciding whether to lazily or eagerly split the stage-2.
> 

I see your point.

> The approach I think we may need is:
> 
>  - Use a software bit in the PTE to stash whether or not a PFN is
>    'software-writable' when constructing the stage-2. By this I mean
>    we've already faulted it in for write from the primary MMU.
> 
>  - At the time of write protection, reap the hardware-writable state
>    from all PTEs but preserve the software-writable bit.
> 
>  - Whenever splitting a block mapping, set the DBM bit in the page-level
>    PTEs if the block was software-writable and HDBSS is present.
> 
> That way you'd have sufficient metadata in the PTE to safely set DBM.

I remember that, for some reason I can't recall, it would not be great to 
set DBM during dirty-log start, and instead we should have it since VM 
creation. Maybe it had to do with part of the pagetable using the old 
encoding (no DBM), and the other part using the new one.

IIRC, only blocks that are backed by writable memory (S1) were supposed to 
receive the DBM bit. We could use that info for deciding what to split, 
then.

Another option would be to split when we are collecting a dirty-entry from 
HDBSS, but for live migration that would mean we have to transfer the whole 
block (possibly a large LEVEL1 block), because we have no idea which part 
of it got dirty.



> We
> could even make use of that metadata for write faults on non-HDBSS
> hardware to avoid the overheads of user_mem_abort() (e.g. VMA lookup)
> and treat it more like access flag updates.
> 
> The last point still needs some thought.
> 

I don't quite understand this yet. But will take a look on how would that 
work.


Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ
  2026-06-30 16:03       ` Oliver Upton
@ 2026-06-30 17:19         ` Leonardo Bras
  0 siblings, 0 replies; 47+ messages in thread
From: Leonardo Bras @ 2026-06-30 17:19 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Leonardo Bras, Catalin Marinas, Will Deacon, Marc Zyngier,
	Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Rafael J. Wysocki, Len Brown, Saket Dumbre, Paolo Bonzini,
	Jonathan Cameron, Chengwen Feng, Kees Cook,
	Mikołaj Lenczewski, James Morse, Zeng Heng, mrigendrachaubey,
	Thomas Huth, Ryan Roberts, Yeoreum Yun, Mark Brown, Kevin Brodsky,
	James Clark, Fuad Tabba, Raghavendra Rao Ananta,
	Lorenzo Pieralisi, Sascha Bischoff, Anshuman Khandual, Tian Zheng,
	linux-arm-kernel, linux-kernel, kvmarm, linux-acpi, acpica-devel,
	kvm

On Tue, Jun 30, 2026 at 09:03:20AM -0700, Oliver Upton wrote:
> On Tue, Jun 30, 2026 at 03:50:17PM +0100, Leonardo Bras wrote:
> > On Mon, Jun 29, 2026 at 10:22:12AM -0700, Oliver Upton wrote:
> > > If we need to initialize the IRQ I'd really like to see device tree
> > > bindings for HACDBSIRQ as well. Pretty much any system us plebs can get
> > > our hands on is gonna be DT anyway.
> > 
> > Agree. I started out with ACPI because that's what the main target is, as 
> > dirty-logging is focused in Live Migration, which is usually more 
> > appreciated in the server space, which generally uses ACPI.
> > 
> > I spoke to some people, and I could not hear of anyone releasing a product 
> > based in DT that would implement this yet, so I postponed the DT 
> > enablement.
> 
> Nested virt is always a good example. In some distant future KVM could
> expose FEAT_HACDBS to the L1 hypervisor, and the VMM may be using DT
> instead of ACPI (like kvmtool).

Oh, good point.

> 
> > > 
> > > > +static irqreturn_t hacdbsirq_handler(int irq, void *pcpu)
> > > > +{
> > > > +	u64 cons = read_sysreg_s(SYS_HACDBSCONS_EL2);
> > > > +	unsigned long err = FIELD_GET(HACDBSCONS_EL2_ERR_REASON, cons);
> > > > +
> > > > +	switch (err) {
> > > > +	case HACDBSCONS_EL2_ERR_REASON_NOF:
> > > > +		this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > > > +		break;
> > > > +	case HACDBSCONS_EL2_ERR_REASON_IPAHACF:
> > > > +		/* When size not a power of two >= 4k, exit with reserved TTLW */
> > > > +		int index = FIELD_GET(HACDBSCONS_EL2_INDEX, cons);
> > > > +
> > > > +		if (index >= this_cpu_read(hacdbs_pcp.size)) {
> > > > +			this_cpu_write(hacdbs_pcp.status, HACDBS_IDLE);
> > > > +			break;
> > > > +		}
> > > > +		fallthrough;
> > > > +	case HACDBSCONS_EL2_ERR_REASON_STRUCTF:
> > > > +	case HACDBSCONS_EL2_ERR_REASON_IPAF:
> > > > +		this_cpu_write(hacdbs_pcp.status, HACDBS_ERROR);
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	return IRQ_HANDLED;
> > > > +}
> > > 
> > > I have a pretty extreme distaste for creating a state machine between
> > > the callsite and the IRQ handler. The callsite should poll HACDBS for
> > > completion. The thread has nothing better to do anyway.
> > 
> > Well, there is one argument it could just wait and save some energy, but I 
> > agree it is not relevant in server space.
> 
> I wouldn't suggest polling in a tight loop :) I'd say use something like
> __mdelay() to get the core into a low-power state w/o using a naked WFI.
> In fact, that already uses WFxT under the hood.

Awesome!

> 
> > The main reason I did this is 
> > because I am planning on later doing an improved version of this that would 
> > clean the dirty-bit *while* running the guest, and having the IRQ is needed 
> > for exiting guest so we can notify userspace the cleaning is done. So I 
> > laid the HACDBSIRQ infra here so we don't have both polling and IRQ options 
> > happening. 
> > 
> > That idea would require us to add new API (a return value for 'cleaned'), 
> > and also a new flag for the clean ioctl. We also need the VMM to 
> > implement that, but then we get a proper cpu usage of cleaning time.
> >
> > I wanted to start with a backwards compatible version, and do the above 
> > idea once I put my hands in hardware that implements HACDBS, so I can 
> > properly measure how much performance we get on above strategy.
> > 
> > What do you think?
> 
> Yeah, I'd want to see some extremely compelling performance numbers for
> this approach before considering it, alongside the necessary VMM patches
> to actually activate it.
> 
> Seems likely to me that the VMM will want the background thread back
> ASAP that calls the clean ioctl so you'll need to work out how to cope
> with idle vCPUs in that case.

Fair point, HACDBS should be disabled if the vcpu gets scheduled-out, so we 
would need to be sure the vcpus stay scheduled, or the cleaning may take 
too long.

> 
> Even still, with this hypothetical approach I'd expect KVM to inspect
> the HACDBS state on every exit. The IRQ is just a convenient kick back
> out to the main KVM_RUN loop.

Got it. Will use the HACDBSCONS register instead to get that info on 
stopping.

Thanks!
Leo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
  2026-06-30 17:09             ` Leonardo Bras
@ 2026-06-30 18:43               ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2026-06-30 18:43 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: sashiko-reviews, Marc Zyngier, kvmarm, kvm, Wei-Lin Chang

On Tue, Jun 30, 2026 at 06:09:56PM +0100, Leonardo Bras wrote:
> On Tue, Jun 30, 2026 at 08:44:20AM -0700, Oliver Upton wrote:
> > On Tue, Jun 30, 2026 at 01:58:48PM +0100, Leonardo Bras wrote:
> > > On Mon, Jun 29, 2026 at 10:06:38AM -0700, Oliver Upton wrote:
> > > > > But this raises a topic I would like to understand:
> > > > > - Do we actually need this to be a block_size to assure correctness? or is 
> > > > >   it just about efficiency?
> > > > 
> > > > What value is there in having a chunk size larger than the largest
> > > > possible block mapping? The whole UAPI is deliberately tied up with page
> > > > table geometry.
> > > 
> > > Not larger, possibly smallerv My concern was the difference in pages to 
> > > split between 4k, 16k and 64k.
> > 
> > Ok, well in any case the upper bound is going to be the largest possible
> > block mapping for a given page granule.
> >
> 
> Sure, we can do this.

I'm describing the UAPI behavior of what we already have. Please do not
change anything and continue to leave userspace in charge of eager
splitting / chunk size.

> > > > 
> > > > Overall, I'm not buying the argument for changing the behavior of
> > > > KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
> > > > *not* eagerly splitting the entire address space, especially if you know
> > > > the working set of the VM is small.
> > > > 
> > > > You can still use HDBSS without eagerly splitting, so long as block
> > > > mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
> > > > a writable PFN) are {1, 0}.
> > > > 
> > > 
> > > Block mappings being read-only, and leaf mappings being writable-clean, 
> > > then? Could you please ellaborate on why does not it need eager-split?
> > 
> > Read-only translations will continue to generate permission faults
> > whereas writable-clean descriptors can be updated by hardware. You get
> > the opportunity to split a block mapping lazily while preserving
> > hardware dirty tracking for page mappings.
> > 
> 
> So you suggest we only enable DBM bit after we split the block, that will 
> happen only after a block is dirtied for the first time after dirty-log 
> starts? 

If the VMM wants to do lazy splitting, yes. Otherwise this would happen
as part of eager page splitting for block mappings that are known to be
writable.

> > The approach I think we may need is:
> > 
> >  - Use a software bit in the PTE to stash whether or not a PFN is
> >    'software-writable' when constructing the stage-2. By this I mean
> >    we've already faulted it in for write from the primary MMU.
> > 
> >  - At the time of write protection, reap the hardware-writable state
> >    from all PTEs but preserve the software-writable bit.
> > 
> >  - Whenever splitting a block mapping, set the DBM bit in the page-level
> >    PTEs if the block was software-writable and HDBSS is present.
> > 
> > That way you'd have sufficient metadata in the PTE to safely set DBM.
> 
> I remember that, for some reason I can't recall, it would not be great to 
> set DBM during dirty-log start, and instead we should have it since VM 
> creation. Maybe it had to do with part of the pagetable using the old 
> encoding (no DBM), and the other part using the new one.
> 
> IIRC, only blocks that are backed by writable memory (S1) were supposed to 
> receive the DBM bit. We could use that info for deciding what to split, 
> then.

That's why I'm saying to use a software bit. We can't leave DBM=1 on
block mappings and set VTCR_EL2.HD=1 for obvious reasons.

> Another option would be to split when we are collecting a dirty-entry from 
> HDBSS, but for live migration that would mean we have to transfer the whole 
> block (possibly a large LEVEL1 block), because we have no idea which part 
> of it got dirty.

We need to do dirty tracking at page granularity. The only scenario
where we have writable block mappings while dirty logging is enabled on
a memslot is if the VMM enabled KVM_DIRTY_LOG_INITIALLY_SET.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine
  2026-06-30 14:59     ` Leonardo Bras
@ 2026-06-30 19:06       ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2026-06-30 19:06 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Joey Gouly,
	Steffen Eiden, Suzuki K Poulose, Zenghui Yu, Rafael J. Wysocki,
	Len Brown, Saket Dumbre, Paolo Bonzini, Jonathan Cameron,
	Chengwen Feng, Kees Cook, Mikołaj Lenczewski, James Morse,
	Zeng Heng, mrigendrachaubey, Thomas Huth, Ryan Roberts,
	Yeoreum Yun, Mark Brown, Kevin Brodsky, James Clark, Fuad Tabba,
	Raghavendra Rao Ananta, Lorenzo Pieralisi, Sascha Bischoff,
	Anshuman Khandual, Tian Zheng, linux-arm-kernel, linux-kernel,
	kvmarm, linux-acpi, acpica-devel, kvm

On Tue, Jun 30, 2026 at 03:59:38PM +0100, Leonardo Bras wrote:
> > > +	hcr_el2 = read_sysreg(HCR_EL2);
> > > +	write_sysreg(hcr_el2 | HCR_EL2_VM, HCR_EL2);
> > 
> > sysreg_clear_set_hcr(). I'm pretty sure all the speculative AT errata
> > depend on HCR_EL2.VM being set _after_ the stage-2 MMU has been loaded.
> > 
> 
> So, move this to after __load_stage2()?
> ok

Yes.

> > > +	__load_stage2(&kvm->arch.mmu);
> > 
> > Pretty sure you need an ISB here to ensure loading the MMU is ordered
> > with enabling HACDBS.
> >
> 
> does not __load_stage2() have an isb() here?
> In any case, will add an isb() after sysreg_clear_set_hcr(), which should 
> come after __load_stage2() IIUC.

No, __load_stage2() inserts an ISB only for hardware subject to the
speculative AT errata. If an implementation has broken AT and HACDBS in
the future then it gets an additional ISB. Oh well.

> > > +	hacdbs_start(hw_entries, size);
> > > +
> > > +	do {
> > > +		wfi();
> > > +	} while (this_cpu_read(hacdbs_pcp.status) == HACDBS_RUNNING);
> > 
> > This is exactly why I said you should just poll hardware instead. It is
> > entirely possible that the IRQ arrives before you WFI.
> 
> It should be fine with WFIT, though, right?

Sure, but we shouldn't assume a functional WFxT even if we have HACDBS.
Just rely on pre-existing kernel infrastructure to do the thing you want
to.

> I understand the reason in pooling, and even done some workaround in 
> pooling for getting this to run in the model. 
> 
> Based on the previous reply, do you think I should only use polling for 
> now, and implement the IRQ later?

Yes.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2026-06-30 19:06 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
2026-06-29 11:34   ` sashiko-bot
2026-06-29 12:57     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
2026-06-29 11:36   ` sashiko-bot
2026-06-29 14:47     ` Leonardo Bras
2026-06-29 17:06       ` Oliver Upton
2026-06-30 12:58         ` Leonardo Bras
2026-06-30 15:44           ` Oliver Upton
2026-06-30 17:09             ` Leonardo Bras
2026-06-30 18:43               ` Oliver Upton
2026-06-29 11:17 ` [PATCH v2 03/13] arm64/cpufeature: Add system-wide FEAT_HACDBS detection Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 04/13] arm64/sysreg: Add HACDBS consumer and base registers Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
2026-06-29 11:32   ` sashiko-bot
2026-06-29 15:43     ` Leonardo Bras
2026-06-29 16:52       ` Vladimir Murzin
2026-06-30 14:52         ` Leonardo Bras
2026-06-29 17:22   ` Oliver Upton
2026-06-30 14:50     ` Leonardo Bras
2026-06-30 16:03       ` Oliver Upton
2026-06-30 17:19         ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
2026-06-29 11:29   ` sashiko-bot
2026-06-29 15:54     ` Leonardo Bras
2026-06-29 17:36   ` Oliver Upton
2026-06-30 14:59     ` Leonardo Bras
2026-06-30 19:06       ` Oliver Upton
2026-06-29 11:17 ` [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning Leonardo Bras
2026-06-29 11:38   ` sashiko-bot
2026-06-29 16:07     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine Leonardo Bras
2026-06-29 11:45   ` sashiko-bot
2026-06-29 16:49     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks Leonardo Bras
2026-06-29 11:39   ` sashiko-bot
2026-06-29 17:07     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 10/13] kvm/dirty_ring: Introduce get_memslot and move helpers to header Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning Leonardo Bras
2026-06-29 11:49   ` sashiko-bot
2026-06-29 17:09     ` Leonardo Bras
2026-06-29 11:18 ` [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine Leonardo Bras
2026-06-29 11:49   ` sashiko-bot
2026-06-29 17:26     ` Leonardo Bras
2026-06-29 11:18 ` [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT Leonardo Bras
2026-06-29 11:52   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox